๐Ÿ“š NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ

๊ฐœ์š”

์ง€๊ธˆ๊นŒ์ง€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ธฐ์ดˆ์™€ ๊ณ ๊ธ‰ ํŒจํ„ด์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. Part II์—์„œ๋Š” compute-sanitizer์™€ cuda-gdb๋ฅผ ์‚ฌ์šฉํ•œ ์ •ํ™•์„ฑ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•์„, ๋‹ค๋ฅธ ํŒŒํŠธ์—์„œ๋Š” ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ, ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ, ๋ธ”๋ก ๋ ˆ๋ฒจ ์—ฐ์‚ฐ ๋“ฑ ๋‹ค์–‘ํ•œ GPU ๊ธฐ๋Šฅ์„ ๋‹ค๋ค˜์Šต๋‹ˆ๋‹ค. ์ปค๋„์ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋™์ž‘ํ•˜๊ธด ํ•ฉ๋‹ˆ๋‹ค - ํ•˜์ง€๋งŒ ๋น ๋ฅด๊ธฐ๋„ ํ• ๊นŒ์š”?

์ด ํŠœํ† ๋ฆฌ์–ผ์€ CUDA Best Practices Guide์—์„œ ๊ถŒ์žฅํ•˜๋Š” NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์˜ฌ๋ฐ”๋ฅธ ์ปค๋„์ด๋ผ๋„ ์ตœ์ ์˜ ์„ฑ๋Šฅ๋ณด๋‹ค ์ˆ˜์‹ญ ๋ฐฐ๋‚˜ ๋А๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กœํŒŒ์ผ๋ง์€ ๋™์ž‘ํ•˜๋Š” ์ฝ”๋“œ์™€ ๊ณ ์„ฑ๋Šฅ ์ฝ”๋“œ ์‚ฌ์ด์˜ ๊ฒฉ์ฐจ๋ฅผ ์ขํž™๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ ๋ชจ์Œ

pixi๋ฅผ ํ†ตํ•ด cuda-toolkit์ด ์„ค์น˜๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ, NVIDIA์˜ ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

NSight Systems (nsys) - โ€œ์ „์ฒด ๊ทธ๋ฆผโ€ ๋„๊ตฌ

์šฉ๋„: ์‹œ์Šคํ…œ ์ „์ฒด ์„ฑ๋Šฅ ๋ถ„์„ (NSight Systems ๋ฌธ์„œ)

  • CPU-GPU ์ƒํ˜ธ์ž‘์šฉ์˜ ํƒ€์ž„๋ผ์ธ ๋ทฐ
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋ณ‘๋ชฉ
  • ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ
  • ๋ฉ€ํ‹ฐ GPU ์กฐ์œจ
  • API ํ˜ธ์ถœ ์ถ”์ 

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ธํ„ฐํŽ˜์ด์Šค: ์ปค๋งจ๋“œ๋ผ์ธ (nsys) ๋ฐ GUI (nsys-ui)

์‚ฌ์šฉ ์‹œ์ :

  • ์ „์ฒด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ๋ฆ„ ํŒŒ์•…
  • CPU-GPU ๋™๊ธฐํ™” ๋ฌธ์ œ ์‹๋ณ„
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ํŒจํ„ด ๋ถ„์„
  • ์ปค๋„ ์‹คํ–‰ ๋ณ‘๋ชฉ ๋ฐœ๊ฒฌ
# ๋„์›€๋ง ๋ณด๊ธฐ
pixi run nsys --help

# ๊ธฐ๋ณธ ์‹œ์Šคํ…œ ์ „์ฒด ํ”„๋กœํŒŒ์ผ๋ง
pixi run nsys profile --trace=cuda,nvtx --output=timeline mojo your_program.mojo

# ๋Œ€ํ™”ํ˜• ๋ถ„์„
pixi run nsys stats --force-export=true timeline.nsys-rep

NSight Compute (ncu) - โ€œ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„โ€ ๋„๊ตฌ

์šฉ๋„: ์ƒ์„ธํ•œ ๋‹จ์ผ ์ปค๋„ ์„ฑ๋Šฅ ๋ถ„์„ (NSight Compute ๋ฌธ์„œ)

  • ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ ๋ถ„์„
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ํ™œ์šฉ๋„
  • ์›Œํ”„ ์‹คํ–‰ ํšจ์œจ
  • ๋ ˆ์ง€์Šคํ„ฐ/๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰
  • ์—ฐ์‚ฐ ์œ ๋‹› ํ™œ์šฉ๋„

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ธํ„ฐํŽ˜์ด์Šค: ์ปค๋งจ๋“œ๋ผ์ธ (ncu) ๋ฐ GUI (ncu-ui)

์‚ฌ์šฉ ์‹œ์ :

  • ํŠน์ • ์ปค๋„ ์„ฑ๋Šฅ ์ตœ์ ํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํŒŒ์•…
  • ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ vs ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„ ๋ถ„์„
  • ์›Œํ”„ ๋ถ„๊ธฐ ๋ฌธ์ œ ์‹๋ณ„
# ๋„์›€๋ง ๋ณด๊ธฐ
pixi run ncu --help

# ์ƒ์„ธ ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง
pixi run ncu --set full --output kernel_profile mojo your_program.mojo

# ํŠน์ • ์ปค๋„์— ์ง‘์ค‘
pixi run ncu --kernel-name regex:your_kernel_name mojo your_program.mojo

๋„๊ตฌ ์„ ํƒ ์˜์‚ฌ๊ฒฐ์ • ํŠธ๋ฆฌ

์„ฑ๋Šฅ ๋ฌธ์ œ ๋ฐœ์ƒ
      |
      v
์–ด๋–ค ์ปค๋„์ธ์ง€ ์•„๋Š”๊ฐ€?
    |           |
  ์•„๋‹ˆ์˜ค         ์˜ˆ
    |           |
    v           v
NSight    ์ปค๋„ ๊ณ ์œ ์˜ ๋ฌธ์ œ์ธ๊ฐ€?
Systems       |         |
    |       ์•„๋‹ˆ์˜ค       ์˜ˆ
    v         |         |
ํƒ€์ž„๋ผ์ธ        |         v
๋ถ„์„    <------+   NSight Compute
                        |
                        v
                   ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„

๋น ๋ฅธ ์˜์‚ฌ๊ฒฐ์ • ๊ฐ€์ด๋“œ:

  • ๋ณ‘๋ชฉ์ด ์–ด๋””์ธ์ง€ ๋ชจ๋ฅด๊ฒ ์œผ๋ฉด NSight Systems (nsys)๋ถ€ํ„ฐ ์‹œ์ž‘
  • ์ตœ์ ํ™”ํ•  ์ปค๋„์„ ์ •ํ™•ํžˆ ์•Œ๋ฉด NSight Compute (ncu) ์‚ฌ์šฉ
  • ์ข…ํ•ฉ์ ์ธ ๋ถ„์„์ด ํ•„์š”ํ•˜๋ฉด ๋‘˜ ๋‹ค ์‚ฌ์šฉ (์ผ๋ฐ˜์ ์ธ ์›Œํฌํ”Œ๋กœ์šฐ)

์‹ค์Šต: NSight Systems๋กœ ์‹œ์Šคํ…œ ์ „์ฒด ํ”„๋กœํŒŒ์ผ๋ง

Puzzle 16์˜ ํ–‰๋ ฌ ๊ณฑ์…ˆ ๊ตฌํ˜„๋“ค์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜์—ฌ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ํŒŒ์•…ํ•ด ๋ด…์‹œ๋‹ค.

GUI ์ฐธ๊ณ : NSight Systems์™€ Compute GUI (nsys-ui, ncu-ui)๋Š” ๋””์Šคํ”Œ๋ ˆ์ด์™€ OpenGL ์ง€์›์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. X11 ํฌ์›Œ๋”ฉ์ด ์—†๋Š” ํ—ค๋“œ๋ฆฌ์Šค ์„œ๋ฒ„๋‚˜ ์›๊ฒฉ ์‹œ์Šคํ…œ์—์„œ๋Š” ์ปค๋งจ๋“œ๋ผ์ธ ๋ฒ„์ „ (nsys, ncu)์„ ์‚ฌ์šฉํ•˜์—ฌ nsys stats์™€ ncu --import --page details๋กœ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์„ธ์š”. .nsys-rep์™€ .ncu-rep ํŒŒ์ผ์„ ๋กœ์ปฌ ๋จธ์‹ ์œผ๋กœ ์ „์†กํ•˜์—ฌ GUI๋กœ ๋ถ„์„ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

Step 1: ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ์ฝ”๋“œ ์ค€๋น„

์ค‘์š”: ์ •ํ™•ํ•œ ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•ด ์ตœ์ ํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜์—ฌ ๋นŒ๋“œํ•ฉ๋‹ˆ๋‹ค:

pixi shell -e nvidia
# ์ตœ์ ํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด ํฌํ•จ ๋นŒ๋“œ (ํฌ๊ด„์ ์ธ ์†Œ์Šค ๋งคํ•‘์šฉ)
mojo build --debug-level=full solutions/p16/p16.mojo -o solutions/p16/p16_optimized

# ์ตœ์ ํ™” ๋นŒ๋“œ ํ…Œ์ŠคํŠธ
./solutions/p16/p16_optimized --naive

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ :

  • ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด: ํ”„๋กœํŒŒ์ผ๋Ÿฌ๋ฅผ ์œ„ํ•œ ์™„์ „ํ•œ ์‹ฌ๋ณผ ํ…Œ์ด๋ธ”, ๋ณ€์ˆ˜๋ช…, ์†Œ์Šค ๋ผ์ธ ๋งคํ•‘ ์ œ๊ณต
  • ํฌ๊ด„์  ๋ถ„์„: NSight ๋„๊ตฌ๊ฐ€ ์„ฑ๋Šฅ ๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ • ์ฝ”๋“œ ์œ„์น˜์™€ ์—ฐ๊ฒฐ ๊ฐ€๋Šฅ
  • ์ตœ์ ํ™” ์œ ์ง€: ํ”„๋กœ๋•์…˜ ๋นŒ๋“œ์™€ ์ผ์น˜ํ•˜๋Š” ํ˜„์‹ค์ ์ธ ์„ฑ๋Šฅ ์ธก์ • ๋ณด์žฅ

Step 2: ์‹œ์Šคํ…œ ์ „์ฒด ํ”„๋กœํŒŒ์ผ ์ˆ˜์ง‘

# ํฌ๊ด„์  ์ถ”์ ์œผ๋กœ ์ตœ์ ํ™” ๋นŒ๋“œ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile \
  --trace=cuda,nvtx \
  --output=matmul_naive \
  --force-overwrite=true \
  ./solutions/p16/p16_optimized --naive

๋ช…๋ น์–ด ๋ถ„์„:

  • --trace=cuda,nvtx: CUDA API ํ˜ธ์ถœ ๋ฐ ์ปค์Šคํ…€ ์–ด๋…ธํ…Œ์ด์…˜ ์บก์ฒ˜
  • --output=matmul_naive: ํ”„๋กœํŒŒ์ผ์„ matmul_naive.nsys-rep๋กœ ์ €์žฅ
  • --force-overwrite=true: ๊ธฐ์กด ํ”„๋กœํŒŒ์ผ ๋ฎ์–ด์“ฐ๊ธฐ
  • ๋งˆ์ง€๋ง‰ ์ธ์ˆ˜: Mojo ํ”„๋กœ๊ทธ๋žจ

Step 3: ํƒ€์ž„๋ผ์ธ ๋ถ„์„

# ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ํ†ต๊ณ„ ์ƒ์„ฑ
nsys stats --force-export=true matmul_naive.nsys-rep

# ์ฃผ์š” ์ง€ํ‘œ ํ™•์ธ:
# - GPU ํ™œ์šฉ๋ฅ 
# - ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„
# - ์ปค๋„ ์‹คํ–‰ ์‹œ๊ฐ„
# - CPU-GPU ๋™๊ธฐํ™” ๊ฐ„๊ฒฉ

ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฐ๊ณผ (2ร—2 ํ–‰๋ ฌ ๊ณฑ์…ˆ์˜ ์‹ค์ œ ์ถœ๋ ฅ):

** CUDA API Summary (cuda_api_sum):
 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)  Min (ns)  Max (ns)  StdDev (ns)          Name
 --------  ---------------  ---------  ---------  --------  --------  --------  -----------  --------------------
     81.9          8617962          3  2872654.0    2460.0      1040   8614462    4972551.6  cuMemAllocAsync
     15.1          1587808          4   396952.0    5965.5      3810   1572067     783412.3  cuMemAllocHost_v2
      0.6            67152          1    67152.0   67152.0     67152     67152          0.0  cuModuleLoadDataEx
      0.4            44961          1    44961.0   44961.0     44961     44961          0.0  cuLaunchKernelEx

** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                    Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------
    100.0             1920          1    1920.0    1920.0      1920      1920          0.0  p16_naive_matmul_Layout_Int6A6AcB6A6AsA6A6A

** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum):
 Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Operation
 --------  ---------------  -----  --------  --------  --------  --------  -----------  ----------------------------
     49.4             4224      3    1408.0    1440.0      1312      1472         84.7  [CUDA memcpy Device-to-Host]
     36.0             3072      4     768.0     528.0       416      1600        561.0  [CUDA memset]
     14.6             1248      3     416.0     416.0       416       416          0.0  [CUDA memcpy Host-to-Device]

์ฃผ์š” ์„ฑ๋Šฅ ํ†ต์ฐฐ:

  • ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ์ง€๋ฐฐ์ : ์ „์ฒด ์‹œ๊ฐ„์˜ 81.9%๊ฐ€ cuMemAllocAsync์— ์†Œ๋น„
  • ์ปค๋„์€ ๋ฒˆ๊ฐœ์ฒ˜๋Ÿผ ๋น ๋ฆ„: ์‹คํ–‰ ์‹œ๊ฐ„ 1,920 ns (0.000001920์ดˆ)์— ๋ถˆ๊ณผ
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋‚ด์—ญ: 49.4% Deviceโ†’Host, 36.0% memset, 14.6% Hostโ†’Device
  • ์•„์ฃผ ์ž‘์€ ๋ฐ์ดํ„ฐ: ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด 0.001 MB ๋ฏธ๋งŒ (float32 4๊ฐœ = 16๋ฐ”์ดํŠธ)

Step 4: ๊ตฌํ˜„ ๋น„๊ต

๋‹ค๋ฅธ ๋ฒ„์ „๋“ค์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜๊ณ  ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค:

# pixi shell ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•˜์„ธ์š” `pixi run -e nvidia`

# ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_shared ./solutions/p16/p16_optimized --single-block

# Tiled ๋ฒ„์ „ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_tiled ./solutions/p16/p16_optimized --tiled

# ๊ด€์šฉ์  Tiled ๋ฒ„์ „ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_idiomatic_tiled ./solutions/p16/p16_optimized --idiomatic-tiled

# ๊ฐ ๊ตฌํ˜„์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋ถ„์„ (nsys stats๋Š” ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ํŒŒ์ผ๋งŒ ์ฒ˜๋ฆฌ)
nsys stats --force-export=true matmul_shared.nsys-rep
nsys stats --force-export=true matmul_tiled.nsys-rep
nsys stats --force-export=true matmul_idiomatic_tiled.nsys-rep

๊ฒฐ๊ณผ ๋น„๊ต ๋ฐฉ๋ฒ•:

  1. GPU Kernel Summary ํ™•์ธ - ๊ตฌํ˜„ ๊ฐ„ ์‹คํ–‰ ์‹œ๊ฐ„ ๋น„๊ต
  2. Memory Operations ํ™•์ธ - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ์„ ์ค„์ด๋Š”์ง€ ํ™•์ธ
  3. API ์˜ค๋ฒ„ํ—ค๋“œ ๋น„๊ต - ๋ชจ๋‘ ๋น„์Šทํ•œ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ํŒจํ„ด์„ ๊ฐ€์ ธ์•ผ ํ•จ

์ˆ˜๋™ ๋น„๊ต ์›Œํฌํ”Œ๋กœ์šฐ:

# ๊ฐ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•˜์—ฌ ๋น„๊ต
nsys stats --force-export=true matmul_naive.nsys-rep > naive_stats.txt
nsys stats --force-export=true matmul_shared.nsys-rep > shared_stats.txt
nsys stats --force-export=true matmul_tiled.nsys-rep > tiled_stats.txt
nsys stats --force-export=true matmul_idiomatic_tiled.nsys-rep > idiomatic_tiled_stats.txt

๊ณต์ •ํ•œ ๋น„๊ต ๊ฒฐ๊ณผ (์‹ค์ œ ํ”„๋กœํŒŒ์ผ๋ง ์ถœ๋ ฅ):

๋น„๊ต 1: 2 x 2 ํ–‰๋ ฌ

๊ตฌํ˜„๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ปค๋„ ์‹คํ–‰์„ฑ๋Šฅ
Naive81.9% cuMemAllocAsyncโœ… 1,920 ns๊ธฐ์ค€์„ 
Shared (--single-block)81.8% cuMemAllocAsyncโœ… 1,984 ns+3.3% ๋А๋ฆผ

๋น„๊ต 2: 9 x 9 ํ–‰๋ ฌ

๊ตฌํ˜„๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ปค๋„ ์‹คํ–‰์„ฑ๋Šฅ
Tiled (์ˆ˜๋™)81.1% cuMemAllocAsyncโœ… 2,048 ns๊ธฐ์ค€์„ 
Idiomatic Tiled81.6% cuMemAllocAsyncโœ… 2,368 ns+15.6% ๋А๋ฆผ

๊ณต์ • ๋น„๊ต์—์„œ ์–ป์€ ํ•ต์‹ฌ ํ†ต์ฐฐ:

๋‘ ํ–‰๋ ฌ ํฌ๊ธฐ ๋ชจ๋‘ GPU ์ž‘์—…์—๋Š” ๋„ˆ๋ฌด ์ž‘์Œ!:

  • 2ร—2 ํ–‰๋ ฌ: 4๊ฐœ ์š”์†Œ - ์™„์ „ํžˆ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ
  • 9ร—9 ํ–‰๋ ฌ: 81๊ฐœ ์š”์†Œ - ์—ฌ์ „ํžˆ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ
  • ์‹ค์ œ GPU ์›Œํฌ๋กœ๋“œ: ์ฐจ์›๋‹น ์ˆ˜์ฒœ~์ˆ˜๋ฐฑ๋งŒ ๊ฐœ ์š”์†Œ

์ด ๊ฒฐ๊ณผ๊ฐ€ ์‹ค์ œ๋กœ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ:

  • ๋ชจ๋“  ๋ณ€ํ˜•์ด ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์— ์ง€๋ฐฐ๋จ (์‹œ๊ฐ„์˜ 81% ์ด์ƒ)
  • ์ปค๋„ ์‹คํ–‰์€ ์˜๋ฏธ ์—†์Œ - ์„ค์ • ๋น„์šฉ์— ๋น„ํ•˜๋ฉด ๋ฏธ๋ฏธ
  • โ€œ์ตœ์ ํ™”โ€œ๊ฐ€ ์˜คํžˆ๋ ค ํ•ด๋กœ์šธ ์ˆ˜ ์žˆ์Œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ 3.3%, async_copy๊ฐ€ 15.6% ์˜ค๋ฒ„ํ—ค๋“œ ์ถ”๊ฐ€
  • ์ง„์งœ ๊ตํ›ˆ: ์ž‘์€ ์›Œํฌ๋กœ๋“œ์—์„œ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ ํƒ์ด ๋ฌด์˜๋ฏธ - ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ชจ๋“  ๊ฒƒ์„ ์••๋„

์ด๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๋Š” ์ด์œ :

  • GPU ์„ค์ • ๋น„์šฉ(๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ์ปค๋„ ์‹คํ–‰)์€ ๋ฌธ์ œ ํฌ๊ธฐ์— ๊ด€๊ณ„์—†์ด ๊ณ ์ •
  • ์ž‘์€ ๋ฌธ์ œ์—์„œ๋Š” ์ด ๊ณ ์ • ๋น„์šฉ์ด ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๋ฌด์ƒ‰ํ•˜๊ฒŒ ๋งŒ๋“ฆ
  • ํฐ ๋ฌธ์ œ๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋œ ์ตœ์ ํ™”๊ฐ€ ์ž‘์€ ๋ฌธ์ œ์—์„œ๋Š” ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋จ

์‹ค๋ฌด ํ”„๋กœํŒŒ์ผ๋ง ๊ตํ›ˆ:

  • ๋ฌธ์ œ ํฌ๊ธฐ ๋งฅ๋ฝ์ด ์ค‘์š”: 2ร—2์™€ 9ร—9 ๋ชจ๋‘ GPU์—๊ฒŒ๋Š” ์ž‘์Œ
  • ๊ณ ์ • ๋น„์šฉ์ด ์ž‘์€ ๋ฌธ์ œ๋ฅผ ์ง€๋ฐฐ: ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ
  • โ€œ์ตœ์ ํ™”โ€œ๊ฐ€ ์ž‘์€ ์›Œํฌ๋กœ๋“œ์— ํ•ด๋กœ์šธ ์ˆ˜ ์žˆ์Œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋น„๋™๊ธฐ ์—ฐ์‚ฐ์ด ์˜ค๋ฒ„ํ—ค๋“œ ์ถ”๊ฐ€
  • ์ž‘์€ ๋ฌธ์ œ๋ฅผ ์ตœ์ ํ™”ํ•˜์ง€ ๋ง ๊ฒƒ: ์‹ค์ œ ์›Œํฌ๋กœ๋“œ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ง‘์ค‘
  • ํ•ญ์ƒ ๋ฒค์น˜๋งˆํ‚นํ•  ๊ฒƒ: โ€œ๋” ์ข‹์€โ€ ์ฝ”๋“œ์— ๋Œ€ํ•œ ๊ฐ€์ •์€ ํ”ํžˆ ํ‹€๋ฆผ

์ž‘์€ ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง์˜ ์ดํ•ด: ์ด 2ร—2 ํ–‰๋ ฌ ์˜ˆ์ œ๋Š” ์ „ํ˜•์ ์ธ ์ž‘์€ ์ปค๋„ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • ์‹ค์ œ ์—ฐ์‚ฐ(ํ–‰๋ ฌ ๊ณฑ์…ˆ)์€ ๊ทนํžˆ ๋น ๋ฆ„ (1,920 ns)
  • ๋ฉ”๋ชจ๋ฆฌ ์„ค์ • ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ „์ฒด ์‹œ๊ฐ„์„ ์ง€๋ฐฐ (์‹คํ–‰์˜ 97% ์ด์ƒ)
  • ์ด๊ฒƒ์ด ์‹ค๋ฌด GPU ์ตœ์ ํ™”๊ฐ€ ๋‹ค์Œ์— ์ง‘์ค‘ํ•˜๋Š” ์ด์œ ์ž…๋‹ˆ๋‹ค:
    • ์—ฐ์‚ฐ ์ผ๊ด„ ์ฒ˜๋ฆฌ๋กœ ์„ค์ • ๋น„์šฉ ๋ถ„์‚ฐ
    • ๋ฉ”๋ชจ๋ฆฌ ์žฌ์‚ฌ์šฉ์œผ๋กœ ํ• ๋‹น ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ
    • ์—ฐ์‚ฐ์ด ๋ณ‘๋ชฉ์ด ๋˜๋Š” ๋” ํฐ ๋ฌธ์ œ ํฌ๊ธฐ

์‹ค์Šต: NSight Compute๋กœ ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„

์ด์ œ ํŠน์ • ์ปค๋„์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ์‹ฌ์ธต์ ์œผ๋กœ ๋“ค์—ฌ๋‹ค๋ด…์‹œ๋‹ค.

Step 1: ํŠน์ • ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง

# ํ™œ์„ฑ shell ์ƒํƒœ์ธ์ง€ ํ™•์ธ
pixi shell -e nvidia

# Naive MatMul ์ปค๋„์„ ์ƒ์„ธ ํ”„๋กœํŒŒ์ผ๋ง (์ตœ์ ํ™” ๋นŒ๋“œ ์‚ฌ์šฉ)
ncu \
  --set full \
  -o kernel_analysis \
  --force-overwrite \
  ./solutions/p16/p16_optimized --naive

ํ”ํ•œ ๋ฌธ์ œ: ๊ถŒํ•œ ์˜ค๋ฅ˜

ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋‹ค์Œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ์‹œ๋„ํ•˜์„ธ์š”:

# NVIDIA ๋“œ๋ผ์ด๋ฒ„ ์˜ต์…˜ ์ถ”๊ฐ€ (rmmod๋ณด๋‹ค ์•ˆ์ „)
echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee -a /etc/modprobe.d/nvidia-kernel-common.conf

# ์ปค๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
sudo sysctl -w kernel.perf_event_paranoid=0

# ์˜๊ตฌ ์ ์šฉ
echo 'kernel.perf_event_paranoid=0' | sudo tee -a /etc/sysctl.conf

# ๋“œ๋ผ์ด๋ฒ„ ๋ณ€๊ฒฝ ์‚ฌํ•ญ ์ ์šฉ์„ ์œ„ํ•ด ์žฌ๋ถ€ํŒ… ํ•„์š”
sudo reboot

# ๊ทธ๋Ÿฐ ๋‹ค์Œ ncu ๋ช…๋ น์„ ๋‹ค์‹œ ์‹คํ–‰
ncu \
  --set full \
  -o kernel_analysis \
  --force-overwrite \
  ./solutions/p16/p16_optimized --naive

Step 2: ์ฃผ์š” ์ง€ํ‘œ ๋ถ„์„

# ์ƒ์„ธ ๋ณด๊ณ ์„œ ์ƒ์„ฑ (์˜ฌ๋ฐ”๋ฅธ ๊ตฌ๋ฌธ)
ncu --import kernel_analysis.ncu-rep --page details

์‹ค์ œ NSight Compute ์ถœ๋ ฅ (2ร—2 Naive MatMul):

GPU Speed Of Light Throughput
----------------------- ----------- ------------
DRAM Frequency              Ghz         6.10
SM Frequency                Ghz         1.30
Elapsed Cycles            cycle         3733
Memory Throughput             %         1.02
DRAM Throughput               %         0.19
Duration                     us         2.88
Compute (SM) Throughput       %         0.00
----------------------- ----------- ------------

Launch Statistics
-------------------------------- --------------- ---------------
Block Size                                                     9
Grid Size                                                      1
Threads                           thread               9
Waves Per SM                                                0.00
-------------------------------- --------------- ---------------

Occupancy
------------------------------- ----------- ------------
Theoretical Occupancy                 %        33.33
Achieved Occupancy                    %         2.09
------------------------------- ----------- ------------

์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ ์–ป์€ ํ•ต์‹ฌ ํ†ต์ฐฐ:

์„ฑ๋Šฅ ๋ถ„์„ - ๋ƒ‰ํ˜นํ•œ ํ˜„์‹ค

  • Compute Throughput: 0.00% - GPU๊ฐ€ ์—ฐ์‚ฐ์ ์œผ๋กœ ์™„์ „ํžˆ ์œ ํœด ์ƒํƒœ
  • Memory Throughput: 1.02% - ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ๊ฑฐ์˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ
  • Achieved Occupancy: 2.09% - GPU ๋Šฅ๋ ฅ์˜ 2%๋งŒ ์‚ฌ์šฉ ์ค‘
  • Grid Size: 1 ๋ธ”๋ก - 80๊ฐœ ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์„œ๋ฅผ ์™„์ „ํžˆ ๋‚ญ๋น„!

์„ฑ๋Šฅ์ด ์ด๋ ‡๊ฒŒ ๋‚ฎ์€ ์ด์œ 

  • ์ž‘์€ ๋ฌธ์ œ ํฌ๊ธฐ: 2ร—2 ํ–‰๋ ฌ = ์ด 4๊ฐœ ์š”์†Œ
  • ์ž˜๋ชป๋œ ์‹คํ–‰ ๊ตฌ์„ฑ: 1๊ฐœ ๋ธ”๋ก์— 9๊ฐœ ์Šค๋ ˆ๋“œ (32์˜ ๋ฐฐ์ˆ˜์—ฌ์•ผ ํ•จ)
  • ์‹ฌ๊ฐํ•œ ๊ณผ์†Œ ํ™œ์šฉ: SM๋‹น 0.00 wave (ํšจ์œจ์„ ์œ„ํ•ด ์ˆ˜์ฒœ ๊ฐœ ํ•„์š”)

NSight Compute์˜ ํ•ต์‹ฌ ์ตœ์ ํ™” ๊ถŒ๊ณ ์‚ฌํ•ญ

  • โ€œEst. Speedup: 98.75%โ€ - 80๊ฐœ SM์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋„๋ก ๊ทธ๋ฆฌ๋“œ ํฌ๊ธฐ ์ฆ๊ฐ€
  • โ€œEst. Speedup: 71.88%โ€ - ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ 32์˜ ๋ฐฐ์ˆ˜๋กœ ์‚ฌ์šฉ
  • โ€œKernel grid is too smallโ€ - GPU ํšจ์œจ์„ ์œ„ํ•ด ํ›จ์”ฌ ํฐ ๋ฌธ์ œ ํ•„์š”

Step 3: ํ˜„์‹ค ์ง์‹œ

์ด ํ”„๋กœํŒŒ์ผ๋ง ๋ฐ์ดํ„ฐ๊ฐ€ ์•Œ๋ ค์ฃผ๋Š” ๊ฒƒ:

  1. ์ž‘์€ ๋ฌธ์ œ๋Š” GPU์—๊ฒŒ ๋…: 2ร—2 ํ–‰๋ ฌ์€ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ์™„์ „ํžˆ ๋‚ญ๋น„
  2. ์‹คํ–‰ ๊ตฌ์„ฑ์ด ์ค‘์š”: ์ž˜๋ชป๋œ ์Šค๋ ˆ๋“œ/๋ธ”๋ก ํฌ๊ธฐ๊ฐ€ ์„ฑ๋Šฅ์„ ์ฃฝ์ž„
  3. ๊ทœ๋ชจ๊ฐ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณด๋‹ค ์ค‘์š”: ๊ทผ๋ณธ์ ์œผ๋กœ ์ž‘์€ ๋ฌธ์ œ๋Š” ์–ด๋–ค ์ตœ์ ํ™”๋กœ๋„ ํ•ด๊ฒฐ ๋ถˆ๊ฐ€
  4. NSight Compute๋Š” ์ •์งํ•จ: ์ปค๋„ ์„ฑ๋Šฅ์ด ๋‚ฎ์„ ๋•Œ ๊ทธ๋Œ€๋กœ ์•Œ๋ ค์คŒ

์ง„์งœ ๊ตํ›ˆ:

  • ํ† ์ด ๋ฌธ์ œ๋ฅผ ์ตœ์ ํ™”ํ•˜์ง€ ๋ง ๊ฒƒ - ์‹ค์ œ GPU ์›Œํฌ๋กœ๋“œ๋ฅผ ๋Œ€ํ‘œํ•˜์ง€ ์•Š์Œ
  • ํ˜„์‹ค์ ์ธ ์›Œํฌ๋กœ๋“œ์— ์ง‘์ค‘ - ์ตœ์ ํ™”๊ฐ€ ์‹ค์ œ๋กœ ์˜๋ฏธ ์žˆ๋Š” 1000ร—1000+ ํ–‰๋ ฌ
  • ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ์ตœ์ ํ™”๋ฅผ ์•ˆ๋‚ด - ๋‹จ, ์ตœ์ ํ™”ํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ๋Š” ๋ฌธ์ œ์—๋งŒ

2ร—2 ์˜ˆ์ œ์˜ ๊ฒฝ์šฐ: ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜(๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, tiling)์ด ์ด๋ฏธ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ์ ์ธ ์›Œํฌ๋กœ๋“œ์— ์˜ค๋ฒ„ํ—ค๋“œ๋งŒ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋Ÿฌ ์ถœ๋ ฅ์„ ์„ฑ๋Šฅ ํƒ์ •์ฒ˜๋Ÿผ ์ฝ๊ธฐ

์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ์„ฑ๋Šฅ ํŒจํ„ด

ํŒจํ„ด 1: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„

NSight Systems๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๊ธด ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„ NSight Compute๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰, ๋‚ฎ์€ ์—ฐ์‚ฐ ํ™œ์šฉ๋„ ํ•ด๊ฒฐ์ฑ…: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ

ํŒจํ„ด 2: ๋‚ฎ์€ ์ ์œ ์œจ

NSight Systems๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ์งง์€ ์ปค๋„ ์‹คํ–‰๊ณผ ๊ฐ„๊ฒฉ NSight Compute๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ์‹ค์ œ ์ ์œ ์œจ์ด ๋‚ฎ์Œ ํ•ด๊ฒฐ์ฑ…: ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰ ์ค„์ด๊ธฐ, ๋ธ”๋ก ํฌ๊ธฐ ์ตœ์ ํ™”

ํŒจํ„ด 3: ์›Œํ”„ ๋ถ„๊ธฐ

NSight Systems๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๋ถˆ๊ทœ์น™ํ•œ ์ปค๋„ ์‹คํ–‰ ํŒจํ„ด NSight Compute๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๋‚ฎ์€ ์›Œํ”„ ์‹คํ–‰ ํšจ์œจ ํ•ด๊ฒฐ์ฑ…: ์กฐ๊ฑด ๋ถ„๊ธฐ ์ตœ์†Œํ™”, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์žฌ๊ตฌ์„ฑ

ํ”„๋กœํŒŒ์ผ๋ง ํƒ์ • ์›Œํฌํ”Œ๋กœ์šฐ

์„ฑ๋Šฅ ๋ฌธ์ œ ๋ฐœ์ƒ
     |
     v
NSight Systems: ์ „์ฒด ๊ทธ๋ฆผ
        |
        v
GPU๋ฅผ ์ž˜ ํ™œ์šฉํ•˜๊ณ  ์žˆ๋Š”๊ฐ€?
    |             |
  ์•„๋‹ˆ์˜ค           ์˜ˆ
    |             |
    v             v
CPU-GPU    NSight Compute: ์ปค๋„ ์ƒ์„ธ
ํŒŒ์ดํ”„๋ผ์ธ          |
์ˆ˜์ •               v
        ๋ฉ”๋ชจ๋ฆฌ ๋˜๋Š” ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์ธ๊ฐ€?
          |       |       |
         ๋ฉ”๋ชจ๋ฆฌ   ์—ฐ์‚ฐ    ๋‘˜ ๋‹ค ์•„๋‹˜
          |       |       |
          v       v       v
        ๋ฉ”๋ชจ๋ฆฌ    ์‚ฐ์ˆ      ์ ์œ ์œจ
        ์ ‘๊ทผ     ์ตœ์ ํ™”    ํ™•์ธ
        ์ตœ์ ํ™”

ํ”„๋กœํŒŒ์ผ๋ง ๋ชจ๋ฒ” ์‚ฌ๋ก€

ํฌ๊ด„์ ์ธ ํ”„๋กœํŒŒ์ผ๋ง ์ง€์นจ์€ Best Practices Guide - Performance Metrics๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

์ด๋ ‡๊ฒŒ ํ•˜์„ธ์š”

  1. ๋Œ€ํ‘œ์ ์ธ ์›Œํฌ๋กœ๋“œ๋ฅผ ํ”„๋กœํŒŒ์ผ๋ง: ํ˜„์‹ค์ ์ธ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์™€ ํŒจํ„ด ์‚ฌ์šฉ
  2. ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด๋กœ ๋นŒ๋“œ: ์ตœ์ ํ™”์™€ ํ•จ๊ป˜ ํฌ๊ด„์ ์ธ ํ”„๋กœํŒŒ์ผ๋ง ๋ฐ์ดํ„ฐ ๋ฐ ์†Œ์Šค ๋งคํ•‘์„ ์œ„ํ•ด --debug-level=full ์‚ฌ์šฉ
  3. GPU ์›Œ๋ฐ์—…: ์ปค๋„์„ ์—ฌ๋Ÿฌ ๋ฒˆ ์‹คํ–‰ํ•œ ํ›„ ํ›„๋ฐ˜ ๋ฐ˜๋ณต์„ ํ”„๋กœํŒŒ์ผ๋ง
  4. ๋Œ€์•ˆ ๋น„๊ต: ํ•ญ์ƒ ์—ฌ๋Ÿฌ ๊ตฌํ˜„์„ ํ”„๋กœํŒŒ์ผ๋ง
  5. ํ•ซ์ŠคํŒŸ์— ์ง‘์ค‘: ๊ฐ€์žฅ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š” ์ปค๋„์„ ์ตœ์ ํ™”

์ด๋ ‡๊ฒŒ ํ•˜์ง€ ๋งˆ์„ธ์š”

  1. ๋””๋ฒ„๊ทธ ์ •๋ณด ์—†์ด ํ”„๋กœํŒŒ์ผ๋งํ•˜์ง€ ๋ง ๊ฒƒ: ์„ฑ๋Šฅ์„ ์†Œ์Šค ์ฝ”๋“œ์— ๋งคํ•‘ํ•  ์ˆ˜ ์—†์Œ (mojo build --help)
  2. ๋‹จ์ผ ์‹คํ–‰๋งŒ ํ”„๋กœํŒŒ์ผ๋งํ•˜์ง€ ๋ง ๊ฒƒ: GPU ์„ฑ๋Šฅ์€ ์‹คํ–‰๋งˆ๋‹ค ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ
  3. ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ๋ฌด์‹œํ•˜์ง€ ๋ง ๊ฒƒ: CPU-GPU ์ „์†ก์ด ํ”ํžˆ ์ง€๋ฐฐ์ 
  4. ์„ฃ๋ถˆ๋ฆฌ ์ตœ์ ํ™”ํ•˜์ง€ ๋ง ๊ฒƒ: ๋จผ์ € ํ”„๋กœํŒŒ์ผ๋ง, ๊ทธ๋‹ค์Œ ์ตœ์ ํ™”

ํ”ํ•œ ํ•จ์ •๊ณผ ํ•ด๊ฒฐ์ฑ…

ํ•จ์ • 1: ์ฝœ๋“œ ์Šคํƒ€ํŠธ ํšจ๊ณผ

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ์ฒซ ๋ฒˆ์งธ ์‹คํ–‰์„ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile mojo your_program.mojo

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ์›Œ๋ฐ์—… ํ›„ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --delay=5 mojo your_program.mojo  # GPU ์›Œ๋ฐ์—… ๋Œ€๊ธฐ

ํ•จ์ • 2: ์ž˜๋ชป๋œ ๋นŒ๋“œ ๊ตฌ์„ฑ

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ์ „์ฒด ๋””๋ฒ„๊ทธ ๋นŒ๋“œ (์ตœ์ ํ™” ๋น„ํ™œ์„ฑํ™”) ์ฆ‰, `--no-optimization`
mojo build -O0 your_program.mojo -o your_program

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ๋””๋ฒ„๊ทธ ์ •๋ณด ์—†์Œ (์†Œ์Šค ๋งคํ•‘ ๋ถˆ๊ฐ€)
mojo build your_program.mojo -o your_program

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด ํฌํ•จ ์ตœ์ ํ™” ๋นŒ๋“œ
mojo build --debug-level=full your_program.mojo -o optimized_program
nsys profile ./optimized_program

ํ•จ์ • 3: ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋ฌด์‹œ

# NSight Systems์—์„œ ์ด ํŒจํ„ด์„ ์ฐพ์•„๋ณด์„ธ์š”:
CPU -> GPU transfer: 50ms
Kernel execution: 2ms
GPU -> CPU transfer: 48ms
# ์ด: 100ms (์ปค๋„์€ ๊ฒจ์šฐ 2%!)

ํ•ด๊ฒฐ์ฑ…: ์ „์†ก๊ณผ ์—ฐ์‚ฐ์„ ์ค‘์ฒฉํ•˜๊ณ  ์ „์†ก ๋นˆ๋„๋ฅผ ์ค„์ด๊ธฐ (Part IX์—์„œ ๋‹ค๋ฃธ)

ํ•จ์ • 4: ๋‹จ์ผ ์ปค๋„์—๋งŒ ์ง‘์ค‘

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: "๋А๋ฆฐ" ์ปค๋„๋งŒ ํ”„๋กœํŒŒ์ผ๋ง
ncu --kernel-name regex:slow_kernel program

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ๋จผ์ € ์ „์ฒด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile mojo program.mojo  # ์‹ค์ œ ๋ณ‘๋ชฉ ์ฐพ๊ธฐ

๋ชจ๋ฒ” ์‚ฌ๋ก€์™€ ๊ณ ๊ธ‰ ์˜ต์…˜

๊ณ ๊ธ‰ NSight Systems ํ”„๋กœํŒŒ์ผ๋ง

ํฌ๊ด„์ ์ธ ์‹œ์Šคํ…œ ์ „์ฒด ๋ถ„์„์„ ์œ„ํ•ด ๋‹ค์Œ ๊ณ ๊ธ‰ nsys ํ”Œ๋ž˜๊ทธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

# ํ”„๋กœ๋•์…˜๊ธ‰ ํ”„๋กœํŒŒ์ผ๋ง ๋ช…๋ น
nsys profile \
  --gpu-metrics-devices=all \
  --trace=cuda,osrt,nvtx \
  --trace-fork-before-exec=true \
  --cuda-memory-usage=true \
  --cuda-um-cpu-page-faults=true \
  --cuda-um-gpu-page-faults=true \
  --opengl-gpu-workload=false \
  --delay=2 \
  --duration=30 \
  --sample=cpu \
  --cpuctxsw=process-tree \
  --output=comprehensive_profile \
  --force-overwrite=true \
  ./your_program

ํ”Œ๋ž˜๊ทธ ์„ค๋ช…:

  • --gpu-metrics-devices=all: ๋ชจ๋“  ๋””๋ฐ”์ด์Šค์—์„œ GPU ์ง€ํ‘œ ์ˆ˜์ง‘
  • --trace=cuda,osrt,nvtx: ํฌ๊ด„์  API ์ถ”์ 
  • --cuda-memory-usage=true: ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น/ํ•ด์ œ ์ถ”์ 
  • --cuda-um-cpu/gpu-page-faults=true: Unified Memory ํŽ˜์ด์ง€ ํดํŠธ ๋ชจ๋‹ˆํ„ฐ๋ง
  • --delay=2: ํ”„๋กœํŒŒ์ผ๋ง ์ „ 2์ดˆ ๋Œ€๊ธฐ (์ฝœ๋“œ ์Šคํƒ€ํŠธ ํšŒํ”ผ)
  • --duration=30: ์ตœ๋Œ€ 30์ดˆ๊ฐ„ ํ”„๋กœํŒŒ์ผ๋ง
  • --sample=cpu: ํ•ซ์ŠคํŒŸ ๋ถ„์„์„ ์œ„ํ•œ CPU ์ƒ˜ํ”Œ๋ง ํฌํ•จ
  • --cpuctxsw=process-tree: CPU ์ปจํ…์ŠคํŠธ ์Šค์œ„์น˜ ์ถ”์ 

๊ณ ๊ธ‰ NSight Compute ํ”„๋กœํŒŒ์ผ๋ง

ํฌ๊ด„์  ์ง€ํ‘œ๋ฅผ ํฌํ•จํ•œ ์ƒ์„ธ ์ปค๋„ ๋ถ„์„:

# ๋ชจ๋“  ์ง€ํ‘œ ์„ธํŠธ๋กœ ์ „์ฒด ์ปค๋„ ๋ถ„์„
ncu \
  --set full \
  --import-source=on \
  --kernel-id=:::1 \
  --launch-skip=0 \
  --launch-count=1 \
  --target-processes=all \
  --replay-mode=kernel \
  --cache-control=all \
  --clock-control=base \
  --apply-rules=yes \
  --check-exit-code=yes \
  --export=detailed_analysis \
  --force-overwrite \
  ./your_program

# ํŠน์ • ์„ฑ๋Šฅ ์ธก๋ฉด์— ์ง‘์ค‘
ncu \
  --set=@roofline \
  --section=InstructionStats \
  --section=LaunchStats \
  --section=Occupancy \
  --section=SpeedOfLight \
  --section=WarpStateStats \
  --metrics=sm__cycles_elapsed.avg,dram__throughput.avg.pct_of_peak_sustained_elapsed \
  --kernel-name regex:your_kernel_.* \
  --export=targeted_analysis \
  ./your_program

์ฃผ์š” NSight Compute ํ”Œ๋ž˜๊ทธ:

  • --set full: ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์ง€ํ‘œ ์ˆ˜์ง‘ (ํฌ๊ด„์ ์ด์ง€๋งŒ ๋А๋ฆผ)
  • --set @roofline: ๋ฃจํ”„๋ผ์ธ ๋ถ„์„์— ์ตœ์ ํ™”๋œ ์„ธํŠธ
  • --import-source=on: ๊ฒฐ๊ณผ๋ฅผ ์†Œ์Šค ์ฝ”๋“œ์— ๋งคํ•‘
  • --replay-mode=kernel: ์ •ํ™•ํ•œ ์ธก์ •์„ ์œ„ํ•ด ์ปค๋„ ๋ฆฌํ”Œ๋ ˆ์ด
  • --cache-control=all: ์ผ๊ด€๋œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ GPU ์บ์‹œ ์ œ์–ด
  • --clock-control=base: ๊ธฐ๋ณธ ์ฃผํŒŒ์ˆ˜๋กœ ํด๋Ÿญ ๊ณ ์ •
  • --section=SpeedOfLight: Speed of Light ๋ถ„์„ ํฌํ•จ
  • --metrics=...: ํŠน์ • ์ง€ํ‘œ๋งŒ ์ˆ˜์ง‘
  • --kernel-name regex:pattern: ์ •๊ทœ์‹ ํŒจํ„ด์œผ๋กœ ์ปค๋„ ์ง€์ • (--kernel-regex๊ฐ€ ์•„๋‹˜)

ํ”„๋กœํŒŒ์ผ๋ง ์›Œํฌํ”Œ๋กœ์šฐ ๋ชจ๋ฒ” ์‚ฌ๋ก€

1. ์ ์ง„์  ํ”„๋กœํŒŒ์ผ๋ง ์ „๋žต

# Step 1: ๋น ๋ฅธ ๊ฐœ์š” (๋น ๋ฆ„)
nsys profile --trace=cuda --duration=10 --output=quick_look ./program

# Step 2: ์ƒ์„ธ ์‹œ์Šคํ…œ ๋ถ„์„ (์ค‘๊ฐ„)
nsys profile --trace=cuda,osrt,nvtx --cuda-memory-usage=true --output=detailed ./program

# Step 3: ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„ (๋А๋ฆฌ์ง€๋งŒ ํฌ๊ด„์ )
ncu --set=@roofline --kernel-name regex:hotspot_kernel ./program

2. ์‹ ๋ขฐ์„ฑ์„ ์œ„ํ•œ ๋‹ค์ค‘ ์‹คํ–‰ ๋ถ„์„

# ์—ฌ๋Ÿฌ ๋ฒˆ ํ”„๋กœํŒŒ์ผ๋งํ•˜๊ณ  ๋น„๊ต
for i in {1..5}; do
  nsys profile --output=run_${i} ./program
  nsys stats run_${i}.nsys-rep > stats_${i}.txt
done

# ๊ฒฐ๊ณผ ๋น„๊ต
diff stats_1.txt stats_2.txt

3. ํƒ€๊ฒŸ ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง

# ๋จผ์ € ํ•ซ์ŠคํŒŸ ์ปค๋„ ์‹๋ณ„
nsys profile --trace=cuda,nvtx --output=overview ./program
nsys stats overview.nsys-rep | grep -A 10 "GPU Kernel Summary"

# ๊ทธ๋Ÿฐ ๋‹ค์Œ ํŠน์ • ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง
ncu --kernel-name="identified_hotspot_kernel" --set full ./program

ํ™˜๊ฒฝ ๋ฐ ๋นŒ๋“œ ๋ชจ๋ฒ” ์‚ฌ๋ก€

์ตœ์  ๋นŒ๋“œ ๊ตฌ์„ฑ

# ํ”„๋กœํŒŒ์ผ๋ง์šฉ: ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด ํฌํ•จ ์ตœ์ ํ™” ๋นŒ๋“œ
mojo build --debug-level=full --optimization-level=3 program.mojo -o program_profile

# ๋นŒ๋“œ ์„ค์ • ํ™•์ธ
mojo build --help | grep -E "(debug|optimization)"

ํ”„๋กœํŒŒ์ผ๋ง ํ™˜๊ฒฝ ์„ค์ •

# ์ผ๊ด€๋œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด GPU ๋ถ€์ŠคํŠธ ๋น„ํ™œ์„ฑํ™”
sudo nvidia-smi -ac 1215,1410  # ๋ฉ”๋ชจ๋ฆฌ ๋ฐ GPU ํด๋Ÿญ ๊ณ ์ •

# ๊ฒฐ์ •๋ก ์  ๋™์ž‘ ์„ค์ •
export CUDA_LAUNCH_BLOCKING=1  # ์ •ํ™•ํ•œ ํƒ€์ด๋ฐ์„ ์œ„ํ•œ ๋™๊ธฐ์‹ ์‹คํ–‰

# ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ๋“œ๋ผ์ด๋ฒ„ ์ œํ•œ ์™„ํ™”
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee -a /etc/modprobe.d/nvidia-kernel-common.conf

๋ฉ”๋ชจ๋ฆฌ ๋ฐ ์„ฑ๋Šฅ ๊ฒฉ๋ฆฌ

# ํ”„๋กœํŒŒ์ผ๋ง ์ „ GPU ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ธฐํ™”
nvidia-smi --gpu-reset

# ๋‹ค๋ฅธ GPU ํ”„๋กœ์„ธ์Šค ๋น„ํ™œ์„ฑํ™”
sudo fuser -v /dev/nvidia*  # GPU ์‚ฌ์šฉ ์ค‘์ธ ํ”„๋กœ์„ธ์Šค ํ™•์ธ
sudo pkill -f cuda  # ํ•„์š”์‹œ CUDA ํ”„๋กœ์„ธ์Šค ์ข…๋ฃŒ

# ๋†’์€ ์šฐ์„ ์ˆœ์œ„๋กœ ์‹คํ–‰
sudo nice -n -20 nsys profile ./program

๋ถ„์„ ๋ฐ ๋ณด๊ณ  ๋ชจ๋ฒ” ์‚ฌ๋ก€

์ข…ํ•ฉ ๋ณด๊ณ ์„œ ์ƒ์„ฑ

# ์—ฌ๋Ÿฌ ๋ณด๊ณ ์„œ ํ˜•์‹ ์ƒ์„ฑ
nsys stats --report=cuda_api_sum,cuda_gpu_kern_sum,cuda_gpu_mem_time_sum --format=csv --output=. profile.nsys-rep

# ์™ธ๋ถ€ ๋ถ„์„์„ ์œ„ํ•ด ๋‚ด๋ณด๋‚ด๊ธฐ
nsys export --type=sqlite profile.nsys-rep
nsys export --type=json profile.nsys-rep

# ๋น„๊ต ๋ณด๊ณ ์„œ ์ƒ์„ฑ
nsys stats --report=cuda_gpu_kern_sum baseline.nsys-rep > baseline_kernels.txt
nsys stats --report=cuda_gpu_kern_sum optimized.nsys-rep > optimized_kernels.txt
diff -u baseline_kernels.txt optimized_kernels.txt

์„ฑ๋Šฅ ํšŒ๊ท€ ํ…Œ์ŠคํŠธ

#!/bin/bash
# CI/CD์šฉ ์ž๋™ํ™” ํ”„๋กœํŒŒ์ผ๋ง ์Šคํฌ๋ฆฝํŠธ
BASELINE_TIME=$(nsys stats baseline.nsys-rep | grep "Total Time" | awk '{print $3}')
CURRENT_TIME=$(nsys stats current.nsys-rep | grep "Total Time" | awk '{print $3}')

REGRESSION_THRESHOLD=1.10  # 10% ์„ฑ๋Šฅ ์ €ํ•˜ ์ž„๊ณ„๊ฐ’
if (( $(echo "$CURRENT_TIME > $BASELINE_TIME * $REGRESSION_THRESHOLD" | bc -l) )); then
    echo "Performance regression detected: ${CURRENT_TIME}ns vs ${BASELINE_TIME}ns"
    exit 1
fi

๋‹ค์Œ ๋‹จ๊ณ„

ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ๋ฅผ ์ดํ•ดํ–ˆ์œผ๋‹ˆ:

  1. ๊ธฐ์กด ์ปค๋„๋กœ ์—ฐ์Šต: ์ด๋ฏธ ํ’€์—ˆ๋˜ ํผ์ฆ๋“ค์„ ํ”„๋กœํŒŒ์ผ๋งํ•ด ๋ณด์„ธ์š”
  2. ์ตœ์ ํ™” ์ค€๋น„: Puzzle 31์—์„œ ์ด ํ†ต์ฐฐ์„ ์ ์œ ์œจ ์ตœ์ ํ™”์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค
  3. ๋„๊ตฌ ์ตํžˆ๊ธฐ: ๋‹ค์–‘ํ•œ NSight Systems์™€ NSight Compute ์˜ต์…˜์„ ์‹คํ—˜ํ•ด ๋ณด์„ธ์š”

๊ธฐ์–ตํ•˜์„ธ์š”: ํ”„๋กœํŒŒ์ผ๋ง์€ ๋‹จ์ˆœํžˆ ๋А๋ฆฐ ์ฝ”๋“œ๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค - ํ”„๋กœ๊ทธ๋žจ์˜ ๋™์ž‘์„ ์ดํ•ดํ•˜๊ณ  ๊ทผ๊ฑฐ ์žˆ๋Š” ์ตœ์ ํ™” ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ถ”๊ฐ€ ํ”„๋กœํŒŒ์ผ๋ง ์ž๋ฃŒ: