๐Ÿ” ํƒ์ • ์ˆ˜์‚ฌ: ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ฐœ์š”

์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์—์„œ ์ตํžŒ ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น… ์Šคํ‚ฌ์„ ๋ฐ”ํƒ•์œผ๋กœ, ์ด๋ฒˆ์—๋Š” ์ „ํ˜€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๋„์ „์„ ๋งˆ์ฃผํ•ฉ๋‹ˆ๋‹ค: ํฌ๋ž˜์‹œ ์—†์ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๋กœ์ง ๋ฒ„๊ทธ์ž…๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ๊ด€์ ์˜ ์ „ํ™˜:

  • ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€: ๋ช…ํ™•ํ•œ ํฌ๋ž˜์‹œ ์‹ ํ˜ธ(CUDA_ERROR_ILLEGAL_ADDRESS)๊ฐ€ ์กฐ์‚ฌ๋ฅผ ์•ˆ๋‚ดํ•จ
  • ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€: ํฌ๋ž˜์‹œ๋„ ์—†๊ณ  ์—๋Ÿฌ ๋ฉ”์‹œ์ง€๋„ ์—†์Œ - ํƒ์ •์ฒ˜๋Ÿผ ํŒŒํ—ค์ณ์•ผ ํ•˜๋Š” ๋ฏธ๋ฌ˜ํ•˜๊ฒŒ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋งŒ ์žˆ์Œ

์ด๋ฒˆ ์ค‘๊ธ‰ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ๋Š” LayoutTensor ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜๋ฅผ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœ๊ทธ๋žจ์€ ์„ฑ๊ณต์ ์œผ๋กœ ์‹คํ–‰๋˜์ง€๋งŒ ์ž˜๋ชป๋œ ์ถœ๋ ฅ์„ ๋‚ด๋Š”๋ฐ, ์‹ค์ œ ๊ฐœ๋ฐœ์—์„œ ํ›จ์”ฌ ํ”ํ•˜๋ฉด์„œ๋„ ๊นŒ๋‹ค๋กœ์šด ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค์ž…๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ค€๋น„: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ๊ณผ ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€๋ฅผ ๋จผ์ € ์™„๋ฃŒํ•ด์„œ CUDA-GDB ์›Œํฌํ”Œ๋กœ์šฐ์™€ ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•์„ ์ตํ˜€๋‘์„ธ์š”. ์•„๋ž˜ ๋ช…๋ น์„ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pixi run -e nvidia setup-cuda-gdb

ํ•ต์‹ฌ ๊ฐœ๋…

์ด๋ฒˆ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • LayoutTensor ๋””๋ฒ„๊น…: ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํŒจํ„ด ์กฐ์‚ฌํ•˜๊ธฐ
  • ๋กœ์ง ๋ฒ„๊ทธ ํƒ์ง€: ํฌ๋ž˜์‹œํ•˜์ง€ ์•Š๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์ฐพ๊ธฐ
  • ๋ฐ˜๋ณต๋ฌธ ๊ฒฝ๊ณ„ ๋ถ„์„: ๋ฐ˜๋ณต ํšŸ์ˆ˜ ๋ฌธ์ œ ์ดํ•ดํ•˜๊ธฐ
  • ๊ฒฐ๊ณผ ํŒจํ„ด ๋ถ„์„: ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋กœ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€๊ธฐ

์ฝ”๋“œ ์‹คํ–‰

๋จผ์ € ์ „์ฒด ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ์ปค๋„๋งŒ ์‚ดํŽด๋ด…์‹œ๋‹ค:

fn process_sliding_window(
    output: LayoutTensor[dtype, vector_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, vector_layout, ImmutAnyOrigin],
):
    thread_id = thread_idx.x

    # Each thread processes a sliding window of 3 elements
    window_sum = Scalar[dtype](0.0)

    # Sum elements in sliding window: [i-1, i, i+1]
    for offset in range(ITER):
        idx = Int(thread_id) + offset - 1
        if 0 <= idx < SIZE:
            value = rebind[Scalar[dtype]](a[idx])
            window_sum += value

    output[thread_id] = window_sum


๋ฒ„๊ทธ๋ฅผ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š” (pixi ์ „์šฉ):

pixi run -e nvidia p09 --second-case

๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค - ํฌ๋ž˜์‹œ ์—†์ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ:

This program computes sliding window sums for each position...

Input array: [0, 1, 2, 3]
Computing sliding window sums (window size = 3)...
Each position should sum its neighbors: [left + center + right]
stack trace was not collected. Enable stack trace collection with environment variable `MOJO_ENABLE_STACK_TRACE_ON_ERROR`
Unhandled exception caught during execution: At open-source/max/mojo/stdlib/stdlib/gpu/host/device_context.mojo:2082:17: CUDA call failed: CUDA_ERROR_INVALID_IMAGE (device kernel image is invalid)
To get more accurate error information, set MODULAR_DEVICE_CONTEXT_SYNC_MODE=true.
/home/ubuntu/workspace/mojo-gpu-puzzles/.pixi/envs/nvidia/bin/mojo: error: execution exited with a non-zero result: 1

๊ณผ์ œ: ํƒ์ • ์ˆ˜์‚ฌ

๋„์ „: ํ”„๋กœ๊ทธ๋žจ์€ ํฌ๋ž˜์‹œ ์—†์ด ์‹คํ–‰๋˜์ง€๋งŒ ์ผ์ •ํ•œ ํŒจํ„ด์œผ๋กœ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค. ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š์€ ์ƒํƒœ์—์„œ, ์ด ๋กœ์ง ๋ฒ„๊ทธ๋ฅผ ์กฐ์‚ฌํ•˜๊ธฐ ์œ„ํ•œ ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฌด์—‡์ผ๊นŒ์š”?

์ƒ๊ฐํ•ด ๋ณผ ์ :

  • ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ์—์„œ ์–ด๋–ค ํŒจํ„ด์ด ๋ณด์ด๋‚˜์š”?
  • ์ œ๋Œ€๋กœ ๋Œ์ง€ ์•Š๋Š” ๊ฒƒ ๊ฐ™์€ ๋ฐ˜๋ณต๋ฌธ์€ ์–ด๋–ป๊ฒŒ ์กฐ์‚ฌํ•  ๊ฑด๊ฐ€์š”?
  • ๋ณ€์ˆ˜๋ฅผ ์ง์ ‘ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์—†์„ ๋•Œ ์–ด๋–ค ๋””๋ฒ„๊น… ์ „๋žต์ด ํšจ๊ณผ์ ์ผ๊นŒ์š”?
  • ์กฐ์‚ฌ๋ฅผ ์•ˆ๋‚ดํ•ด ์ค„ ํฌ๋ž˜์‹œ ์‹ ํ˜ธ๊ฐ€ ์—†์„ ๋•Œ, ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์˜ ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ๋ฐฉ๋ฒ•์„ ์–ด๋–ป๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ณด์„ธ์š”:

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --second-case

GDB ๋ช…๋ น์–ด ๋‹จ์ถ•ํ‚ค (๋น ๋ฅธ ๋””๋ฒ„๊น…)

์ด ๋‹จ์ถ•ํ‚ค๋“ค์„ ์‚ฌ์šฉํ•˜๋ฉด ๋””๋ฒ„๊น… ์„ธ์…˜ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

๋‹จ์ถ•์ „์ฒด์‚ฌ์šฉ ์˜ˆ์‹œ
rrun(cuda-gdb) r
nnext(cuda-gdb) n
ccontinue(cuda-gdb) c
bbreak(cuda-gdb) b 39
pprint(cuda-gdb) p thread_id
qquit(cuda-gdb) q

์•„๋ž˜ ๋ชจ๋“  ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋Š” ํšจ์œจ์„ ์œ„ํ•ด ์ด ๋‹จ์ถ•ํ‚ค๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค!

ํŒ
  1. ํŒจํ„ด ๋ถ„์„๋ถ€ํ„ฐ - ๊ธฐ๋Œ€๊ฐ’๊ณผ ์‹ค์ œ ๊ฒฐ๊ณผ์˜ ๊ด€๊ณ„๋ฅผ ์‚ดํŽด๋ณด์„ธ์š” (์ฐจ์ด์— ์–ด๋–ค ์ˆ˜ํ•™์  ํŒจํ„ด์ด ์žˆ๋‚˜์š”?)
  2. ์‹คํ–‰ ํ๋ฆ„์— ์ง‘์ค‘ - ๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์œผ๋ฉด ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์„ธ์–ด๋ณด์„ธ์š”
  3. ๋‹จ์ˆœํ•œ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์‚ฌ์šฉ - ์ตœ์ ํ™”๋œ ์ฝ”๋“œ์—์„œ๋Š” ๋ณต์žกํ•œ ๋””๋ฒ„๊น… ๋ช…๋ น์ด ์‹คํŒจํ•˜๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค
  4. ์ˆ˜ํ•™์  ์ถ”๋ก  - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ๊ณผ ์‹ค์ œ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์„ ๋”ฐ์ ธ๋ณด์„ธ์š”
  5. ๋ˆ„๋ฝ๋œ ๋ฐ์ดํ„ฐ ์กฐ์‚ฌ - ๊ฒฐ๊ณผ๊ฐ€ ์ผ๊ด€๋˜๊ฒŒ ๊ธฐ๋Œ€๋ณด๋‹ค ์ž‘๋‹ค๋ฉด, ๋ฌด์—‡์ด ๋น ์กŒ์„๊นŒ์š”?
  6. ํ˜ธ์ŠคํŠธ ์ถœ๋ ฅ ๊ฒ€์ฆ - ์ตœ์ข… ๊ฒฐ๊ณผ์—์„œ ๋ฒ„๊ทธ์˜ ํŒจํ„ด์ด ๋“œ๋Ÿฌ๋‚˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค
  7. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฒฝ๊ณ„ ๋ถ„์„ - ๋ฐ˜๋ณต๋ฌธ์ด ์˜ฌ๋ฐ”๋ฅธ ๊ฐœ์ˆ˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”
  8. ์ž‘๋™ํ•˜๋Š” ์ผ€์ด์Šค์™€ ๊ต์ฐจ ๊ฒ€์ฆ - ์Šค๋ ˆ๋“œ 3์€ ์ •ํ™•ํ•˜๊ฒŒ ์ž‘๋™ํ•˜๋Š”๋ฐ ๋‹ค๋ฅธ ๊ฒƒ๋“ค์€ ์™œ ์•ˆ ๋ ๊นŒ์š”?
๐Ÿ’ก ์กฐ์‚ฌ ๊ณผ์ •๊ณผ ํ•ด๊ฒฐ์ฑ…

CUDA-GDB๋กœ ๋‹จ๊ณ„๋ณ„ ์กฐ์‚ฌ

1๋‹จ๊ณ„: ์‹คํ–‰๊ณผ ์ดˆ๊ธฐ ๋ถ„์„

Step 1: ๋””๋ฒ„๊ฑฐ ์‹คํ–‰

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --second-case

Step 2: ์ฆ์ƒ๋ถ€ํ„ฐ ๋ถ„์„

๋””๋ฒ„๊ฑฐ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

์‹ค์ œ ๊ฒฐ๊ณผ: [0.0, 1.0, 3.0, 5.0]
๊ธฐ๋Œ€๊ฐ’: [1.0, 3.0, 6.0, 5.0]

๐Ÿ” ํŒจํ„ด ์ธ์‹:

  • ์Šค๋ ˆ๋“œ 0: 0.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 1.0 โ†’ 1.0 ๋ˆ„๋ฝ
  • ์Šค๋ ˆ๋“œ 1: 1.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 3.0 โ†’ 2.0 ๋ˆ„๋ฝ
  • ์Šค๋ ˆ๋“œ 2: 3.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 6.0 โ†’ 3.0 ๋ˆ„๋ฝ
  • ์Šค๋ ˆ๋“œ 3: 5.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 5.0 โ†’ โœ… ์ •ํ™•

์ดˆ๊ธฐ ๊ฐ€์„ค: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ˆ„๋ฝํ•˜๊ณ  ์žˆ๋Š”๋ฐ, ์Šค๋ ˆ๋“œ 3๋งŒ ์ •ํ™•ํ•˜๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

2๋‹จ๊ณ„: ์ปค๋„ ์ง„์ž…

Step 3: ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์ง„์ž… ํ™•์ธ

์‹ค์ œ ๋””๋ฒ„๊น… ์„ธ์…˜์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค:

(cuda-gdb) r
Starting program: .../mojo run problems/p09/p09.mojo --second-case

This program computes sliding window sums for each position...
Input array: [0, 1, 2, 3]
Computing sliding window sums (window size = 3)...
Each position should sum its neighbors: [left + center + right]

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

CUDA thread hit application kernel entry function breakpoint, p09_process_sliding_window_...
   <<<(1,1,1),(4,1,1)>>> (output=..., input=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:30
30          input: LayoutTensor[mut=False, dtype, vector_layout],

Step 4: ๋ฉ”์ธ ๋กœ์ง์œผ๋กœ ์ด๋™

(cuda-gdb) n
29          output: LayoutTensor[mut=True, dtype, vector_layout],
(cuda-gdb) n
32          thread_id = thread_idx.x
(cuda-gdb) n
38          for offset in range(ITER):

Step 5: ๋ณ€์ˆ˜ ์ ‘๊ทผ์„ฑ ํ…Œ์ŠคํŠธ - ์ค‘์š”ํ•œ ๋ฐœ๊ฒฌ

(cuda-gdb) p thread_id
$1 = 0

โœ… ์ข‹์Œ: Thread ID์— ์ ‘๊ทผ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

(cuda-gdb) p window_sum
Cannot access memory at address 0x0

โŒ ๋ฌธ์ œ: window_sum์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

(cuda-gdb) p a[0]
Attempt to take address of value not located in memory.

โŒ ๋ฌธ์ œ: LayoutTensor ์ง์ ‘ ์ธ๋ฑ์‹ฑ์ด ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

(cuda-gdb) p a.ptr[0]
$2 = {0}
(cuda-gdb) p a.ptr[0]@4
$3 = {{0}, {1}, {2}, {3}}

๐ŸŽฏ ๋ŒํŒŒ๊ตฌ: a.ptr[0]@4๋กœ ์ „์ฒด ์ž…๋ ฅ ๋ฐฐ์—ด์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ด๊ฒƒ์ด LayoutTensor ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

3๋‹จ๊ณ„: ํ•ต์‹ฌ ๋ฐ˜๋ณต๋ฌธ ์กฐ์‚ฌ

Step 6: ๋ฐ˜๋ณต๋ฌธ ๋ชจ๋‹ˆํ„ฐ๋ง ์„ค์ •

(cuda-gdb) b 42
Breakpoint 1 at 0x7fffd326ffd0: file problems/p09/p09.mojo, line 42.
(cuda-gdb) c
Continuing.

CUDA thread hit Breakpoint 1, p09_process_sliding_window_...
   <<<(1,1,1),(4,1,1)>>> (output=..., input=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:42
42              idx = thread_id + offset - 1

๐Ÿ” ์ด์ œ ๋ฐ˜๋ณต๋ฌธ ๋ณธ๋ฌธ ์•ˆ์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ง์ ‘ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์„ธ์–ด๋ด…์‹œ๋‹ค.

Step 7: ์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต (offset = 0)

(cuda-gdb) n
43              if 0 <= idx < SIZE:
(cuda-gdb) n
41          for offset in range(ITER):

์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต ์™„๋ฃŒ: ๋ฐ˜๋ณต๋ฌธ์ด 42๋ฒˆ ์ค„ โ†’ 43๋ฒˆ ์ค„ โ†’ 41๋ฒˆ ์ค„๋กœ ๋Œ์•„์™”์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ณต๋ฌธ์ด ๊ณ„์†๋ฉ๋‹ˆ๋‹ค.

Step 8: ๋‘ ๋ฒˆ์งธ ๋ฐ˜๋ณต (offset = 1)

(cuda-gdb) n

CUDA thread hit Breakpoint 1, p09_process_sliding_window_...
42              idx = thread_id + offset - 1
(cuda-gdb) n
43              if 0 <= idx < SIZE:
(cuda-gdb) n
44                  value = rebind[Scalar[dtype]](input[idx])
(cuda-gdb) n
45                  window_sum += value
(cuda-gdb) n
43              if 0 <= idx < SIZE:
(cuda-gdb) n
41          for offset in range(ITER):

๋‘ ๋ฒˆ์งธ ๋ฐ˜๋ณต ์™„๋ฃŒ: ์ด๋ฒˆ์—๋Š” if ๋ธ”๋ก(44-45๋ฒˆ ์ค„)์„ ํ†ต๊ณผํ–ˆ์Šต๋‹ˆ๋‹ค.

Step 9: ์„ธ ๋ฒˆ์งธ ๋ฐ˜๋ณต ํ…Œ์ŠคํŠธ

(cuda-gdb) n
47          output[thread_id] = window_sum

๊ฒฐ์ •์  ๋ฐœ๊ฒฌ: ๋ฐ˜๋ณต๋ฌธ์ด 2๋ฒˆ๋งŒ ๋Œ๊ณ  ์ข…๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค! 42๋ฒˆ ์ค„์˜ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์— ๋‹ค์‹œ ๊ฑธ๋ฆฌ์ง€ ์•Š๊ณ  47๋ฒˆ ์ค„๋กœ ๋ฐ”๋กœ ๋„˜์–ด๊ฐ”์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก : ๋ฐ˜๋ณต๋ฌธ์ด ์ •ํ™•ํžˆ 2๋ฒˆ ๋Œ๊ณ  ์ข…๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Step 10: ์ปค๋„ ์‹คํ–‰ ์™„๋ฃŒ์™€ ์ปจํ…์ŠคํŠธ ์†์‹ค

(cuda-gdb) n
31      fn process_sliding_window(
(cuda-gdb) n
[Switching to Thread 0x7ffff7cc0e00 (LWP 110927)]
0x00007ffff064f84a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(cuda-gdb) p output.ptr[0]@4
No symbol "output" in current context.
(cuda-gdb) p offset
No symbol "offset" in current context.

๐Ÿ” ์ปจํ…์ŠคํŠธ ์†์‹ค: ์ปค๋„ ์‹คํ–‰์ด ๋๋‚˜๋ฉด ์ปค๋„ ๋ณ€์ˆ˜์— ๋” ์ด์ƒ ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ •์ƒ์ ์ธ ๋™์ž‘์ž…๋‹ˆ๋‹ค.

4๋‹จ๊ณ„: ๊ทผ๋ณธ ์›์ธ ๋ถ„์„

Step 11: ๊ด€์ฐฐ๋œ ์‹คํ–‰์—์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„

๋””๋ฒ„๊น… ์„ธ์…˜์—์„œ ๊ด€์ฐฐํ•œ ๊ฒƒ:

  1. ๋ฐ˜๋ณต ํšŸ์ˆ˜: 2๋ฒˆ๋งŒ ๋ฐ˜๋ณต (offset = 0, offset = 1)
  2. ๊ธฐ๋Œ€๊ฐ’: ํฌ๊ธฐ 3์˜ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋Š” 3๋ฒˆ ๋ฐ˜๋ณตํ•ด์•ผ ํ•จ (offset = 0, 1, 2)
  3. ๋ˆ„๋ฝ: ์„ธ ๋ฒˆ์งธ ๋ฐ˜๋ณต (offset = 2)

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐํ•ด์•ผ ํ•  ๊ฒƒ:

  • ์Šค๋ ˆ๋“œ 0: window_sum = input[-1] + input[0] + input[1] = (๊ฒฝ๊ณ„) + 0 + 1 = 1.0
  • ์Šค๋ ˆ๋“œ 1: window_sum = input[0] + input[1] + input[2] = 0 + 1 + 2 = 3.0
  • ์Šค๋ ˆ๋“œ 2: window_sum = input[1] + input[2] + input[3] = 1 + 2 + 3 = 6.0
  • ์Šค๋ ˆ๋“œ 3: window_sum = input[2] + input[3] + input[4] = 2 + 3 + (๊ฒฝ๊ณ„) = 5.0

Step 12: ์Šค๋ ˆ๋“œ 0์˜ ์‹ค์ œ ์‹คํ–‰ ์ถ”์ 

2๋ฒˆ๋งŒ ๋ฐ˜๋ณตํ•  ๊ฒฝ์šฐ (offset = 0, 1):

๋ฐ˜๋ณต 1 (offset = 0):

  • idx = thread_id + offset - 1 = 0 + 0 - 1 = -1
  • if 0 <= idx < SIZE: โ†’ if 0 <= -1 < 4: โ†’ False
  • ํ•ฉ์‚ฐ ์—ฐ์‚ฐ ๊ฑด๋„ˆ๋œ€

๋ฐ˜๋ณต 2 (offset = 1):

  • idx = thread_id + offset - 1 = 0 + 1 - 1 = 0
  • if 0 <= idx < SIZE: โ†’ if 0 <= 0 < 4: โ†’ True
  • window_sum += input[0] โ†’ window_sum += 0

๋ˆ„๋ฝ๋œ ๋ฐ˜๋ณต 3 (offset = 2):

  • idx = thread_id + offset - 1 = 0 + 2 - 1 = 1
  • if 0 <= idx < SIZE: โ†’ if 0 <= 1 < 4: โ†’ True
  • window_sum += input[1] โ†’ window_sum += 1 โ† ์ด ์—ฐ์‚ฐ์ด ์‹คํ–‰๋˜์ง€ ์•Š์Œ

๊ฒฐ๊ณผ: ์Šค๋ ˆ๋“œ 0์€ window_sum = 0 + 1 = 1 ๋Œ€์‹  window_sum = 0์„ ์–ป์Šต๋‹ˆ๋‹ค

5๋‹จ๊ณ„: ๋ฒ„๊ทธ ํ™•์ธ

๋ฌธ์ œ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด:

comptime ITER = 2                       # โ† ๋ฒ„๊ทธ: 3์ด์–ด์•ผ ํ•จ!

for offset in range(ITER):           # โ† 2๋ฒˆ๋งŒ ๋ฐ˜๋ณต: [0, 1]
    idx = Int(thread_id) + offset - 1     # โ† offset = 2 ๋ˆ„๋ฝ
    if 0 <= idx < SIZE:
        value = rebind[Scalar[dtype]](a[idx])
        window_sum += value

๐ŸŽฏ ๊ทผ๋ณธ ์›์ธ ํ™•์ธ: ํฌ๊ธฐ 3์˜ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋ฅผ ์œ„ํ•ด ITER = 2๊ฐ€ ITER = 3์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜์ • ๋ฐฉ๋ฒ•: ์†Œ์Šค ์ฝ”๋“œ์—์„œ comptime ITER = 2๋ฅผ comptime ITER = 3์œผ๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋””๋ฒ„๊น… ๊ตํ›ˆ

๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์„ ๋•Œ:

  1. ์‹คํ–‰ ํ๋ฆ„์— ์ง‘์ค‘ - ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๊ฐ€ ๋ช‡ ๋ฒˆ ๊ฑธ๋ฆฌ๋Š”์ง€, ๋ฐ˜๋ณต์ด ๋ช‡ ๋ฒˆ ๋„๋Š”์ง€ ์„ธ์–ด๋ณด์„ธ์š”
  2. ์ˆ˜ํ•™์  ์ถ”๋ก  ์‚ฌ์šฉ - ์ผ์–ด๋‚˜์•ผ ํ•  ์ผ๊ณผ ์‹ค์ œ๋กœ ์ผ์–ด๋‚˜๋Š” ์ผ์„ ๋”ฐ์ ธ๋ณด์„ธ์š”
  3. ํŒจํ„ด ๋ถ„์„ - ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๊ฐ€ ์กฐ์‚ฌ๋ฅผ ์ด๋Œ๋„๋ก ํ•˜์„ธ์š”
  4. ๊ต์ฐจ ๊ฒ€์ฆ - ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ๋Œ€ํ•ด ๊ฐ€์„ค์„ ํ…Œ์ŠคํŠธํ•˜์„ธ์š”

์ „๋ฌธ์ ์ธ GPU ๋””๋ฒ„๊น…์˜ ํ˜„์‹ค:

  • ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™” ๋•Œ๋ฌธ์— ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ๊ฐ€ ์‹คํŒจํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค
  • ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„์ด ๋ฐ์ดํ„ฐ ๊ฒ€์‚ฌ๋ณด๋‹ค ๋” ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ํ˜ธ์ŠคํŠธ ์ถœ๋ ฅ ํŒจํ„ด์ด ์ค‘์š”ํ•œ ๋””๋ฒ„๊น… ๋‹จ์„œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค
  • ์†Œ์Šค ์ฝ”๋“œ ์ถ”๋ก ์ด ์ œํ•œ๋œ ๋””๋ฒ„๊ฑฐ ๊ธฐ๋Šฅ์„ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค

LayoutTensor ๋””๋ฒ„๊น…:

  • LayoutTensor ์ถ”์ƒํ™”๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๊ทผ๋ณธ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฒ„๊ทธ๋Š” ๊ทธ๋Œ€๋กœ ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค
  • ํ…์„œ ๋‚ด์šฉ์„ ๊ฒ€์‚ฌํ•˜๋ ค ํ•˜๊ธฐ๋ณด๋‹ค ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋กœ์ง์— ์ง‘์ค‘ํ•˜์„ธ์š”
  • ์ฒด๊ณ„์ ์ธ ์ถ”๋ก ์œผ๋กœ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ๊ณผ ์‹ค์ œ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์„ ์ถ”์ ํ•˜์„ธ์š”

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด๋Ÿฐ ์œ ํ˜•์˜ off-by-one (์—ญ์ฃผ: ๊ฒฝ๊ณ„๊ฐ’์ด 1๋งŒํผ ์–ด๊ธ‹๋‚˜๋Š” ์˜ค๋ฅ˜) ๋ฐ˜๋ณต๋ฌธ ๋ฒ„๊ทธ๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋งค์šฐ ํ”ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋ฐฐ์šด ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ• - ์ œํ•œ๋œ ๋””๋ฒ„๊ฑฐ ์ •๋ณด์— ์ˆ˜ํ•™์  ๋ถ„์„๊ณผ ํŒจํ„ด ์ธ์‹์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ - ์€ ๋„๊ตฌ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์„ ๋•Œ ์ „๋ฌธ GPU ๊ฐœ๋ฐœ์ž๋“ค์ด ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐฉ์‹ ๊ทธ๋Œ€๋กœ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„: ๋กœ์ง ๋ฒ„๊ทธ์—์„œ ๊ต์ฐฉ ์ƒํƒœ๋กœ

๋กœ์ง ๋ฒ„๊ทธ ๋””๋ฒ„๊น…์„ ์ตํ˜”์Šต๋‹ˆ๋‹ค! ์ด์ œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • โœ… ํฌ๋ž˜์‹œ๋‚˜ ๋šœ๋ ทํ•œ ์ฆ์ƒ ์—†์ด๋„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์กฐ์‚ฌ
  • โœ… ํŒจํ„ด ๋ถ„์„์œผ๋กœ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ์—์„œ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ์ถ”์ 
  • โœ… ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„์œผ๋กœ ๋ณ€์ˆ˜ ์ ‘๊ทผ์ด ์ œํ•œ๋œ ์ƒํ™ฉ์—์„œ ๋””๋ฒ„๊น…
  • โœ… ๋””๋ฒ„๊ฑฐ ๋„๊ตฌ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์„ ๋•Œ ์ˆ˜ํ•™์  ์ถ”๋ก  ์ ์šฉ

๋งˆ์ง€๋ง‰ ๋„์ „: ํƒ์ • ์ˆ˜์‚ฌ: ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ทธ๋Ÿฐ๋ฐ ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜์ง€๋„ ์•Š๊ณ  ๋๋‚˜์ง€๋„ ์•Š๋Š”๋‹ค๋ฉด์š”? ๊ทธ๋ƒฅ ์˜์›ํžˆ ๋ฉˆ์ถฐ๋ฒ„๋ฆฐ๋‹ค๋ฉด์š”?

์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€๋Š” ๊ถ๊ทน์˜ ๋””๋ฒ„๊น… ๋„์ „์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค:

  • โŒ ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€ ์—†์Œ (์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์ฒ˜๋Ÿผ)
  • โŒ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ ์—†์Œ (๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€์ฒ˜๋Ÿผ)
  • โŒ ์™„๋ฃŒ ์ž์ฒด๊ฐ€ ์—†์Œ - ๊ทธ๋ƒฅ ๋ฌดํ•œํžˆ ๋ฉˆ์ถค
  • โœ… ๊ณ ๊ธ‰ ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋ถ„์„์ด ํ•„์š”ํ•œ ์กฐ์šฉํ•œ ๊ต์ฐฉ ์ƒํƒœ

์ƒˆ๋กญ๊ฒŒ ์ตํžˆ๊ฒŒ ๋  ์Šคํ‚ฌ:

  • ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ํƒ์ง€ - ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ์—์„œ ์กฐ์œจ ์‹คํŒจ ์ฐพ๊ธฐ
  • ๋ฉ€ํ‹ฐ ์Šค๋ ˆ๋“œ ์ƒํƒœ ๋ถ„์„ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๋ฅผ ๋™์‹œ์— ๊ฒ€์‚ฌํ•˜๊ธฐ
  • ๋™๊ธฐํ™” ๋””๋ฒ„๊น… - ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ์‹คํŒจ ์ดํ•ดํ•˜๊ธฐ

๋””๋ฒ„๊น… ์ง„ํ™”:

  1. ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€: ํฌ๋ž˜์‹œ ์‹ ํ˜ธ ๋”ฐ๋ผ๊ฐ€๊ธฐ โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ ์ฐพ๊ธฐ
  2. ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€: ๊ฒฐ๊ณผ ํŒจํ„ด ๋ถ„์„ํ•˜๊ธฐ โ†’ ๋กœ์ง ๋ฒ„๊ทธ ์ฐพ๊ธฐ
  3. ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€: ์Šค๋ ˆ๋“œ ์ƒํƒœ ์กฐ์‚ฌํ•˜๊ธฐ โ†’ ์กฐ์œจ ๋ฒ„๊ทธ ์ฐพ๊ธฐ

์ด์ „ ๋‘ ์‚ฌ๋ก€์—์„œ ๋ฐฐ์šด ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ์Šคํ‚ฌ - ๊ฐ€์„ค ์ˆ˜๋ฆฝ, ์ฆ๊ฑฐ ์ˆ˜์ง‘, ํŒจํ„ด ๋ถ„์„ - ์€ ๊ฐ€์žฅ ์–ด๋ ค์šด GPU ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•  ๋•Œ ํ•ต์‹ฌ์ด ๋ฉ๋‹ˆ๋‹ค: ์กฐ์œจ์ด ์–ด๊ธ‹๋‚˜ ์˜์›ํžˆ ์„œ๋กœ๋ฅผ ๊ธฐ๋‹ค๋ฆฌ๋Š” ์Šค๋ ˆ๋“œ๋“ค.