๊ฐœ์š”

๋ฒกํ„ฐ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋‚ด์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉํ•˜๊ธฐ
  • ๋ฐฐ๋ฆฌ์–ด(barrier)๋กœ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”ํ•˜๊ธฐ
  • ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ์ €์žฅ์†Œ ๊ด€๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” ๋น ๋ฅธ ๋กœ์ปฌ ์ €์žฅ์†Œ๋ผ๋Š” ์ , ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์Šค๋ ˆ๋“œ ๊ฐ„ ์กฐ์œจ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8 ์›์†Œ
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 4
  • ๋ธ”๋ก ์ˆ˜: 2
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น TPB๊ฐœ ์›์†Œ

์ฐธ๊ณ :

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋“ค์ด ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๋น ๋ฅธ ์ €์žฅ์†Œ
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”: barrier()๋ฅผ ์‚ฌ์šฉํ•œ ์กฐ์œจ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฒ”์œ„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋Š” ๋ธ”๋ก ๋‚ด์—์„œ๋งŒ ๋ณด์ž„
  • ์ ‘๊ทผ ํŒจํ„ด: ๋กœ์ปฌ ์ธ๋ฑ์Šค vs ์ „์—ญ ์ธ๋ฑ์Šค

์ฃผ์˜: ๊ฐ ๋ธ”๋ก์ด ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ๋Š” ์ƒ์ˆ˜ ๋กœ ์ •ํ•ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์€ ๋ณ€์ˆ˜๊ฐ€ ์•„๋‹Œ ๋ฆฌํ„ฐ๋Ÿด Python ์ƒ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์“ด ํ›„์—๋Š” barrier๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ์•ž์„œ๊ฐ€์ง€ ์•Š๋„๋ก ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ์ฐธ๊ณ : ์ด ํผ์ฆ์—์„œ๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์—๋งŒ ์ ‘๊ทผํ•˜๋ฏ€๋กœ barrier()๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋” ๋ณต์žกํ•œ ์ƒํ™ฉ์—์„œ ํ•„์š”ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 4
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (2, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32


fn add_10_shared(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # Load local data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # wait for all threads to complete
    # works within a thread block
    barrier()

    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p08/p08.mojo

ํŒ
  1. barrier()๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ (ํ•™์Šต์šฉ - ์—ฌ๊ธฐ์„œ๋Š” ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Œ)
  2. local_i๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: shared[local_i]
  3. global_i๋กœ ์ถœ๋ ฅ: output[global_i]
  4. ๊ฐ€๋“œ ์ถ”๊ฐ€: if global_i < size

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p08
pixi run -e amd p08
pixi run -e apple p08
uv run poe p08

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

์†”๋ฃจ์…˜

fn add_10_shared(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # Load local data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # Wait for all threads to complete (works within a thread block).
    # Note: barrier is not strictly needed here since each thread only accesses
    # its own shared memory location. However, it's included to teach proper
    # shared memory synchronization patterns for more complex scenarios where
    # threads need to coordinate access to shared data.
    barrier()

    # process using shared memory
    if global_i < size:
        output[global_i] = shared[local_i] + 10


GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ

    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: a์™€ output ๋ฐฐ์—ด (๋А๋ฆผ, ๋ชจ๋“  ๋ธ”๋ก์—์„œ ๋ณด์ž„)

    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: shared ๋ฐฐ์—ด (๋น ๋ฆ„, ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋กœ์ปฌ)

    • ๋ธ”๋ก๋‹น 4๊ฐœ ์Šค๋ ˆ๋“œ๋กœ 8๊ฐœ ์›์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ:

      ์ „์—ญ ๋ฐฐ์—ด a: [1 1 1 1 | 1 1 1 1]  # ์ž…๋ ฅ: ๋ชจ๋‘ 1
      
      Block (0):      Block (1):
      shared[0..3]    shared[0..3]
      [1 1 1 1]       [1 1 1 1]
      
  2. ์Šค๋ ˆ๋“œ ์กฐ์œจ

    • ๋กœ๋“œ ๋‹จ๊ณ„:

      Thread 0: shared[0] = a[0]=1    Thread 2: shared[2] = a[2]=1
      Thread 1: shared[1] = a[1]=1    Thread 3: shared[3] = a[3]=1
      barrier()    โ†“         โ†“        โ†“         โ†“   # ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ
      
    • ์ฒ˜๋ฆฌ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10์„ ๋”ํ•จ

    • ๊ฒฐ๊ณผ: output[i] = shared[local_i] + 10 = 11

    ์ฐธ๊ณ : ์ด ๊ฒฝ์šฐ์—๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜(shared[local_i])์—๋งŒ ์“ฐ๊ณ  ์ฝ์œผ๋ฏ€๋กœ barrier()๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•˜๋Š” ์ƒํ™ฉ์—์„œ ํ•„์ˆ˜์ ์ธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  3. ์ธ๋ฑ์Šค ๋งคํ•‘

    • ์ „์—ญ ์ธ๋ฑ์Šค: block_dim.x * block_idx.x + thread_idx.x

      Block 0 ์ถœ๋ ฅ: [11 11 11 11]
      Block 1 ์ถœ๋ ฅ: [11 11 11 11]
      
    • ๋กœ์ปฌ ์ธ๋ฑ์Šค: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— thread_idx.x ์‚ฌ์šฉ

      ๋‘ ๋ธ”๋ก ๋ชจ๋‘ ์ฒ˜๋ฆฌ: 1 + 10 = 11
      
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ๋กœ๋“œ: ์ „์—ญ โ†’ ๊ณต์œ  (๋ณ‘ํ•ฉ ์ฝ๊ธฐ๋กœ 1 ๊ฐ’๋“ค ๋กœ๋“œ)
    • ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋ณด์žฅ
    • ์ฒ˜๋ฆฌ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10 ๋”ํ•˜๊ธฐ
    • ์ €์žฅ: ๊ฒฐ๊ณผ(11)๋ฅผ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ์“ฐ๊ธฐ

์ด ํŒจํ„ด์€ ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ ์กฐ์œจ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.