๊ฐœ์š”

๋ฒกํ„ฐ a์™€ ๋ฒกํ„ฐ b์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜์—ฌ output(๋‹จ์ผ ๊ฐ’)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ๋ธ”๋ก๋‹น ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • ์—ฌ๋Ÿฌ ๊ฐ’์„ ํ•˜๋‚˜๋กœ ํ•ฉ์น˜๋Š” ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜(parallel reduction) ๊ตฌํ˜„ํ•˜๊ธฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ค‘๊ฐ„ ๊ฒฐ๊ณผ ์ €์žฅํ•˜๊ธฐ
  • ์Šค๋ ˆ๋“œ๋ผ๋ฆฌ ํ˜‘๋ ฅํ•˜์—ฌ ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ ๋งŒ๋“ค๊ธฐ

ํ•ต์‹ฌ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์™€ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์„ ํ™œ์šฉํ•ด, ํฉ์–ด์ ธ ์žˆ๋Š” ๊ฐ’๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋กœ ๋ชจ์•„๊ฐ€๋Š” ๊ณผ์ •์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 1
  • ์ถœ๋ ฅ ํฌ๊ธฐ: 1
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ

์ฐธ๊ณ :

  • ์š”์†Œ ์ ‘๊ทผ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ a์™€ b์—์„œ ๋Œ€์‘ํ•˜๋Š” ์š”์†Œ๋ฅผ ์ฝ์Œ
  • ๋ถ€๋ถ„ ๊ฒฐ๊ณผ: ์ค‘๊ฐ„ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ €์žฅ
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ: ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์น˜๊ธฐ ์ „์— ๋™๊ธฐํ™”
  • ์ตœ์ข… ๋ฆฌ๋•์…˜: ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์Šค์นผ๋ผ ์ถœ๋ ฅ์œผ๋กœ ๋ณ€ํ™˜

์ฐธ๊ณ : ์ด ๋ฌธ์ œ์—์„œ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ ํšŸ์ˆ˜๋ฅผ ์‹ ๊ฒฝ ์“ธ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ ๋ฌธ์ œ๋Š” ๋‚˜์ค‘์— ๋‹ค๋ฃจ๊ฒ ์Šต๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32


fn dot_product(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    # FILL ME IN (roughly 13 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p12/p12.mojo

ํŒ
  1. shared[local_i]์— a[global_i] * b[global_i]๋ฅผ ์ €์žฅ
  2. barrier()๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๋™๊ธฐํ™”
  3. ์Šค๋ ˆ๋“œ 0์ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ชจ๋“  ๊ณฑ์„ ํ•ฉ์‚ฐ
  4. ์ตœ์ข… ํ•ฉ๊ณ„๋ฅผ output[0]์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p12
pixi run -e amd p12
pixi run -e apple p12
uv run poe p12

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0])
expected: HostBuffer([140.0])

์†”๋ฃจ์…˜

fn dot_product(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    if global_i < size:
        shared[local_i] = a[global_i] * b[global_i]

    barrier()

    # The following causes race condition: all threads writing to the same location
    # out[0] += shared[local_i]

    # Instead can do parallel reduction in shared memory as opposed to
    # global memory which has no guarantee on synchronization.
    # Loops using global memory can cause thread divergence because
    # fundamentally GPUs execute threads in warps (groups of 32 threads typically)
    # and warps can be scheduled independently.
    # However, shared memory does not have such issues as long as we use `barrier()`
    # correctly when we're in the same thread block.
    stride = UInt(TPB // 2)
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]

        barrier()
        stride //= 2

    # only thread 0 writes the final result
    if local_i == 0:
        output[0] = shared[0]


๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณฑ์…ˆ ํ•˜๋‚˜๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

Thread i: shared[i] = a[i] * b[i]

2๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜

ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๋ฅผ ๋งค ๋‹จ๊ณ„๋งˆ๋‹ค ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๋Š” ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค:

์ดˆ๊ธฐ๊ฐ’:    [0*0  1*1  2*2  3*3  4*4  5*5  6*6  7*7]
        = [0    1    4    9    16   25   36   49]

Step 1:   [0+16 1+25 4+36 9+49  16   25   36   49]
        = [16   26   40   58   16   25   36   49]

Step 2:   [16+40 26+58 40   58   16   25   36   49]
        = [56   84   40   58   16   25   36   49]

Step 3:   [56+84  84   40   58   16   25   36   49]
        = [140   84   40   58   16   25   36   49]

๊ตฌํ˜„์˜ ํ•ต์‹ฌ ํŠน์ง•

  1. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ •ํ™•ํžˆ ๋‘ ๊ฐ’์„ ๋กœ๋“œ (a[i], b[i])
    • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
    • ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— 1ํšŒ ๊ธฐ๋ก
  2. ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”:

    • ์ดˆ๊ธฐ ๊ณฑ์…ˆ ํ›„ barrier()
    • ๊ฐ ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ํ›„ barrier()
    • ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ๊ฐ„ ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€
  3. ๋ฆฌ๋•์…˜ ๋กœ์ง:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2
    
    • ๋งค ๋‹จ๊ณ„๋งˆ๋‹ค stride๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ
    • ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๋งŒ ๋ง์…ˆ ์ˆ˜ํ–‰
    • ์ž‘์—… ํšจ์œจ์„ฑ ์œ ์ง€
  4. ์„ฑ๋Šฅ ๊ณ ๋ ค ์‚ฌํ•ญ:

    • \(n\)๊ฐœ ์š”์†Œ์— ๋Œ€ํ•ด \(\log_2(n)\) ๋‹จ๊ณ„
    • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
    • ์ตœ์†Œํ•œ์˜ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ํšจ์œจ์  ํ™œ์šฉ

์ด ๊ตฌํ˜„์€ ์ˆœ์ฐจ ์‹คํ–‰์˜ \(O(n)\)์— ๋น„ํ•ด \(O(\log n)\) ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์œ„๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”์˜ ์ค‘์š”์„ฑ

๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด์˜ barrier()๋Š” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

barrier()๊ฐ€ ์—†์œผ๋ฉด ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค:

์ดˆ๊ธฐ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: [0 1 4 9 16 25 36 49]

Step 1 (stride = 4):
Thread 0 ์ฝ๊ธฐ: shared[0] = 0, shared[4] = 16
Thread 1 ์ฝ๊ธฐ: shared[1] = 1, shared[5] = 25
Thread 2 ์ฝ๊ธฐ: shared[2] = 4, shared[6] = 36
Thread 3 ์ฝ๊ธฐ: shared[3] = 9, shared[7] = 49

barrier ์—†์ด:
- Thread 0 ์“ฐ๊ธฐ: shared[0] = 0 + 16 = 16
- Thread 1์ด Thread 0๋ณด๋‹ค ๋จผ์ € ๋‹ค์Œ ๋‹จ๊ณ„(stride = 2)๋กœ ๋„˜์–ด๊ฐ€์„œ
  16์ด ์•„๋‹Œ ์ด์ „ ๊ฐ’ shared[0] = 0์„ ์ฝ์–ด๋ฒ„๋ฆฝ๋‹ˆ๋‹ค!

barrier()๊ฐ€ ์žˆ์œผ๋ฉด:

Step 1 (stride = 4):
๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•ฉ์„ ๊ธฐ๋ก:
[16 26 40 58 16 25 36 49]
barrier()๊ฐ€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์—๊ฒŒ ์ด ๊ฐ’๋“ค์ด ๋ณด์ด๋„๋ก ๋ณด์žฅ

Step 2 (stride = 2):
์ด์ œ ์—…๋ฐ์ดํŠธ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์Œ:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84

barrier()๋Š” ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  1. ํ˜„์žฌ ๋‹จ๊ณ„์˜ ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ๋๋‚œ ๋’ค์—์•ผ ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐ
  2. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์‹  ๊ฐ’์„ ๋ณผ ์ˆ˜ ์žˆ์Œ
  3. ์–ด๋–ค ์Šค๋ ˆ๋“œ๋„ ์•ž์„œ ๋‚˜๊ฐ€์ง€ ์•Š์Œ
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ์ผ๊ด€๋œ ์ƒํƒœ๋ฅผ ์œ ์ง€

์ด๋Ÿฐ ๋™๊ธฐํ™” ์ง€์ ์ด ์—†์œผ๋ฉด:

  • ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ 
  • ์Šค๋ ˆ๋“œ๊ฐ€ ์ด๋ฏธ ์ง€๋‚œ ๊ฐ’์„ ์ฝ๊ฒŒ ๋˜๋ฉฐ
  • ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€๊ณ 
  • ์ตœ์ข… ํ•ฉ๊ณ„๊ฐ€ ํ‹€์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค