๊ฐœ์š”

1D LayoutTensor a์™€ 1D LayoutTensor b์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D LayoutTensor output(๋‹จ์ผ ๊ฐ’)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ๋ธ”๋ก๋‹น ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • Puzzle 8, Puzzle 11์—์„œ ์ด์–ด์ง€๋Š” LayoutTensor ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • address_space๋ฅผ ํ™œ์šฉํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ
  • ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ํ˜‘๋ ฅํ•ด ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๊ฐ€๋Š” ๊ณผ์ •
  • ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ํ…์„œ ์—ฐ์‚ฐ

ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋ฉด์„œ๋„, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์˜ ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ ์‚ด๋ฆฌ๋Š” ๋ฐฉ์‹์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 1
  • ์ถœ๋ ฅ ํฌ๊ธฐ: 1
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ

์ฐธ๊ณ :

  • LayoutTensor ํ• ๋‹น: LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() ์‚ฌ์šฉ
  • ์š”์†Œ ์ ‘๊ทผ: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ž๋™์œผ๋กœ ๋”ฐ๋ผ์˜ค๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ
  • ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ: ์ž…๋ ฅ์šฉ๊ณผ ์ถœ๋ ฅ์šฉ ๋ ˆ์ด์•„์›ƒ์„ ๋”ฐ๋กœ ๊ตฌ์„ฑ
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ: ๋™์ผํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์œผ๋กœ barrier() ์‚ฌ์šฉ

์™„์„ฑํ•  ์ฝ”๋“œ

from gpu import thread_idx, block_idx, block_dim, barrier
from gpu.memory import AddressSpace
from layout import Layout, LayoutTensor


comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)
comptime out_layout = Layout.row_major(1)


fn dot_product[
    in_layout: Layout, out_layout: Layout
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: UInt,
):
    # FILL ME IN (roughly 13 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p12/p12_layout_tensor.mojo

ํŒ
  1. LayoutTensor์™€ address_space๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ
  2. shared[local_i]์— a[global_i] * b[global_i]๋ฅผ ์ €์žฅ
  3. barrier()์™€ ํ•จ๊ป˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด ์ ์šฉ
  4. ์Šค๋ ˆ๋“œ 0์ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ output[0]์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p12_layout_tensor
pixi run -e amd p12_layout_tensor
pixi run -e apple p12_layout_tensor
uv run poe p12_layout_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0])
expected: HostBuffer([140.0])

์†”๋ฃจ์…˜

fn dot_product[
    in_layout: Layout, out_layout: Layout
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: UInt,
):
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Compute element-wise multiplication into shared memory
    if global_i < size:
        shared[local_i] = a[global_i] * b[global_i]

    # Synchronize threads within block
    barrier()

    # Parallel reduction in shared memory
    stride = UInt(TPB // 2)
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]

        barrier()
        stride //= 2

    # Only thread 0 writes the final result
    if local_i == 0:
        output[0] = shared[0]


LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์œผ๋กœ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ง๊ด€์ ์ธ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๊ณฑ์…ˆ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

shared[local_i] = a[global_i] * b[global_i]

2๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜

๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜์ž…๋‹ˆ๋‹ค:

์ดˆ๊ธฐ๊ฐ’:    [0*0  1*1  2*2  3*3  4*4  5*5  6*6  7*7]
        = [0    1    4    9    16   25   36   49]

Step 1:   [0+16 1+25 4+36 9+49  16   25   36   49]
        = [16   26   40   58   16   25   36   49]

Step 2:   [16+40 26+58 40   58   16   25   36   49]
        = [56   84   40   58   16   25   36   49]

Step 3:   [56+84  84   40   58   16   25   36   49]
        = [140   84   40   58   16   25   36   49]

๊ตฌํ˜„์˜ ํ•ต์‹ฌ ํŠน์ง•

  1. ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ:

    • address_space ํŒŒ๋ผ๋ฏธํ„ฐ ํ•˜๋‚˜๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ํ• ๋‹น
    • ํƒ€์ž… ์•ˆ์ „ํ•œ ์—ฐ์‚ฐ์ด ๋ณด์žฅ๋˜๊ณ 
    • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ž๋™์œผ๋กœ ๋”ฐ๋ผ์˜ค๋ฉฐ
    • ์ธ๋ฑ์‹ฑ๋„ ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹
  2. ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”:

    • ์ดˆ๊ธฐ ๊ณฑ์…ˆ์ด ๋๋‚˜๋ฉด barrier()
    • ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด๋งˆ๋‹ค barrier()
    • ์Šค๋ ˆ๋“œ ๊ฐ„ ์•ˆ์ „ํ•œ ์กฐ์œจ ๋ณด์žฅ
  3. ๋ฆฌ๋•์…˜ ๋กœ์ง:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2
    
  4. ์„ฑ๋Šฅ์ƒ ์ด์ :

    • \(O(\log n)\) ์‹œ๊ฐ„ ๋ณต์žก๋„
    • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
    • ์ตœ์†Œํ•œ์˜ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ํšจ์œจ์  ํ™œ์šฉ

LayoutTensor ๋ฒ„์ „์€ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์˜ ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ, ์—ฌ๊ธฐ์— ๋”ํ•ด:

  • ํƒ€์ž… ์•ˆ์ „์„ฑ์ด ํ•œ์ธต ๊ฐ•ํ™”๋˜๊ณ 
  • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๊ฐ€ ๋” ๊น”๋”ํ•ด์ง€๋ฉฐ
  • ๋ ˆ์ด์•„์›ƒ์„ ์ž๋™์œผ๋กœ ์ธ์‹ํ•˜๊ณ 
  • ์ธ๋ฑ์‹ฑ ๋ฌธ๋ฒ•๋„ ์ž์—ฐ์Šค๋Ÿฌ์›Œ์ง‘๋‹ˆ๋‹ค

๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”์˜ ์ค‘์š”์„ฑ

๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด์˜ barrier()๋Š” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

barrier()๊ฐ€ ์—†์œผ๋ฉด ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค:

์ดˆ๊ธฐ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: [0 1 4 9 16 25 36 49]

Step 1 (stride = 4):
Thread 0 ์ฝ๊ธฐ: shared[0] = 0, shared[4] = 16
Thread 1 ์ฝ๊ธฐ: shared[1] = 1, shared[5] = 25
Thread 2 ์ฝ๊ธฐ: shared[2] = 4, shared[6] = 36
Thread 3 ์ฝ๊ธฐ: shared[3] = 9, shared[7] = 49

barrier ์—†์ด:
- Thread 0 ์“ฐ๊ธฐ: shared[0] = 0 + 16 = 16
- Thread 1์ด Thread 0๋ณด๋‹ค ๋จผ์ € ๋‹ค์Œ ๋‹จ๊ณ„(stride = 2)๋กœ ๋„˜์–ด๊ฐ€์„œ
  16์ด ์•„๋‹Œ ์ด์ „ ๊ฐ’ shared[0] = 0์„ ์ฝ์–ด๋ฒ„๋ฆฝ๋‹ˆ๋‹ค!

barrier()๊ฐ€ ์žˆ์œผ๋ฉด:

Step 1 (stride = 4):
๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•ฉ์„ ๊ธฐ๋ก:
[16 26 40 58 16 25 36 49]
barrier()๊ฐ€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์—๊ฒŒ ์ด ๊ฐ’๋“ค์ด ๋ณด์ด๋„๋ก ๋ณด์žฅ

Step 2 (stride = 2):
์ด์ œ ์—…๋ฐ์ดํŠธ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์Œ:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84

barrier()๋Š” ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  1. ํ˜„์žฌ ๋‹จ๊ณ„์˜ ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ๋๋‚œ ๋’ค์—์•ผ ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐ
  2. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์‹  ๊ฐ’์„ ๋ณผ ์ˆ˜ ์žˆ์Œ
  3. ์–ด๋–ค ์Šค๋ ˆ๋“œ๋„ ์•ž์„œ ๋‚˜๊ฐ€์ง€ ์•Š์Œ
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ์ผ๊ด€๋œ ์ƒํƒœ๋ฅผ ์œ ์ง€

์ด๋Ÿฐ ๋™๊ธฐํ™” ์ง€์ ์ด ์—†์œผ๋ฉด:

  • ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ 
  • ์Šค๋ ˆ๋“œ๊ฐ€ ์ด๋ฏธ ์ง€๋‚œ ๊ฐ’์„ ์ฝ๊ฒŒ ๋˜๋ฉฐ
  • ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€๊ณ 
  • ์ตœ์ข… ํ•ฉ๊ณ„๊ฐ€ ํ‹€์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค