๊ฐœ์š”

1D LayoutTensor a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 1D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • address_space๋ฅผ ํ™œ์šฉํ•œ LayoutTensor์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋Šฅ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์˜ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  • LayoutTensor๋กœ ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ๋ธ”๋ก ๋กœ์ปฌ ์ €์žฅ์†Œ์˜ ์„ฑ๋Šฅ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8 ์›์†Œ
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 4
  • ๋ธ”๋ก ์ˆ˜: 2
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น TPB๊ฐœ ์›์†Œ

์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹๊ณผ์˜ ์ฃผ์š” ์ฐจ์ด์ 

  1. ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: stack_allocation ๋Œ€์‹  address_space๋ฅผ ์‚ฌ์šฉํ•œ LayoutTensor ์‚ฌ์šฉ

    # ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹
    shared = stack_allocation[TPB, Scalar[dtype]]()
    
    # LayoutTensor ๋ฐฉ์‹
    shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
    
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๋™์ผํ•œ ๋ฌธ๋ฒ•

    # ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹
    shared[local_i] = a[global_i]
    
    # LayoutTensor ๋ฐฉ์‹
    shared[local_i] = a[global_i]
    
  3. ์•ˆ์ „ ๊ธฐ๋Šฅ:

    • ํƒ€์ž… ์•ˆ์ „์„ฑ
    • ๋ ˆ์ด์•„์›ƒ ๊ด€๋ฆฌ
    • ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ ์ฒ˜๋ฆฌ

์ฐธ๊ณ : LayoutTensor๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์„ ์ฒ˜๋ฆฌํ•˜์ง€๋งŒ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์‹œ barrier()๋ฅผ ํ†ตํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๋Š” ์—ฌ์ „ํžˆ ์ง์ ‘ ๊ด€๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ์ฐธ๊ณ : ์ด ํผ์ฆ์—์„œ๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์—๋งŒ ์ ‘๊ทผํ•˜๋ฏ€๋กœ barrier()๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋” ๋ณต์žกํ•œ ์ƒํ™ฉ์—์„œ ํ•„์š”ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 4
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (2, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)


fn add_10_shared_layout_tensor[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    # Allocate shared memory using LayoutTensor with explicit address_space
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p08/p08_layout_tensor.mojo

ํŒ
  1. address_space ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ LayoutTensor ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ
  2. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: shared[local_i] = a[global_i]
  3. barrier()๋กœ ๋™๊ธฐํ™” (ํ•™์Šต์šฉ - ์—ฌ๊ธฐ์„œ๋Š” ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Œ)
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค๋กœ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  5. ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ฐ€๋“œ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p08_layout_tensor
pixi run -e amd p08_layout_tensor
pixi run -e apple p08_layout_tensor
uv run poe p08_layout_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

์†”๋ฃจ์…˜

fn add_10_shared_layout_tensor[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    # Allocate shared memory using tensor builder
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    # Note: barrier is not strictly needed here since each thread only accesses
    # its own shared memory location. However, it's included to teach proper
    # shared memory synchronization patterns for more complex scenarios where
    # threads need to coordinate access to shared data.
    barrier()

    if global_i < size:
        output[global_i] = shared[local_i] + 10


LayoutTensor๊ฐ€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ

    • ์ „์—ญ ํ…์„œ: a์™€ output (๋А๋ฆผ, ๋ชจ๋“  ๋ธ”๋ก์—์„œ ๋ณด์ž„)

    • ๊ณต์œ  ํ…์„œ: shared (๋น ๋ฆ„, ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋กœ์ปฌ)

    • ๋ธ”๋ก๋‹น 4๊ฐœ ์Šค๋ ˆ๋“œ๋กœ 8๊ฐœ ์›์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ:

      ์ „์—ญ ํ…์„œ a: [1 1 1 1 | 1 1 1 1]  # ์ž…๋ ฅ: ๋ชจ๋‘ 1
      
      Block (0):         Block (1):
      shared[0..3]       shared[0..3]
      [1 1 1 1]          [1 1 1 1]
      
  2. ์Šค๋ ˆ๋“œ ์กฐ์œจ

    • ๋กœ๋“œ ๋‹จ๊ณ„ (์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ):

      Thread 0: shared[0] = a[0]=1    Thread 2: shared[2] = a[2]=1
      Thread 1: shared[1] = a[1]=1    Thread 3: shared[3] = a[3]=1
      barrier()    โ†“         โ†“        โ†“         โ†“   # ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ
      
    • ์ฒ˜๋ฆฌ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ํ…์„œ ๊ฐ’์— 10์„ ๋”ํ•จ

    • ๊ฒฐ๊ณผ: output[global_i] = shared[local_i] + 10 = 11

    ์ฐธ๊ณ : ์ด ๊ฒฝ์šฐ์—๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜(shared[local_i])์—๋งŒ ์“ฐ๊ณ  ์ฝ์œผ๋ฏ€๋กœ barrier()๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•˜๋Š” ์ƒํ™ฉ์—์„œ ํ•„์ˆ˜์ ์ธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  3. LayoutTensor์˜ ์žฅ์ 

    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น:

      # address_space๋ฅผ ์‚ฌ์šฉํ•œ ๊น”๋”ํ•œ LayoutTensor API
      shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
      
    • ์ „์—ญ๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‘ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ:

      Block 0 ์ถœ๋ ฅ: [11 11 11 11]
      Block 1 ์ถœ๋ ฅ: [11 11 11 11]
      
    • ๋‚ด์žฅ๋œ ๋ ˆ์ด์•„์›ƒ ๊ด€๋ฆฌ์™€ ํƒ€์ž… ์•ˆ์ „์„ฑ

  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ๋กœ๋“œ: ์ „์—ญ ํ…์„œ โ†’ ๊ณต์œ  ํ…์„œ (์ตœ์ ํ™”๋จ)
    • ๋™๊ธฐํ™”: ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „๊ณผ ๋™์ผํ•œ barrier() ํ•„์š”
    • ์ฒ˜๋ฆฌ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10 ๋”ํ•˜๊ธฐ
    • ์ €์žฅ: ๊ฒฐ๊ณผ(11)๋ฅผ ์ „์—ญ ํ…์„œ์— ์“ฐ๊ธฐ

์ด ํŒจํ„ด์€ LayoutTensor๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ ์ด์ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋” ํŽธ๋ฆฌํ•œ API์™€ ๋‚ด์žฅ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.