๊ฐœ์š”

1D LayoutTensor a์—์„œ ๊ฐ ์œ„์น˜์˜ ์ง์ „ 3๊ฐœ ๊ฐ’์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • LayoutTensor๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ
  • Puzzle 8์—์„œ ๋‹ค๋ฃฌ LayoutTensor ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌํ•˜๊ธฐ
  • ํšจ์œจ์ ์ธ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ

ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ํšจ์œจ์ ์ธ ์œˆ๋„์šฐ ๊ธฐ๋ฐ˜ ์—ฐ์‚ฐ์€ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ์œˆ๋„์šฐ ํฌ๊ธฐ: 3
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ

์ฐธ๊ณ :

  • LayoutTensor ํ• ๋‹น: LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() ์‚ฌ์šฉ
  • ์œˆ๋„์šฐ ์ ‘๊ทผ: 3๊ฐœ์งœ๋ฆฌ ์œˆ๋„์šฐ์— ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์ฒ˜์Œ ๋‘ ์œ„์น˜๋Š” ํŠน์ˆ˜ ์ผ€์ด์Šค
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ 1ํšŒ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)


fn pooling[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    # Allocate shared memory using tensor builder
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FIX ME IN (roughly 10 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p11/p11_layout_tensor.mojo

ํŒ
  1. LayoutTensor์™€ ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ
  2. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: shared[local_i] = a[global_i]
  3. ์ฒ˜์Œ ๋‘ ์œ„์น˜๋ฅผ ํŠน์ˆ˜ ์ผ€์ด์Šค๋กœ ์ฒ˜๋ฆฌ
  4. ์œˆ๋„์šฐ ์—ฐ์‚ฐ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  5. ๊ฒฝ๊ณ„ ์ดˆ๊ณผ ์ ‘๊ทผ์— ๊ฐ€๋“œ ์ถ”๊ฐ€

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p11_layout_tensor
pixi run -e amd p11_layout_tensor
pixi run -e apple p11_layout_tensor
uv run poe p11_layout_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])

์†”๋ฃจ์…˜

fn pooling[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    # Allocate shared memory using tensor builder
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Load data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # Synchronize threads within block
    barrier()

    # Handle first two special cases
    if global_i == 0:
        output[0] = shared[0]
    elif global_i == 1:
        output[1] = shared[0] + shared[1]
    # Handle general case
    elif UInt(1) < global_i < size:
        output[global_i] = (
            shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
        )


LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ํ•ฉ๊ณ„ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ์ฃผ์š” ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •

    • LayoutTensor๊ฐ€ ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๋ธ”๋ก ๋กœ์ปฌ ์ €์žฅ์†Œ๋ฅผ ์ƒ์„ฑ:

      shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
      
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์”ฉ ๋กœ๋“œ:

      Input array:  [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      
    • barrier()๋กœ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋กœ๋“œ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

  2. ๊ฒฝ๊ณ„ ์ผ€์ด์Šค

    • ์œ„์น˜ 0: ํ•˜๋‚˜๋งŒ

      output[0] = shared[0] = 0.0
      
    • ์œ„์น˜ 1: ์ฒ˜์Œ ๋‘ ๊ฐ’์˜ ํ•ฉ

      output[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0
      
  3. ๋ฉ”์ธ ์œˆ๋„์šฐ ์—ฐ์‚ฐ

    • ์œ„์น˜ 2 ์ดํ›„:

      Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0
      Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0
      Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0
      ...
      
    • LayoutTensor์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ:

      # 3๊ฐœ์งœ๋ฆฌ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ
      window_sum = shared[i-2] + shared[i-1] + shared[i]
      
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๊ณต์œ  ํ…์„œ๋กœ ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ํšจ์œจ์ ์ธ ์ด์›ƒ ์ ‘๊ทผ
    • LayoutTensor์˜ ์žฅ์ :
      • ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
      • ์ž์—ฐ์Šค๋Ÿฌ์šด ์œˆ๋„์šฐ ์ธ๋ฑ์‹ฑ
      • ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
      • ์ „ ๊ณผ์ •์— ๊ฑธ์นœ ํƒ€์ž… ์•ˆ์ „์„ฑ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ๊ณผ LayoutTensor์˜ ์•ˆ์ „์„ฑ ๋ฐ ํŽธ์˜์„ฑ์„ ๊ฒฐํ•ฉํ•œ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์†Œํ™”
  • ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ฐ„์†Œํ™”
  • ๊น”๋”ํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ
  • ๋ณ‘ํ•ฉ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€

์ตœ์ข… ์ถœ๋ ฅ์€ ๋ˆ„์  ์œˆ๋„์šฐ ํ•ฉ๊ณ„์ž…๋‹ˆ๋‹ค:

[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]