LayoutTensor ๋ฒ„์ „

๊ฐœ์š”

1D LayoutTensor a์™€ b๋ฅผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋”ํ•ด 2D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์— LayoutTensor ์‚ฌ์šฉํ•˜๊ธฐ
  • ์„œ๋กœ ๋‹ค๋ฅธ ํ…์„œ ํฌ๊ธฐ ๋‹ค๋ฃจ๊ธฐ
  • LayoutTensor๋กœ 2D ์ธ๋ฑ์‹ฑ ์ฒ˜๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ํ…์„œ ํฌ๊ธฐ \((1, n)\)์™€ \((n, 1)\)์„ \((n,n)\)์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • ํ…์„œ ํฌ๊ธฐ: ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋Š” \((1, n)\)๊ณผ \((n, 1)\)
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ: ๋‘ ์ฐจ์›์„ ๊ฒฐํ•ฉํ•ด \((n,n)\) ์ถœ๋ ฅ ์ƒ์„ฑ
  • ๊ฐ€๋“œ ์กฐ๊ฑด: ์ถœ๋ ฅ ํฌ๊ธฐ์— ๋Œ€ํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”
  • ์Šค๋ ˆ๋“œ ๋ฒ”์œ„: ํ…์„œ ์›์†Œ \((2 \times 2)\)๋ณด๋‹ค ์Šค๋ ˆ๋“œ \((3 \times 3)\)๊ฐ€ ๋งŽ์Œ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime out_layout = Layout.row_major(SIZE, SIZE)
comptime a_layout = Layout.row_major(1, SIZE)
comptime b_layout = Layout.row_major(SIZE, 1)


fn broadcast_add[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, b_layout, ImmutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p05/p05_layout_tensor.mojo

ํŒ
  1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: row = thread_idx.y, col = thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: LayoutTensor๋กœ a์™€ b ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ• ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p05_layout_tensor
pixi run -e amd p05_layout_tensor
pixi run -e apple p05_layout_tensor
uv run poe p05_layout_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 2.0, 11.0, 12.0])

์†”๋ฃจ์…˜

fn broadcast_add[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, b_layout, ImmutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:
        output[row, col] = a[0, col] + b[row, 0]


LayoutTensor ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์™€ GPU ์Šค๋ ˆ๋“œ ๋งคํ•‘์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ์—์„œ ํ–‰๋ ฌ๋กœ ๋งคํ•‘

    • thread_idx.y๋กœ ํ–‰, thread_idx.x๋กœ ์—ด์— ์ ‘๊ทผ
    • ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ์ด ์ถœ๋ ฅ ํ–‰๋ ฌ ๊ตฌ์กฐ์™€ ์ผ์น˜
    • ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ(3ร—3 ๊ทธ๋ฆฌ๋“œ)๋Š” ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ์ฒ˜๋ฆฌ
  2. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ž‘๋™ ๋ฐฉ์‹

    • ์ž…๋ ฅ a์˜ ํฌ๊ธฐ๋Š” (1,n): a[0,col]์ด ํ–‰์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
    • ์ž…๋ ฅ b์˜ ํฌ๊ธฐ๋Š” (n,1): b[row,0]์ด ์—ด์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
    • ์ถœ๋ ฅ์˜ ํฌ๊ธฐ๋Š” (n,n): ๊ฐ ์›์†Œ๋Š” ํ•ด๋‹น ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’๋“ค์˜ ํ•ฉ
    [ a0 a1 ]  +  [ b0 ]  =  [ a0+b0  a1+b0 ]
                  [ b1 ]     [ a0+b1  a1+b1 ]
    
  3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

    • ๊ฐ€๋“œ ์กฐ๊ฑด row < size and col < size๋กœ ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ ๋ฐฉ์ง€
    • ํ–‰๋ ฌ ๋ฒ”์œ„์™€ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ
    • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋•๋ถ„์— a์™€ b์— ๋Œ€ํ•œ ๋ณ„๋„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š”

์ด ํŒจํ„ด์€ ์ดํ›„ ํผ์ฆ์—์„œ ๋‹ค๋ฃฐ ๋” ๋ณต์žกํ•œ ํ…์„œ ์—ฐ์‚ฐ์˜ ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.