์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ ๋ฒ„์ „

๊ฐœ์š”

์ •๋ฐฉ ํ–‰๋ ฌ \(A\)์™€ \(B\)๋ฅผ ๊ณฑํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ \(\text{output}\)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ํ–‰๋ ฌ์˜ ์›์†Œ ํ•˜๋‚˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์œ„ํ•œ 2D ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ํ–‰ ์šฐ์„ (row-major) ๋ ˆ์ด์•„์›ƒ์—์„œ์˜ ํ–‰๋ ฌ ์ธ๋ฑ์‹ฑ
  • ์Šค๋ ˆ๋“œ์™€ ์ถœ๋ ฅ ์›์†Œ ๊ฐ„ ๋งคํ•‘

ํ•ต์‹ฌ์€ 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ํ–‰๋ ฌ ์›์†Œ์— ๋งคํ•‘ํ•˜๊ณ , ๋‚ด์ ์„ ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: \(1 \times 1\)

๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ:

  • ์ž…๋ ฅ A: Layout.row_major(SIZE, SIZE)
  • ์ž…๋ ฅ B: Layout.row_major(SIZE, SIZE)
  • ์ถœ๋ ฅ: Layout.row_major(SIZE, SIZE)

์™„์„ฑํ•  ์ฝ”๋“œ

from gpu import thread_idx, block_idx, block_dim, barrier
from gpu.memory import AddressSpace
from layout import Layout, LayoutTensor


comptime TPB = 3
comptime SIZE = 2
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, TPB)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE, SIZE)


fn naive_matmul[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 6 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p16/p16.mojo

ํŒ
  1. ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋กœ row์™€ col ๊ณ„์‚ฐ
  2. ์ธ๋ฑ์Šค๊ฐ€ size ๋ฒ”์œ„ ์•ˆ์— ์žˆ๋Š”์ง€ ํ™•์ธ
  3. ๋กœ์ปฌ ๋ณ€์ˆ˜์— ๊ณฑ์˜ ํ•ฉ ๋ˆ„์ 
  4. ์ตœ์ข… ํ•ฉ์„ ์˜ฌ๋ฐ”๋ฅธ ์ถœ๋ ฅ ์œ„์น˜์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p16 --naive
pixi run -e amd p16 --naive
pixi run -e apple p16 --naive
uv run poe p16 --naive

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([4.0, 6.0, 12.0, 22.0])

์†”๋ฃจ์…˜

fn naive_matmul[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x

    if row < size and col < size:
        var acc: output.element_type = 0

        @parameter
        for k in range(size):
            acc += a[row, k] * b[k, col]

        output[row, col] = acc


LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ๊ธฐ๋ณธ ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

ํ–‰๋ ฌ ๋ ˆ์ด์•„์›ƒ (2ร—2 ์˜ˆ์‹œ)

Matrix A:          Matrix B:                   Output C:
[a[0,0] a[0,1]]    [b[0,0] b[0,1]]             [c[0,0] c[0,1]]
[a[1,0] a[1,1]]    [b[1,0] b[1,1]]             [c[1,0] c[1,1]]

๊ตฌํ˜„ ์ƒ์„ธ

  1. ์Šค๋ ˆ๋“œ ๋งคํ•‘:

    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • ์ง์ ‘ 2D ์ธ๋ฑ์‹ฑ: a[row, k]
    • ์ „์น˜ ์ ‘๊ทผ: b[k, col]
    • ์ถœ๋ ฅ ๊ธฐ๋ก: output[row, col]
  3. ์—ฐ์‚ฐ ํ๋ฆ„:

    # var๋กœ ๊ฐ€๋ณ€ ๋ˆ„์  ๋ณ€์ˆ˜๋ฅผ ์„ ์–ธํ•˜๊ณ  ํ…์„œ์˜ ์›์†Œ ํƒ€์ž…์„ ์‚ฌ์šฉ
    var acc: output.element_type = 0
    
    # @parameter๋กœ ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ
    @parameter
    for k in range(size):
        acc += a[row, k] * b[k, col]
    

์ฃผ์š” ์–ธ์–ด ๊ธฐ๋Šฅ

  1. ๋ณ€์ˆ˜ ์„ ์–ธ:

    • var acc: output.element_type = 0์—์„œ var๋กœ ๊ฐ€๋ณ€ ๋ณ€์ˆ˜๋ฅผ ์„ ์–ธํ•˜๊ณ , output.element_type์œผ๋กœ ์ถœ๋ ฅ ํ…์„œ์™€ ๋™์ผํ•œ ํƒ€์ž…์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค
    • ๋ˆ„์  ์—ฐ์‚ฐ ์ „์— 0์œผ๋กœ ์ดˆ๊ธฐํ™”
  2. ๋ฃจํ”„ ์ตœ์ ํ™”:

    • @parameter ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋กœ ์ปดํŒŒ์ผ ํƒ€์ž„์— ๋ฃจํ”„ ์ „๊ฐœ
    • ํฌ๊ธฐ๊ฐ€ ์ž‘๊ณ  ๋ฏธ๋ฆฌ ์•Œ๋ ค์ง„ ํ–‰๋ ฌ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • ๋” ๋‚˜์€ ๋ช…๋ น์–ด ์Šค์ผ€์ค„๋ง ๊ฐ€๋Šฅ

์„ฑ๋Šฅ ํŠน์„ฑ

  1. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ 2 x SIZEํšŒ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฝ์Œ
    • ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ 1ํšŒ
    • ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ ์—†์Œ
  2. ์—ฐ์‚ฐ ํšจ์œจ:

    • ๋‹จ์ˆœํ•œ ๊ตฌํ˜„์ด์ง€๋งŒ ์„ฑ๋Šฅ์€ ์ตœ์ ์ด ์•„๋‹˜
    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ค‘๋ณต์œผ๋กœ ๋งŽ์ด ์ฝ์Œ
    • ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์ง€ ์•Š์Œ
  3. ํ•œ๊ณ„:

    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ๋งŽ์ด ์†Œ๋ชจ
    • ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ์ง€์—ญ์„ฑ
    • ํฐ ํ–‰๋ ฌ๋กœ ๊ฐˆ์ˆ˜๋ก ํ™•์žฅ์„ฑ ๋ถ€์กฑ

์ด ๊ธฐ๋ณธ ๊ตฌํ˜„์€ GPU ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์ค€์ ์œผ๋กœ, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์ตœ์ ํ™”ํ•ด์•ผ ํ•˜๋Š” ์ด์œ ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.