νμΌλ§ λ²μ
κ°μ
LayoutTensorλ₯Ό μ¬μ©ν νμΌλ§ νλ ¬ κ³±μ μΌλ‘ μ λ°© νλ ¬ \(A\) μ \(B\) λ₯Ό κ³±νλ 컀λμ ꡬννμΈμ. ν° νλ ¬μ μμ μ‘°κ°(νμΌ)μΌλ‘ λλμ΄ μ²λ¦¬νλ λ°©μμ λλ€.
ν΅μ¬ κ°λ
- LayoutTensorλ₯Ό μ¬μ©ν νλ ¬ νμΌλ§μΌλ‘ ν¨μ¨μ μΈ μ°μ°
- μ μ ν λ μ΄μμμ μ¬μ©ν λ©ν° λΈλ‘ μ‘°μ¨
- TensorBuilderλ₯Ό ν΅ν ν¨μ¨μ μΈ κ³΅μ λ©λͺ¨λ¦¬ νμ©
- LayoutTensor μΈλ±μ±μ μ¬μ©ν νμΌ κ²½κ³ μ²λ¦¬
ꡬμ±
- νλ ¬ ν¬κΈ°: \(\text{SIZE_TILED} = 9\)
- λΈλ‘λΉ μ€λ λ μ: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
- 그리λ μ°¨μ: \(3 \times 3\) λΈλ‘
- 곡μ λ©λͺ¨λ¦¬: λΈλ‘λΉ \(\text{TPB} \times \text{TPB}\) LayoutTensor 2κ°
λ μ΄μμ ꡬμ±:
- μ
λ ₯ A:
Layout.row_major(SIZE_TILED, SIZE_TILED) - μ
λ ₯ B:
Layout.row_major(SIZE_TILED, SIZE_TILED) - μΆλ ₯:
Layout.row_major(SIZE_TILED, SIZE_TILED) - 곡μ λ©λͺ¨λ¦¬: TensorBuilderλ₯Ό μ¬μ©ν
TPB Γ TPBLayoutTensor 2κ°
νμΌλ§ μ λ΅
λΈλ‘ ꡬμ±
Grid Layout (3Γ3): Thread Layout per Block (3Γ3):
[B00][B01][B02] [T00 T01 T02]
[B10][B11][B12] [T10 T11 T12]
[B20][B21][B22] [T20 T21 T22]
κ° λΈλ‘μ LayoutTensor μΈλ±μ±μ μ¬μ©νμ¬ νλμ νμΌμ μ²λ¦¬
νμΌ μ²λ¦¬ λ¨κ³
- μ€λ λ μμΉμ λν μ μ μΈλ±μ€μ λ‘컬 μΈλ±μ€ κ³μ°
- Aμ B νμΌμ μν 곡μ λ©λͺ¨λ¦¬ ν λΉ
- κ° νμΌμ λν΄:
- νλ ¬ Aμ Bμμ νμΌ λ‘λ
- λΆλΆ κ³± κ³μ°
- λ μ§μ€ν°μ κ²°κ³Ό λμ
- μ΅μ’ λμ κ²°κ³Ό κΈ°λ‘
λ©λͺ¨λ¦¬ μ κ·Ό ν¨ν΄
Matrix A (8Γ8) Matrix B (8Γ8) Matrix C (8Γ8)
+---+---+---+ +---+---+---+ +---+---+---+
|T00|T01|T02| ... |T00|T01|T02| ... |T00|T01|T02| ...
+---+---+---+ +---+---+---+ +---+---+---+
|T10|T11|T12| |T10|T11|T12| |T10|T11|T12|
+---+---+---+ +---+---+---+ +---+---+---+
|T20|T21|T22| |T20|T21|T22| |T20|T21|T22|
+---+---+---+ +---+---+---+ +---+---+---+
... ... ...
νμΌ μ²λ¦¬ κ³Όμ (C[T11] κ³μ° μμ):
1. Aμ Bμμ νμΌ λ‘λ:
+---+ +---+
|A11| Γ |B11| κ° λ¨κ³ kμ λν΄:
+---+ +---+ C[T11] += A[row, k] Γ B[k, col]
2. νμΌ μ΄λ:
λ¨κ³ 1 λ¨κ³ 2 λ¨κ³ 3
A: [T10] A: [T11] A: [T12]
B: [T01] B: [T11] B: [T21]
3. νμΌ λ΄ κ° μ€λ λ (i,j)μ μ°μ°:
C[i,j] = Ξ£ (A[i,k] Γ B[k,j]), kλ νμΌ λλΉ λ²μ
λκΈ°ν νμ μμ :
* νμΌμ 곡μ λ©λͺ¨λ¦¬μ λ‘λν ν
* κ° λ¨κ³μ μ°μ°μ΄ λλ ν
μμ±ν μ½λ
comptime SIZE_TILED = 9
comptime BLOCKS_PER_GRID_TILED = (3, 3) # each block convers 3x3 elements
comptime THREADS_PER_BLOCK_TILED = (TPB, TPB)
comptime layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED)
fn matmul_tiled[
layout: Layout, size: UInt
](
output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
):
local_row = thread_idx.y
local_col = thread_idx.x
tiled_row = block_idx.y * TPB + thread_idx.y
tiled_col = block_idx.x * TPB + thread_idx.x
# FILL ME IN (roughly 20 lines)
μ 체 νμΌ λ³΄κΈ°: problems/p16/p16.mojo
ν
-
νμ€ μΈλ±μ± κ·μΉμ μ¬μ©νμΈμ:
local_row = thread_idx.y,local_col = thread_idx.x -
μ μ μμΉ κ³μ°:
global_row = block_idx.y * TPB + local_row그리κ³
global_col = block_idx.x * TPB + local_colμ μ μΈλ±μ± 곡μ μ΄ν΄νκΈ°:
-
κ° λΈλ‘μ νλ ¬μ
TPB Γ TPBνμΌμ μ²λ¦¬ν©λλ€ -
block_idx.yλ νμ¬ λͺ λ²μ§Έ λΈλ‘ νμΈμ§λ₯Ό λνλ λλ€ (0, 1, 2β¦) -
block_idx.y * TPBλ ν΄λΉ λΈλ‘ νμΌμ μμ νμ λλ€ -
local_row(0~TPB-1)μ λΈλ‘ λ΄ μ€λ λμ μ€νμ μ λλ€ -
λμ λνλ©΄ μ 체 νλ ¬μμμ μ€μ ν μμΉκ° λ©λλ€
TPB=3 μμ:
Block Layout: Global Matrix (9Γ9): [B00][B01][B02] [0 1 2 | 3 4 5 | 6 7 8] [B10][B11][B12] β [9 A B | C D E | F G H] [B20][B21][B22] [I J K | L M N | O P Q] ββββββββββββββββββββββ [R S T | U V W | X Y Z] [a b c | d e f | g h i] [j k l | m n o | p q r] ββββββββββββββββββββββ [s t u | v w x | y z Ξ±] [Ξ² Ξ³ Ξ΄ | Ξ΅ ΞΆ Ξ· | ΞΈ ΞΉ ΞΊ] [Ξ» ΞΌ Ξ½ | ΞΎ ΞΏ Ο | Ο Ο Ο] Thread(1,2) in Block(1,0): - block_idx.y = 1, local_row = 1 - global_row = 1 * 3 + 1 = 4 - μ΄ μ€λ λλ νλ ¬μ 4λ²μ§Έ νμ λ΄λΉ -
-
곡μ λ©λͺ¨λ¦¬ ν λΉ (
.fill(0)μΌλ‘ μ¬μ μ΄κΈ°νλ¨) -
9Γ9 μλ²½ν νμΌλ§μ΄λ―λ‘ κ²½κ³ κ²μ¬κ° λΆνμ!
-
μ μ ν λκΈ°νμ ν¨κ» νμΌ κ° κ²°κ³Όλ₯Ό λμ
μ½λ μ€ν
μ루μ μ ν μ€νΈνλ €λ©΄ ν°λ―Έλμμ λ€μ λͺ λ Ήμ΄λ₯Ό μ€ννμΈμ:
pixi run p16 --tiled
pixi run -e amd p16 --tiled
pixi run -e apple p16 --tiled
uv run poe p16 --tiled
νΌμ¦μ μμ§ νμ§ μμλ€λ©΄ μΆλ ₯μ λ€μκ³Ό κ°μ΅λλ€:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 4176.0, 4248.0, 9504.0, 9738.0, 9972.0, 10206.0, 10440.0, 10674.0, 10908.0, 11142.0, 11376.0, 15336.0, 15732.0, 16128.0, 16524.0, 16920.0, 17316.0, 17712.0, 18108.0, 18504.0, 21168.0, 21726.0, 22284.0, 22842.0, 23400.0, 23958.0, 24516.0, 25074.0, 25632.0, 27000.0, 27720.0, 28440.0, 29160.0, 29880.0, 30600.0, 31320.0, 32040.0, 32760.0, 32832.0, 33714.0, 34596.0, 35478.0, 36360.0, 37242.0, 38124.0, 39006.0, 39888.0, 38664.0, 39708.0, 40752.0, 41796.0, 42840.0, 43884.0, 44928.0, 45972.0, 47016.0, 44496.0, 45702.0, 46908.0, 48114.0, 49320.0, 50526.0, 51732.0, 52938.0, 54144.0, 50328.0, 51696.0, 53064.0, 54432.0, 55800.0, 57168.0, 58536.0, 59904.0, 61272.0])
μ루μ : μλ νμΌλ§
fn matmul_tiled[
layout: Layout, size: UInt
](
output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
):
local_row = thread_idx.y
local_col = thread_idx.x
tiled_row = block_idx.y * TPB + local_row
tiled_col = block_idx.x * TPB + local_col
a_shared = LayoutTensor[
dtype,
Layout.row_major(TPB, TPB),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
b_shared = LayoutTensor[
dtype,
Layout.row_major(TPB, TPB),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
var acc: output.element_type = 0
# Iterate over tiles to compute matrix product
@parameter
for tile in range((size + TPB - 1) // TPB):
# Load A tile - global row stays the same, col determined by tile
if tiled_row < size and (tile * TPB + local_col) < size:
a_shared[local_row, local_col] = a[
tiled_row, tile * TPB + local_col
]
# Load B tile - row determined by tile, global col stays the same
if (tile * TPB + local_row) < size and tiled_col < size:
b_shared[local_row, local_col] = b[
tile * TPB + local_row, tiled_col
]
barrier()
# Matrix multiplication within the tile
if tiled_row < size and tiled_col < size:
@parameter
for k in range(min(Int(TPB), Int(size - tile * TPB))):
acc += a_shared[local_row, k] * b_shared[k, local_col]
barrier()
# Write out final result
if tiled_row < size and tiled_col < size:
output[tiled_row, tiled_col] = acc
νμΌλ§ νλ ¬ κ³±μ ꡬνμ μμ νμΌ \((3 \times 3)\) μ μ¬μ©νμ¬ ν° νλ ¬ \((9 \times 9)\) μ ν¨μ¨μ μΌλ‘ μ²λ¦¬νλ λ°©λ²μ 보μ¬μ€λλ€. λμ λ°©μμ λ€μκ³Ό κ°μ΅λλ€:
-
곡μ λ©λͺ¨λ¦¬ ν λΉ
Input matrices (9Γ9) - (3Γ3) νμΌλ§μ λ± λ§λ ν¬κΈ°: A = [0 1 2 3 4 5 6 7 8 ] B = [0 2 4 6 8 10 12 14 16] [9 10 11 12 13 14 15 16 17] [18 20 22 24 26 28 30 32 34] [18 19 20 21 22 23 24 25 26] [36 38 40 42 44 46 48 50 52] [27 28 29 30 31 32 33 34 35] [54 56 58 60 62 64 66 68 70] [36 37 38 39 40 41 42 43 44] [72 74 76 78 80 82 84 86 88] [45 46 47 48 49 50 51 52 53] [90 92 94 96 98 100 102 104 106] [54 55 56 57 58 59 60 61 62] [108 110 112 114 116 118 120 122 124] [63 64 65 66 67 68 69 70 71] [126 128 130 132 134 136 138 140 142] [72 73 74 75 76 77 78 79 80] [144 146 148 150 152 154 156 158 160] λΈλ‘λΉ κ³΅μ λ©λͺ¨λ¦¬ (3Γ3): a_shared[TPB, TPB] b_shared[TPB, TPB] -
νμΌ μ²λ¦¬ 루ν
νμΌ μ = 9 // 3 = 3κ° (λλ¨Έμ§ μμ΄ λ± λλ μ§!) κ° νμΌμ λν΄: 1. Aμ Bμμ νμΌ λ‘λ 2. λΆλΆ κ³± κ³μ° 3. λ μ§μ€ν°μ λμ -
λ©λͺ¨λ¦¬ λ‘λ© ν¨ν΄
-
\((9 \times 9)\) μ΄ λ± λλ μ§λ―λ‘ κ²½κ³ κ²μ¬κ° κΈ°μ μ μΌλ‘λ λΆνμνμ§λ§, λ°©μ΄μ νλ‘κ·Έλλ°κ³Ό λ€λ₯Έ νλ ¬ ν¬κΈ°μλ λμν μ μλλ‘ ν¬ν¨ν©λλ€.
# A νμΌ λ‘λ - μ μ νμ κ·Έλλ‘, μ΄μ νμΌμ μν΄ κ²°μ if tiled_row < size and (tile * TPB + local_col) < size: a_shared[local_row, local_col] = a[ tiled_row, tile * TPB + local_col ] # B νμΌ λ‘λ - νμ νμΌμ μν΄ κ²°μ , μ μ μ΄μ κ·Έλλ‘ if (tile * TPB + local_row) < size and tiled_col < size: b_shared[local_row, local_col] = b[ tile * TPB + local_row, tiled_col ]
-
-
νμΌ λ΄ μ°μ°
for k in range(min(TPB, size - tile * TPB)): acc += a_shared[local_row, k] * b_shared[k, local_col]-
곡μ λ©λͺ¨λ¦¬ λ± ν¬ μΆ©λ ννΌ:
Bank Conflict Free (Good): Bank Conflicts (Bad): Thread0: a_shared[0,k] b_shared[k,0] Thread0: a_shared[k,0] b_shared[0,k] Thread1: a_shared[0,k] b_shared[k,1] Thread1: a_shared[k,0] b_shared[1,k] Thread2: a_shared[0,k] b_shared[k,2] Thread2: a_shared[k,0] b_shared[2,k] β β μλ‘ λ€λ₯Έ λ± ν¬μ λ³λ ¬ μ κ·Ό b_sharedκ° μ΄ μ°μ μ΄μλ€λ©΄ (a_sharedλ broadcast) κ°μ λ± ν¬μ μ§λ ¬ μ 근곡μ λ©λͺ¨λ¦¬ λ± ν¬ μΆ©λ μ€λͺ :
- μΌμͺ½ (Good):
b_shared[k,threadIdx.x]λ μλ‘ λ€λ₯Έ λ± ν¬μ μ κ·Όνκ³ ,a_shared[0,k]λ λͺ¨λ μ€λ λμ λΈλ‘λμΊμ€νΈ λ©λλ€ - μ€λ₯Έμͺ½ (Bad): b_sharedκ° μ΄ μ°μ μ΄μλ€λ©΄ μ€λ λλ€μ΄ λμμ κ°μ λ± ν¬μ μ κ·Όνκ² λ©λλ€
- ν΅μ¬: μ΄κ²μ μ μ λ©λͺ¨λ¦¬ λ³ν©μ΄ μλ 곡μ λ©λͺ¨λ¦¬ μ κ·Ό ν¨ν΄μ κ΄ν κ²μ λλ€
- λ± ν¬ κ΅¬μ‘°: 곡μ λ©λͺ¨λ¦¬λ 32κ° λ± ν¬λ‘ ꡬμ±λμ΄ μμΌλ©°, μ¬λ¬ μ€λ λκ° λμμ κ°μ λ± ν¬μ λ€λ₯Έ μ£Όμμ μ κ·Όν λ μΆ©λμ΄ λ°μν©λλ€
- μΌμͺ½ (Good):
-
-
λκΈ°ν μ§μ
barrier() νΈμΆ μμ : 1. νμΌ λ‘λ© ν 2. νμΌ μ°μ° ν
μ£Όμ μ±λ₯ νΉμ±:
- \((3 \times 3)\) νμΌλ‘ \((9 \times 9)\) νλ ¬ μ²λ¦¬ (λ± λ§λ ν¬κΈ°!)
- 곡μ λ©λͺ¨λ¦¬λ‘ λΉ λ₯Έ νμΌ μ κ·Ό
- λ³ν©λ λ©λͺ¨λ¦¬ μ κ·ΌμΌλ‘ μ μ λ©λͺ¨λ¦¬ νΈλμμ μ΅μν
- λ± ν¬ μΆ©λμ νΌνλλ‘ μ΅μ νλ 곡μ λ©λͺ¨λ¦¬ λ μ΄μμκ³Ό μ κ·Ό ν¨ν΄
-
κ²°κ³Ό κΈ°λ‘:
if tiled_row < size and tiled_col < size: output[tiled_row, tiled_col] = acc- λ€λ₯Έ νλ ¬ ν¬κΈ°μ νμΌλ§ μ λ΅μ μν λ°©μ΄μ κ²½κ³ κ²μ¬ ν¬ν¨
- μΆλ ₯ νλ ¬μ μ§μ λμ
- λͺ¨λ μ€λ λκ° μ ν¨ν κ²°κ³Όλ₯Ό κΈ°λ‘
μ£Όμ μ΅μ ν
-
λ μ΄μμ μ΅μ ν:
- λͺ¨λ ν μμ ν μ°μ λ μ΄μμ
- ν¨μ¨μ μΈ 2D μΈλ±μ±
-
λ©λͺ¨λ¦¬ μ κ·Ό:
- λ³ν©λ μ μ λ©λͺ¨λ¦¬ λ‘λ
- ν¨μ¨μ μΈ κ³΅μ λ©λͺ¨λ¦¬ νμ©
-
μ°μ°:
- λ μ§μ€ν° κΈ°λ° λμ , μ¦
var acc: output.element_type = 0 @parameterλ₯Ό ν΅ν μ»΄νμΌ νμ 루ν μ κ°
- λ μ§μ€ν° κΈ°λ° λμ , μ¦
μ΄ κ΅¬νμ λ€μμ ν΅ν΄ λμ μ±λ₯μ λ¬μ±ν©λλ€:
- LayoutTensorλ₯Ό νμ©ν ν¨μ¨μ μΈ λ©λͺ¨λ¦¬ μ κ·Ό
- μ΅μ μ νμΌλ§ μ λ΅
- μ μ ν μ€λ λ λκΈ°ν
- μΈμ¬ν κ²½κ³ μ²λ¦¬
μ루μ : κ΄μ©μ LayoutTensor νμΌλ§
from gpu.memory import async_copy_wait_all
from layout.layout_tensor import copy_dram_to_sram_async
comptime NUM_THREADS = TPB * TPB
comptime BLOCK_DIM_COUNT = 2
fn matmul_idiomatic_tiled[
layout: Layout, size: UInt
](
output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
):
local_row = thread_idx.y
local_col = thread_idx.x
tiled_row = block_idx.y * TPB + local_row
tiled_col = block_idx.x * TPB + local_col
# Get the tile of the output matrix that this thread block is responsible for
out_tile = output.tile[TPB, TPB](Int(block_idx.y), Int(block_idx.x))
a_shared = LayoutTensor[
dtype,
Layout.row_major(TPB, TPB),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
b_shared = LayoutTensor[
dtype,
Layout.row_major(TPB, TPB),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
var acc: output.element_type = 0
comptime load_a_layout = Layout.row_major(1, TPB) # Coalesced loading
comptime load_b_layout = Layout.row_major(1, TPB) # Coalesced loading
# Note: Both matrices stored in same orientation for correct matrix multiplication
# Transposed loading would be useful if B were pre-transposed in global memory
@parameter
for idx in range(size // TPB): # Perfect division: 9 // 3 = 3 tiles
# Get tiles from A and B matrices
a_tile = a.tile[TPB, TPB](Int(block_idx.y), Int(idx))
b_tile = b.tile[TPB, TPB](Int(idx), Int(block_idx.x))
# Asynchronously copy tiles to shared memory with consistent orientation
copy_dram_to_sram_async[
thread_layout=load_a_layout,
num_threads=NUM_THREADS,
block_dim_count=BLOCK_DIM_COUNT,
](a_shared, a_tile)
copy_dram_to_sram_async[
thread_layout=load_b_layout,
num_threads=NUM_THREADS,
block_dim_count=BLOCK_DIM_COUNT,
](b_shared, b_tile)
# Wait for all async copies to complete
async_copy_wait_all()
barrier()
# Compute partial matrix multiplication for this tile
@parameter
for k in range(TPB):
acc += a_shared[local_row, k] * b_shared[k, local_col]
barrier()
# Write final result to output tile
if tiled_row < size and tiled_col < size:
out_tile[local_row, local_col] = acc
κ΄μ©μ νμΌλ§ νλ ¬ κ³±μ μ Mojoμ LayoutTensor APIμ λΉλκΈ° λ©λͺ¨λ¦¬ μ°μ°μ νμ©νμ¬ κΉλν ꡬνμ μ 곡ν©λλ€.
ν΅μ¬ ν¬μΈνΈ: μ΄ κ΅¬νμ λ νλ ¬ λͺ¨λ λ³ν© λ‘λ©μ μ¬μ©νμ¬ νμ€ A Γ B νλ ¬ κ³±μ μ μνν©λλ€.
μ΄ κ΅¬νμ΄ νλ κ²:
- νλ ¬ μ°μ°: νμ€ \(A \times B\) κ³±μ (\(A \times B^T\) κ° μλ)
- λ‘λ© ν¨ν΄: λ νλ ¬ λͺ¨λ
Layout.row_major(1, TPB)λ‘ λ³ν© μ κ·Ό - μ°μ°:
acc += a_shared[local_row, k] * b_shared[k, local_col] - λ°μ΄ν° λ μ΄μμ: λ‘λ© μ μ μΉ μμ - λ νλ ¬μ κ°μ λ°©ν₯μΌλ‘ λ‘λ
μ΄ κ΅¬νμ΄ νμ§ μλ κ²:
- \(A \times B^T\) κ³±μ μ μννμ§ μμ
- μ μΉ λ‘λ© ν¨ν΄μ μ¬μ©νμ§ μμ
- λ³΅μ¬ κ³Όμ μμ λ°μ΄ν°λ₯Ό μ μΉνμ§ μμ
\((9 \times 9)\) νλ ¬ ν¬κΈ°μμλ μλ²½ν νμΌλ§μ΄ μ΄λ£¨μ΄μ Έ λͺ¨λ κ²½κ³ κ²μ¬κ° λΆνμν©λλ€:
-
LayoutTensor νμΌ API
out_tile = output.tile[TPB, TPB](block_idx.y, block_idx.x) a_tile = a.tile[TPB, TPB](block_idx.y, idx) b_tile = b.tile[TPB, TPB](idx, block_idx.x)μλ μ’ν κ³μ° μμ΄ β(block_idx.y, block_idx.x) μμΉμ νμΌμ κ°μ Έμ¨λ€βλ₯Ό μ§μ ννν©λλ€. μμΈν λ΄μ©μ λ¬Έμλ₯Ό μ°Έκ³ νμΈμ.
-
λΉλκΈ° λ©λͺ¨λ¦¬ μ°μ°
copy_dram_to_sram_async[ thread_layout = load_a_layout, num_threads = NUM_THREADS, block_dim_count = BLOCK_DIM_COUNT ](a_shared,a_tile) copy_dram_to_sram_async[ thread_layout = load_b_layout, num_threads = NUM_THREADS, block_dim_count = BLOCK_DIM_COUNT ](b_shared, b_tile) async_copy_wait_all()μ΄ μ°μ°λ€μ:
- λ μ§μ€ν°λ₯Ό μ°ννλ μ μ© λ³΅μ¬ μμ§μ μ¬μ©νμ¬ μ°μ°κ³Ό λ©λͺ¨λ¦¬ μ μ‘μ μ€μ²©μ κ°λ₯νκ² ν©λλ€ (copy_dram_to_sram_async μ°Έκ³ )
- μ΅μ μ λ©λͺ¨λ¦¬ μ κ·Ό ν¨ν΄μ μν νΉνλ μ€λ λ λ μ΄μμμ μ¬μ©ν©λλ€
- μλ λ©λͺ¨λ¦¬ μ΄κΈ°νκ° λΆνμν©λλ€
- μ€μ:
- νμ€ GPU λ‘λλ μ΄λ―Έ λΉλκΈ°μ μ λλ€. μ΄ ν¨μλ€μ λ λμ 리μμ€ νμ©κ³Ό λ μ§μ€ν° μ°νλ₯Ό μ 곡ν©λλ€
copy_dram_to_sram_asyncλ κΈ°λ³Έμ μΌλ‘ 1D μ€λ λ λΈλ‘(block_dim.y == block_dim.z == 1)μ κ°μ νλ©°, λ³λ μ§μ μ΄ μμΌλ©΄ μ€λ λ λΈλ‘μ λͺ¨λ μ€λ λκ° λ³΅μ¬μ μ°Έμ¬ν©λλ€. λ€μμ μ§μ νμ¬ μ΄ λμμ λ³κ²½ν μ μμ΅λλ€:block_dim_count: μ€λ λ λΈλ‘μ μ°¨μ μ (2D μ€λ λ λΈλ‘THREADS_PER_BLOCK_TILED = (TPB, TPB)μ κ²½μ°2)num_threads: μ€λ λ λΈλ‘μ μ€λ λ μ (TPB*TPB == 9)
-
μ΅μ νλ λ©λͺ¨λ¦¬ μ κ·Ό λ μ΄μμ
comptime load_a_layout = Layout.row_major(1, TPB) # λ³ν© λ‘λ© comptime load_b_layout = Layout.row_major(1, TPB) # λ³ν© λ‘λ© # μ°Έκ³ : νμ€ A Γ B κ³±μ μμ λ νλ ¬ λͺ¨λ κ°μ λ μ΄μμμ μ¬μ©νμ¬ κ΅¬νμ λ©λͺ¨λ¦¬ μ κ·Ό λΆμ:
λ νλ ¬ λͺ¨λ μ μ λ©λͺ¨λ¦¬μμ λ³ν© λ‘λ©μ μν΄
Layout.row_major(1, TPB)λ₯Ό μ¬μ©ν©λλ€:load_a_layout: μ€λ λλ€μ΄ νλ ₯νμ¬ νλ ¬ A νμ μ°μ μμλ₯Ό λ‘λload_b_layout: μ€λ λλ€μ΄ νλ ₯νμ¬ νλ ¬ B νμ μ°μ μμλ₯Ό λ‘λ- ν΅μ¬: μ€λ λ λ μ΄μμμ λ³΅μ¬ μ μ€λ λ κ° νλ ₯ λ°©μμ κ²°μ νλ©°, μ΅μ’ λ°μ΄ν° λ μ΄μμκ³Όλ λ³κ°μ λλ€
μ€μ μ°μ° ν¨ν΄ (A Γ Bμμ μ¦λͺ ):
# νμ¬ κ΅¬νμ μ€μ μ°μ° acc += a_shared[local_row, k] * b_shared[k, local_col] # μ΄κ²μ C[i,j] = Ξ£(A[i,k] * B[k,j])μ ν΄λΉ # μ¦, νμ€ νλ ¬ κ³±μ A Γ Bλ νλ ¬μ΄ κ°μ λ³ν© λ‘λ© ν¨ν΄μ μ¬μ©νλ μ΄μ :
μ μ λ©λͺ¨λ¦¬μμ νμΌ λ‘λ©: - Matrix A νμΌ: μ€λ λλ€μ΄ A[block_row, k], A[block_row, k+1], A[block_row, k+2]... λ‘λ (μ°μ) - Matrix B νμΌ: μ€λ λλ€μ΄ B[k, block_col], B[k, block_col+1], B[k, block_col+2]... λ‘λ (μ°μ) Layout.row_major(1, TPB)λ‘ λ ν¨ν΄ λͺ¨λ λ³ν©μΈ κ°μ§ λ³κ°μ λ©λͺ¨λ¦¬ κ³ λ €μ¬ν:
- μ μβ곡μ λ³ν©:
Layout.row_major(1, TPB)λ‘ λ³ν© μ μ λ©λͺ¨λ¦¬ μ κ·Ό 보μ₯ - 곡μ λ©λͺ¨λ¦¬ μ°μ°:
a_shared[local_row, k] * b_shared[k, local_col]λ‘ λ± ν¬ μΆ©λ ννΌ - νλ ¬ μ°μ°: μ°μ° ν¨ν΄μ΄ A Γ Bλ₯Ό κ²°μ (A Γ B^Tκ° μλ)
-
μλ²½ν νμΌλ§μΌλ‘ κ²½κ³ κ²μ¬ λΆνμ
@parameter for idx in range(size // TPB): # λλ¨Έμ§ μλ λλμ : 9 // 3 = 3\((9 \times 9)\) νλ ¬κ³Ό \((3 \times 3)\) νμΌμμλ λͺ¨λ νμΌμ΄ μ νν κ½ μ°¨κΈ° λλ¬Έμ κ²½κ³ κ²μ¬κ° νμ μμ΅λλ€!
-
λ°©μ΄μ κ²½κ³ κ²μ¬λ₯Ό ν¬ν¨ν κΉλν νμΌ μ²λ¦¬
# μλ²½ν νμΌλ§μμλ λ°©μ΄μ κ²½κ³ κ²μ¬ ν¬ν¨ if tiled_row < size and tiled_col < size: out_tile[local_row, local_col] = acc\((9 \times 9)\) μ μλ²½ν νμΌλ§μμλ μ΄ κ²½κ³ κ²μ¬κ° κΈ°μ μ μΌλ‘ λΆνμνμ§λ§, λ°©μ΄μ νλ‘κ·Έλλ°κ³Ό λ€λ₯Έ νλ ¬ ν¬κΈ°μμ μΌκ΄μ±μ μν΄ ν¬ν¨ν©λλ€.
μ±λ₯ κ³ λ €μ¬ν
κ΄μ©μ ꡬνμ νμΌλ§μ μ±λ₯ μ΄μ μ μ μ§νλ©΄μ λ κΉλν μΆμνλ₯Ό μ 곡ν©λλ€:
- λ©λͺ¨λ¦¬ μ§μμ±: νμΌλ§μ ν΅ν΄ 곡κ°μ , μκ°μ μ§μμ±μ νμ©
- λ³ν© μ κ·Ό: νΉνλ λ‘λ λ μ΄μμμΌλ‘ λ³ν© λ©λͺ¨λ¦¬ μ κ·Ό ν¨ν΄ 보μ₯
- μ°μ°-λ©λͺ¨λ¦¬ μ€μ²©: λΉλκΈ° λ©λͺ¨λ¦¬ μ°μ°μ ν΅ν μ€μ²© κ°λ₯
- 곡μ λ©λͺ¨λ¦¬ ν¨μ¨: λΆνμν 곡μ λ©λͺ¨λ¦¬ μ΄κΈ°ν μμ
- λ μ§μ€ν° μλ ₯: μ΅μ μ μ°μ° μ²λ¦¬λμ μν λμ λ μ§μ€ν° μ¬μ©
μ΄ κ΅¬νμ κ³ μμ€ μΆμνλ‘λ μ±λ₯ μ ν μμ΄ λ³΅μ‘ν GPU μκ³ λ¦¬μ¦μ ννν μ μμμ 보μ¬μ€λλ€. κ³ μμ€μ ννλ ₯κ³Ό μ μμ€μ μ±λ₯ μ μ΄λ₯Ό κ²°ν©νλ Mojoμ μ² νμ μ 보μ¬μ£Όλ μμμ λλ€.
μλ νμΌλ§κ³Όμ μ£Όμ μ°¨μ΄μ
| κΈ°λ₯ | μλ Tiling | κ΄μ©μ Tiling |
|---|---|---|
| λ©λͺ¨λ¦¬ μ κ·Ό | κ²½κ³ κ²μ¬κ° μλ μ§μ μΈλ±μ± | LayoutTensor νμΌ API |
| νμΌ λ‘λ© | μμλ³ λͺ μμ λ³΅μ¬ | μ μ© λ³΅μ¬ μμ§μ λ²ν¬ μ μ‘ |
| 곡μ λ©λͺ¨λ¦¬ | μλ μ΄κΈ°ν (λ°©μ΄μ ) | λ³΅μ¬ ν¨μκ° κ΄λ¦¬ |
| μ½λ 볡μ‘λ | λͺ μμ μΈλ±μ±μΌλ‘ λ€μ μ₯ν© | κ³ μμ€ APIλ‘ λ κ°κ²° |
| κ²½κ³ κ²μ¬ | λ‘λ©κ³Ό μ°μ° μ€ λ€μμ κ²μ¬ | μ΅μ’ κΈ°λ‘ μ λ¨μΌ λ°©μ΄μ κ²μ¬ |
| νλ ¬ λ°©ν₯ | Aμ B λͺ¨λ κ°μ λ°©ν₯ (νμ€ A Γ B) | Aμ B λͺ¨λ κ°μ λ°©ν₯ (νμ€ A Γ B) |
| μ±λ₯ | λ©λͺ¨λ¦¬ ν¨ν΄μ λͺ μμ μ μ΄ | λ μ§μ€ν° μ°νλ₯Ό ν¬ν¨ν μ΅μ νλ λ μ΄μμ |
κ΄μ©μ μ κ·Ό λ°©μμ λ¨μν λ κΉλν λΏ μλλΌ, νΉνλ λ©λͺ¨λ¦¬ λ μ΄μμκ³Ό λΉλκΈ° μ°μ° λλΆμ μ±λ₯λ λ μ’μ μ μμ΅λλ€.
μ°Έκ³ : μ μΉ λ‘λ©μ μΈμ μ μ©ν κΉ?
νμ¬ κ΅¬νμ μ μΉ λ‘λ©μ μ¬μ©νμ§ μμ΅λλ€. μ΄ μΉμ μ λ μ΄μμ μμ€ν μΌλ‘ ν μ μλ κ²μ 보μ¬μ£ΌκΈ° μν κ΅μ‘μ λ΄μ©μ λλ€.
νμ¬ κ΅¬ν μμ½:
- λ νλ ¬ λͺ¨λ
Layout.row_major(1, TPB)μ¬μ© - νμ€ A Γ B κ³±μ μν
- λ³΅μ¬ μ€ λ°μ΄ν° μ μΉ μμ
μ μΉ λ‘λ©μ μ¬μ©νλ κ΅μ‘μ μλ리μ€:
μ΄ νΌμ¦μ λ νλ ¬ λͺ¨λ νμ€ λ³ν© λ‘λ©μ μ¬μ©νμ§λ§, λ μ΄μμ μμ€ν μ μ μ°μ±μ λ€λ₯Έ μλ리μ€μμ κ°λ ₯ν μ΅μ νλ₯Ό κ°λ₯νκ² ν©λλ€:
# μμ: A Γ Bλ₯Ό κ³μ°νκΈ° μν΄ μ¬μ μ μΉλ νλ ¬ B^Tλ₯Ό λ‘λ
# (νμ¬ κ΅¬νμμλ μ΄λ κ² νμ§ μμ)
comptime load_b_layout = Layout.row_major(TPB, 1) # B^Tλ₯Ό λ³ν© μ κ·ΌμΌλ‘ λ‘λ
comptime store_b_layout = Layout.row_major(1, TPB) # 곡μ λ©λͺ¨λ¦¬μ Bλ‘ μ μ₯
copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile)
μ μΉ λ‘λ©μ νμ© μ¬λ‘ (μ΄ νΌμ¦μμλ μ¬μ©νμ§ μμ):
- μ΄λ―Έ μ μΉλ μ λ ₯ νλ ¬: \(B\) κ° μ μ λ©λͺ¨λ¦¬μ μ μΉ μνλ‘ μ μ₯λμ΄ μλ κ²½μ°
- λ€λ₯Έ μκ³ λ¦¬μ¦: \(A^T \times B\), \(A \times B^T\), λλ \(A^T \times B^T\) κ³μ°
- λ©λͺ¨λ¦¬ λ μ΄μμ λ³ν: ν μ°μ κ³Ό μ΄ μ°μ λ μ΄μμ κ° λ³ν
- λ³λ μ μΉ μ°μ° μμ΄ λ‘λ: νμν λ°©ν₯μΌλ‘ λ°μ΄ν°λ₯Ό μ§μ λ‘λ
ν΅μ¬ ꡬλΆ:
- νμ¬ κ΅¬ν: λ νλ ¬ λͺ¨λ νμ€ \(A \times B\) κ³±μ
μ
Layout.row_major(1, TPB)μ¬μ© - μ μΉ λ‘λ© μμ: μ΄λ―Έ μ μΉλ λ°μ΄ν°λ λ€λ₯Έ νλ ¬ μ°μ°μ μ²λ¦¬ν λ λ€λ₯Έ λ μ΄μμ μ¬μ©
μ΄κ²μ Mojoμ μ² νμ 보μ¬μ€λλ€: μΌλ°μ μΈ κ²½μ°μ κ³ μμ€ μΆμνλ₯Ό μ μ§νλ©΄μλ, νμν λ μ μμ€ μ μ΄λ₯Ό μ 곡ν©λλ€.
μμ½: ν΅μ¬ μ 리
κ΄μ©μ νμΌλ§ ꡬνμ΄ μ€μ λ‘ νλ κ²:
- νλ ¬ μ°μ°: νμ€ A Γ B κ³±μ
- λ©λͺ¨λ¦¬ λ‘λ©: λ νλ ¬ λͺ¨λ
Layout.row_major(1, TPB)λ‘ λ³ν© μ κ·Ό - μ°μ° ν¨ν΄:
acc += a_shared[local_row, k] * b_shared[k, local_col] - λ°μ΄ν° λ μ΄μμ: λ‘λ© μ μ μΉ μμ
μ΄κ²μ΄ μ΅μ μΈ μ΄μ :
- λ³ν© μ μ λ©λͺ¨λ¦¬ μ κ·Ό:
Layout.row_major(1, TPB)λ‘ ν¨μ¨μ μΈ λ‘λ© λ³΄μ₯ - λ± ν¬ μΆ©λ ννΌ: 곡μ λ©λͺ¨λ¦¬ μ κ·Ό ν¨ν΄μ΄ μΆ©λμ λ°©μ§
- νμ€ μκ³ λ¦¬μ¦: κ°μ₯ μΌλ°μ μΈ νλ ¬ κ³±μ ν¨ν΄μ ꡬν