Puzzle 12: λ΄μ
κ°μ
1D TileTensor aμ 1D TileTensor bμ λ΄μ μ κ³μ°νμ¬ 1D TileTensor
output(λ¨μΌ κ°)μ μ μ₯νλ 컀λμ ꡬννμΈμ. λ΄μ μ ν¬κΈ°κ° κ°μ λ 벑ν°μμ
λμνλ μμλΌλ¦¬ κ³±ν λ€, κ·Έ κ²°κ³Όλ₯Ό λͺ¨λ λν΄ νλμ μ«μ(μ€μΉΌλΌ)λ₯Ό ꡬνλ
μ°μ°μ
λλ€.
μλ₯Ό λ€μ΄, λ 벑ν°κ° λ€μκ³Ό κ°μ λ:
\[a = [a_{1}, a_{2}, β¦, a_{n}] \] \[b = [b_{1}, b_{2}, β¦, b_{n}] \]
λ΄μ μ μ΄λ κ² κ΅¬ν©λλ€: \[a \cdot b = a_{1}b_{1} + a_{2}b_{2} + β¦ + a_{n}b_{n}\]
μ°Έκ³ : κ° μμΉλ§λ€ μ€λ λ 1κ°κ° μμ΅λλ€. μ€λ λλΉ μ μ μ½κΈ° 2ν, λΈλ‘λΉ μ μ μ°κΈ° 1νλ§ νμν©λλ€.
ν΅μ¬ κ°λ
μ΄ νΌμ¦μμ λ°°μΈ λ΄μ©:
- Puzzle 8, Puzzle 11μμ μ΄μ΄μ§λ TileTensor κΈ°λ° λ³λ ¬ 리λμ
address_spaceλ₯Ό νμ©ν 곡μ λ©λͺ¨λ¦¬ κ΄λ¦¬- μ¬λ¬ μ€λ λκ° νλ ₯ν΄ νλμ κ²°κ³Όλ₯Ό λ§λ€μ΄κ°λ κ³Όμ
- λ μ΄μμμ μΈμνλ ν μ μ°μ°
ν΅μ¬μ TileTensorκ° λ©λͺ¨λ¦¬ κ΄λ¦¬λ₯Ό κ°μννλ©΄μλ, λ³λ ¬ 리λμ μ ν¨μ¨μ κ·Έλλ‘ μ΄λ¦¬λ λ°©μμ μ΄ν΄νλ κ²μ λλ€.
ꡬμ±
- λ²‘ν° ν¬κΈ°:
SIZE = 8 - λΈλ‘λΉ μ€λ λ μ:
TPB = 8 - λΈλ‘ μ: 1
- μΆλ ₯ ν¬κΈ°: 1
- 곡μ λ©λͺ¨λ¦¬:
TPBκ°
μ°Έκ³ :
- TileTensor ν λΉ:
stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]())μ¬μ© - μμ μ κ·Ό: κ²½κ³ κ²μ¬κ° μλμΌλ‘ λ°λΌμ€λ μμ°μ€λ¬μ΄ μΈλ±μ±
- λ μ΄μμ μ²λ¦¬: μ λ ₯μ©κ³Ό μΆλ ₯μ© λ μ΄μμμ λ°λ‘ ꡬμ±
- μ€λ λ μ‘°μ¨: λμΌν λκΈ°ν ν¨ν΄μΌλ‘
barrier()μ¬μ©
μμ±ν μ½λ
from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.memory import AddressSpace
from layout import TileTensor
from layout.tile_layout import row_major
from layout.tile_tensor import stack_allocation
comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = row_major[SIZE]()
comptime out_layout = row_major[1]()
comptime LayoutType = type_of(layout)
comptime OutLayout = type_of(out_layout)
def dot_product(
output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
# FILL ME IN (roughly 13 lines)
...
μ 체 νμΌ λ³΄κΈ°: problems/p12/p12.mojo
ν
- TileTensorμ
address_spaceλ‘ κ³΅μ λ©λͺ¨λ¦¬ μμ± shared[local_i]μa[global_i] * b[global_i]λ₯Ό μ μ₯barrier()μ ν¨κ» λ³λ ¬ 리λμ ν¨ν΄ μ μ©- μ€λ λ 0μ΄ μ΅μ’
κ²°κ³Όλ₯Ό
output[0]μ κΈ°λ‘
μ½λ μ€ν
μ루μ μ ν μ€νΈνλ €λ©΄ ν°λ―Έλμμ λ€μ λͺ λ Ήμ΄λ₯Ό μ€ννμΈμ:
pixi run p12
pixi run -e amd p12
pixi run -e apple p12
uv run poe p12
νΌμ¦μ μμ§ νμ§ μμλ€λ©΄ μΆλ ₯μ λ€μκ³Ό κ°μ΅λλ€:
out: HostBuffer([0.0])
expected: HostBuffer([140.0])
μ루μ
def dot_product(
output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
size: Int,
):
var shared = stack_allocation[
dtype=dtype, address_space=AddressSpace.SHARED
](row_major[TPB]())
var global_i = block_dim.x * block_idx.x + thread_idx.x
var local_i = thread_idx.x
# Compute element-wise multiplication into shared memory
if global_i < size:
shared[local_i] = a[global_i] * b[global_i]
# Synchronize threads within block
barrier()
# Parallel reduction in shared memory
var stride = TPB // 2
while stride > 0:
if local_i < stride:
shared[local_i] += shared[local_i + stride]
barrier()
stride //= 2
# Only thread 0 writes the final result
if local_i == 0:
output[0] = shared[0]
TileTensorλ₯Ό νμ©ν λ³λ ¬ 리λμ μΌλ‘ λ΄μ μ κ³μ°νλ μ루μ μ λλ€. λ¨κ³λ³λ‘ μ΄ν΄λ³΄κ² μ΅λλ€:
1λ¨κ³: μμλ³ κ³±μ
κ° μ€λ λκ° μ§κ΄μ μΈ μΈλ±μ±μΌλ‘ κ³±μ μ°μ°μ νλμ© μ²λ¦¬ν©λλ€:
shared[local_i] = a[global_i] * b[global_i]
2λ¨κ³: λ³λ ¬ 리λμ
λ μ΄μμμ μΈμνλ νΈλ¦¬ κΈ°λ° λ¦¬λμ μ λλ€:
μ΄κΈ°κ°: [0*0 1*1 2*2 3*3 4*4 5*5 6*6 7*7]
= [0 1 4 9 16 25 36 49]
Step 1: [0+16 1+25 4+36 9+49 16 25 36 49]
= [16 26 40 58 16 25 36 49]
Step 2: [16+40 26+58 40 58 16 25 36 49]
= [56 84 40 58 16 25 36 49]
Step 3: [56+84 84 40 58 16 25 36 49]
= [140 84 40 58 16 25 36 49]
ꡬνμ ν΅μ¬ νΉμ§
-
λ©λͺ¨λ¦¬ κ΄λ¦¬:
address_spaceνλΌλ―Έν° νλλ‘ κ³΅μ λ©λͺ¨λ¦¬λ₯Ό κΉλνκ² ν λΉ- νμ μμ ν μ°μ°μ΄ 보μ₯λκ³
- κ²½κ³ κ²μ¬κ° μλμΌλ‘ λ°λΌμ€λ©°
- μΈλ±μ±λ λ μ΄μμμ μΈμ
-
μ€λ λ λκΈ°ν:
- μ΄κΈ° κ³±μ
μ΄ λλλ©΄
barrier() - 리λμ
λ¨κ³ μ¬μ΄λ§λ€
barrier() - μ€λ λ κ° μμ ν μ‘°μ¨ λ³΄μ₯
- μ΄κΈ° κ³±μ
μ΄ λλλ©΄
-
리λμ λ‘μ§:
stride = TPB // 2 while stride > 0: if local_i < stride: shared[local_i] += shared[local_i + stride] barrier() stride //= 2 -
μ±λ₯μ μ΄μ :
- \(O(\log n)\) μκ° λ³΅μ‘λ
- λ³ν© λ©λͺ¨λ¦¬ μ κ·Ό
- μ΅μνμ μ€λ λ λΆκΈ°
- 곡μ λ©λͺ¨λ¦¬μ ν¨μ¨μ νμ©
TileTensor λ²μ μ λ³λ ¬ 리λμ μ ν¨μ¨μ κ·Έλλ‘ μ μ§νλ©΄μ, μ¬κΈ°μ λν΄:
- νμ μμ μ±μ΄ νμΈ΅ κ°νλκ³
- λ©λͺ¨λ¦¬ κ΄λ¦¬κ° λ κΉλν΄μ§λ©°
- λ μ΄μμμ μλμΌλ‘ μΈμνκ³
- μΈλ±μ± λ¬Έλ²λ μμ°μ€λ¬μμ§λλ€
λ°°λ¦¬μ΄ λκΈ°νμ μ€μμ±
리λμ
λ¨κ³ μ¬μ΄μ barrier()λ μ νν κ²°κ³Όλ₯Ό μν΄ λ°λμ νμν©λλ€. κ·Έ μ΄μ λ₯Ό
μ΄ν΄λ³΄κ² μ΅λλ€:
barrier()κ° μμΌλ©΄ κ²½μ μνκ° λ°μν©λλ€:
μ΄κΈ° 곡μ λ©λͺ¨λ¦¬: [0 1 4 9 16 25 36 49]
Step 1 (stride = 4):
Thread 0 μ½κΈ°: shared[0] = 0, shared[4] = 16
Thread 1 μ½κΈ°: shared[1] = 1, shared[5] = 25
Thread 2 μ½κΈ°: shared[2] = 4, shared[6] = 36
Thread 3 μ½κΈ°: shared[3] = 9, shared[7] = 49
barrier μμ΄:
- Thread 0 μ°κΈ°: shared[0] = 0 + 16 = 16
- Thread 1μ΄ Thread 0λ³΄λ€ λ¨Όμ λ€μ λ¨κ³(stride = 2)λ‘ λμ΄κ°μ
16μ΄ μλ μ΄μ κ° shared[0] = 0μ μ½μ΄λ²λ¦½λλ€!
barrier()κ° μμΌλ©΄:
Step 1 (stride = 4):
λͺ¨λ μ€λ λκ° ν©μ κΈ°λ‘:
[16 26 40 58 16 25 36 49]
barrier()κ° λͺ¨λ μ€λ λμκ² μ΄ κ°λ€μ΄ 보μ΄λλ‘ λ³΄μ₯
Step 2 (stride = 2):
μ΄μ μ
λ°μ΄νΈλ κ°μ μμ νκ² μ½μ μ μμ:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84
barrier()λ λ€μμ 보μ₯ν©λλ€:
- νμ¬ λ¨κ³μ λͺ¨λ μ°κΈ°κ° λλ λ€μμΌ λ€μμΌλ‘ λμ΄κ°
- λͺ¨λ μ€λ λκ° μ΅μ κ°μ λ³Ό μ μμ
- μ΄λ€ μ€λ λλ μμ λκ°μ§ μμ
- 곡μ λ©λͺ¨λ¦¬κ° νμ μΌκ΄λ μνλ₯Ό μ μ§
μ΄λ° λκΈ°ν μ§μ μ΄ μμΌλ©΄:
- κ²½μ μνκ° λ°μνκ³
- μ€λ λκ° μ΄λ―Έ μ§λ κ°μ μ½κ² λλ©°
- μ€νν λλ§λ€ κ²°κ³Όκ° λ¬λΌμ§κ³
- μ΅μ’ ν©κ³κ° νμ΄μ§ μ μμ΅λλ€