warp.shuffle_down() μΌλμΌ ν΅μ
μν λ 벨 μ΄μ ν΅μ μμλ shuffle_down()μ μ¬μ©νμ¬ μν λ΄ μΈμ λ μΈμ λ°μ΄ν°μ μ κ·Όν μ μμ΅λλ€. μ΄ κ°λ ₯ν κΈ°λ³Έ μμλ₯Ό ν΅ν΄ 곡μ λ©λͺ¨λ¦¬λ λͺ
μμ λκΈ°ν μμ΄ μ ν μ°¨λΆ, μ΄λ νκ· , μ΄μ κΈ°λ° κ³μ°μ ν¨μ¨μ μΌλ‘ μνν μ μμ΅λλ€.
ν΅μ¬ ν΅μ°°: shuffle_down() μ°μ°μ SIMT μ€νμ νμ©νμ¬ κ° λ μΈμ΄ κ°μ μν λ΄ μ΄μμ λ°μ΄ν°μ μ κ·Όν μ μκ² νλ©°, ν¨μ¨μ μΈ μ€ν μ€ ν¨ν΄κ³Ό μ¬λΌμ΄λ© μλμ° μ°μ°μ κ°λ₯νκ² ν©λλ€.
μ€ν μ€ μ°μ°μ΄λ? μ€ν μ€ μ°μ°μ κ° μΆλ ₯ μμκ° μ΄μ μ λ ₯ μμμ κ³ μ λ ν¨ν΄μ μμ‘΄νλ κ³μ°μ λλ€. λνμ μΈ μλ‘ μ ν μ°¨λΆ(λν¨μ), ν©μ±κ³±, μ΄λ νκ· μ΄ μμ΅λλ€. βμ€ν μ€βμ μ΄μ μ κ·Ό ν¨ν΄μ κ°λ¦¬ν΅λλ€ - μλ₯Ό λ€μ΄
[i-1, i, i+1]μ μ½λ 3μ μ€ν μ€μ΄λ[i-2, i-1, i, i+1, i+2]λ₯Ό μ½λ 5μ μ€ν μ€μ΄ μμ΅λλ€.
ν΅μ¬ κ°λ
μ΄ νΌμ¦μμ λ°°μΈ λ΄μ©:
shuffle_down()μ νμ©ν μν λ 벨 λ°μ΄ν° μ ν- μ€ν μ€ κ³μ°μ μν μ΄μ μ κ·Ό ν¨ν΄
- μν κ°μ₯μ리μμμ κ²½κ³ μ²λ¦¬
- νμ₯λ μ΄μ μ κ·Όμ μν λ€μ€ μ€νμ μ ν
- λ©ν° λΈλ‘ μλ리μ€μμμ μν κ° μ‘°μ
shuffle_down μ°μ°μ κ° λ μΈμ΄ λ λμ μΈλ±μ€μ λ μΈμμ λ°μ΄ν°λ₯Ό κ°μ Έμ¬ μ μκ² ν©λλ€:
\[\Large \text{shuffle_down}(\text{value}, \text{offset}) = \text{value_from_lane}(\text{lane_id} + \text{offset})\]
μ΄λ₯Ό ν΅ν΄ 볡μ‘ν μ΄μ μ κ·Ό ν¨ν΄μ΄ κ°λ¨ν μν λ 벨 μ°μ°μΌλ‘ λ³νλμ΄, λͺ μμ λ©λͺ¨λ¦¬ μΈλ±μ± μμ΄ ν¨μ¨μ μΈ μ€ν μ€ κ³μ°μ΄ κ°λ₯ν©λλ€.
1. κΈ°λ³Έ μ΄μ μ°¨λΆ
ꡬμ±
- λ²‘ν° ν¬κΈ°:
SIZE = WARP_SIZE(GPUμ λ°λΌ 32 λλ 64) - 그리λ ꡬμ±:
(1, 1)그리λλΉ λΈλ‘ μ - λΈλ‘ ꡬμ±:
(WARP_SIZE, 1)λΈλ‘λΉ μ€λ λ μ - λ°μ΄ν° νμ
:
DType.float32 - λ μ΄μμ:
Layout.row_major(SIZE)(1D row-major)
shuffle_down κ°λ
κΈ°μ‘΄ μ΄μ μ κ·Ό λ°©μμ 볡μ‘ν μΈλ±μ±κ³Ό κ²½κ³ κ²μ¬κ° νμν©λλ€:
# κΈ°μ‘΄ λ°©μ - 볡μ‘νκ³ μ€λ₯κ° λ°μνκΈ° μ¬μ
if global_i < size - 1:
next_value = input[global_i + 1] # λ²μ μ΄κ³Ό κ°λ₯μ±
result = next_value - current_value
κΈ°μ‘΄ λ°©μμ λ¬Έμ μ :
- κ²½κ³ κ²μ¬: λ°°μ΄ κ²½κ³λ₯Ό μλμΌλ‘ νμΈν΄μΌ ν¨
- λ©λͺ¨λ¦¬ μ κ·Ό: λ³λμ λ©λͺ¨λ¦¬ λ‘λκ° νμ
- λκΈ°ν: 곡μ λ©λͺ¨λ¦¬ ν¨ν΄μμ 배리μ΄κ° νμν μ μμ
- 볡μ‘ν λ‘μ§: κ²½κ³μ μμΈ μν© μ²λ¦¬κ° μ₯ν©ν΄μ§
shuffle_down()μ μ¬μ©νλ©΄ μ΄μ μ κ·Όμ΄ κ°κ²°ν΄μ§λλ€:
# μν μ
ν λ°©μ - κ°λ¨νκ³ μμ
current_val = input[global_i]
next_val = shuffle_down(current_val, 1) # lane+1μμ κ° κ°μ Έμ€κΈ°
if lane < WARP_SIZE - 1:
result = next_val - current_val
shuffle_downμ μ₯μ :
- λ©λͺ¨λ¦¬ μ€λ²ν€λ μ λ‘: μΆκ° λ©λͺ¨λ¦¬ μ κ·Ό λΆνμ
- μλ κ²½κ³ μ²λ¦¬: νλμ¨μ΄κ° μν κ²½κ³λ₯Ό κ΄λ¦¬
- λκΈ°ν λΆνμ: SIMT μ€νμ΄ μ νμ±μ 보μ₯
- μ‘°ν© κ°λ₯: λ€λ₯Έ μν μ°μ°κ³Ό μ½κ² κ²°ν©
μμ±ν μ½λ
shuffle_down()μΌλ‘ λ€μ μμμ μ κ·Όνμ¬ μ ν μ°¨λΆμ ꡬνν©λλ€.
μνμ μ°μ°: κ° μμμ μ΄μ° λν¨μ(μ ν μ°¨λΆ)λ₯Ό κ³μ°ν©λλ€: \[\Large \text{output}[i] = \text{input}[i+1] - \text{input}[i]\]
μ
λ ₯ λ°μ΄ν° [0, 1, 4, 9, 16, 25, ...] (μ κ³±μ: i * i)λ₯Ό μ°¨λΆκ° [1, 3, 5, 7, 9, ...] (νμ)λ‘ λ³ννμ¬, μ΄μ°¨ ν¨μμ μ΄μ° λν¨μλ₯Ό ν¨κ³Όμ μΌλ‘ κ³μ°ν©λλ€.
comptime SIZE = WARP_SIZE
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)
fn neighbor_difference[
layout: Layout, size: Int
](
output: LayoutTensor[dtype, layout, MutAnyOrigin],
input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
"""
Compute finite differences: output[i] = input[i+1] - input[i]
Uses shuffle_down(val, 1) to get the next neighbor's value.
Works across multiple blocks, each processing one warp worth of data.
"""
global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
lane = Int(lane_id())
# FILL IN (roughly 7 lines)
μ 체 νμΌ λ³΄κΈ°: problems/p25/p25.mojo
ν
1. shuffle_down μ΄ν΄νκΈ°
shuffle_down(value, offset) μ°μ°μ κ° λ μΈμ΄ λ λμ μΈλ±μ€μ λ μΈμμ λ°μ΄ν°λ₯Ό λ°μ μ μκ² ν©λλ€. λͺ
μμ λ©λͺ¨λ¦¬ λ‘λ μμ΄ μ΄μ μμμ μ κ·Όνλ λ°©λ²μ μ΄ν΄λ³΄μΈμ.
shuffle_down(val, 1)μ΄ νλ μΌ:
- λ μΈ 0μ΄ λ μΈ 1μ κ°μ λ°μ
- λ μΈ 1μ΄ λ μΈ 2μ κ°μ λ°μ
- β¦
- λ μΈ 30μ΄ λ μΈ 31μ κ°μ λ°μ
- λ μΈ 31μ λ―Έμ μ κ°μ λ°μ (κ²½κ³ κ²μ¬λ‘ μ²λ¦¬)
2. μν κ²½κ³ κ³ λ €μ¬ν
μνμ κ°μ₯μ리μμ μ΄λ€ μΌμ΄ μΌμ΄λλμ§ μκ°ν΄ 보μΈμ. μΌλΆ λ μΈμ μ ν μ°μ°μΌλ‘ μ κ·Όν μ ν¨ν μ΄μμ΄ μμ μ μμ΅λλ€.
κ³Όμ : μν κ²½κ³μμ μ ν μ°μ°μ΄ λ―Έμ μ λ°μ΄ν°λ₯Ό λ°νν μ μλ κ²½μ°λ₯Ό μ²λ¦¬νλλ‘ μκ³ λ¦¬μ¦μ μ€κ³νμΈμ.
WARP_SIZE = 32μμμ μ΄μ μ°¨λΆ:
-
μ ν¨ν μ°¨λΆ (
lane < WARP_SIZE - 1): λ μΈ 0-30 (31κ° λ μΈ)- 쑰건: \(\text{lane_id}() \in {0, 1, \cdots, 30}\)
- μ΄μ :
shuffle_down(current_val, 1)μ΄ λ€μ μ΄μμ κ°μ μ±κ³΅μ μΌλ‘ κ°μ Έμ΄ - κ²°κ³Ό:
output[i] = input[i+1] - input[i](μ ν μ°¨λΆ)
-
κ²½κ³ μΌμ΄μ€ (else): λ μΈ 31 (1κ° λ μΈ)
- 쑰건: \(\text{lane_id}() = 31\)
- μ΄μ :
shuffle_down(current_val, 1)μ΄ λ―Έμ μ λ°μ΄ν°λ₯Ό λ°ν (λ μΈ 32κ° μμ) - κ²°κ³Ό:
output[i] = 0(μ°¨λΆ κ³μ° λΆκ°)
3. λ μΈ μλ³
lane = lane_id() # 0λΆν° WARP_SIZE-1κΉμ§ λ°ν
λ μΈ λ²νΈ λ§€κΈ°κΈ°: κ° μν λ΄μμ λ μΈμ 0, 1, 2, β¦, WARP_SIZE-1λ‘ λ²νΈκ° 맀겨μ§λλ€
μ΄μ μ°¨λΆ ν μ€νΈ:
pixi run p25 --neighbor
pixi run -e amd p25 --neighbor
pixi run -e apple p25 --neighbor
uv run poe p25 --neighbor
νμμ λμ μμ μΆλ ₯:
WARP_SIZE: 32
SIZE: 32
output: [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0, 25.0, 27.0, 29.0, 31.0, 33.0, 35.0, 37.0, 39.0, 41.0, 43.0, 45.0, 47.0, 49.0, 51.0, 53.0, 55.0, 57.0, 59.0, 61.0, 0.0]
expected: [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0, 25.0, 27.0, 29.0, 31.0, 33.0, 35.0, 37.0, 39.0, 41.0, 43.0, 45.0, 47.0, 49.0, 51.0, 53.0, 55.0, 57.0, 59.0, 61.0, 0.0]
β
Basic neighbor difference test passed!
μ루μ
fn neighbor_difference[
layout: Layout, size: Int
](
output: LayoutTensor[dtype, layout, MutAnyOrigin],
input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
"""
Compute finite differences: output[i] = input[i+1] - input[i]
Uses shuffle_down(val, 1) to get the next neighbor's value.
Works across multiple blocks, each processing one warp worth of data.
"""
global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
lane = Int(lane_id())
if global_i < size:
# Get current value
current_val = input[global_i]
# Get next neighbor's value using shuffle_down
next_val = shuffle_down(current_val, 1)
# Compute difference - valid within warp boundaries
# Last lane of each warp has no valid neighbor within the warp
# Note there's only one warp in this test, so we don't need to check global_i < size - 1
# We'll see how this works with multiple blocks in the next tests
if lane < WARP_SIZE - 1:
output[global_i] = next_val - current_val
else:
# Last thread in warp or last thread overall, set to 0
output[global_i] = 0
μ΄ μ루μ
μ shuffle_down()μ΄ κΈ°μ‘΄ λ°°μ΄ μΈλ±μ±μ ν¨μ¨μ μΈ μν λ 벨 ν΅μ μΌλ‘ μ΄λ»κ² λ³ννλμ§ λ³΄μ¬μ€λλ€.
μκ³ λ¦¬μ¦ λΆμ:
if global_i < size:
current_val = input[global_i] # κ° λ μΈμ΄ μμ μ μμλ₯Ό μ½μ
next_val = shuffle_down(current_val, 1) # νλμ¨μ΄κ° λ°μ΄ν°λ₯Ό μ€λ₯Έμͺ½μΌλ‘ μ΄λ
if lane < WARP_SIZE - 1:
output[global_i] = next_val - current_val # μ°¨λΆ κ³μ°
else:
output[global_i] = 0 # κ²½κ³ μ²λ¦¬
SIMT μ€ν μμΈ λΆμ:
μ¬μ΄ν΄ 1: λͺ¨λ λ μΈμ΄ λμμ κ°μ λ‘λ
λ μΈ 0: current_val = input[0] = 0
λ μΈ 1: current_val = input[1] = 1
λ μΈ 2: current_val = input[2] = 4
...
λ μΈ 31: current_val = input[31] = 961
μ¬μ΄ν΄ 2: shuffle_down(current_val, 1)μ΄ λͺ¨λ λ μΈμμ μ€ν
λ μΈ 0: λ μΈ 1μμ current_val μμ β next_val = 1
λ μΈ 1: λ μΈ 2μμ current_val μμ β next_val = 4
λ μΈ 2: λ μΈ 3μμ current_val μμ β next_val = 9
...
λ μΈ 30: λ μΈ 31μμ current_val μμ β next_val = 961
λ μΈ 31: λ―Έμ μ μμ (λ μΈ 32 μμ) β next_val = ?
μ¬μ΄ν΄ 3: μ°¨λΆ κ³μ° (λ μΈ 0-30λ§ ν΄λΉ)
λ μΈ 0: output[0] = 1 - 0 = 1
λ μΈ 1: output[1] = 4 - 1 = 3
λ μΈ 2: output[2] = 9 - 4 = 5
...
λ μΈ 31: output[31] = 0 (κ²½κ³ μ‘°κ±΄)
μνμ ν΅μ°°: μ΄μ° λν¨μ μ°μ°μ \(D\)λ₯Ό ꡬνν©λλ€: \[\Large D\lbrack f\rbrack(i) = f(i+1) - f(i)\]
μ΄μ°¨ μ λ ₯ \(f(i) = i^2\)μ λν΄: \[\Large D[i^2] = (i+1)^2 - i^2 = i^2 + 2i + 1 - i^2 = 2i + 1\]
shuffle_downμ΄ μ°μν μ΄μ :
- λ©λͺ¨λ¦¬ ν¨μ¨μ±: κΈ°μ‘΄ λ°©μμ
input[global_i + 1]λ‘λκ° νμνμ¬ μΊμ λ―Έμ€λ₯Ό μ λ°ν μ μμ - κ²½κ³ μμ μ±: λ²μ μ΄κ³Ό μ κ·Ό μνμ΄ μμ - νλμ¨μ΄κ° μν κ²½κ³λ₯Ό μ²λ¦¬
- SIMT μ΅μ ν: λ¨μΌ λͺ λ Ήμ΄ λͺ¨λ λ μΈμ λμμ μ²λ¦¬
- λ μ§μ€ν° ν΅μ : λ°μ΄ν°κ° λ©λͺ¨λ¦¬ κ³μΈ΅ κ΅¬μ‘°κ° μλ λ μ§μ€ν° μ¬μ΄λ₯Ό μ΄λ
μ±λ₯ νΉμ±:
- μ§μ° μκ°: 1 μ¬μ΄ν΄ (λ©λͺ¨λ¦¬ μ κ·Όμ 100+ μ¬μ΄ν΄ λλΉ)
- λμν: 0 λ°μ΄νΈ (κΈ°μ‘΄ λ°©μμ μ€λ λλΉ 4λ°μ΄νΈ λλΉ)
- λ³λ ¬μ±: 32κ° λ μΈ λͺ¨λ λμμ μ²λ¦¬
2. λ€μ€ μ€νμ μ΄λ νκ·
ꡬμ±
- λ²‘ν° ν¬κΈ°:
SIZE_2 = 64(λ©ν° λΈλ‘ μλ리μ€) - 그리λ ꡬμ±:
BLOCKS_PER_GRID = (2, 1)그리λλΉ λΈλ‘ μ - λΈλ‘ ꡬμ±:
THREADS_PER_BLOCK = (WARP_SIZE, 1)λΈλ‘λΉ μ€λ λ μ
μμ±ν μ½λ
μ¬λ¬ shuffle_down μ°μ°μ μ¬μ©νμ¬ 3μ μ΄λ νκ· μ ꡬνν©λλ€.
μνμ μ°μ°: μΈ κ°μ μ°μ μμλ₯Ό μ¬μ©νμ¬ μ¬λΌμ΄λ© μλμ° νκ· μ κ³μ°ν©λλ€: \[\Large \text{output}[i] = \frac{1}{3}\left(\text{input}[i] + \text{input}[i+1] + \text{input}[i+2]\right)\]
κ²½κ³ μ²λ¦¬: μν κ²½κ³μμ μκ³ λ¦¬μ¦μ΄ μ°μνκ² μ μν©λλ€:
- 3μ μ 체 μλμ°: \(\text{output}[i] = \frac{1}{3}\sum_{k=0}^{2} \text{input}[i+k]\) - λͺ¨λ μ΄μμ΄ μ¬μ© κ°λ₯ν λ
- 2μ μλμ°: \(\text{output}[i] = \frac{1}{2}\sum_{k=0}^{1} \text{input}[i+k]\) - λ€μ μ΄μλ§ μ¬μ© κ°λ₯ν λ
- 1μ μλμ°: \(\text{output}[i] = \text{input}[i]\) - μ΄μμ΄ μ¬μ© λΆκ°ν λ
μ΄λ shuffle_down()μ΄ μν λ²μ λ΄μμ μλ κ²½κ³ μ²λ¦¬μ ν¨κ» ν¨μ¨μ μΈ μ€ν
μ€ μ°μ°μ κ°λ₯νκ² νλ λ°©λ²μ 보μ¬μ€λλ€.
comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
comptime layout_2 = Layout.row_major(SIZE_2)
fn moving_average_3[
layout: Layout, size: Int
](
output: LayoutTensor[dtype, layout, MutAnyOrigin],
input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
"""
Compute 3-point moving average: output[i] = (input[i] + input[i+1] + input[i+2]) / 3
Uses shuffle_down with offsets 1 and 2 to access neighbors.
Works within warp boundaries across multiple blocks.
"""
global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
lane = Int(lane_id())
# FILL IN (roughly 10 lines)
ν
1. λ€μ€ μ€νμ μ ν ν¨ν΄
μ΄ νΌμ¦μ μ¬λ¬ μ΄μμ λμμ μ κ·Όν΄μΌ ν©λλ€. μλ‘ λ€λ₯Έ μ€νμ μΌλ‘ μ ν μ°μ°μ μ¬μ©ν΄μΌ ν©λλ€.
ν΅μ¬ μ§λ¬Έ:
input[i+1]κ³Όinput[i+2]λ₯Ό μ ν μ°μ°μΌλ‘ μ΄λ»κ² κ°μ Έμ¬ μ μμκΉμ?- μ ν μ€νμ κ³Ό μ΄μ 거리μ κ΄κ³λ 무μμΌκΉμ?
- κ°μ μμ€ κ°μ λν΄ μ¬λ¬ λ² μ νμ μνν μ μμκΉμ?
μκ°ν κ°λ :
νμ¬ λ μΈμ΄ νμν κ°: current_val, next_val, next_next_val
μ
ν μ€νμ
: 0 (μ§μ ), 1, 2
μκ°ν΄ 보μΈμ: λͺ λ²μ μ ν μ°μ°μ΄ νμνκ³ , μ΄λ€ μ€νμ μ μ¬μ©ν΄μΌ ν κΉμ?
2. λ¨κ³μ κ²½κ³ μ²λ¦¬
λ¨μν μ΄μ μ°¨λΆκ³Ό λ¬λ¦¬, μ΄ νΌμ¦μ 2κ°μ μ΄μμ μ κ·Όν΄μΌ νλ―λ‘ μ¬λ¬ κ²½κ³ μλ리μ€κ° μμ΅λλ€.
κ³ λ €ν κ²½κ³ μλ리μ€:
- μ 체 μλμ°: λ μΈμ΄ λ μ΄μ λͺ¨λ μ κ·Ό κ°λ₯ β 3κ° κ° λͺ¨λ μ¬μ©
- λΆλΆ μλμ°: λ μΈμ΄ 1κ° μ΄μλ§ μ κ·Ό κ°λ₯ β 2κ° κ° μ¬μ©
- μλμ° μμ: λ μΈμ΄ μ΄μμ μ κ·Ό λΆκ° β 1κ° κ° μ¬μ©
λΉνμ μ¬κ³ :
- μ΄λ€ λ μΈμ΄ κ° μΉ΄ν κ³ λ¦¬μ ν΄λΉν κΉμ?
- κ°μ΄ μ μ λ νκ· μ κ°μ€μΉλ₯Ό μ΄λ»κ² μ‘°μ ν΄μΌ ν κΉμ?
- μ΄λ€ κ²½κ³ μ‘°κ±΄μ κ²μ¬ν΄μΌ ν κΉμ?
κ³ λ €ν ν¨ν΄:
if (λ μ΄μ λͺ¨λ μ κ·Ό κ°λ₯):
# 3μ νκ·
elif (ν μ΄μλ§ μ κ·Ό κ°λ₯):
# 2μ νκ·
else:
# 1μ (νκ· μμ)
3. λ©ν° λΈλ‘ μ‘°μ
μ΄ νΌμ¦μ μ¬λ¬ λΈλ‘μ μ¬μ©νλ©°, κ° λΈλ‘μ΄ λ°μ΄ν°μ λ€λ₯Έ μμμ μ²λ¦¬ν©λλ€.
μ€μν κ³ λ €μ¬ν:
- κ° λΈλ‘μ λ μΈ 0λΆν° WARP_SIZE-1κΉμ§μ μ체 μνλ₯Ό κ°μ§
- κ²½κ³ μ‘°κ±΄μ κ° μν λ΄μμ λ 립μ μΌλ‘ μ μ©
- λΈλ‘λ§λ€ λ μΈ λ²νΈκ° μ΄κΈ°νλ¨
μκ°ν΄ λ³Ό μ§λ¬Έ:
- κ²½κ³ λ‘μ§μ΄ λΈλ‘ 0κ³Ό λΈλ‘ 1 λͺ¨λμμ μ¬λ°λ₯΄κ² λμνλμ?
- λ μΈ κ²½κ³μ μ μ λ°°μ΄ κ²½κ³λ₯Ό λͺ¨λ κ²μ¬νκ³ μλμ?
- μλ‘ λ€λ₯Έ λΈλ‘μμ
global_iμlane_id()μ κ΄κ³λ μ΄λ»κ² λ κΉμ?
λλ²κΉ ν: κ° λΈλ‘μ κ²½κ³ λ μΈμμ μ΄λ€ μΌμ΄ μΌμ΄λλμ§ μΆμ νμ¬ λ‘μ§μ ν μ€νΈνμΈμ.
μ΄λ νκ· ν μ€νΈ:
pixi run p25 --average
pixi run -e amd p25 --average
uv run poe p25 --average
νμμ λμ μμ μΆλ ₯:
WARP_SIZE: 32
SIZE_2: 64
output: HostBuffer([3.3333333, 6.3333335, 10.333333, 15.333333, 21.333334, 28.333334, 36.333332, 45.333332, 55.333332, 66.333336, 78.333336, 91.333336, 105.333336, 120.333336, 136.33333, 153.33333, 171.33333, 190.33333, 210.33333, 231.33333, 253.33333, 276.33334, 300.33334, 325.33334, 351.33334, 378.33334, 406.33334, 435.33334, 465.33334, 496.33334, 512.0, 528.0, 595.3333, 630.3333, 666.3333, 703.3333, 741.3333, 780.3333, 820.3333, 861.3333, 903.3333, 946.3333, 990.3333, 1035.3334, 1081.3334, 1128.3334, 1176.3334, 1225.3334, 1275.3334, 1326.3334, 1378.3334, 1431.3334, 1485.3334, 1540.3334, 1596.3334, 1653.3334, 1711.3334, 1770.3334, 1830.3334, 1891.3334, 1953.3334, 2016.3334, 2048.0, 2080.0])
expected: HostBuffer([3.3333333, 6.3333335, 10.333333, 15.333333, 21.333334, 28.333334, 36.333332, 45.333332, 55.333332, 66.333336, 78.333336, 91.333336, 105.333336, 120.333336, 136.33333, 153.33333, 171.33333, 190.33333, 210.33333, 231.33333, 253.33333, 276.33334, 300.33334, 325.33334, 351.33334, 378.33334, 406.33334, 435.33334, 465.33334, 496.33334, 512.0, 528.0, 595.3333, 630.3333, 666.3333, 703.3333, 741.3333, 780.3333, 820.3333, 861.3333, 903.3333, 946.3333, 990.3333, 1035.3334, 1081.3334, 1128.3334, 1176.3334, 1225.3334, 1275.3334, 1326.3334, 1378.3334, 1431.3334, 1485.3334, 1540.3334, 1596.3334, 1653.3334, 1711.3334, 1770.3334, 1830.3334, 1891.3334, 1953.3334, 2016.3334, 2048.0, 2080.0])
β
Moving average test passed!
μ루μ
fn moving_average_3[
layout: Layout, size: Int
](
output: LayoutTensor[dtype, layout, MutAnyOrigin],
input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
"""
Compute 3-point moving average: output[i] = (input[i] + input[i+1] + input[i+2]) / 3
Uses shuffle_down with offsets 1 and 2 to access neighbors.
Works within warp boundaries across multiple blocks.
"""
global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
lane = Int(lane_id())
if global_i < size:
# Get current, next, and next+1 values
current_val = input[global_i]
next_val = shuffle_down(current_val, 1)
next_next_val = shuffle_down(current_val, 2)
# Compute 3-point average - valid within warp boundaries
if lane < WARP_SIZE - 2 and global_i < size - 2:
output[global_i] = (current_val + next_val + next_next_val) / 3.0
elif lane < WARP_SIZE - 1 and global_i < size - 1:
# Second-to-last in warp: only current + next available
output[global_i] = (current_val + next_val) / 2.0
else:
# Last thread in warp or boundary cases: only current available
output[global_i] = current_val
μ΄ μ루μ μ 볡μ‘ν μ€ν μ€ μ°μ°μ μν κ³ κΈ λ€μ€ μ€νμ μ νμ 보μ¬μ€λλ€.
μ 체 μκ³ λ¦¬μ¦ λΆμ:
if global_i < size:
# λ¨κ³ 1: μ¬λ¬ μ
νλ‘ νμν λ°μ΄ν° λͺ¨λ ν보
current_val = input[global_i] # μ§μ μ κ·Ό
next_val = shuffle_down(current_val, 1) # μ€λ₯Έμͺ½ μ΄μ
next_next_val = shuffle_down(current_val, 2) # μ€λ₯Έμͺ½+1 μ΄μ
# λ¨κ³ 2: μ¬μ© κ°λ₯ν λ°μ΄ν°μ λ°λ₯Έ μ μν κ³μ°
if lane < WARP_SIZE - 2 and global_i < size - 2:
# 3μ μ€ν
μ€ μ 체 μ¬μ© κ°λ₯
output[global_i] = (current_val + next_val + next_next_val) / 3.0
elif lane < WARP_SIZE - 1 and global_i < size - 1:
# 2μ μ€ν
μ€λ§ μ¬μ© κ°λ₯ (μν κ²½κ³ κ·Όμ²)
output[global_i] = (current_val + next_val) / 2.0
else:
# μ€ν
μ€ μ¬μ© λΆκ° (μν κ²½κ³)
output[global_i] = current_val
λ€μ€ μ€νμ
μ€ν μΆμ (WARP_SIZE = 32):
μ΄κΈ° μν (λΈλ‘ 0, μμ 0-31):
λ μΈ 0: current_val = input[0] = 1
λ μΈ 1: current_val = input[1] = 2
λ μΈ 2: current_val = input[2] = 4
...
λ μΈ 31: current_val = input[31] = X
첫 λ²μ§Έ μ
ν: shuffle_down(current_val, 1)
λ μΈ 0: next_val = input[1] = 2
λ μΈ 1: next_val = input[2] = 4
λ μΈ 2: next_val = input[3] = 7
...
λ μΈ 30: next_val = input[31] = X
λ μΈ 31: next_val = λ―Έμ μ
λ λ²μ§Έ μ
ν: shuffle_down(current_val, 2)
λ μΈ 0: next_next_val = input[2] = 4
λ μΈ 1: next_next_val = input[3] = 7
λ μΈ 2: next_next_val = input[4] = 11
...
λ μΈ 29: next_next_val = input[31] = X
λ μΈ 30: next_next_val = λ―Έμ μ
λ μΈ 31: next_next_val = λ―Έμ μ
κ³μ° λ¨κ³:
λ μΈ 0-29: 3μ μ 체 νκ· β (current + next + next_next) / 3
λ μΈ 30: 2μ νκ· β (current + next) / 2
λ μΈ 31: 1μ νκ· β current (κ·Έλλ‘ μ λ¬)
μνμ κΈ°λ°: κ°λ³ ν μ΄μ° ν©μ±κ³±μ ꡬνν©λλ€: \[\Large h[i] = \sum_{k=0}^{K(i)-1} w_k^{(i)} \cdot f[i+k]\]
μμΉμ λ°λΌ 컀λμ΄ μ μν©λλ€:
- λ΄λΆ μ : \(K(i) = 3\), \(\mathbf{w}^{(i)} = [\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]\)
- κ²½κ³ κ·Όμ²: \(K(i) = 2\), \(\mathbf{w}^{(i)} = [\frac{1}{2}, \frac{1}{2}]\)
- κ²½κ³: \(K(i) = 1\), \(\mathbf{w}^{(i)} = [1]\)
λ©ν° λΈλ‘ μ‘°μ : SIZE_2 = 64μ 2κ° λΈλ‘:
λΈλ‘ 0 (μ μ μΈλ±μ€ 0-31):
μ μ μΈλ±μ€ 29, 30, 31μ λ μΈ κ²½κ³ μ μ©
λΈλ‘ 1 (μ μ μΈλ±μ€ 32-63):
μ μ μΈλ±μ€ 61, 62, 63μ λ μΈ κ²½κ³ μ μ©
λ μΈ λ²νΈ μ΄κΈ°ν: global_i=32 β lane=0, global_i=63 β lane=31
μ±λ₯ μ΅μ ν:
- λ³λ ¬ λ°μ΄ν° ν보: λ μ ν μ°μ°μ΄ λμμ μ€ν
- μ‘°κ±΄λΆ λΆκΈ°: GPUκ° νλ λμΌμ΄μ μ ν΅ν΄ λΆκΈ° λ μΈμ ν¨μ¨μ μΌλ‘ μ²λ¦¬
- λ©λͺ¨λ¦¬ λ³ν©: μμ°¨μ μ μ λ©λͺ¨λ¦¬ μ κ·Ό ν¨ν΄μ΄ GPUμ μ΅μ
- λ μ§μ€ν° μ¬μ¬μ©: λͺ¨λ μ€κ° κ°μ΄ λ μ§μ€ν°μ μ μ§
μ νΈ μ²λ¦¬ κ΄μ : μ΄κ²μ μνμ€ μλ΅ \(h[n] = \frac{1}{3}[\delta[n] + \delta[n-1] + \delta[n-2]]\)λ₯Ό κ°μ§ μΈκ³Ό FIR νν°λ‘, μ°¨λ¨ μ£Όνμ \(f_c \approx 0.25f_s\)μμ μ€λ¬΄λ©μ μ 곡ν©λλ€.
μμ½
μ΄ μΉμ μ ν΅μ¬ ν¨ν΄μ λ€μκ³Ό κ°μ΅λλ€
current_val = input[global_i]
neighbor_val = shuffle_down(current_val, offset)
if lane < WARP_SIZE - offset:
result = compute(current_val, neighbor_val)
ν΅μ¬ μ₯μ :
- νλμ¨μ΄ ν¨μ¨μ±: λ μ§μ€ν° κ° μ§μ ν΅μ
- κ²½κ³ μμ μ±: μλ μν λ²μ μ²λ¦¬
- SIMT μ΅μ ν: λ¨μΌ λͺ λ Ή, λͺ¨λ λ μΈ λ³λ ¬ μ²λ¦¬
νμ© λΆμΌ: μ ν μ°¨λΆ, μ€ν μ€ μ°μ°, μ΄λ νκ· , ν©μ±κ³±.