block.sum()μ ν΅μ¬ - λΈλ‘ λ 벨 λ΄μ
Puzzle 12μμ μ΄ν΄λ³Έ λ΄μ μ λΈλ‘ λ 벨 sum μ°μ°μΌλ‘ ꡬνν©λλ€. 볡μ‘ν 곡μ λ©λͺ¨λ¦¬ ν¨ν΄μ κ°λ¨ν ν¨μ νΈμΆλ‘ λ체ν©λλ€. λΈλ‘ λ΄ κ° μ€λ λκ° νλμ μμλ₯Ό μ²λ¦¬νκ³ block.sum()μΌλ‘ κ²°κ³Όλ₯Ό μλμΌλ‘ ν©μ°νμ¬, λΈλ‘ νλ‘κ·Έλλ°μ΄ μ 체 μ€λ λ λΈλ‘μ κ±ΈμΉ GPU λκΈ°νλ₯Ό μ΄λ»κ² λ³ννλμ§ λ³΄μ¬μ€λλ€.
ν΅μ¬ ν΅μ°°: block.sum() μ°μ°μ λΈλ‘ μ 체 μ€νμ νμ©νμ¬ κ³΅μ λ©λͺ¨λ¦¬ + λ°°λ¦¬μ΄ + νΈλ¦¬ 리λμ μ λΈλ‘ λ΄ λͺ¨λ μ€λ λμ κ±Έμ³ μν ν¨ν΄μ μ¬μ©νλ μ κ΅νκ² μ΅μ νλ ꡬνμΌλ‘ λ체ν©λλ€. LLVM λΆμμ κΈ°μ λΆμμ μ°Έκ³ νμΈμ.
ν΅μ¬ κ°λ
μ΄ νΌμ¦μμ λ°°μΈ λ΄μ©:
block.sum()μ νμ©ν λΈλ‘ λ 벨 리λμ - λΈλ‘ μ 체 λκΈ°νμ μ€λ λ μ‘°μ¨
- λ¨μΌ λΈλ‘ λ΄ ν¬λ‘μ€ μν ν΅μ
- 볡μ‘ν ν¨ν΄μμ κ°λ¨ν ν¨ν΄μΌλ‘μ μ±λ₯ λ³ν
- μ€λ λ 0 κ²°κ³Ό κ΄λ¦¬μ μ‘°κ±΄λΆ μ°κΈ°
μνμ μ°μ°μ λ΄μ μ λλ€: \[\Large \text{output}[0] = \sum_{i=0}^{N-1} a[i] \times b[i]\]
νμ§λ§ ꡬν κ³Όμ μμ Mojoμ λͺ¨λ λΈλ‘ λ 벨 GPU νλ‘κ·Έλλ°μ μ μ©λλ κΈ°λ³Έ ν¨ν΄μ λ°°μλλ€.
ꡬμ±
- λ²‘ν° ν¬κΈ°:
SIZE = 128μμ - λ°μ΄ν° νμ
:
DType.float32 - λΈλ‘ ꡬμ±:
(128, 1)λΈλ‘λΉ μ€λ λ μ (TPB = 128) - 그리λ ꡬμ±:
(1, 1)그리λλΉ λΈλ‘ μ - λ μ΄μμ:
Layout.row_major(SIZE)(1D row-major) - λΈλ‘λΉ μν μ:
128 / WARP_SIZE(NVIDIAμμ 4κ°, AMDμμ 2κ° λλ 4κ°)
κΈ°μ‘΄ λ°©μμ 볡μ‘μ± (Puzzle 12μμ)
Puzzle 12μ 볡μ‘ν λ°©μμ λ μ¬λ € λ΄ μλ€. 곡μ λ©λͺ¨λ¦¬, 배리μ΄, νΈλ¦¬ 리λμ μ΄ νμνμ΅λλ€:
fn traditional_dot_product[
in_layout: Layout, out_layout: Layout, tpb: Int
](
output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
size: Int,
):
"""Traditional dot product using shared memory + barriers + tree reduction.
Educational but complex - shows the manual coordination needed."""
shared = LayoutTensor[
dtype,
Layout.row_major(tpb),
MutAnyOrigin,
address_space = AddressSpace.SHARED,
].stack_allocation()
global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
local_i = Int(thread_idx.x)
# Each thread computes partial product
if global_i < size:
a_val = rebind[Scalar[dtype]](a[global_i])
b_val = rebind[Scalar[dtype]](b[global_i])
shared[local_i] = a_val * b_val
barrier()
# Tree reduction in shared memory - complex but educational
var stride = tpb // 2
while stride > 0:
if local_i < stride:
shared[local_i] += shared[local_i + stride]
barrier()
stride //= 2
# Only thread 0 writes final result
if local_i == 0:
output[0] = shared[0]
μ΄ λ°©μμ΄ λ³΅μ‘ν μ΄μ :
- 곡μ λ©λͺ¨λ¦¬ ν λΉ: λΈλ‘ λ΄μμ μλμΌλ‘ λ©λͺ¨λ¦¬λ₯Ό κ΄λ¦¬
- λͺ
μμ 배리μ΄: λΈλ‘ λ΄ λͺ¨λ μ€λ λλ₯Ό λκΈ°ννκΈ° μν
barrier()νΈμΆ - νΈλ¦¬ 리λμ : μ€νΈλΌμ΄λ κΈ°λ° μΈλ±μ±μ μ¬μ©νλ 볡μ‘ν 루ν (64β32β16β8β4β2β1)
- ν¬λ‘μ€ μν μ‘°μ¨: μ¬λ¬ μν κ° λκΈ°νκ° νμ
- μ‘°κ±΄λΆ μ°κΈ°: μ€λ λ 0λ§ μ΅μ’ κ²°κ³Όλ₯Ό κΈ°λ‘
μ΄ λ°©μμ μ 체 λΈλ‘(GPUμ λ°λΌ 2κ° λλ 4κ° μνμ κ±ΈμΉ 128 μ€λ λ)μμ λμνμ§λ§, μ½λκ° μ₯ν©νκ³ μ€λ₯κ° λ°μνκΈ° μ¬μ°λ©° λΈλ‘ λ 벨 GPU λκΈ°νμ λν κΉμ μ΄ν΄κ° νμν©λλ€.
μν λ 벨 κ°μ (Puzzle 24μμ)
λΈλ‘ λ 벨 μ°μ°μΌλ‘ λμ΄κ°κΈ° μ μ, Puzzle 24μμ warp.sum()μ μ¬μ©νμ¬ λ¨μΌ μν λ΄ λ¦¬λμ
μ μ΄λ»κ² λ¨μννλμ§ λ μ¬λ € λ΄
μλ€:
fn simple_warp_dot_product[
in_layout: Layout, out_layout: Layout, size: Int
](
output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
):
global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
# Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based
var partial_product: Scalar[dtype] = 0
if global_i < size:
partial_product = (a[global_i] * b[global_i]).reduce_add()
# warp_sum() replaces all the shared memory + barriers + tree reduction
total = warp_sum(partial_product)
# Only lane 0 writes the result (all lanes have the same total)
if lane_id() == 0:
output[global_i // WARP_SIZE] = total
warp.sum()μ΄ λ¬μ±ν κ²:
- λ¨μΌ μν λ²μ: 32 μ€λ λ(NVIDIA) λλ 32/64 μ€λ λ(AMD) λ΄μμ λμ
- νλμ¨μ΄ μ
ν: ν¨μ¨μ μΈ
shfl.sync.bfly.b32λͺ λ Ή μ¬μ© - 곡μ λ©λͺ¨λ¦¬ λΆνμ: λͺ μμ λ©λͺ¨λ¦¬ κ΄λ¦¬ μμ
- ν μ€ λ¦¬λμ
:
total = warp_sum[warp_size=WARP_SIZE](val=partial_product)
κ·Έλ¬λ νκ³κ° μμ΅λλ€: warp.sum()μ λ¨μΌ μν λ΄μμλ§ λμν©λλ€. μ¬λ¬ μνκ° νμν λ¬Έμ (μ: 128 μ€λ λ λΈλ‘)μμλ μ¬μ ν μν κ° μ‘°μ¨μ μν΄ λ³΅μ‘ν 곡μ λ©λͺ¨λ¦¬ + λ°°λ¦¬μ΄ λ°©μμ΄ νμν©λλ€.
κΈ°μ‘΄ λ°©μ ν μ€νΈ:
pixi run p27 --traditional-dot-product
pixi run -e amd p27 --traditional-dot-product
pixi run -e apple p27 --traditional-dot-product
uv run poe p27 --traditional-dot-product
μμ±ν μ½λ
block.sum() λ°©μ
볡μ‘ν κΈ°μ‘΄ λ°©μμ block.sum()μ μ¬μ©νλ κ°λ¨ν λΈλ‘ 컀λλ‘ λ³νν©λλ€:
comptime SIZE = 128
comptime TPB = 128
comptime NUM_BINS = 8
comptime in_layout = Layout.row_major(SIZE)
comptime out_layout = Layout.row_major(1)
comptime dtype = DType.float32
fn block_sum_dot_product[
in_layout: Layout, out_layout: Layout, tpb: Int
](
output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
size: Int,
):
"""Dot product using block.sum() - convenience function like warp.sum()!
Replaces manual shared memory + barriers + tree reduction with one line."""
global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
local_i = thread_idx.x
# FILL IN (roughly 6 lines)
μ 체 νμΌ λ³΄κΈ°: problems/p27/p27.mojo
pixi run p27 --block-sum-dot-product
pixi run -e amd p27 --block-sum-dot-product
uv run poe p27 --block-sum-dot-product
νμμ λμ μμ μΆλ ₯:
SIZE: 128
TPB: 128
Expected result: 1381760.0
Block.sum result: 1381760.0
Block.sum() gives identical results!
Compare the code: 15+ lines of barriers β 1 line of block.sum()!
Just like warp.sum() but for the entire block
ν
1. μΈ λ¨κ³ ν¨ν΄ μ΄ν΄νκΈ°
λͺ¨λ λΈλ‘ 리λμ μ λμΌν κ°λ μ ν¨ν΄μ λ°λ¦ λλ€:
- κ° μ€λ λκ° μμ μ λ‘컬 κΈ°μ¬λΆμ κ³μ°
- λͺ¨λ μ€λ λκ° λΈλ‘ μ 체 리λμ μ μ°Έμ¬
- μ§μ λ νλμ μ€λ λκ° μ΅μ’ κ²°κ³Όλ₯Ό μ²λ¦¬
2. λ΄μ μν κΈ°μ΅νκΈ°
κ° μ€λ λλ λ²‘ν° aμ bμμ νλμ μμ μμ μ²λ¦¬ν΄μΌ ν©λλ€. μ΄λ€μ μ€λ λ κ°μ ν©μ°ν μ μλ βλΆλΆ κ²°κ³Όβλ‘ ν©μΉλ μ°μ°μ 무μμΌκΉμ?
3. LayoutTensor μΈλ±μ± ν¨ν΄
LayoutTensor μμμ μ κ·Όν λ, μΈλ±μ±μ΄ SIMD κ°μ λ°ννλ€λ μ μ κΈ°μ΅νμΈμ. μ°μ μ°μ°μ μν΄ μ€μΉΌλΌ κ°μ μΆμΆν΄μΌ ν©λλ€.
4. block.sum() API κ°λ
ν¨μ μκ·Έλμ²λ₯Ό μ΄ν΄λ³΄μΈμ - λ€μμ΄ νμν©λλ€:
- λΈλ‘ ν¬κΈ°λ₯Ό μ§μ νλ ν νλ¦Ώ νλΌλ―Έν°
- κ²°κ³Ό λΆλ°° λ°©μμ μν ν
νλ¦Ώ νλΌλ―Έν° (
broadcast) - 리λμ€ν κ°μ λ΄μ λ°νμ νλΌλ―Έν°
5. μ€λ λ μ‘°μ¨ μμΉ
- μ΄λ€ μ€λ λκ° μ²λ¦¬ν μ ν¨ν λ°μ΄ν°λ₯Ό κ°μ§κ³ μμκΉμ? (ννΈ: κ²½κ³ κ²μ¬)
- μ΄λ€ μ€λ λκ° μ΅μ’ κ²°κ³Όλ₯Ό κΈ°λ‘ν΄μΌ ν κΉμ? (ννΈ: μΌκ΄λ μ ν)
- κ·Έ νΉμ μ€λ λλ₯Ό μ΄λ»κ² μλ³ν κΉμ? (ννΈ: μ€λ λ μΈλ±μ±)
μ루μ
fn block_sum_dot_product[
in_layout: Layout, out_layout: Layout, tpb: Int
](
output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
size: Int,
):
"""Dot product using block.sum() - convenience function like warp.sum()!
Replaces manual shared memory + barriers + tree reduction with one line."""
global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
local_i = thread_idx.x
# Each thread computes partial product
var partial_product: Scalar[dtype] = 0.0
if global_i < size:
# LayoutTensor indexing `[0]` returns the underlying SIMD value
partial_product = a[global_i][0] * b[global_i][0]
# The magic: block.sum() replaces 15+ lines of manual reduction!
# Just like warp.sum() but for the entire block
total = block.sum[block_size=tpb, broadcast=False](
val=SIMD[DType.float32, 1](partial_product)
)
# Only thread 0 writes the result
if local_i == 0:
output[0] = total[0]
block.sum() 컀λμ 볡μ‘ν λΈλ‘ λκΈ°νμμ μ κ΅νκ² μ΅μ νλ ꡬνμΌλ‘μ κ·Όλ³Έμ μΈ λ³νμ 보μ¬μ€λλ€:
κΈ°μ‘΄ λ°©μμμ μ¬λΌμ§ κ²λ€:
- 15μ€ μ΄μ β 8μ€: νκΈ°μ μΈ μ½λ μΆμ
- 곡μ λ©λͺ¨λ¦¬ ν λΉ: λ©λͺ¨λ¦¬ κ΄λ¦¬ λΆνμ
- 7ν μ΄μμ barrier() νΈμΆ: λͺ μμ λκΈ°ν μ λ‘
- 볡μ‘ν νΈλ¦¬ 리λμ : λ¨μΌ ν¨μ νΈμΆλ‘ λ체
- μ€νΈλΌμ΄λ κΈ°λ° μΈλ±μ±: μμ ν μ κ±°
- ν¬λ‘μ€ μν μ‘°μ¨: μ΅μ νλ ꡬνμ΄ μλμΌλ‘ μ²λ¦¬
λΈλ‘ μ 체 μ€ν λͺ¨λΈ:
λΈλ‘ μ€λ λ (128 μ€λ λ, 4κ° μν):
μν 0 (μ€λ λ 0-31):
μ€λ λ 0: partial_product = a[0] * b[0] = 0.0
μ€λ λ 1: partial_product = a[1] * b[1] = 2.0
...
μ€λ λ 31: partial_product = a[31] * b[31] = 1922.0
μν 1 (μ€λ λ 32-63):
μ€λ λ 32: partial_product = a[32] * b[32] = 2048.0
...
μν 2 (μ€λ λ 64-95):
μ€λ λ 64: partial_product = a[64] * b[64] = 8192.0
...
μν 3 (μ€λ λ 96-127):
μ€λ λ 96: partial_product = a[96] * b[96] = 18432.0
μ€λ λ 127: partial_product = a[127] * b[127] = 32258.0
block.sum() νλμ¨μ΄ μ°μ°:
λͺ¨λ μ€λ λ β 0.0 + 2.0 + 1922.0 + 2048.0 + ... + 32258.0 = 1381760.0
μ€λ λ 0μ΄ μμ β total = 1381760.0 (broadcast=FalseμΌ λ)
λ°°λ¦¬μ΄ μμ΄ λμνλ μ΄μ :
- λΈλ‘ μ 체 μ€ν: λͺ¨λ μ€λ λκ° μν λ΄μμ λ‘μ€ν μΌλ‘ κ° λͺ λ Ήμ μ€ν
- λ΄μ₯ λκΈ°ν:
block.sum()ꡬνμ΄ λκΈ°νλ₯Ό λ΄λΆμ μΌλ‘ μ²λ¦¬ - ν¬λ‘μ€ μν ν΅μ : λΈλ‘ λ΄ μν κ° μ΅μ νλ ν΅μ
- μ‘°μ¨λ κ²°κ³Ό μ λ¬: μ€λ λ 0λ§ μ΅μ’ κ²°κ³Όλ₯Ό μμ
warp.sum() (Puzzle 24)κ³Όμ λΉκ΅:
- μν λ²μ:
warp.sum()μ 32/64 μ€λ λ(λ¨μΌ μν) λ΄μμ λμ - λΈλ‘ λ²μ:
block.sum()μ μ 체 λΈλ‘(μ¬λ¬ μν)μ κ±Έμ³ λμ - λμΌν λ¨μν¨: λ λ€ λ³΅μ‘ν μλ 리λμ μ ν μ€ νΈμΆλ‘ λ체
- μλ μ‘°μ¨:
block.sum()μwarp.sum()μ΄ μ²λ¦¬ν μ μλ ν¬λ‘μ€ μν 배리μ΄λ₯Ό μλμΌλ‘ μ²λ¦¬
κΈ°μ λΆμ: block.sum()μ μ€μ λ‘ λ¬΄μμΌλ‘ μ»΄νμΌλ κΉ?
block.sum()μ΄ μ€μ λ‘ λ¬΄μμ μμ±νλμ§ μ΄ν΄νκΈ° μν΄, λλ²κ·Έ μ 보μ ν¨κ» νΌμ¦μ μ»΄νμΌνμ΅λλ€:
pixi run mojo build --emit llvm --debug-level=line-tables solutions/p27/p27.mojo -o solutions/p27/p27.ll
μ΄λ κ² μμ±λ LLVM νμΌ solutions/p27/p27.llμλ, νΈν NVIDIA GPUμμ μ€μ GPU λͺ
λ Ήμ 보μ¬μ£Όλ PTX μ΄μ
λΈλ¦¬κ° λ΄μ₯λμ΄ μμ΅λλ€:
λ°κ²¬ 1: λ¨μΌ λͺ λ Ήμ΄ μλλ€
block.sum()μ μ½ 20κ° μ΄μμ PTX λͺ
λ ΉμΌλ‘ μ»΄νμΌλλ©°, 2λ¨κ³ 리λμ
μΌλ‘ ꡬμ±λ©λλ€:
1λ¨κ³: μν λ 벨 리λμ (λ²ν°νλΌμ΄ μ ν)
shfl.sync.bfly.b32 %r23, %r46, 16, 31, -1; // μ€νμ
16μΌλ‘ μ
ν
add.f32 %r24, %r46, %r23; // μ
νλ κ°μ ν©μ°
shfl.sync.bfly.b32 %r25, %r24, 8, 31, -1; // μ€νμ
8λ‘ μ
ν
add.f32 %r26, %r24, %r25; // μ
νλ κ°μ ν©μ°
// ... μ€νμ
4, 2, 1μ λν΄ κ³μ
2λ¨κ³: ν¬λ‘μ€ μν μ‘°μ¨
shr.u32 %r32, %r1, 5; // μν IDλ₯Ό κ³μ°
mov.b32 %r34, _global_alloc_$__gpu_shared_mem; // 곡μ λ©λͺ¨λ¦¬
bar.sync 0; // λ°°λ¦¬μ΄ λκΈ°ν
// ... ν¬λ‘μ€ μν 리λμ
μ μν λ λ€λ₯Έ λ²ν°νλΌμ΄ μ
ν μνμ€
λ°κ²¬ 2: νλμ¨μ΄ μ΅μ ν ꡬν
- λ²ν°νλΌμ΄ μ ν: νΈλ¦¬ 리λμ λ³΄λ€ ν¨μ¨μ
- μλ λ°°λ¦¬μ΄ λ°°μΉ: ν¬λ‘μ€ μν λκΈ°νλ₯Ό μλμΌλ‘ μ²λ¦¬
- μ΅μ νλ λ©λͺ¨λ¦¬ μ κ·Ό: 곡μ λ©λͺ¨λ¦¬λ₯Ό μ λ΅μ μΌλ‘ μ¬μ©
- μν€ν μ² μΈμ: λμΌν APIκ° NVIDIA(32 μ€λ λ μν)μ AMD(32 λλ 64 μ€λ λ μν)μμ λμ
λ°κ²¬ 3: μκ³ λ¦¬μ¦ λ³΅μ‘λ λΆμ
λΆμ μ κ·Ό λ°©μ:
- λ°μ΄λ리 ELF μΉμ
(
.nv_debug_ptx_txt)μμ PTX μ΄μ λΈλ¦¬λ₯Ό νμΈ - κ°λ³ λͺ λ Ή μλ₯Ό μΈκΈ°λ³΄λ€ μκ³ λ¦¬μ¦μ μ°¨μ΄λ₯Ό μλ³
κ΄μ°°λ μ£Όμ μκ³ λ¦¬μ¦ μ°¨μ΄:
- κΈ°μ‘΄ λ°©μ: 곡μ λ©λͺ¨λ¦¬λ₯Ό μ¬μ©ν νΈλ¦¬ 리λμ
+ λ€μμ
bar.syncνΈμΆ - block.sum(): λ²ν°νλΌμ΄ μ ν ν¨ν΄ + μ΅μ νλ ν¬λ‘μ€ μν μ‘°μ¨
μ±λ₯ μ΄μ μ λͺ λ Ή μλ λ§λ² κ°μ νλμ¨μ΄κ° μλλΌ μ κ΅νκ² μ΅μ νλ μκ³ λ¦¬μ¦ μ ν(λ²ν°νλΌμ΄ > νΈλ¦¬)μμ λΉλ‘―λ©λλ€. ꡬνμ λν μμΈν λ΄μ©μ Mojo gpu λͺ¨λμ block.mojoλ₯Ό μ°Έκ³ νμΈμ.
μ±λ₯ μΈμ¬μ΄νΈ
block.sum() vs κΈ°μ‘΄ λ°©μ:
- μ½λ λ¨μν¨: 리λμ λΆλΆμ΄ 15μ€ μ΄μ β 1μ€λ‘
- λ©λͺ¨λ¦¬ μ¬μ©: 곡μ λ©λͺ¨λ¦¬ ν λΉ λΆνμ
- λκΈ°ν: λͺ μμ λ°°λ¦¬μ΄ λΆνμ
- νμ₯μ±: νλμ¨μ΄ νλ λ΄μμ λͺ¨λ λΈλ‘ ν¬κΈ°μ λμ
block.sum() vs warp.sum():
- λ²μ: λΈλ‘ μ 체(128 μ€λ λ) vs μν μ 체(32 μ€λ λ)
- μ©λ: μ 체 λΈλ‘μ κ±ΈμΉ λ¦¬λμ μ΄ νμν λ
- νΈμμ±: λμΌν νλ‘κ·Έλλ° λͺ¨λΈ, λ€λ₯Έ κ·λͺ¨
block.sum()μ μ¬μ©ν΄μΌ ν λ:
- λ¨μΌ λΈλ‘ λ¬Έμ : λͺ¨λ λ°μ΄ν°κ° νλμ λΈλ‘μ λ€μ΄κ° λ
- λΈλ‘ λ 벨 μκ³ λ¦¬μ¦: 리λμ μ΄ νμν 곡μ λ©λͺ¨λ¦¬ μ°μ°
- νμ₯μ±λ³΄λ€ νΈμμ±: λ©ν° λΈλ‘ λ°©μλ³΄λ€ λ¨μ
μ΄μ νΌμ¦κ³Όμ κ΄κ³
Puzzle 12 (κΈ°μ‘΄ λ°©μ)μμ:
볡μ‘ν¨: 곡μ λ©λͺ¨λ¦¬ + λ°°λ¦¬μ΄ + νΈλ¦¬ 리λμ
β
λ¨μν¨: block.sum() νλμ¨μ΄ κΈ°λ³Έ μμ
Puzzle 24 (warp.sum())μμ:
μν λ 벨: warp.sum() - 32 μ€λ λ (λ¨μΌ μν)
β
λΈλ‘ λ 벨: block.sum() - 128 μ€λ λ (μ¬λ¬ μν)
3λ¨κ³ μ§ν:
- μλ 리λμ (Puzzle 12): 볡μ‘ν 곡μ λ©λͺ¨λ¦¬ + λ°°λ¦¬μ΄ + νΈλ¦¬ 리λμ
- μν κΈ°λ³Έ μμ (Puzzle 24):
warp.sum()- λ¨μνμ§λ§ λ¨μΌ μνλ‘ μ ν - λΈλ‘ κΈ°λ³Έ μμ (Puzzle 27):
block.sum()- μνμ λ¨μν¨μ μ¬λ¬ μνλ‘ νμ₯
ν΅μ¬ ν΅μ°°: block.sum()μ warp.sum()μ λ¨μν¨μ μ 곡νλ©΄μ μ 체 λΈλ‘μΌλ‘ νμ₯λ©λλ€. μλμΌλ‘ ꡬνν΄μΌ νλ 볡μ‘ν ν¬λ‘μ€ μν μ‘°μ¨μ μλμΌλ‘ μ²λ¦¬ν©λλ€.
λ€μ λ¨κ³
block.sum() μ°μ°μ λ°°μ μΌλ, λ€μμΌλ‘ μ§νν μ μμ΅λλ€:
- block.prefix_sum()κ³Ό λ³λ ¬ νμ€ν κ·Έλ¨ κ΅¬κ° λΆλ₯: λΈλ‘ μ€λ λμ κ±ΈμΉ λμ μ°μ°
- block.broadcast()μ λ²‘ν° μ κ·ν: λΈλ‘ λ΄ λͺ¨λ μ€λ λμ κ°μ 곡μ
π‘ ν΅μ¬ μμ : λΈλ‘ μ°μ°μ μν νλ‘κ·Έλλ° κ°λ
μ μ 체 μ€λ λ λΈλ‘μΌλ‘ νμ₯νμ¬, μ¬λ¬ μνμ κ±Έμ³ λμμ λμνλ©΄μ 볡μ‘ν λκΈ°ν ν¨ν΄μ λ체νλ μ΅μ νλ κΈ°λ³Έ μμλ₯Ό μ 곡ν©λλ€. warp.sum()μ΄ μν λ 벨 리λμ
μ λ¨μνν κ²μ²λΌ, block.sum()μ μ±λ₯μ ν¬μνμ§ μκ³ λΈλ‘ λ 벨 리λμ
μ λ¨μνν©λλ€.