warp.sum()의 핡심 - μ›Œν”„ 레벨 내적

Puzzle 12μ—μ„œ μ‚΄νŽ΄λ³Έ 내적을 Mojo의 μ›Œν”„ μ—°μ‚°μœΌλ‘œ κ΅¬ν˜„ν•©λ‹ˆλ‹€. λ³΅μž‘ν•œ 곡유 λ©”λͺ¨λ¦¬ νŒ¨ν„΄μ„ κ°„λ‹¨ν•œ ν•¨μˆ˜ 호좜둜 λŒ€μ²΄ν•©λ‹ˆλ‹€. 각 μ›Œν”„ 레인이 ν•˜λ‚˜μ˜ μš”μ†Œλ₯Ό μ²˜λ¦¬ν•˜κ³  warp.sum()으둜 κ²°κ³Όλ₯Ό μžλ™μœΌλ‘œ ν•©μ‚°ν•˜μ—¬, μ›Œν”„ ν”„λ‘œκ·Έλž˜λ°μ΄ GPU 동기화λ₯Ό μ–΄λ–»κ²Œ λ³€ν™˜ν•˜λŠ”μ§€ λ³΄μ—¬μ€λ‹ˆλ‹€.

핡심 톡찰: warp.sum() 연산은 SIMT 싀행을 ν™œμš©ν•˜μ—¬ 곡유 λ©”λͺ¨λ¦¬ + 배리어 + 트리 λ¦¬λ•μ…˜μ„ 단일 ν•˜λ“œμ›¨μ–΄ 가속 λͺ…λ ΉμœΌλ‘œ λŒ€μ²΄ν•©λ‹ˆλ‹€.

핡심 κ°œλ…

이 νΌμ¦μ—μ„œ 배울 λ‚΄μš©:

  • warp.sum()을 ν™œμš©ν•œ μ›Œν”„ 레벨 λ¦¬λ•μ…˜
  • SIMT μ‹€ν–‰ λͺ¨λΈκ³Ό 레인 동기화
  • WARP_SIZEλ₯Ό ν™œμš©ν•œ 크둜슀 μ•„ν‚€ν…μ²˜ ν˜Έν™˜μ„±
  • λ³΅μž‘ν•œ νŒ¨ν„΄μ—μ„œ κ°„λ‹¨ν•œ νŒ¨ν„΄μœΌλ‘œμ˜ μ„±λŠ₯ λ³€ν™˜
  • 레인 ID 관리와 쑰건뢀 μ“°κΈ°

μˆ˜ν•™μ  연산은 λ‚΄μ μž…λ‹ˆλ‹€: \[\Large \text{output}[0] = \sum_{i=0}^{N-1} a[i] \times b[i]\]

ν•˜μ§€λ§Œ κ΅¬ν˜„ κ³Όμ •μ—μ„œ Mojo의 λͺ¨λ“  μ›Œν”„ 레벨 GPU ν”„λ‘œκ·Έλž˜λ°μ— μ μš©λ˜λŠ” κΈ°λ³Έ νŒ¨ν„΄μ„ λ°°μ›λ‹ˆλ‹€.

ꡬ성

  • 벑터 크기: SIZE = WARP_SIZE (GPU μ•„ν‚€ν…μ²˜μ— 따라 32 λ˜λŠ” 64)
  • 데이터 νƒ€μž…: DType.float32
  • 블둝 ꡬ성: (WARP_SIZE, 1) 블둝당 μŠ€λ ˆλ“œ 수
  • κ·Έλ¦¬λ“œ ꡬ성: (1, 1) κ·Έλ¦¬λ“œλ‹Ή 블둝 수
  • λ ˆμ΄μ•„μ›ƒ: Layout.row_major(SIZE) (1D ν–‰ μš°μ„ )

κΈ°μ‘΄ λ°©μ‹μ˜ λ³΅μž‘μ„± (Puzzle 12μ—μ„œ)

solutions/p12/p12.mojo의 λ³΅μž‘ν•œ 방식을 λ– μ˜¬λ € λ΄…μ‹œλ‹€. 곡유 λ©”λͺ¨λ¦¬, 배리어, 트리 λ¦¬λ•μ…˜μ΄ ν•„μš”ν–ˆμŠ΅λ‹ˆλ‹€:

comptime SIZE = WARP_SIZE
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
comptime SIMD_WIDTH = simd_width_of[dtype]()
comptime in_layout = Layout.row_major(SIZE)
comptime out_layout = Layout.row_major(1)


fn traditional_dot_product_p12_style[
    in_layout: Layout, out_layout: Layout, size: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
):
    """
    This is the complex approach from p12_layout_tensor.mojo - kept for comparison.
    """
    shared = LayoutTensor[
        dtype,
        Layout.row_major(WARP_SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    if global_i < size:
        shared[local_i] = (a[global_i] * b[global_i]).reduce_add()
    else:
        shared[local_i] = 0.0

    barrier()

    stride = WARP_SIZE // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2

    if local_i == 0:
        output[global_i // WARP_SIZE] = shared[0]


이 방식이 λ³΅μž‘ν•œ 이유:

  • 곡유 λ©”λͺ¨λ¦¬ ν• λ‹Ή: 블둝 λ‚΄μ—μ„œ μˆ˜λ™μœΌλ‘œ λ©”λͺ¨λ¦¬λ₯Ό 관리
  • λͺ…μ‹œμ  배리어: μŠ€λ ˆλ“œ 동기화λ₯Ό μœ„ν•œ barrier() 호좜
  • 트리 λ¦¬λ•μ…˜: μŠ€νŠΈλΌμ΄λ“œ 기반 인덱싱을 μ‚¬μš©ν•˜λŠ” λ³΅μž‘ν•œ 루프
  • 쑰건뢀 μ“°κΈ°: μŠ€λ ˆλ“œ 0만 μ΅œμ’… κ²°κ³Όλ₯Ό 기둝

λ™μž‘μ€ ν•˜μ§€λ§Œ, μ½”λ“œκ°€ μž₯ν™©ν•˜κ³  였λ₯˜κ°€ λ°œμƒν•˜κΈ° μ‰¬μš°λ©° GPU 동기화에 λŒ€ν•œ κΉŠμ€ 이해가 ν•„μš”ν•©λ‹ˆλ‹€.

κΈ°μ‘΄ 방식 ν…ŒμŠ€νŠΈ:

pixi run p24 --traditional
pixi run -e amd p24 --traditional
pixi run -e apple p24 --traditional
uv run poe p24 --traditional

μ™„μ„±ν•  μ½”λ“œ

1. κ°„λ‹¨ν•œ μ›Œν”„ 컀널 방식

λ³΅μž‘ν•œ κΈ°μ‘΄ 방식을 warp_sum()을 μ‚¬μš©ν•˜λŠ” κ°„λ‹¨ν•œ μ›Œν”„ μ»€λ„λ‘œ λ³€ν™˜ν•©λ‹ˆλ‹€:

fn simple_warp_dot_product[
    in_layout: Layout, out_layout: Layout, size: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
):
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    # FILL IN (6 lines at most)


전체 파일 보기: problems/p24/p24.mojo

팁

1. κ°„λ‹¨ν•œ μ›Œν”„ 컀널 ꡬ쑰 μ΄ν•΄ν•˜κΈ°

simple_warp_dot_product ν•¨μˆ˜λ₯Ό 6쀄 μ΄λ‚΄λ‘œ μ™„μ„±ν•΄μ•Ό ν•©λ‹ˆλ‹€:

fn simple_warp_dot_product[...](output, a, b):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    # μ—¬κΈ°λ₯Ό μ±„μš°μ„Έμš” (μ΅œλŒ€ 6쀄)

따라야 ν•  νŒ¨ν„΄:

  1. 이 μŠ€λ ˆλ“œμ˜ μš”μ†Œμ— λŒ€ν•œ λΆ€λΆ„κ³± 계산
  2. warp_sum()으둜 λͺ¨λ“  μ›Œν”„ 레인의 값을 ν•©μ‚°
  3. 레인 0이 μ΅œμ’… κ²°κ³Όλ₯Ό 기둝

2. λΆ€λΆ„κ³± κ³„μ‚°ν•˜κΈ°

var partial_product: Scalar[dtype] = 0
if global_i < size:
    partial_product = (a[global_i] * b[global_i]).reduce_add()

.reduce_add()κ°€ ν•„μš”ν•œ 이유: Mojo의 값은 SIMD κΈ°λ°˜μ΄λ―€λ‘œ a[global_i] * b[global_i]λŠ” SIMD 벑터λ₯Ό λ°˜ν™˜ν•©λ‹ˆλ‹€. .reduce_add()둜 벑터λ₯Ό 슀칼라 κ°’μœΌλ‘œ ν•©μ‚°ν•©λ‹ˆλ‹€.

경계 검사: λͺ¨λ“  μŠ€λ ˆλ“œκ°€ μœ νš¨ν•œ 데이터λ₯Ό κ°€μ§€κ³  μžˆμ§€ μ•Šμ„ 수 μžˆμœΌλ―€λ‘œ ν•„μˆ˜μ μž…λ‹ˆλ‹€.

3. μ›Œν”„ λ¦¬λ•μ…˜μ˜ λ§ˆλ²•

total = warp_sum(partial_product)

warp_sum()이 ν•˜λŠ” 일:

  • 각 레인의 partial_product 값을 κ°€μ Έμ˜΄
  • μ›Œν”„ λ‚΄ λͺ¨λ“  레인의 값을 ν•©μ‚° (ν•˜λ“œμ›¨μ–΄ 가속)
  • λͺ¨λ“  λ ˆμΈμ— 같은 합계λ₯Ό λ°˜ν™˜ (레인 0만이 μ•„λ‹˜)
  • λͺ…μ‹œμ  동기화가 μ „ν˜€ ν•„μš” μ—†μŒ (SIMTκ°€ 처리)

4. κ²°κ³Ό κΈ°λ‘ν•˜κΈ°

if lane_id() == 0:
    output[global_i // WARP_SIZE] = total

μ™œ 레인 0만? warp_sum() 이후 λͺ¨λ“  레인이 같은 total 값을 κ°–μ§€λ§Œ, 경쟁 μƒνƒœλ₯Ό ν”Όν•˜κΈ° μœ„ν•΄ ν•œ 번만 κΈ°λ‘ν•©λ‹ˆλ‹€.

μ™œ output[0]에 직접 μ“°μ§€ μ•Šμ„κΉŒ? μœ μ—°μ„±μ„ μœ„ν•΄μ„œμž…λ‹ˆλ‹€. 이 ν•¨μˆ˜λŠ” μ›Œν”„κ°€ μ—¬λŸ¬ 개인 κ²½μš°μ—λ„ μ‚¬μš©ν•  수 있으며, 각 μ›Œν”„μ˜ κ²°κ³Όκ°€ global_i // WARP_SIZE μœ„μΉ˜μ— κΈ°λ‘λ©λ‹ˆλ‹€.

lane_id(): 0-31 (NVIDIA) λ˜λŠ” 0-63 (AMD)을 λ°˜ν™˜ - μ›Œν”„ λ‚΄μ—μ„œ μ–΄λŠ λ ˆμΈμΈμ§€ μ‹λ³„ν•©λ‹ˆλ‹€.

κ°„λ‹¨ν•œ μ›Œν”„ 컀널 ν…ŒμŠ€νŠΈ:

uv run poe p24 --kernel
pixi run p24 --kernel

ν’€μ—ˆμ„ λ•Œμ˜ μ˜ˆμƒ 좜λ ₯:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
=== RESULT ===
out: 10416.0
expected: 10416.0
πŸš€ Notice how simple the warp version is compared to p12.mojo!
   Same kernel structure, but warp_sum() replaces all the complexity!

풀이

fn simple_warp_dot_product[
    in_layout: Layout, out_layout: Layout, size: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
):
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    # Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based
    var partial_product: Scalar[dtype] = 0
    if global_i < size:
        partial_product = (a[global_i] * b[global_i]).reduce_add()

    # warp_sum() replaces all the shared memory + barriers + tree reduction
    total = warp_sum(partial_product)

    # Only lane 0 writes the result (all lanes have the same total)
    if lane_id() == 0:
        output[global_i // WARP_SIZE] = total


κ°„λ‹¨ν•œ μ›Œν”„ 컀널은 λ³΅μž‘ν•œ λ™κΈ°ν™”μ—μ„œ ν•˜λ“œμ›¨μ–΄ 가속 κΈ°λ³Έ μš”μ†Œλ‘œμ˜ 근본적인 λ³€ν™˜μ„ λ³΄μ—¬μ€λ‹ˆλ‹€:

κΈ°μ‘΄ λ°©μ‹μ—μ„œ 사라진 것듀:

  • 15쀄 이상 β†’ 6쀄: 획기적인 μ½”λ“œ μΆ•μ†Œ
  • 곡유 λ©”λͺ¨λ¦¬ ν• λ‹Ή: λ©”λͺ¨λ¦¬ 관리 λΆˆν•„μš”
  • 3회 μ΄μƒμ˜ barrier() 호좜: λͺ…μ‹œμ  동기화 제둜
  • λ³΅μž‘ν•œ 트리 λ¦¬λ•μ…˜: 단일 ν•¨μˆ˜ 호좜둜 λŒ€μ²΄
  • μŠ€νŠΈλΌμ΄λ“œ 기반 인덱싱: μ™„μ „νžˆ 제거

SIMT μ‹€ν–‰ λͺ¨λΈ:

μ›Œν”„ 레인 (SIMT μ‹€ν–‰):
레인 0: partial_product = a[0] * b[0]    = 0.0
레인 1: partial_product = a[1] * b[1]    = 4.0
레인 2: partial_product = a[2] * b[2]    = 16.0
...
레인 31: partial_product = a[31] * b[31] = 3844.0

warp_sum() ν•˜λ“œμ›¨μ–΄ μ—°μ‚°:
λͺ¨λ“  레인 β†’ 0.0 + 4.0 + 16.0 + ... + 3844.0 = 10416.0
λͺ¨λ“  레인이 μˆ˜μ‹  β†’ total = 10416.0 (λΈŒλ‘œλ“œμΊμŠ€νŠΈ κ²°κ³Ό)

배리어 없이 λ™μž‘ν•˜λŠ” 이유:

  1. SIMT μ‹€ν–‰: λͺ¨λ“  레인이 각 λͺ…λ Ή λ™μ‹œ μ‹€ν–‰
  2. ν•˜λ“œμ›¨μ–΄ 동기화: warp_sum()이 μ‹œμž‘λ  λ•Œ λͺ¨λ“  레인이 이미 partial_product 계산 μ™„λ£Œ
  3. λ‚΄μž₯ 톡신: GPU ν•˜λ“œμ›¨μ–΄κ°€ λ¦¬λ•μ…˜ μ—°μ‚° 처리
  4. λΈŒλ‘œλ“œμΊμŠ€νŠΈ κ²°κ³Ό: λͺ¨λ“  레인이 같은 total κ°’ μˆ˜μ‹ 

2. ν•¨μˆ˜ν˜• 방식

μ΄λ²ˆμ—λŠ” Mojo의 ν•¨μˆ˜ν˜• ν”„λ‘œκ·Έλž˜λ° νŒ¨ν„΄μ„ μ‚¬μš©ν•˜μ—¬ 같은 μ›Œν”„ 내적을 κ΅¬ν˜„ν•©λ‹ˆλ‹€:

fn functional_warp_dot_product[
    layout: Layout,
    out_layout: Layout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
](
    output: LayoutTensor[mut=True, dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn compute_dot_product[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]
        print("idx:", idx)
        # FILL IN (10 lines at most)

    # Launch exactly size == WARP_SIZE threads (one warp) to process all elements
    elementwise[compute_dot_product, 1, target="gpu"](size, ctx)


팁

1. ν•¨μˆ˜ν˜• λ°©μ‹μ˜ ꡬ쑰 μ΄ν•΄ν•˜κΈ°

compute_dot_product ν•¨μˆ˜λ₯Ό 10쀄 μ΄λ‚΄λ‘œ μ™„μ„±ν•΄μ•Ό ν•©λ‹ˆλ‹€:

@parameter
@always_inline
fn compute_dot_product[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
    idx = indices[0]
    # μ—¬κΈ°λ₯Ό μ±„μš°μ„Έμš” (μ΅œλŒ€ 10쀄)

ν•¨μˆ˜ν˜• νŒ¨ν„΄μ˜ 차이점:

  • elementwiseλ₯Ό μ‚¬μš©ν•˜μ—¬ μ •ν™•νžˆ WARP_SIZE개의 μŠ€λ ˆλ“œ μ‹€ν–‰
  • 각 μŠ€λ ˆλ“œκ°€ idxλ₯Ό 기반으둜 ν•˜λ‚˜μ˜ μš”μ†Œ 처리
  • 같은 μ›Œν”„ μ—°μ‚°, λ‹€λ₯Έ μ‹€ν–‰ λ©”μ»€λ‹ˆμ¦˜

2. λΆ€λΆ„κ³± κ³„μ‚°ν•˜κΈ°

var partial_product: Scalar[dtype] = 0.0
if idx < size:
    a_val = a.load[1](idx, 0)
    b_val = b.load[1](idx, 0)
    partial_product = (a_val * b_val).reduce_add()
else:
    partial_product = 0.0

λ‘œλ”© νŒ¨ν„΄: a.load[1](idx, 0)은 μœ„μΉ˜ idxμ—μ„œ μ •ν™•νžˆ 1개 μš”μ†Œλ₯Ό λ‘œλ“œν•©λ‹ˆλ‹€ (SIMD 벑터화 μ—†μŒ).

경계 처리: λ²”μœ„λ₯Ό λ²—μ–΄λ‚œ μŠ€λ ˆλ“œμ˜ partial_productλ₯Ό 0.0으둜 μ„€μ •ν•˜μ—¬ 합산에 κΈ°μ—¬ν•˜μ§€ μ•Šλ„λ‘ ν•©λ‹ˆλ‹€.

3. μ›Œν”„ μ—°μ‚°κ³Ό μ €μž₯

total = warp_sum(partial_product)

if lane_id() == 0:
    output.store[1](Index(idx // WARP_SIZE), total)

μ €μž₯ νŒ¨ν„΄: output.store[1](Index(idx // WARP_SIZE), 0, total)은 좜λ ₯ ν…μ„œμ˜ μœ„μΉ˜ (idx // WARP_SIZE, 0)에 1개 μš”μ†Œλ₯Ό μ €μž₯ν•©λ‹ˆλ‹€.

λ™μΌν•œ μ›Œν”„ 둜직: warp_sum()κ³Ό 레인 0의 기둝 λ‘œμ§μ€ ν•¨μˆ˜ν˜• λ°©μ‹μ—μ„œλ„ λ™μΌν•˜κ²Œ λ™μž‘ν•©λ‹ˆλ‹€.

4. importμ—μ„œ μ‚¬μš© κ°€λŠ₯ν•œ ν•¨μˆ˜λ“€

from gpu import lane_id
from gpu.primitives.warp import sum as warp_sum, WARP_SIZE

# ν•¨μˆ˜ λ‚΄μ—μ„œ:
my_lane = lane_id()           # 0 ~ WARP_SIZE-1
total = warp_sum(my_value)    # ν•˜λ“œμ›¨μ–΄ 가속 λ¦¬λ•μ…˜
warp_size = WARP_SIZE         # 32 (NVIDIA) λ˜λŠ” 64 (AMD)

ν•¨μˆ˜ν˜• 방식 ν…ŒμŠ€νŠΈ:

uv run poe p24 --functional
pixi run p24 --functional

ν’€μ—ˆμ„ λ•Œμ˜ μ˜ˆμƒ 좜λ ₯:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
=== RESULT ===
out: 10416.0
expected: 10416.0
πŸ”§ Functional approach shows modern Mojo style with warp operations!
   Clean, composable, and still leverages warp hardware primitives!

풀이

fn functional_warp_dot_product[
    layout: Layout,
    out_layout: Layout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
](
    output: LayoutTensor[mut=True, dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn compute_dot_product[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]

        # Each thread computes one partial product
        var partial_product: Scalar[dtype] = 0.0
        if idx < size:
            a_val = a.load[1](Index(idx))
            b_val = b.load[1](Index(idx))
            partial_product = a_val * b_val
        else:
            partial_product = 0.0

        # Warp magic - combines all WARP_SIZE partial products!
        total = warp_sum(partial_product)

        # Only lane 0 writes the result (all lanes have the same total)
        if lane_id() == 0:
            output.store[1](Index(idx // WARP_SIZE), total)

    # Launch exactly size == WARP_SIZE threads (one warp) to process all elements
    elementwise[compute_dot_product, 1, target="gpu"](size, ctx)


ν•¨μˆ˜ν˜• μ›Œν”„ 방식은 μ›Œν”„ 연산을 ν™œμš©ν•œ ν˜„λŒ€μ μΈ Mojo ν”„λ‘œκ·Έλž˜λ° νŒ¨ν„΄μ„ λ³΄μ—¬μ€λ‹ˆλ‹€:

ν•¨μˆ˜ν˜• λ°©μ‹μ˜ νŠΉμ§•:

elementwise[compute_dot_product, 1, target="gpu"](size, ctx)

μž₯점:

  • νƒ€μž… μ•ˆμ „μ„±: 컴파일 νƒ€μž„ ν…μ„œ λ ˆμ΄μ•„μ›ƒ 검사
  • μ‘°ν•© κ°€λŠ₯μ„±: λ‹€λ₯Έ ν•¨μˆ˜ν˜• μ—°μ‚°κ³Ό μ‰½κ²Œ 톡합
  • ν˜„λŒ€μ  νŒ¨ν„΄: Mojo의 ν•¨μˆ˜ν˜• ν”„λ‘œκ·Έλž˜λ° κΈ°λŠ₯ ν™œμš©
  • μžλ™ μ΅œμ ν™”: μ»΄νŒŒμΌλŸ¬κ°€ κ³ μˆ˜μ€€ μ΅œμ ν™”λ₯Ό 적용 κ°€λŠ₯

컀널 λ°©μ‹κ³Όμ˜ μ£Όμš” 차이:

  • μ‹€ν–‰ λ©”μ»€λ‹ˆμ¦˜: enqueue_function λŒ€μ‹  elementwise μ‚¬μš©
  • λ©”λͺ¨λ¦¬ μ ‘κ·Ό: .load[1]()κ³Ό .store[1]() νŒ¨ν„΄ μ‚¬μš©
  • 톡합성: λ‹€λ₯Έ ν•¨μˆ˜ν˜• μ—°μ‚°κ³Ό μžμ—°μŠ€λŸ½κ²Œ κ²°ν•©

λ™μΌν•œ μ›Œν”„μ˜ 이점:

  • 동기화 제둜: warp_sum()이 λ™μΌν•˜κ²Œ λ™μž‘
  • ν•˜λ“œμ›¨μ–΄ 가속: 컀널 방식과 같은 μ„±λŠ₯
  • 크둜슀 μ•„ν‚€ν…μ²˜: WARP_SIZEκ°€ μžλ™μœΌλ‘œ 적응

벀치마크λ₯Ό ν†΅ν•œ μ„±λŠ₯ 비ꡐ

μ’…ν•© 벀치마크λ₯Ό μ‹€ν–‰ν•˜μ—¬ μ›Œν”„ μ—°μ‚°μ˜ ν™•μž₯성을 ν™•μΈν•©λ‹ˆλ‹€:

uv run poe p24 --benchmark
pixi run p24 --benchmark

전체 벀치마크 μ‹€ν–‰ 결과의 μ˜ˆμ‹œμž…λ‹ˆλ‹€:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
--------------------------------------------------------------------------------
Testing SIZE=1 x WARP_SIZE, BLOCKS=1
Running traditional_1x
Running simple_warp_1x
Running functional_warp_1x
--------------------------------------------------------------------------------
Testing SIZE=4 x WARP_SIZE, BLOCKS=4
Running traditional_4x
Running simple_warp_4x
Running functional_warp_4x
--------------------------------------------------------------------------------
Testing SIZE=32 x WARP_SIZE, BLOCKS=32
Running traditional_32x
Running simple_warp_32x
Running functional_warp_32x
--------------------------------------------------------------------------------
Testing SIZE=256 x WARP_SIZE, BLOCKS=256
Running traditional_256x
Running simple_warp_256x
Running functional_warp_256x
--------------------------------------------------------------------------------
Testing SIZE=2048 x WARP_SIZE, BLOCKS=2048
Running traditional_2048x
Running simple_warp_2048x
Running functional_warp_2048x
--------------------------------------------------------------------------------
Testing SIZE=16384 x WARP_SIZE, BLOCKS=16384 (Large Scale)
Running traditional_16384x
Running simple_warp_16384x
Running functional_warp_16384x
--------------------------------------------------------------------------------
Testing SIZE=65536 x WARP_SIZE, BLOCKS=65536 (Massive Scale)
Running traditional_65536x
Running simple_warp_65536x
Running functional_warp_65536x
| name                   | met (ms)              | iters |
| ---------------------- | --------------------- | ----- |
| traditional_1x         | 0.00460128            | 100   |
| simple_warp_1x         | 0.00574047            | 100   |
| functional_warp_1x     | 0.00484192            | 100   |
| traditional_4x         | 0.00492671            | 100   |
| simple_warp_4x         | 0.00485247            | 100   |
| functional_warp_4x     | 0.00587679            | 100   |
| traditional_32x        | 0.0062406399999999996 | 100   |
| simple_warp_32x        | 0.0054918400000000004 | 100   |
| functional_warp_32x    | 0.00552447            | 100   |
| traditional_256x       | 0.0050614300000000004 | 100   |
| simple_warp_256x       | 0.00488768            | 100   |
| functional_warp_256x   | 0.00461472            | 100   |
| traditional_2048x      | 0.01120031            | 100   |
| simple_warp_2048x      | 0.00884383            | 100   |
| functional_warp_2048x  | 0.007038720000000001  | 100   |
| traditional_16384x     | 0.038533750000000005  | 100   |
| simple_warp_16384x     | 0.0323264             | 100   |
| functional_warp_16384x | 0.01674271            | 100   |
| traditional_65536x     | 0.19784991999999998   | 100   |
| simple_warp_65536x     | 0.12870176            | 100   |
| functional_warp_65536x | 0.048680310000000004  | 100   |

Benchmarks completed!

WARP OPERATIONS PERFORMANCE ANALYSIS:
   GPU Architecture: NVIDIA (WARP_SIZE=32) vs AMD (WARP_SIZE=64)
   - 1,...,256 x WARP_SIZE: Grid size too small to benchmark
   - 2048 x WARP_SIZE: Warp primative benefits emerge
   - 16384 x WARP_SIZE: Large scale (512K-1M elements)
   - 65536 x WARP_SIZE: Massive scale (2M-4M elements)

   Expected Results at Large Scales:
   β€’ Traditional: Slower due to more barrier overhead
   β€’ Warp operations: Faster, scale better with problem size
   β€’ Memory bandwidth becomes the limiting factor

이 μ˜ˆμ‹œμ—μ„œ 얻을 수 μžˆλŠ” μ„±λŠ₯ μΈμ‚¬μ΄νŠΈ:

  • μ†Œκ·œλͺ¨ (1x-4x): μ›Œν”„ 연산이 μ†Œν­μ˜ κ°œμ„ μ„ λ³΄μž„ (~10-15% 빠름)
  • μ€‘κ·œλͺ¨ (32x-256x): ν•¨μˆ˜ν˜• 방식이 κ°€μž₯ 쒋은 μ„±λŠ₯을 λ³΄μ΄λŠ” κ²½μš°κ°€ 많음
  • λŒ€κ·œλͺ¨ (16K-65K): λ©”λͺ¨λ¦¬ λŒ€μ—­ν­μ΄ 지배적이 λ˜λ©΄μ„œ λͺ¨λ“  λ°©μ‹μ˜ μ„±λŠ₯이 수렴
  • 변동성: μ„±λŠ₯은 νŠΉμ • GPU μ•„ν‚€ν…μ²˜μ™€ λ©”λͺ¨λ¦¬ μ„œλΈŒμ‹œμŠ€ν…œμ— 크게 의쑴

μ°Έκ³ : ν•˜λ“œμ›¨μ–΄(GPU λͺ¨λΈ, λ©”λͺ¨λ¦¬ λŒ€μ—­ν­, WARP_SIZE)에 따라 κ²°κ³Όκ°€ 크게 λ‹¬λΌμ§‘λ‹ˆλ‹€. 핡심은 μ ˆλŒ€μ μΈ μˆ˜μΉ˜λ³΄λ‹€ μƒλŒ€μ μΈ μ„±λŠ₯ μΆ”μ„Έλ₯Ό κ΄€μ°°ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€.

λ‹€μŒ 단계

warp.sum 연산을 λ°°μ› μœΌλ‹ˆ, λ‹€μŒμœΌλ‘œ μ§„ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€:

  • μ–Έμ œ μ›Œν”„ ν”„λ‘œκ·Έλž˜λ°μ„ μ‚¬μš©ν• κΉŒ: μ›Œν”„ vs κΈ°μ‘΄ 방식에 λŒ€ν•œ μ „λž΅μ  μ˜μ‚¬κ²°μ • ν”„λ ˆμž„μ›Œν¬
  • κ³ κΈ‰ μ›Œν”„ μ—°μ‚°: λ³΅μž‘ν•œ 톡신 νŒ¨ν„΄μ„ μœ„ν•œ shuffle_idx(), shuffle_down(), prefix_sum()
  • λ©€ν‹° μ›Œν”„ μ•Œκ³ λ¦¬μ¦˜: μ›Œν”„ μ—°μ‚°κ³Ό 블둝 레벨 λ™κΈ°ν™”μ˜ κ²°ν•©
  • λ©”λͺ¨λ¦¬ 병합 μ΅œμ ν™”: μ΅œλŒ€ λŒ€μ—­ν­μ„ μœ„ν•œ λ©”λͺ¨λ¦¬ μ ‘κ·Ό νŒ¨ν„΄ μ΅œμ ν™”

πŸ’‘ 핡심 μš”μ : μ›Œν”„ 연산은 λ³΅μž‘ν•œ 동기화 νŒ¨ν„΄μ„ ν•˜λ“œμ›¨μ–΄ 가속 κΈ°λ³Έ μš”μ†Œλ‘œ λŒ€μ²΄ν•˜μ—¬ GPU ν”„λ‘œκ·Έλž˜λ°μ„ λ³€ν™˜ν•©λ‹ˆλ‹€. μ‹€ν–‰ λͺ¨λΈμ„ μ΄ν•΄ν•˜λ©΄ μ„±λŠ₯을 ν¬μƒν•˜μ§€ μ•Šκ³ λ„ 획기적인 λ‹¨μˆœν™”κ°€ κ°€λŠ₯ν•©λ‹ˆλ‹€.