warp.shuffle_xor() ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ 

์›Œํ”„ ๋ ˆ๋ฒจ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ์—์„œ๋Š” shuffle_xor()์„ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„ ๋‚ด์— ์ •๊ตํ•œ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŒจํ„ด์„ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ†ตํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜, ์ •๋ ฌ ๋„คํŠธ์›Œํฌ, ๊ณ ๊ธ‰ ์กฐ์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: shuffle_xor() ์—ฐ์‚ฐ์€ SIMT ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ XOR ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ์›Œํ”„ ํฌ๊ธฐ์— ๋Œ€ํ•ด \(O(\log n)\) ๋ณต์žก๋„๋กœ ํ™•์žฅ๋˜๋Š” ํšจ์œจ์ ์ธ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์™€ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๋ž€? ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๋Š” ์Šค๋ ˆ๋“œ๋“ค์ด ์ธ๋ฑ์Šค์˜ XOR ํŒจํ„ด์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•˜๋Š” ํ†ต์‹  ํ† ํด๋กœ์ง€์ž…๋‹ˆ๋‹ค. ์ด๋ฆ„์€ ์‹œ๊ฐ์ ์œผ๋กœ ๊ทธ๋ ธ์„ ๋•Œ ๋‚˜๋น„ ๋‚ ๊ฐœ์ฒ˜๋Ÿผ ๋ณด์ด๋Š” ์—ฐ๊ฒฐ ํŒจํ„ด์—์„œ ์œ ๋ž˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋„คํŠธ์›Œํฌ๋Š” \(O(\log n)\) ํ†ต์‹  ๋ณต์žก๋„๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— FFT, bitonic ์ •๋ ฌ, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๊ฐ™์€ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • shuffle_xor()์„ ํ™œ์šฉํ•œ XOR ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŒจํ„ด
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€
  • \(O(\log n)\) ๋ณต์žก๋„์˜ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • ๊ณ ๊ธ‰ ์กฐ์ •์„ ์œ„ํ•œ ์กฐ๊ฑด๋ถ€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์—ฐ์‚ฐ
  • ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ๊ธฐ๋ณธ ์š”์†Œ

shuffle_xor ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด XOR ํŒจํ„ด์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๋ ˆ์ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{shuffle_xor}(\text{value}, \text{mask}) = \text{value_from_lane}(\text{lane_id} \oplus \text{mask})\]

์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์šฐ์•„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์œผ๋กœ ๋ณ€ํ™˜๋˜์–ด, ๋ช…์‹œ์  ์กฐ์ • ์—†์ด ํšจ์œจ์ ์ธ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜๊ณผ ์ •๋ ฌ ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

1. ๊ธฐ๋ณธ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŽ˜์–ด ๊ตํ™˜

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D row-major)

shuffle_xor ๊ฐœ๋…

๊ธฐ์กด ํŽ˜์–ด ๊ตํ™˜ ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ๊ณผ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ๊ธฐ์กด ๋ฐฉ์‹ - ๋ณต์žกํ•˜๊ณ  ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”
shared_memory[lane] = input[global_i]
barrier()
if lane % 2 == 0:
    partner = lane + 1
else:
    partner = lane - 1
if partner < WARP_SIZE:
    swapped_val = shared_memory[partner]

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ํ•„์š”
  • ๋™๊ธฐํ™”: ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”
  • ๋ณต์žกํ•œ ๋กœ์ง: ์ˆ˜๋™ ํŒŒํŠธ๋„ˆ ๊ณ„์‚ฐ๊ณผ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ๋‚ฎ์€ ํ™•์žฅ์„ฑ: ํ•˜๋“œ์›จ์–ด ํ†ต์‹ ์„ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•จ

shuffle_xor()์„ ์‚ฌ์šฉํ•˜๋ฉด ํŽ˜์–ด ๊ตํ™˜์ด ์šฐ์•„ํ•ด์ง‘๋‹ˆ๋‹ค:

# ๋ฒ„ํ„ฐํ”Œ๋ผ์ด XOR ๋ฐฉ์‹ - ๊ฐ„๋‹จํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”
current_val = input[global_i]
swapped_val = shuffle_xor(current_val, 1)  # 1๊ณผ XORํ•˜๋ฉด ํŽ˜์–ด๊ฐ€ ์ƒ์„ฑ๋จ
output[global_i] = swapped_val

shuffle_xor์˜ ์žฅ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ๋ชจ๋“  ๋ ˆ์ธ์— ๋Œ€ํ•ด ๋‹จ์ผ ๋ช…๋ น์œผ๋กœ ์ฒ˜๋ฆฌ
  • ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๊ธฐ๋ฐ˜: ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋นŒ๋”ฉ ๋ธ”๋ก

์™„์„ฑํ•  ์ฝ”๋“œ

shuffle_xor()์„ ์‚ฌ์šฉํ•˜์—ฌ ์ธ์ ‘ ํŽ˜์–ด ๊ฐ„ ๊ฐ’์„ ๊ตํ™˜ํ•˜๋Š” ํŽ˜์–ด ๊ตํ™˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: XOR ํŒจํ„ด์œผ๋กœ ์ธ์ ‘ ํŽ˜์–ด๋ฅผ ๋งŒ๋“ค์–ด ๊ฐ’์„ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \text{input}[i \oplus 1]\]

์ž…๋ ฅ ๋ฐ์ดํ„ฐ [0, 1, 2, 3, 4, 5, 6, 7, ...]์„ ํŽ˜์–ด [1, 0, 3, 2, 5, 4, 7, 6, ...]์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ๊ฐ ํŽ˜์–ด (i, i+1)์ด XOR ํ†ต์‹ ์œผ๋กœ ๊ฐ’์„ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค.


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p26/p26.mojo

ํŒ

1. shuffle_xor ์ดํ•ดํ•˜๊ธฐ

shuffle_xor(value, mask) ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด XOR ๋งˆ์Šคํฌ๋งŒํผ ์ฐจ์ด๋‚˜๋Š” ๋ ˆ์ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ๋งˆ์Šคํฌ ๊ฐ’์œผ๋กœ ๋ ˆ์ธ ID๋ฅผ XORํ–ˆ์„ ๋•Œ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํƒ๊ตฌํ•  ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ ˆ์ธ 0์ด ๋งˆ์Šคํฌ 1๋กœ XORํ•˜๋ฉด ์–ด๋–ค ํŒŒํŠธ๋„ˆ๋ฅผ ์–ป๋‚˜์š”?
  • ๋ ˆ์ธ 1์ด ๋งˆ์Šคํฌ 1๋กœ XORํ•˜๋ฉด ์–ด๋–ค ํŒŒํŠธ๋„ˆ๋ฅผ ์–ป๋‚˜์š”?
  • ํŒจํ„ด์ด ๋ณด์ด๋‚˜์š”?

ํžŒํŠธ: ์ฒ˜์Œ ๋ช‡ ๊ฐœ์˜ ๋ ˆ์ธ ID์— ๋Œ€ํ•ด XOR ์—ฐ์‚ฐ์„ ์ง์ ‘ ํ•ด๋ณด๋ฉด ํŽ˜์–ด๋ง ํŒจํ„ด์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2. XOR ํŽ˜์–ด ํŒจํ„ด

๋ ˆ์ธ ID์˜ ์ด์ง„ ํ‘œํ˜„๊ณผ ์ตœํ•˜์œ„ ๋น„ํŠธ๋ฅผ ๋’ค์ง‘์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

๊ณ ๋ คํ•  ์งˆ๋ฌธ:

  • ์ง์ˆ˜ ๋ ˆ์ธ์„ 1๊ณผ XORํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?
  • ํ™€์ˆ˜ ๋ ˆ์ธ์„ 1๊ณผ XORํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?
  • ์™œ ์ด๊ฒƒ์ด ์™„๋ฒฝํ•œ ํŽ˜์–ด๋ฅผ ๋งŒ๋“œ๋‚˜์š”?

3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š”

shuffle_down()๊ณผ ๋‹ฌ๋ฆฌ shuffle_xor() ์—ฐ์‚ฐ์€ ์›Œํ”„ ๊ฒฝ๊ณ„ ๋‚ด์—์„œ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์ž‘์€ ๋งˆ์Šคํฌ๋กœ์˜ XOR์ด ์ ˆ๋Œ€๋กœ ๋ฒ”์œ„ ๋ฐ–์˜ ๋ ˆ์ธ ID๋ฅผ ๋งŒ๋“ค์ง€ ์•Š๋Š” ์ด์œ ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: ์œ ํšจํ•œ ๋ ˆ์ธ ID๋ฅผ 1๊ณผ XORํ–ˆ์„ ๋•Œ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ๋ ˆ์ธ ID๋Š” ์–ผ๋งˆ์ธ๊ฐ€์š”?

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŽ˜์–ด ๊ตํ™˜ ํ…Œ์ŠคํŠธ:

pixi run p26 --pair-swap
pixi run -e amd p26 --pair-swap
pixi run -e apple p26 --pair-swap
uv run poe p26 --pair-swap

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1.0, 0.0, 3.0, 2.0, 5.0, 4.0, 7.0, 6.0, 9.0, 8.0, 11.0, 10.0, 13.0, 12.0, 15.0, 14.0, 17.0, 16.0, 19.0, 18.0, 21.0, 20.0, 23.0, 22.0, 25.0, 24.0, 27.0, 26.0, 29.0, 28.0, 31.0, 30.0]
expected: [1.0, 0.0, 3.0, 2.0, 5.0, 4.0, 7.0, 6.0, 9.0, 8.0, 11.0, 10.0, 13.0, 12.0, 15.0, 14.0, 17.0, 16.0, 19.0, 18.0, 21.0, 20.0, 23.0, 22.0, 25.0, 24.0, 27.0, 26.0, 29.0, 28.0, 31.0, 30.0]
โœ… Butterfly pair swap test passed!

์†”๋ฃจ์…˜

fn butterfly_pair_swap[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Basic butterfly pair swap: Exchange values between adjacent pairs using XOR pattern.
    Each thread exchanges its value with its XOR-1 neighbor, creating pairs: (0,1), (2,3), (4,5), etc.
    Uses shuffle_xor(val, 1) to swap values within each pair.
    This is the foundation of butterfly network communication patterns.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    if global_i < size:
        current_val = input[global_i]

        # Exchange with XOR-1 neighbor using butterfly pattern
        # Lane 0 exchanges with lane 1, lane 2 with lane 3, etc.
        swapped_val = shuffle_xor(current_val, 1)

        # For demonstration, we'll store the swapped value
        # In real applications, this might be used for sorting, reduction, etc.
        output[global_i] = swapped_val


์ด ํ’€์ด๋Š” shuffle_xor()์ด XOR ํ†ต์‹  ํŒจํ„ด์„ ํ†ตํ•ด ์™„๋ฒฝํ•œ ํŽ˜์–ด ๊ตํ™˜์„ ์–ด๋–ป๊ฒŒ ๋งŒ๋“œ๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]              # ๊ฐ ๋ ˆ์ธ์ด ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ฝ์Œ
    swapped_val = shuffle_xor(current_val, 1)  # XOR๋กœ ํŽ˜์–ด ๊ตํ™˜ ์ƒ์„ฑ

    # ๊ตํ™˜๋œ ๊ฐ’์„ ์ €์žฅ
    output[global_i] = swapped_val

SIMT ์‹คํ–‰ ์ƒ์„ธ ๋ถ„์„:

์‚ฌ์ดํด 1: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์‹œ์— ๊ฐ’์„ ๋กœ๋“œ
  Lane 0: current_val = input[0] = 0
  Lane 1: current_val = input[1] = 1
  Lane 2: current_val = input[2] = 2
  Lane 3: current_val = input[3] = 3
  ...
  Lane 31: current_val = input[31] = 31

์‚ฌ์ดํด 2: shuffle_xor(current_val, 1)์ด ๋ชจ๋“  ๋ ˆ์ธ์—์„œ ์‹คํ–‰
  Lane 0: Lane 1์—์„œ ์ˆ˜์‹  (0โŠ•1=1) โ†’ swapped_val = 1
  Lane 1: Lane 0์—์„œ ์ˆ˜์‹  (1โŠ•1=0) โ†’ swapped_val = 0
  Lane 2: Lane 3์—์„œ ์ˆ˜์‹  (2โŠ•1=3) โ†’ swapped_val = 3
  Lane 3: Lane 2์—์„œ ์ˆ˜์‹  (3โŠ•1=2) โ†’ swapped_val = 2
  ...
  Lane 30: Lane 31์—์„œ ์ˆ˜์‹  (30โŠ•1=31) โ†’ swapped_val = 31
  Lane 31: Lane 30์—์„œ ์ˆ˜์‹  (31โŠ•1=30) โ†’ swapped_val = 30

์‚ฌ์ดํด 3: ๊ฒฐ๊ณผ ์ €์žฅ
  Lane 0: output[0] = 1
  Lane 1: output[1] = 0
  Lane 2: output[2] = 3
  Lane 3: output[3] = 2
  ...

์ˆ˜ํ•™์  ํ†ต์ฐฐ: XOR ์†์„ฑ์„ ํ™œ์šฉํ•œ ์™„๋ฒฝํ•œ ํŽ˜์–ด ๊ตํ™˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{XOR}(i, 1) = \begin{cases} i + 1 & \text{if } i \bmod 2 = 0 \\ i - 1 & \text{if } i \bmod 2 = 1 \end{cases}\]

shuffle_xor์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ์™„๋ฒฝํ•œ ๋Œ€์นญ: ๋ชจ๋“  ๋ ˆ์ธ์ด ์ •ํ™•ํžˆ ํ•˜๋‚˜์˜ ํŽ˜์–ด์— ์ฐธ์—ฌ
  2. ์กฐ์ • ๋ถˆํ•„์š”: ๋ชจ๋“  ํŽ˜์–ด๊ฐ€ ๋™์‹œ์— ๊ตํ™˜
  3. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์›Œํ”„ ์ „์ฒด์— ๋Œ€ํ•ด ๋‹จ์ผ ๋ช…๋ น์œผ๋กœ ์ฒ˜๋ฆฌ
  4. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๊ธฐ๋ฐ˜: ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋นŒ๋”ฉ ๋ธ”๋ก

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์ง€์—ฐ ์‹œ๊ฐ„: 1 ์‚ฌ์ดํด (ํ•˜๋“œ์›จ์–ด ๋ ˆ์ง€์Šคํ„ฐ ๊ตํ™˜)
  • ๋Œ€์—ญํญ: 0 ๋ฐ”์ดํŠธ (๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์—†์Œ)
  • ๋ณ‘๋ ฌ์„ฑ: WARP_SIZE๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์‹œ์— ๊ตํ™˜
  • ํ™•์žฅ์„ฑ: ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๊ด€๊ณ„์—†์ด \(O(1)\) ๋ณต์žก๋„

2. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

๊ฐ์†Œํ•˜๋Š” offset์œผ๋กœ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด shuffle_xor์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์„ ํ†ตํ•ด ๋ชจ๋“  ์›Œํ”„ ๋ ˆ์ธ์—์„œ ์ตœ๋Œ“๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{max_result} = \max_{i=0}^{\small\text{WARP_SIZE}-1} \text{input}[i]\]

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ํŒจํ„ด: XOR ์˜คํ”„์…‹์„ WARP_SIZE/2์—์„œ 1๊นŒ์ง€ ์ ˆ๋ฐ˜์”ฉ ์ค„์—ฌ๊ฐ€๋ฉฐ, ํ†ต์‹  ๋ฒ”์œ„๊ฐ€ ๋‹จ๊ณ„๋งˆ๋‹ค ๋ฐ˜์œผ๋กœ ์ข์•„์ง€๋Š” ์ด์ง„ ํŠธ๋ฆฌ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • 1๋‹จ๊ณ„: WARP_SIZE/2 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต (์›Œํ”„ ์ „์ฒด๋ฅผ ํฌ๊ด„)
  • 2๋‹จ๊ณ„: WARP_SIZE/4 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต (๋ฒ”์œ„๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ขํž˜)
  • 3๋‹จ๊ณ„: WARP_SIZE/8 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต
  • 4๋‹จ๊ณ„: offset = 1์ด ๋  ๋•Œ๊นŒ์ง€ ๊ณ„์† ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ž„

\(\log_2(\text{WARP_SIZE})\) ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜๋ฉด ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ฐ–๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

fn butterfly_parallel_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Parallel maximum reduction using butterfly pattern.
    Uses shuffle_xor with decreasing offsets starting from WARP_SIZE/2 down to 1.
    Each step reduces the active range by half until all threads have the maximum value.
    This implements an efficient O(log n) parallel reduction algorithm that works
    for any WARP_SIZE (32, 64, etc.).
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    # FILL ME IN (roughly 7 lines)


ํŒ

1. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ์ดํ•ดํ•˜๊ธฐ

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์€ ์ด์ง„ ํŠธ๋ฆฌ ํ†ต์‹  ํŒจํ„ด์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„์—์„œ ๋ฌธ์ œ ํฌ๊ธฐ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ์ตœ๋Œ€ ๋ฒ”์œ„๋ฅผ ์ปค๋ฒ„ํ•˜๋ ค๋ฉด ์‹œ์ž‘ offset์ด ์–ผ๋งˆ์—ฌ์•ผ ํ•˜๋‚˜์š”?
  • ๋‹จ๊ณ„ ์‚ฌ์ด์— ์˜คํ”„์…‹์„ ์–ด๋–ป๊ฒŒ ๋ณ€๊ฒฝํ•ด์•ผ ํ•˜๋‚˜์š”?
  • ์–ธ์ œ ๋ฆฌ๋•์…˜์„ ๋ฉˆ์ถฐ์•ผ ํ•˜๋‚˜์š”?

ํžŒํŠธ: โ€œ๋ฒ„ํ„ฐํ”Œ๋ผ์ดโ€œ๋ผ๋Š” ์ด๋ฆ„์€ ํ†ต์‹  ํŒจํ„ด์—์„œ ์œ ๋ž˜ํ•ฉ๋‹ˆ๋‹ค - ์ž‘์€ ์˜ˆ์ œ์— ๋Œ€ํ•ด ์ง์ ‘ ๊ทธ๋ ค๋ณด์„ธ์š”.

2. XOR ๋ฆฌ๋•์…˜ ํŠน์„ฑ

XOR์€ ๊ฐ ๋‹จ๊ณ„์—์„œ ๊ฒน์น˜์ง€ ์•Š๋Š” ํ†ต์‹  ํŽ˜์–ด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์—์„œ ์™œ ์ค‘์š”ํ•œ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • ์„œ๋กœ ๋‹ค๋ฅธ ์˜คํ”„์…‹์œผ๋กœ์˜ XOR์ด ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅธ ํ†ต์‹  ํŒจํ„ด์„ ๋งŒ๋“œ๋‚˜์š”?
  • ๊ฐ™์€ ๋‹จ๊ณ„์—์„œ ๋ ˆ์ธ๋“ค์ด ์™œ ์„œ๋กœ ๊ฐ„์„ญํ•˜์ง€ ์•Š๋‚˜์š”?
  • XOR์ด ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์— ํŠนํžˆ ์ ํ•ฉํ•œ ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?

3. ์ตœ๋Œ“๊ฐ’ ๋ˆ„์ 

๊ฐ ๋ ˆ์ธ์€ ์ž์‹ ์˜ โ€œ์˜์—ญโ€œ์—์„œ ์ตœ๋Œ“๊ฐ’์˜ ์ง€์‹์„ ์ ์ง„์ ์œผ๋กœ ์Œ“์•„๊ฐ€์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ:

  • ์ž์‹ ์˜ ๊ฐ’์œผ๋กœ ์‹œ์ž‘
  • ๊ฐ ๋‹จ๊ณ„์—์„œ ์ด์›ƒ์˜ ๊ฐ’๊ณผ ๋น„๊ต
  • ์ตœ๋Œ“๊ฐ’์„ ์œ ์ง€ํ•˜๊ณ  ๊ณ„์† ์ง„ํ–‰

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๊ฐ ๋‹จ๊ณ„ ํ›„, โ€œ์ง€์‹์˜ ์˜์—ญโ€œ์ด ๋‘ ๋ฐฐ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค.

  • ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„ ํ›„: ๊ฐ ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ์•Œ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

4. ์ด ํŒจํ„ด์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ 

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์€ \(\log_2(\text{WARP_SIZE})\) ๋‹จ๊ณ„ ํ›„์— ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  • ๋ชจ๋“  ๋ ˆ์ธ์ด ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์˜ ๊ฐ’์„ ๊ฐ„์ ‘์ ์œผ๋กœ ํ™•์ธ
  • ์ค‘๋ณต ํ†ต์‹  ์—†์Œ: ๊ฐ ํŽ˜์–ด๊ฐ€ ๋‹จ๊ณ„๋‹น ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ๊ตํ™˜
  • ์ตœ์  ๋ณต์žก๋„: \(O(n)\) ์ˆœ์ฐจ ๋น„๊ต ๋Œ€์‹  \(O(\log n)\) ๋‹จ๊ณ„

์ถ”์  ์˜ˆ์ œ (4๊ฐœ ๋ ˆ์ธ, ๊ฐ’ [3, 1, 7, 2]):

์ดˆ๊ธฐ ์ƒํƒœ: Lane 0=3, Lane 1=1, Lane 2=7, Lane 3=2

1๋‹จ๊ณ„ (offset=2): 0 โ†” 2, 1 โ†” 3
  Lane 0: max(3, 7) = 7
  Lane 1: max(1, 2) = 2
  Lane 2: max(7, 3) = 7
  Lane 3: max(2, 1) = 2

2๋‹จ๊ณ„ (offset=1): 0 โ†” 1, 2 โ†” 3
  Lane 0: max(7, 2) = 7
  Lane 1: max(2, 7) = 7
  Lane 2: max(7, 2) = 7
  Lane 3: max(2, 7) = 7

๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’ = 7์„ ๊ฐ€์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’ ํ…Œ์ŠคํŠธ:

pixi run p26 --parallel-max
pixi run -e amd p26 --parallel-max
uv run poe p26 --parallel-max

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0]
expected: [1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0]
โœ… Butterfly parallel max test passed!

์†”๋ฃจ์…˜

fn butterfly_parallel_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Parallel maximum reduction using butterfly pattern.
    Uses shuffle_xor with decreasing offsets (16, 8, 4, 2, 1) to perform tree-based reduction.
    Each step reduces the active range by half until all threads have the maximum value.
    This implements an efficient O(log n) parallel reduction algorithm.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    if global_i < size:
        max_val = input[global_i]

        # Butterfly reduction tree: dynamic for any WARP_SIZE (32, 64, etc.)
        # Start with half the warp size and reduce by half each step
        offset = WARP_SIZE // 2
        while offset > 0:
            max_val = max(max_val, shuffle_xor(max_val, offset))
            offset //= 2

        # All threads now have the maximum value across the entire warp
        output[global_i] = max_val


์ด ํ’€์ด๋Š” shuffle_xor()์ด \(O(\log n)\) ๋ณต์žก๋„์˜ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ๋ฅผ ์–ด๋–ป๊ฒŒ ์ƒ์„ฑํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    max_val = input[global_i]  # ๋กœ์ปฌ ๊ฐ’์œผ๋กœ ์‹œ์ž‘

    # ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ: ๋ชจ๋“  WARP_SIZE์— ๋™์ ์œผ๋กœ ๋Œ€์‘
    offset = WARP_SIZE // 2
    while offset > 0:
        max_val = max(max_val, shuffle_xor(max_val, offset))
        offset //= 2

    output[global_i] = max_val  # ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ฐ€์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์‹คํ–‰ ์ถ”์  (8-๋ ˆ์ธ ์˜ˆ์ œ, ๊ฐ’ [0,2,4,6,8,10,12,1000]):

์ดˆ๊ธฐ ์ƒํƒœ:
  Lane 0: max_val = 0,    Lane 1: max_val = 2
  Lane 2: max_val = 4,    Lane 3: max_val = 6
  Lane 4: max_val = 8,    Lane 5: max_val = 10
  Lane 6: max_val = 12,   Lane 7: max_val = 1000

1๋‹จ๊ณ„: shuffle_xor(max_val, 4) - ์ ˆ๋ฐ˜ ๊ตํ™˜
  Lane 0โ†”4: max(0,8)=8,     Lane 1โ†”5: max(2,10)=10
  Lane 2โ†”6: max(4,12)=12,   Lane 3โ†”7: max(6,1000)=1000
  Lane 4โ†”0: max(8,0)=8,     Lane 5โ†”1: max(10,2)=10
  Lane 6โ†”2: max(12,4)=12,   Lane 7โ†”3: max(1000,6)=1000

2๋‹จ๊ณ„: shuffle_xor(max_val, 2) - 1/4 ๊ตํ™˜
  Lane 0โ†”2: max(8,12)=12,   Lane 1โ†”3: max(10,1000)=1000
  Lane 2โ†”0: max(12,8)=12,   Lane 3โ†”1: max(1000,10)=1000
  Lane 4โ†”6: max(8,12)=12,   Lane 5โ†”7: max(10,1000)=1000
  Lane 6โ†”4: max(12,8)=12,   Lane 7โ†”5: max(1000,10)=1000

3๋‹จ๊ณ„: shuffle_xor(max_val, 1) - ํŽ˜์–ด ๊ตํ™˜
  Lane 0โ†”1: max(12,1000)=1000,  Lane 1โ†”0: max(1000,12)=1000
  Lane 2โ†”3: max(12,1000)=1000,  Lane 3โ†”2: max(1000,12)=1000
  Lane 4โ†”5: max(12,1000)=1000,  Lane 5โ†”4: max(1000,12)=1000
  Lane 6โ†”7: max(12,1000)=1000,  Lane 7โ†”6: max(1000,12)=1000

์ตœ์ข… ๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์˜ max_val = 1000

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ์œผ๋กœ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ์ž๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{Reduce}(\oplus, [a_0, a_1, \ldots, a_{n-1}]) = a_0 \oplus a_1 \oplus \cdots \oplus a_{n-1}\]

์—ฌ๊ธฐ์„œ \(\oplus\)๋Š” max ์—ฐ์‚ฐ์ด๋ฉฐ, ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์ด ์ตœ์  \(O(\log n)\) ๋ณต์žก๋„๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ๋กœ๊ทธ ๋ณต์žก๋„: ์ˆœ์ฐจ ๋ฆฌ๋•์…˜์˜ \(O(n)\)์— ๋น„ํ•ด \(O(\log n)\)
  2. ์™„๋ฒฝํ•œ ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ ๋‹จ๊ณ„์—์„œ ๋™๋“ฑํ•˜๊ฒŒ ์ฐธ์—ฌ
  3. ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ ์—†์Œ: ์ˆœ์ˆ˜ ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ํ†ต์‹ 
  4. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: GPU ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์— ์ง์ ‘ ๋งคํ•‘

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ๋‹จ๊ณ„ ์ˆ˜: \(\log_2(\text{WARP_SIZE})\) (์˜ˆ: 32-์Šค๋ ˆ๋“œ ์›Œํ”„๋Š” 5๋‹จ๊ณ„, 64-์Šค๋ ˆ๋“œ ์›Œํ”„๋Š” 6๋‹จ๊ณ„)
  • ๋‹จ๊ณ„๋‹น ์ง€์—ฐ ์‹œ๊ฐ„: 1 ์‚ฌ์ดํด (๋ ˆ์ง€์Šคํ„ฐ ๊ตํ™˜ + ๋น„๊ต)
  • ์ด ์ง€์—ฐ ์‹œ๊ฐ„: ์ˆœ์ฐจ ๋ฐฉ์‹์˜ \((\text{WARP_SIZE}-1)\) ์‚ฌ์ดํด ๋Œ€๋น„ \(\log_2(\text{WARP_SIZE})\) ์‚ฌ์ดํด
  • ๋ณ‘๋ ฌ์„ฑ: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ „์ฒด์—์„œ ๋ชจ๋“  ๋ ˆ์ธ์ด ํ™œ์„ฑ ์ƒํƒœ

3. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์กฐ๊ฑด๋ถ€ ์ตœ๋Œ“๊ฐ’

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE_2 = 64 (๋ฉ€ํ‹ฐ ๋ธ”๋ก ์‹œ๋‚˜๋ฆฌ์˜ค)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: BLOCKS_PER_GRID_2 = (2, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: THREADS_PER_BLOCK_2 = (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

์ง์ˆ˜ ๋ ˆ์ธ์€ ์ตœ๋Œ“๊ฐ’์„, ํ™€์ˆ˜ ๋ ˆ์ธ์€ ์ตœ์†Ÿ๊ฐ’์„ ์ €์žฅํ•˜๋Š” ์กฐ๊ฑด๋ถ€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’ ๋ชจ๋‘์— ๋Œ€ํ•ด ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•œ ํ›„, ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ผ ์กฐ๊ฑด๋ถ€๋กœ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \begin{cases} \max_{j=0}^{\text{WARP_SIZE}-1} \text{input}[j] & \text{if } i \bmod 2 = 0 \\ \min_{j=0}^{\text{WARP_SIZE}-1} \text{input}[j] & \text{if } i \bmod 2 = 1 \end{cases}\]

์ด์ค‘ ๋ฆฌ๋•์…˜ ํŒจํ„ด: ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŠธ๋ฆฌ๋ฅผ ํ†ตํ•ด ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์„ ๋™์‹œ์— ์ถ”์ ํ•œ ํ›„, ๋ ˆ์ธ ID ํ™€์ง์— ๋”ฐ๋ผ ์กฐ๊ฑด๋ถ€๋กœ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์ด ๋ณต์žกํ•œ ๋‹ค์ค‘ ๊ฐ’ ๋ฆฌ๋•์…˜์œผ๋กœ ์–ด๋–ป๊ฒŒ ํ™•์žฅ๋˜๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
comptime layout_2 = Layout.row_major(SIZE_2)


fn butterfly_conditional_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Conditional butterfly maximum: Perform butterfly max reduction, but only store result
    in even-numbered lanes. Odd-numbered lanes store the minimum value seen.
    Demonstrates conditional logic combined with butterfly communication patterns.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = lane_id()

    if global_i < size:
        current_val = input[global_i]
        min_val = current_val

        # FILL ME IN (roughly 11 lines)


ํŒ

1. ์ด์ค‘ ์ถ”์  ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜

์ด ํผ์ฆ์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŠธ๋ฆฌ๋ฅผ ํ†ตํ•ด ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ ๊ฐ’์„ ๋™์‹œ์— ์ถ”์ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๋ฆฌ๋•์…˜์„ ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ฆฌ๋•์…˜ ๊ณผ์ •์—์„œ ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋™์‹œ์— ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ๋‘ ์—ฐ์‚ฐ์— ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ์–ด๋–ค ๋ณ€์ˆ˜๋ฅผ ์ถ”์ ํ•ด์•ผ ํ•˜๋‚˜์š”?

2. ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ ๋กœ์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ์™„๋ฃŒํ•œ ํ›„, ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๊ฐ’์„ ์ถœ๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ์ :

  • ๋ ˆ์ธ์ด ์ง์ˆ˜์ธ์ง€ ํ™€์ˆ˜์ธ์ง€ ์–ด๋–ป๊ฒŒ ํŒ๋ณ„ํ•˜๋‚˜์š”?
  • ์–ด๋–ค ๋ ˆ์ธ์ด ์ตœ๋Œ“๊ฐ’์„, ์–ด๋–ค ๋ ˆ์ธ์ด ์ตœ์†Ÿ๊ฐ’์„ ์ถœ๋ ฅํ•ด์•ผ ํ•˜๋‚˜์š”?
  • ๋ ˆ์ธ ID์— ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•˜๋‚˜์š”?

3. min๊ณผ max ๋™์‹œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜

์ด ๊ณผ์ œ์˜ ํ•ต์‹ฌ์€ ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์œผ๋กœ min๊ณผ max๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ณ‘๋ ฌ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • min๊ณผ max์— ๋ณ„๋„์˜ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•œ๊ฐ€์š”?
  • ๋‘ ์—ฐ์‚ฐ์— ๊ฐ™์€ ์ด์›ƒ ๊ฐ’์„ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ๋‘ ๋ฆฌ๋•์…˜ ๋ชจ๋‘ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์™„๋ฃŒ๋˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•˜๋‚˜์š”?

4. ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๊ฒฝ๊ณ„ ๊ณ ๋ ค์‚ฌํ•ญ

์ด ํผ์ฆ์€ ์—ฌ๋Ÿฌ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ฆฌ๋•์…˜ ๋ฒ”์œ„์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ค‘์š”ํ•œ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๊ฐ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์˜ ๋ฒ”์œ„๋Š” ์–ด๋””๊นŒ์ง€์ธ๊ฐ€์š”?
  • ๋ธ”๋ก ๊ตฌ์กฐ๊ฐ€ ๋ ˆ์ธ ๋ฒˆํ˜ธ ๋งค๊ธฐ๊ธฐ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋‚˜์š”?
  • ์ „์—ญ min/max๋ฅผ ๊ณ„์‚ฐํ•˜๋‚˜์š”, ๋ธ”๋ก๋ณ„ min/max๋ฅผ ๊ณ„์‚ฐํ•˜๋‚˜์š”?

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์กฐ๊ฑด๋ถ€ ์ตœ๋Œ“๊ฐ’ ํ…Œ์ŠคํŠธ:

pixi run p26 --conditional-max
pixi run -e amd p26 --conditional-max
uv run poe p26 --conditional-max

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE_2:  64
output: [9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0]
expected: [9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0]
โœ… Butterfly conditional max test passed!

์†”๋ฃจ์…˜

fn butterfly_conditional_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Conditional butterfly maximum: Perform butterfly max reduction, but only store result
    in even-numbered lanes. Odd-numbered lanes store the minimum value seen.
    Demonstrates conditional logic combined with butterfly communication patterns.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = lane_id()

    if global_i < size:
        current_val = input[global_i]
        min_val = current_val

        # Butterfly reduction for both maximum and minimum: dynamic for any WARP_SIZE
        offset = WARP_SIZE // 2
        while offset > 0:
            neighbor_val = shuffle_xor(current_val, offset)
            current_val = max(current_val, neighbor_val)

            min_neighbor_val = shuffle_xor(min_val, offset)
            min_val = min(min_val, min_neighbor_val)

            offset //= 2

        # Conditional output: max for even lanes, min for odd lanes
        if lane % 2 == 0:
            output[global_i] = current_val  # Maximum
        else:
            output[global_i] = min_val  # Minimum


์ด ํ’€์ด๋Š” ์ด์ค‘ ์ถ”์ ๊ณผ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ณ ๊ธ‰ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]
    min_val = current_val  # ์ตœ์†Ÿ๊ฐ’์„ ๋ณ„๋„๋กœ ์ถ”์ 

    # max์™€ min ๋™์‹œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ (log_2(WARP_SIZE) ๋‹จ๊ณ„)
    offset = WARP_SIZE // 2
    while offset > 0:
        neighbor_val = shuffle_xor(current_val, offset)
        current_val = max(current_val, neighbor_val)    # Max ๋ฆฌ๋•์…˜

        min_neighbor_val = shuffle_xor(min_val, offset)
        min_val = min(min_val, min_neighbor_val)        # Min ๋ฆฌ๋•์…˜

        offset //= 2

    # ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ฅธ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ
    if lane % 2 == 0:
        output[global_i] = current_val  # ์ง์ˆ˜ ๋ ˆ์ธ: ์ตœ๋Œ“๊ฐ’
    else:
        output[global_i] = min_val      # ํ™€์ˆ˜ ๋ ˆ์ธ: ์ตœ์†Ÿ๊ฐ’

์ด์ค‘ ๋ฆฌ๋•์…˜ ์‹คํ–‰ ์ถ”์  (4-๋ ˆ์ธ ์˜ˆ์ œ, ๊ฐ’ [3, 1, 7, 2]):

์ดˆ๊ธฐ ์ƒํƒœ:
  Lane 0: current_val=3, min_val=3
  Lane 1: current_val=1, min_val=1
  Lane 2: current_val=7, min_val=7
  Lane 3: current_val=2, min_val=2

1๋‹จ๊ณ„: shuffle_xor(current_val, 2)์™€ shuffle_xor(min_val, 2) - ์ ˆ๋ฐ˜ ๊ตํ™˜
  Lane 0โ†”2: max_neighbor=7, min_neighbor=7 โ†’ current_val=max(3,7)=7, min_val=min(3,7)=3
  Lane 1โ†”3: max_neighbor=2, min_neighbor=2 โ†’ current_val=max(1,2)=2, min_val=min(1,2)=1
  Lane 2โ†”0: max_neighbor=3, min_neighbor=3 โ†’ current_val=max(7,3)=7, min_val=min(7,3)=3
  Lane 3โ†”1: max_neighbor=1, min_neighbor=1 โ†’ current_val=max(2,1)=2, min_val=min(2,1)=1

2๋‹จ๊ณ„: shuffle_xor(current_val, 1)์™€ shuffle_xor(min_val, 1) - ํŽ˜์–ด ๊ตํ™˜
  Lane 0โ†”1: max_neighbor=2, min_neighbor=1 โ†’ current_val=max(7,2)=7, min_val=min(3,1)=1
  Lane 1โ†”0: max_neighbor=7, min_neighbor=3 โ†’ current_val=max(2,7)=7, min_val=min(1,3)=1
  Lane 2โ†”3: max_neighbor=2, min_neighbor=1 โ†’ current_val=max(7,2)=7, min_val=min(3,1)=1
  Lane 3โ†”2: max_neighbor=7, min_neighbor=3 โ†’ current_val=max(2,7)=7, min_val=min(1,3)=1

์ตœ์ข… ๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์ด current_val=7 (์ „์—ญ max)๊ณผ min_val=1 (์ „์—ญ min)์„ ๊ฐ€์ง

๋™์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ (๋ชจ๋“  WARP_SIZE์—์„œ ๋™์ž‘):

offset = WARP_SIZE // 2
while offset > 0:
    neighbor_val = shuffle_xor(current_val, offset)
    current_val = max(current_val, neighbor_val)

    min_neighbor_val = shuffle_xor(min_val, offset)
    min_val = min(min_val, min_neighbor_val)

    offset //= 2

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ์กฐ๊ฑด๋ถ€ ๋””๋ฉ€ํ‹ฐํ”Œ๋ ‰์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์ค‘ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \begin{align} \text{max_result} &= \max_{i=0}^{n-1} \text{input}[i] \\ \text{min_result} &= \min_{i=0}^{n-1} \text{input}[i] \\ \text{output}[i] &= \text{lane_parity}(i) \; \text{?} \; \text{min_result} : \text{max_result} \end{align}\]

์ด์ค‘ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  1. ๋…๋ฆฝ์  ๋ฆฌ๋•์…˜: Max์™€ min ๋ฆฌ๋•์…˜์€ ์ˆ˜ํ•™์ ์œผ๋กœ ๋…๋ฆฝ
  2. ๋ณ‘๋ ฌ ์‹คํ–‰: ๋‘˜ ๋‹ค ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  3. ํ†ต์‹  ๊ณต์œ : ๊ฐ™์€ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ๋‘ ๋ฆฌ๋•์…˜ ๋ชจ๋‘์— ํ™œ์šฉ
  4. ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ: ๋ ˆ์ธ ํ™€์ง์ด ์–ด๋–ค ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ• ์ง€ ๊ฒฐ์ •

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ํ†ต์‹  ๋‹จ๊ณ„: \(\log_2(\text{WARP_SIZE})\) (๋‹จ์ผ ๋ฆฌ๋•์…˜๊ณผ ๋™์ผ)
  • ๋‹จ๊ณ„๋‹น ์—ฐ์‚ฐ: ๋‹จ์ผ ๋ฆฌ๋•์…˜์˜ 1๊ฐœ ๋Œ€๋น„ 2๊ฐœ ์—ฐ์‚ฐ (max + min)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹ ๋Œ€๋น„ ์Šค๋ ˆ๋“œ๋‹น ๋ ˆ์ง€์Šคํ„ฐ 2๊ฐœ
  • ์ถœ๋ ฅ ์œ ์—ฐ์„ฑ: ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ธ์ด ๋‹ค๋ฅธ ๋ฆฌ๋•์…˜ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ ๊ฐ€๋Šฅ

์š”์•ฝ

shuffle_xor() ๊ธฐ๋ณธ ์š”์†Œ๋Š” ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๊ฐ•๋ ฅํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์„ธ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ†ตํ•ด ๋‹ค์Œ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด

  1. ํŽ˜์–ด ๊ตํ™˜ (shuffle_xor(value, 1)):

    • ์™„๋ฒฝํ•œ ์ธ์ ‘ ํŽ˜์–ด ์ƒ์„ฑ: (0,1), (2,3), (4,5), โ€ฆ
    • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ์˜ \(O(1)\) ๋ณต์žก๋„
    • ์ •๋ ฌ ๋„คํŠธ์›Œํฌ์™€ ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜์˜ ๊ธฐ๋ฐ˜
  2. ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ (๋™์  offset: WARP_SIZE/2 โ†’ 1):

    • ๋กœ๊ทธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜: ์ˆœ์ฐจ์˜ \(O(n)\) ๋Œ€๋น„ \(O(\log n)\)
    • ๋ชจ๋“  ๊ฒฐํ•ฉ ์—ฐ์‚ฐ์— ์ ์šฉ ๊ฐ€๋Šฅ (max, min, sum ๋“ฑ)
    • ๋ชจ๋“  ์›Œํ”„ ๋ ˆ์ธ์— ๊ฑธ์ณ ์ตœ์ ์˜ ๋ถ€ํ•˜ ๋ถ„์‚ฐ
  3. ์กฐ๊ฑด๋ถ€ ๋‹ค์ค‘ ๋ฆฌ๋•์…˜ (์ด์ค‘ ์ถ”์  + ๋ ˆ์ธ ํ™€์ง):

    • ์—ฌ๋Ÿฌ ๋ฆฌ๋•์…˜์„ ๋™์‹œ์— ๋ณ‘๋ ฌ ์ˆ˜ํ–‰
    • ์Šค๋ ˆ๋“œ ํŠน์„ฑ์— ๋”ฐ๋ฅธ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ
    • ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†๋Š” ๊ณ ๊ธ‰ ์กฐ์ •

ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ†ต์ฐฐ

XOR ํ†ต์‹  ํŠน์„ฑ:

  • shuffle_xor(value, mask)๊ฐ€ ๋Œ€์นญ์ ์ด๊ณ  ๊ฒน์น˜์ง€ ์•Š๋Š” ํŽ˜์–ด๋ฅผ ์ƒ์„ฑ
  • ๊ฐ ๋งˆ์Šคํฌ๊ฐ€ ๊ณ ์œ ํ•œ ํ†ต์‹  ํ† ํด๋กœ์ง€๋ฅผ ์ƒ์„ฑ
  • ์ด์ง„ XOR ํŒจํ„ด์—์„œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋„์ถœ

๋™์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„:

offset = WARP_SIZE // 2
while offset > 0:
    neighbor_val = shuffle_xor(current_val, offset)
    current_val = operation(current_val, neighbor_val)
    offset //= 2

์„ฑ๋Šฅ ์ด์ :

  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ณต์žก๋„: ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ \(O(\log n)\)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ถˆํ•„์š”

์‹ค์šฉ์  ํ™œ์šฉ

์ด ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด๋“ค์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๋ถ„์•ผ:

  • ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜: ํ•ฉ๊ณ„, max, min, ๋…ผ๋ฆฌ ์—ฐ์‚ฐ
  • ๋ˆ„์  ํ•ฉ/์Šค์บ” ์—ฐ์‚ฐ: ๋ˆ„์  ํ•ฉ, ๋ณ‘๋ ฌ ์ •๋ ฌ
  • FFT ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์‹ ํ˜ธ ์ฒ˜๋ฆฌ์™€ ํ•ฉ์„ฑ๊ณฑ
  • Bitonic ์ •๋ ฌ: ๋ณ‘๋ ฌ ์ •๋ ฌ ๋„คํŠธ์›Œํฌ
  • ๊ทธ๋ž˜ํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ํŠธ๋ฆฌ ์ˆœํšŒ์™€ ์—ฐ๊ฒฐ์„ฑ

shuffle_xor() ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์กฐ์ •์„ ์šฐ์•„ํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ํ†ต์‹  ํŒจํ„ด์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ํšจ์œจ์ ์œผ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค.