warp.shuffle_xor() ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ 

์›Œํ”„ ๋ ˆ๋ฒจ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ์—์„œ๋Š” shuffle_xor()์„ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„ ๋‚ด์— ์ •๊ตํ•œ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŒจํ„ด์„ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ†ตํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜, ์ •๋ ฌ ๋„คํŠธ์›Œํฌ, ๊ณ ๊ธ‰ ์กฐ์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: shuffle_xor() ์—ฐ์‚ฐ์€ SIMT ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ XOR ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ์›Œํ”„ ํฌ๊ธฐ์— ๋Œ€ํ•ด \(O(\log n)\) ๋ณต์žก๋„๋กœ ํ™•์žฅ๋˜๋Š” ํšจ์œจ์ ์ธ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์™€ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๋ž€? ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๋Š” ์Šค๋ ˆ๋“œ๋“ค์ด ์ธ๋ฑ์Šค์˜ XOR ํŒจํ„ด์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•˜๋Š” ํ†ต์‹  ํ† ํด๋กœ์ง€์ž…๋‹ˆ๋‹ค. ์ด๋ฆ„์€ ์‹œ๊ฐ์ ์œผ๋กœ ๊ทธ๋ ธ์„ ๋•Œ ๋‚˜๋น„ ๋‚ ๊ฐœ์ฒ˜๋Ÿผ ๋ณด์ด๋Š” ์—ฐ๊ฒฐ ํŒจํ„ด์—์„œ ์œ ๋ž˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋„คํŠธ์›Œํฌ๋Š” \(O(\log n)\) ํ†ต์‹  ๋ณต์žก๋„๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— FFT, bitonic ์ •๋ ฌ, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๊ฐ™์€ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • shuffle_xor()์„ ํ™œ์šฉํ•œ XOR ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŒจํ„ด
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€
  • \(O(\log n)\) ๋ณต์žก๋„์˜ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • ๊ณ ๊ธ‰ ์กฐ์ •์„ ์œ„ํ•œ ์กฐ๊ฑด๋ถ€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์—ฐ์‚ฐ
  • ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ๊ธฐ๋ณธ ์š”์†Œ

shuffle_xor ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด XOR ํŒจํ„ด์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๋ ˆ์ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{shuffle_xor}(\text{value}, \text{mask}) = \text{value_from_lane}(\text{lane_id} \oplus \text{mask})\]

์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์šฐ์•„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์œผ๋กœ ๋ณ€ํ™˜๋˜์–ด, ๋ช…์‹œ์  ์กฐ์ • ์—†์ด ํšจ์œจ์ ์ธ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜๊ณผ ์ •๋ ฌ ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

1. ๊ธฐ๋ณธ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŽ˜์–ด ๊ตํ™˜

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D row-major)

shuffle_xor ๊ฐœ๋…

๊ธฐ์กด ํŽ˜์–ด ๊ตํ™˜ ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ๊ณผ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ๊ธฐ์กด ๋ฐฉ์‹ - ๋ณต์žกํ•˜๊ณ  ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”
shared_memory[lane] = input[global_i]
barrier()
if lane % 2 == 0:
    partner = lane + 1
else:
    partner = lane - 1
if partner < WARP_SIZE:
    swapped_val = shared_memory[partner]

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ํ•„์š”
  • ๋™๊ธฐํ™”: ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”
  • ๋ณต์žกํ•œ ๋กœ์ง: ์ˆ˜๋™ ํŒŒํŠธ๋„ˆ ๊ณ„์‚ฐ๊ณผ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ๋‚ฎ์€ ํ™•์žฅ์„ฑ: ํ•˜๋“œ์›จ์–ด ํ†ต์‹ ์„ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•จ

shuffle_xor()์„ ์‚ฌ์šฉํ•˜๋ฉด ํŽ˜์–ด ๊ตํ™˜์ด ์šฐ์•„ํ•ด์ง‘๋‹ˆ๋‹ค:

# ๋ฒ„ํ„ฐํ”Œ๋ผ์ด XOR ๋ฐฉ์‹ - ๊ฐ„๋‹จํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”
current_val = input[global_i]
swapped_val = shuffle_xor(current_val, 1)  # 1๊ณผ XORํ•˜๋ฉด ํŽ˜์–ด๊ฐ€ ์ƒ์„ฑ๋จ
output[global_i] = swapped_val

shuffle_xor์˜ ์žฅ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ๋ชจ๋“  ๋ ˆ์ธ์— ๋Œ€ํ•ด ๋‹จ์ผ ๋ช…๋ น์œผ๋กœ ์ฒ˜๋ฆฌ
  • ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๊ธฐ๋ฐ˜: ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋นŒ๋”ฉ ๋ธ”๋ก

์™„์„ฑํ•  ์ฝ”๋“œ

shuffle_xor()์„ ์‚ฌ์šฉํ•˜์—ฌ ์ธ์ ‘ ํŽ˜์–ด ๊ฐ„ ๊ฐ’์„ ๊ตํ™˜ํ•˜๋Š” ํŽ˜์–ด ๊ตํ™˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: XOR ํŒจํ„ด์œผ๋กœ ์ธ์ ‘ ํŽ˜์–ด๋ฅผ ๋งŒ๋“ค์–ด ๊ฐ’์„ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \text{input}[i \oplus 1]\]

์ž…๋ ฅ ๋ฐ์ดํ„ฐ [0, 1, 2, 3, 4, 5, 6, 7, ...]์„ ํŽ˜์–ด [1, 0, 3, 2, 5, 4, 7, 6, ...]์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ๊ฐ ํŽ˜์–ด (i, i+1)์ด XOR ํ†ต์‹ ์œผ๋กœ ๊ฐ’์„ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค.


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p26/p26.mojo

ํŒ

1. shuffle_xor ์ดํ•ดํ•˜๊ธฐ

shuffle_xor(value, mask) ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด XOR ๋งˆ์Šคํฌ๋งŒํผ ์ฐจ์ด๋‚˜๋Š” ๋ ˆ์ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ๋งˆ์Šคํฌ ๊ฐ’์œผ๋กœ ๋ ˆ์ธ ID๋ฅผ XORํ–ˆ์„ ๋•Œ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํƒ๊ตฌํ•  ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ ˆ์ธ 0์ด ๋งˆ์Šคํฌ 1๋กœ XORํ•˜๋ฉด ์–ด๋–ค ํŒŒํŠธ๋„ˆ๋ฅผ ์–ป๋‚˜์š”?
  • ๋ ˆ์ธ 1์ด ๋งˆ์Šคํฌ 1๋กœ XORํ•˜๋ฉด ์–ด๋–ค ํŒŒํŠธ๋„ˆ๋ฅผ ์–ป๋‚˜์š”?
  • ํŒจํ„ด์ด ๋ณด์ด๋‚˜์š”?

ํžŒํŠธ: ์ฒ˜์Œ ๋ช‡ ๊ฐœ์˜ ๋ ˆ์ธ ID์— ๋Œ€ํ•ด XOR ์—ฐ์‚ฐ์„ ์ง์ ‘ ํ•ด๋ณด๋ฉด ํŽ˜์–ด๋ง ํŒจํ„ด์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2. XOR ํŽ˜์–ด ํŒจํ„ด

๋ ˆ์ธ ID์˜ ์ด์ง„ ํ‘œํ˜„๊ณผ ์ตœํ•˜์œ„ ๋น„ํŠธ๋ฅผ ๋’ค์ง‘์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

๊ณ ๋ คํ•  ์งˆ๋ฌธ:

  • ์ง์ˆ˜ ๋ ˆ์ธ์„ 1๊ณผ XORํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?
  • ํ™€์ˆ˜ ๋ ˆ์ธ์„ 1๊ณผ XORํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?
  • ์™œ ์ด๊ฒƒ์ด ์™„๋ฒฝํ•œ ํŽ˜์–ด๋ฅผ ๋งŒ๋“œ๋‚˜์š”?

3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š”

shuffle_down()๊ณผ ๋‹ฌ๋ฆฌ shuffle_xor() ์—ฐ์‚ฐ์€ ์›Œํ”„ ๊ฒฝ๊ณ„ ๋‚ด์—์„œ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์ž‘์€ ๋งˆ์Šคํฌ๋กœ์˜ XOR์ด ์ ˆ๋Œ€๋กœ ๋ฒ”์œ„ ๋ฐ–์˜ ๋ ˆ์ธ ID๋ฅผ ๋งŒ๋“ค์ง€ ์•Š๋Š” ์ด์œ ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: ์œ ํšจํ•œ ๋ ˆ์ธ ID๋ฅผ 1๊ณผ XORํ–ˆ์„ ๋•Œ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ๋ ˆ์ธ ID๋Š” ์–ผ๋งˆ์ธ๊ฐ€์š”?

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŽ˜์–ด ๊ตํ™˜ ํ…Œ์ŠคํŠธ:

pixi run p26 --pair-swap
pixi run -e amd p26 --pair-swap
pixi run -e apple p26 --pair-swap
uv run poe p26 --pair-swap

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1.0, 0.0, 3.0, 2.0, 5.0, 4.0, 7.0, 6.0, 9.0, 8.0, 11.0, 10.0, 13.0, 12.0, 15.0, 14.0, 17.0, 16.0, 19.0, 18.0, 21.0, 20.0, 23.0, 22.0, 25.0, 24.0, 27.0, 26.0, 29.0, 28.0, 31.0, 30.0]
expected: [1.0, 0.0, 3.0, 2.0, 5.0, 4.0, 7.0, 6.0, 9.0, 8.0, 11.0, 10.0, 13.0, 12.0, 15.0, 14.0, 17.0, 16.0, 19.0, 18.0, 21.0, 20.0, 23.0, 22.0, 25.0, 24.0, 27.0, 26.0, 29.0, 28.0, 31.0, 30.0]
โœ… Butterfly pair swap test passed!

์†”๋ฃจ์…˜

def butterfly_pair_swap[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Basic butterfly pair swap: Exchange values between adjacent pairs using XOR pattern.
    Each thread exchanges its value with its XOR-1 neighbor, creating pairs: (0,1), (2,3), (4,5), etc.
    Uses shuffle_xor(val, 1) to swap values within each pair.
    This is the foundation of butterfly network communication patterns.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    if global_i < size:
        var current_val = input[global_i]

        # Exchange with XOR-1 neighbor using butterfly pattern
        # Lane 0 exchanges with lane 1, lane 2 with lane 3, etc.
        var swapped_val = shuffle_xor(current_val, 1)

        # For demonstration, we'll store the swapped value
        # In real applications, this might be used for sorting, reduction, etc.
        output[global_i] = swapped_val


์ด ํ’€์ด๋Š” shuffle_xor()์ด XOR ํ†ต์‹  ํŒจํ„ด์„ ํ†ตํ•ด ์™„๋ฒฝํ•œ ํŽ˜์–ด ๊ตํ™˜์„ ์–ด๋–ป๊ฒŒ ๋งŒ๋“œ๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]              # ๊ฐ ๋ ˆ์ธ์ด ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ฝ์Œ
    swapped_val = shuffle_xor(current_val, 1)  # XOR๋กœ ํŽ˜์–ด ๊ตํ™˜ ์ƒ์„ฑ

    # ๊ตํ™˜๋œ ๊ฐ’์„ ์ €์žฅ
    output[global_i] = swapped_val

SIMT ์‹คํ–‰ ์ƒ์„ธ ๋ถ„์„:

์‚ฌ์ดํด 1: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์‹œ์— ๊ฐ’์„ ๋กœ๋“œ
  Lane 0: current_val = input[0] = 0
  Lane 1: current_val = input[1] = 1
  Lane 2: current_val = input[2] = 2
  Lane 3: current_val = input[3] = 3
  ...
  Lane 31: current_val = input[31] = 31

์‚ฌ์ดํด 2: shuffle_xor(current_val, 1)์ด ๋ชจ๋“  ๋ ˆ์ธ์—์„œ ์‹คํ–‰
  Lane 0: Lane 1์—์„œ ์ˆ˜์‹  (0โŠ•1=1) โ†’ swapped_val = 1
  Lane 1: Lane 0์—์„œ ์ˆ˜์‹  (1โŠ•1=0) โ†’ swapped_val = 0
  Lane 2: Lane 3์—์„œ ์ˆ˜์‹  (2โŠ•1=3) โ†’ swapped_val = 3
  Lane 3: Lane 2์—์„œ ์ˆ˜์‹  (3โŠ•1=2) โ†’ swapped_val = 2
  ...
  Lane 30: Lane 31์—์„œ ์ˆ˜์‹  (30โŠ•1=31) โ†’ swapped_val = 31
  Lane 31: Lane 30์—์„œ ์ˆ˜์‹  (31โŠ•1=30) โ†’ swapped_val = 30

์‚ฌ์ดํด 3: ๊ฒฐ๊ณผ ์ €์žฅ
  Lane 0: output[0] = 1
  Lane 1: output[1] = 0
  Lane 2: output[2] = 3
  Lane 3: output[3] = 2
  ...

์ˆ˜ํ•™์  ํ†ต์ฐฐ: XOR ์†์„ฑ์„ ํ™œ์šฉํ•œ ์™„๋ฒฝํ•œ ํŽ˜์–ด ๊ตํ™˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{XOR}(i, 1) = \begin{cases} i + 1 & \text{if } i \bmod 2 = 0 \\ i - 1 & \text{if } i \bmod 2 = 1 \end{cases}\]

shuffle_xor์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ์™„๋ฒฝํ•œ ๋Œ€์นญ: ๋ชจ๋“  ๋ ˆ์ธ์ด ์ •ํ™•ํžˆ ํ•˜๋‚˜์˜ ํŽ˜์–ด์— ์ฐธ์—ฌ
  2. ์กฐ์ • ๋ถˆํ•„์š”: ๋ชจ๋“  ํŽ˜์–ด๊ฐ€ ๋™์‹œ์— ๊ตํ™˜
  3. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์›Œํ”„ ์ „์ฒด์— ๋Œ€ํ•ด ๋‹จ์ผ ๋ช…๋ น์œผ๋กœ ์ฒ˜๋ฆฌ
  4. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๊ธฐ๋ฐ˜: ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋นŒ๋”ฉ ๋ธ”๋ก

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์ง€์—ฐ ์‹œ๊ฐ„: 1 ์‚ฌ์ดํด (ํ•˜๋“œ์›จ์–ด ๋ ˆ์ง€์Šคํ„ฐ ๊ตํ™˜)
  • ๋Œ€์—ญํญ: 0 ๋ฐ”์ดํŠธ (๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์—†์Œ)
  • ๋ณ‘๋ ฌ์„ฑ: WARP_SIZE๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์‹œ์— ๊ตํ™˜
  • ํ™•์žฅ์„ฑ: ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๊ด€๊ณ„์—†์ด \(O(1)\) ๋ณต์žก๋„

2. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

๊ฐ์†Œํ•˜๋Š” offset์œผ๋กœ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด shuffle_xor์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์„ ํ†ตํ•ด ๋ชจ๋“  ์›Œํ”„ ๋ ˆ์ธ์—์„œ ์ตœ๋Œ“๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{max_result} = \max_{i=0}^{\small\text{WARP_SIZE}-1} \text{input}[i]\]

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ํŒจํ„ด: XOR ์˜คํ”„์…‹์„ WARP_SIZE/2์—์„œ 1๊นŒ์ง€ ์ ˆ๋ฐ˜์”ฉ ์ค„์—ฌ๊ฐ€๋ฉฐ, ํ†ต์‹  ๋ฒ”์œ„๊ฐ€ ๋‹จ๊ณ„๋งˆ๋‹ค ๋ฐ˜์œผ๋กœ ์ข์•„์ง€๋Š” ์ด์ง„ ํŠธ๋ฆฌ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • 1๋‹จ๊ณ„: WARP_SIZE/2 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต (์›Œํ”„ ์ „์ฒด๋ฅผ ํฌ๊ด„)
  • 2๋‹จ๊ณ„: WARP_SIZE/4 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต (๋ฒ”์œ„๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ขํž˜)
  • 3๋‹จ๊ณ„: WARP_SIZE/8 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต
  • 4๋‹จ๊ณ„: offset = 1์ด ๋  ๋•Œ๊นŒ์ง€ ๊ณ„์† ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ž„

\(\log_2(\text{WARP_SIZE})\) ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜๋ฉด ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ฐ–๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

def butterfly_parallel_max[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Parallel maximum reduction using butterfly pattern.
    Uses shuffle_xor with decreasing offsets starting from WARP_SIZE/2 down to 1.
    Each step reduces the active range by half until all threads have the maximum value.
    This implements an efficient O(log n) parallel reduction algorithm that works
    for any WARP_SIZE (32, 64, etc.).
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    # FILL ME IN (roughly 7 lines)


ํŒ

1. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ์ดํ•ดํ•˜๊ธฐ

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์€ ์ด์ง„ ํŠธ๋ฆฌ ํ†ต์‹  ํŒจํ„ด์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„์—์„œ ๋ฌธ์ œ ํฌ๊ธฐ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ์ตœ๋Œ€ ๋ฒ”์œ„๋ฅผ ์ปค๋ฒ„ํ•˜๋ ค๋ฉด ์‹œ์ž‘ offset์ด ์–ผ๋งˆ์—ฌ์•ผ ํ•˜๋‚˜์š”?
  • ๋‹จ๊ณ„ ์‚ฌ์ด์— ์˜คํ”„์…‹์„ ์–ด๋–ป๊ฒŒ ๋ณ€๊ฒฝํ•ด์•ผ ํ•˜๋‚˜์š”?
  • ์–ธ์ œ ๋ฆฌ๋•์…˜์„ ๋ฉˆ์ถฐ์•ผ ํ•˜๋‚˜์š”?

ํžŒํŠธ: โ€œ๋ฒ„ํ„ฐํ”Œ๋ผ์ดโ€œ๋ผ๋Š” ์ด๋ฆ„์€ ํ†ต์‹  ํŒจํ„ด์—์„œ ์œ ๋ž˜ํ•ฉ๋‹ˆ๋‹ค - ์ž‘์€ ์˜ˆ์ œ์— ๋Œ€ํ•ด ์ง์ ‘ ๊ทธ๋ ค๋ณด์„ธ์š”.

2. XOR ๋ฆฌ๋•์…˜ ํŠน์„ฑ

XOR์€ ๊ฐ ๋‹จ๊ณ„์—์„œ ๊ฒน์น˜์ง€ ์•Š๋Š” ํ†ต์‹  ํŽ˜์–ด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์—์„œ ์™œ ์ค‘์š”ํ•œ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • ์„œ๋กœ ๋‹ค๋ฅธ ์˜คํ”„์…‹์œผ๋กœ์˜ XOR์ด ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅธ ํ†ต์‹  ํŒจํ„ด์„ ๋งŒ๋“œ๋‚˜์š”?
  • ๊ฐ™์€ ๋‹จ๊ณ„์—์„œ ๋ ˆ์ธ๋“ค์ด ์™œ ์„œ๋กœ ๊ฐ„์„ญํ•˜์ง€ ์•Š๋‚˜์š”?
  • XOR์ด ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์— ํŠนํžˆ ์ ํ•ฉํ•œ ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?

3. ์ตœ๋Œ“๊ฐ’ ๋ˆ„์ 

๊ฐ ๋ ˆ์ธ์€ ์ž์‹ ์˜ โ€œ์˜์—ญโ€œ์—์„œ ์ตœ๋Œ“๊ฐ’์˜ ์ง€์‹์„ ์ ์ง„์ ์œผ๋กœ ์Œ“์•„๊ฐ€์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ:

  • ์ž์‹ ์˜ ๊ฐ’์œผ๋กœ ์‹œ์ž‘
  • ๊ฐ ๋‹จ๊ณ„์—์„œ ์ด์›ƒ์˜ ๊ฐ’๊ณผ ๋น„๊ต
  • ์ตœ๋Œ“๊ฐ’์„ ์œ ์ง€ํ•˜๊ณ  ๊ณ„์† ์ง„ํ–‰

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๊ฐ ๋‹จ๊ณ„ ํ›„, โ€œ์ง€์‹์˜ ์˜์—ญโ€œ์ด ๋‘ ๋ฐฐ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค.

  • ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„ ํ›„: ๊ฐ ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ์•Œ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

4. ์ด ํŒจํ„ด์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ 

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์€ \(\log_2(\text{WARP_SIZE})\) ๋‹จ๊ณ„ ํ›„์— ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  • ๋ชจ๋“  ๋ ˆ์ธ์ด ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์˜ ๊ฐ’์„ ๊ฐ„์ ‘์ ์œผ๋กœ ํ™•์ธ
  • ์ค‘๋ณต ํ†ต์‹  ์—†์Œ: ๊ฐ ํŽ˜์–ด๊ฐ€ ๋‹จ๊ณ„๋‹น ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ๊ตํ™˜
  • ์ตœ์  ๋ณต์žก๋„: \(O(n)\) ์ˆœ์ฐจ ๋น„๊ต ๋Œ€์‹  \(O(\log n)\) ๋‹จ๊ณ„

์ถ”์  ์˜ˆ์ œ (4๊ฐœ ๋ ˆ์ธ, ๊ฐ’ [3, 1, 7, 2]):

์ดˆ๊ธฐ ์ƒํƒœ: Lane 0=3, Lane 1=1, Lane 2=7, Lane 3=2

1๋‹จ๊ณ„ (offset=2): 0 โ†” 2, 1 โ†” 3
  Lane 0: max(3, 7) = 7
  Lane 1: max(1, 2) = 2
  Lane 2: max(7, 3) = 7
  Lane 3: max(2, 1) = 2

2๋‹จ๊ณ„ (offset=1): 0 โ†” 1, 2 โ†” 3
  Lane 0: max(7, 2) = 7
  Lane 1: max(2, 7) = 7
  Lane 2: max(7, 2) = 7
  Lane 3: max(2, 7) = 7

๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’ = 7์„ ๊ฐ€์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’ ํ…Œ์ŠคํŠธ:

pixi run p26 --parallel-max
pixi run -e amd p26 --parallel-max
uv run poe p26 --parallel-max

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0]
expected: [1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0]
โœ… Butterfly parallel max test passed!

์†”๋ฃจ์…˜

def butterfly_parallel_max[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Parallel maximum reduction using butterfly pattern.
    Uses shuffle_xor with decreasing offsets (16, 8, 4, 2, 1) to perform tree-based reduction.
    Each step reduces the active range by half until all threads have the maximum value.
    This implements an efficient O(log n) parallel reduction algorithm.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    if global_i < size:
        var max_val = input[global_i]

        # Butterfly reduction tree: dynamic for any WARP_SIZE (32, 64, etc.)
        # Start with half the warp size and reduce by half each step
        var offset = WARP_SIZE // 2
        while offset > 0:
            max_val = max(max_val, shuffle_xor(max_val, UInt32(offset)))
            offset //= 2

        # All threads now have the maximum value across the entire warp
        output[global_i] = max_val


์ด ํ’€์ด๋Š” shuffle_xor()์ด \(O(\log n)\) ๋ณต์žก๋„์˜ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ๋ฅผ ์–ด๋–ป๊ฒŒ ์ƒ์„ฑํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    max_val = input[global_i]  # ๋กœ์ปฌ ๊ฐ’์œผ๋กœ ์‹œ์ž‘

    # ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ: ๋ชจ๋“  WARP_SIZE์— ๋™์ ์œผ๋กœ ๋Œ€์‘
    offset = WARP_SIZE // 2
    while offset > 0:
        max_val = max(max_val, shuffle_xor(max_val, offset))
        offset //= 2

    output[global_i] = max_val  # ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ฐ€์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์‹คํ–‰ ์ถ”์  (8-๋ ˆ์ธ ์˜ˆ์ œ, ๊ฐ’ [0,2,4,6,8,10,12,1000]):

์ดˆ๊ธฐ ์ƒํƒœ:
  Lane 0: max_val = 0,    Lane 1: max_val = 2
  Lane 2: max_val = 4,    Lane 3: max_val = 6
  Lane 4: max_val = 8,    Lane 5: max_val = 10
  Lane 6: max_val = 12,   Lane 7: max_val = 1000

1๋‹จ๊ณ„: shuffle_xor(max_val, 4) - ์ ˆ๋ฐ˜ ๊ตํ™˜
  Lane 0โ†”4: max(0,8)=8,     Lane 1โ†”5: max(2,10)=10
  Lane 2โ†”6: max(4,12)=12,   Lane 3โ†”7: max(6,1000)=1000
  Lane 4โ†”0: max(8,0)=8,     Lane 5โ†”1: max(10,2)=10
  Lane 6โ†”2: max(12,4)=12,   Lane 7โ†”3: max(1000,6)=1000

2๋‹จ๊ณ„: shuffle_xor(max_val, 2) - 1/4 ๊ตํ™˜
  Lane 0โ†”2: max(8,12)=12,   Lane 1โ†”3: max(10,1000)=1000
  Lane 2โ†”0: max(12,8)=12,   Lane 3โ†”1: max(1000,10)=1000
  Lane 4โ†”6: max(8,12)=12,   Lane 5โ†”7: max(10,1000)=1000
  Lane 6โ†”4: max(12,8)=12,   Lane 7โ†”5: max(1000,10)=1000

3๋‹จ๊ณ„: shuffle_xor(max_val, 1) - ํŽ˜์–ด ๊ตํ™˜
  Lane 0โ†”1: max(12,1000)=1000,  Lane 1โ†”0: max(1000,12)=1000
  Lane 2โ†”3: max(12,1000)=1000,  Lane 3โ†”2: max(1000,12)=1000
  Lane 4โ†”5: max(12,1000)=1000,  Lane 5โ†”4: max(1000,12)=1000
  Lane 6โ†”7: max(12,1000)=1000,  Lane 7โ†”6: max(1000,12)=1000

์ตœ์ข… ๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์˜ max_val = 1000

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ์œผ๋กœ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ์ž๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{Reduce}(\oplus, [a_0, a_1, \ldots, a_{n-1}]) = a_0 \oplus a_1 \oplus \cdots \oplus a_{n-1}\]

์—ฌ๊ธฐ์„œ \(\oplus\)๋Š” max ์—ฐ์‚ฐ์ด๋ฉฐ, ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์ด ์ตœ์  \(O(\log n)\) ๋ณต์žก๋„๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ๋กœ๊ทธ ๋ณต์žก๋„: ์ˆœ์ฐจ ๋ฆฌ๋•์…˜์˜ \(O(n)\)์— ๋น„ํ•ด \(O(\log n)\)
  2. ์™„๋ฒฝํ•œ ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ ๋‹จ๊ณ„์—์„œ ๋™๋“ฑํ•˜๊ฒŒ ์ฐธ์—ฌ
  3. ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ ์—†์Œ: ์ˆœ์ˆ˜ ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ํ†ต์‹ 
  4. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: GPU ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์— ์ง์ ‘ ๋งคํ•‘

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ๋‹จ๊ณ„ ์ˆ˜: \(\log_2(\text{WARP_SIZE})\) (์˜ˆ: 32-์Šค๋ ˆ๋“œ ์›Œํ”„๋Š” 5๋‹จ๊ณ„, 64-์Šค๋ ˆ๋“œ ์›Œํ”„๋Š” 6๋‹จ๊ณ„)
  • ๋‹จ๊ณ„๋‹น ์ง€์—ฐ ์‹œ๊ฐ„: 1 ์‚ฌ์ดํด (๋ ˆ์ง€์Šคํ„ฐ ๊ตํ™˜ + ๋น„๊ต)
  • ์ด ์ง€์—ฐ ์‹œ๊ฐ„: ์ˆœ์ฐจ ๋ฐฉ์‹์˜ \((\text{WARP_SIZE}-1)\) ์‚ฌ์ดํด ๋Œ€๋น„ \(\log_2(\text{WARP_SIZE})\) ์‚ฌ์ดํด
  • ๋ณ‘๋ ฌ์„ฑ: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ „์ฒด์—์„œ ๋ชจ๋“  ๋ ˆ์ธ์ด ํ™œ์„ฑ ์ƒํƒœ

3. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์กฐ๊ฑด๋ถ€ ์ตœ๋Œ“๊ฐ’

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE_2 = 64 (๋ฉ€ํ‹ฐ ๋ธ”๋ก ์‹œ๋‚˜๋ฆฌ์˜ค)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: BLOCKS_PER_GRID_2 = (2, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: THREADS_PER_BLOCK_2 = (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

์ง์ˆ˜ ๋ ˆ์ธ์€ ์ตœ๋Œ“๊ฐ’์„, ํ™€์ˆ˜ ๋ ˆ์ธ์€ ์ตœ์†Ÿ๊ฐ’์„ ์ €์žฅํ•˜๋Š” ์กฐ๊ฑด๋ถ€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’ ๋ชจ๋‘์— ๋Œ€ํ•ด ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•œ ํ›„, ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ผ ์กฐ๊ฑด๋ถ€๋กœ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \begin{cases} \max_{j=0}^{\text{WARP_SIZE}-1} \text{input}[j] & \text{if} i \bmod 2 = 0 \\ \min_{j=0}^{\text{WARP_SIZE}-1} \text{input}[j] & \text{if } i \bmod 2 = 1 \end{cases}\]

์ด์ค‘ ๋ฆฌ๋•์…˜ ํŒจํ„ด: ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŠธ๋ฆฌ๋ฅผ ํ†ตํ•ด ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์„ ๋™์‹œ์— ์ถ”์ ํ•œ ํ›„, ๋ ˆ์ธ ID ํ™€์ง์— ๋”ฐ๋ผ ์กฐ๊ฑด๋ถ€๋กœ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์ด ๋ณต์žกํ•œ ๋‹ค์ค‘ ๊ฐ’ ๋ฆฌ๋•์…˜์œผ๋กœ ์–ด๋–ป๊ฒŒ ํ™•์žฅ๋˜๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
comptime layout_2 = row_major[SIZE_2]()
comptime LayoutType_2 = type_of(layout_2)


def butterfly_conditional_max[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType_2, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType_2, ImmutAnyOrigin],
):
    """
    Conditional butterfly maximum: Perform butterfly max reduction, but only store result
    in even-numbered lanes. Odd-numbered lanes store the minimum value seen.
    Demonstrates conditional logic combined with butterfly communication patterns.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = lane_id()

    if global_i < size:
        var current_val = input[global_i]
        var min_val = current_val

        # FILL ME IN (roughly 11 lines)


ํŒ

1. ์ด์ค‘ ์ถ”์  ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜

์ด ํผ์ฆ์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŠธ๋ฆฌ๋ฅผ ํ†ตํ•ด ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ ๊ฐ’์„ ๋™์‹œ์— ์ถ”์ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๋ฆฌ๋•์…˜์„ ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ฆฌ๋•์…˜ ๊ณผ์ •์—์„œ ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋™์‹œ์— ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ๋‘ ์—ฐ์‚ฐ์— ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ์–ด๋–ค ๋ณ€์ˆ˜๋ฅผ ์ถ”์ ํ•ด์•ผ ํ•˜๋‚˜์š”?

2. ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ ๋กœ์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ์™„๋ฃŒํ•œ ํ›„, ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๊ฐ’์„ ์ถœ๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ์ :

  • ๋ ˆ์ธ์ด ์ง์ˆ˜์ธ์ง€ ํ™€์ˆ˜์ธ์ง€ ์–ด๋–ป๊ฒŒ ํŒ๋ณ„ํ•˜๋‚˜์š”?
  • ์–ด๋–ค ๋ ˆ์ธ์ด ์ตœ๋Œ“๊ฐ’์„, ์–ด๋–ค ๋ ˆ์ธ์ด ์ตœ์†Ÿ๊ฐ’์„ ์ถœ๋ ฅํ•ด์•ผ ํ•˜๋‚˜์š”?
  • ๋ ˆ์ธ ID์— ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•˜๋‚˜์š”?

3. min๊ณผ max ๋™์‹œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜

์ด ๊ณผ์ œ์˜ ํ•ต์‹ฌ์€ ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์œผ๋กœ min๊ณผ max๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ณ‘๋ ฌ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • min๊ณผ max์— ๋ณ„๋„์˜ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•œ๊ฐ€์š”?
  • ๋‘ ์—ฐ์‚ฐ์— ๊ฐ™์€ ์ด์›ƒ ๊ฐ’์„ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ๋‘ ๋ฆฌ๋•์…˜ ๋ชจ๋‘ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์™„๋ฃŒ๋˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•˜๋‚˜์š”?

4. ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๊ฒฝ๊ณ„ ๊ณ ๋ ค์‚ฌํ•ญ

์ด ํผ์ฆ์€ ์—ฌ๋Ÿฌ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ฆฌ๋•์…˜ ๋ฒ”์œ„์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ค‘์š”ํ•œ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๊ฐ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์˜ ๋ฒ”์œ„๋Š” ์–ด๋””๊นŒ์ง€์ธ๊ฐ€์š”?
  • ๋ธ”๋ก ๊ตฌ์กฐ๊ฐ€ ๋ ˆ์ธ ๋ฒˆํ˜ธ ๋งค๊ธฐ๊ธฐ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋‚˜์š”?
  • ์ „์—ญ min/max๋ฅผ ๊ณ„์‚ฐํ•˜๋‚˜์š”, ๋ธ”๋ก๋ณ„ min/max๋ฅผ ๊ณ„์‚ฐํ•˜๋‚˜์š”?

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์กฐ๊ฑด๋ถ€ ์ตœ๋Œ“๊ฐ’ ํ…Œ์ŠคํŠธ:

pixi run p26 --conditional-max
pixi run -e amd p26 --conditional-max
uv run poe p26 --conditional-max

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE_2:  64
output: [9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0]
expected: [9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0]
โœ… Butterfly conditional max test passed!

์†”๋ฃจ์…˜

def butterfly_conditional_max[
    size: Int
](
    output: TileTensor[mut=True, dtype, Layout2Type, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, Layout2Type, ImmutAnyOrigin],
):
    """
    Conditional butterfly maximum: Perform butterfly max reduction, but only store result
    in even-numbered lanes. Odd-numbered lanes store the minimum value seen.
    Demonstrates conditional logic combined with butterfly communication patterns.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = lane_id()

    if global_i < size:
        var current_val = input[global_i]
        var min_val = current_val

        # Butterfly reduction for both maximum and minimum: dynamic for any WARP_SIZE
        var offset = WARP_SIZE // 2
        while offset > 0:
            var neighbor_val = shuffle_xor(current_val, UInt32(offset))
            current_val = max(current_val, neighbor_val)

            var min_neighbor_val = shuffle_xor(min_val, UInt32(offset))
            min_val = min(min_val, min_neighbor_val)

            offset //= 2

        # Conditional output: max for even lanes, min for odd lanes
        if lane % 2 == 0:
            output[global_i] = current_val  # Maximum
        else:
            output[global_i] = min_val  # Minimum


์ด ํ’€์ด๋Š” ์ด์ค‘ ์ถ”์ ๊ณผ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ณ ๊ธ‰ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]
    min_val = current_val  # ์ตœ์†Ÿ๊ฐ’์„ ๋ณ„๋„๋กœ ์ถ”์ 

    # max์™€ min ๋™์‹œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ (log_2(WARP_SIZE) ๋‹จ๊ณ„)
    offset = WARP_SIZE // 2
    while offset > 0:
        neighbor_val = shuffle_xor(current_val, offset)
        current_val = max(current_val, neighbor_val)    # Max ๋ฆฌ๋•์…˜

        min_neighbor_val = shuffle_xor(min_val, offset)
        min_val = min(min_val, min_neighbor_val)        # Min ๋ฆฌ๋•์…˜

        offset //= 2

    # ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ฅธ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ
    if lane % 2 == 0:
        output[global_i] = current_val  # ์ง์ˆ˜ ๋ ˆ์ธ: ์ตœ๋Œ“๊ฐ’
    else:
        output[global_i] = min_val      # ํ™€์ˆ˜ ๋ ˆ์ธ: ์ตœ์†Ÿ๊ฐ’

์ด์ค‘ ๋ฆฌ๋•์…˜ ์‹คํ–‰ ์ถ”์  (4-๋ ˆ์ธ ์˜ˆ์ œ, ๊ฐ’ [3, 1, 7, 2]):

์ดˆ๊ธฐ ์ƒํƒœ:
  Lane 0: current_val=3, min_val=3
  Lane 1: current_val=1, min_val=1
  Lane 2: current_val=7, min_val=7
  Lane 3: current_val=2, min_val=2

1๋‹จ๊ณ„: shuffle_xor(current_val, 2)์™€ shuffle_xor(min_val, 2) - ์ ˆ๋ฐ˜ ๊ตํ™˜
  Lane 0โ†”2: max_neighbor=7, min_neighbor=7 โ†’ current_val=max(3,7)=7, min_val=min(3,7)=3
  Lane 1โ†”3: max_neighbor=2, min_neighbor=2 โ†’ current_val=max(1,2)=2, min_val=min(1,2)=1
  Lane 2โ†”0: max_neighbor=3, min_neighbor=3 โ†’ current_val=max(7,3)=7, min_val=min(7,3)=3
  Lane 3โ†”1: max_neighbor=1, min_neighbor=1 โ†’ current_val=max(2,1)=2, min_val=min(2,1)=1

2๋‹จ๊ณ„: shuffle_xor(current_val, 1)์™€ shuffle_xor(min_val, 1) - ํŽ˜์–ด ๊ตํ™˜
  Lane 0โ†”1: max_neighbor=2, min_neighbor=1 โ†’ current_val=max(7,2)=7, min_val=min(3,1)=1
  Lane 1โ†”0: max_neighbor=7, min_neighbor=3 โ†’ current_val=max(2,7)=7, min_val=min(1,3)=1
  Lane 2โ†”3: max_neighbor=2, min_neighbor=1 โ†’ current_val=max(7,2)=7, min_val=min(3,1)=1
  Lane 3โ†”2: max_neighbor=7, min_neighbor=3 โ†’ current_val=max(2,7)=7, min_val=min(1,3)=1

์ตœ์ข… ๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์ด current_val=7 (์ „์—ญ max)๊ณผ min_val=1 (์ „์—ญ min)์„ ๊ฐ€์ง

๋™์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ (๋ชจ๋“  WARP_SIZE์—์„œ ๋™์ž‘):

offset = WARP_SIZE // 2
while offset > 0:
    neighbor_val = shuffle_xor(current_val, offset)
    current_val = max(current_val, neighbor_val)

    min_neighbor_val = shuffle_xor(min_val, offset)
    min_val = min(min_val, min_neighbor_val)

    offset //= 2

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ์กฐ๊ฑด๋ถ€ ๋””๋ฉ€ํ‹ฐํ”Œ๋ ‰์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์ค‘ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \begin{align} \text{max_result} &= \max_{i=0}^{n-1} \text{input}[i] \\ \text{min_result} &= \min_{i=0}^{n-1} \text{input}[i] \\ \text{output}[i] &= \text{lane_parity}(i) \; \text{?} \; \text{min_result}: \text{max_result} \end{align}\]

์ด์ค‘ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  1. ๋…๋ฆฝ์  ๋ฆฌ๋•์…˜: Max์™€ min ๋ฆฌ๋•์…˜์€ ์ˆ˜ํ•™์ ์œผ๋กœ ๋…๋ฆฝ
  2. ๋ณ‘๋ ฌ ์‹คํ–‰: ๋‘˜ ๋‹ค ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  3. ํ†ต์‹  ๊ณต์œ : ๊ฐ™์€ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ๋‘ ๋ฆฌ๋•์…˜ ๋ชจ๋‘์— ํ™œ์šฉ
  4. ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ: ๋ ˆ์ธ ํ™€์ง์ด ์–ด๋–ค ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ• ์ง€ ๊ฒฐ์ •

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ํ†ต์‹  ๋‹จ๊ณ„: \(\log_2(\text{WARP_SIZE})\) (๋‹จ์ผ ๋ฆฌ๋•์…˜๊ณผ ๋™์ผ)
  • ๋‹จ๊ณ„๋‹น ์—ฐ์‚ฐ: ๋‹จ์ผ ๋ฆฌ๋•์…˜์˜ 1๊ฐœ ๋Œ€๋น„ 2๊ฐœ ์—ฐ์‚ฐ (max + min)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹ ๋Œ€๋น„ ์Šค๋ ˆ๋“œ๋‹น ๋ ˆ์ง€์Šคํ„ฐ 2๊ฐœ
  • ์ถœ๋ ฅ ์œ ์—ฐ์„ฑ: ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ธ์ด ๋‹ค๋ฅธ ๋ฆฌ๋•์…˜ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ ๊ฐ€๋Šฅ

์š”์•ฝ

shuffle_xor() ๊ธฐ๋ณธ ์š”์†Œ๋Š” ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๊ฐ•๋ ฅํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์„ธ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ†ตํ•ด ๋‹ค์Œ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด

  1. ํŽ˜์–ด ๊ตํ™˜ (shuffle_xor(value, 1)):

    • ์™„๋ฒฝํ•œ ์ธ์ ‘ ํŽ˜์–ด ์ƒ์„ฑ: (0,1), (2,3), (4,5), โ€ฆ
    • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ์˜ \(O(1)\) ๋ณต์žก๋„
    • ์ •๋ ฌ ๋„คํŠธ์›Œํฌ์™€ ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜์˜ ๊ธฐ๋ฐ˜
  2. ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ (๋™์  offset: WARP_SIZE/2 โ†’ 1):

    • ๋กœ๊ทธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜: ์ˆœ์ฐจ์˜ \(O(n)\) ๋Œ€๋น„ \(O(\log n)\)
    • ๋ชจ๋“  ๊ฒฐํ•ฉ ์—ฐ์‚ฐ์— ์ ์šฉ ๊ฐ€๋Šฅ (max, min, sum ๋“ฑ)
    • ๋ชจ๋“  ์›Œํ”„ ๋ ˆ์ธ์— ๊ฑธ์ณ ์ตœ์ ์˜ ๋ถ€ํ•˜ ๋ถ„์‚ฐ
  3. ์กฐ๊ฑด๋ถ€ ๋‹ค์ค‘ ๋ฆฌ๋•์…˜ (์ด์ค‘ ์ถ”์  + ๋ ˆ์ธ ํ™€์ง):

    • ์—ฌ๋Ÿฌ ๋ฆฌ๋•์…˜์„ ๋™์‹œ์— ๋ณ‘๋ ฌ ์ˆ˜ํ–‰
    • ์Šค๋ ˆ๋“œ ํŠน์„ฑ์— ๋”ฐ๋ฅธ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ
    • ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†๋Š” ๊ณ ๊ธ‰ ์กฐ์ •

ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ†ต์ฐฐ

XOR ํ†ต์‹  ํŠน์„ฑ:

  • shuffle_xor(value, mask)๊ฐ€ ๋Œ€์นญ์ ์ด๊ณ  ๊ฒน์น˜์ง€ ์•Š๋Š” ํŽ˜์–ด๋ฅผ ์ƒ์„ฑ
  • ๊ฐ ๋งˆ์Šคํฌ๊ฐ€ ๊ณ ์œ ํ•œ ํ†ต์‹  ํ† ํด๋กœ์ง€๋ฅผ ์ƒ์„ฑ
  • ์ด์ง„ XOR ํŒจํ„ด์—์„œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋„์ถœ

๋™์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„:

offset = WARP_SIZE // 2
while offset > 0:
    neighbor_val = shuffle_xor(current_val, offset)
    current_val = operation(current_val, neighbor_val)
    offset //= 2

์„ฑ๋Šฅ ์ด์ :

  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ณต์žก๋„: ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ \(O(\log n)\)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ถˆํ•„์š”

์‹ค์šฉ์  ํ™œ์šฉ

์ด ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด๋“ค์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๋ถ„์•ผ:

  • ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜: ํ•ฉ๊ณ„, max, min, ๋…ผ๋ฆฌ ์—ฐ์‚ฐ
  • ๋ˆ„์  ํ•ฉ/์Šค์บ” ์—ฐ์‚ฐ: ๋ˆ„์  ํ•ฉ, ๋ณ‘๋ ฌ ์ •๋ ฌ
  • FFT ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์‹ ํ˜ธ ์ฒ˜๋ฆฌ์™€ ํ•ฉ์„ฑ๊ณฑ
  • Bitonic ์ •๋ ฌ: ๋ณ‘๋ ฌ ์ •๋ ฌ ๋„คํŠธ์›Œํฌ
  • ๊ทธ๋ž˜ํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ํŠธ๋ฆฌ ์ˆœํšŒ์™€ ์—ฐ๊ฒฐ์„ฑ

shuffle_xor() ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์กฐ์ •์„ ์šฐ์•„ํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ํ†ต์‹  ํŒจํ„ด์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ํšจ์œจ์ ์œผ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค.