๐Ÿง  ๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๊ฐœ์š”

์ด ๋งˆ์ง€๋ง‰ ๋„์ „์—์„œ๋Š” ์›Œํ”„ ๋ ˆ๋ฒจ (Puzzle 24-26), ๋ธ”๋ก ๋ ˆ๋ฒจ (Puzzle 27), ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณ„์ธต ๊ตฌ์กฐ์˜ ๋ชจ๋“  ๋ ˆ๋ฒจ์„ ๊ฒฐํ•ฉํ•˜์—ฌ GPU ํ™œ์šฉ๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ์ •๊ตํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: ์›Œํ”„ ๋ ˆ๋ฒจ ์ตœ์ ํ™” (elect_one_sync()), ๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„, ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ์กฐ์ •์„ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ํŒจํ„ด์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ณ„์ธต์  ํด๋Ÿฌ์Šคํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต: ๊ณ ๊ธ‰ ์—ฐ์‚ฐ ์›Œํฌ๋กœ๋“œ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€์˜ ์กฐ์ • ํŒจํ„ด๊ณผ ํ•จ๊ป˜ ์™„์ „ํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์Šคํƒ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฌธ์ œ: ๋‹ค๋‹จ๊ณ„ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ

์‹ค์ œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ GPU ๊ณ„์ธต ๊ตฌ์กฐ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ๋ฒจ(Puzzle 24์˜ ์›Œํ”„, Puzzle 27์˜ ๋ธ”๋ก, ํด๋Ÿฌ์Šคํ„ฐ)์ด ์กฐ์ •๋œ ์—ฐ์‚ฐ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ๊ฐ๊ฐ ์ „๋ฌธํ™”๋œ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ณ„์ธต์  ์กฐ์ •์„ ํ•„์š”๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์ด๋Š” Puzzle 29์˜ ๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ๋ฅผ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

๊ณผ์ œ: ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜์„ธ์š”:

  1. ์›Œํ”„ ๋ ˆ๋ฒจ: ํšจ์œจ์ ์ธ ์›Œํ”„ ๋‚ด๋ถ€ ์กฐ์ •์„ ์œ„ํ•ด elect_one_sync()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (SIMT ์‹คํ–‰)
  2. ๋ธ”๋ก ๋ ˆ๋ฒจ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •์„ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„ ๊ฒฐ๊ณผ๋ฅผ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค
  3. ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ: cluster_arrive() / cluster_wait() Puzzle 29์˜ ๋‹จ๊ณ„์  ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ธ”๋ก ๊ฐ„ ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ช…์„ธ

๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ:

  1. 1๋‹จ๊ณ„ (์›Œํ”„ ๋ ˆ๋ฒจ): ๊ฐ ์›Œํ”„๊ฐ€ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋ฅผ ์„ ์ถœํ•˜์—ฌ 32๊ฐœ์˜ ์—ฐ์† ์š”์†Œ๋ฅผ ํ•ฉ์‚ฐํ•ฉ๋‹ˆ๋‹ค
  2. 2๋‹จ๊ณ„ (๋ธ”๋ก ๋ ˆ๋ฒจ): ๊ฐ ๋ธ”๋ก ๋‚ด์˜ ๋ชจ๋“  ์›Œํ”„ ํ•ฉ๊ณ„๋ฅผ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค
  3. 3๋‹จ๊ณ„ (ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ): cluster_arrive() / cluster_wait()๋กœ ๋ธ”๋ก ๊ฐ„ ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

์ž…๋ ฅ: ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•œ (i % 50) * 0.02 ํŒจํ„ด์˜ 1024๊ฐœ float ๊ฐ’ ์ถœ๋ ฅ: ๊ณ„์ธต์  ์ฒ˜๋ฆฌ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” 4๊ฐœ ๋ธ”๋ก ๊ฒฐ๊ณผ

์„ค์ •

  • ๋ฌธ์ œ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ
  • ๋ธ”๋ก ์„ค์ •: TPB = 256 ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (256, 1)
  • ๊ทธ๋ฆฌ๋“œ ์„ค์ •: CLUSTER_SIZE = 4 ๋ธ”๋ก (4, 1)
  • ์›Œํ”„ ํฌ๊ธฐ: WARP_SIZE = 32 ์›Œํ”„๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (NVIDIA ํ‘œ์ค€)
  • ๋ธ”๋ก๋‹น ์›Œํ”„ ์ˆ˜: TPB / WARP_SIZE = 8 ์›Œํ”„
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ์ž…๋ ฅ Layout.row_major(SIZE), ์ถœ๋ ฅ Layout.row_major(CLUSTER_SIZE)

์ฒ˜๋ฆฌ ๋ถ„๋ฐฐ:

  • Block 0: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 0-255
  • Block 1: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 256-511
  • Block 2: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 512-767
  • Block 3: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 768-1023

์™„์„ฑํ•  ์ฝ”๋“œ

fn advanced_cluster_patterns[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Advanced cluster programming using cluster masks and relaxed synchronization.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    # FILL IN (roughly 26 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p34/p34.mojo

ํŒ

์›Œํ”„ ๋ ˆ๋ฒจ ์ตœ์ ํ™” ํŒจํ„ด

๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„ ์ „๋žต

  • ์›Œํ”„ ์ฒ˜๋ฆฌ ํ›„ ๋ชจ๋“  ์›Œํ”„ ๊ฒฐ๊ณผ๋ฅผ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค (Puzzle 27์˜ ๋ธ”๋ก ์กฐ์ • ํ™•์žฅ)
  • ์„ ์ถœ๋œ ์œ„์น˜์—์„œ ์ฝ์Šต๋‹ˆ๋‹ค: ์ธ๋ฑ์Šค 0, 32, 64, 96, 128, 160, 192, 224
  • for i in range(0, tpb, 32) ๋ฃจํ”„๋กœ ์›Œํ”„ ๋ฆฌ๋”๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค (๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํŒจํ„ด)
  • ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค (๋ฐฐ๋ฆฌ์–ด ์กฐ์ •์˜ ๋‹จ์ผ ์“ฐ๊ธฐ ํŒจํ„ด)

ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ํ๋ฆ„

  1. ์ฒ˜๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด ๊ณ„์ธต์  ์›Œํ”„ ์ตœ์ ํ™”๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  2. ์‹ ํ˜ธ: cluster_arrive()๋กœ ๋กœ์ปฌ ์ฒ˜๋ฆฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  3. ์ €์žฅ: ์Šค๋ ˆ๋“œ 0์ด ๋ธ”๋ก ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ์— ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค
  4. ๋Œ€๊ธฐ: cluster_wait()๋กœ ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค

๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง ๋ฐ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

  • Float32(block_id + 1)๋กœ ์ž…๋ ฅ์„ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๋ธ”๋ก๋ณ„ ๊ณ ์œ  ํŒจํ„ด์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค
  • ์ž…๋ ฅ์„ ์ฝ๊ธฐ ์ „์— ํ•ญ์ƒ global_i < size๋ฅผ ๊ฒ€์‚ฌํ•ฉ๋‹ˆ๋‹ค (Puzzle 3์˜ ๊ฐ€๋“œ)
  • ๋ธ”๋ก ๋‚ด ์ฒ˜๋ฆฌ ๋‹จ๊ณ„ ์‚ฌ์ด์— barrier()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (๋™๊ธฐํ™” ํŒจํ„ด)
  • ๋ฃจํ”„์—์„œ ์›Œํ”„ ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์ฃผ์˜ ๊นŠ๊ฒŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค (์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ณ ๋ ค์‚ฌํ•ญ)

๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ API

gpu.primitives.cluster ๋ชจ๋“ˆ:

  • elect_one_sync(): ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ์Šค๋ ˆ๋“œ ์„ ์ถœ
  • cluster_arrive(): ๋‹จ๊ณ„์  ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ์œ„ํ•œ ์™„๋ฃŒ ์‹ ํ˜ธ
  • cluster_wait(): ๋ชจ๋“  ๋ธ”๋ก์ด ๋™๊ธฐํ™” ์ง€์ ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
  • block_rank_in_cluster(): ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๊ณ ์œ ํ•œ ๋ธ”๋ก ์‹๋ณ„์ž ๋ฐ˜ํ™˜

๊ณ„์ธต์  ์กฐ์ • ํŒจํ„ด

์ด ํผ์ฆ์€ 3๋‹จ๊ณ„ ์กฐ์ • ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๋ ˆ๋ฒจ 1: ์›Œํ”„ ์กฐ์ • (Puzzle 24)

Warp (32 threads) โ†’ elect_one_sync() โ†’ 1 elected thread โ†’ processes 32 elements

๋ ˆ๋ฒจ 2: ๋ธ”๋ก ์กฐ์ • (Puzzle 27)

Block (8 warps) โ†’ aggregate warp results โ†’ 1 block total

๋ ˆ๋ฒจ 3: ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • (์ด ํผ์ฆ)

Cluster (4 blocks) โ†’ cluster_arrive/wait โ†’ synchronized completion

๊ฒฐํ•ฉ ํšจ๊ณผ: 1024๊ฐœ ์Šค๋ ˆ๋“œ โ†’ 32๊ฐœ ์›Œํ”„ ๋ฆฌ๋” โ†’ 4๊ฐœ ๋ธ”๋ก ๊ฒฐ๊ณผ โ†’ ์กฐ์ •๋œ ํด๋Ÿฌ์Šคํ„ฐ ์™„๋ฃŒ

์ฝ”๋“œ ์‹คํ–‰

pixi run p34 --advanced
uv run poe p34 --advanced

์˜ˆ์ƒ ์ถœ๋ ฅ:

Testing Advanced Cluster Algorithms
SIZE: 1024 TPB: 256 CLUSTER_SIZE: 4
Advanced cluster algorithm results:
  Block 0 : 122.799995
  Block 1 : 247.04001
  Block 2 : 372.72
  Block 3 : 499.83997
โœ… Advanced cluster patterns tests passed!

์„ฑ๊ณต ๊ธฐ์ค€:

  • ๊ณ„์ธต์  ์Šค์ผ€์ผ๋ง: ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋‹จ๊ณ„ ์กฐ์ • ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ์›Œํ”„ ์ตœ์ ํ™”: elect_one_sync()๊ฐ€ ์ค‘๋ณต ์—ฐ์‚ฐ์„ ์ค„์ž…๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •: ๋ชจ๋“  ๋ธ”๋ก์ด ์ฒ˜๋ฆฌ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒํ•ฉ๋‹ˆ๋‹ค
  • ์„ฑ๋Šฅ ํŒจํ„ด: ๋” ๋†’์€ ๋ธ”๋ก ID๊ฐ€ ๋น„๋ก€์ ์œผ๋กœ ๋” ํฐ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค

์†”๋ฃจ์…˜

fn advanced_cluster_patterns[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Advanced cluster programming using cluster masks and relaxed synchronization.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)
    my_block_rank = Int(block_rank_in_cluster())
    block_id = Int(block_idx.x)

    shared_data = LayoutTensor[
        dtype,
        Layout.row_major(tpb),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Compute cluster mask for advanced coordination
    # base_mask = cluster_mask_base()  # Requires cluster_shape parameter

    # FIX: Process data with block_idx-based scaling for guaranteed uniqueness
    var data_scale = Float32(block_id + 1)
    if global_i < size:
        shared_data[local_i] = input[global_i] * data_scale
    else:
        shared_data[local_i] = 0.0

    barrier()

    # Advanced pattern: Use elect_one_sync for efficient coordination
    if elect_one_sync():  # Only one thread per warp does this work
        var warp_sum: Float32 = 0.0
        var warp_start = (local_i // 32) * 32  # Get warp start index
        for i in range(32):  # Sum across warp
            if warp_start + i < tpb:
                warp_sum += shared_data[warp_start + i][0]
        shared_data[local_i] = warp_sum

    barrier()

    # Use cluster_arrive for staged synchronization in sm90+
    cluster_arrive()

    # Only first thread in each block stores result
    if local_i == 0:
        var block_total: Float32 = 0.0
        for i in range(0, tpb, 32):  # Sum warp results
            if i < tpb:
                block_total += shared_data[i][0]
        output[block_id] = block_total

    # Wait for all blocks to complete their calculations in sm90+
    cluster_wait()


๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ ํŒจํ„ด ํ’€์ด๋Š” GPU ํ™œ์šฉ๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์›Œํ”„, ๋ธ”๋ก, ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ๊ฒฐํ•ฉํ•˜๋Š” ์ •๊ตํ•œ 3๋‹จ๊ณ„ ๊ณ„์ธต์  ์ตœ์ ํ™”๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๋ ˆ๋ฒจ 1: ์›Œํ”„ ๋ ˆ๋ฒจ ์ตœ์ ํ™” (์Šค๋ ˆ๋“œ ์„ ์ถœ)

๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ์Šค์ผ€์ผ๋ง:

var data_scale = Float32(block_id + 1)  # Block-specific scaling factor
if global_i < size:
    shared_data[local_i] = input[global_i] * data_scale
else:
    shared_data[local_i] = 0.0  # Zero-pad for out-of-bounds
barrier()  # Ensure all threads complete data loading

์›Œํ”„ ๋ ˆ๋ฒจ ์Šค๋ ˆ๋“œ ์„ ์ถœ:

if elect_one_sync():  # Hardware elects exactly 1 thread per warp
    var warp_sum: Float32 = 0.0
    var warp_start = (local_i // 32) * 32  # Calculate warp boundary
    for i in range(32):  # Process entire warp's data
        if warp_start + i < tpb:
            warp_sum += shared_data[warp_start + i][0]
    shared_data[local_i] = warp_sum  # Store result at elected thread's position

์›Œํ”„ ๊ฒฝ๊ณ„ ๊ณ„์‚ฐ ์„ค๋ช…:

  • ์Šค๋ ˆ๋“œ 37 (์›Œํ”„ 1): warp_start = (37 // 32) * 32 = 1 * 32 = 32
  • ์Šค๋ ˆ๋“œ 67 (์›Œํ”„ 2): warp_start = (67 // 32) * 32 = 2 * 32 = 64
  • ์Šค๋ ˆ๋“œ 199 (์›Œํ”„ 6): warp_start = (199 // 32) * 32 = 6 * 32 = 192

์„ ์ถœ ํŒจํ„ด ์‹œ๊ฐํ™” (TPB=256, 8 ์›Œํ”„):

Warp 0 (threads 0-31):   elect_one_sync() โ†’ Thread 0   processes elements 0-31
Warp 1 (threads 32-63):  elect_one_sync() โ†’ Thread 32  processes elements 32-63
Warp 2 (threads 64-95):  elect_one_sync() โ†’ Thread 64  processes elements 64-95
Warp 3 (threads 96-127): elect_one_sync() โ†’ Thread 96  processes elements 96-127
Warp 4 (threads 128-159):elect_one_sync() โ†’ Thread 128 processes elements 128-159
Warp 5 (threads 160-191):elect_one_sync() โ†’ Thread 160 processes elements 160-191
Warp 6 (threads 192-223):elect_one_sync() โ†’ Thread 192 processes elements 192-223
Warp 7 (threads 224-255):elect_one_sync() โ†’ Thread 224 processes elements 224-255

๋ ˆ๋ฒจ 2: ๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„ (์›Œํ”„ ๋ฆฌ๋” ์กฐ์ •)

์›Œํ”„ ๊ฐ„ ๋™๊ธฐํ™”:

barrier()  # Ensure all warps complete their elected computations

์›Œํ”„ ๋ฆฌ๋” ์ง‘๊ณ„ (์Šค๋ ˆ๋“œ 0๋งŒ):

if local_i == 0:
    var block_total: Float32 = 0.0
    for i in range(0, tpb, 32):  # Iterate through warp leader positions
        if i < tpb:
            block_total += shared_data[i][0]  # Sum warp results
    output[block_id] = block_total

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

  • ์Šค๋ ˆ๋“œ 0์ด ๋‹ค์Œ ์œ„์น˜์—์„œ ์ฝ์Šต๋‹ˆ๋‹ค: shared_data[0], shared_data[32], shared_data[64], shared_data[96], shared_data[128], shared_data[160], shared_data[192], shared_data[224]
  • ์ด ์œ„์น˜๋“ค์—๋Š” ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐํ•œ ์›Œํ”„ ํ•ฉ๊ณ„๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค
  • ๊ฒฐ๊ณผ: 8๊ฐœ ์›Œํ”„ ํ•ฉ๊ณ„ โ†’ 1๊ฐœ ๋ธ”๋ก ํ•ฉ๊ณ„

๋ ˆ๋ฒจ 3: ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ๋‹จ๊ณ„์  ๋™๊ธฐํ™”

๋‹จ๊ณ„์  ๋™๊ธฐํ™” ์ ‘๊ทผ:

cluster_arrive()  # Non-blocking: signal this block's completion
# ... Thread 0 computes and stores block result ...
cluster_wait()    # Blocking: wait for all blocks to complete

์™œ ๋‹จ๊ณ„์  ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ• ๊นŒ?

  • cluster_arrive() ๋ฅผ ์ตœ์ข… ์—ฐ์‚ฐ ์ด์ „์— ํ˜ธ์ถœํ•˜๋ฉด ์ž‘์—… ์ค‘์ฒฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค
  • ๋‹ค๋ฅธ ๋ธ”๋ก์ด ์•„์ง ์ฒ˜๋ฆฌ ์ค‘์ธ ๋™์•ˆ์—๋„ ๋ธ”๋ก์ด ์ž์ฒด ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • cluster_wait() ๋กœ ๊ฒฐ์ •๋ก ์  ์™„๋ฃŒ ์ˆœ์„œ๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋…๋ฆฝ์ ์ธ ๋ธ”๋ก ์—ฐ์‚ฐ์˜ ๊ฒฝ์šฐ cluster_sync()๋ณด๋‹ค ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค

๊ณ ๊ธ‰ ํŒจํ„ด ํŠน์„ฑ

๊ณ„์ธต์  ์—ฐ์‚ฐ ์ถ•์†Œ:

  1. 256๊ฐœ ์Šค๋ ˆ๋“œ โ†’ 8๊ฐœ ์„ ์ถœ ์Šค๋ ˆ๋“œ (๋ธ”๋ก๋‹น 32๋ฐฐ ์ถ•์†Œ)
  2. 8๊ฐœ ์›Œํ”„ ํ•ฉ๊ณ„ โ†’ 1๊ฐœ ๋ธ”๋ก ํ•ฉ๊ณ„ (๋ธ”๋ก๋‹น 8๋ฐฐ ์ถ•์†Œ)
  3. 4๊ฐœ ๋ธ”๋ก โ†’ ๋‹จ๊ณ„์  ์™„๋ฃŒ (๋™๊ธฐํ™”๋œ ์ข…๋ฃŒ)
  4. ์ „์ฒด ํšจ์œจ: ๋ธ”๋ก๋‹น ์ค‘๋ณต ์—ฐ์‚ฐ 256๋ฐฐ ์ถ•์†Œ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”:

  • ๋ ˆ๋ฒจ 1: input[global_i]์—์„œ ๋ณ‘ํ•ฉ๋œ ์ฝ๊ธฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์Šค์ผ€์ผ๋ง๋œ ์“ฐ๊ธฐ
  • ๋ ˆ๋ฒจ 2: ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์›Œํ”„ ๋ ˆ๋ฒจ ์ง‘๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (256๊ฐœ ๋Œ€์‹  8๊ฐœ ์—ฐ์‚ฐ)
  • ๋ ˆ๋ฒจ 3: ์Šค๋ ˆ๋“œ 0์ด ๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (8๊ฐœ ๋Œ€์‹  1๊ฐœ ์—ฐ์‚ฐ)
  • ๊ฒฐ๊ณผ: ๊ณ„์ธต์  ๋ฆฌ๋•์…˜์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค

๋™๊ธฐํ™” ๊ณ„์ธต ๊ตฌ์กฐ:

  1. barrier(): ๋ธ”๋ก ๋‚ด๋ถ€ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” (๋ฐ์ดํ„ฐ ๋กœ๋”ฉ ๋ฐ ์›Œํ”„ ์ฒ˜๋ฆฌ ํ›„)
  2. cluster_arrive(): ๋ธ”๋ก ๊ฐ„ ์‹ ํ˜ธ (๋…ผ๋ธ”๋กœํ‚น, ์ž‘์—… ์ค‘์ฒฉ ๊ฐ€๋Šฅ)
  3. cluster_wait(): ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™” (๋ธ”๋กœํ‚น, ์™„๋ฃŒ ์ˆœ์„œ ๋ณด์žฅ)

์™œ โ€œ๊ณ ๊ธ‰โ€œ์ธ๊ฐ€:

  • ๋‹ค๋‹จ๊ณ„ ์ตœ์ ํ™”: ์›Œํ”„, ๋ธ”๋ก, ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ๋ฒ•์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค
  • ํ•˜๋“œ์›จ์–ด ํšจ์œจ: elect_one_sync()๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์›Œํ”„ ํ™œ์šฉ๋ฅ ์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค
  • ๋‹จ๊ณ„์  ์กฐ์ •: ๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์œ ์—ฐํ•œ ๋™๊ธฐํ™”๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค
  • ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€: ์‹ค์ œ GPU ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค

์‹ค์ œ ์„ฑ๋Šฅ ์ด์ :

  • ๋ฉ”๋ชจ๋ฆฌ ๋ถ€ํ•˜ ๊ฐ์†Œ: ๋™์‹œ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ์ ์–ด์ง‘๋‹ˆ๋‹ค
  • ๋” ๋‚˜์€ ์›Œํ”„ ํ™œ์šฉ: ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์ง‘์ค‘์ ์ธ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์กฐ์ •: ๋‹จ๊ณ„์  ๋™๊ธฐํ™”๊ฐ€ ๋” ํฐ ํด๋Ÿฌ์Šคํ„ฐ ํฌ๊ธฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์œ ์—ฐ์„ฑ: ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค

๋ณต์žก๋„ ๋ถ„์„:

  • ์›Œํ”„ ๋ ˆ๋ฒจ: ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๋‹น O(32) ์—ฐ์‚ฐ = ๋ธ”๋ก๋‹น ์ด O(256)
  • ๋ธ”๋ก ๋ ˆ๋ฒจ: ๋ธ”๋ก๋‹น O(8) ์ง‘๊ณ„ ์—ฐ์‚ฐ
  • ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ: ๋ธ”๋ก๋‹น O(1) ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ
  • ์ „์ฒด: ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌํ™” ์ด์ ์„ ๊ฐ€์ง„ ์„ ํ˜• ๋ณต์žก๋„

์™„์ „ํ•œ GPU ๊ณ„์ธต ๊ตฌ์กฐ

์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ์ด ํผ์ฆ์„ ์™„๋ฃŒํ•จ์œผ๋กœ์จ ์™„์ „ํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์Šคํƒ์„ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค:

โœ… ์Šค๋ ˆ๋“œ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: ๊ฐœ๋ณ„ ์‹คํ–‰ ๋‹จ์œ„
โœ… ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: 32๊ฐœ ์Šค๋ ˆ๋“œ SIMT ์กฐ์ •
โœ… ๋ธ”๋ก ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: ๋ฉ€ํ‹ฐ ์›Œํ”„ ์กฐ์ •๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ
โœ… ๐Ÿ†• ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: SM90+ API๋ฅผ ํ™œ์šฉํ•œ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ •
โœ… ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ๋กœ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ์กฐ์ •
โœ… ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์ผ ๋ธ”๋ก ํ•œ๊ณ„๋ฅผ ๋„˜์–ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™•์žฅ
โœ… ์›Œํ”„ + ๋ธ”๋ก + ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ๊ฒฐํ•ฉํ•œ ๊ณ„์ธต์  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„
โœ… SM90+ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์œผ๋กœ ์ฐจ์„ธ๋Œ€ GPU ํ•˜๋“œ์›จ์–ด๋ฅผ ํ™œ์šฉ

์‹ค์ „ ์‘์šฉ

์ด ํผ์ฆ์˜ ๊ณ„์ธต์  ์กฐ์ • ํŒจํ„ด์€ ๋‹ค์Œ ๋ถ„์•ผ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค:

๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ…:

  • ๋ฉ€ํ‹ฐ ๊ทธ๋ฆฌ๋“œ ๊ธฐ๋ฒ•: ๊ฐ ๋ ˆ๋ฒจ์ด ์„œ๋กœ ๋‹ค๋ฅธ ํ•ด์ƒ๋„์˜ ๊ทธ๋ฆฌ๋“œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ๋„๋ฉ”์ธ ๋ถ„ํ•ด: ๋ฌธ์ œ์˜ ํ•˜์œ„ ๋„๋ฉ”์ธ์— ๊ฑธ์นœ ๊ณ„์ธต์  ์กฐ์ •
  • ๋ณ‘๋ ฌ ๋ฐ˜๋ณต๋ฒ•: ์›Œํ”„ ๋ ˆ๋ฒจ์˜ ๋กœ์ปฌ ์—ฐ์‚ฐ๊ณผ ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ์˜ ์ „์—ญ ํ†ต์‹ 

๋”ฅ๋Ÿฌ๋‹:

  • ๋ชจ๋ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด ๋ชจ๋ธ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์—ฌ๋Ÿฌ ํŠธ๋žœ์Šคํฌ๋จธ ๋ ˆ์ด์–ด์— ๊ฑธ์นœ ๋‹จ๊ณ„์  ์ฒ˜๋ฆฌ
  • ๊ธฐ์šธ๊ธฐ ์ง‘๊ณ„: ๋ถ„์‚ฐ ํ•™์Šต ๋…ธ๋“œ์— ๊ฑธ์นœ ๊ณ„์ธต์  ๋ฆฌ๋•์…˜

๊ทธ๋ž˜ํ”ฝ์Šค ๋ฐ ์‹œ๊ฐํ™”:

  • ๋ฉ€ํ‹ฐ ํŒจ์Šค ๋ Œ๋”๋ง: ๋ณต์žกํ•œ ์‹œ๊ฐ ํšจ๊ณผ๋ฅผ ์œ„ํ•œ ๋‹จ๊ณ„์  ์ฒ˜๋ฆฌ
  • ๊ณ„์ธต์  ์ปฌ๋ง: ๊ฐ ๋ ˆ๋ฒจ์ด ์„œ๋กœ ๋‹ค๋ฅธ ์„ธ๋ถ„๋„์—์„œ ์ปฌ๋งํ•ฉ๋‹ˆ๋‹ค
  • ๋ณ‘๋ ฌ ์ง€์˜ค๋ฉ”ํŠธ๋ฆฌ ์ฒ˜๋ฆฌ: ์กฐ์ •๋œ ๋ณ€ํ™˜ ํŒŒ์ดํ”„๋ผ์ธ

๋‹ค์Œ ๋‹จ๊ณ„

์ด์ œ ์ตœ์‹  ํ•˜๋“œ์›จ์–ด์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ตœ์ฒจ๋‹จ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค!

๋” ๋งŽ์€ ๋„์ „์„ ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋‹ค๋ฅธ ๊ณ ๊ธ‰ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ฃผ์ œ๋ฅผ ํƒ๊ตฌํ•˜๊ณ , Puzzle 30-32์˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ๋ณต์Šตํ•˜๊ณ , NVIDIA ๋„๊ตฌ์˜ ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ์ ์šฉํ•˜๊ฑฐ๋‚˜, ์ด ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž์‹ ๋งŒ์˜ ์—ฐ์‚ฐ ์›Œํฌ๋กœ๋“œ๋ฅผ ๊ตฌ์ถ•ํ•ด ๋ณด์„ธ์š”!