๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ • ๊ธฐ์ดˆ

๊ฐœ์š”

์ฒซ ๋ฒˆ์งธ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋„์ „์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์ด ์„น์…˜์—์„œ๋Š” SM90+ ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ์‚ฌ์šฉํ•œ ๋ธ”๋ก ๊ฐ„ ์กฐ์ •์˜ ๊ธฐ๋ณธ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: 4๊ฐœ์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ์กฐ์ •ํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ  ์ถœ๋ ฅ ๋ฐฐ์—ด์— ์ €์žฅํ•˜๋Š” ๋ฉ€ํ‹ฐ ๋ธ”๋ก ํžˆ์Šคํ† ๊ทธ๋žจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต: cluster_arrive() โ†’ ์ฒ˜๋ฆฌ โ†’ cluster_wait()๋ผ๋Š” ํ•„์ˆ˜์ ์ธ ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค. Puzzle 29์˜ barrier()์—์„œ ๋ฐฐ์šด ๋™๊ธฐํ™” ๊ฐœ๋…์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์ œ: ๋ฉ€ํ‹ฐ ๋ธ”๋ก ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜

Puzzle 27๊ณผ ๊ฐ™์€ ๊ธฐ์กด์˜ ๋‹จ์ผ ๋ธ”๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•˜๋‚˜์˜ ๋ธ”๋ก์ด ๊ฐ€์ง„ ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰(์˜ˆ: 256๊ฐœ ์Šค๋ ˆ๋“œ) ๋‚ด์— ๋“ค์–ด์˜ค๋Š” ๋ฐ์ดํ„ฐ๋งŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์ดˆ๊ณผํ•˜๋Š” ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฝ์šฐ, ์—ฌ๋Ÿฌ ๋ธ”๋ก์ด ํ˜‘๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ณผ์ œ: 4๊ฐœ ๋ธ”๋ก ๊ฐ๊ฐ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , ๊ณ ์œ ํ•œ ๋ธ”๋ก ์ˆœ์œ„๋กœ ๊ฐ’์„ ์Šค์ผ€์ผ๋งํ•˜๋ฉฐ, Puzzle 29์˜ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅธ ๋ธ”๋ก๋“ค๊ณผ ์กฐ์ •ํ•จ์œผ๋กœ์จ ๋ชจ๋“  ๋ธ”๋ก์˜ ์ฒ˜๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

๋ฌธ์ œ ๋ช…์„ธ

๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋ฐ์ดํ„ฐ ๋ถ„๋ฐฐ:

  • Block 0: ์š”์†Œ 0-255๋ฅผ ์ฒ˜๋ฆฌ, 1๋ฐฐ ์Šค์ผ€์ผ๋ง
  • Block 1: ์š”์†Œ 256-511์„ ์ฒ˜๋ฆฌ, 2๋ฐฐ ์Šค์ผ€์ผ๋ง
  • Block 2: ์š”์†Œ 512-767์„ ์ฒ˜๋ฆฌ, 3๋ฐฐ ์Šค์ผ€์ผ๋ง
  • Block 3: ์š”์†Œ 768-1023์„ ์ฒ˜๋ฆฌ, 4๋ฐฐ ์Šค์ผ€์ผ๋ง

์กฐ์ • ์š”๊ตฌ์‚ฌํ•ญ:

  1. ๊ฐ ๋ธ”๋ก์€ cluster_arrive()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ ค์•ผ ํ•ฉ๋‹ˆ๋‹ค
  2. ๋ชจ๋“  ๋ธ”๋ก์€ cluster_wait()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅธ ๋ธ”๋ก์„ ๊ธฐ๋‹ค๋ ค์•ผ ํ•ฉ๋‹ˆ๋‹ค
  3. ์ตœ์ข… ์ถœ๋ ฅ์€ ๊ฐ ๋ธ”๋ก์˜ ์ฒ˜๋ฆฌ๋œ ํ•ฉ๊ณ„๋ฅผ 4๊ฐœ ์š”์†Œ ๋ฐฐ์—ด๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค

์„ค์ •

  • ๋ฌธ์ œ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ (1D ๋ฐฐ์—ด)
  • ๋ธ”๋ก ์„ค์ •: TPB = 256 ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (256, 1)
  • ๊ทธ๋ฆฌ๋“œ ์„ค์ •: CLUSTER_SIZE = 4 ํด๋Ÿฌ์Šคํ„ฐ๋‹น ๋ธ”๋ก ์ˆ˜ (4, 1)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ์ž…๋ ฅ Layout.row_major(SIZE), ์ถœ๋ ฅ Layout.row_major(CLUSTER_SIZE)

์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋ถ„๋ฐฐ:

  • Block 0: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 0-255
  • Block 1: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 256-511
  • Block 2: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 512-767
  • Block 3: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 768-1023

์™„์„ฑํ•  ์ฝ”๋“œ

fn cluster_coordination_basics[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Real cluster coordination using SM90+ cluster APIs."""
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # Check what's happening with cluster ranks
    my_block_rank = Int(block_rank_in_cluster())
    block_id = Int(block_idx.x)

    shared_data = LayoutTensor[
        dtype,
        Layout.row_major(tpb),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # FIX: Use block_idx.x for data distribution instead of cluster rank
    # Each block should process different portions of the data
    var data_scale = Float32(
        block_id + 1
    )  # Use block_idx instead of cluster rank

    # Phase 1: Each block processes its portion
    if global_i < size:
        shared_data[local_i] = input[global_i] * data_scale
    else:
        shared_data[local_i] = 0.0

    barrier()

    # Phase 2: Use cluster_arrive() for inter-block coordination
    # Signal this block has completed processing

    # FILL IN 1 line here

    # Block-level aggregation (only thread 0)
    if local_i == 0:
        # FILL IN 4 line here
        ...

    # Wait for all blocks in cluster to complete

    # FILL IN 1 line here


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p34/p34.mojo

ํŒ

๋ธ”๋ก ์‹๋ณ„ ํŒจํ„ด

  • block_rank_in_cluster()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ ์ˆœ์œ„(0-3)๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค
  • ๊ทธ๋ฆฌ๋“œ ์‹คํ–‰์—์„œ ์•ˆ์ •์ ์ธ ๋ธ”๋ก ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด Int(block_idx.x)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๋ธ”๋ก ์œ„์น˜์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๊ณ ์œ ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •

  • LayoutTensor[dtype, Layout.row_major(tpb), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค (Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ดˆ ์ฐธ๊ณ )
  • block_id + 1๋กœ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๋ธ”๋ก๋งˆ๋‹ค ๊ณ ์œ ํ•œ ์Šค์ผ€์ผ๋ง์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ์‹œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (Puzzle 3์˜ ๊ฐ€๋“œ ํŒจํ„ด)

ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ํŒจํ„ด

  1. ์ฒ˜๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด ์ž์‹ ์˜ ๋ฐ์ดํ„ฐ ์˜์—ญ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  2. ์‹ ํ˜ธ: cluster_arrive()๋กœ ์ฒ˜๋ฆฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  3. ์—ฐ์‚ฐ: ๋ธ”๋ก ๋‚ด๋ถ€ ์—ฐ์‚ฐ (๋ฆฌ๋•์…˜, ์ง‘๊ณ„)
  4. ๋Œ€๊ธฐ: cluster_wait()๋กœ ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค

๋ธ”๋ก ๋‚ด๋ถ€ ์Šค๋ ˆ๋“œ ์กฐ์ •

  • ํด๋Ÿฌ์Šคํ„ฐ ์—ฐ์‚ฐ ์ „์— ๋ธ”๋ก ๋‚ด๋ถ€ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด barrier()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (Puzzle 29์˜ ๋ฐฐ๋ฆฌ์–ด ๊ฐœ๋…)
  • ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๋ธ”๋ก ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๋‹จ์ผ ์“ฐ๊ธฐ ํŒจํ„ด)
  • ์•ˆ์ •์ ์ธ ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด ๊ฒฐ๊ณผ๋ฅผ output[block_id]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

์ฝ”๋“œ ์‹คํ–‰

pixi run p34 --coordination
uv run poe p34 --coordination

์˜ˆ์ƒ ์ถœ๋ ฅ:

Testing Multi-Block Coordination
SIZE: 1024 TPB: 256 CLUSTER_SIZE: 4
Block coordination results:
  Block 0 : 127.5
  Block 1 : 255.0
  Block 2 : 382.5
  Block 3 : 510.0
โœ… Multi-block coordination tests passed!

์„ฑ๊ณต ๊ธฐ์ค€:

  • 4๊ฐœ ๋ธ”๋ก ๋ชจ๋‘ 0์ด ์•„๋‹Œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ๊ฒฐ๊ณผ๊ฐ€ ์Šค์ผ€์ผ๋ง ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: Block 1 > Block 0, Block 2 > Block 1 ๋“ฑ
  • ๊ฒฝ์Ÿ ์ƒํƒœ๋‚˜ ์กฐ์ • ์‹คํŒจ๊ฐ€ ์—†์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

์†”๋ฃจ์…˜

fn cluster_coordination_basics[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Real cluster coordination using SM90+ cluster APIs."""
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # Check what's happening with cluster ranks
    my_block_rank = Int(block_rank_in_cluster())
    block_id = Int(block_idx.x)

    shared_data = LayoutTensor[
        dtype,
        Layout.row_major(tpb),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # FIX: Use block_idx.x for data distribution instead of cluster rank
    # Each block should process different portions of the data
    var data_scale = Float32(
        block_id + 1
    )  # Use block_idx instead of cluster rank

    # Phase 1: Each block processes its portion
    if global_i < size:
        shared_data[local_i] = input[global_i] * data_scale
    else:
        shared_data[local_i] = 0.0

    barrier()

    # Phase 2: Use cluster_arrive() for inter-block coordination
    cluster_arrive()  # Signal this block has completed processing

    # Block-level aggregation (only thread 0)
    if local_i == 0:
        var block_sum: Float32 = 0.0
        for i in range(tpb):
            block_sum += shared_data[i][0]
        # FIX: Store result at block_idx position (guaranteed unique per block)
        output[block_id] = block_sum

    # Wait for all blocks in cluster to complete
    cluster_wait()


ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ํ’€์ด๋Š” ์‹ ์ค‘ํ•˜๊ฒŒ ์„ค๊ณ„๋œ 2๋‹จ๊ณ„ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ†ตํ•ด ๊ธฐ๋ณธ์ ์ธ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ๋…๋ฆฝ์  ๋ธ”๋ก ์ฒ˜๋ฆฌ

์Šค๋ ˆ๋“œ ๋ฐ ๋ธ”๋ก ์‹๋ณ„:

global_i = block_dim.x * block_idx.x + thread_idx.x  # Global thread index
local_i = thread_idx.x                               # Local thread index within block
my_block_rank = Int(block_rank_in_cluster())         # Cluster rank (0-3)
block_id = Int(block_idx.x)                          # Block index for reliable addressing

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ:

  • ๊ฐ ๋ธ”๋ก์ด ์ž์ฒด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ž‘์—… ๊ณต๊ฐ„์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค: LayoutTensor[dtype, Layout.row_major(tpb), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
  • ์Šค์ผ€์ผ๋ง ์ „๋žต: data_scale = Float32(block_id + 1)๋กœ ๊ฐ ๋ธ”๋ก์ด ๋‹ค๋ฅด๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค
    • Block 0: 1.0๋ฐฐ, Block 1: 2.0๋ฐฐ, Block 2: 3.0๋ฐฐ, Block 3: 4.0๋ฐฐ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: if global_i < size:๋กœ ๋ฒ”์œ„ ๋ฐ– ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ: shared_data[local_i] = input[global_i] * data_scale๋กœ ๋ธ”๋ก๋ณ„ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์Šค์ผ€์ผ๋งํ•ฉ๋‹ˆ๋‹ค

๋ธ”๋ก ๋‚ด๋ถ€ ๋™๊ธฐํ™”:

  • barrier()๋Š” ๊ฐ ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์„ ์™„๋ฃŒํ•œ ํ›„์—์•ผ ๋‹ค์Œ ๋‹จ๊ณ„๋กœ ์ง„ํ–‰ํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ๊ณผ ์ดํ›„์˜ ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ์‚ฌ์ด์˜ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค

2๋‹จ๊ณ„: ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •

๋ธ”๋ก ๊ฐ„ ์‹ ํ˜ธ:

  • cluster_arrive()๋Š” ์ด ๋ธ”๋ก์ด ๋กœ์ปฌ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ–ˆ์Œ์„ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ํ•˜๋“œ์›จ์–ด์— ์™„๋ฃŒ๋ฅผ ๋“ฑ๋กํ•˜๋Š” ๋…ผ๋ธ”๋กœํ‚น ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค

๋กœ์ปฌ ์ง‘๊ณ„ (์Šค๋ ˆ๋“œ 0๋งŒ):

if local_i == 0:
    var block_sum: Float32 = 0.0
    for i in range(tpb):
        block_sum += shared_data[i][0]  # Sum all elements in shared memory
    output[block_id] = block_sum        # Store result at unique block position
  • ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์Šค๋ ˆ๋“œ 0๋งŒ ํ•ฉ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • output[block_id]์— ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•˜์—ฌ ๊ฐ ๋ธ”๋ก์ด ๊ณ ์œ ํ•œ ์œ„์น˜์— ๊ธฐ๋กํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค

์ตœ์ข… ๋™๊ธฐํ™”:

  • cluster_wait()๋Š” ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๋ชจ๋“  ๋ธ”๋ก์ด ์ž‘์—…์„ ์™„๋ฃŒํ•  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค
  • ์ด๋ฅผ ํ†ตํ•ด ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ์— ๊ฑธ์ณ ๊ฒฐ์ •๋ก ์  ์™„๋ฃŒ ์ˆœ์„œ๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค

ํ•ต์‹ฌ ๊ธฐ์ˆ  ์ธ์‚ฌ์ดํŠธ

์™œ my_block_rank ๋Œ€์‹  block_id๋ฅผ ์‚ฌ์šฉํ• ๊นŒ?

  • block_idx.x๋Š” ์•ˆ์ •์ ์ธ ๊ทธ๋ฆฌ๋“œ ์‹คํ–‰ ์ธ๋ฑ์‹ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค (0, 1, 2, 3)
  • block_rank_in_cluster()๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์„ค์ •์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • block_id๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ ๋ธ”๋ก์ด ๊ณ ์œ ํ•œ ๋ฐ์ดํ„ฐ ์˜์—ญ๊ณผ ์ถœ๋ ฅ ์œ„์น˜๋ฅผ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ input[global_i]๋ฅผ ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ์ฝ์Šต๋‹ˆ๋‹ค
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก ๋‚ด๋ถ€ ํ†ต์‹ ๊ณผ ์ง‘๊ณ„์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค
  • ์ถœ๋ ฅ ๋ฉ”๋ชจ๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด output[block_id]์— ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค

๋™๊ธฐํ™” ๊ณ„์ธต ๊ตฌ์กฐ:

  1. barrier(): ๊ฐ ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ๋‚ด๋ถ€)
  2. cluster_arrive(): ๋‹ค๋ฅธ ๋ธ”๋ก์— ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค (๋ธ”๋ก ๊ฐ„, ๋…ผ๋ธ”๋กœํ‚น)
  3. cluster_wait(): ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ๊ฐ„, ๋ธ”๋กœํ‚น)

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์—ฐ์‚ฐ ๋ณต์žก๋„: ๋ธ”๋ก๋‹น ๋กœ์ปฌ ํ•ฉ์‚ฐ์— O(TPB), ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์— O(1)
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ๊ฐ ์ž…๋ ฅ ์š”์†Œ๋ฅผ ํ•œ ๋ฒˆ๋งŒ ์ฝ์œผ๋ฉฐ, ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์€ ์ตœ์†Œํ™”
  • ํ™•์žฅ์„ฑ: ํŒจํ„ด์ด ๋” ํฐ ํด๋Ÿฌ์Šคํ„ฐ ํฌ๊ธฐ์—๋„ ์ตœ์†Œํ•œ์˜ ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅ

ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์˜ ํ•ต์‹ฌ ํŒจํ„ด์€ ๋‹จ์ˆœํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. 1๋‹จ๊ณ„: ๊ฐ ๋ธ”๋ก์ด ํ• ๋‹น๋œ ๋ฐ์ดํ„ฐ ์˜์—ญ์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  2. ์‹ ํ˜ธ: cluster_arrive()๋กœ ์ฒ˜๋ฆฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  3. 2๋‹จ๊ณ„: ๋‹ค๋ฅธ ๋ธ”๋ก์˜ ๊ฒฐ๊ณผ์— ์˜์กดํ•˜๋Š” ์—ฐ์‚ฐ์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  4. ๋™๊ธฐํ™”: cluster_wait()๋กœ ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋œ ํ›„ ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค

๋‹ค์Œ ๋‹จ๊ณ„: ๋” ๊ณ ๊ธ‰ ์กฐ์ •์„ ๋ฐฐ์šธ ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ ์œผ๋กœ ์ด๋™ํ•˜์—ฌ Puzzle 27์˜ block.sum() ํŒจํ„ด์„ ํด๋Ÿฌ์Šคํ„ฐ ๊ทœ๋ชจ๋กœ ํ™•์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›Œ๋ณด์„ธ์š”. Puzzle 24์˜ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค!