block.broadcast()์™€ ๋ฒกํ„ฐ ์ •๊ทœํ™”

block.sum๊ณผ block.broadcast ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๊ตฌํ˜„ํ•˜๊ณ , ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต์‹  ์›Œํฌํ”Œ๋กœ์šฐ์˜ ์ „์ฒด ๋ชจ์Šต์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ‰๊ท  ๊ณ„์‚ฐ์— ๊ธฐ์—ฌํ•œ ๋‹ค์Œ, ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ๋ฐ›์•„ ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ, ๋ธ”๋ก ์—ฐ์‚ฐ์ด ์‹ค์ œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋–ป๊ฒŒ ํ•จ๊ป˜ ๋™์ž‘ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.broadcast() ์—ฐ์‚ฐ์€ ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹ ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ, ๊ธฐ๋ณธ ๋ธ”๋ก ํ†ต์‹  ํŒจํ„ด์„ ์™„์„ฑํ•ฉ๋‹ˆ๋‹ค: ๋ฆฌ๋•์…˜(์ „์ฒดโ†’ํ•˜๋‚˜), ์Šค์บ”(์ „์ฒดโ†’๊ฐ๊ฐ), ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ(ํ•˜๋‚˜โ†’์ „์ฒด).

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • block.broadcast()๋ฅผ ํ™œ์šฉํ•œ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
  • ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹  ํŒจํ„ด
  • ์†Œ์Šค ์Šค๋ ˆ๋“œ ์ง€์ •๊ณผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ œ์–ด
  • ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜๋Š” ์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ
  • ์กฐ์œจ๋œ ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•œ ์‹ค์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \frac{\text{input}[i]}{\frac{1}{N}\sum_{j=0}^{N-1} \text{input}[j]}\]

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ‰๊ท  ๊ณ„์‚ฐ์— ๊ธฐ์—ฌํ•œ ๋‹ค์Œ, ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ๋ฐ›์•„ ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 128 ์š”์†Œ
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (128, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (TPB = 128)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ๋ชจ๋‘ 1D row-major)
  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: 1-8 ๋ฐ˜๋ณต ๊ฐ’, ํ‰๊ท  = 4.5
  • ์˜ˆ์ƒ ์ถœ๋ ฅ: ํ‰๊ท ์ด 1.0์ธ ์ •๊ทœํ™”๋œ ๋ฒกํ„ฐ

๋„์ „ ๊ณผ์ œ: ๋ธ”๋ก ์ „์ฒด ๊ณ„์‚ฐ๊ณผ ๋ถ„๋ฐฐ์˜ ์กฐ์œจ

๊ธฐ์กด์˜ ํ‰๊ท  ์ •๊ทœํ™” ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ์ˆœ์ฐจ์  ๋ฐฉ์‹ - ๋ณ‘๋ ฌ์„ฑ์„ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•จ
total = sum(input_array)
mean = total / len(input_array)
output_array = [x / mean for x in input_array]

๋‹จ์ˆœํ•œ GPU ๋ณ‘๋ ฌํ™”์˜ ๋ฌธ์ œ์ :

  • ๋‹ค์ค‘ ์ปค๋„ ์‹คํ–‰: ํ‰๊ท  ๊ณ„์‚ฐ๊ณผ ์ •๊ทœํ™”์— ๊ฐ๊ฐ ๋ณ„๋„์˜ ํŒจ์Šค๊ฐ€ ํ•„์š”
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์™•๋ณต: ํ‰๊ท ์„ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ–ˆ๋‹ค๊ฐ€ ๋‚˜์ค‘์— ๋‹ค์‹œ ์ฝ๊ธฐ
  • ๋™๊ธฐํ™” ๋ณต์žก์„ฑ: ๊ณ„์‚ฐ ๋‹จ๊ณ„ ๊ฐ„์— ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”
  • ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ: ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ์ž‘์—…์„ ์ˆ˜ํ–‰

๊ธฐ์กด GPU ํ’€์ด์˜ ๋ณต์žก์„ฑ:

# 1๋‹จ๊ณ„: ํ•ฉ๊ณ„๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฆฌ๋•์…˜ (๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด)
shared_sum[local_i] = my_value
barrier()
# ์—ฌ๋Ÿฌ barrier() ํ˜ธ์ถœ์ด ํ•„์š”ํ•œ ์ˆ˜๋™ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜...

# 2๋‹จ๊ณ„: ์Šค๋ ˆ๋“œ 0์ด ํ‰๊ท ์„ ๊ณ„์‚ฐ
if local_i == 0:
    mean = shared_sum[0] / size
    shared_mean[0] = mean

barrier()

# 3๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ‰๊ท ์„ ์ฝ๊ณ  ์ •๊ทœํ™”
mean = shared_mean[0]  # ๋ชจ๋‘๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ์ฝ์Œ
output[global_i] = my_value / mean

๊ณ ๊ธ‰ ๋ฐฉ์‹: block.sum() + block.broadcast() ์กฐ์œจ

๋‹ค๋‹จ๊ณ„ ์กฐ์œจ์„ ๊ฐ„๊ฒฐํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

์™„์„ฑํ•  ์ฝ”๋“œ

์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ

๋ธ”๋ก ์—ฐ์‚ฐ ๋„๊ตฌ ๋ชจ์Œ ์ „์ฒด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ๊ธ‰ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:


comptime vector_layout = Layout.row_major(SIZE)


fn block_normalize_vector[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    output_data: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    size: Int,
):
    """Vector mean normalization using block.sum() + block.broadcast() combination.

    This demonstrates the complete block operations workflow:
    1. Use block.sum() to compute sum of all elements (all โ†’ one)
    2. Thread 0 computes mean = sum / size
    3. Use block.broadcast() to share mean to all threads (one โ†’ all)
    4. Each thread normalizes: output[i] = input[i] / mean
    """

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # Step 1: Each thread loads its element

    # FILL IN (roughly 3 lines)

    # Step 2: Use block.sum() to compute total sum (familiar from earlier!)

    # FILL IN (1 line)

    # Step 3: Thread 0 computes mean value

    # FILL IN (roughly 4 lines)

    # Step 4: block.broadcast() shares mean to ALL threads!
    # This completes the block operations trilogy demonstration

    # FILL IN (1 line)

    # Step 5: Each thread normalizes by the mean

    # FILL IN (roughly 3 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p27/p27.mojo

ํŒ

1. ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ ๊ตฌ์กฐ (๋ชจ๋“  ์ด์ „ ์—ฐ์‚ฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•)

์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์™„๋ฒฝํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ๋กœ๋“œ (๋ชจ๋“  ์ด์ „ ํผ์ฆ์—์„œ ์ต์ˆ™ํ•œ ํŒจํ„ด)
  2. block.sum()์œผ๋กœ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐ (์ด ํผ์ฆ์˜ ์•ž๋ถ€๋ถ„์—์„œ ๋ฐฐ์šด ๋‚ด์šฉ)
  3. ์Šค๋ ˆ๋“œ 0์ด ํ•ฉ๊ณ„๋กœ๋ถ€ํ„ฐ ํ‰๊ท ์„ ๊ณ„์‚ฐ
  4. block.broadcast()๋กœ ํ‰๊ท ์„ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ณต์œ  (์ƒˆ๋กœ์šด ๋‚ด์šฉ!)
  5. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์œผ๋กœ ์ •๊ทœํ™”

2. ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ๊ณผ ํ•ฉ๊ณ„ ๊ณ„์‚ฐ (์ต์ˆ™ํ•œ ํŒจํ„ด)

๊ธฐ์กด LayoutTensor ํŒจํ„ด์œผ๋กœ ์š”์†Œ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค:

var my_value: Scalar[dtype] = 0.0
if global_i < size:
    my_value = input_data[global_i][0]  # SIMD ์ถ”์ถœ

๊ทธ๋Ÿฐ ๋‹ค์Œ ์•ž์„œ ๋ฐฐ์šด ๋‚ด์ ๊ณผ ๋™์ผํ•˜๊ฒŒ block.sum()์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

total_sum = block.sum[block_size=tpb, broadcast=False](...)

3. ํ‰๊ท  ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ 0๋งŒ)

์Šค๋ ˆ๋“œ 0๋งŒ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

var mean_value: Scalar[dtype] = 1.0  # ์•ˆ์ „ํ•œ ๊ธฐ๋ณธ๊ฐ’
if local_i == 0:
    # total_sum๊ณผ size๋กœ ํ‰๊ท  ๊ณ„์‚ฐ

์™œ ์Šค๋ ˆ๋“œ 0์ธ๊ฐ€? block.sum() ํŒจํ„ด์—์„œ ์Šค๋ ˆ๋“œ 0์ด ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›๋Š” ๊ฒƒ๊ณผ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

4. block.broadcast() API ๊ฐœ๋…

ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ์‚ดํŽด๋ณด์„ธ์š” - ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ํ…œํ”Œ๋ฆฟ ํŒŒ๋ผ๋ฏธํ„ฐ: dtype, width, block_size
  • ๋Ÿฐํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ: val (๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•  SIMD ๊ฐ’), src_thread (๊ธฐ๋ณธ๊ฐ’=0)

ํ˜ธ์ถœ ํŒจํ„ด์€ ๊ธฐ์กด ํ…œํ”Œ๋ฆฟ ์Šคํƒ€์ผ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

result = block.broadcast[
    dtype = DType.float32,
    width = 1,
    block_size = tpb
](val=SIMD[DType.float32, 1](value_to_broadcast), src_thread=UInt(0))

5. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.broadcast()๋Š” ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์™€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค:

  • ์Šค๋ ˆ๋“œ 0์ด ๊ณ„์‚ฐ๋œ ํ‰๊ท ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ
  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ํ‰๊ท ๊ฐ’์ด ํ•„์š”
  • block.broadcast() ๊ฐ€ ์Šค๋ ˆ๋“œ 0์˜ ๊ฐ’์„ ๋ชจ๋‘์—๊ฒŒ ๋ณต์‚ฌ

์ด๊ฒƒ์€ block.sum()(์ „์ฒดโ†’ํ•˜๋‚˜)์˜ ๋ฐ˜๋Œ€์ด๋ฉฐ, block.prefix_sum()(์ „์ฒดโ†’๊ฐ๊ฐ ์œ„์น˜)๊ณผ๋„ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

6. ์ตœ์ข… ์ •๊ทœํ™” ๋‹จ๊ณ„

๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ๋ฐ›์œผ๋ฉด, ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค:

if global_i < size:
    normalized_value = my_value / broadcasted_mean[0]  # SIMD ์ถ”์ถœ
    output_data[global_i] = normalized_value

SIMD ์ถ”์ถœ: block.broadcast()๊ฐ€ SIMD๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋ฏ€๋กœ [0]์œผ๋กœ ์Šค์นผ๋ผ๋ฅผ ์ถ”์ถœํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

7. ์ด์ „ ํผ์ฆ์—์„œ์˜ ํŒจํ„ด ์ธ์‹

  • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ: ํ•ญ์ƒ ๋™์ผํ•œ global_i, local_i ํŒจํ„ด
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋™์ผํ•œ if global_i < size ๊ฒ€์ฆ
  • SIMD ์ฒ˜๋ฆฌ: ๋™์ผํ•œ [0] ์ถ”์ถœ ํŒจํ„ด
  • ๋ธ”๋ก ์—ฐ์‚ฐ: block.sum()๊ณผ ๋™์ผํ•œ ํ…œํ”Œ๋ฆฟ ํŒŒ๋ผ๋ฏธํ„ฐ ์Šคํƒ€์ผ

๊ฐ ๋ธ”๋ก ์—ฐ์‚ฐ์ด ์ผ๊ด€๋œ ํŒจํ„ด์„ ๋”ฐ๋ฅด๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค!

block.broadcast() ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

pixi run p27 --normalize
pixi run -e amd p27 --normalize
pixi run -e apple p27 --normalize
uv run poe p27 --normalize

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 128
TPB: 128

Input sample: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 ...
Sum value: 576.0
Mean value: 4.5

Mean Normalization Results:
Normalized sample: 0.22222222 0.44444445 0.6666667 0.8888889 1.1111112 1.3333334 1.5555556 1.7777778 ...

Output sum: 128.0
Output mean: 1.0
โœ… Success: Output mean is 1.0 (should be close to 1.0)

์†”๋ฃจ์…˜

fn block_normalize_vector[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    output_data: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    size: Int,
):
    """Vector mean normalization using block.sum() + block.broadcast() combination.

    This demonstrates the complete block operations workflow:
    1. Use block.sum() to compute sum of all elements (all โ†’ one)
    2. Thread 0 computes mean = sum / size
    3. Use block.broadcast() to share mean to all threads (one โ†’ all)
    4. Each thread normalizes: output[i] = input[i] / mean
    """

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # Step 1: Each thread loads its element
    var my_value: Scalar[dtype] = 0.0
    if global_i < size:
        my_value = input_data[global_i][0]  # Extract SIMD value

    # Step 2: Use block.sum() to compute total sum (familiar from earlier!)
    total_sum = block.sum[block_size=tpb, broadcast=False](
        val=SIMD[DType.float32, 1](my_value)
    )

    # Step 3: Thread 0 computes mean value
    var mean_value: Scalar[dtype] = 1.0  # Default to avoid division by zero
    if local_i == 0:
        if total_sum[0] > 0.0:
            mean_value = total_sum[0] / Float32(size)

    # Step 4: block.broadcast() shares mean to ALL threads!
    # This completes the block operations trilogy demonstration
    broadcasted_mean = block.broadcast[
        dtype = DType.float32, width=1, block_size=tpb
    ](val=SIMD[DType.float32, 1](mean_value), src_thread=UInt(0))

    # Step 5: Each thread normalizes by the mean
    if global_i < size:
        normalized_value = my_value / broadcasted_mean[0]
        output_data[global_i] = normalized_value


block.broadcast() ์ปค๋„์€ ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ํ†ต์‹  ํŒจํ„ด์„ ๋ชจ๋‘ ๊ฒฐํ•ฉํ•˜์—ฌ ์ˆ˜ํ•™์ ์œผ๋กœ ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์‹ค์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ตฌ์ฒด์ ์ธ ์‹คํ–‰์„ ํ†ตํ•œ ์™„์ „ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

1๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ (๋ชจ๋“  ์ด์ „ ํผ์ฆ์—์„œ ํ™•๋ฆฝ๋œ ํŒจํ„ด)

์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ (๋ชจ๋“  ํผ์ฆ์—์„œ ์ผ๊ด€๋จ):
  global_i = block_dim.x * block_idx.x + thread_idx.x  // ์ž…๋ ฅ ๋ฐฐ์—ด ์œ„์น˜์— ๋งคํ•‘
  local_i = thread_idx.x                              // ๋ธ”๋ก ๋‚ด ์œ„์น˜ (0-127)

LayoutTensor ํŒจํ„ด์„ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ์š”์†Œ ๋กœ๋”ฉ:
  ์Šค๋ ˆ๋“œ 0:   my_value = input_data[0][0] = 1.0    // ์ฒซ ๋ฒˆ์งธ ์ˆœํ™˜ ๊ฐ’
  ์Šค๋ ˆ๋“œ 1:   my_value = input_data[1][0] = 2.0    // ๋‘ ๋ฒˆ์งธ ์ˆœํ™˜ ๊ฐ’
  ์Šค๋ ˆ๋“œ 7:   my_value = input_data[7][0] = 8.0    // ๋งˆ์ง€๋ง‰ ์ˆœํ™˜ ๊ฐ’
  ์Šค๋ ˆ๋“œ 8:   my_value = input_data[8][0] = 1.0    // ์ˆœํ™˜ ๋ฐ˜๋ณต: 1,2,3,4,5,6,7,8,1,2...
  ์Šค๋ ˆ๋“œ 15:  my_value = input_data[15][0] = 8.0   // 15 % 8 = 7, 8๋ฒˆ์งธ ๊ฐ’
  ์Šค๋ ˆ๋“œ 127: my_value = input_data[127][0] = 8.0  // 127 % 8 = 7, 8๋ฒˆ์งธ ๊ฐ’

128๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๋กœ๋“œ - ์™„๋ฒฝํ•œ ๋ณ‘๋ ฌ ํšจ์œจ!

2๋‹จ๊ณ„: ๋ธ”๋ก ์ „์ฒด ํ•ฉ๊ณ„ ๋ฆฌ๋•์…˜ (์•ž์„œ ๋ฐฐ์šด block.sum() ์ง€์‹ ํ™œ์šฉ)

128๊ฐœ ์Šค๋ ˆ๋“œ์— ๊ฑธ์นœ block.sum() ์กฐ์œจ:
  ๊ธฐ์—ฌ๋ถ„ ๋ถ„์„:
    - ๊ฐ’ 1,2,3,4,5,6,7,8์ด ๊ฐ๊ฐ 16๋ฒˆ ๋ฐ˜๋ณต (128/8 = 16)
    - ์Šค๋ ˆ๋“œ ๊ธฐ์—ฌ๋ถ„: 16ร—1 + 16ร—2 + 16ร—3 + 16ร—4 + 16ร—5 + 16ร—6 + 16ร—7 + 16ร—8
    - ์ˆ˜ํ•™์  ํ•ฉ๊ณ„: 16 ร— (1+2+3+4+5+6+7+8) = 16 ร— 36 = 576.0

block.sum() ํ•˜๋“œ์›จ์–ด ์‹คํ–‰:
  ๋ชจ๋“  ์Šค๋ ˆ๋“œ โ†’ [๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ] โ†’ ์Šค๋ ˆ๋“œ 0
  total_sum = SIMD[DType.float32, 1](576.0)  // ์Šค๋ ˆ๋“œ 0๋งŒ ์ด ๊ฐ’์„ ์ˆ˜์‹ 

์Šค๋ ˆ๋“œ 1-127: total_sum์— ์ ‘๊ทผ ๋ถˆ๊ฐ€ (block.sum์—์„œ broadcast=False)

3๋‹จ๊ณ„: ๋…์ ์  ํ‰๊ท  ๊ณ„์‚ฐ (๋‹จ์ผ ์Šค๋ ˆ๋“œ ์ฒ˜๋ฆฌ)

์Šค๋ ˆ๋“œ 0์ด ํ•ต์‹ฌ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰:
  ์ž…๋ ฅ: total_sum[0] = 576.0, size = 128
  ๊ณ„์‚ฐ: mean_value = 576.0 / 128.0 = 4.5

  ๊ฒ€์ฆ: ๊ธฐ๋Œ€ ํ‰๊ท  = (1+2+3+4+5+6+7+8)/8 = 36/8 = 4.5 โœ“

๋‹ค๋ฅธ ๋ชจ๋“  ์Šค๋ ˆ๋“œ (1-127):
  mean_value = 1.0 (๊ธฐ๋ณธ ์•ˆ์ „ ๊ฐ’)
  ์ด ๊ฐ’๋“ค์€ ๋ฌด๊ด€ - ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด ์‹œ์ ์—์„œ ์˜ฌ๋ฐ”๋ฅธ ํ‰๊ท ๊ฐ’์„ ๊ฐ€์ง„ ๊ฒƒ์€ ์Šค๋ ˆ๋“œ 0๋ฟ์ž…๋‹ˆ๋‹ค!

4๋‹จ๊ณ„: ๋ธ”๋ก ์ „์ฒด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋ถ„๋ฐฐ (ํ•˜๋‚˜ โ†’ ์ „์ฒด ํ†ต์‹ )

block.broadcast() API ์‹คํ–‰:
  ์†Œ์Šค: src_thread = UInt(0) โ†’ ์Šค๋ ˆ๋“œ 0์˜ mean_value = 4.5
  ๋Œ€์ƒ: ๋ธ”๋ก ๋‚ด ๋ชจ๋“  128 ์Šค๋ ˆ๋“œ

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ „:
  ์Šค๋ ˆ๋“œ 0:   mean_value = 4.5  โ† ์ง„์‹ค์˜ ์›์ฒœ
  ์Šค๋ ˆ๋“œ 1:   mean_value = 1.0  โ† ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •
  ์Šค๋ ˆ๋“œ 2:   mean_value = 1.0  โ† ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •
  ...
  ์Šค๋ ˆ๋“œ 127: mean_value = 1.0  โ† ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •

block.broadcast() ์‹คํ–‰ ํ›„:
  ์Šค๋ ˆ๋“œ 0:   broadcasted_mean[0] = 4.5  โ† ์ž์‹ ์˜ ๊ฐ’์„ ๋‹ค์‹œ ์ˆ˜์‹ 
  ์Šค๋ ˆ๋“œ 1:   broadcasted_mean[0] = 4.5  โ† ์ด์ œ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ€์ง!
  ์Šค๋ ˆ๋“œ 2:   broadcasted_mean[0] = 4.5  โ† ์ด์ œ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ€์ง!
  ...
  ์Šค๋ ˆ๋“œ 127: broadcasted_mean[0] = 4.5  โ† ์ด์ œ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ€์ง!

๊ฒฐ๊ณผ: ์™„๋ฒฝํ•œ ๋™๊ธฐํ™” - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ํ‰๊ท ๊ฐ’์„ ๊ฐ€์ง!

5๋‹จ๊ณ„: ๋ณ‘๋ ฌ ํ‰๊ท  ์ •๊ทœํ™” (์กฐ์œจ๋œ ์ฒ˜๋ฆฌ)

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”:
  ์Šค๋ ˆ๋“œ 0:   normalized = 1.0 / 4.5 = 0.22222222...
  ์Šค๋ ˆ๋“œ 1:   normalized = 2.0 / 4.5 = 0.44444444...
  ์Šค๋ ˆ๋“œ 2:   normalized = 3.0 / 4.5 = 0.66666666...
  ์Šค๋ ˆ๋“œ 7:   normalized = 8.0 / 4.5 = 1.77777777...
  ์Šค๋ ˆ๋“œ 8:   normalized = 1.0 / 4.5 = 0.22222222...  (ํŒจํ„ด ๋ฐ˜๋ณต)
  ...

์ˆ˜ํ•™์  ๊ฒ€์ฆ:
  ์ถœ๋ ฅ ํ•ฉ๊ณ„ = (0.222... + 0.444... + ... + 1.777...) ร— 16 = 4.5 ร— 16 ร— 2 = 128.0
  ์ถœ๋ ฅ ํ‰๊ท  = 128.0 / 128 = 1.0  ์™„๋ฒฝํ•œ ์ •๊ทœํ™”!

๊ฐ ๊ฐ’์„ ์›๋ž˜ ํ‰๊ท ์œผ๋กœ ๋‚˜๋ˆ„๋ฉด ํ‰๊ท ์ด 1.0์ธ ์ถœ๋ ฅ์„ ์ƒ์„ฑ

6๋‹จ๊ณ„: ์ •ํ™•์„ฑ ๊ฒ€์ฆ

์ž…๋ ฅ ๋ถ„์„:
  - ํ•ฉ๊ณ„: 576.0, ํ‰๊ท : 4.5
  - ์ตœ๋Œ“๊ฐ’: 8.0, ์ตœ์†Ÿ๊ฐ’: 1.0
  - ๋ฒ”์œ„: [1.0, 8.0]

์ถœ๋ ฅ ๋ถ„์„:
  - ํ•ฉ๊ณ„: 128.0, ํ‰๊ท : 1.0 โœ“
  - ์ตœ๋Œ“๊ฐ’: 1.777..., ์ตœ์†Ÿ๊ฐ’: 0.222...
  - ๋ฒ”์œ„: [0.222, 1.777] (๋ชจ๋“  ๊ฐ’์ด 1/4.5 ๋น„์œจ๋กœ ์Šค์ผ€์ผ๋ง)

๋น„๋ก€ ๊ด€๊ณ„ ๋ณด์กด:
  - ์›๋ž˜ 8:1 ๋น„์œจ์ด 1.777:0.222 = 8:1๋กœ ์œ ์ง€ โœ“
  - ๋ชจ๋“  ์ƒ๋Œ€์  ํฌ๊ธฐ๊ฐ€ ์™„๋ฒฝํ•˜๊ฒŒ ์œ ์ง€

์ด ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ๊ฐ€ ์ˆ˜ํ•™์ ยท๊ณ„์‚ฐ์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ์ด์œ :

๊ธฐ์ˆ ์  ์ •ํ™•์„ฑ๊ณผ ๊ฒ€์ฆ:

์ˆ˜ํ•™์  ์ •ํ™•์„ฑ ์ฆ๋ช…:
  ์ž…๋ ฅ: xโ‚, xโ‚‚, ..., xโ‚™ (n = 128)
  ํ‰๊ท : ฮผ = (โˆ‘xแตข)/n = 576/128 = 4.5

  ์ •๊ทœํ™”: yแตข = xแตข/ฮผ
  ์ถœ๋ ฅ ํ‰๊ท : (โˆ‘yแตข)/n = (โˆ‘xแตข/ฮผ)/n = (1/ฮผ)(โˆ‘xแตข)/n = (1/ฮผ)ฮผ = 1 โœ“

์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ฆ๋ช… ๊ฐ€๋Šฅํ•˜๊ฒŒ ์˜ฌ๋ฐ”๋ฅธ ์ˆ˜ํ•™์  ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Puzzle 12 (๊ธฐ์ดˆ ํŒจํ„ด)๊ณผ์˜ ์—ฐ๊ฒฐ:

  • ์Šค๋ ˆ๋“œ ์กฐ์œจ์˜ ์ง„ํ™”: ๋™์ผํ•œ global_i, local_i ํŒจํ„ด์ด์ง€๋งŒ ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด: ๋™์ผํ•œ LayoutTensor SIMD ์ถ”์ถœ [0]์ด์ง€๋งŒ ์ตœ์ ํ™”๋œ ์›Œํฌํ”Œ๋กœ์šฐ
  • ๋ณต์žก์„ฑ ์ œ๊ฑฐ: 20์ค„ ์ด์ƒ์˜ ์ˆ˜๋™ ๋ฐฐ๋ฆฌ์–ด๋ฅผ 2๊ฐœ์˜ ๋ธ”๋ก ์—ฐ์‚ฐ์œผ๋กœ ๋Œ€์ฒด
  • ๊ต์œก์  ์ง„ํ–‰: ์ˆ˜๋™ โ†’ ์ž๋™, ๋ณต์žก โ†’ ๋‹จ์ˆœ, ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ โ†’ ์‹ ๋ขฐ์„ฑ

block.sum() (์™„๋ฒฝํ•œ ํ†ตํ•ฉ)๊ณผ์˜ ์—ฐ๊ฒฐ:

  • API ์ผ๊ด€์„ฑ: ๋™์ผํ•œ ํ…œํ”Œ๋ฆฟ ๊ตฌ์กฐ [block_size=tpb, broadcast=False]
  • ๊ฒฐ๊ณผ ํ๋ฆ„ ์„ค๊ณ„: ์Šค๋ ˆ๋“œ 0์ด ํ•ฉ๊ณ„๋ฅผ ์ˆ˜์‹ ํ•˜๊ณ , ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํŒŒ์ƒ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ„์‚ฐ
  • ๋งค๋„๋Ÿฌ์šด ์กฐํ•ฉ: block.sum()์˜ ์ถœ๋ ฅ์ด ๊ณ„์‚ฐ + ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์˜ ์ž…๋ ฅ์ด ๋จ
  • ์„ฑ๋Šฅ ์ตœ์ ํ™”: ๋‹จ์ผ ์ปค๋„ ์›Œํฌํ”Œ๋กœ์šฐ vs ๋‹ค์ค‘ ํŒจ์Šค ๋ฐฉ์‹

block.prefix_sum() (์ƒ๋ณด์  ํ†ต์‹ )๊ณผ์˜ ์—ฐ๊ฒฐ:

  • ๋ถ„๋ฐฐ ํŒจํ„ด: prefix_sum์€ ๊ณ ์œ ํ•œ ์œ„์น˜๋ฅผ, broadcast๋Š” ๊ณต์œ  ๊ฐ’์„ ์ œ๊ณต

  • ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค: prefix_sum์€ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์šฉ, broadcast๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ ์šฉ

  • ํ…œํ”Œ๋ฆฟ ์ผ๊ด€์„ฑ: ๋ชจ๋“  ์—ฐ์‚ฐ์—์„œ ๋™์ผํ•œ dtype, block_size ํŒŒ๋ผ๋ฏธํ„ฐ ํŒจํ„ด

  • SIMD ์ฒ˜๋ฆฌ ํ†ต์ผ์„ฑ: ๋ชจ๋“  ๋ธ”๋ก ์—ฐ์‚ฐ์ด [0] ์ถ”์ถœ์ด ํ•„์š”ํ•œ SIMD๋ฅผ ๋ฐ˜ํ™˜

๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ธ์‚ฌ์ดํŠธ:

ํ†ต์‹  ํŒจํ„ด ๋น„๊ต:
  ๊ธฐ์กด ๋ฐฉ์‹:
    1. ์ˆ˜๋™ ๋ฆฌ๋•์…˜:         O(log n), ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ํ•„์š”
    2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ:     O(1), ๋™๊ธฐํ™” ํ•„์š”
    3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ:     O(1), ๋ฑ…ํฌ ์ถฉ๋Œ ๊ฐ€๋Šฅ์„ฑ
    ์ดํ•ฉ: ๋‹ค์ˆ˜์˜ ๋™๊ธฐํ™” ์ง€์ , ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ

  ๋ธ”๋ก ์—ฐ์‚ฐ ๋ฐฉ์‹:
    1. block.sum():        O(log n), ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”, ์ž๋™ ๋ฐฐ๋ฆฌ์–ด
    2. ๊ณ„์‚ฐ:                O(1), ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    3. block.broadcast():  O(log n), ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”, ์ž๋™ ๋ถ„๋ฐฐ
    ์ดํ•ฉ: ๋‘ ๊ฐœ์˜ ๊ธฐ๋ณธ ์š”์†Œ, ์ž๋™ ๋™๊ธฐํ™”, ์ฆ๋ช…๋œ ์ •ํ™•์„ฑ

์‹ค์ œ ์‘์šฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด:

์ผ๋ฐ˜์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ:
  1๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ        โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ธฐ์—ฌ
  2๋‹จ๊ณ„: ์ „์—ญ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณ„์‚ฐ      โ†’ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐ
  3๋‹จ๊ณ„: ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„๋ฐฐ          โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ˆ˜์‹ 
  4๋‹จ๊ณ„: ์กฐ์œจ๋œ ๋ณ‘๋ ฌ ์ถœ๋ ฅ        โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌ

์ด ์ •ํ™•ํ•œ ํŒจํ„ด์ด ๋“ฑ์žฅํ•˜๋Š” ๋ถ„์•ผ:
  - ๋ฐฐ์น˜ ์ •๊ทœํ™” (๋”ฅ๋Ÿฌ๋‹)
  - ํžˆ์Šคํ† ๊ทธ๋žจ ๊ท ๋“ฑํ™” (์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ)
  - ๋ฐ˜๋ณต์  ์ˆ˜์น˜ ํ•ด๋ฒ• (๊ณผํ•™ ์—ฐ์‚ฐ)
  - ์กฐ๋ช… ๊ณ„์‚ฐ (์ปดํ“จํ„ฐ ๊ทธ๋ž˜ํ”ฝ)

ํ‰๊ท  ์ •๊ทœํ™”๋Š” ์ด ๊ทผ๋ณธ์ ์ธ ํŒจํ„ด์˜ ์™„๋ฒฝํ•œ ๊ต์œก ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.

๋ธ”๋ก ์—ฐ์‚ฐ 3๋ถ€์ž‘ ์™„์„ฑ:

1. block.sum() - ์ „์ฒดโ†’ํ•˜๋‚˜ (Reduction)

  • ์ž…๋ ฅ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ์ œ๊ณต
  • ์ถœ๋ ฅ: ์Šค๋ ˆ๋“œ 0์ด ์ง‘๊ณ„๋œ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์‹ 
  • ์šฉ๋„: ํ•ฉ๊ณ„, ์ตœ๋Œ“๊ฐ’ ๊ณ„์‚ฐ ๋“ฑ

2. block.prefix_sum() - ์ „์ฒดโ†’๊ฐ๊ฐ (Scan)

  • ์ž…๋ ฅ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ์ œ๊ณต
  • ์ถœ๋ ฅ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ˆ„์  ์œ„์น˜๋ฅผ ์ˆ˜์‹ 
  • ์šฉ๋„: ์“ฐ๊ธฐ ์œ„์น˜ ๊ณ„์‚ฐ, ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹

3. block.broadcast() - ํ•˜๋‚˜โ†’์ „์ฒด (Broadcast)

  • ์ž…๋ ฅ: ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ์ œ๊ณต (์ผ๋ฐ˜์ ์œผ๋กœ ์Šค๋ ˆ๋“œ 0)
  • ์ถœ๋ ฅ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ์ˆ˜์‹ 
  • ์šฉ๋„: ๊ณ„์‚ฐ๋œ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ , ์„ค์ •๊ฐ’ ๋ถ„๋ฐฐ

์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์ง„ํ–‰:

  1. ์ˆ˜๋™ ์กฐ์œจ (Puzzle 12): ๋ณ‘๋ ฌ ๊ธฐ์ดˆ ์ดํ•ด
  2. ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ (Puzzle 24): ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ํŒจํ„ด ํ•™์Šต
  3. ๋ธ”๋ก ๋ฆฌ๋•์…˜ (block.sum()): ์ „์ฒดโ†’ํ•˜๋‚˜ ํ†ต์‹  ํ•™์Šต
  4. ๋ธ”๋ก ์Šค์บ” (block.prefix_sum()): ์ „์ฒดโ†’๊ฐ๊ฐ ํ†ต์‹  ํ•™์Šต
  5. ๋ธ”๋ก ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ (block.broadcast()): ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹  ํ•™์Šต

์ „์ฒด ๊ทธ๋ฆผ: ๋ธ”๋ก ์—ฐ์‚ฐ์€ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๊ธฐ๋ณธ ํ†ต์‹  ๋นŒ๋”ฉ ๋ธ”๋ก์„ ์ œ๊ณตํ•˜๋ฉฐ, ๋ณต์žกํ•œ ์ˆ˜๋™ ์กฐ์œจ์„ ๊น”๋”ํ•˜๊ณ  ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ์™€ ๊ธฐ์ˆ  ๋ถ„์„

์ •๋Ÿ‰์  ์„ฑ๋Šฅ ๋น„๊ต:

block.broadcast() vs ๊ธฐ์กด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹ (์ฐธ๊ณ ์šฉ):

๊ธฐ์กด ์ˆ˜๋™ ๋ฐฉ์‹:

1๋‹จ๊ณ„: ์ˆ˜๋™ ๋ฆฌ๋•์…˜
  โ€ข ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ~5 ์‚ฌ์ดํด
  โ€ข ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”: ~10 ์‚ฌ์ดํด
  โ€ข ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ๋ฃจํ”„: ~15 ์‚ฌ์ดํด
  โ€ข ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅํ•œ ์ˆ˜๋™ ์ธ๋ฑ์‹ฑ

2๋‹จ๊ณ„: ํ‰๊ท  ๊ณ„์‚ฐ: ~2 ์‚ฌ์ดํด

3๋‹จ๊ณ„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
  โ€ข ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ˆ˜๋™ ์“ฐ๊ธฐ: ~2 ์‚ฌ์ดํด
  โ€ข ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”: ~10 ์‚ฌ์ดํด
  โ€ข ๋ชจ๋“  ์Šค๋ ˆ๋“œ ์ฝ๊ธฐ: ~3 ์‚ฌ์ดํด

์ดํ•ฉ: ~47 ์‚ฌ์ดํด
  + ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ
  + ๊ฒฝ์Ÿ ์ƒํƒœ ๊ฐ€๋Šฅ์„ฑ
  + ์ˆ˜๋™ ์˜ค๋ฅ˜ ๋””๋ฒ„๊น…

๋ธ”๋ก ์—ฐ์‚ฐ ๋ฐฉ์‹:

1๋‹จ๊ณ„: block.sum()
  โ€ข ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ~3 ์‚ฌ์ดํด
  โ€ข ์ž๋™ ๋ฐฐ๋ฆฌ์–ด: ๋ช…์‹œ์  ๋น„์šฉ 0
  โ€ข ์ตœ์ ํ™”๋œ ๋ฆฌ๋•์…˜: ~8 ์‚ฌ์ดํด
  โ€ข ๊ฒ€์ฆ๋œ ์˜ฌ๋ฐ”๋ฅธ ๊ตฌํ˜„

2๋‹จ๊ณ„: ํ‰๊ท  ๊ณ„์‚ฐ: ~2 ์‚ฌ์ดํด

3๋‹จ๊ณ„: block.broadcast()
  โ€ข ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ~4 ์‚ฌ์ดํด
  โ€ข ์ž๋™ ๋ถ„๋ฐฐ: ๋ช…์‹œ์  ๋น„์šฉ 0
  โ€ข ๊ฒ€์ฆ๋œ ์˜ฌ๋ฐ”๋ฅธ ๊ตฌํ˜„

์ดํ•ฉ: ~17 ์‚ฌ์ดํด
  + ์ž๋™ ์ตœ์ ํ™”
  + ๋ณด์žฅ๋œ ์ •ํ™•์„ฑ
  + ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ์„ค๊ณ„

๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ด์ :

์บ์‹œ ํšจ์œจ:

  • block.sum(): ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์บ์‹œ ๋ฏธ์Šค ๊ฐ์†Œ
  • block.broadcast(): ํšจ์œจ์ ์ธ ๋ถ„๋ฐฐ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ ์ตœ์†Œํ™”
  • ๊ฒฐํ•ฉ ์›Œํฌํ”Œ๋กœ์šฐ: ๋‹จ์ผ ์ปค๋„์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์™•๋ณต์„ 100% ๊ฐ์†Œ

๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ:

๊ธฐ์กด ๋ฉ€ํ‹ฐ ์ปค๋„ ๋ฐฉ์‹:
  ์ปค๋„ 1: ์ž…๋ ฅ โ†’ ๋ฆฌ๋•์…˜ โ†’ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ
  ์ปค๋„ 2: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ โ†’ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ โ†’ ์ถœ๋ ฅ
  ์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก: ๋ฐฐ์—ด ํฌ๊ธฐ์˜ 3๋ฐฐ

๋ธ”๋ก ์—ฐ์‚ฐ ๋‹จ์ผ ์ปค๋„:
  ์ž…๋ ฅ โ†’ block.sum() โ†’ block.broadcast() โ†’ ์ถœ๋ ฅ
  ์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก: ๋ฐฐ์—ด ํฌ๊ธฐ์˜ 2๋ฐฐ (33% ๊ฐœ์„ )

๊ฐ ๋ธ”๋ก ์—ฐ์‚ฐ์˜ ์ตœ์  ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค:

block.sum() ์ตœ์  ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ๋ฐ์ดํ„ฐ ์ง‘๊ณ„: ํ•ฉ๊ณ„, ํ‰๊ท , ์ตœ๋Œ“๊ฐ’/์ตœ์†Ÿ๊ฐ’ ๊ณ„์‚ฐ
  • ๋ฆฌ๋•์…˜ ํŒจํ„ด: ์ „์ฒดโ†’ํ•˜๋‚˜ ํ†ต์‹ ์ด ํ•„์š”ํ•œ ๋ชจ๋“  ๊ฒฝ์šฐ
  • ํ†ต๊ณ„ ์—ฐ์‚ฐ: ํ‰๊ท , ๋ถ„์‚ฐ, ์ƒ๊ด€๊ด€๊ณ„ ๊ณ„์‚ฐ

block.prefix_sum() ์ตœ์  ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹: ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜
  • ์“ฐ๊ธฐ ์œ„์น˜ ๊ณ„์‚ฐ: ๋ณ‘๋ ฌ ์ถœ๋ ฅ ์ƒ์„ฑ
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ •๋ ฌ, ๊ฒ€์ƒ‰, ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ

block.broadcast() ์ตœ์  ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ถ„๋ฐฐ: ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ณต์œ 
  • ์„ค์ • ์ „ํŒŒ: ๋ชจ๋“œ ํ”Œ๋ž˜๊ทธ, ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ, ์ž„๊ณ„๊ฐ’
  • ์กฐ์œจ๋œ ์ฒ˜๋ฆฌ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๊ณ„์‚ฐ๋œ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ํ•„์š”ํ•  ๋•Œ

์กฐํ•ฉ์˜ ์ด์ :

๊ฐœ๋ณ„ ์—ฐ์‚ฐ:   ์ข‹์€ ์„ฑ๋Šฅ, ์ œํ•œ๋œ ๋ฒ”์œ„
๊ฒฐํ•ฉ ์—ฐ์‚ฐ:   ํƒ์›”ํ•œ ์„ฑ๋Šฅ, ํฌ๊ด„์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜

์‹ค์ œ ์‘์šฉ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์กฐํ•ฉ ์˜ˆ์‹œ:
โ€ข block.sum() + block.broadcast():       ์ •๊ทœํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜
โ€ข block.prefix_sum() + block.sum():      ๊ณ ๊ธ‰ ํŒŒํ‹ฐ์…”๋‹
โ€ข ์„ธ ๊ฐ€์ง€ ๋ชจ๋‘ ๊ฒฐํ•ฉ:                      ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜
โ€ข ๊ธฐ์กด ํŒจํ„ด๊ณผ ํ•จ๊ป˜:                       ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ตœ์ ํ™” ์ „๋žต

๋‹ค์Œ ๋‹จ๊ณ„

์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ 3๋ถ€์ž‘์„ ๋ฐฐ์› ์œผ๋‹ˆ, ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์นœ ์—ฐ์‚ฐ ์กฐ์œจ
  • ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒจํ„ด: ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ๊ฒฐํ•ฉ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”: ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ์ด๋™ ํŒจํ„ด
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„: ๋ธ”๋ก ์—ฐ์‚ฐ ๋นŒ๋”ฉ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐํ™”
  • ์„ฑ๋Šฅ ์ตœ์ ํ™”: ์ตœ์ ์˜ ๋ธ”๋ก ํฌ๊ธฐ์™€ ์—ฐ์‚ฐ ์กฐํ•ฉ ์„ ํƒ

๐Ÿ’ก ํ•ต์‹ฌ ์š”์ : ๋ธ”๋ก ์—ฐ์‚ฐ 3๋ถ€์ž‘(sum, prefix_sum, broadcast)์€ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•œ ์™„์ „ํ•œ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ์—ฐ์‚ฐ๋“ค์„ ์กฐํ•ฉํ•˜๋ฉด GPU ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊น”๋”ํ•˜๊ณ  ์œ ์ง€๋ณด์ˆ˜ํ•˜๊ธฐ ์‰ฌ์šด ์ฝ”๋“œ๋กœ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ‰๊ท  ์ •๊ทœํ™”๋Š” ์ด ์—ฐ์‚ฐ๋“ค์ด ํ•จ๊ป˜ ์ž‘๋™ํ•˜์—ฌ ์‹ค์ œ ์—ฐ์‚ฐ ๋ฌธ์ œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.