Puzzle 6: ๋ธ”๋ก

๊ฐœ์š”

๋ฒกํ„ฐ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค.

๋ธ”๋ก ์‹œ๊ฐํ™” ๋ธ”๋ก ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • ์Šค๋ ˆ๋“œ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  • ์—ฌ๋Ÿฌ ๋ธ”๋ก์˜ ์Šค๋ ˆ๋“œ ์กฐ์œจ
  • ์ „์—ญ ์Šค๋ ˆ๋“œ ์œ„์น˜ ๊ณ„์‚ฐ

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ํ˜‘๋ ฅํ•˜์—ฌ ๋‹จ์ผ ๋ธ”๋ก ์šฉ๋Ÿ‰๋ณด๋‹ค ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ๋„, ์š”์†Œ์™€ ์Šค๋ ˆ๋“œ ๊ฐ„ ์˜ฌ๋ฐ”๋ฅธ ๋งคํ•‘์„ ์œ ์ง€ํ•˜๋Š” ์›๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 9
comptime BLOCKS_PER_GRID = (3, 1)
comptime THREADS_PER_BLOCK = (4, 1)
comptime dtype = DType.float32


fn add_10_blocks(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p06/p06.mojo

์ฐธ๊ณ : ์ด ํผ์ฆ์˜ LayoutTensor ๋ฒ„์ „์€ ๊ฑฐ์˜ ๋™์ผํ•˜๋ฏ€๋กœ ๋…์ž์—๊ฒŒ ๋งก๊น๋‹ˆ๋‹ค.

ํŒ
  1. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ: i = block_dim.x * block_idx.x + thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if i < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: output[i] = a[i] + 10.0

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p06
pixi run -e amd p06
pixi run -e apple p06
uv run poe p06

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])

์†”๋ฃจ์…˜

fn add_10_blocks(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    i = block_dim.x * block_idx.x + thread_idx.x
    if i < size:
        output[i] = a[i] + 10.0


์ด ์†”๋ฃจ์…˜์€ ๋ธ”๋ก ๊ธฐ๋ฐ˜ GPU ์ฒ˜๋ฆฌ์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค:

  1. ์ „์—ญ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

    • ๋ธ”๋ก ์ธ๋ฑ์Šค์™€ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฒฐํ•ฉ: block_dim.x * block_idx.x + thread_idx.x

    • ๊ฐ ์Šค๋ ˆ๋“œ๋ฅผ ๊ณ ์œ ํ•œ ์ „์—ญ ์œ„์น˜์— ๋งคํ•‘

    • ๋ธ”๋ก๋‹น 3๊ฐœ ์Šค๋ ˆ๋“œ ์˜ˆ์‹œ:

      Block 0: [0 1 2]
      Block 1: [3 4 5]
      Block 2: [6 7 8]
      
  2. ๋ธ”๋ก ์กฐ์œจ

    • ๊ฐ ๋ธ”๋ก์€ ์—ฐ์†๋œ ๋ฐ์ดํ„ฐ ์ฒญํฌ๋ฅผ ์ฒ˜๋ฆฌ

    • ๋ธ”๋ก ํฌ๊ธฐ(3) < ๋ฐ์ดํ„ฐ ํฌ๊ธฐ(9)์ด๋ฏ€๋กœ ์—ฌ๋Ÿฌ ๋ธ”๋ก ํ•„์š”

    • ๋ธ”๋ก ๊ฐ„ ์ž๋™ ์ž‘์—… ๋ถ„๋ฐฐ:

      Data:    [0 1 2 3 4 5 6 7 8]
      Block 0: [0 1 2]
      Block 1:       [3 4 5]
      Block 2:             [6 7 8]
      
  3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

    • ๊ฐ€๋“œ ์กฐ๊ฑด i < size๋กœ ๊ฒฝ๊ณ„ ์ผ€์ด์Šค ์ฒ˜๋ฆฌ
    • ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋กœ ๋‚˜๋ˆ„์–ด ๋–จ์–ด์ง€์ง€ ์•Š์„ ๋•Œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ ๋ฐฉ์ง€
    • ๋ฐ์ดํ„ฐ ๋๋ถ€๋ถ„์˜ ๋ถˆ์™„์ „ํ•œ ๋ธ”๋ก ์ฒ˜๋ฆฌ์— ํ•„์ˆ˜
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ๋ณ‘ํ•ฉ(coalesced) ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋“ค์ด ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ์š”์†Œ ์ฒ˜๋ฆฌ: output[i] = a[i] + 10.0
    • ๋ธ”๋ก ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉ

์ด ํŒจํ„ด์€ ๋‹จ์ผ ์Šค๋ ˆ๋“œ ๋ธ”๋ก ํฌ๊ธฐ๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ์ฒ˜๋ฆฌ์˜ ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.