Mojo ๐Ÿ”ฅ GPU Puzzles, Edition 1

โ€œ์šฐ๋ฆฌ๊ฐ€ ํ•  ์ˆ˜ ์žˆ๊ธฐ ์ „์— ๋ฐฐ์›Œ์•ผ ํ•˜๋Š” ๊ฒƒ๋“ค์€, ํ•˜๋ฉด์„œ ๋ฐฐ์šด๋‹ค.โ€ ์•„๋ฆฌ์Šคํ† ํ…”๋ ˆ์Šค (๋‹ˆ์ฝ”๋งˆ์ฝ”์Šค ์œค๋ฆฌํ•™)

Mojo ๐Ÿ”ฅ๋ฅผ ์‚ฌ์šฉํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์‹ค์Šต ๊ฐ€์ด๋“œ์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. Mojo๋Š” ํŒŒ์ด์ฌ ๋ฌธ๋ฒ•๊ณผ ์‹œ์Šคํ…œ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๊ฒฐํ•ฉํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด์ž…๋‹ˆ๋‹ค.

์•„๋ž˜ ๊ฐœ์š” ์˜์ƒ์„ ๋จผ์ € ์‹œ์ฒญํ•˜๊ฑฐ๋‚˜, ๊ณ„์† ์ฝ์–ด์ฃผ์„ธ์š”.

์™œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ธ๊ฐ€?

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์ „๋ฌธ ๊ธฐ์ˆ ์—์„œ ํ˜„๋Œ€ ์ปดํ“จํŒ…์˜ ํ•ต์‹ฌ ์ธํ”„๋ผ๋กœ ๋ฐœ์ „ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜์‹ญ์–ต ๊ฐœ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ๋ถ€ํ„ฐ ์‹ค์‹œ๊ฐ„ ์˜์ƒ ์ŠคํŠธ๋ฆผ์„ ๋ถ„์„ํ•˜๋Š” ์ปดํ“จํ„ฐ ๋น„์ „ ์‹œ์Šคํ…œ๊นŒ์ง€, GPU ๊ฐ€์†์ด ์˜ค๋Š˜๋‚ ์˜ ์—ฐ์‚ฐ ํ˜์‹ ์„ ์ด๋Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐํ›„ ๋ชจ๋ธ๋ง, ์‹ ์•ฝ ๋ฐœ๊ฒฌ, ์–‘์ž ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋“ฑ ๊ณผํ•™์  ๋ฐœ์ „์€ GPU๋งŒ์ด ์ œ๊ณตํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ์— ์˜์กดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธˆ์œต ๊ธฐ๊ด€์€ ์‹ค์‹œ๊ฐ„ ๋ฆฌ์Šคํฌ ๋ถ„์„๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŠธ๋ ˆ์ด๋”ฉ์— GPU ์ปดํ“จํŒ…์„ ํ™œ์šฉํ•˜๋ฉฐ, ์ž์œจ์ฃผํ–‰ ์ฐจ๋Ÿ‰์€ GPU ๊ฐ€์† ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ์„ผ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜์—ฌ ์ค‘์š”ํ•œ ์˜์‚ฌ๊ฒฐ์ •์„ ๋‚ด๋ฆฝ๋‹ˆ๋‹ค.

๊ฒฝ์ œ์  ํŒŒ๊ธ‰๋ ฅ๋„ ์ƒ๋‹นํ•ฉ๋‹ˆ๋‹ค. GPU ์ปดํ“จํŒ…์„ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ์กฐ์ง์€ ๊ฐœ๋ฐœ ์ฃผ๊ธฐ ๋‹จ์ถ•, ์—ฐ์‚ฐ ๋น„์šฉ ์ ˆ๊ฐ, ๊ทธ๋ฆฌ๊ณ  ์ด์ „์—๋Š” ํ’€๊ธฐ ์–ด๋ ค์› ๋˜ ๊ณ„์‚ฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ ๋“ฑ ์ƒ๋‹นํ•œ ๊ฒฝ์Ÿ ์šฐ์œ„๋ฅผ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค. ๊ณ„์‚ฐ ๋Šฅ๋ ฅ์ด ๋น„์ฆˆ๋‹ˆ์Šค ๊ฐ€์น˜์™€ ์ง๊ฒฐ๋˜๋Š” ์‹œ๋Œ€์—, GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—ญ๋Ÿ‰์€ ์—”์ง€๋‹ˆ์–ด, ์—ฐ๊ตฌ์ž, ์กฐ์ง์—๊ฒŒ ์ „๋žต์  ์ฐจ๋ณ„ํ™” ์š”์†Œ์ž…๋‹ˆ๋‹ค.

์™œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— Mojo๐Ÿ”ฅ๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๊ฐ€?

์ปดํ“จํŒ… ์‚ฐ์—…์€ ์ค‘๋Œ€ํ•œ ์ „ํ™˜์ ์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค. CPU ์„ฑ๋Šฅ์€ ์ „๋ ฅ๊ณผ ๋ฐœ์—ด ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ํด๋Ÿญ ์†๋„ ํ–ฅ์ƒ๋งŒ์œผ๋กœ๋Š” ํ•œ๊ณ„์— ์ด๋ฅด๋ €์Šต๋‹ˆ๋‹ค. ์ด์— ๋”ฐ๋ผ ํ•˜๋“œ์›จ์–ด ์ œ์กฐ์‚ฌ๋“ค์€ ๋ฌผ๋ฆฌ์  ์ฝ”์–ด ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋‚˜์•„๊ฐ”๊ณ , ์ด๋Ÿฌํ•œ ๋ฉ€ํ‹ฐ์ฝ”์–ด ์ ‘๊ทผ ๋ฐฉ์‹์˜ ์ •์ ์ด ๋ฐ”๋กœ ์ˆ˜์ฒœ ๊ฐœ์˜ ์ฝ”์–ด๊ฐ€ ๋ณ‘๋ ฌ๋กœ ๋™์ž‘ํ•˜๋Š” ํ˜„๋Œ€ GPU์ž…๋‹ˆ๋‹ค. NVIDIA H100์„ ์˜ˆ๋กœ ๋“ค๋ฉด, ๋‹จ์ผ ํด๋Ÿญ ์‚ฌ์ดํด์— 16,896๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๋™์‹œ์— ์‹คํ–‰ํ•˜๋ฉด์„œ 270,000๊ฐœ ์ด์ƒ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๋Œ€๊ธฐ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Mojo๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ๋Œ€ํ•œ ์‹ค์šฉ์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ๊ณตํ•˜์—ฌ, ์ด๋Ÿฌํ•œ ๋ณ‘๋ ฌ์„ฑ์„ ๋” ์‰ฝ๊ฒŒ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค:

  • ํŒŒ์ด์ฌ ์Šคํƒ€์ผ ๋ฌธ๋ฒ•์œผ๋กœ ์‹œ์Šคํ…œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊นŒ์ง€
  • ์ถ”์ƒํ™”ํ•ด๋„ ์„ฑ๋Šฅ ์†์‹ค ์—†์ด ๋จธ์‹  ์ฝ”๋“œ๋กœ ์ปดํŒŒ์ผ๋˜๋Š” ์ œ๋กœ ์ฝ”์ŠคํŠธ ์ถ”์ƒํ™”
  • ์ปดํŒŒ์ผ ํƒ€์ž„์— ์˜ค๋ฅ˜๋ฅผ ์žก๋Š” ๊ฐ•๋ ฅํ•œ ํƒ€์ž… ์‹œ์Šคํ…œ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋ฅผ ๊ณ ๋ คํ•œ ํ…์„œ ๊ธฐ๋ณธ ์ง€์›
  • CPUยทGPU ๋‚ด์žฅ ํ•จ์ˆ˜๋ฅผ ์ง์ ‘ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋Š” ํ•˜๋“œ์›จ์–ด ์ง์ ‘ ์ œ์–ด
  • CPU์™€ GPU ๋ชจ๋‘์—์„œ ๋™์ž‘ํ•˜๋Š” ํฌ๋กœ์Šค ํ•˜๋“œ์›จ์–ด ์ด์‹์„ฑ
  • C/C++ ๋Œ€๋น„ ํ–ฅ์ƒ๋œ ์•ˆ์ „์„ฑ
  • ๋‚ฎ์€ ์ง„์ž… ์žฅ๋ฒฝ์œผ๋กœ ๋” ๋งŽ์€ ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ GPU ์„ฑ๋Šฅ์„ ํ™œ์šฉ

Mojo๐Ÿ”ฅ๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋ˆ„๊ตฌ๋‚˜ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์–ด ํ˜์‹ ์„ ์ด๋Œ๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. >์ต์ˆ™ํ•œ ํŒŒ์ด์ฌ ๋ฌธ๋ฒ•์„ ๋ฐ”ํƒ•์œผ๋กœ GPU์— ์ง์ ‘ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์–ด, ๊นŠ์€ ์ „๋ฌธ ์ง€์‹ ์—†์ด๋„ CPU์™€ GPU๋ฅผ ํ•จ๊ป˜ ํ™œ์šฉํ•˜๋Š” ๊ณ ์„ฑ๋Šฅ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์™œ ํผ์ฆ๋กœ ๋ฐฐ์šฐ๋Š”๊ฐ€?

๋Œ€๋ถ€๋ถ„์˜ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ž๋ฃŒ๋Š” ์‹ค์Šต์— ์•ž์„œ ๋ฐฉ๋Œ€ํ•œ ์ด๋ก ์„ ๋จผ์ € ๋‹ค๋ฃน๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ง์ ‘ ํ•ด๋ด์•ผ ์ดํ•ด๋˜๋Š” ์ถ”์ƒ์  ๊ฐœ๋…๋“ค์€ ์ž…๋ฌธ์ž์—๊ฒŒ ๋ถ€๋‹ด์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์ฑ…์€ ๋‹ค๋ฅธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ „ ๋ฌธ์ œ์— ๋ฐ”๋กœ ๋›ฐ์–ด๋“ค์–ด, ๋‹จ๊ณ„์ ์œผ๋กœ ๊ฐœ๋…์„ ๋ฐœ๊ฒฌํ•ด ๋‚˜๊ฐ‘๋‹ˆ๋‹ค.

ํผ์ฆ ๊ธฐ๋ฐ˜ ํ•™์Šต์˜ ์žฅ์ :

  • ์ง์ ‘ ์ฒดํ—˜: GPU์—์„œ ๋ฐ”๋กœ ์‹คํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ์ ์ง„์  ๋ณต์žก๋„: ๊ฐ ํผ์ฆ์ด ์ด์ „์— ๋ฐฐ์šด ๊ฐœ๋… ์œ„์— ์Œ“์—ฌ๊ฐ‘๋‹ˆ๋‹ค
  • ์‹ค์šฉ์  ์ดˆ์ : ์‹ค์ œ ๊ณ„์‚ฐ ๋ฌธ์ œ๋ฅผ ๋ฐ˜์˜ํ•œ ํผ์ฆ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค
  • ๋””๋ฒ„๊น… ๋Šฅ๋ ฅ: ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น… ์—ฐ์Šต์„ ํ†ตํ•ด ๋ฌธ์ œ ํ•ด๊ฒฐ ๊ฐ๊ฐ์„ ํ‚ค์›๋‹ˆ๋‹ค
  • ์ง€์‹ ์ •์ฐฉ: ์ง์ ‘ ํ’€์–ด๋ณด๋Š” ๊ฒƒ์ด ์ฝ๊ธฐ๋งŒ ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ดํ•ด๊ฐ€ ๋” ๊นŠ์–ด์ง‘๋‹ˆ๋‹ค

์•”๊ธฐ๊ฐ€ ์•„๋‹Œ ๋ฐœ๊ฒฌ์— ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค. ์ง์ ‘ ์‹คํ—˜ํ•˜๋ฉด์„œ ๊ฐœ๋…์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ตํžˆ๊ณ , ๊นŠ์€ ์ดํ•ด์™€ ์‹ค์ „ ์—ญ๋Ÿ‰์„ ํ•จ๊ป˜ ์Œ“์•„๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ์‚ฌ์˜ ๋ง: ์ด ์ฑ…์˜ Part I๊ณผ III์€ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ NVIDIA GPU ํ•™์Šต ํ”„๋กœ์ ํŠธ์ธ GPU Puzzles์—์„œ ํฐ ์˜๊ฐ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ์ด ์ฑ…์€ ํ•ด๋‹น ๊ฐœ๋…๋“ค์„ Mojo์˜ ์ถ”์ƒํ™”์™€ ์„ฑ๋Šฅ์„ ํ™œ์šฉํ•˜์—ฌ ์žฌ๊ตฌํ˜„ํ•˜๊ณ , Mojo์— ํŠนํ™”๋œ ์ตœ์ ํ™”๋กœ ๊ณ ๊ธ‰ ์ฃผ์ œ๋ฅผ ๋” ๋„“๊ฒŒ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์‚ฌ๊ณ ๋ฐฉ์‹

ํšจ๊ณผ์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•ด์„œ๋Š” ๊ณ„์‚ฐ์„ ๋ฐ”๋ผ๋ณด๋Š” ๋ฐฉ์‹ ์ž์ฒด๋ฅผ ๋ฐ”๊ฟ”์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์•ž์œผ๋กœ์˜ ํ•™์Šต์— ๊ธธ์žก์ด๊ฐ€ ๋  ํ•ต์‹ฌ ์‚ฌ๊ณ  ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค:

์ˆœ์ฐจ์—์„œ ๋ณ‘๋ ฌ๋กœ: ๋ฐ˜๋ณต๋ฌธ์„ ์Šค๋ ˆ๋“œ๋กœ ๋Œ€์ฒด

๊ธฐ์กด CPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” ๋ฐ˜๋ณต๋ฌธ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์”ฉ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

# CPU ๋ฐฉ์‹
for i in range(data_size):
    result[i] = process(data[i])

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์ด ๋ฐฉ์‹์„ ์™„์ „ํžˆ ๋’ค์ง‘์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์”ฉ ์ˆœํšŒํ•˜๋Š” ๋Œ€์‹ , ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ํ• ๋‹นํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์š”์†Œ๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

# GPU ๋ฐฉ์‹ (๊ฐœ๋…์ )
thread_id = get_global_id()
if thread_id < data_size:
    result[thread_id] = process(data[thread_id])

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ์š”์†Œ๋ฅผ ๋งก์•„ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ, ๋ช…์‹œ์ ์ธ ๋ฐ˜๋ณต๋ฌธ์ด ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์‹คํ–‰์œผ๋กœ ๋ฐ”๋€๋‹ˆ๋‹ค. ์ˆœ์ฐจ ์ฒ˜๋ฆฌ์—์„œ ๋™์‹œ ์‹คํ–‰์œผ๋กœ์˜ ์ „ํ™˜์ด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ•ต์‹ฌ ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์œ„์— ์—ฐ์‚ฐ ๊ทธ๋ฆฌ๋“œ ๋งž์ถ”๊ธฐ

๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™”๋œ ๊ทธ๋ฆฌ๋“œ๋กœ, GPU ์Šค๋ ˆ๋“œ๊ฐ€ ์ด์— ๋Œ€์‘ํ•˜๋Š” ์—ฐ์‚ฐ ๊ทธ๋ฆฌ๋“œ๋ฅผ ํ˜•์„ฑํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”. ํšจ๊ณผ์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์ด ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์„ ์ž˜ ์„ค๊ณ„ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๊ณต๊ฐ„์„ ์ตœ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  • ์Šค๋ ˆ๋“œ: ๊ฐ๊ฐ ํŠน์ • ๋ฐ์ดํ„ฐ ์š”์†Œ๋ฅผ ๋‹ด๋‹นํ•˜๋Š” ๊ฐœ๋ณ„ ์ฒ˜๋ฆฌ ๋‹จ์œ„
  • ๋ธ”๋ก: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ณผ ๋™๊ธฐํ™” ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ˜ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน
  • ๊ทธ๋ฆฌ๋“œ: ์ „์ฒด ๊ณ„์‚ฐ ๋ฌธ์ œ๋ฅผ ์•„์šฐ๋ฅด๋Š” ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์ž˜ํ•˜๋ ค๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ๋™๊ธฐํ™” ์š”๊ตฌ์‚ฌํ•ญ์„ ๊ด€๋ฆฌํ•˜๋ฉด์„œ ๋ณ‘๋ ฌ ํšจ์œจ์„ ์ตœ๋Œ€ํ•œ ๋Œ์–ด์˜ฌ๋ฆฌ๋„๋ก ์ด ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์˜ ๊ท ํ˜•์„ ์žก์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์ด๋™ vs. ์—ฐ์‚ฐ

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” ์—ฐ์‚ฐ ์ž์ฒด๋ณด๋‹ค ๋ฐ์ดํ„ฐ๋ฅผ ์˜ฎ๊ธฐ๋Š” ๋น„์šฉ์ด ๋” ํด ๋•Œ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค:

  • CPU์™€ GPU ๊ฐ„ ๋ฐ์ดํ„ฐ ์ด๋™์€ ๋А๋ฆฝ๋‹ˆ๋‹ค
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ์˜ ์ด๋™์€ ๊ทธ๋ณด๋‹ค ๋น ๋ฆ…๋‹ˆ๋‹ค
  • ๋ ˆ์ง€์Šคํ„ฐ๋‚˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ด๋ฏธ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์€ ๋งค์šฐ ๋น ๋ฆ…๋‹ˆ๋‹ค

์ด๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ํ”ํžˆ ๊ฐ€์ง€๋Š” ๊ฐ€์ •์„ ๋’ค์ง‘์Šต๋‹ˆ๋‹ค. ๋ณ‘๋ชฉ์€ ์—ฐ์‚ฐ์ด ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ ์ด๋™์ž…๋‹ˆ๋‹ค.

์ด ์ฑ…์˜ ํผ์ฆ๋“ค์„ ํ’€์–ด๊ฐ€๋ฉด์„œ ์ด๋Ÿฌํ•œ ์›์น™์„ ์ง๊ด€์ ์œผ๋กœ ์ฒด๋“ํ•˜๊ณ , ๊ณ„์‚ฐ ๋ฌธ์ œ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹์„ ๋ฐ”๊ฟ” ๋‚˜๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๋‚ด์šฉ

์ด ์ฑ…์€ ๊ธฐ์ดˆ ์›๋ฆฌ๋ถ€ํ„ฐ ๊ณ ๊ธ‰ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ๋ฒ•๊นŒ์ง€ ๋‹ค๋ฃน๋‹ˆ๋‹ค. GPU๋ฅผ ์•Œ ์ˆ˜ ์—†๋Š” ๋ธ”๋ž™๋ฐ•์Šค๋กœ ๋‘์ง€ ์•Š๊ณ , ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ์˜ ๋™์ž‘๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊นŒ์ง€ ๋‹จ๊ณ„๋ณ„๋กœ ์ดํ•ด๋ฅผ ์Œ“์•„๊ฐ‘๋‹ˆ๋‹ค. ์ €์ˆ˜์ค€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์™€ ๊ณ ์ˆ˜์ค€ ํ…์„œ ์ถ”์ƒํ™”๋ฅผ ๋ชจ๋‘ ๋ฐฐ์›€์œผ๋กœ์จ, ์–ด๋–ค GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณผ์ œ์—๋„ ์œ ์—ฐํ•˜๊ฒŒ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ ํ•™์Šต ๊ณผ์ •

ํ•ต์‹ฌ ๊ธฐ์ˆ ์ƒํƒœํผ์ฆ
์Šค๋ ˆ๋“œ/๋ธ”๋ก ๊ธฐ์ดˆโœ… ์ œ๊ณต ์ค‘Part I (1-8)
GPU ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น…โœ… ์ œ๊ณต ์ค‘Part II (9-10)
ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜โœ… ์ œ๊ณต ์ค‘Part III (11-16)
MAX ๊ทธ๋ž˜ํ”„ ํ†ตํ•ฉโœ… ์ œ๊ณต ์ค‘Part IV (17-19)
PyTorch ํ†ตํ•ฉโœ… ์ œ๊ณต ์ค‘Part V (20-22)
ํ•จ์ˆ˜ํ˜• ํŒจํ„ด ๋ฐ ๋ฒค์น˜๋งˆํ‚นโœ… ์ œ๊ณต ์ค‘Part VI (23)
์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐโœ… ์ œ๊ณต ์ค‘Part VII (24-26)
๋ธ”๋ก ์ˆ˜์ค€ ํ”„๋กœ๊ทธ๋ž˜๋ฐโœ… ์ œ๊ณต ์ค‘Part VIII (27)
๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐโœ… ์ œ๊ณต ์ค‘Part IX (28-29)
์„ฑ๋Šฅ ๋ถ„์„โœ… ์ œ๊ณต ์ค‘Part X (30-32)
์ตœ์‹  GPU ๊ธฐ๋Šฅโœ… ์ œ๊ณต ์ค‘Part XI (33-34)

์ƒ์„ธ ํ•™์Šต ๋ชฉํ‘œ

Part I: GPU ๊ธฐ์ดˆ (ํผ์ฆ 1-8) โœ…

  • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ๊ณผ ๋ธ”๋ก ๊ตฌ์„ฑ ๋ฐฐ์šฐ๊ธฐ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ๊ฐ€๋“œ ์ดํ•ดํ•˜๊ธฐ
  • ์›์‹œ ํฌ์ธํ„ฐ์™€ LayoutTensor ์ถ”์ƒํ™” ๋ชจ๋‘ ๋‹ค๋ค„๋ณด๊ธฐ
  • ์Šค๋ ˆ๋“œ ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ดˆ ์ตํžˆ๊ธฐ

Part II: GPU ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น… (ํผ์ฆ 9-10) โœ…

  • GPU ๋””๋ฒ„๊ฑฐ์™€ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ• ๋ฐฐ์šฐ๊ธฐ
  • ์ƒˆ๋‹ˆํƒ€์ด์ €๋กœ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜์™€ ๊ฒฝ์Ÿ ์ƒํƒœ ์ฐพ๊ธฐ
  • GPU ๋ฒ„๊ทธ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์‹๋ณ„ํ•˜๊ณ  ์ˆ˜์ •ํ•˜๊ธฐ
  • ๋ณต์žกํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณผ์ œ์— ๋„์ „ํ•  ์ž์‹ ๊ฐ ์Œ“๊ธฐ

์ฐธ๊ณ : ๋””๋ฒ„๊น… ํผ์ฆ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด NVIDIA GPU ๋””๋ฒ„๊น… ๋„๊ตฌ ์ ‘๊ทผ์„ ์œ„ํ•œ pixi๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. CUDA๋ฅผ ์ง€์›ํ•˜๋Š” NVIDIA GPU์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

Part III: GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜ (ํผ์ฆ 11-16) โœ…

  • ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜๊ณผ ํ’€๋ง ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ
  • ํšจ์œจ์ ์ธ ํ•ฉ์„ฑ๊ณฑ ์ปค๋„ ๋งŒ๋“ค๊ธฐ
  • ๋ˆ„์  ํ•ฉ(์Šค์บ”) ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐฐ์šฐ๊ธฐ
  • ํƒ€์ผ๋ง ์ „๋žต์œผ๋กœ ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ตœ์ ํ™”ํ•˜๊ธฐ

Part IV: MAX ๊ทธ๋ž˜ํ”„ ํ†ตํ•ฉ (ํผ์ฆ 17-19) โœ…

  • ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ ๋งŒ๋“ค๊ธฐ
  • GPU ์ปค๋„๊ณผ ํŒŒ์ด์ฌ ์ฝ”๋“œ ์—ฐ๊ฒฐํ•˜๊ธฐ
  • ์†Œํ”„ํŠธ๋งฅ์Šค, ์–ดํ…์…˜ ๊ฐ™์€ ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€์˜ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ

Part V: PyTorch ํ†ตํ•ฉ (ํผ์ฆ 20-22) โœ…

  • Mojo GPU ์ปค๋„๊ณผ PyTorch ํ…์„œ ์—ฐ๊ฒฐํ•˜๊ธฐ
  • CustomOpLibrary๋กœ ํ…์„œ ๋งˆ์ƒฌ๋ง์„ ๋งค๋„๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ
  • torch.compile๊ณผ ํ†ตํ•ฉํ•˜์—ฌ ์‹คํ–‰ ์ตœ์ ํ™”ํ•˜๊ธฐ
  • ์ปค๋„ ํ“จ์ „๊ณผ ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋ฐฐ์šฐ๊ธฐ

Part VI: Mojo ํ•จ์ˆ˜ํ˜• ํŒจํ„ด ๋ฐ ๋ฒค์น˜๋งˆํ‚น (ํผ์ฆ 23) โœ…

  • ํ•จ์ˆ˜ํ˜• ํŒจํ„ด ๋ฐฐ์šฐ๊ธฐ: elementwise, tiled ์ฒ˜๋ฆฌ, vectorization
  • ์ฒด๊ณ„์ ์ธ ์„ฑ๋Šฅ ์ตœ์ ํ™”์™€ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ ์ตํžˆ๊ธฐ
  • ์ •๋Ÿ‰์  ๋ฒค์น˜๋งˆํ‚น์œผ๋กœ ์„ฑ๋Šฅ ๋ถ„์„ํ•˜๊ธฐ
  • GPU ์Šค๋ ˆ๋”ฉ vs SIMD ์‹คํ–‰ ๊ณ„์ธต ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

Part VII: ์›Œํ”„ ์ˆ˜์ค€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (ํผ์ฆ 24-26) โœ…

  • ์›Œํ”„ ๊ธฐ์ดˆ์™€ SIMT ์‹คํ–‰ ๋ชจ๋ธ ๋ฐฐ์šฐ๊ธฐ
  • ํ•ต์‹ฌ ์›Œํ”„ ์—ฐ์‚ฐ ์ตํžˆ๊ธฐ: sum, shuffle_down, broadcast
  • shuffle_xor์™€ prefix_sum์œผ๋กœ ๊ณ ๊ธ‰ ํŒจํ„ด ๊ตฌํ˜„ํ•˜๊ธฐ
  • ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๊ธฐ

Part VIII: ๋ธ”๋ก ์ˆ˜์ค€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (ํผ์ฆ 27) โœ…

  • block.sum()๊ณผ block.max()๋กœ ๋ธ”๋ก ๋‹จ์œ„ ๋ฆฌ๋•์…˜ ๋ฐฐ์šฐ๊ธฐ
  • ๋ธ”๋ก ์ˆ˜์ค€ ๋ˆ„์  ํ•ฉ ํŒจํ„ด๊ณผ ํ†ต์‹  ์ตํžˆ๊ธฐ
  • block.broadcast()๋กœ ๋ธ”๋ก ๋‚ด ์กฐ์œจ ํšจ์œจ์ ์œผ๋กœ ๊ตฌํ˜„ํ•˜๊ธฐ

Part IX: ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ (ํผ์ฆ 28-29) โœ…

  • ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ํŒจํ„ด ๊ตฌํ˜„ํ•˜๊ธฐ
  • ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์œผ๋กœ ์—ฐ์‚ฐ๊ณผ ์ „์†ก์„ ๊ฒน์ณ ์ง€์—ฐ ์‹œ๊ฐ„ ์ˆจ๊ธฐ๊ธฐ
  • ๋ฉ”๋ชจ๋ฆฌ ํŽœ์Šค์™€ ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ ๋ฐฐ์šฐ๊ธฐ
  • ํ”„๋ฆฌํŽ˜์นญ๊ณผ ์บ์‹œ ์ตœ์ ํ™” ์ „๋žต ์ตํžˆ๊ธฐ

Part X: ์„ฑ๋Šฅ ๋ถ„์„ ๋ฐ ์ตœ์ ํ™” (ํผ์ฆ 30-32) โœ…

  • GPU ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ๋ณ‘๋ชฉ ์ง€์  ์ฐพ๊ธฐ
  • ์ ์œ ์œจ๊ณผ ๋ฆฌ์†Œ์Šค ํ™œ์šฉ๋„ ์ตœ์ ํ™”ํ•˜๊ธฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ ์ œ๊ฑฐํ•˜๊ธฐ

Part XI: ๊ณ ๊ธ‰ GPU ๊ธฐ๋Šฅ (ํผ์ฆ 33-34) โœ…

  • AI ์›Œํฌ๋กœ๋“œ๋ฅผ ์œ„ํ•œ ํ…์„œ ์ฝ”์–ด ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฐฐ์šฐ๊ธฐ
  • ํ˜„๋Œ€ GPU์˜ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฐฐ์šฐ๊ธฐ

์ด ์ฑ…์€ ๊ธฐ์กด ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ, ๋จผ์ € ์ €์ˆ˜์ค€ ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ž‘์œผ๋กœ ์ดํ•ด๋ฅผ ์Œ“์€ ๋’ค ์ ์ง„์ ์œผ๋กœ Mojo์˜ LayoutTensor ์ถ”์ƒํ™”๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด์™€ ํ˜„๋Œ€์  ํ…์„œ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์˜ ์‹ค์šฉ์  ์ง€์‹์„ ๋ชจ๋‘ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”?

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ์™œ ์ค‘์š”ํ•œ์ง€, Mojo๊ฐ€ ์™œ ์ ํ•ฉํ•œ์ง€, ๊ทธ๋ฆฌ๊ณ  ํผ์ฆ๋กœ ์–ด๋–ป๊ฒŒ ๋ฐฐ์šฐ๋Š”์ง€ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด์ œ ์‹œ์ž‘ํ•ด ๋ด…์‹œ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„: ํผ์ฆ ์‚ฌ์šฉ ๊ฐ€์ด๋“œ์—์„œ ํ™˜๊ฒฝ ์„ค์ •, ์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ, ์ฒซ ๋ฒˆ์งธ ํผ์ฆ ์‹คํ–‰ ๋ฐฉ๋ฒ•์„ ํ™•์ธํ•˜์„ธ์š”.

ํผ์ฆ ์‚ฌ์šฉ ๊ฐ€์ด๋“œ

๊ฐ ํผ์ฆ์€ ๋‹จ๊ณ„์ ์œผ๋กœ ์‹ค๋ ฅ์„ ์Œ“์„ ์ˆ˜ ์žˆ๋„๋ก ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ผ๊ด€๋œ ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๊ฐœ์š”: ๋ฌธ์ œ ์ •์˜์™€ ํ•ต์‹ฌ ๊ฐœ๋… ์†Œ๊ฐœ
  • ๊ตฌ์„ฑ: ๊ธฐ์ˆ ์  ์„ค์ •๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ ์„ค๋ช…
  • ์™„์„ฑํ•  ์ฝ”๋“œ: problems/pXX/์— ์ฑ„์›Œ์•ผ ํ•  ๋ถ€๋ถ„์ด ํ‘œ์‹œ๋œ ๊ตฌํ˜„ ํ…œํ”Œ๋ฆฟ
  • ํžŒํŠธ: ํ•„์š”ํ•  ๋•Œ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋Š” ์ „๋žต์  ํžŒํŠธ๋กœ, ์ •๋‹ต์„ ์ง์ ‘ ์•Œ๋ ค์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค
  • ํ’€์ด: ์„ฑ๋Šฅ ๊ณ ๋ ค์‚ฌํ•ญ๊ณผ ๊ฐœ๋… ์„ค๋ช…์„ ํฌํ•จํ•œ ์ข…ํ•ฉ ๋ถ„์„

ํผ์ฆ์€ ์ด์ „์— ๋ฐฐ์šด ๊ฐœ๋… ์œ„์— ์ƒˆ๋กœ์šด ๊ฐœ๋…์„ ์Œ“์•„๊ฐ€๋ฉฐ ์ ์ฐจ ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค. ๊ณ ๊ธ‰ ํผ์ฆ์€ ์•ž์„  ํผ์ฆ์˜ ๊ฐœ๋…์„ ์•Œ๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฏ€๋กœ, ์ˆœ์„œ๋Œ€๋กœ ํ’€์–ด๋‚˜๊ฐ€๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์‹คํ–‰ํ•˜๊ธฐ

๋ชจ๋“  ํผ์ฆ์—๋Š” ๊ตฌํ˜„ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ƒ ๊ฒฐ๊ณผ์™€ ๋น„๊ตํ•ด์ฃผ๋Š” ํ…Œ์ŠคํŠธ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ํผ์ฆ๋ณ„๋กœ ์‹คํ–‰ ๋ฐฉ๋ฒ•๊ณผ ๊ฒ€์ฆ ์ ˆ์ฐจ๊ฐ€ ์•ˆ๋‚ด๋ฉ๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ค€๋น„

์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ

๋จผ์ € ์‹œ์Šคํ…œ์ด ์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ์„ ์ถฉ์กฑํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.

์ง€์›๋˜๋Š” GPU

ํผ์ฆ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด ์ง€์›๋˜๋Š” GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํ™˜๊ฒฝ ์„ค์ •์„ ๋งˆ์นœ ๋’ค ์•„๋ž˜ ํ™˜๊ฒฝ ์„ค์ •์˜ gpu-specs ๋ช…๋ น์–ด๋กœ GPU ํ˜ธํ™˜์„ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์šด์˜์ฒด์ œ

[!NOTE] ์šด์˜์ฒด์ œ๋ณ„ GPU ์ง€์› ์„ค์ • ๋ฐฉ๋ฒ•์„ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค.

Windows WSL2 for Linux with NVIDIA

Windows Subsystem for Linux(WSL2, ์˜ˆ: Ubuntu)์—์„œ NVIDIA GPU๋ฅผ ์„ค์ •ํ•˜๋ ค๋ฉด NVIDIA CUDA on WSL ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

ํ•ต์‹ฌ์€ Windows์šฉ NVIDIA CUDA ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์„ค์น˜ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๋“œ๋ผ์ด๋ฒ„๊ฐ€ WSL2๋ฅผ ์™„๋ฒฝํžˆ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. Windows์— NVIDIA GPU ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์„ค์น˜ํ•˜๋ฉด WSL 2 ์•ˆ์—์„œ CUDA๋ฅผ ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Windows ํ˜ธ์ŠคํŠธ์˜ CUDA ๋“œ๋ผ์ด๋ฒ„๊ฐ€ WSL 2 ๋‚ด๋ถ€์—์„œ libcuda.so๋กœ ์Šคํ…(stub) ์ฒ˜๋ฆฌ๋˜๋ฏ€๋กœ, WSL 2 ์•ˆ์— ๋ณ„๋„์˜ NVIDIA GPU Linux ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์„ค์น˜ํ•ด์„œ๋Š” ์•ˆ ๋ฉ๋‹ˆ๋‹ค.

๋“œ๋ผ์ด๋ฒ„ ์„ค์น˜ ํ›„ ์ •์ƒ ๋™์ž‘์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

Windows์—์„œ ํ™•์ธ: PowerShell์„ ์—ฝ๋‹ˆ๋‹ค (WSL์ด ์•„๋‹™๋‹ˆ๋‹ค)

nvidia-smi

WSL ๋‚ด๋ถ€์—์„œ ํ™•์ธ: (๋จผ์ € WSL์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ: wsl -d Ubuntu)

ls -l /usr/lib/wsl/lib/nvidia-smi
/usr/lib/wsl/lib/nvidia-smi

Pixi์—์„œ ์„ค์ •์„ ํ™•์ธํ•˜๊ณ , ํ•„์š”์‹œ ๋ˆ„๋ฝ๋œ ์š”๊ตฌ์‚ฌํ•ญ์„ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: cuda-gdb ๋””๋ฒ„๊น…์šฉ)

pixi run nvidia-smi
pixi run setup-cuda-gdb
pixi run mojo debug --help
pixi run cuda-gdb --version

WSL์—์„œ๋Š” VS Code๋ฅผ ์—๋””ํ„ฐ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Windows์—์„œ https://code.visualstudio.com/์„ ํ†ตํ•ด VS Code๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฐ ๋‹ค์Œ Remote - WSL ํ™•์žฅ์„ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.

[!NOTE] ํผ์ฆ 1-15๋Š” ๋ชจ๋‘ WSL๊ณผ Linux์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

Linux native with NVIDIA

๋จผ์ € GPU์™€ Ubuntu ๋ฒ„์ „์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค (์ง€์›๋˜๋Š” Ubuntu LTS: 20.04, 22.04, 24.04)

lspci | grep -i nvidia
lsb_release -a

NVIDIA ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค (ํ•„์ˆ˜)

sudo ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
sudo reboot

Linux์—์„œ๋Š” VS Code๋ฅผ ์—๋””ํ„ฐ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. VS Code APT ์ €์žฅ์†Œ๋ฅผ ํ†ตํ•ด ์„ค์น˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Microsoft GPG ํ‚ค ๊ฐ€์ ธ์˜ค๊ธฐ

wget -qO- https://packages.microsoft.com/keys/microsoft.asc \
  | gpg --dearmor \
  | sudo tee /usr/share/keyrings/packages.microsoft.gpg > /dev/null

VS Code APT ์ €์žฅ์†Œ ์ถ”๊ฐ€

echo "deb [arch=amd64 signed-by=/usr/share/keyrings/packages.microsoft.gpg] \
https://packages.microsoft.com/repos/code stable main" \
| sudo tee /etc/apt/sources.list.d/vscode.list

VS Code ์„ค์น˜ ๋ฐ ํ™•์ธ

sudo apt update
sudo apt install code
code --version

[!NOTE] ํผ์ฆ 1-15๋Š” ๋ชจ๋‘ Linux์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

macOS Apple Silicon

osx-arm64 ์‚ฌ์šฉ์ž๋Š” ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • macOS 15.0 ์ด์ƒ โ€” ์ตœ์  ํ˜ธํ™˜์„ฑ์„ ์œ„ํ•ด ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค. pixi run check-macos๋กœ ํ™•์ธํ•˜๊ณ , ์‹คํŒจํ•˜๋ฉด ์—…๊ทธ๋ ˆ์ด๋“œํ•˜์„ธ์š”.
  • Xcode 16 ์ด์ƒ โ€” ์ตœ์†Œ ์š”๊ตฌ์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค. xcodebuild -version์œผ๋กœ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

xcrun -sdk macosx metal ์‹คํ–‰ ์‹œ cannot execute tool 'metal' due to missing Metal toolchain ์˜ค๋ฅ˜๊ฐ€ ๋‚˜ํƒ€๋‚˜๋ฉด ๋‹ค์Œ์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

xcodebuild -downloadComponent MetalToolchain

์ดํ›„ xcrun -sdk macosx metal์„ ๋‹ค์‹œ ์‹คํ–‰ํ•˜๋ฉด no input files error๊ฐ€ ๋‚˜ํƒ€๋‚˜์•ผ ์ •์ƒ์ž…๋‹ˆ๋‹ค.

[!NOTE] ํ˜„์žฌ ํผ์ฆ 1-8๊ณผ 11-15๊ฐ€ macOS์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋” ๋งŽ์€ ํผ์ฆ ์ง€์›์„ ์ค€๋น„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค!

ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ง€์‹

๋‹ค์Œ์— ๋Œ€ํ•œ ๊ธฐ๋ณธ์ ์ธ ์ดํ•ด๊ฐ€ ์žˆ์œผ๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค:

  • ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ์ดˆ (๋ณ€์ˆ˜, ๋ฐ˜๋ณต๋ฌธ, ์กฐ๊ฑด๋ฌธ, ํ•จ์ˆ˜)
  • ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ๊ฐœ๋… (์Šค๋ ˆ๋“œ, ๋™๊ธฐํ™”, ๊ฒฝ์Ÿ ์ƒํƒœ)
  • Mojo ๊ธฐ๋ณธ ๋ฌธ๋ฒ• (ํฌ์ธํ„ฐ ์ž…๋ฌธ ์„น์…˜ ํฌํ•จ)
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ์ดˆ๋ฅผ ๋ฏธ๋ฆฌ ์ฝ์–ด๋‘๋ฉด ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค!

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฒฝํ—˜์ด ์—†์–ด๋„ ๊ดœ์ฐฎ์Šต๋‹ˆ๋‹ค! ํผ์ฆ์„ ํ’€์–ด๊ฐ€๋ฉฐ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ตํž ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Mojo๐Ÿ”ฅ์™€ ํ•จ๊ป˜ GPU ์ปดํ“จํŒ…์˜ ์„ธ๊ณ„๋กœ ๋– ๋‚˜๋ด…์‹œ๋‹ค!

ํ™˜๊ฒฝ ์„ค์ •ํ•˜๊ธฐ

  1. GitHub ์ €์žฅ์†Œ๋ฅผ ํด๋ก ํ•˜๊ณ  ํ•ด๋‹น ๋””๋ ‰ํ† ๋ฆฌ๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค:

    # ์ €์žฅ์†Œ ํด๋ก 
    git clone https://github.com/modular/mojo-gpu-puzzles
    cd mojo-gpu-puzzles
    
  2. Mojo๐Ÿ”ฅ ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ํŒจํ‚ค์ง€ ๋งค๋‹ˆ์ €๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค:

    ์˜ต์…˜ 1 (๊ฐ•๋ ฅ ์ถ”์ฒœ): pixi

    ์ด ํ”„๋กœ์ ํŠธ์—์„œ pixi๋ฅผ ๊ถŒ์žฅํ•˜๋Š” ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

    • Modular์˜ MAX/Mojo ํŒจํ‚ค์ง€์— ์‰ฝ๊ฒŒ ์ ‘๊ทผ ๊ฐ€๋Šฅ
    • GPU ์˜์กด์„ฑ์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
    • conda + PyPI ์ƒํƒœ๊ณ„๋ฅผ ๋ชจ๋‘ ์ง€์›

    ์ฐธ๊ณ : ์ผ๋ถ€ ํผ์ฆ์€ pixi์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค

    ์„ค์น˜:

    curl -fsSL https://pixi.sh/install.sh | sh
    

    ์—…๋ฐ์ดํŠธ:

    pixi self-update
    

    ์˜ต์…˜ 2: uv

    ์„ค์น˜:

    curl -fsSL https://astral.sh/uv/install.sh | sh
    

    ์—…๋ฐ์ดํŠธ:

    uv self update
    

    ๊ฐ€์ƒ ํ™˜๊ฒฝ ์ƒ์„ฑ:

    uv venv && source .venv/bin/activate
    
  3. ์„ค์ •์„ ํ™•์ธํ•˜๊ณ  ์ฒซ ๋ฒˆ์งธ ํผ์ฆ์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค:

# GPU ์‚ฌ์–‘ ํ™•์ธ
pixi run gpu-specs

# ์ฒซ ๋ฒˆ์งธ ํผ์ฆ ์‹คํ–‰
# ์•„์ง ๊ตฌํ˜„ ์ „์ด๋ฏ€๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค! ๋ณธ๋ฌธ์„ ๋”ฐ๋ผ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”
pixi run p01
# GPU ์‚ฌ์–‘ ํ™•์ธ
pixi run gpu-specs

# ์ฒซ ๋ฒˆ์งธ ํผ์ฆ ์‹คํ–‰
# ์•„์ง ๊ตฌํ˜„ ์ „์ด๋ฏ€๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค! ๋ณธ๋ฌธ์„ ๋”ฐ๋ผ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”
pixi run -e amd p01
# GPU ์‚ฌ์–‘ ํ™•์ธ
pixi run gpu-specs

# ์ฒซ ๋ฒˆ์งธ ํผ์ฆ ์‹คํ–‰
# ์•„์ง ๊ตฌํ˜„ ์ „์ด๋ฏ€๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค! ๋ณธ๋ฌธ์„ ๋”ฐ๋ผ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”
pixi run -e apple p01
# GPU๋ณ„ ์˜์กด์„ฑ ์„ค์น˜
uv pip install -e ".[nvidia]"  # NVIDIA GPU์šฉ
# ๋˜๋Š”
uv pip install -e ".[amd]"     # AMD GPU์šฉ

# GPU ์‚ฌ์–‘ ํ™•์ธ
uv run poe gpu-specs

# ์ฒซ ๋ฒˆ์งธ ํผ์ฆ ์‹คํ–‰
# ์•„์ง ๊ตฌํ˜„ ์ „์ด๋ฏ€๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค! ๋ณธ๋ฌธ์„ ๋”ฐ๋ผ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”
uv run poe p01

ํผ์ฆ ํ’€๊ธฐ

ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

  • problems/: ํ’€์ด๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•˜๋Š” ๊ณณ์ž…๋‹ˆ๋‹ค (์—ฌ๊ธฐ์„œ ์ž‘์—…ํ•ฉ๋‹ˆ๋‹ค!)
  • solutions/: ๋น„๊ต์™€ ํ•™์Šต์„ ์œ„ํ•œ ์ฐธ๊ณ  ํ’€์ด์ž…๋‹ˆ๋‹ค. ์ฑ… ์ „๋ฐ˜์— ๊ฑธ์ณ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค

์ž‘์—… ํ๋ฆ„

  1. problems/pXX/์—์„œ ํผ์ฆ ํ…œํ”Œ๋ฆฟ์„ ์—ฝ๋‹ˆ๋‹ค
  2. ์ œ๊ณต๋œ ํ”„๋ ˆ์ž„์›Œํฌ ์•ˆ์— ํ’€์ด๋ฅผ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค
  3. ๊ตฌํ˜„์„ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค: pixi run pXX ๋˜๋Š” uv run poe pXX (ํ”Œ๋žซํผ์— ๋”ฐ๋ผ -e platform์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ: -e amd)
  4. solutions/pXX/์˜ ์ฐธ๊ณ  ํ’€์ด์™€ ๋น„๊ตํ•˜๋ฉฐ ๋‹ค๋ฅธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋ฐฐ์›๋‹ˆ๋‹ค

์ฃผ์š” ๋ช…๋ น์–ด

# ํผ์ฆ ์‹คํ–‰ (ํ•„์š”์‹œ -e๋กœ ํ”Œ๋žซํผ ์ง€์ •)
pixi run pXX             # NVIDIA (๊ธฐ๋ณธ๊ฐ’) `pixi run -e nvidia pXX`์™€ ๋™์ผ
pixi run -e amd pXX      # AMD GPU
pixi run -e apple pXX    # Apple GPU

# ํ’€์ด ํ…Œ์ŠคํŠธ
pixi run tests           # ๋ชจ๋“  ํ’€์ด ํ…Œ์ŠคํŠธ
pixi run tests pXX       # ํŠน์ • ํผ์ฆ ํ…Œ์ŠคํŠธ

# ์ˆ˜๋™ ์‹คํ–‰
pixi run mojo problems/pXX/pXX.mojo     # ๋‚ด ๊ตฌํ˜„
pixi run mojo solutions/pXX/pXX.mojo    # ์ฐธ๊ณ  ํ’€์ด

# ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ์…ธ
pixi shell               # ํ™˜๊ฒฝ ์ง„์ž…
mojo problems/p01/p01.mojo              # ์ง์ ‘ ์‹คํ–‰
exit                     # ์…ธ ์ข…๋ฃŒ

# ๊ฐœ๋ฐœ
pixi run format         # ์ฝ”๋“œ ํฌ๋งทํŒ…
pixi task list          # ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ช…๋ น์–ด
# ์ฐธ๊ณ : uv๋Š” ์ œํ•œ์ ์ด๋ฉฐ ์ผ๋ถ€ ์ฑ•ํ„ฐ๋Š” pixi๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
# GPU๋ณ„ ์˜์กด์„ฑ ์„ค์น˜:
uv pip install -e ".[nvidia]"  # NVIDIA GPU์šฉ
uv pip install -e ".[amd]"     # AMD GPU์šฉ

# ํ’€์ด ํ…Œ์ŠคํŠธ
uv run poe tests        # ๋ชจ๋“  ํ’€์ด ํ…Œ์ŠคํŠธ
uv run poe tests pXX    # ํŠน์ • ํผ์ฆ ํ…Œ์ŠคํŠธ

# ์ˆ˜๋™ ์‹คํ–‰
uv run mojo problems/pXX/pXX.mojo      # ๋‚ด ๊ตฌํ˜„
uv run mojo solutions/pXX/pXX.mojo     # ์ฐธ๊ณ  ํ’€์ด

GPU ์ง€์› ํ˜„ํ™ฉ

์•„๋ž˜ ํ‘œ๋Š” ํผ์ฆ๋ณ„ GPU ํ”Œ๋žซํผ ํ˜ธํ™˜์„ฑ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํผ์ฆ์— ๋”ฐ๋ผ ํ•„์š”ํ•œ GPU ๊ธฐ๋Šฅ๊ณผ ๋ฒค๋”๋ณ„ ๋„๊ตฌ๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

ํผ์ฆNVIDIA GPUAMD GPUApple GPU๋น„๊ณ 
Part I: GPU ๊ธฐ์ดˆ
1 - Mapโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
2 - Zipโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
3 - ๊ฐ€๋“œโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
4 - Map 2Dโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
5 - ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
6 - ๋ธ”๋กโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
7 - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
8 - ์Šคํ…์‹คโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
Part II: ๋””๋ฒ„๊น…
9 - GPU ๋””๋ฒ„๊ฑฐโœ…โŒโŒNVIDIA ์ „์šฉ ๋””๋ฒ„๊น… ๋„๊ตฌ
10 - ์ƒˆ๋‹ˆํƒ€์ด์ €โœ…โŒโŒNVIDIA ์ „์šฉ ๋””๋ฒ„๊น… ๋„๊ตฌ
Part III: GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜
11 - ๋ฆฌ๋•์…˜โœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
12 - ์Šค์บ”โœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
13 - ํ’€๋งโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
14 - ํ•ฉ์„ฑ๊ณฑโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
15 - ํ–‰๋ ฌ ๊ณฑ์…ˆโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
16 - Flashdotโœ…โœ…โœ…๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด
Part IV: MAX ๊ทธ๋ž˜ํ”„
17 - ์ปค์Šคํ…€ Opโœ…โœ…โœ…MAX ๊ทธ๋ž˜ํ”„ ํ†ตํ•ฉ
18 - ์†Œํ”„ํŠธ๋งฅ์Šคโœ…โœ…โœ…MAX ๊ทธ๋ž˜ํ”„ ํ†ตํ•ฉ
19 - ์–ดํ…์…˜โœ…โœ…โœ…MAX ๊ทธ๋ž˜ํ”„ ํ†ตํ•ฉ
Part V: PyTorch ํ†ตํ•ฉ
20 - Torch ๋ธŒ๋ฆฟ์ง€โœ…โœ…โŒPyTorch ํ†ตํ•ฉ
21 - ์˜คํ† ๊ทธ๋ž˜๋“œโœ…โœ…โŒPyTorch ํ†ตํ•ฉ
22 - ํ“จ์ „โœ…โœ…โŒPyTorch ํ†ตํ•ฉ
Part VI: ํ•จ์ˆ˜ํ˜• ํŒจํ„ด
23 - ํ•จ์ˆ˜ํ˜•โœ…โœ…โœ…๊ณ ๊ธ‰ Mojo ํŒจํ„ด
Part VII: ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ
24 - ์›Œํ”„ ํ•ฉ๊ณ„โœ…โœ…โœ…์›Œํ”„ ์ˆ˜์ค€ ์—ฐ์‚ฐ
25 - ์›Œํ”„ ํ†ต์‹ โœ…โœ…โœ…์›Œํ”„ ์ˆ˜์ค€ ์—ฐ์‚ฐ
26 - ๊ณ ๊ธ‰ ์›Œํ”„โœ…โœ…โœ…์›Œํ”„ ์ˆ˜์ค€ ์—ฐ์‚ฐ
Part VIII: ๋ธ”๋ก ํ”„๋กœ๊ทธ๋ž˜๋ฐ
27 - ๋ธ”๋ก ์—ฐ์‚ฐโœ…โœ…โœ…๋ธ”๋ก ๋‹จ์œ„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด
Part IX: ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ
28 - ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌโœ…โœ…โœ…๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ
29 - ๋ฐฐ๋ฆฌ์–ดโœ…โŒโŒNVIDIA ์ „์šฉ ๊ณ ๊ธ‰ ๋™๊ธฐํ™”
Part X: ์„ฑ๋Šฅ ๋ถ„์„
30 - ํ”„๋กœํŒŒ์ผ๋งโœ…โŒโŒNVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ (NSight)
31 - ์ ์œ ์œจโœ…โŒโŒNVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ
32 - ๋ฑ…ํฌ ์ถฉ๋Œโœ…โŒโŒNVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ
Part XI: ์ตœ์‹  GPU ๊ธฐ๋Šฅ
33 - ํ…์„œ ์ฝ”์–ดโœ…โŒโŒNVIDIA ํ…์„œ ์ฝ”์–ด ์ „์šฉ
34 - ํด๋Ÿฌ์Šคํ„ฐโœ…โŒโŒNVIDIA ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ

๋ฒ”๋ก€

  • โœ… ์ง€์›: ํ•ด๋‹น ํ”Œ๋žซํผ์—์„œ ํผ์ฆ์ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค
  • โŒ ๋ฏธ์ง€์›: ํ”Œ๋žซํผ๋ณ„ ๊ณ ์œ  ๊ธฐ๋Šฅ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค

ํ”Œ๋žซํผ๋ณ„ ์ฐธ๊ณ ์‚ฌํ•ญ

NVIDIA GPU (์ „์ฒด ์ง€์›)

  • ๋ชจ๋“  ํผ์ฆ(1-34)์ด CUDA๋ฅผ ์ง€์›ํ•˜๋Š” NVIDIA GPU์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค
  • CUDA ํˆดํ‚ท๊ณผ ํ˜ธํ™˜ ๋“œ๋ผ์ด๋ฒ„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • ๋ชจ๋“  ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด ๊ฐ€์žฅ ์™„์ „ํ•œ ํ•™์Šต ๊ฒฝํ—˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค

AMD GPU (ํญ๋„“์€ ์ง€์›)

  • ๋Œ€๋ถ€๋ถ„์˜ ํผ์ฆ(1-8, 11-29)์ด ROCm์„ ํ†ตํ•ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฏธ์ง€์›: ๋””๋ฒ„๊น… ๋„๊ตฌ(9-10), ํ”„๋กœํŒŒ์ผ๋ง(30-32), ํ…์„œ ์ฝ”์–ด(33-34)
  • ๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด๊นŒ์ง€ ํฌํ•จํ•˜์—ฌ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ํญ๋„“๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

Apple GPU (๊ธฐ๋ณธ ์ง€์›)

  • ๊ธฐ์ดˆ(1-8, 11-18) ๋ฐ ๊ณ ๊ธ‰(23-27) ํผ์ฆ ์ผ๋ถ€๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฏธ์ง€์›: ๊ณ ๊ธ‰ ๊ธฐ๋Šฅ ์ „๋ฐ˜, ๋””๋ฒ„๊น…, ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ธฐ๋ณธ ํŒจํ„ด์„ ์ตํžˆ๊ธฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค

ํ–ฅํ›„ ์ง€์› ๊ณ„ํš: AMD ๋ฐ Apple GPU์— ๋Œ€ํ•œ ๋„๊ตฌ์™€ ํ”Œ๋žซํผ ์ง€์›์„ ๊พธ์ค€ํžˆ ํ™•๋Œ€ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋””๋ฒ„๊น… ๋„๊ตฌ, ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ๋Šฅ, ๊ณ ๊ธ‰ GPU ์—ฐ์‚ฐ ๋“ฑ ์•„์ง ์ง€์›๋˜์ง€ ์•Š๋Š” ๊ธฐ๋Šฅ์€ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์— ํฌํ•จ๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ํฌ๋กœ์Šค ํ”Œ๋žซํผ ํ˜ธํ™˜์„ฑ์„ ๊ณ„์† ๊ฐœ์„ ํ•˜๊ณ  ์žˆ์œผ๋‹ˆ ์—…๋ฐ์ดํŠธ๋ฅผ ํ™•์ธํ•ด ์ฃผ์„ธ์š”.

GPU ๋ฆฌ์†Œ์Šค

๋ฌด๋ฃŒ ํด๋ผ์šฐ๋“œ GPU ํ”Œ๋žซํผ

๋กœ์ปฌ GPU๊ฐ€ ์—†๋‹ค๋ฉด, ๋ฌด๋ฃŒ๋กœ GPU๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํด๋ผ์šฐ๋“œ ํ”Œ๋žซํผ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Google Colab

Google Colab์€ ๋ฌด๋ฃŒ GPU ์ ‘๊ทผ์„ ์ œ๊ณตํ•˜์ง€๋งŒ, Mojo GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—๋Š” ์ผ๋ถ€ ์ œํ•œ์ด ์žˆ์Šต๋‹ˆ๋‹ค:

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU:

  • Tesla T4 (๊ตฌ์„ธ๋Œ€ Turing ์•„ํ‚คํ…์ฒ˜)
  • Tesla V100 (์ œํ•œ์  ๊ฐ€์šฉ)

Mojo GPU Puzzles ์‚ฌ์šฉ ์‹œ ์ œํ•œ์‚ฌํ•ญ:

  • ๊ตฌ์„ธ๋Œ€ GPU ์•„ํ‚คํ…์ฒ˜: T4 GPU๋Š” ๊ณ ๊ธ‰ Mojo GPU ๊ธฐ๋Šฅ๊ณผ ํ˜ธํ™˜๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ์„ธ์…˜ ์‹œ๊ฐ„ ์ œํ•œ: ์ตœ๋Œ€ 12์‹œ๊ฐ„ ์‹คํ–‰ ํ›„ ์ž๋™์œผ๋กœ ์—ฐ๊ฒฐ์ด ๋Š๊น๋‹ˆ๋‹ค
  • ์ œํ•œ์  ๋””๋ฒ„๊น… ์ง€์›: NVIDIA ๋””๋ฒ„๊น… ๋„๊ตฌ(ํผ์ฆ 9-10)๋ฅผ ์™„์ „ํžˆ ์‚ฌ์šฉํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ํŒจํ‚ค์ง€ ์„ค์น˜ ์ œํ•œ: Mojo/MAX ์„ค์น˜ ์‹œ ์šฐํšŒ ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ์„ฑ๋Šฅ ์ œํ•œ: ๊ณต์œ  ์ธํ”„๋ผ ํŠน์„ฑ์ƒ ์ผ๊ด€๋œ ๋ฒค์น˜๋งˆํ‚น์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค

์ถ”์ฒœ ์šฉ๋„: ๊ธฐ๋ณธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…(ํผ์ฆ 1-8, 11-15)๊ณผ ๊ธฐ์ดˆ ํŒจํ„ด ํ•™์Šต.

Kaggle Notebooks

Kaggle์€ Colab๋ณด๋‹ค ๋„‰๋„‰ํ•œ ๋ฌด๋ฃŒ GPU ์‚ฌ์šฉ ์‹œ๊ฐ„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU:

  • Tesla T4 (์ฃผ๋‹น 30์‹œ๊ฐ„ ๋ฌด๋ฃŒ)
  • P100 (์ œํ•œ์  ๊ฐ€์šฉ)

Colab ๋Œ€๋น„ ์žฅ์ :

  • ๋„‰๋„‰ํ•œ ์‹œ๊ฐ„: Colab์˜ ์ผ์ผ ์„ธ์…˜ ์ œํ•œ๊ณผ ๋‹ฌ๋ฆฌ ์ฃผ๋‹น 30์‹œ๊ฐ„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ์ž๋™ ์ €์žฅ: ๋…ธํŠธ๋ถ์ด ์ž๋™์œผ๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค
  • ์•ˆ์ •์ ์ธ ํ™˜๊ฒฝ: ํŒจํ‚ค์ง€ ์„ค์น˜๊ฐ€ ๋” ์•ˆ์ •์ ์ž…๋‹ˆ๋‹ค

Mojo GPU Puzzles ์‚ฌ์šฉ ์‹œ ์ œํ•œ์‚ฌํ•ญ:

  • GPU ์•„ํ‚คํ…์ฒ˜ ์ œ์•ฝ: T4์˜ ๊ณ ๊ธ‰ ๊ธฐ๋Šฅ ํ˜ธํ™˜์„ฑ ๋ฌธ์ œ๋Š” Colab๊ณผ ๋™์ผ
  • ์ œํ•œ์  ๋””๋ฒ„๊น… ๋„๊ตฌ: NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋ฐ ๋””๋ฒ„๊น… ๋„๊ตฌ(ํผ์ฆ 9-10, 30-32) ์‚ฌ์šฉ ๋ถˆ๊ฐ€
  • Mojo ์„ค์น˜ ๋ณต์žก์„ฑ: Mojo ํ™˜๊ฒฝ์„ ์ˆ˜๋™์œผ๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฏธ์ง€์›: ๊ณ ๊ธ‰ ํผ์ฆ(33-34) ์ž‘๋™ ๋ถˆ๊ฐ€

์ถ”์ฒœ ์šฉ๋„: ๊ธฐ๋ณธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ(ํผ์ฆ 1-16)์„ ์žฅ์‹œ๊ฐ„์— ๊ฑธ์ณ ํ•™์Šตํ•  ๋•Œ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

๊ถŒ์žฅ ์‚ฌํ•ญ

  • ์ „์ฒด ํ•™์Šต ๊ณผ์ •: NVIDIA GPU๊ฐ€ ์žˆ์œผ๋ฉด ๋ชจ๋“  ํผ์ฆ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (์ „์ฒด 34๊ฐœ)
  • ํญ๋„“์€ ํ•™์Šต: AMD GPU๋กœ๋„ ๋Œ€๋ถ€๋ถ„์˜ ๋‚ด์šฉ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (34๊ฐœ ์ค‘ 27๊ฐœ)
  • ๊ธฐ์ดˆ ํ•™์Šต: Apple GPU๋กœ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ์ตํž ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (34๊ฐœ ์ค‘ 13๊ฐœ)
  • ๋ฌด๋ฃŒ ํ”Œ๋žซํผ ํ•™์Šต: Google Colab/Kaggle๋กœ ๊ธฐ์ดˆ~์ค‘๊ธ‰ ๊ฐœ๋…๊นŒ์ง€ ํ•™์Šต ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค (ํผ์ฆ 1-16)
  • ๋””๋ฒ„๊น… ๋ฐ ํ”„๋กœํŒŒ์ผ๋ง: ๋””๋ฒ„๊น… ๋„๊ตฌ์™€ ์„ฑ๋Šฅ ๋ถ„์„์—๋Š” NVIDIA GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • ์ตœ์‹  GPU ๊ธฐ๋Šฅ: ํ…์„œ ์ฝ”์–ด์™€ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—๋Š” NVIDIA GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค

๊ฐœ๋ฐœ

์ž์„ธํ•œ ๋‚ด์šฉ์€ README๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

์ปค๋ฎค๋‹ˆํ‹ฐ ์ฐธ์—ฌํ•˜๊ธฐ

์—…๋ฐ์ดํŠธ ๊ตฌ๋… Modular ํฌ๋Ÿผ Discord

์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐํ•˜๊ณ , ํ’€์ด๋ฅผ ๊ณต์œ ํ•˜๊ณ , ์„œ๋กœ ๋„์›€์„ ์ฃผ๊ณ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ† ๋ณด์ƒ์„ ๋ฐ›์•„๊ฐ€์„ธ์š”

ํผ์ฆ์„ ๋ชจ๋‘ ํ’€์–ด๋ณด์…จ๋‚˜์š”? ์—ฌ๋Ÿฌ๋ถ„์˜ ๋„์ „์„ ์ถ•ํ•˜ํ•˜๋ฉฐ ๋ฌด๋ฃŒ ์Šคํ‹ฐ์ปค ํŒฉ์„ ์„ ๋ฌผ๋กœ ๋“œ๋ ค์š”!

๋ฌด๋ฃŒ ์Šคํ‹ฐ์ปค๋ฅผ ๋ฐ›๋Š” ๋ฐฉ๋ฒ•:

  1. GitHub ์ €์žฅ์†Œ https://github.com/modular/mojo-gpu-puzzles๋ฅผ Forkํ•ฉ๋‹ˆ๋‹ค
  2. ํผ์ฆ ์†”๋ฃจ์…˜์„ ์ž‘์„ฑํ•ด์„œ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค
  3. ์ด ์–‘์‹์œผ๋กœ ์ œ์ถœํ•˜๋ฉด Modular ํ•œ์ • ์Šคํ‹ฐ์ปค๋ฅผ ๋ณด๋‚ด๋“œ๋ ค์š”!

ํ˜„์žฌ๋Š” ๋ถ๋ฏธ ์ง€์—ญ์œผ๋กœ๋งŒ ๋ฐฐ์†ก์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์ง€์—ญ์— ๊ณ„์‹  ๋ถ„๋“ค๋„ ์†”๋ฃจ์…˜์„ ์ œ์ถœํ•ด ์ฃผ์„ธ์š” โ€“ ๋ฐฐ์†ก ๋ฒ”์œ„๋ฅผ ๋„“ํ˜€๊ฐ€๊ณ  ์žˆ์œผ๋‹ˆ, ๊ฐ€๋Šฅํ•ด์ง€๋ฉด ๊ผญ ๋ณด์ƒ์„ ๋ณด๋‚ด๋“œ๋ฆด๊ฒŒ์š”.

Puzzle 1: Map

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” GPU ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์˜ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐ์ดํ„ฐ ์š”์†Œ ํ•˜๋‚˜๋ฅผ ๋งก์•„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ๋ฒกํ„ฐ a์˜ ๊ฐ ์š”์†Œ์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ๋ฐฐ์ •๋ฉ๋‹ˆ๋‹ค.

Map Map

ํ•ต์‹ฌ ๊ฐœ๋…

  • GPU ์ปค๋„์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ
  • ์Šค๋ ˆ๋“œ์™€ ๋ฐ์ดํ„ฐ ๊ฐ„ ์ผ๋Œ€์ผ ๋งคํ•‘
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • GPU์—์„œ์˜ ๋ฐฐ์—ด ์—ฐ์‚ฐ

๊ฐ ์œ„์น˜ \(i\)์— ๋Œ€ํ•ด: \[\Large output[i] = a[i] + 10\]

๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ

๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹

์ง์ ‘ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋‹ค๋ฃจ๋ฉฐ GPU์˜ ๊ธฐ๋ณธ ์›๋ฆฌ๋ฅผ ์ตํž™๋‹ˆ๋‹ค.

๐Ÿ’ก ๋ฏธ๋ฆฌ๋ณด๊ธฐ: LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ํ˜„๋Œ€์  ๋ฐฉ์‹

LayoutTensor๊ฐ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์–ด๋–ป๊ฒŒ ๋‹จ์ˆœํ™”ํ•˜๋Š”์ง€ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ๋” ์•ˆ์ „ํ•˜๊ณ  ๊น”๋”ํ•œ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก ํŒ: ๋‘ ๋ฐฉ์‹์„ ๋ชจ๋‘ ์ตํžˆ๋ฉด ํ˜„๋Œ€์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ๋” ๊นŠ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šฐ๋Š” ๋‚ด์šฉ:

  • ๊ธฐ๋ณธ GPU ์ปค๋„ ๊ตฌ์กฐ

  • thread_idx.x๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

  • ๊ฐ„๋‹จํ•œ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ

  • ๋ณ‘๋ ฌ์„ฑ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค

  • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ: i = thread_idx.x ์œ„์น˜์˜ ์š”์†Œ์— ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค

  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: a[i]์—์„œ ์ฝ๊ณ  output[i]์— ์”๋‹ˆ๋‹ค

  • ๋ฐ์ดํ„ฐ ๋…๋ฆฝ์„ฑ: ๊ฐ ์ถœ๋ ฅ์€ ํ•ด๋‹น ์ž…๋ ฅ์—๋งŒ ์˜์กดํ•ฉ๋‹ˆ๋‹ค

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 4
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = SIZE
comptime dtype = DType.float32


fn add_10(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    i = thread_idx.x
    # FILL ME IN (roughly 1 line)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p01/p01.mojo

ํŒ
  1. thread_idx.x๋ฅผ i์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  2. a[i]์— 10์„ ๋”ํ•ฉ๋‹ˆ๋‹ค
  3. ๊ฒฐ๊ณผ๋ฅผ output[i]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p01
pixi run -e amd p01
pixi run -e apple p01
uv run poe p01

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

์†”๋ฃจ์…˜

fn add_10(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    i = thread_idx.x
    output[i] = a[i] + 10.0


์ด ์†”๋ฃจ์…˜์€:

  • i = thread_idx.x๋กœ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค
  • ์ž…๋ ฅ๊ฐ’์— 10์„ ๋”ํ•ฉ๋‹ˆ๋‹ค: output[i] = a[i] + 10.0

์™œ LayoutTensor๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ• ๊นŒ์š”?

์•„๋ž˜ ๊ธฐ์กด ๊ตฌํ˜„์„ ๋ณด๋ฉด ๋ช‡ ๊ฐ€์ง€ ์ž ์žฌ์ ์ธ ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

ํ˜„์žฌ ๋ฐฉ์‹

i = thread_idx.x
output[i] = a[i] + 10.0

1D ๋ฐฐ์—ด์—์„œ๋Š” ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ƒํ™ฉ์—์„œ๋Š” ์–ด๋–จ๊นŒ์š”?

  • 2D๋‚˜ 3D ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ค„์•ผ ํ•  ๋•Œ
  • ๋‹ค์–‘ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•  ๋•Œ
  • ๋ณ‘ํ•ฉ(coalesced) ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ๋ณด์žฅํ•ด์•ผ ํ•  ๋•Œ

์•ž์œผ๋กœ์˜ ๋„์ „ ๋ฏธ๋ฆฌ๋ณด๊ธฐ

ํผ์ฆ์„ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๋ฐฐ์—ด ์ธ๋ฑ์‹ฑ์€ ์ ์  ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค:

# ์ดํ›„ ํผ์ฆ์—์„œ ๋‹ค๋ฃฐ 2D ์ธ๋ฑ์‹ฑ
idx = row * WIDTH + col

# 3D ์ธ๋ฑ์‹ฑ
idx = (batch * HEIGHT + row) * WIDTH + col

# ํŒจ๋”ฉ์ด ์žˆ๋Š” ๊ฒฝ์šฐ
idx = (batch * padded_height + row) * padded_width + col

LayoutTensor ๋ฏธ๋ฆฌ๋ณด๊ธฐ

LayoutTensor๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋Ÿฐ ๊ฒฝ์šฐ๋ฅผ ํ›จ์”ฌ ๊น”๋”ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# ๋ฏธ๋ฆฌ๋ณด๊ธฐ - ์ง€๊ธˆ์€ ์ด ๋ฌธ๋ฒ•์„ ๋ชฐ๋ผ๋„ ๊ดœ์ฐฎ์Šต๋‹ˆ๋‹ค!
output[i, j] = a[i, j] + 10.0  # 2D ์ธ๋ฑ์‹ฑ
output[b, i, j] = a[b, i, j] + 10.0  # 3D ์ธ๋ฑ์‹ฑ

Puzzle 4์—์„œ LayoutTensor๋ฅผ ์ž์„ธํžˆ ๋ฐฐ์šธ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ๊ทธ๋•Œ ์ด ๊ฐœ๋…๋“ค์ด ํ•„์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ง€๊ธˆ์€ ๋‹ค์Œ ๋‚ด์šฉ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ•˜์„ธ์š”:

  • ๊ธฐ๋ณธ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ
  • ๊ฐ„๋‹จํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ์Šค๋ ˆ๋“œ์™€ ๋ฐ์ดํ„ฐ์˜ ์ผ๋Œ€์ผ ๋งคํ•‘

๐Ÿ’ก ํ•ต์‹ฌ ํฌ์ธํŠธ: ์ง์ ‘ ์ธ๋ฑ์‹ฑ์€ ๊ฐ„๋‹จํ•œ ๊ฒฝ์šฐ์— ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ๋ณต์žกํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์—์„œ๋Š” ๊ณง ๋” ์ •๊ตํ•œ ๋„๊ตฌ๊ฐ€ ํ•„์š”ํ•ด์ง‘๋‹ˆ๋‹ค.

Puzzle 2: Zip

๊ฐœ์š”

๋ฒกํ„ฐ a์™€ ๋ฒกํ„ฐ b์˜ ๊ฐ ์œ„์น˜๋ฅผ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ๋ฐฐ์ •๋ฉ๋‹ˆ๋‹ค.

Zip Zip

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šฐ๋Š” ๋‚ด์šฉ:

  • ์—ฌ๋Ÿฌ ์ž…๋ ฅ ๋ฐฐ์—ด์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ
  • ์—ฌ๋Ÿฌ ์ž…๋ ฅ์— ๋Œ€ํ•œ ์š”์†Œ๋ณ„ ์—ฐ์‚ฐ
  • ๋ฐฐ์—ด ๊ฐ„ ์Šค๋ ˆ๋“œ-๋ฐ์ดํ„ฐ ๋งคํ•‘
  • ์—ฌ๋Ÿฌ ๋ฐฐ์—ด์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

๊ฐ ์Šค๋ ˆ๋“œ \(i\)์— ๋Œ€ํ•ด: \[\Large output[i] = a[i] + b[i]\]

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

Thread 0:  a[0] + b[0] โ†’ output[0]
Thread 1:  a[1] + b[1] โ†’ output[1]
Thread 2:  a[2] + b[2] โ†’ output[2]
...

๐Ÿ’ก ์ฐธ๊ณ : ์ด์ œ ์ปค๋„์—์„œ ์„ธ ๊ฐœ์˜ ๋ฐฐ์—ด(a, b, output)์„ ๋‹ค๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฐ์‚ฐ์ด ๋ณต์žกํ•ด์งˆ์ˆ˜๋ก ์—ฌ๋Ÿฌ ๋ฐฐ์—ด์— ๋Œ€ํ•œ ์ ‘๊ทผ์„ ๊ด€๋ฆฌํ•˜๊ธฐ๊ฐ€ ์ ์  ์–ด๋ ค์›Œ์ง‘๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 4
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = SIZE
comptime dtype = DType.float32


fn add(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    i = thread_idx.x
    # FILL ME IN (roughly 1 line)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p02/p02.mojo

ํŒ
  1. thread_idx.x๋ฅผ i์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  2. a[i]์™€ b[i]๋ฅผ ๋”ํ•ฉ๋‹ˆ๋‹ค
  3. ๊ฒฐ๊ณผ๋ฅผ output[i]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p02
pixi run -e amd p02
pixi run -e apple p02
uv run poe p02

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 2.0, 4.0, 6.0])

์†”๋ฃจ์…˜

fn add(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    i = thread_idx.x
    output[i] = a[i] + b[i]


์ด ์†”๋ฃจ์…˜์€:

  • i = thread_idx.x๋กœ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค
  • ๋‘ ๋ฐฐ์—ด์˜ ๊ฐ’์„ ๋”ํ•ฉ๋‹ˆ๋‹ค: output[i] = a[i] + b[i]

์•ž์œผ๋กœ ๋‹ค๋ฃฐ ๋‚ด์šฉ

์ง์ ‘ ์ธ๋ฑ์‹ฑ์€ ๊ฐ„๋‹จํ•œ ์š”์†Œ๋ณ„ ์—ฐ์‚ฐ์—์„œ ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ๋‹ค์Œ ์ƒํ™ฉ์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • ๋ฐฐ์—ด์˜ ๋ ˆ์ด์•„์›ƒ์ด ์„œ๋กœ ๋‹ค๋ฅด๋‹ค๋ฉด?
  • ํ•œ ๋ฐฐ์—ด์„ ๋‹ค๋ฅธ ๋ฐฐ์—ด์— ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•ด์•ผ ํ•œ๋‹ค๋ฉด?
  • ์—ฌ๋Ÿฌ ๋ฐฐ์—ด์—์„œ ๋ณ‘ํ•ฉ(coalesced) ์ ‘๊ทผ์„ ์–ด๋–ป๊ฒŒ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์„๊นŒ?

์ด๋Ÿฌํ•œ ์งˆ๋ฌธ๋“ค์€ Puzzle 4์˜ LayoutTensor ์•Œ์•„๋ณด๊ธฐ์—์„œ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

Puzzle 3: ๊ฐ€๋“œ

๊ฐœ์š”

๋ฒกํ„ฐ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๋ณด๋‹ค ๋งŽ์•„์„œ, ์ผ๋ถ€ ์Šค๋ ˆ๋“œ๋Š” ์ฒ˜๋ฆฌํ•  ๋ฐ์ดํ„ฐ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜์ง€ ์•Š๋„๋ก ๋ฐฉ์ง€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Guard Guard

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • ์Šค๋ ˆ๋“œ ์ˆ˜์™€ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ ๋ถˆ์ผ์น˜ ์ฒ˜๋ฆฌ
  • ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ฐฉ์ง€
  • GPU ์ปค๋„์—์„œ ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ์‚ฌ์šฉ
  • ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

์ˆ˜ํ•™์  ํ‘œํ˜„

๊ฐ ์Šค๋ ˆ๋“œ \(i\)์— ๋Œ€ํ•ด: \[\Large \text{if}\ i < \text{size}: output[i] = a[i] + 10\]

๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „ ํŒจํ„ด

Thread 0 (i=0):  if 0 < size:  output[0] = a[0] + 10  โœ“ Valid
Thread 1 (i=1):  if 1 < size:  output[1] = a[1] + 10  โœ“ Valid
Thread 2 (i=2):  if 2 < size:  output[2] = a[2] + 10  โœ“ Valid
Thread 3 (i=3):  if 3 < size:  output[3] = a[3] + 10  โœ“ Valid
Thread 4 (i=4):  if 4 < size:  โŒ Skip (out of bounds)
Thread 5 (i=5):  if 5 < size:  โŒ Skip (out of bounds)

๐Ÿ’ก ์ฐธ๊ณ : ๋‹ค์Œ ์ƒํ™ฉ์—์„œ ๊ฒฝ๊ณ„(boundary) ๊ฒ€์‚ฌ๋Š” ์ ์  ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค:

  • ๋‹ค์ฐจ์› ๋ฐฐ์—ด
  • ๋‹ค์–‘ํ•œ ๋ฐฐ์—ด ํ˜•ํƒœ
  • ๋ณต์žกํ•œ ์ ‘๊ทผ ํŒจํ„ด

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 4
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = 8
comptime dtype = DType.float32


fn add_10_guard(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    i = thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p03/p03.mojo

ํŒ
  1. thread_idx.x๋ฅผ i์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if i < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: output[i] = a[i] + 10.0

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p03
pixi run -e amd p03
pixi run -e apple p03
uv run poe p03

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

์†”๋ฃจ์…˜

fn add_10_guard(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    i = thread_idx.x
    if i < size:
        output[i] = a[i] + 10.0


์ด ์†”๋ฃจ์…˜์€:

  • i = thread_idx.x๋กœ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค
  • if i < size๋กœ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
  • ๊ฐ€๋“œ ๋‚ด๋ถ€: ์ž…๋ ฅ๊ฐ’์— 10์„ ๋”ํ•ฉ๋‹ˆ๋‹ค

๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ์—†์ด๋„ ํ…Œ์ŠคํŠธ๊ฐ€ ํ†ต๊ณผ๋˜๋Š” ์ด์œ ๊ฐ€ ๊ถ๊ธˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ํ…Œ์ŠคํŠธ ํ†ต๊ณผ๊ฐ€ ์ฝ”๋“œ์˜ ์•ˆ์ „์„ฑ์ด๋‚˜ ๋ฏธ์ •์˜ ๋™์ž‘(Undefined Behavior) ๋ถ€์žฌ๋ฅผ ๋ณด์žฅํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค๋Š” ์ ์„ ํ•ญ์ƒ ๊ธฐ์–ตํ•˜์„ธ์š”. Puzzle 10์—์„œ ์ด๋Ÿฐ ๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณด๊ณ , ์•ˆ์ „์„ฑ ๋ฒ„๊ทธ๋ฅผ ์žก๋Š” ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•ด ๋ด…๋‹ˆ๋‹ค.

์•ž์œผ๋กœ ๋‹ค๋ฃฐ ๋‚ด์šฉ

๊ฐ„๋‹จํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ๊ธฐ์„œ ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ๋‹ค์Œ ์ƒํ™ฉ์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • 2D/3D ๋ฐฐ์—ด์˜ ๊ฒฝ๊ณ„๋Š” ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ• ๊นŒ?
  • ๋‹ค์–‘ํ•œ ํ˜•ํƒœ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด?
  • ํŒจ๋”ฉ์ด๋‚˜ ๊ฐ€์žฅ์ž๋ฆฌ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋ฉด?

๋ณต์žก๋„๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ์˜ˆ์‹œ:

# ํ˜„์žฌ: 1D ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
if i < size: ...

# ๊ณง ๋‹ค๋ฃฐ ๋‚ด์šฉ: 2D ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
if i < height and j < width: ...

# ์ดํ›„: ํŒจ๋”ฉ์ด ์žˆ๋Š” 3D
if i < height and j < width and k < depth and
   i >= padding and j >= padding: ...

์ด๋Ÿฐ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ํŒจํ„ด์€ Puzzle 4์˜ LayoutTensor ์•Œ์•„๋ณด๊ธฐ์—์„œ ๋ฐฐ์šฐ๋ฉด ํ›จ์”ฌ ๊น”๋”ํ•ด์ง‘๋‹ˆ๋‹ค. LayoutTensor๋Š” ํ˜•ํƒœ ๊ด€๋ฆฌ ๊ธฐ๋Šฅ์„ ๊ธฐ๋ณธ์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Puzzle 4: 2D Map

๊ฐœ์š”

2D ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

2D ํ–‰๋ ฌ ๋งคํ•‘ 2D ํ–‰๋ ฌ ๋งคํ•‘

ํ•ต์‹ฌ ๊ฐœ๋…

  • 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ
  • GPU์—์„œ์˜ ํ–‰๋ ฌ ์—ฐ์‚ฐ
  • ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ํŒจํ„ด

๊ฐ ์œ„์น˜ \((i,j)\)์— ๋Œ€ํ•ด: \[\Large output[i,j] = a[i,j] + 10\]

์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ ๊ทœ์น™

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ 2D ํ–‰๋ ฌ์„ ๋‹ค๋ฃฐ ๋•Œ๋Š” ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค์™€ ํ–‰๋ ฌ ์ขŒํ‘œ ์‚ฌ์ด์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋งคํ•‘์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  • thread_idx.y๋Š” ํ–‰(row) ์ธ๋ฑ์Šค
  • thread_idx.x๋Š” ์—ด(column) ์ธ๋ฑ์Šค
2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

์ด ๊ทœ์น™์€ ๋‹ค์Œ๊ณผ ์ž˜ ๋งž์Šต๋‹ˆ๋‹ค:

  1. ํ–‰๋ ฌ ์œ„์น˜๋ฅผ (row, column)์œผ๋กœ ์“ฐ๋Š” ํ‘œ์ค€ ์ˆ˜ํ•™ ํ‘œ๊ธฐ๋ฒ•
  2. ํ–‰์€ ์œ„์—์„œ ์•„๋ž˜๋กœ(y์ถ•), ์—ด์€ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ(x์ถ•) ๊ฐ€๋Š” ํ–‰๋ ฌ์˜ ์‹œ๊ฐ์  ๊ตฌ์กฐ
  3. ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ํ–‰๋ ฌ ๊ตฌ์กฐ์— ๋งž์ถฐ 2D ๊ทธ๋ฆฌ๋“œ๋กœ ๊ตฌ์„ฑํ•˜๋Š” ์ผ๋ฐ˜์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด

์—ญ์‚ฌ์  ๋ฐฐ๊ฒฝ

๊ทธ๋ž˜ํ”ฝ์ด๋‚˜ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์—์„œ๋Š” ๋ณดํ†ต \((x,y)\) ์ขŒํ‘œ๋ฅผ ์“ฐ์ง€๋งŒ, ํ–‰๋ ฌ ์—ฐ์‚ฐ์—์„œ๋Š” ์ „ํ†ต์ ์œผ๋กœ (row, column) ์ธ๋ฑ์‹ฑ์„ ์จ์™”์Šต๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ์ปดํ“จํ„ฐ๊ฐ€ 2D ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ์ฒ˜๋ฆฌํ•˜๋˜ ๋ฐฉ์‹์—์„œ ๋น„๋กฏ๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค: ์œ„์—์„œ ์•„๋ž˜๋กœ ํ•œ ์ค„์”ฉ, ๊ฐ ์ค„์€ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ฝ์—ˆ์ฃ . ์ด๋Ÿฐ ํ–‰ ์šฐ์„ (row-major) ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋งž์•„์„œ CPU์™€ GPU ๋ชจ๋‘์—์„œ ํšจ์œจ์ ์ž„์ด ์ž…์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์šฉ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ๋„์ž…๋์„ ๋•Œ, thread_idx.y๋ฅผ ํ–‰์—, thread_idx.x๋ฅผ ์—ด์— ๋งคํ•‘ํ•œ ๊ฑด ๊ธฐ์กด์— ํ™•๋ฆฝ๋œ ํ–‰๋ ฌ ์ธ๋ฑ์‹ฑ ๊ทœ์น™๊ณผ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ ค๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ์„ ํƒ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

๊ตฌํ˜„ ๋ฐฉ์‹

๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹

์ˆ˜๋™์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ด€๋ฆฌํ•˜๋ฉด์„œ 2D ์ธ๋ฑ์‹ฑ์ด ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.

๐Ÿ“š LayoutTensor ์•Œ์•„๋ณด๊ธฐ

GPU์—์„œ ๋‹ค์ฐจ์› ๋ฐฐ์—ด ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„ํŽธํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” ๊ฐ•๋ ฅํ•œ ์ถ”์ƒํ™”๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๐Ÿš€ ํ˜„๋Œ€์  2D ์—ฐ์‚ฐ

์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ๊ณผ ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ๊ฐ–์ถ˜ LayoutTensor๋ฅผ ์ง์ ‘ ์จ๋ด…๋‹ˆ๋‹ค.

๐Ÿ’ก ์ฐธ๊ณ : ์ด ํผ์ฆ๋ถ€ํ„ฐ๋Š” ๋” ๊น”๋”ํ•˜๊ณ  ์•ˆ์ „ํ•œ GPU ์ฝ”๋“œ๋ฅผ ์œ„ํ•ด LayoutTensor๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ฐœ์š”

2D ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค ๋‹ค๋ฃจ๊ธฐ (thread_idx.x, thread_idx.y)
  • 2D ์ขŒํ‘œ๋ฅผ 1D ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ
  • 2์ฐจ์›์—์„œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ 2D ์Šค๋ ˆ๋“œ ์ขŒํ‘œ \((i,j)\)๋ฅผ ํฌ๊ธฐ \(n \times n\)์ธ ํ–‰ ์šฐ์„  ํ–‰๋ ฌ์˜ ์›์†Œ๋กœ ๋งคํ•‘ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋™์‹œ์— ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๊ฐ€ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋Š”์ง€๋„ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • 2D ์ธ๋ฑ์‹ฑ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ ์œ ํ•œ \((i,j)\) ์œ„์น˜๋ฅผ ๊ฐ€์ง
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ํ–‰ ์šฐ์„  ์ˆœ์„œ๋กœ 2D๋ฅผ 1D ๋ฉ”๋ชจ๋ฆฌ์— ๋งคํ•‘
  • ๊ฐ€๋“œ ์กฐ๊ฑด: ๋‘ ์ฐจ์› ๋ชจ๋‘ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํ•„์š”
  • ์Šค๋ ˆ๋“œ ๋ฒ”์œ„: ์Šค๋ ˆ๋“œ \((3 \times 3)\)๊ฐ€ ํ–‰๋ ฌ ์›์†Œ \((2 \times 2)\)๋ณด๋‹ค ๋งŽ์Œ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32


fn add_10_2d(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p04/p04.mojo

ํŒ
  1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: row = thread_idx.y, col = thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€์—์„œ ํ–‰ ์šฐ์„  ๋ฐฉ์‹์œผ๋กœ 10 ๋”ํ•˜๊ธฐ!

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p04
pixi run -e amd p04
pixi run -e apple p04
uv run poe p04

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

์†”๋ฃจ์…˜

fn add_10_2d(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:
        output[row * size + col] = a[row * size + col] + 10.0


์ด ์†”๋ฃจ์…˜์€:

  1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: row = thread_idx.y, col = thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: output[row * size + col] = a[row * size + col] + 10.0

LayoutTensor ์•Œ์•„๋ณด๊ธฐ

ํผ์ฆ ํ’€์ด๋ฅผ ์ž ์‹œ ๋ฉˆ์ถ”๊ณ , GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋” ์ฆ๊ฒ๊ฒŒ ๋งŒ๋“ค์–ด์ค„ ๊ฐ•๋ ฅํ•œ ์ถ”์ƒํ™”๋ฅผ ๋ฏธ๋ฆฌ ์‚ดํŽด๋ด…์‹œ๋‹ค: ๐Ÿฅ โ€ฆ ๋ฐ”๋กœ LayoutTensor ์ž…๋‹ˆ๋‹ค.

๐Ÿ’ก LayoutTensor๊ฐ€ ์–ด๋–ค ์ผ์„ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ง›๋ณด๊ธฐ๋กœ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ์ง€๊ธˆ ๋ชจ๋“  ๊ฑธ ์ดํ•ดํ•  ํ•„์š”๋Š” ์—†์–ด์š” - ํผ์ฆ์„ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๊ฐ ๊ธฐ๋Šฅ์„ ์ž์„ธํžˆ ์•Œ์•„๋ณผ ๊ฒ๋‹ˆ๋‹ค.

๋ฌธ์ œ: ์ ์  ๋ณต์žกํ•ด์ง€๋Š” ์ฝ”๋“œ

์ง€๊ธˆ๊นŒ์ง€ ๊ฒช์€ ์–ด๋ ค์›€์„ ์‚ดํŽด๋ด…์‹œ๋‹ค:

# Puzzle 1: ๋‹จ์ˆœ ์ธ๋ฑ์‹ฑ
output[i] = a[i] + 10.0

# Puzzle 2: ์—ฌ๋Ÿฌ ๋ฐฐ์—ด ๊ด€๋ฆฌ
output[i] = a[i] + b[i]

# Puzzle 3: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
if i < size:
    output[i] = a[i] + 10.0

์ฐจ์›์ด ๋Š˜์–ด๋‚˜๋ฉด ์ฝ”๋“œ๋Š” ๋” ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค:

# ์ „ํ†ต์ ์ธ 2D ์ธ๋ฑ์‹ฑ (ํ–‰ ์šฐ์„  2D ํ–‰๋ ฌ)
idx = row * WIDTH + col
if row < height and col < width:
    output[idx] = a[idx] + 10.0

ํ•ด๊ฒฐ์ฑ…: LayoutTensor ๋ฏธ๋ฆฌ๋ณด๊ธฐ

LayoutTensor๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๋“ค์„ ๊น”๋”ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•ด์ค๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ๋ฐฐ์šธ ๋‚ด์šฉ์„ ์‚ด์ง ์—ฟ๋ณด๋ฉด:

  1. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ: ์ˆ˜๋™ ์˜คํ”„์…‹ ๊ณ„์‚ฐ ๋Œ€์‹  tensor[i, j] ์‚ฌ์šฉ
  2. ์œ ์—ฐํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ํ–‰ ์šฐ์„ , ์—ด ์šฐ์„ , ํƒ€์ผ ๊ตฌ์„ฑ ์ง€์›
  3. ์„ฑ๋Šฅ ์ตœ์ ํ™”: GPU์— ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

์•ž์œผ๋กœ ๋ฐฐ์šธ ๋‚ด์šฉ ๋ง›๋ณด๊ธฐ

LayoutTensor๊ฐ€ ํ•  ์ˆ˜ ์žˆ๋Š” ์ผ์„ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์‹œ๋กœ ์‚ดํŽด๋ด…์‹œ๋‹ค. ์ง€๊ธˆ ๋ชจ๋“  ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ์ดํ•ดํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค - ์•ž์œผ๋กœ ๋‚˜์˜ฌ ํผ์ฆ์—์„œ ๊ฐ ๊ธฐ๋Šฅ์„ ๊ผผ๊ผผํžˆ ๋‹ค๋ฃฐ ๊ฑฐ์˜ˆ์š”.

๊ธฐ๋ณธ ์‚ฌ์šฉ ์˜ˆ์‹œ

from layout import Layout, LayoutTensor

# ๋ ˆ์ด์•„์›ƒ ์ •์˜
comptime HEIGHT = 2
comptime WIDTH = 3
comptime layout = Layout.row_major(HEIGHT, WIDTH)

# ํ…์„œ ์ƒ์„ฑ
tensor = LayoutTensor[dtype, layout](buffer.unsafe_ptr())

# ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์š”์†Œ ์ ‘๊ทผ
tensor[0, 0] = 1.0  # ์ฒซ ๋ฒˆ์งธ ์š”์†Œ
tensor[1, 2] = 2.0  # ๋งˆ์ง€๋ง‰ ์š”์†Œ

Layout๊ณผ LayoutTensor์— ๋Œ€ํ•ด ๋” ์•Œ์•„๋ณด๋ ค๋ฉด Mojo ๋งค๋‰ด์–ผ์˜ ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”:

๊ฐ„๋‹จํ•œ ์˜ˆ์ œ

LayoutTensor์˜ ๊ธฐ๋ณธ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋กœ ๋ชจ๋“  ๊ฒƒ์„ ์ •๋ฆฌํ•ด๋ด…์‹œ๋‹ค:

from gpu.host import DeviceContext
from layout import Layout, LayoutTensor

comptime HEIGHT = 2
comptime WIDTH = 3
comptime dtype = DType.float32
comptime layout = Layout.row_major(HEIGHT, WIDTH)


fn kernel[
    dtype: DType, layout: Layout
](tensor: LayoutTensor[dtype, layout, MutAnyOrigin]):
    print("Before:")
    print(tensor)
    tensor[0, 0] += 1
    print("After:")
    print(tensor)


def main() raises:
    ctx = DeviceContext()

    a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH)
    a.enqueue_fill(0)
    tensor = LayoutTensor[dtype, layout, MutAnyOrigin](a)
    # Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
    ctx.enqueue_function[kernel[dtype, layout], kernel[dtype, layout]](
        tensor, grid_dim=1, block_dim=1
    )

    ctx.synchronize()

๋‹ค์Œ ๋ช…๋ น์–ด๋กœ ์ด ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด:

pixi run layout_tensor_intro
pixi run -e amd layout_tensor_intro
pixi run -e apple layout_tensor_intro
uv run poe layout_tensor_intro
Before:
0.0 0.0 0.0
0.0 0.0 0.0
After:
1.0 0.0 0.0
0.0 0.0 0.0

๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์‚ดํŽด๋ด…์‹œ๋‹ค:

  1. ํ–‰ ์šฐ์„  ๋ ˆ์ด์•„์›ƒ์œผ๋กœ 2 x 3 ํ…์„œ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  2. ์ฒ˜์Œ์—๋Š” ๋ชจ๋“  ์š”์†Œ๊ฐ€ 0์ž…๋‹ˆ๋‹ค
  3. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค
  4. ๋ณ€๊ฒฝ ์‚ฌํ•ญ์ด ์ถœ๋ ฅ์— ๋ฐ˜์˜๋ฉ๋‹ˆ๋‹ค

์ด ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋Š” LayoutTensor์˜ ํ•ต์‹ฌ ์žฅ์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • ํ…์„œ ์ƒ์„ฑ๊ณผ ์ ‘๊ทผ์„ ์œ„ํ•œ ๊น”๋”ํ•œ ๋ฌธ๋ฒ•
  • ์ž๋™ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ
  • ์ž์—ฐ์Šค๋Ÿฌ์šด ๋‹ค์ฐจ์› ์ธ๋ฑ์‹ฑ

์ด ์˜ˆ์ œ๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ, ๊ฐ™์€ ํŒจํ„ด์ด ์•ž์œผ๋กœ ๋‚˜์˜ฌ ํผ์ฆ์˜ ๋ณต์žกํ•œ GPU ์—ฐ์‚ฐ์—๋„ ๊ทธ๋Œ€๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๊ธฐ๋ณธ ๊ฐœ๋…์ด ๋‹ค์Œ์œผ๋กœ ์–ด๋–ป๊ฒŒ ํ™•์žฅ๋˜๋Š”์ง€ ๋ณด๊ฒŒ ๋  ๊ฑฐ์˜ˆ์š”:

  • ๋ฉ€ํ‹ฐ ์Šค๋ ˆ๋“œ GPU ์—ฐ์‚ฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”
  • ๋ณต์žกํ•œ ํƒ€์ผ๋ง ์ „๋žต
  • ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ์—ฐ์‚ฐ

LayoutTensor์™€ ํ•จ๊ป˜ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—ฌ์ •์„ ์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋๋‚˜์š”? ํผ์ฆ๋กœ ๋“ค์–ด๊ฐ€๋ด…์‹œ๋‹ค!

๐Ÿ’ก ํŒ: ์ง„ํ–‰ํ•˜๋ฉด์„œ ์ด ์˜ˆ์ œ๋ฅผ ๊ธฐ์–ตํ•ด๋‘์„ธ์š” - ์ด ๊ธฐ๋ณธ ๊ฐœ๋…์„ ๋ฐ”ํƒ•์œผ๋กœ ์ ์  ๋” ์ •๊ตํ•œ GPU ํ”„๋กœ๊ทธ๋žจ์„ ๋งŒ๋“ค์–ด๊ฐˆ ๊ฒ๋‹ˆ๋‹ค.

LayoutTensor ๋ฒ„์ „

๊ฐœ์š”

2D LayoutTensor a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • 2D ๋ฐฐ์—ด ์ ‘๊ทผ์— LayoutTensor ์‚ฌ์šฉํ•˜๊ธฐ
  • tensor[i, j]๋กœ ์ง์ ‘ 2D ์ธ๋ฑ์‹ฑํ•˜๊ธฐ
  • LayoutTensor์—์„œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•˜์—ฌ ๋‚ด๋ถ€ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์„ ์ถ”์ƒํ™”ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • 2D ์ ‘๊ทผ: LayoutTensor๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด \((i,j)\) ์ธ๋ฑ์‹ฑ
  • ๋ฉ”๋ชจ๋ฆฌ ์ถ”์ƒํ™”: ์ˆ˜๋™ ํ–‰ ์šฐ์„  ๊ณ„์‚ฐ ๋ถˆํ•„์š”
  • ๊ฐ€๋“œ ์กฐ๊ฑด: ๋‘ ์ฐจ์› ๋ชจ๋‘ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํ•„์š”
  • ์Šค๋ ˆ๋“œ ๋ฒ”์œ„: ์Šค๋ ˆ๋“œ \((3 \times 3)\)๊ฐ€ ํ…์„œ ์›์†Œ \((2 \times 2)\)๋ณด๋‹ค ๋งŽ์Œ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE, SIZE)


fn add_10_2d(
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, MutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p04/p04_layout_tensor.mojo

ํŒ
  1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: row = thread_idx.y, col = thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€์—์„œ a[row, col]์— 10 ๋”ํ•˜๊ธฐ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p04_layout_tensor
pixi run -e amd p04_layout_tensor
pixi run -e apple p04_layout_tensor
uv run poe p04_layout_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

์†”๋ฃจ์…˜

fn add_10_2d(
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, MutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    if col < size and row < size:
        output[row, col] = a[row, col] + 10.0


์ด ์†”๋ฃจ์…˜์€:

  • row = thread_idx.y, col = thread_idx.x๋กœ 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์ ธ์˜ด
  • if row < size and col < size๋กœ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ ๋ฐฉ์ง€
  • LayoutTensor์˜ 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: output[row, col] = a[row, col] + 10.0

Puzzle 5: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ

๊ฐœ์š”

๋ฒกํ„ฐ a์™€ b๋ฅผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ(broadcast)๋กœ ๋”ํ•ด 2D ํ–‰๋ ฌ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋ž€ ์š”์†Œ๋ณ„ ์—ฐ์‚ฐ์„ ํ•  ๋•Œ ์ €์ฐจ์› ๋ฐฐ์—ด์„ ๊ณ ์ฐจ์› ๋ฐฐ์—ด์˜ ํ˜•์ƒ์— ๋งž๊ฒŒ ์ž๋™์œผ๋กœ ํ™•์žฅํ•˜๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ๋ฉ”๋ชจ๋ฆฌ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์ œํ•˜์ง€ ์•Š๊ณ , ์ถ”๊ฐ€ ์ฐจ์›์— ๊ฑธ์ณ ๊ฐ’์„ ๋…ผ๋ฆฌ์ ์œผ๋กœ ๋ฐ˜๋ณตํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, 2D ํ–‰๋ ฌ์˜ ๊ฐ ํ–‰(๋˜๋Š” ์—ด)์— 1D ๋ฒกํ„ฐ๋ฅผ ๋”ํ•  ๋•Œ ๋ฒกํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ณต์‚ฌํ•˜์ง€ ์•Š์•„๋„ ๊ฐ™์€ ์š”์†Œ๊ฐ€ ์ž๋™์œผ๋กœ ๋ฐ˜๋ณต ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

Broadcast ์‹œ๊ฐํ™” Broadcast ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

  • ๋ฒกํ„ฐ๋ฅผ ํ–‰๋ ฌ๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•˜๊ธฐ
  • 2D ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ
  • ์„œ๋กœ ๋‹ค๋ฅธ ์ฐจ์› ๊ฐ„ ์—ฐ์‚ฐ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ํŒจํ„ด

๊ตฌํ˜„ ๋ฐฉ์‹

๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹

์ˆ˜๋™ ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.

๐Ÿ“ LayoutTensor ๋ฒ„์ „

์„œ๋กœ ๋‹ค๋ฅธ ์ฐจ์› ๊ฐ„ ์—ฐ์‚ฐ์„ LayoutTensor๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์ฐธ๊ณ : ์ˆ˜๋™ ์ธ๋ฑ์‹ฑ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ LayoutTensor๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”.

๊ฐœ์š”

๋ฒกํ„ฐ a์™€ b๋ฅผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋”ํ•ด 2D ํ–‰๋ ฌ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • 1D ๋ฒกํ„ฐ๋ฅผ ๊ฐ๊ฐ ๋‹ค๋ฅธ ์ฐจ์› ๋ฐฉํ–ฅ์œผ๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•˜๊ธฐ
  • 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ ์ˆ˜ํ–‰ํ•˜๊ธฐ
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด์—์„œ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ ๋‘ 1D ๋ฒกํ„ฐ์˜ ์›์†Œ๋“ค์„ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ 2D ์ถœ๋ ฅ ํ–‰๋ ฌ์— ๋งคํ•‘ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๊ณ , ์Šค๋ ˆ๋“œ ๊ฒฝ๊ณ„๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ: a์˜ ๊ฐ ์›์†Œ๊ฐ€ b์˜ ๊ฐ ์›์†Œ์™€ ๊ฒฐํ•ฉ
  • ์Šค๋ ˆ๋“œ ๋งคํ•‘: \(2 \times 2\) ์ถœ๋ ฅ์— \((3 \times 3)\) ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ ์‚ฌ์šฉ
  • ๋ฒกํ„ฐ ์ ‘๊ทผ: a์™€ b๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ ‘๊ทผ ํŒจํ„ด ์‚ฌ์šฉ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ํ–‰๋ ฌ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ์Šค๋ ˆ๋“œ๋ฅผ ๊ฐ€๋“œ๋กœ ์ฒ˜๋ฆฌ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32


fn broadcast_add(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p05/p05.mojo

ํŒ
  1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: row = thread_idx.y, col = thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: a์™€ b ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ• ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p05
pixi run -e amd p05
pixi run -e apple p05
uv run poe p05

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 2.0, 11.0, 12.0])

์†”๋ฃจ์…˜

fn broadcast_add(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:
        output[row * size + col] = a[col] + b[row]


LayoutTensor ์ถ”์ƒํ™” ์—†์ด GPU ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์˜ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ์—์„œ ํ–‰๋ ฌ๋กœ ๋งคํ•‘

    • thread_idx.y๋กœ ํ–‰, thread_idx.x๋กœ ์—ด์— ์ ‘๊ทผ
    • 2D ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ๋ฅผ ์ถœ๋ ฅ ํ–‰๋ ฌ ์›์†Œ์— ์ง์ ‘ ๋งคํ•‘
    • 3ร—3 ๊ทธ๋ฆฌ๋“œ์˜ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ๋ฅผ 2ร—2 ์ถœ๋ ฅ์— ๋งž๊ฒŒ ์ฒ˜๋ฆฌ
  2. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ž‘๋™ ๋ฐฉ์‹

    • ๋ฒกํ„ฐ a๋Š” ์ˆ˜ํ‰ ๋ฐฉํ–ฅ์œผ๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ: ๊ฐ ํ–‰์—์„œ ๋™์ผํ•œ a[col] ์‚ฌ์šฉ
    • ๋ฒกํ„ฐ b๋Š” ์ˆ˜์ง ๋ฐฉํ–ฅ์œผ๋กœ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ: ๊ฐ ์—ด์—์„œ ๋™์ผํ•œ b[row] ์‚ฌ์šฉ
    • ๋‘ ๋ฒกํ„ฐ๋ฅผ ๋”ํ•ด ์ถœ๋ ฅ ์ƒ์„ฑ
    [ a0 a1 ]  +  [ b0 ]  =  [ a0+b0  a1+b0 ]
                  [ b1 ]     [ a0+b1  a1+b1 ]
    
  3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

    • ๋‹จ์ผ ๊ฐ€๋“œ ์กฐ๊ฑด row < size and col < size๋กœ ๋‘ ์ฐจ์› ๋ชจ๋‘ ์ฒ˜๋ฆฌ
    • ์ž…๋ ฅ ๋ฒกํ„ฐ์™€ ์ถœ๋ ฅ ํ–‰๋ ฌ์˜ ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ ๋ฐฉ์ง€
    • 3ร—3 ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ๊ฐ€ 2ร—2 ๋ฐ์ดํ„ฐ๋ณด๋‹ค ํฌ๋ฏ€๋กœ ๋ฐ˜๋“œ์‹œ ํ•„์š”

LayoutTensor ๋ฒ„์ „๊ณผ ๋น„๊ตํ•ด์„œ ๋™์ผํ•œ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ถ”์ƒํ™”๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์„ ์–ผ๋งˆ๋‚˜ ๋‹จ์ˆœํ•˜๊ฒŒ ๋งŒ๋“œ๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”.

LayoutTensor ๋ฒ„์ „

๊ฐœ์š”

1D LayoutTensor a์™€ b๋ฅผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋”ํ•ด 2D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์— LayoutTensor ์‚ฌ์šฉํ•˜๊ธฐ
  • ์„œ๋กœ ๋‹ค๋ฅธ ํ…์„œ ํฌ๊ธฐ ๋‹ค๋ฃจ๊ธฐ
  • LayoutTensor๋กœ 2D ์ธ๋ฑ์‹ฑ ์ฒ˜๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ํ…์„œ ํฌ๊ธฐ \((1, n)\)์™€ \((n, 1)\)์„ \((n,n)\)์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • ํ…์„œ ํฌ๊ธฐ: ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋Š” \((1, n)\)๊ณผ \((n, 1)\)
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ: ๋‘ ์ฐจ์›์„ ๊ฒฐํ•ฉํ•ด \((n,n)\) ์ถœ๋ ฅ ์ƒ์„ฑ
  • ๊ฐ€๋“œ ์กฐ๊ฑด: ์ถœ๋ ฅ ํฌ๊ธฐ์— ๋Œ€ํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”
  • ์Šค๋ ˆ๋“œ ๋ฒ”์œ„: ํ…์„œ ์›์†Œ \((2 \times 2)\)๋ณด๋‹ค ์Šค๋ ˆ๋“œ \((3 \times 3)\)๊ฐ€ ๋งŽ์Œ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime out_layout = Layout.row_major(SIZE, SIZE)
comptime a_layout = Layout.row_major(1, SIZE)
comptime b_layout = Layout.row_major(SIZE, 1)


fn broadcast_add[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, b_layout, ImmutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p05/p05_layout_tensor.mojo

ํŒ
  1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: row = thread_idx.y, col = thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: LayoutTensor๋กœ a์™€ b ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ• ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p05_layout_tensor
pixi run -e amd p05_layout_tensor
pixi run -e apple p05_layout_tensor
uv run poe p05_layout_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 2.0, 11.0, 12.0])

์†”๋ฃจ์…˜

fn broadcast_add[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, b_layout, ImmutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    if row < size and col < size:
        output[row, col] = a[0, col] + b[row, 0]


LayoutTensor ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์™€ GPU ์Šค๋ ˆ๋“œ ๋งคํ•‘์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ์—์„œ ํ–‰๋ ฌ๋กœ ๋งคํ•‘

    • thread_idx.y๋กœ ํ–‰, thread_idx.x๋กœ ์—ด์— ์ ‘๊ทผ
    • ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ์ด ์ถœ๋ ฅ ํ–‰๋ ฌ ๊ตฌ์กฐ์™€ ์ผ์น˜
    • ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ(3ร—3 ๊ทธ๋ฆฌ๋“œ)๋Š” ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ์ฒ˜๋ฆฌ
  2. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ž‘๋™ ๋ฐฉ์‹

    • ์ž…๋ ฅ a์˜ ํฌ๊ธฐ๋Š” (1,n): a[0,col]์ด ํ–‰์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
    • ์ž…๋ ฅ b์˜ ํฌ๊ธฐ๋Š” (n,1): b[row,0]์ด ์—ด์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
    • ์ถœ๋ ฅ์˜ ํฌ๊ธฐ๋Š” (n,n): ๊ฐ ์›์†Œ๋Š” ํ•ด๋‹น ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’๋“ค์˜ ํ•ฉ
    [ a0 a1 ]  +  [ b0 ]  =  [ a0+b0  a1+b0 ]
                  [ b1 ]     [ a0+b1  a1+b1 ]
    
  3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

    • ๊ฐ€๋“œ ์กฐ๊ฑด row < size and col < size๋กœ ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ ๋ฐฉ์ง€
    • ํ–‰๋ ฌ ๋ฒ”์œ„์™€ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ
    • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋•๋ถ„์— a์™€ b์— ๋Œ€ํ•œ ๋ณ„๋„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š”

์ด ํŒจํ„ด์€ ์ดํ›„ ํผ์ฆ์—์„œ ๋‹ค๋ฃฐ ๋” ๋ณต์žกํ•œ ํ…์„œ ์—ฐ์‚ฐ์˜ ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

Puzzle 6: ๋ธ”๋ก

๊ฐœ์š”

๋ฒกํ„ฐ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค.

๋ธ”๋ก ์‹œ๊ฐํ™” ๋ธ”๋ก ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • ์Šค๋ ˆ๋“œ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  • ์—ฌ๋Ÿฌ ๋ธ”๋ก์˜ ์Šค๋ ˆ๋“œ ์กฐ์œจ
  • ์ „์—ญ ์Šค๋ ˆ๋“œ ์œ„์น˜ ๊ณ„์‚ฐ

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ํ˜‘๋ ฅํ•˜์—ฌ ๋‹จ์ผ ๋ธ”๋ก ์šฉ๋Ÿ‰๋ณด๋‹ค ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ๋„, ์š”์†Œ์™€ ์Šค๋ ˆ๋“œ ๊ฐ„ ์˜ฌ๋ฐ”๋ฅธ ๋งคํ•‘์„ ์œ ์ง€ํ•˜๋Š” ์›๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 9
comptime BLOCKS_PER_GRID = (3, 1)
comptime THREADS_PER_BLOCK = (4, 1)
comptime dtype = DType.float32


fn add_10_blocks(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p06/p06.mojo

์ฐธ๊ณ : ์ด ํผ์ฆ์˜ LayoutTensor ๋ฒ„์ „์€ ๊ฑฐ์˜ ๋™์ผํ•˜๋ฏ€๋กœ ๋…์ž์—๊ฒŒ ๋งก๊น๋‹ˆ๋‹ค.

ํŒ
  1. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ: i = block_dim.x * block_idx.x + thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if i < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: output[i] = a[i] + 10.0

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p06
pixi run -e amd p06
pixi run -e apple p06
uv run poe p06

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])

์†”๋ฃจ์…˜

fn add_10_blocks(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    i = block_dim.x * block_idx.x + thread_idx.x
    if i < size:
        output[i] = a[i] + 10.0


์ด ์†”๋ฃจ์…˜์€ ๋ธ”๋ก ๊ธฐ๋ฐ˜ GPU ์ฒ˜๋ฆฌ์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค:

  1. ์ „์—ญ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

    • ๋ธ”๋ก ์ธ๋ฑ์Šค์™€ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฒฐํ•ฉ: block_dim.x * block_idx.x + thread_idx.x

    • ๊ฐ ์Šค๋ ˆ๋“œ๋ฅผ ๊ณ ์œ ํ•œ ์ „์—ญ ์œ„์น˜์— ๋งคํ•‘

    • ๋ธ”๋ก๋‹น 3๊ฐœ ์Šค๋ ˆ๋“œ ์˜ˆ์‹œ:

      Block 0: [0 1 2]
      Block 1: [3 4 5]
      Block 2: [6 7 8]
      
  2. ๋ธ”๋ก ์กฐ์œจ

    • ๊ฐ ๋ธ”๋ก์€ ์—ฐ์†๋œ ๋ฐ์ดํ„ฐ ์ฒญํฌ๋ฅผ ์ฒ˜๋ฆฌ

    • ๋ธ”๋ก ํฌ๊ธฐ(3) < ๋ฐ์ดํ„ฐ ํฌ๊ธฐ(9)์ด๋ฏ€๋กœ ์—ฌ๋Ÿฌ ๋ธ”๋ก ํ•„์š”

    • ๋ธ”๋ก ๊ฐ„ ์ž๋™ ์ž‘์—… ๋ถ„๋ฐฐ:

      Data:    [0 1 2 3 4 5 6 7 8]
      Block 0: [0 1 2]
      Block 1:       [3 4 5]
      Block 2:             [6 7 8]
      
  3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

    • ๊ฐ€๋“œ ์กฐ๊ฑด i < size๋กœ ๊ฒฝ๊ณ„ ์ผ€์ด์Šค ์ฒ˜๋ฆฌ
    • ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋กœ ๋‚˜๋ˆ„์–ด ๋–จ์–ด์ง€์ง€ ์•Š์„ ๋•Œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ ๋ฐฉ์ง€
    • ๋ฐ์ดํ„ฐ ๋๋ถ€๋ถ„์˜ ๋ถˆ์™„์ „ํ•œ ๋ธ”๋ก ์ฒ˜๋ฆฌ์— ํ•„์ˆ˜
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ๋ณ‘ํ•ฉ(coalesced) ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋“ค์ด ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ์š”์†Œ ์ฒ˜๋ฆฌ: output[i] = a[i] + 10.0
    • ๋ธ”๋ก ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉ

์ด ํŒจํ„ด์€ ๋‹จ์ผ ์Šค๋ ˆ๋“œ ๋ธ”๋ก ํฌ๊ธฐ๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ์ฒ˜๋ฆฌ์˜ ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

Puzzle 7: 2D ๋ธ”๋ก

๊ฐœ์š”

ํ–‰๋ ฌ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํ–‰๊ณผ ์—ด ํฌ๊ธฐ๋ณด๋‹ค ๋ชจ๋‘ ์ž‘์Šต๋‹ˆ๋‹ค.

2D Blocks ์‹œ๊ฐํ™” 2D Blocks ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

  • ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ
  • ๊ทธ๋ฆฌ๋“œ์™€ ๋ธ”๋ก์˜ ์กฐ์œจ
  • ์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์นœ ์ธ๋ฑ์‹ฑ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

๐Ÿ”‘ 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ ๋ฐฉ์‹

Puzzle 4: 2D Map์˜ ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ์„ 2D๋กœ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค:

์ „์—ญ ์œ„์น˜ ๊ณ„์‚ฐ:
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x

์˜ˆ๋ฅผ ๋“ค์–ด, 4ร—4 ๊ทธ๋ฆฌ๋“œ์—์„œ 2ร—2 ๋ธ”๋ก์„ ์‚ฌ์šฉํ•˜๋ฉด:

Block (0,0):   Block (1,0):
[0,0  0,1]     [0,2  0,3]
[1,0  1,1]     [1,2  1,3]

Block (0,1):   Block (1,1):
[2,0  2,1]     [2,2  2,3]
[3,0  3,1]     [3,2  3,3]

๊ฐ ์œ„์น˜๋Š” ํ•ด๋‹น ์Šค๋ ˆ๋“œ์˜ ์ „์—ญ ์ธ๋ฑ์Šค (row, col)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋ธ”๋ก ์ฐจ์›๊ณผ ์ธ๋ฑ์Šค๊ฐ€ ํ•จ๊ป˜ ์ž‘๋™ํ•˜์—ฌ ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  • 2D ๊ณต๊ฐ„ ์ „์ฒด๋ฅผ ๋นˆํ‹ˆ์—†์ด ์ฒ˜๋ฆฌ
  • ๋ธ”๋ก ๊ฐ„ ๊ฒน์นจ ์—†์Œ
  • ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

๊ตฌํ˜„ ๋ฐฉ์‹

๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹

์ˆ˜๋™ ์ธ๋ฑ์‹ฑ์œผ๋กœ ์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์นœ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.

๐Ÿ“ LayoutTensor ๋ฒ„์ „

LayoutTensor ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•ด ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์ฐธ๊ณ : LayoutTensor๊ฐ€ ๋ธ”๋ก ๊ฐ„ ์กฐ์œจ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์–ผ๋งˆ๋‚˜ ๋‹จ์ˆœํ™”ํ•˜๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”.

๊ฐœ์š”

ํ–‰๋ ฌ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํ–‰๊ณผ ์—ด ํฌ๊ธฐ๋ณด๋‹ค ๋ชจ๋‘ ์ž‘์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • 2D ๋ธ”๋ก๊ณผ ์Šค๋ ˆ๋“œ ๋ฐฐ์น˜ ๋‹ค๋ฃจ๊ธฐ
  • ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ํ–‰๋ ฌ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌํ•˜๊ธฐ
  • 2D ์ธ๋ฑ์Šค์™€ ์„ ํ˜• ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๊ฐ„ ๋ณ€ํ™˜ํ•˜๊ธฐ

ํ•ต์‹ฌ์€ ํ•˜๋‚˜์˜ ๋ธ”๋ก๋ณด๋‹ค ํฐ 2D ํ–‰๋ ฌ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ์—ฌ๋Ÿฌ ๋ธ”๋ก์˜ ์Šค๋ ˆ๋“œ๋“ค์ด ์–ด๋–ป๊ฒŒ ํ•จ๊ป˜ ์ž‘๋™ํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(5 \times 5\) ์›์†Œ
  • 2D ๋ธ”๋ก: ๊ฐ ๋ธ”๋ก์ด \(3 \times 3\) ์˜์—ญ ์ฒ˜๋ฆฌ
  • ๊ทธ๋ฆฌ๋“œ ๋ ˆ์ด์•„์›ƒ: \(2 \times 2\) ๊ทธ๋ฆฌ๋“œ์— ๋ธ”๋ก ๋ฐฐ์น˜
  • ์ด ์Šค๋ ˆ๋“œ ์ˆ˜: \(25\)๊ฐœ ์›์†Œ์— ๋Œ€ํ•ด \(36\)๊ฐœ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: 2D ๋ฐ์ดํ„ฐ๋ฅผ ํ–‰ ์šฐ์„ ์œผ๋กœ ์ €์žฅ
  • ์ปค๋ฒ„๋ฆฌ์ง€: ๋ชจ๋“  ํ–‰๋ ฌ ์›์†Œ๊ฐ€ ๋น ์ง์—†์ด ์ฒ˜๋ฆฌ๋˜๋„๋ก ๋ณด์žฅ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 5
comptime BLOCKS_PER_GRID = (2, 2)
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32


fn add_10_blocks_2d(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p07/p07.mojo

ํŒ
  1. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ: row = block_dim.y * block_idx.y + thread_idx.y, col = block_dim.x * block_idx.x + thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: ํ–‰ ์šฐ์„  ๋ฐฉ์‹์œผ๋กœ 10์„ ๋”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”!

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p07
pixi run -e amd p07
pixi run -e apple p07
uv run poe p07

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, ... , 34.0])

์†”๋ฃจ์…˜

fn add_10_blocks_2d(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:
        output[row * size + col] = a[row * size + col] + 10.0


์›์‹œ ๋ฉ”๋ชจ๋ฆฌ๋กœ 2D ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ๋ฅผ ๊ตฌํ˜„ํ•  ๋•Œ์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

    • ์ „์—ญ ํ–‰(row): block_dim.y * block_idx.y + thread_idx.y

    • ์ „์—ญ ์—ด(col): block_dim.x * block_idx.x + thread_idx.x

    • ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ๋ฅผ ํ–‰๋ ฌ ์›์†Œ์— ๋งคํ•‘:

      3ร—3 ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ 5ร—5 ํ–‰๋ ฌ:
      
      Block (0,0)         Block (1,0)
      [(0,0) (0,1) (0,2)] [(0,3) (0,4)    *  ]
      [(1,0) (1,1) (1,2)] [(1,3) (1,4)    *  ]
      [(2,0) (2,1) (2,2)] [(2,3) (2,4)    *  ]
      
      Block (0,1)         Block (1,1)
      [(3,0) (3,1) (3,2)] [(3,3) (3,4)    *  ]
      [(4,0) (4,1) (4,2)] [(4,3) (4,4)    *  ]
      [  *     *     *  ] [  *     *      *  ]
      

      (* = ์Šค๋ ˆ๋“œ๋Š” ์กด์žฌํ•˜์ง€๋งŒ ํ–‰๋ ฌ ๊ฒฝ๊ณ„ ๋ฐ–)

  2. ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ

    • ํ–‰ ์šฐ์„  ์„ ํ˜• ๋ฉ”๋ชจ๋ฆฌ: index = row * size + col

    • 5ร—5 ํ–‰๋ ฌ ์˜ˆ์‹œ:

      2D ์ธ๋ฑ์Šค:       ์„ ํ˜• ๋ฉ”๋ชจ๋ฆฌ:
      (2,1) -> 11   [00 01 02 03 04]
                    [05 06 07 08 09]
                    [10 11 12 13 14]
                    [15 16 17 18 19]
                    [20 21 22 23 24]
      
  3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

    • ๊ฐ€๋“œ row < size and col < size๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒฝ์šฐ:
      • ๋ถ€๋ถ„ ๋ธ”๋ก์—์„œ ๋‚จ๋Š” ์Šค๋ ˆ๋“œ
      • ํ–‰๋ ฌ ๊ฒฝ๊ณ„์˜ ์—ฃ์ง€ ์ผ€์ด์Šค
      • 3ร—3 ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ 2ร—2 ๊ทธ๋ฆฌ๋“œ = 25๊ฐœ ์›์†Œ์— 36๊ฐœ ์Šค๋ ˆ๋“œ
  4. ๋ธ”๋ก ์กฐ์œจ

    • ๊ฐ 3ร—3 ๋ธ”๋ก์ด 5ร—5 ํ–‰๋ ฌ์˜ ์ผ๋ถ€๋ถ„์„ ๋‹ด๋‹น
    • 2ร—2 ๋ธ”๋ก ๊ทธ๋ฆฌ๋“œ๋กœ ์ „์ฒด๋ฅผ ๋น ์ง์—†์ด ์ปค๋ฒ„
    • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ๊ฒน์น˜๋Š” ์Šค๋ ˆ๋“œ ์ฒ˜๋ฆฌ
    • ๋ธ”๋ก๋“ค์ด ํ•จ๊ป˜ ํšจ์œจ์ ์œผ๋กœ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ

์ด ํŒจํ„ด์€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ 2D ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ๋•Œ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ณผ ์Šค๋ ˆ๋“œ ์กฐ์œจ์„ ์–ด๋–ป๊ฒŒ ์œ ์ง€ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

LayoutTensor ๋ฒ„์ „

๊ฐœ์š”

2D LayoutTensor a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํ–‰๊ณผ ์—ด ํฌ๊ธฐ๋ณด๋‹ค ๋ชจ๋‘ ์ž‘์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์—ฌ๋Ÿฌ ๋ธ”๋ก๊ณผ ํ•จ๊ป˜ LayoutTensor ์‚ฌ์šฉํ•˜๊ธฐ
  • 2D ๋ธ”๋ก ๊ตฌ์„ฑ์œผ๋กœ ํฐ ํ–‰๋ ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ
  • ๋ธ”๋ก ์ธ๋ฑ์‹ฑ๊ณผ LayoutTensor ์ ‘๊ทผ ๊ฒฐํ•ฉํ•˜๊ธฐ

ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ 2D ์ธ๋ฑ์‹ฑ์„ ๋‹จ์ˆœํ™”ํ•ด ์ฃผ์ง€๋งŒ, ํฐ ํ–‰๋ ฌ์—์„œ๋Š” ์—ฌ์ „ํžˆ ๋ธ”๋ก ๊ฐ„ ์กฐ์œจ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(5 \times 5\) ์›์†Œ
  • ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ: LayoutTensor๊ฐ€ ํ–‰ ์šฐ์„  ๊ตฌ์„ฑ ๊ด€๋ฆฌ
  • ๋ธ”๋ก ์กฐ์œจ: ์—ฌ๋Ÿฌ ๋ธ”๋ก์œผ๋กœ ์ „์ฒด ํ–‰๋ ฌ ์ปค๋ฒ„
  • 2D ์ธ๋ฑ์‹ฑ: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ํ•จ๊ป˜ ์ž์—ฐ์Šค๋Ÿฌ์šด \((i,j)\) ์ ‘๊ทผ
  • ์ด ์Šค๋ ˆ๋“œ ์ˆ˜: \(25\)๊ฐœ ์›์†Œ์— ๋Œ€ํ•ด \(36\)๊ฐœ
  • ์Šค๋ ˆ๋“œ ๋งคํ•‘: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ–‰๋ ฌ ์›์†Œ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 5
comptime BLOCKS_PER_GRID = (2, 2)
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime out_layout = Layout.row_major(SIZE, SIZE)
comptime a_layout = Layout.row_major(SIZE, SIZE)


fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
    size: UInt,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p07/p07_layout_tensor.mojo

ํŒ
  1. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ: row = block_dim.y * block_idx.y + thread_idx.y, col = block_dim.x * block_idx.x + thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: 2D LayoutTensor์— 10์„ ๋”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p07_layout_tensor
pixi run -e amd p07_layout_tensor
pixi run -e apple p07_layout_tensor
uv run poe p07_layout_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, ... , 34.0])

์†”๋ฃจ์…˜

fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, a_layout, ImmutAnyOrigin],
    size: UInt,
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:
        output[row, col] = a[row, col] + 10.0


LayoutTensor๊ฐ€ 2D ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

    • ์ „์—ญ ํ–‰(row): block_dim.y * block_idx.y + thread_idx.y

    • ์ „์—ญ ์—ด(col): block_dim.x * block_idx.x + thread_idx.x

    • ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ๋ฅผ ํ…์„œ ์›์†Œ์— ๋งคํ•‘:

      3ร—3 ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ 5ร—5 ํ…์„œ:
      
      Block (0,0)         Block (1,0)
      [(0,0) (0,1) (0,2)] [(0,3) (0,4)    *  ]
      [(1,0) (1,1) (1,2)] [(1,3) (1,4)    *  ]
      [(2,0) (2,1) (2,2)] [(2,3) (2,4)    *  ]
      
      Block (0,1)         Block (1,1)
      [(3,0) (3,1) (3,2)] [(3,3) (3,4)    *  ]
      [(4,0) (4,1) (4,2)] [(4,3) (4,4)    *  ]
      [  *     *     *  ] [  *     *      *  ]
      

      (* = ์Šค๋ ˆ๋“œ๋Š” ์กด์žฌํ•˜์ง€๋งŒ ํ…์„œ ๊ฒฝ๊ณ„ ๋ฐ–)

  2. LayoutTensor์˜ ์žฅ์ 

    • ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ: ์ˆ˜๋™ ์˜คํ”„์…‹ ๊ณ„์‚ฐ ๋Œ€์‹  tensor[row, col] ์‚ฌ์šฉ

    • ์ž๋™ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”

    • ์ ‘๊ทผ ํŒจํ„ด ์˜ˆ์‹œ:

      ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ:          LayoutTensor:
      row * size + col    tensor[row, col]
      (2,1) -> 11        (2,1) -> ๊ฐ™์€ ์›์†Œ
      
  3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

    • ๊ฐ€๋“œ row < size and col < size๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ์ƒํ™ฉ:
      • ๋ถ€๋ถ„ ๋ธ”๋ก์—์„œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ์Šค๋ ˆ๋“œ
      • ํ…์„œ ๊ฒฝ๊ณ„์˜ ์—ฃ์ง€ ์ผ€์ด์Šค
      • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์€ LayoutTensor๊ฐ€ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
      • 25๊ฐœ ์›์†Œ๋ฅผ 36๊ฐœ ์Šค๋ ˆ๋“œ๋กœ ์ฒ˜๋ฆฌ (3ร—3 ๋ธ”๋ก์˜ 2ร—2 ๊ทธ๋ฆฌ๋“œ)
  4. ๋ธ”๋ก ์กฐ์œจ

    • ๊ฐ 3ร—3 ๋ธ”๋ก์ด 5ร—5 ํ…์„œ์˜ ์ผ๋ถ€๋ถ„์„ ๋‹ด๋‹น
    • LayoutTensor๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ถ€๋ถ„:
      • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”
      • ํšจ์œจ์ ์ธ ์ ‘๊ทผ ํŒจํ„ด
      • ๋ธ”๋ก ๊ฒฝ๊ณ„ ๊ฐ„ ์กฐ์œจ
      • ์บ์‹œ ์นœํ™”์  ๋ฐ์ดํ„ฐ ์ ‘๊ทผ

์ด ํŒจํ„ด์€ LayoutTensor๊ฐ€ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์Šค๋ ˆ๋“œ ์กฐ์œจ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ 2D ๋ธ”๋ก ์ฒ˜๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Puzzle 8: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ

๊ฐœ์š”

๋ฒกํ„ฐ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด ๋ฒกํ„ฐ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค.

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ๊ฐํ™” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ๊ฐํ™”

๊ตฌํ˜„ ๋ฐฉ์‹

๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์™€ ๋™๊ธฐํ™”๋ฅผ ์ˆ˜๋™์œผ๋กœ ๊ด€๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.

๐Ÿ“ LayoutTensor ๋ฒ„์ „

LayoutTensor์— ๋‚ด์žฅ๋œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์ฐธ๊ณ : LayoutTensor๊ฐ€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๊ฒฝํ—˜ํ•ด ๋ณด์„ธ์š”.

๊ฐœ์š”

๋ฒกํ„ฐ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋‚ด์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉํ•˜๊ธฐ
  • ๋ฐฐ๋ฆฌ์–ด(barrier)๋กœ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”ํ•˜๊ธฐ
  • ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ์ €์žฅ์†Œ ๊ด€๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” ๋น ๋ฅธ ๋กœ์ปฌ ์ €์žฅ์†Œ๋ผ๋Š” ์ , ๊ทธ๋ฆฌ๊ณ  ์ด๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ ์Šค๋ ˆ๋“œ ๊ฐ„ ์กฐ์œจ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8 ์›์†Œ
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 4
  • ๋ธ”๋ก ์ˆ˜: 2
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น TPB๊ฐœ ์›์†Œ

์ฐธ๊ณ :

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋“ค์ด ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๋น ๋ฅธ ์ €์žฅ์†Œ
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”: barrier()๋ฅผ ์‚ฌ์šฉํ•œ ์กฐ์œจ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฒ”์œ„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋Š” ๋ธ”๋ก ๋‚ด์—์„œ๋งŒ ๋ณด์ž„
  • ์ ‘๊ทผ ํŒจํ„ด: ๋กœ์ปฌ ์ธ๋ฑ์Šค vs ์ „์—ญ ์ธ๋ฑ์Šค

์ฃผ์˜: ๊ฐ ๋ธ”๋ก์ด ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ๋Š” ์ƒ์ˆ˜ ๋กœ ์ •ํ•ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ’์€ ๋ณ€์ˆ˜๊ฐ€ ์•„๋‹Œ ๋ฆฌํ„ฐ๋Ÿด Python ์ƒ์ˆ˜์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์“ด ํ›„์—๋Š” barrier๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ์•ž์„œ๊ฐ€์ง€ ์•Š๋„๋ก ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ์ฐธ๊ณ : ์ด ํผ์ฆ์—์„œ๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์—๋งŒ ์ ‘๊ทผํ•˜๋ฏ€๋กœ barrier()๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋” ๋ณต์žกํ•œ ์ƒํ™ฉ์—์„œ ํ•„์š”ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 4
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (2, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32


fn add_10_shared(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # Load local data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # wait for all threads to complete
    # works within a thread block
    barrier()

    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p08/p08.mojo

ํŒ
  1. barrier()๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ (ํ•™์Šต์šฉ - ์—ฌ๊ธฐ์„œ๋Š” ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Œ)
  2. local_i๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: shared[local_i]
  3. global_i๋กœ ์ถœ๋ ฅ: output[global_i]
  4. ๊ฐ€๋“œ ์ถ”๊ฐ€: if global_i < size

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p08
pixi run -e amd p08
pixi run -e apple p08
uv run poe p08

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

์†”๋ฃจ์…˜

fn add_10_shared(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # Load local data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # Wait for all threads to complete (works within a thread block).
    # Note: barrier is not strictly needed here since each thread only accesses
    # its own shared memory location. However, it's included to teach proper
    # shared memory synchronization patterns for more complex scenarios where
    # threads need to coordinate access to shared data.
    barrier()

    # process using shared memory
    if global_i < size:
        output[global_i] = shared[local_i] + 10


GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ

    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: a์™€ output ๋ฐฐ์—ด (๋А๋ฆผ, ๋ชจ๋“  ๋ธ”๋ก์—์„œ ๋ณด์ž„)

    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: shared ๋ฐฐ์—ด (๋น ๋ฆ„, ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋กœ์ปฌ)

    • ๋ธ”๋ก๋‹น 4๊ฐœ ์Šค๋ ˆ๋“œ๋กœ 8๊ฐœ ์›์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ:

      ์ „์—ญ ๋ฐฐ์—ด a: [1 1 1 1 | 1 1 1 1]  # ์ž…๋ ฅ: ๋ชจ๋‘ 1
      
      Block (0):      Block (1):
      shared[0..3]    shared[0..3]
      [1 1 1 1]       [1 1 1 1]
      
  2. ์Šค๋ ˆ๋“œ ์กฐ์œจ

    • ๋กœ๋“œ ๋‹จ๊ณ„:

      Thread 0: shared[0] = a[0]=1    Thread 2: shared[2] = a[2]=1
      Thread 1: shared[1] = a[1]=1    Thread 3: shared[3] = a[3]=1
      barrier()    โ†“         โ†“        โ†“         โ†“   # ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ
      
    • ์ฒ˜๋ฆฌ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10์„ ๋”ํ•จ

    • ๊ฒฐ๊ณผ: output[i] = shared[local_i] + 10 = 11

    ์ฐธ๊ณ : ์ด ๊ฒฝ์šฐ์—๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜(shared[local_i])์—๋งŒ ์“ฐ๊ณ  ์ฝ์œผ๋ฏ€๋กœ barrier()๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•˜๋Š” ์ƒํ™ฉ์—์„œ ํ•„์ˆ˜์ ์ธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  3. ์ธ๋ฑ์Šค ๋งคํ•‘

    • ์ „์—ญ ์ธ๋ฑ์Šค: block_dim.x * block_idx.x + thread_idx.x

      Block 0 ์ถœ๋ ฅ: [11 11 11 11]
      Block 1 ์ถœ๋ ฅ: [11 11 11 11]
      
    • ๋กœ์ปฌ ์ธ๋ฑ์Šค: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— thread_idx.x ์‚ฌ์šฉ

      ๋‘ ๋ธ”๋ก ๋ชจ๋‘ ์ฒ˜๋ฆฌ: 1 + 10 = 11
      
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ๋กœ๋“œ: ์ „์—ญ โ†’ ๊ณต์œ  (๋ณ‘ํ•ฉ ์ฝ๊ธฐ๋กœ 1 ๊ฐ’๋“ค ๋กœ๋“œ)
    • ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋ณด์žฅ
    • ์ฒ˜๋ฆฌ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10 ๋”ํ•˜๊ธฐ
    • ์ €์žฅ: ๊ฒฐ๊ณผ(11)๋ฅผ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ์“ฐ๊ธฐ

์ด ํŒจํ„ด์€ ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ ์กฐ์œจ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

1D LayoutTensor a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 1D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • address_space๋ฅผ ํ™œ์šฉํ•œ LayoutTensor์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋Šฅ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์˜ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  • LayoutTensor๋กœ ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ๋ธ”๋ก ๋กœ์ปฌ ์ €์žฅ์†Œ์˜ ์„ฑ๋Šฅ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8 ์›์†Œ
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 4
  • ๋ธ”๋ก ์ˆ˜: 2
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น TPB๊ฐœ ์›์†Œ

์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹๊ณผ์˜ ์ฃผ์š” ์ฐจ์ด์ 

  1. ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: stack_allocation ๋Œ€์‹  address_space๋ฅผ ์‚ฌ์šฉํ•œ LayoutTensor ์‚ฌ์šฉ

    # ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹
    shared = stack_allocation[TPB, Scalar[dtype]]()
    
    # LayoutTensor ๋ฐฉ์‹
    shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
    
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๋™์ผํ•œ ๋ฌธ๋ฒ•

    # ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹
    shared[local_i] = a[global_i]
    
    # LayoutTensor ๋ฐฉ์‹
    shared[local_i] = a[global_i]
    
  3. ์•ˆ์ „ ๊ธฐ๋Šฅ:

    • ํƒ€์ž… ์•ˆ์ „์„ฑ
    • ๋ ˆ์ด์•„์›ƒ ๊ด€๋ฆฌ
    • ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ ์ฒ˜๋ฆฌ

์ฐธ๊ณ : LayoutTensor๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์„ ์ฒ˜๋ฆฌํ•˜์ง€๋งŒ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์‹œ barrier()๋ฅผ ํ†ตํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๋Š” ์—ฌ์ „ํžˆ ์ง์ ‘ ๊ด€๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ์ฐธ๊ณ : ์ด ํผ์ฆ์—์„œ๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์—๋งŒ ์ ‘๊ทผํ•˜๋ฏ€๋กœ barrier()๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋” ๋ณต์žกํ•œ ์ƒํ™ฉ์—์„œ ํ•„์š”ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 4
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (2, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)


fn add_10_shared_layout_tensor[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    # Allocate shared memory using LayoutTensor with explicit address_space
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p08/p08_layout_tensor.mojo

ํŒ
  1. address_space ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ LayoutTensor ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ
  2. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: shared[local_i] = a[global_i]
  3. barrier()๋กœ ๋™๊ธฐํ™” (ํ•™์Šต์šฉ - ์—ฌ๊ธฐ์„œ๋Š” ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Œ)
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค๋กœ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  5. ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ฐ€๋“œ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p08_layout_tensor
pixi run -e amd p08_layout_tensor
pixi run -e apple p08_layout_tensor
uv run poe p08_layout_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

์†”๋ฃจ์…˜

fn add_10_shared_layout_tensor[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    # Allocate shared memory using tensor builder
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    # Note: barrier is not strictly needed here since each thread only accesses
    # its own shared memory location. However, it's included to teach proper
    # shared memory synchronization patterns for more complex scenarios where
    # threads need to coordinate access to shared data.
    barrier()

    if global_i < size:
        output[global_i] = shared[local_i] + 10


LayoutTensor๊ฐ€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ

    • ์ „์—ญ ํ…์„œ: a์™€ output (๋А๋ฆผ, ๋ชจ๋“  ๋ธ”๋ก์—์„œ ๋ณด์ž„)

    • ๊ณต์œ  ํ…์„œ: shared (๋น ๋ฆ„, ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋กœ์ปฌ)

    • ๋ธ”๋ก๋‹น 4๊ฐœ ์Šค๋ ˆ๋“œ๋กœ 8๊ฐœ ์›์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ:

      ์ „์—ญ ํ…์„œ a: [1 1 1 1 | 1 1 1 1]  # ์ž…๋ ฅ: ๋ชจ๋‘ 1
      
      Block (0):         Block (1):
      shared[0..3]       shared[0..3]
      [1 1 1 1]          [1 1 1 1]
      
  2. ์Šค๋ ˆ๋“œ ์กฐ์œจ

    • ๋กœ๋“œ ๋‹จ๊ณ„ (์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ):

      Thread 0: shared[0] = a[0]=1    Thread 2: shared[2] = a[2]=1
      Thread 1: shared[1] = a[1]=1    Thread 3: shared[3] = a[3]=1
      barrier()    โ†“         โ†“        โ†“         โ†“   # ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ
      
    • ์ฒ˜๋ฆฌ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ํ…์„œ ๊ฐ’์— 10์„ ๋”ํ•จ

    • ๊ฒฐ๊ณผ: output[global_i] = shared[local_i] + 10 = 11

    ์ฐธ๊ณ : ์ด ๊ฒฝ์šฐ์—๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜(shared[local_i])์—๋งŒ ์“ฐ๊ณ  ์ฝ์œผ๋ฏ€๋กœ barrier()๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•˜๋Š” ์ƒํ™ฉ์—์„œ ํ•„์ˆ˜์ ์ธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  3. LayoutTensor์˜ ์žฅ์ 

    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น:

      # address_space๋ฅผ ์‚ฌ์šฉํ•œ ๊น”๋”ํ•œ LayoutTensor API
      shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
      
    • ์ „์—ญ๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‘ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ:

      Block 0 ์ถœ๋ ฅ: [11 11 11 11]
      Block 1 ์ถœ๋ ฅ: [11 11 11 11]
      
    • ๋‚ด์žฅ๋œ ๋ ˆ์ด์•„์›ƒ ๊ด€๋ฆฌ์™€ ํƒ€์ž… ์•ˆ์ „์„ฑ

  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ๋กœ๋“œ: ์ „์—ญ ํ…์„œ โ†’ ๊ณต์œ  ํ…์„œ (์ตœ์ ํ™”๋จ)
    • ๋™๊ธฐํ™”: ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „๊ณผ ๋™์ผํ•œ barrier() ํ•„์š”
    • ์ฒ˜๋ฆฌ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10 ๋”ํ•˜๊ธฐ
    • ์ €์žฅ: ๊ฒฐ๊ณผ(11)๋ฅผ ์ „์—ญ ํ…์„œ์— ์“ฐ๊ธฐ

์ด ํŒจํ„ด์€ LayoutTensor๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ ์ด์ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋” ํŽธ๋ฆฌํ•œ API์™€ ๋‚ด์žฅ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Puzzle 9: GPU ๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ

โš ๏ธ ์ด ํผ์ฆ์€ ํ˜ธํ™˜๋˜๋Š” NVIDIA GPU์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ GPU ๋ฒค๋” ์ง€์›์„ ์œ„ํ•œ ๋„๊ตฌ ๊ฐœ๋ฐœ์ด ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.

GPU ํ”„๋กœ๊ทธ๋žจ์ด ์‹คํŒจํ•  ๋•Œ

์ง€๊ธˆ๊นŒ์ง€ GPU ์ปค๋„์„ ์ž‘์„ฑํ•˜๊ณ , ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋‹ค๋ฃจ๊ณ , ์ˆ˜์ฒœ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ๋ฅผ ์กฐ์œจํ•ด ์™”์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ๊ฐ€ ์ปดํŒŒ์ผ๋ฉ๋‹ˆ๋‹ค. ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ ์‹คํ–‰ํ•˜๋ฉด:

  • ํฌ๋ž˜์‹œ
  • ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ
  • ๋ฌดํ•œ ์ •์ง€

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ˜„์‹ค์ด ๋ฐ”๋กœ ์ด๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ์—์„œ ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฝ”๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•ด์•ผ ํ•˜์ฃ . ์ด๋ก ๊ณผ ์‹ค์ „์ด ๋งŒ๋‚˜๊ณ , ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ง€์‹๊ณผ ์กฐ์‚ฌ ๋Šฅ๋ ฅ์ด ๊ต์ฐจํ•˜๋Š” ์˜์—ญ์ž…๋‹ˆ๋‹ค.

GPU ๋””๋ฒ„๊น…์ด ์–ด๋ ค์šด ์ด์œ 

๋‹จ์ผ ์Šค๋ ˆ๋“œ์˜ ์ˆœ์ฐจ ์‹คํ–‰์„ ๋”ฐ๋ผ๊ฐ€๋Š” ์ „ํ†ต์ ์ธ CPU ๋””๋ฒ„๊น…๊ณผ ๋‹ฌ๋ฆฌ, GPU ๋””๋ฒ„๊น…์€ ๋‹ค์Œ์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค:

  • ๋ณ‘๋ ฌ๋กœ ์‚ฌ๊ณ ํ•˜๊ธฐ: ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์‹คํ–‰๋˜๋ฉฐ, ๊ฐ๊ฐ ๋‹ค๋ฅธ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ
  • ์—ฌ๋Ÿฌ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„ ํƒ์ƒ‰: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ ˆ์ง€์Šคํ„ฐ, ์ƒ์ˆ˜ ๋ฉ”๋ชจ๋ฆฌ
  • ์กฐ์œจ ์‹คํŒจ ์ฒ˜๋ฆฌ: ๊ฒฝ์Ÿ ์ƒํƒœ, ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์œ„๋ฐ˜
  • ์ตœ์ ํ™”๋œ ์ฝ”๋“œ ๋””๋ฒ„๊น…: JIT ์ปดํŒŒ์ผ, ๋ณ€์ˆ˜ ์ตœ์ ํ™”, ์ œํ•œ๋œ ์‹ฌ๋ณผ ์ •๋ณด
  • ์ „๋ฌธ ๋„๊ตฌ ์‚ฌ์šฉ: ์ปค๋„ ๊ฒ€์‚ฌ, ์Šค๋ ˆ๋“œ ํƒ์ƒ‰, ๋ณ‘๋ ฌ ์ƒํƒœ ๋ถ„์„์„ ์œ„ํ•œ CUDA-GDB

GPU ๋””๋ฒ„๊น…์„ ์ตํžˆ๋ฉด ๋ณ‘๋ ฌ ์ปดํ“จํŒ…์˜ ๊ธฐ์ดˆ๋ฅผ ๊นŠ์ด ์ดํ•ดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ

์ด ํผ์ฆ์—์„œ๋Š” GPU ์ฝ”๋“œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. GPU ๊ฐœ๋ฐœ์ž๋“ค์ด ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋งค์ผ ์‚ฌ์šฉํ•˜๋Š” ์ ‘๊ทผ๋ฒ•, ๋„๊ตฌ, ๊ธฐ๋ฒ•์„ ์ตํžˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ตํžˆ๊ฒŒ ๋  ํ•ต์‹ฌ ๊ธฐ์ˆ 

  1. ์ „๋ฌธ์ ์ธ ๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ - ์ „๋ฌธ๊ฐ€๋“ค์ด ์‚ฌ์šฉํ•˜๋Š” ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ•
  2. ๋„๊ตฌ ์ˆ™๋ จ๋„ - ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ์šฉ LLDB, GPU ์ปค๋„์šฉ CUDA-GDB
  3. ํŒจํ„ด ์ธ์‹ - ํ”ํ•œ GPU ๋ฒ„๊ทธ ์œ ํ˜•๊ณผ ์ฆ์ƒ
  4. ์กฐ์‚ฌ ๊ธฐ๋ฒ• - ๋ณ€์ˆ˜๊ฐ€ ์ตœ์ ํ™”๋กœ ์ œ๊ฑฐ๋˜์—ˆ์„ ๋•Œ ๊ทผ๋ณธ ์›์ธ ์ฐพ๊ธฐ
  5. ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น… - ๊ณ ๊ธ‰ GPU ๋””๋ฒ„๊น… ๊ธฐ์ˆ 

์‹ค์ œ ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค

๊ฐ€์žฅ ํ”ํ•œ ์„ธ ๊ฐ€์ง€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์‹คํŒจ ์ƒํ™ฉ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ํฌ๋ž˜์‹œ - Null ํฌ์ธํ„ฐ, ์ž˜๋ชป๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ, ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ํดํŠธ
  • ๋กœ์ง ๋ฒ„๊ทธ - ์ •์ƒ ์‹คํ–‰๋˜์ง€๋งŒ ๊ฒฐ๊ณผ๊ฐ€ ํ‹€๋ฆผ, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜
  • ์กฐ์œจ ๊ต์ฐฉ ์ƒํƒœ - ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™” ์‹คํŒจ, ๋ฌดํ•œ ์ •์ง€

๊ฐ ์‹œ๋‚˜๋ฆฌ์˜ค๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์กฐ์‚ฌ ๊ธฐ๋ฒ•์„ ๊ฐ€๋ฅด์น˜๊ณ  ๋””๋ฒ„๊น… ๊ฐ๊ฐ์„ ๊ธธ๋Ÿฌ์ค๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ์—ฌ์ •

์ด ํผ์ฆ์€ ๊ธฐ๋ณธ ๋””๋ฒ„๊น… ๊ฐœ๋…๋ถ€ํ„ฐ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์กฐ์œจ ์‹คํŒจ๊นŒ์ง€, ์ฒด๊ณ„์ ์œผ๋กœ ์„ค๊ณ„๋œ ๊ณผ์ •์„ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค:

๐Ÿ“š Step 1: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ

๊ธฐ์ดˆ ๋‹ค์ง€๊ธฐ - ๋„๊ตฌ์™€ ์›Œํฌํ”Œ๋กœ์šฐ ๋ฐฐ์šฐ๊ธฐ

  • pixi์™€ CUDA-GDB๋กœ ๋””๋ฒ„๊น… ํ™˜๊ฒฝ ์„ค์ •
  • ๋„ค ๊ฐ€์ง€ ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ• ๋ฐฐ์šฐ๊ธฐ: JIT vs ๋ฐ”์ด๋„ˆ๋ฆฌ, CPU vs GPU
  • GPU ์ปค๋„ ๊ฒ€์‚ฌ๋ฅผ ์œ„ํ•œ ํ•„์ˆ˜ CUDA-GDB ๋ช…๋ น์–ด ํ•™์Šต
  • ์ด์ „ ํผ์ฆ์˜ ์ต์ˆ™ํ•œ ์ฝ”๋“œ๋กœ ์‹ค์Šต
  • ๊ฐ ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ•์„ ์–ธ์ œ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”์ง€ ์ดํ•ด

๋ชฉํ‘œ: ์ „๋ฌธ์ ์ธ ๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ์™€ ๋„๊ตฌ ์ˆ™๋ จ๋„

๐Ÿง Step 2: ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€

๋ฉ”๋ชจ๋ฆฌ ํฌ๋ž˜์‹œ ์กฐ์‚ฌ - ํฌ๋ž˜์‹œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” GPU ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น…

  • CUDA_ERROR_ILLEGAL_ADDRESS ํฌ๋ž˜์‹œ ์กฐ์‚ฌ
  • ์ฒด๊ณ„์ ์ธ ํฌ์ธํ„ฐ ๊ฒ€์‚ฌ ๊ธฐ๋ฒ• ํ•™์Šต
  • Null ํฌ์ธํ„ฐ ํƒ์ง€ ๋ฐ ๊ฒ€์ฆ ํ•™์Šต
  • ์ „๋ฌธ์ ์ธ ํฌ๋ž˜์‹œ ๋ถ„์„ ์›Œํฌํ”Œ๋กœ์šฐ ์‹ค์Šต
  • GPU ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹คํŒจ ์ดํ•ด

๋ชฉํ‘œ: GPU ๋ฉ”๋ชจ๋ฆฌ ํฌ๋ž˜์‹œ์™€ ํฌ์ธํ„ฐ ๋ฌธ์ œ ๋””๋ฒ„๊น… ๋Šฅ๋ ฅ

๐Ÿ” Step 3: ํƒ์ • ์ˆ˜์‚ฌ: ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€

๋กœ์ง ๋ฒ„๊ทธ ์กฐ์‚ฌ - ๊ฒฐ๊ณผ๊ฐ€ ํ‹€๋ฆฐ ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น…

  • LayoutTensor ๊ธฐ๋ฐ˜์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์กฐ์‚ฌ
  • ์ตœ์ ํ™”๋กœ ๋ณ€์ˆ˜๊ฐ€ ์‚ฌ๋ผ์กŒ์„ ๋•Œ ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„ํ•˜๊ธฐ
  • ๋ฐ˜๋ณต๋ฌธ ๊ฒฝ๊ณ„์™€ ๋ฐ˜๋ณต ํšŸ์ˆ˜ ๋ถ„์„ํ•˜๊ธฐ
  • ํ‹€๋ฆฐ ๊ฒฐ๊ณผ์—์„œ ํŒจํ„ด ์ฐพ์•„๋‚ด๊ธฐ
  • ๋ณ€์ˆ˜๋ฅผ ์ง์ ‘ ํ™•์ธํ•˜์ง€ ์•Š๊ณ  ๋””๋ฒ„๊น…ํ•˜๊ธฐ

๋ชฉํ‘œ: GPU ์ปค๋„์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜์™€ ๋กœ์ง ๋ฒ„๊ทธ ๋””๋ฒ„๊น… ๋Šฅ๋ ฅ

๐Ÿ•ต๏ธ Step 4: ํƒ์ • ์ˆ˜์‚ฌ: ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€

๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ์กฐ์‚ฌ - ์˜์›ํžˆ ๋ฉˆ์ถ”๋Š” ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น…

  • ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™” ์‹คํŒจ ์กฐ์‚ฌ
  • ๋ณ‘๋ ฌ ์‹คํ–‰ ์ „๋ฐ˜์˜ ๋ฉ€ํ‹ฐ ์Šค๋ ˆ๋“œ ์ƒํƒœ ๋ถ„์„ ํ•™์Šต
  • ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ๊ฒฝ๋กœ ์ถ”์  ํ•™์Šต
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น… ์‹ค์Šต
  • ๊ฐ€์žฅ ์–ด๋ ค์šด GPU ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค ์ดํ•ด

๋ชฉํ‘œ: ๊ณ ๊ธ‰ ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น… - GPU ๋””๋ฒ„๊น… ๊ธฐ์ˆ ์˜ ์ •์ 

ํƒ์ •์˜ ๋งˆ์ธ๋“œ์…‹

GPU ๋””๋ฒ„๊น…์€ ์ผ๋ฐ˜์ ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๋‹ค๋ฅธ ์‚ฌ๊ณ ๋ฐฉ์‹์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ๋ถ„์€ ๋ฒ”์ฃ„ ํ˜„์žฅ์„ ์กฐ์‚ฌํ•˜๋Š” ํƒ์ •์ด ๋ฉ๋‹ˆ๋‹ค:

  • ๋‹จ์„œ๊ฐ€ ๋ถ€์กฑํ•จ - ๋ณ€์ˆ˜๋Š” ์ตœ์ ํ™”๋กœ ์‚ฌ๋ผ์ง€๊ณ , ์‹ฌ๋ณผ๋ช…์€ ์•Œ์•„๋ณด๊ธฐ ์–ด๋ ค์›€
  • ์šฉ์˜์ž๊ฐ€ ๋„˜์นจ - ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ, ๋ˆ„๊ตฌ๋“  ๋ฒ”์ธ์ผ ์ˆ˜ ์žˆ์Œ
  • ํƒ€์ž„๋ผ์ธ์ด ๋ณต์žกํ•จ - ๋ณ‘๋ ฌ ์‹คํ–‰, ๊ฒฝ์Ÿ ์ƒํƒœ, ํƒ€์ด๋ฐ ์˜์กด์„ฑ
  • ์ „๋ฌธ ๋„๊ตฌ๊ฐ€ ํ•„์š”ํ•จ - CUDA-GDB, ์Šค๋ ˆ๋“œ ํƒ์ƒ‰, GPU ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์‚ฌ

ํ•˜์ง€๋งŒ ํ›Œ๋ฅญํ•œ ํƒ์ •์ด ๊ทธ๋ ‡๋“ฏ, ์—ฌ๋Ÿฌ๋ถ„๋„ ๋‹ค์Œ์„ ๋ฐฐ์šฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

  • ๋‹จ์„œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ถ”์  - ์—๋Ÿฌ ๋ฉ”์‹œ์ง€, ํฌ๋ž˜์‹œ ํŒจํ„ด, ์Šค๋ ˆ๋“œ ์ƒํƒœ
  • ๊ฐ€์„ค ์ˆ˜๋ฆฝ - ์ด ๋™์ž‘์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์›์ธ์€ ๋ฌด์—‡์ผ๊นŒ?
  • ์ด๋ก  ๊ฒ€์ฆ - ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋กœ ์•„์ด๋””์–ด๋ฅผ ํ™•์ธํ•˜๊ฑฐ๋‚˜ ๋ฐ˜์ฆ
  • ๊ทผ๋ณธ ์›์ธ ์ถ”์  - ์ฆ์ƒ์—์„œ ์‹ค์ œ ๋ฌธ์ œ์˜ ์›์ธ๊นŒ์ง€

์‹œ์ž‘ํ•˜๊ธฐ ์ „์—

์•Œ์•„์•ผ ํ•  ๊ฒƒ:

  • Puzzle 1-8์—์„œ ๋‹ค๋ฃฌ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋… (์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ, ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ, ๋ฐฐ๋ฆฌ์–ด)
  • ๊ธฐ๋ณธ์ ์ธ ๋ช…๋ น์ค„ ์‚ฌ์šฉ์— ์ต์ˆ™ํ•จ (ํ„ฐ๋ฏธ๋„ ๊ธฐ๋ฐ˜ ๋””๋ฒ„๊น… ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค)
  • ์ธ๋‚ด์‹ฌ๊ณผ ์ฒด๊ณ„์  ์‚ฌ๊ณ  (GPU ๋””๋ฒ„๊น…์€ ๊ผผ๊ผผํ•œ ์กฐ์‚ฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค)

๋ชฉํ‘œ:

  • GPU ๊ฐœ๋ฐœํŒ€์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ „๋ฌธ ๋””๋ฒ„๊น… ๊ธฐ์ˆ 
  • ์Šค๋ ˆ๋“œ ์ˆ˜์ค€์˜ ์‹คํ–‰์„ ๊ด€์ฐฐํ•˜๋ฉฐ ์–ป๋Š” ๋ณ‘๋ ฌ ์ปดํ“จํŒ…์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด
  • ๊ฐ€์žฅ ๊นŒ๋‹ค๋กœ์šด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ƒํ™ฉ์—์„œ๋„ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ž์‹ ๊ฐ
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ปค๋ฆฌ์–ด ์ „๋ฐ˜์— ๋„์›€์ด ๋  ๋„๊ตฌ ์ˆ™๋ จ๋„

์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”?

GPU ๋””๋ฒ„๊น…์€ GPU ํ”„๋กœ๊ทธ๋žจ์„ ์ž‘์„ฑํ•˜๋Š” ๊ฒƒ์—์„œ ๊นŠ์ด ์ดํ•ดํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜์•„๊ฐ€๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ „๋ฌธ GPU ๊ฐœ๋ฐœ์ž๋ผ๋ฉด ๋ˆ„๊ตฌ๋‚˜ ๋ณ‘๋ ฌ ์ฝ”๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๊ณ , ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋กœ ๋™์‹œ์— ์‚ฌ๊ณ ํ•˜๋Š” ๋ฒ•์„ ์ตํžˆ๊ณ , ๋ณต์žกํ•œ ์กฐ์œจ ์‹คํŒจ๋ฅผ ๋ˆ๊ธฐ ์žˆ๊ฒŒ ์กฐ์‚ฌํ•˜๋ฉฐ ์ˆ˜๋งŽ์€ ์‹œ๊ฐ„์„ ๋ณด๋ƒˆ์Šต๋‹ˆ๋‹ค.

์ง€๊ธˆ์ด ๋ฐ”๋กœ ๊ทธ ์ „๋ฌธ๊ฐ€ ๊ทธ๋ฃน์— ํ•ฉ๋ฅ˜ํ•  ๊ธฐํšŒ์ž…๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ์—ฌ์ • ์‹œ์ž‘ํ•˜๊ธฐ: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ


โ€œ๋””๋ฒ„๊น…์€ ์ฝ”๋“œ ์ž‘์„ฑ๋ณด๋‹ค ๋‘ ๋ฐฐ๋Š” ์–ด๋ ต๋‹ค. ๋”ฐ๋ผ์„œ ์ตœ๋Œ€ํ•œ ์˜๋ฆฌํ•˜๊ฒŒ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ–ˆ๋‹ค๋ฉด, ์ •์˜์ƒ ๊ทธ๊ฒƒ์„ ๋””๋ฒ„๊น…ํ•  ๋งŒํผ ๋˜‘๋˜‘ํ•˜์ง€ ์•Š๋‹ค๋Š” ๋œป์ด๋‹ค.โ€ - Brian Kernighan

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” ์ด ๋ง์ด ์ˆ˜์ฒœ ๋ฐฐ๋กœ ์™€๋‹ฟ์Šต๋‹ˆ๋‹ค. ๋™์‹œ์— ๋””๋ฒ„๊น…ํ•ด์•ผ ํ•  ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ ์ˆ˜๋งŒํผ์š”.

๐Ÿ“š Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ

GPU ๋””๋ฒ„๊น…์˜ ์„ธ๊ณ„์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! Puzzle 1-8์„ ํ†ตํ•ด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…์„ ๋ฐฐ์› ์œผ๋‹ˆ, ์ด์ œ ๋ชจ๋“  GPU ํ”„๋กœ๊ทธ๋ž˜๋จธ์—๊ฒŒ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ธฐ์ˆ ์„ ๋ฐฐ์šธ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค: ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์„ ๋•Œ ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐฉ๋ฒ•.

GPU ๋””๋ฒ„๊น…์€ ์ฒ˜์Œ์—๋Š” ์–ด๋ ค์›Œ ๋ณด์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜๊ณ , ๋‹ค์–‘ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์ด ์žˆ์œผ๋ฉฐ, ํ•˜๋“œ์›จ์–ด๋ณ„ ๋™์ž‘๋„ ๋‹ค๋ฃจ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ ์ ˆํ•œ ๋„๊ตฌ์™€ ์›Œํฌํ”Œ๋กœ์šฐ๋งŒ ์žˆ์œผ๋ฉด GPU ์ฝ”๋“œ ๋””๋ฒ„๊น…๋„ ์ฒด๊ณ„์ ์œผ๋กœ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ(GPU ์ž‘์—…์„ ์„ค์ •ํ•˜๋Š” ๋ถ€๋ถ„)์™€ GPU ์ปค๋„ ์ฝ”๋“œ(๋ณ‘๋ ฌ ์—ฐ์‚ฐ์ด ์‹คํ–‰๋˜๋Š” ๋ถ€๋ถ„) ๋ชจ๋‘๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ์‹ค์ œ ์˜ˆ์ œ, ์‹ค์ œ ๋””๋ฒ„๊ฑฐ ์ถœ๋ ฅ, ๊ทธ๋ฆฌ๊ณ  ์—ฌ๋Ÿฌ๋ถ„์˜ ํ”„๋กœ์ ํŠธ์— ๋ฐ”๋กœ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ๊ณ„๋ณ„ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ : ๋‹ค์Œ ๋‚ด์šฉ์€ ๋ฒ”์šฉ IDE ํ˜ธํ™˜์„ฑ์„ ์œ„ํ•ด ๋ช…๋ น์ค„ ๋””๋ฒ„๊น…์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. VS Code ๋””๋ฒ„๊น…์„ ์„ ํ˜ธํ•œ๋‹ค๋ฉด Mojo ๋””๋ฒ„๊น… ๋ฌธ์„œ์—์„œ VS Code ์ „์šฉ ์„ค์ •๊ณผ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

GPU ๋””๋ฒ„๊น…์ด ๋‹ค๋ฅธ ์ด์œ 

๋„๊ตฌ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, GPU ๋””๋ฒ„๊น…์ด ํŠน๋ณ„ํ•œ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

  • ์ „ํ†ต์ ์ธ CPU ๋””๋ฒ„๊น…: ๋‹จ์ผ ์Šค๋ ˆ๋“œ, ์ˆœ์ฐจ ์‹คํ–‰, ๋‹จ์ˆœํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋ธ
  • GPU ๋””๋ฒ„๊น…: ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ, ๋ณ‘๋ ฌ ์‹คํ–‰, ์—ฌ๋Ÿฌ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„, ๊ฒฝ์Ÿ ์ƒํƒœ

์ด๋Š” ๋‹ค์Œ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ์ „๋ฌธ ๋„๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค:

  • ์„œ๋กœ ๋‹ค๋ฅธ GPU ์Šค๋ ˆ๋“œ ๊ฐ„ ์ „ํ™˜
  • ์Šค๋ ˆ๋“œ๋ณ„ ๋ณ€์ˆ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์‚ฌ
  • ๋ณ‘๋ ฌ ์‹คํ–‰์˜ ๋ณต์žก์„ฑ ์ฒ˜๋ฆฌ
  • CPU ์„ค์ • ์ฝ”๋“œ์™€ GPU ์ปค๋„ ์ฝ”๋“œ ๋ชจ๋‘ ๋””๋ฒ„๊น…

๋””๋ฒ„๊น… ๋„๊ตฌ ๋ชจ์Œ

Mojo์˜ GPU ๋””๋ฒ„๊น… ๊ธฐ๋Šฅ์€ ํ˜„์žฌ NVIDIA GPU๋กœ ์ œํ•œ๋ฉ๋‹ˆ๋‹ค. Mojo ๋””๋ฒ„๊น… ๋ฌธ์„œ์— ๋”ฐ๋ฅด๋ฉด Mojo ํŒจํ‚ค์ง€์—๋Š” ๋‹ค์Œ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค:

  • CPU ์ธก ๋””๋ฒ„๊น…์„ ์œ„ํ•œ Mojo ํ”Œ๋Ÿฌ๊ทธ์ธ์ด ํฌํ•จ๋œ LLDB ๋””๋ฒ„๊ฑฐ
  • GPU ์ปค๋„ ๋””๋ฒ„๊น…์„ ์œ„ํ•œ CUDA-GDB ํ†ตํ•ฉ
  • ๋ฒ”์šฉ IDE ํ˜ธํ™˜์„ฑ์„ ์œ„ํ•œ mojo debug๋ฅผ ํ†ตํ•œ ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค

GPU ์ „์šฉ ๋””๋ฒ„๊น…์— ๋Œ€ํ•ด์„œ๋Š” Mojo GPU ๋””๋ฒ„๊น… ๊ฐ€์ด๋“œ์—์„œ ์ถ”๊ฐ€ ๊ธฐ์ˆ  ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ด ์•„ํ‚คํ…์ฒ˜๋Š” ์ต์ˆ™ํ•œ ๋””๋ฒ„๊น… ๋ช…๋ น์–ด์™€ GPU ์ „์šฉ ๊ธฐ๋Šฅ, ๋‘ ๊ฐ€์ง€ ์žฅ์ ์„ ๋ชจ๋‘ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ: ๋ฌธ์ œ์—์„œ ํ•ด๊ฒฐ๊นŒ์ง€

GPU ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜๊ฑฐ๋‚˜, ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๊ฑฐ๋‚˜, ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋™์ž‘์„ ํ•  ๋•Œ ๋‹ค์Œ์˜ ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ•์„ ๋”ฐ๋ฅด์„ธ์š”:

  1. ๋””๋ฒ„๊น…์„ ์œ„ํ•œ ์ฝ”๋“œ ์ค€๋น„ (์ตœ์ ํ™” ๋น„ํ™œ์„ฑํ™”, ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ ์ถ”๊ฐ€)
  2. ์ ์ ˆํ•œ ๋””๋ฒ„๊ฑฐ ์„ ํƒ (CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ vs GPU ์ปค๋„ ๋””๋ฒ„๊น…)
  3. ์ „๋žต์  ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ • (๋ฌธ์ œ๊ฐ€ ์˜์‹ฌ๋˜๋Š” ์œ„์น˜์—)
  4. ์‹คํ–‰ ๋ฐ ๊ฒ€์‚ฌ (์ฝ”๋“œ๋ฅผ ๋‹จ๊ณ„๋ณ„๋กœ ์‹คํ–‰ํ•˜๋ฉฐ ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ)
  5. ํŒจํ„ด ๋ถ„์„ (๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ, ์Šค๋ ˆ๋“œ ๋™์ž‘, ๊ฒฝ์Ÿ ์ƒํƒœ)

์ด ์›Œํฌํ”Œ๋กœ์šฐ๋Š” Puzzle 01์˜ ๊ฐ„๋‹จํ•œ ๋ฐฐ์—ด ์—ฐ์‚ฐ์ด๋“  Puzzle 08์˜ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ”๋“œ๋“  ์ƒ๊ด€์—†์ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

Step 1: ๋””๋ฒ„๊น…์„ ์œ„ํ•œ ์ฝ”๋“œ ์ค€๋น„

๐Ÿฅ‡ ์ฒ ์น™: ์ตœ์ ํ™”๋œ ์ฝ”๋“œ๋Š” ์ ˆ๋Œ€ ๋””๋ฒ„๊น…ํ•˜์ง€ ๋งˆ์„ธ์š”. ์ตœ์ ํ™”๋Š” ๋ช…๋ น์–ด ์ˆœ์„œ๋ฅผ ๋ฐ”๊พธ๊ณ , ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•˜๊ณ , ํ•จ์ˆ˜๋ฅผ ์ธ๋ผ์ธํ™”ํ•˜์—ฌ ๋””๋ฒ„๊น…์„ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๋””๋ฒ„๊ทธ ์ •๋ณด๋กœ ๋นŒ๋“œํ•˜๊ธฐ

๋””๋ฒ„๊น…์šฉ Mojo ํ”„๋กœ๊ทธ๋žจ์„ ๋นŒ๋“œํ•  ๋•Œ๋Š” ํ•ญ์ƒ ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ์„ ํฌํ•จํ•˜์„ธ์š”:

# ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด๋กœ ๋นŒ๋“œ
mojo build -O0 -g your_program.mojo -o your_program_debug

์ด ํ”Œ๋ž˜๊ทธ๋“ค์ด ํ•˜๋Š” ์ผ:

  • -O0: ๋ชจ๋“  ์ตœ์ ํ™”๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•˜์—ฌ ์›๋ž˜ ์ฝ”๋“œ ๊ตฌ์กฐ๋ฅผ ๋ณด์กด
  • -g: ๋””๋ฒ„๊ฑฐ๊ฐ€ ๋จธ์‹  ์ฝ”๋“œ๋ฅผ Mojo ์†Œ์Šค์— ๋งคํ•‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ ํฌํ•จ
  • -o: ์‰ฌ์šด ์‹๋ณ„์„ ์œ„ํ•ด ๋ช…๋ช…๋œ ์ถœ๋ ฅ ํŒŒ์ผ ์ƒ์„ฑ

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ 

๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ ์—†์ด๋Š” ๋””๋ฒ„๊น… ์„ธ์…˜์ด ์ด๋ ‡๊ฒŒ ๋ณด์ž…๋‹ˆ๋‹ค:

(lldb) print my_variable
error: use of undeclared identifier 'my_variable'

๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ์ด ์žˆ์œผ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฉ๋‹ˆ๋‹ค:

(lldb) print my_variable
(int) $0 = 42

Step 2: ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ• ์„ ํƒ

์—ฌ๊ธฐ์„œ GPU ๋””๋ฒ„๊น…์ด ํฅ๋ฏธ๋กœ์›Œ์ง‘๋‹ˆ๋‹ค. ๋„ค ๊ฐ€์ง€ ๋‹ค๋ฅธ ์กฐํ•ฉ ์ค‘์—์„œ ์„ ํƒํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ ์ ˆํ•œ ๊ฒƒ์„ ๊ณ ๋ฅด๋ฉด ์‹œ๊ฐ„์„ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

๋„ค ๊ฐ€์ง€ ๋””๋ฒ„๊น… ์กฐํ•ฉ

๋น ๋ฅธ ์ฐธ์กฐ:

# 1. JIT + LLDB: ์†Œ์Šค์—์„œ ์ง์ ‘ CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ ๋””๋ฒ„๊น…
pixi run mojo debug your_gpu_program.mojo

# 2. JIT + CUDA-GDB: ์†Œ์Šค์—์„œ ์ง์ ‘ GPU ์ปค๋„ ๋””๋ฒ„๊น…
pixi run mojo debug --cuda-gdb --break-on-launch your_gpu_program.mojo

# 3. ๋ฐ”์ด๋„ˆ๋ฆฌ + LLDB: ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ์—์„œ CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ ๋””๋ฒ„๊น…
pixi run mojo build -O0 -g your_gpu_program.mojo -o your_program_debug
pixi run mojo debug your_program_debug

# 4. ๋ฐ”์ด๋„ˆ๋ฆฌ + CUDA-GDB: ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ์—์„œ GPU ์ปค๋„ ๋””๋ฒ„๊น…
pixi run mojo debug --cuda-gdb --break-on-launch your_program_debug

๊ฐ ์ ‘๊ทผ๋ฒ•์„ ์–ธ์ œ ์‚ฌ์šฉํ• ๊นŒ

ํ•™์Šต๊ณผ ๋น ๋ฅธ ์‹คํ—˜์šฉ:

  • JIT ๋””๋ฒ„๊น… ์‚ฌ์šฉ - ๋นŒ๋“œ ๋‹จ๊ณ„๊ฐ€ ํ•„์š” ์—†์–ด ๋” ๋น ๋ฅด๊ฒŒ ๋ฐ˜๋ณต ๊ฐ€๋Šฅ

๋ณธ๊ฒฉ์ ์ธ ๋””๋ฒ„๊น… ์„ธ์…˜์šฉ:

  • ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น… ์‚ฌ์šฉ - ๋” ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๊ณ  ๊น”๋”ํ•œ ๋””๋ฒ„๊ฑฐ ์ถœ๋ ฅ

CPU ์ธก ๋ฌธ์ œ์šฉ (๋ฒ„ํผ ํ• ๋‹น, ํ˜ธ์ŠคํŠธ ๋ฉ”๋ชจ๋ฆฌ, ํ”„๋กœ๊ทธ๋žจ ๋กœ์ง):

  • LLDB ๋ชจ๋“œ ์‚ฌ์šฉ - main() ํ•จ์ˆ˜์™€ ์„ค์ • ์ฝ”๋“œ ๋””๋ฒ„๊น…์— ์ ํ•ฉ

GPU ์ปค๋„ ๋ฌธ์ œ์šฉ (์Šค๋ ˆ๋“œ ๋™์ž‘, GPU ๋ฉ”๋ชจ๋ฆฌ, ์ปค๋„ ํฌ๋ž˜์‹œ):

  • CUDA-GDB ๋ชจ๋“œ ์‚ฌ์šฉ - ๊ฐœ๋ณ„ GPU ์Šค๋ ˆ๋“œ๋ฅผ ๊ฒ€์‚ฌํ•˜๋Š” ์œ ์ผํ•œ ๋ฐฉ๋ฒ•

์žฅ์ ์€ ๋‹ค์–‘ํ•˜๊ฒŒ ์กฐํ•ฉํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. JIT + LLDB๋กœ ์„ค์ • ์ฝ”๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•œ ๋‹ค์Œ, JIT + CUDA-GDB๋กœ ์ „ํ™˜ํ•ด์„œ ์‹ค์ œ ์ปค๋„์„ ๋””๋ฒ„๊น…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


CUDA-GDB๋กœ GPU ์ปค๋„ ๋””๋ฒ„๊น… ์ดํ•ดํ•˜๊ธฐ

์ด์ œ GPU ์ปค๋„ ๋””๋ฒ„๊น…์ž…๋‹ˆ๋‹ค - ๋””๋ฒ„๊น… ๋„๊ตฌ ๋ชจ์Œ์—์„œ ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•˜๋ฉด์„œ๋„ ๋ณต์žกํ•œ ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.

--cuda-gdb๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด Mojo๋Š” NVIDIA์˜ CUDA-GDB ๋””๋ฒ„๊ฑฐ์™€ ํ†ตํ•ฉ๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋‹จ์ˆœํ•œ ๋””๋ฒ„๊ฑฐ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค - GPU ์ปดํ“จํŒ…์˜ ๋ณ‘๋ ฌ ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋“œ ์„ธ๊ณ„๋ฅผ ์œ„ํ•ด ํŠน๋ณ„ํžˆ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

CUDA-GDB๊ฐ€ ํŠน๋ณ„ํ•œ ์ด์œ 

์ผ๋ฐ˜ GDB๋Š” ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋ฉฐ ์ˆœ์ฐจ ์ฝ”๋“œ๋ฅผ ๋‹จ๊ณ„๋ณ„๋กœ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. CUDA-GDB๋Š” ์ˆ˜์ฒœ ๊ฐœ์˜ GPU ์Šค๋ ˆ๋“œ๋ฅผ ๋™์‹œ์— ๋””๋ฒ„๊น…ํ•˜๋ฉฐ, ๊ฐ๊ฐ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๋‹ค์Œ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค:

  • GPU ์ปค๋„ ๋‚ด๋ถ€์— ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ • - ์–ด๋–ค ์Šค๋ ˆ๋“œ๋“  ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์— ๋„๋‹ฌํ•˜๋ฉด ์‹คํ–‰์„ ์ผ์‹œ ์ •์ง€
  • GPU ์Šค๋ ˆ๋“œ ๊ฐ„ ์ „ํ™˜ - ๊ฐ™์€ ์ˆœ๊ฐ„์— ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฌด์—‡์„ ํ•˜๋Š”์ง€ ๊ฒ€์‚ฌ
  • ์Šค๋ ˆ๋“œ๋ณ„ ๋ฐ์ดํ„ฐ ๊ฒ€์‚ฌ - ๊ฐ™์€ ๋ณ€์ˆ˜๊ฐ€ ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๊ฒƒ์„ ํ™•์ธ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋””๋ฒ„๊น… - ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ, ๊ฒฝ์Ÿ ์ƒํƒœ, ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ ํฌ์ฐฉ (์ด๋Ÿฐ ๋ฌธ์ œ ๊ฐ์ง€์— ๋Œ€ํ•ด์„œ๋Š” Puzzle 10์—์„œ ๋” ์ž์„ธํžˆ)
  • ๋ณ‘๋ ฌ ์‹คํ–‰ ๋ถ„์„ - ์Šค๋ ˆ๋“œ๋“ค์ด ์–ด๋–ป๊ฒŒ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ณ  ๋™๊ธฐํ™”ํ•˜๋Š”์ง€ ์ดํ•ด

์ด์ „ ํผ์ฆ์˜ ๊ฐœ๋…๊ณผ ์—ฐ๊ฒฐ

Puzzle 1-8์—์„œ ๋ฐฐ์šด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…์„ ๊ธฐ์–ตํ•˜์‹œ๋‚˜์š”? CUDA-GDB๋กœ ๋Ÿฐํƒ€์ž„์— ๋ชจ๋“  ๊ฒƒ์„ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ ๋””๋ฒ„๊น…

Puzzle 1-8์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค:

# Puzzle 1์—์„œ: ๊ธฐ๋ณธ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ
i = thread_idx.x  # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ ์œ ํ•œ ์ธ๋ฑ์Šค๋ฅผ ์–ป์Œ

# Puzzle 7์—์„œ: 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ
row = thread_idx.y  # 2D ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ
col = thread_idx.x

CUDA-GDB๋กœ ์ด ์Šค๋ ˆ๋“œ ์ขŒํ‘œ๋“ค์ด ์‹ค์ œ๋กœ ๋™์ž‘ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

(cuda-gdb) info cuda threads

์ถœ๋ ฅ:

  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (3,0,0)     4 0x00007fffcf26fed0 /home/ubuntu/workspace/mojo-gpu-puzzles/solutions/p01/p01.mojo    13

๊ทธ๋ฆฌ๊ณ  ํŠน์ • ์Šค๋ ˆ๋“œ๋กœ ์ด๋™ํ•ด์„œ ๋ฌด์—‡์„ ํ•˜๋Š”์ง€ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

(cuda-gdb) cuda thread (1,0,0)

์ถœ๋ ฅ:

[Switching to CUDA thread (1,0,0)]

์ •๋ง ๊ฐ•๋ ฅํ•œ ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค - ๋ง ๊ทธ๋Œ€๋กœ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ์—์„œ ์‹คํ–‰๋˜๋Š” ๊ฒƒ์„ ์ง์ ‘ ์ง€์ผœ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„ ๋””๋ฒ„๊น…

๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ GPU ๋ฉ”๋ชจ๋ฆฌ์— ๋Œ€ํ•ด ๋ฐฐ์šด Puzzle 8์„ ๊ธฐ์–ตํ•˜์‹œ๋‚˜์š”? CUDA-GDB๋กœ ๋ชจ๋“  ๊ฒƒ์„ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์‚ฌ (Puzzle 1-5์˜ ๋ฐฐ์—ด๋“ค)
(cuda-gdb) print input_array[0]@4
$1 = {{1}, {2}, {3}, {4}}   # Mojo ์Šค์นผ๋ผ ํ˜•์‹

# ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์‚ฌ (thread_idx.x๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์Œ)
(cuda-gdb) print shared_data[i]   # thread_idx.x ๋Œ€์‹  ๋กœ์ปฌ ๋ณ€์ˆ˜ 'i' ์‚ฌ์šฉ
$2 = {42}

๋””๋ฒ„๊ฑฐ๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ •ํ™•ํžˆ ๋ฌด์—‡์„ ๋ณด๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋‚˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ฒ„๊ทธ๋ฅผ ์žก๊ธฐ์— ์™„๋ฒฝํ•ฉ๋‹ˆ๋‹ค.

์ „๋žต์  ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ๋ฐฐ์น˜

CUDA-GDB ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋Š” ๋ณ‘๋ ฌ ์‹คํ–‰๊ณผ ํ•จ๊ป˜ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋ณด๋‹ค ํ›จ์”ฌ ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค:

# ์–ด๋–ค ์Šค๋ ˆ๋“œ๋“  ์ปค๋„์— ์ง„์ž…ํ•  ๋•Œ ์ค‘๋‹จ
(cuda-gdb) break add_kernel

# ํŠน์ • ์Šค๋ ˆ๋“œ์— ๋Œ€ํ•ด์„œ๋งŒ ์ค‘๋‹จ (๋ฌธ์ œ ๊ฒฉ๋ฆฌ์— ์ข‹์Œ)
(cuda-gdb) break add_kernel if thread_idx.x == 0

# ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์œ„๋ฐ˜ ์‹œ ์ค‘๋‹จ
(cuda-gdb) watch input_array[thread_idx.x]

# ํŠน์ • ๋ฐ์ดํ„ฐ ์กฐ๊ฑด์—์„œ ์ค‘๋‹จ
(cuda-gdb) break add_kernel if input_array[thread_idx.x] > 100.0

์ด๋ฅผ ํ†ตํ•ด ์ˆ˜์ฒœ ๊ฐœ ์Šค๋ ˆ๋“œ์˜ ์ถœ๋ ฅ์— ํŒŒ๋ฌปํžˆ์ง€ ์•Š๊ณ  ์ •ํ™•ํžˆ ๊ด€์‹ฌ ์žˆ๋Š” ์Šค๋ ˆ๋“œ์™€ ์กฐ๊ฑด์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


ํ™˜๊ฒฝ ์ค€๋น„ํ•˜๊ธฐ

๋””๋ฒ„๊น…์„ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๊ฐœ๋ฐœ ํ™˜๊ฒฝ์ด ์ œ๋Œ€๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”. ์ด์ „ ํผ์ฆ๋“ค์„ ์ง„ํ–‰ํ•ด์™”๋‹ค๋ฉด ๋Œ€๋ถ€๋ถ„ ์ด๋ฏธ ์„ค์ •๋˜์–ด ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค!

์ฐธ๊ณ : pixi ์—†์ด๋Š” NVIDIA ๊ณต์‹ ๋ฆฌ์†Œ์Šค์—์„œ CUDA Toolkit์„ ์ˆ˜๋™์œผ๋กœ ์„ค์น˜ํ•˜๊ณ , ๋“œ๋ผ์ด๋ฒ„ ํ˜ธํ™˜์„ฑ์„ ๊ด€๋ฆฌํ•˜๊ณ , ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ , ์ปดํฌ๋„ŒํŠธ ๊ฐ„ ๋ฒ„์ „ ์ถฉ๋Œ์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. pixi๋Š” ๋ชจ๋“  CUDA ์˜์กด์„ฑ, ๋ฒ„์ „, ํ™˜๊ฒฝ ๊ตฌ์„ฑ์„ ์ž๋™์œผ๋กœ ๊ด€๋ฆฌํ•˜์—ฌ ์ด ๋ณต์žก์„ฑ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

pixi๊ฐ€ ๋””๋ฒ„๊น…์— ์ค‘์š”ํ•œ ์ด์œ 

๋ฌธ์ œ์ : GPU ๋””๋ฒ„๊น…์€ CUDA ํˆดํ‚ท, GPU ๋“œ๋ผ์ด๋ฒ„, Mojo ์ปดํŒŒ์ผ๋Ÿฌ, ๋””๋ฒ„๊ฑฐ ์ปดํฌ๋„ŒํŠธ ๊ฐ„์˜ ์ •๋ฐ€ํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋ฒ„์ „ ๋ถˆ์ผ์น˜๋Š” โ€œ๋””๋ฒ„๊ฑฐ๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Œโ€ ์˜ค๋ฅ˜๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๊ฒฐ์ฑ…: pixi๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด ๋ชจ๋“  ์ปดํฌ๋„ŒํŠธ๊ฐ€ ์กฐํ™”๋กญ๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. pixi run mojo debug --cuda-gdb๋ฅผ ์‹คํ–‰ํ•˜๋ฉด pixi๊ฐ€ ์ž๋™์œผ๋กœ:

  • CUDA ํˆดํ‚ท ๊ฒฝ๋กœ ์„ค์ •
  • ์˜ฌ๋ฐ”๋ฅธ GPU ๋“œ๋ผ์ด๋ฒ„ ๋กœ๋“œ
  • Mojo ๋””๋ฒ„๊น… ํ”Œ๋Ÿฌ๊ทธ์ธ ๊ตฌ์„ฑ
  • ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์ผ๊ด€๋˜๊ฒŒ ๊ด€๋ฆฌ

์„ค์ • ํ™•์ธ

๋ชจ๋“  ๊ฒƒ์ด ์ž‘๋™ํ•˜๋Š”์ง€ ํ™•์ธํ•ด ๋ด…์‹œ๋‹ค:

# 1. GPU ํ•˜๋“œ์›จ์–ด ์ ‘๊ทผ ๊ฐ€๋Šฅ ์—ฌ๋ถ€ ํ™•์ธ
pixi run nvidia-smi
# GPU์™€ ๋“œ๋ผ์ด๋ฒ„ ๋ฒ„์ „์ด ํ‘œ์‹œ๋˜์–ด์•ผ ํ•จ

# 2. CUDA-GDB ํ†ตํ•ฉ ์„ค์ • (GPU ๋””๋ฒ„๊น…์— ํ•„์š”)
pixi run setup-cuda-gdb
# ์‹œ์Šคํ…œ CUDA-GDB ๋ฐ”์ด๋„ˆ๋ฆฌ๋ฅผ conda ํ™˜๊ฒฝ์— ๋งํฌ

# 3. Mojo ๋””๋ฒ„๊ฑฐ ์‚ฌ์šฉ ๊ฐ€๋Šฅ ์—ฌ๋ถ€ ํ™•์ธ
pixi run mojo debug --help
# --cuda-gdb๋ฅผ ํฌํ•จํ•œ ๋””๋ฒ„๊น… ์˜ต์…˜์ด ํ‘œ์‹œ๋˜์–ด์•ผ ํ•จ

# 4. CUDA-GDB ํ†ตํ•ฉ ํ…Œ์ŠคํŠธ
pixi run cuda-gdb --version
# NVIDIA CUDA-GDB ๋ฒ„์ „ ์ •๋ณด๊ฐ€ ํ‘œ์‹œ๋˜์–ด์•ผ ํ•จ

์ด ๋ช…๋ น์–ด ์ค‘ ํ•˜๋‚˜๋ผ๋„ ์‹คํŒจํ•˜๋ฉด pixi.toml ๊ตฌ์„ฑ์„ ๋‹ค์‹œ ํ™•์ธํ•˜๊ณ  CUDA ํˆดํ‚ท ๊ธฐ๋Šฅ์ด ํ™œ์„ฑํ™”๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.

์ค‘์š”: conda์˜ cuda-gdb ํŒจํ‚ค์ง€๋Š” ๋ž˜ํผ ์Šคํฌ๋ฆฝํŠธ๋งŒ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์— pixi run setup-cuda-gdb ๋ช…๋ น์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ช…๋ น์€ ์‹œ์Šคํ…œ CUDA ์„ค์น˜์—์„œ ์‹ค์ œ CUDA-GDB ๋ฐ”์ด๋„ˆ๋ฆฌ๋ฅผ ์ž๋™ ๊ฐ์ง€ํ•˜๊ณ  conda ํ™˜๊ฒฝ์— ๋งํฌํ•˜์—ฌ ์ „์ฒด GPU ๋””๋ฒ„๊น… ๊ธฐ๋Šฅ์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ช…๋ น์ด ํ•˜๋Š” ์ผ:

์Šคํฌ๋ฆฝํŠธ๋Š” ์—ฌ๋Ÿฌ ์ผ๋ฐ˜์ ์ธ ์œ„์น˜์—์„œ CUDA๋ฅผ ์ž๋™ ๊ฐ์ง€ํ•ฉ๋‹ˆ๋‹ค:

  • $CUDA_HOME ํ™˜๊ฒฝ ๋ณ€์ˆ˜
  • /usr/local/cuda (Ubuntu/Debian ๊ธฐ๋ณธ๊ฐ’)
  • /opt/cuda (ArchLinux ๋ฐ ๊ธฐํƒ€ ๋ฐฐํฌํŒ)
  • ์‹œ์Šคํ…œ PATH (which cuda-gdb ํ†ตํ•ด)

๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ์€ scripts/setup-cuda-gdb.sh๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

WSL ์‚ฌ์šฉ์ž๋ฅผ ์œ„ํ•œ ํŠน๋ณ„ ์ฐธ๊ณ ์‚ฌํ•ญ: Part II์—์„œ ์‚ฌ์šฉํ•  ๋‘ ๊ฐ€์ง€ ๋””๋ฒ„๊ทธ ๋„๊ตฌ(cuda-gdb์™€ compute-sanitizer)๋Š” WSL์—์„œ CUDA ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋””๋ฒ„๊น…์„ ์ง€์›ํ•˜์ง€๋งŒ, ๋ ˆ์ง€์ŠคํŠธ๋ฆฌ ํ‚ค HKEY_LOCAL_MACHINE\SOFTWARE\NVIDIA Corporation\GPUDebugger\EnableInterface๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  (DWORD) 1๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ํ”Œ๋žซํผ๊ณผ OS๋ณ„ ๋™์ž‘์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ cuda-gdb์™€ compute-sanitizer๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.


์‹ค์Šต ํŠœํ† ๋ฆฌ์–ผ: ์ฒซ GPU ๋””๋ฒ„๊น… ์„ธ์…˜

์ด๋ก ๋„ ์ข‹์ง€๋งŒ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๋Š” ๊ฒƒ๋งŒ ํ•œ ๊ฒŒ ์—†์Šต๋‹ˆ๋‹ค. Puzzle 01 - ์—ฌ๋Ÿฌ๋ถ„์ด ์ž˜ ์•„๋Š” ๊ฐ„๋‹จํ•œ โ€œ๋ฐฐ์—ด ๊ฐ ์š”์†Œ์— 10 ๋”ํ•˜๊ธฐโ€ ์ปค๋„์„ ์‚ฌ์šฉํ•ด์„œ ์‹ค์ œ ํ”„๋กœ๊ทธ๋žจ์„ ๋””๋ฒ„๊น…ํ•ด ๋ด…์‹œ๋‹ค.

์™œ Puzzle 01์ธ๊ฐ€? ๋‹ค์Œ ์ด์œ ๋กœ ์™„๋ฒฝํ•œ ๋””๋ฒ„๊น… ํŠœํ† ๋ฆฌ์–ผ์ž…๋‹ˆ๋‹ค:

  • ์ถฉ๋ถ„ํžˆ ๋‹จ์ˆœํ•ด์„œ ๋ฌด์—‡์ด ์ผ์–ด๋‚˜์•ผ ํ•˜๋Š”์ง€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Œ
  • ์‹ค์ œ ์ปค๋„ ์‹คํ–‰์ด ์žˆ๋Š” ์ง„์งœ GPU ์ฝ”๋“œ
  • CPU ์„ค์ • ์ฝ”๋“œ์™€ GPU ์ปค๋„ ์ฝ”๋“œ ๋ชจ๋‘ ํฌํ•จ
  • ์งง์€ ์‹คํ–‰ ์‹œ๊ฐ„์œผ๋กœ ๋น ๋ฅธ ๋ฐ˜๋ณต ๊ฐ€๋Šฅ

์ด ํŠœํ† ๋ฆฌ์–ผ์ด ๋๋‚˜๋ฉด ๋„ค ๊ฐ€์ง€ ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ• ๋ชจ๋‘๋กœ ๊ฐ™์€ ํ”„๋กœ๊ทธ๋žจ์„ ๋””๋ฒ„๊น…ํ•˜๊ณ , ์‹ค์ œ ๋””๋ฒ„๊ฑฐ ์ถœ๋ ฅ์„ ๋ณด๊ณ , ๋งค์ผ ์‚ฌ์šฉํ•  ํ•„์ˆ˜ ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋ฅผ ๋ฐฐ์šฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ• ํ•™์Šต ๊ฒฝ๋กœ

Puzzle 01์„ ์˜ˆ์ œ๋กœ ๋„ค ๊ฐ€์ง€ ๋””๋ฒ„๊น… ์กฐํ•ฉ์„ ํƒ์ƒ‰ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ๊ฒฝ๋กœ: JIT + LLDB(๊ฐ€์žฅ ์‰ฌ์›€)๋กœ ์‹œ์ž‘ํ•ด์„œ CUDA-GDB(๊ฐ€์žฅ ๊ฐ•๋ ฅํ•จ)๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

โš ๏ธ GPU ๋””๋ฒ„๊น… ์‹œ ์ค‘์š”์‚ฌํ•ญ:

  • --break-on-launch ํ”Œ๋ž˜๊ทธ๋Š” CUDA-GDB ์ ‘๊ทผ๋ฒ•์—์„œ ํ•„์ˆ˜
  • ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ (์ ‘๊ทผ๋ฒ• 3 & 4)๋Š” ๋””๋ฒ„๊น…์„ ์œ„ํ•ด i ๊ฐ™์€ ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ๋ณด์กด
  • JIT ์ปดํŒŒ์ผ (์ ‘๊ทผ๋ฒ• 1 & 2)์€ ๋Œ€๋ถ€๋ถ„์˜ ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์ตœ์ ํ™”๋กœ ์ œ๊ฑฐ
  • ๋ณธ๊ฒฉ์ ์ธ GPU ๋””๋ฒ„๊น…์—๋Š” ์ ‘๊ทผ๋ฒ• 4 (๋ฐ”์ด๋„ˆ๋ฆฌ + CUDA-GDB) ์‚ฌ์šฉ

ํŠœํ† ๋ฆฌ์–ผ Step 1: LLDB๋กœ CPU ๋””๋ฒ„๊น…

๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค๋กœ ์‹œ์ž‘ํ•ฉ์‹œ๋‹ค: ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜๊ฑฐ๋‚˜ ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋™์ž‘์„ ํ•ด์„œ main() ํ•จ์ˆ˜์—์„œ ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ๋ด์•ผ ํ•  ๋•Œ.

๋ฏธ์…˜: Puzzle 01์˜ CPU ์ธก ์„ค์ • ์ฝ”๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜์—ฌ Mojo๊ฐ€ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ์ปค๋„์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค.

๋””๋ฒ„๊ฑฐ ์‹คํ–‰

JIT ์ปดํŒŒ์ผ๋กœ LLDB ๋””๋ฒ„๊ฑฐ๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค:

# ํ•œ ๋‹จ๊ณ„๋กœ p01.mojo๋ฅผ ์ปดํŒŒ์ผํ•˜๊ณ  ๋””๋ฒ„๊น…
pixi run mojo debug solutions/p01/p01.mojo

LLDB ํ”„๋กฌํ”„ํŠธ๊ฐ€ ๋ณด์ž…๋‹ˆ๋‹ค: (lldb). ์ด์ œ ๋””๋ฒ„๊ฑฐ ์•ˆ์—์„œ ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰์„ ๊ฒ€์‚ฌํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!

์ฒซ ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋“ค

Puzzle 01์ด ์‹คํ–‰๋  ๋•Œ ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ถ”์ ํ•ด ๋ด…์‹œ๋‹ค. ๋ณด์—ฌ๋“œ๋ฆฐ ๋Œ€๋กœ ์ •ํ™•ํžˆ ์ด ๋ช…๋ น์–ด๋“ค์„ ์ž…๋ ฅํ•˜๊ณ  ์ถœ๋ ฅ์„ ๊ด€์ฐฐํ•˜์„ธ์š”:

Step 1: main ํ•จ์ˆ˜์— ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ •

(lldb) br set -n main

์ถœ๋ ฅ:

Breakpoint 1: where = mojo`main, address = 0x00000000027d7530

๋””๋ฒ„๊ฑฐ๊ฐ€ main ํ•จ์ˆ˜๋ฅผ ์ฐพ์•˜๊ณ  ๊ฑฐ๊ธฐ์„œ ์‹คํ–‰์„ ์ผ์‹œ ์ •์ง€ํ•ฉ๋‹ˆ๋‹ค.

Step 2: ํ”„๋กœ๊ทธ๋žจ ์‹œ์ž‘

(lldb) run

์ถœ๋ ฅ:

Process 186951 launched: '/home/ubuntu/workspace/mojo-gpu-puzzles/.pixi/envs/default/bin/mojo' (x86_64)
Process 186951 stopped
* thread #1, name = 'mojo', stop reason = breakpoint 1.1
    frame #0: 0x0000555557d2b530 mojo`main
mojo`main:
->  0x555557d2b530 <+0>: pushq  %rbp
    0x555557d2b531 <+1>: movq   %rsp, %rbp
    ...

ํ”„๋กœ๊ทธ๋žจ์ด ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์—์„œ ๋ฉˆ์ท„์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ์–ด์…ˆ๋ธ”๋ฆฌ ์ฝ”๋“œ๋ฅผ ๋ณด๊ณ  ์žˆ๋Š”๋ฐ ์ด๋Š” ์ •์ƒ์ž…๋‹ˆ๋‹ค - ๋””๋ฒ„๊ฑฐ๊ฐ€ ๊ณ ์ˆ˜์ค€ Mojo ์†Œ์Šค์— ๋„๋‹ฌํ•˜๊ธฐ ์ „์— ์ €์ˆ˜์ค€ ๋จธ์‹  ์ฝ”๋“œ์—์„œ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

Step 3: ์‹œ์ž‘ ๊ณผ์ • ํƒ์ƒ‰

# ๋ช…๋ น์–ด ํ•˜๋‚˜๋ฅผ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ ์‹œ๋„
(lldb) next

์ถœ๋ ฅ:

Process 186951 stopped
* thread #1, name = 'mojo', stop reason = instruction step over
    frame #0: 0x0000555557d2b531 mojo`main + 1
mojo`main:
->  0x555557d2b531 <+1>: movq   %rsp, %rbp
    0x555557d2b534 <+4>: pushq  %r15
    ...

์–ด์…ˆ๋ธ”๋ฆฌ๋ฅผ ๋‹จ๊ณ„๋ณ„๋กœ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์€ ์ง€๋ฃจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋” ๊ด€๋ จ ์žˆ๋Š” ๋ถ€๋ถ„์œผ๋กœ ์ง„ํ–‰ํ•ฉ์‹œ๋‹ค.

Step 4: Mojo ์†Œ์Šค ์ฝ”๋“œ์— ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•ด ๊ณ„์†

# ์‹œ์ž‘ ์–ด์…ˆ๋ธ”๋ฆฌ๋ฅผ ๊ฑด๋„ˆ๋›ฐ์–ด ์‹ค์ œ ์ฝ”๋“œ๋กœ ์ด๋™
(lldb) continue

์ถœ๋ ฅ:

Process 186951 resuming
Process 186951 stopped and restarted: thread 1 received signal: SIGCHLD
2 locations added to breakpoint 1
Process 186951 stopped
* thread #1, name = 'mojo', stop reason = breakpoint 1.3
    frame #0: 0x00007fff5c01e841 JIT(0x7fff5c075000)`stdlib::builtin::_startup::__mojo_main_prototype(argc=([0] = 1), argv=0x00007fffffffa858) at _startup.mojo:95:4

Mojo์˜ ๋Ÿฐํƒ€์ž„์ด ์ดˆ๊ธฐํ™” ์ค‘์ž…๋‹ˆ๋‹ค. _startup.mojo๋Š” Mojo์˜ ๋‚ด๋ถ€ ์‹œ์ž‘ ์ฝ”๋“œ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. SIGCHLD ์‹œ๊ทธ๋„์€ ์ •์ƒ์ž…๋‹ˆ๋‹ค - Mojo๊ฐ€ ๋‚ด๋ถ€ ํ”„๋กœ์„ธ์Šค๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

Step 5: ์‹ค์ œ ์ฝ”๋“œ๋กœ ๊ณ„์†

# ํ•œ ๋ฒˆ ๋” continueํ•ด์„œ p01.mojo ์ฝ”๋“œ์— ๋„๋‹ฌ!
(lldb) continue

์ถœ๋ ฅ:

Process 186951 resuming
Process 186951 stopped
* thread #1, name = 'mojo', stop reason = breakpoint 1.2
    frame #0: 0x00007fff5c014040 JIT(0x7fff5c075000)`p01::main(__error__=<unavailable>) at p01.mojo:24:23
   21
   22
   23   def main():
-> 24       with DeviceContext() as ctx:
   25           out = ctx.enqueue_create_buffer[dtype](SIZE)
   26           out.enqueue_fill(0)
   27           a = ctx.enqueue_create_buffer[dtype](SIZE)

์ด์ œ ์‹ค์ œ Mojo ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ๋ชฉํ•  ์ :

  • p01.mojo ํŒŒ์ผ์˜ 21-27๋ฒˆ ์ค„
  • ํ˜„์žฌ ์ค„ 24: with DeviceContext() as ctx:
  • JIT ์ปดํŒŒ์ผ: JIT(0x7fff5c075000)์€ Mojo๊ฐ€ ์ฝ”๋“œ๋ฅผ ์ฆ‰์„์—์„œ ์ปดํŒŒ์ผํ–ˆ์Œ์„ ๋‚˜ํƒ€๋ƒ„

Step 6: ํ”„๋กœ๊ทธ๋žจ ์™„๋ฃŒ

# ํ”„๋กœ๊ทธ๋žจ์„ ์™„๋ฃŒ๊นŒ์ง€ ์‹คํ–‰
(lldb) continue

์ถœ๋ ฅ:

Process 186951 resuming
out: HostBuffer([10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
Process 186951 exited with status = 0 (0x00000000)

๋ฐฐ์šด ๋‚ด์šฉ

๐ŸŽ“ ์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ์ฒซ GPU ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น… ์„ธ์…˜์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฌด์Šจ ์ผ์ด ์žˆ์—ˆ๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

๊ฑฐ์ณ์˜จ ๋””๋ฒ„๊น… ์—ฌ์ •:

  1. ์–ด์…ˆ๋ธ”๋ฆฌ๋กœ ์‹œ์ž‘ - ์ €์ˆ˜์ค€ ๋””๋ฒ„๊น…์—์„œ๋Š” ์ •์ƒ์ ์ธ ํ˜„์ƒ์ด๋ฉฐ, ๋””๋ฒ„๊ฑฐ๊ฐ€ ๋จธ์‹  ์ˆ˜์ค€์—์„œ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ๋ณด์—ฌ์คŒ
  2. Mojo ์‹œ์ž‘ ๊ณผ์ • ํƒ์ƒ‰ - Mojo์— ๋‚ด๋ถ€ ์ดˆ๊ธฐํ™” ์ฝ”๋“œ๊ฐ€ ์žˆ์Œ์„ ํ•™์Šต
  3. ์†Œ์Šค ์ฝ”๋“œ ๋„๋‹ฌ - ๊ตฌ๋ฌธ ๊ฐ•์กฐ๊ฐ€ ๋œ ์‹ค์ œ p01.mojo 21-27๋ฒˆ ์ค„ ํ™•์ธ
  4. JIT ์ปดํŒŒ์ผ ๊ด€์ฐฐ - Mojo๊ฐ€ ์ฝ”๋“œ๋ฅผ ์ฆ‰์„์—์„œ ์ปดํŒŒ์ผํ•˜๋Š” ๊ฒƒ์„ ๊ด€์ฐฐ
  5. ์„ฑ๊ณต์ ์ธ ์‹คํ–‰ ํ™•์ธ - ํ”„๋กœ๊ทธ๋žจ์ด ์˜ˆ์ƒ๋œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•จ์„ ํ™•์ธ

LLDB ๋””๋ฒ„๊น…์ด ์ œ๊ณตํ•˜๋Š” ๊ฒƒ:

  • โœ… CPU ์ธก ๊ฐ€์‹œ์„ฑ: main() ํ•จ์ˆ˜, ๋ฒ„ํผ ํ• ๋‹น, ๋ฉ”๋ชจ๋ฆฌ ์„ค์ • ํ™•์ธ
  • โœ… ์†Œ์Šค ์ฝ”๋“œ ๊ฒ€์‚ฌ: ์ค„ ๋ฒˆํ˜ธ๊ฐ€ ์žˆ๋Š” ์‹ค์ œ Mojo ์ฝ”๋“œ ๋ณด๊ธฐ
  • โœ… ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ: ํ˜ธ์ŠคํŠธ ์ธก ๋ณ€์ˆ˜(CPU ๋ฉ”๋ชจ๋ฆฌ) ๊ฐ’ ํ™•์ธ
  • โœ… ํ”„๋กœ๊ทธ๋žจ ํ๋ฆ„ ์ œ์–ด: ์„ค์ • ๋กœ์ง์„ ์ค„ ๋‹จ์œ„๋กœ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰
  • โœ… ์˜ค๋ฅ˜ ์กฐ์‚ฌ: ์žฅ์น˜ ์„ค์ •, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋“ฑ์˜ ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น…

LLDB๊ฐ€ ํ•  ์ˆ˜ ์—†๋Š” ๊ฒƒ:

  • โŒ GPU ์ปค๋„ ๊ฒ€์‚ฌ: add_10 ํ•จ์ˆ˜ ์‹คํ–‰ ๋‚ด๋ถ€๋กœ ์ง„์ž… ๋ถˆ๊ฐ€๋Šฅ
  • โŒ ์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ๋””๋ฒ„๊น…: ๊ฐœ๋ณ„ GPU ์Šค๋ ˆ๋“œ ๋™์ž‘ ํ™•์ธ ๋ถˆ๊ฐ€
  • โŒ GPU ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: GPU ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณด๋Š” ๋ฐ์ดํ„ฐ ๊ฒ€์‚ฌ ๋ถˆ๊ฐ€
  • โŒ ๋ณ‘๋ ฌ ์‹คํ–‰ ๋ถ„์„: ๊ฒฝ์Ÿ ์ƒํƒœ๋‚˜ ๋™๊ธฐํ™” ๋””๋ฒ„๊น… ๋ถˆ๊ฐ€

LLDB ๋””๋ฒ„๊น…์„ ์‚ฌ์šฉํ•  ๋•Œ:

  • GPU ์ฝ”๋“œ๊ฐ€ ์‹คํ–‰๋˜๊ธฐ ์ „์— ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•  ๋•Œ
  • ๋ฒ„ํผ ํ• ๋‹น์ด๋‚˜ ๋ฉ”๋ชจ๋ฆฌ ์„ค์ • ๋ฌธ์ œ
  • ํ”„๋กœ๊ทธ๋žจ ์ดˆ๊ธฐํ™”์™€ ํ๋ฆ„ ์ดํ•ด
  • Mojo ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ์–ด๋–ป๊ฒŒ ์‹œ์ž‘๋˜๋Š”์ง€ ํ•™์Šต
  • ๋น ๋ฅธ ํ”„๋กœํ† ํƒ€์ดํ•‘๊ณผ ์ฝ”๋“œ ๋ณ€๊ฒฝ ์‹คํ—˜

ํ•ต์‹ฌ ํ†ต์ฐฐ: LLDB๋Š” ํ˜ธ์ŠคํŠธ ์ธก ๋””๋ฒ„๊น…์— ์™„๋ฒฝํ•ฉ๋‹ˆ๋‹ค - GPU ์‹คํ–‰ ์ „ํ›„์— CPU์—์„œ ์ผ์–ด๋‚˜๋Š” ๋ชจ๋“  ๊ฒƒ. ์‹ค์ œ GPU ์ปค๋„ ๋””๋ฒ„๊น…์—๋Š” ๋‹ค์Œ ์ ‘๊ทผ๋ฒ•์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹คโ€ฆ

ํŠœํ† ๋ฆฌ์–ผ Step 2: ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น…

JIT ๋””๋ฒ„๊น…์„ ๋ฐฐ์› ์œผ๋‹ˆ ์ด์ œ ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ „๋ฌธ์ ์ธ ์ ‘๊ทผ๋ฒ•์„ ํƒ์ƒ‰ํ•ฉ์‹œ๋‹ค.

์‹œ๋‚˜๋ฆฌ์˜ค: ์—ฌ๋Ÿฌ ํŒŒ์ผ์ด ์žˆ๋Š” ๋ณต์žกํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋””๋ฒ„๊น…ํ•˜๊ฑฐ๋‚˜ ๊ฐ™์€ ํ”„๋กœ๊ทธ๋žจ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋””๋ฒ„๊น…ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋จผ์ € ๋ฐ”์ด๋„ˆ๋ฆฌ๋ฅผ ๋นŒ๋“œํ•˜๋ฉด ๋” ๋งŽ์€ ์ œ์–ด์™€ ๋น ๋ฅธ ๋””๋ฒ„๊น… ๋ฐ˜๋ณต์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๋””๋ฒ„๊ทธ ๋ฐ”์ด๋„ˆ๋ฆฌ ๋นŒ๋“œ

Step 1: ๋””๋ฒ„๊ทธ ์ •๋ณด๋กœ ์ปดํŒŒ์ผ

# ๋””๋ฒ„๊ทธ ๋นŒ๋“œ ์ƒ์„ฑ (๋ช…ํ™•ํ•œ ๋ช…๋ช…์— ์ฃผ๋ชฉ)
pixi run mojo build -O0 -g solutions/p01/p01.mojo -o solutions/p01/p01_debug

์—ฌ๊ธฐ์„œ ์ผ์–ด๋‚˜๋Š” ์ผ:

  • ๐Ÿ”ง -O0: ์ตœ์ ํ™” ๋น„ํ™œ์„ฑํ™” (์ •ํ™•ํ•œ ๋””๋ฒ„๊น…์— ๋ฐ˜๋“œ์‹œ ํ•„์š”)
  • ๐Ÿ” -g: ๋จธ์‹  ์ฝ”๋“œ๋ฅผ ์†Œ์Šค ์ฝ”๋“œ์— ๋งคํ•‘ํ•˜๋Š” ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ ํฌํ•จ
  • ๐Ÿ“ -o p01_debug: ๋ช…ํ™•ํ•˜๊ฒŒ ์ด๋ฆ„ ์ง€์€ ๋””๋ฒ„๊ทธ ๋ฐ”์ด๋„ˆ๋ฆฌ ์ƒ์„ฑ

Step 2: ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น…

# ๋ฏธ๋ฆฌ ๋นŒ๋“œ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น…
pixi run mojo debug solutions/p01/p01_debug

๋ฌด์—‡์ด ๋‹ค๋ฅธ๊ฐ€ (๊ทธ๋ฆฌ๊ณ  ๋” ๋‚˜์€๊ฐ€)

์‹œ์ž‘ ๋น„๊ต:

JIT ๋””๋ฒ„๊น…๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น…
ํ•œ ๋‹จ๊ณ„๋กœ ์ปดํŒŒ์ผ + ๋””๋ฒ„๊น…ํ•œ ๋ฒˆ ๋นŒ๋“œ, ์—ฌ๋Ÿฌ ๋ฒˆ ๋””๋ฒ„๊น…
๋А๋ฆฐ ์‹œ์ž‘ (์ปดํŒŒ์ผ ์˜ค๋ฒ„ํ—ค๋“œ)๋น ๋ฅธ ์‹œ์ž‘
์ปดํŒŒ์ผ ๋ฉ”์‹œ์ง€๊ฐ€ ๋””๋ฒ„๊ทธ ์ถœ๋ ฅ๊ณผ ์„ž์ž„๊น”๋”ํ•œ ๋””๋ฒ„๊ฑฐ ์ถœ๋ ฅ
๋””๋ฒ„๊น… ์ค‘ ์ƒ์„ฑ๋˜๋Š” ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ๊ณ ์ •๋œ ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ

๊ฐ™์€ LLDB ๋ช…๋ น์–ด(br set -n main, run, continue)๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฐจ์ด๋ฅผ ๋А๋‚„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋น ๋ฅธ ์‹œ์ž‘ - ์ปดํŒŒ์ผ ์ง€์—ฐ ์—†์Œ
  • ๊น”๋”ํ•œ ์ถœ๋ ฅ - JIT ์ปดํŒŒ์ผ ๋ฉ”์‹œ์ง€ ์—†์Œ
  • ๋” ์˜ˆ์ธก ๊ฐ€๋Šฅ - ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ์ด ์‹คํ–‰ ๊ฐ„์— ๋ณ€ํ•˜์ง€ ์•Š์Œ
  • ์ „๋ฌธ์ ์ธ ์›Œํฌํ”Œ๋กœ์šฐ - ํ”„๋กœ๋•์…˜ ๋””๋ฒ„๊น…์ด ์ด๋ ‡๊ฒŒ ์ž‘๋™ํ•จ

ํŠœํ† ๋ฆฌ์–ผ Step 3: GPU ์ปค๋„ ๋””๋ฒ„๊น…

์ง€๊ธˆ๊นŒ์ง€๋Š” CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ - ์„ค์ •, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ์ดˆ๊ธฐํ™”๋ฅผ ๋””๋ฒ„๊น…ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์ด ์ผ์–ด๋‚˜๋Š” ์‹ค์ œ GPU ์ปค๋„์€ ์–ด๋–จ๊นŒ์š”?

๋ฌธ์ œ์ : add_10 ์ปค๋„์€ ์ž ์žฌ์ ์œผ๋กœ ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” GPU์—์„œ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. LLDB๋Š” GPU์˜ ๋ณ‘๋ ฌ ์‹คํ–‰ ํ™˜๊ฒฝ์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

ํ•ด๊ฒฐ์ฑ…: CUDA-GDB - GPU ์Šค๋ ˆ๋“œ, GPU ๋ฉ”๋ชจ๋ฆฌ, ๋ณ‘๋ ฌ ์‹คํ–‰์„ ์ดํ•ดํ•˜๋Š” ์ „๋ฌธ ๋””๋ฒ„๊ฑฐ์ž…๋‹ˆ๋‹ค.

CUDA-GDB๊ฐ€ ํ•„์š”ํ•œ ์ด์œ 

GPU ๋””๋ฒ„๊น…์ด ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅธ ์ด์œ ๋ฅผ ์ดํ•ดํ•ฉ์‹œ๋‹ค:

CPU ๋””๋ฒ„๊น… (LLDB):

  • ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š” ๋‹จ์ผ ์Šค๋ ˆ๋“œ
  • ์ถ”์ ํ•  ์ฝœ ์Šคํƒ์ด ํ•˜๋‚˜๋ฟ
  • ๋‹จ์ˆœํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋ธ
  • ๋ณ€์ˆ˜๊ฐ€ ๋‹จ์ผ ๊ฐ’์„ ๊ฐ€์ง

GPU ๋””๋ฒ„๊น… (CUDA-GDB):

  • ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜๋Š” ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ
  • ์—ฌ๋Ÿฌ ์ฝœ ์Šคํƒ (์Šค๋ ˆ๋“œ๋‹น ํ•˜๋‚˜)
  • ๋ณต์žกํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ (์ „์—ญ, ๊ณต์œ , ๋กœ์ปฌ, ๋ ˆ์ง€์Šคํ„ฐ)
  • ๊ฐ™์€ ๋ณ€์ˆ˜๊ฐ€ ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ€์ง

์‹ค์ œ ์˜ˆ: add_10 ์ปค๋„์—์„œ thread_idx.x ๋ณ€์ˆ˜๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค - ์Šค๋ ˆ๋“œ 0์€ 0์„, ์Šค๋ ˆ๋“œ 1์€ 1์„ ๋ณด๋Š” ์‹์ž…๋‹ˆ๋‹ค. CUDA-GDB๋งŒ์ด ์ด ๋ณ‘๋ ฌ ํ˜„์‹ค์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

CUDA-GDB ๋””๋ฒ„๊ฑฐ ์‹คํ–‰

Step 1: GPU ์ปค๋„ ๋””๋ฒ„๊น… ์‹œ์ž‘

์ ‘๊ทผ๋ฒ•์„ ์„ ํƒํ•˜์„ธ์š”:

# ์ด๋ฏธ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธ (ํ•œ ๋ฒˆ์ด๋ฉด ์ถฉ๋ถ„)
pixi run setup-cuda-gdb

# JIT + CUDA-GDB ์‚ฌ์šฉ (์œ„์˜ ์ ‘๊ทผ๋ฒ• 2)
pixi run mojo debug --cuda-gdb --break-on-launch solutions/p01/p01.mojo

ํ•™์Šต๊ณผ ๋น ๋ฅธ ๋ฐ˜๋ณต์— ์ ํ•ฉํ•œ JIT + CUDA-GDB ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Step 2: ์‹คํ–‰ํ•˜๊ณ  GPU ์ปค๋„ ์ง„์ž… ์‹œ ์ž๋™ ์ •์ง€

CUDA-GDB ํ”„๋กฌํ”„ํŠธ๋Š” ์ด๋ ‡๊ฒŒ ๋ณด์ž…๋‹ˆ๋‹ค: (cuda-gdb). ํ”„๋กœ๊ทธ๋žจ์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค:

# ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰ - GPU ์ปค๋„์ด ์‹คํ–‰๋  ๋•Œ ์ž๋™์œผ๋กœ ์ •์ง€
(cuda-gdb) run

์ถœ๋ ฅ:

Starting program: /home/ubuntu/workspace/mojo-gpu-puzzles/.pixi/envs/default/bin/mojo...
[Thread debugging using libthread_db enabled]
...
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0)]

CUDA thread hit application kernel entry function breakpoint, p01_add_10_UnsafePointer...
   <<<(1,1,1),(4,1,1)>>> (output=0x302000000, a=0x302000200) at p01.mojo:16
16          i = thread_idx.x

์„ฑ๊ณต! GPU ์ปค๋„ ๋‚ด๋ถ€์—์„œ ์ž๋™์œผ๋กœ ์ •์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค! --break-on-launch ํ”Œ๋ž˜๊ทธ๊ฐ€ ์ปค๋„ ์‹คํ–‰์„ ๊ฐ์ง€ํ–ˆ๊ณ  ์ด์ œ i = thread_idx.x๊ฐ€ ์‹คํ–‰๋˜๋Š” 16๋ฒˆ ์ค„์— ์žˆ์Šต๋‹ˆ๋‹ค.

์ค‘์š”: break add_10์ฒ˜๋Ÿผ ์ˆ˜๋™์œผ๋กœ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋ฅผ ์„ค์ •ํ•  ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค - ์ปค๋„ ์ง„์ž… ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋Š” ์ž๋™์ž…๋‹ˆ๋‹ค. GPU ์ปค๋„ ํ•จ์ˆ˜๋Š” CUDA-GDB์—์„œ ๋งน๊ธ€๋ง๋œ ์ด๋ฆ„(p01_add_10_UnsafePointer... ๊ฐ™์€)์„ ๊ฐ€์ง€์ง€๋งŒ, ์ด๋ฏธ ์ปค๋„ ์•ˆ์— ์žˆ์œผ๋ฏ€๋กœ ๋ฐ”๋กœ ๋””๋ฒ„๊น…์„ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Step 3: ๋ณ‘๋ ฌ ์‹คํ–‰ ํƒ์ƒ‰

# ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์—์„œ ์ผ์‹œ ์ •์ง€๋œ ๋ชจ๋“  GPU ์Šค๋ ˆ๋“œ ๋ณด๊ธฐ
(cuda-gdb) info cuda threads

์ถœ๋ ฅ:

  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (3,0,0)     4 0x00007fffd326fb70 /home/ubuntu/workspace/mojo-gpu-puzzles/solutions/p01/p01.mojo    16

์™„๋ฒฝํ•ฉ๋‹ˆ๋‹ค! Puzzle 01์˜ ๋ชจ๋“  4๊ฐœ ๋ณ‘๋ ฌ GPU ์Šค๋ ˆ๋“œ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • *๊ฐ€ ํ˜„์žฌ ์Šค๋ ˆ๋“œ ํ‘œ์‹œ: (0,0,0) - ๋””๋ฒ„๊น… ์ค‘์ธ ์Šค๋ ˆ๋“œ
  • ์Šค๋ ˆ๋“œ ๋ฒ”์œ„: (0,0,0)์—์„œ (3,0,0)๊นŒ์ง€ - ๋ธ”๋ก์˜ ๋ชจ๋“  4๊ฐœ ์Šค๋ ˆ๋“œ
  • Count: 4 - ์ฝ”๋“œ์˜ THREADS_PER_BLOCK = 4์™€ ์ผ์น˜
  • ๊ฐ™์€ ์œ„์น˜: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ p01.mojo์˜ 16๋ฒˆ ์ค„์—์„œ ์ผ์‹œ ์ •์ง€

Step 4: ์ปค๋„์„ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ํ•˜๊ณ  ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ

# 'next'๋กœ ์ฝ”๋“œ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ ('step'์€ ๋‚ด๋ถ€๋กœ ๋“ค์–ด๊ฐ)
(cuda-gdb) next

์ถœ๋ ฅ:

p01_add_10_UnsafePointer... at p01.mojo:17
17          output[i] = a[i] + 10.0
# ๋กœ์ปฌ ๋ณ€์ˆ˜๋Š” ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ์—์„œ ์ž‘๋™!
(cuda-gdb) print i

์ถœ๋ ฅ:

$1 = 0                    # ์ด ์Šค๋ ˆ๋“œ์˜ ์ธ๋ฑ์Šค (thread_idx.x ๊ฐ’ ์บก์ฒ˜)
# GPU ๋‚ด์žฅ ๋ณ€์ˆ˜๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์ง€๋งŒ ํ•„์š” ์—†์Œ
(cuda-gdb) print thread_idx.x

์ถœ๋ ฅ:

No symbol "thread_idx" in current context.
# ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ์Šค๋ ˆ๋“œ๋ณ„ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ
(cuda-gdb) print a[i]     # ์ด ์Šค๋ ˆ๋“œ์˜ ์ž…๋ ฅ: a[0]

์ถœ๋ ฅ:

$2 = {0}                  # ์ž…๋ ฅ ๊ฐ’ (Mojo ์Šค์นผ๋ผ ํ˜•์‹)
(cuda-gdb) print output[i] # ์—ฐ์‚ฐ ์ „ ์ด ์Šค๋ ˆ๋“œ์˜ ์ถœ๋ ฅ

์ถœ๋ ฅ:

$3 = {0}                  # ์•„์ง 0 - ์—ฐ์‚ฐ์ด ์•„์ง ์‹คํ–‰๋˜์ง€ ์•Š์Œ!
# ์—ฐ์‚ฐ ์ค„ ์‹คํ–‰
(cuda-gdb) next

์ถœ๋ ฅ:

13      fn add_10(         # ์—ฐ์‚ฐ ํ›„ ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜ ์ค„๋กœ ์ด๋™
# ์ด์ œ ๊ฒฐ๊ณผ ํ™•์ธ
(cuda-gdb) print output[i]

์ถœ๋ ฅ:

$4 = {10}                 # ์ด์ œ ๊ณ„์‚ฐ๋œ ๊ฒฐ๊ณผ ํ‘œ์‹œ: 0 + 10 = 10
# ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์—ฌ์ „ํžˆ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
(cuda-gdb) print a

์ถœ๋ ฅ:

$5 = (!pop.scalar<f32> * @register) 0x302000200

Step 5: ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ ๊ฐ„ ์ด๋™

# ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๋กœ ์ „ํ™˜ํ•ด์„œ ์‹คํ–‰ ํ™•์ธ
(cuda-gdb) cuda thread (1,0,0)

์ถœ๋ ฅ:

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]
13      fn add_10(         # ์Šค๋ ˆ๋“œ 1๋„ ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜์— ์žˆ์Œ
# ์Šค๋ ˆ๋“œ์˜ ๋กœ์ปฌ ๋ณ€์ˆ˜ ํ™•์ธ
(cuda-gdb) print i

์ถœ๋ ฅ:

$5 = 1                    # ์Šค๋ ˆ๋“œ 1์˜ ์ธ๋ฑ์Šค (์Šค๋ ˆ๋“œ 0๊ณผ ๋‹ค๋ฆ„!)
# ์ด ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ ๊ฒ€์‚ฌ
(cuda-gdb) print a[i]     # ์ด ์Šค๋ ˆ๋“œ์˜ ์ž…๋ ฅ: a[1]

์ถœ๋ ฅ:

$6 = {1}                  # ์Šค๋ ˆ๋“œ 1์˜ ์ž…๋ ฅ ๊ฐ’
# ์Šค๋ ˆ๋“œ 1์˜ ์—ฐ์‚ฐ์€ ์ด๋ฏธ ์™„๋ฃŒ (๋ณ‘๋ ฌ ์‹คํ–‰!)
(cuda-gdb) print output[i] # ์ด ์Šค๋ ˆ๋“œ์˜ ์ถœ๋ ฅ: output[1]

์ถœ๋ ฅ:

$7 = {11}                 # 1 + 10 = 11 (์ด๋ฏธ ๊ณ„์‚ฐ๋จ)
# ์ตœ๊ณ ์˜ ๊ธฐ๋ฒ•: ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๊ฒฐ๊ณผ๋ฅผ ํ•œ ๋ฒˆ์— ๋ณด๊ธฐ
(cuda-gdb) print output[0]@4

์ถœ๋ ฅ:

$8 = {{10}, {11}, {12}, {13}}     # ๋ชจ๋“  4๊ฐœ ์Šค๋ ˆ๋“œ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ•œ ๋ช…๋ น์–ด๋กœ!
(cuda-gdb) print a[0]@4

์ถœ๋ ฅ:

$9 = {{0}, {1}, {2}, {3}}         # ๋น„๊ต๋ฅผ ์œ„ํ•œ ๋ชจ๋“  ์ž…๋ ฅ ๊ฐ’
# ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ์ง„ํ–‰ํ•˜๋ฉด CUDA ์ปจํ…์ŠคํŠธ๋ฅผ ์žƒ์Šต๋‹ˆ๋‹ค
(cuda-gdb) next

์ถœ๋ ฅ:

[Switching to Thread 0x7ffff7e25840 (LWP 306942)]  # ํ˜ธ์ŠคํŠธ ์Šค๋ ˆ๋“œ๋กœ ๋ณต๊ท€
0x00007fffeca3f831 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(cuda-gdb) print output[i]

์ถœ๋ ฅ:

No symbol "output" in current context.  # GPU ์ปจํ…์ŠคํŠธ๋ฅผ ์žƒ์Œ!

์ด ๋””๋ฒ„๊น… ์„ธ์…˜์˜ ํ•ต์‹ฌ ํ†ต์ฐฐ:

  • ๐Ÿคฏ ๋ณ‘๋ ฌ ์‹คํ–‰์€ ์ง„์งœ์ž…๋‹ˆ๋‹ค - ์Šค๋ ˆ๋“œ (1,0,0)์œผ๋กœ ์ „ํ™˜ํ•˜๋ฉด ์ด๋ฏธ ์—ฐ์‚ฐ์ด ์™„๋ฃŒ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค!
  • ๊ฐ ์Šค๋ ˆ๋“œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ด…๋‹ˆ๋‹ค - i=0 vs i=1, a[i]={0} vs a[i]={1}, output[i]={10} vs output[i]={11}
  • ๋ฐฐ์—ด ๊ฒ€์‚ฌ๊ฐ€ ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค - print output[0]@4๋กœ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: {{10}, {11}, {12}, {13}}
  • GPU ์ปจํ…์ŠคํŠธ๋Š” ๊นจ์ง€๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค - ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ์ง„ํ–‰ํ•˜๋ฉด ํ˜ธ์ŠคํŠธ ์Šค๋ ˆ๋“œ๋กœ ๋Œ์•„๊ฐ€ GPU ๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

์ด๊ฒƒ์ด ๋ฐ”๋กœ ๋ณ‘๋ ฌ ์ปดํ“จํŒ…์˜ ๋ณธ์งˆ์ž…๋‹ˆ๋‹ค: ๊ฐ™์€ ์ฝ”๋“œ, ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ, ๋™์‹œ ์‹คํ–‰.

CUDA-GDB๋กœ ๋ฐฐ์šด ๋‚ด์šฉ

๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ๋กœ GPU ์ปค๋„ ์‹คํ–‰ ๋””๋ฒ„๊น…์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ์‹ค์ œ๋กœ ์ž‘๋™ํ•˜๋Š” ๊ธฐ๋Šฅ๋“ค์ž…๋‹ˆ๋‹ค:

์Šต๋“ํ•œ GPU ๋””๋ฒ„๊น… ๋Šฅ๋ ฅ:

  • โœ… GPU ์ปค๋„ ์ž๋™ ๋””๋ฒ„๊น… - --break-on-launch๊ฐ€ ์ปค๋„ ์ง„์ž… ์‹œ์ ์—์„œ ์ •์ง€ํ•ฉ๋‹ˆ๋‹ค
  • โœ… GPU ์Šค๋ ˆ๋“œ ๊ฐ„ ์ด๋™ - cuda thread๋กœ ์ปจํ…์ŠคํŠธ๋ฅผ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค
  • โœ… ๋กœ์ปฌ ๋ณ€์ˆ˜ ์ ‘๊ทผ - -O0 -g๋กœ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ์—์„œ print i๊ฐ€ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค
  • โœ… ์Šค๋ ˆ๋“œ๋ณ„ ๋ฐ์ดํ„ฐ ๊ฒ€์‚ฌ - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ i, a[i], output[i] ๊ฐ’์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • โœ… ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๊ฒฐ๊ณผ ๋ณด๊ธฐ - print output[0]@4๋กœ {{10}, {11}, {12}, {13}}์„ ํ•œ ๋ฒˆ์— ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค
  • โœ… GPU ์ฝ”๋“œ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ - next๊ฐ€ ์—ฐ์‚ฐ์„ ์‹คํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • โœ… ๋ณ‘๋ ฌ ์‹คํ–‰ ํ™•์ธ - ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค (์ „ํ™˜ํ•˜๋ฉด ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๋Š” ์ด๋ฏธ ๊ณ„์‚ฐ ์™„๋ฃŒ)
  • โœ… ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ ‘๊ทผ - output๊ณผ a ํฌ์ธํ„ฐ๋ฅผ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • โŒ GPU ๋‚ด์žฅ ๋ณ€์ˆ˜ ์‚ฌ์šฉ ๋ถˆ๊ฐ€ - thread_idx.x, blockIdx.x ๋“ฑ์€ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค (ํ•˜์ง€๋งŒ ๋กœ์ปฌ ๋ณ€์ˆ˜๋Š” ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค!)
  • ๐Ÿ“Š Mojo ์Šค์นผ๋ผ ํ˜•์‹ - ๊ฐ’์ด 10.0 ๋Œ€์‹  {10}์œผ๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค
  • โš ๏ธ ๊นจ์ง€๊ธฐ ์‰ฌ์šด GPU ์ปจํ…์ŠคํŠธ - ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ์ง„ํ–‰ํ•˜๋ฉด GPU ๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

ํ•ต์‹ฌ ํ†ต์ฐฐ:

  • ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ (mojo build -O0 -g)๋Š” ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค - ๋กœ์ปฌ ๋ณ€์ˆ˜๊ฐ€ ๋ณด์กด๋ฉ๋‹ˆ๋‹ค
  • @N์„ ์‚ฌ์šฉํ•œ ๋ฐฐ์—ด ๊ฒ€์‚ฌ - ๋ชจ๋“  ๋ณ‘๋ ฌ ๊ฒฐ๊ณผ๋ฅผ ํ•œ ๋ฒˆ์— ๋ณด๋Š” ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค
  • GPU ๋‚ด์žฅ ๋ณ€์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค - ํ•˜์ง€๋งŒ i ๊ฐ™์€ ๋กœ์ปฌ ๋ณ€์ˆ˜๊ฐ€ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค
  • Mojo๋Š” {value} ํ˜•์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค - ์Šค์นผ๋ผ๊ฐ€ 10.0 ๋Œ€์‹  {10}์œผ๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค
  • ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰์— ์ฃผ์˜ํ•˜์„ธ์š” - GPU ์ปจํ…์ŠคํŠธ๋ฅผ ์žƒ๊ณ  ํ˜ธ์ŠคํŠธ ์Šค๋ ˆ๋“œ๋กœ ๋Œ์•„๊ฐ€๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค

์‹ค์ œ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•๋“ค

์ด์ œ ์‹ค์ œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋งˆ์ฃผ์น˜๊ฒŒ ๋  ์‹ค์šฉ์ ์ธ ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์‚ดํŽด๋ด…์‹œ๋‹ค:

๊ธฐ๋ฒ• 1: ์Šค๋ ˆ๋“œ ๊ฒฝ๊ณ„ ํ™•์ธ

# ๋ชจ๋“  4๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ณ„์‚ฐํ–ˆ๋Š”์ง€ ํ™•์ธ
(cuda-gdb) print output[0]@4

์ถœ๋ ฅ:

$8 = {{10}, {11}, {12}, {13}}    # ๋ชจ๋“  4๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ณ„์‚ฐ
# ์œ ํšจ ๋ฒ”์œ„๋ฅผ ๋„˜์–ด ํ™•์ธํ•˜์—ฌ ๋ฒ”์œ„ ์ดˆ๊ณผ ๋ฌธ์ œ ๊ฐ์ง€
(cuda-gdb) print output[0]@5

์ถœ๋ ฅ:

$9 = {{10}, {11}, {12}, {13}, {0}}  # ์š”์†Œ 4๋Š” ์ดˆ๊ธฐํ™”๋˜์ง€ ์•Š์Œ (์ข‹์Œ!)
# ์ž…๋ ฅ๊ณผ ๋น„๊ตํ•˜์—ฌ ์—ฐ์‚ฐ ๊ฒ€์ฆ
(cuda-gdb) print a[0]@4

์ถœ๋ ฅ:

$10 = {{0}, {1}, {2}, {3}}       # ์ž…๋ ฅ ๊ฐ’: 0+10=10, 1+10=11 ๋“ฑ

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ : ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ์€ GPU ํฌ๋ž˜์‹œ์˜ ๊ฐ€์žฅ ํ”ํ•œ ์›์ธ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋””๋ฒ„๊น… ๋‹จ๊ณ„๋กœ ์ผ์ฐ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ๋ฒ• 2: ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ ์ดํ•ด

# ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธ”๋ก์œผ๋กœ ์–ด๋–ป๊ฒŒ ๊ตฌ์„ฑ๋˜๋Š”์ง€ ๋ณด๊ธฐ
(cuda-gdb) info cuda blocks

์ถœ๋ ฅ:

  BlockIdx To BlockIdx Count   State
Kernel 0
*  (0,0,0)     (0,0,0)     1 running
# ํ˜„์žฌ ๋ธ”๋ก์˜ ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๋ณด๊ธฐ
(cuda-gdb) info cuda threads

์ถœ๋ ฅ์€ ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ํ™œ์„ฑ ์ƒํƒœ์ธ์ง€, ์ •์ง€๋˜์—ˆ๋Š”์ง€, ์˜ค๋ฅ˜๊ฐ€ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ : ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๊ตฌ์„ฑ์„ ์ดํ•ดํ•˜๋ฉด ๋™๊ธฐํ™”์™€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ฒ• 3: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„

# GPU ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ ํ™•์ธ:
(cuda-gdb) print a               # ์ž…๋ ฅ ๋ฐฐ์—ด GPU ํฌ์ธํ„ฐ

์ถœ๋ ฅ:

$9 = (!pop.scalar<f32> * @register) 0x302000200
(cuda-gdb) print output          # ์ถœ๋ ฅ ๋ฐฐ์—ด GPU ํฌ์ธํ„ฐ

์ถœ๋ ฅ:

$10 = (!pop.scalar<f32> * @register) 0x302000000
# ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํ™•์ธ:
(cuda-gdb) print a[i]            # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ 'i'๋ฅผ ์‚ฌ์šฉํ•ด ์ž์‹ ์˜ ์š”์†Œ์— ์ ‘๊ทผ

์ถœ๋ ฅ:

$11 = {0}                        # ์Šค๋ ˆ๋“œ์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ : ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์€ ์„ฑ๋Šฅ๊ณผ ์ •ํ™•์„ฑ์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค. ์ž˜๋ชป๋œ ํŒจํ„ด์€ ๊ฒฝ์Ÿ ์ƒํƒœ๋‚˜ ํฌ๋ž˜์‹œ๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ฒ• 4: ๊ฒฐ๊ณผ ๊ฒ€์ฆ ๋ฐ ์™„๋ฃŒ

# ์ปค๋„ ์‹คํ–‰์„ ๋‹จ๊ณ„๋ณ„๋กœ ์‹คํ–‰ํ•œ ํ›„ ์ตœ์ข… ๊ฒฐ๊ณผ ํ™•์ธ
(cuda-gdb) print output[0]@4

์ถœ๋ ฅ:

$11 = {10.0, 11.0, 12.0, 13.0}    # ์™„๋ฒฝ! ๊ฐ ์š”์†Œ๊ฐ€ 10 ์ฆ๊ฐ€
# ํ”„๋กœ๊ทธ๋žจ์„ ์ •์ƒ์ ์œผ๋กœ ์™„๋ฃŒ
(cuda-gdb) continue

์ถœ๋ ฅ:

...ํ”„๋กœ๊ทธ๋žจ ์ถœ๋ ฅ์ด ์„ฑ๊ณต ํ‘œ์‹œ...
# ๋””๋ฒ„๊ฑฐ ์ข…๋ฃŒ
(cuda-gdb) exit

์„ค์ •๋ถ€ํ„ฐ ๊ฒฐ๊ณผ๊นŒ์ง€ GPU ์ปค๋„ ์‹คํ–‰ ๋””๋ฒ„๊น…์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค.

GPU ๋””๋ฒ„๊น… ์—ฌ์ •: ํ•ต์‹ฌ ํ†ต์ฐฐ

ํฌ๊ด„์ ์ธ GPU ๋””๋ฒ„๊น… ํŠœํ† ๋ฆฌ์–ผ์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณ‘๋ ฌ ์ปดํ“จํŒ…์— ๋Œ€ํ•ด ๋ฐœ๊ฒฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค:

๋ณ‘๋ ฌ ์‹คํ–‰์— ๋Œ€ํ•œ ๊นŠ์€ ํ†ต์ฐฐ

  1. ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ์˜ ์‹ค์ œ: thread_idx.x๊ฐ€ ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ’(0, 1, 2, 3โ€ฆ)์„ ๊ฐ–๋Š” ๊ฒƒ์„ ์ด๋ก ์ด ์•„๋‹Œ ์ง์ ‘ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค

  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํŒŒ์•…: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ a[thread_idx.x]์—์„œ ์ฝ๊ณ  output[thread_idx.x]์— ์“ฐ๋ฉฐ, ์ถฉ๋Œ ์—†์ด ์™„๋ฒฝํ•œ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ์„ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค

  3. ๋ณ‘๋ ฌ ์‹คํ–‰์˜ ์ดํ•ด: ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์ปค๋„ ์ฝ”๋“œ๋ฅผ ๋™์‹œ์— ์‹คํ–‰ํ•˜๋ฉด์„œ ๊ฐ๊ฐ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

  4. GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ: ๋ฐฐ์—ด์€ ์ „์—ญ GPU ๋ฉ”๋ชจ๋ฆฌ์— ์žˆ์–ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์Šค๋ ˆ๋“œ๋ณ„ ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค

๋ชจ๋“  ํผ์ฆ์— ์ ์šฉ๋˜๋Š” ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•

Puzzle 01๋ถ€ํ„ฐ Puzzle 08, ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์ดํ›„๊นŒ์ง€ ๋ณดํŽธ์ ์œผ๋กœ ์ ์šฉ๋˜๋Š” ๊ธฐ๋ฒ•์„ ์Šต๋“ํ–ˆ์Šต๋‹ˆ๋‹ค:

  • CPU ์ธก ๋ฌธ์ œ(์žฅ์น˜ ์„ค์ •, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น)๋Š” LLDB๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค
  • GPU ์ปค๋„ ๋ฌธ์ œ(์Šค๋ ˆ๋“œ ๋™์ž‘, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ)๋Š” CUDA-GDB๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค
  • ํŠน์ • ์Šค๋ ˆ๋“œ๋‚˜ ๋ฐ์ดํ„ฐ ์กฐ๊ฑด์— ์ง‘์ค‘ํ•˜๋ ค๋ฉด ์กฐ๊ฑด๋ถ€ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๋ณ‘๋ ฌ ์‹คํ–‰ ํŒจํ„ด์„ ์ดํ•ดํ•˜๋ ค๋ฉด ์Šค๋ ˆ๋“œ ๊ฐ„ ์ด๋™์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๊ฒฝ์Ÿ ์ƒํƒœ์™€ ๋ฒ”์œ„ ์ดˆ๊ณผ ์˜ค๋ฅ˜๋ฅผ ์žก์œผ๋ ค๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค

ํ™•์žฅ์„ฑ: ์ด ๊ธฐ๋ฒ•๋“ค์€ ๋‹ค์Œ ๋ชจ๋“  ์ƒํ™ฉ์—์„œ ๋™์ผํ•˜๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค:

  • Puzzle 01: ๊ฐ„๋‹จํ•œ ๋ง์…ˆ์„ ํ•˜๋Š” 4๊ฐœ ์š”์†Œ ๋ฐฐ์—ด
  • Puzzle 08: ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•œ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ
  • ํ”„๋กœ๋•์…˜ ์ฝ”๋“œ: ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฑ๋งŒ ๊ฐœ ์š”์†Œ ๋ฐฐ์—ด

ํ•„์ˆ˜ ๋””๋ฒ„๊น… ๋ช…๋ น์–ด ์ฐธ์กฐ

๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๋ฐฐ์› ์œผ๋‹ˆ, ์ผ์ƒ์ ์ธ ๋””๋ฒ„๊น… ์„ธ์…˜์—์„œ ์“ธ ๋น ๋ฅธ ์ฐธ์กฐ ๊ฐ€์ด๋“œ๋ฅผ ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ์ด ์„น์…˜์„ ๋ถ๋งˆํฌํ•˜์„ธ์š”!

GDB ๋ช…๋ น์–ด ์•ฝ์–ด (์‹œ๊ฐ„ ์ ˆ์•ฝ!)

๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๋‹จ์ถ•ํ‚ค๋กœ ๋” ๋น ๋ฅธ ๋””๋ฒ„๊น…:

์•ฝ์–ด์ „์ฒด ๋ช…๋ น์–ด๊ธฐ๋Šฅ
rrunํ”„๋กœ๊ทธ๋žจ ์‹œ์ž‘/์‹คํ–‰
ccontinue์‹คํ–‰ ์žฌ๊ฐœ
nnext์Šคํ… ์˜ค๋ฒ„ (๊ฐ™์€ ๋ ˆ๋ฒจ)
sstepํ•จ์ˆ˜ ๋‚ด๋ถ€๋กœ ์ง„์ž…
bbreak๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ •
pprint๋ณ€์ˆ˜ ๊ฐ’ ์ถœ๋ ฅ
llist์†Œ์Šค ์ฝ”๋“œ ํ‘œ์‹œ
qquit๋””๋ฒ„๊ฑฐ ์ข…๋ฃŒ

์˜ˆ์‹œ:

(cuda-gdb) r                    # 'run' ๋Œ€์‹ 
(cuda-gdb) b 39                 # 'break 39' ๋Œ€์‹ 
(cuda-gdb) p thread_id          # 'print thread_id' ๋Œ€์‹ 
(cuda-gdb) n                    # 'next' ๋Œ€์‹ 
(cuda-gdb) c                    # 'continue' ๋Œ€์‹ 

โšก Pro ํŒ: ์•ฝ์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋””๋ฒ„๊น… ์†๋„๊ฐ€ 3-5๋ฐฐ ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค!

LLDB ๋ช…๋ น์–ด (CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ ๋””๋ฒ„๊น…)

์–ธ์ œ ์‚ฌ์šฉ: ์žฅ์น˜ ์„ค์ •, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ํ”„๋กœ๊ทธ๋žจ ํ๋ฆ„, ํ˜ธ์ŠคํŠธ ์ธก ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น…

์‹คํ–‰ ์ œ์–ด

(lldb) run                   # ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰
(lldb) continue              # ์‹คํ–‰ ์žฌ๊ฐœ (๋ณ„์นญ: c)
(lldb) step                  # ํ•จ์ˆ˜ ๋‚ด๋ถ€๋กœ ์ง„์ž… (์†Œ์Šค ๋ ˆ๋ฒจ)
(lldb) next                  # ํ•จ์ˆ˜ ๊ฑด๋„ˆ๋›ฐ๊ธฐ (์†Œ์Šค ๋ ˆ๋ฒจ)
(lldb) finish                # ํ˜„์žฌ ํ•จ์ˆ˜์—์„œ ๋‚˜๊ฐ€๊ธฐ

๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ๊ด€๋ฆฌ

(lldb) br set -n main        # main ํ•จ์ˆ˜์— ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ •
(lldb) br set -n function_name     # ์–ด๋–ค ํ•จ์ˆ˜์—๋“  ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ •
(lldb) br list               # ๋ชจ๋“  ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ํ‘œ์‹œ
(lldb) br delete 1           # ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ #1 ์‚ญ์ œ
(lldb) br disable 1          # ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ #1 ์ž„์‹œ ๋น„ํ™œ์„ฑํ™”

๋ณ€์ˆ˜ ๊ฒ€์‚ฌ

(lldb) print variable_name   # ๋ณ€์ˆ˜ ๊ฐ’ ํ‘œ์‹œ
(lldb) print pointer[offset]        # ํฌ์ธํ„ฐ ์—ญ์ฐธ์กฐ
(lldb) print array[0]@4      # ์ฒซ 4๊ฐœ ๋ฐฐ์—ด ์š”์†Œ ํ‘œ์‹œ

CUDA-GDB ๋ช…๋ น์–ด (GPU ์ปค๋„ ๋””๋ฒ„๊น…)

์–ธ์ œ ์‚ฌ์šฉ: GPU ์ปค๋„, ์Šค๋ ˆ๋“œ ๋™์ž‘, ๋ณ‘๋ ฌ ์‹คํ–‰, GPU ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ ๋””๋ฒ„๊น…

GPU ์ƒํƒœ ๊ฒ€์‚ฌ

(cuda-gdb) info cuda threads    # ๋ชจ๋“  GPU ์Šค๋ ˆ๋“œ์™€ ์ƒํƒœ ํ‘œ์‹œ
(cuda-gdb) info cuda blocks     # ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๋ธ”๋ก ํ‘œ์‹œ
(cuda-gdb) cuda kernel          # ํ™œ์„ฑ GPU ์ปค๋„ ๋‚˜์—ด

์Šค๋ ˆ๋“œ ํƒ์ƒ‰ (๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๊ธฐ๋Šฅ!)

(cuda-gdb) cuda thread (0,0,0)  # ํŠน์ • ์Šค๋ ˆ๋“œ ์ขŒํ‘œ๋กœ ์ „ํ™˜
(cuda-gdb) cuda block (0,0)     # ํŠน์ • ๋ธ”๋ก์œผ๋กœ ์ „ํ™˜
(cuda-gdb) cuda thread          # ํ˜„์žฌ ์Šค๋ ˆ๋“œ ์ขŒํ‘œ ํ‘œ์‹œ

์Šค๋ ˆ๋“œ๋ณ„ ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ

# ๋กœ์ปฌ ๋ณ€์ˆ˜์™€ ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ:
(cuda-gdb) print i              # ๋กœ์ปฌ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค ๋ณ€์ˆ˜
(cuda-gdb) print output         # ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ์ธํ„ฐ
(cuda-gdb) print a              # ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ์ธํ„ฐ

GPU ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ

# ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐฐ์—ด ๊ฒ€์‚ฌ (์‹ค์ œ๋กœ ์ž‘๋™ํ•˜๋Š” ๊ฒƒ):
(cuda-gdb) print array[i]       # ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ๋ณ„ ๋ฐฐ์—ด ์ ‘๊ทผ
(cuda-gdb) print array[0]@4     # ์—ฌ๋Ÿฌ ์š”์†Œ ๋ณด๊ธฐ: {{val1}, {val2}, {val3}, {val4}}

๊ณ ๊ธ‰ GPU ๋””๋ฒ„๊น…

# ๋ฉ”๋ชจ๋ฆฌ ๊ฐ์‹œ
(cuda-gdb) watch array[i]     # ๋ฉ”๋ชจ๋ฆฌ ๋ณ€๊ฒฝ ์‹œ ์ค‘๋‹จ
(cuda-gdb) rwatch array[i]    # ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ ์‹œ ์ค‘๋‹จ

๋น ๋ฅธ ์ฐธ์กฐ: ๋””๋ฒ„๊น… ๊ฒฐ์ • ํŠธ๋ฆฌ

๐Ÿค” ์–ด๋–ค ์œ ํ˜•์˜ ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๊ณ  ์žˆ๋‚˜์š”?

GPU ์ฝ”๋“œ ์‹คํ–‰ ์ „์— ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œ

โ†’ LLDB ๋””๋ฒ„๊น… ์‚ฌ์šฉ

pixi run mojo debug your_program.mojo

GPU ์ปค๋„์ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ ์ƒ์„ฑ

โ†’ ์กฐ๊ฑด๋ถ€ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์™€ ํ•จ๊ป˜ CUDA-GDB ์‚ฌ์šฉ

pixi run mojo debug --cuda-gdb --break-on-launch your_program.mojo

์„ฑ๋Šฅ ๋ฌธ์ œ๋‚˜ ๊ฒฝ์Ÿ ์ƒํƒœ

โ†’ ์žฌํ˜„์„ฑ์„ ์œ„ํ•ด ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น… ์‚ฌ์šฉ

pixi run mojo build -O0 -g your_program.mojo -o debug_binary
pixi run mojo debug --cuda-gdb --break-on-launch debug_binary

GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค

GPU ๋””๋ฒ„๊น… ๊ธฐ์ดˆ์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ ํŠœํ† ๋ฆฌ์–ผ์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ๋‹ฌ์„ฑํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค:

์Šต๋“ํ•œ ๊ธฐ์ˆ 

๋‹ค์ค‘ ๋ ˆ๋ฒจ ๋””๋ฒ„๊น… ์ง€์‹:

  • โœ… LLDB๋กœ CPU ํ˜ธ์ŠคํŠธ ๋””๋ฒ„๊น… - ์žฅ์น˜ ์„ค์ •, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ํ”„๋กœ๊ทธ๋žจ ํ๋ฆ„ ๋””๋ฒ„๊น…
  • โœ… CUDA-GDB๋กœ GPU ์ปค๋„ ๋””๋ฒ„๊น… - ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ, GPU ๋ฉ”๋ชจ๋ฆฌ, ๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น…
  • โœ… JIT vs ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น… - ์ƒํ™ฉ์— ๋งž๋Š” ์ ‘๊ทผ๋ฒ• ์„ ํƒ
  • โœ… pixi๋กœ ํ™˜๊ฒฝ ๊ด€๋ฆฌ - ์ผ๊ด€๋˜๊ณ  ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๋””๋ฒ„๊น… ์„ค์ • ๋ณด์žฅ

์‹ค์ œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ†ต์ฐฐ:

  • ์Šค๋ ˆ๋“œ์˜ ์‹ค์ œ ๋™์ž‘ ํ™•์ธ - ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ๋งˆ๋‹ค thread_idx.x๊ฐ€ ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ–๋Š” ๊ฒƒ์„ ์ง์ ‘ ๋ชฉ๊ฒฉํ–ˆ์Šต๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ดํ•ด - ์ „์—ญ GPU ๋ฉ”๋ชจ๋ฆฌ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์Šค๋ ˆ๋“œ ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ๋””๋ฒ„๊น…ํ–ˆ์Šต๋‹ˆ๋‹ค
  • ์Šค๋ ˆ๋“œ ํƒ์ƒ‰ ํ•™์Šต - ์ˆ˜์ฒœ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ ์‚ฌ์ด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ด๋™ํ–ˆ์Šต๋‹ˆ๋‹ค

์ด๋ก ์—์„œ ์‹ค์ „์œผ๋กœ

GPU ๋””๋ฒ„๊น…์— ๋Œ€ํ•ด ์ฝ๊ธฐ๋งŒ ํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ฒฝํ—˜ํ–ˆ์Šต๋‹ˆ๋‹ค:

  • ์‹ค์ œ ์ฝ”๋“œ ๋””๋ฒ„๊น…: ์‹ค์ œ GPU ์‹คํ–‰์œผ๋กœ Puzzle 01์˜ add_10 ์ปค๋„์„ ๋””๋ฒ„๊น…ํ–ˆ์Šต๋‹ˆ๋‹ค
  • ์‹ค์ œ ๋””๋ฒ„๊ฑฐ ์ถœ๋ ฅ ํ™•์ธ: LLDB ์–ด์…ˆ๋ธ”๋ฆฌ, CUDA-GDB ์Šค๋ ˆ๋“œ ์ƒํƒœ, ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ๋ฅผ ์ง์ ‘ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค
  • ์ „๋ฌธ ๋„๊ตฌ ์‚ฌ์šฉ: ํ”„๋กœ๋•์…˜ GPU ๊ฐœ๋ฐœ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•œ CUDA-GDB๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค
  • ์‹ค์ œ ์‹œ๋‚˜๋ฆฌ์˜ค ํ•ด๊ฒฐ: ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ, ๊ฒฝ์Ÿ ์ƒํƒœ, ์ปค๋„ ์‹คํ–‰ ์‹คํŒจ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ค˜์Šต๋‹ˆ๋‹ค

๋””๋ฒ„๊น… ๋„๊ตฌ ๋ชจ์Œ

๋น ๋ฅธ ๊ฒฐ์ • ๊ฐ€์ด๋“œ (ํ•ญ์ƒ ๊ฐ€๊นŒ์ด ๋‘์„ธ์š”!):

๋ฌธ์ œ ์œ ํ˜•๋„๊ตฌ๋ช…๋ น์–ด
GPU ์ „์— ํ”„๋กœ๊ทธ๋žจ ํฌ๋ž˜์‹œLLDBpixi run mojo debug program.mojo
GPU ์ปค๋„ ๋ฌธ์ œCUDA-GDBpixi run mojo debug --cuda-gdb --break-on-launch program.mojo
๊ฒฝ์Ÿ ์ƒํƒœCUDA-GDB + ์Šค๋ ˆ๋“œ ํƒ์ƒ‰(cuda-gdb) cuda thread (0,0,0)

ํ•„์ˆ˜ ๋ช…๋ น์–ด (์ผ์ƒ ๋””๋ฒ„๊น…์šฉ):

# GPU ์Šค๋ ˆ๋“œ ๊ฒ€์‚ฌ
(cuda-gdb) info cuda threads          # ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๋ณด๊ธฐ
(cuda-gdb) cuda thread (0,0,0)        # ์Šค๋ ˆ๋“œ ์ „ํ™˜
(cuda-gdb) print i                    # ๋กœ์ปฌ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค (thread_idx.x ๋“ฑ๊ฐ€)

# ์Šค๋งˆํŠธ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ (GPU ๋‚ด์žฅ ๋ณ€์ˆ˜๊ฐ€ ์ž‘๋™ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๋กœ์ปฌ ๋ณ€์ˆ˜ ์‚ฌ์šฉ)
(cuda-gdb) break kernel if i == 0      # ์Šค๋ ˆ๋“œ 0์— ์ง‘์ค‘
(cuda-gdb) break kernel if array[i] > 100  # ๋ฐ์ดํ„ฐ ์กฐ๊ฑด์— ์ง‘์ค‘

# ๋ฉ”๋ชจ๋ฆฌ ๋””๋ฒ„๊น…
(cuda-gdb) print array[i]              # ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ๋ณ„ ๋ฐ์ดํ„ฐ
(cuda-gdb) print array[0]@4            # ๋ฐฐ์—ด ์„ธ๊ทธ๋จผํŠธ: {{val1}, {val2}, {val3}, {val4}}

์š”์•ฝ

GPU ๋””๋ฒ„๊น…์—๋Š” ์ˆ˜์ฒœ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ, ๋ณต์žกํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ, ์ „๋ฌธ ๋„๊ตฌ๊ฐ€ ๊ด€์—ฌํ•ฉ๋‹ˆ๋‹ค. ์ด์ œ ๋‹ค์Œ์„ ๊ฐ–์ถ”๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  • ์–ด๋–ค GPU ํ”„๋กœ๊ทธ๋žจ์—๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ฒด๊ณ„์ ์ธ ์›Œํฌํ”Œ๋กœ์šฐ
  • LLDB์™€ CUDA-GDB ์ „๋ฌธ ๋„๊ตฌ์— ๋Œ€ํ•œ ์นœ์ˆ™ํ•จ
  • ์‹ค์ œ ๋ณ‘๋ ฌ ์ฝ”๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•œ ์‹ค์ „ ๊ฒฝํ—˜
  • ๋ณต์žกํ•œ ์ƒํ™ฉ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ์‹ค์šฉ์ ์ธ ์ „๋žต
  • GPU ๋””๋ฒ„๊น… ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ๊ธฐ์ดˆ

์ถ”๊ฐ€ ์ž๋ฃŒ

์ฐธ๊ณ : GPU ๋””๋ฒ„๊น…์—๋Š” ์ธ๋‚ด์‹ฌ๊ณผ ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃฌ ์›Œํฌํ”Œ๋กœ์šฐ์™€ ๋ช…๋ น์–ด๋Š” ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ๋งˆ์ฃผ์น˜๊ฒŒ ๋  ๋ณต์žกํ•œ GPU ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋Š” ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

๐Ÿง ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ฐœ์š”

์ด๋ฒˆ ํผ์ฆ์—์„œ๋Š” ํฌ๋ž˜์‹œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” GPU ํ”„๋กœ๊ทธ๋žจ์ด ์ฃผ์–ด์ง‘๋‹ˆ๋‹ค. ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  (cuda-gdb) ๋””๋ฒ„๊น… ๋„๊ตฌ๋งŒ์œผ๋กœ ๋ฌธ์ œ๋ฅผ ์ฐพ์•„๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋””๋ฒ„๊น… ์Šคํ‚ฌ์„ ๋ฐœํœ˜ํ•ด ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ํ’€์–ด๋ณด์„ธ์š”!

์‚ฌ์ „ ์ค€๋น„: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ์„ ๋จผ์ € ์™„๋ฃŒํ•ด์„œ CUDA-GDB ์„ค์ •๊ณผ ๊ธฐ๋ณธ ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋ฅผ ์ตํ˜€๋‘์„ธ์š”. ์•„๋ž˜ ๋ช…๋ น์„ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pixi run -e nvidia setup-cuda-gdb

์ด ๋ช…๋ น์€ ์‹œ์Šคํ…œ์˜ CUDA ์„ค์น˜๋ฅผ ์ž๋™์œผ๋กœ ๊ฐ์ง€ํ•˜๊ณ  GPU ๋””๋ฒ„๊น…์— ํ•„์š”ํ•œ ๋งํฌ๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด๋ฒˆ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น…: ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋ฅผ ๋‹จ์„œ ์‚ผ์•„ ๊ทผ๋ณธ ์›์ธ ์ฐพ๊ธฐ
  • ์˜ค๋ฅ˜ ๋ถ„์„: ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€์™€ ์Šคํƒ ์ถ”์ (stack trace) ํ•ด์„ํ•˜๊ธฐ
  • ๊ฐ€์„ค ์ˆ˜๋ฆฝ: ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ฉ๋ฆฌ์ ์ธ ์ถ”์ธก ์„ธ์šฐ๊ธฐ
  • ๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ: ๋‹จ๊ณ„๋ณ„ ์กฐ์‚ฌ ๊ณผ์ • ์ตํžˆ๊ธฐ

์ฝ”๋“œ ์‹คํ–‰

๋จผ์ € ์ „์ฒด ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ์ปค๋„๋งŒ ์‚ดํŽด๋ด…์‹œ๋‹ค:

fn add_10(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    i = thread_idx.x
    output[i] = a[i] + 10.0


๋ฒ„๊ทธ๋ฅผ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š” (pixi ์ „์šฉ):

pixi run -e nvidia p09 --first-case

ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

First Case: Try to identify what's wrong without looking at the code!

stack trace was not collected. Enable stack trace collection with environment variable `MOJO_ENABLE_STACK_TRACE_ON_ERROR`
Unhandled exception caught during execution: At open-source/max/mojo/stdlib/stdlib/gpu/host/device_context.mojo:2082:17: CUDA call failed: CUDA_ERROR_INVALID_IMAGE (device kernel image is invalid)
To get more accurate error information, set MODULAR_DEVICE_CONTEXT_SYNC_MODE=true.
/home/ubuntu/workspace/mojo-gpu-puzzles/.pixi/envs/nvidia/bin/mojo: error: execution exited with a non-zero result: 1

๊ณผ์ œ: ํƒ์ • ์ˆ˜์‚ฌ

๋„์ „: ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š์€ ์ƒํƒœ์—์„œ, ์ด ํฌ๋ž˜์‹œ๋ฅผ ์กฐ์‚ฌํ•˜๊ธฐ ์œ„ํ•œ ๋””๋ฒ„๊น… ์ „๋žต์€ ๋ฌด์—‡์ผ๊นŒ์š”?

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ณด์„ธ์š”:

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --first-case
ํŒ
  1. ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€๋ฅผ ๊ผผ๊ผผํžˆ ์ฝ๊ธฐ - CUDA_ERROR_ILLEGAL_ADDRESS๋Š” GPU๊ฐ€ ์ž˜๋ชป๋œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋ ค ํ–ˆ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค
  2. ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์ •๋ณด ํ™•์ธ - CUDA-GDB๊ฐ€ ๋ฉˆ์ถœ ๋•Œ ํ‘œ์‹œ๋˜๋Š” ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ดํŽด๋ณด์„ธ์š”
  3. ๋ชจ๋“  ํฌ์ธํ„ฐ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๊ฒ€์‚ฌ - print๋กœ ๊ฐ ํฌ์ธํ„ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ™•์ธํ•˜์„ธ์š”
  4. ์ˆ˜์ƒํ•œ ์ฃผ์†Œ ์ฐพ๊ธฐ - ์œ ํšจํ•œ GPU ์ฃผ์†Œ๋Š” ๋ณดํ†ต ํฐ 16์ง„์ˆ˜์ž…๋‹ˆ๋‹ค (0x0์€ ๋ฌด์—‡์„ ์˜๋ฏธํ• ๊นŒ์š”?)
  5. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํ…Œ์ŠคํŠธ - ๊ฐ ํฌ์ธํ„ฐ๋กœ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•ด์„œ ์–ด๋А ๊ฒƒ์ด ์‹คํŒจํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”
  6. ์ฒด๊ณ„์ ์œผ๋กœ ์ ‘๊ทผ - ํƒ์ •์ฒ˜๋Ÿผ ์ฆ๊ฑฐ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ ์ฆ์ƒ์—์„œ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ์ถ”์ ํ•˜์„ธ์š”
  7. ์œ ํšจํ•œ ํŒจํ„ด๊ณผ ๊ทธ๋ ‡์ง€ ์•Š์€ ํŒจํ„ด ๋น„๊ต - ํ•œ ํฌ์ธํ„ฐ๊ฐ€ ์ž‘๋™ํ•˜๊ณ  ๋‹ค๋ฅธ ๊ฑด ์•ˆ ๋œ๋‹ค๋ฉด, ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ์ชฝ์— ์ง‘์ค‘ํ•˜์„ธ์š”
๐Ÿ’ก ์กฐ์‚ฌ ๊ณผ์ •๊ณผ ํ•ด๊ฒฐ์ฑ…

CUDA-GDB๋กœ ๋‹จ๊ณ„๋ณ„ ์กฐ์‚ฌ

๋””๋ฒ„๊ฑฐ ์‹คํ–‰

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --first-case

๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์ •๋ณด ํ™•์ธ

CUDA-GDB๊ฐ€ ๋ฉˆ์ถ”๋ฉด ๋ฐ”๋กœ ์œ ์šฉํ•œ ๋‹จ์„œ๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

(cuda-gdb) run
CUDA thread hit breakpoint, p09_add_10_... (output=0x302000000, a=0x0)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:31
31          i = thread_idx.x

๐Ÿ” ์ฒซ ๋ฒˆ์งธ ๋‹จ์„œ: ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜์— (output=0x302000000, a=0x0)์ด ๋ณด์ž…๋‹ˆ๋‹ค

  • output์€ ์œ ํšจํ•œ GPU ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค
  • a๋Š” 0x0 - null ํฌ์ธํ„ฐ์ž…๋‹ˆ๋‹ค!

์ฒด๊ณ„์ ์ธ ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ

(cuda-gdb) next
32          output[i] = a[i] + 10.0
(cuda-gdb) print i
$1 = 0
(cuda-gdb) print output
$2 = (!pop.scalar<f32> * @register) 0x302000000
(cuda-gdb) print a
$3 = (!pop.scalar<f32> * @register) 0x0

์ฆ๊ฑฐ ์ˆ˜์ง‘:

  • โœ… ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค i=0์€ ์œ ํšจํ•ฉ๋‹ˆ๋‹ค
  • โœ… ๊ฒฐ๊ณผ ํฌ์ธํ„ฐ 0x302000000์€ ์˜ฌ๋ฐ”๋ฅธ GPU ์ฃผ์†Œ์ž…๋‹ˆ๋‹ค
  • โŒ ์ž…๋ ฅ ํฌ์ธํ„ฐ 0x0์€ null์ž…๋‹ˆ๋‹ค

๋ฌธ์ œ ํ™•์ธ

(cuda-gdb) print a[i]
Cannot access memory at address 0x0

๊ฒฐ์ •์  ์ฆ๊ฑฐ: null ์ฃผ์†Œ์˜ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค - ๋ฐ”๋กœ ์ด๊ฒƒ์ด ํฌ๋ž˜์‹œ์˜ ์›์ธ์ž…๋‹ˆ๋‹ค!

๊ทผ๋ณธ ์›์ธ ๋ถ„์„

๋ฌธ์ œ์ : ์ด์ œ --first-crash์˜ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด, ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ๊ฐ€ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ œ๋Œ€๋กœ ํ• ๋‹นํ•˜์ง€ ์•Š๊ณ  null ํฌ์ธํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

 input_buf = ctx.enqueue_create_buffer[dtype](0)  # 0๊ฐœ์˜ ์š”์†Œ๋ฅผ ๊ฐ€์ง„ `DeviceBuffer`๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์š”์†Œ๊ฐ€ 0๊ฐœ์ด๋ฏ€๋กœ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ• ๋‹น๋˜์ง€ ์•Š์•„ NULL ํฌ์ธํ„ฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค!

์™œ ํฌ๋ž˜์‹œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š”๊ฐ€:

  1. ctx.enqueue_create_buffer[dtype](0)์€ 0๊ฐœ ์š”์†Œ๋ฅผ ๊ฐ€์ง„ DeviceBuffer๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  2. ํ• ๋‹นํ•  ์š”์†Œ๊ฐ€ ์—†์œผ๋‹ˆ null ํฌ์ธํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  3. ์ด null ํฌ์ธํ„ฐ๊ฐ€ GPU ์ปค๋„๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค.
  4. ์ปค๋„์ด a[i]์— ์ ‘๊ทผํ•˜๋ ค ํ•  ๋•Œ null์„ ์—ญ์ฐธ์กฐ โ†’ CUDA_ERROR_ILLEGAL_ADDRESS

์ˆ˜์ • ๋ฐฉ๋ฒ•

Null ํฌ์ธํ„ฐ ์ƒ์„ฑ์„ ์ ์ ˆํ•œ ๋ฒ„ํผ ํ• ๋‹น์œผ๋กœ ๊ต์ฒดํ•ฉ๋‹ˆ๋‹ค:

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: Null ํฌ์ธํ„ฐ ์ƒ์„ฑ
input_buf = ctx.enqueue_create_buffer[dtype](0)

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ์•ˆ์ „ํ•œ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ์‹ค์ œ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•˜๊ณ  ์ดˆ๊ธฐํ™”
input_buf = ctx.enqueue_create_buffer[dtype](SIZE)
input_buf.enqueue_fill(0)

ํ•ต์‹ฌ ๋””๋ฒ„๊น… ๊ตํ›ˆ

ํŒจํ„ด ์ธ์‹:

  • 0x0 ์ฃผ์†Œ๋Š” ํ•ญ์ƒ null ํฌ์ธํ„ฐ์ž…๋‹ˆ๋‹ค
  • ์œ ํšจํ•œ GPU ์ฃผ์†Œ๋Š” ํฐ 16์ง„์ˆ˜์ž…๋‹ˆ๋‹ค (์˜ˆ: 0x302000000)

๋””๋ฒ„๊น… ์ „๋žต:

  1. ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€ ์ฝ๊ธฐ - ๋Œ€์ฒด๋กœ ๋ฌธ์ œ ์œ ํ˜•์— ๋Œ€ํ•œ ํžŒํŠธ๋ฅผ ์ค๋‹ˆ๋‹ค
  2. ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ ํ™•์ธ - CUDA-GDB๊ฐ€ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์ง„์ž… ์‹œ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  3. ๋ชจ๋“  ํฌ์ธํ„ฐ ๊ฒ€์‚ฌ - ์ฃผ์†Œ๋ฅผ ๋น„๊ตํ•ด์„œ null์ด๋‚˜ ์ž˜๋ชป๋œ ๊ฒƒ์„ ์ฐพ์Šต๋‹ˆ๋‹ค
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํ…Œ์ŠคํŠธ - ์ˆ˜์ƒํ•œ ํฌ์ธํ„ฐ๋ฅผ ์—ญ์ฐธ์กฐํ•ด ๋ด…๋‹ˆ๋‹ค
  5. ํ• ๋‹น ์ง€์ ๊นŒ์ง€ ์ถ”์  - ๋ฌธ์ œ์˜ ํฌ์ธํ„ฐ๊ฐ€ ์–ด๋””์„œ ์ƒ์„ฑ๋˜์—ˆ๋Š”์ง€ ์ฐพ์Šต๋‹ˆ๋‹ค

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด๋Ÿฐ ์œ ํ˜•์˜ null ํฌ์ธํ„ฐ ๋ฒ„๊ทธ๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋งค์šฐ ํ”ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋ฐฐ์šด ์ฒด๊ณ„์ ์ธ CUDA-GDB ์กฐ์‚ฌ ๋ฐฉ๋ฒ•์€ ๋‹ค๋ฅธ ๋งŽ์€ GPU ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ, ๊ฒฝ์Ÿ ์ƒํƒœ, ์ปค๋„ ํฌ๋ž˜์‹œ๋ฅผ ๋””๋ฒ„๊น…ํ•  ๋•Œ๋„ ๊ทธ๋Œ€๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„: ํฌ๋ž˜์‹œ์—์„œ ์กฐ์šฉํ•œ ๋ฒ„๊ทธ๋กœ

ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น…์„ ์ตํ˜”์Šต๋‹ˆ๋‹ค! ์ด์ œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋ฅผ ๋‹จ์„œ๋กœ GPU ํฌ๋ž˜์‹œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์กฐ์‚ฌ
  • ํฌ์ธํ„ฐ ์ฃผ์†Œ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•ด null ํฌ์ธํ„ฐ ๋ฒ„๊ทธ ์‹๋ณ„
  • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ๋””๋ฒ„๊น…์— CUDA-GDB๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉ

๋‹ค์Œ ๋„์ „: ํƒ์ • ์ˆ˜์‚ฌ: ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ทธ๋Ÿฐ๋ฐ ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด์š”? ์™„๋ฒฝํ•˜๊ฒŒ ์‹คํ–‰๋˜์ง€๋งŒ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค๋ฉด?

๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€๋Š” ์ „ํ˜€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๋””๋ฒ„๊น… ๋„์ „์ž…๋‹ˆ๋‹ค:

  • ๊ธธ์žก์ด๊ฐ€ ๋˜์–ด์ค„ ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค
  • ์กฐ์‚ฌํ•  ๋šœ๋ ทํ•œ ํฌ์ธํ„ฐ ๋ฌธ์ œ๋„ ์—†์Šต๋‹ˆ๋‹ค
  • ๋ฌธ์ œ๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š” ์Šคํƒ ์ถ”์ ๋„ ์—†์Šต๋‹ˆ๋‹ค
  • ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ๊ฐ€ ํ•„์š”ํ•œ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋งŒ ์žˆ์Šต๋‹ˆ๋‹ค

์ƒˆ๋กญ๊ฒŒ ์ตํžˆ๊ฒŒ ๋  ์Šคํ‚ฌ:

  • ๋กœ์ง ๋ฒ„๊ทธ ํƒ์ง€ - ํฌ๋ž˜์‹œ ์—†์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์ฐพ๊ธฐ
  • ํŒจํ„ด ๋ถ„์„ - ์ž˜๋ชป๋œ ์ถœ๋ ฅ์—์„œ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€๊ธฐ
  • ์‹คํ–‰ ํ๋ฆ„ ๋””๋ฒ„๊น… - ์ตœ์ ํ™” ๋•Œ๋ฌธ์— ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ๊ฐ€ ์•ˆ ๋  ๋•Œ ๋Œ€์ฒ˜ํ•˜๊ธฐ

์—ฌ๊ธฐ์„œ ๋ฐฐ์šด ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ๋ฐฉ๋ฒ• - ๋‹จ์„œ ์ฝ๊ธฐ, ๊ฐ€์„ค ์„ธ์šฐ๊ธฐ, ์ฒด๊ณ„์ ์œผ๋กœ ํ…Œ์ŠคํŠธํ•˜๊ธฐ - ์€ ์•ž์œผ๋กœ ๋งˆ์ฃผํ•  ๋” ๋ฏธ๋ฌ˜ํ•œ ๋กœ์ง ์˜ค๋ฅ˜๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋Š” ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ” ํƒ์ • ์ˆ˜์‚ฌ: ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ฐœ์š”

์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์—์„œ ์ตํžŒ ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น… ์Šคํ‚ฌ์„ ๋ฐ”ํƒ•์œผ๋กœ, ์ด๋ฒˆ์—๋Š” ์ „ํ˜€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๋„์ „์„ ๋งˆ์ฃผํ•ฉ๋‹ˆ๋‹ค: ํฌ๋ž˜์‹œ ์—†์ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๋กœ์ง ๋ฒ„๊ทธ์ž…๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ๊ด€์ ์˜ ์ „ํ™˜:

  • ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€: ๋ช…ํ™•ํ•œ ํฌ๋ž˜์‹œ ์‹ ํ˜ธ(CUDA_ERROR_ILLEGAL_ADDRESS)๊ฐ€ ์กฐ์‚ฌ๋ฅผ ์•ˆ๋‚ดํ•จ
  • ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€: ํฌ๋ž˜์‹œ๋„ ์—†๊ณ  ์—๋Ÿฌ ๋ฉ”์‹œ์ง€๋„ ์—†์Œ - ํƒ์ •์ฒ˜๋Ÿผ ํŒŒํ—ค์ณ์•ผ ํ•˜๋Š” ๋ฏธ๋ฌ˜ํ•˜๊ฒŒ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋งŒ ์žˆ์Œ

์ด๋ฒˆ ์ค‘๊ธ‰ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ๋Š” LayoutTensor ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜๋ฅผ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœ๊ทธ๋žจ์€ ์„ฑ๊ณต์ ์œผ๋กœ ์‹คํ–‰๋˜์ง€๋งŒ ์ž˜๋ชป๋œ ์ถœ๋ ฅ์„ ๋‚ด๋Š”๋ฐ, ์‹ค์ œ ๊ฐœ๋ฐœ์—์„œ ํ›จ์”ฌ ํ”ํ•˜๋ฉด์„œ๋„ ๊นŒ๋‹ค๋กœ์šด ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค์ž…๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ค€๋น„: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ๊ณผ ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€๋ฅผ ๋จผ์ € ์™„๋ฃŒํ•ด์„œ CUDA-GDB ์›Œํฌํ”Œ๋กœ์šฐ์™€ ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•์„ ์ตํ˜€๋‘์„ธ์š”. ์•„๋ž˜ ๋ช…๋ น์„ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pixi run -e nvidia setup-cuda-gdb

ํ•ต์‹ฌ ๊ฐœ๋…

์ด๋ฒˆ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • LayoutTensor ๋””๋ฒ„๊น…: ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํŒจํ„ด ์กฐ์‚ฌํ•˜๊ธฐ
  • ๋กœ์ง ๋ฒ„๊ทธ ํƒ์ง€: ํฌ๋ž˜์‹œํ•˜์ง€ ์•Š๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์ฐพ๊ธฐ
  • ๋ฐ˜๋ณต๋ฌธ ๊ฒฝ๊ณ„ ๋ถ„์„: ๋ฐ˜๋ณต ํšŸ์ˆ˜ ๋ฌธ์ œ ์ดํ•ดํ•˜๊ธฐ
  • ๊ฒฐ๊ณผ ํŒจํ„ด ๋ถ„์„: ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋กœ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€๊ธฐ

์ฝ”๋“œ ์‹คํ–‰

๋จผ์ € ์ „์ฒด ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ์ปค๋„๋งŒ ์‚ดํŽด๋ด…์‹œ๋‹ค:

fn process_sliding_window(
    output: LayoutTensor[dtype, vector_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, vector_layout, ImmutAnyOrigin],
):
    thread_id = thread_idx.x

    # Each thread processes a sliding window of 3 elements
    window_sum = Scalar[dtype](0.0)

    # Sum elements in sliding window: [i-1, i, i+1]
    for offset in range(ITER):
        idx = Int(thread_id) + offset - 1
        if 0 <= idx < SIZE:
            value = rebind[Scalar[dtype]](a[idx])
            window_sum += value

    output[thread_id] = window_sum


๋ฒ„๊ทธ๋ฅผ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š” (pixi ์ „์šฉ):

pixi run -e nvidia p09 --second-case

๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค - ํฌ๋ž˜์‹œ ์—†์ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ:

This program computes sliding window sums for each position...

Input array: [0, 1, 2, 3]
Computing sliding window sums (window size = 3)...
Each position should sum its neighbors: [left + center + right]
stack trace was not collected. Enable stack trace collection with environment variable `MOJO_ENABLE_STACK_TRACE_ON_ERROR`
Unhandled exception caught during execution: At open-source/max/mojo/stdlib/stdlib/gpu/host/device_context.mojo:2082:17: CUDA call failed: CUDA_ERROR_INVALID_IMAGE (device kernel image is invalid)
To get more accurate error information, set MODULAR_DEVICE_CONTEXT_SYNC_MODE=true.
/home/ubuntu/workspace/mojo-gpu-puzzles/.pixi/envs/nvidia/bin/mojo: error: execution exited with a non-zero result: 1

๊ณผ์ œ: ํƒ์ • ์ˆ˜์‚ฌ

๋„์ „: ํ”„๋กœ๊ทธ๋žจ์€ ํฌ๋ž˜์‹œ ์—†์ด ์‹คํ–‰๋˜์ง€๋งŒ ์ผ์ •ํ•œ ํŒจํ„ด์œผ๋กœ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค. ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š์€ ์ƒํƒœ์—์„œ, ์ด ๋กœ์ง ๋ฒ„๊ทธ๋ฅผ ์กฐ์‚ฌํ•˜๊ธฐ ์œ„ํ•œ ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฌด์—‡์ผ๊นŒ์š”?

์ƒ๊ฐํ•ด ๋ณผ ์ :

  • ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ์—์„œ ์–ด๋–ค ํŒจํ„ด์ด ๋ณด์ด๋‚˜์š”?
  • ์ œ๋Œ€๋กœ ๋Œ์ง€ ์•Š๋Š” ๊ฒƒ ๊ฐ™์€ ๋ฐ˜๋ณต๋ฌธ์€ ์–ด๋–ป๊ฒŒ ์กฐ์‚ฌํ•  ๊ฑด๊ฐ€์š”?
  • ๋ณ€์ˆ˜๋ฅผ ์ง์ ‘ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์—†์„ ๋•Œ ์–ด๋–ค ๋””๋ฒ„๊น… ์ „๋žต์ด ํšจ๊ณผ์ ์ผ๊นŒ์š”?
  • ์กฐ์‚ฌ๋ฅผ ์•ˆ๋‚ดํ•ด ์ค„ ํฌ๋ž˜์‹œ ์‹ ํ˜ธ๊ฐ€ ์—†์„ ๋•Œ, ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์˜ ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ๋ฐฉ๋ฒ•์„ ์–ด๋–ป๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ณด์„ธ์š”:

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --second-case

GDB ๋ช…๋ น์–ด ๋‹จ์ถ•ํ‚ค (๋น ๋ฅธ ๋””๋ฒ„๊น…)

์ด ๋‹จ์ถ•ํ‚ค๋“ค์„ ์‚ฌ์šฉํ•˜๋ฉด ๋””๋ฒ„๊น… ์„ธ์…˜ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

๋‹จ์ถ•์ „์ฒด์‚ฌ์šฉ ์˜ˆ์‹œ
rrun(cuda-gdb) r
nnext(cuda-gdb) n
ccontinue(cuda-gdb) c
bbreak(cuda-gdb) b 39
pprint(cuda-gdb) p thread_id
qquit(cuda-gdb) q

์•„๋ž˜ ๋ชจ๋“  ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋Š” ํšจ์œจ์„ ์œ„ํ•ด ์ด ๋‹จ์ถ•ํ‚ค๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค!

ํŒ
  1. ํŒจํ„ด ๋ถ„์„๋ถ€ํ„ฐ - ๊ธฐ๋Œ€๊ฐ’๊ณผ ์‹ค์ œ ๊ฒฐ๊ณผ์˜ ๊ด€๊ณ„๋ฅผ ์‚ดํŽด๋ณด์„ธ์š” (์ฐจ์ด์— ์–ด๋–ค ์ˆ˜ํ•™์  ํŒจํ„ด์ด ์žˆ๋‚˜์š”?)
  2. ์‹คํ–‰ ํ๋ฆ„์— ์ง‘์ค‘ - ๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์œผ๋ฉด ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์„ธ์–ด๋ณด์„ธ์š”
  3. ๋‹จ์ˆœํ•œ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์‚ฌ์šฉ - ์ตœ์ ํ™”๋œ ์ฝ”๋“œ์—์„œ๋Š” ๋ณต์žกํ•œ ๋””๋ฒ„๊น… ๋ช…๋ น์ด ์‹คํŒจํ•˜๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค
  4. ์ˆ˜ํ•™์  ์ถ”๋ก  - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ๊ณผ ์‹ค์ œ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์„ ๋”ฐ์ ธ๋ณด์„ธ์š”
  5. ๋ˆ„๋ฝ๋œ ๋ฐ์ดํ„ฐ ์กฐ์‚ฌ - ๊ฒฐ๊ณผ๊ฐ€ ์ผ๊ด€๋˜๊ฒŒ ๊ธฐ๋Œ€๋ณด๋‹ค ์ž‘๋‹ค๋ฉด, ๋ฌด์—‡์ด ๋น ์กŒ์„๊นŒ์š”?
  6. ํ˜ธ์ŠคํŠธ ์ถœ๋ ฅ ๊ฒ€์ฆ - ์ตœ์ข… ๊ฒฐ๊ณผ์—์„œ ๋ฒ„๊ทธ์˜ ํŒจํ„ด์ด ๋“œ๋Ÿฌ๋‚˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค
  7. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฒฝ๊ณ„ ๋ถ„์„ - ๋ฐ˜๋ณต๋ฌธ์ด ์˜ฌ๋ฐ”๋ฅธ ๊ฐœ์ˆ˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”
  8. ์ž‘๋™ํ•˜๋Š” ์ผ€์ด์Šค์™€ ๊ต์ฐจ ๊ฒ€์ฆ - ์Šค๋ ˆ๋“œ 3์€ ์ •ํ™•ํ•˜๊ฒŒ ์ž‘๋™ํ•˜๋Š”๋ฐ ๋‹ค๋ฅธ ๊ฒƒ๋“ค์€ ์™œ ์•ˆ ๋ ๊นŒ์š”?
๐Ÿ’ก ์กฐ์‚ฌ ๊ณผ์ •๊ณผ ํ•ด๊ฒฐ์ฑ…

CUDA-GDB๋กœ ๋‹จ๊ณ„๋ณ„ ์กฐ์‚ฌ

1๋‹จ๊ณ„: ์‹คํ–‰๊ณผ ์ดˆ๊ธฐ ๋ถ„์„

Step 1: ๋””๋ฒ„๊ฑฐ ์‹คํ–‰

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --second-case

Step 2: ์ฆ์ƒ๋ถ€ํ„ฐ ๋ถ„์„

๋””๋ฒ„๊ฑฐ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

์‹ค์ œ ๊ฒฐ๊ณผ: [0.0, 1.0, 3.0, 5.0]
๊ธฐ๋Œ€๊ฐ’: [1.0, 3.0, 6.0, 5.0]

๐Ÿ” ํŒจํ„ด ์ธ์‹:

  • ์Šค๋ ˆ๋“œ 0: 0.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 1.0 โ†’ 1.0 ๋ˆ„๋ฝ
  • ์Šค๋ ˆ๋“œ 1: 1.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 3.0 โ†’ 2.0 ๋ˆ„๋ฝ
  • ์Šค๋ ˆ๋“œ 2: 3.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 6.0 โ†’ 3.0 ๋ˆ„๋ฝ
  • ์Šค๋ ˆ๋“œ 3: 5.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 5.0 โ†’ โœ… ์ •ํ™•

์ดˆ๊ธฐ ๊ฐ€์„ค: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ˆ„๋ฝํ•˜๊ณ  ์žˆ๋Š”๋ฐ, ์Šค๋ ˆ๋“œ 3๋งŒ ์ •ํ™•ํ•˜๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

2๋‹จ๊ณ„: ์ปค๋„ ์ง„์ž…

Step 3: ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์ง„์ž… ํ™•์ธ

์‹ค์ œ ๋””๋ฒ„๊น… ์„ธ์…˜์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค:

(cuda-gdb) r
Starting program: .../mojo run problems/p09/p09.mojo --second-case

This program computes sliding window sums for each position...
Input array: [0, 1, 2, 3]
Computing sliding window sums (window size = 3)...
Each position should sum its neighbors: [left + center + right]

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

CUDA thread hit application kernel entry function breakpoint, p09_process_sliding_window_...
   <<<(1,1,1),(4,1,1)>>> (output=..., input=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:30
30          input: LayoutTensor[mut=False, dtype, vector_layout],

Step 4: ๋ฉ”์ธ ๋กœ์ง์œผ๋กœ ์ด๋™

(cuda-gdb) n
29          output: LayoutTensor[mut=True, dtype, vector_layout],
(cuda-gdb) n
32          thread_id = thread_idx.x
(cuda-gdb) n
38          for offset in range(ITER):

Step 5: ๋ณ€์ˆ˜ ์ ‘๊ทผ์„ฑ ํ…Œ์ŠคํŠธ - ์ค‘์š”ํ•œ ๋ฐœ๊ฒฌ

(cuda-gdb) p thread_id
$1 = 0

โœ… ์ข‹์Œ: Thread ID์— ์ ‘๊ทผ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

(cuda-gdb) p window_sum
Cannot access memory at address 0x0

โŒ ๋ฌธ์ œ: window_sum์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

(cuda-gdb) p a[0]
Attempt to take address of value not located in memory.

โŒ ๋ฌธ์ œ: LayoutTensor ์ง์ ‘ ์ธ๋ฑ์‹ฑ์ด ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

(cuda-gdb) p a.ptr[0]
$2 = {0}
(cuda-gdb) p a.ptr[0]@4
$3 = {{0}, {1}, {2}, {3}}

๐ŸŽฏ ๋ŒํŒŒ๊ตฌ: a.ptr[0]@4๋กœ ์ „์ฒด ์ž…๋ ฅ ๋ฐฐ์—ด์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ด๊ฒƒ์ด LayoutTensor ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

3๋‹จ๊ณ„: ํ•ต์‹ฌ ๋ฐ˜๋ณต๋ฌธ ์กฐ์‚ฌ

Step 6: ๋ฐ˜๋ณต๋ฌธ ๋ชจ๋‹ˆํ„ฐ๋ง ์„ค์ •

(cuda-gdb) b 42
Breakpoint 1 at 0x7fffd326ffd0: file problems/p09/p09.mojo, line 42.
(cuda-gdb) c
Continuing.

CUDA thread hit Breakpoint 1, p09_process_sliding_window_...
   <<<(1,1,1),(4,1,1)>>> (output=..., input=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:42
42              idx = thread_id + offset - 1

๐Ÿ” ์ด์ œ ๋ฐ˜๋ณต๋ฌธ ๋ณธ๋ฌธ ์•ˆ์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ง์ ‘ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์„ธ์–ด๋ด…์‹œ๋‹ค.

Step 7: ์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต (offset = 0)

(cuda-gdb) n
43              if 0 <= idx < SIZE:
(cuda-gdb) n
41          for offset in range(ITER):

์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต ์™„๋ฃŒ: ๋ฐ˜๋ณต๋ฌธ์ด 42๋ฒˆ ์ค„ โ†’ 43๋ฒˆ ์ค„ โ†’ 41๋ฒˆ ์ค„๋กœ ๋Œ์•„์™”์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ณต๋ฌธ์ด ๊ณ„์†๋ฉ๋‹ˆ๋‹ค.

Step 8: ๋‘ ๋ฒˆ์งธ ๋ฐ˜๋ณต (offset = 1)

(cuda-gdb) n

CUDA thread hit Breakpoint 1, p09_process_sliding_window_...
42              idx = thread_id + offset - 1
(cuda-gdb) n
43              if 0 <= idx < SIZE:
(cuda-gdb) n
44                  value = rebind[Scalar[dtype]](input[idx])
(cuda-gdb) n
45                  window_sum += value
(cuda-gdb) n
43              if 0 <= idx < SIZE:
(cuda-gdb) n
41          for offset in range(ITER):

๋‘ ๋ฒˆ์งธ ๋ฐ˜๋ณต ์™„๋ฃŒ: ์ด๋ฒˆ์—๋Š” if ๋ธ”๋ก(44-45๋ฒˆ ์ค„)์„ ํ†ต๊ณผํ–ˆ์Šต๋‹ˆ๋‹ค.

Step 9: ์„ธ ๋ฒˆ์งธ ๋ฐ˜๋ณต ํ…Œ์ŠคํŠธ

(cuda-gdb) n
47          output[thread_id] = window_sum

๊ฒฐ์ •์  ๋ฐœ๊ฒฌ: ๋ฐ˜๋ณต๋ฌธ์ด 2๋ฒˆ๋งŒ ๋Œ๊ณ  ์ข…๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค! 42๋ฒˆ ์ค„์˜ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์— ๋‹ค์‹œ ๊ฑธ๋ฆฌ์ง€ ์•Š๊ณ  47๋ฒˆ ์ค„๋กœ ๋ฐ”๋กœ ๋„˜์–ด๊ฐ”์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก : ๋ฐ˜๋ณต๋ฌธ์ด ์ •ํ™•ํžˆ 2๋ฒˆ ๋Œ๊ณ  ์ข…๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Step 10: ์ปค๋„ ์‹คํ–‰ ์™„๋ฃŒ์™€ ์ปจํ…์ŠคํŠธ ์†์‹ค

(cuda-gdb) n
31      fn process_sliding_window(
(cuda-gdb) n
[Switching to Thread 0x7ffff7cc0e00 (LWP 110927)]
0x00007ffff064f84a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(cuda-gdb) p output.ptr[0]@4
No symbol "output" in current context.
(cuda-gdb) p offset
No symbol "offset" in current context.

๐Ÿ” ์ปจํ…์ŠคํŠธ ์†์‹ค: ์ปค๋„ ์‹คํ–‰์ด ๋๋‚˜๋ฉด ์ปค๋„ ๋ณ€์ˆ˜์— ๋” ์ด์ƒ ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ •์ƒ์ ์ธ ๋™์ž‘์ž…๋‹ˆ๋‹ค.

4๋‹จ๊ณ„: ๊ทผ๋ณธ ์›์ธ ๋ถ„์„

Step 11: ๊ด€์ฐฐ๋œ ์‹คํ–‰์—์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„

๋””๋ฒ„๊น… ์„ธ์…˜์—์„œ ๊ด€์ฐฐํ•œ ๊ฒƒ:

  1. ๋ฐ˜๋ณต ํšŸ์ˆ˜: 2๋ฒˆ๋งŒ ๋ฐ˜๋ณต (offset = 0, offset = 1)
  2. ๊ธฐ๋Œ€๊ฐ’: ํฌ๊ธฐ 3์˜ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋Š” 3๋ฒˆ ๋ฐ˜๋ณตํ•ด์•ผ ํ•จ (offset = 0, 1, 2)
  3. ๋ˆ„๋ฝ: ์„ธ ๋ฒˆ์งธ ๋ฐ˜๋ณต (offset = 2)

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐํ•ด์•ผ ํ•  ๊ฒƒ:

  • ์Šค๋ ˆ๋“œ 0: window_sum = input[-1] + input[0] + input[1] = (๊ฒฝ๊ณ„) + 0 + 1 = 1.0
  • ์Šค๋ ˆ๋“œ 1: window_sum = input[0] + input[1] + input[2] = 0 + 1 + 2 = 3.0
  • ์Šค๋ ˆ๋“œ 2: window_sum = input[1] + input[2] + input[3] = 1 + 2 + 3 = 6.0
  • ์Šค๋ ˆ๋“œ 3: window_sum = input[2] + input[3] + input[4] = 2 + 3 + (๊ฒฝ๊ณ„) = 5.0

Step 12: ์Šค๋ ˆ๋“œ 0์˜ ์‹ค์ œ ์‹คํ–‰ ์ถ”์ 

2๋ฒˆ๋งŒ ๋ฐ˜๋ณตํ•  ๊ฒฝ์šฐ (offset = 0, 1):

๋ฐ˜๋ณต 1 (offset = 0):

  • idx = thread_id + offset - 1 = 0 + 0 - 1 = -1
  • if 0 <= idx < SIZE: โ†’ if 0 <= -1 < 4: โ†’ False
  • ํ•ฉ์‚ฐ ์—ฐ์‚ฐ ๊ฑด๋„ˆ๋œ€

๋ฐ˜๋ณต 2 (offset = 1):

  • idx = thread_id + offset - 1 = 0 + 1 - 1 = 0
  • if 0 <= idx < SIZE: โ†’ if 0 <= 0 < 4: โ†’ True
  • window_sum += input[0] โ†’ window_sum += 0

๋ˆ„๋ฝ๋œ ๋ฐ˜๋ณต 3 (offset = 2):

  • idx = thread_id + offset - 1 = 0 + 2 - 1 = 1
  • if 0 <= idx < SIZE: โ†’ if 0 <= 1 < 4: โ†’ True
  • window_sum += input[1] โ†’ window_sum += 1 โ† ์ด ์—ฐ์‚ฐ์ด ์‹คํ–‰๋˜์ง€ ์•Š์Œ

๊ฒฐ๊ณผ: ์Šค๋ ˆ๋“œ 0์€ window_sum = 0 + 1 = 1 ๋Œ€์‹  window_sum = 0์„ ์–ป์Šต๋‹ˆ๋‹ค

5๋‹จ๊ณ„: ๋ฒ„๊ทธ ํ™•์ธ

๋ฌธ์ œ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด:

comptime ITER = 2                       # โ† ๋ฒ„๊ทธ: 3์ด์–ด์•ผ ํ•จ!

for offset in range(ITER):           # โ† 2๋ฒˆ๋งŒ ๋ฐ˜๋ณต: [0, 1]
    idx = Int(thread_id) + offset - 1     # โ† offset = 2 ๋ˆ„๋ฝ
    if 0 <= idx < SIZE:
        value = rebind[Scalar[dtype]](a[idx])
        window_sum += value

๐ŸŽฏ ๊ทผ๋ณธ ์›์ธ ํ™•์ธ: ํฌ๊ธฐ 3์˜ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋ฅผ ์œ„ํ•ด ITER = 2๊ฐ€ ITER = 3์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜์ • ๋ฐฉ๋ฒ•: ์†Œ์Šค ์ฝ”๋“œ์—์„œ comptime ITER = 2๋ฅผ comptime ITER = 3์œผ๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋””๋ฒ„๊น… ๊ตํ›ˆ

๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์„ ๋•Œ:

  1. ์‹คํ–‰ ํ๋ฆ„์— ์ง‘์ค‘ - ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๊ฐ€ ๋ช‡ ๋ฒˆ ๊ฑธ๋ฆฌ๋Š”์ง€, ๋ฐ˜๋ณต์ด ๋ช‡ ๋ฒˆ ๋„๋Š”์ง€ ์„ธ์–ด๋ณด์„ธ์š”
  2. ์ˆ˜ํ•™์  ์ถ”๋ก  ์‚ฌ์šฉ - ์ผ์–ด๋‚˜์•ผ ํ•  ์ผ๊ณผ ์‹ค์ œ๋กœ ์ผ์–ด๋‚˜๋Š” ์ผ์„ ๋”ฐ์ ธ๋ณด์„ธ์š”
  3. ํŒจํ„ด ๋ถ„์„ - ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๊ฐ€ ์กฐ์‚ฌ๋ฅผ ์ด๋Œ๋„๋ก ํ•˜์„ธ์š”
  4. ๊ต์ฐจ ๊ฒ€์ฆ - ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ๋Œ€ํ•ด ๊ฐ€์„ค์„ ํ…Œ์ŠคํŠธํ•˜์„ธ์š”

์ „๋ฌธ์ ์ธ GPU ๋””๋ฒ„๊น…์˜ ํ˜„์‹ค:

  • ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™” ๋•Œ๋ฌธ์— ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ๊ฐ€ ์‹คํŒจํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค
  • ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„์ด ๋ฐ์ดํ„ฐ ๊ฒ€์‚ฌ๋ณด๋‹ค ๋” ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ํ˜ธ์ŠคํŠธ ์ถœ๋ ฅ ํŒจํ„ด์ด ์ค‘์š”ํ•œ ๋””๋ฒ„๊น… ๋‹จ์„œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค
  • ์†Œ์Šค ์ฝ”๋“œ ์ถ”๋ก ์ด ์ œํ•œ๋œ ๋””๋ฒ„๊ฑฐ ๊ธฐ๋Šฅ์„ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค

LayoutTensor ๋””๋ฒ„๊น…:

  • LayoutTensor ์ถ”์ƒํ™”๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๊ทผ๋ณธ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฒ„๊ทธ๋Š” ๊ทธ๋Œ€๋กœ ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค
  • ํ…์„œ ๋‚ด์šฉ์„ ๊ฒ€์‚ฌํ•˜๋ ค ํ•˜๊ธฐ๋ณด๋‹ค ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋กœ์ง์— ์ง‘์ค‘ํ•˜์„ธ์š”
  • ์ฒด๊ณ„์ ์ธ ์ถ”๋ก ์œผ๋กœ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ๊ณผ ์‹ค์ œ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์„ ์ถ”์ ํ•˜์„ธ์š”

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด๋Ÿฐ ์œ ํ˜•์˜ off-by-one (์—ญ์ฃผ: ๊ฒฝ๊ณ„๊ฐ’์ด 1๋งŒํผ ์–ด๊ธ‹๋‚˜๋Š” ์˜ค๋ฅ˜) ๋ฐ˜๋ณต๋ฌธ ๋ฒ„๊ทธ๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋งค์šฐ ํ”ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋ฐฐ์šด ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ• - ์ œํ•œ๋œ ๋””๋ฒ„๊ฑฐ ์ •๋ณด์— ์ˆ˜ํ•™์  ๋ถ„์„๊ณผ ํŒจํ„ด ์ธ์‹์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ - ์€ ๋„๊ตฌ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์„ ๋•Œ ์ „๋ฌธ GPU ๊ฐœ๋ฐœ์ž๋“ค์ด ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐฉ์‹ ๊ทธ๋Œ€๋กœ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„: ๋กœ์ง ๋ฒ„๊ทธ์—์„œ ๊ต์ฐฉ ์ƒํƒœ๋กœ

๋กœ์ง ๋ฒ„๊ทธ ๋””๋ฒ„๊น…์„ ์ตํ˜”์Šต๋‹ˆ๋‹ค! ์ด์ œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • โœ… ํฌ๋ž˜์‹œ๋‚˜ ๋šœ๋ ทํ•œ ์ฆ์ƒ ์—†์ด๋„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์กฐ์‚ฌ
  • โœ… ํŒจํ„ด ๋ถ„์„์œผ๋กœ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ์—์„œ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ์ถ”์ 
  • โœ… ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„์œผ๋กœ ๋ณ€์ˆ˜ ์ ‘๊ทผ์ด ์ œํ•œ๋œ ์ƒํ™ฉ์—์„œ ๋””๋ฒ„๊น…
  • โœ… ๋””๋ฒ„๊ฑฐ ๋„๊ตฌ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์„ ๋•Œ ์ˆ˜ํ•™์  ์ถ”๋ก  ์ ์šฉ

๋งˆ์ง€๋ง‰ ๋„์ „: ํƒ์ • ์ˆ˜์‚ฌ: ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ทธ๋Ÿฐ๋ฐ ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜์ง€๋„ ์•Š๊ณ  ๋๋‚˜์ง€๋„ ์•Š๋Š”๋‹ค๋ฉด์š”? ๊ทธ๋ƒฅ ์˜์›ํžˆ ๋ฉˆ์ถฐ๋ฒ„๋ฆฐ๋‹ค๋ฉด์š”?

์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€๋Š” ๊ถ๊ทน์˜ ๋””๋ฒ„๊น… ๋„์ „์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค:

  • โŒ ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€ ์—†์Œ (์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์ฒ˜๋Ÿผ)
  • โŒ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ ์—†์Œ (๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€์ฒ˜๋Ÿผ)
  • โŒ ์™„๋ฃŒ ์ž์ฒด๊ฐ€ ์—†์Œ - ๊ทธ๋ƒฅ ๋ฌดํ•œํžˆ ๋ฉˆ์ถค
  • โœ… ๊ณ ๊ธ‰ ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋ถ„์„์ด ํ•„์š”ํ•œ ์กฐ์šฉํ•œ ๊ต์ฐฉ ์ƒํƒœ

์ƒˆ๋กญ๊ฒŒ ์ตํžˆ๊ฒŒ ๋  ์Šคํ‚ฌ:

  • ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ํƒ์ง€ - ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ์—์„œ ์กฐ์œจ ์‹คํŒจ ์ฐพ๊ธฐ
  • ๋ฉ€ํ‹ฐ ์Šค๋ ˆ๋“œ ์ƒํƒœ ๋ถ„์„ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๋ฅผ ๋™์‹œ์— ๊ฒ€์‚ฌํ•˜๊ธฐ
  • ๋™๊ธฐํ™” ๋””๋ฒ„๊น… - ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ์‹คํŒจ ์ดํ•ดํ•˜๊ธฐ

๋””๋ฒ„๊น… ์ง„ํ™”:

  1. ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€: ํฌ๋ž˜์‹œ ์‹ ํ˜ธ ๋”ฐ๋ผ๊ฐ€๊ธฐ โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ ์ฐพ๊ธฐ
  2. ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€: ๊ฒฐ๊ณผ ํŒจํ„ด ๋ถ„์„ํ•˜๊ธฐ โ†’ ๋กœ์ง ๋ฒ„๊ทธ ์ฐพ๊ธฐ
  3. ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€: ์Šค๋ ˆ๋“œ ์ƒํƒœ ์กฐ์‚ฌํ•˜๊ธฐ โ†’ ์กฐ์œจ ๋ฒ„๊ทธ ์ฐพ๊ธฐ

์ด์ „ ๋‘ ์‚ฌ๋ก€์—์„œ ๋ฐฐ์šด ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ์Šคํ‚ฌ - ๊ฐ€์„ค ์ˆ˜๋ฆฝ, ์ฆ๊ฑฐ ์ˆ˜์ง‘, ํŒจํ„ด ๋ถ„์„ - ์€ ๊ฐ€์žฅ ์–ด๋ ค์šด GPU ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•  ๋•Œ ํ•ต์‹ฌ์ด ๋ฉ๋‹ˆ๋‹ค: ์กฐ์œจ์ด ์–ด๊ธ‹๋‚˜ ์˜์›ํžˆ ์„œ๋กœ๋ฅผ ๊ธฐ๋‹ค๋ฆฌ๋Š” ์Šค๋ ˆ๋“œ๋“ค.

๐Ÿ•ต ํƒ์ • ์ˆ˜์‚ฌ: ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ฐœ์š”

๋ฉ”๋ชจ๋ฆฌ ํฌ๋ž˜์‹œ์™€ ๋กœ์ง ๋ฒ„๊ทธ ๋””๋ฒ„๊น…์„ ์ตํ˜”์Šต๋‹ˆ๋‹ค. ์ด์ œ GPU ๋””๋ฒ„๊น…์˜ ์ตœ์ข… ๋ณด์Šค์— ๋„์ „ํ•ฉ๋‹ˆ๋‹ค: ํ”„๋กœ๊ทธ๋žจ์ด ๋ฌดํ•œ์ • ๋ฉˆ์ถฐ๋ฒ„๋ฆฌ๋Š” ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ. ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋„, ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋„ ์—†์ด - ๊ทธ์ € ๋์—†๋Š” ์นจ๋ฌต๋งŒ ์žˆ์Šต๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ์—ฌ์ •์˜ ์™„๊ฒฐ:

  • ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€: ํ”„๋กœ๊ทธ๋žจ ํฌ๋ž˜์‹œ โ†’ ์˜ค๋ฅ˜ ์‹ ํ˜ธ ์ถ”์  โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ ๋ฐœ๊ฒฌ
  • ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€: ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ ์ถœ๋ ฅ โ†’ ํŒจํ„ด ๋ถ„์„ โ†’ ๋กœ์ง ๋ฒ„๊ทธ ๋ฐœ๊ฒฌ
  • ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€: ํ”„๋กœ๊ทธ๋žจ ๋ฌดํ•œ ์ •์ง€ โ†’ ์Šค๋ ˆ๋“œ ์ƒํƒœ ์กฐ์‚ฌ โ†’ ์กฐ์œจ ๋ฒ„๊ทธ ๋ฐœ๊ฒฌ

์ด ๊ณ ๊ธ‰ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, LayoutTensor ์—ฐ์‚ฐ, ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”๊ฐ€ ์–ฝํžŒ ์Šค๋ ˆ๋“œ ์กฐ์œจ ์‹คํŒจ๋ฅผ ์กฐ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค - ์ด์ „ ์‚ฌ๋ก€๋“ค์—์„œ ์ตํžŒ ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ๊ธฐ์ˆ ์„ ์ด๋™์›ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ค€๋น„: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ, ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€, ํƒ์ • ์ˆ˜์‚ฌ: ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€๋ฅผ ๋จผ์ € ์™„๋ฃŒํ•ด์„œ CUDA-GDB ์›Œํฌํ”Œ๋กœ์šฐ, ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ์˜ ํ•œ๊ณ„, ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ•์„ ์ดํ•ดํ•˜์„ธ์š”. ์•„๋ž˜ ์„ค์ • ๋ช…๋ น์„ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pixi run -e nvidia setup-cuda-gdb

ํ•ต์‹ฌ ๊ฐœ๋…

์ด๋ฒˆ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ํƒ์ง€: ์Šค๋ ˆ๋“œ๋“ค์ด ๋™๊ธฐํ™” ์ง€์ ์—์„œ ์˜์›ํžˆ ๊ธฐ๋‹ค๋ฆฌ๊ฒŒ ๋˜๋Š” ์ƒํ™ฉ ์‹๋ณ„ํ•˜๊ธฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ: LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ
  • ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ๋ถ„์„: ์ผ๋ถ€ ์Šค๋ ˆ๋“œ๊ฐ€ ๋‹ค๋ฅธ ์ฝ”๋“œ ๊ฒฝ๋กœ๋ฅผ ํƒˆ ๋•Œ ๋””๋ฒ„๊น…ํ•˜๊ธฐ
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น…: CUDA-GDB๋กœ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ์‹คํŒจ ๋ถ„์„ํ•˜๊ธฐ

์ฝ”๋“œ ์‹คํ–‰

๋จผ์ € ์ „์ฒด ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ์ปค๋„๋งŒ ์‚ดํŽด๋ด…์‹œ๋‹ค:

fn collaborative_filter(
    output: LayoutTensor[dtype, vector_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, vector_layout, ImmutAnyOrigin],
):
    thread_id = thread_idx.x

    # Shared memory workspace for collaborative processing
    shared_workspace = LayoutTensor[
        dtype,
        Layout.row_major(SIZE - 1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Phase 1: Initialize shared workspace (all threads participate)
    if thread_id < SIZE - 1:
        shared_workspace[thread_id] = rebind[Scalar[dtype]](a[thread_id])
    barrier()

    # Phase 2: Collaborative processing
    if thread_id < SIZE - 1:
        # Apply collaborative filter with neighbors
        if thread_id > 0:
            shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
        barrier()

    # Phase 3: Final synchronization and output
    barrier()

    # Write filtered results back to output
    if thread_id < SIZE - 1:
        output[thread_id] = shared_workspace[thread_id]
    else:
        output[thread_id] = rebind[Scalar[dtype]](a[thread_id])


๋ฒ„๊ทธ๋ฅผ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š” (pixi ์ „์šฉ):

pixi run -e nvidia p09 --third-case

๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค - ํ”„๋กœ๊ทธ๋žจ์ด ๋ฌดํ•œ์ • ๋ฉˆ์ถฅ๋‹ˆ๋‹ค:

Third Case: Advanced collaborative filtering with shared memory...
WARNING: This may hang - use Ctrl+C to stop if needed

Input array: [1, 2, 3, 4]
Applying collaborative filter using shared memory...
Each thread cooperates with neighbors for smoothing...
Waiting for GPU computation to complete...
[HANGS FOREVER - Use Ctrl+C to stop]

โš ๏ธ ๊ฒฝ๊ณ : ์ด ํ”„๋กœ๊ทธ๋žจ์€ ๋ฉˆ์ถฐ์„œ ์™„๋ฃŒ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. Ctrl+C๋กœ ์ค‘๋‹จํ•˜์„ธ์š”.

๊ณผ์ œ: ํƒ์ • ์ˆ˜์‚ฌ

๋„์ „: ํ”„๋กœ๊ทธ๋žจ์ด ์ •์ƒ์ ์œผ๋กœ ์‹œ์ž‘๋˜์ง€๋งŒ GPU ์—ฐ์‚ฐ ์ค‘์— ๋ฉˆ์ถฐ์„œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š์€ ์ƒํƒœ์—์„œ, ์ด ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ์กฐ์‚ฌํ•˜๊ธฐ ์œ„ํ•œ ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ•์€ ๋ฌด์—‡์ผ๊นŒ์š”?

์ƒ๊ฐํ•ด๋ณผ ์ :

  • GPU ์ปค๋„์ด ์˜์˜ ์™„๋ฃŒ๋˜์ง€ ์•Š๊ฒŒ ๋งŒ๋“œ๋Š” ์›์ธ์€ ๋ฌด์—‡์ผ๊นŒ์š”?
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋ฌธ์ œ๋ฅผ ์–ด๋–ป๊ฒŒ ์กฐ์‚ฌํ•˜์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ?
  • ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€ ์—†์ด ํ”„๋กœ๊ทธ๋žจ์ด ๊ทธ๋ƒฅ โ€œ๋ฉˆ์ถฐ๋ฒ„๋ฆดโ€ ๋•Œ ์–ด๋–ค ๋””๋ฒ„๊น… ์ „๋žต์ด ํ†ตํ• ๊นŒ์š”?
  • ์Šค๋ ˆ๋“œ๋“ค์ด ์ œ๋Œ€๋กœ ํ˜‘๋ ฅํ•˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ๋””๋ฒ„๊น…ํ• ๊นŒ์š”?
  • ์ฒด๊ณ„์  ์กฐ์‚ฌ(์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€)์™€ ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„(๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€)์„ ๊ฒฐํ•ฉํ•ด์„œ ์กฐ์œจ ์‹คํŒจ๋ฅผ ์–ด๋–ป๊ฒŒ ๋””๋ฒ„๊น…ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ณด์„ธ์š”:

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --third-case

GDB ๋ช…๋ น์–ด ๋‹จ์ถ•ํ‚ค (๋น ๋ฅธ ๋””๋ฒ„๊น…)

์ด ๋‹จ์ถ•ํ‚ค๋“ค์„ ์‚ฌ์šฉํ•˜๋ฉด ๋””๋ฒ„๊น… ์„ธ์…˜ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

๋‹จ์ถ•์ „์ฒด์‚ฌ์šฉ ์˜ˆ์‹œ
rrun(cuda-gdb) r
nnext(cuda-gdb) n
ccontinue(cuda-gdb) c
bbreak(cuda-gdb) b 62
pprint(cuda-gdb) p thread_id
qquit(cuda-gdb) q

์•„๋ž˜ ๋ชจ๋“  ๋””๋ฒ„๊น… ๋ช…๋ น์€ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ๋‹จ์ถ•ํ‚ค๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค!

ํŒ
  1. ์†Œ๋ฆฌ ์—†๋Š” ๋ฉˆ์ถค ์กฐ์‚ฌ - ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€ ์—†์ด ํ”„๋กœ๊ทธ๋žจ์ด ๋ฉˆ์ถฐ๋ฒ„๋ฆด ๋•Œ, GPU์˜ ์–ด๋–ค ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ๋ฌดํ•œ ๋Œ€๊ธฐ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์„๊นŒ์š”?
  2. ์Šค๋ ˆ๋“œ ์ƒํƒœ ๊ฒ€์‚ฌ - info cuda threads๋กœ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๋“ค์ด ์–ด๋””์„œ ๋ฉˆ์ท„๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”
  3. ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ๋ถ„์„ - ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ์–ด๋–ค ์ฝ”๋“œ ๊ฒฝ๋กœ๋ฅผ ์‹คํ–‰ํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š” (๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ฅด๋‚˜์š”?)
  4. ๋™๊ธฐํ™” ์ง€์  ์กฐ์‚ฌ - ์Šค๋ ˆ๋“œ๋“ค์ด ์กฐ์œจํ•ด์•ผ ํ•  ์ˆ˜๋„ ์žˆ๋Š” ์ง€์ ์„ ์ฐพ์œผ์„ธ์š”
  5. ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ ํƒ์ง€ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ํ”„๋กœ๊ทธ๋žจ ์œ„์น˜์— ์žˆ๋‚˜์š”, ์•„๋‹ˆ๋ฉด ์ผ๋ถ€๋Š” ๋‹ค๋ฅธ ๊ณณ์— ์žˆ๋‚˜์š”?
  6. ์กฐ์œจ ๊ธฐ๋ณธ ์š”์†Œ ๋ถ„์„ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋™๊ธฐํ™” ์—ฐ์‚ฐ์— ์ฐธ์—ฌํ•˜์ง€ ์•Š์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”?
  7. ์‹คํ–‰ ํ๋ฆ„ ์ถ”์  - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์กฐ๊ฑด๋ฌธ์„ ํ†ตํ•ด ์–ด๋–ค ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋Š”์ง€ ์ถ”์ ํ•˜์„ธ์š”
  8. ์Šค๋ ˆ๋“œ ID ์˜ํ–ฅ ๋ถ„์„ - ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ID๊ฐ€ ์–ด๋–ค ์ฝ”๋“œ ๊ฒฝ๋กœ๋ฅผ ์‹คํ–‰ํ• ์ง€ ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋‚˜์š”?
๐Ÿ’ก ์กฐ์‚ฌ ๊ณผ์ •๊ณผ ํ•ด๊ฒฐ์ฑ…

CUDA-GDB๋กœ ๋‹จ๊ณ„๋ณ„ ์กฐ์‚ฌ

1๋‹จ๊ณ„: ์‹คํ–‰๊ณผ ์ดˆ๊ธฐ ์„ค์ •

Step 1: ๋””๋ฒ„๊ฑฐ ์‹คํ–‰

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --third-case

Step 2: ์ •์ง€ ํ˜„์ƒ ๋ถ„์„

๋””๋ฒ„๊น…์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ์•Œ๊ณ  ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

๊ธฐ๋Œ€๊ฐ’: ํ”„๋กœ๊ทธ๋žจ์ด ์™„๋ฃŒ๋˜๊ณ  ํ•„ํ„ฐ๋ง๋œ ๊ฒฐ๊ณผ ํ‘œ์‹œ
์‹ค์ œ: "Waiting for GPU computation to complete..."์—์„œ ๋ฉˆ์ถค

๐Ÿ” ์ดˆ๊ธฐ ๊ฐ€์„ค: GPU ์ปค๋„์ด ๊ต์ฐฉ ์ƒํƒœ์— ๋น ์ง - ์–ด๋–ค ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ์Šค๋ ˆ๋“œ๋“ค์„ ์˜์›ํžˆ ๋Œ€๊ธฐ์‹œํ‚ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

2๋‹จ๊ณ„: ์ปค๋„ ์ง„์ž…

Step 3: ์‹คํ–‰ ๋ฐ ์ปค๋„ ์ง„์ž… ๊ด€์ฐฐ

(cuda-gdb) r
Starting program: .../mojo run problems/p09/p09.mojo --third-case

Third Case: Advanced collaborative filtering with shared memory...
WARNING: This may hang - use Ctrl+C to stop if needed

Input array: [1, 2, 3, 4]
Applying collaborative filter using shared memory...
Each thread cooperates with neighbors for smoothing...
Waiting for GPU computation to complete...

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

CUDA thread hit application kernel entry function breakpoint, p09_collaborative_filter_Orig6A6AcB6A6A_1882ca334fc2d34b2b9c4fa338df6c07<<<(1,1,1),(4,1,1)>>> (
    output=..., a=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:56
56          a: LayoutTensor[mut=False, dtype, vector_layout],

๐Ÿ” ์ฃผ์š” ๊ด€์ฐฐ:

  • Grid: (1,1,1) - ๋‹จ์ผ ๋ธ”๋ก
  • Block: (4,1,1) - ์ด 4๊ฐœ ์Šค๋ ˆ๋“œ (0, 1, 2, 3)
  • ํ˜„์žฌ ์Šค๋ ˆ๋“œ: (0,0,0) - ์Šค๋ ˆ๋“œ 0 ๋””๋ฒ„๊น… ์ค‘
  • ํ•จ์ˆ˜: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” collaborative_filter

Step 4: ์ดˆ๊ธฐํ™” ๊ณผ์ • ํƒ์ƒ‰

(cuda-gdb) n
55          output: LayoutTensor[mut=True, dtype, vector_layout],
(cuda-gdb) n
58          thread_id = thread_idx.x
(cuda-gdb) n
66          ].stack_allocation()
(cuda-gdb) n
69          if thread_id < SIZE - 1:
(cuda-gdb) p thread_id
$1 = 0

โœ… ์Šค๋ ˆ๋“œ 0 ์ƒํƒœ: thread_id = 0, ์กฐ๊ฑด 0 < 3 ๊ฒ€์‚ฌ ์ง์ „ โ†’ True

Step 5: 1๋‹จ๊ณ„ ์ถ”์ 

(cuda-gdb) n
70              shared_workspace[thread_id] = rebind[Scalar[dtype]](a[thread_id])
(cuda-gdb) n
69          if thread_id < SIZE - 1:
(cuda-gdb) n
71          barrier()

1๋‹จ๊ณ„ ์™„๋ฃŒ: ์Šค๋ ˆ๋“œ 0์ด ์ดˆ๊ธฐํ™”๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์ฒซ ๋ฒˆ์งธ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค.

3๋‹จ๊ณ„: ๊ฒฐ์ •์ ์ธ ๋ฐฐ๋ฆฌ์–ด ์กฐ์‚ฌ

Step 6: ์ฒซ ๋ฒˆ์งธ ๋ฐฐ๋ฆฌ์–ด ๊ฒ€์‚ฌ

(cuda-gdb) n
74          if thread_id < SIZE - 1:
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (3,0,0)     4 0x00007fffd3272180 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    74

โœ… ์ •์ƒ: 4๊ฐœ ์Šค๋ ˆ๋“œ ๋ชจ๋‘ 74๋ฒˆ ์ค„(์ฒซ ๋ฒˆ์งธ ๋ฐฐ๋ฆฌ์–ด ํ†ต๊ณผ ํ›„)์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฐฐ๋ฆฌ์–ด๋Š” ์ •์ƒ ์ž‘๋™ํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” ๊ฒฐ์ •์  ์ง€์ : ์ด์ œ ๋˜ ๋‹ค๋ฅธ ์กฐ๊ฑด๋ฌธ์ด ์žˆ๋Š” 2๋‹จ๊ณ„์— ์ง„์ž…ํ•ฉ๋‹ˆ๋‹ค.

Step 7: 2๋‹จ๊ณ„ ์ถ”์  - ์Šค๋ ˆ๋“œ 0 ๊ด€์ 

(cuda-gdb) n
76              if thread_id > 0:

์Šค๋ ˆ๋“œ 0 ๋ถ„์„: 0 < 3 โ†’ True โ†’ ์Šค๋ ˆ๋“œ 0์ด 2๋‹จ๊ณ„ ๋ธ”๋ก์— ์ง„์ž…

(cuda-gdb) n
78              barrier()

์Šค๋ ˆ๋“œ 0 ๊ฒฝ๋กœ: 0 > 0 โ†’ False โ†’ ์Šค๋ ˆ๋“œ 0์ด ๋‚ด๋ถ€ ์—ฐ์‚ฐ์€ ๊ฑด๋„ˆ๋›ฐ์ง€๋งŒ 78๋ฒˆ ์ค„์˜ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌ

๊ฒฐ์ •์  ์ˆœ๊ฐ„: ์Šค๋ ˆ๋“œ 0์ด ์ด์ œ 78๋ฒˆ ์ค„์˜ ๋ฐฐ๋ฆฌ์–ด์—์„œ ๋Œ€๊ธฐ ์ค‘์ž…๋‹ˆ๋‹ค.

(cuda-gdb) n # <-- ์‹คํ–‰ํ•˜๋ฉด ํ”„๋กœ๊ทธ๋žจ์ด ๋ฉˆ์ถฅ๋‹ˆ๋‹ค!
[HANGS HERE - ํ”„๋กœ๊ทธ๋žจ์ด ์ด ์ง€์ ์„ ๋„˜์–ด๊ฐ€์ง€ ๋ชปํ•จ]

Step 8: ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ์กฐ์‚ฌ

(cuda-gdb) cuda thread (1,0,0)
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]
78              barrier()
(cuda-gdb) p thread_id
$2 = 1
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (2,0,0)     3 0x00007fffd3273aa0 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    78
   (0,0,0)   (3,0,0)     (0,0,0)      (3,0,0)     1 0x00007fffd3273b10 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    81

๊ฒฐ์ •์  ์ฆ๊ฑฐ ๋ฐœ๊ฒฌ:

  • ์Šค๋ ˆ๋“œ 0, 1, 2: 78๋ฒˆ ์ค„์—์„œ ๋ชจ๋‘ ๋Œ€๊ธฐ ์ค‘ (์กฐ๊ฑด ๋ธ”๋ก ์•ˆ์˜ ๋ฐฐ๋ฆฌ์–ด)
  • ์Šค๋ ˆ๋“œ 3: 81๋ฒˆ ์ค„์— ์žˆ์Œ (์กฐ๊ฑด ๋ธ”๋ก์„ ์ง€๋‚˜์ณค๊ณ , ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•œ ์  ์—†์Œ!)

Step 9: ์Šค๋ ˆ๋“œ 3์˜ ์‹คํ–‰ ๊ฒฝ๋กœ ๋ถ„์„

๐Ÿ” info ์ถœ๋ ฅ์œผ๋กœ ๋ณธ ์Šค๋ ˆ๋“œ 3 ๋ถ„์„:

  • ์Šค๋ ˆ๋“œ 3: 81๋ฒˆ ์ค„์— ์œ„์น˜ (PC: 0x00007fffd3273b10)
  • 2๋‹จ๊ณ„ ์กฐ๊ฑด: thread_id < SIZE - 1 โ†’ 3 < 3 โ†’ False
  • ๊ฒฐ๊ณผ: ์Šค๋ ˆ๋“œ 3์€ 2๋‹จ๊ณ„ ๋ธ”๋ก(74-78๋ฒˆ ์ค„)์— ์ง„์ž…ํ•˜์ง€ ์•Š์Œ
  • ๊ฒฐ๊ณผ: ์Šค๋ ˆ๋“œ 3์€ 78๋ฒˆ ์ค„์˜ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•œ ์  ์—†์Œ
  • ํ˜„์žฌ ์ƒํƒœ: ์Šค๋ ˆ๋“œ 3์€ 81๋ฒˆ ์ค„(๋งˆ์ง€๋ง‰ ๋ฐฐ๋ฆฌ์–ด)์— ์žˆ๊ณ , ์Šค๋ ˆ๋“œ 0,1,2๋Š” 78๋ฒˆ ์ค„์—์„œ ๊ฐ‡ํ˜€ ์žˆ์Œ

4๋‹จ๊ณ„: ๊ทผ๋ณธ ์›์ธ ๋ถ„์„

Step 10: ๊ต์ฐฉ ์ƒํƒœ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์‹๋ณ„

# 2๋‹จ๊ณ„: ํ˜‘๋ ฅ์  ์ฒ˜๋ฆฌ
if thread_id < SIZE - 1:        # โ† ์Šค๋ ˆ๋“œ 0, 1, 2๋งŒ ์ด ๋ธ”๋ก์— ์ง„์ž…
    # ์ด์›ƒ๊ณผ ํ˜‘๋ ฅ ํ•„ํ„ฐ ์ ์šฉ
    if thread_id > 0:
        shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
    barrier()                   # โ† ๊ต์ฐฉ ์ƒํƒœ: 4๊ฐœ ์ค‘ 3๊ฐœ ์Šค๋ ˆ๋“œ๋งŒ ์—ฌ๊ธฐ์— ๋„๋‹ฌ!

๐Ÿ’€ ๊ต์ฐฉ ์ƒํƒœ ๋ฉ”์ปค๋‹ˆ์ฆ˜:

  1. ์Šค๋ ˆ๋“œ 0: 0 < 3 โ†’ True โ†’ ๋ธ”๋ก ์ง„์ž… โ†’ ๋ฐฐ๋ฆฌ์–ด์—์„œ ๋Œ€๊ธฐ (69๋ฒˆ ์ค„)
  2. ์Šค๋ ˆ๋“œ 1: 1 < 3 โ†’ True โ†’ ๋ธ”๋ก ์ง„์ž… โ†’ ๋ฐฐ๋ฆฌ์–ด์—์„œ ๋Œ€๊ธฐ (69๋ฒˆ ์ค„)
  3. ์Šค๋ ˆ๋“œ 2: 2 < 3 โ†’ True โ†’ ๋ธ”๋ก ์ง„์ž… โ†’ ๋ฐฐ๋ฆฌ์–ด์—์„œ ๋Œ€๊ธฐ (69๋ฒˆ ์ค„)
  4. ์Šค๋ ˆ๋“œ 3: 3 < 3 โ†’ False โ†’ ๋ธ”๋ก์— ์ง„์ž… ์•ˆ ํ•จ โ†’ 72๋ฒˆ ์ค„๋กœ ๊ณ„์† ์ง„ํ–‰

๊ฒฐ๊ณผ: 3๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ 4๋ฒˆ์งธ ์Šค๋ ˆ๋“œ๋ฅผ ์˜์›ํžˆ ๊ธฐ๋‹ค๋ฆฌ์ง€๋งŒ, ์Šค๋ ˆ๋“œ 3์€ ๊ทธ ๋ฐฐ๋ฆฌ์–ด์— ์ ˆ๋Œ€ ๋„์ฐฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

5๋‹จ๊ณ„: ๋ฒ„๊ทธ ํ™•์ธ๊ณผ ํ•ด๊ฒฐ์ฑ…

Step 11: ๊ทผ๋ณธ์ ์ธ ๋ฐฐ๋ฆฌ์–ด ๊ทœ์น™ ์œ„๋ฐ˜

GPU ๋ฐฐ๋ฆฌ์–ด ๊ทœ์น™: ๋™๊ธฐํ™”๊ฐ€ ์™„๋ฃŒ๋˜๋ ค๋ฉด ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ฌด์—‡์ด ์ž˜๋ชป๋˜์—ˆ๋‚˜:

# โŒ ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ์กฐ๊ฑด๋ฌธ ์•ˆ์— ๋ฐฐ๋ฆฌ์–ด
if thread_id < SIZE - 1:    # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ง„์ž…ํ•˜์ง€ ์•Š์Œ
    # ... ์—ฐ์‚ฐ ...
    barrier()               # ์ผ๋ถ€ ์Šค๋ ˆ๋“œ๋งŒ ์—ฌ๊ธฐ์— ๋„๋‹ฌ

# โœ… ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ์กฐ๊ฑด๋ฌธ ๋ฐ–์— ๋ฐฐ๋ฆฌ์–ด
if thread_id < SIZE - 1:    # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ง„์ž…ํ•˜์ง€ ์•Š์Œ
    # ... ์—ฐ์‚ฐ ...
 barrier()                  # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฌ๊ธฐ์— ๋„๋‹ฌ

์ˆ˜์ • ๋ฐฉ๋ฒ•: ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์กฐ๊ฑด ๋ธ”๋ก ๋ฐ–์œผ๋กœ ์ด๋™:

fn collaborative_filter(
    output: LayoutTensor[mut=True, dtype, vector_layout],
    a: LayoutTensor[mut=False, dtype, vector_layout],
):
    thread_id = thread_idx.x
    shared_workspace = LayoutTensor[
        dtype,
        Layout.row_major(SIZE-1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # 1๋‹จ๊ณ„: ๊ณต์œ  ์ž‘์—…๊ณต๊ฐ„ ์ดˆ๊ธฐํ™” (๋ชจ๋“  ์Šค๋ ˆ๋“œ ์ฐธ์—ฌ)
    if thread_id < SIZE - 1:
        shared_workspace[thread_id] = rebind[Scalar[dtype]](a[thread_id])
    barrier()

    # 2๋‹จ๊ณ„: ํ˜‘๋ ฅ์  ์ฒ˜๋ฆฌ
    if thread_id < SIZE - 1:
        if thread_id > 0:
            shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
    # โœ… ์ˆ˜์ •: ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์กฐ๊ฑด๋ฌธ ๋ฐ–์œผ๋กœ ์ด๋™ํ•ด์„œ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋„๋‹ฌํ•˜๋„๋ก
    barrier()

    # 3๋‹จ๊ณ„: ์ตœ์ข… ๋™๊ธฐํ™”์™€ ์ถœ๋ ฅ
    barrier()

    if thread_id < SIZE - 1:
        output[thread_id] = shared_workspace[thread_id]
    else:
        output[thread_id] = rebind[Scalar[dtype]](a[thread_id])

ํ•ต์‹ฌ ๋””๋ฒ„๊น… ๊ตํ›ˆ

๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ํƒ์ง€:

  1. info cuda threads ์‚ฌ์šฉ - ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ์–ด๋А ์ค„์— ์žˆ๋Š”์ง€ ๋ณด์—ฌ์คŒ
  2. ์Šค๋ ˆ๋“œ ์ƒํƒœ ๋ถ„๊ธฐ ์ฐพ๊ธฐ - ์ผ๋ถ€ ์Šค๋ ˆ๋“œ๊ฐ€ ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋žจ ์œ„์น˜์— ์žˆ์Œ
  3. ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ๊ฒฝ๋กœ ์ถ”์  - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•˜๋Š”์ง€ ํ™•์ธ
  4. ๋ฐฐ๋ฆฌ์–ด ๋„๋‹ฌ ๊ฐ€๋Šฅ์„ฑ ๊ฒ€์ฆ - ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๋“ค์ด ๋„๋‹ฌํ•˜๋Š” ๋ฐฐ๋ฆฌ์–ด๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ์Šค๋ ˆ๋“œ๊ฐ€ ์—†๋Š”์ง€ ํ™•์ธ

์‹ค๋ฌด GPU ๋””๋ฒ„๊น…์˜ ํ˜„์‹ค:

  • ๊ต์ฐฉ ์ƒํƒœ๋Š” ์†Œ๋ฆฌ ์—†๋Š” ์‚ด์ธ์ž - ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€ ์—†์ด ํ”„๋กœ๊ทธ๋žจ์ด ๊ทธ๋ƒฅ ๋ฉˆ์ถค
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น…์€ ์ธ๋‚ด๊ฐ€ ํ•„์š” - ๊ฐ ์Šค๋ ˆ๋“œ ๊ฒฝ๋กœ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•ด์•ผ ํ•จ
  • ์กฐ๊ฑด๋ถ€ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ๊ต์ฐฉ ์ƒํƒœ์˜ 1์ˆœ์œ„ ์›์ธ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋™๊ธฐํ™” ์ง€์ ์— ๋„๋‹ฌํ•˜๋Š”์ง€ ํ•ญ์ƒ ํ™•์ธ
  • CUDA-GDB ์Šค๋ ˆ๋“œ ๊ฒ€์‚ฌ๊ฐ€ ํ•„์ˆ˜ - ์Šค๋ ˆ๋“œ ์กฐ์œจ ์‹คํŒจ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์œ ์ผํ•œ ๋ฐฉ๋ฒ•

๊ณ ๊ธ‰ GPU ๋™๊ธฐํ™”:

  • ๋ฐฐ๋ฆฌ์–ด ๊ทœ์น™: ๋ธ”๋ก์˜ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•ด์•ผ ํ•จ
  • ์กฐ๊ฑด๋ถ€ ์‹คํ–‰์˜ ํ•จ์ •: ์–ด๋–ค if๋ฌธ์ด๋“  ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์Œ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ: ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜์— ์ฃผ์˜ ํ•„์š”
  • LayoutTensor๊ฐ€ ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ๋ง‰์•„์ฃผ์ง€ ์•Š์Œ: ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋ผ๋„ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™”๋Š” ์—ฌ์ „ํžˆ ํ•„์š”

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ๋Š” GPU ๋ฒ„๊ทธ ์ค‘ ๋””๋ฒ„๊น…ํ•˜๊ธฐ ๊ฐ€์žฅ ์–ด๋ ค์šด ์œ ํ˜•์— ์†ํ•ฉ๋‹ˆ๋‹ค:

  • ์˜ค๋ฅ˜๊ฐ€ ๋ณด์ด์ง€ ์•Š์Œ - ๊ทธ์ € ๋ฌดํ•œ ๋Œ€๊ธฐ
  • ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ๋ถ„์„ ํ•„์š” - ์Šค๋ ˆ๋“œ ํ•˜๋‚˜๋งŒ ๋ด์„œ๋Š” ๋””๋ฒ„๊น…ํ•  ์ˆ˜ ์—†์Œ
  • ์กฐ์šฉํ•œ ์‹คํŒจ ๋ชจ๋“œ - ์ •ํ™•์„ฑ ๋ฒ„๊ทธ๊ฐ€ ์•„๋‹Œ ์„ฑ๋Šฅ ๋ฌธ์ œ์ฒ˜๋Ÿผ ๋ณด์ž„
  • ๋ณต์žกํ•œ ์Šค๋ ˆ๋“œ ์กฐ์œจ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ฑธ์ณ ์‹คํ–‰ ๊ฒฝ๋กœ๋ฅผ ์ถ”์ ํ•ด์•ผ ํ•จ

CUDA-GDB๋กœ ์Šค๋ ˆ๋“œ ์ƒํƒœ๋ฅผ ๋ถ„์„ํ•˜๊ณ , ๋ถ„๊ธฐ๋œ ์‹คํ–‰ ๊ฒฝ๋กœ๋ฅผ ์‹๋ณ„ํ•˜๊ณ , ๋ฐฐ๋ฆฌ์–ด ๋„๋‹ฌ ๊ฐ€๋Šฅ์„ฑ์„ ๊ฒ€์ฆํ•˜๋Š” ์ด ๋””๋ฒ„๊น… ๋ฐฉ์‹์€ ์‹ค๋ฌด GPU ๊ฐœ๋ฐœ์ž๋“ค์ด ์šด์˜ ์‹œ์Šคํ…œ์—์„œ ๊ต์ฐฉ ์ƒํƒœ ๋ฌธ์ œ์— ๋งž๋‹ฅ๋œจ๋ ธ์„ ๋•Œ ์“ฐ๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ •ํ™•ํžˆ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„: GPU ๋””๋ฒ„๊น… ์Šคํ‚ฌ ์™„์„ฑ

GPU ๋””๋ฒ„๊น… ์‚ผ๋ถ€์ž‘์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค!

์™„์„ฑ๋œ GPU ๋””๋ฒ„๊น… ๋ฌด๊ธฐ๊ณ 

์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์—์„œ - ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น…:

  • โœ… ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋ฅผ ๊ฐ€์ด๋“œ ์‚ผ์•„ ์ฒด๊ณ„์ ์ธ ํฌ๋ž˜์‹œ ์กฐ์‚ฌ
  • โœ… ํฌ์ธํ„ฐ ์ฃผ์†Œ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ ํƒ์ง€
  • โœ… ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ๋ฌธ์ œ๋ฅผ ์œ„ํ•œ CUDA-GDB ๊ธฐ์ดˆ

๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€์—์„œ - ๋กœ์ง ๋ฒ„๊ทธ ๋””๋ฒ„๊น…:

  • โœ… ๋šœ๋ ทํ•œ ์ฆ์ƒ ์—†์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์กฐ์‚ฌ
  • โœ… ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ์ถ”์ ํ•˜๋Š” ํŒจํ„ด ๋ถ„์„ ๊ธฐ๋ฒ•
  • โœ… ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ๊ฐ€ ์•ˆ ๋  ๋•Œ ์‹คํ–‰ ํ๋ฆ„ ๋””๋ฒ„๊น…

์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€์—์„œ - ์กฐ์œจ ๋””๋ฒ„๊น…:

  • โœ… ์Šค๋ ˆ๋“œ ์กฐ์œจ ์‹คํŒจ๋ฅผ ์œ„ํ•œ ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ์กฐ์‚ฌ
  • โœ… ๊ณ ๊ธ‰ CUDA-GDB ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์ƒํƒœ ๋ถ„์„
  • โœ… ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ์„ ์œ„ํ•œ ๋™๊ธฐํ™” ๊ฒ€์ฆ

์ „๋ฌธ๊ฐ€์˜ GPU ๋””๋ฒ„๊น… ๋ฐฉ๋ฒ•๋ก 

์‹ค๋ฌด GPU ๊ฐœ๋ฐœ์ž๋“ค์ด ์‚ฌ์šฉํ•˜๋Š” ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ•์„ ์ตํ˜”์Šต๋‹ˆ๋‹ค:

  1. ์ฆ์ƒ ์ฝ๊ธฐ - ํฌ๋ž˜์‹œ์ธ๊ฐ€? ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ์ธ๊ฐ€? ๋ฌดํ•œ ์ •์ง€์ธ๊ฐ€?
  2. ๊ฐ€์„ค ์ˆ˜๋ฆฝ - ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ? ๋กœ์ง ์˜ค๋ฅ˜? ์กฐ์œจ ๋ฌธ์ œ?
  3. ์ฆ๊ฑฐ ์ˆ˜์ง‘ - ๋ฒ„๊ทธ ์œ ํ˜•์— ๋งž์ถฐ CUDA-GDB๋ฅผ ์ „๋žต์ ์œผ๋กœ ํ™œ์šฉ
  4. ์ฒด๊ณ„์ ์œผ๋กœ ํ…Œ์ŠคํŠธ - ๋ชฉํ‘œ ์ง€ํ–ฅ์  ์กฐ์‚ฌ๋ฅผ ํ†ตํ•ด ๊ฐ ๊ฐ€์„ค ๊ฒ€์ฆ
  5. ๊ทผ๋ณธ ์›์ธ ์ถ”์  - ์ฆ๊ฑฐ์˜ ์—ฐ๊ฒฐ ๊ณ ๋ฆฌ๋ฅผ ๋”ฐ๋ผ ์›์ฒœ๊นŒ์ง€

์—…์  ๋‹ฌ์„ฑ: ์ด์ œ ๊ฐ€์žฅ ํ”ํ•œ ์„ธ ๊ฐ€์ง€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Puzzle 10: ์ƒˆ๋‹ˆํƒ€์ด์ €๋กœ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜์™€ ๊ฒฝ์Ÿ ์ƒํƒœ ์ฐพ๊ธฐ

โš ๏ธ ์ด ํผ์ฆ์€ ํ˜ธํ™˜๋˜๋Š” NVIDIA GPU์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ GPU ๋ฒค๋” ์ง€์›์„ ์œ„ํ•œ ๋„๊ตฌ ๊ฐœ๋ฐœ์ด ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.

๋ชจ๋“  GPU ๊ฐœ๋ฐœ์ž๊ฐ€ ๋‘๋ ค์›Œํ•˜๋Š” ์ˆœ๊ฐ„

์™„๋ฒฝํ•ด ๋ณด์ด๋Š” GPU ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ •ํ™•ํ•˜๊ณ , ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋„ ์˜ฌ๋ฐ”๋ฅธ ๊ฒƒ ๊ฐ™๊ณ , ์Šค๋ ˆ๋“œ ์กฐ์œจ๋„ ํ ์žก์„ ๋ฐ ์—†์–ด ๋ณด์ž…๋‹ˆ๋‹ค. ์ž์‹  ์žˆ๊ฒŒ ํ…Œ์ŠคํŠธ๋ฅผ ์‹คํ–‰ํ•˜๋ฉดโ€ฆ

  • โœ… ๋ชจ๋“  ํ…Œ์ŠคํŠธ ํ†ต๊ณผ
  • โœ… ์„ฑ๋Šฅ๋„ ํ›Œ๋ฅญํ•จ
  • โœ… ์ถœ๋ ฅ์ด ์˜ˆ์ƒ ๊ฒฐ๊ณผ์™€ ์ผ์น˜

๋ฟŒ๋“ฏํ•˜๊ฒŒ ์ฝ”๋“œ๋ฅผ ํ”„๋กœ๋•์…˜์— ๋ฐฐํฌํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๋ช‡ ์ฃผ ํ›„, ์—ฐ๋ฝ์ด ์˜ต๋‹ˆ๋‹ค:

  • โ€œํ”„๋กœ๋•์…˜์—์„œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ํฌ๋ž˜์‹œ๋์–ด์š”โ€
  • โ€œ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์š”โ€
  • โ€œ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ์ด ๊ฐ์ง€๋์–ด์š”โ€

์กฐ์šฉํžˆ ์ˆจ์–ด๋“œ๋Š” GPU ๋ฒ„๊ทธ์˜ ์„ธ๊ณ„์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์˜ ๊ทธ๋Š˜์— ์ˆจ์–ด ์žˆ๋‹ค๊ฐ€ ๊ฐ€์žฅ ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ์ˆœ๊ฐ„์— ํŠ€์–ด๋‚˜์˜ค๋Š” ์˜ค๋ฅ˜๋“ค์ด์ฃ . ์ด๋Ÿฐ ๋ฒ„๊ทธ๋“ค์€ ๋ชจ๋“  ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ•˜๊ณ , 99%์˜ ๊ฒฝ์šฐ ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋‹ค๊ฐ€, ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ˆœ๊ฐ„์— ์น˜๋ช…์ ์œผ๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค.

์ค‘์š”: ์ด ํผ์ฆ์€ NVIDIA GPU ํ•˜๋“œ์›จ์–ด๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, compute-sanitizer๊ฐ€ NVIDIA CUDA toolkit์— ํฌํ•จ๋˜์–ด ์žˆ์–ด pixi๋ฅผ ํ†ตํ•ด์„œ๋งŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GPU ๋ฒ„๊ทธ๊ฐ€ ์œ ๋‚œํžˆ ๊ตํ™œํ•œ ์ด์œ 

CPU ํ”„๋กœ๊ทธ๋žจ์—์„œ๋Š” ๋ฒ„๊ทธ๊ฐ€ ๋ณดํ†ต ์ฆ‰๊ฐ์ ์ธ ํฌ๋ž˜์‹œ๋‚˜ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋กœ ์ž์‹ ์˜ ์กด์žฌ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ GPU ๋ฒ„๊ทธ๋Š” ์ˆจ๊ธฐ์˜ ๋‹ฌ์ธ์ž…๋‹ˆ๋‹ค:

์กฐ์šฉํžˆ ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ์‹œํ‚ค๋Š” ํŒจํ„ด:

  • ํฌ๋ž˜์‹œ ์—†๋Š” ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜: ์šฐ์—ฐํžˆ ์œ ํšจํ•œ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜๋ฅผ ๊ฑด๋“œ๋ฆฌ๋Š” ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ
  • โ€œ๋Œ€๋ถ€๋ถ„์€ ์ž˜ ๋™์ž‘ํ•˜๋Š”โ€ ๊ฒฝ์Ÿ ์ƒํƒœ: ํƒ€์ด๋ฐ์— ๋”ฐ๋ผ ๋ฌด์ž‘์œ„์ฒ˜๋Ÿผ ๋‚˜ํƒ€๋‚˜๋Š” ๋ฒ„๊ทธ
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ ์‹คํŒจ: ํŠน์ • ๋ถ€ํ•˜ ์กฐ๊ฑด์—์„œ๋งŒ ๋ฐœ์ƒํ•˜๋Š” ๊ต์ฐฉ ์ƒํƒœ

๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์—์„œ ์ฆํญ๋˜๋Š” ๋ฌธ์ œ:

  • ํ•œ ์Šค๋ ˆ๋“œ์˜ ๋ฒ„๊ทธ๊ฐ€ ์ˆ˜์ฒœ ๊ฐœ์— ์˜ํ–ฅ: ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ํ•˜๋‚˜๊ฐ€ ์ „์ฒด ์›Œํ”„๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ
  • ๊ฒฝ์Ÿ ์ƒํƒœ์˜ ๊ธฐํ•˜๊ธ‰์ˆ˜์  ์ฆ๊ฐ€: ์Šค๋ ˆ๋“œ๊ฐ€ ๋งŽ์„์ˆ˜๋ก ์†์ƒ ๊ฐ€๋Šฅ์„ฑ๋„ ์ปค์ง
  • ํ•˜๋“œ์›จ์–ด ์ฐจ์ด๊ฐ€ ๋ฌธ์ œ๋ฅผ ์€ํ: ๊ฐ™์€ ๋ฒ„๊ทธ๊ฐ€ GPU ์•„ํ‚คํ…์ฒ˜๋งˆ๋‹ค ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘

ํ•˜์ง€๋งŒ ํฌ์†Œ์‹์ด ์žˆ์Šต๋‹ˆ๋‹ค: GPU ๊ฒ€์‚ฌ ๋„๊ตฌ๋ฅผ ์ตํžˆ๋ฉด, ์ด๋ ‡๊ฒŒ ์ฐพ๊ธฐ ์–ด๋ ค์šด ๋ฒ„๊ทธ๋“ค์„ ํ”„๋กœ๋•์…˜์— ๋„๋‹ฌํ•˜๊ธฐ ์ „์— ์žก์•„๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ƒˆ๋‹ˆํƒ€์ด์ € ๋„๊ตฌ ๋ชจ์Œ: NVIDIA compute-sanitizer

NVIDIA compute-sanitizer๋Š” GPU ๋ฒ„๊ทธ์— ๋งž์„œ ์‹ธ์šฐ๋Š” ์—ฌ๋Ÿฌ๋ถ„์˜ ๋น„๋ฐ€ ๋ฌด๊ธฐ์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ์„ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜: ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ, ์ž˜๋ชป๋œ ํฌ์ธํ„ฐ, ๋ฉ”๋ชจ๋ฆฌ ๋ˆ„์ˆ˜
  • ๊ฒฝ์Ÿ ์ƒํƒœ: ์Šค๋ ˆ๋“œ ๊ฐ„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ hazard
  • ๋™๊ธฐํ™” ๋ฒ„๊ทธ: ๊ต์ฐฉ ์ƒํƒœ, barrier ์˜ค์šฉ, ๋ถ€์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ์กฐ์œจ
  • ๊ทธ ์™ธ: pixi run compute-sanitizer --help๋กœ ํ™•์ธ

๐Ÿ“– ๊ณต์‹ ๋ฌธ์„œ: NVIDIA Compute Sanitizer User Guide

GPU ํ”„๋กœ๊ทธ๋žจ์˜ X-ray๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜ ํ…Œ์ŠคํŠธ๋กœ๋Š” ๋ณผ ์ˆ˜ ์—†๋Š” ์ˆจ๊ฒจ์ง„ ๋ฌธ์ œ๊นŒ์ง€ ๋“œ๋Ÿฌ๋‚ด ์ค๋‹ˆ๋‹ค.

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ

์ด ํผ์ฆ์—์„œ๋Š” ๊ฐ€์žฅ ์ฐพ๊ธฐ ์–ด๋ ค์šด GPU ๋ฒ„๊ทธ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ฐพ์•„ ์ˆ˜์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ์œ ๋Šฅํ•œ GPU ๊ฐœ๋ฐœ์ž์™€ ๋›ฐ์–ด๋‚œ ๊ฐœ๋ฐœ์ž๋ฅผ ๊ตฌ๋ถ„ ์ง“๋Š” ํƒ์ • ๊ธฐ์ˆ ์„ ์ตํžˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ตํžˆ๊ฒŒ ๋  ํ•ต์‹ฌ ๊ธฐ์ˆ 

  1. ์ˆจ์€ ๋ฒ„๊ทธ ์ฐพ๊ธฐ - ํ…Œ์ŠคํŠธ๋กœ๋Š” ์žกํžˆ์ง€ ์•Š๋Š” ๋ฌธ์ œ ๋ฐœ๊ฒฌ
  2. ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ ์กฐ์‚ฌ - ํ”ผํ•ด๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์ „์— ๋ฏธ์ •์˜ ๋™์ž‘ ์ถ”์ 
  3. ๊ฒฝ์Ÿ ์ƒํƒœ ํƒ์ง€ - ๋™์‹œ์„ฑ ์œ„ํ—˜ ์š”์†Œ๋ฅผ ์ฐพ์•„๋‚ด๊ณ  ์ œ๊ฑฐ
  4. ๋„๊ตฌ ์„ ํƒ ๋Šฅ๋ ฅ - ์ƒํ™ฉ์— ๋งž๋Š” ์ƒˆ๋‹ˆํƒ€์ด์ € ์„ ํƒ
  5. ํ”„๋กœ๋•์…˜ ๋””๋ฒ„๊น… ์ž์‹ ๊ฐ - ์‚ฌ์šฉ์ž์—๊ฒŒ ๋„๋‹ฌํ•˜๊ธฐ ์ „์— ๋ฒ„๊ทธ ํฌ์ฐฉ

์‹ค์ „ ๋ฒ„๊ทธ ์‚ฌ๋ƒฅ ์‹œ๋‚˜๋ฆฌ์˜ค

๊ฐ€์žฅ ์œ„ํ—˜ํ•œ ๋‘ ์ข…๋ฅ˜์˜ GPU ๋ฒ„๊ทธ๋ฅผ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ - ๊ฒฝ๊ณ  ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ง๊ฐ€๋œจ๋ฆฌ๋Š” ์กฐ์šฉํ•œ ์•”์‚ด์ž
  • ๊ฒฝ์Ÿ ์ƒํƒœ - ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ํ˜ผ๋ˆ์˜ ์”จ์•—

๊ฐ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ผ๋ฐ˜ ํ…Œ์ŠคํŠธ๋กœ๋Š” ๋ณด์ด์ง€ ์•Š๋Š” ๋‹จ์„œ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ, GPU ๋ฒ„๊ทธ ํƒ์ •์ฒ˜๋Ÿผ ์‚ฌ๊ณ ํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฒ„๊ทธ ์‚ฌ๋ƒฅ ์—ฌ์ •

์ด ํผ์ฆ์€ ์กฐ์šฉํ•œ ์†์ƒ์„ ๋ฐœ๊ฒฌํ•˜๋Š” ๊ฒƒ๋ถ€ํ„ฐ ๋ณ‘๋ ฌ ๋””๋ฒ„๊น…์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ๊นŒ์ง€, ์ฒด๊ณ„์ ์œผ๋กœ ์„ค๊ณ„๋œ ๊ณผ์ •์„ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค:

๐Ÿ‘ฎ๐Ÿผโ€โ™‚๏ธ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ํƒ์ง€

๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ์กฐ์‚ฌ - ํ…Œ์ŠคํŠธ๋Š” ํ†ต๊ณผํ•ด๋„ ๋ฉ”๋ชจ๋ฆฌ๋Š” ๊ฑฐ์ง“๋ง์„ ํ•  ๋•Œ

  • ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ•˜๋ฉด์„œ๋„ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ”์ฃ„๋ฅผ ์ €์ง€๋ฅด๋Š” ํ”„๋กœ๊ทธ๋žจ ์กฐ์‚ฌ
  • ๋ฏธ์ •์˜ ๋™์ž‘(UB)์˜ ์ง•ํ›„๋ฅผ ์•Œ์•„๋ณด๋Š” ๋ฒ• ์ตํžˆ๊ธฐ
  • memcheck ํ•™์Šต - ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์„ ์žก์•„๋‚ด๋Š” ํƒ์ง€๊ธฐ
  • GPU ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜๋ฅผ ์ˆจ๊ธฐ๋Š” ์ด์œ  ์ดํ•ด
  • ์ฒด๊ณ„์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๊ฒ€์ฆ ์‹ค์Šต

๋ชฉํ‘œ: ๋ฐฉ์น˜ํ•˜๋ฉด ํ”„๋กœ๋•์…˜๊นŒ์ง€ ๋ฐœ๊ฒฌ๋˜์ง€ ์•Š์•˜์„ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ํƒ์ง€ ๋Šฅ๋ ฅ

๐Ÿ ๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น…

๋™์‹œ์„ฑ ๋ฒ„๊ทธ ์กฐ์‚ฌ - ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ๋ฐœ๋ชฉ์„ ์žก์„ ๋•Œ

  • ์Šค๋ ˆ๋“œ ํƒ€์ด๋ฐ ๋•Œ๋ฌธ์— ๋ฌด์ž‘์œ„๋กœ ์‹คํŒจํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ ์กฐ์‚ฌ
  • ๋ฐ์ดํ„ฐ๊ฐ€ ์†์ƒ๋˜๊ธฐ ์ „์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„ํ—˜ ์š”์†Œ ์‹๋ณ„๋ฒ• ์ตํžˆ๊ธฐ
  • racecheck ํ•™์Šต - ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์žก์•„๋‚ด๋Š” ํƒ์ง€๊ธฐ
  • ๋‹ค์–‘ํ•œ ๋™์‹œ์„ฑ ๋ฒ„๊ทธ์— ๋Œ€ํ•ด racecheck vs synccheck ๋น„๊ต
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ์ „๋žต ์‹ค์Šต

๋ชฉํ‘œ: ๊ณ ๊ธ‰ ๋™์‹œ์„ฑ ๋””๋ฒ„๊น… - ์ˆ˜์ฒœ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ๋ฅผ ๊ธธ๋“ค์ด๋Š” ๋Šฅ๋ ฅ

GPU ํƒ์ • ๋งˆ์ธ๋“œ์…‹

GPU ๊ฒ€์‚ฌ๋ฅผ ํ•˜๋ ค๋ฉด ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ ํƒ์ •์ด ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‚ฌ๊ฑด์„ ์กฐ์‚ฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

  • ์ฆ๊ฑฐ๊ฐ€ ์ˆจ๊ฒจ์ ธ ์žˆ๋‹ค - ์ง์ ‘ ๊ด€์ฐฐํ•  ์ˆ˜ ์—†๋Š” ๋ณ‘๋ ฌ ์‹คํ–‰ ์†์—์„œ ๋ฒ„๊ทธ๊ฐ€ ๋ฐœ์ƒ
  • ์šฉ์˜์ž๊ฐ€ ์ˆ˜์—†์ด ๋งŽ๋‹ค - ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ ์ค‘ ์–ด๋–ค ์กฐํ•ฉ์ด๋“  ๋ฒ”์ธ์ผ ์ˆ˜ ์žˆ์Œ
  • ๋ฒ”ํ–‰์ด ๊ฐ„ํ—์ ์ด๋‹ค - ๊ฒฝ์Ÿ ์ƒํƒœ์™€ ํƒ€์ด๋ฐ์— ๋”ฐ๋ฅธ ์‹คํŒจ
  • ์ „๋ฌธ ๋„๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค - ์ผ๋ฐ˜ ๋””๋ฒ„๊น…์œผ๋กœ๋Š” ๋ณผ ์ˆ˜ ์—†๋Š” ๊ฒƒ์„ ์ƒˆ๋‹ˆํƒ€์ด์ €๊ฐ€ ๋ณด์—ฌ์คŒ

ํ•˜์ง€๋งŒ ํ›Œ๋ฅญํ•œ ํƒ์ •์ฒ˜๋Ÿผ, ์—ฌ๋Ÿฌ๋ถ„๋„ ๋‹ค์Œ์„ ๋ฐฐ์šฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

  • ๋ณด์ด์ง€ ์•Š๋Š” ๋‹จ์„œ ๋”ฐ๋ผ๊ฐ€๊ธฐ - ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด, ์Šค๋ ˆ๋“œ ํƒ€์ด๋ฐ, ๋™๊ธฐํ™” ์ง€์ 
  • ๋ณ‘๋ ฌ์ ์œผ๋กœ ์‚ฌ๊ณ ํ•˜๊ธฐ - ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์–ด๋–ป๊ฒŒ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š”์ง€ ๊ณ ๋ ค
  • ๋ฏธ๋ž˜์˜ ๋ฒ”์ฃ„ ์˜ˆ๋ฐฉํ•˜๊ธฐ - ๊ฐœ๋ฐœ ์›Œํฌํ”Œ๋กœ์šฐ์— ๊ฒ€์‚ฌ ๋„๊ตฌ ํ†ตํ•ฉ
  • ๋„๊ตฌ ๋ฏฟ๊ธฐ - ์ˆ˜๋™ ํ…Œ์ŠคํŠธ๋กœ๋Š” ๋“œ๋Ÿฌ๋‚ผ ์ˆ˜ ์—†๋Š” ๊ฒƒ์„ ์ƒˆ๋‹ˆํƒ€์ด์ €์— ๋งก๊ธฐ๊ธฐ

์‹œ์ž‘ํ•˜๊ธฐ ์ „์—

์•Œ์•„์•ผ ํ•  ๊ฒƒ:

  • Puzzle 1-8์—์„œ ๋‹ค๋ฃฌ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋… (๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ, ์Šค๋ ˆ๋“œ ์กฐ์œจ, ๋ฐฐ๋ฆฌ์–ด)
  • ํ˜ธํ™˜ NVIDIA GPU ํ•˜๋“œ์›จ์–ด
  • compute-sanitizer ์ ‘๊ทผ์„ ์œ„ํ•œ pixi ํŒจํ‚ค์ง€ ๋งค๋‹ˆ์ € ํ™˜๊ฒฝ ์„ค์ •
  • ์„ ํ–‰ ํผ์ฆ: Puzzle 4์™€ Puzzle 8 ์ˆ™์ง€ ๊ถŒ์žฅ

๋ชฉํ‘œ:

  • ์ „๋ฌธ GPU ๊ฐœ๋ฐœํŒ€์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํ”„๋กœ๋•์…˜๊ธ‰ ๋””๋ฒ„๊น… ๊ธฐ์ˆ 
  • ๋น„์šฉ์ด ํฐ ํ”„๋กœ๋•์…˜ ์žฅ์• ๋ฅผ ์˜ˆ๋ฐฉํ•˜๋Š” ์ˆจ์€ ๋ฒ„๊ทธ ํƒ์ง€ ๊ธฐ์ˆ 
  • ๊ฐ€์žฅ ๊นŒ๋‹ค๋กœ์šด ๋™์‹œ์„ฑ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋„ ๋ณ‘๋ ฌ ๋””๋ฒ„๊น… ์ž์‹ ๊ฐ
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ปค๋ฆฌ์–ด ์ „๋ฐ˜์— ๋„์›€์ด ๋  ๋„๊ตฌ ์ „๋ฌธ์„ฑ

๐Ÿ‘ฎ๐Ÿผโ€โ™‚๏ธ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ํƒ์ง€

๊ฐœ์š”

ํ…Œ์ŠคํŠธ๊ฐ€ ํ†ต๊ณผํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์—ฌ๋„ GPU ํ”„๋กœ๊ทธ๋žจ์„ ์กฐ์šฉํžˆ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์„ ํƒ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. NVIDIA์˜ compute-sanitizer(pixi๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉ ๊ฐ€๋Šฅ)์™€ memcheck ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, GPU ์ฝ”๋“œ์—์„œ ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ๋™์ž‘์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์ˆจ์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ํ”„๋กœ๊ทธ๋žจ์€ ๋ถˆ๋ฒ•์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ๋„ ๋™์‹œ์— โ€œ์˜ฌ๋ฐ”๋ฅธโ€ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์„ ํ–‰ ํ•™์Šต: Puzzle 4 LayoutTensor์™€ ๊ธฐ๋ณธ์ ์ธ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ๋…์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์กฐ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ์˜ ๋ฐœ๊ฒฌ

ํ…Œ์ŠคํŠธ๋Š” ํ†ต๊ณผํ–ˆ์ง€๋งŒ, ์ฝ”๋“œ๊ฐ€ ์ •๋ง ์˜ฌ๋ฐ”๋ฅธ ๊ฑธ๊นŒ?

์–ผํ• ๋ฌดํ•ดํ•ด ๋ณด์ด๊ณ  ์™„๋ฒฝํ•˜๊ฒŒ ๋™์ž‘ํ•˜๋Š” ๋“ฏํ•œ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ด…์‹œ๋‹ค (๊ฐ€๋“œ๊ฐ€ ์—†๋Š” Puzzle 04์ž…๋‹ˆ๋‹ค):

fn add_10_2d(
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    output[row, col] = a[row, col] + 10.0


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p10/p10.mojo

์ด ํ”„๋กœ๊ทธ๋žจ์„ ์ผ๋ฐ˜์ ์œผ๋กœ ์‹คํ–‰ํ•˜๋ฉด, ๋ชจ๋“  ๊ฒƒ์ด ์ •์ƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค:

pixi run p10 --memory-bug
out shape: 2 x 2
Running memory bug example (bounds checking issue)...
out: HostBuffer([10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
โœ… Memory test PASSED! (memcheck may find bounds violations)

โœ… ํ…Œ์ŠคํŠธ ํ†ต๊ณผ! ์ถœ๋ ฅ์ด ์˜ˆ์ƒ ๊ฒฐ๊ณผ์™€ ์™„๋ฒฝํ•˜๊ฒŒ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ๊ฑด ์ข…๊ฒฐ, ๋งž์ฃ ?

์•„๋‹™๋‹ˆ๋‹ค! compute-sanitizer๊ฐ€ ๋ฌด์—‡์„ ๋ณด์—ฌ์ฃผ๋Š”์ง€ ๋ด…์‹œ๋‹ค:

MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0 pixi run compute-sanitizer --tool memcheck mojo problems/p10/p10.mojo --memory-bug

์ฐธ๊ณ : MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0์€ ๋””๋ฐ”์ด์Šค ์ปจํ…์ŠคํŠธ์˜ ๋ฒ„ํผ ์บ์‹œ๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•˜๋Š” ๋ช…๋ น์ค„ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •์ž…๋‹ˆ๋‹ค. ์ด ์„ค์ •์€ ์ผ๋ฐ˜์ ์ธ ์บ์‹ฑ ๋™์ž‘์— ์˜ํ•ด ์ˆจ๊ฒจ์ง€๋˜ ๊ฒฝ๊ณ„ ์œ„๋ฐ˜ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ๋ฅผ ๋“œ๋Ÿฌ๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์—ญ์ฃผ: ๋ฒ„ํผ ์บ์‹œ๊ฐ€ ํ™œ์„ฑํ™”๋˜๋ฉด ํ•ด์ œ๋œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฆ‰์‹œ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š๊ณ  ์žฌ์‚ฌ์šฉ์„ ์œ„ํ•ด ๋ณด๊ด€ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ๋ฌธ์— ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์ด ์•„์ง ์œ ํšจํ•œ ์บ์‹œ ์˜์—ญ์— ๋‹ฟ์•„ ์˜ค๋ฅ˜๊ฐ€ ๋“œ๋Ÿฌ๋‚˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋น„ํ™œ์„ฑํ™”ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ฆ‰์‹œ ๋ฐ˜ํ™˜๋˜์–ด ์œ„๋ฐ˜์ด ๊ฐ์ง€๋ฉ๋‹ˆ๋‹ค.)

========= COMPUTE-SANITIZER
out shape: 2 x 2
Running memory bug example (bounds checking issue)...

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (2,1,0) in block (0,0,0)
=========     Access at 0xe0c000210 is out of bounds
=========     and is 513 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (0,2,0) in block (0,0,0)
=========     Access at 0xe0c000210 is out of bounds
=========     and is 513 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (1,2,0) in block (0,0,0)
=========     Access at 0xe0c000214 is out of bounds
=========     and is 517 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (2,2,0) in block (0,0,0)
=========     Access at 0xe0c000218 is out of bounds
=========     and is 521 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuStreamSynchronize.
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuEventCreate.
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuMemFreeAsync.

========= ERROR SUMMARY: 7 errors

๋ชจ๋“  ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ–ˆ์Œ์—๋„ ํ”„๋กœ๊ทธ๋žจ์—๋Š” ์ด 7๊ฐœ์˜ ์˜ค๋ฅ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  • 4๊ฐœ์˜ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ (Invalid __global__ read)
  • 3๊ฐœ์˜ ๋Ÿฐํƒ€์ž„ ์˜ค๋ฅ˜ (๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์œผ๋กœ ์ธํ•ด ๋ฐœ์ƒ)

์ˆจ๊ฒจ์ง„ ๋ฒ„๊ทธ ์ดํ•ดํ•˜๊ธฐ

๊ทผ๋ณธ ์›์ธ ๋ถ„์„

๋ฌธ์ œ:

  • ํ…์„œ ํฌ๊ธฐ: 2ร—2 (์œ ํšจํ•œ ์ธ๋ฑ์Šค: 0, 1)
  • ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ: 3ร—3 (์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค: 0, 1, 2)
  • ๋ฒ”์œ„ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ: (2,1), (0,2), (1,2), (2,2)๊ฐ€ ์ž˜๋ชป๋œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ๋ˆ„๋ฝ: ํ…์„œ ์ฐจ์›์— ๋Œ€ํ•œ thread_idx ๊ฒ€์ฆ์ด ์—†์Œ

7๊ฐœ ์˜ค๋ฅ˜ ์ „์ฒด ์ดํ•ดํ•˜๊ธฐ

4๊ฐœ์˜ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜:

  • ๊ฐ ๋ฒ”์œ„ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ (2,1), (0,2), (1,2), (2,2)๊ฐ€ Invalid __global__ read๋ฅผ ๋ฐœ์ƒ์‹œํ‚ด

3๊ฐœ์˜ CUDA ๋Ÿฐํƒ€์ž„ ์˜ค๋ฅ˜:

  • ์ปค๋„ ์‹คํ–‰ ์‹คํŒจ๋กœ ์ธํ•ด cuStreamSynchronize ์‹คํŒจ
  • ์ •๋ฆฌ ๊ณผ์ •์—์„œ cuEventCreate ์‹คํŒจ
  • ๋ฉ”๋ชจ๋ฆฌ ํ•ด์ œ ๊ณผ์ •์—์„œ cuMemFreeAsync ์‹คํŒจ

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์€ ์—ฐ์‡„ ํšจ๊ณผ๋ฅผ ์ผ์œผํ‚ต๋‹ˆ๋‹ค - ํ•˜๋‚˜์˜ ์ž˜๋ชป๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์—ฌ๋Ÿฌ ํ›„์† CUDA API ์‹คํŒจ๋ฅผ ์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿผ์—๋„ ํ…Œ์ŠคํŠธ๊ฐ€ ํ†ต๊ณผํ•œ ์ด์œ :

  • ์œ ํšจํ•œ ์Šค๋ ˆ๋“œ (0,0), (0,1), (1,0), (1,1)์ด ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•จ
  • ํ…Œ์ŠคํŠธ๊ฐ€ ์œ ํšจํ•œ ์ถœ๋ ฅ ์œ„์น˜๋งŒ ๊ฒ€์‚ฌํ•จ
  • ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ์ด ํ”„๋กœ๊ทธ๋žจ์„ ์ฆ‰์‹œ ํฌ๋ž˜์‹œ์‹œํ‚ค์ง€ ์•Š์Œ

๋ฏธ์ •์˜ ๋™์ž‘ ์ดํ•ดํ•˜๊ธฐ

๋ฏธ์ •์˜ ๋™์ž‘์ด๋ž€?

๋ฏธ์ •์˜ ๋™์ž‘(Undefined Behavior, UB) ์€ ํ”„๋กœ๊ทธ๋žจ์ด ์–ธ์–ด ๋ช…์„ธ์ƒ ์ •์˜๋˜์ง€ ์•Š์€ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ๋ฒ”์œ„ ์ดˆ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์ž…๋‹ˆ๋‹ค.

๋ฏธ์ •์˜ ๋™์ž‘์˜ ์ฃผ์š” ํŠน์„ฑ:

  • ํ”„๋กœ๊ทธ๋žจ์ด ๋ง ๊ทธ๋Œ€๋กœ ๋ฌด์Šจ ์ง“์ด๋“  ํ•  ์ˆ˜ ์žˆ์Œ: ํฌ๋ž˜์‹œ, ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ, ์ •์ƒ ๋™์ž‘ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๊ธฐ, ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ
  • ์–ด๋–ค ๋ณด์žฅ๋„ ์—†์Œ: ์ปดํŒŒ์ผ๋Ÿฌ, ํ•˜๋“œ์›จ์–ด, ๋“œ๋ผ์ด๋ฒ„, ์‹ฌ์ง€์–ด ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๋™์ž‘์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ

๋ฏธ์ •์˜ ๋™์ž‘์ด ํŠนํžˆ ์œ„ํ—˜ํ•œ ์ด์œ 

์ •ํ™•์„ฑ ๋ฌธ์ œ:

  • ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ฒฐ๊ณผ: ํ…Œ์ŠคํŠธ ์ค‘์—๋Š” ๋™์ž‘ํ•˜๋‹ค๊ฐ€ ํ”„๋กœ๋•์…˜์—์„œ ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Œ
  • ๋น„๊ฒฐ์ •์  ๋™์ž‘: ๊ฐ™์€ ์ฝ”๋“œ๊ฐ€ ๋‹ค๋ฅธ ์‹คํ–‰์—์„œ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Œ
  • ์กฐ์šฉํ•œ ์†์ƒ: ๋ฏธ์ •์˜ ๋™์ž‘์€ ๊ฐ€์‹œ์ ์ธ ์˜ค๋ฅ˜ ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ
  • ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”: ์ปดํŒŒ์ผ๋Ÿฌ๋Š” ๋ฏธ์ •์˜ ๋™์ž‘์ด ์—†๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋ฐฉ์‹์œผ๋กœ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ์Œ

๋ณด์•ˆ ์ทจ์•ฝ์ :

  • ๋ฒ„ํผ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ: ์‹œ์Šคํ…œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋ณด์•ˆ ๊ณต๊ฒฉ์˜ ๊ณ ์ „์ ์ธ ์›์ธ
  • ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ: ๊ถŒํ•œ ์ƒ์Šน์ด๋‚˜ ์ฝ”๋“œ ์ธ์ ์…˜ ๊ณต๊ฒฉ์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Œ
  • ์ •๋ณด ์œ ์ถœ: ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ฝ๊ธฐ๋กœ ๋ฏผ๊ฐํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋…ธ์ถœ๋  ์ˆ˜ ์žˆ์Œ
  • ์ œ์–ด ํ๋ฆ„ ํ•˜์ด์žฌํ‚น: ๋ฏธ์ •์˜ ๋™์ž‘์„ ์•…์šฉํ•ด ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰ ํ๋ฆ„์„ ํƒˆ์ทจํ•  ์ˆ˜ ์žˆ์Œ

GPU ํŠน์œ ์˜ ๋ฏธ์ •์˜ ๋™์ž‘ ์œ„ํ—˜์„ฑ

๋Œ€๊ทœ๋ชจ ์˜ํ–ฅ:

  • ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ: ํ•œ ์Šค๋ ˆ๋“œ์˜ ๋ฏธ์ •์˜ ๋™์ž‘์ด ์ „์ฒด ์›Œํ”„(32๊ฐœ ์Šค๋ ˆ๋“œ)์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ์ด ์ธ์ ‘ ์Šค๋ ˆ๋“œ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ
  • ์ปค๋„ ์‹คํŒจ: ๋ฏธ์ •์˜ ๋™์ž‘์ด GPU ์ปค๋„ ์ „์ฒด๋ฅผ ์™„์ „ํžˆ ๋ง๊ฐ€๋œจ๋ฆด ์ˆ˜ ์žˆ์Œ

ํ•˜๋“œ์›จ์–ด ์ฐจ์ด:

  • ๋‹ค๋ฅธ GPU ์•„ํ‚คํ…์ฒ˜: ๋ฏธ์ •์˜ ๋™์ž‘์ด ๋‹ค๋ฅธ GPU ๋ชจ๋ธ์—์„œ ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Œ
  • ๋“œ๋ผ์ด๋ฒ„ ์ฐจ์ด: ๊ฐ™์€ ๋ฏธ์ •์˜ ๋™์ž‘์ด ๋“œ๋ผ์ด๋ฒ„ ๋ฒ„์ „์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ๋ณ€๊ฒฝ: GPU ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ํŒจํ„ด์— ๋”ฐ๋ผ ๋ฏธ์ •์˜ ๋™์ž‘์ด ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Œ

๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ์ˆ˜์ •ํ•˜๊ธฐ

ํ•ด๊ฒฐ์ฑ…

Puzzle 04์—์„œ ๋ณธ ๊ฒƒ์ฒ˜๋Ÿผ, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

fn add_10_2d(
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, MutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x
    if col < size and row < size:
        output[row, col] = a[row, col] + 10.0


ํ•ด๊ฒฐ์ฑ…์€ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค: ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๊ธฐ ์ „์— ํ•ญ์ƒ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ์ดํ„ฐ ์ฐจ์›์— ๋Œ€ํ•ด ๊ฒ€์ฆํ•˜์„ธ์š”.

compute-sanitizer๋กœ ๊ฒ€์ฆ

# p10.mojo ๋ณต์‚ฌ๋ณธ์—์„œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์ˆ˜์ •ํ•œ ํ›„ ์‹คํ–‰:
MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0 pixi run compute-sanitizer --tool memcheck mojo problems/p10/p10.mojo --memory-bug
========= COMPUTE-SANITIZER
out shape: 2 x 2
Running memory bug example (bounds checking issue)...
out: HostBuffer([10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
โœ… Memory test PASSED! (memcheck may find bounds violations)
========= ERROR SUMMARY: 0 errors

โœ… ์„ฑ๊ณต: ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์ด ํƒ์ง€๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค!

ํ•ต์‹ฌ ํ•™์Šต ํฌ์ธํŠธ

์ˆ˜๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ค‘์š”ํ•œ ์ด์œ 

  1. ๋ช…ํ™•์„ฑ: ์ฝ”๋“œ์—์„œ ์•ˆ์ „ ์š”๊ตฌ์‚ฌํ•ญ์„ ๋ช…์‹œ์ ์œผ๋กœ ํ‘œํ˜„
  2. ์ œ์–ด: ๋ฒ”์œ„ ์ดˆ๊ณผ ์ผ€์ด์Šค์—์„œ ์ •ํ™•ํžˆ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚ ์ง€ ์ง์ ‘ ๊ฒฐ์ •
  3. ๋””๋ฒ„๊น…: ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์ด ๋ฐœ์ƒํ•  ๋•Œ ์ถ”๋ก ํ•˜๊ธฐ ์‰ฌ์›€

GPU ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „ ๊ทœ์น™

  1. ํ•ญ์ƒ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฒ€์ฆํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ฐจ์›๊ณผ ๋น„๊ต
  2. ๋ฏธ์ •์˜ ๋™์ž‘์„ ์–ด๋–ค ๋Œ€๊ฐ€๋ฅผ ์น˜๋ฅด๋”๋ผ๋„ ํ”ผํ•˜๊ธฐ - ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ์€ ๋ฏธ์ •์˜ ๋™์ž‘์ด๋ฉฐ ๋ชจ๋“  ๊ฒƒ์„ ๋ง๊ฐ€๋œจ๋ฆด ์ˆ˜ ์žˆ์Œ
  3. ๊ฐœ๋ฐœ๊ณผ ํ…Œ์ŠคํŠธ ์ค‘ compute-sanitizer ์‚ฌ์šฉ
  4. ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์‚ฌ ์—†์ด โ€œ๋™์ž‘ํ•œ๋‹คโ€œ๊ณ  ์ ˆ๋Œ€ ๊ฐ€์ •ํ•˜์ง€ ์•Š๊ธฐ
  5. ๋‹ค์–‘ํ•œ ๊ทธ๋ฆฌ๋“œ/๋ธ”๋ก ๊ตฌ์„ฑ์œผ๋กœ ํ…Œ์ŠคํŠธํ•˜์—ฌ ์ผ๊ด€์„ฑ ์—†์ด ๋‚˜ํƒ€๋‚˜๋Š” ๋ฏธ์ •์˜ ๋™์ž‘ ํฌ์ฐฉ

compute-sanitizer ๋ชจ๋ฒ” ์‚ฌ๋ก€

MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0 pixi run compute-sanitizer --tool memcheck mojo your_code.mojo

์ฐธ๊ณ : ์ƒˆ๋‹ˆํƒ€์ด์ € ์ถœ๋ ฅ์—์„œ Mojo ๋Ÿฐํƒ€์ž„ ๊ฒฝ๊ณ ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์„ ํ™•์ธํ•˜๋ ค๋ฉด ========= Invalid์™€ ========= ERROR SUMMARY ๋ผ์ธ์— ์ง‘์ค‘ํ•˜์„ธ์š”.

๐Ÿ ๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น…

๊ฐœ์š”

NVIDIA compute-sanitizer๋ฅผ ์‚ฌ์šฉํ•ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ์ผ์œผํ‚ค๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์‹๋ณ„ํ•˜๋ฉด์„œ ์‹คํŒจํ•˜๋Š” GPU ํ”„๋กœ๊ทธ๋žจ์„ ๋””๋ฒ„๊น…ํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์—์„œ ๋™์‹œ์„ฑ ๋ฒ„๊ทธ๋ฅผ ์ฐพ๋Š” racecheck ๋„๊ตฌ ์‚ฌ์šฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ์˜ ๊ฐ’์„ ๋ˆ„์ ํ•ด์•ผ ํ•˜๋Š” GPU ์ปค๋„์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ…Œ์ŠคํŠธ๋Š” ์‹คํŒจํ•˜๋Š”๋ฐ, ๋กœ์ง์€ ์˜ฌ๋ฐ”๋ฅธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‹น์‹ ์˜ ๊ณผ์ œ๋Š” ์‹คํŒจ๋ฅผ ์ผ์œผํ‚ค๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ฐพ์•„ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)  # 9๊ฐœ ์Šค๋ ˆ๋“œ ์ค‘ 4๊ฐœ๋งŒ ํ™œ์„ฑํ™”
comptime dtype = DType.float32

์‹คํŒจํ•˜๋Š” ์ปค๋„


comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE, SIZE)


fn shared_memory_race(
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    row = thread_idx.y
    col = thread_idx.x

    shared_sum = LayoutTensor[
        dtype,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    if row < size and col < size:
        shared_sum[0] += a[row, col]

    barrier()

    if row < size and col < size:
        output[row, col] = shared_sum[0]


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p10/p10.mojo

์ฝ”๋“œ ์‹คํ–‰

pixi run p10 --race-condition

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

out shape: 2 x 2
Running race condition example...
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([6.0, 6.0, 6.0, 6.0])
stack trace was not collected. Enable stack trace collection with environment variable `MOJO_ENABLE_STACK_TRACE_ON_ERROR`
Unhandled exception caught during execution: At /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p10/p10.mojo:122:33: AssertionError: `left == right` comparison failed:
   left: 0.0
  right: 6.0

compute-sanitizer๊ฐ€ GPU ์ฝ”๋“œ์˜ ๋ฌธ์ œ๋ฅผ ์–ด๋–ป๊ฒŒ ์ฐพ์•„๋‚ด๋Š”์ง€ ์‚ดํŽด๋ด…์‹œ๋‹ค.

compute-sanitizer๋กœ ๋””๋ฒ„๊น…ํ•˜๊ธฐ

1๋‹จ๊ณ„: racecheck๋กœ ๊ฒฝ์Ÿ ์ƒํƒœ ์‹๋ณ„

compute-sanitizer์™€ racecheck ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค:

pixi run compute-sanitizer --tool racecheck mojo problems/p10/p10.mojo --race-condition

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

========= COMPUTE-SANITIZER
out shape: 2 x 2
Running race condition example...
========= Error: Race reported between Write access at p10_shared_memory_race_...+0x140
=========     and Read access at p10_shared_memory_race_...+0xe0 [4 hazards]
=========     and Write access at p10_shared_memory_race_...+0x140 [5 hazards]
=========
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([6.0, 6.0, 6.0, 6.0])
AssertionError: `left == right` comparison failed:
  left: 0.0
  right: 6.0
========= RACECHECK SUMMARY: 1 hazard displayed (1 error, 0 warnings)

๋ถ„์„: ํ”„๋กœ๊ทธ๋žจ์— 1๊ฐœ์˜ ๊ฒฝ์Ÿ ์ƒํƒœ์™€ 9๊ฐœ์˜ ๊ฐœ๋ณ„ ์œ„ํ—˜ ์š”์†Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  • 4๊ฐœ์˜ read-after-write ์œ„ํ—˜ ์š”์†Œ (๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์“ฐ๋Š” ๋™์•ˆ ์ฝ๊ธฐ)
  • 5๊ฐœ์˜ write-after-write ์œ„ํ—˜ ์š”์†Œ (์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์“ฐ๊ธฐ)

2๋‹จ๊ณ„: synccheck์™€ ๋น„๊ต

๋™๊ธฐํ™” ๋ฌธ์ œ๊ฐ€ ์•„๋‹Œ ๊ฒฝ์Ÿ ์ƒํƒœ์ธ์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:

pixi run compute-sanitizer --tool synccheck mojo problems/p10/p10.mojo --race-condition

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

========= COMPUTE-SANITIZER
out shape: 2 x 2
Running race condition example...
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([6.0, 6.0, 6.0, 6.0])
AssertionError: `left == right` comparison failed:
  left: 0.0
  right: 6.0
========= ERROR SUMMARY: 0 errors

ํ•ต์‹ฌ ํ†ต์ฐฐ: synccheck๊ฐ€ 0๊ฐœ์˜ ์˜ค๋ฅ˜๋ฅผ ์ฐพ์•˜์Šต๋‹ˆ๋‹ค - ๊ต์ฐฉ ์ƒํƒœ ๊ฐ™์€ ๋™๊ธฐํ™” ๋ฌธ์ œ๋Š” ์—†์Šต๋‹ˆ๋‹ค. ๋ฌธ์ œ๋Š” ๋™๊ธฐํ™” ๋ฒ„๊ทธ๊ฐ€ ์•„๋‹Œ ๊ฒฝ์Ÿ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.

๊ต์ฐฉ ์ƒํƒœ vs ๊ฒฝ์Ÿ ์ƒํƒœ: ์ฐจ์ด์  ์ดํ•ดํ•˜๊ธฐ

์ธก๋ฉด๊ต์ฐฉ ์ƒํƒœ๊ฒฝ์Ÿ ์ƒํƒœ
์ฆ์ƒํ”„๋กœ๊ทธ๋žจ์ด ์˜์›ํžˆ ๋ฉˆ์ถคํ”„๋กœ๊ทธ๋žจ์ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ ์ƒ์„ฑ
์‹คํ–‰์™„๋ฃŒ๋˜์ง€ ์•Š์Œ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒ๋จ
ํƒ€์ด๋ฐ๊ฒฐ์ •์ ์œผ๋กœ ๋ฉˆ์ถค๋น„๊ฒฐ์ •์  ๊ฒฐ๊ณผ
๊ทผ๋ณธ ์›์ธ๋™๊ธฐํ™” ๋กœ์ง ์˜ค๋ฅ˜๋™๊ธฐํ™”๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ
ํƒ์ง€ ๋„๊ตฌsynccheckracecheck
์˜ˆ์‹œPuzzle 09: ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€ ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ += ์—ฐ์‚ฐ

์šฐ๋ฆฌ ์‚ฌ๋ก€์—์„œ:

  • ํ”„๋กœ๊ทธ๋žจ ์™„๋ฃŒ๋จ โ†’ ๊ต์ฐฉ ์ƒํƒœ ์—†์Œ (์Šค๋ ˆ๋“œ๊ฐ€ ๋ฉˆ์ถ”์ง€ ์•Š์Œ)
  • ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ โ†’ ๊ฒฝ์Ÿ ์ƒํƒœ (์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ)
  • ๋„๊ตฌ ํ™•์ธ โ†’ synccheck๋Š” 0๊ฐœ ์˜ค๋ฅ˜, racecheck๋Š” 9๊ฐœ ์œ„ํ—˜ ์š”์†Œ ๋ณด๊ณ 

๋””๋ฒ„๊น…์—์„œ ์ด ๊ตฌ๋ถ„์ด ์ค‘์š”ํ•œ ์ด์œ :

  • ๊ต์ฐฉ ์ƒํƒœ ๋””๋ฒ„๊น…: ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜, ์กฐ๊ฑด๋ถ€ ๋™๊ธฐํ™”, ์Šค๋ ˆ๋“œ ์กฐ์œจ์— ์ง‘์ค‘
  • ๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น…: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด, ์›์ž์  ์—ฐ์‚ฐ (์—ญ์ฃผ: ์ค‘๊ฐ„ ์ƒํƒœ ์—†์ด ์™„์ „ํžˆ ์‹คํ–‰๋˜๊ฑฐ๋‚˜ ์ „ํ˜€ ์‹คํ–‰๋˜์ง€ ์•Š๋Š” ์—ฐ์‚ฐ), ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ์— ์ง‘์ค‘

๋„์ „ ๊ณผ์ œ

์ด ๋„๊ตฌ๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ์‹คํŒจํ•˜๋Š” ์ปค๋„์„ ์ˆ˜์ •ํ•˜์„ธ์š”.

ํŒ

์œ„ํ—˜ ์š”์†Œ ๋ถ„์„

shared_sum[0] += a[row, col] ์—ฐ์‚ฐ์ด ์œ„ํ—˜ํ•œ ์ด์œ ๋Š” ์‹ค์ œ๋กœ ์„ธ ๊ฐœ์˜ ๋ณ„๋„ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค:

  1. shared_sum[0] ์ฝ๊ธฐ
  2. ์ฝ์€ ๊ฐ’์— a[row, col] ๋”ํ•˜๊ธฐ
  3. ๊ฒฐ๊ณผ๋ฅผ shared_sum[0]์— ๋‹ค์‹œ ์“ฐ๊ธฐ

4๊ฐœ์˜ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ(์œ„์น˜ (0,0), (0,1), (1,0), (1,1))์—์„œ ์ด ์—ฐ์‚ฐ๋“ค์ด ๊ฒน์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์Šค๋ ˆ๋“œ ํƒ€์ด๋ฐ ์ค‘์ฒฉ โ†’ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ์ดˆ๊ธฐ๊ฐ’(0.0)์„ ์ฝ์Œ
  • ์—…๋ฐ์ดํŠธ ์†์‹ค โ†’ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ 0.0 + ์ž์‹ ์˜_๊ฐ’์„ ์จ์„œ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ์˜ ์ž‘์—…์„ ๋ฎ์–ด์”€
  • ๋น„์›์ž์  ์—ฐ์‚ฐ โ†’ += ๋ณตํ•ฉ ๋Œ€์ž…์€ GPU ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์›์ž์ ์ด์ง€ ์•Š์Œ (์—ญ์ฃผ: ์‹คํ–‰ ๋„์ค‘ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ผ์–ด๋“ค ์ˆ˜ ์žˆ์–ด ์ค‘๊ฐ„ ์ƒํƒœ๊ฐ€ ๋…ธ์ถœ๋จ)

์ •ํ™•ํžˆ 9๊ฐœ์˜ ์œ„ํ—˜ ์š”์†Œ๊ฐ€ ๋‚˜์˜ค๋Š” ์ด์œ :

  • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ read-modify-write๋ฅผ ์‹œ๋„
  • 4๊ฐœ ์Šค๋ ˆ๋“œ ร— ์Šค๋ ˆ๋“œ๋‹น 2-3๊ฐœ ์œ„ํ—˜ ์š”์†Œ = ์ด 9๊ฐœ ์œ„ํ—˜ ์š”์†Œ
  • compute-sanitizer๊ฐ€ ๋ชจ๋“  ์ถฉ๋Œํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์Œ์„ ์ถ”์ 

๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น… ํŒ

  1. ๋ฐ์ดํ„ฐ ๊ฒฝ์Ÿ์—๋Š” racecheck ์‚ฌ์šฉ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„ํ—˜ ์š”์†Œ์™€ ๋ฐ์ดํ„ฐ ์†์ƒ ํƒ์ง€
  2. ๊ต์ฐฉ ์ƒํƒœ์—๋Š” synccheck ์‚ฌ์šฉ: ๋™๊ธฐํ™” ๋ฒ„๊ทธ(๋ฐฐ๋ฆฌ์–ด ๋ฌธ์ œ, ๊ต์ฐฉ ์ƒํƒœ) ํƒ์ง€
  3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— ์ง‘์ค‘: ๊ณต์œ  ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ๋™๊ธฐํ™”๋˜์ง€ ์•Š์€ +=, = ์—ฐ์‚ฐ ์ฐพ๊ธฐ
  4. ํŒจํ„ด ์‹๋ณ„: read-modify-write ์—ฐ์‚ฐ์ด ํ”ํ•œ ๊ฒฝ์Ÿ ์ƒํƒœ ์›์ธ
  5. ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜ ํ™•์ธ: ๋ฐฐ๋ฆฌ์–ด๋Š” ์ถฉ๋Œ ์—ฐ์‚ฐ ์ด์ „์— ๋ฐฐ์น˜ํ•ด์•ผ ํ•จ, ์ดํ›„๊ฐ€ ์•„๋‹˜

๋””๋ฒ„๊น…์—์„œ ์ด ๊ตฌ๋ถ„์ด ์ค‘์š”ํ•œ ์ด์œ :

  • ๊ต์ฐฉ ์ƒํƒœ ๋””๋ฒ„๊น…: ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜, ์กฐ๊ฑด๋ถ€ ๋™๊ธฐํ™”, ์Šค๋ ˆ๋“œ ์กฐ์œจ์— ์ง‘์ค‘
  • ๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น…: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด, ์›์ž์  ์—ฐ์‚ฐ, ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ์— ์ง‘์ค‘

ํ”ผํ•ด์•ผ ํ•  ํ”ํ•œ ๊ฒฝ์Ÿ ์ƒํƒœ ํŒจํ„ด:

  • ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์“ฐ๊ธฐ
  • ๋™๊ธฐํ™”๋˜์ง€ ์•Š์€ read-modify-write ์—ฐ์‚ฐ (+=, ++ ๋“ฑ)
  • ๊ฒฝ์Ÿ ์ƒํƒœ ์ด์ „์ด ์•„๋‹Œ ์ดํ›„์— ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜

์†”๋ฃจ์…˜


comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE, SIZE)


fn shared_memory_race(
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    """Fixed: sequential access with barriers eliminates race conditions."""
    row = thread_idx.y
    col = thread_idx.x

    shared_sum = LayoutTensor[
        dtype,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Only thread 0 does all the accumulation work to prevent races
    if row == 0 and col == 0:
        # Use local accumulation first, then single write to shared memory
        local_sum = Scalar[dtype](0.0)
        for r in range(size):
            for c in range(size):
                local_sum += rebind[Scalar[dtype]](a[r, c])

        shared_sum[0] = local_sum  # Single write operation

    barrier()  # Ensure thread 0 completes before others read

    # All threads read the safely accumulated result after synchronization
    if row < size and col < size:
        output[row, col] = shared_sum[0]


๋ฌด์—‡์ด ์ž˜๋ชป๋˜์—ˆ๋Š”์ง€ ์ดํ•ดํ•˜๊ธฐ

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฌธ์ œ ํŒจํ„ด

์›๋ž˜ ์‹คํŒจํ•˜๋Š” ์ฝ”๋“œ์—๋Š” ์ด ํ•ต์‹ฌ์ ์ธ ์ค„์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค:

shared_sum[0] += a[row, col]  # ๊ฒฝ์Ÿ ์ƒํƒœ!

์ด ํ•œ ์ค„์ด 4๊ฐœ์˜ ์œ ํšจํ•œ ์Šค๋ ˆ๋“œ ์‚ฌ์ด์—์„œ ์—ฌ๋Ÿฌ ์œ„ํ—˜ ์š”์†Œ๋ฅผ ์ผ์œผํ‚ต๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ (0,0)์ด ์ฝ์Œ shared_sum[0] (๊ฐ’: 0.0)
  2. ์Šค๋ ˆ๋“œ (0,1)์ด ์ฝ์Œ shared_sum[0] (๊ฐ’: 0.0) โ† Read-after-write ์œ„ํ—˜!
  3. ์Šค๋ ˆ๋“œ (0,0)์ด ์”€ 0.0 + 0
  4. ์Šค๋ ˆ๋“œ (1,0)์ด ์”€ 0.0 + 2 โ† Write-after-write ์œ„ํ—˜!

ํ…Œ์ŠคํŠธ๊ฐ€ ์‹คํŒจํ•œ ์ด์œ 

  • += ์—ฐ์‚ฐ ์ค‘ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ์˜ ์“ฐ๊ธฐ๋ฅผ ์†์ƒ์‹œํ‚ด
  • += ์—ฐ์‚ฐ์ด ์ค‘๋‹จ๋˜์–ด ์—…๋ฐ์ดํŠธ ์†์‹ค ๋ฐœ์ƒ
  • ์˜ˆ์ƒ ํ•ฉ๊ณ„ 6.0 (0+1+2+3)์ด์ง€๋งŒ, ๊ฒฝ์Ÿ ์ƒํƒœ๋กœ ์ธํ•ด 0.0์ด ๋จ
  • barrier()๊ฐ€ ๋„ˆ๋ฌด ๋Šฆ๊ฒŒ ์˜ด - ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ์ด๋ฏธ ๋ฐœ์ƒํ•œ ํ›„

๊ฒฝ์Ÿ ์ƒํƒœ๋ž€?

๊ฒฝ์Ÿ ์ƒํƒœ๋Š” ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณต์œ  ๋ฐ์ดํ„ฐ์— ๋™์‹œ์— ์ ‘๊ทผํ•˜๊ณ , ๊ฒฐ๊ณผ๊ฐ€ ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ์Šค๋ ˆ๋“œ ์‹คํ–‰ ํƒ€์ด๋ฐ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์„ฑ:

  • ๋น„๊ฒฐ์ •์  ๋™์ž‘: ๊ฐ™์€ ์ฝ”๋“œ๊ฐ€ ๋‹ค๋ฅธ ์‹คํ–‰์—์„œ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Œ
  • ํƒ€์ด๋ฐ ์˜์กด์ : ๊ฒฐ๊ณผ๊ฐ€ ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ โ€œ๊ฒฝ์Ÿ์—์„œ ์ด๊ธฐ๋Š”์ง€โ€œ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง
  • ์žฌํ˜„ํ•˜๊ธฐ ์–ด๋ ค์›€: ํŠน์ • ์กฐ๊ฑด์ด๋‚˜ ํ•˜๋“œ์›จ์–ด์—์„œ๋งŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Œ

GPU ํŠน์œ ์˜ ์œ„ํ—˜์„ฑ

๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์˜ ์˜ํ–ฅ:

  • ์›Œํ”„ ์ˆ˜์ค€ ์†์ƒ: ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ์ „์ฒด ์›Œํ”„(32๊ฐœ ์Šค๋ ˆ๋“œ)์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ๋ฌธ์ œ: ๊ฒฝ์Ÿ์œผ๋กœ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ๊นจ์งˆ ์ˆ˜ ์žˆ์Œ
  • ์ปค๋„ ์ „์ฒด ์‹คํŒจ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ์ด ์ „์ฒด GPU ์ปค๋„์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Œ

ํ•˜๋“œ์›จ์–ด ์ฐจ์ด:

  • ๋‹ค๋ฅธ GPU ์•„ํ‚คํ…์ฒ˜: ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ GPU ๋ชจ๋ธ๋งˆ๋‹ค ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต: L1 ์บ์‹œ, L2 ์บ์‹œ, ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ฐ๊ฐ ๋‹ค๋ฅธ ๊ฒฝ์Ÿ ๋™์ž‘์„ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ
  • ์›Œํ”„ ์Šค์ผ€์ค„๋ง: ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ์Šค์ผ€์ค„๋ง์ด ๋‹ค๋ฅธ ๊ฒฝ์Ÿ ์ƒํƒœ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ๋…ธ์ถœ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ

์ „๋žต: ๋‹จ์ผ ์“ฐ๊ธฐ ํŒจํ„ด

ํ•ต์‹ฌ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋Œ€ํ•œ ๋™์‹œ ์“ฐ๊ธฐ๋ฅผ ์—†์• ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  1. Single writer: ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ(์œ„์น˜ (0,0))๋งŒ ๋ชจ๋“  ๋ˆ„์  ์ž‘์—… ์ˆ˜ํ–‰
  2. ๋กœ์ปฌ ๋ˆ„์ : ์œ„์น˜ (0,0) ์Šค๋ ˆ๋“œ๊ฐ€ ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๋ฐ˜๋ณต์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ํ”ผํ•จ
  3. ๋‹จ์ผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ: ๋‹จ์ผ ์“ฐ๊ธฐ ์—ฐ์‚ฐ์œผ๋กœ write-write ๊ฒฝ์Ÿ ์ œ๊ฑฐ
  4. ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”: writer๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ๋„๋ก ๋ณด์žฅ
  5. ๋‹ค์ค‘ ์ฝ๊ธฐ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์•ˆ์ „ํ•˜๊ฒŒ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ฝ์Œ

๋‹จ๊ณ„๋ณ„ ์†”๋ฃจ์…˜ ๋ถ„์„

1๋‹จ๊ณ„: ์Šค๋ ˆ๋“œ ์‹๋ณ„

if row == 0 and col == 0:

์ง์ ‘ ์ขŒํ‘œ ๊ฒ€์‚ฌ๋กœ ์œ„์น˜ (0,0)์˜ ์Šค๋ ˆ๋“œ๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

2๋‹จ๊ณ„: ๋‹จ์ผ ์Šค๋ ˆ๋“œ ๋ˆ„์ 

if row == 0 and col == 0:
    local_sum = Scalar[dtype](0.0)
    for r in range(size):
        for c in range(size):
            local_sum += rebind[Scalar[dtype]](a[r, c])
    shared_sum[0] = local_sum  # ๋‹จ์ผ ์“ฐ๊ธฐ ์—ฐ์‚ฐ

์œ„์น˜ (0,0)์˜ ์Šค๋ ˆ๋“œ๋งŒ ๋ชจ๋“  ๋ˆ„์  ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

3๋‹จ๊ณ„: ๋™๊ธฐํ™” ๋ฐฐ๋ฆฌ์–ด

barrier()  # ์Šค๋ ˆ๋“œ (0,0)์ด ์™„๋ฃŒํ•œ ํ›„ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ๋„๋ก ๋ณด์žฅ

๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์œ„์น˜ (0,0)์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ˆ„์ ์„ ๋งˆ์น  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฝ๋‹ˆ๋‹ค.

4๋‹จ๊ณ„: ์•ˆ์ „ํ•œ ๋ณ‘๋ ฌ ์ฝ๊ธฐ

if row < size and col < size:
    output[row, col] = shared_sum[0]

๋™๊ธฐํ™” ํ›„ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์•ˆ์ „ํ•˜๊ฒŒ ๊ฒฐ๊ณผ๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํšจ์œจ์„ฑ์— ๊ด€ํ•œ ์ค‘์š” ์‚ฌํ•ญ

์ด ์†”๋ฃจ์…˜์€ ํšจ์œจ์„ฑ๋ณด๋‹ค ์ •ํ™•์„ฑ์„ ์šฐ์„ ํ•ฉ๋‹ˆ๋‹ค. ๊ฒฝ์Ÿ ์ƒํƒœ๋Š” ์ œ๊ฑฐํ•˜์ง€๋งŒ, ์œ„์น˜ (0,0) ์Šค๋ ˆ๋“œ๋งŒ ๋ˆ„์ ์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ GPU ์„ฑ๋Šฅ์— ์ตœ์ ์ด ์•„๋‹™๋‹ˆ๋‹ค - ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์žฅ์น˜์—์„œ ์‚ฌ์‹ค์ƒ ์ง๋ ฌ ๊ณ„์‚ฐ์„ ํ•˜๋Š” ์…ˆ์ž…๋‹ˆ๋‹ค.

์ด์–ด์„œ Puzzle 11: ํ’€๋ง์—์„œ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๋ฅผ ํ™œ์šฉํ•ด ๊ณ ์„ฑ๋Šฅ ํ•ฉ์‚ฐ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ๋„ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๋Š” ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ์ด ํผ์ฆ์€ ์ •ํ™•์„ฑ ์šฐ์„ ์˜ ๊ธฐ์ดˆ๋ฅผ ๊ฐ€๋ฅด์นฉ๋‹ˆ๋‹ค - ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๊ณ  ๋‚˜๋ฉด, Puzzle 11์—์„œ ์ •ํ™•์„ฑ๊ณผ ์„ฑ๋Šฅ ๋ชจ๋‘๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๊ฒ€์ฆ

pixi run compute-sanitizer --tool racecheck mojo solutions/p10/p10.mojo --race-condition

์˜ˆ์ƒ ์ถœ๋ ฅ:

========= COMPUTE-SANITIZER
out shape: 2 x 2
Running race condition example...
out: HostBuffer([6.0, 6.0, 6.0, 6.0])
expected: HostBuffer([6.0, 6.0, 6.0, 6.0])
โœ… Race condition test PASSED! (racecheck will find hazards)
========= RACECHECK SUMMARY: 0 hazards displayed (0 errors, 0 warnings)

โœ… ์„ฑ๊ณต: ํ…Œ์ŠคํŠธ๊ฐ€ ํ†ต๊ณผํ•˜๊ณ  ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ํƒ์ง€๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค!

Puzzle 11: ํ’€๋ง

๊ฐœ์š”

๋ฒกํ„ฐ a์—์„œ ๊ฐ ์œ„์น˜์˜ ์ง์ „ 3๊ฐœ ๊ฐ’์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฒกํ„ฐ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

Pooling ์‹œ๊ฐํ™” Pooling ์‹œ๊ฐํ™”

๊ตฌํ˜„ ๋ฐฉ์‹

๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹

์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ์„ ์ˆ˜๋™ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์™€ ๋™๊ธฐํ™”๋กœ ์ง์ ‘ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.

๐Ÿ“ LayoutTensor ๋ฒ„์ „

LayoutTensor์˜ ๊ธฐ๋Šฅ์„ ํ™œ์šฉํ•ด ํšจ์œจ์ ์ธ ์œˆ๋„์šฐ ๊ธฐ๋ฐ˜ ์—ฐ์‚ฐ๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์ฐธ๊ณ : LayoutTensor๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ์ด ์–ผ๋งˆ๋‚˜ ๊ฐ„๊ฒฐํ•ด์ง€๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”. ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๋„ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.

๊ฐœ์š”

๋ฒกํ„ฐ a์—์„œ ๊ฐ ์œ„์น˜์˜ ์ง์ „ 3๊ฐœ ๊ฐ’์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฒกํ„ฐ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ
  • ํ’€๋ง์˜ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ
  • ์ด์›ƒ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์„ ์œ„ํ•œ ์Šค๋ ˆ๋“œ ๊ฐ„ ํ˜‘๋ ฅ

ํ•ต์‹ฌ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด ์œˆ๋„์šฐ ๋‚ด ๊ฐ’๋“ค์— ํšจ์œจ์ ์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‹œํ€€์Šค ์•ž๋ถ€๋ถ„์€ ํŠน๋ณ„ํžˆ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ์œˆ๋„์šฐ ํฌ๊ธฐ: 3
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ

์ฐธ๊ณ :

  • ์œˆ๋„์šฐ ์ ‘๊ทผ: ๊ฐ ์ถœ๋ ฅ์€ ์ด์ „ ์ตœ๋Œ€ 3๊ฐœ ๊ฐ’์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์ฒ˜์Œ ๋‘ ์œ„์น˜๋Š” ํŠน๋ณ„ํ•œ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ 1ํšŒ
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”: ์œˆ๋„์šฐ ์—ฐ์‚ฐ ์ „์— ์กฐ์œจ ํ•„์š”

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32


fn pooling(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 10 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p11/p11.mojo

ํŒ
  1. ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ  barrier() ํ˜ธ์ถœ
  2. ํŠน์ˆ˜ ์ผ€์ด์Šค: output[0] = shared[0], output[1] = shared[0] + shared[1]
  3. ์ผ๋ฐ˜ ์ผ€์ด์Šค: if 1 < global_i < size
  4. ์„ธ ๊ฐ’์˜ ํ•ฉ: shared[local_i - 2] + shared[local_i - 1] + shared[local_i]

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p11
pixi run -e amd p11
pixi run -e apple p11
uv run poe p11

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])

์†”๋ฃจ์…˜

fn pooling(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    if global_i == 0:
        output[0] = shared[0]
    elif global_i == 1:
        output[1] = shared[0] + shared[1]
    elif UInt(1) < global_i < size:
        output[global_i] = (
            shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
        )


๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ํ•ฉ๊ณ„ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ์ฃผ์š” ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •

    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— TPB๊ฐœ ํ• ๋‹น:

      Input array:  [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ํ•˜๋‚˜์”ฉ ๋กœ๋“œ

    • barrier()๋กœ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋กœ๋“œ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

  2. ๊ฒฝ๊ณ„ ์ผ€์ด์Šค

    • ์œ„์น˜ 0: ํ•˜๋‚˜๋งŒ

      output[0] = shared[0] = 0.0
      
    • ์œ„์น˜ 1: ์ฒ˜์Œ ๋‘ ๊ฐ’์˜ ํ•ฉ

      output[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0
      
  3. ๋ฉ”์ธ ์œˆ๋„์šฐ ์—ฐ์‚ฐ

    • ์œ„์น˜ 2 ์ดํ›„:

      Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0
      Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0
      Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0
      ...
      
    • ๋กœ์ปฌ ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•œ ์œˆ๋„์šฐ ๊ณ„์‚ฐ:

      # 3๊ฐœ์งœ๋ฆฌ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ
      window_sum = shared[i-2] + shared[i-1] + shared[i]
      
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ
    • ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ
    • ์ด์›ƒ ์ ‘๊ทผ์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
    • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€

์ด ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ํฌ์ธํŠธ:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์†Œํ™”
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋น ๋ฅธ ์ด์›ƒ ์กฐํšŒ
  • ๊น”๋”ํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ
  • ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ

์ตœ์ข… ์ถœ๋ ฅ์€ ๋ˆ„์  ์œˆ๋„์šฐ ํ•ฉ๊ณ„๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]

๊ฐœ์š”

1D LayoutTensor a์—์„œ ๊ฐ ์œ„์น˜์˜ ์ง์ „ 3๊ฐœ ๊ฐ’์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • LayoutTensor๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ
  • Puzzle 8์—์„œ ๋‹ค๋ฃฌ LayoutTensor ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌํ•˜๊ธฐ
  • ํšจ์œจ์ ์ธ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ

ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ํšจ์œจ์ ์ธ ์œˆ๋„์šฐ ๊ธฐ๋ฐ˜ ์—ฐ์‚ฐ์€ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ์œˆ๋„์šฐ ํฌ๊ธฐ: 3
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ

์ฐธ๊ณ :

  • LayoutTensor ํ• ๋‹น: LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() ์‚ฌ์šฉ
  • ์œˆ๋„์šฐ ์ ‘๊ทผ: 3๊ฐœ์งœ๋ฆฌ ์œˆ๋„์šฐ์— ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์ฒ˜์Œ ๋‘ ์œ„์น˜๋Š” ํŠน์ˆ˜ ์ผ€์ด์Šค
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ 1ํšŒ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)


fn pooling[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    # Allocate shared memory using tensor builder
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FIX ME IN (roughly 10 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p11/p11_layout_tensor.mojo

ํŒ
  1. LayoutTensor์™€ ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ
  2. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: shared[local_i] = a[global_i]
  3. ์ฒ˜์Œ ๋‘ ์œ„์น˜๋ฅผ ํŠน์ˆ˜ ์ผ€์ด์Šค๋กœ ์ฒ˜๋ฆฌ
  4. ์œˆ๋„์šฐ ์—ฐ์‚ฐ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  5. ๊ฒฝ๊ณ„ ์ดˆ๊ณผ ์ ‘๊ทผ์— ๊ฐ€๋“œ ์ถ”๊ฐ€

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p11_layout_tensor
pixi run -e amd p11_layout_tensor
pixi run -e apple p11_layout_tensor
uv run poe p11_layout_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])

์†”๋ฃจ์…˜

fn pooling[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    # Allocate shared memory using tensor builder
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Load data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # Synchronize threads within block
    barrier()

    # Handle first two special cases
    if global_i == 0:
        output[0] = shared[0]
    elif global_i == 1:
        output[1] = shared[0] + shared[1]
    # Handle general case
    elif UInt(1) < global_i < size:
        output[global_i] = (
            shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
        )


LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ํ•ฉ๊ณ„ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ์ฃผ์š” ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •

    • LayoutTensor๊ฐ€ ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๋ธ”๋ก ๋กœ์ปฌ ์ €์žฅ์†Œ๋ฅผ ์ƒ์„ฑ:

      shared = LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
      
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์”ฉ ๋กœ๋“œ:

      Input array:  [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      
    • barrier()๋กœ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋กœ๋“œ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

  2. ๊ฒฝ๊ณ„ ์ผ€์ด์Šค

    • ์œ„์น˜ 0: ํ•˜๋‚˜๋งŒ

      output[0] = shared[0] = 0.0
      
    • ์œ„์น˜ 1: ์ฒ˜์Œ ๋‘ ๊ฐ’์˜ ํ•ฉ

      output[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0
      
  3. ๋ฉ”์ธ ์œˆ๋„์šฐ ์—ฐ์‚ฐ

    • ์œ„์น˜ 2 ์ดํ›„:

      Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0
      Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0
      Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0
      ...
      
    • LayoutTensor์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ:

      # 3๊ฐœ์งœ๋ฆฌ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ
      window_sum = shared[i-2] + shared[i-1] + shared[i]
      
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๊ณต์œ  ํ…์„œ๋กœ ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ํšจ์œจ์ ์ธ ์ด์›ƒ ์ ‘๊ทผ
    • LayoutTensor์˜ ์žฅ์ :
      • ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
      • ์ž์—ฐ์Šค๋Ÿฌ์šด ์œˆ๋„์šฐ ์ธ๋ฑ์‹ฑ
      • ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
      • ์ „ ๊ณผ์ •์— ๊ฑธ์นœ ํƒ€์ž… ์•ˆ์ „์„ฑ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ๊ณผ LayoutTensor์˜ ์•ˆ์ „์„ฑ ๋ฐ ํŽธ์˜์„ฑ์„ ๊ฒฐํ•ฉํ•œ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์†Œํ™”
  • ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ฐ„์†Œํ™”
  • ๊น”๋”ํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ
  • ๋ณ‘ํ•ฉ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€

์ตœ์ข… ์ถœ๋ ฅ์€ ๋ˆ„์  ์œˆ๋„์šฐ ํ•ฉ๊ณ„์ž…๋‹ˆ๋‹ค:

[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]

Puzzle 12: ๋‚ด์ 

๊ฐœ์š”

๋ฒกํ„ฐ a์™€ ๋ฒกํ„ฐ b์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜์—ฌ output(๋‹จ์ผ ๊ฐ’)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ๋‚ด์ ์€ ํฌ๊ธฐ๊ฐ€ ๊ฐ™์€ ๋‘ ๋ฒกํ„ฐ์—์„œ ๋Œ€์‘ํ•˜๋Š” ์›์†Œ๋ผ๋ฆฌ ๊ณฑํ•œ ๋’ค, ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋‘ ๋”ํ•ด ํ•˜๋‚˜์˜ ์ˆซ์ž(์Šค์นผ๋ผ)๋ฅผ ๊ตฌํ•˜๋Š” ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ๋‘ ๋ฒกํ„ฐ๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์„ ๋•Œ:

\[a = [a_{1}, a_{2}, โ€ฆ, a_{n}] \] \[b = [b_{1}, b_{2}, โ€ฆ, b_{n}] \]

๋‚ด์ ์€ ์ด๋ ‡๊ฒŒ ๊ตฌํ•ฉ๋‹ˆ๋‹ค: \[a \cdot b = a_{1}b_{1} + a_{2}b_{2} + โ€ฆ + a_{n}b_{n}\]

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ๋ธ”๋ก๋‹น ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๋‚ด์  ์‹œ๊ฐํ™” ๋‚ด์  ์‹œ๊ฐํ™”

๊ตฌํ˜„ ๋ฐฉ์‹

๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹

์ˆ˜๋™ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์™€ ๋™๊ธฐํ™”๋กœ ๋ฆฌ๋•์…˜์„ ๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.

๐Ÿ“ LayoutTensor ๋ฒ„์ „

LayoutTensor๋ฅผ ํ™œ์šฉํ•ด ๋ฆฌ๋•์…˜๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๋” ๊ฐ„๊ฒฐํ•˜๊ฒŒ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์ฐธ๊ณ : LayoutTensor๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์–ผ๋งˆ๋‚˜ ๊น”๋”ํ•ด์ง€๋Š”์ง€ ํ™•์ธํ•ด ๋ณด์„ธ์š”. ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ์ž…๋‹ˆ๋‹ค.

๊ฐœ์š”

๋ฒกํ„ฐ a์™€ ๋ฒกํ„ฐ b์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜์—ฌ output(๋‹จ์ผ ๊ฐ’)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ๋ธ”๋ก๋‹น ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • ์—ฌ๋Ÿฌ ๊ฐ’์„ ํ•˜๋‚˜๋กœ ํ•ฉ์น˜๋Š” ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜(parallel reduction) ๊ตฌํ˜„ํ•˜๊ธฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ค‘๊ฐ„ ๊ฒฐ๊ณผ ์ €์žฅํ•˜๊ธฐ
  • ์Šค๋ ˆ๋“œ๋ผ๋ฆฌ ํ˜‘๋ ฅํ•˜์—ฌ ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ ๋งŒ๋“ค๊ธฐ

ํ•ต์‹ฌ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์™€ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์„ ํ™œ์šฉํ•ด, ํฉ์–ด์ ธ ์žˆ๋Š” ๊ฐ’๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋กœ ๋ชจ์•„๊ฐ€๋Š” ๊ณผ์ •์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 1
  • ์ถœ๋ ฅ ํฌ๊ธฐ: 1
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ

์ฐธ๊ณ :

  • ์š”์†Œ ์ ‘๊ทผ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ a์™€ b์—์„œ ๋Œ€์‘ํ•˜๋Š” ์š”์†Œ๋ฅผ ์ฝ์Œ
  • ๋ถ€๋ถ„ ๊ฒฐ๊ณผ: ์ค‘๊ฐ„ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ €์žฅ
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ: ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์น˜๊ธฐ ์ „์— ๋™๊ธฐํ™”
  • ์ตœ์ข… ๋ฆฌ๋•์…˜: ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์Šค์นผ๋ผ ์ถœ๋ ฅ์œผ๋กœ ๋ณ€ํ™˜

์ฐธ๊ณ : ์ด ๋ฌธ์ œ์—์„œ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ ํšŸ์ˆ˜๋ฅผ ์‹ ๊ฒฝ ์“ธ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ ๋ฌธ์ œ๋Š” ๋‚˜์ค‘์— ๋‹ค๋ฃจ๊ฒ ์Šต๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32


fn dot_product(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    # FILL ME IN (roughly 13 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p12/p12.mojo

ํŒ
  1. shared[local_i]์— a[global_i] * b[global_i]๋ฅผ ์ €์žฅ
  2. barrier()๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๋™๊ธฐํ™”
  3. ์Šค๋ ˆ๋“œ 0์ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ชจ๋“  ๊ณฑ์„ ํ•ฉ์‚ฐ
  4. ์ตœ์ข… ํ•ฉ๊ณ„๋ฅผ output[0]์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p12
pixi run -e amd p12
pixi run -e apple p12
uv run poe p12

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0])
expected: HostBuffer([140.0])

์†”๋ฃจ์…˜

fn dot_product(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: UInt,
):
    shared = stack_allocation[
        TPB,
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    if global_i < size:
        shared[local_i] = a[global_i] * b[global_i]

    barrier()

    # The following causes race condition: all threads writing to the same location
    # out[0] += shared[local_i]

    # Instead can do parallel reduction in shared memory as opposed to
    # global memory which has no guarantee on synchronization.
    # Loops using global memory can cause thread divergence because
    # fundamentally GPUs execute threads in warps (groups of 32 threads typically)
    # and warps can be scheduled independently.
    # However, shared memory does not have such issues as long as we use `barrier()`
    # correctly when we're in the same thread block.
    stride = UInt(TPB // 2)
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]

        barrier()
        stride //= 2

    # only thread 0 writes the final result
    if local_i == 0:
        output[0] = shared[0]


๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณฑ์…ˆ ํ•˜๋‚˜๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

Thread i: shared[i] = a[i] * b[i]

2๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜

ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๋ฅผ ๋งค ๋‹จ๊ณ„๋งˆ๋‹ค ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๋Š” ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค:

์ดˆ๊ธฐ๊ฐ’:    [0*0  1*1  2*2  3*3  4*4  5*5  6*6  7*7]
        = [0    1    4    9    16   25   36   49]

Step 1:   [0+16 1+25 4+36 9+49  16   25   36   49]
        = [16   26   40   58   16   25   36   49]

Step 2:   [16+40 26+58 40   58   16   25   36   49]
        = [56   84   40   58   16   25   36   49]

Step 3:   [56+84  84   40   58   16   25   36   49]
        = [140   84   40   58   16   25   36   49]

๊ตฌํ˜„์˜ ํ•ต์‹ฌ ํŠน์ง•

  1. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ •ํ™•ํžˆ ๋‘ ๊ฐ’์„ ๋กœ๋“œ (a[i], b[i])
    • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
    • ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— 1ํšŒ ๊ธฐ๋ก
  2. ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”:

    • ์ดˆ๊ธฐ ๊ณฑ์…ˆ ํ›„ barrier()
    • ๊ฐ ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ํ›„ barrier()
    • ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ๊ฐ„ ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€
  3. ๋ฆฌ๋•์…˜ ๋กœ์ง:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2
    
    • ๋งค ๋‹จ๊ณ„๋งˆ๋‹ค stride๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ
    • ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๋งŒ ๋ง์…ˆ ์ˆ˜ํ–‰
    • ์ž‘์—… ํšจ์œจ์„ฑ ์œ ์ง€
  4. ์„ฑ๋Šฅ ๊ณ ๋ ค ์‚ฌํ•ญ:

    • \(n\)๊ฐœ ์š”์†Œ์— ๋Œ€ํ•ด \(\log_2(n)\) ๋‹จ๊ณ„
    • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
    • ์ตœ์†Œํ•œ์˜ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ํšจ์œจ์  ํ™œ์šฉ

์ด ๊ตฌํ˜„์€ ์ˆœ์ฐจ ์‹คํ–‰์˜ \(O(n)\)์— ๋น„ํ•ด \(O(\log n)\) ์‹œ๊ฐ„ ๋ณต์žก๋„๋ฅผ ๋‹ฌ์„ฑํ•˜๋ฉฐ, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์œ„๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”์˜ ์ค‘์š”์„ฑ

๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด์˜ barrier()๋Š” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

barrier()๊ฐ€ ์—†์œผ๋ฉด ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค:

์ดˆ๊ธฐ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: [0 1 4 9 16 25 36 49]

Step 1 (stride = 4):
Thread 0 ์ฝ๊ธฐ: shared[0] = 0, shared[4] = 16
Thread 1 ์ฝ๊ธฐ: shared[1] = 1, shared[5] = 25
Thread 2 ์ฝ๊ธฐ: shared[2] = 4, shared[6] = 36
Thread 3 ์ฝ๊ธฐ: shared[3] = 9, shared[7] = 49

barrier ์—†์ด:
- Thread 0 ์“ฐ๊ธฐ: shared[0] = 0 + 16 = 16
- Thread 1์ด Thread 0๋ณด๋‹ค ๋จผ์ € ๋‹ค์Œ ๋‹จ๊ณ„(stride = 2)๋กœ ๋„˜์–ด๊ฐ€์„œ
  16์ด ์•„๋‹Œ ์ด์ „ ๊ฐ’ shared[0] = 0์„ ์ฝ์–ด๋ฒ„๋ฆฝ๋‹ˆ๋‹ค!

barrier()๊ฐ€ ์žˆ์œผ๋ฉด:

Step 1 (stride = 4):
๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•ฉ์„ ๊ธฐ๋ก:
[16 26 40 58 16 25 36 49]
barrier()๊ฐ€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์—๊ฒŒ ์ด ๊ฐ’๋“ค์ด ๋ณด์ด๋„๋ก ๋ณด์žฅ

Step 2 (stride = 2):
์ด์ œ ์—…๋ฐ์ดํŠธ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์Œ:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84

barrier()๋Š” ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  1. ํ˜„์žฌ ๋‹จ๊ณ„์˜ ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ๋๋‚œ ๋’ค์—์•ผ ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐ
  2. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์‹  ๊ฐ’์„ ๋ณผ ์ˆ˜ ์žˆ์Œ
  3. ์–ด๋–ค ์Šค๋ ˆ๋“œ๋„ ์•ž์„œ ๋‚˜๊ฐ€์ง€ ์•Š์Œ
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ์ผ๊ด€๋œ ์ƒํƒœ๋ฅผ ์œ ์ง€

์ด๋Ÿฐ ๋™๊ธฐํ™” ์ง€์ ์ด ์—†์œผ๋ฉด:

  • ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ 
  • ์Šค๋ ˆ๋“œ๊ฐ€ ์ด๋ฏธ ์ง€๋‚œ ๊ฐ’์„ ์ฝ๊ฒŒ ๋˜๋ฉฐ
  • ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€๊ณ 
  • ์ตœ์ข… ํ•ฉ๊ณ„๊ฐ€ ํ‹€์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

๊ฐœ์š”

1D LayoutTensor a์™€ 1D LayoutTensor b์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D LayoutTensor output(๋‹จ์ผ ๊ฐ’)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ๋ธ”๋ก๋‹น ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • Puzzle 8, Puzzle 11์—์„œ ์ด์–ด์ง€๋Š” LayoutTensor ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • address_space๋ฅผ ํ™œ์šฉํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ
  • ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ํ˜‘๋ ฅํ•ด ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๊ฐ€๋Š” ๊ณผ์ •
  • ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ํ…์„œ ์—ฐ์‚ฐ

ํ•ต์‹ฌ์€ LayoutTensor๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋ฉด์„œ๋„, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์˜ ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ ์‚ด๋ฆฌ๋Š” ๋ฐฉ์‹์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 1
  • ์ถœ๋ ฅ ํฌ๊ธฐ: 1
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ

์ฐธ๊ณ :

  • LayoutTensor ํ• ๋‹น: LayoutTensor[dtype, Layout.row_major(TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation() ์‚ฌ์šฉ
  • ์š”์†Œ ์ ‘๊ทผ: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ž๋™์œผ๋กœ ๋”ฐ๋ผ์˜ค๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ
  • ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ: ์ž…๋ ฅ์šฉ๊ณผ ์ถœ๋ ฅ์šฉ ๋ ˆ์ด์•„์›ƒ์„ ๋”ฐ๋กœ ๊ตฌ์„ฑ
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ: ๋™์ผํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์œผ๋กœ barrier() ์‚ฌ์šฉ

์™„์„ฑํ•  ์ฝ”๋“œ

from gpu import thread_idx, block_idx, block_dim, barrier
from gpu.memory import AddressSpace
from layout import Layout, LayoutTensor


comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)
comptime out_layout = Layout.row_major(1)


fn dot_product[
    in_layout: Layout, out_layout: Layout
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: UInt,
):
    # FILL ME IN (roughly 13 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p12/p12_layout_tensor.mojo

ํŒ
  1. LayoutTensor์™€ address_space๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ
  2. shared[local_i]์— a[global_i] * b[global_i]๋ฅผ ์ €์žฅ
  3. barrier()์™€ ํ•จ๊ป˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด ์ ์šฉ
  4. ์Šค๋ ˆ๋“œ 0์ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ output[0]์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p12_layout_tensor
pixi run -e amd p12_layout_tensor
pixi run -e apple p12_layout_tensor
uv run poe p12_layout_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0])
expected: HostBuffer([140.0])

์†”๋ฃจ์…˜

fn dot_product[
    in_layout: Layout, out_layout: Layout
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: UInt,
):
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Compute element-wise multiplication into shared memory
    if global_i < size:
        shared[local_i] = a[global_i] * b[global_i]

    # Synchronize threads within block
    barrier()

    # Parallel reduction in shared memory
    stride = UInt(TPB // 2)
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]

        barrier()
        stride //= 2

    # Only thread 0 writes the final result
    if local_i == 0:
        output[0] = shared[0]


LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์œผ๋กœ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ง๊ด€์ ์ธ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๊ณฑ์…ˆ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

shared[local_i] = a[global_i] * b[global_i]

2๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜

๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜์ž…๋‹ˆ๋‹ค:

์ดˆ๊ธฐ๊ฐ’:    [0*0  1*1  2*2  3*3  4*4  5*5  6*6  7*7]
        = [0    1    4    9    16   25   36   49]

Step 1:   [0+16 1+25 4+36 9+49  16   25   36   49]
        = [16   26   40   58   16   25   36   49]

Step 2:   [16+40 26+58 40   58   16   25   36   49]
        = [56   84   40   58   16   25   36   49]

Step 3:   [56+84  84   40   58   16   25   36   49]
        = [140   84   40   58   16   25   36   49]

๊ตฌํ˜„์˜ ํ•ต์‹ฌ ํŠน์ง•

  1. ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ:

    • address_space ํŒŒ๋ผ๋ฏธํ„ฐ ํ•˜๋‚˜๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ํ• ๋‹น
    • ํƒ€์ž… ์•ˆ์ „ํ•œ ์—ฐ์‚ฐ์ด ๋ณด์žฅ๋˜๊ณ 
    • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ž๋™์œผ๋กœ ๋”ฐ๋ผ์˜ค๋ฉฐ
    • ์ธ๋ฑ์‹ฑ๋„ ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹
  2. ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”:

    • ์ดˆ๊ธฐ ๊ณฑ์…ˆ์ด ๋๋‚˜๋ฉด barrier()
    • ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด๋งˆ๋‹ค barrier()
    • ์Šค๋ ˆ๋“œ ๊ฐ„ ์•ˆ์ „ํ•œ ์กฐ์œจ ๋ณด์žฅ
  3. ๋ฆฌ๋•์…˜ ๋กœ์ง:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2
    
  4. ์„ฑ๋Šฅ์ƒ ์ด์ :

    • \(O(\log n)\) ์‹œ๊ฐ„ ๋ณต์žก๋„
    • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
    • ์ตœ์†Œํ•œ์˜ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ํšจ์œจ์  ํ™œ์šฉ

LayoutTensor ๋ฒ„์ „์€ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์˜ ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ, ์—ฌ๊ธฐ์— ๋”ํ•ด:

  • ํƒ€์ž… ์•ˆ์ „์„ฑ์ด ํ•œ์ธต ๊ฐ•ํ™”๋˜๊ณ 
  • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๊ฐ€ ๋” ๊น”๋”ํ•ด์ง€๋ฉฐ
  • ๋ ˆ์ด์•„์›ƒ์„ ์ž๋™์œผ๋กœ ์ธ์‹ํ•˜๊ณ 
  • ์ธ๋ฑ์‹ฑ ๋ฌธ๋ฒ•๋„ ์ž์—ฐ์Šค๋Ÿฌ์›Œ์ง‘๋‹ˆ๋‹ค

๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”์˜ ์ค‘์š”์„ฑ

๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด์˜ barrier()๋Š” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

barrier()๊ฐ€ ์—†์œผ๋ฉด ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค:

์ดˆ๊ธฐ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: [0 1 4 9 16 25 36 49]

Step 1 (stride = 4):
Thread 0 ์ฝ๊ธฐ: shared[0] = 0, shared[4] = 16
Thread 1 ์ฝ๊ธฐ: shared[1] = 1, shared[5] = 25
Thread 2 ์ฝ๊ธฐ: shared[2] = 4, shared[6] = 36
Thread 3 ์ฝ๊ธฐ: shared[3] = 9, shared[7] = 49

barrier ์—†์ด:
- Thread 0 ์“ฐ๊ธฐ: shared[0] = 0 + 16 = 16
- Thread 1์ด Thread 0๋ณด๋‹ค ๋จผ์ € ๋‹ค์Œ ๋‹จ๊ณ„(stride = 2)๋กœ ๋„˜์–ด๊ฐ€์„œ
  16์ด ์•„๋‹Œ ์ด์ „ ๊ฐ’ shared[0] = 0์„ ์ฝ์–ด๋ฒ„๋ฆฝ๋‹ˆ๋‹ค!

barrier()๊ฐ€ ์žˆ์œผ๋ฉด:

Step 1 (stride = 4):
๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•ฉ์„ ๊ธฐ๋ก:
[16 26 40 58 16 25 36 49]
barrier()๊ฐ€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์—๊ฒŒ ์ด ๊ฐ’๋“ค์ด ๋ณด์ด๋„๋ก ๋ณด์žฅ

Step 2 (stride = 2):
์ด์ œ ์—…๋ฐ์ดํŠธ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์Œ:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84

barrier()๋Š” ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  1. ํ˜„์žฌ ๋‹จ๊ณ„์˜ ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ๋๋‚œ ๋’ค์—์•ผ ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐ
  2. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์‹  ๊ฐ’์„ ๋ณผ ์ˆ˜ ์žˆ์Œ
  3. ์–ด๋–ค ์Šค๋ ˆ๋“œ๋„ ์•ž์„œ ๋‚˜๊ฐ€์ง€ ์•Š์Œ
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ์ผ๊ด€๋œ ์ƒํƒœ๋ฅผ ์œ ์ง€

์ด๋Ÿฐ ๋™๊ธฐํ™” ์ง€์ ์ด ์—†์œผ๋ฉด:

  • ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ 
  • ์Šค๋ ˆ๋“œ๊ฐ€ ์ด๋ฏธ ์ง€๋‚œ ๊ฐ’์„ ์ฝ๊ฒŒ ๋˜๋ฉฐ
  • ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€๊ณ 
  • ์ตœ์ข… ํ•ฉ๊ณ„๊ฐ€ ํ‹€์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

Puzzle 13: 1D ํ•ฉ์„ฑ๊ณฑ

LayoutTensor๋กœ ์ „ํ™˜ํ•˜๊ธฐ

์ง€๊ธˆ๊นŒ์ง€ GPU ํผ์ฆ ์—ฌ์ •์—์„œ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์— ๋Œ€ํ•œ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ•จ๊ป˜ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค:

  1. UnsafePointer๋ฅผ ์‚ฌ์šฉํ•œ ํฌ์ธํ„ฐ ์ง์ ‘ ์กฐ์ž‘ ๋ฐฉ์‹์˜ raw ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ
  2. ๊ฐ•๋ ฅํ•œ address_space ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•˜๋Š”, ๋ณด๋‹ค ๊ตฌ์กฐํ™”๋œ LayoutTensor

์ด ํผ์ฆ๋ถ€ํ„ฐ๋Š” LayoutTensor๋กœ ์™„์ „ํžˆ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ถ”์ƒํ™”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • ํƒ€์ž… ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ์˜ ๋ช…ํ™•ํ•œ ํ‘œํ˜„
  • ์ฝ”๋“œ ์œ ์ง€๋ณด์ˆ˜์„ฑ ํ–ฅ์ƒ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ๋ฒ„๊ทธ ๋ฐœ์ƒ ๊ฐ€๋Šฅ์„ฑ ๊ฐ์†Œ
  • ๋‚ด๋ถ€ ์—ฐ์‚ฐ์˜ ์˜๋„๋ฅผ ๋” ์ž˜ ๋“œ๋Ÿฌ๋‚ด๋Š” ํ‘œํ˜„๋ ฅ ์žˆ๋Š” ์ฝ”๋“œ
  • ์•ž์œผ๋กœ ์ฐจ์ฐจ ์•Œ์•„๊ฐˆ ๋” ๋งŽ์€ ๊ฒƒ๋“ค!

์ด๋Ÿฌํ•œ ์ „ํ™˜์€ Mojo ๐Ÿ”ฅ์˜ ํ˜„๋Œ€์  GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ฒ” ์‚ฌ๋ก€์™€ ๋งž๋‹ฟ์•„ ์žˆ์Šต๋‹ˆ๋‹ค. ๋†’์€ ์ˆ˜์ค€์˜ ์ถ”์ƒํ™”๋กœ ๋ณต์žก์„ฑ์„ ๊ด€๋ฆฌํ•˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐœ์š”

์‹ ํ˜ธ ์ฒ˜๋ฆฌ์™€ ์ด๋ฏธ์ง€ ๋ถ„์„์—์„œ ํ•ฉ์„ฑ๊ณฑ(convolution)์€ ๋‘ ์‹œํ€€์Šค๋ฅผ ๊ฒฐํ•ฉํ•ด ์ƒˆ๋กœ์šด ์‹œํ€€์Šค๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ํ•ต์‹ฌ ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ์ด ํผ์ฆ์—์„œ๋Š” ์ž…๋ ฅ ๋ฐฐ์—ด ์œ„๋กœ ์ปค๋„์„ ์Šฌ๋ผ์ด๋”ฉํ•˜๋ฉด์„œ ๊ฐ ์ถœ๋ ฅ ์›์†Œ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ์„ GPU์—์„œ ๊ตฌํ˜„ํ•ด ๋ด…๋‹ˆ๋‹ค.

LayoutTensor ์ถ”์ƒํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ a์™€ ๋ฒกํ„ฐ b์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

1D ํ•ฉ์„ฑ๊ณฑ ์‹œ๊ฐํ™” 1D ํ•ฉ์„ฑ๊ณฑ ์‹œ๊ฐํ™”

ํ•ฉ์„ฑ๊ณฑ์ด ์ฒ˜์Œ์ด๋ผ๋ฉด, ๊ฐ€์ค‘์น˜๊ฐ€ ์ ์šฉ๋œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ๊ฐ ์œ„์น˜์—์„œ ์ปค๋„ ๊ฐ’๊ณผ ๋Œ€์‘ํ•˜๋Š” ์ž…๋ ฅ ๊ฐ’์„ ๊ณฑํ•œ ๋’ค ํ•ฉ์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜ํ•™์  ํ‘œ๊ธฐ๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

\[\Large output[i] = \sum_{j=0}^{\text{CONV}-1} a[i+j] \cdot b[j] \]

์˜์‚ฌ ์ฝ”๋“œ๋กœ ํ‘œํ˜„ํ•œ 1D ํ•ฉ์„ฑ๊ณฑ:

for i in range(SIZE):
    for j in range(CONV):
        if i + j < SIZE:
            ret[i] += a_host[i + j] * b_host[j]

์ด ํผ์ฆ์€ ๋‹จ๊ณ„์ ์œผ๋กœ ์ดํ•ด๋ฅผ ์Œ“์•„๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ๋‘ ํŒŒํŠธ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค:

  • ๐Ÿ”ฐ ๊ธฐ๋ณธ ๋ฒ„์ „ ์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์„ธ์š”. ๋‹จ์ผ ๋ธ”๋ก์—์„œ LayoutTensor์™€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ํ•ฉ์„ฑ๊ณฑ ๊ตฌํ˜„์˜ ๊ธฐ์ดˆ๋ฅผ ์ตํž™๋‹ˆ๋‹ค.

  • โญ ๋ธ”๋ก ๊ฒฝ๊ณ„ ๋ฒ„์ „ ์ด์–ด์„œ ๋ธ”๋ก ๊ฒฝ๊ณ„๋ฅผ ๋„˜์–ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ ํ•ด์•ผ ํ•˜๋Š” ๋” ๊นŒ๋‹ค๋กœ์šด ๊ฒฝ์šฐ์— ๋„์ „ํ•ฉ๋‹ˆ๋‹ค. LayoutTensor์˜ ๊ธฐ๋Šฅ์„ ๋ณธ๊ฒฉ์ ์œผ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ๋ฒ„์ „์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์Šค๋ ˆ๋“œ ๊ฐ„ ํ˜‘๋ ฅ ์ธก๋ฉด์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋„์ „ ๊ณผ์ œ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ๋ฒ„์ „์—์„œ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์˜ ์›๋ฆฌ๋ฅผ ์ตํžŒ ๋‹ค์Œ, ๋ธ”๋ก ๊ฒฝ๊ณ„ ๋ฒ„์ „์—์„œ๋Š” ์‹ค์ œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋งˆ์ฃผ์น˜๋Š” ๋ณต์žกํ•œ ์ƒํ™ฉ์„ ๋‹ค๋ฃจ๋Š” ๋Šฅ๋ ฅ์„ ์‹œํ—˜ํ•ด ๋ด…๋‹ˆ๋‹ค.

๋‹จ์ผ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ ๋ฒ„์ „

1D LayoutTensor a์™€ 1D LayoutTensor b์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • GPU์—์„œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ
  • ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ ๊ด€๋ฆฌํ•˜๊ธฐ
  • ๊ฒน์น˜๋Š” ์˜์—ญ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉํ•˜๊ธฐ

ํ•ต์‹ฌ์€ ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ฒน์น˜๋Š” ์›์†Œ์— ํšจ์œจ์ ์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ์ž…๋ ฅ ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 6
  • ์ปค๋„ ํฌ๊ธฐ: CONV = 3
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 1
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: SIZE์™€ CONV ํฌ๊ธฐ์˜ ๋ฐฐ์—ด 2๊ฐœ

์ฐธ๊ณ :

  • ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž…๋ ฅ ๋ฐฐ์—ด๊ณผ ์ปค๋„์—์„œ ์›์†Œ๋ฅผ ํ•˜๋‚˜์”ฉ ๋กœ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์ž…๋ ฅ ๋ฐฐ์—ด๊ณผ ํ•ฉ์„ฑ๊ณฑ ์ปค๋„์„ ์ €์žฅํ•˜๋Š” ๊ณต์œ  ๋ฐฐ์—ด
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”: ์—ฐ์‚ฐ ์‹œ์ž‘ ์ „ ์Šค๋ ˆ๋“œ ๊ฐ„ ์กฐ์œจ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 8
comptime SIZE = 6
comptime CONV = 3
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime in_layout = Layout.row_major(SIZE)
comptime out_layout = Layout.row_major(SIZE)
comptime conv_layout = Layout.row_major(CONV)


fn conv_1d_simple[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = Int(thread_idx.x)
    # FILL ME IN (roughly 14 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p13/p13.mojo

ํŒ
  1. LayoutTensor[dtype, Layout.row_major(SIZE), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น
  2. ์ž…๋ ฅ์„ shared_a[local_i]์—, ์ปค๋„์„ shared_b[local_i]์— ๋กœ๋“œ
  3. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ํ›„ barrier() ํ˜ธ์ถœ
  4. ๊ฒฝ๊ณ„ ์•ˆ์—์„œ ๊ณฑ์„ ํ•ฉ์‚ฐ: if local_i + j < SIZE
  5. global_i < SIZE์ผ ๋•Œ๋งŒ ๊ฒฐ๊ณผ ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p13 --simple
pixi run -e amd p13 --simple
pixi run -e apple p13 --simple
uv run poe p13 --simple

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])

์†”๋ฃจ์…˜

fn conv_1d_simple[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = Int(thread_idx.x)
    shared_a = LayoutTensor[
        dtype,
        Layout.row_major(SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    shared_b = LayoutTensor[
        dtype,
        Layout.row_major(CONV),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    if global_i < SIZE:
        shared_a[local_i] = a[global_i]

    if global_i < CONV:
        shared_b[local_i] = b[global_i]

    barrier()

    # Note: this is unsafe as it enforces no guard so could access `shared_a` beyond its bounds
    # local_sum = Scalar[dtype](0)
    # for j in range(CONV):
    #     if local_i + j < SIZE:
    #         local_sum += shared_a[local_i + j] * shared_b[j]

    # if global_i < SIZE:
    #     out[global_i] = local_sum

    # Safe and correct:
    if global_i < SIZE:
        # Note: using `var` allows us to include the type in the type inference
        # `out.element_type` is available in LayoutTensor
        var local_sum: output.element_type = 0

        # Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant
        # See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement
        @parameter
        for j in range(CONV):
            # Bonus: do we need this check for this specific example with fixed SIZE, CONV
            if local_i + j < SIZE:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum


๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•ด ๊ฒน์น˜๋Š” ์›์†Œ์— ํšจ์œจ์ ์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ

์ž…๋ ฅ ๋ฐฐ์—ด a:       [0  1  2  3  4  5]
์ปค๋„ b:          [0  1  2]

์—ฐ์‚ฐ ๊ณผ์ •

  1. ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ:

    shared_a: [0  1  2  3  4  5]  // ์ž…๋ ฅ ๋ฐฐ์—ด
    shared_b: [0  1  2]           // ํ•ฉ์„ฑ๊ณฑ ์ปค๋„
    
  2. ๊ฐ ์œ„์น˜ i์— ๋Œ€ํ•œ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ:

    output[0] = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] = 0*0 + 1*1 + 2*2 = 5
    output[1] = a[1]*b[0] + a[2]*b[1] + a[3]*b[2] = 1*0 + 2*1 + 3*2 = 8
    output[2] = a[2]*b[0] + a[3]*b[1] + a[4]*b[2] = 2*0 + 3*1 + 4*2 = 11
    output[3] = a[3]*b[0] + a[4]*b[1] + a[5]*b[2] = 3*0 + 4*1 + 5*2 = 14
    output[4] = a[4]*b[0] + a[5]*b[1] + 0*b[2]    = 4*0 + 5*1 + 0*2 = 5
    output[5] = a[5]*b[0] + 0*b[1]   + 0*b[2]     = 5*0 + 0*1 + 0*2 = 0
    

๊ตฌํ˜„ ์ƒ์„ธ

  1. ์Šค๋ ˆ๋“œ ์ฐธ์—ฌ ๋ฒ”์œ„์™€ ํšจ์œจ์„ฑ:

    • ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๊ฐ€๋“œ๊ฐ€ ์—†๋Š” ๋น„ํšจ์œจ์  ์ ‘๊ทผ:

      # ๋น„ํšจ์œจ์  ๋ฒ„์ „ - ๊ฒฐ๊ณผ๊ฐ€ ์‚ฌ์šฉ๋˜์ง€ ์•Š์„ ์Šค๋ ˆ๋“œ๋„ ๋ชจ๋‘ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
      local_sum = Scalar[dtype](0)
      for j in range(CONV):
          if local_i + j < SIZE:
              local_sum += shared_a[local_i + j] * shared_b[j]
      # ๋งˆ์ง€๋ง‰ ์“ฐ๊ธฐ๋งŒ ๊ฐ€๋“œ
      if global_i < SIZE:
          output[global_i] = local_sum
      
    • ํšจ์œจ์ ์ด๊ณ  ์˜ฌ๋ฐ”๋ฅธ ๊ตฌํ˜„:

      if global_i < SIZE:
          var local_sum: output.element_type = 0  # var๋กœ ํƒ€์ž… ์ถ”๋ก  ํ™œ์šฉ
          @parameter  # CONV๊ฐ€ ์ƒ์ˆ˜์ด๋ฏ€๋กœ ์ปดํŒŒ์ผ ํƒ€์ž„์— ๋ฃจํ”„ ์ „๊ฐœ
          for j in range(CONV):
              if local_i + j < SIZE:
                  local_sum += shared_a[local_i + j] * shared_b[j]
          output[global_i] = local_sum
      

    ํ•ต์‹ฌ์ ์ธ ์ฐจ์ด๋Š” ๊ฐ€๋“œ์˜ ์œ„์น˜์ž…๋‹ˆ๋‹ค. ๋น„ํšจ์œจ์  ๋ฒ„์ „์€ global_i >= SIZE์ธ ์Šค๋ ˆ๋“œ๋ฅผ ํฌํ•จํ•ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ ๋’ค, ๋งˆ์ง€๋ง‰ ์“ฐ๊ธฐ์—์„œ๋งŒ ๊ฐ€๋“œ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด:

    • ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ: ์œ ํšจ ๋ฒ”์œ„ ๋ฐ–์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ์“ธ๋ชจ์—†๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰
    • ํšจ์œจ ์ €ํ•˜: ์‚ฌ์šฉ๋˜์ง€ ์•Š์„ ์—ฐ์‚ฐ์— ์ž์› ์†Œ๋น„
    • GPU ํ™œ์šฉ๋„ ์ €ํ•˜: ์˜๋ฏธ ์—†๋Š” ๊ณ„์‚ฐ์— GPU ์ฝ”์–ด๋ฅผ ๋‚ญ๋น„

    ํšจ์œจ์  ๋ฒ„์ „์€ ์œ ํšจํ•œ global_i ๊ฐ’์„ ๊ฐ€์ง„ ์Šค๋ ˆ๋“œ๋งŒ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋ฏ€๋กœ GPU ์ž์›์„ ๋” ์ž˜ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

  2. ์ฃผ์š” ๊ตฌํ˜„ ํŠน์ง•:

    • var์™€ output.element_type์œผ๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์ถ”๋ก 
    • @parameter ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋กœ ํ•ฉ์„ฑ๊ณฑ ๋ฃจํ”„๋ฅผ ์ปดํŒŒ์ผ ํƒ€์ž„์— ์ „๊ฐœ
    • ์—„๊ฒฉํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ ํ™•๋ณด
    • LayoutTensor์˜ ํƒ€์ž… ์‹œ์Šคํ…œ์œผ๋กœ ์ฝ”๋“œ ์•ˆ์ „์„ฑ ํ–ฅ์ƒ
  3. ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ:

    • ์ž…๋ ฅ ๋ฐฐ์—ด๊ณผ ์ปค๋„ ๋ชจ๋‘ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
    • ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ 1ํšŒ ๋กœ๋“œ
    • ๋กœ๋“œํ•œ ๋ฐ์ดํ„ฐ์˜ ํšจ์œจ์  ์žฌ์‚ฌ์šฉ
  4. ์Šค๋ ˆ๋“œ ์กฐ์œจ:

    • barrier()๋กœ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋กœ๋“œ๊ฐ€ ๋๋‚œ ํ›„ ์—ฐ์‚ฐ ์‹œ์ž‘์„ ๋ณด์žฅ
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ์›์†Œ ํ•˜๋‚˜๋ฅผ ๊ณ„์‚ฐ
    • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€
  5. ์„ฑ๋Šฅ ์ตœ์ ํ™”:

    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์†Œํ™”
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋น ๋ฅธ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ
    • ๋ฉ”์ธ ์—ฐ์‚ฐ ๋ฃจํ”„์—์„œ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ ํšŒํ”ผ
    • @parameter ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋ฅผ ํ†ตํ•œ ๋ฃจํ”„ ์ „๊ฐœ

๋ธ”๋ก ๊ฒฝ๊ณ„ ๋ฒ„์ „

1D LayoutTensor a์™€ 1D LayoutTensor b์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ์ž…๋ ฅ ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE_2 = 15
  • ์ปค๋„ ํฌ๊ธฐ: CONV_2 = 4
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 2
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ์ž…๋ ฅ์šฉ TPB + CONV_2 - 1๊ฐœ

์ฐธ๊ณ :

  • ํ™•์žฅ ๋กœ๋”ฉ: ๊ฒฝ๊ณ„ ๊ฒน์นจ ์˜์—ญ์„ ๊ณ ๋ ค
  • ๋ธ”๋ก ๊ฐ€์žฅ์ž๋ฆฌ: ๋ธ”๋ก ๊ฒฝ๊ณ„๋ฅผ ๋„˜๋Š” ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ํšจ์œจ์  ํ™œ์šฉ
  • ๋™๊ธฐํ™”: ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๊ฐ„ ์กฐ์œจ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE_2 = 15
comptime CONV_2 = 4
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (TPB, 1)
comptime in_2_layout = Layout.row_major(SIZE_2)
comptime out_2_layout = Layout.row_major(SIZE_2)
comptime conv_2_layout = Layout.row_major(CONV_2)


fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin],
):
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)
    # FILL ME IN (roughly 18 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p13/p13.mojo

ํŒ
  1. LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น
  2. ๋ฉ”์ธ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: shared_a[local_i] = a[global_i]
  3. ๊ฒฝ๊ณ„ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: if local_i < CONV_2 - 1์ผ ๋•Œ ๋‹ค์Œ ๋ธ”๋ก์˜ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  4. ์ปค๋„ ๋กœ๋“œ: shared_b[local_i] = b[local_i]
  5. ์ž…๋ ฅ ๋ฒ”์œ„ ์•ˆ์—์„œ ํ•ฉ์‚ฐ: if global_i + j < SIZE_2

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p13 --block-boundary
pixi run -e amd p13 --block-boundary
pixi run -e apple p13 --block-boundary
uv run poe p13 --block-boundary

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([14.0, 20.0, 26.0, 32.0, 38.0, 44.0, 50.0, 56.0, 62.0, 68.0, 74.0, 80.0, 41.0, 14.0, 0.0])

์†”๋ฃจ์…˜

fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, conv_layout, ImmutAnyOrigin],
):
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)
    # first: need to account for padding
    shared_a = LayoutTensor[
        dtype,
        Layout.row_major(TPB + CONV_2 - 1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    shared_b = LayoutTensor[
        dtype,
        Layout.row_major(CONV_2),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    if global_i < SIZE_2:
        shared_a[local_i] = a[global_i]
    else:
        shared_a[local_i] = 0

    # second: load elements needed for convolution at block boundary
    if local_i < CONV_2 - 1:
        # indices from next block
        next_idx = global_i + TPB
        if next_idx < SIZE_2:
            shared_a[TPB + local_i] = a[next_idx]
        else:
            # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
            # which is an undefined behavior
            shared_a[TPB + local_i] = 0

    if local_i < CONV_2:
        shared_b[local_i] = b[local_i]

    barrier()

    if global_i < SIZE_2:
        var local_sum: output.element_type = 0

        @parameter
        for j in range(CONV_2):
            if global_i + j < SIZE_2:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum


ํ™•์žฅ๋œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด ๋ธ”๋ก ๊ฒฝ๊ณ„๋ฅผ ๋„˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ์ž์„ธํžˆ ๋ถ„์„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ํฌ๊ธฐ ๊ณ„์‚ฐ

ํ…Œ์ŠคํŠธ ๊ตฌ์„ฑ:
- ์ „์ฒด ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE_2 = 15
- ๊ทธ๋ฆฌ๋“œ: 2 ๋ธ”๋ก ร— 8 ์Šค๋ ˆ๋“œ
- ํ•ฉ์„ฑ๊ณฑ ์ปค๋„: CONV_2 = 4

Block 0 ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ:  [0 1 2 3 4 5 6 7|8 9 10]  // TPB(8) + (CONV_2-1)(3) ํŒจ๋”ฉ
Block 1 ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ:  [8 9 10 11 12 13 14 0|0 0 0]  // ๋‘ ๋ฒˆ์งธ ๋ธ”๋ก. ๋ฐ์ดํ„ฐ(7) + ๊ทธ๋ฆฌ๋“œ ์ฑ„์›€์šฉ ํŒจ๋”ฉ(1) + (CONV_2-1)(3) ํŒจ๋”ฉ

ํฌ๊ธฐ ๊ณ„์‚ฐ:
- ๋ฉ”์ธ ๋ฐ์ดํ„ฐ: TPB๊ฐœ (8)
- ๊ฒน์นจ ์˜์—ญ: CONV_2 - 1๊ฐœ (4 - 1 = 3)
- ํ•ฉ๊ณ„: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11๊ฐœ

๊ตฌํ˜„ ์ƒ์„ธ

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น:

    # ํ•ฉ์„ฑ๊ณฑ ์œˆ๋„์šฐ์— ํ•„์š”ํ•œ ํŒจ๋”ฉ์„ ๋จผ์ € ๊ณ ๋ ค
    shared_a = LayoutTensor[dtype, Layout.row_major(TPB + CONV_2 - 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
    shared_b = LayoutTensor[dtype, Layout.row_major(CONV_2), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
    

    ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ธ”๋ก ๋ฐ์ดํ„ฐ์™€ ๊ฒน์นจ ์˜์—ญ์„ ๋ชจ๋‘ ๋‹ด๊ธฐ์— ์ถฉ๋ถ„ํ•œ ๊ณต๊ฐ„์ด ํ™•๋ณด๋ฉ๋‹ˆ๋‹ค.

  2. ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ ์ „๋žต:

    # ๋ฉ”์ธ ๋ธ”๋ก ๋ฐ์ดํ„ฐ
    if global_i < SIZE_2:
        shared_a[local_i] = a[global_i]
    else:
        shared_a[local_i] = 0
    
    # ๋‹ค์Œ ๋ธ”๋ก์˜ ๊ฒฝ๊ณ„ ๋ฐ์ดํ„ฐ
    if local_i < CONV_2 - 1:
        next_idx = global_i + TPB
        if next_idx < SIZE_2:
            shared_a[TPB + local_i] = a[next_idx]
        else:
            # ๋ฒ”์œ„ ๋ฐ– ์›์†Œ๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜์—ฌ
            # ๋ฏธ์ •์˜ ๋™์ž‘์„ ์œ ๋ฐœํ•˜๋Š” ์ดˆ๊ธฐํ™”๋˜์ง€ ์•Š์€ ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ๋ฅผ ๋ฐฉ์ง€
            shared_a[TPB + local_i] = 0
    
    • local_i < CONV_2 - 1์ธ ์Šค๋ ˆ๋“œ๋งŒ ๊ฒฝ๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œ
    • ๋ถˆํ•„์š”ํ•œ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ ๋ฐฉ์ง€
    • ๋ฉ”์ธ ๋ฐ์ดํ„ฐ ๋กœ๋“œ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ์œ ์ง€
    • ๋ฒ”์œ„ ๋ฐ– ์›์†Œ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜์—ฌ ๋ฏธ์ •์˜ ๋™์ž‘ ๋ฐฉ์ง€
  3. ์ปค๋„ ๋กœ๋”ฉ:

    if local_i < b_size:
        shared_b[local_i] = b[local_i]
    
    • ์Šค๋ ˆ๋“œ๋‹น 1ํšŒ ๋กœ๋“œ
    • ์ปค๋„ ํฌ๊ธฐ๋กœ ๋ฒ”์œ„ ์ œํ•œ
  4. ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ:

    if global_i < SIZE_2:
        var local_sum: output.element_type = 0
        @parameter
        for j in range(CONV_2):
            if global_i + j < SIZE_2:
                local_sum += shared_a[local_i + j] * shared_b[j]
    
    • @parameter๋กœ ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ
    • output.element_type์œผ๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์ถ”๋ก 
    • ์˜๋ฏธ์ ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅธ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ์œ ํšจํ•œ ์ž…๋ ฅ ์œ„์น˜์—์„œ๋งŒ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์‚ฐ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„

  1. Block 0 ์ ‘๊ทผ ํŒจํ„ด:

    Thread 0: [0 1 2 3] ร— [0 1 2 3]
    Thread 1: [1 2 3 4] ร— [0 1 2 3]
    Thread 2: [2 3 4 5] ร— [0 1 2 3]
    ...
    Thread 7: [7 8 9 10] ร— [0 1 2 3]  // ๊ฒน์นจ ์˜์—ญ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ
    
  2. Block 1 ์ ‘๊ทผ ํŒจํ„ด: Thread 4๋ถ€ํ„ฐ๋Š” global_i + j < SIZE_2๊ฐ€ False๊ฐ€ ๋˜์–ด ํ•ด๋‹น ๋ฐ˜๋ณต์„ ๊ฑด๋„ˆ๋›ฐ๋Š” ์ ์— ์ฃผ๋ชฉํ•˜์„ธ์š”.

    Thread 0: [8  9 10 11] ร— [0 1 2 3]
    Thread 1: [9 10 11 12] ร— [0 1 2 3]
    ...
    Thread 4: [12 13 14] ร— [0 1 2]       // ๋๋ถ€๋ถ„ ์ œ๋กœ ํŒจ๋”ฉ
    Thread 5: [13 14]    ร— [0 1]
    Thread 6: [14]       ร— [0]
    Thread 7: ๊ฑด๋„ˆ๋œ€                      // ๋ชจ๋“  j์— ๋Œ€ํ•ด global_i + j < SIZE_2๊ฐ€ false, ์—ฐ์‚ฐ ์—†์Œ
    

์„ฑ๋Šฅ ์ตœ์ ํ™”

  1. ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ:

    • ๋ฉ”์ธ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ
    • ๊ฒฝ๊ณ„ ๋ฐ์ดํ„ฐ: ํ•„์š”ํ•œ ์Šค๋ ˆ๋“œ๋งŒ ์ฐธ์—ฌ
    • ๋‹จ์ผ ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™” ์ง€์ 
  2. ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ ์ตœ์†Œํ™”:

    • ๋ฉ”์ธ ๋กœ๋”ฉ๊ณผ ๊ฒฝ๊ณ„ ๋กœ๋”ฉ์˜ ๊น”๋”ํ•œ ๋ถ„๋ฆฌ
    • ์›Œํ”„ ๋‚ด ๊ท ์ผํ•œ ์—ฐ์‚ฐ ํŒจํ„ด
    • ํšจ์œจ์ ์ธ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ:

    • ๋ธ”๋ก ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ์— ์ตœ์ ํ™”๋œ ํฌ๊ธฐ ์„ค์ •
    • ์ ‘๊ทผ ํŒจํ„ด์—์„œ ๋ฑ…ํฌ ์ถฉ๋Œ ์—†์Œ
    • ๋กœ๋“œํ•œ ๋ฐ์ดํ„ฐ์˜ ํšจ์œจ์  ์žฌ์‚ฌ์šฉ
  4. ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ:

    • ๋ฒ”์œ„ ๋ฐ– ์›์†Œ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ 0์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ์ดˆ๊ธฐํ™”๋˜์ง€ ์•Š์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ ๋ฐฉ์ง€
    • global_i + j < SIZE_2๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์•„๋‹Œ ์‹ค์ œ ์ž…๋ ฅ ๋ฒ”์œ„ ๊ธฐ์ค€์˜ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
    • ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ ์—†์ด ์ ์ ˆํ•œ ์—ฃ์ง€ ์ผ€์ด์Šค ์ฒ˜๋ฆฌ

๊ฒฝ๊ณ„ ์กฐ๊ฑด ๊ฐœ์„ 

์ด ์†”๋ฃจ์…˜์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ”์œ„๋ฅผ ํ™•์ธํ•˜๋Š” ๋Œ€์‹  if global_i + j < SIZE_2:๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ํŒจํ„ด์€:

  • ์ˆ˜ํ•™์ ์œผ๋กœ ์ •ํ™•: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ค์ œ๋กœ ์กด์žฌํ•˜๋Š” ์œ„์น˜์—์„œ๋งŒ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์‚ฐ
  • ๋” ํšจ์œจ์ : ์ž…๋ ฅ ๋ฐฐ์—ด์„ ๋„˜์–ด์„  ์œ„์น˜์— ๋Œ€ํ•œ ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ ํšŒํ”ผ
  • ๋” ์•ˆ์ „: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์ œ๋กœ ํŒจ๋”ฉ ๋™์ž‘์— ์˜์กดํ•˜์ง€ ์•Š์Œ

์ด ๊ตฌํ˜„์€ ๋ธ”๋ก ๊ฐ„ ํ•ฉ์„ฑ๊ณฑ์„ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ ๋‹ค์Œ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค:

  • ์ ์ ˆํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ
  • ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ํ†ตํ•œ ๋†’์€ ์„ฑ๋Šฅ
  • LayoutTensor ์ถ”์ƒํ™”๋ฅผ ํ™œ์šฉํ•œ ๊น”๋”ํ•œ ์ฝ”๋“œ ๊ตฌ์กฐ
  • ์ตœ์†Œํ•œ์˜ ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ
  • ์ˆ˜ํ•™์ ์œผ๋กœ ๊ฑด์ „ํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ

Puzzle 14: ๋ˆ„์  ํ•ฉ

๊ฐœ์š”

๋ˆ„์  ํ•ฉ(prefix sum, scan ์ด๋ผ๊ณ ๋„ ํ•ฉ๋‹ˆ๋‹ค)์€ ์‹œํ€€์Šค์˜ ๊ฐ’์„ ์ฐจ๋ก€๋กœ ๋”ํ•ด ๋‚˜๊ฐ€๋Š” ๊ธฐ๋ณธ์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์ •๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ถ€ํ„ฐ ๊ณผํ•™ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊นŒ์ง€ ์ˆ˜๋งŽ์€ ๋ณ‘๋ ฌ ์‘์šฉ์˜ ํ•ต์‹ฌ์— ์ž๋ฆฌํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ˆซ์ž ์‹œํ€€์Šค๋ฅผ ๋ˆ„์  ํ•ฉ๊ณ„๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ธฐ๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ, GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ๋งŒ๋“ค๋ ค๋ฉด ๊ธฐ๋ฐœํ•œ ๋ณ‘๋ ฌ์  ์‚ฌ๊ณ ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค!

1D LayoutTensor a์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : a์˜ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, ๊ฐ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋งŒ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ˆ„์  ํ•ฉ ์‹œ๊ฐํ™” ๋ˆ„์  ํ•ฉ ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋กœ๊ทธ ๋ณต์žก๋„๋ฅผ ๊ฐ€์ง„ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ˜‘๋ ฅ ํŒจํ„ด
  • ๋‹ค๋‹จ๊ณ„ ์—ฐ์‚ฐ ์ „๋žต

ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์ˆœ์ฐจ ์—ฐ์‚ฐ์„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ์ž…๋ ฅ ์‹œํ€€์Šค \([3, 1, 4, 1, 5, 9]\) ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, ๋ˆ„์  ํ•ฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค:

  • \([3]\) (์ฒซ ๋ฒˆ์งธ ์›์†Œ ๊ทธ๋Œ€๋กœ)
  • \([3, 4]\) (3 + 1)
  • \([3, 4, 8]\) (์ด์ „ ํ•ฉ + 4)
  • \([3, 4, 8, 9]\) (์ด์ „ ํ•ฉ + 1)
  • \([3, 4, 8, 9, 14]\) (์ด์ „ ํ•ฉ + 5)
  • \([3, 4, 8, 9, 14, 23]\) (์ด์ „ ํ•ฉ + 9)

์ˆ˜ํ•™์ ์œผ๋กœ, ์‹œํ€€์Šค \([x_0, x_1, โ€ฆ, x_n]\) ์˜ ๋ˆ„์  ํ•ฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: \[ [x_0, x_0+x_1, x_0+x_1+x_2, โ€ฆ, \sum_{i=0}^n x_i] \]

์ˆœ์ฐจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ผ๋ฉด \(O(n)\) ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•˜๊ฒ ์ง€๋งŒ, ์—ฌ๊ธฐ์„œ๋Š” ์˜๋ฆฌํ•œ 2๋‹จ๊ณ„ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ \(O(\log n)\) ๋‹จ๊ณ„๋งŒ์— ์™„๋ฃŒํ•ฉ๋‹ˆ๋‹ค! ์œ„์˜ ์• ๋‹ˆ๋ฉ”์ด์…˜์—์„œ ์ด ๊ณผ์ •์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ํผ์ฆ์€ ๊ฐœ๋…์„ ๋‹จ๊ณ„์ ์œผ๋กœ ์ตํž ์ˆ˜ ์žˆ๋„๋ก ๋‘ ํŒŒํŠธ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค:

  • ๐Ÿ”ฐ ๊ธฐ๋ณธ ๋ฒ„์ „ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋“ค์–ด๊ฐ€๋Š” ๋‹จ์ผ ๋ธ”๋ก ๊ตฌํ˜„๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์›๋ฆฌ๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์ข‹์Šต๋‹ˆ๋‹ค.

  • โญ ์™„์„ฑ ๋ฒ„์ „ ์ด์–ด์„œ ์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์น˜๋Š” ํฐ ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋” ๊นŒ๋‹ค๋กœ์šด ๊ฒฝ์šฐ์— ๋„์ „ํ•ฉ๋‹ˆ๋‹ค. ๋ธ”๋ก ๊ฐ„ ์กฐ์œจ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ๋ฒ„์ „์€ ์ด์ „ ๋ฒ„์ „ ์œ„์— ์Œ“์•„ ์˜ฌ๋ฆฌ๋Š” ๋ฐฉ์‹์œผ๋กœ, ๋ณ‘๋ ฌ ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๊นŠ์ด ์žˆ๊ฒŒ ๋ฐœ์ „์‹œ์ผœ ์ค๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ๋ฒ„์ „์—์„œ ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹ค์ง€๊ณ , ์™„์„ฑ ๋ฒ„์ „์—์„œ๋Š” ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ™•์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค โ€” ์‹ค์ œ GPU ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์ž์ฃผ ๋งˆ์ฃผ์น˜๋Š” ๊ณผ์ œ์ž…๋‹ˆ๋‹ค.

๊ธฐ๋ณธ ๋ฒ„์ „

1D LayoutTensor a์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : a์˜ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, ๊ฐ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋งŒ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 1
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ ์›์†Œ

์ฐธ๊ณ :

  • ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ LayoutTensor ์ ‘๊ทผ์„ ํ†ตํ•ด ์›์†Œ ํ•˜๋‚˜๋ฅผ ๋กœ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: address_space๋ฅผ ์ง€์ •ํ•œ LayoutTensor๋กœ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”: ์—ฐ์‚ฐ ๋‹จ๊ณ„ ๊ฐ„ ์กฐ์œจ
  • ์ ‘๊ทผ ํŒจํ„ด: ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ
  • ํƒ€์ž… ์•ˆ์ „์„ฑ: LayoutTensor์˜ ํƒ€์ž… ์‹œ์Šคํ…œ ํ™œ์šฉ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)


fn prefix_sum_simple[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 18 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p14/p14.mojo

ํŒ
  1. ๋ฐ์ดํ„ฐ๋ฅผ shared[local_i]์— ๋กœ๋“œ
  2. offset = 1์—์„œ ์‹œ์ž‘ํ•ด ๋งค ๋‹จ๊ณ„๋งˆ๋‹ค 2๋ฐฐ๋กœ ์ฆ๊ฐ€
  3. local_i >= offset์ธ ์›์†Œ์— ๋Œ€ํ•ด ๋ง์…ˆ ์ˆ˜ํ–‰
  4. ๊ฐ ๋‹จ๊ณ„ ์‚ฌ์ด์— barrier() ํ˜ธ์ถœ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p14 --simple
pixi run -e amd p14 --simple
pixi run -e apple p14 --simple
uv run poe p14 --simple

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: DeviceBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0])

์†”๋ฃจ์…˜

fn prefix_sum_simple[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    offset = UInt(1)
    for i in range(Int(log2(Scalar[dtype](TPB)))):
        var current_val: output.element_type = 0
        if local_i >= offset and local_i < size:
            current_val = shared[local_i - offset]  # read

        barrier()
        if local_i >= offset and local_i < size:
            shared[local_i] += current_val

        barrier()
        offset *= 2

    if global_i < size:
        output[global_i] = shared[local_i]


๋ณ‘๋ ฌ (ํฌํ•จ) ๋ˆ„์  ํ•ฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค:

์„ค์ • ๋ฐ ๊ตฌ์„ฑ

  • TPB (๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜) = 8
  • SIZE (๋ฐฐ์—ด ํฌ๊ธฐ) = 8

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ช…์‹œ์  ๋™๊ธฐํ™”๋ฅผ ํ†ตํ•ด ์ฝ๊ธฐ-์“ฐ๊ธฐ ์ถฉ๋Œ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค:

  • ์ฝ๊ธฐ ๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋จผ์ € ํ•„์š”ํ•œ ๊ฐ’์„ ๋กœ์ปฌ ๋ณ€์ˆ˜ current_val์— ์ฝ์–ด๋‘ 
  • ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ ์“ฐ๊ธฐ๊ฐ€ ์‹œ์ž‘๋˜๋„๋ก ๋ณด์žฅ
  • ์“ฐ๊ธฐ ๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๊ธฐ๋ก

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜๋ฅผ ์ฝ๊ณ  ์“ธ ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.ใ„นใ…‡

๋Œ€์•ˆ์  ์ ‘๊ทผ: ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ๋”๋ธ” ๋ฒ„ํผ๋ง ์ž…๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ 2๋ฐฐ๋กœ ํ• ๋‹นํ•œ ๋’ค, ํ•œ ๋ฒ„ํผ์—์„œ ์ฝ๊ณ  ๋‹ค๋ฅธ ๋ฒ„ํผ์— ์“ฐ๋Š” ๊ฒƒ์„ ๋ฒˆ๊ฐˆ์•„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์™„์ „ํžˆ ์ œ๊ฑฐํ•˜์ง€๋งŒ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋Š˜์–ด๋‚˜๊ณ  ๋ณต์žก๋„๊ฐ€ ์˜ฌ๋ผ๊ฐ‘๋‹ˆ๋‹ค. ํ•™์Šต ๋ชฉ์ ์œผ๋กœ๋Š” ์ดํ•ดํ•˜๊ธฐ ๋” ์‰ฌ์šด ๋ช…์‹œ์  ๋™๊ธฐํ™” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ๋งคํ•‘

  • thread_idx.x: \([0, 1, 2, 3, 4, 5, 6, 7]\) (local_i)
  • block_idx.x: \([0, 0, 0, 0, 0, 0, 0, 0]\)
  • global_i: \([0, 1, 2, 3, 4, 5, 6, 7]\) (block_idx.x * TPB + thread_idx.x)

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ดˆ๊ธฐ ๋กœ๋“œ

Threads:      Tโ‚€   Tโ‚   Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡
Input array:  [0    1    2    3    4    5    6    7]
shared:       [0    1    2    3    4    5    6    7]
               โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘
              Tโ‚€   Tโ‚   Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡

Offset = 1: ์ฒซ ๋ฒˆ์งธ ๋ณ‘๋ ฌ ๋‹จ๊ณ„

ํ™œ์„ฑ ์Šค๋ ˆ๋“œ: \(T_1 \ldots T_7\) (local_i โ‰ฅ 1์ธ ์Šค๋ ˆ๋“œ)

์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

Tโ‚ reads shared[0] = 0    Tโ‚… reads shared[4] = 4
Tโ‚‚ reads shared[1] = 1    Tโ‚† reads shared[5] = 5
Tโ‚ƒ reads shared[2] = 2    Tโ‚‡ reads shared[6] = 6
Tโ‚„ reads shared[3] = 3

๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ํ˜„์žฌ ์œ„์น˜์— ๋”ํ•จ:

Before:      [0    1    2    3    4    5    6    7]
Add:              +0   +1   +2   +3   +4   +5   +6
                   |    |    |    |    |    |    |
Result:      [0    1    3    5    7    9    11   13]
                   โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘
                  Tโ‚   Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡

Offset = 2: ๋‘ ๋ฒˆ์งธ ๋ณ‘๋ ฌ ๋‹จ๊ณ„

ํ™œ์„ฑ ์Šค๋ ˆ๋“œ: \(T_2 \ldots T_7\) (local_i โ‰ฅ 2์ธ ์Šค๋ ˆ๋“œ)

์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

Tโ‚‚ reads shared[0] = 0    Tโ‚… reads shared[3] = 5
Tโ‚ƒ reads shared[1] = 1    Tโ‚† reads shared[4] = 7
Tโ‚„ reads shared[2] = 3    Tโ‚‡ reads shared[5] = 9

๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ๋”ํ•จ:

Before:      [0    1    3    5    7    9    11   13]
Add:                   +0   +1   +3   +5   +7   +9
                        |    |    |    |    |    |
Result:      [0    1    3    6    10   14   18   22]
                        โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘
                       Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡

Offset = 4: ์„ธ ๋ฒˆ์งธ ๋ณ‘๋ ฌ ๋‹จ๊ณ„

ํ™œ์„ฑ ์Šค๋ ˆ๋“œ: \(T_4 \ldots T_7\) (local_i โ‰ฅ 4์ธ ์Šค๋ ˆ๋“œ)

์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

Tโ‚„ reads shared[0] = 0    Tโ‚† reads shared[2] = 3
Tโ‚… reads shared[1] = 1    Tโ‚‡ reads shared[3] = 6

๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ๋”ํ•จ:

Before:      [0    1    3    6    10   14   18   22]
Add:                              +0   +1   +3   +6
                                  |    |    |    |
Result:      [0    1    3    6    10   15   21   28]
                                  โ†‘    โ†‘    โ†‘    โ†‘
                                  Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡

์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ output์— ๊ธฐ๋ก

Threads:      Tโ‚€   Tโ‚   Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡
global_i:     0    1    2    3    4    5    6    7
output:       [0    1    3    6    10   15   21   28]
              โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘
              Tโ‚€   Tโ‚   Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡

์ฃผ์š” ๊ตฌํ˜„ ์ƒ์„ธ

๋™๊ธฐํ™” ํŒจํ„ด: ๊ฐ ๋ฐ˜๋ณต์€ ์—„๊ฒฉํ•œ ์ฝ๊ธฐ โ†’ ๋™๊ธฐํ™” โ†’ ์“ฐ๊ธฐ ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. var current_val: out.element_type = 0 - ๋กœ์ปฌ ๋ณ€์ˆ˜ ์ดˆ๊ธฐํ™”
  2. current_val = shared[local_i - offset] - ์ฝ๊ธฐ ๋‹จ๊ณ„ (์กฐ๊ฑด ์ถฉ์กฑ ์‹œ)
  3. barrier() - ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ๋ช…์‹œ์  ๋™๊ธฐํ™”
  4. shared[local_i] += current_val - ์“ฐ๊ธฐ ๋‹จ๊ณ„ (์กฐ๊ฑด ์ถฉ์กฑ ์‹œ)
  5. barrier() - ๋‹ค์Œ ๋ฐ˜๋ณต ์ „ ๋™๊ธฐํ™”

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€: ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜์ง€ ์•Š์œผ๋ฉด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผํ•˜์—ฌ ๋ฏธ์ •์˜ ๋™์ž‘์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ช…์‹œ์  ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ•œ 2๋‹จ๊ณ„ ์ ‘๊ทผ ๋ฐฉ์‹์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ: ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค:

  • if local_i >= offset and local_i < size๋กœ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ์ž„์‹œ ๋ณ€์ˆ˜์˜ ์ ์ ˆํ•œ ์ดˆ๊ธฐํ™”
  • ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ์กฐ์œจ๋œ ์ ‘๊ทผ ํŒจํ„ด

์ด ์†”๋ฃจ์…˜์€ barrier()๋ฅผ ์‚ฌ์šฉํ•ด ๋‹จ๊ณ„ ๊ฐ„ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™”๋ฅผ ๋ณด์žฅํ•˜๊ณ , if global_i < size๋กœ ๋ฐฐ์—ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” ๊ฐ ์›์†Œ \(i\)๊ฐ€ \(\sum_{j=0}^{i} a[j]\) ๋ฅผ ํฌํ•จํ•˜๋Š” ํฌํ•จ ๋ˆ„์  ํ•ฉ์ž…๋‹ˆ๋‹ค.

์™„์„ฑ ๋ฒ„์ „

1D LayoutTensor a์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D LayoutTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : a์˜ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์œผ๋ ค๋ฉด ์—ฌ๋Ÿฌ ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE_2 = 15
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 2
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น TPB๊ฐœ ์›์†Œ

์ฐธ๊ณ :

  • ๋‹ค์ค‘ ๋ธ”๋ก: ์ž…๋ ฅ ๋ฐฐ์—ด์ด ํ•˜๋‚˜์˜ ๋ธ”๋ก๋ณด๋‹ค ํด ๋•Œ๋Š” ๋‹ค๋‹จ๊ณ„ ์ ‘๊ทผ์ด ํ•„์š”
  • ๋ธ”๋ก ๋ ˆ๋ฒจ ๋™๊ธฐํ™”: ๋ธ”๋ก ๋‚ด์—์„œ๋Š” barrier()๋กœ ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”
  • ํ˜ธ์ŠคํŠธ ๋ ˆ๋ฒจ ๋™๊ธฐํ™”: Mojo์˜ DeviceContext๊ฐ€ ์ปค๋„ ์‹คํ–‰ ์ˆœ์„œ๋ฅผ ๋ณด์žฅํ•˜๋ฏ€๋กœ, ์ปค๋„๋“ค์€ ํ์— ๋„ฃ์€ ์ˆœ์„œ๋Œ€๋กœ ์‹คํ–‰๋˜๊ณ  ์ด์ „ ์ปค๋„์ด ๋๋‚˜์•ผ ๋‹ค์Œ์ด ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค. ํ˜ธ์ŠคํŠธ์—์„œ ๊ฒฐ๊ณผ๋ฅผ ์ฝ๊ธฐ ์ „์— ctx.synchronize()๋กœ ๋ชจ๋“  GPU ์ž‘์—… ์™„๋ฃŒ๋ฅผ ํ™•์ธํ•ด์•ผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ณด์กฐ ์ €์žฅ์†Œ: ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•ด ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์ €์žฅํ•  ์ถ”๊ฐ€ ๊ณต๊ฐ„ ์‚ฌ์šฉ

์™„์„ฑํ•  ์ฝ”๋“œ

๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋ˆ„์  ํ•ฉ์„ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ ๋ณ„๋„ ์ปค๋„ ํ•จ์ˆ˜๋ฅผ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ฒซ ๋ฒˆ์งธ ์ปค๋„ (prefix_sum_local_phase): ๊ฐ ๋ธ”๋ก ๋‚ด์—์„œ ๋กœ์ปฌ ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์ €์žฅ
  2. ๋‘ ๋ฒˆ์งธ ์ปค๋„ (prefix_sum_block_sum_phase): ์ด์ „ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋ฅผ ํ›„์† ๋ธ”๋ก์˜ ์›์†Œ์— ๋”ํ•จ

๋ฉ”์ธ ํ•จ์ˆ˜๊ฐ€ ์ด ์ปค๋„๋“ค ์‚ฌ์ด์— ํ•„์š”ํ•œ ํ˜ธ์ŠคํŠธ ์ธก ๋™๊ธฐํ™”๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

comptime SIZE_2 = 15
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (TPB, 1)
comptime EXTENDED_SIZE = SIZE_2 + 2  # up to 2 blocks
comptime layout_2 = Layout.row_major(SIZE_2)
comptime extended_layout = Layout.row_major(EXTENDED_SIZE)


# Kernel 1: Compute local prefix sums and store block sums in out
fn prefix_sum_local_phase[
    out_layout: Layout, in_layout: Layout
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: UInt,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 20 lines)


# Kernel 2: Add block sums to their respective blocks
fn prefix_sum_block_sum_phase[
    layout: Layout
](output: LayoutTensor[dtype, layout, MutAnyOrigin], size: UInt):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 3 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p14/p14.mojo

์ด ํผ์ฆ์˜ ํ•ต์‹ฌ์€ barrier๊ฐ€ ๋ธ”๋ก ๋‚ด๋ถ€์˜ ์Šค๋ ˆ๋“œ๋งŒ ๋™๊ธฐํ™”ํ•˜๋ฉฐ, ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”๋Š” ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋””๋ฐ”์ด์Šค์—์„œ ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š” ์—ฌ๋Ÿฌ ์ปค๋„์„ ํ์— ๋„ฃ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

            # Phase 1: Local prefix sums
            comptime kernel = prefix_sum_local_phase[extended_layout, layout_2]
            ctx.enqueue_function[kernel, kernel](
                out_tensor,
                a_tensor,
                UInt(size),
                grid_dim=BLOCKS_PER_GRID_2,
                block_dim=THREADS_PER_BLOCK_2,
            )

            # Phase 2: Add block sums
            comptime kernel2 = prefix_sum_block_sum_phase[extended_layout]
            ctx.enqueue_function[kernel2, kernel2](
                out_tensor,
                UInt(size),
                grid_dim=BLOCKS_PER_GRID_2,
                block_dim=THREADS_PER_BLOCK_2,
            )

๋‘ ์ปค๋„์ด ์ˆœ์ฐจ์ ์œผ๋กœ ํ์— ๋“ค์–ด๊ฐ€์ง€๋งŒ, out_tensor๋Š” ๋‘ ์ปค๋„์˜ ์ž‘์—…์ด ๋ชจ๋‘ ๋๋‚  ๋•Œ๊นŒ์ง€ ํ˜ธ์ŠคํŠธ๋กœ ์ „์†ก๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์— ์ฃผ๋ชฉํ•˜์„ธ์š”. Mojo์˜ DeviceContext๊ฐ€ ๋‹จ์ผ ์‹คํ–‰ ์ŠคํŠธ๋ฆผ์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ, ํ์— ๋„ฃ์€ ๋ชจ๋“  ์ปค๋„์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. ํ˜ธ์ŠคํŠธ์—์„œ ๊ฒฐ๊ณผ๋ฅผ ์ฝ๊ธฐ ์ „์— ๋ชจ๋“  GPU ์ž‘์—…์˜ ์™„๋ฃŒ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋Œ€๊ธฐํ•˜๋ ค๋ฉด ctx.synchronize()๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŒ

1. ๊ธฐ๋ณธ ๋ˆ„์  ํ•ฉ ์œ„์— ์Œ“์•„ ์˜ฌ๋ฆฌ๊ธฐ

๐Ÿ”ฐ ๊ธฐ๋ณธ ๋ฒ„์ „์—์„œ ๋‹จ์ผ ๋ธ”๋ก ๋ˆ„์  ํ•ฉ ๊ตฌํ˜„ ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์„ ์—ฌ๋Ÿฌ ๋ธ”๋ก์—์„œ ๋™์ž‘ํ•˜๋„๋ก ํ™•์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

๊ธฐ๋ณธ ๋ฒ„์ „ (๋‹จ์ผ ๋ธ”๋ก): [0,1,2,3,4,5,6,7] โ†’ [0,1,3,6,10,15,21,28]

์™„์„ฑ ๋ฒ„์ „ (๋‘ ๋ธ”๋ก):
Block 0: [0,1,2,3,4,5,6,7] โ†’ [0,1,3,6,10,15,21,28]
Block 1: [8,9,10,11,12,13,14] โ†’ [8,17,27,38,50,63,77]

๊ทธ๋Ÿฐ๋ฐ ๋‘ ๋ฒˆ์งธ ๋ธ”๋ก์˜ ๊ฐ’์€ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ• ๊นŒ์š”? ์ฒซ ๋ฒˆ์งธ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋ฅผ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค!

2. 2๋‹จ๊ณ„ ์ ‘๊ทผ

๊ธฐ๋ณธ ๋ˆ„์  ํ•ฉ์œผ๋กœ๋Š” ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ์ž‘์—…์„ ๋‚˜๋ˆ•๋‹ˆ๋‹ค:

  1. 1๋‹จ๊ณ„: ๊ฐ ๋ธ”๋ก์ด ๋กœ์ปฌ ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐ (๊ธฐ๋ณธ ๋ฒ„์ „๊ณผ ๋™์ผ)
  2. 2๋‹จ๊ณ„: ๊ฐ ๋ธ”๋ก์ด ์ด์ „ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋ฅผ ๋ฐ˜์˜

์ฃผ์˜: barrier()๋Š” ํ•˜๋‚˜์˜ ๋ธ”๋ก ๋‚ด์—์„œ๋งŒ ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ๋‹จ๊ณ„ ๊ฐ„์—๋Š” ํ˜ธ์ŠคํŠธ ๋ ˆ๋ฒจ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

3. ํ™•์žฅ ๋ฉ”๋ชจ๋ฆฌ ์ „๋žต

๋ธ”๋ก๋ผ๋ฆฌ ์ง์ ‘ ํ†ต์‹ ํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ, ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์ €์žฅํ•  ๊ณณ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ์ถœ๋ ฅ ๋ฒ„ํผ ๋์— ์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹น
  • ๊ฐ ๋ธ”๋ก์˜ ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ํ•ฉ๊ณ„๋ฅผ ์ด ์ถ”๊ฐ€ ๊ณต๊ฐ„์— ์ €์žฅ
  • ํ›„์† ๋ธ”๋ก์ด ์ด ํ•ฉ๊ณ„๋ฅผ ์ฝ์–ด์„œ ์ž๊ธฐ ์›์†Œ์— ๋”ํ•จ

4. ์ฃผ์š” ๊ตฌํ˜„ ํฌ์ธํŠธ

  • ๋ ˆ์ด์•„์›ƒ ์ฐจ์ด: ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ํ˜•ํƒœ๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ํ•ญ์ƒ global_i < size๋กœ ๋ฐฐ์—ด ๋ฒ”์œ„ ํ™•์ธ
  • ์Šค๋ ˆ๋“œ ์—ญํ•  ๋ถ„๋‹ด: ํŠน์ • ์Šค๋ ˆ๋“œ(์˜ˆ: ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ)๋งŒ ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์ €์žฅ
  • ๋‘ ์ปค๋„ ๊ฐ„ ๋™๊ธฐํ™”: ๋‘ ๋ฒˆ์งธ ์ปค๋„์€ ๋ฐ˜๋“œ์‹œ ์ฒซ ๋ฒˆ์งธ ์ปค๋„์ด ์™„๋ฃŒ๋œ ํ›„์— ์‹คํ–‰๋˜์–ด์•ผ ํ•จ

5. ๋””๋ฒ„๊น… ์ „๋žต

๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด, 1๋‹จ๊ณ„ ์ดํ›„์˜ ์ค‘๊ฐ„ ์ƒํƒœ๋ฅผ ์‹œ๊ฐํ™”ํ•ด ๋ณด์„ธ์š”:

1๋‹จ๊ณ„ ์ดํ›„: [0,1,3,6,10,15,21,28, 8,17,27,38,50,63,77, ???,???]

์—ฌ๊ธฐ์„œ ???์—๋Š” 2๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉ๋  ๋ธ”๋ก ํ•ฉ๊ณ„๊ฐ€ ๋“ค์–ด๊ฐ€์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋ ค๋ฉด ๋จผ์ € ๋””๋ฐ”์ด์Šค์˜ ์ž‘์—… ์™„๋ฃŒ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ณด์žฅํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์„ ๊ธฐ์–ตํ•˜์„ธ์š”.

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p14 --complete
pixi run -e amd p14 --complete
pixi run -e apple p14 --complete
uv run poe p14 --complete

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0, 36.0, 45.0, 55.0, 66.0, 78.0, 91.0, 105.0])

์†”๋ฃจ์…˜



# Kernel 1: Compute local prefix sums and store block sums in out
fn prefix_sum_local_phase[
    out_layout: Layout, in_layout: Layout
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: UInt,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Load data into shared memory
    # Example with SIZE_2=15, TPB=8, BLOCKS=2:
    # Block 0 shared mem: [0,1,2,3,4,5,6,7]
    # Block 1 shared mem: [8,9,10,11,12,13,14,uninitialized]
    # Note: The last position remains uninitialized since global_i >= size,
    # but this is safe because that thread doesn't participate in computation
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # Compute local prefix sum using parallel reduction
    # This uses a tree-based algorithm with log(TPB) iterations
    # Iteration 1 (offset=1):
    #   Block 0: [0,0+1,2+1,3+2,4+3,5+4,6+5,7+6] = [0,1,3,5,7,9,11,13]
    # Iteration 2 (offset=2):
    #   Block 0: [0,1,3+0,5+1,7+3,9+5,11+7,13+9] = [0,1,3,6,10,14,18,22]
    # Iteration 3 (offset=4):
    #   Block 0: [0,1,3,6,10+0,14+1,18+3,22+6] = [0,1,3,6,10,15,21,28]
    #   Block 1 follows same pattern to get [8,17,27,38,50,63,77,???]
    offset = UInt(1)
    for i in range(Int(log2(Scalar[dtype](TPB)))):
        var current_val: output.element_type = 0
        if local_i >= offset and local_i < TPB:
            current_val = shared[local_i - offset]  # read

        barrier()
        if local_i >= offset and local_i < TPB:
            shared[local_i] += current_val  # write

        barrier()
        offset *= 2

    # Write local results to output
    # Block 0 writes: [0,1,3,6,10,15,21,28]
    # Block 1 writes: [8,17,27,38,50,63,77,???]
    if global_i < size:
        output[global_i] = shared[local_i]

    # Store block sums in auxiliary space
    # Block 0: Thread 7 stores shared[7] == 28 at position size+0 (position 15)
    # Block 1: Thread 7 stores shared[7] == ??? at position size+1 (position 16).  This sum is not needed for the final output.
    # This gives us: [0,1,3,6,10,15,21,28, 8,17,27,38,50,63,77, 28,???]
    #                                                           โ†‘  โ†‘
    #                                                     Block sums here
    if local_i == TPB - 1:
        output[size + block_idx.x] = shared[local_i]


# Kernel 2: Add block sums to their respective blocks
fn prefix_sum_block_sum_phase[
    layout: Layout
](output: LayoutTensor[dtype, layout, MutAnyOrigin], size: UInt):
    global_i = block_dim.x * block_idx.x + thread_idx.x

    # Second pass: add previous block's sum to each element
    # Block 0: No change needed - already correct
    # Block 1: Add Block 0's sum (28) to each element
    #   Before: [8,17,27,38,50,63,77]
    #   After: [36,45,55,66,78,91,105]
    # Final result combines both blocks:
    # [0,1,3,6,10,15,21,28, 36,45,55,66,78,91,105]
    if block_idx.x > 0 and global_i < size:
        prev_block_sum = output[size + block_idx.x - 1]
        output[global_i] += prev_block_sum


์ด ์†”๋ฃจ์…˜์€ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์น˜๋Š” ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด 2๊ฐœ์˜ ์ปค๋„์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋ˆ„์  ํ•ฉ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ถ€๋ถ„์„ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์˜ ๊ณผ์ œ

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ทผ๋ณธ์ ์ธ ์ œ์•ฝ์€ barrier()๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๊ฐ€ ๋ธ”๋ก ๋‚ด๋ถ€์—์„œ๋งŒ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์ณ ์žˆ์„ ๋•Œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ œ์— ์ง๋ฉดํ•ฉ๋‹ˆ๋‹ค: ๋ธ”๋ก์ด ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค๋ฅธ ๋ธ”๋ก์— ์–ด๋–ป๊ฒŒ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์„๊นŒ?

๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์‹œ๊ฐํ™”

ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค SIZE_2 = 15, TPB = 8์˜ ๊ฒฝ์šฐ:

Input array:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

Block 0 ์ฒ˜๋ฆฌ: [0, 1, 2, 3, 4, 5, 6, 7]
Block 1 ์ฒ˜๋ฆฌ: [8, 9, 10, 11, 12, 13, 14] (์œ ํšจ ์›์†Œ 7๊ฐœ)

๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์œ„ํ•œ ๊ณต๊ฐ„์„ ํฌํ•จํ•˜๋„๋ก ์ถœ๋ ฅ ๋ฒ„ํผ๋ฅผ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค:

ํ™•์žฅ ๋ฒ„ํผ: [๋ฐ์ดํ„ฐ ๊ฐ’ (15๊ฐœ)] + [๋ธ”๋ก ํ•ฉ๊ณ„ (2๊ฐœ)]
           [0...14] + [block0_sum, block1_sum]

์ด ํ™•์žฅ ๋ฒ„ํผ์˜ ํฌ๊ธฐ: EXTENDED_SIZE = SIZE_2 + num_blocks = 15 + 2 = 17

1๋‹จ๊ณ„ ์ปค๋„: ๋กœ์ปฌ ๋ˆ„์  ํ•ฉ

๋กœ์ปฌ ๋‹จ๊ณ„์—์„œ์˜ ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€

๋กœ์ปฌ ๋‹จ๊ณ„๋Š” ๊ธฐ๋ณธ ๋ฒ„์ „๊ณผ ๋™์ผํ•œ ๋ช…์‹œ์  ๋™๊ธฐํ™” ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฝ๊ธฐ-์“ฐ๊ธฐ ์ถฉ๋Œ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค:

  • ์ฝ๊ธฐ ๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋จผ์ € ํ•„์š”ํ•œ ๊ฐ’์„ ๋กœ์ปฌ ๋ณ€์ˆ˜ current_val์— ์ฝ์–ด๋‘ 
  • ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ ์“ฐ๊ธฐ๊ฐ€ ์‹œ์ž‘๋˜๋„๋ก ๋ณด์žฅ
  • ์“ฐ๊ธฐ ๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๊ธฐ๋ก

์ด๋ฅผ ํ†ตํ•ด ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์ค‘ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผํ•  ๋•Œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

Block 0 ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๊ฐ’ ๋กœ๋“œ:

    shared = [0, 1, 2, 3, 4, 5, 6, 7]
    
  2. ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๋ฐ˜๋ณต (\(\log_2(TPB) = 3\)ํšŒ ๋ฐ˜๋ณต):

    ๋ฐ˜๋ณต 1 (offset=1):

    ์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

    Tโ‚ reads shared[0] = 0    Tโ‚… reads shared[4] = 4
    Tโ‚‚ reads shared[1] = 1    Tโ‚† reads shared[5] = 5
    Tโ‚ƒ reads shared[2] = 2    Tโ‚‡ reads shared[6] = 6
    Tโ‚„ reads shared[3] = 3
    

    ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

    ์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ๋”ํ•จ:

    shared[0] = 0              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[1] = 1 + 0 = 1
    shared[2] = 2 + 1 = 3
    shared[3] = 3 + 2 = 5
    shared[4] = 4 + 3 = 7
    shared[5] = 5 + 4 = 9
    shared[6] = 6 + 5 = 11
    shared[7] = 7 + 6 = 13
    

    ๋ฐฐ๋ฆฌ์–ด ํ›„: shared = [0, 1, 3, 5, 7, 9, 11, 13]

    ๋ฐ˜๋ณต 2 (offset=2):

    ์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

    Tโ‚‚ reads shared[0] = 0    Tโ‚… reads shared[3] = 5
    Tโ‚ƒ reads shared[1] = 1    Tโ‚† reads shared[4] = 7
    Tโ‚„ reads shared[2] = 3    Tโ‚‡ reads shared[5] = 9
    

    ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

    ์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ๋”ํ•จ:

    shared[0] = 0              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[1] = 1              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[2] = 3 + 0 = 3      (๋ณ€๊ฒฝ ์—†์Œ)
    shared[3] = 5 + 1 = 6
    shared[4] = 7 + 3 = 10
    shared[5] = 9 + 5 = 14
    shared[6] = 11 + 7 = 18
    shared[7] = 13 + 9 = 22
    

    ๋ฐฐ๋ฆฌ์–ด ํ›„: shared = [0, 1, 3, 6, 10, 14, 18, 22]

    ๋ฐ˜๋ณต 3 (offset=4):

    ์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

    Tโ‚„ reads shared[0] = 0    Tโ‚† reads shared[2] = 3
    Tโ‚… reads shared[1] = 1    Tโ‚‡ reads shared[3] = 6
    

    ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

    ์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ๋”ํ•จ:

    shared[0] = 0              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[1] = 1              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[2] = 3              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[3] = 6              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[4] = 10 + 0 = 10    (๋ณ€๊ฒฝ ์—†์Œ)
    shared[5] = 14 + 1 = 15
    shared[6] = 18 + 3 = 21
    shared[7] = 22 + 6 = 28
    

    ๋ฐฐ๋ฆฌ์–ด ํ›„: shared = [0, 1, 3, 6, 10, 15, 21, 28]

  3. ๋กœ์ปฌ ๊ฒฐ๊ณผ๋ฅผ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ๊ธฐ๋ก:

    output[0...7] = [0, 1, 3, 6, 10, 15, 21, 28]
    
  4. ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ๋ณด์กฐ ๊ณต๊ฐ„์— ์ €์žฅ (๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ๋งŒ):

    output[15] = 28  // ์œ„์น˜: size + block_idx.x = 15 + 0
    

Block 1 ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๊ฐ’ ๋กœ๋“œ:

    shared = [8, 9, 10, 11, 12, 13, 14, ๋ฏธ์ดˆ๊ธฐํ™”]
    

    ์ฐธ๊ณ : ์Šค๋ ˆ๋“œ 7์€ global_i = 15 >= SIZE_2์ด๋ฏ€๋กœ ์•„๋ฌด๊ฒƒ๋„ ๋กœ๋“œํ•˜์ง€ ์•Š์•„ shared[7]์ด ๋ฏธ์ดˆ๊ธฐํ™” ์ƒํƒœ๋กœ ๋‚จ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ 7์€ ์ตœ์ข… ์ถœ๋ ฅ์— ์ฐธ์—ฌํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์•ˆ์ „ํ•ฉ๋‹ˆ๋‹ค.

  2. ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๋ฐ˜๋ณต (\(\log_2(TPB) = 3\)ํšŒ ๋ฐ˜๋ณต):

    ์‹ค์ œ๋กœ ์—ฐ์‚ฐ์— ์ฐธ์—ฌํ•˜๋Š” ๊ฒƒ์€ ์ฒ˜์Œ 7๊ฐœ ์Šค๋ ˆ๋“œ๋ฟ์ž…๋‹ˆ๋‹ค. ์„ธ ๋ฒˆ์˜ ๋ฐ˜๋ณต์„ ๊ฑฐ์น˜๋ฉด:

    shared = [8, 17, 27, 38, 50, 63, 77, ๋ฏธ์ดˆ๊ธฐํ™”]
    
  3. ๋กœ์ปฌ ๊ฒฐ๊ณผ๋ฅผ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ๊ธฐ๋ก:

    output[8...14] = [8, 17, 27, 38, 50, 63, 77]  // ์œ ํšจ ์ถœ๋ ฅ 7๊ฐœ๋งŒ
    
  4. ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ๋ณด์กฐ ๊ณต๊ฐ„์— ์ €์žฅ (๋ธ”๋ก์˜ ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ๋งŒ):

    output[16] = shared[7]  // ์Šค๋ ˆ๋“œ 7 (TPB-1)์ด shared[7]์˜ ๊ฐ’์„ ์ €์žฅ
    

    ์ฐธ๊ณ : ์Šค๋ ˆ๋“œ 7์€ ์œ ํšจํ•œ ์ž…๋ ฅ์„ ๋กœ๋“œํ•˜์ง€ ์•Š์•˜์ง€๋งŒ, ๋ธ”๋ก ๋‚ด ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ์—๋Š” ๊ทธ๋Œ€๋กœ ์ฐธ์—ฌํ•ฉ๋‹ˆ๋‹ค. shared[7]์€ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ๊ฑฐ์น˜๋ฉฐ ๊ฐฑ์‹ ๋˜์ง€๋งŒ, ๋ฏธ์ดˆ๊ธฐํ™” ์ƒํƒœ์—์„œ ์‹œ์ž‘ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์ข… ๊ฐ’์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ Block 1์ด ๋งˆ์ง€๋ง‰ ๋ธ”๋ก์ด๋ฏ€๋กœ ์ด ๋ธ”๋ก ํ•ฉ๊ณ„๋Š” 2๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉ๋˜์ง€ ์•Š์•„ ์ •ํ™•์„ฑ์—๋Š” ์˜ํ–ฅ์ด ์—†์Šต๋‹ˆ๋‹ค.

1๋‹จ๊ณ„ ์ดํ›„ ์ถœ๋ ฅ ๋ฒ„ํผ์˜ ๋‚ด์šฉ:

[0, 1, 3, 6, 10, 15, 21, 28, 8, 17, 27, 38, 50, 63, 77, 28, ???]
                                                        ^   ^
                                                ๋ธ”๋ก ํ•ฉ๊ณ„๊ฐ€ ์—ฌ๊ธฐ์— ์ €์žฅ๋จ

์ฐธ๊ณ : ๋งˆ์ง€๋ง‰ ๋ธ”๋ก ํ•ฉ๊ณ„ (???) ๋Š” ๋ฏธ์ดˆ๊ธฐํ™” ๋ฉ”๋ชจ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•˜๋ฏ€๋กœ ์˜ˆ์ธกํ•  ์ˆ˜ ์—†์ง€๋งŒ, ์ตœ์ข… ๊ฒฐ๊ณผ์—๋Š” ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™”: ์‹ค์ œ๋กœ ํ•„์š”ํ•œ ์‹œ์ 

๋‘ ์ปค๋„ ๋‹จ๊ณ„๋Š” ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค:

# 1๋‹จ๊ณ„: ๋กœ์ปฌ ๋ˆ„์  ํ•ฉ
ctx.enqueue_function[prefix_sum_local_phase[...], prefix_sum_local_phase[...]](...)

# 2๋‹จ๊ณ„: ๋ธ”๋ก ํ•ฉ๊ณ„ ๋”ํ•˜๊ธฐ (์ž๋™์œผ๋กœ 1๋‹จ๊ณ„ ์™„๋ฃŒ๋ฅผ ๋Œ€๊ธฐ)
ctx.enqueue_function[prefix_sum_block_sum_phase[...], prefix_sum_block_sum_phase[...]](...)

ํ•ต์‹ฌ ํ†ต์ฐฐ: Mojo์˜ DeviceContext๋Š” ๋‹จ์ผ ์‹คํ–‰ ์ŠคํŠธ๋ฆผ(NVIDIA GPU์—์„œ๋Š” CUDA ์ŠคํŠธ๋ฆผ, AMD ROCm GPU์—์„œ๋Š” HIP ์ŠคํŠธ๋ฆผ)์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ, ํ์— ๋„ฃ์€ ์ปค๋„์ด ์ •ํ™•ํžˆ ๋„ฃ์€ ์ˆœ์„œ๋Œ€๋กœ ์‹คํ–‰๋จ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค. ์ปค๋„ ๊ฐ„์— ๋ช…์‹œ์  ๋™๊ธฐํ™”๊ฐ€ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค.

ctx.synchronize()๊ฐ€ ํ•„์š”ํ•œ ์‹œ์ :

# ๋‘ ์ปค๋„ ์™„๋ฃŒ ํ›„, ํ˜ธ์ŠคํŠธ์—์„œ ๊ฒฐ๊ณผ๋ฅผ ์ฝ๊ธฐ ์ „
ctx.synchronize()  # ํ˜ธ์ŠคํŠธ๊ฐ€ GPU ์™„๋ฃŒ๋ฅผ ๋Œ€๊ธฐ

with out.map_to_host() as out_host:  # ์ด์ œ GPU ๊ฒฐ๊ณผ๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์Œ
    print("out:", out_host)

ctx.synchronize() ํ˜ธ์ถœ์˜ ์—ญํ• :

  • ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™”: ๊ฒฐ๊ณผ์— ์ ‘๊ทผํ•˜๊ธฐ ์ „์— ํ˜ธ์ŠคํŠธ๊ฐ€ ๋ชจ๋“  GPU ์ž‘์—…์˜ ์™„๋ฃŒ๋ฅผ ๋Œ€๊ธฐํ•˜๋„๋ก ๋ณด์žฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ: ์—ฐ์‚ฐ์ด ๋๋‚˜๊ธฐ ์ „์— GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฝ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€

์‹คํ–‰ ๋ชจ๋ธ: ๋ธ”๋ก ๋‚ด๋ถ€์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•˜๋Š” barrier()์™€ ๋‹ฌ๋ฆฌ, ์ปค๋„ ์‹คํ–‰ ์ˆœ์„œ๋Š” Mojo์˜ ๋‹จ์ผ ์ŠคํŠธ๋ฆผ ์‹คํ–‰ ๋ชจ๋ธ์—์„œ ๋ณด์žฅ๋˜๋ฉฐ, ctx.synchronize()๋Š” ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๊ฐ„ ์กฐ์œจ์„ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

2๋‹จ๊ณ„ ์ปค๋„: ๋ธ”๋ก ํ•ฉ๊ณ„ ๋”ํ•˜๊ธฐ

  1. Block 0: ๋ณ€๊ฒฝ ๋ถˆํ•„์š” (์ด๋ฏธ ์˜ฌ๋ฐ”๋ฅธ ์ƒํƒœ).

  2. Block 1: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ Block 0์˜ ํ•ฉ๊ณ„๋ฅผ ์ž๊ธฐ ์›์†Œ์— ๋”ํ•จ:

    prev_block_sum = output[size + block_idx.x - 1] = output[15] = 28
    output[global_i] += prev_block_sum
    

    Block 1์˜ ๊ฐ’์ด ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค:

    Before: [8, 17, 27, 38, 50, 63, 77]
    After:  [36, 45, 55, 66, 78, 91, 105]
    

์„ฑ๋Šฅ ๋ฐ ์ตœ์ ํ™” ๊ณ ๋ ค ์‚ฌํ•ญ

์ฃผ์š” ๊ตฌํ˜„ ์ƒ์„ธ

๋กœ์ปฌ ๋‹จ๊ณ„ ๋™๊ธฐํ™” ํŒจํ„ด: ๋ธ”๋ก ๋‚ด ๊ฐ ๋ฐ˜๋ณต์€ ์—„๊ฒฉํ•œ ์ฝ๊ธฐ โ†’ ๋™๊ธฐํ™” โ†’ ์“ฐ๊ธฐ ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. var current_val: out.element_type = 0 - ๋กœ์ปฌ ๋ณ€์ˆ˜ ์ดˆ๊ธฐํ™”
  2. current_val = shared[local_i - offset] - ์ฝ๊ธฐ ๋‹จ๊ณ„ (์กฐ๊ฑด ์ถฉ์กฑ ์‹œ)
  3. barrier() - ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ๋ช…์‹œ์  ๋™๊ธฐํ™”
  4. shared[local_i] += current_val - ์“ฐ๊ธฐ ๋‹จ๊ณ„ (์กฐ๊ฑด ์ถฉ์กฑ ์‹œ)
  5. barrier() - ๋‹ค์Œ ๋ฐ˜๋ณต ์ „ ๋™๊ธฐํ™”

๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”: ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‘ ๋‹จ๊ณ„์˜ ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • ๋ธ”๋ก ๋‚ด๋ถ€: ๋กœ์ปฌ ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ ์ค‘ barrier()๋กœ ๊ฐ ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”
  • ๋ธ”๋ก ๊ฐ„: DeviceContext๊ฐ€ ํ์— ๋„ฃ์€ ์ปค๋„์„ ์ˆœ์ฐจ ์‹คํ–‰ํ•˜์—ฌ 1๋‹จ๊ณ„๊ฐ€ 2๋‹จ๊ณ„ ์ „์— ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ. ๊ฒฐ๊ณผ๋ฅผ ์ฝ๊ธฐ ์ „์— ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•˜๋ฉด ctx.synchronize()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€: ๋กœ์ปฌ ๋‹จ๊ณ„์—์„œ ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์ค‘ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ๋™์‹œ์— ์ ‘๊ทผํ•  ๋•Œ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

  1. ์ž‘์—… ํšจ์œจ์„ฑ: ์ด ๊ตฌํ˜„์˜ ์ž‘์—… ๋ณต์žก๋„๋Š” \(O(n \log n)\)์ด๋ฉฐ, ์ˆœ์ฐจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ \(O(n)\)์ž…๋‹ˆ๋‹ค. ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์ „ํ˜•์ ์ธ ๊ณต๊ฐ„-์‹œ๊ฐ„ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„์ž…๋‹ˆ๋‹ค.

  2. ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์œ„ํ•œ ์ถ”๊ฐ€ ๊ณต๊ฐ„์€ ์•„์ฃผ ์ ์Šต๋‹ˆ๋‹ค (๋ธ”๋ก๋‹น ์›์†Œ ํ•˜๋‚˜).

์ด 2๊ฐœ ์ปค๋„ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์ด ํ•„์š”ํ•œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ณธ ํŒจํ„ด์ž…๋‹ˆ๋‹ค. ๊ธฐ์ˆ˜ ์ •๋ ฌ, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ณ„์‚ฐ, ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ ๋“ฑ ๋‹ค๋ฅธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—๋„ ๋™์ผํ•œ ์ „๋žต์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Puzzle 15: ์ถ• ํ•ฉ๊ณ„

๊ฐœ์š”

2D ํ–‰๋ ฌ a์˜ ๊ฐ ํ–‰์— ๋Œ€ํ•ด ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ LayoutTensor๋ฅผ ์‚ฌ์šฉํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ถ• ํ•ฉ๊ณ„ ์‹œ๊ฐํ™” ์ถ• ํ•ฉ๊ณ„ ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ํ–‰๋ ฌ ์ฐจ์› ๋ฐฉํ–ฅ์˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • ๋ธ”๋ก ์ขŒํ‘œ๋ฅผ ์ด์šฉํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
  • ํšจ์œจ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด
  • ๋‹ค์ฐจ์› ํ…์„œ ๋ ˆ์ด์•„์›ƒ ๋‹ค๋ฃจ๊ธฐ

ํ•ต์‹ฌ์€ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ํ–‰๋ ฌ์˜ ํ–‰์— ๋งคํ•‘ํ•˜๊ณ , LayoutTensor์˜ ์ฐจ์›๋ณ„ ์ธ๋ฑ์‹ฑ์„ ํ™œ์šฉํ•˜๋ฉด์„œ ๊ฐ ๋ธ”๋ก ๋‚ด์—์„œ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{BATCH} \times \text{SIZE} = 4 \times 6\)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(\text{TPB} = 8\)
  • ๊ทธ๋ฆฌ๋“œ ํฌ๊ธฐ: \(1 \times \text{BATCH}\)
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น \(\text{TPB}\)๊ฐœ ์›์†Œ
  • ์ž…๋ ฅ ๋ ˆ์ด์•„์›ƒ: Layout.row_major(BATCH, SIZE)
  • ์ถœ๋ ฅ ๋ ˆ์ด์•„์›ƒ: Layout.row_major(BATCH, 1)

ํ–‰๋ ฌ ์‹œ๊ฐํ™”:

Row 0: [0, 1, 2, 3, 4, 5]       โ†’ Block(0,0)
Row 1: [6, 7, 8, 9, 10, 11]     โ†’ Block(0,1)
Row 2: [12, 13, 14, 15, 16, 17] โ†’ Block(0,2)
Row 3: [18, 19, 20, 21, 22, 23] โ†’ Block(0,3)

์™„์„ฑํ•  ์ฝ”๋“œ

from gpu import thread_idx, block_idx, block_dim, barrier
from gpu.memory import AddressSpace
from layout import Layout, LayoutTensor


comptime TPB = 8
comptime BATCH = 4
comptime SIZE = 6
comptime BLOCKS_PER_GRID = (1, BATCH)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime in_layout = Layout.row_major(BATCH, SIZE)
comptime out_layout = Layout.row_major(BATCH, 1)


fn axis_sum[
    in_layout: Layout, out_layout: Layout
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: UInt,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    batch = block_idx.y
    # FILL ME IN (roughly 15 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p15/p15.mojo

ํŒ
  1. batch = block_idx.y๋กœ ํ–‰ ์„ ํƒ
  2. ์›์†Œ ๋กœ๋“œ: cache[local_i] = a[batch, local_i]
  3. ์ŠคํŠธ๋ผ์ด๋“œ๋ฅผ ์ ˆ๋ฐ˜์”ฉ ์ค„์ด๋ฉฐ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์ˆ˜ํ–‰
  4. ์Šค๋ ˆ๋“œ 0์ด ์ตœ์ข… ํ•ฉ๊ณ„๋ฅผ output[batch]์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p15
pixi run -e amd p15
pixi run -e apple p15
uv run poe p15

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: DeviceBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([15.0, 51.0, 87.0, 123.0])

์†”๋ฃจ์…˜

fn axis_sum[
    in_layout: Layout, out_layout: Layout
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: UInt,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    batch = block_idx.y
    cache = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Visualize:
    # Block(0,0): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 0: [0,1,2,3,4,5]
    # Block(0,1): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 1: [6,7,8,9,10,11]
    # Block(0,2): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 2: [12,13,14,15,16,17]
    # Block(0,3): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 3: [18,19,20,21,22,23]

    # each row is handled by each block bc we have grid_dim=(1, BATCH)

    if local_i < size:
        cache[local_i] = a[batch, local_i]
    else:
        # Add zero-initialize padding elements for later reduction
        cache[local_i] = 0

    barrier()

    # do reduction sum per each block
    stride = UInt(TPB // 2)
    while stride > 0:
        # Read phase: all threads read the values they need first to avoid race conditions
        var temp_val: output.element_type = 0
        if local_i < stride:
            temp_val = cache[local_i + stride]

        barrier()

        # Write phase: all threads safely write their computed values
        if local_i < stride:
            cache[local_i] += temp_val

        barrier()
        stride //= 2

    # writing with local thread = 0 that has the sum for each batch
    if local_i == 0:
        output[batch, 0] = cache[0]


LayoutTensor๋ฅผ ํ™œ์šฉํ•ด 2D ํ–‰๋ ฌ์˜ ํ–‰ ๋ฐฉํ–ฅ ํ•ฉ๊ณ„๋ฅผ ๋ณ‘๋ ฌ๋กœ ๊ตฌํ•˜๋Š” ๋ฆฌ๋•์…˜ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

ํ–‰๋ ฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ๋ธ”๋ก ๋งคํ•‘

Input Matrix (4ร—6) with LayoutTensor:                Block Assignment:
[[ a[0,0]  a[0,1]  a[0,2]  a[0,3]  a[0,4]  a[0,5] ] โ†’ Block(0,0)
 [ a[1,0]  a[1,1]  a[1,2]  a[1,3]  a[1,4]  a[1,5] ] โ†’ Block(0,1)
 [ a[2,0]  a[2,1]  a[2,2]  a[2,3]  a[2,4]  a[2,5] ] โ†’ Block(0,2)
 [ a[3,0]  a[3,1]  a[3,2]  a[3,3]  a[3,4]  a[3,5] ] โ†’ Block(0,3)

๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๊ณผ์ •

  1. ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ:

    Block(0,0): cache = [a[0,0] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] * *]  // * = ํŒจ๋”ฉ
    Block(0,1): cache = [a[1,0] a[1,1] a[1,2] a[1,3] a[1,4] a[1,5] * *]
    Block(0,2): cache = [a[2,0] a[2,1] a[2,2] a[2,3] a[2,4] a[2,5] * *]
    Block(0,3): cache = [a[3,0] a[3,1] a[3,2] a[3,3] a[3,4] a[3,5] * *]
    
  2. ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ (Block 0,0 ๊ธฐ์ค€):

    Initial:  [0  1  2  3  4  5  *  *]
    Stride 4: [4  5  6  7  4  5  *  *]
    Stride 2: [10 12 6  7  4  5  *  *]
    Stride 1: [15 12 6  7  4  5  *  *]
    

์ฃผ์š” ๊ตฌํ˜„ ํŠน์ง•

  1. ๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ:

    • ์ž…๋ ฅ: ํ–‰ ์šฐ์„ (row-major) ๋ ˆ์ด์•„์›ƒ (BATCH ร— SIZE)
    • ์ถœ๋ ฅ: ํ–‰ ์šฐ์„  ๋ ˆ์ด์•„์›ƒ (BATCH ร— 1)
    • ๊ฐ ๋ธ”๋ก์ด ํ•˜๋‚˜์˜ ํ–‰ ์ „์ฒด๋ฅผ ์ฒ˜๋ฆฌ
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • ์ž…๋ ฅ์— LayoutTensor 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: a[batch, local_i]
    • ํšจ์œจ์ ์ธ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
    • ์ถœ๋ ฅ์— LayoutTensor 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: output[batch, 0]
  3. ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๋กœ์ง:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            cache[local_i] += cache[local_i + stride]
        barrier()
        stride //= 2
    

    ์ฐธ๊ณ : ์ด ๊ตฌํ˜„์—์„œ๋Š” ๊ฐ™์€ ๋ฐ˜๋ณต ๋‚ด์—์„œ ์Šค๋ ˆ๋“œ๋“ค์ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋™์‹œ์— ์ฝ๊ณ  ์“ฐ๊ธฐ ๋•Œ๋ฌธ์— ์ž ์žฌ์ ์ธ ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋” ์•ˆ์ „ํ•œ ๋ฐฉ๋ฒ•์€ ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ๋‹จ๊ณ„๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

    stride = TPB // 2
    while stride > 0:
        var temp_val: output.element_type = 0
        if local_i < stride:
            temp_val = cache[local_i + stride]  # ์ฝ๊ธฐ ๋‹จ๊ณ„
        barrier()
        if local_i < stride:
            cache[local_i] += temp_val  # ์“ฐ๊ธฐ ๋‹จ๊ณ„
        barrier()
        stride //= 2
    
  4. ์ถœ๋ ฅ ๊ธฐ๋ก:

    if local_i == 0:
        output[batch, 0] = cache[0]  --> ๋ฐฐ์น˜๋‹น ๊ฒฐ๊ณผ ํ•˜๋‚˜
    

์„ฑ๋Šฅ ์ตœ์ ํ™”

  1. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ:

    • LayoutTensor๋ฅผ ํ†ตํ•œ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
    • ๋น ๋ฅธ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
    • ํ–‰ ๊ฒฐ๊ณผ๋‹น ํ•œ ๋ฒˆ์˜ ์“ฐ๊ธฐ
  2. ์Šค๋ ˆ๋“œ ํ™œ์šฉ:

    • ํ–‰ ๊ฐ„ ์™„๋ฒฝํ•œ ๋ถ€ํ•˜ ๊ท ํ˜•
    • ์ฃผ์š” ์—ฐ์‚ฐ์—์„œ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ ์—†์Œ
    • ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด
  3. ๋™๊ธฐํ™”:

    • ์ตœ์†Œํ•œ์˜ ๋ฐฐ๋ฆฌ์–ด (๋ฆฌ๋•์…˜ ์ค‘์—๋งŒ ์‚ฌ์šฉ)
    • ํ–‰ ๊ฐ„ ๋…๋ฆฝ์ ์ธ ์ฒ˜๋ฆฌ
    • ๋ธ”๋ก ๊ฐ„ ํ†ต์‹  ๋ถˆํ•„์š”
    • ๊ฒฝ์Ÿ ์ƒํƒœ ๊ณ ๋ ค์‚ฌํ•ญ: ํ˜„์žฌ ๊ตฌํ˜„์—์„œ๋Š” ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์ค‘์— ์ฝ๊ธฐ-์“ฐ๊ธฐ ์ถฉ๋Œ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ช…์‹œ์ ์ธ ์ฝ๊ธฐ-์“ฐ๊ธฐ ๋‹จ๊ณ„ ๋ถ„๋ฆฌ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

๋ณต์žก๋„ ๋ถ„์„

  • ์‹œ๊ฐ„: ํ–‰๋‹น \(O(\log n)\), n์€ ํ–‰์˜ ๊ธธ์ด
  • ๊ณต๊ฐ„: ๋ธ”๋ก๋‹น \(O(TPB)\) ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ
  • ์ „์ฒด ๋ณ‘๋ ฌ ์‹œ๊ฐ„: ์Šค๋ ˆ๋“œ๊ฐ€ ์ถฉ๋ถ„ํ•  ๋•Œ \(O(\log n)\)

Puzzle 16: ํ–‰๋ ฌ ๊ณฑ์…ˆ (MatMul)

๊ฐœ์š”

ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ ๊ณผํ•™ ๊ณ„์‚ฐ, ๋จธ์‹  ๋Ÿฌ๋‹, ๊ทธ๋ž˜ํ”ฝ์Šค์—์„œ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ด ๋˜๋Š” ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ๋‘ ํ–‰๋ ฌ \(A\)์™€ \(B\)๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ด๋“ค์˜ ๊ณฑ \(C = A \times B\) ๋ฅผ ๊ตฌํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.

ํ–‰๋ ฌ \(A_{m\times k}\)์™€ \(B_{k\times n}\)์— ๋Œ€ํ•ด, ๊ฒฐ๊ณผ \(C_{m\times n}\)์˜ ๊ฐ ์›์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:

\[\Large C_{ij} = \sum_{l=0}^{k-1} A_{il} \cdot B_{lj} \]

ํ–‰๋ ฌ ๊ณฑ์…ˆ ์‹œ๊ฐํ™” ํ–‰๋ ฌ ๊ณฑ์…ˆ ์‹œ๊ฐํ™”

์ด ํผ์ฆ์—์„œ๋Š” GPU์—์„œ ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๊ตฌํ˜„ํ•˜๋Š” ์—ฌ๋Ÿฌ ์ ‘๊ทผ๋ฒ•์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ๊ฐ ๋ฒ„์ „์€ ์„œ๋กœ ๋‹ค๋ฅธ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ ๋ฒ„์ „ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ํ–‰๋ ฌ์˜ ์›์†Œ ํ•˜๋‚˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์ง๊ด€์ ์ธ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ์ดํ•ดํ•˜๊ธฐ ์‰ฝ์ง€๋งŒ, ์ค‘๋ณต๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ๋งŽ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „ ์ž…๋ ฅ ํ–‰๋ ฌ์˜ ๋ธ”๋ก์„ ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ ค ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์ž…๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ์›์†Œ ํ•˜๋‚˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ๊ฐ™์ง€๋งŒ, ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ฝ์Šต๋‹ˆ๋‹ค.

  • ํƒ€์ผ๋ง ๋ฒ„์ „ ์—ฐ์‚ฐ์„ ํƒ€์ผ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„์–ด ์Šค๋ ˆ๋“œ๋“ค์ด ์ถœ๋ ฅ ํ–‰๋ ฌ์˜ ๋ธ”๋ก์„ ํ•จ๊ป˜ ๋กœ๋“œํ•˜๊ณ  ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ์„ ํ•œ์ธต ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

๊ฐ ๋ฒ„์ „์€ ์ด์ „ ๋ฒ„์ „ ์œ„์— ์Œ“์•„ ์˜ฌ๋ฆฌ๋ฉด์„œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ์ƒˆ๋กœ์šด ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ์ „๋žต์ด ์„ฑ๋Šฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋ฐฐ์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋‹จ๊ณ„์  ์ง„ํ–‰ ๊ณผ์ •์€ GPU ์ตœ์ ํ™”์˜ ๋Œ€ํ‘œ์ ์ธ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  1. ์ •ํ™•ํ•˜์ง€๋งŒ ๋‹จ์ˆœํ•œ ๊ธฐ๋ณธ ๊ตฌํ˜„์—์„œ ์ถœ๋ฐœ
  2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ค„์ด๊ธฐ
  3. ํƒ€์ผ๋ง์œผ๋กœ ๋ฐ์ดํ„ฐ ์ง€์—ญ์„ฑ๊ณผ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ๊ฐœ์„ 
  4. ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋ฅผ ํ™œ์šฉํ•˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ ์œ ์ง€

์›ํ•˜๋Š” ๋ฒ„์ „์„ ๊ณจ๋ผ ํ–‰๋ ฌ ๊ณฑ์…ˆ ์—ฌ์ •์„ ์‹œ์ž‘ํ•ด ๋ณด์„ธ์š”!

์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ ๋ฒ„์ „

๊ฐœ์š”

์ •๋ฐฉ ํ–‰๋ ฌ \(A\)์™€ \(B\)๋ฅผ ๊ณฑํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ \(\text{output}\)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ํ–‰๋ ฌ์˜ ์›์†Œ ํ•˜๋‚˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์œ„ํ•œ 2D ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ํ–‰ ์šฐ์„ (row-major) ๋ ˆ์ด์•„์›ƒ์—์„œ์˜ ํ–‰๋ ฌ ์ธ๋ฑ์‹ฑ
  • ์Šค๋ ˆ๋“œ์™€ ์ถœ๋ ฅ ์›์†Œ ๊ฐ„ ๋งคํ•‘

ํ•ต์‹ฌ์€ 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ํ–‰๋ ฌ ์›์†Œ์— ๋งคํ•‘ํ•˜๊ณ , ๋‚ด์ ์„ ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: \(1 \times 1\)

๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ:

  • ์ž…๋ ฅ A: Layout.row_major(SIZE, SIZE)
  • ์ž…๋ ฅ B: Layout.row_major(SIZE, SIZE)
  • ์ถœ๋ ฅ: Layout.row_major(SIZE, SIZE)

์™„์„ฑํ•  ์ฝ”๋“œ

from gpu import thread_idx, block_idx, block_dim, barrier
from gpu.memory import AddressSpace
from layout import Layout, LayoutTensor


comptime TPB = 3
comptime SIZE = 2
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, TPB)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE, SIZE)


fn naive_matmul[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 6 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p16/p16.mojo

ํŒ
  1. ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋กœ row์™€ col ๊ณ„์‚ฐ
  2. ์ธ๋ฑ์Šค๊ฐ€ size ๋ฒ”์œ„ ์•ˆ์— ์žˆ๋Š”์ง€ ํ™•์ธ
  3. ๋กœ์ปฌ ๋ณ€์ˆ˜์— ๊ณฑ์˜ ํ•ฉ ๋ˆ„์ 
  4. ์ตœ์ข… ํ•ฉ์„ ์˜ฌ๋ฐ”๋ฅธ ์ถœ๋ ฅ ์œ„์น˜์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p16 --naive
pixi run -e amd p16 --naive
pixi run -e apple p16 --naive
uv run poe p16 --naive

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([4.0, 6.0, 12.0, 22.0])

์†”๋ฃจ์…˜

fn naive_matmul[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x

    if row < size and col < size:
        var acc: output.element_type = 0

        @parameter
        for k in range(size):
            acc += a[row, k] * b[k, col]

        output[row, col] = acc


LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ๊ธฐ๋ณธ ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

ํ–‰๋ ฌ ๋ ˆ์ด์•„์›ƒ (2ร—2 ์˜ˆ์‹œ)

Matrix A:          Matrix B:                   Output C:
[a[0,0] a[0,1]]    [b[0,0] b[0,1]]             [c[0,0] c[0,1]]
[a[1,0] a[1,1]]    [b[1,0] b[1,1]]             [c[1,0] c[1,1]]

๊ตฌํ˜„ ์ƒ์„ธ

  1. ์Šค๋ ˆ๋“œ ๋งคํ•‘:

    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • ์ง์ ‘ 2D ์ธ๋ฑ์‹ฑ: a[row, k]
    • ์ „์น˜ ์ ‘๊ทผ: b[k, col]
    • ์ถœ๋ ฅ ๊ธฐ๋ก: output[row, col]
  3. ์—ฐ์‚ฐ ํ๋ฆ„:

    # var๋กœ ๊ฐ€๋ณ€ ๋ˆ„์  ๋ณ€์ˆ˜๋ฅผ ์„ ์–ธํ•˜๊ณ  ํ…์„œ์˜ ์›์†Œ ํƒ€์ž…์„ ์‚ฌ์šฉ
    var acc: output.element_type = 0
    
    # @parameter๋กœ ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ
    @parameter
    for k in range(size):
        acc += a[row, k] * b[k, col]
    

์ฃผ์š” ์–ธ์–ด ๊ธฐ๋Šฅ

  1. ๋ณ€์ˆ˜ ์„ ์–ธ:

    • var acc: output.element_type = 0์—์„œ var๋กœ ๊ฐ€๋ณ€ ๋ณ€์ˆ˜๋ฅผ ์„ ์–ธํ•˜๊ณ , output.element_type์œผ๋กœ ์ถœ๋ ฅ ํ…์„œ์™€ ๋™์ผํ•œ ํƒ€์ž…์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค
    • ๋ˆ„์  ์—ฐ์‚ฐ ์ „์— 0์œผ๋กœ ์ดˆ๊ธฐํ™”
  2. ๋ฃจํ”„ ์ตœ์ ํ™”:

    • @parameter ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋กœ ์ปดํŒŒ์ผ ํƒ€์ž„์— ๋ฃจํ”„ ์ „๊ฐœ
    • ํฌ๊ธฐ๊ฐ€ ์ž‘๊ณ  ๋ฏธ๋ฆฌ ์•Œ๋ ค์ง„ ํ–‰๋ ฌ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • ๋” ๋‚˜์€ ๋ช…๋ น์–ด ์Šค์ผ€์ค„๋ง ๊ฐ€๋Šฅ

์„ฑ๋Šฅ ํŠน์„ฑ

  1. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ 2 x SIZEํšŒ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฝ์Œ
    • ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ 1ํšŒ
    • ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ ์—†์Œ
  2. ์—ฐ์‚ฐ ํšจ์œจ:

    • ๋‹จ์ˆœํ•œ ๊ตฌํ˜„์ด์ง€๋งŒ ์„ฑ๋Šฅ์€ ์ตœ์ ์ด ์•„๋‹˜
    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ค‘๋ณต์œผ๋กœ ๋งŽ์ด ์ฝ์Œ
    • ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์ง€ ์•Š์Œ
  3. ํ•œ๊ณ„:

    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ๋งŽ์ด ์†Œ๋ชจ
    • ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ์ง€์—ญ์„ฑ
    • ํฐ ํ–‰๋ ฌ๋กœ ๊ฐˆ์ˆ˜๋ก ํ™•์žฅ์„ฑ ๋ถ€์กฑ

์ด ๊ธฐ๋ณธ ๊ตฌํ˜„์€ GPU ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์ค€์ ์œผ๋กœ, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์ตœ์ ํ™”ํ•ด์•ผ ํ•˜๋Š” ์ด์œ ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

GPU ์„ฑ๋Šฅ ์ดํ•ดํ•˜๊ธฐ: ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ

๊ธฐ๋ณธ ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๊ตฌํ˜„ํ–ˆ์œผ๋‹ˆ, ์ด๋Ÿฐ ๊ถ๊ธˆ์ฆ์ด ์ƒ๊ธธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: ์šฐ๋ฆฌ ์ปค๋„์€ ์‹ค์ œ๋กœ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋™์ž‘ํ•˜๊ณ  ์žˆ์„๊นŒ? GPU์˜ ์—ฐ์‚ฐ ๋Šฅ๋ ฅ์— ์˜ํ•ด ์ œํ•œ๋˜๋Š” ๊ฑธ๊นŒ, ์•„๋‹ˆ๋ฉด ๋‹ค๋ฅธ ๋ฌด์–ธ๊ฐ€๊ฐ€ ๋ฐœ๋ชฉ์„ ์žก๊ณ  ์žˆ๋Š” ๊ฑธ๊นŒ?

๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ(์—ญ์ฃผ: ๋ฃจํ”„๋ผ์ธ์€ โ€œ์ƒํ•œ์„ โ€œ์ด๋ผ๋Š” ๋œป์œผ๋กœ, ์„ฑ๋Šฅ์ด ๋„˜์„ ์ˆ˜ ์—†๋Š” ํ•œ๊ณ„๋ฅผ ์ง€๋ถ• ์„ ์— ๋น„์œ ํ•œ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค)์€ GPU ์ตœ์ ํ™”์˜ ๋‚˜์นจ๋ฐ˜์ž…๋‹ˆ๋‹ค. ์ปค๋„์˜ ์„ฑ๋Šฅ์„ ์ œํ•œํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๋ณ‘๋ชฉ์ด ๋ฌด์—‡์ธ์ง€ ์•Œ๋ ค์ฃผ๊ณ , ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ์ตœ์ ํ™” ๋ฐฉํ–ฅ์œผ๋กœ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค. ๊ฐ์œผ๋กœ ๊ฐœ์„ ํ•˜๋Š” ๋Œ€์‹ , ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์ด ์ •ํ™•ํžˆ ์–ด๋””์— ์ง‘์ค‘ํ•ด์•ผ ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

1. ๋ชจ๋“  GPU ์ปค๋„์˜ ๋‘ ๊ฐ€์ง€ ์„ฑ๋Šฅ ์ƒํ•œ

๋ชจ๋“  GPU ์ปค๋„์€ ๋‘ ๊ฐ€์ง€ ๊ทผ๋ณธ์ ์ธ ์ œ์•ฝ ์•„๋ž˜์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค:

  • ์—ฐ์‚ฐ ์ƒํ•œ(compute ceiling) โ€“ ์ฝ”์–ด๊ฐ€ ๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ์„ ์–ผ๋งˆ๋‚˜ ๋น ๋ฅด๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€ (์ตœ๋Œ€ FLOPs/s)
  • ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ(memory ceiling) โ€“ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์ด ์ฝ”์–ด์— ๋ฐ์ดํ„ฐ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋น ๋ฅด๊ฒŒ ๊ณต๊ธ‰ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€ (์ตœ๋Œ€ bytes/s)

์–ด๋–ค ์ƒํ•œ์ด ์ปค๋„์„ ์ œ์•ฝํ•˜๋Š”์ง€ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ด ์ตœ์ ํ™” ์ „๋žต์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์€ ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ์ง€ํ‘œ๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ํ‘œํ˜„ํ•˜์—ฌ ์ด ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค:

X์ถ•: ์‚ฐ์ˆ  ๊ฐ•๋„(Arithmetic Intensity) โ€“ ๋ฐ์ดํ„ฐ 1๋ฐ”์ดํŠธ๋‹น ์ˆ˜ํ–‰ํ•˜๋Š” ์—ฐ์‚ฐ๋Ÿ‰

\[\Large I = \frac{\text{Total FLOPs}}{\text{Total Bytes from Memory}} \quad [\text{FLOP/B}]\]

Y์ถ•: ์‹ค์ธก ์„ฑ๋Šฅ(Sustained Performance) โ€“ ์ปค๋„์ด ์‹ค์ œ๋กœ ๋‹ฌ์„ฑํ•˜๋Š” ์†๋„

\[\Large P_{\text{sustained}} = \frac{\text{Total FLOPs}}{\text{Elapsed Time}} \quad [\text{GFLOP/s}]\]

๋‘ ๊ฐœ์˜ โ€œ์ƒํ•œ(roof)โ€œ์ด ๋‹ฌ์„ฑ ๊ฐ€๋Šฅํ•œ ์„ฑ๋Šฅ์˜ ์ƒํ•œ์„ ์ •ํ•ฉ๋‹ˆ๋‹ค:

์ƒํ•œ์ˆ˜์‹์˜๋ฏธ
๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ\(P = B_{\text{peak}} \cdot I\)๊ธฐ์šธ์–ด์ง„ ์ง์„ . ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ์˜ํ•ด ์„ฑ๋Šฅ์ด ์ œํ•œ๋จ
์—ฐ์‚ฐ ์ƒํ•œ\(P = P_{\text{peak}}\)์ˆ˜ํ‰์„ . ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ๋Ÿ‰์— ์˜ํ•ด ์„ฑ๋Šฅ์ด ์ œํ•œ๋จ

์ž„๊ณ„ ๊ฐ•๋„(critical intensity)

\[\Large I^* = \frac{P_{\text{peak}}}{B_{\text{peak}}}\]

๋Š” ์ปค๋„์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ(\(I < I^\ast\) )์—์„œ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ(\(I > I^\ast\) )๋กœ ์ „ํ™˜๋˜๋Š” ์ง€์ ์ž…๋‹ˆ๋‹ค.

2. ํ•˜๋“œ์›จ์–ด ์˜ˆ์‹œ: NVIDIA A100 ์‚ฌ์–‘

์ด๋ก ์„ NVIDIA A100์˜ ๊ตฌ์ฒด์ ์ธ ์ˆซ์ž๋กœ ํ™•์ธํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

์ตœ๋Œ€ FP32 ์ฒ˜๋ฆฌ๋Ÿ‰ \[\Large P_{\text{peak}} = 19.5 \text{ TFLOP/s} = 19{,}500 \text{ GFLOP/s}\]

์ตœ๋Œ€ HBM2 ๋Œ€์—ญํญ \[\Large B_{\text{peak}} = 1{,}555 \text{ GB/s}\]

์ž„๊ณ„ ๊ฐ•๋„ \[\Large I^* = \frac{19{,}500}{1{,}555} \approx 12.5 \text{ FLOP/B}\]

์ถœ์ฒ˜: NVIDIA A100 Tensor Core GPU Architecture

์ด๋Š” ์‚ฐ์ˆ  ๊ฐ•๋„๊ฐ€ 12.5 FLOP/B ๋ฏธ๋งŒ์ธ ์ปค๋„์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ, ๊ทธ ์ด์ƒ์ธ ์ปค๋„์€ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์ž„์„ ๋œปํ•ฉ๋‹ˆ๋‹ค.

3. ํ–‰๋ ฌ ๊ณฑ์…ˆ ๊ตฌํ˜„์˜ ์‹œ๊ฐํ™”

์•„๋ž˜ ์• ๋‹ˆ๋ฉ”์ด์…˜์€ ์ด ํผ์ฆ์˜ ๊ตฌํ˜„๋“ค์ด A100์˜ ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์— ์–ด๋–ป๊ฒŒ ๋Œ€์‘ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ ์‹œ๊ฐํ™”

์ด ์‹œ๊ฐํ™”๋Š” ์ด ํผ์ฆ์—์„œ ๊ฑฐ์น˜๊ฒŒ ๋  ์ตœ์ ํ™” ๊ณผ์ •์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  1. ํ•˜๋“œ์›จ์–ด ์ œ์•ฝ โ€“ ๋นจ๊ฐ„์ƒ‰ ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ๊ณผ ํŒŒ๋ž€์ƒ‰ ์—ฐ์‚ฐ ์ƒํ•œ์ด ์„ฑ๋Šฅ ํ•œ๊ณ„๋ฅผ ์ •์˜
  2. ์ถœ๋ฐœ์  โ€“ ๊ธฐ๋ณธ ๊ตฌํ˜„(์ฃผํ™ฉ์ƒ‰ ์ )์ด ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ ์œ„์— ์œ„์น˜
  3. ์ตœ์ ํ™” ๋ชฉํ‘œ โ€“ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „(์ฒญ๋ก์ƒ‰ ์ )์œผ๋กœ ์‚ฐ์ˆ  ๊ฐ•๋„๊ฐ€ ๊ฐœ์„ ๋จ
  4. ๊ถ๊ทน์  ๋ชฉํ‘œ โ€“ ๊ธˆ์ƒ‰ ํ™”์‚ดํ‘œ๋Š” ์ปค๋„์ด ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ๊ฐ€ ๋˜๋Š” ์ž„๊ณ„ ๊ฐ•๋„ ์ง€์ ์„ ๊ฐ€๋ฆฌํ‚ด

4. ๊ธฐ๋ณธ ๊ตฌํ˜„ ๋ถ„์„

์ด์ „ ์„น์…˜์˜ ๊ธฐ๋ณธ ์ปค๋„์ด ์™œ ์ด๋Ÿฐ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. \(2 \times 2\) ํ–‰๋ ฌ ๊ณฑ์…ˆ์˜ ๊ฒฝ์šฐ:

์ถœ๋ ฅ ์›์†Œ๋‹น ์—ฐ์‚ฐ๋Ÿ‰: \(\text{SIZE} + (\text{SIZE}-1) = 3 \text{ FLOPs }\)

๊ฐ ์›์†Œ์—๋Š” \(\text{SIZE}\) ํšŒ์˜ ๊ณฑ์…ˆ๊ณผ \(\text{SIZE} - 1\) ํšŒ์˜ ๋ง์…ˆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค: \[C_{00} = A_{00} \cdot B_{00} + A_{01} \cdot B_{10}\] \(\text{SIZE} = 2\) ์ผ ๋•Œ ๊ณฑ์…ˆ 2ํšŒ + ๋ง์…ˆ 1ํšŒ = 3 FLOPs

์ถœ๋ ฅ ์›์†Œ๋‹น ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

  • ํ–‰๋ ฌ A์˜ ํ–‰: \(2 \times 4 = 8\) bytes (FP32)
  • ํ–‰๋ ฌ B์˜ ์—ด: \(2 \times 4 = 8\) bytes (FP32)
  • ํ•ฉ๊ณ„: ์ถœ๋ ฅ ์›์†Œ๋‹น \(16\) bytes

์‚ฐ์ˆ  ๊ฐ•๋„: \[\Large I_{\text{naive}} = \frac{3 \text{ FLOPs}}{16 \text{ bytes}} = 0.1875 \text{ FLOP/B}\]

์ด ์‚ฐ์ˆ  ๊ฐ•๋„๋Š” A100์˜ ์ž„๊ณ„ ๊ฐ•๋„์— ํ•œ์ฐธ ๋ชป ๋ฏธ์น˜๋ฏ€๋กœ, ๊ธฐ๋ณธ ์ปค๋„์€ ์‹ฌ๊ฐํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.

\[\Large I_{\text{naive}} = 0.1875 \ll I^* = 12.5\]

์˜ˆ์ƒ ์„ฑ๋Šฅ: \[\Large P \approx B_{\text{peak}} \times I_{\text{naive}} = 1{,}555 \times 0.1875 \approx 292 \text{ GFLOP/s}\]

์ด๋Š” GPU ์—ฐ์‚ฐ ์ž ์žฌ๋ ฅ์˜ \(\frac{292}{19{,}500} \approx 1.5\%\) ์— ๋ถˆ๊ณผํ•ฉ๋‹ˆ๋‹ค! ์‹œ๊ฐํ™”์—์„œ ๋…ธ๋ž€์ƒ‰ ์ ์ด ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ ์œ„์— ๋†“์ธ ๊ฒƒ์ด ์ด๋ฅผ ์ž˜ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค โ€” ์—ฐ์‚ฐ ์ƒํ•œ์—๋Š” ํ•œ์ฐธ ๋ฏธ์น˜์ง€ ๋ชปํ•˜๋Š” ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค.

5. ๋‹ค์Œ ๋‹จ๊ณ„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”

๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์ด ์•Œ๋ ค์ฃผ๋Š” ์ตœ์ ํ™” ์ „๋žต์€ ๋ช…ํ™•ํ•ฉ๋‹ˆ๋‹ค: ์ค‘๋ณต ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์—ฌ ์‚ฐ์ˆ  ๊ฐ•๋„๋ฅผ ๋†’์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๋ฒ•์ด ๋ฐ”๋กœ ์ด๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์ด์ :

  • ํ˜‘๋ ฅ์  ๋กœ๋”ฉ: ์Šค๋ ˆ๋“œ๋“ค์ด ํ•จ๊ป˜ ํ–‰๋ ฌ ๋ธ”๋ก์„ ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ
  • ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ: ๋กœ๋“œํ•œ ์›์†Œ ํ•˜๋‚˜๋ฅผ ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ์— ํ™œ์šฉ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ๊ฐ์†Œ: ๋А๋ฆฐ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ๋Œ€ํ•œ ์ ‘๊ทผ ํšŸ์ˆ˜ ๊ฐ์†Œ

์‚ฐ์ˆ  ๊ฐ•๋„ ๊ฐœ์„  ์˜ˆ์ƒ์น˜: \[\Large I_{\text{shared}} = \frac{12 \text{ FLOPs}}{32 \text{ bytes}} = 0.375 \text{ FLOP/B}\]

์ž‘์€ \(2 \times 2\) ๊ทœ๋ชจ์—์„œ๋Š” ์—ฌ์ „ํžˆ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ด์ง€๋งŒ, ์ด 2๋ฐฐ์˜ ์‚ฐ์ˆ  ๊ฐ•๋„ ํ–ฅ์ƒ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ์„ ํ›จ์”ฌ ๋” ๋งŽ์ด ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํฐ ํ–‰๋ ฌ์—์„œ ๊ทน์ ์ธ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค.

6. ๋ฃจํ”„๋ผ์ธ์ด ์•Œ๋ ค์ฃผ๋Š” ์ตœ์ ํ™” ์ „๋žต

๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์€ ํ˜„์žฌ ์„ฑ๋Šฅ์„ ์ง„๋‹จํ•  ๋ฟ ์•„๋‹ˆ๋ผ, ์ตœ์ ํ™” ๋ฐฉํ–ฅ๊นŒ์ง€ ์•Œ๋ ค์ค๋‹ˆ๋‹ค. ์ดํ›„ ํผ์ฆ์—์„œ ์‚ดํŽด๋ณผ ํ•ต์‹ฌ ๊ธฐ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๊ธฐ๋ฒ•๋ฃจํ”„๋ผ์ธ ํšจ๊ณผ๊ตฌํ˜„ ๋ฐฉ๋ฒ•
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ๋ง๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์œผ๋กœ ์‚ฐ์ˆ  ๊ฐ•๋„ โ†‘ํ˜‘๋ ฅ์  ๋กœ๋”ฉ, ๋ธ”๋ก ๋‹จ์œ„ ์—ฐ์‚ฐ
๋ ˆ์ง€์Šคํ„ฐ ๋ธ”๋กœํ‚น๋ ˆ์ง€์Šคํ„ฐ ๋ˆ„์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ๊ฐ์†Œ๋ ˆ์ง€์Šคํ„ฐ ๋ณ€์ˆ˜์™€ ๋ฃจํ”„ ์ „๊ฐœ
์ปค๋„ ํ“จ์ „์—ฐ์‚ฐ ๊ฒฐํ•ฉ์œผ๋กœ ๋ฐ”์ดํŠธ๋‹น FLOPs ์ฆ๊ฐ€๋‹จ์ผ ์ปค๋„์—์„œ ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ ๋‹จ๊ณ„ ์ฒ˜๋ฆฌ
๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ(coalescing)์‹คํšจ ๋Œ€์—ญํญ ํ™œ์šฉ ๊ทน๋Œ€ํ™”๊ตฌ์กฐํ™”๋œ ์ ‘๊ทผ ํŒจํ„ด, ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ
๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์œผ๋กœ ์—ฐ์‚ฐ-๋ฉ”๋ชจ๋ฆฌ ์ค‘์ฒฉcopy_dram_to_sram_async์™€ ์—ฐ์‚ฐ ์ค‘์ฒฉ
ํ˜ผํ•ฉ ์ •๋ฐ€๋„์ž‘์€ ๋ฐ์ดํ„ฐ ํƒ€์ž…์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€ํ•˜ ๊ฐ์†ŒFP16/BF16 ์ž…๋ ฅ + FP32 ๋ˆ„์ 

๊ฐ ๊ธฐ๋ฒ•์€ ์ปค๋„์„ ๋ฃจํ”„๋ผ์ธ ์œ„์—์„œ ์ด๋™์‹œํ‚ต๋‹ˆ๋‹ค โ€” ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ์„ ๋”ฐ๋ผ ์œ„๋กœ(๋Œ€์—ญํญ ํ™œ์šฉ ๊ฐœ์„ ), ๋˜๋Š” ์˜ค๋ฅธ์ชฝ ์—ฐ์‚ฐ ์ƒํ•œ์„ ํ–ฅํ•ด(์‚ฐ์ˆ  ๊ฐ•๋„ ํ–ฅ์ƒ).

๋น„๋™๊ธฐ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ฐธ๊ณ : ํ‘œ์ค€ GPU ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ(ld.global)๋Š” ์ด๋ฏธ ๋น„๋™๊ธฐ์ž…๋‹ˆ๋‹ค โ€” ์›Œํ”„๋Š” ๋กœ๋“œํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ค์ œ๋กœ ํ•„์š”ํ•ด์งˆ ๋•Œ๊นŒ์ง€ ๊ณ„์† ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. cp.async(CUDA)๋‚˜ copy_dram_to_sram_async(Mojo) ๊ฐ™์€ ์ „์šฉ ๋น„๋™๊ธฐ ๋ณต์‚ฌ ๋ช…๋ น์€ ์—ฌ๊ธฐ์„œ ํ•œ ๊ฑธ์Œ ๋” ๋‚˜์•„๊ฐ€, ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์„ ์‚ฌ์šฉํ•˜๊ณ  ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ์šฐํšŒํ•˜์—ฌ ์ž์› ํ™œ์šฉ์„ ๋†’์ž…๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ ๋™๊ธฐ ์—ฐ์‚ฐ์„ ๋น„๋™๊ธฐ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ๊ณผ๋Š” ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

7. ๋‹จ์ˆœํ•œ ๋ฃจํ”„๋ผ์ธ์„ ๋„˜์–ด์„œ

๋‹ค๋‹จ๊ณ„ ๋ฉ”๋ชจ๋ฆฌ: ๊ณ ๊ธ‰ ๋ฃจํ”„๋ผ์ธ์€ L2 ์บ์‹œ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ ˆ์ง€์Šคํ„ฐ ๋Œ€์—ญํญ์— ๋Œ€ํ•ด ๋ณ„๋„์˜ ์ƒํ•œ์„ ํฌํ•จํ•˜์—ฌ ์–ด๋–ค ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต์ด ์„ฑ๋Šฅ์„ ์ œ์•ฝํ•˜๋Š”์ง€ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

ํ†ต์‹  ๋ฃจํ”„๋ผ์ธ: ๋ฉ€ํ‹ฐ GPU ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋Œ€์‹  ์ธํ„ฐ์ปค๋„ฅํŠธ ๋Œ€์—ญํญ(NVLink, InfiniBand)์„ ์‚ฌ์šฉํ•˜์—ฌ ์Šค์ผ€์ผ๋ง ํšจ์œจ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

์ „์šฉ ์œ ๋‹›: ์ตœ์‹  GPU๋Š” ๊ณ ์œ ํ•œ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ฐ€์ง„ ํ…์„œ ์ฝ”์–ด๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ๋ณ„๋„์˜ ๋ฃจํ”„๋ผ์ธ ๋ถ„์„์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

8. ์‹ค์ „์—์„œ ๋ฃจํ”„๋ผ์ธ ํ™œ์šฉํ•˜๊ธฐ

  1. ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง: Nsight Compute ๊ฐ™์€ ๋„๊ตฌ๋กœ ์‹ค์ œ FLOPs์™€ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ธก์ •
  2. ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ํ‘œ์‹œ: ์‚ฐ์ˆ  ๊ฐ•๋„์™€ ์‹ค์ธก ์„ฑ๋Šฅ ๊ณ„์‚ฐ
  3. ๋ณ‘๋ชฉ ์‹๋ณ„: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„์€ ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ ์œ„์—, ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ ์ปค๋„์€ ์—ฐ์‚ฐ ์ƒํ•œ์— ๊ทผ์ ‘
  4. ์ตœ์ ํ™” ์„ ํƒ: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„์—๋Š” ๋Œ€์—ญํญ ๊ฐœ์„ ์—, ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ ์ปค๋„์—๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ€๊ฒฝ์— ์ง‘์ค‘
  5. ์ธก์ •๊ณผ ๋ฐ˜๋ณต: ์ตœ์ ํ™”๊ฐ€ ์ปค๋„์„ ๊ธฐ๋Œ€ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™์‹œํ‚ค๋Š”์ง€ ๊ฒ€์ฆ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํผ์ฆ๊ณผ์˜ ์—ฐ๊ฒฐ

๋‹ค์Œ ์„น์…˜์—์„œ๋Š” ์ปค๋„์„ ๋ฃจํ”„๋ผ์ธ ์œ„๋กœ ๋Œ์–ด์˜ฌ๋ฆฌ๊ธฐ ์‹œ์ž‘ํ•˜๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์‹œ๊ฐํ™”์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ์ฃผํ™ฉ์ƒ‰ ์ (๊ธฐ๋ณธ)์—์„œ ์ฒญ๋ก์ƒ‰ ์ (๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)์œผ๋กœ ์ด๋™ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค โ€” ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ ๊ฐœ์„ ์„ ํ†ตํ•œ ํ™•์‹คํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ž…๋‹ˆ๋‹ค.

\(2 \times 2\) ์˜ˆ์ œ์—์„œ๋Š” ์—ฐ์‚ฐ ์ƒํ•œ์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ•˜์ง€๋งŒ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์„ฑ๋Šฅ์— ๊ฒฐ์ •์ ์ธ ์—ญํ• ์„ ํ•˜๋Š” ํฐ ํ–‰๋ ฌ์—์„œ ๋™์ผํ•œ ์›๋ฆฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ํ™•์žฅ๋˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์™œ ๋„์›€์ด ๋˜๊ณ  ์–ผ๋งˆ๋‚˜ ๊ฐœ์„ ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ์ด๋ก ์  ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์„ ์ดํ•ดํ•˜๋ฉด GPU ์ตœ์ ํ™”๊ฐ€ ์ถ”์ธก์—์„œ ์ฒด๊ณ„์ ์ธ ์—”์ง€๋‹ˆ์–ด๋ง์œผ๋กœ ๋ฐ”๋€๋‹ˆ๋‹ค. ์ด ์ฑ…์˜ ๋ชจ๋“  ์ตœ์ ํ™” ๊ธฐ๋ฒ•์€ ์ด ๋‹จ์ˆœํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ ๋ชจ๋ธ์— ๋Œ€ํ•œ ํšจ๊ณผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „

๊ฐœ์š”

์ •๋ฐฉ ํ–‰๋ ฌ \(A\) ์™€ \(B\) ์˜ ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๊ตฌํ˜„ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ \(\text{output}\)์— ์ €์žฅํ•˜๋Š” ํผ์ฆ์ž…๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ์‚ฐ ์ „์— ํ–‰๋ ฌ ๋ธ”๋ก์„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋ฏธ๋ฆฌ ๋กœ๋“œํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ๋ธ”๋ก ๋กœ์ปฌ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ํŒจํ„ด
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”
  • 2D ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•œ ํ˜‘๋ ฅ์  ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ
  • ํ–‰๋ ฌ ์—ฐ์‚ฐ์— LayoutTensor๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜๊ธฐ

ํ•ต์‹ฌ์€ LayoutTensor๋ฅผ ํ†ตํ•ด ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋น„์šฉ์ด ํฐ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: \(1 \times 1\)

๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ:

  • ์ž…๋ ฅ A: Layout.row_major(SIZE, SIZE)
  • ์ž…๋ ฅ B: Layout.row_major(SIZE, SIZE)
  • ์ถœ๋ ฅ: Layout.row_major(SIZE, SIZE)
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB ร— TPB ํฌ๊ธฐ์˜ LayoutTensor 2๊ฐœ

๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ:

Global Memory (LayoutTensor):          Shared Memory (LayoutTensor):
A[i,j]: Direct access                  a_shared[local_row, local_col]
B[i,j]: Direct access                  b_shared[local_row, local_col]

์™„์„ฑํ•  ์ฝ”๋“œ

fn single_block_matmul[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    local_row = thread_idx.y
    local_col = thread_idx.x
    # FILL ME IN (roughly 12 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p16/p16.mojo

ํŒ
  1. ์ „์—ญ ์ธ๋ฑ์Šค์™€ ๋กœ์ปฌ ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ–‰๋ ฌ์„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ
  2. ๋กœ๋“œ ํ›„ barrier() ํ˜ธ์ถœ
  3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‚ด์  ๊ณ„์‚ฐ
  4. ๋ชจ๋“  ์—ฐ์‚ฐ์—์„œ ๋ฐฐ์—ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p16 --single-block
pixi run -e amd p16 --single-block
pixi run -e apple p16 --single-block
uv run poe p16 --single-block

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([4.0, 6.0, 12.0, 22.0])

์†”๋ฃจ์…˜

fn single_block_matmul[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    local_row = thread_idx.y
    local_col = thread_idx.x

    a_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB, TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    b_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB, TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    if row < size and col < size:
        a_shared[local_row, local_col] = a[row, col]
        b_shared[local_row, local_col] = b[row, col]

    barrier()

    if row < size and col < size:
        var acc: output.element_type = 0

        @parameter
        for k in range(size):
            acc += a_shared[local_row, k] * b_shared[k, local_col]

        output[row, col] = acc


LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ตฌํ˜„์€ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค:

๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ

Input Tensors (2ร—2):                Shared Memory (3ร—3):
Matrix A:                           a_shared:
 [a[0,0] a[0,1]]                     [s[0,0] s[0,1] s[0,2]]
 [a[1,0] a[1,1]]                     [s[1,0] s[1,1] s[1,2]]
                                     [s[2,0] s[2,1] s[2,2]]
Matrix B:                           b_shared: (๋น„์Šทํ•œ ๋ ˆ์ด์•„์›ƒ)
 [b[0,0] b[0,1]]                     [t[0,0] t[0,1] t[0,2]]
 [b[1,0] b[1,1]]                     [t[1,0] t[1,1] t[1,2]]
                                     [t[2,0] t[2,1] t[2,2]]

๊ตฌํ˜„ ๋‹จ๊ณ„

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •:

    # address_space๋ฅผ ์ง€์ •ํ•œ LayoutTensor๋กœ 2D ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ…์„œ ์ƒ์„ฑ
    a_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
    b_shared = LayoutTensor[dtype, Layout.row_major(TPB, TPB), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
    
  2. ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ:

    # ํ–‰๋ ฌ ์ ‘๊ทผ์„ ์œ„ํ•œ ์ „์—ญ ์ธ๋ฑ์Šค
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    
    # ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์šฉ ๋กœ์ปฌ ์ธ๋ฑ์Šค
    local_row = thread_idx.y
    local_col = thread_idx.x
    
  3. ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ:

    # LayoutTensor ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ
    if row < size and col < size:
        a_shared[local_row, local_col] = a[row, col]
        b_shared[local_row, local_col] = b[row, col]
    
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ์—ฐ์‚ฐ:

    # ๊ฐ€๋“œ๋กœ ์œ ํšจํ•œ ํ–‰๋ ฌ ์›์†Œ๋งŒ ๊ณ„์‚ฐ
    if row < size and col < size:
        # ์ถœ๋ ฅ ํ…์„œ์˜ ํƒ€์ž…์œผ๋กœ ๋ˆ„์  ๋ณ€์ˆ˜ ์ดˆ๊ธฐํ™”
        var acc: output.element_type = 0
    
        # ์ปดํŒŒ์ผ ํƒ€์ž„์— ์ „๊ฐœ๋˜๋Š” ํ–‰๋ ฌ ๊ณฑ์…ˆ ๋ฃจํ”„
        @parameter
        for k in range(size):
            acc += a_shared[local_row, k] * b_shared[k, local_col]
    
        # ํ–‰๋ ฌ ๊ฒฝ๊ณ„ ๋‚ด์˜ ์Šค๋ ˆ๋“œ๋งŒ ๊ฒฐ๊ณผ ๊ธฐ๋ก
        output[row, col] = acc
    

    ์ฃผ์š” ํฌ์ธํŠธ:

    • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: if row < size and col < size

      • ๋ฒ”์œ„ ๋ฐ– ์—ฐ์‚ฐ ๋ฐฉ์ง€
      • ์œ ํšจํ•œ ์Šค๋ ˆ๋“œ๋งŒ ์ž‘์—… ์ˆ˜ํ–‰
      • TPB (3ร—3) > SIZE (2ร—2)์ด๋ฏ€๋กœ ํ•„์ˆ˜
    • ๋ˆ„์  ๋ณ€์ˆ˜ ํƒ€์ž…: var acc: output.element_type

      • ์ถœ๋ ฅ ํ…์„œ์˜ ์›์†Œ ํƒ€์ž…์œผ๋กœ ํƒ€์ž… ์•ˆ์ „์„ฑ ํ™•๋ณด
      • ์ผ๊ด€๋œ ์ˆ˜์น˜ ์ •๋ฐ€๋„ ๋ณด์žฅ
      • ๋ˆ„์  ์ „์— 0์œผ๋กœ ์ดˆ๊ธฐํ™”
    • ๋ฃจํ”„ ์ตœ์ ํ™”: @parameter for k in range(size)

      • ์ปดํŒŒ์ผ ํƒ€์ž„์— ๋ฃจํ”„ ์ „๊ฐœ
      • ๋” ๋‚˜์€ ๋ช…๋ น์–ด ์Šค์ผ€์ค„๋ง ๊ฐ€๋Šฅ
      • ํฌ๊ธฐ๊ฐ€ ์ž‘๊ณ  ๋ฏธ๋ฆฌ ์•Œ๋ ค์ง„ ํ–‰๋ ฌ์— ํšจ๊ณผ์ 
    • ๊ฒฐ๊ณผ ๊ธฐ๋ก: output[row, col] = acc

      • ๋™์ผํ•œ ๊ฐ€๋“œ ์กฐ๊ฑด์œผ๋กœ ๋ณดํ˜ธ
      • ์œ ํšจํ•œ ์Šค๋ ˆ๋“œ๋งŒ ๊ฒฐ๊ณผ ๊ธฐ๋ก
      • ํ–‰๋ ฌ ๊ฒฝ๊ณ„ ์•ˆ์ „์„ฑ ์œ ์ง€

์Šค๋ ˆ๋“œ ์•ˆ์ „์„ฑ๊ณผ ๋™๊ธฐํ™”

  1. ๊ฐ€๋“œ ์กฐ๊ฑด:

    • ์ž…๋ ฅ ๋กœ๋”ฉ: if row < size and col < size
    • ์—ฐ์‚ฐ: ๋™์ผํ•œ ๊ฐ€๋“œ๋กœ ์Šค๋ ˆ๋“œ ์•ˆ์ „์„ฑ ๋ณด์žฅ
    • ์ถœ๋ ฅ ๊ธฐ๋ก: ๊ฐ™์€ ์กฐ๊ฑด์œผ๋กœ ๋ณดํ˜ธ
    • ์ž˜๋ชป๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ณผ ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์•ˆ์ „์„ฑ:

    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB ๋ฒ”์œ„ ๋‚ด์—์„œ๋งŒ ์ ‘๊ทผ
    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: ํฌ๊ธฐ ๊ฒ€์‚ฌ๋กœ ๋ณดํ˜ธ
    • ์ถœ๋ ฅ: ๊ฐ€๋“œ๋œ ์“ฐ๊ธฐ๋กœ ๋ฐ์ดํ„ฐ ์†์ƒ ๋ฐฉ์ง€

์ฃผ์š” ์–ธ์–ด ๊ธฐ๋Šฅ

  1. LayoutTensor์˜ ์žฅ์ :

    • ์ง์ ‘ 2D ์ธ๋ฑ์‹ฑ์œผ๋กœ ์ฝ”๋“œ ๋‹จ์ˆœํ™”
    • element_type์„ ํ†ตํ•œ ํƒ€์ž… ์•ˆ์ „์„ฑ
    • ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ
  2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น:

    • address_space๋ฅผ ์ง€์ •ํ•œ LayoutTensor๋กœ ๊ตฌ์กฐํ™”๋œ ํ• ๋‹น
    • ์ž…๋ ฅ ํ…์„œ์™€ ๋™์ผํ•œ ํ–‰ ์šฐ์„  ๋ ˆ์ด์•„์›ƒ
    • ํšจ์œจ์  ์ ‘๊ทผ์„ ์œ„ํ•œ ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ
  3. ๋™๊ธฐํ™”:

    • barrier()๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ผ๊ด€์„ฑ ๋ณด์žฅ
    • ๋กœ๋“œ์™€ ์—ฐ์‚ฐ ๊ฐ„ ์ ์ ˆํ•œ ๋™๊ธฐํ™”
    • ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ ๊ฐ„ ํ˜‘๋ ฅ

์„ฑ๋Šฅ ์ตœ์ ํ™”

  1. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ:

    • ์›์†Œ๋‹น ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ 1ํšŒ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ๋‹ค์ค‘ ์žฌ์‚ฌ์šฉ
    • ๋ณ‘ํ•ฉ๋œ(coalesced) ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  2. ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ:

    • ํ˜‘๋ ฅ์  ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ
    • ๊ณต์œ  ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ
    • ํšจ์œจ์ ์ธ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  3. ์—ฐ์‚ฐ ์ด์ :

    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ๊ฐ์†Œ
    • ์บ์‹œ ํ™œ์šฉ๋„ ํ–ฅ์ƒ
    • ๋ช…๋ น์–ด ์ฒ˜๋ฆฌ๋Ÿ‰ ๊ฐœ์„ 

์ด ๊ตฌํ˜„์€ ๋‹ค์Œ์„ ํ†ตํ•ด ๊ธฐ๋ณธ ๋ฒ„์ „ ๋Œ€๋น„ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšŸ์ˆ˜ ๊ฐ์†Œ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ
  • LayoutTensor์˜ ํšจ์œจ์ ์ธ 2D ์ธ๋ฑ์‹ฑ ํ™œ์šฉ
  • ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ์œ ์ง€

ํƒ€์ผ๋ง ๋ฒ„์ „

๊ฐœ์š”

LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์œผ๋กœ ์ •๋ฐฉ ํ–‰๋ ฌ \(A\) ์™€ \(B\) ๋ฅผ ๊ณฑํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ํฐ ํ–‰๋ ฌ์„ ์ž‘์€ ์กฐ๊ฐ(ํƒ€์ผ)์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

  • LayoutTensor๋ฅผ ์‚ฌ์šฉํ•œ ํ–‰๋ ฌ ํƒ€์ผ๋ง์œผ๋กœ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ
  • ์ ์ ˆํ•œ ๋ ˆ์ด์•„์›ƒ์„ ์‚ฌ์šฉํ•œ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์œจ
  • TensorBuilder๋ฅผ ํ†ตํ•œ ํšจ์œจ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  • LayoutTensor ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•œ ํƒ€์ผ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{SIZE_TILED} = 9\)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: \(3 \times 3\) ๋ธ”๋ก
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น \(\text{TPB} \times \text{TPB}\) LayoutTensor 2๊ฐœ

๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ:

  • ์ž…๋ ฅ A: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • ์ž…๋ ฅ B: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • ์ถœ๋ ฅ: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TensorBuilder๋ฅผ ์‚ฌ์šฉํ•œ TPB ร— TPB LayoutTensor 2๊ฐœ

ํƒ€์ผ๋ง ์ „๋žต

๋ธ”๋ก ๊ตฌ์„ฑ

Grid Layout (3ร—3):           Thread Layout per Block (3ร—3):
[B00][B01][B02]               [T00 T01 T02]
[B10][B11][B12]               [T10 T11 T12]
[B20][B21][B22]               [T20 T21 T22]

๊ฐ ๋ธ”๋ก์€ LayoutTensor ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ํƒ€์ผ์„ ์ฒ˜๋ฆฌ

ํƒ€์ผ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„

  1. ์Šค๋ ˆ๋“œ ์œ„์น˜์— ๋Œ€ํ•œ ์ „์—ญ ์ธ๋ฑ์Šค์™€ ๋กœ์ปฌ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ
  2. A์™€ B ํƒ€์ผ์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น
  3. ๊ฐ ํƒ€์ผ์— ๋Œ€ํ•ด:
    • ํ–‰๋ ฌ A์™€ B์—์„œ ํƒ€์ผ ๋กœ๋“œ
    • ๋ถ€๋ถ„ ๊ณฑ ๊ณ„์‚ฐ
    • ๋ ˆ์ง€์Šคํ„ฐ์— ๊ฒฐ๊ณผ ๋ˆ„์ 
  4. ์ตœ์ข… ๋ˆ„์  ๊ฒฐ๊ณผ ๊ธฐ๋ก

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

Matrix A (8ร—8)                 Matrix B (8ร—8)               Matrix C (8ร—8)
+---+---+---+                  +---+---+---+                +---+---+---+
|T00|T01|T02| ...              |T00|T01|T02| ...            |T00|T01|T02| ...
+---+---+---+                  +---+---+---+                +---+---+---+
|T10|T11|T12|                  |T10|T11|T12|                |T10|T11|T12|
+---+---+---+                  +---+---+---+                +---+---+---+
|T20|T21|T22|                  |T20|T21|T22|                |T20|T21|T22|
+---+---+---+                  +---+---+---+                +---+---+---+
  ...                            ...                          ...

ํƒ€์ผ ์ฒ˜๋ฆฌ ๊ณผ์ • (C[T11] ๊ณ„์‚ฐ ์˜ˆ์‹œ):
1. A์™€ B์—์„œ ํƒ€์ผ ๋กœ๋“œ:
   +---+      +---+
   |A11| ร—    |B11|     ๊ฐ ๋‹จ๊ณ„ k์— ๋Œ€ํ•ด:
   +---+      +---+     C[T11] += A[row, k] ร— B[k, col]

2. ํƒ€์ผ ์ด๋™:
   ๋‹จ๊ณ„ 1      ๋‹จ๊ณ„ 2      ๋‹จ๊ณ„ 3
   A: [T10]    A: [T11]    A: [T12]
   B: [T01]    B: [T11]    B: [T21]

3. ํƒ€์ผ ๋‚ด ๊ฐ ์Šค๋ ˆ๋“œ (i,j)์˜ ์—ฐ์‚ฐ:
   C[i,j] = ฮฃ (A[i,k] ร— B[k,j]), k๋Š” ํƒ€์ผ ๋„ˆ๋น„ ๋ฒ”์œ„

๋™๊ธฐํ™” ํ•„์š” ์‹œ์ :
* ํƒ€์ผ์„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œํ•œ ํ›„
* ๊ฐ ๋‹จ๊ณ„์˜ ์—ฐ์‚ฐ์ด ๋๋‚œ ํ›„

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE_TILED = 9
comptime BLOCKS_PER_GRID_TILED = (3, 3)  # each block convers 3x3 elements
comptime THREADS_PER_BLOCK_TILED = (TPB, TPB)
comptime layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED)


fn matmul_tiled[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
    a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
):
    local_row = thread_idx.y
    local_col = thread_idx.x
    tiled_row = block_idx.y * TPB + thread_idx.y
    tiled_col = block_idx.x * TPB + thread_idx.x
    # FILL ME IN (roughly 20 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p16/p16.mojo

ํŒ
  1. ํ‘œ์ค€ ์ธ๋ฑ์‹ฑ ๊ทœ์น™์„ ์‚ฌ์šฉํ•˜์„ธ์š”: local_row = thread_idx.y, local_col = thread_idx.x

  2. ์ „์—ญ ์œ„์น˜ ๊ณ„์‚ฐ:

    global_row = block_idx.y * TPB + local_row
    

    ๊ทธ๋ฆฌ๊ณ 

    global_col = block_idx.x * TPB + local_col
    

    ์ „์—ญ ์ธ๋ฑ์‹ฑ ๊ณต์‹ ์ดํ•ดํ•˜๊ธฐ:

    • ๊ฐ ๋ธ”๋ก์€ ํ–‰๋ ฌ์˜ TPB ร— TPB ํƒ€์ผ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

    • block_idx.y๋Š” ํ˜„์žฌ ๋ช‡ ๋ฒˆ์งธ ๋ธ”๋ก ํ–‰์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค (0, 1, 2โ€ฆ)

    • block_idx.y * TPB๋Š” ํ•ด๋‹น ๋ธ”๋ก ํƒ€์ผ์˜ ์‹œ์ž‘ ํ–‰์ž…๋‹ˆ๋‹ค

    • local_row (0~TPB-1)์€ ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ์˜ ์˜คํ”„์…‹์ž…๋‹ˆ๋‹ค

    • ๋‘˜์„ ๋”ํ•˜๋ฉด ์ „์ฒด ํ–‰๋ ฌ์—์„œ์˜ ์‹ค์ œ ํ–‰ ์œ„์น˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค

      TPB=3 ์˜ˆ์‹œ:

    Block Layout:        Global Matrix (9ร—9):
    [B00][B01][B02]      [0 1 2 | 3 4 5 | 6 7 8]
    [B10][B11][B12]  โ†’   [9 A B | C D E | F G H]
    [B20][B21][B22]      [I J K | L M N | O P Q]
                         โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”
                         [R S T | U V W | X Y Z]
                         [a b c | d e f | g h i]
                         [j k l | m n o | p q r]
                         โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”
                         [s t u | v w x | y z ฮฑ]
                         [ฮฒ ฮณ ฮด | ฮต ฮถ ฮท | ฮธ ฮน ฮบ]
                         [ฮป ฮผ ฮฝ | ฮพ ฮฟ ฯ€ | ฯ ฯƒ ฯ„]
    
    Thread(1,2) in Block(1,0):
    - block_idx.y = 1, local_row = 1
    - global_row = 1 * 3 + 1 = 4
    - ์ด ์Šค๋ ˆ๋“œ๋Š” ํ–‰๋ ฌ์˜ 4๋ฒˆ์งธ ํ–‰์„ ๋‹ด๋‹น
    
  3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น (.fill(0)์œผ๋กœ ์‚ฌ์ „ ์ดˆ๊ธฐํ™”๋จ)

  4. 9ร—9 ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์ด๋ฏ€๋กœ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ๋ถˆํ•„์š”!

  5. ์ ์ ˆํ•œ ๋™๊ธฐํ™”์™€ ํ•จ๊ป˜ ํƒ€์ผ ๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๋ˆ„์ 

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p16 --tiled
pixi run -e amd p16 --tiled
pixi run -e apple p16 --tiled
uv run poe p16 --tiled

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 4176.0, 4248.0, 9504.0, 9738.0, 9972.0, 10206.0, 10440.0, 10674.0, 10908.0, 11142.0, 11376.0, 15336.0, 15732.0, 16128.0, 16524.0, 16920.0, 17316.0, 17712.0, 18108.0, 18504.0, 21168.0, 21726.0, 22284.0, 22842.0, 23400.0, 23958.0, 24516.0, 25074.0, 25632.0, 27000.0, 27720.0, 28440.0, 29160.0, 29880.0, 30600.0, 31320.0, 32040.0, 32760.0, 32832.0, 33714.0, 34596.0, 35478.0, 36360.0, 37242.0, 38124.0, 39006.0, 39888.0, 38664.0, 39708.0, 40752.0, 41796.0, 42840.0, 43884.0, 44928.0, 45972.0, 47016.0, 44496.0, 45702.0, 46908.0, 48114.0, 49320.0, 50526.0, 51732.0, 52938.0, 54144.0, 50328.0, 51696.0, 53064.0, 54432.0, 55800.0, 57168.0, 58536.0, 59904.0, 61272.0])

์†”๋ฃจ์…˜: ์ˆ˜๋™ ํƒ€์ผ๋ง

fn matmul_tiled[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
    a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
):
    local_row = thread_idx.y
    local_col = thread_idx.x
    tiled_row = block_idx.y * TPB + local_row
    tiled_col = block_idx.x * TPB + local_col

    a_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB, TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    b_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB, TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    var acc: output.element_type = 0

    # Iterate over tiles to compute matrix product
    @parameter
    for tile in range((size + TPB - 1) // TPB):
        # Load A tile - global row stays the same, col determined by tile
        if tiled_row < size and (tile * TPB + local_col) < size:
            a_shared[local_row, local_col] = a[
                tiled_row, tile * TPB + local_col
            ]

        # Load B tile - row determined by tile, global col stays the same
        if (tile * TPB + local_row) < size and tiled_col < size:
            b_shared[local_row, local_col] = b[
                tile * TPB + local_row, tiled_col
            ]

        barrier()

        # Matrix multiplication within the tile
        if tiled_row < size and tiled_col < size:

            @parameter
            for k in range(min(Int(TPB), Int(size - tile * TPB))):
                acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write out final result
    if tiled_row < size and tiled_col < size:
        output[tiled_row, tiled_col] = acc


ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ ๊ตฌํ˜„์€ ์ž‘์€ ํƒ€์ผ \((3 \times 3)\) ์„ ์‚ฌ์šฉํ•˜์—ฌ ํฐ ํ–‰๋ ฌ \((9 \times 9)\) ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋™์ž‘ ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น

    Input matrices (9ร—9) - (3ร—3) ํƒ€์ผ๋ง์— ๋”ฑ ๋งž๋Š” ํฌ๊ธฐ:
    A = [0  1  2  3  4  5  6  7  8 ]    B = [0  2  4  6  8  10 12 14 16]
        [9  10 11 12 13 14 15 16 17]        [18 20 22 24 26 28 30 32 34]
        [18 19 20 21 22 23 24 25 26]        [36 38 40 42 44 46 48 50 52]
        [27 28 29 30 31 32 33 34 35]        [54 56 58 60 62 64 66 68 70]
        [36 37 38 39 40 41 42 43 44]        [72 74 76 78 80 82 84 86 88]
        [45 46 47 48 49 50 51 52 53]        [90 92 94 96 98 100 102 104 106]
        [54 55 56 57 58 59 60 61 62]        [108 110 112 114 116 118 120 122 124]
        [63 64 65 66 67 68 69 70 71]        [126 128 130 132 134 136 138 140 142]
        [72 73 74 75 76 77 78 79 80]        [144 146 148 150 152 154 156 158 160]
    
    ๋ธ”๋ก๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ (3ร—3):
    a_shared[TPB, TPB]  b_shared[TPB, TPB]
    
  2. ํƒ€์ผ ์ฒ˜๋ฆฌ ๋ฃจํ”„

    ํƒ€์ผ ์ˆ˜ = 9 // 3 = 3๊ฐœ (๋‚˜๋จธ์ง€ ์—†์ด ๋”ฑ ๋‚˜๋ˆ ์ง!)
    
    ๊ฐ ํƒ€์ผ์— ๋Œ€ํ•ด:
    1. A์™€ B์—์„œ ํƒ€์ผ ๋กœ๋“œ
    2. ๋ถ€๋ถ„ ๊ณฑ ๊ณ„์‚ฐ
    3. ๋ ˆ์ง€์Šคํ„ฐ์— ๋ˆ„์ 
    
  3. ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋”ฉ ํŒจํ„ด

    • \((9 \times 9)\) ์ด ๋”ฑ ๋‚˜๋ˆ ์ง€๋ฏ€๋กœ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ๊ธฐ์ˆ ์ ์œผ๋กœ๋Š” ๋ถˆํ•„์š”ํ•˜์ง€๋งŒ, ๋ฐฉ์–ด์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๋‹ค๋ฅธ ํ–‰๋ ฌ ํฌ๊ธฐ์—๋„ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

         # A ํƒ€์ผ ๋กœ๋“œ - ์ „์—ญ ํ–‰์€ ๊ทธ๋Œ€๋กœ, ์—ด์€ ํƒ€์ผ์— ์˜ํ•ด ๊ฒฐ์ •
         if tiled_row < size and (tile * TPB + local_col) < size:
             a_shared[local_row, local_col] = a[
                 tiled_row, tile * TPB + local_col
             ]
      
         # B ํƒ€์ผ ๋กœ๋“œ - ํ–‰์€ ํƒ€์ผ์— ์˜ํ•ด ๊ฒฐ์ •, ์ „์—ญ ์—ด์€ ๊ทธ๋Œ€๋กœ
         if (tile * TPB + local_row) < size and tiled_col < size:
             b_shared[local_row, local_col] = b[
                 tile * TPB + local_row, tiled_col
             ]
      
  4. ํƒ€์ผ ๋‚ด ์—ฐ์‚ฐ

    for k in range(min(TPB, size - tile * TPB)):
        acc += a_shared[local_row, k] * b_shared[k, local_col]
    
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ ํšŒํ”ผ:

      Bank Conflict Free (Good):        Bank Conflicts (Bad):
      Thread0: a_shared[0,k] b_shared[k,0]  Thread0: a_shared[k,0] b_shared[0,k]
      Thread1: a_shared[0,k] b_shared[k,1]  Thread1: a_shared[k,0] b_shared[1,k]
      Thread2: a_shared[0,k] b_shared[k,2]  Thread2: a_shared[k,0] b_shared[2,k]
      โ†“                                     โ†“
      ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ๋ณ‘๋ ฌ ์ ‘๊ทผ             b_shared๊ฐ€ ์—ด ์šฐ์„ ์ด์—ˆ๋‹ค๋ฉด
      (a_shared๋Š” broadcast)               ๊ฐ™์€ ๋ฑ…ํฌ์— ์ง๋ ฌ ์ ‘๊ทผ
      

      ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ ์„ค๋ช…:

      • ์™ผ์ชฝ (Good): b_shared[k,threadIdx.x]๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๊ณ , a_shared[0,k]๋Š” ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋ฉ๋‹ˆ๋‹ค
      • ์˜ค๋ฅธ์ชฝ (Bad): b_shared๊ฐ€ ์—ด ์šฐ์„ ์ด์—ˆ๋‹ค๋ฉด ์Šค๋ ˆ๋“œ๋“ค์ด ๋™์‹œ์— ๊ฐ™์€ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค
      • ํ•ต์‹ฌ: ์ด๊ฒƒ์€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์•„๋‹Œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์— ๊ด€ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค
      • ๋ฑ…ํฌ ๊ตฌ์กฐ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋Š” 32๊ฐœ ๋ฑ…ํฌ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ๋ฑ…ํฌ์˜ ๋‹ค๋ฅธ ์ฃผ์†Œ์— ์ ‘๊ทผํ•  ๋•Œ ์ถฉ๋Œ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค
  5. ๋™๊ธฐํ™” ์ง€์ 

    barrier() ํ˜ธ์ถœ ์‹œ์ :
    1. ํƒ€์ผ ๋กœ๋”ฉ ํ›„
    2. ํƒ€์ผ ์—ฐ์‚ฐ ํ›„
    

์ฃผ์š” ์„ฑ๋Šฅ ํŠน์„ฑ:

  • \((3 \times 3)\) ํƒ€์ผ๋กœ \((9 \times 9)\) ํ–‰๋ ฌ ์ฒ˜๋ฆฌ (๋”ฑ ๋งž๋Š” ํฌ๊ธฐ!)
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋น ๋ฅธ ํƒ€์ผ ์ ‘๊ทผ
  • ๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜ ์ตœ์†Œํ™”
  • ๋ฑ…ํฌ ์ถฉ๋Œ์„ ํ”ผํ•˜๋„๋ก ์ตœ์ ํ™”๋œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ์ ‘๊ทผ ํŒจํ„ด
  1. ๊ฒฐ๊ณผ ๊ธฐ๋ก:

    if tiled_row < size and tiled_col < size:
       output[tiled_row, tiled_col] = acc
    
    • ๋‹ค๋ฅธ ํ–‰๋ ฌ ํฌ๊ธฐ์™€ ํƒ€์ผ๋ง ์ „๋žต์„ ์œ„ํ•œ ๋ฐฉ์–ด์  ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํฌํ•จ
    • ์ถœ๋ ฅ ํ–‰๋ ฌ์— ์ง์ ‘ ๋Œ€์ž…
    • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์œ ํšจํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ก

์ฃผ์š” ์ตœ์ ํ™”

  1. ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”:

    • ๋ชจ๋“  ํ…์„œ์— ํ–‰ ์šฐ์„  ๋ ˆ์ด์•„์›ƒ
    • ํšจ์œจ์ ์ธ 2D ์ธ๋ฑ์‹ฑ
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ๋ณ‘ํ•ฉ๋œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ
    • ํšจ์œจ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  3. ์—ฐ์‚ฐ:

    • ๋ ˆ์ง€์Šคํ„ฐ ๊ธฐ๋ฐ˜ ๋ˆ„์ , ์ฆ‰ var acc: output.element_type = 0
    • @parameter๋ฅผ ํ†ตํ•œ ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ

์ด ๊ตฌํ˜„์€ ๋‹ค์Œ์„ ํ†ตํ•ด ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • LayoutTensor๋ฅผ ํ™œ์šฉํ•œ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
  • ์ตœ์ ์˜ ํƒ€์ผ๋ง ์ „๋žต
  • ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  • ์„ธ์‹ฌํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ

์†”๋ฃจ์…˜: ๊ด€์šฉ์  LayoutTensor ํƒ€์ผ๋ง

from gpu.memory import async_copy_wait_all
from layout.layout_tensor import copy_dram_to_sram_async

comptime NUM_THREADS = TPB * TPB
comptime BLOCK_DIM_COUNT = 2


fn matmul_idiomatic_tiled[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
    a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
):
    local_row = thread_idx.y
    local_col = thread_idx.x
    tiled_row = block_idx.y * TPB + local_row
    tiled_col = block_idx.x * TPB + local_col

    # Get the tile of the output matrix that this thread block is responsible for
    out_tile = output.tile[TPB, TPB](Int(block_idx.y), Int(block_idx.x))
    a_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB, TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    b_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB, TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    var acc: output.element_type = 0

    comptime load_a_layout = Layout.row_major(1, TPB)  # Coalesced loading
    comptime load_b_layout = Layout.row_major(1, TPB)  # Coalesced loading
    # Note: Both matrices stored in same orientation for correct matrix multiplication
    # Transposed loading would be useful if B were pre-transposed in global memory

    @parameter
    for idx in range(size // TPB):  # Perfect division: 9 // 3 = 3 tiles
        # Get tiles from A and B matrices
        a_tile = a.tile[TPB, TPB](Int(block_idx.y), Int(idx))
        b_tile = b.tile[TPB, TPB](Int(idx), Int(block_idx.x))

        # Asynchronously copy tiles to shared memory with consistent orientation
        copy_dram_to_sram_async[
            thread_layout=load_a_layout,
            num_threads=NUM_THREADS,
            block_dim_count=BLOCK_DIM_COUNT,
        ](a_shared, a_tile)
        copy_dram_to_sram_async[
            thread_layout=load_b_layout,
            num_threads=NUM_THREADS,
            block_dim_count=BLOCK_DIM_COUNT,
        ](b_shared, b_tile)

        # Wait for all async copies to complete
        async_copy_wait_all()
        barrier()

        # Compute partial matrix multiplication for this tile
        @parameter
        for k in range(TPB):
            acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write final result to output tile
    if tiled_row < size and tiled_col < size:
        out_tile[local_row, local_col] = acc


๊ด€์šฉ์  ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ Mojo์˜ LayoutTensor API์™€ ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ํ™œ์šฉํ•˜์—ฌ ๊น”๋”ํ•œ ๊ตฌํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํฌ์ธํŠธ: ์ด ๊ตฌํ˜„์€ ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œ์ค€ A ร— B ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ตฌํ˜„์ด ํ•˜๋Š” ๊ฒƒ:

  • ํ–‰๋ ฌ ์—ฐ์‚ฐ: ํ‘œ์ค€ \(A \times B\) ๊ณฑ์…ˆ (\(A \times B^T\) ๊ฐ€ ์•„๋‹˜)
  • ๋กœ๋”ฉ ํŒจํ„ด: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ Layout.row_major(1, TPB)๋กœ ๋ณ‘ํ•ฉ ์ ‘๊ทผ
  • ์—ฐ์‚ฐ: acc += a_shared[local_row, k] * b_shared[k, local_col]
  • ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ: ๋กœ๋”ฉ ์‹œ ์ „์น˜ ์—†์Œ - ๋‘ ํ–‰๋ ฌ์„ ๊ฐ™์€ ๋ฐฉํ–ฅ์œผ๋กœ ๋กœ๋“œ

์ด ๊ตฌํ˜„์ด ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ:

  • \(A \times B^T\) ๊ณฑ์…ˆ์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Œ
  • ์ „์น˜ ๋กœ๋”ฉ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ
  • ๋ณต์‚ฌ ๊ณผ์ •์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ „์น˜ํ•˜์ง€ ์•Š์Œ

\((9 \times 9)\) ํ–‰๋ ฌ ํฌ๊ธฐ์—์„œ๋Š” ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์ด ์ด๋ฃจ์–ด์ ธ ๋ชจ๋“  ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ๋ถˆํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  1. LayoutTensor ํƒ€์ผ API

    out_tile = output.tile[TPB, TPB](block_idx.y, block_idx.x)
    a_tile = a.tile[TPB, TPB](block_idx.y, idx)
    b_tile = b.tile[TPB, TPB](idx, block_idx.x)
    

    ์ˆ˜๋™ ์ขŒํ‘œ ๊ณ„์‚ฐ ์—†์ด โ€œ(block_idx.y, block_idx.x) ์œ„์น˜์˜ ํƒ€์ผ์„ ๊ฐ€์ ธ์˜จ๋‹คโ€œ๋ฅผ ์ง์ ‘ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

  2. ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ

    copy_dram_to_sram_async[
       thread_layout = load_a_layout,
       num_threads = NUM_THREADS,
       block_dim_count = BLOCK_DIM_COUNT
    ](a_shared,a_tile)
    copy_dram_to_sram_async[
       thread_layout = load_b_layout,
       num_threads = NUM_THREADS,
       block_dim_count = BLOCK_DIM_COUNT
    ](b_shared, b_tile)
    async_copy_wait_all()
    

    ์ด ์—ฐ์‚ฐ๋“ค์€:

    • ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ์šฐํšŒํ•˜๋Š” ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์˜ ์ค‘์ฒฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค (copy_dram_to_sram_async ์ฐธ๊ณ )
    • ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์œ„ํ•œ ํŠนํ™”๋œ ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
    • ์ˆ˜๋™ ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ธฐํ™”๊ฐ€ ๋ถˆํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
    • ์ค‘์š”:
      • ํ‘œ์ค€ GPU ๋กœ๋“œ๋Š” ์ด๋ฏธ ๋น„๋™๊ธฐ์ ์ž…๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋“ค์€ ๋” ๋‚˜์€ ๋ฆฌ์†Œ์Šค ํ™œ์šฉ๊ณผ ๋ ˆ์ง€์Šคํ„ฐ ์šฐํšŒ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค
      • copy_dram_to_sram_async๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ 1D ์Šค๋ ˆ๋“œ ๋ธ”๋ก(block_dim.y == block_dim.z == 1)์„ ๊ฐ€์ •ํ•˜๋ฉฐ, ๋ณ„๋„ ์ง€์ •์ด ์—†์œผ๋ฉด ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณต์‚ฌ์— ์ฐธ์—ฌํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์„ ์ง€์ •ํ•˜์—ฌ ์ด ๋™์ž‘์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
        • block_dim_count: ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ ์ฐจ์› ์ˆ˜ (2D ์Šค๋ ˆ๋“œ ๋ธ”๋ก THREADS_PER_BLOCK_TILED = (TPB, TPB)์˜ ๊ฒฝ์šฐ 2)
        • num_threads: ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ ์Šค๋ ˆ๋“œ ์ˆ˜ (TPB*TPB == 9)
  3. ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ ˆ์ด์•„์›ƒ

    comptime load_a_layout = Layout.row_major(1, TPB)    # ๋ณ‘ํ•ฉ ๋กœ๋”ฉ
    comptime load_b_layout = Layout.row_major(1, TPB)    # ๋ณ‘ํ•ฉ ๋กœ๋”ฉ
    # ์ฐธ๊ณ : ํ‘œ์ค€ A ร— B ๊ณฑ์…ˆ์—์„œ ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ๊ฐ™์€ ๋ ˆ์ด์•„์›ƒ์„ ์‚ฌ์šฉ
    

    ํ˜„์žฌ ๊ตฌํ˜„์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ถ„์„:

    ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ์„ ์œ„ํ•ด Layout.row_major(1, TPB)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

    • load_a_layout: ์Šค๋ ˆ๋“œ๋“ค์ด ํ˜‘๋ ฅํ•˜์—ฌ ํ–‰๋ ฌ A ํ–‰์˜ ์—ฐ์† ์›์†Œ๋ฅผ ๋กœ๋“œ
    • load_b_layout: ์Šค๋ ˆ๋“œ๋“ค์ด ํ˜‘๋ ฅํ•˜์—ฌ ํ–‰๋ ฌ B ํ–‰์˜ ์—ฐ์† ์›์†Œ๋ฅผ ๋กœ๋“œ
    • ํ•ต์‹ฌ: ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ์€ ๋ณต์‚ฌ ์‹œ ์Šค๋ ˆ๋“œ ๊ฐ„ ํ˜‘๋ ฅ ๋ฐฉ์‹์„ ๊ฒฐ์ •ํ•˜๋ฉฐ, ์ตœ์ข… ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ๊ณผ๋Š” ๋ณ„๊ฐœ์ž…๋‹ˆ๋‹ค

    ์‹ค์ œ ์—ฐ์‚ฐ ํŒจํ„ด (A ร— B์ž„์„ ์ฆ๋ช…):

    # ํ˜„์žฌ ๊ตฌํ˜„์˜ ์‹ค์ œ ์—ฐ์‚ฐ
    acc += a_shared[local_row, k] * b_shared[k, local_col]
    
    # ์ด๊ฒƒ์€ C[i,j] = ฮฃ(A[i,k] * B[k,j])์— ํ•ด๋‹น
    # ์ฆ‰, ํ‘œ์ค€ ํ–‰๋ ฌ ๊ณฑ์…ˆ A ร— B
    

    ๋‘ ํ–‰๋ ฌ์ด ๊ฐ™์€ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ :

    ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ํƒ€์ผ ๋กœ๋”ฉ:
    - Matrix A ํƒ€์ผ: ์Šค๋ ˆ๋“œ๋“ค์ด A[block_row, k], A[block_row, k+1], A[block_row, k+2]... ๋กœ๋“œ (์—ฐ์†)
    - Matrix B ํƒ€์ผ: ์Šค๋ ˆ๋“œ๋“ค์ด B[k, block_col], B[k, block_col+1], B[k, block_col+2]... ๋กœ๋“œ (์—ฐ์†)
    
    Layout.row_major(1, TPB)๋กœ ๋‘ ํŒจํ„ด ๋ชจ๋‘ ๋ณ‘ํ•ฉ
    

    ์„ธ ๊ฐ€์ง€ ๋ณ„๊ฐœ์˜ ๋ฉ”๋ชจ๋ฆฌ ๊ณ ๋ ค์‚ฌํ•ญ:

    1. ์ „์—ญโ†’๊ณต์œ  ๋ณ‘ํ•ฉ: Layout.row_major(1, TPB)๋กœ ๋ณ‘ํ•ฉ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ณด์žฅ
    2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ: a_shared[local_row, k] * b_shared[k, local_col]๋กœ ๋ฑ…ํฌ ์ถฉ๋Œ ํšŒํ”ผ
    3. ํ–‰๋ ฌ ์—ฐ์‚ฐ: ์—ฐ์‚ฐ ํŒจํ„ด์ด A ร— B๋ฅผ ๊ฒฐ์ • (A ร— B^T๊ฐ€ ์•„๋‹˜)
  4. ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์œผ๋กœ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š”

    @parameter
    for idx in range(size // TPB):  # ๋‚˜๋จธ์ง€ ์—†๋Š” ๋‚˜๋ˆ—์…ˆ: 9 // 3 = 3
    

    \((9 \times 9)\) ํ–‰๋ ฌ๊ณผ \((3 \times 3)\) ํƒ€์ผ์—์„œ๋Š” ๋ชจ๋“  ํƒ€์ผ์ด ์ •ํ™•ํžˆ ๊ฝ‰ ์ฐจ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค!

  5. ๋ฐฉ์–ด์  ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํฌํ•จํ•œ ๊น”๋”ํ•œ ํƒ€์ผ ์ฒ˜๋ฆฌ

    # ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์—์„œ๋„ ๋ฐฉ์–ด์  ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํฌํ•จ
    if tiled_row < size and tiled_col < size:
        out_tile[local_row, local_col] = acc
    

    \((9 \times 9)\) ์˜ ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์—์„œ๋Š” ์ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ๊ธฐ์ˆ ์ ์œผ๋กœ ๋ถˆํ•„์š”ํ•˜์ง€๋งŒ, ๋ฐฉ์–ด์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๋‹ค๋ฅธ ํ–‰๋ ฌ ํฌ๊ธฐ์™€์˜ ์ผ๊ด€์„ฑ์„ ์œ„ํ•ด ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๊ณ ๋ ค์‚ฌํ•ญ

๊ด€์šฉ์  ๊ตฌํ˜„์€ ํƒ€์ผ๋ง์˜ ์„ฑ๋Šฅ ์ด์ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋” ๊น”๋”ํ•œ ์ถ”์ƒํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  1. ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ: ํƒ€์ผ๋ง์„ ํ†ตํ•ด ๊ณต๊ฐ„์ , ์‹œ๊ฐ„์  ์ง€์—ญ์„ฑ์„ ํ™œ์šฉ
  2. ๋ณ‘ํ•ฉ ์ ‘๊ทผ: ํŠนํ™”๋œ ๋กœ๋“œ ๋ ˆ์ด์•„์›ƒ์œผ๋กœ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋ณด์žฅ
  3. ์—ฐ์‚ฐ-๋ฉ”๋ชจ๋ฆฌ ์ค‘์ฒฉ: ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ํ†ตํ•œ ์ค‘์ฒฉ ๊ฐ€๋Šฅ
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ: ๋ถˆํ•„์š”ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ธฐํ™” ์—†์Œ
  5. ๋ ˆ์ง€์Šคํ„ฐ ์••๋ ฅ: ์ตœ์ ์˜ ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์œ„ํ•œ ๋ˆ„์  ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ

์ด ๊ตฌํ˜„์€ ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋กœ๋„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ๋ณต์žกํ•œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ณ ์ˆ˜์ค€์˜ ํ‘œํ˜„๋ ฅ๊ณผ ์ €์ˆ˜์ค€์˜ ์„ฑ๋Šฅ ์ œ์–ด๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” Mojo์˜ ์ฒ ํ•™์„ ์ž˜ ๋ณด์—ฌ์ฃผ๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

์ˆ˜๋™ ํƒ€์ผ๋ง๊ณผ์˜ ์ฃผ์š” ์ฐจ์ด์ 

๊ธฐ๋Šฅ์ˆ˜๋™ Tiling๊ด€์šฉ์  Tiling
๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์žˆ๋Š” ์ง์ ‘ ์ธ๋ฑ์‹ฑLayoutTensor ํƒ€์ผ API
ํƒ€์ผ ๋กœ๋”ฉ์›์†Œ๋ณ„ ๋ช…์‹œ์  ๋ณต์‚ฌ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์˜ ๋ฒŒํฌ ์ „์†ก
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์ˆ˜๋™ ์ดˆ๊ธฐํ™” (๋ฐฉ์–ด์ )๋ณต์‚ฌ ํ•จ์ˆ˜๊ฐ€ ๊ด€๋ฆฌ
์ฝ”๋“œ ๋ณต์žก๋„๋ช…์‹œ์  ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋‹ค์†Œ ์žฅํ™ฉ๊ณ ์ˆ˜์ค€ API๋กœ ๋” ๊ฐ„๊ฒฐ
๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ๋”ฉ๊ณผ ์—ฐ์‚ฐ ์ค‘ ๋‹ค์ˆ˜์˜ ๊ฒ€์‚ฌ์ตœ์ข… ๊ธฐ๋ก ์‹œ ๋‹จ์ผ ๋ฐฉ์–ด์  ๊ฒ€์‚ฌ
ํ–‰๋ ฌ ๋ฐฉํ–ฅA์™€ B ๋ชจ๋‘ ๊ฐ™์€ ๋ฐฉํ–ฅ (ํ‘œ์ค€ A ร— B)A์™€ B ๋ชจ๋‘ ๊ฐ™์€ ๋ฐฉํ–ฅ (ํ‘œ์ค€ A ร— B)
์„ฑ๋Šฅ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์˜ ๋ช…์‹œ์  ์ œ์–ด๋ ˆ์ง€์Šคํ„ฐ ์šฐํšŒ๋ฅผ ํฌํ•จํ•œ ์ตœ์ ํ™”๋œ ๋ ˆ์ด์•„์›ƒ

๊ด€์šฉ์  ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋‹จ์ˆœํžˆ ๋” ๊น”๋”ํ•  ๋ฟ ์•„๋‹ˆ๋ผ, ํŠนํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ๋น„๋™๊ธฐ ์—ฐ์‚ฐ ๋•๋ถ„์— ์„ฑ๋Šฅ๋„ ๋” ์ข‹์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ : ์ „์น˜ ๋กœ๋”ฉ์€ ์–ธ์ œ ์œ ์šฉํ• ๊นŒ?

ํ˜„์žฌ ๊ตฌํ˜„์€ ์ „์น˜ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ์„น์…˜์€ ๋ ˆ์ด์•„์›ƒ ์‹œ์Šคํ…œ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•œ ๊ต์œก์  ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

ํ˜„์žฌ ๊ตฌํ˜„ ์š”์•ฝ:

  • ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ Layout.row_major(1, TPB) ์‚ฌ์šฉ
  • ํ‘œ์ค€ A ร— B ๊ณฑ์…ˆ ์ˆ˜ํ–‰
  • ๋ณต์‚ฌ ์ค‘ ๋ฐ์ดํ„ฐ ์ „์น˜ ์—†์Œ

์ „์น˜ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ต์œก์  ์‹œ๋‚˜๋ฆฌ์˜ค:

์ด ํผ์ฆ์€ ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ํ‘œ์ค€ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๋ ˆ์ด์•„์›ƒ ์‹œ์Šคํ…œ์˜ ์œ ์—ฐ์„ฑ์€ ๋‹ค๋ฅธ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๊ฐ•๋ ฅํ•œ ์ตœ์ ํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค:

# ์˜ˆ์‹œ: A ร— B๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์ „ ์ „์น˜๋œ ํ–‰๋ ฌ B^T๋ฅผ ๋กœ๋“œ
# (ํ˜„์žฌ ๊ตฌํ˜„์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ํ•˜์ง€ ์•Š์Œ)
comptime load_b_layout = Layout.row_major(TPB, 1)   # B^T๋ฅผ ๋ณ‘ํ•ฉ ์ ‘๊ทผ์œผ๋กœ ๋กœ๋“œ
comptime store_b_layout = Layout.row_major(1, TPB)  # ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— B๋กœ ์ €์žฅ
copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile)

์ „์น˜ ๋กœ๋”ฉ์˜ ํ™œ์šฉ ์‚ฌ๋ก€ (์ด ํผ์ฆ์—์„œ๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ):

  1. ์ด๋ฏธ ์ „์น˜๋œ ์ž…๋ ฅ ํ–‰๋ ฌ: \(B\) ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ์ „์น˜ ์ƒํƒœ๋กœ ์ €์žฅ๋˜์–ด ์žˆ๋Š” ๊ฒฝ์šฐ
  2. ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜: \(A^T \times B\), \(A \times B^T\), ๋˜๋Š” \(A^T \times B^T\) ๊ณ„์‚ฐ
  3. ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ๋ณ€ํ™˜: ํ–‰ ์šฐ์„ ๊ณผ ์—ด ์šฐ์„  ๋ ˆ์ด์•„์›ƒ ๊ฐ„ ๋ณ€ํ™˜
  4. ๋ณ„๋„ ์ „์น˜ ์—ฐ์‚ฐ ์—†์ด ๋กœ๋“œ: ํ•„์š”ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ง์ ‘ ๋กœ๋“œ

ํ•ต์‹ฌ ๊ตฌ๋ถ„:

  • ํ˜„์žฌ ๊ตฌํ˜„: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ํ‘œ์ค€ \(A \times B\) ๊ณฑ์…ˆ์— Layout.row_major(1, TPB) ์‚ฌ์šฉ
  • ์ „์น˜ ๋กœ๋”ฉ ์˜ˆ์‹œ: ์ด๋ฏธ ์ „์น˜๋œ ๋ฐ์ดํ„ฐ๋‚˜ ๋‹ค๋ฅธ ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ๋‹ค๋ฅธ ๋ ˆ์ด์•„์›ƒ ์‚ฌ์šฉ

์ด๊ฒƒ์€ Mojo์˜ ์ฒ ํ•™์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ์— ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, ํ•„์š”ํ•  ๋•Œ ์ €์ˆ˜์ค€ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.


์š”์•ฝ: ํ•ต์‹ฌ ์ •๋ฆฌ

๊ด€์šฉ์  ํƒ€์ผ๋ง ๊ตฌํ˜„์ด ์‹ค์ œ๋กœ ํ•˜๋Š” ๊ฒƒ:

  1. ํ–‰๋ ฌ ์—ฐ์‚ฐ: ํ‘œ์ค€ A ร— B ๊ณฑ์…ˆ
  2. ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋”ฉ: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ Layout.row_major(1, TPB)๋กœ ๋ณ‘ํ•ฉ ์ ‘๊ทผ
  3. ์—ฐ์‚ฐ ํŒจํ„ด: acc += a_shared[local_row, k] * b_shared[k, local_col]
  4. ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ: ๋กœ๋”ฉ ์‹œ ์ „์น˜ ์—†์Œ

์ด๊ฒƒ์ด ์ตœ์ ์ธ ์ด์œ :

  • ๋ณ‘ํ•ฉ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: Layout.row_major(1, TPB)๋กœ ํšจ์œจ์ ์ธ ๋กœ๋”ฉ ๋ณด์žฅ
  • ๋ฑ…ํฌ ์ถฉ๋Œ ํšŒํ”ผ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์ถฉ๋Œ์„ ๋ฐฉ์ง€
  • ํ‘œ์ค€ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ํ–‰๋ ฌ ๊ณฑ์…ˆ ํŒจํ„ด์„ ๊ตฌํ˜„

Puzzle 17: 1D ํ•ฉ์„ฑ๊ณฑ Op

MAX ๊ทธ๋ž˜ํ”„๋กœ ํŒŒ์ด์ฌ ์—ฐ๋™ํ•˜๊ธฐ

GPU ํผ์ฆ ์—ฌ์ •์˜ Part IV์— ์ง„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค: MAX ๊ทธ๋ž˜ํ”„ ์ปค์Šคํ…€ Op์œผ๋กœ ํŒŒ์ด์ฌ ์—ฐ๋™ํ•˜๊ธฐ.

์ด์ „ ํผ์ฆ๋“ค์—์„œ๋Š” Mojo๋กœ ํšจ์œจ์ ์ธ GPU ์ปค๋„์„ ์ž‘์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ์ด์ œ๋ถ€ํ„ฐ๋Š” ๋‹ค์Œ์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค:

  • ์ปค๋„์„ ํŒŒ์ด์ฌ์—์„œ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋Š” ์ปค์Šคํ…€ ์—ฐ์‚ฐ์œผ๋กœ ํŒจํ‚ค์ง•ํ•˜๊ธฐ
  • MAX ๊ทธ๋ž˜ํ”„ ์‹œ์Šคํ…œ๊ณผ ํ†ตํ•ฉํ•˜์—ฌ ๋จธ์‹ ๋Ÿฌ๋‹์„ ๊ฐ€์†ํ•˜๊ธฐ
  • ํ•˜์ด๋ ˆ๋ฒจ ํŒŒ์ด์ฌ API์™€ ๋กœ์šฐ๋ ˆ๋ฒจ GPU ์ฝ”๋“œ ์‚ฌ์ด์˜ ๊ฐ„๊ทน ๋ฉ”์šฐ๊ธฐ

์ด๋ฅผ ํ†ตํ•ด ์ต์ˆ™ํ•œ ํŒŒ์ด์ฌ ํ™˜๊ฒฝ์—์„œ ์ž‘์—…ํ•˜๋ฉด์„œ๋„ Mojo GPU ์ปค๋„์˜ ์„ฑ๋Šฅ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐœ์š”

Puzzle 13: 1D ํ•ฉ์„ฑ๊ณฑ์—์„œ GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ ์ปค๋„์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” ์ด ์ปค๋„์„ MAX ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด ํŒŒ์ด์ฌ์—์„œ ์ง์ ‘ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋Š” ์ปค์Šคํ…€ ์—ฐ์‚ฐ์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉํ•  1D ํ•ฉ์„ฑ๊ณฑ ์ปค๋„์€ ์ด๋ฏธ ๊ตฌํ˜„๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

comptime TPB = 15
comptime BLOCKS_PER_GRID = (2, 1)


fn conv1d_kernel[
    in_layout: Layout,
    out_layout: Layout,
    conv_layout: Layout,
    input_size: Int,
    conv_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, MutAnyOrigin],
    kernel: LayoutTensor[dtype, conv_layout, MutAnyOrigin],
):
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)
    # first: need to account for padding
    shared_a = LayoutTensor[
        dtype,
        Layout.row_major(TPB + conv_size - 1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    shared_b = LayoutTensor[
        dtype,
        Layout.row_major(conv_size),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    if global_i < input_size:
        shared_a[local_i] = input[global_i]

    # second: load elements needed for convolution at block boundary
    if local_i < conv_size - 1:
        # indices from next block
        next_idx = global_i + TPB
        if next_idx < input_size:
            shared_a[TPB + local_i] = input[next_idx]
        else:
            shared_a[TPB + local_i] = 0

    if local_i < conv_size:
        shared_b[local_i] = kernel[local_i]

    barrier()

    if global_i < input_size:
        var local_sum: output.element_type = 0

        @parameter
        for j in range(conv_size):
            if local_i + j < TPB + conv_size - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum


์ด ํผ์ฆ์˜ ํ•ต์‹ฌ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์ปค์Šคํ…€ op ๋“ฑ๋ก: @compiler.register ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋ฅผ ํ†ตํ•ด Mojo ํ•จ์ˆ˜๋ฅผ ํŒŒ์ด์ฌ์— ๋…ธ์ถœํ•˜๋Š” ๋ฐฉ๋ฒ• ์ดํ•ดํ•˜๊ธฐ
  2. ์ปค์Šคํ…€ op ํŒจํ‚ค์ง•: Mojo ์ฝ”๋“œ๋ฅผ MAX ๊ทธ๋ž˜ํ”„์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํŒจํ‚ค์ง•ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ตํžˆ๊ธฐ
  3. ํŒŒ์ด์ฌ ํ†ตํ•ฉ: MAX ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด ํŒŒ์ด์ฌ์—์„œ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ํ˜ธ์ถœํ•˜๊ธฐ
  4. ํฌ๋กœ์Šค ์–ธ์–ด ๋ฐ์ดํ„ฐ ํ๋ฆ„: ํŒŒ์ด์ฌ๊ณผ GPU ์‚ฌ์ด์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌํ•˜๊ธฐ

์ด ์ปค์Šคํ…€ ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ผ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  • ํŒŒ์ด์ฌ์—์„œ NumPy ๋ฐฐ์—ด์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ
  • ์ด ๋ฐ์ดํ„ฐ๋ฅผ GPU๋กœ ์ „์†กํ•˜๊ธฐ
  • ์ตœ์ ํ™”๋œ ํ•ฉ์„ฑ๊ณฑ ์ปค๋„ ์‹คํ–‰ํ•˜๊ธฐ
  • ๊ฒฐ๊ณผ๋ฅผ ํŒŒ์ด์ฌ์œผ๋กœ ๋ฐ˜ํ™˜ํ•˜๊ธฐ

์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ฉด ํŒŒ์ด์ฌ์˜ ํ’๋ถ€ํ•œ ์ƒํƒœ๊ณ„์™€ Mojo์˜ ๊ฐ•๋ ฅํ•œ GPU ์„ฑ๋Šฅ์„ ์ž‡๋Š” ๋งค๋„๋Ÿฌ์šด ๋‹ค๋ฆฌ๋ฅผ ๋งŒ๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ ค๋ฉด conv1d.mojo์—์„œ conv1d_kernel์„ ํ˜ธ์ถœํ•˜๋Š” ํ•œ ์ค„๋งŒ ์ฑ„์šฐ๋ฉด ๋ฉ๋‹ˆ๋‹ค:

import compiler
from runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer
from gpu.host import DeviceBuffer


@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    fn execute[
        # The kind of device this will be run on: "cpu" or "gpu"
        target: StaticString,
        input_size: Int,
        conv_size: Int,
        dtype: DType = DType.float32,
    ](
        output: OutputTensor[rank=1],
        input: InputTensor[rank = output.rank],
        kernel: InputTensor[rank = output.rank],
        # the context is needed for some GPU calls
        ctx: DeviceContextPtr,
    ) raises:
        output_tensor = output.to_layout_tensor()
        input_tensor = input.to_layout_tensor()
        kernel_tensor = kernel.to_layout_tensor()
        comptime in_layout = input_tensor.layout
        comptime out_layout = output_tensor.layout
        comptime conv_layout = kernel_tensor.layout

        @parameter
        if target == "gpu":
            gpu_ctx = ctx.get_device_context()
            # making sure the output tensor is zeroed out before the kernel is called
            gpu_ctx.enqueue_memset(
                DeviceBuffer[output_tensor.dtype](
                    gpu_ctx,
                    output_tensor.ptr,
                    input_size,
                    owning=False,
                ),
                0,
            )

            # FILL ME IN with 1 line calling our conv1d_kernel

        elif target == "cpu":
            # we can fallback to CPU
            pass
        else:
            raise Error("Unsupported target: " + target)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p17/op/conv1d.mojo

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ํผ์ฆ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pixi run p17
pixi run -e amd p17
pixi run -e apple p17
uv run poe p17

์„ฑ๊ณตํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ ์ถœ๋ ฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Input array: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]
Expected result (NumPy calculation): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Compiling 1D convolution graph...
Executing 1D convolution...
1D Convolution result (custom Mojo kernel): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Verification passed: Custom kernel results match NumPy calculation

์ด ์ถœ๋ ฅ์€ ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์ด 1D ํ•ฉ์„ฑ๊ณฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌํ˜„ํ–ˆ์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์†”๋ฃจ์…˜

์ด ํผ์ฆ์„ ํ’€๋ ค๋ฉด 1D ํ•ฉ์„ฑ๊ณฑ ์ปค๋„์„ MAX ๊ทธ๋ž˜ํ”„ ์‹œ์Šคํ…œ๊ณผ ํ†ตํ•ฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ Conv1DCustomOp ๊ตฌ์กฐ์ฒด์˜ execute ๋ฉ”์„œ๋“œ์—์„œ ์ปค๋„์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ˜ธ์ถœํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ’€์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

            comptime kernel = conv1d_kernel[
                in_layout, out_layout, conv_layout, input_size, conv_size
            ]
            gpu_ctx.enqueue_function[kernel, kernel](
                output_tensor,
                input_tensor,
                kernel_tensor,
                grid_dim=BLOCKS_PER_GRID,
                block_dim=(TPB, 1),
            )
์ด ํ•œ ์ค„์ด ์ˆ˜ํ–‰ํ•˜๋Š” ์ค‘์š”ํ•œ ์ž‘์—…๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
  1. GPU ์ปจํ…์ŠคํŠธ(gpu_ctx์˜ ํƒ€์ž…์€ DeviceContext)์—์„œ enqueue_function์„ ํ˜ธ์ถœํ•˜์—ฌ ์ปค๋„ ์‹คํ–‰ ์˜ˆ์•ฝ
  2. ํ•„์š”ํ•œ ๋ ˆ์ด์•„์›ƒ๊ณผ ํฌ๊ธฐ ์ •๋ณด๋ฅผ ์ปดํŒŒ์ผ ํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ „๋‹ฌ
  3. ์ถœ๋ ฅ, ์ž…๋ ฅ, ์ปค๋„ ํ…์„œ๋ฅผ ๋Ÿฐํƒ€์ž„ ์ธ์ž๋กœ ์ œ๊ณต
  4. ์ ์ ˆํ•œ ์ฐจ์›์œผ๋กœ ์‹คํ–‰ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ

์ „์ฒด ๋งฅ๋ฝ์—์„œ ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

ํŒŒ์ด์ฌ-Mojo ํ†ตํ•ฉ ํ๋ฆ„

  1. ํŒŒ์ด์ฌ ์ชฝ (problems/p17/p17.py):

    • ์ž…๋ ฅ๊ณผ ์ปค๋„์šฉ NumPy ๋ฐฐ์—ด ์ƒ์„ฑ
    • MAX ๊ทธ๋ž˜ํ”„๋กœ ์—ฐ์‚ฐ์„ ๊ฐ์‹ธ๋Š” conv_1d() ํ•จ์ˆ˜ ํ˜ธ์ถœ
    • NumPy ๋ฐฐ์—ด์„ Buffer.from_numpy(input).to(device)๋กœ MAX driver Buffer๋กœ ๋ณ€ํ™˜
    • custom_extensions=[mojo_kernels]๋กœ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ํŒจํ‚ค์ง€ ๋กœ๋“œ
  2. ๊ทธ๋ž˜ํ”„ ๊ตฌ์ถ•:

    • TensorType์œผ๋กœ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ํ…์„œ ํƒ€์ž… ์ •์˜
    • parameters={...}๋ฅผ ํ†ตํ•ด ์—ฐ์‚ฐ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ง€์ •
    • Graph("conv_1d_graph", ...)๋กœ ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”„ ์ƒ์„ฑ
    • ops.custom(name="conv1d", ...)๋กœ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ํ˜ธ์ถœ
  3. ์ปค์Šคํ…€ op ๋“ฑ๋ก:

    • @compiler.register("conv1d") ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๊ฐ€ ์—ฐ์‚ฐ์„ MAX ๊ทธ๋ž˜ํ”„์— ๋…ธ์ถœ. @compiler.register ์ฐธ๊ณ 
    • execute ๋ฉ”์„œ๋“œ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ธํ„ฐํŽ˜์ด์Šค(์ž…๋ ฅ, ์ถœ๋ ฅ, ์ปจํ…์ŠคํŠธ) ์ •์˜
    • ์ž…์ถœ๋ ฅ ํ…์„œ๊ฐ€ ์ปค๋„์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก LayoutTensor๋กœ ๋ณ€ํ™˜
    • Device context๊ฐ€ GPU ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น๊ณผ ์ปค๋„ ์‹คํ–‰ ๊ด€๋ฆฌ
  4. ์ปค๋„ ์‹คํ–‰:

    • model.execute(...)๊ฐ€ ํ˜ธ์ถœ๋˜๋ฉด conv1d_kernel์ด ๋ฐ์ดํ„ฐ ์ˆ˜์‹ 
    • grid_dim๊ณผ block_dim์œผ๋กœ GPU ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ ์„ค์ •
    • result.to(CPU())๋กœ ๊ฒฐ๊ณผ๋ฅผ CPU๋กœ ์ „์†ก
    • NumPy ๊ฒ€์ฆ์œผ๋กœ ๊ธฐ๋Œ€ ์ถœ๋ ฅ๊ณผ ๊ฒฐ๊ณผ ๋น„๊ต

ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ ์ƒ์„ธ

  1. ์ปค์Šคํ…€ Op ๊ตฌ์กฐ์ฒด:

    @compiler.register("conv1d")
    struct Conv1DCustomOp:
        @staticmethod
        fn execute[target: StaticString, input_size: Int, conv_size: Int, dtype: DType = DType.float32](
            output: OutputTensor[rank=1],
            input: InputTensor[dtype = output.dtype, rank = output.rank],
            kernel: InputTensor[dtype = output.dtype, rank = output.rank],
            ctx: DeviceContextPtr,
        ) raises:
            # ๊ตฌํ˜„
    
    • target์€ ๋””๋ฐ”์ด์Šค ํƒ€์ž…(โ€œgpuโ€ ๋˜๋Š” โ€œcpuโ€)์„ ๋‚˜ํƒ€๋ƒ„
    • input_size์™€ conv_size๋Š” ํŒŒ์ด์ฌ์—์„œ ์ „๋‹ฌ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ
    • ํ…์„œ ํƒ€์ž…์ด ์˜ฌ๋ฐ”๋ฅธ shape๊ณผ ํƒ€์ž… ๊ฒ€์‚ฌ ๋ณด์žฅ
    • ๋ฐ˜ํ™˜ ํƒ€์ž…์€ ์ ์ ˆํ•œ ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ ์œ„ํ•ด raises
  2. ํ…์„œ ๋ณ€ํ™˜:

    output_tensor = output.to_layout_tensor()
    input_tensor = input.to_layout_tensor()
    kernel_tensor = kernel.to_layout_tensor()
    
    • MAX ๊ทธ๋ž˜ํ”„ ํ…์„œ๋ฅผ Mojo LayoutTensor๋กœ ๋ณ€ํ™˜
    • ์ปค๋„์ด ํ…์„œ๋ฅผ ์ง์ ‘ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์คŒ
    • ์ปดํŒŒ์ผ ํƒ€์ž„ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๋ ˆ์ด์•„์›ƒ ์ถ”์ถœ
  3. Device Context ์‚ฌ์šฉ:

    gpu_ctx = ctx.get_device_context()
    gpu_ctx.enqueue_memset(...)  # ์ถœ๋ ฅ ๋ฒ„ํผ ์ดˆ๊ธฐํ™”
    gpu_ctx.enqueue_function[..., ...](...) # ์ปค๋„ ์˜ˆ์•ฝ
    
    • ๋””๋ฐ”์ด์Šค ์ปจํ…์ŠคํŠธ๊ฐ€ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ๊ด€๋ฆฌ
    • ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅธ ๋ฒ„ํผ ์ƒํƒœ๋ฅผ ๋ณด์žฅ
    • ํ•จ์ˆ˜๋ฅผ ํ์— ๋“ฑ๋กํ•˜์—ฌ ์ปค๋„ ์‹คํ–‰์„ ์˜ˆ์•ฝ

์ด ํ’€์ด๋Š” ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ๊ฐ€ MAX ๊ทธ๋ž˜ํ”„๋ฅผ ๊ฑฐ์ณ GPU์—์„œ ์‹คํ–‰๋˜๊ณ  ๋‹ค์‹œ ๋Œ์•„์˜ค๋Š” ์ „์ฒด ํ๋ฆ„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Mojo์˜ ๊ฐ•๋ ฅํ•œ ํƒ€์ž… ์‹œ์Šคํ…œ๊ณผ ๋งค๊ฐœ๋ณ€์ˆ˜ํ™” ํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํšจ์œจ์ ์ด๊ณ  ํƒ€์ž… ์•ˆ์ „ํ•œ ๊ฐ€์† ์—ฐ์‚ฐ์„ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค.

MAX ๊ทธ๋ž˜ํ”„ ์ปค์Šคํ…€ op ์ดํ•ดํ•˜๊ธฐ

๋” ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜ ํŠœํ† ๋ฆฌ์–ผ์„ ์ฐธ๊ณ ํ•˜์„ธ์š”:

์ปค์Šคํ…€ op ๋“ฑ๋ก

์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ๋งŒ๋“œ๋Š” ํ•ต์‹ฌ์€ @compiler.register ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ์™€ ๊ด€๋ จ ๊ตฌ์กฐ์ฒด์ž…๋‹ˆ๋‹ค:

@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    fn execute[...](
        output: OutputTensor[rank=1],
        input: InputTensor[dtype = output.dtype, rank = output.rank],
        kernel: InputTensor[type = output.dtype, rank = output.rank],
        ctx: DeviceContextPtr,
    ) raises:
        # ๊ตฌํ˜„

๋“ฑ๋ก์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ:

  • ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ์— ์ „๋‹ฌํ•˜๋Š” ์ด๋ฆ„("conv1d")์ด ํŒŒ์ด์ฌ ์ฝ”๋“œ์—์„œ ์ด ์—ฐ์‚ฐ์„ ํ˜ธ์ถœํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ์ด๋ฆ„
  • ๊ตฌ์กฐ์ฒด์—๋Š” ์˜ฌ๋ฐ”๋ฅธ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ๊ฐ€์ง„ execute ๋ฉ”์„œ๋“œ๊ฐ€ ์žˆ์–ด์•ผ ํ•จ
  • OutputTensor์™€ InputTensor ํƒ€์ž…์ด ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ์™€์˜ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ •์˜
  • DeviceContextPtr์ด ์‹คํ–‰ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์ ‘๊ทผ์„ ์ œ๊ณต

์ปค์Šคํ…€ op ํŒจํ‚ค์ง•

์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ํŒŒ์ด์ฌ์—์„œ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋จผ์ € ํŒจํ‚ค์ง•ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

mojo package op -o op.mojopkg

์ด ๋ช…๋ น์€:

  1. Mojo ์ฝ”๋“œ๋ฅผ ๋ฐฐํฌ ๊ฐ€๋Šฅํ•œ ํŒจํ‚ค์ง€๋กœ ์ปดํŒŒ์ผ
  2. MAX ๊ทธ๋ž˜ํ”„๊ฐ€ ์—ฐ์‚ฐ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ƒ์„ฑ
  3. ํŒŒ์ด์ฌ์—์„œ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ”์ด๋„ˆ๋ฆฌ ์•„ํ‹ฐํŒฉํŠธ(op.mojopkg)๋ฅผ ์ƒ์„ฑ

ํŒจํ‚ค์ง€๋Š” MAX ๊ทธ๋ž˜ํ”„๊ฐ€ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” ์œ„์น˜์— ๋ฐฐ์น˜ํ•ด์•ผ ํ•˜๋ฉฐ, ๋ณดํ†ต ํŒŒ์ด์ฌ ์ฝ”๋“œ์—์„œ ์ ‘๊ทผ ๊ฐ€๋Šฅํ•œ ๋””๋ ‰ํ† ๋ฆฌ์— ๋‘ก๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ํ†ตํ•ฉ

ํŒŒ์ด์ฌ ์ชฝ์—์„œ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

# Mojo ์—ฐ์‚ฐ์ด ํฌํ•จ๋œ ๋””๋ ‰ํ† ๋ฆฌ ๊ฒฝ๋กœ
mojo_kernels = Path(__file__).parent / "op"

# ์ปค์Šคํ…€ conv1d ์—ฐ์‚ฐ์œผ๋กœ ๊ทธ๋ž˜ํ”„ ๊ตฌ์„ฑ
with Graph(
    "conv_1d_graph",
    input_types=[...],
    custom_extensions=[mojo_kernels],  # ์ปค์Šคํ…€ op ํŒจํ‚ค์ง€ ๋กœ๋“œ
) as graph:
    # ๊ทธ๋ž˜ํ”„์˜ ์ž…๋ ฅ ์ •์˜
    input_value, kernel_value = graph.inputs

    # ์ด๋ฆ„์œผ๋กœ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ์‚ฌ์šฉ
    output = ops.custom(
        name="conv1d",  # @compiler.register์˜ ์ด๋ฆ„๊ณผ ์ผ์น˜ํ•ด์•ผ ํ•จ
        values=[input_value, kernel_value],
        out_types=[...],
        parameters={
            "input_size": input_tensor.shape[0],
            "conv_size": kernel_tensor.shape[0],
            "dtype": dtype,
        },
    )[0].tensor

ํ•ต์‹ฌ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. custom_extensions๋กœ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์˜ ๊ฒฝ๋กœ ์ง€์ •
  2. ๋“ฑ๋ก๋œ ์—ฐ์‚ฐ ์ด๋ฆ„์œผ๋กœ ops.custom ํ˜ธ์ถœ
  3. ์—ฐ์‚ฐ์˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜์— ๋งž๋Š” ์ž…๋ ฅ ๊ฐ’๊ณผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ „๋‹ฌ

Puzzle 18: ์†Œํ”„ํŠธ๋งฅ์Šค Op

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์†Œํ”„ํŠธ๋งฅ์Šค๋Š” ์‹ค์ˆ˜ ๋ฒกํ„ฐ๋ฅผ ๋ฐ›์•„ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ์ •๊ทœํ™”ํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.

์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋Š” ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๋‹จ๊ณ„๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ง€์ˆ˜ ํ•จ์ˆ˜ ์ ์šฉ: ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ ๊ฐ ์š”์†Œ์— ์ง€์ˆ˜ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋“  ๊ฐ’์ด ์–‘์ˆ˜๊ฐ€ ๋˜๊ณ  ๊ฐ’ ์‚ฌ์ด์˜ ์ฐจ์ด๊ฐ€ ์ฆํญ๋ฉ๋‹ˆ๋‹ค. ํฐ ์ž…๋ ฅ๊ฐ’์€ ํ›จ์”ฌ ํฐ ์ง€์ˆ˜ ์ถœ๋ ฅ์„ ๋งŒ๋“ค๊ณ , ์ž‘๊ฑฐ๋‚˜ ์Œ์ˆ˜์ธ ๊ฐ’์€ 0์— ๊ฐ€๊นŒ์šด ์ถœ๋ ฅ์„ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค.

  2. ์ •๊ทœํ™”: ๊ฐ ์ง€์ˆ˜ ๊ฐ’์„ ๋ชจ๋“  ์ง€์ˆ˜ ๊ฐ’์˜ ํ•ฉ์œผ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. ์ด ์ •๊ทœํ™” ๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด ๊ฒฐ๊ณผ๊ฐ’์ด ์œ ํšจํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ชจ๋“  ๊ฐ’์ด 0๊ณผ 1 ์‚ฌ์ด์ด๊ณ  ํ•ฉ์ด ์ •ํ™•ํžˆ 1์ด ๋ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์ ์œผ๋กœ ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$$

์—ฌ๊ธฐ์„œ:

  • \(x_i\)๋Š” ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ \(i\)๋ฒˆ์งธ ์š”์†Œ
  • \(n\)์€ ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ ๊ธธ์ด

๊ทธ๋Ÿฌ๋‚˜ ์ด ์ง์ ‘์ ์ธ ๊ตฌํ˜„์€ ๊ฐ’์ด ํด ๋•Œ ์ˆ˜์น˜ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜์น˜์ ์œผ๋กœ ๋” ์•ˆ์ •์ ์ธ ๋ฒ„์ „์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

GPU ๊ตฌํ˜„์—์„œ๋Š” ์ตœ๋Œ“๊ฐ’ ์ฐพ๊ธฐ์™€ ์ง€์ˆ˜ ํ•ฉ ๊ณ„์‚ฐ ๋ชจ๋‘์— ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ํฐ ๋ฒกํ„ฐ์—์„œ๋„ ๋†’์€ ํšจ์œจ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

  • ํšจ์œจ์ ์ธ ์ตœ๋Œ“๊ฐ’ ๋ฐ ํ•ฉ๊ณ„ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • ์ตœ๋Œ“๊ฐ’ ์ฐจ๊ฐ ๊ธฐ๋ฒ•์„ ํ†ตํ•œ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ
  • ์Šค๋ ˆ๋“œ ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  • ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์˜ ํŒŒ์ด์ฌ ํ†ตํ•ฉ
  • ๋ฐฐ๋ฆฌ์–ด๋ฅผ ํ†ตํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”

์„ค์ •

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 128
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: BLOCK_DIM_X = 1 << log2_ceil(SIZE). ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜์ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋™์ž‘ํ•˜๋ ค๋ฉด BLOCK_DIM_X๊ฐ€ SIZE ์ด์ƒ์ธ ๊ฐ€์žฅ ์ž‘์€ 2์˜ ๊ฑฐ๋“ญ์ œ๊ณฑ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: \(1 \times 1\) ๋ธ”๋ก
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ์ตœ๋Œ“๊ฐ’๊ณผ ํ•ฉ๊ณ„๋ฅผ ์œ„ํ•œ ๋‘ ๊ฐœ์˜ ๊ณต์œ  ๋ณ€์ˆ˜

๋ ˆ์ด์•„์›ƒ ์„ค์ •:

  • ์ž…๋ ฅ ํ…์„œ: Layout.row_major(SIZE)
  • ์ถœ๋ ฅ ํ…์„œ: Layout.row_major(SIZE)
  • ์ปค์Šคํ…€ op ํŒŒ๋ผ๋ฏธํ„ฐ: {"input_size": input_tensor.shape[0]}

์ด ํผ์ฆ์˜ ํ•ต์‹ฌ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์ˆ˜์น˜ ์•ˆ์ •์„ฑ: ์ž ์žฌ์ ์ธ ์ˆ˜์น˜ ๋ฌธ์ œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ• ์ดํ•ดํ•˜๊ธฐ
  2. ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ํšจ์œจ์ ์ธ ์ตœ๋Œ“๊ฐ’ ๋ฐ ํ•ฉ๊ณ„ ๊ณ„์‚ฐ
  3. ์ปค์Šคํ…€ op ํ†ตํ•ฉ: Mojo GPU ์ปค๋„์„ ์œ„ํ•œ ํŒŒ์ด์ฌ ์ธํ„ฐํŽ˜์ด์Šค ์™„์„ฑํ•˜๊ธฐ
  4. ํ…Œ์ŠคํŠธ์™€ ๊ฒ€์ฆ: ๊ตฌํ˜„์ด ๊ธฐ๋Œ€ ๊ฒฐ๊ณผ์™€ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ

์†Œํ”„ํŠธ๋งฅ์Šค ์ปค์Šคํ…€ ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ผ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  • ํŒŒ์ด์ฌ์—์„œ NumPy ๋ฐฐ์—ด์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ
  • GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ
  • ์ •๊ทœํ™”๋œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๊ธฐ
  • SciPy์˜ ์†Œํ”„ํŠธ๋งฅ์Šค ๊ตฌํ˜„ ๊ฒฐ๊ณผ์™€ ์ผ์น˜์‹œํ‚ค๊ธฐ

์™„์„ฑํ•  ์ฝ”๋“œ

์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ ค๋ฉด Mojo ํŒŒ์ผ์—์„œ GPU์™€ CPU ์ปค๋„์„ ๋ชจ๋‘ ๊ตฌํ˜„ํ•˜๊ณ , ํŒŒ์ด์ฌ ์ฝ”๋“œ์—์„œ ๊ทธ๋ž˜ํ”„ ์ •์˜๋ฅผ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

1. softmax.mojo์—์„œ GPU ์ปค๋„ ๊ตฌํ˜„ํ•˜๊ธฐ

from gpu import thread_idx, block_idx, block_dim, barrier
from gpu.host import DeviceContext, HostBuffer, DeviceBuffer
from gpu.memory import AddressSpace
from layout import Layout, LayoutTensor
from math import exp
from bit import log2_ceil
from utils.numerics import max_finite, min_finite


comptime SIZE = 128  # This must be equal to INPUT_SIZE in p18.py
comptime layout = Layout.row_major(SIZE)
comptime GRID_DIM_X = 1
# Tree-based reduction require the number of threads to be the next power of two >= SIZE for correctness.
comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE)


fn softmax_gpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    comptime assert (
        dtype.is_floating_point()
    ), "dtype must be a floating-point type"
    # FILL IN (roughly 31 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p18/op/softmax.mojo

ํŒ
  1. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ์ตœ๋Œ“๊ฐ’๊ณผ ํ•ฉ๊ณ„ ๋ชจ๋‘์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”
  2. ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ ์ ˆํ•œ ์ง€์ ์—์„œ barrier()๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ๊ฒƒ์„ ์žŠ์ง€ ๋งˆ์„ธ์š”
  3. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž…๋ ฅ ๋ฐฐ์—ด์˜ ์ผ๋ถ€๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•˜์„ธ์š”
  4. ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์„ธ์š”
  5. ํŠนํžˆ ํฐ ์ž…๋ ฅ์—์„œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์„ ์ฃผ์˜ ๊นŠ๊ฒŒ ์ฒ˜๋ฆฌํ•˜์„ธ์š”
  6. ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด \(e^{x_i}\) ๋Œ€์‹  \(e^{x_i - max}\)๋ฅผ ๊ณ„์‚ฐํ•˜์„ธ์š”

2. softmax.mojo์—์„œ CPU ์ปค๋„ ๊ตฌํ˜„ํ•˜๊ธฐ

fn softmax_cpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    comptime assert (
        dtype.is_floating_point()
    ), "dtype must be a floating-point type"
    # FILL IN (roughly 10 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p18/op/softmax.mojo

ํŒ
  1. GPU ๋ฒ„์ „๊ณผ ๋™์ผํ•œ ์ˆ˜ํ•™์  ๋‹จ๊ณ„๋ฅผ ๋”ฐ๋ฅด๋Š” ์ˆœ์ฐจ์  ๊ตฌํ˜„์„ ์ž‘์„ฑํ•˜์„ธ์š”
  2. ๋จผ์ € ๋ชจ๋“  ์ž…๋ ฅ์—์„œ ์ตœ๋Œ“๊ฐ’์„ ์ฐพ์œผ์„ธ์š”
  3. ๊ทธ๋‹ค์Œ ๊ฐ ์š”์†Œ์— ๋Œ€ํ•ด \(e^{x_i - max}\)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ํ•ฉ๊ณ„๋ฅผ ๋ˆ„์ ํ•˜์„ธ์š”
  4. ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ฐ ์š”์†Œ๋ฅผ ํ•ฉ๊ณ„๋กœ ๋‚˜๋ˆ  ์ •๊ทœํ™”ํ•˜์„ธ์š”
  5. CPU ๊ตฌํ˜„์—๋Š” ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ๊ฐ€ ์—†์œผ๋ฏ€๋กœ ์Šค์นผ๋ผ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜์„ธ์š”

CPU์™€ GPU ์ปค๋„ ํ…Œ์ŠคํŠธ

uv run poe p18-test-kernels
pixi run p18-test-kernels

์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

Total Discovered Tests: 1

Passed : 1 (100.00%)
Failed : 0 (0.00%)
Skipped: 0 (0.00%)

3. p18.py์—์„œ ๊ทธ๋ž˜ํ”„ ์ •์˜ ์™„์„ฑํ•˜๊ธฐ

from pathlib import Path
import numpy as np
from max.driver import CPU, Accelerator, Device, Buffer
from max.dtype import DType
from max.engine import InferenceSession
from max.graph import DeviceRef, Graph, TensorType, ops
from numpy.typing import NDArray
from scipy.special import softmax as scipy_softmax


def softmax(
    input: NDArray[np.float32],
    session: InferenceSession,
    device: Device,
) -> Buffer:
    dtype = DType.float32
    input_tensor = Buffer.from_numpy(input).to(device)
    mojo_kernels = Path(__file__).parent / "op"

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
        # FILL IN (roughly 4 unformatted lines)
        pass

์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p18/p18.py

ํŒ
  1. graph.inputs[0]์œผ๋กœ ๊ทธ๋ž˜ํ”„์— ์ „๋‹ฌ๋œ ์ž…๋ ฅ ํ…์„œ์— ์ ‘๊ทผํ•˜์„ธ์š”
  2. ๋“ฑ๋กํ•œ ์ปค์Šคํ…€ op ์ด๋ฆ„(โ€œsoftmaxโ€)์œผ๋กœ ops.custom()์„ ํ˜ธ์ถœํ•˜์„ธ์š”
  3. ์ž…๋ ฅ ํ…์„œ๋ฅผ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์˜ ๊ฐ’์œผ๋กœ ์ „๋‹ฌํ•˜์„ธ์š”
  4. ์ž…๋ ฅ shape๊ณผ ์ผ์น˜ํ•˜๋Š” ์ถœ๋ ฅ ํƒ€์ž…์„ ์ง€์ •ํ•˜์„ธ์š”
  5. ์ปค๋„์— ํ•„์š”ํ•œ โ€œinput_sizeโ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํฌํ•จํ•˜์„ธ์š”
  6. graph.outputs๋ฅผ ์—ฐ์‚ฐ์˜ ์ถœ๋ ฅ ํ…์„œ๊ฐ€ ๋‹ด๊ธด ๋ฆฌ์ŠคํŠธ๋กœ ์„ค์ •ํ•˜์„ธ์š”

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ํผ์ฆ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pixi run p18
pixi run -e amd p18
pixi run -e apple p18
uv run poe p18

์„ฑ๊ณตํ•˜๋ฉด CPU์™€ GPU์—์„œ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ ์ถœ๋ ฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Input shape: (128,)
First few random input values: [ 1.1810775   0.60472375  0.5718309   0.6644599  -0.08899796]
Compiling softmax graph on Device(type=cpu,id=0)
Executing softmax on Device(type=cpu,id=0)
====================================================================================================
Compiling softmax graph on Device(type=gpu,id=0)
Executing softmax on Device(type=gpu,id=0)
====================================================================================================
First few softmax results on CPU (custom Mojo kernel): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
First few softmax results on GPU (custom Mojo kernel): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
First few expected results (SciPy calculation): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
Verification passed: Custom kernel results match SciPy calculation
Sum of all probabilities on CPU: 1.0
Sum of all probabilities on GPU: 1.0

์ด ์ถœ๋ ฅ์€ ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์ด ์†Œํ”„ํŠธ๋งฅ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌํ˜„ํ•˜์—ฌ ์œ ํšจํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ–ˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์†”๋ฃจ์…˜

์ด ํผ์ฆ์„ ํ’€๋ ค๋ฉด Mojo ์ปค๋„(GPU์™€ CPU)๊ณผ ํŒŒ์ด์ฌ ๊ทธ๋ž˜ํ”„ ์ •์˜๋ฅผ ๋ชจ๋‘ ๊ตฌํ˜„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. Puzzle 17: 1D ํ•ฉ์„ฑ๊ณฑ Op์—์„œ ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ, ํŒŒ์ด์ฌ์˜ ์ƒํƒœ๊ณ„์™€ Mojo์˜ GPU ๊ฐ€์† ์ปดํ“จํŒ… ์—ญ๋Ÿ‰์„ ์ž‡๋Š” ๋‹ค๋ฆฌ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๊ตฌํ˜„ํ•  ์†Œํ”„ํŠธ๋งฅ์Šค ์—ฐ์‚ฐ์€ ์ˆ˜ํ•™์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$$

ํ•˜์ง€๋งŒ ์ˆ˜์น˜ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋” ์•ˆ์ •์ ์ธ ํ˜•ํƒœ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

GPU ์ปค๋„ ๊ตฌํ˜„

fn softmax_gpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    comptime assert (
        dtype.is_floating_point()
    ), "dtype must be a floating-point type"
    shared_max = LayoutTensor[
        dtype,
        Layout.row_major(BLOCK_DIM_X),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    shared_sum = LayoutTensor[
        dtype,
        Layout.row_major(BLOCK_DIM_X),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    global_i = Int(thread_idx.x)

    # Initialize out-of-bounds (shared_max[local_i], global_i >= input_size) shared memory addresses to the minimum
    # finite value for dtype, ensuring that if these elements are accessed in the parallel max reduction below they
    # do not influence the result (max(min_finite, x) == x for any x).
    var val: Scalar[dtype] = min_finite[dtype]()
    if global_i < input_size:
        val = rebind[Scalar[dtype]](input[global_i])
    shared_max[global_i] = val

    barrier()

    # Parallel reduction to find max similar to reduction we saw before
    stride = BLOCK_DIM_X // 2
    while stride > 0:
        if global_i < stride:
            shared_max[global_i] = max(
                shared_max[global_i], shared_max[global_i + stride]
            )
        barrier()
        stride = stride // 2

    block_max = shared_max[0]

    # Initialize out-of-bounds (shared_max[global_i], global_i >= input_size) shared memory addresses to 0.0,
    # ensuring that if these elements are accessed in the parallel sum reduction below they
    # do not influence the result (adding 0.0 does not change the sum).
    var exp_val: Scalar[dtype] = 0.0
    if global_i < input_size:
        exp_val = rebind[Scalar[dtype]](exp(val - block_max))
    shared_sum[global_i] = exp_val
    barrier()

    # Parallel reduction for sum similar to reduction we saw before
    stride = BLOCK_DIM_X // 2
    while stride > 0:
        if global_i < stride:
            shared_sum[global_i] += shared_sum[global_i + stride]
        barrier()
        stride = stride // 2

    block_sum = shared_sum[0]

    # Normalize by sum
    if global_i < input_size:
        output[global_i] = exp_val / block_sum


GPU ์ปค๋„์€ ๊ณ ๋„๋กœ ์ตœ์ ํ™”๋œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ์†Œํ”„ํŠธ๋งฅ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ปค๋„์„ ์ƒ์„ธํžˆ ๋ถ„์„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

์ปค๋„ ์‹œ๊ทธ๋‹ˆ์ฒ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ

fn softmax_gpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[mut=True, dtype, layout],
    input: LayoutTensor[mut=False, dtype, layout],
)

์ปค๋„์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ตฌ์„ฑ:

  • ์ž…์ถœ๋ ฅ ํ…์„œ์— ๊ณตํ†ต์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ ˆ์ด์•„์›ƒ ํŒŒ๋ผ๋ฏธํ„ฐ
  • ์ •์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ง€์ •๋˜๋Š” ๋ฒกํ„ฐ ํฌ๊ธฐ
  • ๊ธฐ๋ณธ๊ฐ’์ด float32์ธ ์„ค์ • ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ ํƒ€์ž…
  • ์—ฐ์‚ฐ ๊ฒฐ๊ณผ๋ฅผ ์ง์ ‘ ์ €์žฅํ•˜๋Š” ๋ณ€๊ฒฝ ๊ฐ€๋Šฅํ•œ(mutable) ์ถœ๋ ฅ ํ…์„œ
  • ๋ณ€๊ฒฝ ๋ถˆ๊ฐ€๋Šฅํ•œ(mut=False) ์ž…๋ ฅ ํ…์„œ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น

shared_max = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
shared_sum = LayoutTensor[dtype, Layout.row_major(BLOCK_DIM_X), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()

์ปค๋„์€ ๋‘ ๊ฐœ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„ํผ๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค:

  • shared_max: ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’ ํƒ์ƒ‰ ๋ฆฌ๋•์…˜์šฉ
  • shared_sum: ๋ณ‘๋ ฌ ํ•ฉ๊ณ„ ์—ฐ์‚ฐ์šฉ
  • ๋‘˜ ๋‹ค BLOCK_DIM_X = 128 ํฌ๊ธฐ๋ฅผ ์‚ฌ์šฉ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋Š” ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๋น ๋ฅธ ์ ‘๊ทผ์„ ์ œ๊ณต

์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

global_i = thread_idx.x

์ด ์†Œํ”„ํŠธ๋งฅ์Šค ๊ตฌํ˜„์€ ๋‹จ์ผ 1D ์Šค๋ ˆ๋“œ ๋ธ”๋ก์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ „์—ญ ์ธ๋ฑ์Šค์™€ ๋กœ์ปฌ ์ธ๋ฑ์Šค๊ฐ€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

์ตœ๋Œ“๊ฐ’ ํƒ์ƒ‰ ๋‹จ๊ณ„

var val: Scalar[dtype] = min_finite[dtype]()
if global_i < input_size:
    val = rebind[Scalar[dtype]](input[global_i])

shared_max[local_i] = val
barrier()

๊ฐ ์Šค๋ ˆ๋“œ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค:

  • ์œ ํšจ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์š”์†Œ์—๋Š” ์ตœ์†Œ ์œ ํ•œ(finite) ๊ฐ’ ํ• ๋‹น
  • ์œ ํšจํ•œ ์š”์†Œ์— ๋งคํ•‘๋˜๋Š” ์Šค๋ ˆ๋“œ์—๋Š” ์‹ค์ œ ์ž…๋ ฅ๊ฐ’ ํ• ๋‹น
  • ๋ฆฌ๋•์…˜ ๊ณผ์ •์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ
  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋˜๋„๋ก ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”

๋ณ‘๋ ฌ max ๋ฆฌ๋•์…˜

stride = BLOCK_DIM_X // 2
while stride > 0:
    if local_i < stride:
        shared_max[local_i] = max(shared_max[local_i], shared_max[local_i + stride])
    barrier()
    stride = stride // 2

๋ณ‘๋ ฌ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

  1. stride = 64(BLOCK_DIM_X์˜ ์ ˆ๋ฐ˜)๋กœ ์‹œ์ž‘
  2. ๊ฐ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๊ฐ€ stride๋งŒํผ ๋–จ์–ด์ง„ ๋‘ ๊ฐ’ ๋น„๊ต
  3. ๋” ์ž‘์€ ์ธ๋ฑ์Šค์— ์ตœ๋Œ“๊ฐ’ ์ €์žฅ
  4. ๋ฐฐ๋ฆฌ์–ด๋กœ ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  5. Stride๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๊ณ  ๋ฐ˜๋ณต
  6. \(\log_2(BLOCK\_DIM\_X)~\) ๋‹จ๊ณ„ ํ›„ shared_max[0]์— ์ „์ฒด ์ตœ๋Œ“๊ฐ’์ด ๋‹ด๊น€

์ด ๋กœ๊ทธ ๋ฆฌ๋•์…˜์€ ๋Œ€๊ทœ๋ชจ ์ž…๋ ฅ์—์„œ ์„ ํ˜• ์Šค์บ”๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฆ…๋‹ˆ๋‹ค.

์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ์ง€์ˆ˜ ํ•จ์ˆ˜ ์ ์šฉ

block_max = shared_max[0]

var exp_val: Scalar[dtype] = 0.0
if global_i < input_size:
    exp_val = rebind[Scalar[dtype]](exp(val - block_max))

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ „์ฒด ์ตœ๋Œ“๊ฐ’ ์ฝ์Œ
  2. ์ง€์ˆ˜ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๊ธฐ ์ „์— ์ž…๋ ฅ๊ฐ’์—์„œ ์ตœ๋Œ“๊ฐ’ ์ฐจ๊ฐ
  3. ์ด ์ฐจ๊ฐ์ด ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์˜ ํ•ต์‹ฌ โ€” ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ ๋ฐฉ์ง€
  4. ๊ฐ€์žฅ ํฐ ์ง€์ˆ˜๊ฐ€ \(e^0 = 1\)์ด ๋˜๊ณ , ๋‚˜๋จธ์ง€๋Š” ๋ชจ๋‘ \(e^{์Œ์ˆ˜} < 1\)

๋ณ‘๋ ฌ sum ๋ฆฌ๋•์…˜

shared_sum[local_i] = exp_val
barrier()

stride = BLOCK_DIM_X // 2
while stride > 0:
    if local_i < stride:
        shared_sum[local_i] += shared_sum[local_i + stride]
    barrier()
    stride = stride // 2

๋‘ ๋ฒˆ์งธ ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„:

  1. ๋ชจ๋“  ์ง€์ˆ˜ ๊ฐ’์„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ
  2. max์™€ ๋™์ผํ•œ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜ ํŒจํ„ด ์‚ฌ์šฉ
  3. ๋‹จ, ์ตœ๋Œ“๊ฐ’ ๋น„๊ต ๋Œ€์‹  ๋ง์…ˆ ์ˆ˜ํ–‰
  4. \(\log_2(BLOCK\_DIM\_X)~\) ๋‹จ๊ณ„ ํ›„ shared_sum[0]์— ๋ชจ๋“  ์ง€์ˆ˜ ๊ฐ’์˜ ์ดํ•ฉ์ด ๋‹ด๊น€

์ตœ์ข… ์ •๊ทœํ™”

block_sum = shared_sum[0]

if global_i < input_size:
    output[global_i] = exp_val / block_sum

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ดํ•ฉ์„ ์ฝ์Œ
  2. ์ž์‹ ์˜ ์ง€์ˆ˜ ๊ฐ’์„ ์ด ์ดํ•ฉ์œผ๋กœ ๋‚˜๋ˆ”
  3. ์ •๊ทœํ™”๋œ ํ™•๋ฅ ์„ ์ถœ๋ ฅ ๋ฒ„ํผ์— ๊ธฐ๋ก
  4. ํ•ฉ์ด 1์ธ ์œ ํšจํ•œ ํ™•๋ฅ  ๋ถ„ํฌ ์ƒ์„ฑ

์„ฑ๋Šฅ ํŠน์„ฑ

์ด ๊ตฌํ˜„์€ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ฐ–์Šต๋‹ˆ๋‹ค:

  • ๋ณต์žก๋„: ์ˆœ์ฐจ์  ์ ‘๊ทผ์˜ \(O(n)\)์— ๋น„ํ•ด max์™€ sum ๊ณ„์‚ฐ ๋ชจ๋‘ \(O(\log n)\)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ \(2 \times BLOCK\_DIM\_X~\) ์š”์†Œ๋งŒ ์‚ฌ์šฉ
  • ์ž‘์—… ํšจ์œจ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์•ฝ \(2 \times \log_2(BLOCK\_DIM\_X)~\) ํšŒ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
  • ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์–‘์˜ ์ž‘์—… ์ฒ˜๋ฆฌ
  • ๋™๊ธฐํ™”: ํ•„์š”ํ•œ ๊ณณ์—์„œ๋งŒ ์ตœ์†Œํ•œ์˜ ๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ์ตœ์  ๋Œ€์—ญํญ์„ ์œ„ํ•œ ๋ณ‘ํ•ฉ๋œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ˆ˜์น˜์ ์œผ๋กœ๋„ ๊ฒฌ๊ณ ํ•ฉ๋‹ˆ๋‹ค. ์ตœ๋Œ“๊ฐ’ ์ฐจ๊ฐ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ์‹ ๊ฒฝ๋ง ํ™œ์„ฑํ™”์—์„œ ํ”ํ•œ ๋„“์€ ๋ฒ”์œ„์˜ ๊ฐ’์—์„œ๋„ ์ •๋ฐ€๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉฐ, ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ/์–ธ๋”ํ”Œ๋กœ์šฐ ๊ฐ€๋Šฅ์„ฑ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

CPU ํด๋ฐฑ ๊ตฌํ˜„

fn softmax_cpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    comptime assert (
        dtype.is_floating_point()
    ), "dtype must be a floating-point type"
    var max_val: Scalar[dtype] = min_finite[dtype]()
    for i in range(input_size):
        max_val = max(max_val, rebind[Scalar[dtype]](input[i]))

    var sum_exp: Scalar[dtype] = 0.0
    for i in range(input_size):
        var exp_val = rebind[Scalar[dtype]](exp(input[i] - max_val))
        output[i] = exp_val
        sum_exp += exp_val

    for i in range(input_size):
        output[i] = output[i] / sum_exp


CPU ๊ตฌํ˜„์€ ๊ฐ™์€ ์ˆ˜ํ•™์  ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋”ฐ๋ฅด๋˜ ๋‹จ์ผ ์Šค๋ ˆ๋“œ ์‹คํ–‰์— ์ตœ์ ํ™”๋œ ์ˆœ์ฐจ์  ํด๋ฐฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„๋ฅผ ๋ถ„์„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:
  1. ์ตœ๋Œ“๊ฐ’ ํƒ์ƒ‰:

    var max_val: Scalar[dtype] = min_finite[dtype]()
    for i in range(input_size):
        max_val = max(max_val, rebind[Scalar[dtype]](input[i]))
    

    ์ตœ์†Œ ์œ ํ•œ๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ๋ฐฐ์—ด์„ ์„ ํ˜• ์Šค์บ”ํ•˜๋ฉฐ ๋งŒ๋‚œ ์ตœ๋Œ“๊ฐ’์„ ์ถ”์ ํ•ฉ๋‹ˆ๋‹ค. \(O(n)\) ๋ณต์žก๋„์ด์ง€๋งŒ, ๋ณ‘๋ ฌํ™”ํ•  ์ฝ”์–ด๊ฐ€ ๋งŽ์ง€ ์•Š์€ CPU์—์„œ๋Š” ํšจ์œจ์ ์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

  2. ์ง€์ˆ˜ ํ•จ์ˆ˜ ์ ์šฉ๊ณผ ํ•ฉ์‚ฐ:

    var sum_exp: Scalar[dtype] = 0.0
    for i in range(input_size):
        var exp_val = rebind[Scalar[dtype]](exp(input[i] - max_val))
        output[i] = exp_val
        sum_exp += exp_val
    

    ๊ฐ ์š”์†Œ์— ๋Œ€ํ•ด \(e^{x_i - max}\)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ ๋ฒ„ํผ์— ์ €์žฅํ•˜๋ฉด์„œ ํ•ฉ๊ณ„ \(\sum_{j=1}^{n} e^{x_j - max}\)๋ฅผ ํ•œ ๋ฒˆ์˜ ์ˆœํšŒ๋กœ ๋ˆ„์ ํ•ฉ๋‹ˆ๋‹ค. ๋ณ„๋„์˜ ๋ฐ˜๋ณต๋ฌธ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.

  3. ์ •๊ทœํ™”:

    for i in range(input_size):
        output[i] = output[i] / sum_exp
    

    ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ฐ ์š”์†Œ๋ฅผ ํ•ฉ๊ณ„๋กœ ๋‚˜๋ˆ  ์†Œํ”„ํŠธ๋งฅ์Šค ๊ณต์‹์— ๋”ฐ๋ฅธ ์˜ฌ๋ฐ”๋ฅธ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

    $$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

CPU ๊ตฌํ˜„์€ ๋™์ผํ•œ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ๊ธฐ๋ฒ•(์ตœ๋Œ“๊ฐ’ ์ฐจ๊ฐ)์„ ์‚ฌ์šฉํ•˜๋˜, ๋ณ‘๋ ฌ์ด ์•„๋‹Œ ์ˆœ์ฐจ์  ์—ฐ์‚ฐ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๋ฅผ ๋‹ค๋ฃฐ ํ•„์š”๊ฐ€ ์—†์–ด GPU ๋ฒ„์ „๋ณด๋‹ค ๋‹จ์ˆœํ•˜์ง€๋งŒ, ๋Œ€๊ทœ๋ชจ ์ž…๋ ฅ์—์„œ๋Š” ํšจ์œจ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

๋‘ ๊ตฌํ˜„ ๋ชจ๋‘ @compiler.register("softmax") ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋ฅผ ํ†ตํ•ด MAX ๊ทธ๋ž˜ํ”„์˜ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ์‹œ์Šคํ…œ์— ๋“ฑ๋ก๋˜๋ฏ€๋กœ, ๊ฐ€์šฉ ์—ฌ๋ถ€์— ๋”ฐ๋ผ ์–ด๋А ๋””๋ฐ”์ด์Šค์—์„œ๋“  ๋งค๋„๋Ÿฝ๊ฒŒ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ํ†ตํ•ฉ

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
        input_value = graph.inputs[0]

        # The output shape is the same as the input for softmax
        # Note: the name must match the name used in `@compiler.register("softmax")` in op/softmax.mojo
        output = ops.custom(
            name="softmax",
            values=[input_value],
            device=DeviceRef.from_device(device),
            out_types=[
                TensorType(
                    dtype=input_value.tensor.dtype,
                    shape=input_value.tensor.shape,
                    device=DeviceRef.from_device(device),
                )
            ],
            parameters={
                "target": "gpu" if device == Accelerator() else "cpu",
                "input_size": input_tensor.shape[0],
                "dtype": dtype,
            },
        )[0].tensor
        graph.output(output)

ํŒŒ์ด์ฌ ํ†ตํ•ฉ์€ NumPy ๋ฐฐ์—ด๊ณผ ์ตœ์ ํ™”๋œ Mojo GPU ์ปค๋„ ์‚ฌ์ด์— ๋งค๋„๋Ÿฌ์šด ๋‹ค๋ฆฌ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ๊ตฌํ˜„์€ ์—ฌ๋Ÿฌ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ค„์ ธ ์žˆ์Šต๋‹ˆ๋‹ค:
  1. ๊ทธ๋ž˜ํ”„ ์„ค์ •๊ณผ ๊ตฌ์„ฑ:

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
    

    โ€œsoftmax_graphโ€œ๋ผ๋Š” ์ด๋ฆ„์˜ ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”„๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

    • ์ ์ ˆํ•œ dtype๊ณผ shape์œผ๋กœ ์ž…๋ ฅ ํ…์„œ ํƒ€์ž… ์ •์˜
    • ํ…์„œ๋ฅผ ๋Œ€์ƒ ๋””๋ฐ”์ด์Šค(CPU ๋˜๋Š” GPU)์— ๋งคํ•‘
    • ์ง€์ •๋œ ๋””๋ ‰ํ† ๋ฆฌ์—์„œ ์ปค์Šคํ…€ Mojo ์—ฐ์‚ฐ ๋กœ๋“œ
    • custom_extensions ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ Mojo ๊ตฌํ˜„๊ณผ์˜ ์—ฐ๊ฒฐ ํ•ต์‹ฌ
  2. ์ปค์Šคํ…€ ์—ฐ์‚ฐ ๊ตฌ์„ฑ:

    output = ops.custom(
        name="softmax",
        values=[input_value],
        out_types=[
            TensorType(
                dtype=input_value.tensor.dtype,
                shape=input_value.tensor.shape,
                device=DeviceRef.from_device(device),
            )
        ],
        parameters={
            "target": "gpu" if device == Accelerator() else "cpu",
            "input_size": input_tensor.shape[0],
            "dtype": dtype,
        },
    )[0].tensor
    

    ์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค:

    • Mojo ์ฝ”๋“œ์˜ @compiler.register("softmax")์™€ ์ผ์น˜ํ•˜๋Š” ์ด๋ฆ„
    • ๋ฆฌ์ŠคํŠธ๋กœ ์ „๋‹ฌ๋˜๋Š” ์ž…๋ ฅ ๊ฐ’
    • ์ž…๋ ฅ shape๊ณผ ํƒ€์ž…์— ๋งž๋Š” ์ถœ๋ ฅ ํƒ€์ž… ์ •์˜
    • ๋Œ€์ƒ ๋””๋ฐ”์ด์Šค, ๋ฒกํ„ฐ ํฌ๊ธฐ, ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ํฌํ•จํ•œ ์ปค๋„ ํ•„์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ
    • [0].tensor๋กœ ์ฒซ ๋ฒˆ์งธ ๋ฐ˜ํ™˜ ์š”์†Œ์—์„œ ํ…์„œ ์ถ”์ถœ
  3. ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ ์ •์˜:

    graph.output(output)
    

    ์—ฐ์‚ฐ์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ทธ๋ž˜ํ”„์˜ ์ถœ๋ ฅ์œผ๋กœ ๋“ฑ๋กํ•ฉ๋‹ˆ๋‹ค.

๋ฉ”์ธ ์Šคํฌ๋ฆฝํŠธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ผผ๊ผผํ•œ ๊ฒ€์ฆ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค:

  • ๋žœ๋ค ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ: np.random.randn(INPUT_SIZE).astype(np.float32)
  • SciPy๋กœ ๊ธฐ๋Œ€ ๊ฒฐ๊ณผ ๊ณ„์‚ฐ: scipy_softmax(input_array)
  • ์ˆ˜์น˜ ์ •ํ™•๋„ ๊ฒ€์ฆ: np.testing.assert_allclose(..., rtol=1e-5)
  • ์ถœ๋ ฅ์ด ์œ ํšจํ•œ ํ™•๋ฅ  ๋ถ„ํฌ์ธ์ง€ ํ™•์ธ: np.sum(result.to_numpy())

์ด ๊ตฌํ˜„์€ ๊ณ ์„ฑ๋Šฅ Mojo ์ปค๋„๊ณผ ํŒŒ์ด์ฌ์˜ ๊ณผํ•™ ์ปดํ“จํŒ… ์ƒํƒœ๊ณ„๋ฅผ ํ†ตํ•ฉํ•˜๋Š” MAX ๊ทธ๋ž˜ํ”„์˜ ๊ฐ•๋ ฅํ•œ ์—ญ๋Ÿ‰์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ํšจ์œจ์„ฑ๊ณผ ์‚ฌ์šฉ ํŽธ์˜์„ฑ์„ ๋™์‹œ์— ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Puzzle 19: ์–ดํ…์…˜ Op

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์–ดํ…์…˜์€ ํŠธ๋žœ์Šคํฌ๋จธ์™€ ํ•จ๊ป˜ ๋„๋ฆฌ ์•Œ๋ ค์ง„ ํ˜„๋Œ€ ์‹ ๊ฒฝ๋ง์˜ ํ•ต์‹ฌ ์š”์†Œ๋กœ, ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•  ๋•Œ ์ž…๋ ฅ์—์„œ ๊ด€๋ จ๋œ ๋ถ€๋ถ„์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค.

์ˆ˜ํ•™์ ์œผ๋กœ ์–ดํ…์…˜ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

$$\Large \text{Attention}(Q, K, V) = \text{softmax}(Q \cdot K^T) \cdot V$$

์—ฌ๊ธฐ์„œ:

  • \(Q\)๋Š” \((d,)~\) ํ˜•ํƒœ์˜ ์ฟผ๋ฆฌ ๋ฒกํ„ฐ - ์ฐพ์œผ๋ ค๋Š” ๋Œ€์ƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค
  • \(K\)๋Š” \((\text{seq_len}, d)~\) ํ˜•ํƒœ์˜ ํ‚ค ํ–‰๋ ฌ - ๋งค์นญํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€์ƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค
  • \(V\)๋Š” \((\text{seq_len}, d)~\) ํ˜•ํƒœ์˜ ๊ฐ’ ํ–‰๋ ฌ - ๊ฒ€์ƒ‰ํ•  ์ •๋ณด๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค
  • ์ถœ๋ ฅ์€ \((d,)\) ํ˜•ํƒœ์˜ ๊ฐ€์ค‘ํ•ฉ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค

์—ฐ์‚ฐ์€ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค:

  1. ์–ดํ…์…˜ ์ ์ˆ˜: \(Q \cdot K^T\)๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ฟผ๋ฆฌ๊ฐ€ ๊ฐ ํ‚ค ๋ฒกํ„ฐ์™€ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งค์นญ๋˜๋Š”์ง€ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค
  2. ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜: ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์ ์šฉํ•˜์—ฌ ์ ์ˆ˜๋ฅผ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค (๊ฐ€์ค‘์น˜์˜ ํ•ฉ = 1)
  3. ๊ฐ€์ค‘ ํ•ฉ: ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ’ ๋ฒกํ„ฐ๋“ค์„ ๊ฒฐํ•ฉํ•ด ์ตœ์ข… ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค

์–ดํ…์…˜ ์ดํ•ดํ•˜๊ธฐ: ๋‹จ๊ณ„๋ณ„ ๋ถ„์„

์–ดํ…์…˜์„ ์Šค๋งˆํŠธ ๊ฒ€์ƒ‰ ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”. ์ฟผ๋ฆฌ(์ฐพ๊ณ ์ž ํ•˜๋Š” ๊ฒƒ)๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, ์–ดํ…์…˜์€ ํ‚ค-๊ฐ’ ์Œ์˜ ๋ชจ์Œ์—์„œ ๊ฐ€์žฅ ๊ด€๋ จ์„ฑ ๋†’์€ ์ •๋ณด๋ฅผ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค:

  1. 1๋‹จ๊ณ„ - ์œ ์‚ฌ๋„ ๋งค์นญ: ์ฟผ๋ฆฌ \(Q\)๋ฅผ ๋ชจ๋“  ํ‚ค \(K\)์™€ ๋น„๊ตํ•˜์—ฌ ์œ ์‚ฌ๋„ ์ ์ˆ˜๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค

    • \(Q \cdot K^T\)๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ \(Q\)๊ฐ€ ๊ฐ ํ‚ค ๋ฒกํ„ฐ์™€ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งค์นญ๋˜๋Š”์ง€ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค
    • ๋†’์€ ์ ์ˆ˜ = ๋” ์ข‹์€ ๋งค์นญ
  2. 2๋‹จ๊ณ„ - ํ™•๋ฅ  ๋ถ„ํฌ: ์›์‹œ ์ ์ˆ˜๋ฅผ ์ •๊ทœํ™”๋œ ๊ฐ€์ค‘์น˜๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค

    • ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ชจ๋“  ๊ฐ€์ค‘์น˜์˜ ํ•ฉ์ด 1.0์ด ๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค
    • ์–ด๋–ค ๊ฐ’์— ์ง‘์ค‘ํ• ์ง€์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค
  3. 3๋‹จ๊ณ„ - ๊ฐ€์ค‘ ๊ฒ€์ƒ‰: ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ’๋“ค์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค

    • ๊ฐ ๊ฐ’ ๋ฒกํ„ฐ์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๊ณฑํ•ฉ๋‹ˆ๋‹ค
    • ๋ชจ๋“  ๊ฒƒ์„ ๋”ํ•ด ์ตœ์ข… ์ถœ๋ ฅ์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค

์‹ค์ƒํ™œ ๋น„์œ : ๋„์„œ๊ด€์—์„œ ๊ฒ€์ƒ‰ํ•˜๋Š” ๊ฒƒ์„ ์ƒ์ƒํ•ด ๋ณด์„ธ์š”. ์ฟผ๋ฆฌ๋Š” ์ฐพ๊ณ  ์‹ถ์€ ๊ฒƒ์ด๊ณ , ์ฑ… ์ œ๋ชฉ์€ ํ‚ค์ด๋ฉฐ, ์ฑ… ๋‚ด์šฉ์€ ๊ฐ’์ž…๋‹ˆ๋‹ค. ์–ดํ…์…˜์€ ๊ฐ ์ฑ…์ด ์ฟผ๋ฆฌ์™€ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ ์žˆ๋Š”์ง€ ๊ณ„์‚ฐํ•œ ๋‹ค์Œ, ๊ด€๋ จ๋„์— ๋”ฐ๋ผ ๊ฐ€์ค‘ ์š”์•ฝ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์—ฐ์‚ฐ ํ๋ฆ„ ์‹œ๊ฐํ™”

Input:  Q(16,)    K(16,16)    V(16,16)
         โ†“           โ†“           โ†“
Step 1: Q(1,16) @ K^T(16,16) โ†’ Scores(1,16)
         โ†“
Step 2: softmax(Scores) โ†’ Weights(1,16)  [sum = 1.0]
         โ†“
Step 3: Weights(1,16) @ V(16,16) โ†’ Output(1,16) โ†’ reshape โ†’ Output(16,)

ํ•ต์‹ฌ ์•„์ด๋””์–ด: ์ฟผ๋ฆฌ ๋ฒกํ„ฐ \(Q\)๋ฅผ \((16,)\)์—์„œ \((1,16)\)์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉด, ๋‚ด์  ๋Œ€์‹  ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋•๋ถ„์— Puzzle 18์˜ ๊ณ ๋„๋กœ ์ตœ์ ํ™”๋œ ํƒ€์ผ๋ง matmul ์ปค๋„์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

GPU ๊ตฌํ˜„์€ ์ด์ „ ํผ์ฆ์—์„œ ์ตœ์ ํ™”๋œ ์ปค๋„๋“ค์„ ์žฌ์‚ฌ์šฉํ•˜๊ณ  ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:

๐Ÿ”„ ์ปค๋„ ์žฌ์‚ฌ์šฉ ์ „๋žต: ์ด ํผ์ฆ์€ ์ด์ „ ํผ์ฆ์—์„œ ๊ฒ€์ฆ๋œ ์ตœ์ ํ™” ์ปค๋„๋“ค์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ณต์žกํ•œ ์—ฐ์‚ฐ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋ชจ๋“  ๊ฒƒ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ์ž‘์„ฑํ•˜๋Š” ๋Œ€์‹ , Puzzle 16์˜ matmul_idiomatic_tiled๊ณผ Puzzle 18์˜ softmax_kernel์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋“ˆํ˜• GPU ์ปค๋„ ์„ค๊ณ„์˜ ๊ฐ•๋ ฅํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

  • ์‹œํ€€์Šค ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋ฒกํ„ฐ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜
  • ์ปค๋„ ์žฌ์‚ฌ์šฉ: Puzzle 16๊ณผ Puzzle 18์˜ ๊ฒ€์ฆ๋œ ๊ตฌํ˜„ ํ™œ์šฉ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ tiling์„ ํ™œ์šฉํ•œ ํšจ์œจ์ ์ธ ํ–‰๋ ฌ ๊ณฑ์…ˆ
  • ๋ฒ„ํผ ํ• ๋‹น์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ํ…์„œ ํ˜•ํƒœ ๋ณ€ํ™˜
  • ์—ฌ๋Ÿฌ ์ตœ์ ํ™” ์ปค๋„์„ ๋‹จ์ผ ์—ฐ์‚ฐ์œผ๋กœ ํ†ตํ•ฉ
  • ๋‹ค์ค‘ ์ž…๋ ฅ์„ ์ง€์›ํ•˜๋Š” ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ
  • ํ˜ธํ™˜์„ฑ์„ ์œ„ํ•œ CPU ํด๋ฐฑ ๊ตฌํ˜„

์„ค์ •

  • ์‹œํ€€์Šค ๊ธธ์ด: \(\text{SEQ_LEN} = 16~\) - ์‹œํ€€์Šค ๋‚ด ํ‚ค/๊ฐ’ ๋ฒกํ„ฐ์˜ ์ˆ˜
  • ๋ชจ๋ธ ์ฐจ์›: \(\text{D} = 16~\) - ๊ฐ ๋ฒกํ„ฐ(์ฟผ๋ฆฌ, ํ‚ค, ๊ฐ’)์˜ ์ฐจ์›
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: ๊ฐ ์ปค๋„์— ๋งž๊ฒŒ ๊ฐœ๋ณ„ ์ตœ์ ํ™”
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: ๋‹ค์–‘ํ•œ ํ–‰๋ ฌ ํฌ๊ธฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋„๋ก ๋™์ ์œผ๋กœ ๊ณ„์‚ฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ์ „์น˜, matmul, ์†Œํ”„ํŠธ๋งฅ์Šค ์ปค๋„์—์„œ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ํ™œ์šฉ

๋ ˆ์ด์•„์›ƒ ์„ค์ •:

  • ์ฟผ๋ฆฌ ํ…์„œ: Layout.row_major(d)
  • ํ‚ค ํ…์„œ: Layout.row_major(seq_len, d)
  • ๊ฐ’ ํ…์„œ: Layout.row_major(seq_len, d)
  • ์ถœ๋ ฅ ํ…์„œ: Layout.row_major(d)
  • ์ปค์Šคํ…€ op ํŒŒ๋ผ๋ฏธํ„ฐ: {"seq_len": seq_len, "d": d, "dtype": dtype}

์ด ํผ์ฆ์˜ ํ•ต์‹ฌ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๋‹ค์ค‘ ์ปค๋„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜: ์ „์น˜, matmul, ์†Œํ”„ํŠธ๋งฅ์Šค ์—ฐ์‚ฐ์˜ ๊ฒฐํ•ฉ
  2. ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”: ํ˜•ํƒœ ๋ณ€ํ™˜ ์—ฐ์‚ฐ๊ณผ ๋ฒ„ํผ ์žฌ์‚ฌ์šฉ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์ตœ์†Œํ™”
  3. ์ˆ˜์น˜ ์•ˆ์ •์„ฑ: Puzzle 18์˜ ๊ฒ€์ฆ๋œ ์†Œํ”„ํŠธ๋งฅ์Šค ๊ตฌํ˜„ ํ™œ์šฉ
  4. ์„ฑ๋Šฅ ์ตœ์ ํ™”: ๋ชจ๋“  ํ–‰๋ ฌ ์—ฐ์‚ฐ์— Puzzle 16์˜ ํƒ€์ผ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์‚ฌ์šฉ
  5. ๋‹ค์ค‘ ์ž…๋ ฅ ์—ฐ์‚ฐ: ๋‹จ์ผ ์ปค์Šคํ…€ op์—์„œ ์„ธ ๊ฐœ์˜ ์ž…๋ ฅ ํ…์„œ(Q, K, V) ์ฒ˜๋ฆฌ

์–ดํ…์…˜ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ผ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  • ํŒŒ์ด์ฌ์—์„œ ์ฟผ๋ฆฌ, ํ‚ค, ๊ฐ’ ํ…์„œ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ
  • ์ตœ์ ํ™”๋œ ์ปค๋„์„ ์‚ฌ์šฉํ•˜์—ฌ GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ
  • ์–ดํ…์…˜ ๊ฐ€์ค‘ ์ถœ๋ ฅ ๋ฒกํ„ฐ ๋ฐ˜ํ™˜
  • NumPy ์ฐธ์กฐ ๊ตฌํ˜„ ๊ฒฐ๊ณผ์™€ ์ผ์น˜

์™„์„ฑํ•  ์ฝ”๋“œ

์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ ค๋ฉด Puzzle 16์˜ ํƒ€์ผ๋ง matmul ์ปค๋„๊ณผ Puzzle 18์˜ ์†Œํ”„ํŠธ๋งฅ์Šค ์ปค๋„์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Mojo ํŒŒ์ผ์—์„œ ์ „์น˜ ์ปค๋„๋งŒ ๊ตฌํ˜„ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

1. ์ „์น˜ ์ปค๋„ ๊ตฌํ˜„ํ•˜๊ธฐ

fn transpose_kernel[
    layout_in: Layout,  # Layout for input matrix (seq_len, d)
    layout_out: Layout,  # Layout for output matrix (d, seq_len)
    rows: Int,
    cols: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
    inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin],
):
    # FILL ME IN (roughly 18 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p19/op/attention.mojo

ํŒ

์ „์น˜ ์ปค๋„ ๊ตฌํ˜„ ๊ฐ€์ด๋“œ:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •: LayoutTensor[dtype, Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()์„ ์‚ฌ์šฉํ•˜์—ฌ TRANSPOSE_BLOCK_DIM_XY ร— TRANSPOSE_BLOCK_DIM_XY ํฌ๊ธฐ์˜ ์ •์‚ฌ๊ฐํ˜• ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์Šค๋ ˆ๋“œ ๊ฐ„ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ตํ™˜์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

  2. ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ: ์Šค๋ ˆ๋“œ๋ฅผ ํ–‰๋ ฌ ์š”์†Œ์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค:

    • local_row = thread_idx.y, local_col = thread_idx.x (๋ธ”๋ก ๋‚ด ์œ„์น˜)
    • global_row = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_row (์ „์ฒด ํ–‰๋ ฌ์—์„œ์˜ ์œ„์น˜)
  3. 2๋‹จ๊ณ„ ์—ฐ์‚ฐ:

    • 1๋‹จ๊ณ„: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ผ๋ฐ˜ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค
    • 2๋‹จ๊ณ„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋’ค๋ฐ”๊พผ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  4. ํ•„์ˆ˜ ๋™๊ธฐํ™”: ๋กœ๋“œ์™€ ์ €์žฅ ์‚ฌ์ด์— barrier()๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋กœ๋“œ๋ฅผ ์™„๋ฃŒํ•œ ํ›„์—์•ผ ์ €์žฅ์„ ์‹œ์ž‘ํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค

  5. ์ „์น˜์˜ ํ•ต์‹ฌ: ์ „์น˜๋Š” ๋’ค๋ฐ”๊พผ ์ธ๋ฑ์‹ฑ์„ ํ†ตํ•ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค: shared_tile[local_row, local_col] ๋Œ€์‹  shared_tile[local_col, local_row]๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค

  6. ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ TRANSPOSE_BLOCK_DIM_XY x TRANSPOSE_BLOCK_DIM_XY๋กœ ์ •ํ™•ํžˆ ๋‚˜๋ˆ„์–ด์ง€์ง€ ์•Š๋Š” ํ–‰๋ ฌ์—์„œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ฝ๊ธฐ/์“ฐ๊ธฐ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค

  7. ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ์ด ํŒจํ„ด์€ ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ๋ชจ๋‘ ๋ณ‘ํ•ฉ๋˜๋„๋ก ๋ณด์žฅํ•˜์—ฌ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค

2. ์–ดํ…์…˜ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜

            var gpu_ctx = rebind[DeviceContext](ctx[])

            # Define layouts for matrix multiplication
            # Q reshaped to (1, d)
            comptime layout_q_2d = Layout.row_major(1, d)
            # K^T is (d, seq_len)
            comptime layout_k_t = Layout.row_major(d, seq_len)
            # Scores as (1, seq_len)
            comptime layout_scores_2d = Layout.row_major(1, seq_len)
            # Weights as (1, seq_len)
            comptime layout_weights_2d = Layout.row_major(1, seq_len)
            # Result as (1, d)
            comptime layout_result_2d = Layout.row_major(1, d)

            # Transpose implementation limited to square (TRANSPOSE_BLOCK_DIM_XY x TRANSPOSE_BLOCK_DIM_XY) thread blocks
            comptime transpose_threads_per_block = (
                TRANSPOSE_BLOCK_DIM_XY,
                TRANSPOSE_BLOCK_DIM_XY,
            )
            # Tile over the K (seq_len, d) matrix
            comptime transpose_blocks_per_grid = (
                (d + TRANSPOSE_BLOCK_DIM_XY - 1) // TRANSPOSE_BLOCK_DIM_XY,
                (seq_len + TRANSPOSE_BLOCK_DIM_XY - 1)
                // TRANSPOSE_BLOCK_DIM_XY,
            )
            # Matmul implementation limited to square (MATMUL_BLOCK_DIM_XY x MATMUL_BLOCK_DIM_XY) thread blocks
            comptime matmul_threads_per_block = (
                MATMUL_BLOCK_DIM_XY,
                MATMUL_BLOCK_DIM_XY,
            )
            # seq_len outputs ( Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len) ) with one thread per output
            comptime scores_blocks_per_grid = (
                seq_len + MATMUL_BLOCK_DIM_XY - 1
            ) // MATMUL_BLOCK_DIM_XY
            comptime softmax_threads = SOFTMAX_BLOCK_DIM_X
            comptime softmax_blocks_per_grid = 1
            # d outputs ( weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d) ) with one thread per output
            comptime result_blocks_per_grid = (
                d + MATMUL_BLOCK_DIM_XY - 1
            ) // MATMUL_BLOCK_DIM_XY

            # Allocate minimal temporary buffers - reuse same buffer for different shapes
            k_t_buf = gpu_ctx.enqueue_create_buffer[dtype](
                seq_len * d
            )  # K^T as (d, seq_len)
            scores_weights_buf = gpu_ctx.enqueue_create_buffer[dtype](
                seq_len
            )  # Reused for scores and weights

            k_t = LayoutTensor[dtype, layout_k_t, MutAnyOrigin](k_t_buf)

            # Step 1: Reshape Q from (d,) to (1, d) - no buffer needed
            # FILL ME IN 1 line

            # Step 2: Transpose K from (seq_len, d) to K^T (d, seq_len)
            # FILL ME IN 1 function call

            # Step 3: Compute attention scores using matmul: Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len)
            # This computes Q ยท K^T[i] = Q ยท K[i] for each column i of K^T (which is row i of K)
            # Reuse scores_weights_buf as (1, seq_len) for scores
            # FILL ME IN 2 lines

            # Step 4: Reshape scores from (1, seq_len) to (seq_len,) for softmax
            # FILL ME IN 1 line

            # Step 5: Apply softmax to get attention weights
            # FILL ME IN 1 function call

            # Step 6: Reshape weights from (seq_len,) to (1, seq_len) for final matmul
            # FILL ME IN 1 line

            # Step 7: Compute final result using matmul: weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d)
            # Reuse out_tensor reshaped as (1, d) for result
            # FILL ME IN 2 lines

์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p19/op/attention.mojo

์ปค๋„ ํ…Œ์ŠคํŠธ

pixi run p19
pixi run -e amd p19
pixi run -e apple p19
uv run poe p19

์„ฑ๊ณตํ•˜๋ฉด CPU์™€ GPU์—์„œ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ ์ถœ๋ ฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Input shapes: Q=(16,), K=(16, 16), V=(16, 16)
Sample Q values: [ 0.04967142 -0.01382643  0.06476886  0.15230298 -0.02341534]
Sample K[0] values: [-0.10128311  0.03142473 -0.09080241 -0.14123037  0.14656489]
Sample V[0] values: [ 0.11631638  0.00102331 -0.09815087  0.04621035  0.01990597]

================================================================================
STEP-BY-STEP VECTOR ATTENTION COMPUTATION DEBUG
================================================================================

1. INPUT SHAPES:
   Q shape: (16,) (query vector)
   K shape: (16, 16) (key matrix)
   V shape: (16, 16) (value matrix)
   Q[:5]: [ 0.04967142 -0.01382643  0.06476886  0.15230298 -0.02341534]

2. ATTENTION SCORES (K[i] ยท Q):
   Scores shape: (16,)
   Scores[:5]: [-0.03479404 -0.01563787  0.04834607  0.06764711  0.04001468]
   Min: -0.061636, Max: 0.067647
   Manual verification:
     Q ยท K[0] = K[0] ยท Q = -0.034794 (computed: -0.034794)
     Q ยท K[1] = K[1] ยท Q = -0.015638 (computed: -0.015638)
     Q ยท K[2] = K[2] ยท Q = 0.048346 (computed: 0.048346)

3. SOFTMAX:
   Max score: 0.067647
   Attention weights shape: (16,)
   Attention weights[:5]: [0.05981331 0.06097015 0.06499878 0.0662655  0.06445949]
   Sum: 1.000000 (should be 1.0)

4. WEIGHTED SUM OF VALUES:
   Output shape: (16,)
   Output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
   Output norm: 0.092764
   Manual output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
   Match: True

================================================================================
TESTING INDIVIDUAL OPERATIONS
================================================================================

Test 1: Vector Dot Product
a ยท b = 3.000000

Test 2: Matrix-Vector Multiplication
M @ v = [ 3.  7. 11.]

Test 3: Softmax
Input: [1. 2. 3. 4.]
Softmax: [0.0320586  0.08714432 0.2368828  0.6439143 ]
Sum: 1.000000

================================================================================
TESTING FULL ATTENTION
================================================================================
Compiling attention graph on Device(type=cpu,id=0)
Executing attention on Device(type=cpu,id=0)
====================================================================================================

CPU attention output[:5]: [-0.00935538 -0.02434331  0.00306551  0.02346884  0.019306  ]
CPU matches NumPy: True
Compiling attention graph on Device(type=gpu,id=0)
Executing attention on Device(type=gpu,id=0)
====================================================================================================

GPU attention output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
Expected output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
GPU matches NumPy: True

================================================================================
FINAL VERIFICATION
================================================================================
โœ“ CPU implementation PASSED
โœ“ GPU implementation PASSED

Output vector norms:
  CPU: 0.092764
  GPU: 0.092764
  Expected: 0.092764

์ด ์ถœ๋ ฅ์€ ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์ด ์–ดํ…์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌํ˜„ํ•˜์—ฌ NumPy ์ฐธ์กฐ ๊ตฌํ˜„๊ณผ ์ผ์น˜ํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ–ˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์†”๋ฃจ์…˜

์ด ํผ์ฆ์„ ํ’€๋ ค๋ฉด Mojo์—์„œ ์ „์น˜ ์ปค๋„์„ ๊ตฌํ˜„ํ•˜๊ณ  ์–ดํ…์…˜ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ํŒŒ์ด์ฌ ๊ทธ๋ž˜ํ”„ ์ •์˜๋ฅผ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ํผ์ฆ์€ ์ด์ „ ํผ์ฆ์˜ ๊ฐœ๋…๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ, Puzzle 16์˜ ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ๊ณผ Puzzle 18์˜ ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์™„์ „ํ•œ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์žฌ์‚ฌ์šฉ ์ปค๋„

๊ตฌํ˜„์—์„œ ๋‹ค์Œ์˜ ๊ฒ€์ฆ๋œ ์ปค๋„๋“ค์„ ์ง์ ‘ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค:

  1. matmul_idiomatic_tiled (Puzzle 16) - \(Q \times K^T\)์™€ \(\text{weights} \times V\) ์—ฐ์‚ฐ ๋ชจ๋‘๋ฅผ ์ˆ˜ํ–‰
  2. softmax_kernel (Puzzle 18) - ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜ ๊ณ„์‚ฐ ์ œ๊ณต

์ด๋Š” ๋ชจ๋“ˆํ˜• GPU ์•„ํ‚คํ…์ฒ˜์˜ ์ข‹์€ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค: ๋‹จ์ผ ๊ตฌํ˜„์ฒด๊ฐ€ ์•„๋‹Œ, ๊ฒ€์ฆ๋œ ์ตœ์ ํ™” ์ปดํฌ๋„ŒํŠธ๋ฅผ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ํ•˜์—ฌ ๋ณต์žกํ•œ ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.

์–ดํ…์…˜ ์—ฐ์‚ฐ์€ ํ‘œ์ค€์ ์ธ ์ˆ˜ํ•™์  ์ •์˜๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

$$\Large \text{Attention}(Q, K, V) = \text{softmax}(Q \cdot K^T) \cdot V$$

์ˆ˜์‹ ๋ถ„์„:

  • \(Q \cdot K^T~\): ์ฟผ๋ฆฌ-ํ‚ค ์œ ์‚ฌ๋„ ์ ์ˆ˜, ํ˜•ํƒœ: \((1, \text{seq_len})\)
  • \(\text{softmax}(\cdot)~\): ์ ์ˆ˜๋ฅผ ํ™•๋ฅ ๋กœ ์ •๊ทœํ™”, ํ˜•ํƒœ: \((1, \text{seq_len})\)
  • \(\text{weights} \cdot V~\): ๊ฐ’์˜ ๊ฐ€์ค‘ ๊ฒฐํ•ฉ, ํ˜•ํƒœ: \((1, d)\)

์ด ๊ณผ์ •์—๋Š” ์ด์ „ ํผ์ฆ์˜ GPU ์ปค๋„์„ ํ™œ์šฉํ•˜์—ฌ ์ตœ์ ํ™”ํ•˜๋Š” ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ ๋‹จ๊ณ„๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

1. ์ „์น˜ ์ปค๋„ ๊ตฌํ˜„

fn transpose_kernel[
    layout_in: Layout,  # Layout for input matrix (seq_len, d)
    layout_out: Layout,  # Layout for output matrix (d, seq_len)
    rows: Int,
    cols: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
    inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin],
):
    """Transpose matrix using shared memory tiling for coalesced access."""
    shared_tile = LayoutTensor[
        dtype,
        Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    local_row = Int(thread_idx.y)
    local_col = Int(thread_idx.x)

    global_row = Int(block_idx.y) * TRANSPOSE_BLOCK_DIM_XY + local_row
    global_col = Int(block_idx.x) * TRANSPOSE_BLOCK_DIM_XY + local_col

    if global_row < rows and global_col < cols:
        shared_tile[local_row, local_col] = inp[global_row, global_col]

    barrier()

    out_row = Int(block_idx.x) * TRANSPOSE_BLOCK_DIM_XY + local_row
    out_col = Int(block_idx.y) * TRANSPOSE_BLOCK_DIM_XY + local_col

    # Store data from shared memory to global memory (coalesced write)
    # Note: we transpose the shared memory access pattern
    if out_row < cols and out_col < rows:
        output[out_row, out_col] = shared_tile[local_col, local_row]


์ „์น˜ ์ปค๋„์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ tiling์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๊ตฌํ˜„ ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ์ „์น˜ ํŒจํ„ด

# ์ผ๋ฐ˜ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋กœ๋“œ
shared_tile[local_row, local_col] = inp[global_row, global_col]
barrier()
# ๋’ค๋ฐ”๊พผ ์ธ๋ฑ์‹ฑ์œผ๋กœ ์ €์žฅํ•˜์—ฌ ์ „์น˜
output[out_row, out_col] = shared_tile[local_col, local_row]

์ „์น˜๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์—์„œ ๋’ค๋ฐ”๊พผ ์ธ๋ฑ์‹ฑ([local_row, local_col] ๋Œ€์‹  [local_col, local_row])๊ณผ ์ถœ๋ ฅ ์œ„์น˜ ์ง€์ •์„ ์œ„ํ•œ ๋’ค๋ฐ”๊พผ ๋ธ”๋ก ์ขŒํ‘œ๋ฅผ ํ†ตํ•ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ๋ชจ๋‘ ๋ณ‘ํ•ฉ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์น˜ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

2. GPU ์ปค๋„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜


            # Step 1: Reshape Q from (d,) to (1, d) - no buffer needed
            q_2d = q_tensor.reshape[layout_q_2d]()

            # Step 2: Transpose K from (seq_len, d) to K^T (d, seq_len)\
            comptime kernel = transpose_kernel[
                layout_k, layout_k_t, seq_len, d, dtype
            ]
            gpu_ctx.enqueue_function[kernel, kernel](
                k_t,
                k_tensor,
                grid_dim=transpose_blocks_per_grid,
                block_dim=transpose_threads_per_block,
            )

            # Step 3: Compute attention scores using matmul: Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len)
            # This computes Q ยท K^T[i] = Q ยท K[i] for each column i of K^T (which is row i of K)
            # Reuse scores_weights_buf as (1, seq_len) for scores
            scores_2d = LayoutTensor[dtype, layout_scores_2d, MutAnyOrigin](
                scores_weights_buf
            )
            comptime kernel2 = matmul_idiomatic_tiled[
                layout_q_2d,
                layout_k_t,
                layout_scores_2d,
                1,
                seq_len,
                d,
                dtype,
            ]
            gpu_ctx.enqueue_function[kernel2, kernel2](
                scores_2d,
                q_2d,
                k_t,
                grid_dim=scores_blocks_per_grid,
                block_dim=matmul_threads_per_block,
            )

            # Step 4: Reshape scores from (1, seq_len) to (seq_len,) for softmax
            weights = scores_2d.reshape[layout_scores]()

            # Step 5: Apply softmax to get attention weights
            comptime kernel3 = softmax_gpu_kernel[layout_scores, seq_len, dtype]
            gpu_ctx.enqueue_function[kernel3, kernel3](
                weights,
                weights,
                grid_dim=softmax_blocks_per_grid,
                block_dim=softmax_threads,
            )

            # Step 6: Reshape weights from (seq_len,) to (1, seq_len) for final matmul
            weights_2d = weights.reshape[layout_weights_2d]()

            # Step 7: Compute final result using matmul: weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d)
            # Reuse out_tensor reshaped as (1, d) for result
            result_2d = output_tensor.reshape[layout_result_2d]()
            comptime kernel4 = matmul_idiomatic_tiled[
                layout_weights_2d,
                layout_v,
                layout_result_2d,
                1,
                d,
                seq_len,
                dtype,
            ]
            gpu_ctx.enqueue_function[kernel4, kernel4](
                result_2d,
                weights_2d,
                v_tensor,
                grid_dim=result_blocks_per_grid,
                block_dim=matmul_threads_per_block,
            )

GPU ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜์€ ์ •๊ตํ•œ ์ปค๋„ ์ฒด์ด๋‹๊ณผ ์ œ๋กœ ์นดํ”ผ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ์ „๋žต

# ์ œ๋กœ ์นดํ”ผ reshape - ๋ฐ์ดํ„ฐ ์ด๋™ ์—†์ด ํ…์„œ shape๋งŒ ์žฌํ•ด์„
q_2d = q_tensor.reshape[layout_q_2d]()
# ์ ๊ทน์ ์ธ ๋ฒ„ํผ ์žฌ์‚ฌ์šฉ - ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ, ๋‹ค๋ฅธ ํ•ด์„
weights = scores_2d.reshape[layout_scores]()

๊ตฌํ˜„์€ ๋‹ค์Œ์„ ํ†ตํ•ด ์ตœ๋Œ€ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • ์ œ๋กœ ์นดํ”ผ ํ˜•ํƒœ ๋ณ€ํ™˜: ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋™ํ•˜์ง€ ์•Š๊ณ  ํ…์„œ ํ˜•ํƒœ๋ฅผ ์žฌํ•ด์„
  • ์ง€๋Šฅ์  ๋ฒ„ํผ ์žฌ์‚ฌ์šฉ: ๋™์ผํ•œ scores_weights_buf๊ฐ€ ์ ์ˆ˜ \((1,\text{seq_len})\)์™€ ๊ฐ€์ค‘์น˜ \((\text{seq_len},)\) ์ด์ค‘ ์šฉ๋„๋กœ ํ™œ์šฉ
  • ์ตœ์†Œ ํ• ๋‹น: ๋‹จ 2๊ฐœ์˜ ์ž„์‹œ ๋ฒ„ํผ๋กœ ์ „์ฒด ์–ดํ…์…˜ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ๋ชจ๋“  ์—ฐ์‚ฐ์—์„œ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€

์ „๋žต์  ์ปค๋„ ์žฌ์‚ฌ์šฉ ํŒจํ„ด

  • 3๋‹จ๊ณ„ & 7๋‹จ๊ณ„: ๋‘˜ ๋‹ค Puzzle 16์˜ matmul_idiomatic_tiled ํ™œ์šฉ
    • 3๋‹จ๊ณ„: \(Q \times K^T\) โ†’ ์–ดํ…์…˜ ์ ์ˆ˜ ๊ณ„์‚ฐ \((1,d) \times (d,\text{seq_len}) \rightarrow (1,\text{seq_len})\)
    • 7๋‹จ๊ณ„: \(\text{weights} \times V\) โ†’ ์ตœ์ข… ๊ฐ€์ค‘ ์ถœ๋ ฅ \((1,\text{seq_len}) \times (\text{seq_len},d) \rightarrow (1,d)\)
    • ๋‘ ์—ฐ์‚ฐ ๋ชจ๋‘ ๋‹ค์–‘ํ•œ ํ–‰๋ ฌ ํฌ๊ธฐ๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํฌํ•จ
  • 5๋‹จ๊ณ„: Puzzle 18์˜ softmax_kernel ์‚ฌ์šฉ
    • ์›์‹œ ์ ์ˆ˜๋ฅผ ์ •๊ทœํ™”๋œ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜
    • ์ตœ๋Œ“๊ฐ’ ์ฐจ๊ฐ๊ณผ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ํ†ตํ•œ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ๋ณด์žฅ
    • \(\sum_{i} \text{weights}[i] = 1.0\) ๋ณด์žฅ

์ด๋Š” ๋ชจ๋“ˆํ˜• GPU ์•„ํ‚คํ…์ฒ˜์˜ ์ข‹์€ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค: ๋‹จ์ผ ๊ตฌํ˜„์ฒด๊ฐ€ ์•„๋‹Œ, ๊ฒ€์ฆ๋œ ์ตœ์ ํ™” ์ปค๋„๋“ค์„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ํ•˜์—ฌ ๋ณต์žกํ•œ ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค!

ํ•ต์‹ฌ ๊ตฌํ˜„ ์ธ์‚ฌ์ดํŠธ

๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ์ „๋žต

์ ๊ทน์ ์ธ ๋ฒ„ํผ ์žฌ์‚ฌ์šฉ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค:

# ์ „์ฒด ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ์ž„์‹œ ๋ฒ„ํผ๋Š” ๋‹จ 2๊ฐœ
k_t_buf = gpu_ctx.enqueue_create_buffer[dtype](seq_len * d)
scores_weights_buf = gpu_ctx.enqueue_create_buffer[dtype](seq_len)

ํ•ต์‹ฌ ์ตœ์ ํ™” ํฌ์ธํŠธ:

  • ๋™์ผํ•œ scores_weights_buf๊ฐ€ ํ˜•ํƒœ ๋ณ€ํ™˜ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ์–ดํ…์…˜ ์ ์ˆ˜์™€ ๊ฐ€์ค‘์น˜ ๋ชจ๋‘์— ์žฌ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค
  • ์ œ๋กœ ์นดํ”ผ ํ…์„œ ํ˜•ํƒœ ๋ณ€ํ™˜์œผ๋กœ ๋ถˆํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ด๋™์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค

์ปค๋„ ์žฌ์‚ฌ์šฉ ์•„ํ‚คํ…์ฒ˜

์ด ํผ์ฆ์€ ์„ธ ๊ฐ€์ง€ ํŠนํ™”๋œ ์ปค๋„์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ชจ๋“ˆํ˜• ์ปค๋„ ์„ค๊ณ„๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • matmul_idiomatic_tiled (2ํšŒ ์‚ฌ์šฉ) - \(Q \times K^T\)์™€ \(\text{weights} \times V\) ์—ฐ์‚ฐ ๋ชจ๋‘๋ฅผ ์ˆ˜ํ–‰
  • softmax_kernel - ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ํ™œ์šฉํ•˜์—ฌ ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜ ๊ณ„์‚ฐ
  • transpose_kernel - ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ํšจ์œจ์ ์ธ \(K^T\) ๊ณ„์‚ฐ

์•„ํ‚คํ…์ฒ˜์˜ ์žฅ์ :

  • ์กฐํ•ฉ ๊ฐ€๋Šฅ์„ฑ: ๊ฒ€์ฆ๋œ ์ปดํฌ๋„ŒํŠธ๋กœ ๋ณต์žกํ•œ ์—ฐ์‚ฐ ๊ตฌ์ถ•
  • ์œ ์ง€๋ณด์ˆ˜์„ฑ: ๊ฐ ์ปค๋„์ด ๋ช…ํ™•ํ•˜๊ฒŒ ์ •์˜๋œ ๋‹จ์ผ ์—ญํ•  ์ˆ˜ํ–‰
  • ์„ฑ๋Šฅ: ์ด์ „ ํผ์ฆ์˜ ๊ณ ๋„๋กœ ์ตœ์ ํ™”๋œ ๊ตฌํ˜„ ํ™œ์šฉ
  • ํ™•์žฅ์„ฑ: ๋ชจ๋“ˆํ˜• ์„ค๊ณ„๋กœ ๋” ํฐ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ํ™•์žฅ ์šฉ์ด

์ด ๊ตฌํ˜„์€ ์ •๊ตํ•œ ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ์ด ๋‹จ์ผ ๊ตฌํ˜„์ฒด๊ฐ€ ์•„๋‹Œ, ๋” ๋‹จ์ˆœํ•˜๊ณ  ์ž˜ ๊ฒ€์ฆ๋œ GPU ์ปค๋„๋“ค์„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ํ•˜์—ฌ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ณด๋„ˆ์Šค ์ฑŒ๋ฆฐ์ง€

์ฑŒ๋ฆฐ์ง€ I: ๊ณ ๊ธ‰ ์†Œํ”„ํŠธ๋งฅ์Šค ๊ตฌํ˜„

์ด ์ฑŒ๋ฆฐ์ง€๋Š” Puzzle 18: ์†Œํ”„ํŠธ๋งฅ์Šค Op์˜ ํ™•์žฅ์ž…๋‹ˆ๋‹ค

์†Œํ”„ํŠธ๋งฅ์Šค ๊ตฌํ˜„์„ ํ™•์žฅํ•˜๋Š” ๊ณ ๊ธ‰ ์ฑŒ๋ฆฐ์ง€๋“ค์ž…๋‹ˆ๋‹ค:

1. ๋Œ€๊ทœ๋ชจ ์†Œํ”„ํŠธ๋งฅ์Šค: TPB < SIZE ์ฒ˜๋ฆฌ

์ž…๋ ฅ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๋ฅผ ์ดˆ๊ณผํ•˜๋ฉด(TPB < SIZE), ๋‹จ์ผ ๋ธ”๋ก์ด ์ „์ฒด ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†์–ด ํ˜„์žฌ ๊ตฌํ˜„์ด ๋™์ž‘ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค:

1.1 ๋ฒ„ํผ ๋ฆฌ๋•์…˜

  • ๋ธ”๋ก ๋‹จ์œ„ ๊ฒฐ๊ณผ(์ตœ๋Œ“๊ฐ’๊ณผ ํ•ฉ๊ณ„)๋ฅผ ๋””๋ฐ”์ด์Šค ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋‘ ๋ฒˆ์งธ ์ปค๋„์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋“ค์— ๋Œ€ํ•ด ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ์ „์—ญ ์ตœ๋Œ“๊ฐ’๊ณผ ํ•ฉ๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ตœ์ข… ์ •๊ทœํ™” ๋‹จ๊ณ„๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค

1.2 2๋‹จ๊ณ„ ์†Œํ”„ํŠธ๋งฅ์Šค

  • 1์ฐจ: ๊ฐ ๋ธ”๋ก์ด ๋กœ์ปฌ ์ตœ๋Œ“๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  • ๋™๊ธฐํ™” ํ›„ ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  • 2์ฐจ: \(e^{x-max}\)์™€ ๋กœ์ปฌ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  • ๋™๊ธฐํ™” ํ›„ ์ „์—ญ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  • ์ตœ์ข…: ์ „์—ญ ํ•ฉ๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค

2. ๋ฐฐ์น˜ ์†Œํ”„ํŠธ๋งฅ์Šค

๋ฒกํ„ฐ ๋ฐฐ์น˜(2D ์ž…๋ ฅ ํ…์„œ)์— ๋Œ€ํ•œ ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ๋‹ค์Œ ๋ณ€ํ˜•์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

  • ํ–‰ ๋‹จ์œ„ ์†Œํ”„ํŠธ๋งฅ์Šค: ๊ฐ ํ–‰์— ๋…๋ฆฝ์ ์œผ๋กœ ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ์—ด ๋‹จ์œ„ ์†Œํ”„ํŠธ๋งฅ์Šค: ๊ฐ ์—ด์— ๋…๋ฆฝ์ ์œผ๋กœ ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๋‘ ๊ตฌํ˜„ ๊ฐ„์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค

์ฑŒ๋ฆฐ์ง€ II: ๊ณ ๊ธ‰ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜

์ด ์ฑŒ๋ฆฐ์ง€๋Š” Puzzle 19: ์–ดํ…์…˜ Op์˜ ํ™•์žฅ์ž…๋‹ˆ๋‹ค

๋ฒกํ„ฐ ์–ดํ…์…˜ ๊ตฌํ˜„์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ํ•œ๊ณ„๋ฅผ ๋„“ํ˜€๋ณด๋Š” ๊ณ ๊ธ‰ ์ฑŒ๋ฆฐ์ง€๋“ค์ž…๋‹ˆ๋‹ค:

1. ๋” ๊ธด ์‹œํ€€์Šค ๊ธธ์ด

๊ธฐ์กด ์ปค๋„์„ ์‚ฌ์šฉํ•˜์—ฌ ๋” ๊ธด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค:

1.1 ์‹œํ€€์Šค ๊ธธ์ด ํ™•์žฅ

  • SEQ_LEN = 32์™€ SEQ_LEN = 64๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ์–ดํ…์…˜ ๊ตฌํ˜„์„ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค
  • TPB(๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜) ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ทธ์— ๋งž๊ฒŒ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค
  • ์ „์น˜ ์ปค๋„์ด ๋” ํฐ ํ–‰๋ ฌ ํฌ๊ธฐ๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค

1.2 ๋™์  ์‹œํ€€์Šค ๊ธธ์ด

  • ๋Ÿฐํƒ€์ž„์— ๊ฐ€๋ณ€ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์–ดํ…์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค
  • SEQ_LEN๋ณด๋‹ค ์งง์€ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ปค๋„์— ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค
  • ๊ณ ์ • ์‹œํ€€์Šค ๊ธธ์ด ์ฒ˜๋ฆฌ์™€ ๋™์  ์‹œํ€€์Šค ๊ธธ์ด ์ฒ˜๋ฆฌ์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค

2. ๋ฐฐ์น˜ ๋ฒกํ„ฐ ์–ดํ…์…˜

์—ฌ๋Ÿฌ ์–ดํ…์…˜ ์—ฐ์‚ฐ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค:

2.1 ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ

  • ์—ฌ๋Ÿฌ ์ฟผ๋ฆฌ ๋ฒกํ„ฐ๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋„๋ก ์–ดํ…์…˜ ์—ฐ์‚ฐ์„ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค
  • ์ž…๋ ฅ ํ˜•ํƒœ: Q(batch_size, d), K(seq_len, d), V(seq_len, d)
  • ์ถœ๋ ฅ ํ˜•ํƒœ: (batch_size, d)
  • ์ ์ ˆํ•œ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๊ธฐ์กด ์ปค๋„์„ ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค

2.2 ๋ฐฐ์น˜๋ฅผ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”

  • ๋ฐฐ์น˜ ์š”์†Œ ๊ฐ„ ๋ฒ„ํผ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค
  • ๋‹ค์–‘ํ•œ ๋ฐฐ์น˜ ํฌ๊ธฐ(2, 4, 8)์—์„œ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ํŒจํ„ด์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค

Puzzle 20: 1D ํ•ฉ์„ฑ๊ณฑ Op

MAX ๊ทธ๋ž˜ํ”„์—์„œ PyTorch ์ปค์Šคํ…€ Op์œผ๋กœ

GPU ํผ์ฆ ์—ฌ์ •์˜ Part V์— ์ง„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค: PyTorch ์ปค์Šคํ…€ Op ํ†ตํ•ฉํ•˜๊ธฐ.

Puzzle 17: 1D ํ•ฉ์„ฑ๊ณฑ Op์—์„œ MAX ๊ทธ๋ž˜ํ”„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Mojo GPU ์ปค๋„์„ ํŒŒ์ด์ฌ๊ณผ ์—ฐ๋™ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ์ด์ œ๋ถ€ํ„ฐ๋Š” ๋‹ค์Œ์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค:

  • ๋™์ผํ•œ Mojo ์ปค๋„์„ PyTorch์˜ CustomOpLibrary๋กœ ์‚ฌ์šฉํ•˜๊ธฐ
  • PyTorch์˜ ํ…์„œ ์‹œ์Šคํ…œ ๋ฐ ์˜คํ† ๊ทธ๋ž˜๋“œ(autograd)์™€ ํ†ตํ•ฉํ•˜๊ธฐ
  • MAX ๊ทธ๋ž˜ํ”„์™€ PyTorch ๋ฐฉ์‹์˜ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ๋น„๊ตํ•˜๊ธฐ
  • ๋ช…์‹œ์  ์ถœ๋ ฅ ํ…์„œ ํ• ๋‹น์ด๋ผ๋Š” ํ•ต์‹ฌ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

์ด ์ „ํ™˜์„ ํ†ตํ•ด ๋™์ผํ•œ ์ตœ์ ํ™”๋œ GPU ์ปค๋„์ด ์„œ๋กœ ๋‹ค๋ฅธ ํŒŒ์ด์ฌ ํ†ตํ•ฉ ๋ฐฉ์‹์—์„œ ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” Puzzle 17: 1D ํ•ฉ์„ฑ๊ณฑ Op์˜ 1D ํ•ฉ์„ฑ๊ณฑ(convolution) ์ปค๋„์„ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™€์„œ, MAX ๊ทธ๋ž˜ํ”„ ๋Œ€์‹  CustomOpLibrary๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ PyTorch์™€ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ ๋™์ผํ•œ Mojo ์ปค๋„์ด ์ˆ˜์ • ์—†์ด ๊ทธ๋Œ€๋กœ ๋™์ž‘ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. MAX ๊ทธ๋ž˜ํ”„์™€ PyTorch ๋ฐฉ์‹ ์‚ฌ์ด์—์„œ ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์€ ํŒŒ์ด์ฌ ํ†ตํ•ฉ ๋ ˆ์ด์–ด๋ฟ์ž…๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ ค๋ฉด ์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ํ˜ธ์ถœํ•˜๋Š” ํ•œ ์ค„๋งŒ ์ฑ„์šฐ๋ฉด ๋ฉ๋‹ˆ๋‹ค:

import torch
from max.torch import CustomOpLibrary


def conv1d_pytorch(
    input_tensor: torch.Tensor, kernel_tensor: torch.Tensor
) -> torch.Tensor:
    """
    1D convolution using our custom PyTorch operation.

    This demonstrates the transition from MAX Graph (p15) to PyTorch CustomOpLibrary.
    Uses the EXACT same Mojo kernel, but different Python integration!
    """
    # Load our custom operations
    mojo_kernels = Path(__file__).parent / "op"
    ops = CustomOpLibrary(mojo_kernels)

    # Create output tensor with same shape as input
    output_tensor = torch.empty_like(input_tensor)

    # Call our custom conv1d operation with explicit output tensor
    # The Mojo signature expects: (out, input, kernel)
    conv1d = ops.conv1d[
        {
            "input_size": input_tensor.shape[0],
            "conv_size": kernel_tensor.shape[0],
        }
    ]

    # FILL IN with 1 line of code

    return output_tensor


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p20/p20.py

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ํผ์ฆ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pixi run p20
pixi run -e amd p20
uv run poe p20

์„ฑ๊ณตํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ ์ถœ๋ ฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Puzzle 20: From MAX Graph to PyTorch Custom Ops
============================================================
Input array: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]

NumPy reference result: [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]

Testing PyTorch Custom Op (device: cuda)
----------------------------------------
PyTorch custom op result: [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
โœ… PyTorch custom op verification PASSED

Comparing with MAX Graph approach (like p15)
--------------------------------------------
MAX Graph result: [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
โœ… MAX Graph verification PASSED
โœ… PyTorch and MAX Graph results MATCH

์†”๋ฃจ์…˜

์ปดํŒŒ์ผ๋œ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ์ ์ ˆํ•œ ์ธ์ž์™€ ํ•จ๊ป˜ ํ˜ธ์ถœํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

    # Call our custom conv1d operation with explicit output tensor
    # The Mojo signature expects: (out, input, kernel)
    conv1d = ops.conv1d[
        {
            "input_size": input_tensor.shape[0],
            "conv_size": kernel_tensor.shape[0],
        }
    ]
    torch.compile(conv1d)(output_tensor, input_tensor, kernel_tensor)

์ด ํ’€์ด๋Š” ๋ช‡ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

1. torch.compile() ํ†ตํ•ฉ

torch.compile ํ†ตํ•ฉ ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

torch.compile(conv1d)(output_tensor, input_tensor, kernel_tensor)

2. ๋ช…์‹œ์  ์ถœ๋ ฅ ํ…์„œ ํ• ๋‹น

output_tensor = torch.empty_like(input_tensor)
  • MAX ๊ทธ๋ž˜ํ”„๋Š” ์ถœ๋ ฅ ํ• ๋‹น์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์ง€๋งŒ
  • PyTorch CustomOpLibrary๋Š” ๋ฏธ๋ฆฌ ํ• ๋‹น๋œ ์ถœ๋ ฅ ํ…์„œ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • Mojo ์—ฐ์‚ฐ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋Š” (out, input, kernel) ์ˆœ์„œ๋ฅผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค

3. ํŒŒ๋ผ๋ฏธํ„ฐ ๋”•์…”๋„ˆ๋ฆฌ

ops.conv1d[{"input_size": input_tensor.shape[0], "conv_size": kernel_tensor.shape[0]}]
  • ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์—ฐ์‚ฐ์— ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค
  • ์ด ๊ฐ’๋“ค์€ Mojo ์ปค๋„์˜ ์ปดํŒŒ์ผ ํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค
  • Mojo @staticmethod fn execute ์‹œ๊ทธ๋‹ˆ์ฒ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ด๋ฆ„๊ณผ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

4. ๊ฐ™์€ ์ปค๋„, ๋‹ค๋ฅธ ํ†ตํ•ฉ ๋ฐฉ์‹

๋‚ด๋ถ€์˜ Mojo ์ปค๋„(conv1d_kernel)์€ Puzzle 17๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค:

  • ๋™์ผํ•œ GPU ์ปค๋„ ์ฝ”๋“œ
  • ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ๋™์ผํ•œ ์—ฐ์‚ฐ ๋กœ์ง
  • ํŒŒ์ด์ฌ ๋ž˜ํผ ๋ ˆ์ด์–ด๋งŒ ๋‹ฌ๋ผ์ง

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์€ PyTorch ์ปค์Šคํ…€ ์—ฐ์‚ฐ์˜ ์ฃผ์š” ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ฐœ๋…MAX ๊ทธ๋ž˜ํ”„ (p15)PyTorch CustomOpLibrary (p18)
์ถœ๋ ฅ ํ• ๋‹น์ž๋™์ˆ˜๋™ (torch.empty_like())
์—ฐ์‚ฐ ํ˜ธ์ถœops.custom(...)torch.compile(op)(...)
ํŒŒ๋ผ๋ฏธํ„ฐ ์ „๋‹ฌparameters={...}op[{...}]
๋””๋ฐ”์ด์Šค ๊ด€๋ฆฌ๋ช…์‹œ์  device contextPyTorch ํ…์„œ์˜ device
๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌMAX ๊ทธ๋ž˜ํ”„ ํ…์„œPyTorch ํ…์„œ

ํ•ต์‹ฌ ํŒจํ„ด: ๋ช…์‹œ์  ์ถœ๋ ฅ ํ…์„œ ํ• ๋‹น

๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ฐจ์ด์ ์€ PyTorch CustomOpLibrary๊ฐ€ ๋ช…์‹œ์  ์ถœ๋ ฅ ํ…์„œ ํ• ๋‹น์„ ์š”๊ตฌํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

# โŒ ๋™์ž‘ํ•˜์ง€ ์•Š์Œ - ์ถœ๋ ฅ ํ…์„œ ์—†์Œ
result = torch.compile(conv1d)(input_tensor, kernel_tensor)

# โœ… ๋™์ž‘ํ•จ - ๋ฏธ๋ฆฌ ํ• ๋‹น๋œ ์ถœ๋ ฅ ํ…์„œ
output_tensor = torch.empty_like(input_tensor)
torch.compile(conv1d)(output_tensor, input_tensor, kernel_tensor)

์ด ํŒจํ„ด์ด ๋ณด์žฅํ•˜๋Š” ๊ฒƒ๋“ค:

  • ์˜ฌ๋ฐ”๋ฅธ ๋””๋ฐ”์ด์Šค์— ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น
  • ์ถœ๋ ฅ ํ…์„œ์˜ shape๊ณผ dtype์ด ์ •ํ™•
  • Mojo ์ปค๋„์ด ์ถœ๋ ฅ ๋ฒ„ํผ์— ์ง์ ‘ ์“ฐ๊ธฐ ๊ฐ€๋Šฅ

torch.compile() ํ†ตํ•ฉ

torch.compile()์ด ํ•„์ˆ˜์ ์ธ ์ด์œ :

  • PyTorch์™€ Mojo ์‚ฌ์ด์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ๋ณ€ํ™˜ ์ฒ˜๋ฆฌ
  • ๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™” ๊ด€๋ฆฌ (CPU โ†” GPU)
  • ํ…์„œ ํฌ๋งท ๋ณ€ํ™˜ ์ตœ์ ํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ ์ ˆํ•œ ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ ์ œ๊ณต

์ฐธ๊ณ : torch.compile() ์—†์ด ์‚ฌ์šฉํ•˜๋ฉด std::bad_alloc ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์›์‹œ ์—ฐ์‚ฐ์ด PyTorch์˜ ํ…์„œ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

์ปค์Šคํ…€ ์—ฐ์‚ฐ ๋””๋ฒ„๊น…

์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ์™€ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•:

  1. ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์˜ค๋ฅ˜: ํ•ญ์ƒ torch.compile()์„ ์‚ฌ์šฉํ•˜์„ธ์š”
  2. ์ž˜๋ชป๋œ ์ถœ๋ ฅ ํ˜•์ƒ: ์ถœ๋ ฅ ํ…์„œ๊ฐ€ ๊ธฐ๋Œ€ํ•˜๋Š” ์ฐจ์›๊ณผ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”
  3. ๋””๋ฐ”์ด์Šค ๋ถˆ์ผ์น˜: ๋ชจ๋“  ํ…์„œ๊ฐ€ ๊ฐ™์€ ๋””๋ฐ”์ด์Šค์— ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  4. ํŒŒ๋ผ๋ฏธํ„ฐ ์˜ค๋ฅ˜: ํŒŒ๋ผ๋ฏธํ„ฐ ์ด๋ฆ„์ด Mojo ์—ฐ์‚ฐ ์‹œ๊ทธ๋‹ˆ์ฒ˜์™€ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”

๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ•: PyTorch ๊ฒฐ๊ณผ๋ฅผ ๋™์ผํ•œ ์ปค๋„์„ ์‹คํ–‰ํ•˜๋Š” MAX ๊ทธ๋ž˜ํ”„ ๋ ˆํผ๋Ÿฐ์Šค ๊ตฌํ˜„๊ณผ ๋น„๊ตํ•ด ๋ณด์„ธ์š”.

Puzzle 21: ์ž„๋ฒ ๋”ฉ Op

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์„ฑ๋Šฅ

๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ๊ณผ GPU ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”์— ์ดˆ์ ์„ ๋งž์ถฐ Part V๋ฅผ ์ด์–ด๊ฐ‘๋‹ˆ๋‹ค.

Puzzle 20: 1D ํ•ฉ์„ฑ๊ณฑ Op์— ์ด์–ด, ๋™์ผํ•œ ์—ฐ์‚ฐ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ปค๋„ ๊ตฌํ˜„์ด ์„ฑ๋Šฅ์— ์–ผ๋งˆ๋‚˜ ๊ทน์ ์ธ ์ฐจ์ด๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋Š”์ง€ ์•Œ์•„๋ด…๋‹ˆ๋‹ค. ๋ฐฐ์šธ ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • GPU ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ์ด ์ค‘์š”ํ•œ ์ด์œ 
  • ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์ปค๋„์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋”ฉ ์ „๋žต์ด ๊ฐ€์ ธ์˜ค๋Š” ์„ฑ๋Šฅ ์ฐจ์ด

์ด ํผ์ฆ์€ ์–ด๋–ค ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋А๋ƒ๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ์— ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•˜๋А๋ƒ๊ฐ€ ๋” ์ค‘์š”ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ์‹ ๊ฒฝ๋ง์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ์ธ ์ž„๋ฒ ๋”ฉ(embedding) ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋‘ ๊ฐ€์ง€ GPU ์ปค๋„์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋‘ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

๋น„๊ตํ•  ๋‘ ์ปค๋„:

  • 1D ๋ณ‘ํ•ฉ(coalesced) ์ปค๋„: ์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ์ตœ์ ํ™”
  • 2D ๋น„๋ณ‘ํ•ฉ(non-coalesced) ์ปค๋„: ๋น„๊ต๋ฅผ ์œ„ํ•œ ์ตœ์ ํ™”๋˜์ง€ ์•Š์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

์ด ๋น„๊ต๋ฅผ ํ†ตํ•ด GPU ์ปค๋„ ์„ฑ๋Šฅ์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ์ฒด๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐฐ๊ฒฝ: ์ž„๋ฒ ๋”ฉ ์—ฐ์‚ฐ

์ž„๋ฒ ๋”ฉ ์—ฐ์‚ฐ์€ ์ด์‚ฐ์ ์ธ ํ† ํฐ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ€์ง‘ ๋ฒกํ„ฐ ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

# Input: token indices
indices = [[1, 5, 2], [7, 1, 9]]           # Shape: [batch_size, seq_len]

# Embedding table (learned parameters)
embedding_table = [                        # Shape: [vocab_size, embed_dim]
    [0.1, 0.2, 0.3, 0.4],  # Token 0
    [0.5, 0.6, 0.7, 0.8],  # Token 1
    [0.9, 1.0, 1.1, 1.2],  # Token 2
    # ... more tokens
]

# Output: embedded vectors
output[0,0] = embedding_table[1]  # [0.5, 0.6, 0.7, 0.8]
output[0,1] = embedding_table[5]  # lookup token 5's embedding
output[0,2] = embedding_table[2]  # [0.9, 1.0, 1.1, 1.2]
# ... and so on

์ด ์—ฐ์‚ฐ์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž…๋‹ˆ๋‹ค. ์„ฑ๋Šฅ์€ ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์—์„œ ์–ผ๋งˆ๋‚˜ ํšจ์œจ์ ์œผ๋กœ ์ฝ๊ณ  ์ถœ๋ ฅ ํ…์„œ์— ์“ธ ์ˆ˜ ์žˆ๋А๋ƒ์— ๋‹ฌ๋ ค ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๊ฒฝ๋กœ

์ด ํผ์ฆ์€ ์ฒด๊ณ„์ ์ธ ์ดํ•ด๋ฅผ ์œ„ํ•ด ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ ์ปค๋„

์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์‹ค์ œ ํผ์ฆ ์ฝ”๋“œ๋ฅผ ๊ตฌํ˜„ํ•˜๊ณ  ์ปค๋„ ๊ตฌํ˜„์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ฌด์—‡์„ ํ•˜๊ฒŒ ๋ ๊นŒ์š”:

  • ๋‘ ๊ฐ€์ง€ GPU ์ž„๋ฒ ๋”ฉ ์ปค๋„ ์™„์„ฑ (1D ๋ณ‘ํ•ฉ vs 2D ๋น„๋ณ‘ํ•ฉ)
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ธฐ๋ณธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํ•™์Šต
  • ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋”ฉ ์ „๋žต์œผ๋กœ ๊ตฌํ˜„ํ•˜๋Š” ์‚ฌ๋ก€ ํ™•์ธ
  • Mojo์—์„œ์˜ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ๋“ฑ๋ก ์ดํ•ด

์„ฑ๋Šฅ ๋น„๊ต

์ปค๋„ ์„ฑ๋Šฅ์ด ์™œ ๋‹ค๋ฅธ์ง€, ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์˜ ์ด๋ก ์„ ๊นŠ์ด ํŒŒ๊ณ ๋“ญ๋‹ˆ๋‹ค.

๋ฌด์—‡์„ ๋ฐฐ์šธ๊นŒ์š”:

  • GPU ์„ฑ๋Šฅ์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์ค‘์š”ํ•œ ์ด์œ 
  • ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์ด ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ
  • ์‹ ๊ฒฝ๋ง ์ตœ์ ํ™”์— ๋Œ€ํ•œ ์‹ค์ œ ์‹œ์‚ฌ์ 
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ตœ์ ํ™” ์ „๋žต

์‹œ์ž‘ํ•˜๊ธฐ

GPU ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”๋ฅผ ํƒ๊ตฌํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ ์ปค๋„ ์—์„œ ์ฝ”๋“œ๋ฅผ ๊ตฌํ˜„ํ•œ ํ›„, ์„ฑ๋Šฅ ๋น„๊ต ๋กœ ๋„˜์–ด๊ฐ€ ์„ฑ๋Šฅ ์ฐจ์ด์˜ ์›์ธ์„ ์ดํ•ดํ•ด ๋ณด์„ธ์š”.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ์„œ๋กœ ๋‹ค๋ฅธ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ(1D vs 2D)์ด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์ฃผ์˜ ๊นŠ๊ฒŒ ์‚ดํŽด๋ณด์„ธ์š”. ์ด ํ†ต์ฐฐ์€ ์ž„๋ฒ ๋”ฉ์„ ๋„˜์–ด ๋‹ค์–‘ํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์‹œ๋‚˜๋ฆฌ์˜ค์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ž„๋ฒ ๋”ฉ ์ปค๋„: ๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ

์ด ํผ์ฆ์—์„œ๋Š” ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜๋Š” ๋‘ ๊ฐ€์ง€ GPU ์ž„๋ฒ ๋”ฉ ์ปค๋„์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. GPU ์„ฑ๋Šฅ์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ์ง์ ‘ ์ฒดํ—˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1D ๋ณ‘ํ•ฉ ์ปค๋„ (์ตœ์ ํ™”๋œ ์ ‘๊ทผ๋ฒ•)

์ด ์ปค๋„์€ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ •ํ™•ํžˆ ํ•˜๋‚˜์˜ ์ถœ๋ ฅ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹จ์ˆœํ•œ 1D ๊ทธ๋ฆฌ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผํ•˜์—ฌ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์„ ๋‹ฌ์„ฑํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: [total_elements // 256] ๋ธ”๋ก, ๋ธ”๋ก๋‹น 256 ์Šค๋ ˆ๋“œ
  • ์Šค๋ ˆ๋“œ ๋งคํ•‘: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ (batch, seq, embed) ์œ„์น˜ ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ์ž„๋ฒ ๋”ฉ ์ฐจ์› ์ ‘๊ทผ

๊ตฌํ˜„ํ•  ๋‚ด์šฉ:

  1. ๋ธ”๋ก ์ธ๋ฑ์Šค์™€ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋กœ๋ถ€ํ„ฐ ์ „์—ญ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ
  2. 1์ฐจ์› ์ธ๋ฑ์Šค๋ฅผ 3D ์ขŒํ‘œ (batch_idx, seq_idx, embed_idx)๋กœ ๋ณ€ํ™˜
  3. indices ํ…์„œ์—์„œ ํ† ํฐ ์ธ๋ฑ์Šค ์กฐํšŒ
  4. ํ•ด๋‹นํ•˜๋Š” ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ์š”์†Œ๋ฅผ ์ถœ๋ ฅ์— ๋ณต์‚ฌ

์™„์„ฑํ•  ์ฝ”๋“œ

๋‘ ์ž„๋ฒ ๋”ฉ ์ปค๋„์˜ ๋นˆ ๋ถ€๋ถ„์„ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

comptime THREADS_PER_BLOCK = 256


fn embedding_kernel_coalesced[
    indices_layout: Layout,
    weights_layout: Layout,
    out_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    vocab_size: Int,
    embed_dim: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin],
    weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin],
):
    """
    Memory-coalescing focused embedding kernel.

    Key insight: The bottleneck is memory access patterns, not computation.
    - Each thread handles one (batch, seq, embed) position
    - Simple 1D grid for maximum simplicity and correctness
    - Focus on getting memory access right first
    """

    # Simple 1D indexing - each thread = one output element
    global_idx = Int(block_idx.x * block_dim.x + thread_idx.x)
    total_elements = batch_size * seq_len * embed_dim

    if global_idx >= total_elements:
        return

    # Convert to (batch, seq, embed) coordinates
    # FILL IN roughly 4 lines

    # Get token index
    # FILL IN 1 line

    # Simple, correct assignment
    # FILL IN 4 lines


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p21/op/embedding.mojo

ํŒ
  • global_idx = block_idx.x * block_dim.x + thread_idx.x๋กœ ์‹œ์ž‘ํ•˜์„ธ์š”
  • ๋‚˜๋ˆ—์…ˆ๊ณผ ๋‚˜๋จธ์ง€ ์—ฐ์‚ฐ์œผ๋กœ 3D ์ขŒํ‘œ๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค: batch_idx = global_idx // (seq_len * embed_dim)
  • remaining = global_idx % (seq_len * embed_dim)์„ ์‚ฌ์šฉํ•˜๋ฉด ์ดํ›„ ๊ณ„์‚ฐ์ด ๊ฐ„๋‹จํ•ด์ง‘๋‹ˆ๋‹ค
  • ํ•ญ์ƒ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ•˜์„ธ์š”: if global_idx >= total_elements: return
  • ์œ ํšจํ•˜์ง€ ์•Š์€ ํ† ํฐ ์ธ๋ฑ์Šค๋Š” ์ถœ๋ ฅ์„ 0์œผ๋กœ ์„ค์ •ํ•˜์„ธ์š”
  • ์ž„๋ฒ ๋”ฉ ์กฐํšŒ: output[batch_idx, seq_idx, embed_idx] = weights[token_idx, embed_idx]

2D ๋น„๋ณ‘ํ•ฉ ์ปค๋„ (๋น„๊ต์šฉ ์ ‘๊ทผ๋ฒ•)

์ด ์ปค๋„์€ X ์ฐจ์›์ด (batch ร— seq) ์œ„์น˜๋ฅผ, Y ์ฐจ์›์ด ์ž„๋ฒ ๋”ฉ ์ฐจ์›์„ ๋‹ด๋‹นํ•˜๋Š” 2D ๊ทธ๋ฆฌ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ๋ณ‘ํ•ฉ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: [batch x seq // 16, embed_dim // 16] ๋ธ”๋ก, 16 x 16 ์Šค๋ ˆ๋“œ
  • ์Šค๋ ˆ๋“œ ๋งคํ•‘: thread_idx.x๋Š” batch/sequence์—, thread_idx.y๋Š” ์ž„๋ฒ ๋”ฉ ์ฐจ์›์— ๋งคํ•‘
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์›Œํ”„ ๋‚ด ์Šค๋ ˆ๋“œ๋“ค์ด ํฉ์–ด์ง„ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Œ

๊ตฌํ˜„ํ•  ๋‚ด์šฉ:

  1. 2D ๊ทธ๋ฆฌ๋“œ์—์„œ X, Y ์ขŒํ‘œ ๊ณ„์‚ฐ
  2. X ์ขŒํ‘œ๋ฅผ batch ์ธ๋ฑ์Šค์™€ sequence ์ธ๋ฑ์Šค๋กœ ๋ถ„๋ฆฌ
  3. Y ์ขŒํ‘œ๋ฅผ ์ž„๋ฒ ๋”ฉ ์ฐจ์›์œผ๋กœ ์ง์ ‘ ์‚ฌ์šฉ
  4. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ํ•จ๊ป˜ ๋™์ผํ•œ ์ž„๋ฒ ๋”ฉ ์กฐํšŒ ์ˆ˜ํ–‰

์™„์„ฑํ•  ์ฝ”๋“œ

๋‘ ์ž„๋ฒ ๋”ฉ ์ปค๋„์˜ ๋นˆ ๋ถ€๋ถ„์„ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

fn embedding_kernel_2d[
    indices_layout: Layout,
    weights_layout: Layout,
    out_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    vocab_size: Int,
    embed_dim: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin],
    weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin],
):
    """
    2D grid non-coalesced embedding kernel.

    Non-optimal approach for comparison:
    - 2D grid: (batch*seq, embed_dim)
    - More complex indexing
    - Potentially worse memory access patterns
    """

    # 2D grid indexing
    batch_seq_idx = Int(block_idx.x * block_dim.x + thread_idx.x)
    embed_idx = Int(block_idx.y * block_dim.y + thread_idx.y)
    total_positions = batch_size * seq_len

    if batch_seq_idx >= total_positions or embed_idx >= embed_dim:
        return

    # Convert to (batch, seq) coordinates
    # FILL IN 2 lines

    # Get token index
    # FILL IN 1 line

    # Assignment with 2D grid pattern
    # FILL IN 4 lines


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p21/op/embedding.mojo

ํŒ
  • X, Y ์Šค๋ ˆ๋“œ ์ขŒํ‘œ๋ฅผ ๋ชจ๋‘ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: batch_seq_idx = block_idx.x * block_dim.x + thread_idx.x
  • ๊ทธ๋ฆฌ๊ณ : embed_idx = block_idx.y * block_dim.y + thread_idx.y
  • batch_seq_idx๋ฅผ batch์™€ sequence ์ธ๋ฑ์Šค๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค: batch_idx = batch_seq_idx // seq_len
  • ๋‘ ์ฐจ์› ๋ชจ๋‘ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์žŠ์ง€ ๋งˆ์„ธ์š”: if batch_seq_idx >= total_positions or embed_idx >= embed_dim
  • ํ† ํฐ ์กฐํšŒ๋Š” 1D์™€ ๋™์ผํ•˜์ง€๋งŒ, ์Šค๋ ˆ๋“œ๋‹น ํ•˜๋‚˜์˜ ์ž„๋ฒ ๋”ฉ ์ฐจ์›๋งŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ์ด ์ปค๋„์€ ์ „์ฒด ๋ฒกํ„ฐ๊ฐ€ ์•„๋‹Œ ์Šค๋ ˆ๋“œ๋‹น ํ•˜๋‚˜์˜ ์ž„๋ฒ ๋”ฉ ์ฐจ์›์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

์ปค์Šคํ…€ op ๋“ฑ๋ก

์ปค๋„๋“ค์€ PyTorch์™€ ์‰ฝ๊ฒŒ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ปค์Šคํ…€ ์—ฐ์‚ฐ์œผ๋กœ ๋ž˜ํ•‘๋ฉ๋‹ˆ๋‹ค. ๋“ฑ๋ก ํŒจํ„ด์€ MAX ๊ทธ๋ž˜ํ”„ ์ปค์Šคํ…€ op ์ดํ•ดํ•˜๊ธฐ์—์„œ ์„ค๋ช…ํ•œ MAX ์ปค์Šคํ…€ op๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค:

1D ๋ณ‘ํ•ฉ ์—ฐ์‚ฐ

์ด ์—ฐ์‚ฐ์€ ์ตœ์ ํ™”๋œ 1D ์ž„๋ฒ ๋”ฉ ์ปค๋„์„ "embedding"์œผ๋กœ ๋“ฑ๋กํ•ฉ๋‹ˆ๋‹ค:

import compiler
from runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
from memory import UnsafePointer
from gpu.host import DeviceBuffer


@compiler.register("embedding")
struct EmbeddingCustomOp:
    @staticmethod
    fn execute[
        target: StaticString,
        batch_size: Int,
        seq_len: Int,
        vocab_size: Int,
        embed_dim: Int,
    ](
        output: OutputTensor[
            dtype = DType.float32, rank=3
        ],  # [batch_size, seq_len, embed_dim]
        indices: InputTensor[
            dtype = DType.int32, rank=2
        ],  # [batch_size, seq_len]
        weights: InputTensor[
            dtype = output.dtype, rank=2
        ],  # [vocab_size, embed_dim]
        ctx: DeviceContextPtr,
    ) raises:
        output_tensor = output.to_layout_tensor()
        indices_tensor = indices.to_layout_tensor()
        weights_tensor = weights.to_layout_tensor()

        comptime indices_layout = indices_tensor.layout
        comptime weights_layout = weights_tensor.layout
        comptime out_layout = output_tensor.layout

        @parameter
        if target == "gpu":
            gpu_ctx = ctx.get_device_context()

            # Zero out output tensor
            gpu_ctx.enqueue_memset(
                DeviceBuffer[output.dtype](
                    gpu_ctx,
                    output_tensor.ptr,
                    batch_size * seq_len * embed_dim,
                    owning=False,
                ),
                0,
            )

            # Calculate 1D grid dimensions (matching kernel's flat indexing)
            total_elements = batch_size * seq_len * embed_dim
            blocks = max(1, ceildiv(total_elements, THREADS_PER_BLOCK))

            # Compile and launch optimized kernel
            comptime kernel = embedding_kernel_coalesced[
                indices_layout,
                weights_layout,
                out_layout,
                batch_size,
                seq_len,
                vocab_size,
                embed_dim,
                output.dtype,
            ]
            compiled_kernel = gpu_ctx.compile_function[kernel, kernel]()

            gpu_ctx.enqueue_function(
                compiled_kernel,
                output_tensor,
                indices_tensor,
                weights_tensor,
                grid_dim=(blocks,),
                block_dim=(THREADS_PER_BLOCK,),
            )

        elif target == "cpu":
            for batch in range(batch_size):
                for seq in range(seq_len):
                    token_idx_val = Int(indices_tensor[batch, seq])
                    if token_idx_val >= 0 and token_idx_val < vocab_size:
                        for emb in range(embed_dim):
                            output_tensor[batch, seq, emb] = weights_tensor[
                                token_idx_val, emb
                            ]
        else:
            raise Error("Unsupported target: " + target)


๋“ฑ๋ก์˜ ํ•ต์‹ฌ ์š”์†Œ:

  • ๋‹จ์ˆœํ•œ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ceildiv(total_elements, THREADS_PER_BLOCK) ๋ธ”๋ก์œผ๋กœ ์ง๊ด€์ ์ธ 1D ๊ทธ๋ฆฌ๋“œ ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”: ๋‹จ์ผ enqueue_memset ํ˜ธ์ถœ๋กœ ์ถœ๋ ฅ ๋ฒ„ํผ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ดˆ๊ธฐํ™”
  • ์ปดํŒŒ์ผ ํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ: ๋ชจ๋“  ํ…์„œ ์ฐจ์›์„ ์ปดํŒŒ์ผ ํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ „๋‹ฌํ•˜์—ฌ ์ตœ์  ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  • ๋””๋ฐ”์ด์Šค ์ถ”์ƒํ™”: GPU ์‹คํ–‰๊ณผ CPU ํด๋ฐฑ์„ ๋งค๋„๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌ

2D ๋น„๋ณ‘ํ•ฉ ์—ฐ์‚ฐ

์ด ์—ฐ์‚ฐ์€ ๋น„๊ต์šฉ 2D ์ž„๋ฒ ๋”ฉ ์ปค๋„์„ "embedding_2d"๋กœ ๋“ฑ๋กํ•ฉ๋‹ˆ๋‹ค:

@compiler.register("embedding_2d")
struct Embedding2DCustomOp:
    @staticmethod
    fn execute[
        target: StaticString,
        batch_size: Int,
        seq_len: Int,
        vocab_size: Int,
        embed_dim: Int,
    ](
        output: OutputTensor[
            dtype = DType.float32, rank=3
        ],  # [batch_size, seq_len, embed_dim]
        indices: InputTensor[
            dtype = DType.int32, rank=2
        ],  # [batch_size, seq_len]
        weights: InputTensor[
            dtype = output.dtype, rank=2
        ],  # [vocab_size, embed_dim]
        ctx: DeviceContextPtr,
    ) raises:
        output_tensor = output.to_layout_tensor()
        indices_tensor = indices.to_layout_tensor()
        weights_tensor = weights.to_layout_tensor()

        comptime indices_layout = indices_tensor.layout
        comptime weights_layout = weights_tensor.layout
        comptime out_layout = output_tensor.layout

        @parameter
        if target == "gpu":
            gpu_ctx = ctx.get_device_context()

            # Zero out output tensor
            gpu_ctx.enqueue_memset(
                DeviceBuffer[output.dtype](
                    gpu_ctx,
                    output_tensor.ptr,
                    batch_size * seq_len * embed_dim,
                    owning=False,
                ),
                0,
            )

            # Calculate 2D grid dimensions for non-coalesced access
            total_positions = batch_size * seq_len
            comptime BLOCK_X = 16  # batch*seq dimension
            comptime BLOCK_Y = 16  # embed dimension
            blocks_x = max(1, ceildiv(total_positions, BLOCK_X))
            blocks_y = max(1, ceildiv(embed_dim, BLOCK_Y))

            # Compile and launch 2D kernel
            comptime kernel = embedding_kernel_2d[
                indices_layout,
                weights_layout,
                out_layout,
                batch_size,
                seq_len,
                vocab_size,
                embed_dim,
                output.dtype,
            ]

            compiled_kernel = gpu_ctx.compile_function[kernel, kernel]()

            gpu_ctx.enqueue_function(
                compiled_kernel,
                output_tensor,
                indices_tensor,
                weights_tensor,
                grid_dim=(blocks_x, blocks_y),
                block_dim=(BLOCK_X, BLOCK_Y),
            )

        elif target == "cpu":
            # Same CPU fallback as 1D version
            for batch in range(batch_size):
                for seq in range(seq_len):
                    token_idx_val = Int(indices_tensor[batch, seq])
                    if token_idx_val >= 0 and token_idx_val < vocab_size:
                        for emb in range(embed_dim):
                            output_tensor[batch, seq, emb] = weights_tensor[
                                token_idx_val, emb
                            ]
        else:
            raise Error("Unsupported target: " + target)


1D ์—ฐ์‚ฐ๊ณผ์˜ ์ฃผ์š” ์ฐจ์ด์ :

  • ๋ณต์žกํ•œ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: blocks_x์™€ blocks_y๋ฅผ ๋ณ„๋„๋กœ ๊ณ„์‚ฐํ•˜๋Š” 2D ๊ทธ๋ฆฌ๋“œ ์‚ฌ์šฉ
  • ๊ณ ์ • ๋ธ”๋ก ์ฐจ์›: 2D ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์„ ์œ„ํ•ด BLOCK_X = 16, BLOCK_Y = 16์œผ๋กœ ๊ณ ์ •
  • ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ: ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ธฐํ™”์™€ CPU ํด๋ฐฑ ๋กœ์ง์€ ๋™์ผ
  • ๋‹ค๋ฅธ ์ปค๋„ ํ˜ธ์ถœ ๋ฐฉ์‹: 2D ๊ทธ๋ฆฌ๋“œ ์ฐจ์› (blocks_x, blocks_y)๊ณผ ๋ธ”๋ก ์ฐจ์› (BLOCK_X, BLOCK_Y) ์ „๋‹ฌ

๊ณตํ†ต ๋ž˜ํผ ๊ธฐ๋Šฅ

๋‘ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•„์ˆ˜ ์ธํ”„๋ผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  1. ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ:

    • enqueue_memset์œผ๋กœ ์ถœ๋ ฅ ํ…์„œ 0 ์ดˆ๊ธฐํ™”
    • ์ ์ ˆํ•œ ๋ฒ„ํผ ์ƒ์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ
    • ์ž๋™ ์ •๋ฆฌ ๋ฐ ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ
  2. ๋””๋ฐ”์ด์Šค ์ถ”์ƒํ™”:

    • ์ตœ์ ํ™”๋œ ์ปค๋„๋กœ GPU ์‹คํ–‰
    • ํ˜ธํ™˜์„ฑ๊ณผ ๋””๋ฒ„๊น…์„ ์œ„ํ•œ CPU ํด๋ฐฑ
    • ์‹คํ–‰ ๋Œ€์ƒ์— ๊ด€๊ณ„์—†์ด ์ผ๊ด€๋œ ์ธํ„ฐํŽ˜์ด์Šค
  3. ํŒŒ๋ผ๋ฏธํ„ฐ ์ „๋‹ฌ:

    • ์ปค๋„ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ปดํŒŒ์ผ ํƒ€์ž„ ํ…์„œ ์ฐจ์›
    • ๋ ˆ์ด์•„์›ƒ ํ…์„œ ๋ณ€ํ™˜์„ ํ†ตํ•œ ๋Ÿฐํƒ€์ž„ ํ…์„œ ๋ฐ์ดํ„ฐ
    • ํƒ€์ž… ์•ˆ์ „ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฒ€์ฆ
  4. ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ:

    • ์ตœ์ ์˜ ๊ทธ๋ฆฌ๋“œ ์ฐจ์› ์ž๋™ ๊ณ„์‚ฐ
    • ๊ฐ ์ปค๋„์˜ ์ ‘๊ทผ ํŒจํ„ด์— ์ตœ์ ํ™”๋œ ์„œ๋กœ ๋‹ค๋ฅธ ์ „๋žต
    • ์ ์ ˆํ•œ ๋ธ”๋ก ์ฐจ์› ๊ด€๋ฆฌ

PyTorch ํ†ตํ•ฉ

๋“ฑ๋ก๋œ ์—ฐ์‚ฐ์€ CustomOpLibrary๋ฅผ ํ†ตํ•ด ํŒŒ์ด์ฌ์—์„œ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# Load the custom operations
ops = CustomOpLibrary(mojo_kernels)

# Call the 1D coalesced version
result_1d = ops.embedding[{"batch_size": B, "seq_len": L, "vocab_size": V, "embed_dim": E}](
    indices, weights
)

# Call the 2D non-coalesced version
result_2d = ops.embedding_2d[{"batch_size": B, "seq_len": L, "vocab_size": V, "embed_dim": E}](
    indices, weights
)

์ด ์ ‘๊ทผ๋ฒ•์˜ ์žฅ์ ์€ ๋™์ผํ•œ ์ปค๋„ ๊ตฌํ˜„์„ ๋‹ค์–‘ํ•œ ํŒŒ์ด์ฌ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ์ตœ์ ์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ฝ”๋“œ ์‹คํ–‰

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ํผ์ฆ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pixi run p21
pixi run -e amd p21
uv run poe p21

์„ฑ๊ณตํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ ์ถœ๋ ฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Puzzle 21: Mojo Embedding Kernel Comparison
======================================================================
Configuration: B=8, L=512, V=10000, E=512
------------------------------------------------------------

Testing Correctness...
   1D Coalesced - Max difference: 1.19e-07
   2D Non-coalesced - Max difference: 1.19e-07
   โœ… Both implementations CORRECT

Benchmarking Mojo Kernels...

Performance Results:
   1D Coalesced:     2.145 ms
   2D Non-coalesced: 3.867 ms
   1D is 1.80x faster than 2D

Key Learning Points:
โ€ข Compare different GPU kernel implementations
โ€ข 1D vs 2D grid patterns have different memory access
โ€ข Coalesced memory access should be faster
โ€ข Grid configuration affects GPU utilization

์†”๋ฃจ์…˜

๋‘ ์ปค๋„์˜ ์ขŒํ‘œ ๋ณ€ํ™˜๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

1D ๋ณ‘ํ•ฉ ์ปค๋„

fn embedding_kernel_coalesced[
    indices_layout: Layout,
    weights_layout: Layout,
    out_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    vocab_size: Int,
    embed_dim: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin],
    weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin],
):
    """
    Memory-coalescing focused embedding kernel.

    Key insight: The bottleneck is memory access patterns, not computation.
    - Each thread handles one (batch, seq, embed) position
    - Simple 1D grid for maximum simplicity and correctness
    - Focus on getting memory access right first
    """

    # Simple 1D indexing - each thread = one output element
    global_idx = Int(block_idx.x * block_dim.x + thread_idx.x)
    total_elements = batch_size * seq_len * embed_dim

    if global_idx >= total_elements:
        return

    # Convert to (batch, seq, embed) coordinates
    batch_idx = global_idx // (seq_len * embed_dim)
    remaining = global_idx % (seq_len * embed_dim)
    seq_idx = remaining // embed_dim
    embed_idx = remaining % embed_dim

    # Get token index
    token_idx_val = Int(indices[batch_idx, seq_idx])

    # Simple, correct assignment
    if token_idx_val >= 0 and token_idx_val < vocab_size:
        output[batch_idx, seq_idx, embed_idx] = weights[
            token_idx_val, embed_idx
        ]
    else:
        output[batch_idx, seq_idx, embed_idx] = 0


2D ๋น„๋ณ‘ํ•ฉ ์ปค๋„

fn embedding_kernel_2d[
    indices_layout: Layout,
    weights_layout: Layout,
    out_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    vocab_size: Int,
    embed_dim: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    indices: LayoutTensor[DType.int32, indices_layout, MutAnyOrigin],
    weights: LayoutTensor[dtype, weights_layout, MutAnyOrigin],
):
    """
    2D grid non-coalesced embedding kernel.

    Non-optimal approach for comparison:
    - 2D grid: (batch*seq, embed_dim)
    - More complex indexing
    - Potentially worse memory access patterns
    """

    # 2D grid indexing
    batch_seq_idx = Int(block_idx.x * block_dim.x + thread_idx.x)
    embed_idx = Int(block_idx.y * block_dim.y + thread_idx.y)

    total_positions = batch_size * seq_len

    # Bounds check
    if batch_seq_idx >= total_positions or embed_idx >= embed_dim:
        return

    # Convert to (batch, seq) coordinates
    batch_idx = batch_seq_idx // seq_len
    seq_idx = batch_seq_idx % seq_len

    # Get token index
    token_idx_val = Int(indices[batch_idx, seq_idx])

    # Assignment with 2D grid pattern
    if token_idx_val >= 0 and token_idx_val < vocab_size:
        output[batch_idx, seq_idx, embed_idx] = weights[
            token_idx_val, embed_idx
        ]
    else:
        output[batch_idx, seq_idx, embed_idx] = 0


๋‘ ํ’€์ด ๋ชจ๋‘ ๋™์ผํ•œ ์ž„๋ฒ ๋”ฉ ์กฐํšŒ ๋กœ์ง์„ ๊ตฌํ˜„ํ•˜์ง€๋งŒ ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค:

์ฃผ์š” ์ฐจ์ด์ 

  1. ์Šค๋ ˆ๋“œ ๋งคํ•‘:

    • 1D ์ปค๋„: ์ถœ๋ ฅ ์š”์†Œ๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ, ๋‹จ์ˆœํ•œ 1์ฐจ์› ์ธ๋ฑ์‹ฑ
    • 2D ์ปค๋„: (batchร—seq, embed_dim) ์ขŒํ‘œ์— ๋Œ€ํ•œ 2D ๊ทธ๋ฆฌ๋“œ ๋งคํ•‘
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • 1D ์ปค๋„: ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ์ž„๋ฒ ๋”ฉ ์ฐจ์›์— ์ ‘๊ทผ โ†’ ๋ณ‘ํ•ฉ๋จ
    • 2D ์ปค๋„: ์Šค๋ ˆ๋“œ ์ ‘๊ทผ ํŒจํ„ด์ด ๋ธ”๋ก ๊ตฌ์„ฑ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง โ†’ ๋ณ‘ํ•ฉ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ
  3. ์ธ๋ฑ์‹ฑ ๋ณต์žก๋„:

    • 1D ์ปค๋„: ๋‹จ์ผ ๋‚˜๋ˆ—์…ˆ/๋‚˜๋จธ์ง€ ์ฒด์ธ์œผ๋กœ 3D ์ขŒํ‘œ ๊ณ„์‚ฐ
    • 2D ์ปค๋„: X/Y ์ขŒํ‘œ๋ฅผ ๋ณ„๋„๋กœ ๊ณ„์‚ฐ

์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ

1D ์ปค๋„์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ์ด์œ :

  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ์— ์ ‘๊ทผ
  • ๋‹จ์ˆœํ•œ ์ธ๋ฑ์‹ฑ: ์ขŒํ‘œ ๊ณ„์‚ฐ์˜ ์—ฐ์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋‚ฎ์Œ
  • ๋” ๋‚˜์€ ์บ์‹œ ํ™œ์šฉ: ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

2D ์ปค๋„์˜ ์„ฑ๋Šฅ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๋Š” ์ด์œ :

  • ํฉ์–ด์ง„ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ์›Œํ”„ ๋‚ด ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Œ
  • ๋ณต์žกํ•œ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: 16ร—16 ๋ธ”๋ก์ด ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ์ตœ์ ์œผ๋กœ ๋งž์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ
  • ์›Œํ”„ ๋ถ„๊ธฐ: ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ์‹คํ–‰ ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ฅผ ์ˆ˜ ์žˆ์Œ

ํ•ต์‹ฌ ๊ฐœ๋…

๊ฐœ๋…1D ๋ณ‘ํ•ฉ2D ๋น„๋ณ‘ํ•ฉ
์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ1D 1์ฐจ์› ์ธ๋ฑ์‹ฑ2D ๊ทธ๋ฆฌ๋“œ (batchร—seq, embed)
๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์—ฐ์†๋œ ์ฃผ์†Œํฉ์–ด์งˆ ์ˆ˜ ์žˆ์Œ
๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ๋‹จ์ˆœ: [total_elements // 256]๋ณต์žก: [batchร—seq // 16, embed // 16]
์„ฑ๋Šฅ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ์ตœ์ ํ™”์ตœ์ ํ™”๋˜์ง€ ์•Š์€ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด
์‚ฌ์šฉ ๋ชฉ์ ํ”„๋กœ๋•์…˜ ์ปค๋„๊ต์œก์šฉ ๋น„๊ต

ํ•ต์‹ฌ ๊ตํ›ˆ: ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์€ ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ 2~3๋ฐฐ์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์„ฑ๋Šฅ: ๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ GPU ์„ฑ๋Šฅ ์ตœ์ ํ™”์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ์ด ์„น์…˜์—์„œ๋Š” ์ž„๋ฒ ๋”ฉ ์กฐํšŒ์™€ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์™œ ๋น„๋ณ‘ํ•ฉ ํŒจํ„ด๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”์ง€ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ๊ธฐ์ดˆ

๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์€ ์›Œํ”„ ๋‚ด ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ์— ์ ‘๊ทผํ•  ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. GPU๋Š” ์ด๋Ÿฌํ•œ ๊ฐœ๋ณ„ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ๋” ์ ์€ ์ˆ˜์˜ ๋Œ€์šฉ๋Ÿ‰ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜์œผ๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ ์ ‘๊ทผ

๋ณ‘ํ•ฉ (ํšจ์œจ์ ):

- Thread 0 โ†’ Address 0x1000
- Thread 1 โ†’ Address 0x1004
- Thread 2 โ†’ Address 0x1008
- Thread 3 โ†’ Address 0x100C
- ...

๊ฒฐ๊ณผ: ์›Œํ”„ ์ „์ฒด(32๊ฐœ ์Šค๋ ˆ๋“œ)์— ๋Œ€ํ•ด 1๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

๋น„๋ณ‘ํ•ฉ (๋น„ํšจ์œจ์ ):

- Thread 0 โ†’ Address 0x1000
- Thread 1 โ†’ Address 0x2000
- Thread 2 โ†’ Address 0x3000
- Thread 3 โ†’ Address 0x4000
- ...

๊ฒฐ๊ณผ: ์ตœ๋Œ€ 32๋ฒˆ์˜ ๊ฐœ๋ณ„ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

์ž„๋ฒ ๋”ฉ ์—ฐ์‚ฐ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ธ ์ด์œ 

์ž„๋ฒ ๋”ฉ ์กฐํšŒ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์„ฑ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž…๋‹ˆ๋‹ค:

  • ์ตœ์†Œํ•œ์˜ ์—ฐ์‚ฐ: ํ•˜๋Š” ์ผ์ด๋ผ๊ณค ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ถœ๋ ฅ์œผ๋กœ ๋ณต์‚ฌํ•˜๋Š” ๊ฒƒ๋ฟ
  • ํฐ ๋ฉ”๋ชจ๋ฆฌ ํ’‹ํ”„๋ฆฐํŠธ: ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์€ ์ˆ˜ ๊ธฐ๊ฐ€๋ฐ”์ดํŠธ์— ๋‹ฌํ•  ์ˆ˜ ์žˆ์Œ
  • ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์š”๊ตฌ: ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ ์ „์†ก์ด ํ•„์š”

์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ์—์„œ๋Š” ์—ฐ์‚ฐ ๋ณต์žก๋„๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ์ด ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

์ปค๋„ ๋น„๊ต

1D ๋ณ‘ํ•ฉ ์ปค๋„

  • ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ: [total_elements // 256] ๋ธ”๋ก, ์ถœ๋ ฅ ์š”์†Œ๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ์ž„๋ฒ ๋”ฉ ์ฐจ์›์— ์ ‘๊ทผ
  • ์™œ ๋ณ‘ํ•ฉ๋˜๋Š”๊ฐ€: Thread 0: output[0,0,0], Thread 1: output[0,0,1] โ†’ ์—ฐ์†๋œ ์ฃผ์†Œ

2D ๋น„๋ณ‘ํ•ฉ ์ปค๋„

  • ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ: [batch*seq // 16, embed_dim // 16] ๋ธ”๋ก, 16ร—16 ์Šค๋ ˆ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Œ
  • ์™œ ๋น„๋ณ‘ํ•ฉ์ธ๊ฐ€: ์Šค๋ ˆ๋“œ ์ ‘๊ทผ ํŒจํ„ด์ด ๋ฉ”๋ชจ๋ฆฌ ์ „์ฒด์— ํฉ์–ด์งˆ ์ˆ˜ ์žˆ์Œ

์„ฑ๋Šฅ ๊ฒฐ๊ณผ

์ผ๋ฐ˜์ ์ธ ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ:

Performance Results:
   1D Coalesced:     2.145 ms
   2D Non-coalesced: 3.867 ms
   1D is 1.80x faster than 2D

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹œ๊ฐํ™”

๋ณ‘ํ•ฉ ํŒจํ„ด (1D ์ปค๋„)

output[0,0,0:32]์— ๋Œ€ํ•œ ์›Œํ”„ ์‹คํ–‰:

์š”์†Œ์Šค๋ ˆ๋“œ ID๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ฃผ์†Œ ํŒจํ„ด
output[0,0,0]0[0,0]Base + 0
output[0,0,1]1[0,1]Base + 4
output[0,0,2]2[0,2]Base + 8
output[0,0,3]3[0,3]Base + 12
โ€ฆโ€ฆโ€ฆโ€ฆ
output[0,0,31]31[0,31]Base + 124

๊ฒฐ๊ณผ: ์—ฐ์†๋œ ์ฃผ์†Œ โ†’ ์›Œํ”„ ์ „์ฒด์— ๋Œ€ํ•ด 1๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

๋น„๋ณ‘ํ•ฉ ํŒจํ„ด (2D ์ปค๋„)

16ร—16 ๋ธ”๋ก์˜ ์›Œํ”„ ์‹คํ–‰:

Block organization (16ร—16):
    X-dim: batch*seq positions (0-15)
    Y-dim: embed dimensions (0-15)

Warp threads might access:
    Thread 0:  batch=0, seq=0, embed=0  โ†’ Address A
    Thread 1:  batch=0, seq=1, embed=0  โ†’ Address B (different row)
    Thread 2:  batch=0, seq=2, embed=0  โ†’ Address C (different row)
    ...
    Thread 31: batch=1, seq=15, embed=0 โ†’ Address Z (scattered)

๊ฒฐ๊ณผ: ํฉ์–ด์ง„ ์ฃผ์†Œ โ†’ ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

ํ•ต์‹ฌ ์ตœ์ ํ™” ์ „๋žต

  1. ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ๋Š” ๊ฐ€๋Šฅํ•œ ํ•œ 1D ์ธ๋ฑ์‹ฑ์„ ์„ ํ˜ธํ•˜์„ธ์š”
  2. ๋ณ‘ํ•ฉ์— ์œ ๋ฆฌํ•˜๋„๋ก ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ์ •๋ ฌํ•˜์„ธ์š”
  3. ์ปค๋„ ์„ค๊ณ„ ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ๊ณ ๋ คํ•˜์„ธ์š”
  4. ๋ณ‘๋ชฉ ์ง€์ ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜์„ธ์š”
  5. ์ตœ์ ํ™” ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ™œ์šฉํ•˜์„ธ์š”

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํŠนํžˆ ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ๋Š” ์—ฐ์‚ฐ ๋ณต์žก๋„๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด GPU ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

Puzzle 22: ์ปค๋„ ํ“จ์ „๊ณผ ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค

์ปค๋„ ํ“จ์ „๊ณผ ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ

์ปค๋„ ํ“จ์ „ ๊ณผ ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ์— ์ดˆ์ ์„ ๋งž์ถฐ Part V๋ฅผ ์ด์–ด๊ฐ‘๋‹ˆ๋‹ค.

Puzzle 21: ์ž„๋ฒ ๋”ฉ Op์— ์ด์–ด, ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ ํšจ์œจ์ ์ธ ์ปค๋„๋กœ ๊ฒฐํ•ฉํ•˜๊ณ  ์ด๋ฅผ PyTorch์˜ ์˜คํ† ๊ทธ๋ž˜๋“œ ์‹œ์Šคํ…œ๊ณผ ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค. ๋ฐฐ์šธ ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ์ปค๋„ ํ“จ์ „์ด ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค(forward pass)์™€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค(backward pass) ๋ชจ๋‘์—์„œ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š” ์›๋ฆฌ
  • ํ“จ์ „ ์—ฐ์‚ฐ์— ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๊ฐ€ ํ•„์ˆ˜์ ์ธ ์ด์œ 
  • ์ ์ ˆํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์„ ๊ฐ–์ถ˜ ํ“จ์ „ ์ปค๋„ ์„ค๊ณ„ ๋ฐฉ๋ฒ•
  • ์„œ๋กœ ๋‹ค๋ฅธ ํ“จ์ „ ์ „๋žต์ด ๊ฐ€์ ธ์˜ค๋Š” ์„ฑ๋Šฅ ์ฐจ์ด

์ด ํผ์ฆ์€ ์—ฐ์‚ฐ์„ ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉํ•˜๋А๋ƒ๊ฐ€ ์–ด๋–ป๊ฒŒ ๊ตฌํ˜„ํ•˜๋А๋ƒ๋งŒํผ ์ค‘์š”ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋ฅผ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” ํ“จ์ „ LayerNorm + Linear ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ํ“จ์ „๊ณผ ์–ธํ“จ์ „ ๊ตฌํ˜„ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ์ „๋žต์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

๋น„๊ตํ•  ๋‚ด์šฉ:

  • ์–ธํ“จ์ „ ๋ฐฉ์‹: LayerNorm๊ณผ Linear๋ฅผ ๋ณ„๋„์˜ ์ปค๋„๋กœ ์‹คํ–‰
  • ํ“จ์ „ ์ปค๋„: ํ•˜๋‚˜์˜ ์ปค๋„์—์„œ ๋‘ ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์‹คํ–‰
  • ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค: ํ“จ์ „ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ

์ด ๋น„๊ต๋ฅผ ํ†ตํ•ด ๋”ฅ๋Ÿฌ๋‹ ์—ฐ์‚ฐ์—์„œ ์ปค๋„ ํ“จ์ „๊ณผ ์ ์ ˆํ•œ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ์ฒด๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐฐ๊ฒฝ: LayerNorm + Linear ์—ฐ์‚ฐ

LayerNorm๊ณผ Linear๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์˜ ํ•ต์‹ฌ ์—ฐ์‚ฐ์œผ๋กœ, ํŠนํžˆ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜๊ณผ ํ”ผ๋“œํฌ์›Œ๋“œ ๋„คํŠธ์›Œํฌ์—์„œ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ์‚ฌ์šฉ ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

import torch
import torch.nn.functional as F

# Input: hidden states
x = torch.randn(batch_size, seq_len, hidden_dim)

# LayerNorm parameters
ln_weight = torch.ones(hidden_dim)  # scale parameter (ฮณ)
ln_bias = torch.zeros(hidden_dim)   # shift parameter (ฮฒ)

# Linear layer parameters
linear_weight = torch.randn(output_dim, hidden_dim)
linear_bias = torch.zeros(output_dim)

# Unfused operations (with autograd)
ln_output = F.layer_norm(x, [hidden_dim], weight=ln_weight, bias=ln_bias)
output = F.linear(ln_output, linear_weight, linear_bias)

# Fused operation (custom implementation)
# This is what you'll implement in this puzzle
output_fused = fused_layernorm_linear(x, ln_weight, ln_bias, linear_weight, linear_bias)

ํ“จ์ „ ์—ฐ์‚ฐ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๋ฉด ํ•˜๋‚˜์˜ ํšจ์œจ์ ์ธ ์ปค๋„์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์ ์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
  • ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์†Œํ™”
  • ์บ์‹œ ํ™œ์šฉ๋„ ํ–ฅ์ƒ
  • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ ์ €์žฅ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์ œ๊ฑฐ

์‹ค์ œ๋กœ ์ด๋Ÿฌํ•œ ํ“จ์ „์€ ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋ชจ๋‘์—์„œ ์ตœ๋Œ€ 1.5~2๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ํ•™์Šต ํšจ์œจ์— ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๊ฐ€ ์ค‘์š”ํ•œ ์ด์œ 

PyTorch์˜ ์˜คํ† ๊ทธ๋ž˜๋“œ ์‹œ์Šคํ…œ์€ ๊ฐœ๋ณ„ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ž๋™์œผ๋กœ ๊ณ„์‚ฐํ•˜์ง€๋งŒ, ํ“จ์ „ ์—ฐ์‚ฐ์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์œ ๋กœ ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ์œ ์ง€
  • ์ ์ ˆํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„ ๋ณด์žฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”
  • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ

ํ•™์Šต ๊ฒฝ๋กœ

์ด ํผ์ฆ์€ ์ฒด๊ณ„์ ์ธ ์ดํ•ด๋ฅผ ์œ„ํ•ด ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

ํ“จ์ „ vs ์–ธํ“จ์ „ ์ปค๋„

์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ํ“จ์ „ ์ˆœ๋ฐฉํ–ฅ ์ปค๋„์„ ๊ตฌํ˜„ํ•˜๊ณ  ์ปค๋„ ํ“จ์ „์˜ ์ด์ ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ฌด์—‡์„ ํ•˜๊ฒŒ ๋ ๊นŒ์š”:

  • ์–ธํ“จ์ „๊ณผ ํ“จ์ „ ์ˆœ๋ฐฉํ–ฅ ์ปค๋„ ๋ชจ๋‘ ๊ตฌํ˜„
  • ํ•ต์‹ฌ ์ปค๋„ ํ“จ์ „ ๊ธฐ๋ฒ• ํ•™์Šต
  • ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์„œ๋กœ ๋‹ค๋ฅธ ์ „๋žต์œผ๋กœ ๊ตฌํ˜„ํ•˜๋Š” ์‚ฌ๋ก€ ํ™•์ธ
  • ํ“จ์ „์ด ๊ฐ€์ ธ์˜ค๋Š” ์„ฑ๋Šฅ ์ฐจ์ด ์ดํ•ด
  • ์ตœ์  ์„ฑ๋Šฅ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํ•™์Šต

์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ๊ณผ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค

์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ๊ณผ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ๊นŠ์ด ํŒŒ๊ณ ๋“ญ๋‹ˆ๋‹ค.

๋ฌด์—‡์„ ๋ฐฐ์šธ๊นŒ์š”:

  • ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ตฌํ˜„ ๋ฐฉ๋ฒ•
  • ์ ์ ˆํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์ด ์ค‘์š”ํ•œ ์ด์œ 
  • ํ•™์Šต ํšจ์œจ์— ๋Œ€ํ•œ ์‹ค์ œ ์‹œ์‚ฌ์ 
  • ์—ญ๋ฐฉํ–ฅ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ตœ์ ํ™” ์ „๋žต
  • ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์˜ ์ˆ˜ํ•™์  ๊ธฐ์ดˆ
  • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ
  • ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค์—์„œ์˜ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ

์‹œ์ž‘ํ•˜๊ธฐ

์ปค๋„ ํ“จ์ „๊ณผ ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ์„ ํƒ๊ตฌํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ํ“จ์ „ vs ์–ธํ“จ์ „ ์ปค๋„ ์—์„œ ํ“จ์ „ ์ปค๋„์„ ๊ตฌํ˜„ํ•œ ํ›„, ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ๊ณผ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋กœ ๋„˜์–ด๊ฐ€ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ์ดํ•ดํ•ด ๋ณด์„ธ์š”.

์ด ํผ์ฆ์—๋Š” ๋‹ค์Œ์„ ๊ฒ€์ฆํ•˜๋Š” ์ข…ํ•ฉ ํ…Œ์ŠคํŠธ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋ชจ๋‘์—์„œ PyTorch ๊ตฌํ˜„๊ณผ์˜ ์ˆ˜์น˜์  ์ •ํ™•๋„
  • CPU์™€ GPU ๊ตฌํ˜„ ๊ฐ„์˜ ์„ฑ๋Šฅ ๋น„๊ต
  • ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ(์ž…๋ ฅ, LayerNorm ๊ฐ€์ค‘์น˜/๋ฐ”์ด์–ด์Šค, Linear ๊ฐ€์ค‘์น˜/๋ฐ”์ด์–ด์Šค)์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ ์ •ํ™•๋„
  • ์ปค๋„ ํ“จ์ „์„ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ตœ์ ํ™”

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ์„œ๋กœ ๋‹ค๋ฅธ ๊ตฌํ˜„ ๋ฐฉ์‹(ํ“จ์ „ vs ์–ธํ“จ์ „)์ด ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ์„ฑ๋Šฅ ๋ชจ๋‘์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์ฃผ์˜ ๊นŠ๊ฒŒ ์‚ดํŽด๋ณด์„ธ์š”. ์ด ํ†ต์ฐฐ์€ LayerNorm + Linear๋ฅผ ๋„˜์–ด ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ์—ฐ์‚ฐ์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ตฌํ˜„์€ ํ•™์Šต ํšจ์œจ๊ณผ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์— ์ง์ ‘์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฏ€๋กœ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

โš›๏ธ ํ“จ์ „ vs ์–ธํ“จ์ „ ์ปค๋„

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” LayerNorm๊ณผ Linear ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๊ตฌํ˜„ํ•˜๊ณ  ๋น„๊ตํ•˜๋ฉฐ, ์ปค๋„ ํ“จ์ „์˜ ์„ฑ๋Šฅ ์ด์ ์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค:

  1. ์–ธํ“จ์ „ ๋ฐฉ์‹: LayerNorm๊ณผ Linear๋ฅผ ๋ณ„๋„์˜ ์—ฐ์‚ฐ์œผ๋กœ ์‹คํ–‰
  2. ํ“จ์ „ ์ปค๋„: LayerNorm๊ณผ Linear ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ GPU ์ปค๋„๋กœ ๊ฒฐํ•ฉ

์ด ๋น„๊ต๋ฅผ ํ†ตํ•ด ์ปค๋„ ํ“จ์ „์ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
  • ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์†Œํ™”
  • ์บ์‹œ ํ™œ์šฉ๋„ ํ–ฅ์ƒ
  • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ ์ €์žฅ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์ œ๊ฑฐ

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜๋Š” ์ปค๋„ ํ“จ์ „ ๊ธฐ๋ฒ•
  • ํ“จ์ „ ์—ฐ์‚ฐ์„ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ตœ์ ํ™”
  • ์„œ๋กœ ๋‹ค๋ฅธ ์ปค๋„ ๊ตฌํ˜„์˜ ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํ‚น
  • ํ“จ์ „ ์—ฐ์‚ฐ์—์„œ์˜ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ
  • PyTorch ์ปค์Šคํ…€ ์—ฐ์‚ฐ ํ†ตํ•ฉ

๊ฒฐํ•ฉํ•  ์ˆ˜ํ•™์  ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. LayerNorm: \[\Large \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \]

  2. Linear: \[\Large \text{Linear}(x) = Wx + b \]

ํ“จ์ „ ์—ฐ์‚ฐ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๋ฉด ๋‹ค์Œ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{Fused}(x) = W(\gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta) + b \]

LayerNorm ์ดํ•ดํ•˜๊ธฐ

LayerNorm์€ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•˜๊ณ  ๊ฐ€์†ํ•˜๋Š” ์ •๊ทœํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ตฌ์„ฑ ์š”์†Œ์™€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

LayerNorm์ด ํ•˜๋Š” ์ผ

  1. ์ •๊ทœํ™”: LayerNorm์€ ๊ฐ ์ƒ˜ํ”Œ์˜ ํŠน์„ฑ(์€๋‹‰ ์ฐจ์›, hidden dimension) ์ „์ฒด์— ๊ฑธ์ณ ํ™œ์„ฑํ™” ๊ฐ’์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ:

    • ๊ฐ ์‹œํ€€์Šค ์œ„์น˜์—์„œ ์€๋‹‰ ์ฐจ์›์— ๋Œ€ํ•œ ํ†ต๊ณ„๋Ÿ‰์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ๋ฐฐ์น˜์˜ ๊ฐ ์ƒ˜ํ”Œ์€ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”๋ฉ๋‹ˆ๋‹ค
    • ๋ฐฐ์น˜ ์ฐจ์›์— ๋Œ€ํ•ด ์ •๊ทœํ™”ํ•˜๋Š” BatchNorm๊ณผ๋Š” ๋‹ค๋ฆ…๋‹ˆ๋‹ค
  2. ํŒŒ๋ผ๋ฏธํ„ฐ:

    • \(\gamma\) (scale): ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฐ ํŠน์„ฑ์˜ ์ตœ์  ์Šค์ผ€์ผ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒกํ„ฐ
    • \(\beta\) (shift): ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฐ ํŠน์„ฑ์˜ ์ตœ์  ์ด๋™๋Ÿ‰์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒกํ„ฐ
    • \(\epsilon\): 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋ถ„์‚ฐ์— ๋”ํ•˜๋Š” ์ž‘์€ ์ƒ์ˆ˜ (1e-5)

LayerNorm์˜ ์‹ค์ œ ์—ญํ• 

LayerNorm์€ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์—์„œ ์—ฌ๋Ÿฌ ์ค‘์š”ํ•œ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  1. ํŠน์„ฑ ํ‘œ์ค€ํ™”:

    • ๊ฐ ํŠน์„ฑ์„ ํ‰๊ท  0, ๋ถ„์‚ฐ 1๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค
    • ๋„คํŠธ์›Œํฌ์˜ ํ•™์Šต ๊ณผ์ •์„ ๋” ์•ˆ์ •์ ์œผ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค
    • ํ•™์Šต ์ค‘ ๋ ˆ์ด์–ด ์ž…๋ ฅ์˜ ๋ถ„ํฌ๊ฐ€ ๋ณ€ํ•˜๋Š” โ€œ๋‚ด๋ถ€ ๊ณต๋ณ€๋Ÿ‰ ์ด๋™(internal covariate shift)โ€ ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
  2. ๊ธฐ์šธ๊ธฐ ํ๋ฆ„:

    • ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค
    • ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค/ํญ๋ฐœ ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
    • ๋” ๋†’์€ ํ•™์Šต๋ฅ ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด ํ•™์Šต ํšจ์œจ์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค
  3. ์ •๊ทœํ™” ํšจ๊ณผ:

    • ์•”๋ฌต์ ์ธ ์ •๊ทœํ™” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค
    • ํŠน์„ฑ ๋ถ„ํฌ๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
    • ์ž…๋ ฅ ๋ณ€๋™์— ๋Œ€ํ•œ ๋„คํŠธ์›Œํฌ์˜ ๊ฐ•๊ฑด์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค
  4. ์‹œํ€€์Šค ๋ชจ๋ธ๋ง:

    • ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์—์„œ ํŠนํžˆ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค
    • ์„œ๋กœ ๋‹ค๋ฅธ ์‹œํ€€์Šค ๊ธธ์ด์—์„œ๋„ ์ผ๊ด€๋œ ์‹ ํ˜ธ ํฌ๊ธฐ๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค
    • ๊ฐ€๋ณ€ ๊ธธ์ด ์‹œํ€€์Šค๋ฅผ ๋” ์ž˜ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค
  5. ํ•™์Šต ์—ญํ•™:

    • ํ•™์Šต ์ˆ˜๋ ด์„ ๊ฐ€์†ํ•ฉ๋‹ˆ๋‹ค
    • ์„ธ๋ฐ€ํ•œ ํ•™์Šต๋ฅ  ์กฐ์ •์˜ ํ•„์š”์„ฑ์„ ์ค„์ž…๋‹ˆ๋‹ค
    • ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”์— ๋Œ€ํ•œ ๋„คํŠธ์›Œํฌ์˜ ๋ฏผ๊ฐ๋„๋ฅผ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค

์ˆ˜ํ•™์  ๊ตฌ์„ฑ ์š”์†Œ

  1. ํ‰๊ท  ๊ณ„์‚ฐ (\(\mu\)): \[\Large \mu = \frac{1}{H} \sum_{i=1}^{H} x_i \]

    • ์€๋‹‰ ์ฐจ์›(H)์— ๊ฑธ์ณ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ๊ฐ ์‹œํ€€์Šค ์œ„์น˜๋งˆ๋‹ค ๊ณ ์œ ํ•œ ํ‰๊ท ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค
  2. ๋ถ„์‚ฐ ๊ณ„์‚ฐ (\(\sigma^2\)): \[\Large \sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2 \]

    • ์€๋‹‰ ์ฐจ์›์— ๊ฑธ์ณ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ์ •๊ทœํ™”๋œ ๊ฐ’์˜ ์Šค์ผ€์ผ๋ง์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค
  3. ์ •๊ทœํ™”์™€ ์Šค์ผ€์ผ๋ง: \[\Large \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \]

    • ๋จผ์ € ์ž…๋ ฅ์„ ํ‰๊ท  0, ๋ถ„์‚ฐ 1๋กœ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค
    • ๊ทธ๋Ÿฐ ๋‹ค์Œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ scale (\(\gamma\))๊ณผ shift (\(\beta\)) ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค
    • \(\odot\) ๊ธฐํ˜ธ๋Š” ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ(์•„๋‹ค๋งˆ๋ฅด ๊ณฑ)์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค
    • ์˜ˆ๋ฅผ ๋“ค์–ด, \(\gamma = [1.2, 0.8, 1.5]\)์ด๊ณ  ์ •๊ทœํ™”๋œ ์ž…๋ ฅ์ด \([0.5, -0.3, 0.7]\)์ด๋ฉด, \(\gamma \odot x = [0.6, -0.24, 1.05]\)์ž…๋‹ˆ๋‹ค

LayerNorm์ด ์ค‘์š”ํ•œ ์ด์œ 

  1. ํ•™์Šต ์•ˆ์ •์„ฑ:

    • ํ™œ์„ฑํ™” ๊ฐ’์ด ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์•„์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
    • ๋„คํŠธ์›Œํฌ ์ „์ฒด์— ๊ฑธ์ณ ์ผ๊ด€๋œ ์‹ ํ˜ธ ํฌ๊ธฐ๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค
  2. ํŠน์„ฑ ํ•™์Šต:

    • scale (\(\gamma\))๊ณผ shift (\(\beta\)) ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ†ตํ•ด ์–ด๋–ค ํŠน์„ฑ์ด ์ค‘์š”ํ•œ์ง€ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
    • ํŠน์ • ํŠน์„ฑ์„ ๋ฌด์‹œํ•˜๊ฑฐ๋‚˜ ๊ฐ•์กฐํ•˜๋Š” ๊ฒƒ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  3. ๋…๋ฆฝ์„ฑ:

    • BatchNorm๊ณผ ๋‹ฌ๋ฆฌ, LayerNorm์˜ ํ†ต๊ณ„๋Ÿ‰์€ ๊ฐ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ๋…๋ฆฝ์ ์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค
    • ๊ฐ€๋ณ€ ๊ธธ์ด ์‹œํ€€์Šค์™€ ์ž‘์€ ๋ฐฐ์น˜ ํฌ๊ธฐ์— ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค

๊ตฌ์„ฑ

  • ๋ฐฐ์น˜ ํฌ๊ธฐ: BATCH_SIZE = 4
  • ์‹œํ€€์Šค ๊ธธ์ด: SEQ_LEN = 4
  • ์€๋‹‰ ์ฐจ์›: HIDDEN_DIM = 8
  • ์ถœ๋ ฅ ์ฐจ์›: OUTPUT_DIM = 16
  • ์—ก์‹ค๋ก : EPS = 1e-5
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32

๊ตฌํ˜„ ๋ฐฉ์‹

1. ์–ธํ“จ์ „ ๊ตฌํ˜„

์–ธํ“จ์ „ ๋ฐฉ์‹์€ ์—ฌ๋Ÿฌ ์ปค๋„์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ์ฑ•ํ„ฐ์—์„œ ์ž‘์„ฑํ•œ ์ปค๋„๋“ค์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ปค๋„

Puzzle 16: ํ–‰๋ ฌ ๊ณฑ์…ˆ (MatMul)์—์„œ ์‚ฌ์šฉํ•œ ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ปค๋„์„ ์„ ํ˜• ๋ณ€ํ™˜์— ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์ปค๋„์€ ๋‹ค์–‘ํ•œ ํ–‰๋ ฌ ํฌ๊ธฐ๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค:

# Idiomatic tiled matmul from p19.mojo
fn matmul_idiomatic_tiled[
    a_layout: Layout,
    b_layout: Layout,
    out_layout: Layout,
    rows: Int,
    cols: Int,
    inner: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, a_layout, MutAnyOrigin],
    b: LayoutTensor[dtype, b_layout, MutAnyOrigin],
):
    """Idiomatic tiled matrix multiplication from p19."""
    local_row = thread_idx.y
    local_col = thread_idx.x
    tiled_row = Int(block_idx.y * MATMUL_BLOCK_DIM_XY + local_row)
    tiled_col = Int(block_idx.x * MATMUL_BLOCK_DIM_XY + local_col)

    # Get the tile of the output matrix that this thread block is responsible for
    out_tile = output.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
        Int(block_idx.y), Int(block_idx.x)
    )
    a_shared = LayoutTensor[
        dtype,
        Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    b_shared = LayoutTensor[
        dtype,
        Layout.row_major(MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    var acc: output.element_type = 0

    comptime load_a_layout = Layout.row_major(
        MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
    )  # Coalesced loading
    comptime load_b_layout = Layout.row_major(
        MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
    )  # Coalesced loading

    @parameter
    for idx in range((inner + MATMUL_BLOCK_DIM_XY - 1) // MATMUL_BLOCK_DIM_XY):
        # Get tiles from A and B matrices
        a_tile = a.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
            Int(block_idx.y), idx
        )
        b_tile = b.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
            idx, Int(block_idx.x)
        )

        # Asynchronously copy tiles to shared memory with consistent orientation
        copy_dram_to_sram_async[
            thread_layout=load_a_layout,
            num_threads=MATMUL_NUM_THREADS,
            block_dim_count=MATMUL_BLOCK_DIM_COUNT,
        ](a_shared, a_tile)
        copy_dram_to_sram_async[
            thread_layout=load_b_layout,
            num_threads=MATMUL_NUM_THREADS,
            block_dim_count=MATMUL_BLOCK_DIM_COUNT,
        ](b_shared, b_tile)

        # Wait for all async copies to complete
        async_copy_wait_all()
        barrier()

        # Compute partial matrix multiplication for this tile
        @parameter
        for k in range(MATMUL_BLOCK_DIM_XY):
            if (
                tiled_row < rows and tiled_col < cols
            ):  # Only perform calculation for valid outputs
                if k < a_tile.dim(
                    1
                ):  # Only perform calculation on valid inputs
                    acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write final result with bounds checking (needed for variable matrix sizes)
    if tiled_row < rows and tiled_col < cols:
        out_tile[local_row, local_col] = acc


์ „์น˜ ์ปค๋„

ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ๋ง์„ ์‚ฌ์šฉํ•˜๋Š” ์ „์น˜ ์ปค๋„์ž…๋‹ˆ๋‹ค:

fn transpose_kernel[
    layout_in: Layout,
    layout_out: Layout,
    rows: UInt,
    cols: UInt,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
    inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin],
):
    """Transpose matrix using shared memory tiling for coalesced access.
    We will learn more about coalesced access in the next part.
    """
    shared_tile = LayoutTensor[
        dtype,
        Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    local_row = thread_idx.y
    local_col = thread_idx.x

    global_row = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_row
    global_col = block_idx.x * TRANSPOSE_BLOCK_DIM_XY + local_col

    if global_row < rows and global_col < cols:
        shared_tile[local_row, local_col] = inp[global_row, global_col]

    barrier()

    out_row = block_idx.x * TRANSPOSE_BLOCK_DIM_XY + local_row
    out_col = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_col

    # Store data from shared memory to global memory (coalesced write)
    # Note: we transpose the shared memory access pattern
    if out_row < cols and out_col < rows:
        output[out_row, out_col] = shared_tile[local_col, local_row]


Bias ํ•ฉ์‚ฐ ์ปค๋„

Bias ํ•ญ์„ ๋”ํ•˜๋Š” ๊ฐ„๋‹จํ•œ ์š”์†Œ๋ณ„ ํ•ฉ์‚ฐ ์ปค๋„์ž…๋‹ˆ๋‹ค:

fn add_bias_kernel[
    input_layout: Layout,
    bias_layout: Layout,
    output_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    output_dim: Int,
](
    output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, input_layout, MutAnyOrigin],
    bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin],
):
    """Simple bias addition."""
    batch_idx = Int(block_idx.x)
    seq_idx = Int(block_idx.y)
    out_idx = Int(thread_idx.x)

    if batch_idx >= batch_size or seq_idx >= seq_len or out_idx >= output_dim:
        return

    output[batch_idx, seq_idx, out_idx] = input[
        batch_idx, seq_idx, out_idx
    ] + rebind[Scalar[dtype]](bias[out_idx])


LayerNorm ์ปค๋„

์ด์ œ ์ด ์ปค๋„์„ ์™„์„ฑํ•˜์—ฌ LayerNorm ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  1. ๊ฐ ์‹œํ€€์Šค ์œ„์น˜์— ๋Œ€ํ•œ ํ‰๊ท  \(\mu\)๊ณผ ๋ถ„์‚ฐ \(\sigma^2\) ๊ณ„์‚ฐ
  2. ์ด ํ†ต๊ณ„๋Ÿ‰์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ์ •๊ทœํ™”
  3. ์Šค์ผ€์ผ \(\gamma\)๊ณผ ์‹œํ”„ํŠธ \(\beta\) ํŒŒ๋ผ๋ฏธํ„ฐ ์ ์šฉ
fn layernorm_kernel[
    input_layout: Layout,
    ln_params_layout: Layout,
    output_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
](
    output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
    ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
    ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
):
    batch_idx = Int(block_idx.x)
    seq_idx = Int(block_idx.y)
    hidden_idx = Int(thread_idx.x)

    if (
        batch_idx >= batch_size
        or seq_idx >= seq_len
        or hidden_idx >= hidden_dim
    ):
        return

    # Compute statistics for this sequence position (redundant but simple)
    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0

    # FILL ME IN (roughly 11 lines)


๊ตฌํ˜„ ๋‹จ๊ณ„:

  1. ๋จผ์ €, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  2. ๊ทธ๋Ÿฐ ๋‹ค์Œ, ์ด ํ†ต๊ณ„๋Ÿ‰์œผ๋กœ ์ž…๋ ฅ์„ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค
  3. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์Šค์ผ€์ผ๊ณผ ์‹œํ”„ํŠธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค

์–ธํ“จ์ „ ๋ฐฉ์‹์˜ ํŠน์„ฑ:

  • ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ์ปค๋„ ์‹คํ–‰ (LayerNorm โ†’ MatMul โ†’ Bias)
  • ์—ฐ์‚ฐ ๊ฐ„ ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น
  • ๋ณ„๋„์˜ ํŒจ์Šค๋กœ ์ธํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ฆ๊ฐ€
  • ๊ด€์‹ฌ์‚ฌ ๋ถ„๋ฆฌ๊ฐ€ ๋ช…ํ™•ํ•œ ๊ฐ„๊ฒฐํ•œ ๊ตฌํ˜„
  • ๊ฐ ์—ฐ์‚ฐ์ด ๊ฒฉ๋ฆฌ๋˜์–ด ๋””๋ฒ„๊น…์ด ์šฉ์ด
ํŒ
  1. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก ์‚ฌ์šฉ (๊ทธ๋ฆฌ๋“œ: [batch_size, seq_len])
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ์€๋‹‰ ์ฐจ์› ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌ
    • ์‹œํ€€์Šค๋‹น ํ†ต๊ณ„๋Ÿ‰์„ ํ•œ ๋ฒˆ๋งŒ ๊ณ„์‚ฐํ•˜์—ฌ ์ค‘๋ณต ์—ฐ์‚ฐ ๋ฐฉ์ง€
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ์ž…๋ ฅ ํ…์„œ: [batch_idx, seq_idx, hidden_idx]๋กœ ์ ‘๊ทผ
    • ์ถœ๋ ฅ ํ…์„œ: [batch_idx, seq_idx, hidden_idx]๋กœ ์ ‘๊ทผ
    • LayerNorm ํŒŒ๋ผ๋ฏธํ„ฐ: [hidden_idx]๋กœ ์ ‘๊ทผ
  3. ์ˆ˜์น˜ ์•ˆ์ •์„ฑ:

    • ์ œ๊ณฑ๊ทผ์„ ์ทจํ•˜๊ธฐ ์ „์— ์—ก์‹ค๋ก (1e-5)์„ ๋”ํ•ฉ๋‹ˆ๋‹ค
    • ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ…์„ ์œ„ํ•ด rebind[Scalar[dtype]] ์‚ฌ์šฉ
    • ๋ถ„์‚ฐ์€ (sq_sum / hidden_dim) - (mean * mean)์œผ๋กœ ๊ณ„์‚ฐ
  4. ์„ฑ๋Šฅ:

    • ํ•œ ๋ฒˆ์˜ ํŒจ์Šค๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๋™์‹œ์— ๊ณ„์‚ฐ
    • ๊ณ„์‚ฐ๋œ ํ†ต๊ณ„๋Ÿ‰์„ ์‹œํ€€์Šค ๋‚ด ๋ชจ๋“  ์š”์†Œ์— ์žฌ์‚ฌ์šฉ
    • ๋ถˆํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๋ฐฉ์ง€

์ฝ”๋“œ ์‹คํ–‰

์–ธํ“จ์ „ ๊ตฌํ˜„์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p22 --unfused
pixi run -e amd p22 --unfused
uv run poe p22 --unfused

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Testing with dimensions: [4, 4, 8] -> [4, 4, 16]
โœ… Loaded Mojo operations library
============================================================
   Puzzle 22: UNFUSED Algorithm Test & Benchmark
============================================================

๐Ÿงช Correctness Testing for UNFUSED Algorithm
====================================================

Testing Reference PyTorch Implementation
-----------------------------------------------
โœ… Reference PyTorch
   Max difference: 0.00e+00
   Result: โœ… CORRECT

Testing CPU Implementation
---------------------------------
โœ… Using Mojo fused kernel (CPU)
   Max difference: 1.86e-08
   Result: โœ… CORRECT

Testing GPU Unfused Implementation
-----------------------------------------
โœ… Using Mojo unfused kernel (GPU)
   Max difference: 1.86e-08
   Result: โœ… CORRECT

Correctness Summary:
   - Reference:   โœ… CORRECT
   - CPU:         โœ… CORRECT
   - GPU unfused: โœ… CORRECT

   Overall Correctness: โœ… ALL CORRECT

Benchmarking CPU vs GPU UNFUSED
------------------------------------------
   Testing CPU performance...
   CPU: 3173.70ms (50 iterations)
   Testing GPU unfused performance...
   GPU unfused: 3183.57ms (50 iterations)

   GPU unfused vs CPU: 1.00x slower
   CPU wins (GPU overhead > computation benefit)

UNFUSED Algorithm Test Completed!

์†”๋ฃจ์…˜

fn layernorm_kernel[
    input_layout: Layout,
    ln_params_layout: Layout,
    output_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
    ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
    ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
):
    batch_idx = Int(block_idx.x)
    seq_idx = Int(block_idx.y)
    hidden_idx = Int(thread_idx.x)

    if (
        batch_idx >= batch_size
        or seq_idx >= seq_len
        or hidden_idx >= hidden_dim
    ):
        return

    # Compute statistics for this sequence position (redundant but simple)
    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0

    @parameter
    for h in range(hidden_dim):
        val = input[batch_idx, seq_idx, h]
        sum_val += rebind[Scalar[dtype]](val)
        sq_sum += rebind[Scalar[dtype]](val * val)

    mean_val = sum_val / hidden_dim
    var_val = (sq_sum / hidden_dim) - (mean_val * mean_val)
    inv_std = 1.0 / sqrt(var_val + 1e-5)

    # Apply LayerNorm to this element
    input_val = input[batch_idx, seq_idx, hidden_idx]
    normalized = (input_val - mean_val) * inv_std * rebind[Scalar[dtype]](
        ln_weight[hidden_idx]
    ) + rebind[Scalar[dtype]](ln_bias[hidden_idx])
    output[batch_idx, seq_idx, hidden_idx] = normalized


์–ธํ“จ์ „ ๊ตฌํ˜„์€ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ํ…์„œ์˜ ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์ง๊ด€์ ์ธ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ์™€ ๋ธ”๋ก ๊ตฌ์„ฑ:

    batch_idx = block_idx.x
    seq_idx = block_idx.y
    hidden_idx = thread_idx.x
    
    • ๊ฐ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ๋ฐฐ์น˜ ๋‚ด ํ•˜๋‚˜์˜ ์‹œํ€€์Šค ์œ„์น˜๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

    • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: [batch_size, seq_len]

    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์€๋‹‰ ์ฐจ์›์˜ ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

    • ์ธ๋ฑ์Šค๊ฐ€ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋ฉด ์กฐ๊ธฐ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

      if (batch_idx >= batch_size or seq_idx >= seq_len or hidden_idx >= hidden_dim):
          return
      
  2. ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ:

    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0
    
    @parameter
    for h in range(hidden_dim):
        val = input[batch_idx, seq_idx, h]
        sum_val += rebind[Scalar[dtype]](val)
        sq_sum += rebind[Scalar[dtype]](val * val)
    
    • ํ•œ ๋ฒˆ์˜ ํŒจ์Šค๋กœ ํ•ฉ๊ณ„์™€ ์ œ๊ณฑํ•ฉ์„ ๋™์‹œ์— ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค

    • ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ๋ฅผ ์œ„ํ•ด @parameter๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค

    • rebind[Scalar[dtype]]๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

    • ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:

      mean_val = sum_val / hidden_dim
      var_val = (sq_sum / hidden_dim) - (mean_val * mean_val)
      inv_std = 1.0 / sqrt(var_val + 1e-5)
      
  3. ์ •๊ทœํ™”์™€ ์Šค์ผ€์ผ๋ง:

    input_val = input[batch_idx, seq_idx, hidden_idx]
    normalized = (input_val - mean_val) * inv_std * rebind[Scalar[dtype]](
        ln_weight[hidden_idx]
    ) + rebind[Scalar[dtype]](ln_bias[hidden_idx])
    output[batch_idx, seq_idx, hidden_idx] = normalized
    
    • ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{normalized} = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \]
    • ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ฮณ (ln_weight)๋กœ ์Šค์ผ€์ผ๋งํ•ฉ๋‹ˆ๋‹ค
    • ํ•™์Šต ๊ฐ€๋Šฅํ•œ bias ฮฒ (ln_bias)๋ฅผ ๋”ํ•ฉ๋‹ˆ๋‹ค
    • ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ ํ…์„œ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  4. ์„ฑ๋Šฅ ํŠน์„ฑ:

    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ํ†ต๊ณ„๋Ÿ‰์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์—†์Œ (๊ฐ„๋‹จํ•˜์ง€๋งŒ ๋œ ํšจ์œจ์ )
    • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:
      • ์ž…๋ ฅ: [batch_idx, seq_idx, h]
      • ์ถœ๋ ฅ: [batch_idx, seq_idx, hidden_idx]
      • ํŒŒ๋ผ๋ฏธํ„ฐ: [hidden_idx]
    • ๋‹ค์Œ์„ ํ†ตํ•ด ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:
      • ์ œ๊ณฑ๊ทผ ์ „์— ์—ก์‹ค๋ก (1e-5) ์ถ”๊ฐ€
      • ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ… ์‚ฌ์šฉ
      • ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ๋ฐฉ์‹์œผ๋กœ ๋ถ„์‚ฐ ๊ณ„์‚ฐ
  5. ๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ:

    • ํƒ€์ž… ์•ˆ์ „์„ฑ:

      • ์ค‘๊ฐ„ ๊ณ„์‚ฐ์— Scalar[dtype] ์‚ฌ์šฉ
      • ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ…์„ ์œ„ํ•ด rebind[Scalar[dtype]] ์‚ฌ์šฉ
      • ์ผ๊ด€๋œ ๋ถ€๋™์†Œ์ˆ˜์  ์ •๋ฐ€๋„ ๋ณด์žฅ
    • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

      • ์ž…๋ ฅ ํ…์„œ์—์„œ ๋ณ‘ํ•ฉ ์ฝ๊ธฐ
      • ์ถœ๋ ฅ ํ…์„œ์— ๋ณ‘ํ•ฉ ์“ฐ๊ธฐ
      • LayerNorm ํŒŒ๋ผ๋ฏธํ„ฐ์— ์ˆœ์ฐจ์  ์ ‘๊ทผ
    • ์—ฐ์‚ฐ ํ๋ฆ„:

      • ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ: \[\Large O(H) \text{ operations per thread} \]
      • ์ •๊ทœํ™”: \[\Large O(1) \text{ operations per thread} \]
      • ์ „์ฒด ๋ณต์žก๋„: \[\Large O(H) \text{ per output element} \]
    • ํ•œ๊ณ„์ :

      • ํ†ต๊ณ„๋Ÿ‰์˜ ์ค‘๋ณต ๊ณ„์‚ฐ
      • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—†์Œ
      • ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰
      • ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ์ปค๋„ ์‹คํ–‰ ํ•„์š”

์ด ๊ตฌํ˜„์€ ์ •ํ™•ํ•˜์ง€๋งŒ ์„ฑ๋Šฅ ๋ฉด์—์„œ ์ตœ์ ์ด ์•„๋‹ˆ๋ฉฐ, ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ์—์„œ CPU ๋ฒ„์ „๋ณด๋‹ค ์•ฝ๊ฐ„ ๋А๋ฆฐ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ“จ์ „ ๊ตฌํ˜„์—์„œ๋Š” ๋‹ค์Œ์„ ํ†ตํ•ด ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค:

  • ์‹œํ€€์Šค๋‹น ํ†ต๊ณ„๋Ÿ‰์„ ํ•œ ๋ฒˆ๋งŒ ๊ณ„์‚ฐ
  • ์ •๊ทœํ™”๋œ ๊ฐ’ ์žฌ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ๊ฐ์†Œ
  • ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ์ œ๊ฑฐ

2. ํ“จ์ „ ์ปค๋„ ๊ตฌํ˜„

ํ“จ์ „ ์ปค๋„์€ LayerNorm๊ณผ Linear ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ GPU ์ปค๋„๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:

fn minimal_fused_kernel[
    input_layout: Layout,
    ln_params_layout: Layout,
    weight_layout: Layout,
    bias_layout: Layout,
    output_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    output_dim: Int,
](
    output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
    ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
    ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
    linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin],
    linear_bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin],
):
    """Minimal fused kernel - one thread per sequence position to avoid redundancy.
    """
    # Grid: (batch_size, seq_len) - one thread block per sequence position
    # Block: (1,) - single thread per sequence position to avoid redundant computation
    batch_idx = Int(block_idx.x)
    seq_idx = Int(block_idx.y)

    if batch_idx >= batch_size or seq_idx >= seq_len:
        return

    # Step 1: Compute LayerNorm statistics once per sequence position

    # FILL IN roughly 10 lines

    # Step 2: Compute all outputs for this sequence position

    # FILL IN roughly 10 lines


ํ•ต์‹ฌ ์ตœ์ ํ™”:

  • ๋‘ ๋ฒˆ ๋Œ€์‹  ํ•œ ๋ฒˆ์˜ ์ปค๋„ ์‹คํ–‰
  • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
  • ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ๋ถˆํ•„์š”
ํŒ
  1. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก (๊ทธ๋ฆฌ๋“œ: [batch_size, seq_len])
    • ์ค‘๋ณต์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์‹œํ€€์Šค ์œ„์น˜๋‹น ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    • ๊ฐ ์‹œํ€€์Šค ์œ„์น˜์˜ ๋ชจ๋“  ์ถœ๋ ฅ์„ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ์—์„œ ๊ณ„์‚ฐ
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ์ž…๋ ฅ ํ…์„œ: [batch_idx, seq_idx, h]๋กœ ์ ‘๊ทผ
    • ์ถœ๋ ฅ ํ…์„œ: [batch_idx, seq_idx, out_idx]๋กœ ์ ‘๊ทผ
    • ๊ฐ€์ค‘์น˜: ์„ ํ˜• ๋ ˆ์ด์–ด์—์„œ [out_idx, h]๋กœ ์ ‘๊ทผ
  3. ์—ฐ์‚ฐ ํ๋ฆ„:

    • ์‹œํ€€์Šค๋‹น LayerNorm ํ†ต๊ณ„๋Ÿ‰์„ ํ•œ ๋ฒˆ๋งŒ ๊ณ„์‚ฐ
    • ๋ชจ๋“  ์ถœ๋ ฅ ์ฐจ์›์— ์ •๊ทœํ™”๋œ ๊ฐ’์„ ์žฌ์‚ฌ์šฉ
    • ์ •๊ทœํ™”์™€ ์„ ํ˜• ๋ณ€ํ™˜์„ ๊ฒฐํ•ฉ
  4. ์„ฑ๋Šฅ:

    • ํ†ต๊ณ„๋Ÿ‰์˜ ์ค‘๋ณต ๊ณ„์‚ฐ ๋ฐฉ์ง€
    • ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ตœ์†Œํ™”
    • rebind[Scalar[dtype]]๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ… ์‚ฌ์šฉ

์ฝ”๋“œ ์‹คํ–‰

ํ“จ์ „ ๊ตฌํ˜„์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p22 --fused
pixi run -e amd p22 --fused
uv run poe p22 --fused

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Testing with dimensions: [4, 4, 8] -> [4, 4, 16]
โœ… Loaded Mojo operations library
============================================================
   Puzzle 22: FUSED Algorithm Test & Benchmark
============================================================

๐Ÿงช Correctness Testing for FUSED Algorithm
==================================================

Testing Reference PyTorch Implementation
-----------------------------------------------
โœ… Reference PyTorch
   Max difference: 0.00e+00
   Result: โœ… CORRECT

Testing CPU Implementation
---------------------------------
โœ… Using Mojo fused kernel (CPU)
   Max difference: 1.86e-08
   Result: โœ… CORRECT

Testing GPU Fused Implementation
---------------------------------------
โœ… Using Mojo fused kernel (GPU)
   Max difference: 1.86e-08
   Result: โœ… CORRECT

Correctness Summary:
   - Reference:   โœ… CORRECT
   - CPU:         โœ… CORRECT
   - GPU fused: โœ… CORRECT

   Overall Correctness: โœ… ALL CORRECT

โšก Benchmarking CPU vs GPU FUSED
----------------------------------------
   Testing CPU performance...
   CPU: 3144.75ms (50 iterations)
   Testing GPU fused performance...
   GPU fused: 3116.11ms (50 iterations)

   GPU fused vs CPU: 1.01x faster
   GPU fused wins!

FUSED Algorithm Test Completed!

์†”๋ฃจ์…˜

fn minimal_fused_kernel[
    input_layout: Layout,
    ln_params_layout: Layout,
    weight_layout: Layout,
    bias_layout: Layout,
    output_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    output_dim: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, output_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
    ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
    ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
    linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin],
    linear_bias: LayoutTensor[dtype, bias_layout, ImmutAnyOrigin],
):
    """Minimal fused kernel - one thread per sequence position to avoid redundancy.
    """
    # Grid: (batch_size, seq_len) - one thread block per sequence position
    # Block: (1,) - single thread per sequence position to avoid redundant computation
    batch_idx = Int(block_idx.x)
    seq_idx = Int(block_idx.y)

    if batch_idx >= batch_size or seq_idx >= seq_len:
        return

    # Step 1: Compute LayerNorm statistics once per sequence position
    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0

    @parameter
    for h in range(hidden_dim):
        val = input[batch_idx, seq_idx, h]
        sum_val += rebind[Scalar[dtype]](val)
        sq_sum += rebind[Scalar[dtype]](val * val)

    mean_val = sum_val / hidden_dim
    var_val = (sq_sum / hidden_dim) - (mean_val * mean_val)
    inv_std = 1.0 / sqrt(var_val + 1e-5)

    # Step 2: Compute all outputs for this sequence position
    @parameter
    for out_idx in range(output_dim):
        var acc: Scalar[dtype] = 0

        @parameter
        for h in range(hidden_dim):
            input_val = input[batch_idx, seq_idx, h]
            normalized = (input_val - mean_val) * inv_std * rebind[
                Scalar[dtype]
            ](ln_weight[h]) + rebind[Scalar[dtype]](ln_bias[h])
            acc += rebind[Scalar[dtype]](normalized * linear_weight[out_idx, h])

        output[batch_idx, seq_idx, out_idx] = acc + rebind[Scalar[dtype]](
            linear_bias[out_idx]
        )


ํ“จ์ „ ๊ตฌํ˜„์€ ์—ฐ์‚ฐ๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก (๊ทธ๋ฆฌ๋“œ: [batch_size, seq_len])
    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค: batch_idx = block_idx.x, seq_idx = block_idx.y
  2. LayerNorm ๋‹จ๊ณ„:

    • ์‹œํ€€์Šค ์œ„์น˜์— ๋Œ€ํ•œ ํ•ฉ๊ณ„์™€ ์ œ๊ณฑํ•ฉ ๊ณ„์‚ฐ
    • ํ‰๊ท  ๊ณ„์‚ฐ: \[\Large \mu = \frac{1}{H} \sum_{i=1}^{H} x_i \]
    • ๋ถ„์‚ฐ ๊ณ„์‚ฐ: \[\Large \sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2 \]
    • ์—ญํ‘œ์ค€ํŽธ์ฐจ ๊ณ„์‚ฐ: \[\Large \text{inv_std} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} \]
  3. Linear ๋‹จ๊ณ„:

    • ๊ฐ ์ถœ๋ ฅ ์ฐจ์›์— ๋Œ€ํ•ด:
      • ์ •๊ทœํ™”๋œ ๊ฐ’ ๊ณ„์‚ฐ: \[\Large \text{normalized} = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \]
      • ์„ ํ˜• ๊ฐ€์ค‘์น˜์™€ ๊ณฑํ•˜๊ณ  ๋ˆ„์ : \[\Large \text{acc} = \sum_{h=1}^{H} \text{normalized}h \cdot W{out,h} \]
      • ์„ ํ˜• bias ์ถ”๊ฐ€: \[\Large \text{output} = \text{acc} + b_{out} \]
    • ๊ฒฐ๊ณผ๋ฅผ output[batch_idx, seq_idx, out_idx]์— ์ €์žฅ
  4. ์„ฑ๋Šฅ ์ตœ์ ํ™”:

    • ๋‘ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋‹จ์ผ ์ปค๋„ ์‹คํ–‰
    • ๊ณ„์‚ฐ๋œ ํ†ต๊ณ„๋Ÿ‰ ์žฌ์‚ฌ์šฉ
    • ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ตœ์†Œํ™”
    • ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ๋ถˆํ•„์š”
    • ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

์ด ๊ตฌํ˜„์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์—ฌ ์–ธํ“จ์ „ ๋ฒ„์ „๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์ปค๋„ ํ“จ์ „์˜ ์žฅ์ 

์ด ํผ์ฆ์—์„œ LayerNorm + Linear ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์„ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค:

  1. ์–ธํ“จ์ „ ๊ตฌํ˜„:

    • LayerNorm๊ณผ Linear๋ฅผ ๋ณ„๋„์˜ ์ปค๋„๋กœ ์‹คํ–‰
    • ๊ตฌํ˜„์ด ๊ฐ„๋‹จํ•˜์ง€๋งŒ ๋œ ํšจ์œจ์ 
    • ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰
    • ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ์ปค๋„ ์‹คํ–‰
    • ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ: 3183.57ms (GPU)
  2. ํ“จ์ „ ๊ตฌํ˜„:

    • ๋‘ ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•œ ๋‹จ์ผ ์ปค๋„
    • ๋” ๋ณต์žกํ•˜์ง€๋งŒ ํ›จ์”ฌ ํšจ์œจ์ 
    • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
    • ๋‹จ์ผ ์ปค๋„ ์‹คํ–‰
    • ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ: 3116.11ms (GPU)

๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ตœ์ ํ™”

  1. ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ œ๊ฑฐ:

    • ์—ฐ์‚ฐ ๊ฐ„ ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ๋ถˆํ•„์š”
    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ/์“ฐ๊ธฐ ๊ฐ์†Œ
    • ์„ ํ˜• ๋ณ€ํ™˜์„ ์œ„ํ•œ ์ •๊ทœํ™”๋œ ๊ฐ’ ์žฌ์‚ฌ์šฉ
    • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ ˆ๊ฐ๋ฅ : \[\Large \text{reduction} = \frac{\text{unfused_bandwidth} - \text{fused_bandwidth}}{\text{unfused_bandwidth}}\]
  2. ์บ์‹œ ํšจ์œจ:

    • L1/L2 ์บ์‹œ ํ™œ์šฉ๋„ ํ–ฅ์ƒ
    • ์บ์‹œ ๋ฏธ์Šค ๊ฐ์†Œ
    • ๊ฐœ์„ ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
    • ๋” ๋†’์€ ์‚ฐ์ˆ  ๊ฐ•๋„

์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ

  1. ์ปค๋„ ์‹คํ–‰ ์ตœ์ ํ™”:

    • ์—ฌ๋Ÿฌ ๋ฒˆ ๋Œ€์‹  ๋‹จ์ผ ์ปค๋„ ์‹คํ–‰
    • ๋“œ๋ผ์ด๋ฒ„ ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ
    • ๋™๊ธฐํ™” ์ง€์  ๊ฐ์†Œ
    • ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ํšŸ์ˆ˜ ๊ฐ์†Œ
  2. ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ:

    • ์—ฐ์‚ฐ ๊ฐ„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์žฌ์‚ฌ์šฉ
    • ๋ ˆ์ง€์Šคํ„ฐ ํ™œ์šฉ๋„ ํ–ฅ์ƒ
    • ์Šค๋ ˆ๋“œ ์ ์œ ์œจ ๊ฐœ์„ 
    • GPU ํ™œ์šฉ๋ฅ  ํ–ฅ์ƒ

์„ฑ๋Šฅ ํŠน์„ฑ

  1. ํ™•์žฅ์„ฑ:

    • ์ž…๋ ฅ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ํ™•์žฅ์„ฑ ํ–ฅ์ƒ
    • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋ณ‘๋ชฉ ๊ฐ์†Œ
    • GPU ๋ฆฌ์†Œ์Šค์˜ ๋” ํšจ์œจ์ ์ธ ํ™œ์šฉ
    • ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์—์„œ ์ฒ˜๋ฆฌ๋Ÿ‰ ํ–ฅ์ƒ
  2. ์ˆ˜์น˜์  ํšจ์œจ:

    • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ์œ ์ง€
    • ๋ฐ˜์˜ฌ๋ฆผ ์˜ค์ฐจ ๊ฐ์†Œ
    • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ์˜ ์ •๋ฐ€๋„ ํ–ฅ์ƒ
    • ์ตœ์ ํ™”๋œ ์—ฐ์‚ฐ ์ˆœ์„œ

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ปค๋„ ํ“จ์ „์€ ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์˜ LayerNorm + Linear์ฒ˜๋Ÿผ ์‹ ๊ฒฝ๋ง์—์„œ ์ž์ฃผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๋Š” ์—ฐ์‚ฐ์— ํŠนํžˆ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ ํฌ๊ธฐ๊ฐ€ ํฌ๊ณ  ๋ชจ๋ธ์ด ๋ณต์žกํ• ์ˆ˜๋ก ์„ฑ๋Šฅ ์ด์ ์€ ๋”์šฑ ์ปค์ง‘๋‹ˆ๋‹ค.

โ›“๏ธ ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ๊ณผ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ํ“จ์ „ LayerNorm + Linear ์—ฐ์‚ฐ์˜ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค(backward pass) ๊ตฌํ˜„์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋Š” ๋‹ค์Œ์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:

  • ์ž…๋ ฅ ํ…์„œ
  • LayerNorm ์Šค์ผ€์ผ (\(\gamma\))๊ณผ ์‹œํ”„ํŠธ (\(\beta\)) ํŒŒ๋ผ๋ฏธํ„ฐ
  • Linear ๋ ˆ์ด์–ด์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ๊ณผ bias

๊ตฌํ˜„ํ•  ์ˆ˜ํ•™์  ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. LayerNorm ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค (์œ ๋„ ๊ณผ์ •์˜ ์ƒ์„ธ ๋‚ด์šฉ์€ LayerNorm ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค์˜ ์ƒ์„ธ ์œ ๋„ ์ฐธ์กฐ): \[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \odot \gamma \odot \frac{1}{\sqrt{\sigma^2 + \epsilon}} (1 - \frac{1}{H} - \frac{(x - \mu)^2}{H(\sigma^2 + \epsilon)}) \]

  2. Linear ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค: \[\Large \frac{\partial L}{\partial W} = \frac{\partial L}{\partial y}x^T \] \[\Large \frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \] \[\Large \frac{\partial L}{\partial x} = W^T\frac{\partial L}{\partial y} \]

  3. ํ“จ์ „ ์—ฐ์‚ฐ์˜ ์—ฐ์‡„ ๋ฒ•์น™: \[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y_{linear}} \frac{\partial y_{linear}}{\partial y_{norm}} \frac{\partial y_{norm}}{\partial x} \] ์—ฌ๊ธฐ์„œ:

  • \(y_{norm}\)์€ LayerNorm ์ถœ๋ ฅ
  • \(y_{linear}\)์€ Linear ๋ ˆ์ด์–ด ์ถœ๋ ฅ
  • ์—ฐ์‡„ ๋ฒ•์น™์ด ๋‘ ์—ฐ์‚ฐ์„ ํ†ตํ•œ ์ ์ ˆํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์„ ๋ณด์žฅ

ํ•ต์‹ฌ ๊ฐœ๋…

  • ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก (๊ทธ๋ฆฌ๋“œ: [batch_size, seq_len])
    • ์ค‘๋ณต์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์‹œํ€€์Šค ์œ„์น˜๋‹น ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    • ๊ฐ ์‹œํ€€์Šค ์œ„์น˜์˜ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๋ฅผ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ์—์„œ ๊ณ„์‚ฐ
    • ์›์ž์  ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ๋ณด์žฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ์ž…๋ ฅ ํ…์„œ: [batch_idx, seq_idx, h]๋กœ ์ ‘๊ทผ
    • ์ถœ๋ ฅ ํ…์„œ: [batch_idx, seq_idx, out_idx]๋กœ ์ ‘๊ทผ
    • ๊ฐ€์ค‘์น˜: ์„ ํ˜• ๋ ˆ์ด์–ด์—์„œ [out_idx, h]๋กœ ์ ‘๊ทผ
    • ์›์ž์  ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ ๋ณด์žฅ
    • ์ž์ฃผ ์ ‘๊ทผํ•˜๋Š” ๋ฐ์ดํ„ฐ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
  • ์—ฐ์‚ฐ ํ๋ฆ„:

    • ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ๋™์ผํ•œ ์ˆœ์„œ๋กœ LayerNorm ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ
    • ๋ชจ๋“  ์ถœ๋ ฅ ์ฐจ์›์— ์ •๊ทœํ™”๋œ ๊ฐ’ ์žฌ์‚ฌ์šฉ
    • ์ •๊ทœํ™”์™€ ์„ ํ˜• ๋ณ€ํ™˜ ๊ฒฐํ•ฉ
    • ์ „์ฒด ๊ณผ์ •์—์„œ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ์œ ์ง€
    • ์—ฃ์ง€ ์ผ€์ด์Šค๋ฅผ ์ ์ ˆํžˆ ์ฒ˜๋ฆฌ
  • ์„ฑ๋Šฅ:

    • ํ†ต๊ณ„๋Ÿ‰์˜ ์ค‘๋ณต ๊ณ„์‚ฐ ๋ฐฉ์ง€
    • ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ตœ์†Œํ™”
    • rebind[Scalar[dtype]]๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ… ์‚ฌ์šฉ
    • ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ ๋ณด์žฅ
    • ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ์— ์ตœ์ ํ™”

๊ตฌ์„ฑ

  • ๋ฐฐ์น˜ ํฌ๊ธฐ: BATCH_SIZE = 4
  • ์‹œํ€€์Šค ๊ธธ์ด: SEQ_LEN = 4
  • ์€๋‹‰ ์ฐจ์›: HIDDEN_DIM = 8
  • ์ถœ๋ ฅ ์ฐจ์›: OUTPUT_DIM = 16
  • ์—ก์‹ค๋ก : EPS = 1e-5
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32

๊ตฌํ˜„ (๊ณ ๊ธ‰)

ํ“จ์ „ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ์ปค๋„์€ LayerNorm๊ณผ Linear์˜ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ GPU ์ปค๋„๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ตฌํ˜„์€ ๋‹ค์Œ์„ ์‹ ์ค‘ํ•˜๊ฒŒ ๋‹ค๋ค„์•ผ ํ•˜๋Š” ๋„์ „์ ์ธ ๊ณผ์ œ์ž…๋‹ˆ๋‹ค:

  • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ
  • ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์—์„œ์˜ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ
  • ํšจ์œจ์ ์ธ GPU ํ™œ์šฉ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ์—ฐ์‚ฐ ๊ฐ„ ์ ์ ˆํ•œ ๋™๊ธฐํ™”
fn minimal_fused_kernel_backward[
    grad_output_layout: Layout,
    input_layout: Layout,
    ln_params_layout: Layout,
    weight_layout: Layout,
    grad_input_layout: Layout,
    grad_ln_weight_layout: Layout,
    grad_ln_bias_layout: Layout,
    grad_weight_layout: Layout,
    grad_bias_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    output_dim: Int,
](
    grad_input: LayoutTensor[dtype, grad_input_layout, MutAnyOrigin],
    grad_ln_weight: LayoutTensor[dtype, grad_ln_weight_layout, MutAnyOrigin],
    grad_ln_bias: LayoutTensor[dtype, grad_ln_bias_layout, MutAnyOrigin],
    grad_weight: LayoutTensor[dtype, grad_weight_layout, MutAnyOrigin],
    grad_bias: LayoutTensor[dtype, grad_bias_layout, MutAnyOrigin],
    grad_output: LayoutTensor[dtype, grad_output_layout, ImmutAnyOrigin],
    input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
    ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
    ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
    linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin],
):
    """Fused backward kernel using atomic operations for safe gradient accumulation.
    """
    # Grid: (batch_size, seq_len) - one thread per sequence position
    # Block: (1,) - single thread per sequence position
    batch_idx = Int(block_idx.x)
    seq_idx = Int(block_idx.y)

    if batch_idx >= batch_size or seq_idx >= seq_len:
        return

    # Initialize gradient tensors to zero (block 0,0 only to avoid UB with atomic ops)
    if batch_idx == 0 and seq_idx == 0:
        # Initialize grad_ln_weight and grad_ln_bias
        @parameter
        for h in range(hidden_dim):
            (grad_ln_weight.ptr + h).init_pointee_copy(0)
            (grad_ln_bias.ptr + h).init_pointee_copy(0)

        # Initialize grad_weight and grad_bias
        @parameter
        for out_idx in range(output_dim):
            (grad_bias.ptr + out_idx).init_pointee_copy(0)

            @parameter
            for h in range(hidden_dim):
                (grad_weight.ptr + out_idx * hidden_dim + h).init_pointee_copy(
                    0
                )

    # Note: We cannot use barrier() here as it only synchronizes within a block.
    # The atomic operations will handle synchronization across blocks.

    # Step 1: Recompute forward pass statistics (needed for gradients)
    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0

    # FILL IN roughly 8 lines

    # Step 2: Atomically accumulate gradients w.r.t. linear bias

    # FILL IN roughly 4 lines

    # Step 3: Atomically accumulate gradients w.r.t. linear weight
    # Make sure to use the correct atomic operation to avoid race conditions

    # FILL IN roughly 10 lines

    # Step 4: Atomically accumulate gradients w.r.t. LayerNorm parameters

    # FILL IN roughly 10 lines

    # Step 5: Compute gradients w.r.t. input (LayerNorm backward)
    # Compute sum terms needed for LayerNorm backward
    # Make sure to use the correct atomic operation to avoid race conditions

    # FILL IN roughly 12 lines

    # Compute actual input gradients (no race conditions here - each thread writes to different positions)

    # FILL IN roughly 10 lines


ํ•ต์‹ฌ ์ตœ์ ํ™”:

  • ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋‹จ์ผ ์ปค๋„ ์‹คํ–‰
  • ์•ˆ์ „ํ•œ ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ
  • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
  • ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ๋ถˆํ•„์š”
ํŒ
  1. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก
    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    • ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๋ฅผ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ์—์„œ ๊ณ„์‚ฐ
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ์ž…๋ ฅ/์ถœ๋ ฅ ํ…์„œ์— ๋Œ€ํ•œ ๋ณ‘ํ•ฉ ์ ‘๊ทผ
    • ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์— ๋Œ€ํ•œ stride ์ ‘๊ทผ
    • ์›์ž์  ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ ์ ˆํ•œ ์ •๋ ฌ
  3. ์—ฐ์‚ฐ ํ๋ฆ„:

    • ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ๋™์ผํ•œ ์ˆœ์„œ๋กœ ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ
    • ์ •๊ทœํ™”๋œ ๊ฐ’ ์žฌ์‚ฌ์šฉ
    • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ์œ ์ง€
  4. ์„ฑ๋Šฅ:

    • ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ตœ์†Œํ™”
    • ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ… ์‚ฌ์šฉ
    • ์ ์ ˆํ•œ ์ •๋ ฌ ๋ณด์žฅ

์ฝ”๋“œ ์‹คํ–‰

ํ“จ์ „ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ตฌํ˜„์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p22 --backward
pixi run -e amd p22 --backward
uv run poe p22 --backward

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Testing with dimensions: [4, 4, 8] -> [4, 4, 16]
โœ… Loaded Mojo operations library
============================================================
           Comprehensive Backward Pass Test
           Testing Custom LayerNorm + Linear Gradients
============================================================
Testing with dimensions: [4, 4, 8] -> [4, 4, 16]

Testing CPU Backward Pass:

Testing CPU Backward Implementation - Backward Pass
---------------------------------------------------------
   Computing PyTorch autograd reference...
   Computing Mojo backward implementation (CPU)...
โœ… CPU Backward Implementation backward completed
   Forward max difference: 1.49e-08
   grad_input: 2.98e-08 โœ…
   grad_ln_weight: 5.96e-08 โœ…
   grad_ln_bias: 2.38e-07 โœ…
   grad_linear_weight: 9.54e-07 โœ…
   grad_linear_bias: 0.00e+00 โœ…

   Forward pass: โœ… CORRECT
   Gradients:    โœ… CORRECT
   Overall:      โœ… CORRECT

Testing GPU Backward Pass:

Testing GPU Backward Implementation - Backward Pass
---------------------------------------------------------
   Computing PyTorch autograd reference...
   Computing Mojo backward implementation (GPU)...

โœ… GPU Backward Implementation backward completed
   Forward max difference: 1.86e-08
   grad_input: 4.47e-08 โœ…
   grad_ln_weight: 5.96e-08 โœ…
   grad_ln_bias: 3.58e-07 โœ…
   grad_linear_weight: 9.54e-07 โœ…
   grad_linear_bias: 0.00e+00 โœ…

   Forward pass: โœ… CORRECT
   Gradients:    โœ… CORRECT
   Overall:      โœ… CORRECT

Backward Pass Test Summary:
   - CPU Backward:  โœ… CORRECT
   - GPU Backward:  โœ… CORRECT

   Overall Result: โœ… ALL CORRECT

BACKWARD PASS Test Completed!

์†”๋ฃจ์…˜

fn minimal_fused_kernel_backward[
    grad_output_layout: Layout,
    input_layout: Layout,
    ln_params_layout: Layout,
    weight_layout: Layout,
    grad_input_layout: Layout,
    grad_ln_weight_layout: Layout,
    grad_ln_bias_layout: Layout,
    grad_weight_layout: Layout,
    grad_bias_layout: Layout,
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    output_dim: Int,
    dtype: DType = DType.float32,
](
    grad_input: LayoutTensor[dtype, grad_input_layout, MutAnyOrigin],
    grad_ln_weight: LayoutTensor[dtype, grad_ln_weight_layout, MutAnyOrigin],
    grad_ln_bias: LayoutTensor[dtype, grad_ln_bias_layout, MutAnyOrigin],
    grad_weight: LayoutTensor[dtype, grad_weight_layout, MutAnyOrigin],
    grad_bias: LayoutTensor[dtype, grad_bias_layout, MutAnyOrigin],
    grad_output: LayoutTensor[dtype, grad_output_layout, ImmutAnyOrigin],
    input: LayoutTensor[dtype, input_layout, ImmutAnyOrigin],
    ln_weight: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
    ln_bias: LayoutTensor[dtype, ln_params_layout, ImmutAnyOrigin],
    linear_weight: LayoutTensor[dtype, weight_layout, ImmutAnyOrigin],
):
    """Fused backward kernel using atomic operations for safe gradient accumulation.
    """
    # Grid: (batch_size, seq_len) - one thread per sequence position
    # Block: (1,) - single thread per sequence position
    batch_idx = Int(block_idx.x)
    seq_idx = Int(block_idx.y)

    if batch_idx >= batch_size or seq_idx >= seq_len:
        return

    # Initialize gradient tensors to zero (block 0,0 only to avoid UB with atomic ops)
    if batch_idx == 0 and seq_idx == 0:
        # Initialize grad_ln_weight and grad_ln_bias
        @parameter
        for h in range(hidden_dim):
            (grad_ln_weight.ptr + h).init_pointee_copy(0)
            (grad_ln_bias.ptr + h).init_pointee_copy(0)

        # Initialize grad_weight and grad_bias
        @parameter
        for out_idx in range(output_dim):
            (grad_bias.ptr + out_idx).init_pointee_copy(0)

            @parameter
            for h in range(hidden_dim):
                (grad_weight.ptr + out_idx * hidden_dim + h).init_pointee_copy(
                    0
                )

    # Note: We cannot use barrier() here as it only synchronizes within a block.
    # The atomic operations will handle synchronization across blocks.

    # Step 1: Recompute forward pass statistics (needed for gradients)
    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0

    @parameter
    for h in range(hidden_dim):
        val = input[batch_idx, seq_idx, h]
        sum_val += rebind[Scalar[dtype]](val)
        sq_sum += rebind[Scalar[dtype]](val * val)

    mean_val = sum_val / hidden_dim
    var_val = (sq_sum / hidden_dim) - (mean_val * mean_val)
    inv_std = 1.0 / sqrt(var_val + 1e-5)

    # Step 2: Atomically accumulate gradients w.r.t. linear bias
    @parameter
    for out_idx in range(output_dim):
        grad_bias_ptr = grad_bias.ptr + out_idx
        _ = Atomic[dtype].fetch_add(
            grad_bias_ptr,
            rebind[Scalar[dtype]](grad_output[batch_idx, seq_idx, out_idx]),
        )

    # Step 3: Atomically accumulate gradients w.r.t. linear weight
    @parameter
    for out_idx in range(output_dim):

        @parameter
        for h in range(hidden_dim):
            var input_val = input[batch_idx, seq_idx, h]
            var normalized = (input_val - mean_val) * inv_std
            var ln_output_val = normalized * rebind[Scalar[dtype]](
                ln_weight[h]
            ) + rebind[Scalar[dtype]](ln_bias[h])

            # Atomic gradient accumulation for linear weight
            var grad_w = (
                grad_output[batch_idx, seq_idx, out_idx] * ln_output_val
            )
            var grad_weight_ptr = grad_weight.ptr + out_idx * hidden_dim + h
            _ = Atomic.fetch_add(grad_weight_ptr, rebind[Scalar[dtype]](grad_w))

    # Step 4: Atomically accumulate gradients w.r.t. LayerNorm parameters
    @parameter
    for h in range(hidden_dim):
        input_val = input[batch_idx, seq_idx, h]
        normalized = (input_val - mean_val) * inv_std

        # Compute gradient w.r.t. LayerNorm output for this h
        var grad_ln_out: Scalar[dtype] = 0

        @parameter
        for out_idx in range(output_dim):
            grad_ln_out = grad_ln_out + rebind[Scalar[dtype]](
                grad_output[batch_idx, seq_idx, out_idx]
                * linear_weight[out_idx, h]
            )

        # Atomic accumulation of LayerNorm parameter gradients
        grad_ln_weight_ptr = grad_ln_weight.ptr + h
        grad_ln_bias_ptr = grad_ln_bias.ptr + h
        _ = Atomic[dtype].fetch_add(
            grad_ln_weight_ptr, rebind[Scalar[dtype]](grad_ln_out * normalized)
        )
        _ = Atomic[dtype].fetch_add(
            grad_ln_bias_ptr, rebind[Scalar[dtype]](grad_ln_out)
        )

    # Step 5: Compute gradients w.r.t. input (LayerNorm backward)
    # Compute sum terms needed for LayerNorm backward
    var sum_grad_normalized: Scalar[dtype] = 0
    var sum_grad_normalized_times_normalized: Scalar[dtype] = 0

    @parameter
    for h in range(hidden_dim):
        h_input_val = input[batch_idx, seq_idx, h]
        h_normalized = (h_input_val - mean_val) * inv_std

        var h_grad_ln_out: Scalar[dtype] = 0

        @parameter
        for out_idx in range(output_dim):
            h_grad_ln_out = h_grad_ln_out + rebind[Scalar[dtype]](
                grad_output[batch_idx, seq_idx, out_idx]
                * linear_weight[out_idx, h]
            )

        h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight[h])
        sum_grad_normalized = sum_grad_normalized + rebind[Scalar[dtype]](
            h_grad_norm
        )
        sum_grad_normalized_times_normalized = (
            sum_grad_normalized_times_normalized
            + rebind[Scalar[dtype]](h_grad_norm * h_normalized)
        )

    # Compute actual input gradients (no race conditions here - each thread writes to different positions)
    @parameter
    for h in range(hidden_dim):
        h_input_val = input[batch_idx, seq_idx, h]
        h_normalized = (h_input_val - mean_val) * inv_std

        var h_grad_ln_out: Scalar[dtype] = 0

        @parameter
        for out_idx in range(output_dim):
            h_grad_ln_out = h_grad_ln_out + rebind[Scalar[dtype]](
                grad_output[batch_idx, seq_idx, out_idx]
                * linear_weight[out_idx, h]
            )

        h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight[h])
        grad_input[batch_idx, seq_idx, h] = inv_std * (
            h_grad_norm
            - (sum_grad_normalized / hidden_dim)
            - (h_normalized * sum_grad_normalized_times_normalized / hidden_dim)
        )


ํ“จ์ „ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ตฌํ˜„์€ ์—ฐ์‚ฐ๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ:

    • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: [batch_size, seq_len]์œผ๋กœ ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก
    • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค: batch_idx = block_idx.x, seq_idx = block_idx.y
    • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ:
      • ์ž…๋ ฅ ํ…์„œ: [batch_size, seq_len, hidden_dim]
      • ์ถœ๋ ฅ ํ…์„œ: [batch_size, seq_len, output_dim]
      • ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ: [output_dim, hidden_dim]
      • ๊ธฐ์šธ๊ธฐ: ์ž…๋ ฅ ๊ธฐ์šธ๊ธฐ์šฉ [batch_size, seq_len, hidden_dim]
      • ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ์šธ๊ธฐ: LayerNorm์šฉ [hidden_dim], Linear์šฉ [output_dim, hidden_dim]
  2. LayerNorm ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋‹จ๊ณ„:

    • ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ๋™์ผํ•œ ์ˆœ์„œ๋กœ ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค ํ†ต๊ณ„๋Ÿ‰์„ ์žฌ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:
      • ํ‰๊ท : \[\Large \mu = \frac{1}{H} \sum_{i=1}^{H} x_i \]
      • ๋ถ„์‚ฐ: \[\Large \sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2 \]
      • ์—ญํ‘œ์ค€ํŽธ์ฐจ: \[\Large \text{inv_std} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} \]
    • ์ •๊ทœํ™”๋œ ๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} \]
    • ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:
      • ์ž…๋ ฅ ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \odot \gamma \odot \frac{1}{\sqrt{\sigma^2 + \epsilon}} (1 - \frac{1}{H} - \frac{(x - \mu)^2}{H(\sigma^2 + \epsilon)}) \]
      • ์Šค์ผ€์ผ ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{H} \frac{\partial L}{\partial y_i} \odot \hat{x}_i \]
      • ์‹œํ”„ํŠธ ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial \beta} = \sum_{i=1}^{H} \frac{\partial L}{\partial y_i} \]
  3. Linear ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋‹จ๊ณ„:

    • ๊ฐ ์ถœ๋ ฅ ์ฐจ์›์— ๋Œ€ํ•ด:
      • Bias ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \]
      • ๊ฐ€์ค‘์น˜ ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial W} = \frac{\partial L}{\partial y}x^T \]
      • ์ž…๋ ฅ ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial x} = W^T\frac{\partial L}{\partial y} \]
    • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ ์‚ฌ์šฉ:
      • Bias ๊ธฐ์šธ๊ธฐ์— ์ ์ ˆํ•œ ์ •๋ ฌ๋กœ atomic_add ์‚ฌ์šฉ
      • ๊ฐ€์ค‘์น˜ ๊ธฐ์šธ๊ธฐ์— ์ ์ ˆํ•œ ์ •๋ ฌ๋กœ atomic_add ์‚ฌ์šฉ
      • LayerNorm ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ์šธ๊ธฐ์— ์ ์ ˆํ•œ ์ •๋ ฌ๋กœ atomic_add ์‚ฌ์šฉ
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • ์ž…๋ ฅ/์ถœ๋ ฅ ํ…์„œ์— ๋Œ€ํ•œ ๋ณ‘ํ•ฉ ์ ‘๊ทผ
    • ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์— ๋Œ€ํ•œ stride ์ ‘๊ทผ
    • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ
    • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ
    • ์ž์ฃผ ์ ‘๊ทผํ•˜๋Š” ๊ฐ’์„ ์œ„ํ•œ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ
    • ๋ชจ๋“  ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ
  5. ์ˆ˜์น˜ ์•ˆ์ •์„ฑ:

    • ๋ถ„๋ชจ์˜ ์—ก์‹ค๋ก  ์ฒ˜๋ฆฌ์— ์ฃผ์˜
    • ๊ธฐ์šธ๊ธฐ์˜ ์ ์ ˆํ•œ ์Šค์ผ€์ผ๋ง
    • ์•ˆ์ •์ ์ธ ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ
    • rebind[Scalar[dtype]]๋กœ ํƒ€์ž… ์บ์ŠคํŒ…
    • ์—ฃ์ง€ ์ผ€์ด์Šค์˜ ์ ์ ˆํ•œ ์ฒ˜๋ฆฌ
    • ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ๋™์ผํ•œ ์—ฐ์‚ฐ ์ˆœ์„œ ์œ ์ง€
  6. ์„ฑ๋Šฅ ์ตœ์ ํ™”:

    • ๋ชจ๋“  ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋‹จ์ผ ์ปค๋„ ์‹คํ–‰
    • ๊ณ„์‚ฐ๋œ ํ†ต๊ณ„๋Ÿ‰ ์žฌ์‚ฌ์šฉ
    • ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ตœ์†Œํ™”
    • ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ๋ถˆํ•„์š”
    • ํšจ์œจ์ ์ธ ์Šค๋ ˆ๋“œ ํ™œ์šฉ
    • ๋™๊ธฐํ™” ์ง€์  ๊ฐ์†Œ
    • ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
    • ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ
  7. ๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ:

    • ์ปดํŒŒ์ผ ํƒ€์ž„ ์ƒ์ˆ˜๋ฅผ ์œ„ํ•œ @parameter ์‚ฌ์šฉ
    • ํ…์„œ ์ฐจ์›์˜ ์ ์ ˆํ•œ ์ฒ˜๋ฆฌ
    • ํšจ์œจ์ ์ธ ํƒ€์ž… ์บ์ŠคํŒ…๊ณผ ๋ณ€ํ™˜
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์‹ ์ค‘ํ•œ ๊ด€๋ฆฌ
    • ์—ฐ์‚ฐ ๊ฐ„ ์ ์ ˆํ•œ ๋™๊ธฐํ™”
    • ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ์™€ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
    • PyTorch ์˜คํ† ๊ทธ๋ž˜๋“œ ์‹œ์Šคํ…œ๊ณผ์˜ ํ†ตํ•ฉ

์ด ๊ตฌํ˜„์€ ๋‹ค์Œ์„ ํ†ตํ•ด ์–ธํ“จ์ „ ๋ฒ„์ „๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • ์ปค๋„ ํ“จ์ „์„ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
  • ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์†Œํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”
  • GPU ๋ฆฌ์†Œ์Šค์˜ ํšจ์œจ์  ํ™œ์šฉ
  • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ์œ ์ง€
  • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์˜ ์ ์ ˆํ•œ ์ฒ˜๋ฆฌ
  • ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ ๋ณด์žฅ
  • ํšจ์œจ์ ์ธ ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ

ํ“จ์ „ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋Š” LayerNorm + Linear ์—ฐ์‚ฐ์ด ์ž์ฃผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์—์„œ ํŠนํžˆ ์ค‘์š”ํ•˜๋ฉฐ, ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๊ณ ๋ ค ์‚ฌํ•ญ

์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ตฌํ˜„์€ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์ ํ™”๋œ torch.compile์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

# Compilation configuration
torch._dynamo.config.cache_size_limit = 64  # Increase cache
torch._dynamo.config.suppress_errors = True  # Handle errors gracefully
torch._dynamo.config.automatic_dynamic_shapes = True  # Dynamic shapes

์ด๋Ÿฌํ•œ ์ตœ์ ํ™”๊ฐ€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค์—์„œ ํŠนํžˆ ์ค‘์š”ํ•œ ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ์ž‘์€ ํ…์„œ ์—ฐ์‚ฐ์€ ์ปดํŒŒ์ผ ์บ์‹ฑ์˜ ์ด์ ์„ ๋ฐ›์Šต๋‹ˆ๋‹ค
  • ๋™์  ํ˜•์ƒ์€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค์—์„œ ํ”ํ•˜๊ฒŒ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค
  • ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์—๋Š” ๊ฐ•๊ฑดํ•œ ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • ์บ์‹œ ํฌ๊ธฐ๋Š” ๋ฐ˜๋ณต์ ์ธ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ์—ฐ์‚ฐ์— ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค
  • ์ ์ ˆํ•œ ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ๋Š” ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์— ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค
  • ์ปดํŒŒ์ผ ์˜ค๋ฒ„ํ—ค๋“œ๋Š” ํ•™์Šต ์‹œ๊ฐ„์— ํฐ ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋Š” ์ •ํ™•์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ปดํŒŒ์ผ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด reduce-overhead ๋ชจ๋“œ๋กœ ์ปดํŒŒ์ผ๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ํŠนํžˆ ์ค‘์š”ํ•œ ์ด์œ ๋Š”:

  • ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋Š” ํ•™์Šต ์ค‘์— ๋นˆ๋ฒˆํ•˜๊ฒŒ ํ˜ธ์ถœ๋ฉ๋‹ˆ๋‹ค
  • ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์€ ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์ตœ์ ํ™”๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ์›์ž์  ์—ฐ์‚ฐ์—๋Š” ์ ์ ˆํ•œ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ์ด ํšจ์œจ์ ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

LayerNorm ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค์˜ ์ƒ์„ธ ์œ ๋„

LayerNorm์˜ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ธฐ์šธ๊ธฐ๋Š” ์—ฐ์‡„ ๋ฒ•์น™์„ ์ฃผ์˜ ๊นŠ๊ฒŒ ์ ์šฉํ•˜์—ฌ ์œ ๋„๋ฉ๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„ ์œ ๋„ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค ์—ฐ์‚ฐ

  • ํ‰๊ท : \(\mu = \frac{1}{H} \sum_{i=1}^{H} x_i\)
  • ๋ถ„์‚ฐ: \(\sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2\)
  • ์ •๊ทœํ™”๋œ ๊ฐ’: \(\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\)
  • ์ตœ์ข… ์ถœ๋ ฅ: \(y = \gamma \odot \hat{x} + \beta\)

์—ฐ์‡„ ๋ฒ•์น™ ์ ์šฉ

\(\frac{\partial L}{\partial x}\)๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฐ์‡„ ๋ฒ•์น™์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค: \[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial \hat{x}} \frac{\partial \hat{x}}{\partial x}\]

๊ธฐ์šธ๊ธฐ ๊ตฌ์„ฑ ์š”์†Œ

์ถœ๋ ฅ์—์„œ ์ •๊ทœํ™”๋œ ๊ฐ’์œผ๋กœ

  • \(\frac{\partial y}{\partial \hat{x}} = \gamma\) (์š”์†Œ๋ณ„ ๊ณฑ์…ˆ)

์ •๊ทœํ™”๋œ ๊ฐ’์—์„œ ์ž…๋ ฅ์œผ๋กœ

๊ธฐ์šธ๊ธฐ \(\frac{\partial \hat{x}}{\partial x}\)์—๋Š” ์„ธ ๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ถ„์ž๋ฅผ ํ†ตํ•œ ์ง์ ‘์  ํšจ๊ณผ: \(\frac{1}{\sqrt{\sigma^2 + \epsilon}}\)
  • ํ‰๊ท ์„ ํ†ตํ•œ ๊ฐ„์ ‘์  ํšจ๊ณผ: \(-\frac{1}{H} \frac{1}{\sqrt{\sigma^2 + \epsilon}}\)
  • ๋ถ„์‚ฐ์„ ํ†ตํ•œ ๊ฐ„์ ‘์  ํšจ๊ณผ: \(-\frac{(x - \mu)}{H(\sigma^2 + \epsilon)^{3/2}} (x - \mu)\)

ํ•ญ ๊ฒฐํ•ฉ

์ •๊ทœํ™” ํ•ญ์„ ํ†ตํ•œ ๊ธฐ์šธ๊ธฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌ๋ฉ๋‹ˆ๋‹ค: \[\Large \frac{\partial \hat{x}}{\partial x} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} (1 - \frac{1}{H} - \frac{(x - \mu)^2}{H(\sigma^2 + \epsilon)})\]

์ตœ์ข… ๊ธฐ์šธ๊ธฐ ํ‘œํ˜„์‹

๋ชจ๋“  ํ•ญ์„ ๊ฒฐํ•ฉํ•˜๋ฉด: \[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \odot \gamma \odot \frac{1}{\sqrt{\sigma^2 + \epsilon}} (1 - \frac{1}{H} - \frac{(x - \mu)^2}{H(\sigma^2 + \epsilon)})\]

ํ•ต์‹ฌ ํ†ต์ฐฐ

  • ์—ฐ์‡„ ๋ฒ•์น™์€ x๊ฐ€ ์ถœ๋ ฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๋ชจ๋“  ๊ฒฝ๋กœ๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค
  • ์ •๊ทœํ™” ํ•ญ \(\sqrt{\sigma^2 + \epsilon}\)์€ ๋ถ„์ž์™€ ๋ถ„๋ชจ ๋ชจ๋‘์— ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค
  • ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ ํ•ญ์€ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์˜ ์ถ”๊ฐ€ ๊ฒฝ๋กœ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ์ตœ์ข… ํ‘œํ˜„์‹์€ ๋ชจ๋“  ํšจ๊ณผ๋ฅผ ํ•˜๋‚˜์˜ ํšจ์œจ์ ์ธ ๊ณ„์‚ฐ์œผ๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค

๊ตฌํ˜„ ์‹œ ๊ณ ๋ ค ์‚ฌํ•ญ

  • ๊ธฐ์šธ๊ธฐ๊ฐ€ \(\gamma\)์˜ ์Šค์ผ€์ผ๋ง ํšจ๊ณผ๋ฅผ ์ ์ ˆํžˆ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค
  • ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์˜ ์ •๊ทœํ™” ํšจ๊ณผ๊ฐ€ ๋ณด์กด๋ฉ๋‹ˆ๋‹ค
  • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ํ•ญ \(\epsilon\)์ด ์œ ์ง€๋ฉ๋‹ˆ๋‹ค
  • ๊ธฐ์šธ๊ธฐ๊ฐ€ ์€๋‹‰ ์ฐจ์› H ์ „์ฒด์— ๊ฑธ์ณ ์ ์ ˆํžˆ ์Šค์ผ€์ผ๋ง๋ฉ๋‹ˆ๋‹ค
  • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด ์—ฐ์‚ฐ ์ˆœ์„œ๊ฐ€ ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค

์ด ์œ ๋„๋ฅผ ํ†ตํ•ด ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๊ฐ€ ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ๋™์ผํ•œ ์ˆ˜์น˜์  ํŠน์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ํ•„์š”ํ•œ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Puzzle 23: GPU ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด

๊ฐœ์š”

Part VI: ํ•จ์ˆ˜ํ˜• GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” GPU ์—ฐ์‚ฐ์„ ์œ„ํ•œ Mojo์˜ ๊ณ ์ˆ˜์ค€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋ฒกํ„ฐํ™”, ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”, ์„ฑ๋Šฅ ํŠœ๋‹์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ํ•จ์ˆ˜ํ˜• ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋ฐฐ์šฐ๋ฉฐ, ์ˆ˜๋™ GPU ์ปค๋„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํ˜„๋Œ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ์šฐ์•„ํ•จ์„ ํฌ๊ธฐํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค - Mojo์˜ ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์€ ๋‘ ๊ฐ€์ง€๋ฅผ ๋ชจ๋‘ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ

GPU ์‹คํ–‰ ๊ณ„์ธต ๊ตฌ์กฐ

GPU ์Šค๋ ˆ๋“œ์™€ SIMD ์—ฐ์‚ฐ ์‚ฌ์ด์˜ ๊ทผ๋ณธ์ ์ธ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

GPU Device
โ”œโ”€โ”€ Grid (์ „์ฒด ๋ฌธ์ œ)
โ”‚   โ”œโ”€โ”€ Block 1 (์Šค๋ ˆ๋“œ ๊ทธ๋ฃน, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)
โ”‚   โ”‚   โ”œโ”€โ”€ Warp 1 (32๊ฐœ ์Šค๋ ˆ๋“œ, ๋ก์Šคํ… ์‹คํ–‰) --> Part VI์—์„œ ํ•™์Šต
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 1 โ†’ SIMD
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 2 โ†’ SIMD
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ... (์ด 32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ”‚   โ””โ”€โ”€ Warp 2 (32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ””โ”€โ”€ Block 2 (๋…๋ฆฝ์ ์ธ ๊ทธ๋ฃน)

Mojo๊ฐ€ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ๋“ค:

  • ๊ทธ๋ฆฌ๋“œ/๋ธ”๋ก ๊ตฌ์„ฑ ์ž๋™ ๊ณ„์‚ฐ
  • ์›Œํ”„ ๊ด€๋ฆฌ์˜ ํˆฌ๋ช…ํ•œ ์ฒ˜๋ฆฌ
  • ์Šค๋ ˆ๋“œ ์Šค์ผ€์ค„๋ง ์ž๋™ ์ตœ์ ํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™” ๋‚ด์žฅ

๐Ÿ’ก ์ฐธ๊ณ : ์ด Part๋Š” ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์œผ๋ฉฐ, ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๊ณ ๊ธ‰ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋Š” Part VII ์—์„œ ์ž์„ธํžˆ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

๋„ค ๊ฐ€์ง€ ๊ธฐ๋ณธ ํŒจํ„ด

GPU ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ•ต์‹ฌ ํŒจํ„ด์„ ๋ชจ๋‘ ๋‹ค๋ฃน๋‹ˆ๋‹ค:

  1. Elementwise: ์ž๋™ SIMD ๋ฒกํ„ฐํ™”๋ฅผ ํ†ตํ•œ ์ตœ๋Œ€ ๋ณ‘๋ ฌ์„ฑ
  2. Tiled: ์บ์‹œ ์ตœ์ ํ™”๋ฅผ ํ™œ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ด ๋†’์€ ์ฒ˜๋ฆฌ
  3. ์ˆ˜๋™ ๋ฒกํ„ฐํ™”: SIMD ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์ œ์–ด
  4. Mojo vectorize: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํฌํ•จํ•œ ์•ˆ์ „ํ•œ ์ž๋™ ๋ฒกํ„ฐํ™”

ํ•œ๋ˆˆ์— ๋ณด๋Š” ์„ฑ๋Šฅ ํŒจํ„ด

๋ฌธ์ œ: 1024๊ฐœ ์š”์†Œ์˜ ๋ฒกํ„ฐ ๋‘ ๊ฐœ ๋”ํ•˜๊ธฐ (SIZE=1024, SIMD_WIDTH=4)

Elementwise:     256 ์Šค๋ ˆ๋“œ ร— 1 SIMD ์—ฐ์‚ฐ   = ๋†’์€ ๋ณ‘๋ ฌ์„ฑ
Tiled:           32 ์Šค๋ ˆ๋“œ  ร— 8 SIMD ์—ฐ์‚ฐ  = ์บ์‹œ ์ตœ์ ํ™”
Manual:          8 ์Šค๋ ˆ๋“œ   ร— 32 SIMD ์—ฐ์‚ฐ = ์ตœ๋Œ€ ์ œ์–ด
Mojo vectorize:  32 ์Šค๋ ˆ๋“œ  ร— 8 SIMD ์—ฐ์‚ฐ  = ์ž๋™ ์•ˆ์ „์„ฑ

๐Ÿ“Š ์‹ค์ œ ์„ฑ๋Šฅ ๋ถ„์„

์‹ค์ฆ์  ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ๋ฅผ ํ•ด์„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค:

๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ (SIZE=1,048,576):
elementwise:        11.34ms  โ† ๋Œ€๊ทœ๋ชจ์—์„œ ์ตœ๋Œ€ ๋ณ‘๋ ฌ์„ฑ์ด ์œ ๋ฆฌ
tiled:              12.04ms  โ† ์ง€์—ญ์„ฑ๊ณผ ๋ณ‘๋ ฌ์„ฑ์˜ ๊ท ํ˜•
manual_vectorized:  15.75ms  โ† ๋‹จ์ˆœ ์—ฐ์‚ฐ์—์„œ ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ์ด ๋ถˆ๋ฆฌ
vectorized:         13.38ms  โ† ์ž๋™ ์ตœ์ ํ™” ์˜ค๋ฒ„ํ—ค๋“œ

์„ ์ˆ˜ ์ง€์‹

ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์„ ํ•™์Šตํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ๋‚ด์šฉ์— ์ต์ˆ™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  • ๊ธฐ๋ณธ GPU ๊ฐœ๋…: ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ, ์Šค๋ ˆ๋“œ ์‹คํ–‰, SIMD ์—ฐ์‚ฐ
  • Mojo ๊ธฐ์ดˆ: ํŒŒ๋ผ๋ฏธํ„ฐ ํ•จ์ˆ˜, ์ปดํŒŒ์ผ ํƒ€์ž„ ํŠน์ˆ˜ํ™”, ์บก์ฒ˜ ์˜๋ฏธ๋ก 
  • LayoutTensor ์—ฐ์‚ฐ: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘
  • GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ: ๋ฒ„ํผ ํ• ๋‹น, ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™”

ํ•™์Šต ๊ฒฝ๋กœ

1. Elementwise ์—ฐ์‚ฐ

โ†’ elementwise - ๊ธฐ๋ณธ GPU ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ

๊ธฐ์ดˆ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค: ์ž๋™ ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ์™€ SIMD ๋ฒกํ„ฐํ™”.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • elementwise๋ฅผ ํ™œ์šฉํ•œ ํ•จ์ˆ˜ํ˜• GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ
  • GPU ์Šค๋ ˆ๋“œ ๋‚ด์˜ ์ž๋™ SIMD ๋ฒกํ„ฐํ™”
  • ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•œ LayoutTensor ์—ฐ์‚ฐ
  • ์ค‘์ฒฉ ํ•จ์ˆ˜์—์„œ์˜ ์บก์ฒ˜ ์˜๋ฏธ๋ก 

ํ•ต์‹ฌ ํŒจํ„ด:

elementwise[add_function, SIMD_WIDTH, target="gpu"](total_size, ctx)

2. ํƒ€์ผ๋ง ์ฒ˜๋ฆฌ

โ†’ tile - ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ํƒ€์ผ๋ง ์ฒ˜๋ฆฌ

elementwise๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ํƒ€์ผ๋ง ํŒจํ„ด์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์บ์‹œ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ํƒ€์ผ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ
  • ํƒ€์ผ ๋‚ด ์ˆœ์ฐจ์  SIMD ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ ์›์น™๊ณผ ์บ์‹œ ์นœํ™”์  ์ ‘๊ทผ ํŒจํ„ด
  • ์Šค๋ ˆ๋“œ-ํƒ€์ผ ๋งคํ•‘ vs ์Šค๋ ˆ๋“œ-์š”์†Œ ๋งคํ•‘

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํƒ€์ผ๋ง์€ ๋ณ‘๋ ฌ ํญ์„ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ๊ณผ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค - ๋” ์ ์€ ์ˆ˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋” ๋‚˜์€ ์บ์‹œ ํ™œ์šฉ์œผ๋กœ ๋” ๋งŽ์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

3. ๊ณ ๊ธ‰ ๋ฒกํ„ฐํ™”

โ†’ vectorize - SIMD ์ œ์–ด

์ˆ˜๋™ ์ œ์–ด์™€ ์ž๋™ ๋ฒกํ„ฐํ™” ์ „๋žต์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋ช…์‹œ์  ์ธ๋ฑ์Šค ๊ด€๋ฆฌ๋ฅผ ํ†ตํ•œ ์ˆ˜๋™ SIMD ์—ฐ์‚ฐ
  • ์•ˆ์ „ํ•˜๊ณ  ์ž๋™์ ์ธ ๋ฒกํ„ฐํ™”๋ฅผ ์œ„ํ•œ Mojo์˜ vectorize ํ•จ์ˆ˜
  • ์ตœ์ ์˜ SIMD ์ •๋ ฌ์„ ์œ„ํ•œ ์ฒญํฌ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ
  • ์ˆ˜๋™ ์ œ์–ด์™€ ์•ˆ์ „์„ฑ ๊ฐ„์˜ ์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ๋ฒ•:

  • ์ˆ˜๋™: ์ง์ ‘ ์ œ์–ด, ์ตœ๋Œ€ ์„ฑ๋Šฅ, ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ
  • Mojo vectorize: ์ž๋™ ์ตœ์ ํ™”, ๋‚ด์žฅ ์•ˆ์ „์„ฑ, ๊น”๋”ํ•œ ์ฝ”๋“œ

๐Ÿง  4. ์Šค๋ ˆ๋”ฉ vs SIMD ๊ฐœ๋…

โ†’ GPU ์Šค๋ ˆ๋”ฉ vs SIMD ๊ฐœ๋…

๋ณ‘๋ ฌ์„ฑ ์ˆ˜์ค€ ๊ฐ„์˜ ๊ทผ๋ณธ์ ์ธ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • GPU ์Šค๋ ˆ๋”ฉ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ํ•˜๋“œ์›จ์–ด ๋งคํ•‘
  • GPU ์Šค๋ ˆ๋“œ ๋‚ด์˜ SIMD ์—ฐ์‚ฐ
  • ํŒจํ„ด ๋น„๊ต์™€ ์Šค๋ ˆ๋“œ-์ž‘์—… ๋งคํ•‘
  • ์›Œํฌ๋กœ๋“œ์— ๋งž๋Š” ์˜ฌ๋ฐ”๋ฅธ ํŒจํ„ด ์„ ํƒ

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณ‘๋ ฌ์„ฑ์˜ ๊ตฌ์กฐ๋ฅผ ์ œ๊ณตํ•˜๊ณ , SIMD ์—ฐ์‚ฐ์ด ๊ฐ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ ๋ฒกํ„ฐํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Š 5. Mojo ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํ‚น

โ†’ Mojo ๋ฒค์น˜๋งˆํ‚น

GPU ์„ฑ๋Šฅ์„ ๊ณผํ•™์ ์œผ๋กœ ์ธก์ •, ๋ถ„์„, ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • Mojo์˜ ๋‚ด์žฅ ๋ฒค์น˜๋งˆํ‚น ํ”„๋ ˆ์ž„์›Œํฌ
  • GPU ๊ณ ์œ ์˜ ํƒ€์ด๋ฐ ๋ฐ ๋™๊ธฐํ™” ๋ฌธ์ œ
  • ์ปดํŒŒ์ผ ํƒ€์ž„ ํŠน์ˆ˜ํ™”๋ฅผ ํ™œ์šฉํ•œ ํŒŒ๋ผ๋ฏธํ„ฐํ™”๋œ ๋ฒค์น˜๋งˆํฌ ํ•จ์ˆ˜
  • ์‹ค์ฆ์  ์„ฑ๋Šฅ ๋ถ„์„๊ณผ ํŒจํ„ด ์„ ํƒ

ํ•ต์‹ฌ ๊ธฐ๋ฒ•: keep()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒค์น˜๋งˆํฌ ์ฝ”๋“œ์˜ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ

Elementwise ํŒจํ„ด๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ๊ฐ ์„น์…˜์„ ์ฒด๊ณ„์ ์œผ๋กœ ํ•™์Šตํ•˜์„ธ์š”. ๊ฐ ํผ์ฆ์€ ์ด์ „ ๊ฐœ๋…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ๋กœ์šด ์ˆ˜์ค€์˜ ์ •๊ตํ•จ์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ๊ฐ ํŒจํ„ด์˜ ์–ด๋–ป๊ฒŒ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์™œ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ•˜์„ธ์š”. ์—ฌ๊ธฐ์„œ ํ˜•์„ฑํ•˜๋Š” ๊ฐœ๋…์  ํ”„๋ ˆ์ž„์›Œํฌ๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ „๋ฐ˜์— ๊ฑธ์ณ ํ™œ์šฉ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ๋ชฉํ‘œ: Part VI๋ฅผ ๋งˆ์น˜๋ฉด, ์ €์ˆ˜์ค€ GPU ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋Œ€์‹  ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์˜ ๊ด€์ ์—์„œ ์‚ฌ๊ณ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์–ด, ๋” ์œ ์ง€๋ณด์ˆ˜ํ•˜๊ธฐ ์‰ฝ๊ณ , ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚˜๋ฉฐ, ์ด์‹์„ฑ์ด ๋†’์€ GPU ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ: elementwise - ๊ธฐ๋ณธ GPU ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ ์—์„œ ํ•จ์ˆ˜ํ˜• GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‹œ์ž‘ํ•˜์„ธ์š”.

elementwise - ๊ธฐ๋ณธ GPU ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ

์ด ํผ์ฆ์€ Mojo์˜ ํ•จ์ˆ˜ํ˜• elementwise ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ ๋ง์…ˆ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž๋™์œผ๋กœ ์—ฌ๋Ÿฌ SIMD ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ํ˜„๋Œ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ์–ด๋–ป๊ฒŒ ์ €์ˆ˜์ค€ ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ์ถ”์ƒํ™”ํ•˜๋ฉด์„œ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: elementwise ํ•จ์ˆ˜๋Š” ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ, SIMD ๋ฒกํ„ฐํ™”, ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • elementwise๋ฅผ ํ™œ์šฉํ•œ ํ•จ์ˆ˜ํ˜• GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ
  • GPU ์Šค๋ ˆ๋“œ ๋‚ด์˜ ์ž๋™ SIMD ๋ฒกํ„ฐํ™”
  • ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•œ LayoutTensor ์—ฐ์‚ฐ
  • GPU ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ vs SIMD ์—ฐ์‚ฐ
  • ์ค‘์ฒฉ ํ•จ์ˆ˜์—์„œ์˜ ์บก์ฒ˜ ์˜๋ฏธ๋ก 

์ˆ˜ํ•™์  ์—ฐ์‚ฐ์€ ๋‹จ์ˆœํ•œ ์š”์†Œ๋ณ„ ๋ง์…ˆ์ž…๋‹ˆ๋‹ค: \[\Large \text{output}[i] = a[i] + b[i]\]

์ด ๊ตฌํ˜„์€ Mojo์—์„œ์˜ ๋ชจ๋“  GPU ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ณธ ํŒจํ„ด์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

์„ค์ •

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 1024
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • SIMD ํญ: ํƒ€๊ฒŸ ์˜์กด์  (GPU ์•„ํ‚คํ…์ฒ˜์™€ ๋ฐ์ดํ„ฐ ํƒ€์ž…์— ๋”ฐ๋ผ ๊ฒฐ์ •)
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D ํ–‰ ์šฐ์„ )

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 1024
comptime rank = 1
comptime layout = Layout.row_major(SIZE)
comptime dtype = DType.float32
comptime SIMD_WIDTH = simd_width_of[dtype, target = get_gpu_target()]()


fn elementwise_add[
    layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
    output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn add[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]
        print("idx:", idx)
        # FILL IN (2 to 4 lines)

    elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p23/p23.mojo

ํŒ

1. ํ•จ์ˆ˜ ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

elementwise ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ •ํ™•ํ•œ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ๊ฐ€์ง„ ์ค‘์ฒฉ ํ•จ์ˆ˜๋ฅผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค:

@parameter
@always_inline
fn your_function[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
    # ๊ตฌํ˜„ ์ฝ”๋“œ

๊ฐ ๋ถ€๋ถ„์ด ์ค‘์š”ํ•œ ์ด์œ :

  • @parameter: ์ตœ์ ์˜ GPU ์ฝ”๋“œ ์ƒ์„ฑ์„ ์œ„ํ•œ ์ปดํŒŒ์ผ ํƒ€์ž„ ํŠน์ˆ˜ํ™”๋ฅผ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค
  • @always_inline: GPU ์ปค๋„์—์„œ ํ•จ์ˆ˜ ํ˜ธ์ถœ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ์ธ๋ผ์ด๋‹์„ ๊ฐ•์ œํ•ฉ๋‹ˆ๋‹ค
  • capturing: ์™ธ๋ถ€ ์Šค์ฝ”ํ”„์˜ ๋ณ€์ˆ˜(์ž…์ถœ๋ ฅ ํ…์„œ)์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค
  • IndexList[rank]: ๋‹ค์ฐจ์› ์ธ๋ฑ์‹ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค (๋ฒกํ„ฐ๋Š” rank=1, ํ–‰๋ ฌ์€ rank=2)

2. ์ธ๋ฑ์Šค ์ถ”์ถœ๊ณผ SIMD ์ฒ˜๋ฆฌ

idx = indices[0]  # 1D ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์„ ํ˜• ์ธ๋ฑ์Šค ์ถ”์ถœ

์ด idx๋Š” ๋‹จ์ผ ์š”์†Œ๊ฐ€ ์•„๋‹Œ SIMD ๋ฒกํ„ฐ์˜ ์‹œ์ž‘ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. SIMD_WIDTH=4 (GPU ์˜์กด์ )์ธ ๊ฒฝ์šฐ:

  • Thread 0์€ idx=0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์š”์†Œ [0, 1, 2, 3]์„ ์ฒ˜๋ฆฌ
  • Thread 1์€ idx=4๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์š”์†Œ [4, 5, 6, 7]์„ ์ฒ˜๋ฆฌ
  • Thread 2๋Š” idx=8๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์š”์†Œ [8, 9, 10, 11]์„ ์ฒ˜๋ฆฌ
  • ์ด๋Ÿฐ ์‹์œผ๋กœ ๊ณ„์†โ€ฆ

3. SIMD ๋กœ๋“œ ํŒจํ„ด

a_simd = a.aligned_load[simd_width](Index(idx))  # ์—ฐ์† float 4๊ฐœ ๋กœ๋“œ (GPU ์˜์กด์ )
b_simd = b.aligned_load[simd_width](Index(idx))  # ์—ฐ์† float 4๊ฐœ ๋กœ๋“œ (GPU ์˜์กด์ )

๋‘ ๋ฒˆ์งธ ๋งค๊ฐœ๋ณ€์ˆ˜ 0์€ ์ฐจ์› ์˜คํ”„์…‹์ž…๋‹ˆ๋‹ค (1D ๋ฒกํ„ฐ์—์„œ๋Š” ํ•ญ์ƒ 0). ์ด ์—ฐ์‚ฐ์€ ํ•œ ๋ฒˆ์— ๋ฒกํ„ฐํ™”๋œ ์ฒญํฌ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ๋กœ๋“œ๋˜๋Š” ์ •ํ™•ํ•œ ์š”์†Œ ์ˆ˜๋Š” GPU์˜ SIMD ๋Šฅ๋ ฅ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

4. ๋ฒกํ„ฐ ์—ฐ์‚ฐ

result = a_simd + b_simd  # 4๊ฐœ ์š”์†Œ์˜ SIMD ๋ง์…ˆ์„ ๋™์‹œ์— ์ˆ˜ํ–‰ (GPU ์˜์กด์ )

์ „์ฒด SIMD ๋ฒกํ„ฐ์— ๊ฑธ์ณ ์š”์†Œ๋ณ„ ๋ง์…ˆ์„ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค - 4๊ฐœ์˜ ๊ฐœ๋ณ„ ์Šค์นผ๋ผ ๋ง์…ˆ๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฆ…๋‹ˆ๋‹ค.

5. SIMD ์ €์žฅ

output.store[simd_width](idx, 0, result)  # 4๊ฐœ ๊ฒฐ๊ณผ๋ฅผ ํ•œ ๋ฒˆ์— ์ €์žฅ (GPU ์˜์กด์ )

์ „์ฒด SIMD ๋ฒกํ„ฐ๋ฅผ ํ•œ ๋ฒˆ์˜ ์—ฐ์‚ฐ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ์— ๋‹ค์‹œ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.

6. elementwise ํ•จ์ˆ˜ ํ˜ธ์ถœ

elementwise[your_function, SIMD_WIDTH, target="gpu"](total_size, ctx)
  • total_size๋Š” ๋ชจ๋“  ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด a.size()๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • GPU๋Š” ์‹คํ–‰ํ•  ์Šค๋ ˆ๋“œ ์ˆ˜๋ฅผ ์ž๋™์œผ๋กœ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค: total_size // SIMD_WIDTH

7. ๋””๋ฒ„๊น… ํ•ต์‹ฌ ํฌ์ธํŠธ

ํ…œํ”Œ๋ฆฟ์— ์žˆ๋Š” print("idx:", idx)์— ์ฃผ๋ชฉํ•˜์„ธ์š”. ์‹คํ–‰ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

idx: 0, idx: 4, idx: 8, idx: 12, ...

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ SIMD ์ฒญํฌ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, SIMD_WIDTH (GPU ์˜์กด์ ) ๊ฐ„๊ฒฉ์œผ๋กœ ์ž๋™ ๋ฐฐ์น˜๋จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ฝ”๋“œ ์‹คํ–‰

ํ’€์ด๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p23 --elementwise
pixi run -e amd p23 --elementwise
pixi run -e apple p23 --elementwise
uv run poe p23 --elementwise

ํผ์ฆ์ด ์•„์ง ํ’€๋ฆฌ์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

SIZE: 1024
simd_width: 4
...
idx: 404
idx: 408
idx: 412
idx: 416
...

out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

์†”๋ฃจ์…˜

fn elementwise_add[
    layout: Layout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
    output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn add[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]
        # Note: This is thread-local SIMD - each thread processes its own vector of data
        # we'll later better see this hierarchy in Mojo:
        # SIMD within threads, warp across threads, block across warps
        a_simd = a.aligned_load[width=simd_width](Index(idx))
        b_simd = b.aligned_load[width=simd_width](Index(idx))
        ret = a_simd + b_simd
        # print(
        #     "idx:", idx, ", a_simd:", a_simd, ", b_simd:", b_simd, " sum:", ret
        # )
        output.store[simd_width](Index(idx), ret)

    elementwise[add, SIMD_WIDTH, target="gpu"](a.size(), ctx)


Mojo์˜ elementwise ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์€ ํ˜„๋Œ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•œ ๋ช‡ ๊ฐ€์ง€ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค:

1. ํ•จ์ˆ˜ํ˜• ์ถ”์ƒํ™” ์ฒ ํ•™

elementwise ํ•จ์ˆ˜๋Š” ๊ธฐ์กด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ์˜ ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค:

์ „ํ†ต์ ์ธ CUDA/HIP ๋ฐฉ์‹:

# ์ˆ˜๋™ ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ
idx = thread_idx.x + block_idx.x * block_dim.x
if idx < size:
    output[idx] = a[idx] + b[idx];  // ์Šค์นผ๋ผ ์—ฐ์‚ฐ

Mojo ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹:

# ์ž๋™ ๊ด€๋ฆฌ + SIMD ๋ฒกํ„ฐํ™”
elementwise[add_function, simd_width, target="gpu"](size, ctx)

elementwise๊ฐ€ ์ถ”์ƒํ™”ํ•˜๋Š” ๊ฒƒ๋“ค:

  • ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ๋ธ”๋ก/๊ทธ๋ฆฌ๋“œ ์ฐจ์›์„ ๊ณ„์‚ฐํ•  ํ•„์š” ์—†์Œ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋ฐฐ์—ด ๊ฒฝ๊ณ„๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ๋‚ด์žฅ
  • SIMD ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜: ๋ฒกํ„ฐํ™”๋ฅผ ํˆฌ๋ช…ํ•˜๊ฒŒ ์ฒ˜๋ฆฌ
  • GPU ํƒ€๊ฒŸ ์„ ํƒ: ๋‹ค์–‘ํ•œ GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ๋™์ž‘

2. ์‹ฌ์ธต ๋ถ„์„: ์ค‘์ฒฉ ํ•จ์ˆ˜ ์•„ํ‚คํ…์ฒ˜

@parameter
@always_inline
fn add[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:

๋งค๊ฐœ๋ณ€์ˆ˜ ๋ถ„์„:

  • @parameter: ์ด ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋Š” ์ปดํŒŒ์ผ ํƒ€์ž„ ํŠน์ˆ˜ํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ณ ์œ ํ•œ simd_width์™€ rank์— ๋Œ€ํ•ด ํ•จ์ˆ˜๊ฐ€ ๋ณ„๋„๋กœ ์ƒ์„ฑ๋˜์–ด ์ ๊ทน์ ์ธ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • @always_inline: GPU ์„ฑ๋Šฅ์— ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค - ์ฝ”๋“œ๋ฅผ ์ปค๋„์— ์ง์ ‘ ๋‚ด์žฅํ•˜์—ฌ ํ•จ์ˆ˜ ํ˜ธ์ถœ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
  • capturing: ๋ ‰์‹œ์ปฌ ์Šค์ฝ”ํ•‘์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค - ๋‚ด๋ถ€ ํ•จ์ˆ˜๊ฐ€ ๋ช…์‹œ์  ๋งค๊ฐœ๋ณ€์ˆ˜ ์ „๋‹ฌ ์—†์ด ์™ธ๋ถ€ ์Šค์ฝ”ํ”„์˜ ๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • IndexList[rank]: ์ฐจ์› ๋ฌด๊ด€ ์ธ๋ฑ์‹ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค - ๋™์ผํ•œ ํŒจํ„ด์ด 1D ๋ฒกํ„ฐ, 2D ํ–‰๋ ฌ, 3D ํ…์„œ ๋“ฑ์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

3. SIMD ์‹คํ–‰ ๋ชจ๋ธ ์‹ฌ์ธต ๋ถ„์„

idx = indices[0]                                    # ์„ ํ˜• ์ธ๋ฑ์Šค: 0, 4, 8, 12... (GPU ์˜์กด์  ๊ฐ„๊ฒฉ)
a_simd = a.aligned_load[simd_width](Index(idx))     # ๋กœ๋“œ: [a[0:4], a[4:8], a[8:12]...] (๋กœ๋“œ๋‹น 4๊ฐœ ์š”์†Œ)
b_simd = b.aligned_load[simd_width](Index(idx))     # ๋กœ๋“œ: [b[0:4], b[4:8], b[8:12]...] (๋กœ๋“œ๋‹น 4๊ฐœ ์š”์†Œ)
ret = a_simd + b_simd                               # SIMD: 4๊ฐœ ๋ง์…ˆ์„ ๋ณ‘๋ ฌ ์ˆ˜ํ–‰ (GPU ์˜์กด์ )
output.store[simd_width](Index(global_start), ret)  # ์ €์žฅ: 4๊ฐœ ๊ฒฐ๊ณผ๋ฅผ ๋™์‹œ ์ €์žฅ (GPU ์˜์กด์ )

์‹คํ–‰ ๊ณ„์ธต ๊ตฌ์กฐ ์‹œ๊ฐํ™”:

GPU ์•„ํ‚คํ…์ฒ˜:
โ”œโ”€โ”€ Grid (์ „์ฒด ๋ฌธ์ œ)
โ”‚   โ”œโ”€โ”€ Block 1 (์—ฌ๋Ÿฌ Warp)
โ”‚   โ”‚   โ”œโ”€โ”€ Warp 1 (32๊ฐœ ์Šค๋ ˆ๋“œ) --> Warp๋Š” ๋‹ค์Œ Part VI์—์„œ ํ•™์Šต
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 1 โ†’ SIMD[4๊ฐœ ์š”์†Œ]  โ† ํ˜„์žฌ ์ดˆ์  (GPU ์˜์กด์  ํญ)
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 2 โ†’ SIMD[4๊ฐœ ์š”์†Œ]
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”‚   โ””โ”€โ”€ Warp 2 (32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ””โ”€โ”€ Block 2 (์—ฌ๋Ÿฌ Warp)

SIMD_WIDTH=4์ธ 1024๊ฐœ ์š”์†Œ ๋ฒกํ„ฐ์˜ ๊ฒฝ์šฐ (GPU ์˜ˆ์‹œ):

  • ํ•„์š”ํ•œ ์ด SIMD ์—ฐ์‚ฐ ์ˆ˜: 1024 รท 4 = 256
  • GPU ์‹คํ–‰: 256๊ฐœ ์Šค๋ ˆ๋“œ (1024 รท 4)
  • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ์–‘: ์ •ํ™•ํžˆ 4๊ฐœ์˜ ์—ฐ์† ์š”์†Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ์Šค์นผ๋ผ ์—ฐ์‚ฐ ๋Œ€๋น„ SIMD_WIDTH๋ฐฐ ํ–ฅ์ƒ

์ฐธ๊ณ : SIMD ํญ์€ GPU ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค (์˜ˆ: ์ผ๋ถ€ GPU๋Š” 4, RTX 4090์€ 8, A100์€ 16).

4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„

a.aligned_load[simd_width](Index(idx))  // ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ

๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์˜ ์ด์ :

  • ์ˆœ์ฐจ์  ์ ‘๊ทผ: ์Šค๋ ˆ๋“œ๋“ค์ด ์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผ
  • ์บ์‹œ ์ตœ์ ํ™”: L1/L2 ์บ์‹œ ํžˆํŠธ์œจ ๊ทน๋Œ€ํ™”
  • ๋Œ€์—ญํญ ํ™œ์šฉ: ์ด๋ก ์  ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ๊ทผ์ ‘ํ•˜๋Š” ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  • ํ•˜๋“œ์›จ์–ด ํšจ์œจ: GPU ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์ด ํŒจํ„ด์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Œ

SIMD_WIDTH=4 (GPU ์˜์กด์ ) ์˜ˆ์‹œ:

Thread 0: a[0:4] ๋กœ๋“œ   โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ 0-3
Thread 1: a[4:8] ๋กœ๋“œ   โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ 4-7
Thread 2: a[8:12] ๋กœ๋“œ  โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ 8-11
...
๊ฒฐ๊ณผ: ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ ํ™œ์šฉ

5. ์„ฑ๋Šฅ ํŠน์„ฑ ๋ฐ ์ตœ์ ํ™”

์‚ฐ์ˆ  ๊ฐ•๋„ ๋ถ„์„ (SIMD_WIDTH=4 ๊ธฐ์ค€):

  • ์‚ฐ์ˆ  ์—ฐ์‚ฐ: 4๊ฐœ ์š”์†Œ๋‹น 1ํšŒ SIMD ๋ง์…ˆ
  • ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ: 4๊ฐœ ์š”์†Œ๋‹น 2ํšŒ SIMD ๋กœ๋“œ + 1ํšŒ SIMD ์ €์žฅ
  • ์‚ฐ์ˆ  ๊ฐ•๋„: 1 ๋ง์…ˆ รท 3 ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ = 0.33 (๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ)

์ด๊ฒƒ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ธ ์ด์œ :

๋‹จ์ˆœ ์—ฐ์‚ฐ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ >>> ์—ฐ์‚ฐ ๋Šฅ๋ ฅ

์ตœ์ ํ™” ์‹œ์‚ฌ์ :

  • ์‚ฐ์ˆ  ์ตœ์ ํ™”๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์— ์ง‘์ค‘ํ•ด์•ผ ํ•จ
  • SIMD ๋ฒกํ„ฐํ™”๊ฐ€ ์ฃผ์š” ์„ฑ๋Šฅ ์ด์ ์„ ์ œ๊ณต
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์„ฑ๋Šฅ์— ๋งค์šฐ ์ค‘์š”
  • ์—ฐ์‚ฐ ๋ณต์žก๋„๋ณด๋‹ค ์บ์‹œ ์ง€์—ญ์„ฑ์ด ๋” ์ค‘์š”

6. ํ™•์žฅ์„ฑ๊ณผ ์ ์‘์„ฑ

์ž๋™ ํ•˜๋“œ์›จ์–ด ์ ์‘:

comptime SIMD_WIDTH = simd_width_of[dtype, target = _get_gpu_target()]()
  • GPU๋ณ„ ์ตœ์ ํ™”: SIMD ํญ์ด ํ•˜๋“œ์›จ์–ด์— ๋งž๊ฒŒ ์กฐ์ •๋จ (์˜ˆ: ์ผ๋ถ€ ์นด๋“œ๋Š” 4, RTX 4090์€ 8, A100์€ 16)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž… ์ธ์‹: float32์™€ float16์— ๋Œ€ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ SIMD ํญ ์ ์šฉ
  • ์ปดํŒŒ์ผ ํƒ€์ž„ ์ตœ์ ํ™”: ํ•˜๋“œ์›จ์–ด ๊ฐ์ง€์— ๋Œ€ํ•œ ๋Ÿฐํƒ€์ž„ ์˜ค๋ฒ„ํ—ค๋“œ ์—†์Œ

ํ™•์žฅ์„ฑ ํŠน์„ฑ:

  • ์Šค๋ ˆ๋“œ ์ˆ˜: ๋ฌธ์ œ ํฌ๊ธฐ์— ๋”ฐ๋ผ ์ž๋™ ํ™•์žฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰: ์ž…๋ ฅ ํฌ๊ธฐ์— ๋น„๋ก€ํ•˜์—ฌ ์„ ํ˜• ํ™•์žฅ
  • ์„ฑ๋Šฅ: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํฌํ™” ์‹œ์ ๊นŒ์ง€ ๊ฑฐ์˜ ์„ ํ˜•์ ์ธ ์†๋„ ํ–ฅ์ƒ

7. ๊ณ ๊ธ‰ ์ธ์‚ฌ์ดํŠธ: ์ด ํŒจํ„ด์ด ์ค‘์š”ํ•œ ์ด์œ 

๋ณต์žกํ•œ ์—ฐ์‚ฐ์˜ ๊ธฐ์ดˆ: ์ด elementwise ํŒจํ„ด์€ ๋‹ค์Œ ์—ฐ์‚ฐ๋“ค์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค:

  • ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ: ๋Œ€๊ทœ๋ชจ ๋ฐฐ์—ด์—์„œ์˜ ํ•ฉ๊ณ„, ์ตœ๋Œ“๊ฐ’, ์ตœ์†Ÿ๊ฐ’
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ: ์Šค์นผ๋ผ-๋ฒกํ„ฐ ์—ฐ์‚ฐ
  • ๋ณต์žกํ•œ ๋ณ€ํ™˜: ํ™œ์„ฑํ™” ํ•จ์ˆ˜, ์ •๊ทœํ™”
  • ๋‹ค์ฐจ์› ์—ฐ์‚ฐ: ํ–‰๋ ฌ ์—ฐ์‚ฐ, ํ•ฉ์„ฑ๊ณฑ

์ „ํ†ต์ ์ธ ๋ฐฉ์‹๊ณผ์˜ ๋น„๊ต:

// ์ „ํ†ต์ : ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ, ์žฅํ™ฉํ•จ, ํ•˜๋“œ์›จ์–ด ์ข…์†์ 
__global__ void add_kernel(float* output, float* a, float* b, int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        output[idx] = a[idx] + b[idx];  // ๋ฒกํ„ฐํ™” ์—†์Œ
    }
}

// Mojo: ์•ˆ์ „, ๊ฐ„๊ฒฐ, ์ž๋™ ๋ฒกํ„ฐํ™”
elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx)

ํ•จ์ˆ˜ํ˜• ์ ‘๊ทผ๋ฒ•์˜ ์ด์ :

  • ์•ˆ์ „์„ฑ: ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ๋ฒ„ํผ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ ๋ฐฉ์ง€
  • ์ด์‹์„ฑ: ๋™์ผํ•œ ์ฝ”๋“œ๊ฐ€ ๋‹ค์–‘ํ•œ GPU ๋ฒค๋”/์„ธ๋Œ€์—์„œ ๋™์ž‘
  • ์„ฑ๋Šฅ: ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”๊ฐ€ ์ˆ˜๋™ ํŠœ๋‹ ์ฝ”๋“œ๋ฅผ ์ข…์ข… ๋Šฅ๊ฐ€
  • ์œ ์ง€๋ณด์ˆ˜์„ฑ: ๊น”๋”ํ•œ ์ถ”์ƒํ™”๋กœ ๋””๋ฒ„๊น… ๋ณต์žก๋„ ๊ฐ์†Œ
  • ์กฐํ•ฉ์„ฑ: ๋‹ค๋ฅธ ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ๊ณผ ์‰ฝ๊ฒŒ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅ

์ด ํŒจํ„ด์€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๋ฏธ๋ž˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค - ์„ฑ๋Šฅ์„ ํฌ์ƒํ•˜์ง€ ์•Š๋Š” ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋กœ, ์ตœ์ ์˜ ํšจ์œจ์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ GPU ์ปดํ“จํŒ…์„ ๋” ์‰ฝ๊ฒŒ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

Elementwise ์—ฐ์‚ฐ์„ ํ•™์Šตํ–ˆ๋‹ค๋ฉด ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐˆ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ: elementwise ํŒจํ„ด์€ Mojo๊ฐ€ ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ์šฐ์•„ํ•จ๊ณผ GPU ์„ฑ๋Šฅ์„ ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์™„์ „ํ•œ ์ œ์–ด๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ฒกํ„ฐํ™”์™€ ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

tile - ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ํƒ€์ผ๋ง ์ฒ˜๋ฆฌ

๊ฐœ์š”

elementwise ํŒจํ„ด์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์ด ํผ์ฆ์—์„œ๋Š” ํƒ€์ผ๋ง ์ฒ˜๋ฆฌ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” GPU์—์„œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์บ์‹œ ํ™œ์šฉ์„ ์ตœ์ ํ™”ํ•˜๋Š” ํ•ต์‹ฌ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ „์ฒด ๋ฐฐ์—ด์— ๊ฑธ์ณ ๊ฐœ๋ณ„ SIMD ๋ฒกํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€์‹ , ํƒ€์ผ๋ง์€ ๋ฐ์ดํ„ฐ๋ฅผ ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ์— ๋” ์ž˜ ๋งž๋Š” ์ž‘๊ณ  ๊ด€๋ฆฌ ๊ฐ€๋Šฅํ•œ ์ฒญํฌ๋กœ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Puzzle 16์˜ ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ ์—์„œ ์ด๋ฏธ ํƒ€์ผ๋ง์„ ๊ฒฝํ—˜ํ•œ ๋ฐ” ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฑฐ๊ธฐ์„œ๋Š” ํƒ€์ผ์„ ์‚ฌ์šฉํ•ด ๋Œ€๊ทœ๋ชจ ํ–‰๋ ฌ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๋™์ผํ•œ ํƒ€์ผ๋ง ์›์น™์„ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์— ์ ์šฉํ•˜์—ฌ, ์ด ๊ธฐ๋ฒ•์ด 2D ํ–‰๋ ฌ์—์„œ 1D ๋ฐฐ์—ด๊นŒ์ง€ ์–ด๋–ป๊ฒŒ ํ™•์žฅ๋˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Mojo์˜ ํƒ€์ผ๋ง ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋™์ผํ•œ ๋ฒกํ„ฐ ๋ง์…ˆ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ GPU ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ํƒ€์ผ ์ „์ฒด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ์ด ํŠน์ • ์›Œํฌ๋กœ๋“œ์—์„œ ์–ด๋–ป๊ฒŒ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํƒ€์ผ๋ง์€ ๋ณ‘๋ ฌ ํญ์„ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ๊ณผ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค - ๋” ์ ์€ ์ˆ˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋” ๋‚˜์€ ์บ์‹œ ํ™œ์šฉ์œผ๋กœ ๋” ๋งŽ์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์บ์‹œ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ํƒ€์ผ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ
  • ํƒ€์ผ ๋‚ด์˜ ์ˆœ์ฐจ์  SIMD ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ ์›์น™๊ณผ ์บ์‹œ ์นœํ™”์  ์ ‘๊ทผ ํŒจํ„ด
  • ์Šค๋ ˆ๋“œ-ํƒ€์ผ ๋งคํ•‘ vs ์Šค๋ ˆ๋“œ-์š”์†Œ ๋งคํ•‘
  • ๋ณ‘๋ ฌ์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ ๊ฐ„์˜ ์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

์š”์†Œ๋ณ„ ๋ฐฉ์‹๊ณผ ๋™์ผํ•œ ์ˆ˜ํ•™์  ์—ฐ์‚ฐ: \[\Large \text{output}[i] = a[i] + b[i]\]

ํ•˜์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ์— ์ตœ์ ํ™”๋œ ์™„์ „ํžˆ ๋‹ค๋ฅธ ์‹คํ–‰ ์ „๋žต์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์„ค์ •

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 1024
  • ํƒ€์ผ ํฌ๊ธฐ: TILE_SIZE = 32
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • SIMD ํญ: GPU ์˜์กด์  (ํƒ€์ผ ๋‚ด ์—ฐ์‚ฐ์šฉ)
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D ํ–‰ ์šฐ์„ )

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TILE_SIZE = 32


fn tiled_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn process_tiles[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        print("tile_id:", tile_id)
        output_tile = output.tile[tile_size](tile_id)
        a_tile = a.tile[tile_size](tile_id)
        b_tile = b.tile[tile_size](tile_id)

        # FILL IN (6 lines at most)

    num_tiles = (size + tile_size - 1) // tile_size
    elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p23/p23.mojo

ํŒ

1. ํƒ€์ผ ๊ตฌ์„ฑ ์ดํ•ดํ•˜๊ธฐ

ํƒ€์ผ๋ง ๋ฐฉ์‹์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ์ • ํฌ๊ธฐ์˜ ์ฒญํฌ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค:

num_tiles = (size + tile_size - 1) // tile_size  # ์˜ฌ๋ฆผ ๋‚˜๋ˆ—์…ˆ

TILE_SIZE=32์ธ 1024๊ฐœ ์š”์†Œ ๋ฒกํ„ฐ์˜ ๊ฒฝ์šฐ: 1024 รท 32 = 32๊ฐœ ํƒ€์ผ์ด ์ •ํ™•ํžˆ ์ƒ๊น๋‹ˆ๋‹ค.

2. ํƒ€์ผ ์ถ”์ถœ ํŒจํ„ด

LayoutTensor .tile ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

tile_id = indices[0]  # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌํ•  ํƒ€์ผ ํ•˜๋‚˜๋ฅผ ๋ฐ›์Œ
out_tile = output.tile[tile_size](tile_id)
a_tile = a.tile[tile_size](tile_id)
b_tile = b.tile[tile_size](tile_id)

tile[size](id) ๋ฉ”์„œ๋“œ๋Š” id ร— size ์œ„์น˜๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” size๊ฐœ์˜ ์—ฐ์† ์š”์†Œ์— ๋Œ€ํ•œ ๋ทฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

3. ํƒ€์ผ ๋‚ด ์ˆœ์ฐจ ์ฒ˜๋ฆฌ

์š”์†Œ๋ณ„ ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ, ํƒ€์ผ์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

@parameter
for i in range(tile_size):
    # ํ˜„์žฌ ํƒ€์ผ ๋‚ด์˜ ์š”์†Œ i๋ฅผ ์ฒ˜๋ฆฌ

์ด @parameter ๋ฃจํ”„๋Š” ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ์ปดํŒŒ์ผ ํƒ€์ž„์— ์ „๊ฐœ๋ฉ๋‹ˆ๋‹ค.

4. ํƒ€์ผ ์š”์†Œ ๋‚ด SIMD ์—ฐ์‚ฐ

a_vec = a_tile.load[simd_width](i, 0)  # ํƒ€์ผ ๋‚ด ์œ„์น˜ i์—์„œ ๋กœ๋“œ
b_vec = b_tile.load[simd_width](i, 0)  # ํƒ€์ผ ๋‚ด ์œ„์น˜ i์—์„œ ๋กœ๋“œ
result = a_vec + b_vec                 # SIMD ๋ง์…ˆ (GPU ์˜์กด์  ํญ)
out_tile.store[simd_width](i, 0, result)  # ํƒ€์ผ ๋‚ด ์œ„์น˜ i์— ์ €์žฅ

5. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์˜ ์ฐจ์ด์ 

elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)

SIMD_WIDTH ๋Œ€์‹  1์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ํƒ€์ผ ์ „์ฒด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

6. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ธ์‚ฌ์ดํŠธ

๊ฐ ์Šค๋ ˆ๋“œ๋Š” ์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋ธ”๋ก(ํƒ€์ผ)์— ์ ‘๊ทผํ•œ ๋‹ค์Œ, ๋‹ค์Œ ํƒ€์ผ๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ ์Šค๋ ˆ๋“œ์˜ ์‹คํ–‰ ๋‚ด์—์„œ ์šฐ์ˆ˜ํ•œ ๊ณต๊ฐ„ ์ง€์—ญ์„ฑ์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.

7. ๋””๋ฒ„๊น… ํ•ต์‹ฌ ํฌ์ธํŠธ

ํƒ€์ผ๋ง์„ ์‚ฌ์šฉํ•˜๋ฉด ์Šค๋ ˆ๋“œ ์‹คํ–‰ ์ˆ˜๋Š” ์ค„์–ด๋“ค์ง€๋งŒ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋” ๋งŽ์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  • ์š”์†Œ๋ณ„: ~256๊ฐœ ์Šค๋ ˆ๋“œ (SIMD_WIDTH=4 ๊ธฐ์ค€), ๊ฐ๊ฐ 4๊ฐœ ์š”์†Œ ์ฒ˜๋ฆฌ
  • Tiled: ~32๊ฐœ ์Šค๋ ˆ๋“œ, ๊ฐ๊ฐ 32๊ฐœ ์š”์†Œ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ

์ฝ”๋“œ ์‹คํ–‰

ํ’€์ด๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p23 --tiled
pixi run -e amd p23 --tiled
pixi run -e apple p23 --tiled
uv run poe p23 --tiled

ํผ์ฆ์ด ์•„์ง ํ’€๋ฆฌ์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0
tile_id: 1
tile_id: 2
tile_id: 3
...
tile_id: 29
tile_id: 30
tile_id: 31
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

์†”๋ฃจ์…˜

comptime TILE_SIZE = 32


fn tiled_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn process_tiles[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]

        output_tile = output.tile[tile_size](tile_id)
        a_tile = a.tile[tile_size](tile_id)
        b_tile = b.tile[tile_size](tile_id)

        @parameter
        for i in range(tile_size):
            a_vec = a_tile.load[simd_width](Index(i))
            b_vec = b_tile.load[simd_width](Index(i))
            ret = a_vec + b_vec
            output_tile.store[simd_width](Index(i), ret)

    num_tiles = (size + tile_size - 1) // tile_size
    elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)


ํƒ€์ผ๋ง ์ฒ˜๋ฆฌ ํŒจํ„ด์€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•œ ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

1. ํƒ€์ผ๋ง ์ฒ ํ•™๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ

ํƒ€์ผ๋ง์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์— ๋Œ€ํ•œ ์‚ฌ๊ณ  ๋ฐฉ์‹์˜ ๊ทผ๋ณธ์ ์ธ ์ „ํ™˜์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค:

์š”์†Œ๋ณ„ ๋ฐฉ์‹:

  • ๋„“์€ ๋ณ‘๋ ฌ์„ฑ: ๋งŽ์€ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ๊ฐ ์ตœ์†Œํ•œ์˜ ์ž‘์—… ์ˆ˜ํ–‰
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€ํ•˜: ์Šค๋ ˆ๋“œ๋“ค์ด ์ „์ฒด ๋ฐฐ์—ด์— ๋ถ„์‚ฐ
  • ์บ์‹œ ๋ฏธ์Šค: ์Šค๋ ˆ๋“œ ๊ฒฝ๊ณ„๋ฅผ ๋„˜๋‚˜๋“œ๋Š” ๋‚ฎ์€ ๊ณต๊ฐ„ ์ง€์—ญ์„ฑ

ํƒ€์ผ๋ง ๋ฐฉ์‹:

  • ๊นŠ์€ ๋ณ‘๋ ฌ์„ฑ: ๋” ์ ์€ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ๊ฐ ์ƒ๋‹นํ•œ ์ž‘์—… ์ˆ˜ํ–‰
  • ์ง€์—ญํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†์ ์ธ ๋ฐ์ดํ„ฐ์—์„œ ์ž‘์—…
  • ์บ์‹œ ์ตœ์ ํ™”: ์šฐ์ˆ˜ํ•œ ๊ณต๊ฐ„ ๋ฐ ์‹œ๊ฐ„ ์ง€์—ญ์„ฑ

2. ํƒ€์ผ ๊ตฌ์„ฑ๊ณผ ์ธ๋ฑ์‹ฑ

tile_id = indices[0]
out_tile = output.tile[tile_size](tile_id)
a_tile = a.tile[tile_size](tile_id)
b_tile = b.tile[tile_size](tile_id)

ํƒ€์ผ ๋งคํ•‘ ์‹œ๊ฐํ™” (TILE_SIZE=32):

์›๋ณธ ๋ฐฐ์—ด: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ..., 1023]

Tile 0 (thread 0): [0, 1, 2, ..., 31]      โ† ์š”์†Œ 0-31
Tile 1 (thread 1): [32, 33, 34, ..., 63]   โ† ์š”์†Œ 32-63
Tile 2 (thread 2): [64, 65, 66, ..., 95]   โ† ์š”์†Œ 64-95
...
Tile 31 (thread 31): [992, 993, ..., 1023] โ† ์š”์†Œ 992-1023

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ:

  • tile[size](id)๋Š” ์›๋ณธ ํ…์„œ์— ๋Œ€ํ•œ ๋ทฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ๋ทฐ๋Š” ์ œ๋กœ ์นดํ”ผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค - ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์‚ฌํ•˜์ง€ ์•Š๊ณ  ํฌ์ธํ„ฐ ์—ฐ์‚ฐ๋งŒ ์ˆ˜ํ–‰
  • ํƒ€์ผ ๊ฒฝ๊ณ„๋Š” ํ•ญ์ƒ tile_size ๋‹จ์œ„๋กœ ์ •๋ ฌ๋ฉ๋‹ˆ๋‹ค

3. ์ˆœ์ฐจ ์ฒ˜๋ฆฌ ์‹ฌ์ธต ๋ถ„์„

@parameter
for i in range(tile_size):
    a_vec = a_tile.load[simd_width](i, 0)
    b_vec = b_tile.load[simd_width](i, 0)
    ret = a_vec + b_vec
    out_tile.store[simd_width](i, 0, ret)

์™œ ์ˆœ์ฐจ ์ฒ˜๋ฆฌ์ธ๊ฐ€?

  • ์บ์‹œ ์ตœ์ ํ™”: ์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์บ์‹œ ํžˆํŠธ์œจ์„ ๊ทน๋Œ€ํ™”
  • ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”: @parameter ๋ฃจํ”„๊ฐ€ ์ปดํŒŒ์ผ ํƒ€์ž„์— ์™„์ „ํžˆ ์ „๊ฐœ๋จ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ์ˆœ์ฐจ ์ ‘๊ทผ์ด ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ ์„ค๊ณ„์— ๋ถ€ํ•ฉ
  • ์กฐ์ • ๋น„์šฉ ๊ฐ์†Œ: SIMD ๊ทธ๋ฃน ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ๋ถˆํ•„์š”

ํ•˜๋‚˜์˜ ํƒ€์ผ ๋‚ด ์‹คํ–‰ ํŒจํ„ด (TILE_SIZE=32, SIMD_WIDTH=4):

์Šค๋ ˆ๋“œ๊ฐ€ ํƒ€์ผ์„ ์ˆœ์ฐจ ์ฒ˜๋ฆฌ:
Step 0: ์š”์†Œ [0:4]๋ฅผ SIMD๋กœ ์ฒ˜๋ฆฌ
Step 1: ์š”์†Œ [4:8]๋ฅผ SIMD๋กœ ์ฒ˜๋ฆฌ
Step 2: ์š”์†Œ [8:12]๋ฅผ SIMD๋กœ ์ฒ˜๋ฆฌ
...
Step 7: ์š”์†Œ [28:32]๋ฅผ SIMD๋กœ ์ฒ˜๋ฆฌ
ํ•ฉ๊ณ„: ์Šค๋ ˆ๋“œ๋‹น 8ํšŒ SIMD ์—ฐ์‚ฐ (32 รท 4 = 8)

4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„

์บ์‹œ ๋™์ž‘ ๋น„๊ต:

์š”์†Œ๋ณ„ ํŒจํ„ด:

Thread 0: ๊ธ€๋กœ๋ฒŒ ์œ„์น˜ [0, 4, 8, 12, ...] ์ ‘๊ทผ    โ† Stride = SIMD_WIDTH
Thread 1: ๊ธ€๋กœ๋ฒŒ ์œ„์น˜ [4, 8, 12, 16, ...] ์ ‘๊ทผ   โ† Stride = SIMD_WIDTH
...
๊ฒฐ๊ณผ: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์ „์ฒด ๋ฐฐ์—ด์— ๋ถ„์‚ฐ

Tiled ํŒจํ„ด:

Thread 0: ์œ„์น˜ [0:32]๋ฅผ ์ˆœ์ฐจ ์ ‘๊ทผ               โ† ์—ฐ์†์ ์ธ 32๊ฐœ ์š”์†Œ ๋ธ”๋ก
Thread 1: ์œ„์น˜ [32:64]๋ฅผ ์ˆœ์ฐจ ์ ‘๊ทผ             โ† ๋‹ค์Œ ์—ฐ์†์ ์ธ 32๊ฐœ ์š”์†Œ ๋ธ”๋ก
...
๊ฒฐ๊ณผ: ๊ฐ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ ์™„๋ฒฝํ•œ ๊ณต๊ฐ„ ์ง€์—ญ์„ฑ

์บ์‹œ ํšจ์œจ ์‹œ์‚ฌ์ :

  • L1 ์บ์‹œ: ์ž‘์€ ํƒ€์ผ์ด L1 ์บ์‹œ์— ๋” ์ž˜ ๋งž์•„ ์บ์‹œ ๋ฏธ์Šค ๊ฐ์†Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ์ˆœ์ฐจ ์ ‘๊ทผ์ด ์œ ํšจ ๋Œ€์—ญํญ์„ ๊ทน๋Œ€ํ™”
  • TLB ํšจ์œจ: TLB ๋ฏธ์Šค ๊ฐ์†Œ (์—ญ์ฃผ: TLB(Translation Lookaside Buffer)๋Š” ๊ฐ€์ƒ ์ฃผ์†Œ๋ฅผ ๋ฌผ๋ฆฌ ์ฃผ์†Œ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์บ์‹œ๋กœ, ๋ฏธ์Šค๊ฐ€ ์ค„๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค)
  • ํ”„๋ฆฌํŽ˜์นญ: ํ•˜๋“œ์›จ์–ด ํ”„๋ฆฌํŽ˜์ฒ˜๊ฐ€ ์ˆœ์ฐจ ํŒจํ„ด์—์„œ ์ตœ์ ์œผ๋กœ ๋™์ž‘

5. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ ์ „๋žต

elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)

์™œ SIMD_WIDTH ๋Œ€์‹  1์ธ๊ฐ€?

  • ์Šค๋ ˆ๋“œ ์ˆ˜: num_tiles ร— SIMD_WIDTH๊ฐ€ ์•„๋‹Œ ์ •ํ™•ํžˆ num_tiles๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋งŒ ์‹คํ–‰
  • ์ž‘์—… ๋ถ„๋ฐฐ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ์™„์ „ํ•œ ํƒ€์ผ์„ ์ฒ˜๋ฆฌ
  • ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ: ์Šค๋ ˆ๋“œ๋‹น ๋” ๋งŽ์€ ์ž‘์—…, ์ „์ฒด์ ์œผ๋กœ ๋” ์ ์€ ์Šค๋ ˆ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ: ๊ฐ ์Šค๋ ˆ๋“œ์˜ ์ž‘์—…์ด ๊ณต๊ฐ„์ ์œผ๋กœ ์ง€์—ญํ™”

์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„:

  • ๋” ์ ์€ ๋…ผ๋ฆฌ์  ์Šค๋ ˆ๋“œ: ๋‚ฎ์€ ์ ์œ ์œจ์—์„œ ๋ชจ๋“  GPU ์ฝ”์–ด๋ฅผ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Œ
  • ์Šค๋ ˆ๋“œ๋‹น ๋” ๋งŽ์€ ์ž‘์—…: ๋” ๋‚˜์€ ์บ์‹œ ํ™œ์šฉ๊ณผ ์กฐ์ • ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ
  • ์ˆœ์ฐจ ์ ‘๊ทผ: ๊ฐ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ
  • ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ: ์Šค๋ ˆ๋“œ ์‹คํ–‰ ๋ฐ ์กฐ์ • ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ

์ค‘์š” ์ฐธ๊ณ : โ€œ๋” ์ ์€ ์Šค๋ ˆ๋“œโ€œ๋Š” ๋…ผ๋ฆฌ์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. GPU ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์—ฌ๋Ÿฌ ์›Œํ”„๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ ํšจ์œจ์ ์œผ๋กœ ์ „ํ™˜ํ•˜์—ฌ ๋†’์€ ํ•˜๋“œ์›จ์–ด ํ™œ์šฉ๋ฅ ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

6. ์„ฑ๋Šฅ ํŠน์„ฑ

ํƒ€์ผ๋ง์ด ๋„์›€์ด ๋˜๋Š” ๊ฒฝ์šฐ:

  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ๋ณ‘๋ชฉ์ธ ๊ฒฝ์šฐ
  • ์บ์‹œ ๋ฏผ๊ฐ ์›Œํฌ๋กœ๋“œ: ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์˜ ์ด์ ์ด ์žˆ๋Š” ์—ฐ์‚ฐ
  • ๋ณต์žกํ•œ ์—ฐ์‚ฐ: ์š”์†Œ๋‹น ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์€ ๊ฒฝ์šฐ
  • ์ œํ•œ๋œ ๋ณ‘๋ ฌ์„ฑ: GPU ์ฝ”์–ด๋ณด๋‹ค ์Šค๋ ˆ๋“œ๊ฐ€ ์ ์€ ๊ฒฝ์šฐ

ํƒ€์ผ๋ง์ด ๋ถˆ๋ฆฌํ•œ ๊ฒฝ์šฐ:

  • ๊ณ ๋„๋กœ ๋ณ‘๋ ฌ์ ์ธ ์›Œํฌ๋กœ๋“œ: ์ตœ๋Œ€ ์Šค๋ ˆ๋“œ ํ™œ์šฉ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ๋‹จ์ˆœํ•œ ์—ฐ์‚ฐ: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์—ฐ์‚ฐ๋ณด๋‹ค ์ง€๋ฐฐ์ ์ธ ๊ฒฝ์šฐ
  • ๋ถˆ๊ทœ์น™์  ์ ‘๊ทผ ํŒจํ„ด: ํƒ€์ผ๋ง์ด ์ง€์—ญ์„ฑ์„ ๊ฐœ์„ ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ

๋‹จ์ˆœ ๋ง์…ˆ ์˜ˆ์‹œ (TILE_SIZE=32):

  • ์Šค๋ ˆ๋“œ ์ˆ˜: 256๊ฐœ ๋Œ€์‹  32๊ฐœ (8๋ฐฐ ์ ์Œ)
  • ์Šค๋ ˆ๋“œ๋‹น ์ž‘์—…๋Ÿ‰: 4๊ฐœ ๋Œ€์‹  32๊ฐœ ์š”์†Œ (8๋ฐฐ ๋งŽ์Œ)
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์ˆœ์ฐจ vs ์ŠคํŠธ๋ผ์ด๋“œ ์ ‘๊ทผ
  • ์บ์‹œ ํ™œ์šฉ: ํ›จ์”ฌ ๋‚˜์€ ๊ณต๊ฐ„ ์ง€์—ญ์„ฑ

7. ๊ณ ๊ธ‰ ํƒ€์ผ๋ง ๊ณ ๋ ค ์‚ฌํ•ญ

ํƒ€์ผ ํฌ๊ธฐ ์„ ํƒ:

  • ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด: ์บ์‹œ ํ™œ์šฉ์ด ๋–จ์–ด์ง€๊ณ , ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ฆ๊ฐ€
  • ๋„ˆ๋ฌด ํฌ๋ฉด: ์บ์‹œ์— ๋งž์ง€ ์•Š์„ ์ˆ˜ ์žˆ๊ณ , ๋ณ‘๋ ฌ์„ฑ์ด ๊ฐ์†Œ
  • ์ตœ์  ์ง€์ : L1 ์บ์‹œ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๋ณดํ†ต 16-64๊ฐœ ์š”์†Œ
  • ํ˜„์žฌ ์„ ํƒ: 32๊ฐœ ์š”์†Œ๋กœ ์บ์‹œ ํ™œ์šฉ๊ณผ ๋ณ‘๋ ฌ์„ฑ์˜ ๊ท ํ˜• ๋‹ฌ์„ฑ

ํ•˜๋“œ์›จ์–ด ๊ณ ๋ ค ์‚ฌํ•ญ:

  • ์บ์‹œ ํฌ๊ธฐ: ๊ฐ€๋Šฅํ•˜๋ฉด ํƒ€์ผ์ด L1 ์บ์‹œ์— ๋งž์•„์•ผ ํ•จ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ ํญ์„ ๊ณ ๋ ค
  • ์ฝ”์–ด ์ˆ˜: ๋ชจ๋“  ์ฝ”์–ด๋ฅผ ํ™œ์šฉํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•œ ํƒ€์ผ ํ™•๋ณด
  • SIMD ํญ: ํƒ€์ผ ํฌ๊ธฐ๋Š” SIMD ํญ์˜ ๋ฐฐ์ˆ˜์—ฌ์•ผ ํ•จ

๋น„๊ต ์š”์•ฝ:

Elementwise: ๋†’์€ ๋ณ‘๋ ฌ์„ฑ, ๋ถ„์‚ฐ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
Tiled:       ์ ๋‹นํ•œ ๋ณ‘๋ ฌ์„ฑ, ์ง€์—ญํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ

์š”์†Œ๋ณ„ ํŒจํ„ด๊ณผ ํƒ€์ผ๋ง ํŒจํ„ด ๊ฐ„์˜ ์„ ํƒ์€ ํŠน์ • ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ, ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํŒจํ„ด, ๋Œ€์ƒ ํ•˜๋“œ์›จ์–ด ๋Šฅ๋ ฅ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

์š”์†Œ๋ณ„ ํŒจํ„ด๊ณผ ํƒ€์ผ๋ง ํŒจํ„ด์„ ๋ชจ๋‘ ์ดํ•ดํ–ˆ๋‹ค๋ฉด:

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ: ํƒ€์ผ๋ง์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์›์‹œ ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ๋Ÿ‰๋ณด๋‹ค ๋” ์ค‘์š”ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ตœ๊ณ ์˜ GPU ์ฝ”๋“œ๋Š” ๋ณ‘๋ ฌ์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”์˜ ๊ท ํ˜•์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

vectorize - SIMD ์ œ์–ด

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ์ˆ˜๋™ ๋ฒกํ„ฐํ™”์™€ vectorize๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ GPU ์ปค๋„ ๋‚ด์—์„œ SIMD ์—ฐ์‚ฐ์„ ์ •๋ฐ€ํ•˜๊ฒŒ ์ œ์–ดํ•˜๋Š” ๊ณ ๊ธ‰ ๋ฒกํ„ฐํ™” ๊ธฐ๋ฒ•์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ๋ฒกํ„ฐํ™”๋œ ์—ฐ์‚ฐ์— ๋Œ€ํ•ด ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ ์ ‘๊ทผ๋ฒ•์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ˆ˜๋™ ๋ฒกํ„ฐํ™”: ๋ช…์‹œ์  ์ธ๋ฑ์Šค ๊ณ„์‚ฐ์„ ํ†ตํ•œ ์ง์ ‘์ ์ธ SIMD ์ œ์–ด
  2. Mojo์˜ vectorize ํ•จ์ˆ˜: ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํฌํ•จํ•œ ๊ณ ์ˆ˜์ค€ ๋ฒกํ„ฐํ™”

๋‘ ์ ‘๊ทผ๋ฒ• ๋ชจ๋‘ ํƒ€์ผ๋ง ๊ฐœ๋…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ง€๋งŒ, ์ œ์–ด, ์•ˆ์ „์„ฑ, ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ฐ„์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ฒกํ„ฐํ™” ์ „๋žต์€ ์„ฑ๋Šฅ ์š”๊ตฌ ์‚ฌํ•ญ๊ณผ ๋ณต์žก๋„ ์ˆ˜์ค€์— ๋”ฐ๋ผ ๋‹ฌ๋ฆฌ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋ช…์‹œ์  ์ธ๋ฑ์Šค ๊ด€๋ฆฌ๋ฅผ ํ†ตํ•œ ์ˆ˜๋™ SIMD ์—ฐ์‚ฐ
  • ์•ˆ์ „ํ•˜๊ณ  ์ž๋™์ ์ธ ๋ฒกํ„ฐํ™”๋ฅผ ์œ„ํ•œ Mojo์˜ vectorize ํ•จ์ˆ˜
  • ์ตœ์ ์˜ SIMD ์ •๋ ฌ์„ ์œ„ํ•œ ์ฒญํฌ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์œ„ํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ์ „๋žต
  • ์ˆ˜๋™ ์ œ์–ด์™€ ์•ˆ์ „์„ฑ ๊ฐ„์˜ ์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

์ด์ „๊ณผ ๋™์ผํ•œ ์ˆ˜ํ•™์  ์—ฐ์‚ฐ: \[\Large \text{output}[i] = a[i] + b[i]\]

ํ•˜์ง€๋งŒ ์ตœ๋Œ€ ์„ฑ๋Šฅ์„ ์œ„ํ•œ ์ •๊ตํ•œ ๋ฒกํ„ฐํ™” ์ „๋žต์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์„ค์ •

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 1024
  • ํƒ€์ผ ํฌ๊ธฐ: TILE_SIZE = 32
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • SIMD ํญ: GPU ์˜์กด์ 
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D ํ–‰ ์šฐ์„ )

1. ์ˆ˜๋™ ๋ฒกํ„ฐํ™” ๋ฐฉ์‹

์™„์„ฑํ•  ์ฝ”๋“œ

fn manual_vectorized_tiled_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size groups of simd_width elements
    comptime chunk_size = tile_size * simd_width

    @parameter
    @always_inline
    fn process_manual_vectorized_tiles[
        num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        print("tile_id:", tile_id)
        output_tile = output.tile[chunk_size](tile_id)
        a_tile = a.tile[chunk_size](tile_id)
        b_tile = b.tile[chunk_size](tile_id)

        # FILL IN (7 lines at most)

    # Number of tiles needed: each tile processes chunk_size elements
    num_tiles = (size + chunk_size - 1) // chunk_size
    elementwise[
        process_manual_vectorized_tiles, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p23/p23.mojo

ํŒ

1. ์ฒญํฌ ๊ตฌ์„ฑ ์ดํ•ดํ•˜๊ธฐ

comptime chunk_size = tile_size * simd_width  # 32 * 4 = ์ฒญํฌ๋‹น 128๊ฐœ ์š”์†Œ

๊ฐ ํƒ€์ผ์€ ์ด์ œ ๋‹จ์ˆœํ•œ ์ˆœ์ฐจ ์š”์†Œ๊ฐ€ ์•„๋‹Œ ์—ฌ๋Ÿฌ SIMD ๊ทธ๋ฃน์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

2. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ

global_start = tile_id * chunk_size + i * simd_width

์ฒญํฌ ๋‚ด ๊ฐ SIMD ๋ฒกํ„ฐ์˜ ์ •ํ™•ํ•œ ์ „์—ญ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

3. ํ…์„œ ์ง์ ‘ ์ ‘๊ทผ

a_vec = a.load[simd_width](global_start, 0)     # ์ „์—ญ ํ…์„œ์—์„œ ๋กœ๋“œ
output.store[simd_width](global_start, 0, ret)  # ์ „์—ญ ํ…์„œ์— ์ €์žฅ

์ฐธ๊ณ : ํƒ€์ผ ๋ทฐ๊ฐ€ ์•„๋‹Œ ์›๋ณธ ํ…์„œ์— ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค.

4. ์ฃผ์š” ํŠน์„ฑ

  • ๋” ๋งŽ์€ ์ œ์–ด, ๋” ๋งŽ์€ ๋ณต์žก์„ฑ, ์ „์—ญ ํ…์„œ ์ ‘๊ทผ
  • ํ•˜๋“œ์›จ์–ด์— ๋Œ€ํ•œ ์™„๋ฒฝํ•œ SIMD ์ •๋ ฌ
  • ์ˆ˜๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํ•„์š”

์ˆ˜๋™ ๋ฒกํ„ฐํ™” ์‹คํ–‰

pixi run p23 --manual-vectorized
pixi run -e amd p23 --manual-vectorized
pixi run -e apple p23 --manual-vectorized
uv run poe p23 --manual-vectorized

ํผ์ฆ์ด ์•„์ง ํ’€๋ฆฌ์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0
tile_id: 1
tile_id: 2
tile_id: 3
tile_id: 4
tile_id: 5
tile_id: 6
tile_id: 7
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

์ˆ˜๋™ ๋ฒกํ„ฐํ™” ํ’€์ด

fn manual_vectorized_tiled_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size groups of simd_width elements
    comptime chunk_size = tile_size * simd_width

    @parameter
    @always_inline
    fn process_manual_vectorized_tiles[
        num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]

        output_tile = output.tile[chunk_size](tile_id)
        a_tile = a.tile[chunk_size](tile_id)
        b_tile = b.tile[chunk_size](tile_id)

        @parameter
        for i in range(tile_size):
            global_start = tile_id * chunk_size + i * simd_width

            a_vec = a.aligned_load[simd_width](Index(global_start))
            b_vec = b.aligned_load[simd_width](Index(global_start))
            ret = a_vec + b_vec
            # print("tile:", tile_id, "simd_group:", i, "global_start:", global_start, "a_vec:", a_vec, "b_vec:", b_vec, "result:", ret)

            output.store[simd_width](Index(global_start), ret)

    # Number of tiles needed: each tile processes chunk_size elements
    num_tiles = (size + chunk_size - 1) // chunk_size
    elementwise[
        process_manual_vectorized_tiles, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


์ˆ˜๋™ ๋ฒกํ„ฐํ™” ์‹ฌ์ธต ๋ถ„์„

์ˆ˜๋™ ๋ฒกํ„ฐํ™”๋Š” ๋ช…์‹œ์  ์ธ๋ฑ์Šค ๊ณ„์‚ฐ์„ ํ†ตํ•ด SIMD ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ง์ ‘์ ์ธ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • ์ฒญํฌ ๊ธฐ๋ฐ˜ ๊ตฌ์„ฑ: chunk_size = tile_size * simd_width
  • ์ „์—ญ ์ธ๋ฑ์‹ฑ: ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์˜ ์ง์ ‘ ๊ณ„์‚ฐ
  • ์ˆ˜๋™ ๊ฒฝ๊ณ„ ๊ด€๋ฆฌ: ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์ง์ ‘ ์ฒ˜๋ฆฌ

์•„ํ‚คํ…์ฒ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ:

comptime chunk_size = tile_size * simd_width  # 32 * 4 = 128

์ฒญํฌ ๊ตฌ์„ฑ ์‹œ๊ฐํ™” (TILE_SIZE=32, SIMD_WIDTH=4):

์›๋ณธ ๋ฐฐ์—ด: [0, 1, 2, 3, ..., 1023]

์ฒญํฌ 0 (thread 0): [0:128]    โ† 128๊ฐœ ์š”์†Œ = 4๊ฐœ์”ฉ 32๊ฐœ SIMD ๊ทธ๋ฃน
์ฒญํฌ 1 (thread 1): [128:256]  โ† ๋‹ค์Œ 128๊ฐœ ์š”์†Œ
์ฒญํฌ 2 (thread 2): [256:384]  โ† ๋‹ค์Œ 128๊ฐœ ์š”์†Œ
...
์ฒญํฌ 7 (thread 7): [896:1024] โ† ๋งˆ์ง€๋ง‰ 128๊ฐœ ์š”์†Œ

ํ•˜๋‚˜์˜ ์ฒญํฌ ๋‚ด ์ฒ˜๋ฆฌ:

@parameter
for i in range(tile_size):  # i = 0, 1, 2, ..., 31
    global_start = tile_id * chunk_size + i * simd_width
    # tile_id=0์ผ ๋•Œ: global_start = 0, 4, 8, 12, ..., 124
    # tile_id=1์ผ ๋•Œ: global_start = 128, 132, 136, 140, ..., 252

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์Šค๋ ˆ๋“œ ์ˆ˜: 8๊ฐœ ์Šค๋ ˆ๋“œ (1024 รท 128 = 8)
  • ์Šค๋ ˆ๋“œ๋‹น ์ž‘์—…๋Ÿ‰: 128๊ฐœ ์š”์†Œ (๊ฐ 4๊ฐœ ์š”์†Œ์˜ SIMD ์—ฐ์‚ฐ 32ํšŒ)
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์™„๋ฒฝํ•œ SIMD ์ •๋ ฌ์„ ๊ฐ–์ถ˜ ๋Œ€ํ˜• ์ฒญํฌ
  • ์˜ค๋ฒ„ํ—ค๋“œ: ์ตœ์†Œ - ํ•˜๋“œ์›จ์–ด์— ์ง์ ‘ ๋งคํ•‘
  • ์•ˆ์ „์„ฑ: ์ˆ˜๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํ•„์š”

์ฃผ์š” ์žฅ์ :

  • ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ์ธ๋ฑ์‹ฑ: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์— ๋Œ€ํ•œ ์ •ํ™•ํ•œ ์ œ์–ด
  • ์ตœ์ ์˜ ์ •๋ ฌ: SIMD ์—ฐ์‚ฐ์ด ํ•˜๋“œ์›จ์–ด์— ์™„๋ฒฝํžˆ ์ •๋ ฌ
  • ์ตœ๋Œ€ ์ฒ˜๋ฆฌ๋Ÿ‰: ์•ˆ์ „์„ฑ ๊ฒ€์‚ฌ๋กœ ์ธํ•œ ์˜ค๋ฒ„ํ—ค๋“œ ์—†์Œ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: GPU SIMD ์œ ๋‹›์— ์ง์ ‘ ๋งคํ•‘

์ฃผ์š” ๊ณผ์ œ:

  • ์ธ๋ฑ์Šค ๋ณต์žก์„ฑ: ์ „์—ญ ์œ„์น˜์˜ ์ˆ˜๋™ ๊ณ„์‚ฐ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ์ฑ…์ž„: ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์ง์ ‘ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•จ
  • ๋””๋ฒ„๊น… ๋‚œ์ด๋„: ์ •ํ™•์„ฑ ๊ฒ€์ฆ์ด ๋” ๋ณต์žก

2. Mojo vectorize ๋ฐฉ์‹

์™„์„ฑํ•  ์ฝ”๋“œ

fn vectorize_within_tiles_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size elements (not SIMD groups)
    @parameter
    @always_inline
    fn process_tile_with_vectorize[
        num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        tile_start = tile_id * tile_size
        tile_end = min(tile_start + tile_size, size)
        actual_tile_size = tile_end - tile_start
        print(
            "tile_id:",
            tile_id,
            "tile_start:",
            tile_start,
            "tile_end:",
            tile_end,
            "actual_tile_size:",
            actual_tile_size,
        )

        # FILL IN (9 lines at most)

    num_tiles = (size + tile_size - 1) // tile_size
    elementwise[
        process_tile_with_vectorize, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p23/p23.mojo

ํŒ

1. ํƒ€์ผ ๊ฒฝ๊ณ„ ๊ณ„์‚ฐ

tile_start = tile_id * tile_size
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start

๋งˆ์ง€๋ง‰ ํƒ€์ผ์ด tile_size๋ณด๋‹ค ์ž‘์„ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

2. ๋ฒกํ„ฐํ™” ํ•จ์ˆ˜ ํŒจํ„ด

fn vectorized_add[
  width: Int
](i: Int) unified {read tile_start, read a, read b, mut output}:
    global_idx = tile_start + i
    if global_idx + width <= size:  # ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
        # SIMD ์—ฐ์‚ฐ ์ฝ”๋“œ

width ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” vectorize ํ•จ์ˆ˜์— ์˜ํ•ด ์ž๋™์œผ๋กœ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.

3. vectorize ํ˜ธ์ถœ

vectorize[simd_width](actual_tile_size, vectorized_add)

์ œ๊ณต๋œ SIMD ํญ์œผ๋กœ ๋ฒกํ„ฐํ™” ๋ฃจํ”„๋ฅผ ์ž๋™ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

4. ์ฃผ์š” ํŠน์„ฑ

  • ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ, ๋‚ด์žฅ ์•ˆ์ „์„ฑ, ํƒ€์ผ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ
  • ๋ช…์‹œ์  SIMD ํญ ๋งค๊ฐœ๋ณ€์ˆ˜ ์‚ฌ์šฉ
  • ๋‚ด์žฅ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ์ž๋™ ๋‚˜๋จธ์ง€ ์š”์†Œ ์ฒ˜๋ฆฌ

Mojo vectorize ์‹คํ–‰

uv run poe p23 --vectorized
pixi run p23 --vectorized

ํผ์ฆ์ด ์•„์ง ํ’€๋ฆฌ์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0 tile_start: 0 tile_end: 32 actual_tile_size: 32
tile_id: 1 tile_start: 32 tile_end: 64 actual_tile_size: 32
tile_id: 2 tile_start: 64 tile_end: 96 actual_tile_size: 32
tile_id: 3 tile_start: 96 tile_end: 128 actual_tile_size: 32
...
tile_id: 29 tile_start: 928 tile_end: 960 actual_tile_size: 32
tile_id: 30 tile_start: 960 tile_end: 992 actual_tile_size: 32
tile_id: 31 tile_start: 992 tile_end: 1024 actual_tile_size: 32
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

Mojo vectorize ํ’€์ด

fn vectorize_within_tiles_elementwise_add[
    layout: Layout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: LayoutTensor[mut=True, dtype, layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size elements (not SIMD groups)
    @parameter
    @always_inline
    fn process_tile_with_vectorize[
        num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        tile_id = indices[0]
        tile_start = tile_id * tile_size
        tile_end = min(tile_start + tile_size, size)
        actual_tile_size = tile_end - tile_start

        fn vectorized_add[
            width: Int
        ](i: Int) unified {read tile_start, read a, read b, mut output}:
            global_idx = tile_start + i
            if global_idx + width <= size:
                a_vec = a.aligned_load[width](Index(global_idx))
                b_vec = b.aligned_load[width](Index(global_idx))
                result = a_vec + b_vec
                output.store[width](Index(global_idx), result)

        # Use vectorize within each tile
        vectorize[simd_width](actual_tile_size, vectorized_add)

    num_tiles = (size + tile_size - 1) // tile_size
    elementwise[
        process_tile_with_vectorize, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


Mojo vectorize ์‹ฌ์ธต ๋ถ„์„

Mojo์˜ vectorize ํ•จ์ˆ˜๋Š” ๋‚ด์žฅ ์•ˆ์ „์„ฑ๊ณผ ํ•จ๊ป˜ ์ž๋™ ๋ฒกํ„ฐํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • ๋ช…์‹œ์  SIMD ํญ ๋งค๊ฐœ๋ณ€์ˆ˜: ์‚ฌ์šฉํ•  simd_width๋ฅผ ์ง์ ‘ ์ง€์ •
  • ๋‚ด์žฅ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋ฒ„ํผ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ๋ฅผ ์ž๋™์œผ๋กœ ๋ฐฉ์ง€
  • ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ: ๋‚จ์€ ์š”์†Œ๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
  • ์ค‘์ฒฉ ํ•จ์ˆ˜ ํŒจํ„ด: ๋ฒกํ„ฐํ™” ๋กœ์ง์˜ ๊น”๋”ํ•œ ๋ถ„๋ฆฌ

ํƒ€์ผ ๊ธฐ๋ฐ˜ ๊ตฌ์„ฑ:

tile_start = tile_id * tile_size    # 0, 32, 64, 96, ...
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start

์ž๋™ ๋ฒกํ„ฐํ™” ๋ฉ”์ปค๋‹ˆ์ฆ˜:

fn vectorized_add[
  width: Int
](i: Int) unified {read tile_start, read a, read b, mut output}:
    global_idx = tile_start + i
    if global_idx + width <= size:
        # ์ž๋™ SIMD ์ตœ์ ํ™”

vectorize์˜ ๋™์ž‘ ๋ฐฉ์‹:

  • ์ž๋™ ์ฒญํฌ ๋ถ„ํ• : actual_tile_size๋ฅผ ์ง€์ •ํ•œ simd_width์˜ ์ฒญํฌ๋กœ ๋ถ„ํ• 
  • ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ: ๋‚จ์€ ์š”์†Œ๋ฅผ ๋” ์ž‘์€ ํญ์œผ๋กœ ์ž๋™ ์ฒ˜๋ฆฌ
  • ๊ฒฝ๊ณ„ ์•ˆ์ „์„ฑ: ๋ฒ„ํผ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ๋ฅผ ์ž๋™์œผ๋กœ ๋ฐฉ์ง€
  • ๋ฃจํ”„ ๊ด€๋ฆฌ: ๋ฒกํ„ฐํ™” ๋ฃจํ”„๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ

์‹คํ–‰ ์‹œ๊ฐํ™” (TILE_SIZE=32, SIMD_WIDTH=4):

Tile 0 ์ฒ˜๋ฆฌ:
  vectorize ํ˜ธ์ถœ 0: ์š”์†Œ [0:4]๋ฅผ SIMD_WIDTH=4๋กœ ์ฒ˜๋ฆฌ
  vectorize ํ˜ธ์ถœ 1: ์š”์†Œ [4:8]๋ฅผ SIMD_WIDTH=4๋กœ ์ฒ˜๋ฆฌ
  ...
  vectorize ํ˜ธ์ถœ 7: ์š”์†Œ [28:32]๋ฅผ SIMD_WIDTH=4๋กœ ์ฒ˜๋ฆฌ
  ํ•ฉ๊ณ„: 8ํšŒ ์ž๋™ SIMD ์—ฐ์‚ฐ

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์Šค๋ ˆ๋“œ ์ˆ˜: 32๊ฐœ ์Šค๋ ˆ๋“œ (1024 รท 32 = 32)
  • ์Šค๋ ˆ๋“œ๋‹น ์ž‘์—…๋Ÿ‰: 32๊ฐœ ์š”์†Œ (์ž๋™ SIMD ์ฒญํฌ ๋ถ„ํ• )
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์ž๋™ ๋ฒกํ„ฐํ™”๋ฅผ ๊ฐ–์ถ˜ ์ž‘์€ ํƒ€์ผ
  • ์˜ค๋ฒ„ํ—ค๋“œ: ์•ฝ๊ฐ„ - ์ž๋™ ์ตœ์ ํ™” ๋ฐ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ์•ˆ์ „์„ฑ: ๋‚ด์žฅ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ

์„ฑ๋Šฅ ๋น„๊ต์™€ ๋ชจ๋ฒ” ์‚ฌ๋ก€

๊ฐ ์ ‘๊ทผ๋ฒ•์˜ ์„ ํƒ ๊ธฐ์ค€

์ˆ˜๋™ ๋ฒกํ„ฐํ™”๋ฅผ ์„ ํƒํ•  ๋•Œ:

  • ์ตœ๋Œ€ ์„ฑ๋Šฅ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ
  • ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๊ณ  ์ •๋ ฌ๋œ ๋ฐ์ดํ„ฐ ํŒจํ„ด์ด ์žˆ๋Š” ๊ฒฝ์šฐ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— ๋Œ€ํ•œ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ์ˆ˜๋™์œผ๋กœ ๊ฒฝ๊ณ„ ์•ˆ์ „์„ฑ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ
  • ํ•˜๋“œ์›จ์–ด๋ณ„ ์ตœ์ ํ™”๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ

Mojo vectorize๋ฅผ ์„ ํƒํ•  ๋•Œ:

  • ๊ฐœ๋ฐœ ์†๋„์™€ ์•ˆ์ „์„ฑ์ด ์šฐ์„ ์ธ ๊ฒฝ์šฐ
  • ๋ถˆ๊ทœ์น™ํ•˜๊ฑฐ๋‚˜ ๋™์ ์ธ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฒฝ์šฐ
  • ์ˆ˜๋™ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ๊ด€๋ฆฌ ๋Œ€์‹  ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ๋ฅผ ์›ํ•˜๋Š” ๊ฒฝ์šฐ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ๋ณต์žก๋„๊ฐ€ ์˜ค๋ฅ˜๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ
  • ์ˆ˜๋™ ๋ฃจํ”„ ๊ด€๋ฆฌ๋ณด๋‹ค ๊น”๋”ํ•œ ๋ฒกํ„ฐํ™” ํŒจํ„ด์„ ์„ ํ˜ธํ•˜๋Š” ๊ฒฝ์šฐ

๊ณ ๊ธ‰ ์ตœ์ ํ™” ์ธ์‚ฌ์ดํŠธ

๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ:

์ˆ˜๋™:      8 ์Šค๋ ˆ๋“œ ร— 32 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ
vectorize: 32 ์Šค๋ ˆ๋“œ ร— 8 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ

๋‘˜ ๋‹ค ๋น„์Šทํ•œ ์ด ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ, ๋ณ‘๋ ฌ์„ฑ ์ „๋žต์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

์บ์‹œ ๋™์ž‘:

  • ์ˆ˜๋™: ๋Œ€ํ˜• ์ฒญํฌ๊ฐ€ L1 ์บ์‹œ๋ฅผ ์ดˆ๊ณผํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์™„๋ฒฝํ•œ ์ˆœ์ฐจ ์ ‘๊ทผ
  • vectorize: ์ž‘์€ ํƒ€์ผ์ด ์บ์‹œ์— ๋” ์ž˜ ๋งž๊ณ , ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ

ํ•˜๋“œ์›จ์–ด ๋งคํ•‘:

  • ์ˆ˜๋™: ์›Œํ”„ ํ™œ์šฉ๊ณผ SIMD ์œ ๋‹› ๋งคํ•‘์— ๋Œ€ํ•œ ์ง์ ‘ ์ œ์–ด
  • vectorize: ์ž๋™ ๋ฃจํ”„ ๋ฐ ๋‚˜๋จธ์ง€ ๊ด€๋ฆฌ๋ฅผ ํ†ตํ•œ ๊ฐ„์†Œํ™”๋œ ๋ฒกํ„ฐํ™”

๋ชจ๋ฒ” ์‚ฌ๋ก€ ์š”์•ฝ

์ˆ˜๋™ ๋ฒกํ„ฐํ™” ๋ชจ๋ฒ” ์‚ฌ๋ก€:

  • ์ธ๋ฑ์Šค ๊ณ„์‚ฐ์„ ํ•ญ์ƒ ์‹ ์ค‘ํ•˜๊ฒŒ ๊ฒ€์ฆ
  • ๊ฐ€๋Šฅํ•˜๋ฉด chunk_size์— ์ปดํŒŒ์ผ ํƒ€์ž„ ์ƒ์ˆ˜ ์‚ฌ์šฉ
  • ์บ์‹œ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํ”„๋กœํŒŒ์ผ๋ง
  • ์ตœ์ ์˜ SIMD ์„ฑ๋Šฅ์„ ์œ„ํ•œ ์ •๋ ฌ ์š”๊ตฌ ์‚ฌํ•ญ ๊ณ ๋ ค

Mojo vectorize ๋ชจ๋ฒ” ์‚ฌ๋ก€:

  • ๋ฐ์ดํ„ฐ์™€ ํ•˜๋“œ์›จ์–ด์— ์ ํ•ฉํ•œ SIMD ํญ ์„ ํƒ
  • ๋ฏธ์„ธ ์ตœ์ ํ™”๋ณด๋‹ค ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ช…ํ™•์„ฑ์— ์ง‘์ค‘
  • ๊น”๋”ํ•œ ๋ฒกํ„ฐํ™” ๋กœ์ง์„ ์œ„ํ•ด ์ค‘์ฒฉ ํŒŒ๋ผ๋ฏธํ„ฐ ํ•จ์ˆ˜ ์‚ฌ์šฉ
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด์—๋Š” ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ ์‹ ๋ขฐ

๋‘ ์ ‘๊ทผ๋ฒ• ๋ชจ๋‘ GPU ์„ฑ๋Šฅ ์ตœ์ ํ™” ๋„๊ตฌ ๋ชจ์Œ์—์„œ ์œ ํšจํ•œ ์ „๋žต์ž…๋‹ˆ๋‹ค. ์ˆ˜๋™ ๋ฒกํ„ฐํ™”๋Š” ์ตœ๋Œ€ํ•œ์˜ ์ œ์–ด๋ฅผ, Mojo์˜ vectorize๋Š” ์•ˆ์ „์„ฑ๊ณผ ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ํŒจํ„ด์„ ๋ชจ๋‘ ์ดํ•ดํ–ˆ๋‹ค๋ฉด:

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ: ๋ฒกํ„ฐํ™” ์ „๋žต์€ ์„ฑ๋Šฅ ์š”๊ตฌ ์‚ฌํ•ญ์— ๋”ฐ๋ผ ๋‹ฌ๋ฆฌ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜๋™ ๋ฒกํ„ฐํ™”๋Š” ์ตœ๋Œ€ํ•œ์˜ ์ œ์–ด๋ฅผ, Mojo์˜ vectorize ํ•จ์ˆ˜๋Š” ์•ˆ์ „์„ฑ๊ณผ ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์ธ ์„ฑ๋Šฅ ์š”๊ตฌ ์‚ฌํ•ญ๊ณผ ๊ฐœ๋ฐœ ์ œ์•ฝ ์กฐ๊ฑด์— ๋”ฐ๋ผ ์„ ํƒํ•˜์„ธ์š”.

๐Ÿง  GPU ์Šค๋ ˆ๋”ฉ vs SIMD - ์‹คํ–‰ ๊ณ„์ธต ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

๊ฐœ์š”

์š”์†Œ๋ณ„, ํƒ€์ผ๋ง, ๋ฒกํ„ฐํ™” ํŒจํ„ด์„ ํƒ๊ตฌํ•˜๋ฉด์„œ GPU ์—ฐ์‚ฐ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด ์„น์…˜์—์„œ๋Š” GPU ์Šค๋ ˆ๋“œ์™€ SIMD ์—ฐ์‚ฐ ์‚ฌ์ด์˜ ๊ทผ๋ณธ์ ์ธ ๊ด€๊ณ„๋ฅผ ๋ช…ํ™•ํžˆ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‘˜์€ ์„œ๋กœ ๋‹ค๋ฅด์ง€๋งŒ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ๋ณ‘๋ ฌ์„ฑ ์ˆ˜์ค€์œผ๋กœ, ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ํ•จ๊ป˜ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณ‘๋ ฌ์„ฑ์˜ ๊ตฌ์กฐ๋ฅผ ์ œ๊ณตํ•˜๊ณ , SIMD ์—ฐ์‚ฐ์ด ๊ฐ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ ๋ฒกํ„ฐํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

GPU ์Šค๋ ˆ๋”ฉ ๊ณ„์ธต ๊ตฌ์กฐ

GPU ์‹คํ–‰์€ ํ•˜๋“œ์›จ์–ด์˜ ๋ณต์žก์„ฑ์„ ์ถ”์ƒํ™”ํ•˜๋Š” ์ž˜ ์ •์˜๋œ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

GPU Device
โ”œโ”€โ”€ Grid (์ „์ฒด ๋ฌธ์ œ)
โ”‚   โ”œโ”€โ”€ Block 1 (์Šค๋ ˆ๋“œ ๊ทธ๋ฃน, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)
โ”‚   โ”‚   โ”œโ”€โ”€ ์›Œํ”„ 1 (32๊ฐœ ์Šค๋ ˆ๋“œ, ๋ก์Šคํ… ์‹คํ–‰)
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 1 โ†’ SIMD ์—ฐ์‚ฐ
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 2 โ†’ SIMD ์—ฐ์‚ฐ
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ... (์ด 32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ”‚   โ””โ”€โ”€ ์›Œํ”„ 2 (32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ””โ”€โ”€ Block 2 (๋…๋ฆฝ์ ์ธ ๊ทธ๋ฃน)

๐Ÿ’ก ์ฐธ๊ณ : ์ด Part๋Š” ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์œผ๋ฉฐ, ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๊ณ ๊ธ‰ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋Š” Part VII ์—์„œ ์ž์„ธํžˆ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

Mojo๊ฐ€ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ๋“ค:

  • ๊ทธ๋ฆฌ๋“œ/๋ธ”๋ก ๊ตฌ์„ฑ: ๋ฌธ์ œ ํฌ๊ธฐ์— ๋”ฐ๋ผ ์ž๋™ ๊ณ„์‚ฐ
  • ์›Œํ”„ ๊ด€๋ฆฌ: ํ•˜๋“œ์›จ์–ด๊ฐ€ 32๊ฐœ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์„ ํˆฌ๋ช…ํ•˜๊ฒŒ ์ฒ˜๋ฆฌ
  • ์Šค๋ ˆ๋“œ ์Šค์ผ€์ค„๋ง: GPU ์Šค์ผ€์ค„๋Ÿฌ๊ฐ€ ์‹คํ–‰์„ ์ž๋™ ๊ด€๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ: ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ์— ์ตœ์ ์˜ ์ ‘๊ทผ ํŒจํ„ด ๋‚ด์žฅ

GPU ์Šค๋ ˆ๋“œ ๋‚ด์˜ SIMD

๊ฐ GPU ์Šค๋ ˆ๋“œ๋Š” SIMD (Single Instruction, Multiple Data) ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์š”์†Œ๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# ํ•˜๋‚˜์˜ GPU ์Šค๋ ˆ๋“œ ๋‚ด๋ถ€:
a_simd = a.load[simd_width](idx, 0)      # float 4๊ฐœ๋ฅผ ๋™์‹œ์— ๋กœ๋“œ
b_simd = b.load[simd_width](idx, 0)      # float 4๊ฐœ๋ฅผ ๋™์‹œ์— ๋กœ๋“œ
result = a_simd + b_simd                 # 4์Œ์„ ๋™์‹œ์— ๋ง์…ˆ
output.store[simd_width](idx, 0, result) # ๊ฒฐ๊ณผ 4๊ฐœ๋ฅผ ๋™์‹œ์— ์ €์žฅ

ํŒจํ„ด ๋น„๊ต์™€ ์Šค๋ ˆ๋“œ-์ž‘์—… ๋งคํ•‘

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ: ๋ชจ๋“  ํŒจํ„ด์€ ๋™์ผํ•œ ์ด ์ž‘์—…๋Ÿ‰ - SIMD_WIDTH=4๋กœ 1024๊ฐœ ์š”์†Œ์— ๋Œ€ํ•ด 256ํšŒ์˜ SIMD ์—ฐ์‚ฐ - ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฐจ์ด์ ์€ ์ด ์ž‘์—…์ด GPU ์Šค๋ ˆ๋“œ์— ์–ด๋–ป๊ฒŒ ๋ถ„๋ฐฐ๋˜๋А๋ƒ์— ์žˆ์Šต๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ ๋น„๊ต (SIZE=1024, SIMD_WIDTH=4)

ํŒจํ„ด์Šค๋ ˆ๋“œ ์ˆ˜์Šค๋ ˆ๋“œ๋‹น SIMD ์—ฐ์‚ฐ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ดํŠธ๋ ˆ์ด๋“œ์˜คํ”„
์š”์†Œ๋ณ„2561๋ถ„์‚ฐ ์ ‘๊ทผ์ตœ๋Œ€ ๋ณ‘๋ ฌ์„ฑ, ๋‚ฎ์€ ์ง€์—ญ์„ฑ
ํƒ€์ผ๋ง328์†Œํ˜• ๋ธ”๋ก๋ณ‘๋ ฌ์„ฑ + ์ง€์—ญ์„ฑ ๊ท ํ˜•
์ˆ˜๋™ ๋ฒกํ„ฐํ™”832๋Œ€ํ˜• ์ฒญํฌ๋†’์€ ๋Œ€์—ญํญ, ์ ์€ ์Šค๋ ˆ๋“œ
Mojo vectorize328์Šค๋งˆํŠธ ๋ธ”๋ก์ž๋™ ์ตœ์ ํ™”

์ƒ์„ธ ์‹คํ–‰ ํŒจํ„ด

์š”์†Œ๋ณ„ ํŒจํ„ด:

Thread 0: [0,1,2,3] โ†’ Thread 1: [4,5,6,7] โ†’ ... โ†’ Thread 255: [1020,1021,1022,1023]
256 ์Šค๋ ˆ๋“œ ร— 1 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ

ํƒ€์ผ๋ง ํŒจํ„ด:

Thread 0: [0:32] (8 SIMD) โ†’ Thread 1: [32:64] (8 SIMD) โ†’ ... โ†’ Thread 31: [992:1024] (8 SIMD)
32 ์Šค๋ ˆ๋“œ ร— 8 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ

์ˆ˜๋™ ๋ฒกํ„ฐํ™” ํŒจํ„ด:

Thread 0: [0:128] (32 SIMD) โ†’ Thread 1: [128:256] (32 SIMD) โ†’ ... โ†’ Thread 7: [896:1024] (32 SIMD)
8 ์Šค๋ ˆ๋“œ ร— 32 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ

Mojo vectorize ํŒจํ„ด:

Thread 0: [0:32] ์ž๋™ ๋ฒกํ„ฐํ™” โ†’ Thread 1: [32:64] ์ž๋™ ๋ฒกํ„ฐํ™” โ†’ ... โ†’ Thread 31: [992:1024] ์ž๋™ ๋ฒกํ„ฐํ™”
32 ์Šค๋ ˆ๋“œ ร— 8 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ

์„ฑ๋Šฅ ํŠน์„ฑ๊ณผ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

ํ•ต์‹ฌ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ ์š”์•ฝ

์ธก๋ฉด์Šค๋ ˆ๋“œ ๋งŽ์Œ (์š”์†Œ๋ณ„)์Šค๋ ˆ๋“œ ์ค‘๊ฐ„ (ํƒ€์ผ๋ง/vectorize)์Šค๋ ˆ๋“œ ์ ์Œ (์ˆ˜๋™)
๋ณ‘๋ ฌ์„ฑ์ตœ๋Œ€ ์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰๊ท ํ˜• ์žกํžŒ ์ ‘๊ทผ์ตœ์†Œํ•œ์˜ ๋ณ‘๋ ฌ์„ฑ
์บ์‹œ ์ง€์—ญ์„ฑ์Šค๋ ˆ๋“œ ๊ฐ„ ๋‚ฎ์Œํƒ€์ผ ๋‚ด์—์„œ ์–‘ํ˜ธ์ˆœ์ฐจ ์ ‘๊ทผ์œผ๋กœ ์šฐ์ˆ˜
๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์–‘ํ˜ธํ•œ ๋ณ‘ํ•ฉ์–‘ํ˜ธ + ์บ์‹œ ์žฌ์‚ฌ์šฉ์ด๋ก ์  ์ตœ๋Œ“๊ฐ’
๋ณต์žก๋„๊ฐ€์žฅ ๋‹จ์ˆœ๋ณดํ†ต๊ฐ€์žฅ ๋ณต์žก

๊ฐ ํŒจํ„ด์˜ ์„ ํƒ ๊ธฐ์ค€

์š”์†Œ๋ณ„ ํŒจํ„ด์„ ์‚ฌ์šฉํ•  ๋•Œ:

  • ์š”์†Œ๋‹น ์—ฐ์‚ฐ๋Ÿ‰์ด ์ ์€ ๋‹จ์ˆœํ•œ ์—ฐ์‚ฐ
  • ์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰์„ ์œ„ํ•ด ์ตœ๋Œ€ ๋ณ‘๋ ฌ์„ฑ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ๋‹ค์–‘ํ•œ ๋ฌธ์ œ ํฌ๊ธฐ์— ๋Œ€ํ•œ ํ™•์žฅ์„ฑ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ

ํƒ€์ผ๋ง/vectorize๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ:

  • ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์˜ ์ด์ ์ด ์žˆ๋Š” ์บ์‹œ ๋ฏผ๊ฐ ์—ฐ์‚ฐ
  • ์„ฑ๋Šฅ๊ณผ ์œ ์ง€๋ณด์ˆ˜์„ฑ์˜ ๊ท ํ˜•์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ์ž๋™ ์ตœ์ ํ™”(vectorize)๊ฐ€ ์„ ํ˜ธ๋˜๋Š” ๊ฒฝ์šฐ

์ˆ˜๋™ ๋ฒกํ„ฐํ™”๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ:

  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์— ๋Œ€ํ•œ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ์ตœ๋Œ€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ
  • ๊ฐœ๋ฐœ ๋ณต์žก๋„๋ฅผ ๊ฐ์ˆ˜ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ

ํ•˜๋“œ์›จ์–ด ๊ณ ๋ ค ์‚ฌํ•ญ

ํ˜„๋Œ€ GPU ์•„ํ‚คํ…์ฒ˜์—๋Š” Mojo๊ฐ€ ์ถ”์ƒํ™”ํ•˜๋Š” ์—ฌ๋Ÿฌ ์ˆ˜์ค€์ด ์žˆ์Šต๋‹ˆ๋‹ค:

ํ•˜๋“œ์›จ์–ด ์‹ค์ œ ๊ตฌ์กฐ:

  • ์›Œํ”„: 32๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ก์Šคํ…์œผ๋กœ ์‹คํ–‰
  • Streaming Multiprocessor (SM): ์—ฌ๋Ÿฌ ์›Œํ”„๊ฐ€ ๋™์‹œ์— ์‹คํ–‰
  • SIMD ์œ ๋‹›: ๊ฐ SM ๋‚ด์˜ ๋ฒกํ„ฐ ์ฒ˜๋ฆฌ ์œ ๋‹›
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ: L1/L2 ์บ์‹œ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ

Mojo ์ถ”์ƒํ™”์˜ ์ด์ :

  • ์›Œํ”„ ์ •๋ ฌ๊ณผ ์Šค์ผ€์ค„๋ง์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ํˆฌ๋ช…ํ•˜๊ฒŒ ์ตœ์ ํ™”
  • SM ๊ฐ„ ๋ฆฌ์†Œ์Šค ํ• ๋‹น์„ ๊ด€๋ฆฌ
  • GPU ๋ฒค๋” ๊ฐ„ ์ด์‹ ๊ฐ€๋Šฅํ•œ ์„ฑ๋Šฅ ์ œ๊ณต

์„ฑ๋Šฅ์— ๋Œ€ํ•œ ์‚ฌ๊ณ  ๋ชจ๋ธ

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋‘ ๊ฐ€์ง€ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ๋ณ‘๋ ฌ์„ฑ ์œ ํ˜•์„ ๊ด€๋ฆฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•˜์„ธ์š”:

์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ:

  • ๋ณ‘๋ ฌ ๊ตฌ์กฐ๋ฅผ ์ œ๊ณต (์‹คํ–‰ ์œ ๋‹›์˜ ์ˆ˜)
  • ๋™์‹œ ์‹คํ–‰์„ ํ†ตํ•œ ์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰ ๊ฐ€๋Šฅ
  • GPU ์Šค์ผ€์ค„๋Ÿฌ๊ฐ€ ์ž๋™์œผ๋กœ ๊ด€๋ฆฌ

SIMD ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ:

  • ๊ฐ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ ๋ฒกํ„ฐํ™”๋ฅผ ์ œ๊ณต
  • ์Šค๋ ˆ๋“œ๋‹น ์‚ฐ์ˆ  ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๊ทน๋Œ€ํ™”
  • ๋ฒกํ„ฐ ์ฒ˜๋ฆฌ ์œ ๋‹›์„ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉ

์ตœ์  ์„ฑ๋Šฅ ๊ณต์‹:

์„ฑ๋Šฅ = (์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰์„ ์œ„ํ•œ ์ถฉ๋ถ„ํ•œ ์Šค๋ ˆ๋“œ) ร—
       (ํšจ์œจ์ ์ธ SIMD ํ™œ์šฉ) ร—
       (์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด)

ํ™•์žฅ์„ฑ ๊ณ ๋ ค ์‚ฌํ•ญ

๋ฌธ์ œ ํฌ๊ธฐ์ตœ์  ํŒจํ„ด๊ทผ๊ฑฐ
์†Œ๊ทœ๋ชจ (< 1K)ํƒ€์ผ๋ง/vectorize๋‚ฎ์€ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ
์ค‘๊ทœ๋ชจ (1K-1M)๋ชจ๋“  ํŒจํ„ด์œ ์‚ฌํ•œ ์„ฑ๋Šฅ
๋Œ€๊ทœ๋ชจ (> 1M)๋ณดํ†ต ์š”์†Œ๋ณ„๋ณ‘๋ ฌ์„ฑ์ด ์ง€๋ฐฐ์ 

์ตœ์ ์˜ ์„ ํƒ์€ ํŠน์ • ํ•˜๋“œ์›จ์–ด, ์›Œํฌ๋กœ๋“œ ๋ณต์žก๋„, ๊ฐœ๋ฐœ ์ œ์•ฝ ์กฐ๊ฑด์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

GPU ์Šค๋ ˆ๋”ฉ vs SIMD ๊ฐœ๋…์„ ํ™•์‹คํžˆ ์ดํ•ดํ–ˆ๋‹ค๋ฉด:

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ: GPU ์Šค๋ ˆ๋“œ์™€ SIMD ์—ฐ์‚ฐ์€ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ๋ณ‘๋ ฌ์„ฑ ์ˆ˜์ค€์œผ๋กœ ํ•จ๊ป˜ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‘˜์˜ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๋ฉด ๊ตฌ์ฒด์ ์ธ ์„ฑ๋Šฅ ์š”๊ตฌ ์‚ฌํ•ญ๊ณผ ์ œ์•ฝ ์กฐ๊ฑด์— ๋งž๋Š” ์˜ฌ๋ฐ”๋ฅธ ํŒจํ„ด์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Š Mojo ๋ฒค์น˜๋งˆํ‚น - ์„ฑ๋Šฅ ๋ถ„์„๊ณผ ์ตœ์ ํ™”

๊ฐœ์š”

์š”์†Œ๋ณ„, ํƒ€์ผ๋ง, ์ˆ˜๋™ ๋ฒกํ„ฐํ™”, Mojo vectorize ํŒจํ„ด์„ ํ•™์Šตํ•œ ํ›„, ์ด์ œ ์‹ค์ œ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ์ฐจ๋ก€์ž…๋‹ˆ๋‹ค. p21.mojo์— ๋‚ด์žฅ๋œ ๋ฒค์น˜๋งˆํ‚น ์‹œ์Šคํ…œ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ๋ฒ•์„ ๊ณผํ•™์ ์œผ๋กœ ๋น„๊ตํ•˜๊ณ  ์„ฑ๋Šฅ ํŠน์„ฑ์„ ์ดํ•ดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด๋ก ์  ๋ถ„์„์€ ๊ฐ€์น˜ ์žˆ์ง€๋งŒ, ์‹ค์ฆ์  ๋ฒค์น˜๋งˆํ‚น์ด ํŠน์ • ํ•˜๋“œ์›จ์–ด์—์„œ์˜ ์‹ค์ œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ฒค์น˜๋งˆํฌ ์‹คํ–‰

์ „์ฒด ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•˜๋ ค๋ฉด:

pixi run p23 --benchmark
pixi run -e amd p23 --benchmark
pixi run -e apple p23 --benchmark
uv run poe p23 --benchmark

๊ฐ ํŒจํ„ด์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ์ธก์ • ๊ฒฐ๊ณผ๊ฐ€ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

SIZE: 1024
simd_width: 4
Running P21 GPU Benchmarks...
SIMD width: 4
--------------------------------------------------------------------------------
Testing SIZE=16, TILE=4
Running elementwise_16_4
Running tiled_16_4
Running manual_vectorized_16_4
Running vectorized_16_4
--------------------------------------------------------------------------------
Testing SIZE=128, TILE=16
Running elementwise_128_16
Running tiled_128_16
Running manual_vectorized_128_16
--------------------------------------------------------------------------------
Testing SIZE=128, TILE=16, Vectorize within tiles
Running vectorized_128_16
--------------------------------------------------------------------------------
Testing SIZE=1048576 (1M), TILE=1024
Running elementwise_1M_1024
Running tiled_1M_1024
Running manual_vectorized_1M_1024
Running vectorized_1M_1024
| name                      | met (ms)              | iters |
| ------------------------- | --------------------- | ----- |
| elementwise_16_4          | 0.0033248             | 100   |
| tiled_16_4                | 0.00327392            | 100   |
| manual_vectorized_16_4    | 0.0036169600000000002 | 100   |
| vectorized_16_4           | 0.0037209599999999997 | 100   |
| elementwise_128_16        | 0.00351999            | 100   |
| tiled_128_16              | 0.00370431            | 100   |
| manual_vectorized_128_16  | 0.0043696             | 100   |
| vectorized_128_16         | 0.00378048            | 100   |
| elementwise_1M_1024       | 0.03130143            | 100   |
| tiled_1M_1024             | 0.6892189000000001    | 100   |
| manual_vectorized_1M_1024 | 0.5923888             | 100   |
| vectorized_1M_1024        | 0.1876688             | 100   |

Benchmarks completed!

๋ฒค์น˜๋งˆํฌ ์„ค์ •

๋ฒค์น˜๋งˆํ‚น ์‹œ์Šคํ…œ์€ Mojo์˜ ๋‚ด์žฅ benchmark ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

from benchmark import Bench, BenchConfig, Bencher, BenchId, keep
bench_config = BenchConfig(max_iters=10, num_warmup_iters=1)
  • max_iters=10: ํ†ต๊ณ„์  ์‹ ๋ขฐ์„ฑ์„ ์œ„ํ•ด ์ตœ๋Œ€ 10ํšŒ ๋ฐ˜๋ณต
  • num_warmup_iters=1: ์ธก์ • ์ „ GPU ์›Œ๋ฐ์—…
  • Benchmark ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”

๋ฒค์น˜๋งˆํ‚น ๊ตฌํ˜„์˜ ํ•ต์‹ฌ

ํ•ต์‹ฌ ์›Œํฌํ”Œ๋กœ์šฐ ํŒจํ„ด

๊ฐ ๋ฒค์น˜๋งˆํฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฐ„๊ฒฐํ•œ ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

@parameter
fn benchmark_pattern_parameterized[test_size: Int, tile_size: Int](mut b: Bencher) raises:
    bench_ctx = DeviceContext()
    # ์…‹์—…: ๋ฒ„ํผ ์ƒ์„ฑ ๋ฐ ๋ฐ์ดํ„ฐ ์ดˆ๊ธฐํ™”
    @parameter
    fn pattern_workflow(ctx: DeviceContext) raises:
      # ์—ฐ์‚ฐ: ์ธก์ • ๋Œ€์ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์‹คํ–‰

    b.iter_custom[pattern_workflow](bench_ctx)
    # ์ตœ์ ํ™” ๋ฐฉ์ง€: keep(out.unsafe_ptr())
    # ๋™๊ธฐํ™”: ctx.synchronize()

์ฃผ์š” ๋‹จ๊ณ„:

  1. ์…‹์—…: ๋ฒ„ํผ ํ• ๋‹น ๋ฐ ๋ฐ์ดํ„ฐ ์ดˆ๊ธฐํ™”
  2. ์—ฐ์‚ฐ: ๋ฒค์น˜๋งˆํฌ ๋Œ€์ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์‹คํ–‰
  3. ์ตœ์ ํ™” ๋ฐฉ์ง€: ์ •ํ™•ํ•œ ์ธก์ •์„ ์œ„ํ•ด ํ•„์ˆ˜
  4. ๋™๊ธฐํ™”: GPU ์ž‘์—… ์™„๋ฃŒ ํ™•์ธ

์ค‘์š”: keep() ํ•จ์ˆ˜ keep(out.unsafe_ptr())๋Š” ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ์—ฐ์‚ฐ ๊ฒฐ๊ณผ๋ฅผ โ€œ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ์ฝ”๋“œโ€œ๋กœ ์ตœ์ ํ™”ํ•˜์—ฌ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ์—†์œผ๋ฉด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋Œ€์‹  ์•„๋ฌด๊ฒƒ๋„ ์ธก์ •ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! GPU ์ปค๋„์€ ๋น„๋™๊ธฐ์ ์œผ๋กœ ์‹คํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•ํ•œ GPU ๋ฒค์น˜๋งˆํ‚น์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

์ปค์Šคํ…€ ๋ฐ˜๋ณต์ด GPU์— ํ•„์š”ํ•œ ์ด์œ 

์ผ๋ฐ˜์ ์ธ ๋ฒค์น˜๋งˆํ‚น์€ CPU ์Šคํƒ€์ผ์˜ ๋™๊ธฐ ์‹คํ–‰์„ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. GPU ์ปค๋„์€ ๋น„๋™๊ธฐ์ ์œผ๋กœ ์‹คํ–‰๋˜๋ฏ€๋กœ ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • GPU ์ปจํ…์ŠคํŠธ ๊ด€๋ฆฌ: ์ ์ ˆํ•œ DeviceContext ์ƒ๋ช…์ฃผ๊ธฐ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ: ๋ฐ˜๋ณต ๊ฐ„ ๋ฒ„ํผ ์ •๋ฆฌ
  • ๋™๊ธฐํ™” ์ฒ˜๋ฆฌ: ๋น„๋™๊ธฐ ์—ฐ์‚ฐ์˜ ์ •ํ™•ํ•œ ํƒ€์ด๋ฐ
  • ์˜ค๋ฒ„ํ—ค๋“œ ๋ถ„๋ฆฌ: ์…‹์—… ๋น„์šฉ๊ณผ ์—ฐ์‚ฐ ๋น„์šฉ์˜ ๋ถ„๋ฆฌ

ํ…Œ์ŠคํŠธ ์‹œ๋‚˜๋ฆฌ์˜ค์™€ ์Šค๋ ˆ๋“œ ๋ถ„์„

๋ฒค์น˜๋งˆํฌ ๋ชจ์Œ์€ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค:

์Šค๋ ˆ๋“œ ํ™œ์šฉ ์š”์•ฝ

๋ฌธ์ œ ํฌ๊ธฐํŒจํ„ด์Šค๋ ˆ๋“œ ์ˆ˜์Šค๋ ˆ๋“œ๋‹น SIMD ์—ฐ์‚ฐ์ด SIMD ์—ฐ์‚ฐ
SIZE=16์š”์†Œ๋ณ„414
ํƒ€์ผ๋ง414
์ˆ˜๋™144
vectorize414
SIZE=128์š”์†Œ๋ณ„32132
ํƒ€์ผ๋ง8432
์ˆ˜๋™21632
vectorize8432
SIZE=1M์š”์†Œ๋ณ„262,1441262,144
ํƒ€์ผ๋ง1,024256262,144
์ˆ˜๋™2561,024262,144
vectorize1,024256262,144

๋ฌธ์ œ ํฌ๊ธฐ๋ณ„ ์„ฑ๋Šฅ ํŠน์„ฑ

์†Œ๊ทœ๋ชจ ๋ฌธ์ œ (SIZE=16):

  • ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ์  (~0.003ms ๊ธฐ์ค€์„ )
  • ์Šค๋ ˆ๋“œ ์ˆ˜ ์ฐจ์ด๋Š” ๊ฑฐ์˜ ๋ฌด์˜๋ฏธ
  • ํƒ€์ผ๋ง/vectorize๊ฐ€ ์•ฝ๊ฐ„ ๋‚ฎ์€ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ๋ณด์ž„

์ค‘๊ทœ๋ชจ ๋ฌธ์ œ (SIZE=128):

  • ์—ฌ์ „ํžˆ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ์  (~0.003ms ์ „ ํŒจํ„ด)
  • ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ๊ฑฐ์˜ ์‚ฌ๋ผ์ง
  • ์˜ค๋ฒ„ํ—ค๋“œ ์ง€๋ฐฐ์—์„œ ์—ฐ์‚ฐ ์ง€๋ฐฐ๋กœ์˜ ์ „ํ™˜ ๊ตฌ๊ฐ„

๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ (SIZE=1M):

  • ์‹ค์งˆ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ฐจ์ด๊ฐ€ ๋“œ๋Ÿฌ๋‚จ
  • ๋น„๋ณ‘ํ•ฉ ๋กœ๋“œ์˜ ์˜ํ–ฅ์ด ๋ช…ํ™•ํ•ด์ง
  • ๋šœ๋ ทํ•œ ์„ฑ๋Šฅ ์ˆœ์œ„๊ฐ€ ๋‚˜ํƒ€๋‚จ

๋ฐ์ดํ„ฐ๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ

๋‹ค์–‘ํ•œ ํ•˜๋“œ์›จ์–ด์—์„œ์˜ ์‹ค์ฆ์  ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ:

์„ฑ๋Šฅ ์ˆœ์œ„ (๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ)

์ˆœ์œ„ํŒจํ„ด์†Œ์š” ์‹œ๊ฐ„ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ
๐Ÿฅ‡์š”์†Œ๋ณ„~0.03ms๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ ์Šน๋ฆฌ
๐ŸฅˆMojo vectorize~0.19ms๋น„๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์„ฑ๋Šฅ์„ ์ €ํ•˜
๐Ÿฅ‰์ˆ˜๋™ ๋ฒกํ„ฐํ™”~0.59ms๋น„๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ณผ ์ˆ˜๋™ ์ตœ์ ํ™”๊ฐ€ ์„ฑ๋Šฅ ๊ฐ์†Œ
4์œ„ํƒ€์ผ๋ง~0.69ms๋น„๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ, SIMD ๋กœ๋“œ ์—†๋Š” ์ˆ˜๋™ ์ตœ์ ํ™”๊ฐ€ ์„ฑ๋Šฅ์„ ๋” ์ €ํ•˜

ํ•ต์‹ฌ ์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ

๋‹จ์ˆœ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์˜ ๊ฒฝ์šฐ: ์ตœ๋Œ€ ๋ณ‘๋ ฌ์„ฑ(elementwise)์ด ๋Œ€๊ทœ๋ชจ์—์„œ ๋ณต์žกํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”๋ณด๋‹ค ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.

์š”์†Œ๋ณ„ ํŒจํ„ด์ด ์Šน๋ฆฌํ•˜๋Š” ์ด์œ :

  • 262,144๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ์šฐ์ˆ˜ํ•œ ์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰์„ ์ œ๊ณต
  • ๋‹จ์ˆœํ•œ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์ด ์ข‹์€ ๋ณ‘ํ•ฉ์„ ๋‹ฌ์„ฑ
  • ์Šค๋ ˆ๋“œ๋‹น ์ตœ์†Œํ•œ์˜ ์˜ค๋ฒ„ํ—ค๋“œ
  • GPU ์ฝ”์–ด ์ˆ˜์— ๋”ฐ๋ผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ™•์žฅ

ํƒ€์ผ๋ง๊ณผ vectorize๊ฐ€ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์ด์œ :

  • ๋ณ‘๋ ฌ์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ ์‚ฌ์ด์˜ ๊ท ํ˜• ์žกํžŒ ์ ‘๊ทผ
  • ์ž๋™ ์ตœ์ ํ™”(vectorize)๊ฐ€ ์ˆ˜๋™ ํƒ€์ผ๋ง๊ณผ ๊ฑฐ์˜ ๋™๋“ฑํ•œ ์„ฑ๋Šฅ
  • ๊ณผ๋„ํ•œ ๋ณต์žก๋„ ์—†์ด ์–‘ํ˜ธํ•œ ์Šค๋ ˆ๋“œ ํ™œ์šฉ

์ˆ˜๋™ ๋ฒกํ„ฐํ™”๊ฐ€ ๊ณ ์ „ํ•˜๋Š” ์ด์œ :

  • 256๊ฐœ ์Šค๋ ˆ๋“œ๋งŒ์œผ๋กœ๋Š” ๋ณ‘๋ ฌ์„ฑ์ด ์ œํ•œ์ 
  • ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ์ด ์—ฐ์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ถ”๊ฐ€
  • ์Šค๋ ˆ๋“œ๋‹น ๋Œ€ํ˜• ์ฒญํฌ๋กœ ์ธํ•œ ์บ์‹œ ๋ถ€๋‹ด
  • ๋‹จ์ˆœ ์‚ฐ์ˆ ์—์„œ ํšจ๊ณผ ์ฒด๊ฐ

ํ”„๋ ˆ์ž„์›Œํฌ ์ž๋™ํ™” ๊ธฐ๋Šฅ:

  • ์ž๋™ ๋ฐ˜๋ณต ํšŸ์ˆ˜ ์กฐ์ • (91-100ํšŒ ๋ฐ˜๋ณต)
  • ์„œ๋กœ ๋‹ค๋ฅธ ์‹คํ–‰ ์‹œ๊ฐ„์— ๊ฑธ์นœ ํ†ต๊ณ„์  ์‹ ๋ขฐ์„ฑ
  • ๋ฐœ์—ด ์ œํ•œ๊ณผ ์‹œ์Šคํ…œ ๋ณ€๋™์— ๋Œ€์‘

๊ฒฐ๊ณผ ํ•ด์„ํ•˜๊ธฐ

์ถœ๋ ฅ ํ…Œ์ด๋ธ” ์ฝ๊ธฐ

| name                     | met (ms)           | iters |
| elementwise_1M_1024      | 0.03130143         | 100   |
  • met (ms): ๋‹จ์ผ ๋ฐ˜๋ณต์˜ ์‹คํ–‰ ์‹œ๊ฐ„
  • iters: ์ˆ˜ํ–‰๋œ ๋ฐ˜๋ณต ํšŸ์ˆ˜
  • ๋™์ผ ๋ฌธ์ œ ํฌ๊ธฐ ๋‚ด์—์„œ ๋น„๊ต: ๊ฐ™์€ ํฌ๊ธฐ๋ผ๋ฆฌ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์˜๋ฏธ ์žˆ์Œ

์ตœ์ ํ™” ์˜์‚ฌ๊ฒฐ์ •

์‹ค์ฆ์  ์ฆ๊ฑฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŒจํ„ด์„ ์„ ํƒํ•˜์„ธ์š”:

ํ”„๋กœ๋•์…˜ ์›Œํฌ๋กœ๋“œ์˜ ๊ฒฝ์šฐ:

  • ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ (>100K ์š”์†Œ): ์š”์†Œ๋ณ„ ํŒจํ„ด์ด ์ผ๋ฐ˜์ ์œผ๋กœ ์ตœ์ 
  • ์†Œ๊ทœ๋ชจ/์‹œ์ž‘ ๋ฐ์ดํ„ฐ์…‹ (<1K ์š”์†Œ): ๋‚ฎ์€ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์œ„ํ•ด ํƒ€์ผ๋ง ๋˜๋Š” vectorize
  • ๊ฐœ๋ฐœ ์†๋„ ์šฐ์„ : ์ž๋™ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ Mojo vectorize
  • ์ˆ˜๋™ ๋ฒกํ„ฐํ™” ์ง€์–‘: ๋‹จ์ˆœ ์—ฐ์‚ฐ์—์„œ๋Š” ๋ณต์žก๋„๊ฐ€ ์„ฑ๋Šฅ์œผ๋กœ ๋ณด์ƒ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋“œ๋ฌพ

์„ฑ๋Šฅ ์ตœ์ ํ™” ์›Œํฌํ”Œ๋กœ์šฐ:

  1. ๋จผ์ € ํ”„๋กœํŒŒ์ผ๋ง: ์ตœ์ ํ™”ํ•˜๊ธฐ ์ „์— ์ธก์ •
  2. ๋Œ€๊ทœ๋ชจ์—์„œ ํ…Œ์ŠคํŠธ: ์†Œ๊ทœ๋ชจ ๋ฌธ์ œ๋Š” ์‹ค์ œ ์„ฑ๋Šฅ์— ๋Œ€ํ•ด ์˜คํ•ด๋ฅผ ์ค„ ์ˆ˜ ์žˆ์Œ
  3. ์ด๋น„์šฉ ๊ณ ๋ ค: ๊ฐœ๋ฐœ ๋ฐ ์œ ์ง€๋ณด์ˆ˜ ๋…ธ๋ ฅ์„ ํฌํ•จ
  4. ๊ฐœ์„  ์‚ฌํ•ญ ๊ฒ€์ฆ: ๋Œ€์ƒ ํ•˜๋“œ์›จ์–ด์—์„œ ๋ฒค์น˜๋งˆํฌ๋กœ ํ™•์ธ

๊ณ ๊ธ‰ ๋ฒค์น˜๋งˆํ‚น ๊ธฐ๋ฒ•

์ปค์Šคํ…€ ํ…Œ์ŠคํŠธ ์‹œ๋‚˜๋ฆฌ์˜ค

๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์กฐ๊ฑด์„ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# ๋‹ค์–‘ํ•œ ๋ฌธ์ œ ํฌ๊ธฐ
benchmark_elementwise_parameterized[1024, 32]  # ๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ
benchmark_elementwise_parameterized[64, 8]     # ์†Œ๊ทœ๋ชจ ๋ฌธ์ œ

# ๋‹ค์–‘ํ•œ ํƒ€์ผ ํฌ๊ธฐ
benchmark_tiled_parameterized[256, 8]   # ์ž‘์€ ํƒ€์ผ
benchmark_tiled_parameterized[256, 64]  # ํฐ ํƒ€์ผ

ํ•˜๋“œ์›จ์–ด ๊ณ ๋ ค ์‚ฌํ•ญ

๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค:

  • GPU ์•„ํ‚คํ…์ฒ˜: SIMD ํญ, ์ฝ”์–ด ์ˆ˜, ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ
  • ์‹œ์Šคํ…œ ๊ตฌ์„ฑ: PCIe ๋Œ€์—ญํญ, CPU ์„ฑ๋Šฅ
  • ์—ด ์ƒํƒœ: GPU ๋ถ€์ŠคํŠธ ํด๋Ÿญ vs ์ง€์† ์„ฑ๋Šฅ
  • ๋™์‹œ ์›Œํฌ๋กœ๋“œ: GPU ํ™œ์šฉ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค

๋ชจ๋ฒ” ์‚ฌ๋ก€ ์š”์•ฝ

๋ฒค์น˜๋งˆํ‚น ์›Œํฌํ”Œ๋กœ์šฐ:

  1. ์ค‘์š”ํ•œ ์ธก์ • ์ „์— GPU ์›Œ๋ฐ์—…
  2. ํ†ต๊ณ„์  ์œ ์˜์„ฑ์„ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณต ์‹คํ–‰
  3. ํ™•์žฅ ํŠน์„ฑ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๋ฌธ์ œ ํฌ๊ธฐ ํ…Œ์ŠคํŠธ
  4. ์ตœ์ ํ™” ์•„ํ‹ฐํŒฉํŠธ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด keep()์„ ์ผ๊ด€๋˜๊ฒŒ ์‚ฌ์šฉ
  5. ๋™์ผ ์กฐ๊ฑด์—์„œ ๋น„๊ต (๊ฐ™์€ ๋ฌธ์ œ ํฌ๊ธฐ, ๊ฐ™์€ ํ•˜๋“œ์›จ์–ด)

์„ฑ๋Šฅ ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ:

  • ๋‹จ์ˆœํ•˜๊ฒŒ ์‹œ์ž‘: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—๋Š” ์š”์†Œ๋ณ„ ํŒจํ„ด๋ถ€ํ„ฐ
  • ์ถ”์ธกํ•˜์ง€ ๋ง๊ณ  ์ธก์ •: ์ด๋ก ์  ๋ถ„์„์€ ๋ฐฉํ–ฅ์„, ์‹ค์ฆ์  ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฒฐ์ •์„
  • ๊ทœ๋ชจ๊ฐ€ ์ค‘์š”: ์†Œ๊ทœ๋ชจ ๋ฌธ์ œ์˜ ์„ฑ๋Šฅ์ด ๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ์˜ ๋™์ž‘์„ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•จ
  • ์ด๋น„์šฉ ์ตœ์ ํ™”: ๊ฐœ๋ฐœ ์‹œ๊ฐ„ vs ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์˜ ๊ท ํ˜•

๋‹ค์Œ ๋‹จ๊ณ„

๋ฒค์น˜๋งˆํ‚น ๊ธฐ์ˆ ์„ ๊ฐ–์ถ”์—ˆ๋‹ค๋ฉด:

  • ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ”„๋กœํŒŒ์ผ๋ง: ์ด ํŒจํ„ด๋“ค์„ ์‹ค์ œ ์›Œํฌ๋กœ๋“œ์— ์ ์šฉ
  • ๊ณ ๊ธ‰ GPU ํŒจํ„ด: ๋ฆฌ๋•์…˜, ํ•ฉ์„ฑ๊ณฑ, ํ–‰๋ ฌ ์—ฐ์‚ฐ ํƒ๊ตฌ
  • ๋ฉ€ํ‹ฐ GPU ํ™•์žฅ: ๋ถ„์‚ฐ GPU ์ปดํ“จํŒ… ํŒจํ„ด ์ดํ•ด
  • ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ ๊ธ‰ ์บ์‹ฑ์„ ๋” ๊นŠ์ด ํƒ๊ตฌ

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ: ๋ฒค์น˜๋งˆํ‚น์€ ์ด๋ก ์  ์ดํ•ด๋ฅผ ์‹ค์งˆ์ ์ธ ์„ฑ๋Šฅ ์ตœ์ ํ™”๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ฆ์  ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ • ํ•˜๋“œ์›จ์–ด์™€ ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ํŒจํ„ด์„ ์„ ํƒํ•˜์„ธ์š”.

์•ž์œผ๋กœ์˜ ๋ฐฉํ–ฅ: ๋” ๋งŽ์€ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•  ๋•Œ

Part VI์˜ ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์€ ๋Œ€๋ถ€๋ถ„์˜ ์›Œํฌ๋กœ๋“œ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜์ง€๋งŒ, ์ผ๋ถ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ง์ ‘์ ์ธ ์Šค๋ ˆ๋“œ ๊ฐ„ ํ†ต์‹ ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ์œ ์šฉํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜:

  • ๋ฆฌ๋•์…˜: ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์— ๊ฑธ์นœ ํ•ฉ๊ณ„, ์ตœ๋Œ“๊ฐ’, ์ตœ์†Ÿ๊ฐ’ ์—ฐ์‚ฐ
  • ๋ˆ„์  ์—ฐ์‚ฐ: ๋ˆ„์  ํ•ฉ, ์ด๋™ ์ตœ๋Œ“๊ฐ’
  • ๋ฐ์ดํ„ฐ ์…”ํ”Œ: ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜
  • ํ˜‘๋ ฅ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์Šค๋ ˆ๋“œ ๊ฐ„ ๊ธด๋ฐ€ํ•œ ์กฐ์ •์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ

์„ฑ๋Šฅ ๋ฏธ๋ฆฌ๋ณด๊ธฐ:

Part VII์—์„œ๋Š” Part III์˜ ์—ฌ๋Ÿฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹ค์‹œ ์‚ดํŽด๋ณด๋ฉฐ ์›Œํ”„ ์—ฐ์‚ฐ์ด ์–ด๋–ป๊ฒŒ:

  • ์ฝ”๋“œ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒด
  • ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š”์ง€: ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ์„ ๊ฐ์†Œ
  • ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š”์ง€: ์ˆœ์ˆ˜ ํ•จ์ˆ˜ํ˜• ์ ‘๊ทผ์œผ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•œ ํŒจํ„ด์„ ๊ตฌํ˜„

๋‹ค์Œ ๋‚ด์šฉ: Part VII: ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ - Puzzle 14์˜ ๋ˆ„์  ํ•ฉ์„ ์™„์ „ํžˆ ์ƒˆ๋กญ๊ฒŒ ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

Puzzle 24: ์›Œํ”„ ๊ธฐ์ดˆ

๊ฐœ์š”

Part VII: ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” GPU์˜ ์›Œํ”„ ๋ ˆ๋ฒจ ๊ธฐ๋ณธ ์š”์†Œ - ์›Œํ”„ ๋‚ด ๋™๊ธฐํ™”๋œ ์Šค๋ ˆ๋“œ ์‹คํ–‰์„ ํ™œ์šฉํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ์—ฐ์‚ฐ์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ ์ธ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋‚ด์žฅ ์›Œํ”„ ์—ฐ์‚ฐ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ชฉํ‘œ: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ํ•˜๋“œ์›จ์–ด ๋™๊ธฐํ™”๋ฅผ ํ™œ์šฉํ•˜๋Š” ํšจ์œจ์ ์ธ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์›Œํ”„๋Š” ๋ก์Šคํ…(lockstep)์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค - Mojo์˜ ์›Œํ”„ ์—ฐ์‚ฐ์€ ์ด ๋™๊ธฐํ™”๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ๊ฐ•๋ ฅํ•œ ๋ณ‘๋ ฌ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ

GPU ์›Œํ”„ ์‹คํ–‰ ๋ชจ๋ธ

GPU ๋ณ‘๋ ฌ์„ฑ์˜ ๊ธฐ๋ณธ ํ•˜๋“œ์›จ์–ด ๋‹จ์œ„๋ฅผ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

GPU ๋ธ”๋ก (์˜ˆ: 256 ์Šค๋ ˆ๋“œ)
โ”œโ”€โ”€ ์›Œํ”„ 0 (32 ์Šค๋ ˆ๋“œ, SIMT ๋ก์Šคํ… ์‹คํ–‰)
โ”‚   โ”œโ”€โ”€ ๋ ˆ์ธ 0  โ”€โ”
โ”‚   โ”œโ”€โ”€ ๋ ˆ์ธ 1   โ”‚ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ช…๋ น์„
โ”‚   โ”œโ”€โ”€ ๋ ˆ์ธ 2   โ”‚ ๋™์‹œ์— ์‹คํ–‰ (SIMT)
โ”‚   โ”‚   ...      โ”‚
โ”‚   โ””โ”€โ”€ ๋ ˆ์ธ 31 โ”€โ”˜
โ”œโ”€โ”€ ์›Œํ”„ 1 (32 ์Šค๋ ˆ๋“œ, ๋…๋ฆฝ์ )
โ”œโ”€โ”€ ์›Œํ”„ 2 (32 ์Šค๋ ˆ๋“œ, ๋…๋ฆฝ์ )
โ””โ”€โ”€ ...

ํ•˜๋“œ์›จ์–ด ํ˜„์‹ค:

  • NVIDIA GPU์—์„œ ์›Œํ”„๋‹น 32 ์Šค๋ ˆ๋“œ (WARP_SIZE=32)
  • AMD GPU์—์„œ ์›Œํ”„๋‹น 32 ๋˜๋Š” 64 ์Šค๋ ˆ๋“œ (WARP_SIZE=32 or 64)
  • ๋ก์Šคํ… ์‹คํ–‰: ์›Œํ”„ ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๋ช…๋ น์„ ๋™์‹œ์— ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ๋™๊ธฐํ™” ๋น„์šฉ ์ œ๋กœ: ์›Œํ”„ ์—ฐ์‚ฐ์€ ๊ฐ ์›Œํ”„ ๋‚ด์—์„œ ์ฆ‰์‹œ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค

Mojo์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์›Œํ”„ ์—ฐ์‚ฐ

gpu.primitives.warp์˜ ํ•ต์‹ฌ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๋ฐฐ์›๋‹ˆ๋‹ค:

  1. sum(value): ์›Œํ”„์˜ ๋ชจ๋“  ๋ ˆ์ธ์—์„œ ๊ฐ’์„ ํ•ฉ์‚ฐ
  2. shuffle_idx(value, lane): ํŠน์ • ๋ ˆ์ธ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ธฐ
  3. shuffle_down(value, delta): lane+delta ์œ„์น˜์˜ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ธฐ
  4. prefix_sum(value): ๋ ˆ์ธ ์ „์ฒด์— ๊ฑธ์ณ ๋ˆ„์  ํ•ฉ ๊ณ„์‚ฐ
  5. lane_id(): ํ˜„์žฌ ์Šค๋ ˆ๋“œ์˜ ๋ ˆ์ธ ๋ฒˆํ˜ธ ๋ฐ˜ํ™˜ (0-31 ๋˜๋Š” 0-63)

์„ฑ๋Šฅ ๋ณ€ํ™˜ ์˜ˆ์‹œ

# 1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ๋ฆฌ๋•์…˜
# ์•ž์„œ ์‚ดํŽด๋ณธ ๋ณต์žกํ•œ ํŒจํ„ด (p12.mojo):
shared = LayoutTensor[
    dtype,
    Layout.row_major(WARP_SIZE),
    MutAnyOrigin,
    address_space = AddressSpace.SHARED,
].stack_allocation()
shared[local_i] = partial_product
barrier()

# ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ์•ˆ์ „ํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์€ ๊ฐ ๋‹จ๊ณ„๋งˆ๋‹ค ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:
stride = WARP_SIZE // 2
while stride > 0:
    if local_i < stride:
        shared[local_i] += shared[local_i + stride]

    barrier()
    stride //= 2

# 2. ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ™œ์šฉํ•œ ๋ฆฌ๋•์…˜
# ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•œ ์•ˆ์ „ํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ๊ฐ ๋‹จ๊ณ„์˜ ๋ฐฐ๋ฆฌ์–ด๊ฐ€
# ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
# Mojo์˜ ์›Œํ”„ ๋ ˆ๋ฒจ sum ์—ฐ์‚ฐ์€ ๋‚ด๋ถ€์ ์œผ๋กœ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด ๋ชจ๋“  ๋ณต์žก์„ฑ์„
# ์ˆจ๊น๋‹ˆ๋‹ค:
total = sum(partial_product)  # ๋‚ด๋ถ€์ ์œผ๋กœ ๋ฐฐ๋ฆฌ์–ด๋„, ๊ฒฝ์Ÿ ์ƒํƒœ๋„ ์—†์Šต๋‹ˆ๋‹ค!

์›Œํ”„ ์—ฐ์‚ฐ์ด ๋น›๋‚˜๋Š” ์ˆœ๊ฐ„

์„ฑ๋Šฅ ํŠน์„ฑ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

๋ฌธ์ œ ๊ทœ๋ชจ              ๊ธฐ์กด ๋ฐฉ์‹        ์›Œํ”„ ์—ฐ์‚ฐ
๋‹จ์ผ ์›Œํ”„ (32)         ๋น ๋ฆ„            ๊ฐ€์žฅ ๋น ๋ฆ„ (๋ฐฐ๋ฆฌ์–ด ์—†์Œ)
์†Œ์ˆ˜ ์›Œํ”„ (128)        ์ข‹์Œ            ์šฐ์ˆ˜ (์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์†Œ)
๋‹ค์ˆ˜ ์›Œํ”„ (1024+)      ์ข‹์Œ            ๋›ฐ์–ด๋‚จ (์„ ํ˜• ํ™•์žฅ)
๋Œ€๊ทœ๋ชจ (16K+)          ๋ณ‘๋ชฉ ๋ฐœ์ƒ        ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ œํ•œ

์„ ์ˆ˜ ์ง€์‹

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ๋‹ค์Œ ๋‚ด์šฉ์— ์ต์ˆ™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  • Part VI ํ•จ์ˆ˜ํ˜• ํŒจํ„ด: elementwise, tiled, vectorize ์ ‘๊ทผ ๋ฐฉ์‹
  • GPU ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ: ๋ธ”๋ก, ์›Œํ”„, ์Šค๋ ˆ๋“œ์— ๋Œ€ํ•œ ์ดํ•ด
  • LayoutTensor ์—ฐ์‚ฐ: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ๋…: ๋ฐฐ๋ฆฌ์–ด์™€ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ์™œ ๋ณต์žกํ•œ์ง€

ํ•™์Šต ๊ฒฝ๋กœ

1. SIMT ์‹คํ–‰ ๋ชจ๋ธ

โ†’ ์›Œํ”„ ๋ ˆ์ธ๊ณผ SIMT ์‹คํ–‰

์›Œํ”„ ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ฐ˜์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • SIMT(Single Instruction, Multiple Thread) ์‹คํ–‰ ๋ชจ๋ธ
  • ์›Œํ”„ ๋ถ„๊ธฐ์™€ ์ˆ˜๋ ด ํŒจํ„ด
  • ์›Œํ”„ ๋‚ด ๋ ˆ์ธ ๋™๊ธฐํ™”
  • ํ•˜๋“œ์›จ์–ด vs ์†Œํ”„ํŠธ์›จ์–ด ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์›Œํ”„๋Š” GPU ์‹คํ–‰์˜ ๊ธฐ๋ณธ ๋‹จ์œ„์ž…๋‹ˆ๋‹ค - SIMT๋ฅผ ์ดํ•ดํ•˜๋ฉด ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๋ฌธ์ด ์—ด๋ฆฝ๋‹ˆ๋‹ค.

2. ์›Œํ”„ sum ๊ธฐ์ดˆ

โ†’ warp.sum()์˜ ํ•ต์‹ฌ

๋‚ด์  ๊ตฌํ˜„์„ ํ†ตํ•ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์›Œํ”„ ์—ฐ์‚ฐ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด๋ฅผ sum()์œผ๋กœ ๋Œ€์ฒด
  • GPU ์•„ํ‚คํ…์ฒ˜ ๊ฐ„ ํ˜ธํ™˜์„ฑ (WARP_SIZE)
  • ์›Œํ”„๋ฅผ ํ™œ์šฉํ•œ ์ปค๋„ vs ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด
  • ๊ธฐ์กด ๋ฐฉ์‹๊ณผ์˜ ์„ฑ๋Šฅ ๋น„๊ต

ํ•ต์‹ฌ ํŒจํ„ด:

partial_result = compute_per_lane_value()
total = sum(partial_result)  # ๋งˆ๋ฒ•์ด ์ผ์–ด๋‚˜๋Š” ๊ณณ!
if lane_id() == 0:
    output[0] = total

3. ์–ธ์ œ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‚ฌ์šฉํ• ๊นŒ

โ†’ ์–ธ์ œ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‚ฌ์šฉํ• ๊นŒ

๋Œ€์•ˆ ๋Œ€๋น„ ์›Œํ”„ ์—ฐ์‚ฐ์„ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•œ ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์›Œํ”„ ์—ฐ์‚ฐ์— ์œ ๋ฆฌํ•œ ๋ฌธ์ œ ํŠน์„ฑ
  • ์›Œํ”„ ์ˆ˜์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ํ™•์žฅ ํŒจํ„ด
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ vs ์—ฐ์‚ฐ๋Ÿ‰ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„
  • ์›Œํ”„ ์—ฐ์‚ฐ ์„ ํƒ ๊ฐ€์ด๋“œ๋ผ์ธ

์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ: ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ์ด ๋ณ‘๋ชฉ์ด ๋  ๋•Œ, ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ๋ŒํŒŒ๊ตฌ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

ํ•˜๋“œ์›จ์–ด-์†Œํ”„ํŠธ์›จ์–ด ์ •๋ ฌ

Mojo ์›Œํ”„ ์—ฐ์‚ฐ์ด GPU ํ•˜๋“œ์›จ์–ด์— ๋งคํ•‘๋˜๋Š” ๋ฐฉ์‹์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

  • SIMT ์‹คํ–‰: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์ผํ•œ ๋ช…๋ น์„ ๋™์‹œ์— ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ๋‚ด์žฅ ๋™๊ธฐํ™”: ์›Œํ”„ ๋‚ด์—์„œ ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค
  • ํฌ๋กœ์Šค ์•„ํ‚คํ…์ฒ˜ ์ง€์›: WARP_SIZE๊ฐ€ NVIDIA์™€ AMD์˜ ์ฐจ์ด๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

ํŒจํ„ด ๋ณ€ํ™˜

๋ณต์žกํ•œ ๋ณ‘๋ ฌ ํŒจํ„ด์„ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

  • ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ โ†’ sum()
  • ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ โ†’ prefix_sum()
  • ๋ฐ์ดํ„ฐ ์…”ํ”Œ โ†’ shuffle_idx(), shuffle_down()

์„ฑ๋Šฅ ํŠน์„ฑ

์›Œํ”„ ์—ฐ์‚ฐ์ด ์ด์ ์„ ์ œ๊ณตํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค:

  • ์†Œ~์ค‘๊ทœ๋ชจ ๋ฌธ์ œ: ๋ฐฐ๋ฆฌ์–ด ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค
  • ๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ: ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ์„ ์ค„์ด๊ณ  ์บ์‹œ ํ™œ์šฉ์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค
  • ๊ทœ์น™์ ์ธ ํŒจํ„ด: ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ์ ‘๊ทผ ํŒจํ„ด์—์„œ ์›Œํ”„ ์—ฐ์‚ฐ์ด ํƒ์›”ํ•ฉ๋‹ˆ๋‹ค

์‹œ์ž‘ํ•˜๊ธฐ

SIMT ์‹คํ–‰ ๋ชจ๋ธ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘ํ•˜์—ฌ, ์‹ค์šฉ์ ์ธ warp.sum ๊ตฌํ˜„์„ ๋‹ค๋ฃจ๊ณ , ์ „๋žต์  ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ๋งˆ๋ฌด๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ์›Œํ”„๋ฅผ ๋…๋ฆฝ์ ์ธ ์Šค๋ ˆ๋“œ๊ฐ€ ์•„๋‹Œ ๋™๊ธฐํ™”๋œ ๋ฒกํ„ฐ ์œ ๋‹›์œผ๋กœ ์ƒ๊ฐํ•˜์„ธ์š”. ์ด ๋ฉ˜ํƒˆ ๋ชจ๋ธ์ด ํšจ๊ณผ์ ์ธ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์œผ๋กœ ์•ˆ๋‚ดํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ๋ชฉํ‘œ: Part VII์„ ๋งˆ์น˜๋ฉด, ์›Œํ”„ ์—ฐ์‚ฐ์ด ๋ณต์žกํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํ™ฉ์„ ์ธ์‹ํ•˜์—ฌ ๋” ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅธ GPU ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ: ์›Œํ”„ ๋ ˆ์ธ๊ณผ SIMT ์‹คํ–‰ ์—์„œ ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํž˜์„ ๋งŒ๋‚˜๋ณด์„ธ์š”!

๐Ÿง  ์›Œํ”„ ๋ ˆ์ธ๊ณผ SIMT ์‹คํ–‰

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ vs SIMD ๋ฉ˜ํƒˆ ๋ชจ๋ธ

์›Œํ”„๋ž€ ๋ฌด์—‡์ธ๊ฐ€?

์›Œํ”„๋Š” 32๊ฐœ(๋˜๋Š” 64๊ฐœ)์˜ GPU ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋™์ผํ•œ ๋ช…๋ น์„ ๋™์‹œ์— ์‹คํ–‰ํ•˜๋Š” ๊ทธ๋ฃน์ž…๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฒกํ„ฐ ํ”„๋กœ์„ธ์„œ์˜ โ€œ๋ ˆ์ธโ€ ์—ญํ• ์„ ํ•˜๋Š” ๋™๊ธฐํ™”๋œ ๋ฒกํ„ฐ ์œ ๋‹›์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๊ฐ„๋‹จํ•œ ์˜ˆ์‹œ:

from gpu.primitives.warp import sum
# ์›Œํ”„ ๋‚ด 32๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์‹คํ–‰:
var my_value = input[my_thread_id]     # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ด
var warp_total = sum(my_value)         # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ํ•ฉ๊ณ„์— ๊ธฐ์—ฌ

๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚œ ๊ฑธ๊นŒ์š”? 32๊ฐœ์˜ ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณต์žกํ•œ ์กฐ์œจ์„ ํ•˜๋Š” ๋Œ€์‹ , ์›Œํ”„๊ฐ€ ์ž๋™์œผ๋กœ ๋™๊ธฐํ™”ํ•˜์—ฌ ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๋ƒˆ์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ฐ”๋กœ SIMT(Single Instruction, Multiple Thread) ์‹คํ–‰์ž…๋‹ˆ๋‹ค.

SIMT vs SIMD ๋น„๊ต

CPU ๋ฒกํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ(SIMD)์— ์ต์ˆ™ํ•˜๋‹ค๋ฉด, GPU ์›Œํ”„๋Š” ๋น„์Šทํ•˜์ง€๋งŒ ํ•ต์‹ฌ์ ์ธ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

๊ด€์ CPU SIMD (์˜ˆ: AVX)GPU ์›Œํ”„ (SIMT)
ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ๋ช…์‹œ์  ๋ฒกํ„ฐ ์—ฐ์‚ฐ์Šค๋ ˆ๋“œ ๊ธฐ๋ฐ˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ
๋ฐ์ดํ„ฐ ํญ๊ณ ์ • (256/512 ๋น„ํŠธ)์œ ์—ฐ (32/64 ์Šค๋ ˆ๋“œ)
๋™๊ธฐํ™”๋ช…๋ น ๋‚ด ์•”์‹œ์ ์›Œํ”„ ๋‚ด ์•”์‹œ์ 
ํ†ต์‹ ๋ฉ”๋ชจ๋ฆฌ/๋ ˆ์ง€์Šคํ„ฐ ๊ฒฝ์œ ์…”ํ”Œ ์—ฐ์‚ฐ ๊ฒฝ์œ 
๋ถ„๊ธฐ ์ฒ˜๋ฆฌํ•ด๋‹น ์—†์Œํ•˜๋“œ์›จ์–ด ๋งˆ์Šคํ‚น
์˜ˆ์‹œa + bsum(thread_value)

CPU SIMD ๋ฐฉ์‹ (C++ intrinsics):

// ๋ช…์‹œ์  ๋ฒกํ„ฐ ์—ฐ์‚ฐ - 8๊ฐœ์˜ float๋ฅผ ๋ณ‘๋ ฌ๋กœ
__m256 result = _mm256_add_ps(a, b);   // 8์Œ์„ ๋™์‹œ์— ๋ง์…ˆ

CPU SIMD ๋ฐฉ์‹ (Mojo):

# Mojo์—์„œ SIMD๋Š” ์ผ๊ธ‰ ์‹œ๋ฏผ ํƒ€์ž…์ด๋ฏ€๋กœ a, b๊ฐ€ SIMD ํƒ€์ž…์ด๋ฉด
# ๋ง์…ˆ์ด ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค
var result = a + b # 8์Œ์„ ๋™์‹œ์— ๋ง์…ˆ

GPU SIMT ๋ฐฉ์‹ (Mojo):

# ์Šค๋ ˆ๋“œ ๊ธฐ๋ฐ˜ ์ฝ”๋“œ๊ฐ€ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค
from gpu.primitives.warp import sum

var my_data = input[thread_id]         # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž๊ธฐ ์š”์†Œ๋ฅผ ๊ฐ€์ ธ์˜ด
var partial = my_data * coefficient    # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๊ณ„์‚ฐ
var total = sum(partial)               # ํ•˜๋“œ์›จ์–ด๊ฐ€ ํ•ฉ์‚ฐ์„ ์กฐ์œจ

์›Œํ”„๋ฅผ ๊ฐ•๋ ฅํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ํ•ต์‹ฌ ๊ฐœ๋…

1. ๋ ˆ์ธ ์‹๋ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๋Š” ์‚ฌ์‹ค์ƒ ๋น„์šฉ ์—†์ด ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” โ€œ๋ ˆ์ธ IDโ€ (0~31)๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค

var my_lane = lane_id()  # ํ•˜๋“œ์›จ์–ด ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ์ฝ์„ ๋ฟ

2. ์•”์‹œ์  ๋™๊ธฐํ™”: ์›Œํ”„ ๋‚ด์—์„œ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค

# ๊ทธ๋ƒฅ ๋™์ž‘ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ž๋™์œผ๋กœ ๋™๊ธฐํ™”
var sum = sum(my_contribution)

3. ํšจ์œจ์ ์ธ ํ†ต์‹ : ๋ฉ”๋ชจ๋ฆฌ ์—†์ด๋„ ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ฐ์ดํ„ฐ ๊ณต์œ ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค

# ๋ ˆ์ธ 0์˜ ๊ฐ’์„ ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์œผ๋กœ ์ „๋‹ฌ
var broadcasted = shuffle_idx(my_value, 0)

ํ•ต์‹ฌ ํ†ต์ฐฐ: SIMT๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ž์—ฐ์Šค๋Ÿฌ์šด ์Šค๋ ˆ๋“œ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๋ฉด์„œ๋„ ํšจ์œจ์ ์ธ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์–ด, ์Šค๋ ˆ๋“œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํŽธ๋ฆฌํ•จ๊ณผ ๋ฒกํ„ฐ ์ฒ˜๋ฆฌ์˜ ์„ฑ๋Šฅ์„ ๋ชจ๋‘ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GPU ์‹คํ–‰ ๊ณ„์ธต ๊ตฌ์กฐ์—์„œ ์›Œํ”„์˜ ์œ„์น˜

์›Œํ”„๊ฐ€ ์ „์ฒด GPU ์‹คํ–‰ ๋ชจ๋ธ๊ณผ ์–ด๋–ป๊ฒŒ ์—ฐ๊ฒฐ๋˜๋Š”์ง€ ์ž์„ธํžˆ ์•Œ์•„๋ณด๋ ค๋ฉด GPU ์Šค๋ ˆ๋”ฉ vs SIMD ๊ฐœ๋…์„ ์ฐธ๊ณ ํ•˜์„ธ์š”. ์›Œํ”„์˜ ์œ„์น˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

GPU ๋””๋ฐ”์ด์Šค
โ”œโ”€โ”€ ๊ทธ๋ฆฌ๋“œ (์ „์ฒด ๋ฌธ์ œ)
โ”‚   โ”œโ”€โ”€ ๋ธ”๋ก 1 (์Šค๋ ˆ๋“œ ๊ทธ๋ฃน, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)
โ”‚   โ”‚   โ”œโ”€โ”€ ์›Œํ”„ 1 (32 ์Šค๋ ˆ๋“œ, ๋ก์Šคํ… ์‹คํ–‰) โ† ์ด ๋ ˆ๋ฒจ
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ์Šค๋ ˆ๋“œ 1 โ†’ SIMD ์—ฐ์‚ฐ
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ์Šค๋ ˆ๋“œ 2 โ†’ SIMD ์—ฐ์‚ฐ
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ... (์ด 32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ”‚   โ””โ”€โ”€ ์›Œํ”„ 2 (32 ์Šค๋ ˆ๋“œ)
โ”‚   โ””โ”€โ”€ ๋ธ”๋ก 2 (๋…๋ฆฝ์ ์ธ ๊ทธ๋ฃน)

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ โ€œ์›Œํ”„ ๋ ˆ๋ฒจโ€œ์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค - ๋‹จ์ผ ์›Œํ”„ ๋‚ด์˜ 32๊ฐœ ์Šค๋ ˆ๋“œ๋ฅผ ๋ชจ๋‘ ์กฐ์œจํ•˜๋Š” ์—ฐ์‚ฐ์„ ๋‹ค๋ฃจ๋ฉฐ, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ์ด ํ•„์š”ํ•œ sum() ๊ฐ™์€ ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฉ˜ํƒˆ ๋ชจ๋ธ์€ ๋ฌธ์ œ๊ฐ€ ์›Œํ”„ ์—ฐ์‚ฐ์— ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋งคํ•‘๋˜๋Š” ๊ฒฝ์šฐ์™€ ๊ธฐ์กด์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ฐ˜

Single Instruction, Multiple Thread(SIMT) ์‹คํ–‰์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ํšจ๊ณผ์ ์ธ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋‹จ์ˆœํ•œ ์†Œํ”„ํŠธ์›จ์–ด ์ถ”์ƒํ™”๊ฐ€ ์•„๋‹ˆ๋ผ, GPU ํ•˜๋“œ์›จ์–ด๊ฐ€ ์‹ค๋ฆฌ์ฝ˜ ์ˆ˜์ค€์—์„œ ์‹ค์ œ๋กœ ์ž‘๋™ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

SIMT ์‹คํ–‰์ด๋ž€?

SIMT๋ž€ ์›Œํ”„ ๋‚ด์—์„œ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๊ฐ™์€ ๋ช…๋ น์„ ๋™์‹œ์— ์‹คํ–‰ํ•œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์™„์ „ํžˆ ๋‹ค๋ฅธ ๋ช…๋ น์„ ๋…๋ฆฝ์ ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” CPU ์Šค๋ ˆ๋“œ์™€ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

CPU vs GPU ์‹คํ–‰ ๋ชจ๋ธ

๊ด€์ CPU (MIMD)GPU ์›Œํ”„ (SIMT)
๋ช…๋ น ๋ชจ๋ธMultiple Instructions, Multiple DataSingle Instruction, Multiple Thread
Core 1add r1, r2add r1, r2
Core 2load r3, [mem]add r1, r2 (๋™์ผ ๋ช…๋ น)
Core 3branch loopadd r1, r2 (๋™์ผ ๋ช…๋ น)
โ€ฆ Core 32๋‹ค๋ฅธ ๋ช…๋ นadd r1, r2 (๋™์ผ ๋ช…๋ น)
์‹คํ–‰ ๋ฐฉ์‹๋…๋ฆฝ์ , ๋น„๋™๊ธฐ๋™๊ธฐํ™”, ๋ก์Šคํ…
์Šค์ผ€์ค„๋ง๋ณต์žก, OS ๊ด€๋ฆฌ๋‹จ์ˆœ, ํ•˜๋“œ์›จ์–ด ๊ด€๋ฆฌ
๋ฐ์ดํ„ฐ๋…๋ฆฝ์ ์ธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ, ๊ฐ™์€ ์—ฐ์‚ฐ

GPU ์›Œํ”„ ์‹คํ–‰ ํŒจํ„ด:

  • ๋ช…๋ น: 32๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์ผ: add r1, r2
  • ๋ ˆ์ธ 0: Data0์— ์—ฐ์‚ฐ โ†’ Result0
  • ๋ ˆ์ธ 1: Data1์— ์—ฐ์‚ฐ โ†’ Result1
  • ๋ ˆ์ธ 2: Data2์— ์—ฐ์‚ฐ โ†’ Result2
  • โ€ฆ (๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์‹œ์— ์‹คํ–‰)
  • ๋ ˆ์ธ 31: Data31์— ์—ฐ์‚ฐ โ†’ Result31

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ชจ๋“  ๋ ˆ์ธ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๊ฐ™์€ ๋ช…๋ น์„ ๋™์‹œ์— ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

SIMT๊ฐ€ GPU์— ์ ํ•ฉํ•œ ์ด์œ 

GPU๋Š” ์ง€์—ฐ ์‹œ๊ฐ„์ด ์•„๋‹Œ ์ฒ˜๋ฆฌ๋Ÿ‰์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. SIMT๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ๋“ค:

  • ํ•˜๋“œ์›จ์–ด ๋‹จ์ˆœํ™”: ํ•˜๋‚˜์˜ ๋ช…๋ น ๋””์ฝ”๋”๊ฐ€ 32๊ฐœ ๋˜๋Š” 64๊ฐœ ์Šค๋ ˆ๋“œ๋ฅผ ์ฒ˜๋ฆฌ
  • ์‹คํ–‰ ํšจ์œจ์„ฑ: ์›Œํ”„ ๋‚ด ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ณต์žกํ•œ ์Šค์ผ€์ค„๋ง ๋ถˆํ•„์š”
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ์ „๋ ฅ ํšจ์œจ์„ฑ: ๋ ˆ์ธ ์ „์ฒด์— ๊ฑธ์ณ ์ œ์–ด ๋กœ์ง ๊ณต์œ 

์›Œํ”„ ์‹คํ–‰ ๋ฉ”์ปค๋‹ˆ์ฆ˜

๋ ˆ์ธ ๋ฒˆํ˜ธ์™€ ์‹๋ณ„

์›Œํ”„ ๋‚ด ๊ฐ ์Šค๋ ˆ๋“œ๋Š” 0๋ถ€ํ„ฐ WARP_SIZE-1๊นŒ์ง€์˜ ๋ ˆ์ธ ID๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค:

from gpu import lane_id
from gpu.primitives.warp import WARP_SIZE

# ์ปค๋„ ํ•จ์ˆ˜ ๋‚ด์—์„œ:
my_lane = lane_id()  # 0-31 (NVIDIA/RDNA) ๋˜๋Š” 0-63 (CDNA) ๋ฐ˜ํ™˜

ํ•ต์‹ฌ ํ†ต์ฐฐ: lane_id()๋Š” ๋น„์šฉ์ด ์—†์Šต๋‹ˆ๋‹ค - ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ํ•˜๋“œ์›จ์–ด ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ์ฝ์„ ๋ฟ์ž…๋‹ˆ๋‹ค.

์›Œํ”„ ๋‚ด ๋™๊ธฐํ™”

SIMT์˜ ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ์ธก๋ฉด: ์•”์‹œ์  ๋™๊ธฐํ™”.

# thread_idx.x < WARP_SIZE์ธ ๊ฒฝ์šฐ์˜ ์˜ˆ์‹œ

# 1. ๊ธฐ์กด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹:
shared[thread_idx.x] = partial_result
barrier()  # ๋ช…์‹œ์  ๋™๊ธฐํ™” ํ•„์š”
var total = shared[0] + shared[1] + ... + shared[WARP_SIZE] # ํ•ฉ์‚ฐ ๋ฆฌ๋•์…˜

# 2. ์›Œํ”„ ๋ฐฉ์‹:
from gpu.primitives.warp import sum

var total = sum(partial_result)  # ์•”์‹œ์  ๋™๊ธฐํ™”!

์™œ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š” ์—†์„๊นŒ์š”? ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ ๋ช…๋ น์„ ์ •ํ™•ํžˆ ๊ฐ™์€ ์‹œ์ ์— ์‹คํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. sum()์ด ์‹œ์ž‘๋  ๋•Œ, ๋ชจ๋“  ๋ ˆ์ธ์€ ์ด๋ฏธ partial_result ๊ณ„์‚ฐ์„ ๋งˆ์นœ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.

์›Œํ”„ ๋ถ„๊ธฐ์™€ ์ˆ˜๋ ด

์กฐ๊ฑด ์ฝ”๋“œ์—์„œ ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚ ๊นŒ?

if lane_id() % 2 == 0:
    # ์ง์ˆ˜ ๋ ˆ์ธ์ด ์ด ๊ฒฝ๋กœ๋ฅผ ์‹คํ–‰
    result = compute_even()
else:
    # ํ™€์ˆ˜ ๋ ˆ์ธ์ด ์ด ๊ฒฝ๋กœ๋ฅผ ์‹คํ–‰
    result = compute_odd()
# ๋ชจ๋“  ๋ ˆ์ธ์ด ์—ฌ๊ธฐ์„œ ์ˆ˜๋ ด

ํ•˜๋“œ์›จ์–ด ๋™์ž‘ ๋‹จ๊ณ„:

๋‹จ๊ณ„ํŽ˜์ด์ฆˆํ™œ์„ฑ ๋ ˆ์ธ๋Œ€๊ธฐ ๋ ˆ์ธํšจ์œจ์„ฑ๋Šฅ ๋น„์šฉ
1์กฐ๊ฑด ํ‰๊ฐ€32๊ฐœ ๋ ˆ์ธ ์ „๋ถ€์—†์Œ100%์ •์ƒ ์†๋„
2์ง์ˆ˜ ๋ ˆ์ธ ๋ถ„๊ธฐ๋ ˆ์ธ 0,2,4โ€ฆ30 (16๊ฐœ)๋ ˆ์ธ 1,3,5โ€ฆ31 (16๊ฐœ)50%2๋ฐฐ ๋А๋ฆผ
3ํ™€์ˆ˜ ๋ ˆ์ธ ๋ถ„๊ธฐ๋ ˆ์ธ 1,3,5โ€ฆ31 (16๊ฐœ)๋ ˆ์ธ 0,2,4โ€ฆ30 (16๊ฐœ)50%2๋ฐฐ ๋А๋ฆผ
4์ˆ˜๋ ด32๊ฐœ ๋ ˆ์ธ ์ „๋ถ€์—†์Œ100%์ •์ƒ ์†๋„ ๋ณต๊ท€

์˜ˆ์‹œ ๋ถ„์„:

  • 2๋‹จ๊ณ„: ์ง์ˆ˜ ๋ ˆ์ธ๋งŒ compute_even()์„ ์‹คํ–‰ํ•˜๊ณ  ํ™€์ˆ˜ ๋ ˆ์ธ์€ ๋Œ€๊ธฐ
  • 3๋‹จ๊ณ„: ํ™€์ˆ˜ ๋ ˆ์ธ๋งŒ compute_odd()๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์ง์ˆ˜ ๋ ˆ์ธ์€ ๋Œ€๊ธฐ
  • ์ด ์†Œ์š” ์‹œ๊ฐ„: time(compute_even) + time(compute_odd) (์ˆœ์ฐจ ์‹คํ–‰)
  • ๋ถ„๊ธฐ ์—†๋Š” ๊ฒฝ์šฐ: max(time(compute_even), time(compute_odd)) (๋ณ‘๋ ฌ ์‹คํ–‰)

์„ฑ๋Šฅ ์˜ํ–ฅ:

  1. ๋ถ„๊ธฐ: ์›Œํ”„๊ฐ€ ์‹คํ–‰์„ ๋ถ„๋ฆฌ - ์ผ๋ถ€ ๋ ˆ์ธ์€ ํ™œ์„ฑ, ๋‚˜๋จธ์ง€๋Š” ๋Œ€๊ธฐ
  2. ์ˆœ์ฐจ ์‹คํ–‰: ์„œ๋กœ ๋‹ค๋ฅธ ๊ฒฝ๋กœ๊ฐ€ ๋ณ‘๋ ฌ์ด ์•„๋‹Œ ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰
  3. ์ˆ˜๋ ด: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋‹ค์‹œ ํ•ฉ๋ฅ˜ํ•˜์—ฌ ํ•จ๊ป˜ ์ง„ํ–‰
  4. ๋น„์šฉ: ๋ถ„๊ธฐ๊ฐ€ ์žˆ๋Š” ์›Œํ”„๋Š” ํ†ตํ•ฉ ์‹คํ–‰ ๋Œ€๋น„ 2๋ฐฐ ์ด์ƒ์˜ ์‹œ๊ฐ„ ์†Œ์š”

์›Œํ”„ ํšจ์œจ์„ ์œ„ํ•œ ๋ชจ๋ฒ” ์‚ฌ๋ก€

์›Œํ”„ ํšจ์œจ ํŒจํ„ด

โœ… ์šฐ์ˆ˜: ๊ท ์ผ ์‹คํ–‰ (100% ํšจ์œจ)

# ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ ์ž‘์—… ์ˆ˜ํ–‰ - ๋ถ„๊ธฐ ์—†์Œ
var partial = a[global_i] * b[global_i]
var total = sum(partial)

์„ฑ๋Šฅ: 32๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์‹œ ํ™œ์„ฑ

โš ๏ธ ํ—ˆ์šฉ: ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ๋ถ„๊ธฐ (~95% ํšจ์œจ)

# lane_id() ๊ธฐ๋ฐ˜ ๋ถ„๊ธฐ - ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋จ
if lane_id() == 0:
    output[block_idx] = sum(partial)

์„ฑ๋Šฅ: ๋‹จ์ผ ๋ ˆ์ธ์˜ ์งง์€ ์—ฐ์‚ฐ, ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ํŒจํ„ด

๐Ÿ”ถ ์ฃผ์˜: ๊ตฌ์กฐํ™”๋œ ๋ถ„๊ธฐ (~50-75% ํšจ์œจ)

# ๊ทœ์น™์ ์ธ ํŒจํ„ด์€ ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ์ตœ์ ํ™” ๊ฐ€๋Šฅ
if (global_i / 4) % 2 == 0:
    result = method_a()
else:
    result = method_b()

์„ฑ๋Šฅ: ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ๊ทธ๋ฃน, ์ผ๋ถ€ ์ตœ์ ํ™” ๊ฐ€๋Šฅ

โŒ ํšŒํ”ผ: ๋ฐ์ดํ„ฐ ์˜์กด์  ๋ถ„๊ธฐ (~25-50% ํšจ์œจ)

# ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ๋ ˆ์ธ๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฒฝ๋กœ๋ฅผ ํƒˆ ์ˆ˜ ์žˆ์Œ
if input[global_i] > threshold:  # ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ถ„๊ธฐ
    result = expensive_computation()
else:
    result = simple_computation()

์„ฑ๋Šฅ: ๋ฌด์ž‘์œ„ ๋ถ„๊ธฐ๊ฐ€ ์›Œํ”„ ํšจ์œจ์„ ๋–จ์–ด๋œจ๋ฆผ

๐Ÿ’€ ์ตœ์•…: ์ค‘์ฒฉ๋œ ๋ฐ์ดํ„ฐ ์˜์กด์  ๋ถ„๊ธฐ (~10-25% ํšจ์œจ)

# ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ถ„๊ธฐ์˜ ๋‹ค๋‹จ๊ณ„ ์ค‘์ฒฉ
if input[global_i] > threshold1:
    if input[global_i] > threshold2:
        result = very_expensive()
    else:
        result = expensive()
else:
    result = simple()

์„ฑ๋Šฅ: ์›Œํ”„ ํšจ์œจ์ด ์‚ฌ์‹ค์ƒ ๋ฌด๋„ˆ์ง

ํฌ๋กœ์Šค ์•„ํ‚คํ…์ฒ˜ ํ˜ธํ™˜์„ฑ

NVIDIA vs AMD ์›Œํ”„ ํฌ๊ธฐ

from gpu.primitives.warp import WARP_SIZE

# NVIDIA GPUs:     WARP_SIZE = 32
# AMD RDNA GPUs:   WARP_SIZE = 32 (wavefront32 ๋ชจ๋“œ)
# AMD CDNA GPUs:   WARP_SIZE = 64 (์ „ํ†ต์ ์ธ wavefront64)

์™œ ์ค‘์š”ํ• ๊นŒ์š”:

  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ๋ณ‘ํ•ฉ๋œ ์ ‘๊ทผ์ด ์›Œํ”„ ํฌ๊ธฐ์— ์˜์กด
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„: ๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ๊ฐ€ ์›Œํ”„ ํฌ๊ธฐ๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•จ
  • ์„ฑ๋Šฅ ํ™•์žฅ: AMD์—์„œ ์›Œํ”„๋‹น ๋ ˆ์ธ์ด 2๋ฐฐ

์ด์‹ ๊ฐ€๋Šฅํ•œ ์›Œํ”„ ์ฝ”๋“œ ์ž‘์„ฑ

์•„ํ‚คํ…์ฒ˜ ์ ์‘ ์ „๋žต

โœ… ์ด์‹ ๊ฐ€๋Šฅ: ํ•ญ์ƒ WARP_SIZE ์‚ฌ์šฉ

comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)  # ์ž๋™์œผ๋กœ ์ ์‘
comptime ELEMENTS_PER_WARP = WARP_SIZE       # ํ•˜๋“œ์›จ์–ด์— ๋งž๊ฒŒ ํ™•์žฅ

๊ฒฐ๊ณผ: NVIDIA/AMD (32)์™€ AMD (64) ๋ชจ๋‘์—์„œ ์ตœ์ ์œผ๋กœ ๋™์ž‘

โŒ ์ž˜๋ชป๋œ ๋ฐฉ์‹: ์›Œํ”„ ํฌ๊ธฐ๋ฅผ ํ•˜๋“œ์ฝ”๋”ฉํ•˜์ง€ ๋งˆ์„ธ์š”

comptime THREADS_PER_BLOCK = (32, 1)  # AMD GPU์—์„œ ๋™์ž‘ ์•ˆ ํ•จ!
comptime REDUCTION_SIZE = 32          # AMD์—์„œ ์ž˜๋ชป๋œ ๊ฐ’!

๊ฒฐ๊ณผ: AMD์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜, ์ •ํ™•์„ฑ ๋ฌธ์ œ ๊ฐ€๋Šฅ

์‹ค์ œ ํ•˜๋“œ์›จ์–ด ์˜ํ–ฅ

GPU ์•„ํ‚คํ…์ฒ˜WARP_SIZE์›Œํ”„๋‹น ๋ฉ”๋ชจ๋ฆฌ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„๋ ˆ์ธ ํŒจํ„ด
NVIDIA/AMD RDNA32128 bytes (4ร—32)5๋‹จ๊ณ„: 32โ†’16โ†’8โ†’4โ†’2โ†’1๋ ˆ์ธ 0-31
AMD CDNA64256 bytes (4ร—64)6๋‹จ๊ณ„: 64โ†’32โ†’16โ†’8โ†’4โ†’2โ†’1๋ ˆ์ธ 0-63

64 vs 32์˜ ์„ฑ๋Šฅ ์ฐจ์ด:

  • CDNA ์žฅ์ : ์›Œํ”„๋‹น 2๋ฐฐ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ
  • CDNA ์žฅ์ : ์›Œํ”„๋‹น 2๋ฐฐ์˜ ์—ฐ์‚ฐ๋Ÿ‰
  • NVIDIA/RDNA ์žฅ์ : ๋ธ”๋ก๋‹น ๋” ๋งŽ์€ ์›Œํ”„ (๋” ๋†’์€ ์ ์œ ์œจ)
  • ์ฝ”๋“œ ์ด์‹์„ฑ: ๊ฐ™์€ ์†Œ์Šค ์ฝ”๋“œ๋กœ ์–‘์ชฝ ๋ชจ๋‘ ์ตœ์  ์„ฑ๋Šฅ

์›Œํ”„์™€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

โœ… ์™„๋ฒฝ: ๋ณ‘ํ•ฉ๋œ ์ ‘๊ทผ (100% ๋Œ€์—ญํญ ํ™œ์šฉ)

# ์ธ์ ‘ ๋ ˆ์ธ โ†’ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ
var value = input[global_i]  # ๋ ˆ์ธ 0โ†’input[0], ๋ ˆ์ธ 1โ†’input[1], ๋“ฑ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

์ ‘๊ทผ ํŒจํ„ดNVIDIA/RDNA (32 ๋ ˆ์ธ)CDNA (64 ๋ ˆ์ธ)๋Œ€์—ญํญ ํ™œ์šฉ์„ฑ๋Šฅ
โœ… ๋ณ‘ํ•ฉ๋ ˆ์ธ N โ†’ ์ฃผ์†Œ 4ร—N๋ ˆ์ธ N โ†’ ์ฃผ์†Œ 4ร—N100%์ตœ์ 
1ํšŒ ํŠธ๋žœ์žญ์…˜: 128 bytes1ํšŒ ํŠธ๋žœ์žญ์…˜: 256 bytes์ „์ฒด ๋ฒ„์Šค ํญ๋น ๋ฆ„
โŒ ๋ถ„์‚ฐ๋ ˆ์ธ N โ†’ ์ž„์˜ ์ฃผ์†Œ๋ ˆ์ธ N โ†’ ์ž„์˜ ์ฃผ์†Œ~6%์ตœ์•…
32ํšŒ ๊ฐœ๋ณ„ ํŠธ๋žœ์žญ์…˜64ํšŒ ๊ฐœ๋ณ„ ํŠธ๋žœ์žญ์…˜๋Œ€๋ถ€๋ถ„ ์œ ํœด ๋ฒ„์Šค32๋ฐฐ ๋А๋ฆผ

์ฃผ์†Œ ์˜ˆ์‹œ:

  • ๋ณ‘ํ•ฉ: ๋ ˆ์ธ 0โ†’0, ๋ ˆ์ธ 1โ†’4, ๋ ˆ์ธ 2โ†’8, ๋ ˆ์ธ 3โ†’12, โ€ฆ
  • ๋ถ„์‚ฐ: ๋ ˆ์ธ 0โ†’1000, ๋ ˆ์ธ 1โ†’52, ๋ ˆ์ธ 2โ†’997, ๋ ˆ์ธ 3โ†’8, โ€ฆ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ

๋ฑ…ํฌ ์ถฉ๋Œ์ด๋ž€?

GPU ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋™์‹œ ์ ‘๊ทผ์ด ๊ฐ€๋Šฅํ•œ 32๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ๋ฑ…ํฌ๋กœ ๋‚˜๋‰˜์–ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์›Œํ”„ ๋‚ด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฑ…ํฌ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ์†Œ์— ๋™์‹œ์— ์ ‘๊ทผํ•˜๋ ค ํ•  ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ ‘๊ทผ์„ ์ง๋ ฌํ™”ํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ๋‹จ์ผ ์‚ฌ์ดํด์ด์–ด์•ผ ํ•  ์—ฐ์‚ฐ์ด ์—ฌ๋Ÿฌ ์‚ฌ์ดํด๋กœ ๋Š˜์–ด๋‚ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ์ถฉ๋Œ ์—†์Œ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผ โ†’ ๋ชจ๋“  ์ ‘๊ทผ์ด ๋™์‹œ์— ๋ฐœ์ƒ (1 ์‚ฌ์ดํด)
  • ๋ฑ…ํฌ ์ถฉ๋Œ: ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฑ…ํฌ์— ์ ‘๊ทผ โ†’ ์ ‘๊ทผ์ด ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐœ์ƒ (N๊ฐœ ์Šค๋ ˆ๋“œ์— N ์‚ฌ์ดํด)
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ์ฃผ์†Œ์— ์ ‘๊ทผ โ†’ ํ•˜๋“œ์›จ์–ด๊ฐ€ 1 ์‚ฌ์ดํด๋กœ ์ตœ์ ํ™”

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ๊ตฌ์„ฑ:

๋ฑ…ํฌ์ฃผ์†Œ (๋ฐ”์ดํŠธ ์˜คํ”„์…‹)์˜ˆ์‹œ ๋ฐ์ดํ„ฐ (float32)
๋ฑ…ํฌ 00, 128, 256, 384, โ€ฆshared[0], shared[32], shared[64], โ€ฆ
๋ฑ…ํฌ 14, 132, 260, 388, โ€ฆshared[1], shared[33], shared[65], โ€ฆ
๋ฑ…ํฌ 28, 136, 264, 392, โ€ฆshared[2], shared[34], shared[66], โ€ฆ
โ€ฆโ€ฆโ€ฆ
๋ฑ…ํฌ 31124, 252, 380, 508, โ€ฆshared[31], shared[63], shared[95], โ€ฆ

๋ฑ…ํฌ ์ถฉ๋Œ ์˜ˆ์‹œ:

์ ‘๊ทผ ํŒจํ„ด๋ฑ…ํฌ ์‚ฌ์šฉ์‚ฌ์ดํด์„ฑ๋Šฅ์„ค๋ช…
โœ… ์ˆœ์ฐจ์ shared[thread_idx.x]1 ์‚ฌ์ดํด100%๊ฐ ๋ ˆ์ธ์ด ๋‹ค๋ฅธ ๋ฑ…ํฌ ์ ‘๊ทผ
๋ ˆ์ธ 0โ†’๋ฑ…ํฌ 0, ๋ ˆ์ธ 1โ†’๋ฑ…ํฌ 1, โ€ฆ์ตœ์ ์ถฉ๋Œ ์—†์Œ
โœ… ๋™์ผ ์ธ๋ฑ์Šคshared[0]1 ์‚ฌ์ดํด100%๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ ์ฃผ์†Œ์—์„œ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
32๊ฐœ ๋ ˆ์ธ ์ „๋ถ€โ†’๋ฑ…ํฌ 0 (๊ฐ™์€ ์ฃผ์†Œ)์ตœ์ ์ถฉ๋Œ ์—†์Œ
โŒ ์ŠคํŠธ๋ผ์ด๋“œ 2shared[thread_idx.x * 2]2 ์‚ฌ์ดํด50%๋ฑ…ํฌ๋‹น 2๊ฐœ ๋ ˆ์ธ
๋ ˆ์ธ 0,16โ†’๋ฑ…ํฌ 0; ๋ ˆ์ธ 1,17โ†’๋ฑ…ํฌ 12๋ฐฐ ๋А๋ฆผ์ง๋ ฌํ™”๋œ ์ ‘๊ทผ
๐Ÿ’€ ์ŠคํŠธ๋ผ์ด๋“œ 32shared[thread_idx.x * 32]32 ์‚ฌ์ดํด3%๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ ๋ฑ…ํฌ ์ ‘๊ทผ
32๊ฐœ ๋ ˆ์ธ ์ „๋ถ€โ†’๋ฑ…ํฌ 0 (๋‹ค๋ฅธ ์ฃผ์†Œ)32๋ฐฐ ๋А๋ฆผ์™„์ „ํžˆ ์ง๋ ฌํ™”

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ์‹ค์ „ ํ™œ์šฉ

์›Œํ”„ ์—ฐ์‚ฐ์ด ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ๊ฒฝ์šฐ

  1. ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ: sum(), max() ๋“ฑ
  2. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ: shuffle_idx()๋กœ ๊ฐ’ ๊ณต์œ 
  3. ์ด์›ƒ ํ†ต์‹ : shuffle_down()์œผ๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ
  4. ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ: prefix_sum()์œผ๋กœ scan ์•Œ๊ณ ๋ฆฌ์ฆ˜

์„ฑ๋Šฅ ํŠน์„ฑ

์—ฐ์‚ฐ ์œ ํ˜•๊ธฐ์กด ๋ฐฉ์‹์›Œํ”„ ์—ฐ์‚ฐ
๋ฆฌ๋•์…˜ (32๊ฐœ ์š”์†Œ)~20๊ฐœ ๋ช…๋ น10๊ฐœ ๋ช…๋ น
๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ๋†’์Œ์ตœ์†Œ
๋™๊ธฐํ™” ๋น„์šฉ๋น„์šฉ ๋†’์Œ๋ฌด๋ฃŒ
์ฝ”๋“œ ๋ณต์žก๋„๋†’์Œ๋‚ฎ์Œ

๋‹ค์Œ ๋‹จ๊ณ„

SIMT์˜ ๊ธฐ๋ฐ˜์„ ์ดํ•ดํ–ˆ์œผ๋‹ˆ, ์ด ๊ฐœ๋…์ด ์–ด๋–ป๊ฒŒ ๊ฐ•๋ ฅํ•œ ์›Œํ”„ ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š”์ง€ ์•Œ์•„๋ณผ ์ฐจ๋ก€์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์„น์…˜์—์„œ๋Š” sum()์ด ๋ณต์žกํ•œ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ ์ธ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

โ†’ ๋‹ค์Œ: warp.sum()์˜ ํ•ต์‹ฌ

warp.sum()์˜ ํ•ต์‹ฌ - ์›Œํ”„ ๋ ˆ๋ฒจ ๋‚ด์ 

Puzzle 12์—์„œ ์‚ดํŽด๋ณธ ๋‚ด์ ์„ Mojo์˜ ์›Œํ”„ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ๊ฐ„๋‹จํ•œ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์›Œํ”„ ๋ ˆ์ธ์ด ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  warp.sum()์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ์ž๋™์œผ๋กœ ํ•ฉ์‚ฐํ•˜์—ฌ, ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด GPU ๋™๊ธฐํ™”๋ฅผ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: warp.sum() ์—ฐ์‚ฐ์€ SIMT ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์„ ๋‹จ์ผ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ช…๋ น์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • warp.sum()์„ ํ™œ์šฉํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜
  • SIMT ์‹คํ–‰ ๋ชจ๋ธ๊ณผ ๋ ˆ์ธ ๋™๊ธฐํ™”
  • WARP_SIZE๋ฅผ ํ™œ์šฉํ•œ ํฌ๋กœ์Šค ์•„ํ‚คํ…์ฒ˜ ํ˜ธํ™˜์„ฑ
  • ๋ณต์žกํ•œ ํŒจํ„ด์—์„œ ๊ฐ„๋‹จํ•œ ํŒจํ„ด์œผ๋กœ์˜ ์„ฑ๋Šฅ ๋ณ€ํ™˜
  • ๋ ˆ์ธ ID ๊ด€๋ฆฌ์™€ ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ

์ˆ˜ํ•™์  ์—ฐ์‚ฐ์€ ๋‚ด์ ์ž…๋‹ˆ๋‹ค: \[\Large \text{output}[0] = \sum_{i=0}^{N-1} a[i] \times b[i]\]

ํ•˜์ง€๋งŒ ๊ตฌํ˜„ ๊ณผ์ •์—์„œ Mojo์˜ ๋ชจ๋“  ์›Œํ”„ ๋ ˆ๋ฒจ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ์ ์šฉ๋˜๋Š” ๊ธฐ๋ณธ ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D ํ–‰ ์šฐ์„ )

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ณต์žก์„ฑ (Puzzle 12์—์„œ)

solutions/p12/p12.mojo์˜ ๋ณต์žกํ•œ ๋ฐฉ์‹์„ ๋– ์˜ฌ๋ ค ๋ด…์‹œ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ฐฐ๋ฆฌ์–ด, ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค:

comptime SIZE = WARP_SIZE
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
comptime SIMD_WIDTH = simd_width_of[dtype]()
comptime in_layout = Layout.row_major(SIZE)
comptime out_layout = Layout.row_major(1)


fn traditional_dot_product_p12_style[
    in_layout: Layout, out_layout: Layout, size: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
):
    """
    This is the complex approach from p12_layout_tensor.mojo - kept for comparison.
    """
    shared = LayoutTensor[
        dtype,
        Layout.row_major(WARP_SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    if global_i < size:
        shared[local_i] = (a[global_i] * b[global_i]).reduce_add()
    else:
        shared[local_i] = 0.0

    barrier()

    stride = WARP_SIZE // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2

    if local_i == 0:
        output[global_i // WARP_SIZE] = shared[0]


์ด ๋ฐฉ์‹์ด ๋ณต์žกํ•œ ์ด์œ :

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ๋ธ”๋ก ๋‚ด์—์„œ ์ˆ˜๋™์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ด€๋ฆฌ
  • ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด: ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•œ barrier() ํ˜ธ์ถœ
  • ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ณต์žกํ•œ ๋ฃจํ”„
  • ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ: ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ก

๋™์ž‘์€ ํ•˜์ง€๋งŒ, ์ฝ”๋“œ๊ฐ€ ์žฅํ™ฉํ•˜๊ณ  ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์šฐ๋ฉฐ GPU ๋™๊ธฐํ™”์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

pixi run p24 --traditional
pixi run -e amd p24 --traditional
pixi run -e apple p24 --traditional
uv run poe p24 --traditional

์™„์„ฑํ•  ์ฝ”๋“œ

1. ๊ฐ„๋‹จํ•œ ์›Œํ”„ ์ปค๋„ ๋ฐฉ์‹

๋ณต์žกํ•œ ๊ธฐ์กด ๋ฐฉ์‹์„ warp_sum()์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ„๋‹จํ•œ ์›Œํ”„ ์ปค๋„๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

fn simple_warp_dot_product[
    in_layout: Layout, out_layout: Layout, size: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
):
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    # FILL IN (6 lines at most)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p24/p24.mojo

ํŒ

1. ๊ฐ„๋‹จํ•œ ์›Œํ”„ ์ปค๋„ ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

simple_warp_dot_product ํ•จ์ˆ˜๋ฅผ 6์ค„ ์ด๋‚ด๋กœ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

fn simple_warp_dot_product[...](output, a, b):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    # ์—ฌ๊ธฐ๋ฅผ ์ฑ„์šฐ์„ธ์š” (์ตœ๋Œ€ 6์ค„)

๋”ฐ๋ผ์•ผ ํ•  ํŒจํ„ด:

  1. ์ด ์Šค๋ ˆ๋“œ์˜ ์š”์†Œ์— ๋Œ€ํ•œ ๋ถ€๋ถ„๊ณฑ ๊ณ„์‚ฐ
  2. warp_sum()์œผ๋กœ ๋ชจ๋“  ์›Œํ”„ ๋ ˆ์ธ์˜ ๊ฐ’์„ ํ•ฉ์‚ฐ
  3. ๋ ˆ์ธ 0์ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ก

2. ๋ถ€๋ถ„๊ณฑ ๊ณ„์‚ฐํ•˜๊ธฐ

var partial_product: Scalar[dtype] = 0
if global_i < size:
    partial_product = (a[global_i] * b[global_i]).reduce_add()

.reduce_add()๊ฐ€ ํ•„์š”ํ•œ ์ด์œ : Mojo์˜ ๊ฐ’์€ SIMD ๊ธฐ๋ฐ˜์ด๋ฏ€๋กœ a[global_i] * b[global_i]๋Š” SIMD ๋ฒกํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. .reduce_add()๋กœ ๋ฒกํ„ฐ๋ฅผ ์Šค์นผ๋ผ ๊ฐ’์œผ๋กœ ํ•ฉ์‚ฐํ•ฉ๋‹ˆ๋‹ค.

๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์œ ํšจํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

3. ์›Œํ”„ ๋ฆฌ๋•์…˜์˜ ๋งˆ๋ฒ•

total = warp_sum(partial_product)

warp_sum()์ด ํ•˜๋Š” ์ผ:

  • ๊ฐ ๋ ˆ์ธ์˜ partial_product ๊ฐ’์„ ๊ฐ€์ ธ์˜ด
  • ์›Œํ”„ ๋‚ด ๋ชจ๋“  ๋ ˆ์ธ์˜ ๊ฐ’์„ ํ•ฉ์‚ฐ (ํ•˜๋“œ์›จ์–ด ๊ฐ€์†)
  • ๋ชจ๋“  ๋ ˆ์ธ์— ๊ฐ™์€ ํ•ฉ๊ณ„๋ฅผ ๋ฐ˜ํ™˜ (๋ ˆ์ธ 0๋งŒ์ด ์•„๋‹˜)
  • ๋ช…์‹œ์  ๋™๊ธฐํ™”๊ฐ€ ์ „ํ˜€ ํ•„์š” ์—†์Œ (SIMT๊ฐ€ ์ฒ˜๋ฆฌ)

4. ๊ฒฐ๊ณผ ๊ธฐ๋กํ•˜๊ธฐ

if lane_id() == 0:
    output[global_i // WARP_SIZE] = total

์™œ ๋ ˆ์ธ 0๋งŒ? warp_sum() ์ดํ›„ ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ total ๊ฐ’์„ ๊ฐ–์ง€๋งŒ, ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ํ•œ ๋ฒˆ๋งŒ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.

์™œ output[0]์— ์ง์ ‘ ์“ฐ์ง€ ์•Š์„๊นŒ? ์œ ์—ฐ์„ฑ์„ ์œ„ํ•ด์„œ์ž…๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ์›Œํ”„๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ธ ๊ฒฝ์šฐ์—๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ฐ ์›Œํ”„์˜ ๊ฒฐ๊ณผ๊ฐ€ global_i // WARP_SIZE ์œ„์น˜์— ๊ธฐ๋ก๋ฉ๋‹ˆ๋‹ค.

lane_id(): 0-31 (NVIDIA) ๋˜๋Š” 0-63 (AMD)์„ ๋ฐ˜ํ™˜ - ์›Œํ”„ ๋‚ด์—์„œ ์–ด๋А ๋ ˆ์ธ์ธ์ง€ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ„๋‹จํ•œ ์›Œํ”„ ์ปค๋„ ํ…Œ์ŠคํŠธ:

uv run poe p24 --kernel
pixi run p24 --kernel

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
=== RESULT ===
out: 10416.0
expected: 10416.0
๐Ÿš€ Notice how simple the warp version is compared to p12.mojo!
   Same kernel structure, but warp_sum() replaces all the complexity!

ํ’€์ด

fn simple_warp_dot_product[
    in_layout: Layout, out_layout: Layout, size: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
):
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    # Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based
    var partial_product: Scalar[dtype] = 0
    if global_i < size:
        partial_product = (a[global_i] * b[global_i]).reduce_add()

    # warp_sum() replaces all the shared memory + barriers + tree reduction
    total = warp_sum(partial_product)

    # Only lane 0 writes the result (all lanes have the same total)
    if lane_id() == 0:
        output[global_i // WARP_SIZE] = total


๊ฐ„๋‹จํ•œ ์›Œํ”„ ์ปค๋„์€ ๋ณต์žกํ•œ ๋™๊ธฐํ™”์—์„œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ณธ ์š”์†Œ๋กœ์˜ ๊ทผ๋ณธ์ ์ธ ๋ณ€ํ™˜์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ธฐ์กด ๋ฐฉ์‹์—์„œ ์‚ฌ๋ผ์ง„ ๊ฒƒ๋“ค:

  • 15์ค„ ์ด์ƒ โ†’ 6์ค„: ํš๊ธฐ์ ์ธ ์ฝ”๋“œ ์ถ•์†Œ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๋ถˆํ•„์š”
  • 3ํšŒ ์ด์ƒ์˜ barrier() ํ˜ธ์ถœ: ๋ช…์‹œ์  ๋™๊ธฐํ™” ์ œ๋กœ
  • ๋ณต์žกํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒด
  • ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ: ์™„์ „ํžˆ ์ œ๊ฑฐ

SIMT ์‹คํ–‰ ๋ชจ๋ธ:

์›Œํ”„ ๋ ˆ์ธ (SIMT ์‹คํ–‰):
๋ ˆ์ธ 0: partial_product = a[0] * b[0]    = 0.0
๋ ˆ์ธ 1: partial_product = a[1] * b[1]    = 4.0
๋ ˆ์ธ 2: partial_product = a[2] * b[2]    = 16.0
...
๋ ˆ์ธ 31: partial_product = a[31] * b[31] = 3844.0

warp_sum() ํ•˜๋“œ์›จ์–ด ์—ฐ์‚ฐ:
๋ชจ๋“  ๋ ˆ์ธ โ†’ 0.0 + 4.0 + 16.0 + ... + 3844.0 = 10416.0
๋ชจ๋“  ๋ ˆ์ธ์ด ์ˆ˜์‹  โ†’ total = 10416.0 (๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฒฐ๊ณผ)

๋ฐฐ๋ฆฌ์–ด ์—†์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  1. SIMT ์‹คํ–‰: ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ ๋ช…๋ น ๋™์‹œ ์‹คํ–‰
  2. ํ•˜๋“œ์›จ์–ด ๋™๊ธฐํ™”: warp_sum()์ด ์‹œ์ž‘๋  ๋•Œ ๋ชจ๋“  ๋ ˆ์ธ์ด ์ด๋ฏธ partial_product ๊ณ„์‚ฐ ์™„๋ฃŒ
  3. ๋‚ด์žฅ ํ†ต์‹ : GPU ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ
  4. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ total ๊ฐ’ ์ˆ˜์‹ 

2. ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹

์ด๋ฒˆ์—๋Š” Mojo์˜ ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ™์€ ์›Œํ”„ ๋‚ด์ ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

fn functional_warp_dot_product[
    layout: Layout,
    out_layout: Layout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
](
    output: LayoutTensor[mut=True, dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn compute_dot_product[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]
        print("idx:", idx)
        # FILL IN (10 lines at most)

    # Launch exactly size == WARP_SIZE threads (one warp) to process all elements
    elementwise[compute_dot_product, 1, target="gpu"](size, ctx)


ํŒ

1. ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹์˜ ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

compute_dot_product ํ•จ์ˆ˜๋ฅผ 10์ค„ ์ด๋‚ด๋กœ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

@parameter
@always_inline
fn compute_dot_product[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
    idx = indices[0]
    # ์—ฌ๊ธฐ๋ฅผ ์ฑ„์šฐ์„ธ์š” (์ตœ๋Œ€ 10์ค„)

ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์˜ ์ฐจ์ด์ :

  • elementwise๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •ํ™•ํžˆ WARP_SIZE๊ฐœ์˜ ์Šค๋ ˆ๋“œ ์‹คํ–‰
  • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ idx๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋‚˜์˜ ์š”์†Œ ์ฒ˜๋ฆฌ
  • ๊ฐ™์€ ์›Œํ”„ ์—ฐ์‚ฐ, ๋‹ค๋ฅธ ์‹คํ–‰ ๋ฉ”์ปค๋‹ˆ์ฆ˜

2. ๋ถ€๋ถ„๊ณฑ ๊ณ„์‚ฐํ•˜๊ธฐ

var partial_product: Scalar[dtype] = 0.0
if idx < size:
    a_val = a.load[1](idx, 0)
    b_val = b.load[1](idx, 0)
    partial_product = (a_val * b_val).reduce_add()
else:
    partial_product = 0.0

๋กœ๋”ฉ ํŒจํ„ด: a.load[1](idx, 0)์€ ์œ„์น˜ idx์—์„œ ์ •ํ™•ํžˆ 1๊ฐœ ์š”์†Œ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค (SIMD ๋ฒกํ„ฐํ™” ์—†์Œ).

๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์Šค๋ ˆ๋“œ์˜ partial_product๋ฅผ 0.0์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ํ•ฉ์‚ฐ์— ๊ธฐ์—ฌํ•˜์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

3. ์›Œํ”„ ์—ฐ์‚ฐ๊ณผ ์ €์žฅ

total = warp_sum(partial_product)

if lane_id() == 0:
    output.store[1](Index(idx // WARP_SIZE), total)

์ €์žฅ ํŒจํ„ด: output.store[1](Index(idx // WARP_SIZE), 0, total)์€ ์ถœ๋ ฅ ํ…์„œ์˜ ์œ„์น˜ (idx // WARP_SIZE, 0)์— 1๊ฐœ ์š”์†Œ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

๋™์ผํ•œ ์›Œํ”„ ๋กœ์ง: warp_sum()๊ณผ ๋ ˆ์ธ 0์˜ ๊ธฐ๋ก ๋กœ์ง์€ ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹์—์„œ๋„ ๋™์ผํ•˜๊ฒŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

4. import์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ•จ์ˆ˜๋“ค

from gpu import lane_id
from gpu.primitives.warp import sum as warp_sum, WARP_SIZE

# ํ•จ์ˆ˜ ๋‚ด์—์„œ:
my_lane = lane_id()           # 0 ~ WARP_SIZE-1
total = warp_sum(my_value)    # ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ฆฌ๋•์…˜
warp_size = WARP_SIZE         # 32 (NVIDIA) ๋˜๋Š” 64 (AMD)

ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

uv run poe p24 --functional
pixi run p24 --functional

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
=== RESULT ===
out: 10416.0
expected: 10416.0
๐Ÿ”ง Functional approach shows modern Mojo style with warp operations!
   Clean, composable, and still leverages warp hardware primitives!

ํ’€์ด

fn functional_warp_dot_product[
    layout: Layout,
    out_layout: Layout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
](
    output: LayoutTensor[mut=True, dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    b: LayoutTensor[mut=False, dtype, layout, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    fn compute_dot_product[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        idx = indices[0]

        # Each thread computes one partial product
        var partial_product: Scalar[dtype] = 0.0
        if idx < size:
            a_val = a.load[1](Index(idx))
            b_val = b.load[1](Index(idx))
            partial_product = a_val * b_val
        else:
            partial_product = 0.0

        # Warp magic - combines all WARP_SIZE partial products!
        total = warp_sum(partial_product)

        # Only lane 0 writes the result (all lanes have the same total)
        if lane_id() == 0:
            output.store[1](Index(idx // WARP_SIZE), total)

    # Launch exactly size == WARP_SIZE threads (one warp) to process all elements
    elementwise[compute_dot_product, 1, target="gpu"](size, ctx)


ํ•จ์ˆ˜ํ˜• ์›Œํ”„ ๋ฐฉ์‹์€ ์›Œํ”„ ์—ฐ์‚ฐ์„ ํ™œ์šฉํ•œ ํ˜„๋Œ€์ ์ธ Mojo ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹์˜ ํŠน์ง•:

elementwise[compute_dot_product, 1, target="gpu"](size, ctx)

์žฅ์ :

  • ํƒ€์ž… ์•ˆ์ „์„ฑ: ์ปดํŒŒ์ผ ํƒ€์ž„ ํ…์„œ ๋ ˆ์ด์•„์›ƒ ๊ฒ€์‚ฌ
  • ์กฐํ•ฉ ๊ฐ€๋Šฅ์„ฑ: ๋‹ค๋ฅธ ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ๊ณผ ์‰ฝ๊ฒŒ ํ†ตํ•ฉ
  • ํ˜„๋Œ€์  ํŒจํ„ด: Mojo์˜ ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ๋Šฅ ํ™œ์šฉ
  • ์ž๋™ ์ตœ์ ํ™”: ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ๊ณ ์ˆ˜์ค€ ์ตœ์ ํ™”๋ฅผ ์ ์šฉ ๊ฐ€๋Šฅ

์ปค๋„ ๋ฐฉ์‹๊ณผ์˜ ์ฃผ์š” ์ฐจ์ด:

  • ์‹คํ–‰ ๋ฉ”์ปค๋‹ˆ์ฆ˜: enqueue_function ๋Œ€์‹  elementwise ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: .load[1]()๊ณผ .store[1]() ํŒจํ„ด ์‚ฌ์šฉ
  • ํ†ตํ•ฉ์„ฑ: ๋‹ค๋ฅธ ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ๊ณผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ฒฐํ•ฉ

๋™์ผํ•œ ์›Œํ”„์˜ ์ด์ :

  • ๋™๊ธฐํ™” ์ œ๋กœ: warp_sum()์ด ๋™์ผํ•˜๊ฒŒ ๋™์ž‘
  • ํ•˜๋“œ์›จ์–ด ๊ฐ€์†: ์ปค๋„ ๋ฐฉ์‹๊ณผ ๊ฐ™์€ ์„ฑ๋Šฅ
  • ํฌ๋กœ์Šค ์•„ํ‚คํ…์ฒ˜: WARP_SIZE๊ฐ€ ์ž๋™์œผ๋กœ ์ ์‘

๋ฒค์น˜๋งˆํฌ๋ฅผ ํ†ตํ•œ ์„ฑ๋Šฅ ๋น„๊ต

์ข…ํ•ฉ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ์›Œํ”„ ์—ฐ์‚ฐ์˜ ํ™•์žฅ์„ฑ์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:

uv run poe p24 --benchmark
pixi run p24 --benchmark

์ „์ฒด ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰ ๊ฒฐ๊ณผ์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
--------------------------------------------------------------------------------
Testing SIZE=1 x WARP_SIZE, BLOCKS=1
Running traditional_1x
Running simple_warp_1x
Running functional_warp_1x
--------------------------------------------------------------------------------
Testing SIZE=4 x WARP_SIZE, BLOCKS=4
Running traditional_4x
Running simple_warp_4x
Running functional_warp_4x
--------------------------------------------------------------------------------
Testing SIZE=32 x WARP_SIZE, BLOCKS=32
Running traditional_32x
Running simple_warp_32x
Running functional_warp_32x
--------------------------------------------------------------------------------
Testing SIZE=256 x WARP_SIZE, BLOCKS=256
Running traditional_256x
Running simple_warp_256x
Running functional_warp_256x
--------------------------------------------------------------------------------
Testing SIZE=2048 x WARP_SIZE, BLOCKS=2048
Running traditional_2048x
Running simple_warp_2048x
Running functional_warp_2048x
--------------------------------------------------------------------------------
Testing SIZE=16384 x WARP_SIZE, BLOCKS=16384 (Large Scale)
Running traditional_16384x
Running simple_warp_16384x
Running functional_warp_16384x
--------------------------------------------------------------------------------
Testing SIZE=65536 x WARP_SIZE, BLOCKS=65536 (Massive Scale)
Running traditional_65536x
Running simple_warp_65536x
Running functional_warp_65536x
| name                   | met (ms)              | iters |
| ---------------------- | --------------------- | ----- |
| traditional_1x         | 0.00460128            | 100   |
| simple_warp_1x         | 0.00574047            | 100   |
| functional_warp_1x     | 0.00484192            | 100   |
| traditional_4x         | 0.00492671            | 100   |
| simple_warp_4x         | 0.00485247            | 100   |
| functional_warp_4x     | 0.00587679            | 100   |
| traditional_32x        | 0.0062406399999999996 | 100   |
| simple_warp_32x        | 0.0054918400000000004 | 100   |
| functional_warp_32x    | 0.00552447            | 100   |
| traditional_256x       | 0.0050614300000000004 | 100   |
| simple_warp_256x       | 0.00488768            | 100   |
| functional_warp_256x   | 0.00461472            | 100   |
| traditional_2048x      | 0.01120031            | 100   |
| simple_warp_2048x      | 0.00884383            | 100   |
| functional_warp_2048x  | 0.007038720000000001  | 100   |
| traditional_16384x     | 0.038533750000000005  | 100   |
| simple_warp_16384x     | 0.0323264             | 100   |
| functional_warp_16384x | 0.01674271            | 100   |
| traditional_65536x     | 0.19784991999999998   | 100   |
| simple_warp_65536x     | 0.12870176            | 100   |
| functional_warp_65536x | 0.048680310000000004  | 100   |

Benchmarks completed!

WARP OPERATIONS PERFORMANCE ANALYSIS:
   GPU Architecture: NVIDIA (WARP_SIZE=32) vs AMD (WARP_SIZE=64)
   - 1,...,256 x WARP_SIZE: Grid size too small to benchmark
   - 2048 x WARP_SIZE: Warp primative benefits emerge
   - 16384 x WARP_SIZE: Large scale (512K-1M elements)
   - 65536 x WARP_SIZE: Massive scale (2M-4M elements)

   Expected Results at Large Scales:
   โ€ข Traditional: Slower due to more barrier overhead
   โ€ข Warp operations: Faster, scale better with problem size
   โ€ข Memory bandwidth becomes the limiting factor

์ด ์˜ˆ์‹œ์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ:

  • ์†Œ๊ทœ๋ชจ (1x-4x): ์›Œํ”„ ์—ฐ์‚ฐ์ด ์†Œํญ์˜ ๊ฐœ์„ ์„ ๋ณด์ž„ (~10-15% ๋น ๋ฆ„)
  • ์ค‘๊ทœ๋ชจ (32x-256x): ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
  • ๋Œ€๊ทœ๋ชจ (16K-65K): ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์ง€๋ฐฐ์ ์ด ๋˜๋ฉด์„œ ๋ชจ๋“  ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ์ด ์ˆ˜๋ ด
  • ๋ณ€๋™์„ฑ: ์„ฑ๋Šฅ์€ ํŠน์ • GPU ์•„ํ‚คํ…์ฒ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ์„œ๋ธŒ์‹œ์Šคํ…œ์— ํฌ๊ฒŒ ์˜์กด

์ฐธ๊ณ : ํ•˜๋“œ์›จ์–ด(GPU ๋ชจ๋ธ, ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ, WARP_SIZE)์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ์ ˆ๋Œ€์ ์ธ ์ˆ˜์น˜๋ณด๋‹ค ์ƒ๋Œ€์ ์ธ ์„ฑ๋Šฅ ์ถ”์„ธ๋ฅผ ๊ด€์ฐฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

warp.sum ์—ฐ์‚ฐ์„ ๋ฐฐ์› ์œผ๋‹ˆ, ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์–ธ์ œ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‚ฌ์šฉํ• ๊นŒ: ์›Œํ”„ vs ๊ธฐ์กด ๋ฐฉ์‹์— ๋Œ€ํ•œ ์ „๋žต์  ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ
  • ๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ: ๋ณต์žกํ•œ ํ†ต์‹  ํŒจํ„ด์„ ์œ„ํ•œ shuffle_idx(), shuffle_down(), prefix_sum()
  • ๋ฉ€ํ‹ฐ ์›Œํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์›Œํ”„ ์—ฐ์‚ฐ๊ณผ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋™๊ธฐํ™”์˜ ๊ฒฐํ•ฉ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ์ตœ์ ํ™”: ์ตœ๋Œ€ ๋Œ€์—ญํญ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”

๐Ÿ’ก ํ•ต์‹ฌ ์š”์ : ์›Œํ”„ ์—ฐ์‚ฐ์€ ๋ณต์žกํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์„ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ณธ ์š”์†Œ๋กœ ๋Œ€์ฒดํ•˜์—ฌ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ–‰ ๋ชจ๋ธ์„ ์ดํ•ดํ•˜๋ฉด ์„ฑ๋Šฅ์„ ํฌ์ƒํ•˜์ง€ ์•Š๊ณ ๋„ ํš๊ธฐ์ ์ธ ๋‹จ์ˆœํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

์–ธ์ œ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‚ฌ์šฉํ• ๊นŒ

๋น ๋ฅธ ํŒ๋‹จ ๊ฐ€์ด๋“œ

โœ… ์›Œํ”„ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•  ๋•Œ:

  • 32๊ฐœ ์ด์ƒ์˜ ์š”์†Œ์— ๋Œ€ํ•œ ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ (sum, max, min)
  • ๊ทœ์น™์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด (์ธ์ ‘ ๋ ˆ์ธ โ†’ ์ธ์ ‘ ์ฃผ์†Œ)
  • ํฌ๋กœ์Šค ์•„ํ‚คํ…์ฒ˜ ์ด์‹์„ฑ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ (NVIDIA/RDNA 32 vs CDNA 64 ์Šค๋ ˆ๋“œ)
  • ๋” ๊ฐ„๋‹จํ•˜๊ณ  ์œ ์ง€๋ณด์ˆ˜ํ•˜๊ธฐ ์‰ฌ์šด ์ฝ”๋“œ๋ฅผ ์›ํ•  ๋•Œ

โŒ ๊ธฐ์กด ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ๋•Œ:

  • ๋ณต์žกํ•œ ์›Œํ”„ ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ๋ถˆ๊ทœ์น™ํ•˜๊ฑฐ๋‚˜ ์‚ฐ๋ฐœ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ์Šค๋ ˆ๋“œ๋ณ„ ์ž‘์—…๋Ÿ‰์ด ๋‹ค๋ฅธ ๊ฒฝ์šฐ (์›Œํ”„ ๋ถ„๊ธฐ ๋ฐœ์ƒ)
  • ๋ฌธ์ œ ํฌ๊ธฐ๊ฐ€ size < WARP_SIZE์ธ ๊ฒฝ์šฐ

์„ฑ๋Šฅ ํŠน์„ฑ

๋ฌธ์ œ ํฌ๊ธฐ๋ณ„ ํ™•์žฅ์„ฑ

์š”์†Œ ์ˆ˜์›Œํ”„ ์ด์ ๋น„๊ณ 
< 32์—†์Œ๊ธฐ์กด ๋ฐฉ์‹์ด ์œ ๋ฆฌ
32-1K1.2-1.5๋ฐฐ์ด์ ์ด ๋‚˜ํƒ€๋‚˜๊ธฐ ์‹œ์ž‘
1K-32K1.5-2.5๋ฐฐ์›Œํ”„ ์—ฐ์‚ฐ์ด ํƒ์›”
> 32K๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์–‘์ชฝ ๋ชจ๋‘ ๋Œ€์—ญํญ์— ์˜ํ•ด ์ œํ•œ

์›Œํ”„์˜ ํ•ต์‹ฌ ์ด์ 

  • ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๋ฐฐ๋ฆฌ์–ด ๋น„์šฉ ์ œ๊ฑฐ
  • ์ตœ์†Œํ•œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ถˆํ•„์š”
  • ์šฐ์ˆ˜ํ•œ ํ™•์žฅ์„ฑ: ์›Œํ”„ ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • ๊ฐ„๊ฒฐํ•œ ์ฝ”๋“œ: ๋” ์ ์€ ์ค„ ์ˆ˜, ๋” ์ ์€ ์˜ค๋ฅ˜ ๊ฐ€๋Šฅ์„ฑ

์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณ„ ๊ฐ€์ด๋“œ

์•Œ๊ณ ๋ฆฌ์ฆ˜๊ถŒ์žฅ ์‚ฌํ•ญ์ด์œ 
๋‚ด์ ์›Œํ”„ ์—ฐ์‚ฐ (1K+ ์š”์†Œ)๋‹จ์ผ ๋ฆฌ๋•์…˜, ๊ทœ์น™์  ์ ‘๊ทผ
ํ–‰๋ ฌ ํ–‰/์—ด ํ•ฉ๊ณ„์›Œํ”„ ์—ฐ์‚ฐ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฆฌ๋•์…˜ ํŒจํ„ด
๋ˆ„์  ํ•ฉํ•ญ์ƒ prefix_sum() ์‚ฌ์šฉํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ๊ธฐ๋ณธ ์š”์†Œ
ํ’€๋ง (max/min)์›Œํ”„ ์—ฐ์‚ฐ (๊ทœ์น™์  ์œˆ๋„์šฐ)ํšจ์œจ์ ์ธ ์œˆ๋„์šฐ ๋ฆฌ๋•์…˜
๊ตฌ๊ฐ„์ด ๋งŽ์€ ํžˆ์Šคํ† ๊ทธ๋žจ๊ธฐ์กด ๋ฐฉ์‹๋ถˆ๊ทœ์น™ํ•œ ์“ฐ๊ธฐ, ์›์ž์  ์—…๋ฐ์ดํŠธ

์ฝ”๋“œ ์˜ˆ์‹œ

โœ… ์›Œํ”„์— ์ ํ•ฉํ•œ ๊ฒฝ์šฐ

# ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ
from gpu.primitives.warp import sum, max
var total = sum(partial_values)
var maximum = max(partial_values)

# ํ†ต์‹  ํŒจํ„ด
from gpu.primitives.warp import shuffle_idx, prefix_sum
var broadcast = shuffle_idx(my_value, 0)
var running_sum = prefix_sum(my_value)

โŒ ๊ธฐ์กด ๋ฐฉ์‹์ด ๋‚˜์€ ๊ฒฝ์šฐ

# ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ๋™๊ธฐํ™”
stage1_compute()
barrier()  # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
stage2_depends_on_stage1()

# ๋ถˆ๊ทœ์น™ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
var value = input[random_indices[global_i]]  # ์‚ฐ๋ฐœ์  ์ฝ๊ธฐ

# ๋ฐ์ดํ„ฐ ์˜์กด์  ์ž‘์—…
if input[global_i] > threshold:
    result = expensive_computation()  # ์›Œํ”„ ๋ถ„๊ธฐ ๋ฐœ์ƒ

์„ฑ๋Šฅ ์ธก์ •

# ํ•ญ์ƒ ์–‘์ชฝ ๋ฐฉ์‹์„ ๋ฒค์น˜๋งˆํฌํ•˜์„ธ์š”
mojo p22.mojo --benchmark

# ํ™•์žฅ ํŒจํ„ด์„ ํ™•์ธํ•˜์„ธ์š”:
# traditional_1x:  X.XX ms
# warp_1x:         Y.YY ms  # ๋” ๋นจ๋ผ์•ผ ํ•จ
# warp_32x:        Z.ZZ ms  # ์ด์ ์ด ์ปค์ ธ์•ผ ํ•จ

์š”์•ฝ

์›Œํ”„ ์—ฐ์‚ฐ์œผ๋กœ ์‹œ์ž‘ํ•˜์„ธ์š”:

  • ๊ทœ์น™์ ์ธ ์ ‘๊ทผ ํŒจํ„ด์„ ๊ฐ€์ง„ ๋ฆฌ๋•์…˜
  • ๋ฌธ์ œ โ‰ฅ 1 ์›Œํ”„ ํฌ๊ธฐ
  • ํฌ๋กœ์Šค ํ”Œ๋žซํผ ํ˜ธํ™˜์„ฑ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ

๊ธฐ์กด ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์„ธ์š”:

  • ๋ณต์žกํ•œ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ๋ถˆ๊ทœ์น™ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด
  • ์ž‘์€ ๋ฌธ์ œ ๋˜๋Š” ์‹ฌํ•œ ๋ถ„๊ธฐ

ํŒ๋‹จ์ด ์–ด๋ ค์šธ ๋•Œ: ์–‘์ชฝ ๋ชจ๋‘ ๊ตฌํ˜„ํ•˜๊ณ  ๋ฒค์น˜๋งˆํฌํ•˜์„ธ์š”. ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ณด๋ฉด ๋‹ต์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.

Puzzle 25: ์›Œํ”„ ํ†ต์‹ 

๊ฐœ์š”

Puzzle 25: ์›Œํ”„ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ์—์„œ๋Š” ๊ณ ๊ธ‰ GPU ์›Œํ”„ ๋ ˆ๋ฒจ ํ†ต์‹  ์—ฐ์‚ฐ - ์›Œํ”„ ๋‚ด์—์„œ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ตํ™˜๊ณผ ์กฐ์ • ํŒจํ„ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. shuffle_down๊ณผ broadcast๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด ์—†์ด ์ด์›ƒ ํ†ต์‹ ๊ณผ ์ง‘ํ•ฉ ์กฐ์ •์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

Part VII: GPU ์›Œํ”„ ํ†ต์‹ ์—์„œ๋Š” ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน ๋‚ด ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฐ์ดํ„ฐ ์ด๋™ ์—ฐ์‚ฐ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ์ธ๋ฑ์‹ฑ + ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํŒจํ„ด์„ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ๋ฐ์ดํ„ฐ ์ด๋™์„ ํ™œ์šฉํ•˜๋Š” ํšจ์œจ์ ์ธ ์›Œํ”„ ํ†ต์‹  ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์›Œํ”„๋Š” ๋ก์Šคํ…์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค - Mojo์˜ ์›Œํ”„ ํ†ต์‹  ์—ฐ์‚ฐ์€ ์ด ๋™๊ธฐํ™”๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ž๋™ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ์™€ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ตํ™˜ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ

์›Œํ”„ ํ†ต์‹  ๋ชจ๋ธ

GPU ์›Œํ”„ ๋‚ด ๊ธฐ๋ณธ ํ†ต์‹  ํŒจํ„ด์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

GPU ์›Œํ”„ (32 ์Šค๋ ˆ๋“œ, SIMT ๋ก์Šคํ… ์‹คํ–‰)
โ”œโ”€โ”€ ๋ ˆ์ธ 0  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 1  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 2
โ”œโ”€โ”€ ๋ ˆ์ธ 1  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 2  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 3
โ”œโ”€โ”€ ๋ ˆ์ธ 2  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 3  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 4
โ”‚   ...
โ””โ”€โ”€ ๋ ˆ์ธ 31 โ”€โ”€shuffle_downโ”€โ”€> undefined (๊ฒฝ๊ณ„)

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด:
๋ ˆ์ธ 0 โ”€โ”€broadcastโ”€โ”€> ๋ชจ๋“  ๋ ˆ์ธ (0, 1, 2, ..., 31)

ํ•˜๋“œ์›จ์–ด ํ˜„์‹ค:

  • ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ : ๋ฐ์ดํ„ฐ๊ฐ€ ์Šค๋ ˆ๋“œ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์ด๋ฅผ ์ง์ ‘ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค
  • ์ž๋™ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ํ•˜๋“œ์›จ์–ด๊ฐ€ ์›Œํ”„ ๊ฒฝ๊ณ„์˜ ์˜ˆ์™ธ ์ƒํ™ฉ์„ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ๋‹จ์ผ ์‚ฌ์ดํด ์—ฐ์‚ฐ: ํ•˜๋‚˜์˜ ๋ช…๋ น ์‚ฌ์ดํด์—์„œ ํ†ต์‹ ์ด ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค

Mojo์˜ ์›Œํ”„ ํ†ต์‹  ์—ฐ์‚ฐ

gpu.primitives.warp์˜ ํ•ต์‹ฌ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๋ฐฐ์›๋‹ˆ๋‹ค:

  1. shuffle_down(value, offset): ๋” ๋†’์€ ์ธ๋ฑ์Šค์˜ ๋ ˆ์ธ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ธฐ (์ด์›ƒ ์ ‘๊ทผ)
  2. broadcast(value): ๋ ˆ์ธ 0์˜ ๊ฐ’์„ ๋ชจ๋“  ๋ ˆ์ธ์— ๊ณต์œ  (์ผ๋Œ€๋‹ค)
  3. shuffle_idx(value, lane): ํŠน์ • ๋ ˆ์ธ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ธฐ (์ž„์˜ ์ ‘๊ทผ)
  4. shuffle_up(value, offset): ๋” ๋‚ฎ์€ ์ธ๋ฑ์Šค์˜ ๋ ˆ์ธ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ธฐ (์—ญ๋ฐฉํ–ฅ ์ด์›ƒ)

์ฐธ๊ณ : ์ด ํผ์ฆ์€ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ํ†ต์‹  ํŒจํ„ด์ธ shuffle_down()๊ณผ broadcast()์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. ๋ชจ๋“  ์›Œํ”„ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ „์ฒด ๋‚ด์šฉ์€ Mojo GPU ์›Œํ”„ ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

์„ฑ๋Šฅ ๋ณ€ํ™˜ ์˜ˆ์‹œ

# ๋ณต์žกํ•œ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด (๊ธฐ์กด ๋ฐฉ์‹):
shared = LayoutTensor[
    dtype,
    Layout.row_major(WARP_SIZE),
    MutAnyOrigin,
    address_space = AddressSpace.SHARED,
].stack_allocation()
shared[local_i] = input[global_i]
barrier()
if local_i < WARP_SIZE - 1:
    next_value = shared[local_i + 1]  # ์ด์›ƒ ์ ‘๊ทผ
    result = next_value - shared[local_i]
else:
    result = 0  # ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ
barrier()

# ์›Œํ”„ ํ†ต์‹ ์€ ์ด ๋ชจ๋“  ๋ณต์žก์„ฑ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค:
current_val = input[global_i]
next_val = shuffle_down(current_val, 1)  # ์ด์›ƒ์— ์ง์ ‘ ์ ‘๊ทผ
if lane < WARP_SIZE - 1:
    result = next_val - current_val
else:
    result = 0

์›Œํ”„ ํ†ต์‹ ์ด ๋น›๋‚˜๋Š” ์ˆœ๊ฐ„

์„ฑ๋Šฅ ํŠน์„ฑ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

ํ†ต์‹  ํŒจํ„ด๊ธฐ์กด ๋ฐฉ์‹์›Œํ”„ ์—ฐ์‚ฐ
์ด์›ƒ ์ ‘๊ทผ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
์Šคํ…์‹ค ์—ฐ์‚ฐ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ๊ฐ„๋‹จํ•œ ์…”ํ”Œ ํŒจํ„ด
๋ธ”๋ก ์กฐ์ •๋ฐฐ๋ฆฌ์–ด + ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‹จ์ผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ์ˆ˜๋™ ๊ฒ€์‚ฌํ•˜๋“œ์›จ์–ด ์ž๋™ ์ฒ˜๋ฆฌ

์„ ์ˆ˜ ์ง€์‹

์›Œํ”„ ํ†ต์‹ ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ๋‹ค์Œ ๋‚ด์šฉ์— ์ต์ˆ™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  • Part VII ์›Œํ”„ ๊ธฐ์ดˆ: SIMT ์‹คํ–‰๊ณผ ๊ธฐ๋ณธ ์›Œํ”„ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ดํ•ด (Puzzle 24: ์›Œํ”„ ๊ธฐ์ดˆ ์ฐธ๊ณ )
  • GPU ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ: ๋ธ”๋ก, ์›Œํ”„, ๋ ˆ์ธ ๋ฒˆํ˜ธ ๋งค๊ธฐ๊ธฐ
  • LayoutTensor ์—ฐ์‚ฐ: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ: ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฐ€์žฅ์ž๋ฆฌ ์ผ€์ด์Šค ๊ด€๋ฆฌ

ํ•™์Šต ๊ฒฝ๋กœ

1. shuffle_down์„ ์ด์šฉํ•œ ์ด์›ƒ ํ†ต์‹ 

โ†’ warp.shuffle_down()

์Šคํ…์‹ค ์—ฐ์‚ฐ๊ณผ ์œ ํ•œ ์ฐจ๋ถ„์„ ์œ„ํ•œ ์ด์›ƒ ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • shuffle_down()์œผ๋กœ ์ธ์ ‘ ๋ ˆ์ธ ๋ฐ์ดํ„ฐ ์ ‘๊ทผํ•˜๊ธฐ
  • ์œ ํ•œ ์ฐจ๋ถ„๊ณผ ์ด๋™ ํ‰๊ท  ๊ตฌํ˜„
  • ์›Œํ”„ ๊ฒฝ๊ณ„ ์ž๋™ ์ฒ˜๋ฆฌ
  • ํ™•์žฅ๋œ ์ด์›ƒ ์ ‘๊ทผ์„ ์œ„ํ•œ ๋‹ค์ค‘ ์˜คํ”„์…‹ ์…”ํ”Œ

ํ•ต์‹ฌ ํŒจํ„ด:

current_val = input[global_i]
next_val = shuffle_down(current_val, 1)
if lane < WARP_SIZE - 1:
    result = compute_with_neighbors(current_val, next_val)

2. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋ฅผ ์ด์šฉํ•œ ์ง‘ํ•ฉ ์กฐ์ •

โ†’ warp.broadcast()

๋ธ”๋ก ๋ ˆ๋ฒจ ์กฐ์ •๊ณผ ์ง‘ํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ •์„ ์œ„ํ•œ ์ผ๋Œ€๋‹ค ํ†ต์‹  ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • broadcast()๋กœ ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ๋ชจ๋“  ๋ ˆ์ธ์— ๊ณต์œ 
  • ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต๊ณ„์™€ ์ง‘ํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ • ๊ตฌํ˜„
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์™€ ์กฐ๊ฑด๋ถ€ ๋กœ์ง ๊ฒฐํ•ฉ
  • ๊ณ ๊ธ‰ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ-์…”ํ”Œ ์กฐ์ • ํŒจํ„ด

ํ•ต์‹ฌ ํŒจํ„ด:

var shared_value = 0.0
if lane == 0:
    shared_value = compute_block_statistic()
shared_value = broadcast(shared_value)
result = use_shared_value(shared_value, local_data)

ํ•ต์‹ฌ ๊ฐœ๋…

ํ†ต์‹  ํŒจํ„ด

์›Œํ”„ ํ†ต์‹ ์˜ ๊ธฐ๋ณธ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

  • ์ด์›ƒ ํ†ต์‹ : ๋ ˆ์ธ ๊ฐ„ ์ธ์ ‘ ๋ฐ์ดํ„ฐ ๊ตํ™˜
  • ์ง‘ํ•ฉ ์กฐ์ •: ํ•˜๋‚˜์˜ ๋ ˆ์ธ์—์„œ ๋ชจ๋“  ๋ ˆ์ธ์œผ๋กœ ์ •๋ณด ๊ณต์œ 
  • ์Šคํ…์‹ค ์—ฐ์‚ฐ: ๊ณ ์ •๋œ ํŒจํ„ด์œผ๋กœ ์ด์›ƒ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์›Œํ”„ ๊ฐ€์žฅ์ž๋ฆฌ์—์„œ์˜ ํ†ต์‹  ๊ด€๋ฆฌ

ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”

์›Œํ”„ ํ†ต์‹ ์ด GPU ํ•˜๋“œ์›จ์–ด์— ๋งคํ•‘๋˜๋Š” ๋ฐฉ์‹์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

  • ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ ํ†ต์‹ : ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ ˆ์ง€์Šคํ„ฐ ์ง์ ‘ ์ ‘๊ทผ
  • SIMT ์‹คํ–‰: ๋ชจ๋“  ๋ ˆ์ธ์ด ํ†ต์‹ ์„ ๋™์‹œ์— ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ์ œ๋กœ ์ง€์—ฐ ์‹œ๊ฐ„: ์‹คํ–‰ ์œ ๋‹› ๋‚ด์—์„œ ํ†ต์‹ ์ด ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค
  • ์ž๋™ ๋™๊ธฐํ™”: ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ€ํ™˜

๊ธฐ์กด ๋ณ‘๋ ฌ ํŒจํ„ด์„ ์›Œํ”„ ํ†ต์‹ ์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

  • ๋ฐฐ์—ด ์ด์›ƒ ์ ‘๊ทผ โ†’ shuffle_down()
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ • โ†’ broadcast()
  • ๋ณต์žกํ•œ ๊ฒฝ๊ณ„ ๋กœ์ง โ†’ ํ•˜๋“œ์›จ์–ด ์ž๋™ ์ฒ˜๋ฆฌ
  • ๋‹ค๋‹จ๊ณ„ ๋™๊ธฐํ™” โ†’ ๋‹จ์ผ ํ†ต์‹  ์—ฐ์‚ฐ

์‹œ์ž‘ํ•˜๊ธฐ

์ด์›ƒ ๊ธฐ๋ฐ˜ ์…”ํ”Œ ์—ฐ์‚ฐ์œผ๋กœ ๊ธฐ์ดˆ๋ฅผ ๋‹ค์ง„ ๋‹ค์Œ, ๊ณ ๊ธ‰ ์กฐ์ •์„ ์œ„ํ•œ ์ง‘ํ•ฉ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด์œผ๋กœ ๋‚˜์•„๊ฐ‘๋‹ˆ๋‹ค.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ์›Œํ”„ ํ†ต์‹ ์„ ๊ฐ™์€ ์›Œํ”„ ๋‚ด ์Šค๋ ˆ๋“œ ๊ฐ„์˜ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ์œผ๋กœ ์ƒ๊ฐํ•˜์„ธ์š”. ์ด ๋ฉ˜ํƒˆ ๋ชจ๋ธ์ด GPU์˜ SIMT ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ™œ์šฉํ•˜๋Š” ํšจ์œจ์ ์ธ ํ†ต์‹  ํŒจํ„ด์œผ๋กœ ์•ˆ๋‚ดํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ๋ชฉํ‘œ: Puzzle 25๋ฅผ ๋งˆ์น˜๋ฉด, ์›Œํ”„ ํ†ต์‹ ์ด ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํ™ฉ์„ ์ธ์‹ํ•˜์—ฌ ๋” ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅธ ์ด์›ƒ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์กฐ์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ: warp.shuffle_down() ์—์„œ ์ด์›ƒ ํ†ต์‹ ์„ ๋ฐฐ์šด ๋‹ค์Œ, warp.broadcast() ์—์„œ ์ง‘ํ•ฉ ์กฐ์ • ํŒจํ„ด์œผ๋กœ ๋‚˜์•„๊ฐ€์„ธ์š”.

warp.shuffle_down() ์ผ๋Œ€์ผ ํ†ต์‹ 

์›Œํ”„ ๋ ˆ๋ฒจ ์ด์›ƒ ํ†ต์‹ ์—์„œ๋Š” shuffle_down()์„ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„ ๋‚ด ์ธ์ ‘ ๋ ˆ์ธ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ†ตํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ์œ ํ•œ ์ฐจ๋ถ„, ์ด๋™ ํ‰๊ท , ์ด์›ƒ ๊ธฐ๋ฐ˜ ๊ณ„์‚ฐ์„ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: shuffle_down() ์—ฐ์‚ฐ์€ SIMT ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ ๋ ˆ์ธ์ด ๊ฐ™์€ ์›Œํ”„ ๋‚ด ์ด์›ƒ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋ฉฐ, ํšจ์œจ์ ์ธ ์Šคํ…์‹ค ํŒจํ„ด๊ณผ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์Šคํ…์‹ค ์—ฐ์‚ฐ์ด๋ž€? ์Šคํ…์‹ค ์—ฐ์‚ฐ์€ ๊ฐ ์ถœ๋ ฅ ์š”์†Œ๊ฐ€ ์ด์›ƒ ์ž…๋ ฅ ์š”์†Œ์˜ ๊ณ ์ •๋œ ํŒจํ„ด์— ์˜์กดํ•˜๋Š” ๊ณ„์‚ฐ์ž…๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ์˜ˆ๋กœ ์œ ํ•œ ์ฐจ๋ถ„(๋„ํ•จ์ˆ˜), ํ•ฉ์„ฑ๊ณฑ, ์ด๋™ ํ‰๊ท ์ด ์žˆ์Šต๋‹ˆ๋‹ค. โ€œ์Šคํ…์‹คโ€œ์€ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด์„ ๊ฐ€๋ฆฌํ‚ต๋‹ˆ๋‹ค - ์˜ˆ๋ฅผ ๋“ค์–ด [i-1, i, i+1]์„ ์ฝ๋Š” 3์  ์Šคํ…์‹ค์ด๋‚˜ [i-2, i-1, i, i+1, i+2]๋ฅผ ์ฝ๋Š” 5์  ์Šคํ…์‹ค์ด ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • shuffle_down()์„ ํ™œ์šฉํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฐ์ดํ„ฐ ์…”ํ”Œ
  • ์Šคํ…์‹ค ๊ณ„์‚ฐ์„ ์œ„ํ•œ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด
  • ์›Œํ”„ ๊ฐ€์žฅ์ž๋ฆฌ์—์„œ์˜ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ
  • ํ™•์žฅ๋œ ์ด์›ƒ ์ ‘๊ทผ์„ ์œ„ํ•œ ๋‹ค์ค‘ ์˜คํ”„์…‹ ์…”ํ”Œ
  • ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ์˜ ์›Œํ”„ ๊ฐ„ ์กฐ์ •

shuffle_down ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด ๋” ๋†’์€ ์ธ๋ฑ์Šค์˜ ๋ ˆ์ธ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{shuffle_down}(\text{value}, \text{offset}) = \text{value_from_lane}(\text{lane_id} + \text{offset})\]

์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด์ด ๊ฐ„๋‹จํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ์—ฐ์‚ฐ์œผ๋กœ ๋ณ€ํ™˜๋˜์–ด, ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์‹ฑ ์—†์ด ํšจ์œจ์ ์ธ ์Šคํ…์‹ค ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

1. ๊ธฐ๋ณธ ์ด์›ƒ ์ฐจ๋ถ„

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D row-major)

shuffle_down ๊ฐœ๋…

๊ธฐ์กด ์ด์›ƒ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ๊ณผ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ๊ธฐ์กด ๋ฐฉ์‹ - ๋ณต์žกํ•˜๊ณ  ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์›€
if global_i < size - 1:
    next_value = input[global_i + 1]  # ๋ฒ”์œ„ ์ดˆ๊ณผ ๊ฐ€๋Šฅ์„ฑ
    result = next_value - current_value

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ :

  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋ฐฐ์—ด ๊ฒฝ๊ณ„๋ฅผ ์ˆ˜๋™์œผ๋กœ ํ™•์ธํ•ด์•ผ ํ•จ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๋ณ„๋„์˜ ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ๊ฐ€ ํ•„์š”
  • ๋™๊ธฐํ™”: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์—์„œ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Œ
  • ๋ณต์žกํ•œ ๋กœ์ง: ๊ฒฝ๊ณ„์˜ ์˜ˆ์™ธ ์ƒํ™ฉ ์ฒ˜๋ฆฌ๊ฐ€ ์žฅํ™ฉํ•ด์ง

shuffle_down()์„ ์‚ฌ์šฉํ•˜๋ฉด ์ด์›ƒ ์ ‘๊ทผ์ด ๊ฐ„๊ฒฐํ•ด์ง‘๋‹ˆ๋‹ค:

# ์›Œํ”„ ์…”ํ”Œ ๋ฐฉ์‹ - ๊ฐ„๋‹จํ•˜๊ณ  ์•ˆ์ „
current_val = input[global_i]
next_val = shuffle_down(current_val, 1)  # lane+1์—์„œ ๊ฐ’ ๊ฐ€์ ธ์˜ค๊ธฐ
if lane < WARP_SIZE - 1:
    result = next_val - current_val

shuffle_down์˜ ์žฅ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ถˆํ•„์š”
  • ์ž๋™ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ํ•˜๋“œ์›จ์–ด๊ฐ€ ์›Œํ”„ ๊ฒฝ๊ณ„๋ฅผ ๊ด€๋ฆฌ
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ์กฐํ•ฉ ๊ฐ€๋Šฅ: ๋‹ค๋ฅธ ์›Œํ”„ ์—ฐ์‚ฐ๊ณผ ์‰ฝ๊ฒŒ ๊ฒฐํ•ฉ

์™„์„ฑํ•  ์ฝ”๋“œ

shuffle_down()์œผ๋กœ ๋‹ค์Œ ์š”์†Œ์— ์ ‘๊ทผํ•˜์—ฌ ์œ ํ•œ ์ฐจ๋ถ„์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ๊ฐ ์š”์†Œ์˜ ์ด์‚ฐ ๋„ํ•จ์ˆ˜(์œ ํ•œ ์ฐจ๋ถ„)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \text{input}[i+1] - \text{input}[i]\]

์ž…๋ ฅ ๋ฐ์ดํ„ฐ [0, 1, 4, 9, 16, 25, ...] (์ œ๊ณฑ์ˆ˜: i * i)๋ฅผ ์ฐจ๋ถ„๊ฐ’ [1, 3, 5, 7, 9, ...] (ํ™€์ˆ˜)๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ, ์ด์ฐจ ํ•จ์ˆ˜์˜ ์ด์‚ฐ ๋„ํ•จ์ˆ˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

comptime SIZE = WARP_SIZE
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)


fn neighbor_difference[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Compute finite differences: output[i] = input[i+1] - input[i]
    Uses shuffle_down(val, 1) to get the next neighbor's value.
    Works across multiple blocks, each processing one warp worth of data.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = Int(lane_id())

    # FILL IN (roughly 7 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p25/p25.mojo

ํŒ

1. shuffle_down ์ดํ•ดํ•˜๊ธฐ

shuffle_down(value, offset) ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด ๋” ๋†’์€ ์ธ๋ฑ์Šค์˜ ๋ ˆ์ธ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ ์—†์ด ์ด์›ƒ ์š”์†Œ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด์„ธ์š”.

shuffle_down(val, 1)์ด ํ•˜๋Š” ์ผ:

  • ๋ ˆ์ธ 0์ด ๋ ˆ์ธ 1์˜ ๊ฐ’์„ ๋ฐ›์Œ
  • ๋ ˆ์ธ 1์ด ๋ ˆ์ธ 2์˜ ๊ฐ’์„ ๋ฐ›์Œ
  • โ€ฆ
  • ๋ ˆ์ธ 30์ด ๋ ˆ์ธ 31์˜ ๊ฐ’์„ ๋ฐ›์Œ
  • ๋ ˆ์ธ 31์€ ๋ฏธ์ •์˜ ๊ฐ’์„ ๋ฐ›์Œ (๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ์ฒ˜๋ฆฌ)

2. ์›Œํ”„ ๊ฒฝ๊ณ„ ๊ณ ๋ ค์‚ฌํ•ญ

์›Œํ”„์˜ ๊ฐ€์žฅ์ž๋ฆฌ์—์„œ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”. ์ผ๋ถ€ ๋ ˆ์ธ์€ ์…”ํ”Œ ์—ฐ์‚ฐ์œผ๋กœ ์ ‘๊ทผํ•  ์œ ํšจํ•œ ์ด์›ƒ์ด ์—†์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณผ์ œ: ์›Œํ”„ ๊ฒฝ๊ณ„์—์„œ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ๋ฏธ์ •์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•˜์„ธ์š”.

WARP_SIZE = 32์—์„œ์˜ ์ด์›ƒ ์ฐจ๋ถ„:

  • ์œ ํšจํ•œ ์ฐจ๋ถ„ (lane < WARP_SIZE - 1): ๋ ˆ์ธ 0-30 (31๊ฐœ ๋ ˆ์ธ)

    • ์กฐ๊ฑด: \(\text{lane_id}() \in {0, 1, \cdots, 30}\)
    • ์ด์œ : shuffle_down(current_val, 1)์ด ๋‹ค์Œ ์ด์›ƒ์˜ ๊ฐ’์„ ์„ฑ๊ณต์ ์œผ๋กœ ๊ฐ€์ ธ์˜ด
    • ๊ฒฐ๊ณผ: output[i] = input[i+1] - input[i] (์œ ํ•œ ์ฐจ๋ถ„)
  • ๊ฒฝ๊ณ„ ์ผ€์ด์Šค (else): ๋ ˆ์ธ 31 (1๊ฐœ ๋ ˆ์ธ)

    • ์กฐ๊ฑด: \(\text{lane_id}() = 31\)
    • ์ด์œ : shuffle_down(current_val, 1)์ด ๋ฏธ์ •์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ (๋ ˆ์ธ 32๊ฐ€ ์—†์Œ)
    • ๊ฒฐ๊ณผ: output[i] = 0 (์ฐจ๋ถ„ ๊ณ„์‚ฐ ๋ถˆ๊ฐ€)

3. ๋ ˆ์ธ ์‹๋ณ„

lane = lane_id()  # 0๋ถ€ํ„ฐ WARP_SIZE-1๊นŒ์ง€ ๋ฐ˜ํ™˜

๋ ˆ์ธ ๋ฒˆํ˜ธ ๋งค๊ธฐ๊ธฐ: ๊ฐ ์›Œํ”„ ๋‚ด์—์„œ ๋ ˆ์ธ์€ 0, 1, 2, โ€ฆ, WARP_SIZE-1๋กœ ๋ฒˆํ˜ธ๊ฐ€ ๋งค๊ฒจ์ง‘๋‹ˆ๋‹ค

์ด์›ƒ ์ฐจ๋ถ„ ํ…Œ์ŠคํŠธ:

pixi run p25 --neighbor
pixi run -e amd p25 --neighbor
pixi run -e apple p25 --neighbor
uv run poe p25 --neighbor

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0, 25.0, 27.0, 29.0, 31.0, 33.0, 35.0, 37.0, 39.0, 41.0, 43.0, 45.0, 47.0, 49.0, 51.0, 53.0, 55.0, 57.0, 59.0, 61.0, 0.0]
expected: [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0, 25.0, 27.0, 29.0, 31.0, 33.0, 35.0, 37.0, 39.0, 41.0, 43.0, 45.0, 47.0, 49.0, 51.0, 53.0, 55.0, 57.0, 59.0, 61.0, 0.0]
โœ… Basic neighbor difference test passed!

์†”๋ฃจ์…˜

fn neighbor_difference[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Compute finite differences: output[i] = input[i+1] - input[i]
    Uses shuffle_down(val, 1) to get the next neighbor's value.
    Works across multiple blocks, each processing one warp worth of data.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = Int(lane_id())

    if global_i < size:
        # Get current value
        current_val = input[global_i]

        # Get next neighbor's value using shuffle_down
        next_val = shuffle_down(current_val, 1)

        # Compute difference - valid within warp boundaries
        # Last lane of each warp has no valid neighbor within the warp
        # Note there's only one warp in this test, so we don't need to check global_i < size - 1
        # We'll see how this works with multiple blocks in the next tests
        if lane < WARP_SIZE - 1:
            output[global_i] = next_val - current_val
        else:
            # Last thread in warp or last thread overall, set to 0
            output[global_i] = 0


์ด ์†”๋ฃจ์…˜์€ shuffle_down()์ด ๊ธฐ์กด ๋ฐฐ์—ด ์ธ๋ฑ์‹ฑ์„ ํšจ์œจ์ ์ธ ์›Œํ”„ ๋ ˆ๋ฒจ ํ†ต์‹ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]           # ๊ฐ ๋ ˆ์ธ์ด ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ฝ์Œ
    next_val = shuffle_down(current_val, 1) # ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™

    if lane < WARP_SIZE - 1:
        output[global_i] = next_val - current_val  # ์ฐจ๋ถ„ ๊ณ„์‚ฐ
    else:
        output[global_i] = 0                       # ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ

SIMT ์‹คํ–‰ ์ƒ์„ธ ๋ถ„์„:

์‚ฌ์ดํด 1: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์‹œ์— ๊ฐ’์„ ๋กœ๋“œ
  ๋ ˆ์ธ 0: current_val = input[0] = 0
  ๋ ˆ์ธ 1: current_val = input[1] = 1
  ๋ ˆ์ธ 2: current_val = input[2] = 4
  ...
  ๋ ˆ์ธ 31: current_val = input[31] = 961

์‚ฌ์ดํด 2: shuffle_down(current_val, 1)์ด ๋ชจ๋“  ๋ ˆ์ธ์—์„œ ์‹คํ–‰
  ๋ ˆ์ธ 0: ๋ ˆ์ธ 1์—์„œ current_val ์ˆ˜์‹  โ†’ next_val = 1
  ๋ ˆ์ธ 1: ๋ ˆ์ธ 2์—์„œ current_val ์ˆ˜์‹  โ†’ next_val = 4
  ๋ ˆ์ธ 2: ๋ ˆ์ธ 3์—์„œ current_val ์ˆ˜์‹  โ†’ next_val = 9
  ...
  ๋ ˆ์ธ 30: ๋ ˆ์ธ 31์—์„œ current_val ์ˆ˜์‹  โ†’ next_val = 961
  ๋ ˆ์ธ 31: ๋ฏธ์ •์˜ ์ˆ˜์‹  (๋ ˆ์ธ 32 ์—†์Œ) โ†’ next_val = ?

์‚ฌ์ดํด 3: ์ฐจ๋ถ„ ๊ณ„์‚ฐ (๋ ˆ์ธ 0-30๋งŒ ํ•ด๋‹น)
  ๋ ˆ์ธ 0: output[0] = 1 - 0 = 1
  ๋ ˆ์ธ 1: output[1] = 4 - 1 = 3
  ๋ ˆ์ธ 2: output[2] = 9 - 4 = 5
  ...
  ๋ ˆ์ธ 31: output[31] = 0 (๊ฒฝ๊ณ„ ์กฐ๊ฑด)

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ์ด์‚ฐ ๋„ํ•จ์ˆ˜ ์—ฐ์‚ฐ์ž \(D\)๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large D\lbrack f\rbrack(i) = f(i+1) - f(i)\]

์ด์ฐจ ์ž…๋ ฅ \(f(i) = i^2\)์— ๋Œ€ํ•ด: \[\Large D[i^2] = (i+1)^2 - i^2 = i^2 + 2i + 1 - i^2 = 2i + 1\]

shuffle_down์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ธฐ์กด ๋ฐฉ์‹์€ input[global_i + 1] ๋กœ๋“œ๊ฐ€ ํ•„์š”ํ•˜์—ฌ ์บ์‹œ ๋ฏธ์Šค๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ์Œ
  2. ๊ฒฝ๊ณ„ ์•ˆ์ „์„ฑ: ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ ์œ„ํ—˜์ด ์—†์Œ - ํ•˜๋“œ์›จ์–ด๊ฐ€ ์›Œํ”„ ๊ฒฝ๊ณ„๋ฅผ ์ฒ˜๋ฆฌ
  3. SIMT ์ตœ์ ํ™”: ๋‹จ์ผ ๋ช…๋ น์ด ๋ชจ๋“  ๋ ˆ์ธ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌ
  4. ๋ ˆ์ง€์Šคํ„ฐ ํ†ต์‹ : ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ๊ฐ€ ์•„๋‹Œ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์ด๋ฅผ ์ด๋™

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์ง€์—ฐ ์‹œ๊ฐ„: 1 ์‚ฌ์ดํด (๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์˜ 100+ ์‚ฌ์ดํด ๋Œ€๋น„)
  • ๋Œ€์—ญํญ: 0 ๋ฐ”์ดํŠธ (๊ธฐ์กด ๋ฐฉ์‹์˜ ์Šค๋ ˆ๋“œ๋‹น 4๋ฐ”์ดํŠธ ๋Œ€๋น„)
  • ๋ณ‘๋ ฌ์„ฑ: 32๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์‹œ์— ์ฒ˜๋ฆฌ

2. ๋‹ค์ค‘ ์˜คํ”„์…‹ ์ด๋™ ํ‰๊ท 

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE_2 = 64 (๋ฉ€ํ‹ฐ ๋ธ”๋ก ์‹œ๋‚˜๋ฆฌ์˜ค)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: BLOCKS_PER_GRID = (2, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: THREADS_PER_BLOCK = (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

์—ฌ๋Ÿฌ shuffle_down ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜์—ฌ 3์  ์ด๋™ ํ‰๊ท ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ์„ธ ๊ฐœ์˜ ์—ฐ์† ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \frac{1}{3}\left(\text{input}[i] + \text{input}[i+1] + \text{input}[i+2]\right)\]

๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์›Œํ”„ ๊ฒฝ๊ณ„์—์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์šฐ์•„ํ•˜๊ฒŒ ์ ์‘ํ•ฉ๋‹ˆ๋‹ค:

  • 3์  ์ „์ฒด ์œˆ๋„์šฐ: \(\text{output}[i] = \frac{1}{3}\sum_{k=0}^{2} \text{input}[i+k]\) - ๋ชจ๋“  ์ด์›ƒ์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•  ๋•Œ
  • 2์  ์œˆ๋„์šฐ: \(\text{output}[i] = \frac{1}{2}\sum_{k=0}^{1} \text{input}[i+k]\) - ๋‹ค์Œ ์ด์›ƒ๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•  ๋•Œ
  • 1์  ์œˆ๋„์šฐ: \(\text{output}[i] = \text{input}[i]\) - ์ด์›ƒ์ด ์‚ฌ์šฉ ๋ถˆ๊ฐ€ํ•  ๋•Œ

์ด๋Š” shuffle_down()์ด ์›Œํ”„ ๋ฒ”์œ„ ๋‚ด์—์„œ ์ž๋™ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ์™€ ํ•จ๊ป˜ ํšจ์œจ์ ์ธ ์Šคํ…์‹ค ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
comptime layout_2 = Layout.row_major(SIZE_2)


fn moving_average_3[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Compute 3-point moving average: output[i] = (input[i] + input[i+1] + input[i+2]) / 3
    Uses shuffle_down with offsets 1 and 2 to access neighbors.
    Works within warp boundaries across multiple blocks.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = Int(lane_id())

    # FILL IN (roughly 10 lines)


ํŒ

1. ๋‹ค์ค‘ ์˜คํ”„์…‹ ์…”ํ”Œ ํŒจํ„ด

์ด ํผ์ฆ์€ ์—ฌ๋Ÿฌ ์ด์›ƒ์— ๋™์‹œ์— ์ ‘๊ทผํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ์˜คํ”„์…‹์œผ๋กœ ์…”ํ”Œ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • input[i+1]๊ณผ input[i+2]๋ฅผ ์…”ํ”Œ ์—ฐ์‚ฐ์œผ๋กœ ์–ด๋–ป๊ฒŒ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์„๊นŒ์š”?
  • ์…”ํ”Œ ์˜คํ”„์…‹๊ณผ ์ด์›ƒ ๊ฑฐ๋ฆฌ์˜ ๊ด€๊ณ„๋Š” ๋ฌด์—‡์ผ๊นŒ์š”?
  • ๊ฐ™์€ ์†Œ์Šค ๊ฐ’์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๋ฒˆ ์…”ํ”Œ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

์‹œ๊ฐํ™” ๊ฐœ๋…:

ํ˜„์žฌ ๋ ˆ์ธ์ด ํ•„์š”ํ•œ ๊ฐ’: current_val, next_val, next_next_val
์…”ํ”Œ ์˜คํ”„์…‹:        0 (์ง์ ‘),    1,        2

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: ๋ช‡ ๋ฒˆ์˜ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜๊ณ , ์–ด๋–ค ์˜คํ”„์…‹์„ ์‚ฌ์šฉํ•ด์•ผ ํ• ๊นŒ์š”?

2. ๋‹จ๊ณ„์  ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ

๋‹จ์ˆœํ•œ ์ด์›ƒ ์ฐจ๋ถ„๊ณผ ๋‹ฌ๋ฆฌ, ์ด ํผ์ฆ์€ 2๊ฐœ์˜ ์ด์›ƒ์— ์ ‘๊ทผํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์—ฌ๋Ÿฌ ๊ฒฝ๊ณ„ ์‹œ๋‚˜๋ฆฌ์˜ค๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ๊ฒฝ๊ณ„ ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ์ „์ฒด ์œˆ๋„์šฐ: ๋ ˆ์ธ์ด ๋‘ ์ด์›ƒ ๋ชจ๋‘ ์ ‘๊ทผ ๊ฐ€๋Šฅ โ†’ 3๊ฐœ ๊ฐ’ ๋ชจ๋‘ ์‚ฌ์šฉ
  • ๋ถ€๋ถ„ ์œˆ๋„์šฐ: ๋ ˆ์ธ์ด 1๊ฐœ ์ด์›ƒ๋งŒ ์ ‘๊ทผ ๊ฐ€๋Šฅ โ†’ 2๊ฐœ ๊ฐ’ ์‚ฌ์šฉ
  • ์œˆ๋„์šฐ ์—†์Œ: ๋ ˆ์ธ์ด ์ด์›ƒ์— ์ ‘๊ทผ ๋ถˆ๊ฐ€ โ†’ 1๊ฐœ ๊ฐ’ ์‚ฌ์šฉ

๋น„ํŒ์  ์‚ฌ๊ณ :

  • ์–ด๋–ค ๋ ˆ์ธ์ด ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ์— ํ•ด๋‹นํ• ๊นŒ์š”?
  • ๊ฐ’์ด ์ ์„ ๋•Œ ํ‰๊ท ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์–ด๋–ป๊ฒŒ ์กฐ์ •ํ•ด์•ผ ํ• ๊นŒ์š”?
  • ์–ด๋–ค ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ๊ฒ€์‚ฌํ•ด์•ผ ํ• ๊นŒ์š”?

๊ณ ๋ คํ•  ํŒจํ„ด:

if (๋‘ ์ด์›ƒ ๋ชจ๋‘ ์ ‘๊ทผ ๊ฐ€๋Šฅ):
    # 3์  ํ‰๊ท 
elif (ํ•œ ์ด์›ƒ๋งŒ ์ ‘๊ทผ ๊ฐ€๋Šฅ):
    # 2์  ํ‰๊ท 
else:
    # 1์  (ํ‰๊ท  ์—†์Œ)

3. ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ •

์ด ํผ์ฆ์€ ์—ฌ๋Ÿฌ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ๊ฐ ๋ธ”๋ก์ด ๋ฐ์ดํ„ฐ์˜ ๋‹ค๋ฅธ ์˜์—ญ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ค‘์š”ํ•œ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๊ฐ ๋ธ”๋ก์€ ๋ ˆ์ธ 0๋ถ€ํ„ฐ WARP_SIZE-1๊นŒ์ง€์˜ ์ž์ฒด ์›Œํ”„๋ฅผ ๊ฐ€์ง
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด์€ ๊ฐ ์›Œํ”„ ๋‚ด์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ
  • ๋ธ”๋ก๋งˆ๋‹ค ๋ ˆ์ธ ๋ฒˆํ˜ธ๊ฐ€ ์ดˆ๊ธฐํ™”๋จ

์ƒ๊ฐํ•ด ๋ณผ ์งˆ๋ฌธ:

  • ๊ฒฝ๊ณ„ ๋กœ์ง์ด ๋ธ”๋ก 0๊ณผ ๋ธ”๋ก 1 ๋ชจ๋‘์—์„œ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋™์ž‘ํ•˜๋‚˜์š”?
  • ๋ ˆ์ธ ๊ฒฝ๊ณ„์™€ ์ „์—ญ ๋ฐฐ์—ด ๊ฒฝ๊ณ„๋ฅผ ๋ชจ๋‘ ๊ฒ€์‚ฌํ•˜๊ณ  ์žˆ๋‚˜์š”?
  • ์„œ๋กœ ๋‹ค๋ฅธ ๋ธ”๋ก์—์„œ global_i์™€ lane_id()์˜ ๊ด€๊ณ„๋Š” ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”?

๋””๋ฒ„๊น… ํŒ: ๊ฐ ๋ธ”๋ก์˜ ๊ฒฝ๊ณ„ ๋ ˆ์ธ์—์„œ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ถ”์ ํ•˜์—ฌ ๋กœ์ง์„ ํ…Œ์ŠคํŠธํ•˜์„ธ์š”.

์ด๋™ ํ‰๊ท  ํ…Œ์ŠคํŠธ:

pixi run p25 --average
pixi run -e amd p25 --average
uv run poe p25 --average

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE_2:  64
output: HostBuffer([3.3333333, 6.3333335, 10.333333, 15.333333, 21.333334, 28.333334, 36.333332, 45.333332, 55.333332, 66.333336, 78.333336, 91.333336, 105.333336, 120.333336, 136.33333, 153.33333, 171.33333, 190.33333, 210.33333, 231.33333, 253.33333, 276.33334, 300.33334, 325.33334, 351.33334, 378.33334, 406.33334, 435.33334, 465.33334, 496.33334, 512.0, 528.0, 595.3333, 630.3333, 666.3333, 703.3333, 741.3333, 780.3333, 820.3333, 861.3333, 903.3333, 946.3333, 990.3333, 1035.3334, 1081.3334, 1128.3334, 1176.3334, 1225.3334, 1275.3334, 1326.3334, 1378.3334, 1431.3334, 1485.3334, 1540.3334, 1596.3334, 1653.3334, 1711.3334, 1770.3334, 1830.3334, 1891.3334, 1953.3334, 2016.3334, 2048.0, 2080.0])
expected: HostBuffer([3.3333333, 6.3333335, 10.333333, 15.333333, 21.333334, 28.333334, 36.333332, 45.333332, 55.333332, 66.333336, 78.333336, 91.333336, 105.333336, 120.333336, 136.33333, 153.33333, 171.33333, 190.33333, 210.33333, 231.33333, 253.33333, 276.33334, 300.33334, 325.33334, 351.33334, 378.33334, 406.33334, 435.33334, 465.33334, 496.33334, 512.0, 528.0, 595.3333, 630.3333, 666.3333, 703.3333, 741.3333, 780.3333, 820.3333, 861.3333, 903.3333, 946.3333, 990.3333, 1035.3334, 1081.3334, 1128.3334, 1176.3334, 1225.3334, 1275.3334, 1326.3334, 1378.3334, 1431.3334, 1485.3334, 1540.3334, 1596.3334, 1653.3334, 1711.3334, 1770.3334, 1830.3334, 1891.3334, 1953.3334, 2016.3334, 2048.0, 2080.0])
โœ… Moving average test passed!

์†”๋ฃจ์…˜

fn moving_average_3[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Compute 3-point moving average: output[i] = (input[i] + input[i+1] + input[i+2]) / 3
    Uses shuffle_down with offsets 1 and 2 to access neighbors.
    Works within warp boundaries across multiple blocks.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = Int(lane_id())

    if global_i < size:
        # Get current, next, and next+1 values
        current_val = input[global_i]
        next_val = shuffle_down(current_val, 1)
        next_next_val = shuffle_down(current_val, 2)

        # Compute 3-point average - valid within warp boundaries
        if lane < WARP_SIZE - 2 and global_i < size - 2:
            output[global_i] = (current_val + next_val + next_next_val) / 3.0
        elif lane < WARP_SIZE - 1 and global_i < size - 1:
            # Second-to-last in warp: only current + next available
            output[global_i] = (current_val + next_val) / 2.0
        else:
            # Last thread in warp or boundary cases: only current available
            output[global_i] = current_val


์ด ์†”๋ฃจ์…˜์€ ๋ณต์žกํ•œ ์Šคํ…์‹ค ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๊ณ ๊ธ‰ ๋‹ค์ค‘ ์˜คํ”„์…‹ ์…”ํ”Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    # ๋‹จ๊ณ„ 1: ์—ฌ๋Ÿฌ ์…”ํ”Œ๋กœ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ๋ชจ๋‘ ํ™•๋ณด
    current_val = input[global_i]                   # ์ง์ ‘ ์ ‘๊ทผ
    next_val = shuffle_down(current_val, 1)         # ์˜ค๋ฅธ์ชฝ ์ด์›ƒ
    next_next_val = shuffle_down(current_val, 2)    # ์˜ค๋ฅธ์ชฝ+1 ์ด์›ƒ

    # ๋‹จ๊ณ„ 2: ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ฅธ ์ ์‘ํ˜• ๊ณ„์‚ฐ
    if lane < WARP_SIZE - 2 and global_i < size - 2:
        # 3์  ์Šคํ…์‹ค ์ „์ฒด ์‚ฌ์šฉ ๊ฐ€๋Šฅ
        output[global_i] = (current_val + next_val + next_next_val) / 3.0
    elif lane < WARP_SIZE - 1 and global_i < size - 1:
        # 2์  ์Šคํ…์‹ค๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ (์›Œํ”„ ๊ฒฝ๊ณ„ ๊ทผ์ฒ˜)
        output[global_i] = (current_val + next_val) / 2.0
    else:
        # ์Šคํ…์‹ค ์‚ฌ์šฉ ๋ถˆ๊ฐ€ (์›Œํ”„ ๊ฒฝ๊ณ„)
        output[global_i] = current_val

๋‹ค์ค‘ ์˜คํ”„์…‹ ์‹คํ–‰ ์ถ”์  (WARP_SIZE = 32):

์ดˆ๊ธฐ ์ƒํƒœ (๋ธ”๋ก 0, ์š”์†Œ 0-31):
  ๋ ˆ์ธ 0: current_val = input[0] = 1
  ๋ ˆ์ธ 1: current_val = input[1] = 2
  ๋ ˆ์ธ 2: current_val = input[2] = 4
  ...
  ๋ ˆ์ธ 31: current_val = input[31] = X

์ฒซ ๋ฒˆ์งธ ์…”ํ”Œ: shuffle_down(current_val, 1)
  ๋ ˆ์ธ 0: next_val = input[1] = 2
  ๋ ˆ์ธ 1: next_val = input[2] = 4
  ๋ ˆ์ธ 2: next_val = input[3] = 7
  ...
  ๋ ˆ์ธ 30: next_val = input[31] = X
  ๋ ˆ์ธ 31: next_val = ๋ฏธ์ •์˜

๋‘ ๋ฒˆ์งธ ์…”ํ”Œ: shuffle_down(current_val, 2)
  ๋ ˆ์ธ 0: next_next_val = input[2] = 4
  ๋ ˆ์ธ 1: next_next_val = input[3] = 7
  ๋ ˆ์ธ 2: next_next_val = input[4] = 11
  ...
  ๋ ˆ์ธ 29: next_next_val = input[31] = X
  ๋ ˆ์ธ 30: next_next_val = ๋ฏธ์ •์˜
  ๋ ˆ์ธ 31: next_next_val = ๋ฏธ์ •์˜

๊ณ„์‚ฐ ๋‹จ๊ณ„:
  ๋ ˆ์ธ 0-29: 3์  ์ „์ฒด ํ‰๊ท  โ†’ (current + next + next_next) / 3
  ๋ ˆ์ธ 30:   2์  ํ‰๊ท  โ†’ (current + next) / 2
  ๋ ˆ์ธ 31:   1์  ํ‰๊ท  โ†’ current (๊ทธ๋Œ€๋กœ ์ „๋‹ฌ)

์ˆ˜ํ•™์  ๊ธฐ๋ฐ˜: ๊ฐ€๋ณ€ ํญ ์ด์‚ฐ ํ•ฉ์„ฑ๊ณฑ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large h[i] = \sum_{k=0}^{K(i)-1} w_k^{(i)} \cdot f[i+k]\]

์œ„์น˜์— ๋”ฐ๋ผ ์ปค๋„์ด ์ ์‘ํ•ฉ๋‹ˆ๋‹ค:

  • ๋‚ด๋ถ€ ์ : \(K(i) = 3\), \(\mathbf{w}^{(i)} = [\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]\)
  • ๊ฒฝ๊ณ„ ๊ทผ์ฒ˜: \(K(i) = 2\), \(\mathbf{w}^{(i)} = [\frac{1}{2}, \frac{1}{2}]\)
  • ๊ฒฝ๊ณ„: \(K(i) = 1\), \(\mathbf{w}^{(i)} = [1]\)

๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ •: SIZE_2 = 64์™€ 2๊ฐœ ๋ธ”๋ก:

๋ธ”๋ก 0 (์ „์—ญ ์ธ๋ฑ์Šค 0-31):
  ์ „์—ญ ์ธ๋ฑ์Šค 29, 30, 31์— ๋ ˆ์ธ ๊ฒฝ๊ณ„ ์ ์šฉ

๋ธ”๋ก 1 (์ „์—ญ ์ธ๋ฑ์Šค 32-63):
  ์ „์—ญ ์ธ๋ฑ์Šค 61, 62, 63์— ๋ ˆ์ธ ๊ฒฝ๊ณ„ ์ ์šฉ
  ๋ ˆ์ธ ๋ฒˆํ˜ธ ์ดˆ๊ธฐํ™”: global_i=32 โ†’ lane=0, global_i=63 โ†’ lane=31

์„ฑ๋Šฅ ์ตœ์ ํ™”:

  1. ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ํ™•๋ณด: ๋‘ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ๋™์‹œ์— ์‹คํ–‰
  2. ์กฐ๊ฑด๋ถ€ ๋ถ„๊ธฐ: GPU๊ฐ€ ํ”„๋ ˆ๋””์ผ€์ด์…˜์„ ํ†ตํ•ด ๋ถ„๊ธฐ ๋ ˆ์ธ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ
  3. ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ์ˆœ์ฐจ์  ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด GPU์— ์ตœ์ 
  4. ๋ ˆ์ง€์Šคํ„ฐ ์žฌ์‚ฌ์šฉ: ๋ชจ๋“  ์ค‘๊ฐ„ ๊ฐ’์ด ๋ ˆ์ง€์Šคํ„ฐ์— ์œ ์ง€

์‹ ํ˜ธ ์ฒ˜๋ฆฌ ๊ด€์ : ์ด๊ฒƒ์€ ์ž„ํŽ„์Šค ์‘๋‹ต \(h[n] = \frac{1}{3}[\delta[n] + \delta[n-1] + \delta[n-2]]\)๋ฅผ ๊ฐ€์ง„ ์ธ๊ณผ FIR ํ•„ํ„ฐ๋กœ, ์ฐจ๋‹จ ์ฃผํŒŒ์ˆ˜ \(f_c \approx 0.25f_s\)์—์„œ ์Šค๋ฌด๋”ฉ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์š”์•ฝ

์ด ์„น์…˜์˜ ํ•ต์‹ฌ ํŒจํ„ด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

current_val = input[global_i]
neighbor_val = shuffle_down(current_val, offset)
if lane < WARP_SIZE - offset:
    result = compute(current_val, neighbor_val)

ํ•ต์‹ฌ ์žฅ์ :

  • ํ•˜๋“œ์›จ์–ด ํšจ์œจ์„ฑ: ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
  • ๊ฒฝ๊ณ„ ์•ˆ์ „์„ฑ: ์ž๋™ ์›Œํ”„ ๋ฒ”์œ„ ์ฒ˜๋ฆฌ
  • SIMT ์ตœ์ ํ™”: ๋‹จ์ผ ๋ช…๋ น, ๋ชจ๋“  ๋ ˆ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ

ํ™œ์šฉ ๋ถ„์•ผ: ์œ ํ•œ ์ฐจ๋ถ„, ์Šคํ…์‹ค ์—ฐ์‚ฐ, ์ด๋™ ํ‰๊ท , ํ•ฉ์„ฑ๊ณฑ.

warp.broadcast() ์ผ๋Œ€๋‹ค ํ†ต์‹ 

์›Œํ”„ ๋ ˆ๋ฒจ ์กฐ์ •์—์„œ๋Š” broadcast()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋ ˆ์ธ์—์„œ ์›Œํ”„ ๋‚ด ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ†ตํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ๋ธ”๋ก ๋ ˆ๋ฒจ ๊ณ„์‚ฐ, ์กฐ๊ฑด๋ถ€ ๋กœ์ง ์กฐ์ •, ์ผ๋Œ€๋‹ค ํ†ต์‹  ํŒจํ„ด์„ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: broadcast() ์—ฐ์‚ฐ์€ SIMT ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋ ˆ์ธ(๋ณดํ†ต ๋ ˆ์ธ 0)์ด ๊ณ„์‚ฐํ•œ ๊ฐ’์„ ๊ฐ™์€ ์›Œํ”„์˜ ๋ชจ๋“  ๋ ˆ์ธ์— ์ „๋‹ฌํ•˜๋ฉฐ, ํšจ์œจ์ ์ธ ์กฐ์ • ํŒจํ„ด๊ณผ ์ง‘ํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ •์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์ด๋ž€? ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์€ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ๋ฃน ๋‚ด ๋‹ค๋ฅธ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์™€ ๊ณต์œ ํ•˜๋Š” ํ†ต์‹  ํŒจํ„ด์ž…๋‹ˆ๋‹ค. ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต๊ณ„ ๊ณ„์‚ฐ, ์ง‘ํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ •, ์›Œํ”„ ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ์„ค์ • ํŒŒ๋ผ๋ฏธํ„ฐ ์ „๋‹ฌ ๋“ฑ์˜ ์กฐ์ • ์ž‘์—…์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • broadcast()๋ฅผ ํ™œ์šฉํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
  • ์ผ๋Œ€๋‹ค ํ†ต์‹  ํŒจํ„ด
  • ์ง‘ํ•ฉ ๊ณ„์‚ฐ ์ „๋žต
  • ๋ ˆ์ธ ๊ฐ„ ์กฐ๊ฑด๋ถ€ ์กฐ์ •
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ-shuffle ๊ฒฐํ•ฉ ์—ฐ์‚ฐ

broadcast() ์—ฐ์‚ฐ์€ ํ•˜๋‚˜์˜ ๋ ˆ์ธ(๊ธฐ๋ณธ์ ์œผ๋กœ ๋ ˆ์ธ 0)์ด ์ž์‹ ์˜ ๊ฐ’์„ ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ๊ณผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{broadcast}(\text{value}) = \text{value_from_lane_0_to_all_lanes}\]

์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ์กฐ์ • ํŒจํ„ด์ด ๊ฐ„๋‹จํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ์—ฐ์‚ฐ์œผ๋กœ ๋ณ€ํ™˜๋˜์–ด, ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ํšจ์œจ์ ์ธ ์ง‘ํ•ฉ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐœ๋…

๊ธฐ์กด ์กฐ์ • ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ๊ธฐ์กด ๋ฐฉ์‹ - ๋ณต์žกํ•˜๊ณ  ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์›€
shared_memory[lane] = local_computation()
sync_threads()  # ๋น„์šฉ์ด ํฐ ๋™๊ธฐํ™”
if lane == 0:
    result = compute_from_shared_memory()
sync_threads()  # ๋˜ ๋‹ค๋ฅธ ๋น„์šฉ์ด ํฐ ๋™๊ธฐํ™”
final_result = shared_memory[0]  # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์Œ

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ํ•„์š”
  • ๋™๊ธฐํ™”: ๋น„์šฉ์ด ํฐ ๋ฐฐ๋ฆฌ์–ด ์—ฐ์‚ฐ์ด ์—ฌ๋Ÿฌ ๋ฒˆ ํ•„์š”
  • ๋ณต์žกํ•œ ๋กœ์ง: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค์™€ ์ ‘๊ทผ ํŒจํ„ด ๊ด€๋ฆฌ
  • ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ์„ฑ: ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ์‰ฝ๊ฒŒ ๋ฐœ์ƒ

broadcast()๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์กฐ์ •์ด ๊ฐ„๊ฒฐํ•ด์ง‘๋‹ˆ๋‹ค:

# ์›Œํ”„ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋ฐฉ์‹ - ๊ฐ„๋‹จํ•˜๊ณ  ์•ˆ์ „
collective_value = 0
if lane == 0:
    collective_value = compute_block_statistic()
collective_value = broadcast(collective_value)  # ๋ชจ๋“  ๋ ˆ์ธ๊ณผ ๊ณต์œ 
result = use_collective_value(collective_value)

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์˜ ์žฅ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ถˆํ•„์š”
  • ์ž๋™ ๋™๊ธฐํ™”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ๊ฐ„๋‹จํ•œ ํŒจํ„ด: ํ•˜๋‚˜์˜ ๋ ˆ์ธ์ด ๊ณ„์‚ฐํ•˜๊ณ  ๋ชจ๋“  ๋ ˆ์ธ์ด ์ˆ˜์‹ 
  • ์กฐํ•ฉ ๊ฐ€๋Šฅ: ๋‹ค๋ฅธ ์›Œํ”„ ์—ฐ์‚ฐ๊ณผ ์‰ฝ๊ฒŒ ๊ฒฐํ•ฉ

1. ๊ธฐ๋ณธ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ

๋ ˆ์ธ 0์ด ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๋ชจ๋“  ๋ ˆ์ธ๊ณผ ๊ณต์œ ํ•˜๋Š” ๊ธฐ๋ณธ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์š”๊ตฌ์‚ฌํ•ญ:

  • ๋ ˆ์ธ 0์ด ํ˜„์žฌ ๋ธ”๋ก์˜ ์ฒ˜์Œ 4๊ฐœ ์š”์†Œ์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ์ด ๊ณ„์‚ฐ๋œ ๊ฐ’์„ broadcast()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„์˜ ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ๊ณผ ๊ณต์œ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๊ฐ ๋ ˆ์ธ์€ ์ด ๊ณต์œ ๋œ ๊ฐ’์„ ์ž์‹ ์˜ ์ž…๋ ฅ ์š”์†Œ์— ๋”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: ์ž…๋ ฅ [1, 2, 3, 4, 5, 6, 7, 8, ...]์€ ์ถœ๋ ฅ [11, 12, 13, 14, 15, 16, 17, 18, ...]์„ ์ƒ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

๊ณผ์ œ: ํ•˜๋‚˜์˜ ๋ ˆ์ธ๋งŒ ๋ธ”๋ก ๋ ˆ๋ฒจ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋˜, ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์ž์‹ ์˜ ๊ฐœ๋ณ„ ์—ฐ์‚ฐ์— ์‚ฌ์šฉํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ์กฐ์ •ํ•ด์•ผ ํ• ๊นŒ์š”?

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D row-major)

์™„์„ฑํ•  ์ฝ”๋“œ

fn basic_broadcast[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Basic broadcast: Lane 0 computes a block-local value, broadcasts it to all lanes.
    Each lane then uses this broadcast value in its own computation.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = Int(lane_id())
    if global_i < size:
        var broadcast_value: output.element_type = 0.0

        # FILL IN (roughly 10 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p25/p25.mojo

ํŒ

1. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋™์ž‘ ๋ฐฉ์‹ ์ดํ•ดํ•˜๊ธฐ

broadcast(value) ์—ฐ์‚ฐ์€ ๋ ˆ์ธ 0์˜ ๊ฐ’์„ ๊ฐ€์ ธ์™€ ์›Œํ”„์˜ ๋ชจ๋“  ๋ ˆ์ธ์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์—์„œ๋Š” ๋ ˆ์ธ 0์˜ ๊ฐ’๋งŒ ์˜๋ฏธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋ ˆ์ธ์˜ ๊ฐ’์€ ๋ฌด์‹œ๋˜์ง€๋งŒ, ๋ชจ๋“  ๋ ˆ์ธ์ด ๋ ˆ์ธ 0์˜ ๊ฐ’์„ ์ˆ˜์‹ ํ•ฉ๋‹ˆ๋‹ค.

์‹œ๊ฐํ™”:

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ „: ๋ ˆ์ธ 0์€ valโ‚€, ๋ ˆ์ธ 1์€ valโ‚, ๋ ˆ์ธ 2๋Š” valโ‚‚, ...
๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํ›„: ๋ ˆ์ธ 0์€ valโ‚€, ๋ ˆ์ธ 1์€ valโ‚€, ๋ ˆ์ธ 2๋Š” valโ‚€, ...

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: ๋ ˆ์ธ 0๋งŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•˜๋ ค๋Š” ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋„๋ก ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ์š”?

2. ๋ ˆ์ธ๋ณ„ ๊ณ„์‚ฐ

๋ ˆ์ธ 0์ด ํŠน๋ณ„ํ•œ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ๋‹ค๋ฅธ ๋ ˆ์ธ์€ ๋Œ€๊ธฐํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ํŒจํ„ด:

var shared_value = ์ดˆ๊ธฐ๊ฐ’
if lane == 0:
    # ๋ ˆ์ธ 0๋งŒ ๊ณ„์‚ฐ
    shared_value = ํŠน๋ณ„ํ•œ_๊ณ„์‚ฐ()
# ๋ชจ๋“  ๋ ˆ์ธ์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์— ์ฐธ์—ฌ
shared_value = broadcast(shared_value)

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ „์— ๋‹ค๋ฅธ ๋ ˆ์ธ์˜ ๊ฐ’์€ ์–ด๋–ค ์ƒํƒœ์—ฌ์•ผ ํ• ๊นŒ์š”?
  • ๋ ˆ์ธ 0์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•  ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ–๋„๋ก ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ์š”?

3. ์ง‘ํ•ฉ์  ํ™œ์šฉ

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํ›„ ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ ๊ฐ’์„ ๊ฐ–๊ฒŒ ๋˜๋ฉฐ, ์ด๋ฅผ ๊ฐ์ž์˜ ๊ฐœ๋ณ„ ๊ณ„์‚ฐ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: ๊ฐ ๋ ˆ์ธ์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’๊ณผ ์ž์‹ ์˜ ๋กœ์ปฌ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉํ• ๊นŒ์š”?

๊ธฐ๋ณธ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํ…Œ์ŠคํŠธ:

pixi run p25 --broadcast-basic
pixi run -e amd p25 --broadcast-basic
pixi run -e apple p25 --broadcast-basic
uv run poe p25 --broadcast-basic

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: HostBuffer([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0])
expected: HostBuffer([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0])
โœ… Basic broadcast test passed!

์†”๋ฃจ์…˜

fn basic_broadcast[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Basic broadcast: Lane 0 computes a block-local value, broadcasts it to all lanes.
    Each lane then uses this broadcast value in its own computation.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = Int(lane_id())

    if global_i < size:
        # Step 1: Lane 0 computes special value (sum of first 4 elements in this block)
        var broadcast_value: output.element_type = 0.0
        if lane == 0:
            block_start = Int(block_idx.x * block_dim.x)
            var sum: output.element_type = 0.0
            for i in range(4):
                if block_start + i < size:
                    sum += input[block_start + i]
            broadcast_value = sum

        # Step 2: Broadcast lane 0's value to all lanes in this warp
        broadcast_value = broadcast(broadcast_value)

        # Step 3: All lanes use broadcast value in their computation
        output[global_i] = broadcast_value + input[global_i]


์ด ์†”๋ฃจ์…˜์€ ์›Œํ”„ ๋ ˆ๋ฒจ ์กฐ์ •์„ ์œ„ํ•œ ๊ธฐ๋ณธ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    # ๋‹จ๊ณ„ 1: ๋ ˆ์ธ 0์ด ํŠน๋ณ„ํ•œ ๊ฐ’์„ ๊ณ„์‚ฐ
    var broadcast_value: output.element_type = 0.0
    if lane == 0:
        # ๋ ˆ์ธ 0๋งŒ ์ด ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰
        block_start = block_idx.x * block_dim.x
        var sum: output.element_type = 0.0
        for i in range(4):
            if block_start + i < size:
                sum += input[block_start + i]
        broadcast_value = sum

    # ๋‹จ๊ณ„ 2: ๋ ˆ์ธ 0์˜ ๊ฐ’์„ ๋ชจ๋“  ๋ ˆ์ธ๊ณผ ๊ณต์œ 
    broadcast_value = broadcast(broadcast_value)

    # ๋‹จ๊ณ„ 3: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’์„ ํ™œ์šฉ
    output[global_i] = broadcast_value + input[global_i]

SIMT ์‹คํ–‰ ์ถ”์ :

์‚ฌ์ดํด 1: ๋ ˆ์ธ๋ณ„ ๊ณ„์‚ฐ
  ๋ ˆ์ธ 0: input[0] + input[1] + input[2] + input[3] = 1+2+3+4 = 10์„ ๊ณ„์‚ฐ
  ๋ ˆ์ธ 1: broadcast_value๋Š” 0.0 ์œ ์ง€ (๋ ˆ์ธ 0์ด ์•„๋‹˜)
  ๋ ˆ์ธ 2: broadcast_value๋Š” 0.0 ์œ ์ง€ (๋ ˆ์ธ 0์ด ์•„๋‹˜)
  ...
  ๋ ˆ์ธ 31: broadcast_value๋Š” 0.0 ์œ ์ง€ (๋ ˆ์ธ 0์ด ์•„๋‹˜)

์‚ฌ์ดํด 2: broadcast(broadcast_value) ์‹คํ–‰
  ๋ ˆ์ธ 0: ์ž์‹ ์˜ ๊ฐ’ ์œ ์ง€ โ†’ broadcast_value = 10.0
  ๋ ˆ์ธ 1: ๋ ˆ์ธ 0์˜ ๊ฐ’ ์ˆ˜์‹  โ†’ broadcast_value = 10.0
  ๋ ˆ์ธ 2: ๋ ˆ์ธ 0์˜ ๊ฐ’ ์ˆ˜์‹  โ†’ broadcast_value = 10.0
  ...
  ๋ ˆ์ธ 31: ๋ ˆ์ธ 0์˜ ๊ฐ’ ์ˆ˜์‹  โ†’ broadcast_value = 10.0

์‚ฌ์ดํด 3: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’์„ ํ™œ์šฉํ•œ ๊ฐœ๋ณ„ ๊ณ„์‚ฐ
  ๋ ˆ์ธ 0: output[0] = 10.0 + input[0] = 10.0 + 1.0 = 11.0
  ๋ ˆ์ธ 1: output[1] = 10.0 + input[1] = 10.0 + 2.0 = 12.0
  ๋ ˆ์ธ 2: output[2] = 10.0 + input[2] = 10.0 + 3.0 = 13.0
  ...
  ๋ ˆ์ธ 31: output[31] = 10.0 + input[31] = 10.0 + 32.0 = 42.0

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๊ฐ€ ์šฐ์›”ํ•œ ์ด์œ :

  1. ์กฐ์ • ํšจ์œจ์„ฑ: ๋‹จ์ผ ์—ฐ์‚ฐ์œผ๋กœ ๋ชจ๋“  ๋ ˆ์ธ์„ ์กฐ์ •
  2. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ถˆํ•„์š”
  3. ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ž๋™์œผ๋กœ ์กฐ์ •์„ ์ฒ˜๋ฆฌ
  4. ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํŒจํ„ด: ์›Œํ”„ ํฌ๊ธฐ์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋™์ผํ•˜๊ฒŒ ๋™์ž‘

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์ง€์—ฐ ์‹œ๊ฐ„: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ 1 ์‚ฌ์ดํด
  • ๋Œ€์—ญํญ: 0 ๋ฐ”์ดํŠธ (๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ )
  • ์กฐ์ •: 32๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ์ž๋™ ๋™๊ธฐํ™”

2. ์กฐ๊ฑด๋ถ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ

๋ ˆ์ธ 0์ด ๋ธ”๋ก ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๋ชจ๋“  ๋ ˆ์ธ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ์กฐ๊ฑด๋ถ€ ์กฐ์ •์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์š”๊ตฌ์‚ฌํ•ญ:

  • ๋ ˆ์ธ 0์ด ํ˜„์žฌ ๋ธ”๋ก์˜ ์ฒ˜์Œ 8๊ฐœ ์š”์†Œ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ์ตœ๋Œ“๊ฐ’์„ ์ฐพ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ์ด ์ตœ๋Œ“๊ฐ’์„ broadcast()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์— ์ „๋‹ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๊ฐ ๋ ˆ์ธ์€ ์กฐ๊ฑด๋ถ€ ๋กœ์ง์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค: ์ž์‹ ์˜ ์š”์†Œ๊ฐ€ ์ตœ๋Œ“๊ฐ’์˜ ์ ˆ๋ฐ˜๋ณด๋‹ค ํฌ๋ฉด 2๋ฐฐ๋กœ, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์ ˆ๋ฐ˜์œผ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: ์ž…๋ ฅ [3, 1, 7, 2, 9, 4, 6, 8, ...] (๋ฐ˜๋ณต ํŒจํ„ด)์€ ์ถœ๋ ฅ [1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, ...]์„ ์ƒ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

๊ณผ์ œ: ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ถ„์„๊ณผ ์š”์†Œ๋ณ„ ์กฐ๊ฑด๋ถ€ ๋ณ€ํ™˜์„ ๋ชจ๋“  ๋ ˆ์ธ์— ๊ฑธ์ณ ์–ด๋–ป๊ฒŒ ์กฐ์ •ํ• ๊นŒ์š”?

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

fn conditional_broadcast[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Conditional broadcast: Lane 0 makes a decision based on block-local data, broadcasts it to all lanes.
    All lanes apply different logic based on the broadcast decision.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = Int(lane_id())
    if global_i < size:
        var decision_value: output.element_type = 0.0

        # FILL IN (roughly 10 lines)

        current_input = input[global_i]
        threshold = decision_value / 2.0
        if current_input >= threshold:
            output[global_i] = current_input * 2.0  # Double if >= threshold
        else:
            output[global_i] = current_input / 2.0  # Halve if < threshold


ํŒ

1. ๋ถ„์„๊ณผ ์˜์‚ฌ๊ฒฐ์ •

๋ ˆ์ธ 0์ด ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์˜ ๋™์ž‘์„ ์•ˆ๋‚ดํ•  ๊ฒฐ์ •์„ ๋‚ด๋ ค์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ ˆ์ธ 0์ด ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ถ„์„ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ์š”?
  • ๋ ˆ์ธ์˜ ๋™์ž‘์„ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋–ค ์ข…๋ฅ˜์˜ ๊ฒฐ์ •์„ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•ด์•ผ ํ• ๊นŒ์š”?
  • ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•  ๋•Œ ๊ฒฝ๊ณ„ ์กฐ๊ฑด์€ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ• ๊นŒ์š”?

๊ณ ๋ คํ•  ํŒจํ„ด:

var decision = ๊ธฐ๋ณธ๊ฐ’
if lane == 0:
    # ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ๋ถ„์„
    decision = ๋ถ„์„_ํ›„_๊ฒฐ์ •()
decision = broadcast(decision)

2. ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ์กฐ์ •

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ๊ฒฐ์ •์„ ์ˆ˜์‹ ํ•œ ํ›„, ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ทธ ๊ฒฐ์ •์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ๋กœ์ง์„ ์ ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • ๋ ˆ์ธ์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ์ปฌ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์€?
  • ๊ฐ ์กฐ๊ฑด๋ถ€ ๋ถ„๊ธฐ์—์„œ ์–ด๋–ค ์—ฐ์‚ฐ์„ ์ ์šฉํ•ด์•ผ ํ• ๊นŒ์š”?
  • ๋ชจ๋“  ๋ ˆ์ธ์—์„œ ์ผ๊ด€๋œ ๋™์ž‘์„ ๋ณด์žฅํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ์š”?

์กฐ๊ฑด๋ถ€ ํŒจํ„ด:

if (๋กœ์ปฌ_๋ฐ์ดํ„ฐ๊ฐ€ broadcast_๊ธฐ์ค€์„ ์ถฉ์กฑ):
    # ํ•˜๋‚˜์˜ ๋ณ€ํ™˜ ์ ์šฉ
else:
    # ๋‹ค๋ฅธ ๋ณ€ํ™˜ ์ ์šฉ

3. ๋ฐ์ดํ„ฐ ๋ถ„์„ ์ „๋žต

๋ ˆ์ธ 0์ด ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ถ„์„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณ ๋ คํ•˜์„ธ์š”.

๊ณ ๋ คํ•  ์ ‘๊ทผ๋ฒ•:

  • ์ตœ๋Œ“๊ฐ’/์ตœ์†Ÿ๊ฐ’ ์ฐพ๊ธฐ
  • ํ‰๊ท ์ด๋‚˜ ํ•ฉ๊ณ„ ๊ณ„์‚ฐ
  • ํŒจํ„ด์ด๋‚˜ ์ž„๊ณ„๊ฐ’ ๊ฐ์ง€
  • ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๊ธฐ๋ฐ˜ํ•œ ์ด์ง„ ๊ฒฐ์ •

์กฐ๊ฑด๋ถ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํ…Œ์ŠคํŠธ:

pixi run p25 --broadcast-conditional
pixi run -e amd p25 --broadcast-conditional
uv run poe p25 --broadcast-conditional

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: HostBuffer([1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0])
expected: HostBuffer([1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0])
โœ… Conditional broadcast test passed!

์†”๋ฃจ์…˜

fn conditional_broadcast[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Conditional broadcast: Lane 0 makes a decision based on block-local data, broadcasts it to all lanes.
    All lanes apply different logic based on the broadcast decision.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = Int(lane_id())

    if global_i < size:
        # Step 1: Lane 0 analyzes block-local data and makes decision (find max of first 8 in block)
        var decision_value: output.element_type = 0.0
        if lane == 0:
            block_start = Int(block_idx.x * block_dim.x)
            decision_value = input[block_start] if block_start < size else 0.0
            for i in range(1, min(8, min(WARP_SIZE, size - block_start))):
                if block_start + i < size:
                    current_val = input[block_start + i]
                    if current_val > decision_value:
                        decision_value = current_val

        # Step 2: Broadcast decision to all lanes in this warp
        decision_value = broadcast(decision_value)

        # Step 3: All lanes apply conditional logic based on broadcast decision
        current_input = input[global_i]
        threshold = decision_value / 2.0
        if current_input >= threshold:
            output[global_i] = current_input * 2.0  # Double if >= threshold
        else:
            output[global_i] = current_input / 2.0  # Halve if < threshold


์ด ์†”๋ฃจ์…˜์€ ๋ ˆ์ธ ๊ฐ„ ์กฐ๊ฑด๋ถ€ ์กฐ์ •์„ ์œ„ํ•œ ๊ณ ๊ธ‰ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    # ๋‹จ๊ณ„ 1: ๋ ˆ์ธ 0์ด ๋ธ”๋ก ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๊ฒฐ์ •์„ ๋‚ด๋ฆผ
    var decision_value: output.element_type = 0.0
    if lane == 0:
        # ๋ธ”๋ก์˜ ์ฒ˜์Œ 8๊ฐœ ์š”์†Œ ์ค‘ ์ตœ๋Œ“๊ฐ’ ์ฐพ๊ธฐ
        block_start = block_idx.x * block_dim.x
        decision_value = input[block_start] if block_start < size else 0.0
        for i in range(1, min(8, min(WARP_SIZE, size - block_start))):
            if block_start + i < size:
                current_val = input[block_start + i]
                if current_val > decision_value:
                    decision_value = current_val

    # ๋‹จ๊ณ„ 2: ๊ฒฐ์ •์„ broadcastํ•˜์—ฌ ๋ชจ๋“  ๋ ˆ์ธ์„ ์กฐ์ •
    decision_value = broadcast(decision_value)

    # ๋‹จ๊ณ„ 3: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์— ๊ธฐ๋ฐ˜ํ•œ ์กฐ๊ฑด๋ถ€ ๋กœ์ง์„ ์ ์šฉ
    current_input = input[global_i]
    threshold = decision_value / 2.0
    if current_input >= threshold:
        output[global_i] = current_input * 2.0  # ์ž„๊ณ„๊ฐ’ ์ด์ƒ์ด๋ฉด 2๋ฐฐ
    else:
        output[global_i] = current_input / 2.0  # ์ž„๊ณ„๊ฐ’ ๋ฏธ๋งŒ์ด๋ฉด ์ ˆ๋ฐ˜

์˜์‚ฌ๊ฒฐ์ • ์‹คํ–‰ ์ถ”์ :

์ž…๋ ฅ ๋ฐ์ดํ„ฐ: [3.0, 1.0, 7.0, 2.0, 9.0, 4.0, 6.0, 8.0, ...]

๋‹จ๊ณ„ 1: ๋ ˆ์ธ 0์ด ์ฒ˜์Œ 8๊ฐœ ์š”์†Œ์˜ ์ตœ๋Œ“๊ฐ’์„ ์ฐพ์Œ
  ๋ ˆ์ธ 0 ๋ถ„์„:
    input[0] = 3.0์œผ๋กœ ์‹œ์ž‘
    input[1] = 1.0๊ณผ ๋น„๊ต โ†’ 3.0 ์œ ์ง€
    input[2] = 7.0๊ณผ ๋น„๊ต โ†’ 7.0์œผ๋กœ ๊ฐฑ์‹ 
    input[3] = 2.0๊ณผ ๋น„๊ต โ†’ 7.0 ์œ ์ง€
    input[4] = 9.0๊ณผ ๋น„๊ต โ†’ 9.0์œผ๋กœ ๊ฐฑ์‹ 
    input[5] = 4.0๊ณผ ๋น„๊ต โ†’ 9.0 ์œ ์ง€
    input[6] = 6.0๊ณผ ๋น„๊ต โ†’ 9.0 ์œ ์ง€
    input[7] = 8.0๊ณผ ๋น„๊ต โ†’ 9.0 ์œ ์ง€
    ์ตœ์ข… decision_value = 9.0

๋‹จ๊ณ„ 2: decision_value = 9.0์„ ๋ชจ๋“  ๋ ˆ์ธ์— broadcast
  ๋ชจ๋“  ๋ ˆ์ธ: decision_value = 9.0, threshold = 4.5

๋‹จ๊ณ„ 3: ๋ ˆ์ธ๋ณ„ ์กฐ๊ฑด๋ถ€ ์‹คํ–‰
  ๋ ˆ์ธ 0: input[0] = 3.0 < 4.5 โ†’ output[0] = 3.0 / 2.0 = 1.5
  ๋ ˆ์ธ 1: input[1] = 1.0 < 4.5 โ†’ output[1] = 1.0 / 2.0 = 0.5
  ๋ ˆ์ธ 2: input[2] = 7.0 โ‰ฅ 4.5 โ†’ output[2] = 7.0 * 2.0 = 14.0
  ๋ ˆ์ธ 3: input[3] = 2.0 < 4.5 โ†’ output[3] = 2.0 / 2.0 = 1.0
  ๋ ˆ์ธ 4: input[4] = 9.0 โ‰ฅ 4.5 โ†’ output[4] = 9.0 * 2.0 = 18.0
  ๋ ˆ์ธ 5: input[5] = 4.0 < 4.5 โ†’ output[5] = 4.0 / 2.0 = 2.0
  ๋ ˆ์ธ 6: input[6] = 6.0 โ‰ฅ 4.5 โ†’ output[6] = 6.0 * 2.0 = 12.0
  ๋ ˆ์ธ 7: input[7] = 8.0 โ‰ฅ 4.5 โ†’ output[7] = 8.0 * 2.0 = 16.0
  ...๋‚˜๋จธ์ง€ ๋ ˆ์ธ์— ํŒจํ„ด ๋ฐ˜๋ณต

์ˆ˜ํ•™์  ๊ธฐ๋ฐ˜: ์ž„๊ณ„๊ฐ’ ๊ธฐ๋ฐ˜ ๋ณ€ํ™˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large f(x) = \begin{cases} 2x & \text{if } x \geq \tau \\ \frac{x}{2} & \text{if } x < \tau \end{cases}\]

์—ฌ๊ธฐ์„œ \(\tau = \frac{\max(\text{block_data})}{2}\)๋Š” ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ์ž„๊ณ„๊ฐ’์ž…๋‹ˆ๋‹ค.

์กฐ์ • ํŒจํ„ด์˜ ์žฅ์ :

  1. ์ค‘์•™ํ™”๋œ ๋ถ„์„: ํ•˜๋‚˜์˜ ๋ ˆ์ธ์ด ๋ถ„์„ํ•˜๊ณ  ๋ชจ๋“  ๋ ˆ์ธ์ด ํ˜œํƒ์„ ๋ฐ›์Œ
  2. ์ผ๊ด€๋œ ๊ฒฐ์ •: ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ ์ž„๊ณ„๊ฐ’์„ ์‚ฌ์šฉ
  3. ์ ์‘ํ˜• ๋™์ž‘: ์ž„๊ณ„๊ฐ’์ด ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ ์‘
  4. ํšจ์œจ์  ์กฐ์ •: ๋‹จ์ผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋ณต์žกํ•œ ์กฐ๊ฑด๋ถ€ ๋กœ์ง์„ ์กฐ์ •

ํ™œ์šฉ ๋ถ„์•ผ:

  • ์ ์‘ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๋”ฐ๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •
  • ํ’ˆ์งˆ ๊ด€๋ฆฌ: ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ์ง€ํ‘œ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์ฒ˜๋ฆฌ ์ ์šฉ
  • ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ๋ธ”๋ก ๋กœ์ปฌ ๋ณต์žก๋„ ๋ถ„์„์— ๊ธฐ๋ฐ˜ํ•œ ์ž‘์—… ๋ถ„๋ฐฐ

3. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ-shuffle ์กฐ์ •

broadcast()์™€ shuffle_down()์„ ๋ชจ๋‘ ๊ฒฐํ•ฉํ•œ ๊ณ ๊ธ‰ ์กฐ์ •์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์š”๊ตฌ์‚ฌํ•ญ:

  • ๋ ˆ์ธ 0์ด ๋ธ”๋ก์˜ ์ฒ˜์Œ 4๊ฐœ ์š”์†Œ์˜ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ด ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ ๋ชจ๋“  ๋ ˆ์ธ์— ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๊ฐ ๋ ˆ์ธ์€ shuffle_down(offset=1)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ ์ด์›ƒ์˜ ๊ฐ’์„ ๊ฐ€์ ธ์™€์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๋Œ€๋ถ€๋ถ„์˜ ๋ ˆ์ธ: ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ์— (ํ˜„์žฌ_๊ฐ’ + ๋‹ค์Œ_์ด์›ƒ_๊ฐ’)์„ ๊ณฑํ•ฉ๋‹ˆ๋‹ค
  • ์›Œํ”„์˜ ๋งˆ์ง€๋ง‰ ๋ ˆ์ธ: ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ์— ํ˜„์žฌ_๊ฐ’๋งŒ ๊ณฑํ•ฉ๋‹ˆ๋‹ค (์œ ํšจํ•œ ์ด์›ƒ ์—†์Œ)

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: ์ž…๋ ฅ์€ [2, 4, 6, 8, 1, 3, 5, 7, ...] ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค (์ฒ˜์Œ 4๊ฐœ ์š”์†Œ: 2,4,6,8 ์ดํ›„ 1,3,5,7 ๋ฐ˜๋ณต)

  • ๋ ˆ์ธ 0์ด ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ ๊ณ„์‚ฐ: (2+4+6+8)/4 = 5.0
  • ์˜ˆ์ƒ ์ถœ๋ ฅ: [30.0, 50.0, 70.0, 45.0, 20.0, 40.0, 60.0, 40.0, ...]

๊ณผ์ œ: ํ•˜๋‚˜์˜ ๋ ˆ์ธ์˜ ๊ณ„์‚ฐ์ด ๋ชจ๋“  ๋ ˆ์ธ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฉด์„œ, ๊ฐ ๋ ˆ์ธ์ด ์ž์‹ ์˜ ์ด์›ƒ ๋ฐ์ดํ„ฐ์—๋„ ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ์ƒํ™ฉ์—์„œ ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์–ด๋–ป๊ฒŒ ์กฐ์ •ํ• ๊นŒ์š”?

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

fn broadcast_shuffle_coordination[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Combine broadcast() and shuffle_down() for advanced warp coordination.
    Lane 0 computes block-local scaling factor, broadcasts it to all lanes in the warp.
    Each lane uses shuffle_down() for neighbor access and applies broadcast factor.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = Int(lane_id())
    if global_i < size:
        var scale_factor: output.element_type = 0.0

        # FILL IN (roughly 14 lines)


ํŒ

1. ๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์กฐ์ •

์ด ํผ์ฆ์€ broadcast์™€ ์…”ํ”Œ ์—ฐ์‚ฐ์„ ์ˆœ์„œ๋Œ€๋กœ ์กฐ์œจํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ๋ฆ„์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  1. ํ•˜๋‚˜์˜ ๋ ˆ์ธ์ด ์ „์ฒด ์›Œํ”„๋ฅผ ์œ„ํ•œ ๊ฐ’์„ ๊ณ„์‚ฐ
  2. ์ด ๊ฐ’์ด ๋ชจ๋“  ๋ ˆ์ธ์— broadcast๋จ
  3. ๊ฐ ๋ ˆ์ธ์ด ์…”ํ”Œ๋กœ ์ด์›ƒ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผ
  4. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’์ด ์ด์›ƒ ๋ฐ์ดํ„ฐ์˜ ์ฒ˜๋ฆฌ ๋ฐฉ์‹์— ์˜ํ–ฅ

์กฐ์ • ํŒจํ„ด:

# ๋‹จ๊ณ„ 1: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์กฐ์ •
var shared_param = lane_0์ด๋ฉด_๊ณ„์‚ฐ()
shared_param = broadcast(shared_param)

# ๋‹จ๊ณ„ 2: ์…”ํ”Œ ์ด์›ƒ ์ ‘๊ทผ
current_val = input[global_i]
neighbor_val = shuffle_down(current_val, offset)

# ๋‹จ๊ณ„ 3: ๊ฒฐํ•ฉ ๊ณ„์‚ฐ
result = ๊ฒฐํ•ฉ(current_val, neighbor_val, shared_param)

2. ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณ„์‚ฐ ์ „๋žต

์ด์›ƒ ์—ฐ์‚ฐ์„ ์Šค์ผ€์ผ๋งํ•˜๋Š” ๋ฐ ์œ ์šฉํ•œ ๋ธ”๋ก ๋ ˆ๋ฒจ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋ฌด์—‡์ผ์ง€ ๊ณ ๋ คํ•˜์„ธ์š”.

ํƒ๊ตฌํ•  ์งˆ๋ฌธ:

  • ๋ ˆ์ธ 0์ด ๋ธ”๋ก ๋ฐ์ดํ„ฐ์—์„œ ์–ด๋–ค ํ†ต๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ด์•ผ ํ• ๊นŒ์š”?
  • ์ด ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ด์›ƒ ๊ธฐ๋ฐ˜ ๊ณ„์‚ฐ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์ณ์•ผ ํ• ๊นŒ์š”?
  • ์…”ํ”Œ ์—ฐ์‚ฐ์ด ํฌํ•จ๋  ๋•Œ ์›Œํ”„ ๊ฒฝ๊ณ„์—์„œ ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚ ๊นŒ์š”?

3. ๊ฒฐํ•ฉ ์—ฐ์‚ฐ ์„ค๊ณ„

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ์…”ํ”Œ ๊ธฐ๋ฐ˜ ์ด์›ƒ ์ ‘๊ทผ์„ ์˜๋ฏธ ์žˆ๊ฒŒ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•˜์„ธ์š”.

ํŒจํ„ด ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ž…๋ ฅ, ์ถœ๋ ฅ, ๋˜๋Š” ๊ณ„์‚ฐ์„ ์Šค์ผ€์ผ๋งํ•ด์•ผ ํ• ๊นŒ์š”?
  • ์…”ํ”Œ์ด ๋ฏธ์ •์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฒฝ๊ณ„ ์ผ€์ด์Šค๋ฅผ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ• ๊นŒ์š”?
  • ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ ์ˆœ์„œ๋Š” ๋ฌด์—‡์ผ๊นŒ์š”?

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ-shuffle ์กฐ์ • ํ…Œ์ŠคํŠธ:

pixi run p25 --broadcast-shuffle-coordination
pixi run -e amd p25 --broadcast-shuffle-coordination
uv run poe p25 --broadcast-shuffle-coordination

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: HostBuffer([30.0, 50.0, 70.0, 45.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 35.0])
expected: HostBuffer([30.0, 50.0, 70.0, 45.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 35.0])
โœ… ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ + ์…”ํ”Œ coordination test passed!

์†”๋ฃจ์…˜

fn broadcast_shuffle_coordination[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Combine broadcast() and shuffle_down() for advanced warp coordination.
    Lane 0 computes block-local scaling factor, broadcasts it to all lanes in the warp.
    Each lane uses shuffle_down() for neighbor access and applies broadcast factor.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = Int(lane_id())

    if global_i < size:
        # Step 1: Lane 0 computes block-local scaling factor
        var scale_factor: output.element_type = 0.0
        if lane == 0:
            # Compute average of first 4 elements in this block's data
            block_start = Int(block_idx.x * block_dim.x)
            var sum: output.element_type = 0.0
            for i in range(4):
                if block_start + i < size:
                    sum += input[block_start + i]
            scale_factor = sum / 4.0

        # Step 2: Broadcast scaling factor to all lanes in this warp
        scale_factor = broadcast(scale_factor)

        # Step 3: Each lane gets current and next values
        current_val = input[global_i]
        next_val = shuffle_down(current_val, 1)

        # Step 4: Apply broadcast factor with neighbor coordination
        if lane < WARP_SIZE - 1 and global_i < size - 1:
            # Combine current + next, then scale by broadcast factor
            output[global_i] = (current_val + next_val) * scale_factor
        else:
            # Last lane in warp or last element: only current value, scaled by broadcast factor
            output[global_i] = current_val * scale_factor


์ด ์†”๋ฃจ์…˜์€ broadcast์™€ ์…”ํ”Œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•œ ๊ฐ€์žฅ ๊ณ ๊ธ‰ ์›Œํ”„ ์กฐ์ • ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    # ๋‹จ๊ณ„ 1: ๋ ˆ์ธ 0์ด ๋ธ”๋ก ๋กœ์ปฌ ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ ๊ณ„์‚ฐ
    var scale_factor: output.element_type = 0.0
    if lane == 0:
        block_start = block_idx.x * block_dim.x
        var sum: output.element_type = 0.0
        for i in range(4):
            if block_start + i < size:
                sum += input[block_start + i]
        scale_factor = sum / 4.0

    # ๋‹จ๊ณ„ 2: ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ ๋ชจ๋“  ๋ ˆ์ธ์— broadcast
    scale_factor = broadcast(scale_factor)

    # ๋‹จ๊ณ„ 3: ๊ฐ ๋ ˆ์ธ์ด shuffle์„ ํ†ตํ•ด ํ˜„์žฌ ๊ฐ’๊ณผ ๋‹ค์Œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ด
    current_val = input[global_i]
    next_val = shuffle_down(current_val, 1)

    # ๋‹จ๊ณ„ 4: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒฉํ„ฐ๋ฅผ ์ด์›ƒ ์กฐ์ •๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ ์šฉ
    if lane < WARP_SIZE - 1 and global_i < size - 1:
        output[global_i] = (current_val + next_val) * scale_factor
    else:
        output[global_i] = current_val * scale_factor

๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์‹คํ–‰ ์ถ”์ :

์ž…๋ ฅ ๋ฐ์ดํ„ฐ: [2, 4, 6, 8, 1, 3, 5, 7, ...]

๋‹จ๊ณ„ 1: ๋ ˆ์ธ 0์ด ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ ๊ณ„์‚ฐ
  ๋ ˆ์ธ 0 ๊ณ„์‚ฐ: (input[0] + input[1] + input[2] + input[3]) / 4
              = (2 + 4 + 6 + 8) / 4 = 20 / 4 = 5.0
  ๋‹ค๋ฅธ ๋ ˆ์ธ: scale_factor๋Š” 0.0 ์œ ์ง€

๋‹จ๊ณ„ 2: scale_factor = 5.0์„ ๋ชจ๋“  ๋ ˆ์ธ์— broadcast
  ๋ชจ๋“  ๋ ˆ์ธ: scale_factor = 5.0

๋‹จ๊ณ„ 3: ์ด์›ƒ ์ ‘๊ทผ์„ ์œ„ํ•œ ์…”ํ”Œ ์—ฐ์‚ฐ
  ๋ ˆ์ธ 0: current_val = input[0] = 2, next_val = shuffle_down(2, 1) = input[1] = 4
  ๋ ˆ์ธ 1: current_val = input[1] = 4, next_val = shuffle_down(4, 1) = input[2] = 6
  ๋ ˆ์ธ 2: current_val = input[2] = 6, next_val = shuffle_down(6, 1) = input[3] = 8
  ๋ ˆ์ธ 3: current_val = input[3] = 8, next_val = shuffle_down(8, 1) = input[4] = 1
  ...
  ๋ ˆ์ธ 31: current_val = input[31], next_val = ๋ฏธ์ •์˜

๋‹จ๊ณ„ 4: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์Šค์ผ€์ผ๋ง๊ณผ ๊ฒฐํ•ฉํ•œ ๊ณ„์‚ฐ
  ๋ ˆ์ธ 0: output[0] = (2 + 4) * 5.0 = 6 * 5.0 = 30.0
  ๋ ˆ์ธ 1: output[1] = (4 + 6) * 5.0 = 10 * 5.0 = 50.0
  ๋ ˆ์ธ 2: output[2] = (6 + 8) * 5.0 = 14 * 5.0 = 70.0
  ๋ ˆ์ธ 3: output[3] = (8 + 1) * 5.0 = 9 * 5.0 = 45.0
  ...
  ๋ ˆ์ธ 31: output[31] = 7 * 5.0 = 35.0 (๊ฒฝ๊ณ„ - ์ด์›ƒ ์—†์Œ)

ํ†ต์‹  ํŒจํ„ด ๋ถ„์„: ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ณ„์ธต์  ์กฐ์ • ํŒจํ„ด์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ˆ˜์ง ์กฐ์ • (broadcast): ๋ ˆ์ธ 0 โ†’ ๋ชจ๋“  ๋ ˆ์ธ
  2. ์ˆ˜ํ‰ ์กฐ์ • (shuffle): ๋ ˆ์ธ i โ†’ ๋ ˆ์ธ i+1
  3. ๊ฒฐํ•ฉ ๊ณ„์‚ฐ: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋ฐ์ดํ„ฐ์™€ ์…”ํ”Œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ํ™œ์šฉ

์ˆ˜ํ•™์  ๊ธฐ๋ฐ˜: \[\Large \text{output}[i] = \begin{cases} (\text{input}[i] + \text{input}[i+1]) \cdot \beta & \text{if lane } i < \text{WARP_SIZE} - 1 \\ \text{input}[i] \cdot \beta & \text{if lane } i = \text{WARP_SIZE} - 1 \end{cases}\]

์—ฌ๊ธฐ์„œ \(\beta = \frac{1}{4}\sum_{k=0}^{3} \text{input}[\text{block_start} + k]\)๋Š” ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ์ž…๋‹ˆ๋‹ค.

๊ณ ๊ธ‰ ์กฐ์ •์˜ ์žฅ์ :

  1. ๋‹ค๋‹จ๊ณ„ ํ†ต์‹ : ์ „์—ญ(broadcast)๊ณผ ์ง€์—ญ(shuffle) ์กฐ์ •์˜ ๊ฒฐํ•ฉ
  2. ์ ์‘ํ˜• ์Šค์ผ€์ผ๋ง: ๋ธ”๋ก ๋ ˆ๋ฒจ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ด์›ƒ ์—ฐ์‚ฐ์— ์˜ํ–ฅ
  3. ํšจ์œจ์  ๊ตฌ์„ฑ: ๋‘ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ๋งค๋„๋Ÿฝ๊ฒŒ ํ˜‘๋ ฅ
  4. ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„: ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ

์‹ค์ œ ํ™œ์šฉ ์‚ฌ๋ก€:

  • ์ ์‘ํ˜• ํ•„ํ„ฐ๋ง: ๋ธ”๋ก ๋ ˆ๋ฒจ ๋…ธ์ด์ฆˆ ์ถ”์ •๊ณผ ์ด์›ƒ ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง
  • ๋™์  ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ์ „์—ญ ์ž‘์—… ๋ถ„๋ฐฐ์™€ ๋กœ์ปฌ ์กฐ์ •
  • ๋‹ค์ค‘ ์Šค์ผ€์ผ ์ฒ˜๋ฆฌ: ์ „์—ญ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋กœ์ปฌ ์Šคํ…์‹ค ์—ฐ์‚ฐ์„ ์ œ์–ด

์š”์•ฝ

์ด ์„น์…˜์˜ ํ•ต์‹ฌ ํŒจํ„ด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

var shared_value = initial_value
if lane == 0:
    shared_value = compute_block_statistic()
shared_value = broadcast(shared_value)
result = use_shared_value(shared_value, local_data)

ํ•ต์‹ฌ ์žฅ์ :

  • ์ผ๋Œ€๋‹ค ์กฐ์ •: ํ•˜๋‚˜์˜ ๋ ˆ์ธ์ด ๊ณ„์‚ฐํ•˜๊ณ  ๋ชจ๋“  ๋ ˆ์ธ์ด ํ˜œํƒ์„ ๋ฐ›์Œ
  • ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: SIMT ์‹คํ–‰์ด ์กฐ์ •์„ ์ฒ˜๋ฆฌ
  • ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ํŒจํ„ด: ์…”ํ”Œ๊ณผ ๋‹ค๋ฅธ ์›Œํ”„ ์—ฐ์‚ฐ๊ณผ ์‰ฝ๊ฒŒ ๊ฒฐํ•ฉ

ํ™œ์šฉ ๋ถ„์•ผ: ๋ธ”๋ก ํ†ต๊ณ„, ์ง‘ํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ •, ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ , ์ ์‘ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜.

Puzzle 26: ๊ณ ๊ธ‰ ์›Œํ”„ ํŒจํ„ด

๊ฐœ์š”

Puzzle 26: ๊ณ ๊ธ‰ ์›Œํ”„ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ์—์„œ๋Š” ์ •๊ตํ•œ GPU ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ๊ณผ ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ - ์›Œํ”„ ๋‚ด์—์„œ ํšจ์œจ์ ์ธ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. shuffle_xor์„ ์‚ฌ์šฉํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์™€ prefix_sum์„ ์‚ฌ์šฉํ•œ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ์Šค์บ”์„ ๋ฐฐ์šฐ๋ฉฐ, ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์—†์ด ์ด๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ตํž™๋‹ˆ๋‹ค.

๋‹ฌ์„ฑ ๋ชฉํ‘œ: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ๋‹ค๋‹จ๊ณ„ ๋ฆฌ๋•์…˜ ํŒจํ„ด์—์„œ ๋ฒ—์–ด๋‚˜, ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์™€ ๋ณ‘๋ ฌ ์Šค์บ” ์œ ๋‹›์„ ํ™œ์šฉํ•˜๋Š” ์šฐ์•„ํ•œ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์›Œํ”„๋Š” ํ•˜๋“œ์›จ์–ด์—์„œ ์ •๊ตํ•œ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ํ†ต์‹ ๊ณผ ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค - Mojo์˜ ๊ณ ๊ธ‰ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์™€ ์ „์šฉ ์Šค์บ” ์œ ๋‹›์„ ํ™œ์šฉํ•˜์—ฌ \(O(\log n)\) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ผ ๋ช…๋ น ์ˆ˜์ค€์˜ ๊ฐ„๊ฒฐํ•จ์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ

๊ณ ๊ธ‰ ์›Œํ”„ ํ†ต์‹  ๋ชจ๋ธ

GPU ์›Œํ”„ ๋‚ด ์ •๊ตํ•œ ํ†ต์‹  ํŒจํ„ด์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

GPU ์›Œํ”„ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ (32 ์Šค๋ ˆ๋“œ, XOR ๊ธฐ๋ฐ˜ ํ†ต์‹ )
Offset 16: Lane 0 โ†” Lane 16, Lane 1 โ†” Lane 17, ..., Lane 15 โ†” Lane 31
Offset 8:  Lane 0 โ†” Lane 8,  Lane 1 โ†” Lane 9,  ..., Lane 23 โ†” Lane 31
Offset 4:  Lane 0 โ†” Lane 4,  Lane 1 โ†” Lane 5,  ..., Lane 27 โ†” Lane 31
Offset 2:  Lane 0 โ†” Lane 2,  Lane 1 โ†” Lane 3,  ..., Lane 29 โ†” Lane 31
Offset 1:  Lane 0 โ†” Lane 1,  Lane 2 โ†” Lane 3,  ..., Lane 30 โ†” Lane 31

ํ•˜๋“œ์›จ์–ด ๋ˆ„์  ํ•ฉ (๋ณ‘๋ ฌ ์Šค์บ” ๊ฐ€์†)
์ž…๋ ฅ:  [1, 2, 3, 4, 5, 6, 7, 8, ...]
์ถœ๋ ฅ: [1, 3, 6, 10, 15, 21, 28, 36, ...] (inclusive scan)

ํ•˜๋“œ์›จ์–ด ํ˜„์‹ค:

  • ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ: XOR ๊ธฐ๋ฐ˜ ํ†ต์‹ ์ด ์ตœ์ ์˜ ํŠธ๋ฆฌ ํ† ํด๋กœ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ์ „์šฉ ์Šค์บ” ์œ ๋‹›: ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ
  • ๋กœ๊ทธ ๋ณต์žก๋„: \(O(\log n)\) ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด \(O(n)\) ์ˆœ์ฐจ ํŒจํ„ด์„ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค
  • ๋‹จ์ผ ์‚ฌ์ดํด ์—ฐ์‚ฐ: ๋ณต์žกํ•œ ๋ฆฌ๋•์…˜์ด ์ „์šฉ ํ•˜๋“œ์›จ์–ด์—์„œ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค

Mojo์˜ ๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ

gpu.primitives.warp์˜ ์ •๊ตํ•œ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๋ฐฐ์›๋‹ˆ๋‹ค:

  1. shuffle_xor(value, mask): ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ XOR ๊ธฐ๋ฐ˜ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ 
  2. prefix_sum(value): ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ
  3. ๊ณ ๊ธ‰ ์กฐ์ • ํŒจํ„ด: ์—ฌ๋Ÿฌ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•œ ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜

์ฐธ๊ณ : ์ด ๊ธฐ๋ณธ ์š”์†Œ๋“ค์€ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜, ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜, quicksort ํŒŒํ‹ฐ์…”๋‹, FFT ์—ฐ์‚ฐ ๋“ฑ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ • ์ฝ”๋“œ๊ฐ€ ์ˆ˜์‹ญ ์ค„ ํ•„์š”ํ–ˆ์„ ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๋ณ€ํ™˜ ์˜ˆ์‹œ

# ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ (๊ธฐ์กด ๋ฐฉ์‹ - Puzzle 14 ์ฐธ๊ณ ):
shared = LayoutTensor[
    dtype,
    Layout.row_major(WARP_SIZE),
    MutAnyOrigin,
    address_space = AddressSpace.SHARED,
].stack_allocation()
shared[local_i] = input[global_i]
barrier()
offset = 1
for i in range(Int(log2(Scalar[dtype](WARP_SIZE)))):
    var current_val: output.element_type = 0
    if local_i >= offset and local_i < WARP_SIZE:
        current_val = shared[local_i - offset]
    barrier()
    if local_i >= offset and local_i < WARP_SIZE:
        shared[local_i] += current_val
    barrier()
    offset *= 2

# ๊ณ ๊ธ‰ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ์ด ๋ชจ๋“  ๋ณต์žก์„ฑ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค:
current_val = input[global_i]
scan_result = prefix_sum[exclusive=False](current_val)  # ๋‹จ์ผ ํ˜ธ์ถœ!
output[global_i] = scan_result

๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ์ด ๋น›๋‚˜๋Š” ์ˆœ๊ฐ„

์„ฑ๋Šฅ ํŠน์„ฑ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด๊ธฐ์กด ๋ฐฉ์‹๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ
๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด๋‹จ์ผ shuffle_xor ํŠธ๋ฆฌ
๋ˆ„์  ํ•ฉ/์Šค์บ” ์—ฐ์‚ฐ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ํ•˜๋“œ์›จ์–ด prefix_sum
์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑprefix_sum + ์กฐ์ •
Quicksort ํŒŒํ‹ฐ์…˜์ˆ˜๋™ ์œ„์น˜ ๊ณ„์‚ฐ๊ฒฐํ•ฉ๋œ ๊ธฐ๋ณธ ์š”์†Œ
ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์žฌ๊ท€์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ 

์„ ์ˆ˜ ์ง€์‹

๊ณ ๊ธ‰ ์›Œํ”„ ํ†ต์‹ ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ๋‹ค์Œ ๋‚ด์šฉ์— ์ต์ˆ™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  • Part VII ์›Œํ”„ ๊ธฐ์ดˆ: SIMT ์‹คํ–‰๊ณผ ๊ธฐ๋ณธ ์›Œํ”„ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ดํ•ด (Puzzle 24: ์›Œํ”„ ๊ธฐ์ดˆ์™€ Puzzle 25: ์›Œํ”„ ํ†ต์‹  ์ฐธ๊ณ )
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ด๋ก : ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜, ๋ณ‘๋ ฌ ์Šค์บ”, ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ
  • GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด๊ณผ ๋™๊ธฐํ™” (Puzzle 14: ๋ˆ„์  ํ•ฉ ์ฐธ๊ณ )
  • ์ˆ˜ํ•™ ์—ฐ์‚ฐ: XOR ์—ฐ์‚ฐ๊ณผ ๋กœ๊ทธ ๋ณต์žก๋„์— ๋Œ€ํ•œ ์ดํ•ด

ํ•™์Šต ๊ฒฝ๋กœ

1. shuffle_xor์„ ์ด์šฉํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ 

โ†’ warp.shuffle_xor()์™€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ

ํšจ์œจ์ ์ธ ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ XOR ๊ธฐ๋ฐ˜ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • shuffle_xor()์œผ๋กœ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€ ๊ตฌ์„ฑํ•˜๊ธฐ
  • ํŠธ๋ฆฌ ํ†ต์‹ ์„ ํ™œ์šฉํ•œ \(O(\log n)\) ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๊ตฌํ˜„
  • XOR ๊ธฐ๋ฐ˜ ๋ ˆ์ธ ํŽ˜์–ด๋ง๊ณผ ํ†ต์‹  ํŒจํ„ด ์ดํ•ด
  • ๋‹ค์ค‘ ๊ฐ’ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๊ณ ๊ธ‰ ์กฐ๊ฑด๋ถ€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์—ฐ์‚ฐ

ํ•ต์‹ฌ ํŒจํ„ด:

max_val = input[global_i]
offset = WARP_SIZE // 2
while offset > 0:
    max_val = max(max_val, shuffle_xor(max_val, offset))
    offset //= 2
# ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ฐ€์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

2. prefix_sum์„ ์ด์šฉํ•œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ”

โ†’ warp.prefix_sum()๊ณผ ์Šค์บ” ์—ฐ์‚ฐ

๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • prefix_sum()์„ ํ™œ์šฉํ•œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ˆ„์  ์—ฐ์‚ฐ
  • ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜๊ณผ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹ ๊ตฌํ˜„
  • prefix_sum๊ณผ shuffle_xor์„ ๊ฒฐํ•ฉํ•œ ๊ณ ๊ธ‰ ์กฐ์ •
  • Inclusive vs exclusive ์Šค์บ” ํŒจํ„ด ์ดํ•ด

ํ•ต์‹ฌ ํŒจํ„ด:

current_val = input[global_i]
scan_result = prefix_sum[exclusive=False](current_val)
output[global_i] = scan_result  # ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ˆ„์  ํ•ฉ

ํ•ต์‹ฌ ๊ฐœ๋…

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ํ†ต์‹ 

XOR ๊ธฐ๋ฐ˜ ํ†ต์‹  ํ† ํด๋กœ์ง€๋ฅผ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

  • XOR ํŽ˜์–ด๋ง: lane_id โŠ• mask๊ฐ€ ๋Œ€์นญ ํ†ต์‹  ์Œ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: ๊ณ„์ธต์  ๋ฐ์ดํ„ฐ ๊ตํ™˜์„ ํ†ตํ•œ ๋กœ๊ทธ ๋ณต์žก๋„
  • ๋ณ‘๋ ฌ ์กฐ์ •: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋ฆฌ๋•์…˜์— ๋™์‹œ์— ์ฐธ์—ฌํ•ฉ๋‹ˆ๋‹ค
  • ๋™์  ์•Œ๊ณ ๋ฆฌ์ฆ˜: 2์˜ ๊ฑฐ๋“ญ์ œ๊ณฑ WARP_SIZE (32, 64 ๋“ฑ) ์–ด๋””์„œ๋‚˜ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค

ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ”

์ „์šฉ ์Šค์บ” ์œ ๋‹›์˜ ๋Šฅ๋ ฅ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

  • ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ: ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์„ ํ™œ์šฉํ•œ ๋ˆ„์  ์—ฐ์‚ฐ
  • ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜: ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜
  • ๋‹จ์ผ ํ•จ์ˆ˜ ๊ฐ„๊ฒฐ์„ฑ: ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋‹จ์ผ ํ˜ธ์ถœ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ชจ๋“  ์กฐ์ •์„ ๋‚ด๋ถ€์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„ ๋ณ€ํ™˜

๊ธฐ์กด ํŒจํ„ด์„ ๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

  • ์ˆœ์ฐจ ๋ฆฌ๋•์…˜ (\(O(n)\)) โ†’ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ (\(O(\log n)\))
  • ๋‹ค๋‹จ๊ณ„ ์Šค์บ” ์•Œ๊ณ ๋ฆฌ์ฆ˜ โ†’ ๋‹จ์ผ ํ•˜๋“œ์›จ์–ด prefix_sum
  • ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด โ†’ ๋ ˆ์ง€์Šคํ„ฐ ์ „์šฉ ์—ฐ์‚ฐ
  • ๋ช…์‹œ์  ๋™๊ธฐํ™” โ†’ ํ•˜๋“œ์›จ์–ด ๊ด€๋ฆฌ ์กฐ์ •

๊ณ ๊ธ‰ ์กฐ์ • ํŒจํ„ด

์—ฌ๋Ÿฌ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•œ ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜:

  • ์ด์ค‘ ๋ฆฌ๋•์…˜: ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์„ ํ™œ์šฉํ•œ ๋™์‹œ min/max ์ถ”์ 
  • ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹: quicksort ์Šคํƒ€์ผ ์—ฐ์‚ฐ์„ ์œ„ํ•œ shuffle_xor + prefix_sum
  • ์กฐ๊ฑด๋ถ€ ์—ฐ์‚ฐ: ์ „์—ญ ์กฐ์ •์„ ํ†ตํ•œ ๋ ˆ์ธ ๊ธฐ๋ฐ˜ ์ถœ๋ ฅ ์„ ํƒ
  • ๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ตœ์  ์„ฑ๋Šฅ์˜ ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ํŒจํ„ด

์‹œ์ž‘ํ•˜๊ธฐ

๊ณ ๊ธ‰ GPU ์›Œํ”„ ๋ ˆ๋ฒจ ํ†ต์‹ ์„ ํ™œ์šฉํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ์—ฐ์‚ฐ์œผ๋กœ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ํ†ต์‹ ์„ ์ดํ•ดํ•œ ๋‹ค์Œ, ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ”์œผ๋กœ ๋‚˜์•„๊ฐ€ ์ตœ์ ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์„ธ์š”.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ์„ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋นŒ๋”ฉ ๋ธ”๋ก์œผ๋กœ ์ƒ๊ฐํ•˜์„ธ์š”. ์ด ๊ธฐ๋ณธ ์š”์†Œ๋“ค์€ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ „์ฒด ๋ฒ”์ฃผ๋ฅผ ๋‹จ์ผ ์ตœ์ ํ™” ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋ชฉํ‘œ: Puzzle 26์„ ๋งˆ์น˜๋ฉด, ๊ณ ๊ธ‰ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํ™ฉ์„ ์ธ์‹ํ•˜์—ฌ ํ›จ์”ฌ ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅธ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜, ๋ณ‘๋ ฌ ์Šค์บ”, ์กฐ์ • ํŒจํ„ด์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ: warp.shuffle_xor()์™€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ์—์„œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ์„ ๋ฐฐ์šด ๋‹ค์Œ, warp.prefix_sum()๊ณผ ์Šค์บ” ์—ฐ์‚ฐ ์—์„œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ” ํŒจํ„ด์œผ๋กœ ๋‚˜์•„๊ฐ€์„ธ์š”!

warp.shuffle_xor() ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ 

์›Œํ”„ ๋ ˆ๋ฒจ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ์—์„œ๋Š” shuffle_xor()์„ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„ ๋‚ด์— ์ •๊ตํ•œ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŒจํ„ด์„ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ†ตํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜, ์ •๋ ฌ ๋„คํŠธ์›Œํฌ, ๊ณ ๊ธ‰ ์กฐ์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: shuffle_xor() ์—ฐ์‚ฐ์€ SIMT ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ XOR ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ์›Œํ”„ ํฌ๊ธฐ์— ๋Œ€ํ•ด \(O(\log n)\) ๋ณต์žก๋„๋กœ ํ™•์žฅ๋˜๋Š” ํšจ์œจ์ ์ธ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์™€ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๋ž€? ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๋Š” ์Šค๋ ˆ๋“œ๋“ค์ด ์ธ๋ฑ์Šค์˜ XOR ํŒจํ„ด์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•˜๋Š” ํ†ต์‹  ํ† ํด๋กœ์ง€์ž…๋‹ˆ๋‹ค. ์ด๋ฆ„์€ ์‹œ๊ฐ์ ์œผ๋กœ ๊ทธ๋ ธ์„ ๋•Œ ๋‚˜๋น„ ๋‚ ๊ฐœ์ฒ˜๋Ÿผ ๋ณด์ด๋Š” ์—ฐ๊ฒฐ ํŒจํ„ด์—์„œ ์œ ๋ž˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋„คํŠธ์›Œํฌ๋Š” \(O(\log n)\) ํ†ต์‹  ๋ณต์žก๋„๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— FFT, bitonic ์ •๋ ฌ, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๊ฐ™์€ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • shuffle_xor()์„ ํ™œ์šฉํ•œ XOR ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŒจํ„ด
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€
  • \(O(\log n)\) ๋ณต์žก๋„์˜ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • ๊ณ ๊ธ‰ ์กฐ์ •์„ ์œ„ํ•œ ์กฐ๊ฑด๋ถ€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์—ฐ์‚ฐ
  • ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ๊ธฐ๋ณธ ์š”์†Œ

shuffle_xor ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด XOR ํŒจํ„ด์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๋ ˆ์ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{shuffle_xor}(\text{value}, \text{mask}) = \text{value_from_lane}(\text{lane_id} \oplus \text{mask})\]

์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์šฐ์•„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์œผ๋กœ ๋ณ€ํ™˜๋˜์–ด, ๋ช…์‹œ์  ์กฐ์ • ์—†์ด ํšจ์œจ์ ์ธ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜๊ณผ ์ •๋ ฌ ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

1. ๊ธฐ๋ณธ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŽ˜์–ด ๊ตํ™˜

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D row-major)

shuffle_xor ๊ฐœ๋…

๊ธฐ์กด ํŽ˜์–ด ๊ตํ™˜ ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ๊ณผ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ๊ธฐ์กด ๋ฐฉ์‹ - ๋ณต์žกํ•˜๊ณ  ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”
shared_memory[lane] = input[global_i]
barrier()
if lane % 2 == 0:
    partner = lane + 1
else:
    partner = lane - 1
if partner < WARP_SIZE:
    swapped_val = shared_memory[partner]

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ํ•„์š”
  • ๋™๊ธฐํ™”: ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”
  • ๋ณต์žกํ•œ ๋กœ์ง: ์ˆ˜๋™ ํŒŒํŠธ๋„ˆ ๊ณ„์‚ฐ๊ณผ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ๋‚ฎ์€ ํ™•์žฅ์„ฑ: ํ•˜๋“œ์›จ์–ด ํ†ต์‹ ์„ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•จ

shuffle_xor()์„ ์‚ฌ์šฉํ•˜๋ฉด ํŽ˜์–ด ๊ตํ™˜์ด ์šฐ์•„ํ•ด์ง‘๋‹ˆ๋‹ค:

# ๋ฒ„ํ„ฐํ”Œ๋ผ์ด XOR ๋ฐฉ์‹ - ๊ฐ„๋‹จํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”
current_val = input[global_i]
swapped_val = shuffle_xor(current_val, 1)  # 1๊ณผ XORํ•˜๋ฉด ํŽ˜์–ด๊ฐ€ ์ƒ์„ฑ๋จ
output[global_i] = swapped_val

shuffle_xor์˜ ์žฅ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ๋ชจ๋“  ๋ ˆ์ธ์— ๋Œ€ํ•ด ๋‹จ์ผ ๋ช…๋ น์œผ๋กœ ์ฒ˜๋ฆฌ
  • ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๊ธฐ๋ฐ˜: ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋นŒ๋”ฉ ๋ธ”๋ก

์™„์„ฑํ•  ์ฝ”๋“œ

shuffle_xor()์„ ์‚ฌ์šฉํ•˜์—ฌ ์ธ์ ‘ ํŽ˜์–ด ๊ฐ„ ๊ฐ’์„ ๊ตํ™˜ํ•˜๋Š” ํŽ˜์–ด ๊ตํ™˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: XOR ํŒจํ„ด์œผ๋กœ ์ธ์ ‘ ํŽ˜์–ด๋ฅผ ๋งŒ๋“ค์–ด ๊ฐ’์„ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \text{input}[i \oplus 1]\]

์ž…๋ ฅ ๋ฐ์ดํ„ฐ [0, 1, 2, 3, 4, 5, 6, 7, ...]์„ ํŽ˜์–ด [1, 0, 3, 2, 5, 4, 7, 6, ...]์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ๊ฐ ํŽ˜์–ด (i, i+1)์ด XOR ํ†ต์‹ ์œผ๋กœ ๊ฐ’์„ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค.


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p26/p26.mojo

ํŒ

1. shuffle_xor ์ดํ•ดํ•˜๊ธฐ

shuffle_xor(value, mask) ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด XOR ๋งˆ์Šคํฌ๋งŒํผ ์ฐจ์ด๋‚˜๋Š” ๋ ˆ์ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ๋งˆ์Šคํฌ ๊ฐ’์œผ๋กœ ๋ ˆ์ธ ID๋ฅผ XORํ–ˆ์„ ๋•Œ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํƒ๊ตฌํ•  ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ ˆ์ธ 0์ด ๋งˆ์Šคํฌ 1๋กœ XORํ•˜๋ฉด ์–ด๋–ค ํŒŒํŠธ๋„ˆ๋ฅผ ์–ป๋‚˜์š”?
  • ๋ ˆ์ธ 1์ด ๋งˆ์Šคํฌ 1๋กœ XORํ•˜๋ฉด ์–ด๋–ค ํŒŒํŠธ๋„ˆ๋ฅผ ์–ป๋‚˜์š”?
  • ํŒจํ„ด์ด ๋ณด์ด๋‚˜์š”?

ํžŒํŠธ: ์ฒ˜์Œ ๋ช‡ ๊ฐœ์˜ ๋ ˆ์ธ ID์— ๋Œ€ํ•ด XOR ์—ฐ์‚ฐ์„ ์ง์ ‘ ํ•ด๋ณด๋ฉด ํŽ˜์–ด๋ง ํŒจํ„ด์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2. XOR ํŽ˜์–ด ํŒจํ„ด

๋ ˆ์ธ ID์˜ ์ด์ง„ ํ‘œํ˜„๊ณผ ์ตœํ•˜์œ„ ๋น„ํŠธ๋ฅผ ๋’ค์ง‘์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

๊ณ ๋ คํ•  ์งˆ๋ฌธ:

  • ์ง์ˆ˜ ๋ ˆ์ธ์„ 1๊ณผ XORํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?
  • ํ™€์ˆ˜ ๋ ˆ์ธ์„ 1๊ณผ XORํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?
  • ์™œ ์ด๊ฒƒ์ด ์™„๋ฒฝํ•œ ํŽ˜์–ด๋ฅผ ๋งŒ๋“œ๋‚˜์š”?

3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š”

shuffle_down()๊ณผ ๋‹ฌ๋ฆฌ shuffle_xor() ์—ฐ์‚ฐ์€ ์›Œํ”„ ๊ฒฝ๊ณ„ ๋‚ด์—์„œ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์ž‘์€ ๋งˆ์Šคํฌ๋กœ์˜ XOR์ด ์ ˆ๋Œ€๋กœ ๋ฒ”์œ„ ๋ฐ–์˜ ๋ ˆ์ธ ID๋ฅผ ๋งŒ๋“ค์ง€ ์•Š๋Š” ์ด์œ ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: ์œ ํšจํ•œ ๋ ˆ์ธ ID๋ฅผ 1๊ณผ XORํ–ˆ์„ ๋•Œ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ๋ ˆ์ธ ID๋Š” ์–ผ๋งˆ์ธ๊ฐ€์š”?

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŽ˜์–ด ๊ตํ™˜ ํ…Œ์ŠคํŠธ:

pixi run p26 --pair-swap
pixi run -e amd p26 --pair-swap
pixi run -e apple p26 --pair-swap
uv run poe p26 --pair-swap

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1.0, 0.0, 3.0, 2.0, 5.0, 4.0, 7.0, 6.0, 9.0, 8.0, 11.0, 10.0, 13.0, 12.0, 15.0, 14.0, 17.0, 16.0, 19.0, 18.0, 21.0, 20.0, 23.0, 22.0, 25.0, 24.0, 27.0, 26.0, 29.0, 28.0, 31.0, 30.0]
expected: [1.0, 0.0, 3.0, 2.0, 5.0, 4.0, 7.0, 6.0, 9.0, 8.0, 11.0, 10.0, 13.0, 12.0, 15.0, 14.0, 17.0, 16.0, 19.0, 18.0, 21.0, 20.0, 23.0, 22.0, 25.0, 24.0, 27.0, 26.0, 29.0, 28.0, 31.0, 30.0]
โœ… Butterfly pair swap test passed!

์†”๋ฃจ์…˜

fn butterfly_pair_swap[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Basic butterfly pair swap: Exchange values between adjacent pairs using XOR pattern.
    Each thread exchanges its value with its XOR-1 neighbor, creating pairs: (0,1), (2,3), (4,5), etc.
    Uses shuffle_xor(val, 1) to swap values within each pair.
    This is the foundation of butterfly network communication patterns.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    if global_i < size:
        current_val = input[global_i]

        # Exchange with XOR-1 neighbor using butterfly pattern
        # Lane 0 exchanges with lane 1, lane 2 with lane 3, etc.
        swapped_val = shuffle_xor(current_val, 1)

        # For demonstration, we'll store the swapped value
        # In real applications, this might be used for sorting, reduction, etc.
        output[global_i] = swapped_val


์ด ํ’€์ด๋Š” shuffle_xor()์ด XOR ํ†ต์‹  ํŒจํ„ด์„ ํ†ตํ•ด ์™„๋ฒฝํ•œ ํŽ˜์–ด ๊ตํ™˜์„ ์–ด๋–ป๊ฒŒ ๋งŒ๋“œ๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]              # ๊ฐ ๋ ˆ์ธ์ด ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ฝ์Œ
    swapped_val = shuffle_xor(current_val, 1)  # XOR๋กœ ํŽ˜์–ด ๊ตํ™˜ ์ƒ์„ฑ

    # ๊ตํ™˜๋œ ๊ฐ’์„ ์ €์žฅ
    output[global_i] = swapped_val

SIMT ์‹คํ–‰ ์ƒ์„ธ ๋ถ„์„:

์‚ฌ์ดํด 1: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์‹œ์— ๊ฐ’์„ ๋กœ๋“œ
  Lane 0: current_val = input[0] = 0
  Lane 1: current_val = input[1] = 1
  Lane 2: current_val = input[2] = 2
  Lane 3: current_val = input[3] = 3
  ...
  Lane 31: current_val = input[31] = 31

์‚ฌ์ดํด 2: shuffle_xor(current_val, 1)์ด ๋ชจ๋“  ๋ ˆ์ธ์—์„œ ์‹คํ–‰
  Lane 0: Lane 1์—์„œ ์ˆ˜์‹  (0โŠ•1=1) โ†’ swapped_val = 1
  Lane 1: Lane 0์—์„œ ์ˆ˜์‹  (1โŠ•1=0) โ†’ swapped_val = 0
  Lane 2: Lane 3์—์„œ ์ˆ˜์‹  (2โŠ•1=3) โ†’ swapped_val = 3
  Lane 3: Lane 2์—์„œ ์ˆ˜์‹  (3โŠ•1=2) โ†’ swapped_val = 2
  ...
  Lane 30: Lane 31์—์„œ ์ˆ˜์‹  (30โŠ•1=31) โ†’ swapped_val = 31
  Lane 31: Lane 30์—์„œ ์ˆ˜์‹  (31โŠ•1=30) โ†’ swapped_val = 30

์‚ฌ์ดํด 3: ๊ฒฐ๊ณผ ์ €์žฅ
  Lane 0: output[0] = 1
  Lane 1: output[1] = 0
  Lane 2: output[2] = 3
  Lane 3: output[3] = 2
  ...

์ˆ˜ํ•™์  ํ†ต์ฐฐ: XOR ์†์„ฑ์„ ํ™œ์šฉํ•œ ์™„๋ฒฝํ•œ ํŽ˜์–ด ๊ตํ™˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{XOR}(i, 1) = \begin{cases} i + 1 & \text{if } i \bmod 2 = 0 \\ i - 1 & \text{if } i \bmod 2 = 1 \end{cases}\]

shuffle_xor์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ์™„๋ฒฝํ•œ ๋Œ€์นญ: ๋ชจ๋“  ๋ ˆ์ธ์ด ์ •ํ™•ํžˆ ํ•˜๋‚˜์˜ ํŽ˜์–ด์— ์ฐธ์—ฌ
  2. ์กฐ์ • ๋ถˆํ•„์š”: ๋ชจ๋“  ํŽ˜์–ด๊ฐ€ ๋™์‹œ์— ๊ตํ™˜
  3. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์›Œํ”„ ์ „์ฒด์— ๋Œ€ํ•ด ๋‹จ์ผ ๋ช…๋ น์œผ๋กœ ์ฒ˜๋ฆฌ
  4. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๊ธฐ๋ฐ˜: ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋นŒ๋”ฉ ๋ธ”๋ก

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์ง€์—ฐ ์‹œ๊ฐ„: 1 ์‚ฌ์ดํด (ํ•˜๋“œ์›จ์–ด ๋ ˆ์ง€์Šคํ„ฐ ๊ตํ™˜)
  • ๋Œ€์—ญํญ: 0 ๋ฐ”์ดํŠธ (๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์—†์Œ)
  • ๋ณ‘๋ ฌ์„ฑ: WARP_SIZE๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์‹œ์— ๊ตํ™˜
  • ํ™•์žฅ์„ฑ: ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๊ด€๊ณ„์—†์ด \(O(1)\) ๋ณต์žก๋„

2. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

๊ฐ์†Œํ•˜๋Š” offset์œผ๋กœ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด shuffle_xor์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์„ ํ†ตํ•ด ๋ชจ๋“  ์›Œํ”„ ๋ ˆ์ธ์—์„œ ์ตœ๋Œ“๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{max_result} = \max_{i=0}^{\small\text{WARP_SIZE}-1} \text{input}[i]\]

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ํŒจํ„ด: XOR ์˜คํ”„์…‹์„ WARP_SIZE/2์—์„œ 1๊นŒ์ง€ ์ ˆ๋ฐ˜์”ฉ ์ค„์—ฌ๊ฐ€๋ฉฐ, ํ†ต์‹  ๋ฒ”์œ„๊ฐ€ ๋‹จ๊ณ„๋งˆ๋‹ค ๋ฐ˜์œผ๋กœ ์ข์•„์ง€๋Š” ์ด์ง„ ํŠธ๋ฆฌ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • 1๋‹จ๊ณ„: WARP_SIZE/2 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต (์›Œํ”„ ์ „์ฒด๋ฅผ ํฌ๊ด„)
  • 2๋‹จ๊ณ„: WARP_SIZE/4 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต (๋ฒ”์œ„๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ขํž˜)
  • 3๋‹จ๊ณ„: WARP_SIZE/8 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต
  • 4๋‹จ๊ณ„: offset = 1์ด ๋  ๋•Œ๊นŒ์ง€ ๊ณ„์† ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ž„

\(\log_2(\text{WARP_SIZE})\) ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜๋ฉด ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ฐ–๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

fn butterfly_parallel_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Parallel maximum reduction using butterfly pattern.
    Uses shuffle_xor with decreasing offsets starting from WARP_SIZE/2 down to 1.
    Each step reduces the active range by half until all threads have the maximum value.
    This implements an efficient O(log n) parallel reduction algorithm that works
    for any WARP_SIZE (32, 64, etc.).
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    # FILL ME IN (roughly 7 lines)


ํŒ

1. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ์ดํ•ดํ•˜๊ธฐ

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์€ ์ด์ง„ ํŠธ๋ฆฌ ํ†ต์‹  ํŒจํ„ด์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„์—์„œ ๋ฌธ์ œ ํฌ๊ธฐ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ์ตœ๋Œ€ ๋ฒ”์œ„๋ฅผ ์ปค๋ฒ„ํ•˜๋ ค๋ฉด ์‹œ์ž‘ offset์ด ์–ผ๋งˆ์—ฌ์•ผ ํ•˜๋‚˜์š”?
  • ๋‹จ๊ณ„ ์‚ฌ์ด์— ์˜คํ”„์…‹์„ ์–ด๋–ป๊ฒŒ ๋ณ€๊ฒฝํ•ด์•ผ ํ•˜๋‚˜์š”?
  • ์–ธ์ œ ๋ฆฌ๋•์…˜์„ ๋ฉˆ์ถฐ์•ผ ํ•˜๋‚˜์š”?

ํžŒํŠธ: โ€œ๋ฒ„ํ„ฐํ”Œ๋ผ์ดโ€œ๋ผ๋Š” ์ด๋ฆ„์€ ํ†ต์‹  ํŒจํ„ด์—์„œ ์œ ๋ž˜ํ•ฉ๋‹ˆ๋‹ค - ์ž‘์€ ์˜ˆ์ œ์— ๋Œ€ํ•ด ์ง์ ‘ ๊ทธ๋ ค๋ณด์„ธ์š”.

2. XOR ๋ฆฌ๋•์…˜ ํŠน์„ฑ

XOR์€ ๊ฐ ๋‹จ๊ณ„์—์„œ ๊ฒน์น˜์ง€ ์•Š๋Š” ํ†ต์‹  ํŽ˜์–ด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์—์„œ ์™œ ์ค‘์š”ํ•œ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • ์„œ๋กœ ๋‹ค๋ฅธ ์˜คํ”„์…‹์œผ๋กœ์˜ XOR์ด ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅธ ํ†ต์‹  ํŒจํ„ด์„ ๋งŒ๋“œ๋‚˜์š”?
  • ๊ฐ™์€ ๋‹จ๊ณ„์—์„œ ๋ ˆ์ธ๋“ค์ด ์™œ ์„œ๋กœ ๊ฐ„์„ญํ•˜์ง€ ์•Š๋‚˜์š”?
  • XOR์ด ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์— ํŠนํžˆ ์ ํ•ฉํ•œ ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?

3. ์ตœ๋Œ“๊ฐ’ ๋ˆ„์ 

๊ฐ ๋ ˆ์ธ์€ ์ž์‹ ์˜ โ€œ์˜์—ญโ€œ์—์„œ ์ตœ๋Œ“๊ฐ’์˜ ์ง€์‹์„ ์ ์ง„์ ์œผ๋กœ ์Œ“์•„๊ฐ€์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ:

  • ์ž์‹ ์˜ ๊ฐ’์œผ๋กœ ์‹œ์ž‘
  • ๊ฐ ๋‹จ๊ณ„์—์„œ ์ด์›ƒ์˜ ๊ฐ’๊ณผ ๋น„๊ต
  • ์ตœ๋Œ“๊ฐ’์„ ์œ ์ง€ํ•˜๊ณ  ๊ณ„์† ์ง„ํ–‰

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๊ฐ ๋‹จ๊ณ„ ํ›„, โ€œ์ง€์‹์˜ ์˜์—ญโ€œ์ด ๋‘ ๋ฐฐ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค.

  • ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„ ํ›„: ๊ฐ ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ์•Œ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

4. ์ด ํŒจํ„ด์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ 

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์€ \(\log_2(\text{WARP_SIZE})\) ๋‹จ๊ณ„ ํ›„์— ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  • ๋ชจ๋“  ๋ ˆ์ธ์ด ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์˜ ๊ฐ’์„ ๊ฐ„์ ‘์ ์œผ๋กœ ํ™•์ธ
  • ์ค‘๋ณต ํ†ต์‹  ์—†์Œ: ๊ฐ ํŽ˜์–ด๊ฐ€ ๋‹จ๊ณ„๋‹น ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ๊ตํ™˜
  • ์ตœ์  ๋ณต์žก๋„: \(O(n)\) ์ˆœ์ฐจ ๋น„๊ต ๋Œ€์‹  \(O(\log n)\) ๋‹จ๊ณ„

์ถ”์  ์˜ˆ์ œ (4๊ฐœ ๋ ˆ์ธ, ๊ฐ’ [3, 1, 7, 2]):

์ดˆ๊ธฐ ์ƒํƒœ: Lane 0=3, Lane 1=1, Lane 2=7, Lane 3=2

1๋‹จ๊ณ„ (offset=2): 0 โ†” 2, 1 โ†” 3
  Lane 0: max(3, 7) = 7
  Lane 1: max(1, 2) = 2
  Lane 2: max(7, 3) = 7
  Lane 3: max(2, 1) = 2

2๋‹จ๊ณ„ (offset=1): 0 โ†” 1, 2 โ†” 3
  Lane 0: max(7, 2) = 7
  Lane 1: max(2, 7) = 7
  Lane 2: max(7, 2) = 7
  Lane 3: max(2, 7) = 7

๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’ = 7์„ ๊ฐ€์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’ ํ…Œ์ŠคํŠธ:

pixi run p26 --parallel-max
pixi run -e amd p26 --parallel-max
uv run poe p26 --parallel-max

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0]
expected: [1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0]
โœ… Butterfly parallel max test passed!

์†”๋ฃจ์…˜

fn butterfly_parallel_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Parallel maximum reduction using butterfly pattern.
    Uses shuffle_xor with decreasing offsets (16, 8, 4, 2, 1) to perform tree-based reduction.
    Each step reduces the active range by half until all threads have the maximum value.
    This implements an efficient O(log n) parallel reduction algorithm.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    if global_i < size:
        max_val = input[global_i]

        # Butterfly reduction tree: dynamic for any WARP_SIZE (32, 64, etc.)
        # Start with half the warp size and reduce by half each step
        offset = WARP_SIZE // 2
        while offset > 0:
            max_val = max(max_val, shuffle_xor(max_val, offset))
            offset //= 2

        # All threads now have the maximum value across the entire warp
        output[global_i] = max_val


์ด ํ’€์ด๋Š” shuffle_xor()์ด \(O(\log n)\) ๋ณต์žก๋„์˜ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ๋ฅผ ์–ด๋–ป๊ฒŒ ์ƒ์„ฑํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    max_val = input[global_i]  # ๋กœ์ปฌ ๊ฐ’์œผ๋กœ ์‹œ์ž‘

    # ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ: ๋ชจ๋“  WARP_SIZE์— ๋™์ ์œผ๋กœ ๋Œ€์‘
    offset = WARP_SIZE // 2
    while offset > 0:
        max_val = max(max_val, shuffle_xor(max_val, offset))
        offset //= 2

    output[global_i] = max_val  # ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ฐ€์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์‹คํ–‰ ์ถ”์  (8-๋ ˆ์ธ ์˜ˆ์ œ, ๊ฐ’ [0,2,4,6,8,10,12,1000]):

์ดˆ๊ธฐ ์ƒํƒœ:
  Lane 0: max_val = 0,    Lane 1: max_val = 2
  Lane 2: max_val = 4,    Lane 3: max_val = 6
  Lane 4: max_val = 8,    Lane 5: max_val = 10
  Lane 6: max_val = 12,   Lane 7: max_val = 1000

1๋‹จ๊ณ„: shuffle_xor(max_val, 4) - ์ ˆ๋ฐ˜ ๊ตํ™˜
  Lane 0โ†”4: max(0,8)=8,     Lane 1โ†”5: max(2,10)=10
  Lane 2โ†”6: max(4,12)=12,   Lane 3โ†”7: max(6,1000)=1000
  Lane 4โ†”0: max(8,0)=8,     Lane 5โ†”1: max(10,2)=10
  Lane 6โ†”2: max(12,4)=12,   Lane 7โ†”3: max(1000,6)=1000

2๋‹จ๊ณ„: shuffle_xor(max_val, 2) - 1/4 ๊ตํ™˜
  Lane 0โ†”2: max(8,12)=12,   Lane 1โ†”3: max(10,1000)=1000
  Lane 2โ†”0: max(12,8)=12,   Lane 3โ†”1: max(1000,10)=1000
  Lane 4โ†”6: max(8,12)=12,   Lane 5โ†”7: max(10,1000)=1000
  Lane 6โ†”4: max(12,8)=12,   Lane 7โ†”5: max(1000,10)=1000

3๋‹จ๊ณ„: shuffle_xor(max_val, 1) - ํŽ˜์–ด ๊ตํ™˜
  Lane 0โ†”1: max(12,1000)=1000,  Lane 1โ†”0: max(1000,12)=1000
  Lane 2โ†”3: max(12,1000)=1000,  Lane 3โ†”2: max(1000,12)=1000
  Lane 4โ†”5: max(12,1000)=1000,  Lane 5โ†”4: max(1000,12)=1000
  Lane 6โ†”7: max(12,1000)=1000,  Lane 7โ†”6: max(1000,12)=1000

์ตœ์ข… ๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์˜ max_val = 1000

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ์œผ๋กœ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ์ž๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{Reduce}(\oplus, [a_0, a_1, \ldots, a_{n-1}]) = a_0 \oplus a_1 \oplus \cdots \oplus a_{n-1}\]

์—ฌ๊ธฐ์„œ \(\oplus\)๋Š” max ์—ฐ์‚ฐ์ด๋ฉฐ, ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์ด ์ตœ์  \(O(\log n)\) ๋ณต์žก๋„๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ๋กœ๊ทธ ๋ณต์žก๋„: ์ˆœ์ฐจ ๋ฆฌ๋•์…˜์˜ \(O(n)\)์— ๋น„ํ•ด \(O(\log n)\)
  2. ์™„๋ฒฝํ•œ ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ ๋‹จ๊ณ„์—์„œ ๋™๋“ฑํ•˜๊ฒŒ ์ฐธ์—ฌ
  3. ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ ์—†์Œ: ์ˆœ์ˆ˜ ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ํ†ต์‹ 
  4. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: GPU ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์— ์ง์ ‘ ๋งคํ•‘

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ๋‹จ๊ณ„ ์ˆ˜: \(\log_2(\text{WARP_SIZE})\) (์˜ˆ: 32-์Šค๋ ˆ๋“œ ์›Œํ”„๋Š” 5๋‹จ๊ณ„, 64-์Šค๋ ˆ๋“œ ์›Œํ”„๋Š” 6๋‹จ๊ณ„)
  • ๋‹จ๊ณ„๋‹น ์ง€์—ฐ ์‹œ๊ฐ„: 1 ์‚ฌ์ดํด (๋ ˆ์ง€์Šคํ„ฐ ๊ตํ™˜ + ๋น„๊ต)
  • ์ด ์ง€์—ฐ ์‹œ๊ฐ„: ์ˆœ์ฐจ ๋ฐฉ์‹์˜ \((\text{WARP_SIZE}-1)\) ์‚ฌ์ดํด ๋Œ€๋น„ \(\log_2(\text{WARP_SIZE})\) ์‚ฌ์ดํด
  • ๋ณ‘๋ ฌ์„ฑ: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ „์ฒด์—์„œ ๋ชจ๋“  ๋ ˆ์ธ์ด ํ™œ์„ฑ ์ƒํƒœ

3. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์กฐ๊ฑด๋ถ€ ์ตœ๋Œ“๊ฐ’

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE_2 = 64 (๋ฉ€ํ‹ฐ ๋ธ”๋ก ์‹œ๋‚˜๋ฆฌ์˜ค)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: BLOCKS_PER_GRID_2 = (2, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: THREADS_PER_BLOCK_2 = (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

์ง์ˆ˜ ๋ ˆ์ธ์€ ์ตœ๋Œ“๊ฐ’์„, ํ™€์ˆ˜ ๋ ˆ์ธ์€ ์ตœ์†Ÿ๊ฐ’์„ ์ €์žฅํ•˜๋Š” ์กฐ๊ฑด๋ถ€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’ ๋ชจ๋‘์— ๋Œ€ํ•ด ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•œ ํ›„, ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ผ ์กฐ๊ฑด๋ถ€๋กœ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \begin{cases} \max_{j=0}^{\text{WARP_SIZE}-1} \text{input}[j] & \text{if } i \bmod 2 = 0 \\ \min_{j=0}^{\text{WARP_SIZE}-1} \text{input}[j] & \text{if } i \bmod 2 = 1 \end{cases}\]

์ด์ค‘ ๋ฆฌ๋•์…˜ ํŒจํ„ด: ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŠธ๋ฆฌ๋ฅผ ํ†ตํ•ด ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์„ ๋™์‹œ์— ์ถ”์ ํ•œ ํ›„, ๋ ˆ์ธ ID ํ™€์ง์— ๋”ฐ๋ผ ์กฐ๊ฑด๋ถ€๋กœ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์ด ๋ณต์žกํ•œ ๋‹ค์ค‘ ๊ฐ’ ๋ฆฌ๋•์…˜์œผ๋กœ ์–ด๋–ป๊ฒŒ ํ™•์žฅ๋˜๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
comptime layout_2 = Layout.row_major(SIZE_2)


fn butterfly_conditional_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Conditional butterfly maximum: Perform butterfly max reduction, but only store result
    in even-numbered lanes. Odd-numbered lanes store the minimum value seen.
    Demonstrates conditional logic combined with butterfly communication patterns.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = lane_id()

    if global_i < size:
        current_val = input[global_i]
        min_val = current_val

        # FILL ME IN (roughly 11 lines)


ํŒ

1. ์ด์ค‘ ์ถ”์  ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜

์ด ํผ์ฆ์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŠธ๋ฆฌ๋ฅผ ํ†ตํ•ด ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ ๊ฐ’์„ ๋™์‹œ์— ์ถ”์ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๋ฆฌ๋•์…˜์„ ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ฆฌ๋•์…˜ ๊ณผ์ •์—์„œ ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋™์‹œ์— ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ๋‘ ์—ฐ์‚ฐ์— ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ์–ด๋–ค ๋ณ€์ˆ˜๋ฅผ ์ถ”์ ํ•ด์•ผ ํ•˜๋‚˜์š”?

2. ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ ๋กœ์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ์™„๋ฃŒํ•œ ํ›„, ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๊ฐ’์„ ์ถœ๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ์ :

  • ๋ ˆ์ธ์ด ์ง์ˆ˜์ธ์ง€ ํ™€์ˆ˜์ธ์ง€ ์–ด๋–ป๊ฒŒ ํŒ๋ณ„ํ•˜๋‚˜์š”?
  • ์–ด๋–ค ๋ ˆ์ธ์ด ์ตœ๋Œ“๊ฐ’์„, ์–ด๋–ค ๋ ˆ์ธ์ด ์ตœ์†Ÿ๊ฐ’์„ ์ถœ๋ ฅํ•ด์•ผ ํ•˜๋‚˜์š”?
  • ๋ ˆ์ธ ID์— ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•˜๋‚˜์š”?

3. min๊ณผ max ๋™์‹œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜

์ด ๊ณผ์ œ์˜ ํ•ต์‹ฌ์€ ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์œผ๋กœ min๊ณผ max๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ณ‘๋ ฌ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • min๊ณผ max์— ๋ณ„๋„์˜ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•œ๊ฐ€์š”?
  • ๋‘ ์—ฐ์‚ฐ์— ๊ฐ™์€ ์ด์›ƒ ๊ฐ’์„ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ๋‘ ๋ฆฌ๋•์…˜ ๋ชจ๋‘ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์™„๋ฃŒ๋˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•˜๋‚˜์š”?

4. ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๊ฒฝ๊ณ„ ๊ณ ๋ ค์‚ฌํ•ญ

์ด ํผ์ฆ์€ ์—ฌ๋Ÿฌ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ฆฌ๋•์…˜ ๋ฒ”์œ„์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ค‘์š”ํ•œ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๊ฐ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์˜ ๋ฒ”์œ„๋Š” ์–ด๋””๊นŒ์ง€์ธ๊ฐ€์š”?
  • ๋ธ”๋ก ๊ตฌ์กฐ๊ฐ€ ๋ ˆ์ธ ๋ฒˆํ˜ธ ๋งค๊ธฐ๊ธฐ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋‚˜์š”?
  • ์ „์—ญ min/max๋ฅผ ๊ณ„์‚ฐํ•˜๋‚˜์š”, ๋ธ”๋ก๋ณ„ min/max๋ฅผ ๊ณ„์‚ฐํ•˜๋‚˜์š”?

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์กฐ๊ฑด๋ถ€ ์ตœ๋Œ“๊ฐ’ ํ…Œ์ŠคํŠธ:

pixi run p26 --conditional-max
pixi run -e amd p26 --conditional-max
uv run poe p26 --conditional-max

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE_2:  64
output: [9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0]
expected: [9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0]
โœ… Butterfly conditional max test passed!

์†”๋ฃจ์…˜

fn butterfly_conditional_max[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Conditional butterfly maximum: Perform butterfly max reduction, but only store result
    in even-numbered lanes. Odd-numbered lanes store the minimum value seen.
    Demonstrates conditional logic combined with butterfly communication patterns.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    lane = lane_id()

    if global_i < size:
        current_val = input[global_i]
        min_val = current_val

        # Butterfly reduction for both maximum and minimum: dynamic for any WARP_SIZE
        offset = WARP_SIZE // 2
        while offset > 0:
            neighbor_val = shuffle_xor(current_val, offset)
            current_val = max(current_val, neighbor_val)

            min_neighbor_val = shuffle_xor(min_val, offset)
            min_val = min(min_val, min_neighbor_val)

            offset //= 2

        # Conditional output: max for even lanes, min for odd lanes
        if lane % 2 == 0:
            output[global_i] = current_val  # Maximum
        else:
            output[global_i] = min_val  # Minimum


์ด ํ’€์ด๋Š” ์ด์ค‘ ์ถ”์ ๊ณผ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ณ ๊ธ‰ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]
    min_val = current_val  # ์ตœ์†Ÿ๊ฐ’์„ ๋ณ„๋„๋กœ ์ถ”์ 

    # max์™€ min ๋™์‹œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ (log_2(WARP_SIZE) ๋‹จ๊ณ„)
    offset = WARP_SIZE // 2
    while offset > 0:
        neighbor_val = shuffle_xor(current_val, offset)
        current_val = max(current_val, neighbor_val)    # Max ๋ฆฌ๋•์…˜

        min_neighbor_val = shuffle_xor(min_val, offset)
        min_val = min(min_val, min_neighbor_val)        # Min ๋ฆฌ๋•์…˜

        offset //= 2

    # ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ฅธ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ
    if lane % 2 == 0:
        output[global_i] = current_val  # ์ง์ˆ˜ ๋ ˆ์ธ: ์ตœ๋Œ“๊ฐ’
    else:
        output[global_i] = min_val      # ํ™€์ˆ˜ ๋ ˆ์ธ: ์ตœ์†Ÿ๊ฐ’

์ด์ค‘ ๋ฆฌ๋•์…˜ ์‹คํ–‰ ์ถ”์  (4-๋ ˆ์ธ ์˜ˆ์ œ, ๊ฐ’ [3, 1, 7, 2]):

์ดˆ๊ธฐ ์ƒํƒœ:
  Lane 0: current_val=3, min_val=3
  Lane 1: current_val=1, min_val=1
  Lane 2: current_val=7, min_val=7
  Lane 3: current_val=2, min_val=2

1๋‹จ๊ณ„: shuffle_xor(current_val, 2)์™€ shuffle_xor(min_val, 2) - ์ ˆ๋ฐ˜ ๊ตํ™˜
  Lane 0โ†”2: max_neighbor=7, min_neighbor=7 โ†’ current_val=max(3,7)=7, min_val=min(3,7)=3
  Lane 1โ†”3: max_neighbor=2, min_neighbor=2 โ†’ current_val=max(1,2)=2, min_val=min(1,2)=1
  Lane 2โ†”0: max_neighbor=3, min_neighbor=3 โ†’ current_val=max(7,3)=7, min_val=min(7,3)=3
  Lane 3โ†”1: max_neighbor=1, min_neighbor=1 โ†’ current_val=max(2,1)=2, min_val=min(2,1)=1

2๋‹จ๊ณ„: shuffle_xor(current_val, 1)์™€ shuffle_xor(min_val, 1) - ํŽ˜์–ด ๊ตํ™˜
  Lane 0โ†”1: max_neighbor=2, min_neighbor=1 โ†’ current_val=max(7,2)=7, min_val=min(3,1)=1
  Lane 1โ†”0: max_neighbor=7, min_neighbor=3 โ†’ current_val=max(2,7)=7, min_val=min(1,3)=1
  Lane 2โ†”3: max_neighbor=2, min_neighbor=1 โ†’ current_val=max(7,2)=7, min_val=min(3,1)=1
  Lane 3โ†”2: max_neighbor=7, min_neighbor=3 โ†’ current_val=max(2,7)=7, min_val=min(1,3)=1

์ตœ์ข… ๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์ด current_val=7 (์ „์—ญ max)๊ณผ min_val=1 (์ „์—ญ min)์„ ๊ฐ€์ง

๋™์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ (๋ชจ๋“  WARP_SIZE์—์„œ ๋™์ž‘):

offset = WARP_SIZE // 2
while offset > 0:
    neighbor_val = shuffle_xor(current_val, offset)
    current_val = max(current_val, neighbor_val)

    min_neighbor_val = shuffle_xor(min_val, offset)
    min_val = min(min_val, min_neighbor_val)

    offset //= 2

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ์กฐ๊ฑด๋ถ€ ๋””๋ฉ€ํ‹ฐํ”Œ๋ ‰์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์ค‘ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \begin{align} \text{max_result} &= \max_{i=0}^{n-1} \text{input}[i] \\ \text{min_result} &= \min_{i=0}^{n-1} \text{input}[i] \\ \text{output}[i] &= \text{lane_parity}(i) \; \text{?} \; \text{min_result} : \text{max_result} \end{align}\]

์ด์ค‘ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  1. ๋…๋ฆฝ์  ๋ฆฌ๋•์…˜: Max์™€ min ๋ฆฌ๋•์…˜์€ ์ˆ˜ํ•™์ ์œผ๋กœ ๋…๋ฆฝ
  2. ๋ณ‘๋ ฌ ์‹คํ–‰: ๋‘˜ ๋‹ค ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  3. ํ†ต์‹  ๊ณต์œ : ๊ฐ™์€ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ๋‘ ๋ฆฌ๋•์…˜ ๋ชจ๋‘์— ํ™œ์šฉ
  4. ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ: ๋ ˆ์ธ ํ™€์ง์ด ์–ด๋–ค ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ• ์ง€ ๊ฒฐ์ •

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ํ†ต์‹  ๋‹จ๊ณ„: \(\log_2(\text{WARP_SIZE})\) (๋‹จ์ผ ๋ฆฌ๋•์…˜๊ณผ ๋™์ผ)
  • ๋‹จ๊ณ„๋‹น ์—ฐ์‚ฐ: ๋‹จ์ผ ๋ฆฌ๋•์…˜์˜ 1๊ฐœ ๋Œ€๋น„ 2๊ฐœ ์—ฐ์‚ฐ (max + min)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹ ๋Œ€๋น„ ์Šค๋ ˆ๋“œ๋‹น ๋ ˆ์ง€์Šคํ„ฐ 2๊ฐœ
  • ์ถœ๋ ฅ ์œ ์—ฐ์„ฑ: ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ธ์ด ๋‹ค๋ฅธ ๋ฆฌ๋•์…˜ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ ๊ฐ€๋Šฅ

์š”์•ฝ

shuffle_xor() ๊ธฐ๋ณธ ์š”์†Œ๋Š” ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๊ฐ•๋ ฅํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์„ธ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ†ตํ•ด ๋‹ค์Œ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด

  1. ํŽ˜์–ด ๊ตํ™˜ (shuffle_xor(value, 1)):

    • ์™„๋ฒฝํ•œ ์ธ์ ‘ ํŽ˜์–ด ์ƒ์„ฑ: (0,1), (2,3), (4,5), โ€ฆ
    • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ์˜ \(O(1)\) ๋ณต์žก๋„
    • ์ •๋ ฌ ๋„คํŠธ์›Œํฌ์™€ ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜์˜ ๊ธฐ๋ฐ˜
  2. ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ (๋™์  offset: WARP_SIZE/2 โ†’ 1):

    • ๋กœ๊ทธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜: ์ˆœ์ฐจ์˜ \(O(n)\) ๋Œ€๋น„ \(O(\log n)\)
    • ๋ชจ๋“  ๊ฒฐํ•ฉ ์—ฐ์‚ฐ์— ์ ์šฉ ๊ฐ€๋Šฅ (max, min, sum ๋“ฑ)
    • ๋ชจ๋“  ์›Œํ”„ ๋ ˆ์ธ์— ๊ฑธ์ณ ์ตœ์ ์˜ ๋ถ€ํ•˜ ๋ถ„์‚ฐ
  3. ์กฐ๊ฑด๋ถ€ ๋‹ค์ค‘ ๋ฆฌ๋•์…˜ (์ด์ค‘ ์ถ”์  + ๋ ˆ์ธ ํ™€์ง):

    • ์—ฌ๋Ÿฌ ๋ฆฌ๋•์…˜์„ ๋™์‹œ์— ๋ณ‘๋ ฌ ์ˆ˜ํ–‰
    • ์Šค๋ ˆ๋“œ ํŠน์„ฑ์— ๋”ฐ๋ฅธ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ
    • ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†๋Š” ๊ณ ๊ธ‰ ์กฐ์ •

ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ†ต์ฐฐ

XOR ํ†ต์‹  ํŠน์„ฑ:

  • shuffle_xor(value, mask)๊ฐ€ ๋Œ€์นญ์ ์ด๊ณ  ๊ฒน์น˜์ง€ ์•Š๋Š” ํŽ˜์–ด๋ฅผ ์ƒ์„ฑ
  • ๊ฐ ๋งˆ์Šคํฌ๊ฐ€ ๊ณ ์œ ํ•œ ํ†ต์‹  ํ† ํด๋กœ์ง€๋ฅผ ์ƒ์„ฑ
  • ์ด์ง„ XOR ํŒจํ„ด์—์„œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋„์ถœ

๋™์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„:

offset = WARP_SIZE // 2
while offset > 0:
    neighbor_val = shuffle_xor(current_val, offset)
    current_val = operation(current_val, neighbor_val)
    offset //= 2

์„ฑ๋Šฅ ์ด์ :

  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ณต์žก๋„: ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ \(O(\log n)\)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ถˆํ•„์š”

์‹ค์šฉ์  ํ™œ์šฉ

์ด ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด๋“ค์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๋ถ„์•ผ:

  • ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜: ํ•ฉ๊ณ„, max, min, ๋…ผ๋ฆฌ ์—ฐ์‚ฐ
  • ๋ˆ„์  ํ•ฉ/์Šค์บ” ์—ฐ์‚ฐ: ๋ˆ„์  ํ•ฉ, ๋ณ‘๋ ฌ ์ •๋ ฌ
  • FFT ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์‹ ํ˜ธ ์ฒ˜๋ฆฌ์™€ ํ•ฉ์„ฑ๊ณฑ
  • Bitonic ์ •๋ ฌ: ๋ณ‘๋ ฌ ์ •๋ ฌ ๋„คํŠธ์›Œํฌ
  • ๊ทธ๋ž˜ํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ํŠธ๋ฆฌ ์ˆœํšŒ์™€ ์—ฐ๊ฒฐ์„ฑ

shuffle_xor() ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์กฐ์ •์„ ์šฐ์•„ํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ํ†ต์‹  ํŒจํ„ด์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ํšจ์œจ์ ์œผ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค.

warp.prefix_sum() ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ์Šค์บ”

์›Œํ”„ ๋ ˆ๋ฒจ ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ์—์„œ๋Š” prefix_sum()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ณธ ์š”์†Œ๋กœ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•๋ ฅํ•œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ์ˆ˜์‹ญ ์ค„์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ๋™๊ธฐํ™” ์ฝ”๋“œ๊ฐ€ ํ•„์š”ํ–ˆ์„ ํšจ์œจ์ ์ธ ๋ˆ„์  ๊ณ„์‚ฐ, ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹, ๊ณ ๊ธ‰ ์กฐ์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: prefix_sum() ์—ฐ์‚ฐ์€ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ”์„ ํ™œ์šฉํ•˜์—ฌ ์›Œํ”„ ๋ ˆ์ธ์— ๊ฑธ์ณ \(O(\log n)\) ๋ณต์žก๋„๋กœ ๋ˆ„์  ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

๋ณ‘๋ ฌ ์Šค์บ”์ด๋ž€? ๋ณ‘๋ ฌ ์Šค์บ” (๋ˆ„์  ํ•ฉ)์€ ๋ฐ์ดํ„ฐ ์š”์†Œ์— ๊ฑธ์ณ ๋ˆ„์  ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋ณธ์ ์ธ ๋ณ‘๋ ฌ ๊ธฐ๋ณธ ์š”์†Œ์ž…๋‹ˆ๋‹ค. ๋ง์…ˆ์˜ ๊ฒฝ์šฐ [a, b, c, d]๋ฅผ [a, a+b, a+b+c, a+b+c+d]๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ์—ฐ์‚ฐ์€ ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜, quicksort ํŒŒํ‹ฐ์…”๋‹, ๋ณ‘๋ ฌ ์ •๋ ฌ ๊ฐ™์€ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • prefix_sum()์„ ํ™œ์šฉํ•œ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ์Šค์บ”
  • ํฌํ•จ(inclusive) vs ๋น„ํฌํ•จ(exclusive) ๋ˆ„์  ํ•ฉ ํŒจํ„ด
  • ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜๋ฅผ ์œ„ํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜
  • ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•œ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹
  • ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ๋‹จ์ผ ์›Œํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ตœ์ ํ™”

์ด๋ฅผ ํ†ตํ•ด ๋‹ค๋‹จ๊ณ„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์šฐ์•„ํ•œ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋ณ€ํ™˜๋˜์–ด, ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

1. ์›Œํ”„ ํฌํ•จ ๋ˆ„์  ํ•ฉ

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D row-major)

prefix_sum์˜ ์ด์ 

๊ธฐ์กด ๋ˆ„์  ํ•ฉ์€ ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Puzzle 14: ๋ˆ„์  ํ•ฉ์—์„œ๋Š” ๋ช…์‹œ์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋กœ ์ด๋ฅผ ํž˜๋“ค๊ฒŒ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค:

fn prefix_sum_simple[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: UInt,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    offset = UInt(1)
    for i in range(Int(log2(Scalar[dtype](TPB)))):
        var current_val: output.element_type = 0
        if local_i >= offset and local_i < size:
            current_val = shared[local_i - offset]  # read

        barrier()
        if local_i >= offset and local_i < size:
            shared[local_i] += current_val

        barrier()
        offset *= 2

    if global_i < size:
        output[global_i] = shared[local_i]


๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ํ•„์š”
  • ๋‹ค์ค‘ ๋ฐฐ๋ฆฌ์–ด: ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ๋™๊ธฐํ™”
  • ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ: ์ˆ˜๋™ ์ŠคํŠธ๋ผ์ด๋“œ ๊ณ„์‚ฐ๊ณผ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ๋‚ฎ์€ ํ™•์žฅ์„ฑ: ๊ฐ ๋‹จ๊ณ„ ์‚ฌ์ด์— ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”ํ•œ \(O(\log n)\) ๋‹จ๊ณ„

prefix_sum()์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ณ‘๋ ฌ ์Šค์บ”์ด ๊ฐ„๋‹จํ•ด์ง‘๋‹ˆ๋‹ค:

# ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ฐฉ์‹ - ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ!
current_val = input[global_i]
scan_result = prefix_sum[exclusive=False](current_val)
output[global_i] = scan_result

prefix_sum์˜ ์žฅ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ์—ฐ์‚ฐ
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: ๋‹จ์ผ ์•„ํ† ๋ฏน ์—ฐ์‚ฐ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์ „์šฉ ์Šค์บ” ์œ ๋‹› ํ™œ์šฉ
  • ์™„๋ฒฝํ•œ ํ™•์žฅ์„ฑ: ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ ๋™์ž‘

์™„์„ฑํ•  ์ฝ”๋“œ

ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” prefix_sum() ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํฌํ•จ ๋ˆ„์  ํ•ฉ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ๊ฐ ๋ ˆ์ธ์ด ์ž์‹ ์˜ ์œ„์น˜๊นŒ์ง€ ๋ชจ๋“  ์š”์†Œ์˜ ํ•ฉ์„ ํฌํ•จํ•˜๋Š” ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \sum_{j=0}^{i} \text{input}[j]\]

์ž…๋ ฅ ๋ฐ์ดํ„ฐ [1, 2, 3, 4, 5, ...]๋ฅผ ๋ˆ„์  ํ•ฉ [1, 3, 6, 10, 15, ...]์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ๊ฐ ์œ„์น˜์— ์ด์ „ ๋ชจ๋“  ์š”์†Œ์™€ ์ž๊ธฐ ์ž์‹ ์˜ ํ•ฉ์ด ๋‹ด๊น๋‹ˆ๋‹ค.

fn warp_inclusive_prefix_sum[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Inclusive prefix sum using warp primitive:
    Each thread gets sum of all elements up to and including its position.
    Compare this to Puzzle 12's complex shared memory + barrier approach.

    Puzzle 12 approach:
    - Shared memory allocation
    - Multiple barrier synchronizations
    - Log(n) iterations with manual tree reduction
    - Complex multi-phase algorithm

    Warp prefix_sum approach:
    - Single function call!
    - Hardware-optimized parallel scan
    - Automatic synchronization
    - O(log n) complexity, but implemented in hardware.

    NOTE: This implementation only works correctly within a single warp (WARP_SIZE threads).
    For multi-warp scenarios, additional coordination would be needed.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    # FILL ME IN (roughly 4 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p26/p26.mojo

ํŒ

1. prefix_sum ๋งค๊ฐœ๋ณ€์ˆ˜ ์ดํ•ดํ•˜๊ธฐ

prefix_sum() ํ•จ์ˆ˜์—๋Š” ์Šค์บ” ์œ ํ˜•์„ ์ œ์–ดํ•˜๋Š” ์ค‘์š”ํ•œ ํ…œํ”Œ๋ฆฟ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ํฌํ•จ ๋ˆ„์  ํ•ฉ๊ณผ ๋น„ํฌํ•จ ๋ˆ„์  ํ•ฉ์˜ ์ฐจ์ด๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?
  • ์–ด๋–ค ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์ด ๋™์ž‘์„ ์ œ์–ดํ•˜๋‚˜์š”?
  • ํฌํ•จ ์Šค์บ”์—์„œ ๊ฐ ๋ ˆ์ธ์€ ๋ฌด์—‡์„ ์ถœ๋ ฅํ•ด์•ผ ํ•˜๋‚˜์š”?

ํžŒํŠธ: ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ๋ณด๊ณ  ๋ˆ„์  ์—ฐ์‚ฐ์—์„œ โ€œํฌํ•จ(inclusive)โ€œ์ด ๋ฌด์—‡์„ ์˜๋ฏธํ•˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

2. ๋‹จ์ผ ์›Œํ”„ ์ œํ•œ

์ด ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋‹จ์ผ ์›Œํ”„ ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ œํ•œ์˜ ์˜๋ฏธ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • ์—ฌ๋Ÿฌ ์›Œํ”„๊ฐ€ ์žˆ์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?
  • ์ด ์ œํ•œ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์™œ ์ค‘์š”ํ•œ๊ฐ€์š”?
  • ๋ฉ€ํ‹ฐ ์›Œํ”„ ์‹œ๋‚˜๋ฆฌ์˜ค๋กœ ํ™•์žฅํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•˜๋‚˜์š”?

3. ๋ฐ์ดํ„ฐ ํƒ€์ž… ๊ณ ๋ ค์‚ฌํ•ญ

prefix_sum ํ•จ์ˆ˜๋Š” ์ตœ์  ์„ฑ๋Šฅ์„ ์œ„ํ•ด ํŠน์ • ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์š”๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ์ :

  • ์ž…๋ ฅ์ด ์–ด๋–ค ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์‚ฌ์šฉํ•˜๋‚˜์š”?
  • prefix_sum์ด ํŠน์ • ์Šค์นผ๋ผ ํƒ€์ž…์„ ๊ธฐ๋Œ€ํ•˜๋‚˜์š”?
  • ํ•„์š”ํ•œ ๊ฒฝ์šฐ ํƒ€์ž… ๋ณ€ํ™˜์„ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋‚˜์š”?

์›Œํ”„ ํฌํ•จ ๋ˆ„์  ํ•ฉ ํ…Œ์ŠคํŠธ:

pixi run p26 --prefix-sum
pixi run -e amd p26 --prefix-sum
pixi run -e apple p26 --prefix-sum
uv run poe p26 --prefix-sum

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0, 36.0, 45.0, 55.0, 66.0, 78.0, 91.0, 105.0, 120.0, 136.0, 153.0, 171.0, 190.0, 210.0, 231.0, 253.0, 276.0, 300.0, 325.0, 351.0, 378.0, 406.0, 435.0, 465.0, 496.0, 528.0]
expected: [1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0, 36.0, 45.0, 55.0, 66.0, 78.0, 91.0, 105.0, 120.0, 136.0, 153.0, 171.0, 190.0, 210.0, 231.0, 253.0, 276.0, 300.0, 325.0, 351.0, 378.0, 406.0, 435.0, 465.0, 496.0, 528.0]
โœ… Warp inclusive prefix sum test passed!

์†”๋ฃจ์…˜

fn warp_inclusive_prefix_sum[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    """
    Inclusive prefix sum using warp primitive: Each thread gets sum of all elements up to and including its position.
    Compare this to Puzzle 12's complex shared memory + barrier approach.

    Puzzle 12 approach:
    - Shared memory allocation
    - Multiple barrier synchronizations
    - Log(n) iterations with manual tree reduction
    - Complex multi-phase algorithm

    Warp prefix_sum approach:
    - Single function call!
    - Hardware-optimized parallel scan
    - Automatic synchronization
    - O(log n) complexity, but implemented in hardware.

    NOTE: This implementation only works correctly within a single warp (WARP_SIZE threads).
    For multi-warp scenarios, additional coordination would be needed.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    if global_i < size:
        current_val = input[global_i]

        # This one call replaces ~30 lines of complex shared memory logic from Puzzle 12!
        # But it only works within the current warp (WARP_SIZE threads)
        scan_result = prefix_sum[exclusive=False](
            rebind[Scalar[dtype]](current_val)
        )

        output[global_i] = scan_result


์ด ์†”๋ฃจ์…˜์€ prefix_sum()์ด ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ์–ด๋–ป๊ฒŒ ๋Œ€์ฒดํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]

    # ์ด ํ•œ ์ค„์ด Puzzle 14์˜ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ์ง ~30์ค„์„ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค!
    # ๋‹จ, ํ˜„์žฌ ์›Œํ”„ (WARP_SIZE ์Šค๋ ˆ๋“œ) ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค
    scan_result = prefix_sum[exclusive=False](
        rebind[Scalar[dtype]](current_val)
    )

    output[global_i] = scan_result

SIMT ์‹คํ–‰ ์ƒ์„ธ ๋ถ„์„:

์ž…๋ ฅ: [1, 2, 3, 4, 5, 6, 7, 8, ...]

์‚ฌ์ดํด 1: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์‹œ์— ๊ฐ’์„ ๋กœ๋“œ
  Lane 0: current_val = 1
  Lane 1: current_val = 2
  Lane 2: current_val = 3
  Lane 3: current_val = 4
  ...
  Lane 31: current_val = 32

์‚ฌ์ดํด 2: prefix_sum[exclusive=False] ์‹คํ–‰ (ํ•˜๋“œ์›จ์–ด ๊ฐ€์†)
  Lane 0: scan_result = 1 (์š”์†Œ 0~0์˜ ํ•ฉ)
  Lane 1: scan_result = 3 (์š”์†Œ 0~1์˜ ํ•ฉ: 1+2)
  Lane 2: scan_result = 6 (์š”์†Œ 0~2์˜ ํ•ฉ: 1+2+3)
  Lane 3: scan_result = 10 (์š”์†Œ 0~3์˜ ํ•ฉ: 1+2+3+4)
  ...
  Lane 31: scan_result = 528 (์š”์†Œ 0~31์˜ ํ•ฉ)

์‚ฌ์ดํด 3: ๊ฒฐ๊ณผ ์ €์žฅ
  Lane 0: output[0] = 1
  Lane 1: output[1] = 3
  Lane 2: output[2] = 6
  Lane 3: output[3] = 10
  ...

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ํฌํ•จ ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \sum_{j=0}^{i} \text{input}[j]\]

Puzzle 14 ๋ฐฉ์‹๊ณผ์˜ ๋น„๊ต:

  • Puzzle 14: ๋ˆ„์  ํ•ฉ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ~30์ค„ + ๋‹ค์ค‘ ๋ฐฐ๋ฆฌ์–ด + ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ
  • ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ: ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์˜ ํ•จ์ˆ˜ ํ˜ธ์ถœ 1๊ฐœ
  • ์„ฑ๋Šฅ: ๊ฐ™์€ \(O(\log n)\) ๋ณต์žก๋„์ด์ง€๋งŒ, ์ „์šฉ ํ•˜๋“œ์›จ์–ด์—์„œ ๊ตฌํ˜„
  • ๋ฉ”๋ชจ๋ฆฌ: ๋ช…์‹œ์  ํ• ๋‹น ๋Œ€๋น„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ œ๋กœ

Puzzle 12์—์„œ์˜ ๋ฐœ์ „: ํ˜„๋Œ€ GPU ์•„ํ‚คํ…์ฒ˜์˜ ๊ฐ•๋ ฅํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค - Puzzle 12์—์„œ ์‹ ์ค‘ํ•œ ์ˆ˜๋™ ๊ตฌํ˜„์ด ํ•„์š”ํ–ˆ๋˜ ๊ฒƒ์ด ์ด์ œ๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ณธ ์š”์†Œ ํ•˜๋‚˜๋กœ ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค. ์›Œํ”„ ๋ ˆ๋ฒจ prefix_sum()์€ ๊ตฌํ˜„ ๋ณต์žก๋„ ์ œ๋กœ๋กœ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์  ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

prefix_sum์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ํ•˜๋“œ์›จ์–ด ๊ฐ€์†: ํ˜„๋Œ€ GPU์˜ ์ „์šฉ ์Šค์บ” ์œ ๋‹›
  2. ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ถˆํ•„์š”
  3. ์ž๋™ ๋™๊ธฐํ™”: ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ๋ถˆํ•„์š”
  4. ์™„๋ฒฝํ•œ ํ™•์žฅ์„ฑ: ๋ชจ๋“  WARP_SIZE์—์„œ ์ตœ์ ์œผ๋กœ ๋™์ž‘

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์ง€์—ฐ ์‹œ๊ฐ„: ~1-2 ์‚ฌ์ดํด (ํ•˜๋“œ์›จ์–ด ์Šค์บ” ์œ ๋‹›)
  • ๋Œ€์—ญํญ: ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ œ๋กœ (๋ ˆ์ง€์Šคํ„ฐ ์ „์šฉ ์—ฐ์‚ฐ)
  • ๋ณ‘๋ ฌ์„ฑ: WARP_SIZE๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์‹œ์— ์ฐธ์—ฌ
  • ํ™•์žฅ์„ฑ: ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋ฅผ ๋™๋ฐ˜ํ•œ \(O(\log n)\) ๋ณต์žก๋„

์ค‘์š”ํ•œ ์ œํ•œ์‚ฌํ•ญ: ์ด ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋‹จ์ผ ์›Œํ”„ ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ๋ฉ€ํ‹ฐ ์›Œํ”„ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ์›Œํ”„ ๊ฐ„ ์ถ”๊ฐ€ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

2. ์›Œํ”„ ํŒŒํ‹ฐ์…˜

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

shuffle_xor๊ณผ prefix_sum ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์ผ ์›Œํ”„ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ํ”ผ๋ฒ— ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์š”์†Œ๋ฅผ ๋ถ„ํ• ํ•˜์—ฌ, < pivot์ธ ์š”์†Œ๋Š” ์™ผ์ชฝ์—, >= pivot์ธ ์š”์†Œ๋Š” ์˜ค๋ฅธ์ชฝ์— ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output} = [\text{elements} < \text{pivot}] \,|\, [\text{elements} \geq \text{pivot}]\]

๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‘ ๊ฐ€์ง€ ์ •๊ตํ•œ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:

  1. shuffle_xor(): ์™ผ์ชฝ ์š”์†Œ ๊ฐœ์ˆ˜๋ฅผ ์„ธ๊ธฐ ์œ„ํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜
  2. prefix_sum(): ๊ฐ ํŒŒํ‹ฐ์…˜ ๋‚ด ์œ„์น˜ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋น„ํฌํ•จ ์Šค์บ”

์ด๋Š” ๋‹จ์ผ ์›Œํ”„ ๋‚ด์—์„œ ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๋Š” ๊ฐ•๋ ฅํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

fn warp_partition[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    pivot: Float32,
):
    """
    Single-warp parallel partitioning using BOTH shuffle_xor AND prefix_sum.
    This implements a warp-level quicksort partition step that places elements < pivot
    on the left and elements >= pivot on the right.

    ALGORITHM COMPLEXITY - combines two advanced warp primitives:
    1. shuffle_xor(): Butterfly pattern for warp-level reductions
    2. prefix_sum(): Warp-level exclusive scan for position calculation.

    This demonstrates the power of warp primitives for sophisticated parallel algorithms
    within a single warp (works for any WARP_SIZE: 32, 64, etc.).

    Example with pivot=5:
    Input:  [3, 7, 1, 8, 2, 9, 4, 6]
    Result: [3, 1, 2, 4, 7, 8, 9, 6] (< pivot | >= pivot).
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    if global_i < size:
        current_val = input[global_i]

        # FILL ME IN (roughly 13 lines)


ํŒ

1. ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์—ฌ๋Ÿฌ ์กฐ์ •๋œ ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํŒŒํ‹ฐ์…”๋‹์— ํ•„์š”ํ•œ ๋…ผ๋ฆฌ์  ๋‹จ๊ณ„๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

๊ณ ๋ คํ•  ํ•ต์‹ฌ ๋‹จ๊ณ„:

  • ์–ด๋–ค ์š”์†Œ๊ฐ€ ์–ด๋А ํŒŒํ‹ฐ์…˜์— ์†ํ•˜๋Š”์ง€ ์–ด๋–ป๊ฒŒ ์‹๋ณ„ํ•˜๋‚˜์š”?
  • ๊ฐ ํŒŒํ‹ฐ์…˜ ๋‚ด์—์„œ ์œ„์น˜๋ฅผ ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ•˜๋‚˜์š”?
  • ์™ผ์ชฝ ํŒŒํ‹ฐ์…˜์˜ ์ „์ฒด ํฌ๊ธฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์•Œ ์ˆ˜ ์žˆ๋‚˜์š”?
  • ์ตœ์ข… ์œ„์น˜์— ์š”์†Œ๋ฅผ ์–ด๋–ป๊ฒŒ ๊ธฐ๋กํ•˜๋‚˜์š”?

2. ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ

์–ด๋А ํŒŒํ‹ฐ์…˜์— ์†ํ•˜๋Š”์ง€ ํŒ๋ณ„ํ•˜๋Š” ๋ถˆ๋ฆฌ์–ธ ํ”„๋ ˆ๋””์ผ€์ดํŠธ๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • โ€œ์ด ์š”์†Œ๋Š” ์™ผ์ชฝ ํŒŒํ‹ฐ์…˜์— ์†ํ•œ๋‹คโ€œ๋ฅผ ์–ด๋–ป๊ฒŒ ํ‘œํ˜„ํ•˜๋‚˜์š”?
  • โ€œ์ด ์š”์†Œ๋Š” ์˜ค๋ฅธ์ชฝ ํŒŒํ‹ฐ์…˜์— ์†ํ•œ๋‹คโ€œ๋ฅผ ์–ด๋–ป๊ฒŒ ํ‘œํ˜„ํ•˜๋‚˜์š”?
  • prefix_sum์— ์ „๋‹ฌํ•  ํ”„๋ ˆ๋””์ผ€์ดํŠธ๋Š” ์–ด๋–ค ๋ฐ์ดํ„ฐ ํƒ€์ž…์ด์–ด์•ผ ํ•˜๋‚˜์š”?

3. shuffle_xor๊ณผ prefix_sum ๊ฒฐํ•ฉ

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‘ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ์ :

  • ์ด ๋งฅ๋ฝ์—์„œ shuffle_xor์€ ๋ฌด์—‡์— ์‚ฌ์šฉ๋˜๋‚˜์š”?
  • ์ด ๋งฅ๋ฝ์—์„œ prefix_sum์€ ๋ฌด์—‡์— ์‚ฌ์šฉ๋˜๋‚˜์š”?
  • ์ด ๋‘ ์—ฐ์‚ฐ์ด ์–ด๋–ป๊ฒŒ ํ•จ๊ป˜ ๋™์ž‘ํ•˜๋‚˜์š”?

4. ์œ„์น˜ ๊ณ„์‚ฐ

๊ฐ€์žฅ ๊นŒ๋‹ค๋กœ์šด ๋ถ€๋ถ„์€ ๊ฐ ์š”์†Œ๊ฐ€ ์ถœ๋ ฅ์—์„œ ์–ด๋””์— ๊ธฐ๋ก๋˜์–ด์•ผ ํ•˜๋Š”์ง€ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ:

  • ์™ผ์ชฝ ํŒŒํ‹ฐ์…˜ ์š”์†Œ: ์ตœ์ข… ์œ„์น˜๋ฅผ ๋ฌด์—‡์ด ๊ฒฐ์ •ํ•˜๋‚˜์š”?
  • ์˜ค๋ฅธ์ชฝ ํŒŒํ‹ฐ์…˜ ์š”์†Œ: ์˜คํ”„์…‹์„ ์–ด๋–ป๊ฒŒ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ ์šฉํ•˜๋‚˜์š”?
  • ๋กœ์ปฌ ์œ„์น˜์™€ ํŒŒํ‹ฐ์…˜ ๊ฒฝ๊ณ„๋ฅผ ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉํ•˜๋‚˜์š”?

์›Œํ”„ ํŒŒํ‹ฐ์…˜ ํ…Œ์ŠคํŠธ:

uv run poe p26 --partition
pixi run p26 --partition

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: HostBuffer([3.0, 1.0, 2.0, 4.0, 0.0, 3.0, 1.0, 4.0, 3.0, 1.0, 2.0, 4.0, 0.0, 3.0, 1.0, 4.0, 7.0, 8.0, 9.0, 6.0, 10.0, 11.0, 12.0, 13.0, 7.0, 8.0, 9.0, 6.0, 10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([3.0, 1.0, 2.0, 4.0, 0.0, 3.0, 1.0, 4.0, 3.0, 1.0, 2.0, 4.0, 0.0, 3.0, 1.0, 4.0, 7.0, 8.0, 9.0, 6.0, 10.0, 11.0, 12.0, 13.0, 7.0, 8.0, 9.0, 6.0, 10.0, 11.0, 12.0, 13.0])
pivot: 5.0
โœ… Warp partition test passed!

์†”๋ฃจ์…˜

fn warp_partition[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    pivot: Float32,
):
    """
    Single-warp parallel partitioning using BOTH shuffle_xor AND prefix_sum.
    This implements a warp-level quicksort partition step that places elements < pivot
    on the left and elements >= pivot on the right.

    ALGORITHM COMPLEXITY - combines two advanced warp primitives:
    1. shuffle_xor(): Butterfly pattern for warp-level reductions
    2. prefix_sum(): Warp-level exclusive scan for position calculation.

    This demonstrates the power of warp primitives for sophisticated parallel algorithms
    within a single warp (works for any WARP_SIZE: 32, 64, etc.).

    Example with pivot=5:
    Input:  [3, 7, 1, 8, 2, 9, 4, 6]
    Result: [3, 1, 2, 4, 7, 8, 9, 6] (< pivot | >= pivot).
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    if global_i < size:
        current_val = input[global_i]

        # Phase 1: Create warp-level predicates
        predicate_left = Float32(1.0) if current_val < pivot else Float32(0.0)
        predicate_right = Float32(1.0) if current_val >= pivot else Float32(0.0)

        # Phase 2: Warp-level prefix sum to get positions within warp
        warp_left_pos = prefix_sum[exclusive=True](predicate_left)
        warp_right_pos = prefix_sum[exclusive=True](predicate_right)

        # Phase 3: Get total left count using shuffle_xor reduction
        warp_left_total = predicate_left

        # Butterfly reduction to get total across the warp: dynamic for any WARP_SIZE
        offset = WARP_SIZE // 2
        while offset > 0:
            warp_left_total += shuffle_xor(warp_left_total, offset)
            offset //= 2

        # Phase 4: Write to output positions
        if current_val < pivot:
            # Left partition: use warp-level position
            output[Int(warp_left_pos)] = current_val
        else:
            # Right partition: offset by total left count + right position
            output[Int(warp_left_total + warp_right_pos)] = current_val


์ด ์†”๋ฃจ์…˜์€ ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ ๊ฐ„์˜ ๊ณ ๊ธ‰ ์กฐ์ •์„ ํ†ตํ•ด ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]

    # 1๋‹จ๊ณ„: ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ
    predicate_left = Float32(1.0) if current_val < pivot else Float32(0.0)
    predicate_right = Float32(1.0) if current_val >= pivot else Float32(0.0)

    # 2๋‹จ๊ณ„: ์›Œํ”„ ๋ ˆ๋ฒจ ๋ˆ„์  ํ•ฉ์œผ๋กœ ์›Œํ”„ ๋‚ด ์œ„์น˜ ๊ณ„์‚ฐ
    warp_left_pos = prefix_sum[exclusive=True](predicate_left)
    warp_right_pos = prefix_sum[exclusive=True](predicate_right)

    # 3๋‹จ๊ณ„: shuffle_xor ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์œผ๋กœ ์™ผ์ชฝ ์ด ๊ฐœ์ˆ˜ ๊ตฌํ•˜๊ธฐ
    warp_left_total = predicate_left

    # ์›Œํ”„ ์ „์ฒด์˜ ํ•ฉ์‚ฐ์„ ์œ„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜: ๋ชจ๋“  WARP_SIZE์— ๋™์  ๋Œ€์‘
    offset = WARP_SIZE // 2
    while offset > 0:
        warp_left_total += shuffle_xor(warp_left_total, offset)
        offset //= 2

    # 4๋‹จ๊ณ„: ์ถœ๋ ฅ ์œ„์น˜์— ๊ธฐ๋ก
    if current_val < pivot:
        # ์™ผ์ชฝ ํŒŒํ‹ฐ์…˜: ์›Œํ”„ ๋ ˆ๋ฒจ ์œ„์น˜ ์‚ฌ์šฉ
        output[Int(warp_left_pos)] = current_val
    else:
        # ์˜ค๋ฅธ์ชฝ ํŒŒํ‹ฐ์…˜: ์™ผ์ชฝ ์ด ๊ฐœ์ˆ˜ + ์˜ค๋ฅธ์ชฝ ์œ„์น˜๋กœ offset
        output[Int(warp_left_total + warp_right_pos)] = current_val

๋‹ค๋‹จ๊ณ„ ์‹คํ–‰ ์ถ”์  (8-๋ ˆ์ธ ์˜ˆ์ œ, pivot=5, ๊ฐ’ [3,7,1,8,2,9,4,6]):

์ดˆ๊ธฐ ์ƒํƒœ:
  Lane 0: current_val=3 (< 5)  Lane 1: current_val=7 (>= 5)
  Lane 2: current_val=1 (< 5)  Lane 3: current_val=8 (>= 5)
  Lane 4: current_val=2 (< 5)  Lane 5: current_val=9 (>= 5)
  Lane 6: current_val=4 (< 5)  Lane 7: current_val=6 (>= 5)

1๋‹จ๊ณ„: ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ
  Lane 0: predicate_left=1.0, predicate_right=0.0
  Lane 1: predicate_left=0.0, predicate_right=1.0
  Lane 2: predicate_left=1.0, predicate_right=0.0
  Lane 3: predicate_left=0.0, predicate_right=1.0
  Lane 4: predicate_left=1.0, predicate_right=0.0
  Lane 5: predicate_left=0.0, predicate_right=1.0
  Lane 6: predicate_left=1.0, predicate_right=0.0
  Lane 7: predicate_left=0.0, predicate_right=1.0

2๋‹จ๊ณ„: ์œ„์น˜ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋น„ํฌํ•จ ๋ˆ„์  ํ•ฉ
  warp_left_pos:  [0, 0, 1, 1, 2, 2, 3, 3]
  warp_right_pos: [0, 0, 0, 1, 1, 2, 2, 3]

3๋‹จ๊ณ„: ์™ผ์ชฝ ์ด ๊ฐœ์ˆ˜๋ฅผ ์œ„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜
  ์ดˆ๊ธฐ๊ฐ’: [1, 0, 1, 0, 1, 0, 1, 0]
  ๋ฆฌ๋•์…˜ ํ›„: ๋ชจ๋“  ๋ ˆ์ธ์ด warp_left_total = 4๋ฅผ ๊ฐ€์ง

4๋‹จ๊ณ„: ์ถœ๋ ฅ ์œ„์น˜์— ๊ธฐ๋ก
  Lane 0: current_val=3 < pivot โ†’ output[0] = 3
  Lane 1: current_val=7 >= pivot โ†’ output[4+0] = output[4] = 7
  Lane 2: current_val=1 < pivot โ†’ output[1] = 1
  Lane 3: current_val=8 >= pivot โ†’ output[4+1] = output[5] = 8
  Lane 4: current_val=2 < pivot โ†’ output[2] = 2
  Lane 5: current_val=9 >= pivot โ†’ output[4+2] = output[6] = 9
  Lane 6: current_val=4 < pivot โ†’ output[3] = 4
  Lane 7: current_val=6 >= pivot โ†’ output[4+3] = output[7] = 6

์ตœ์ข… ๊ฒฐ๊ณผ: [3, 1, 2, 4, 7, 8, 9, 6] (< pivot | >= pivot)

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ์ด์ค‘ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \begin{align} \text{left_pos}[i] &= \text{prefix_sum}_{\text{exclusive}}(\text{predicate_left}[i]) \\ \text{right_pos}[i] &= \text{prefix_sum}_{\text{exclusive}}(\text{predicate_right}[i]) \\ \text{left_total} &= \text{butterfly_reduce}(\text{predicate_left}) \\ \text{final_pos}[i] &= \begin{cases} \text{left_pos}[i] & \text{if } \text{input}[i] < \text{pivot} \\ \text{left_total} + \text{right_pos}[i] & \text{if } \text{input}[i] \geq \text{pivot} \end{cases} \end{align}\]

๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์ ‘๊ทผ ๋ฐฉ์‹์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  1. ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ: ๊ฐ ์š”์†Œ์˜ ํŒŒํ‹ฐ์…˜ ์†Œ์†์„ ์‹๋ณ„
  2. ๋น„ํฌํ•จ ๋ˆ„์  ํ•ฉ: ๊ฐ ํŒŒํ‹ฐ์…˜ ๋‚ด ์ƒ๋Œ€์  ์œ„์น˜๋ฅผ ๊ณ„์‚ฐ
  3. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜: ํŒŒํ‹ฐ์…˜ ๊ฒฝ๊ณ„ (์™ผ์ชฝ ์ด ๊ฐœ์ˆ˜)๋ฅผ ์‚ฐ์ถœ
  4. ์กฐ์ •๋œ ๊ธฐ๋ก: ๋กœ์ปฌ ์œ„์น˜์™€ ์ „์—ญ ํŒŒํ‹ฐ์…˜ ๊ตฌ์กฐ๋ฅผ ๊ฒฐํ•ฉ

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„:

  • 1๋‹จ๊ณ„: \(O(1)\) - ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ
  • 2๋‹จ๊ณ„: \(O(\log n)\) - ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ˆ„์  ํ•ฉ
  • 3๋‹จ๊ณ„: \(O(\log n)\) - shuffle_xor์„ ํ™œ์šฉํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜
  • 4๋‹จ๊ณ„: \(O(1)\) - ์กฐ์ •๋œ ๊ธฐ๋ก
  • ์ „์ฒด: ์šฐ์ˆ˜ํ•œ ์ƒ์ˆ˜๋ฅผ ๊ฐ€์ง„ \(O(\log n)\)

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ํ†ต์‹  ๋‹จ๊ณ„: \(2 \times \log_2(\text{WARP_SIZE})\) (๋ˆ„์  ํ•ฉ + ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ œ๋กœ, ๋ชจ๋‘ ๋ ˆ์ง€์Šคํ„ฐ ๊ธฐ๋ฐ˜
  • ๋ณ‘๋ ฌ์„ฑ: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ „์ฒด์—์„œ ๋ชจ๋“  ๋ ˆ์ธ์ด ํ™œ์„ฑ ์ƒํƒœ
  • ํ™•์žฅ์„ฑ: ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ ๋™์ž‘

์‹ค์šฉ์  ํ™œ์šฉ: ์ด ํŒจํ„ด์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๋ถ„์•ผ:

  • Quicksort ํŒŒํ‹ฐ์…”๋‹: ๋ณ‘๋ ฌ ์ •๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•ต์‹ฌ ๋‹จ๊ณ„
  • ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜: ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆผ์—์„œ null/๋ฌดํšจ ์š”์†Œ ์ œ๊ฑฐ
  • ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง: ๋ณต์žกํ•œ ํ”„๋ ˆ๋””์ผ€์ดํŠธ์— ๋”ฐ๋ฅธ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
  • ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ์—ฐ์‚ฐ ์š”๊ตฌ๋Ÿ‰์— ๋”ฐ๋ฅธ ์ž‘์—… ์žฌ๋ถ„๋ฐฐ

์š”์•ฝ

prefix_sum() ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ†ตํ•ด ๋‹ค์Œ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ๋ˆ„์  ํ•ฉ ํŒจํ„ด

  1. ํฌํ•จ ๋ˆ„์  ํ•ฉ (prefix_sum[exclusive=False]):

    • ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ˆ„์  ์—ฐ์‚ฐ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ”๋“œ ~30์ค„์„ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒด
    • ์ „์šฉ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋ฅผ ๋™๋ฐ˜ํ•œ \(O(\log n)\) ๋ณต์žก๋„
  2. ๊ณ ๊ธ‰ ๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์กฐ์ • (prefix_sum + shuffle_xor ๊ฒฐํ•ฉ):

    • ๋‹จ์ผ ์›Œํ”„ ๋‚ด ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜
    • ์œ„์น˜ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋น„ํฌํ•จ ์Šค์บ” + ์ดํ•ฉ์„ ์œ„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜
    • ์ตœ์ ์˜ ๋ณ‘๋ ฌ ํšจ์œจ์„ฑ์„ ๊ฐ€์ง„ ๋ณต์žกํ•œ ํŒŒํ‹ฐ์…”๋‹ ์—ฐ์‚ฐ

ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ†ต์ฐฐ

ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์˜ ์ด์ :

  • prefix_sum()์ด ํ˜„๋Œ€ GPU์˜ ์ „์šฉ ์Šค์บ” ์œ ๋‹›์„ ํ™œ์šฉ
  • ๊ธฐ์กด ๋ฐฉ์‹ ๋Œ€๋น„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ
  • ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ์—†๋Š” ์ž๋™ ๋™๊ธฐํ™”

๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์กฐ์ •:

# 1๋‹จ๊ณ„: ํŒŒํ‹ฐ์…˜ ์†Œ์†์„ ์œ„ํ•œ ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ
predicate = 1.0 if condition else 0.0

# 2๋‹จ๊ณ„: ๋กœ์ปฌ ์œ„์น˜๋ฅผ ์œ„ํ•œ prefix_sum ์‚ฌ์šฉ
local_pos = prefix_sum[exclusive=True](predicate)

# 3๋‹จ๊ณ„: ์ „์—ญ ์ดํ•ฉ์„ ์œ„ํ•œ shuffle_xor ์‚ฌ์šฉ
global_total = butterfly_reduce(predicate)

# 4๋‹จ๊ณ„: ์ตœ์ข… ์œ„์น˜ ๊ฒฐ์ •์„ ์œ„ํ•œ ๊ฒฐํ•ฉ
final_pos = local_pos + partition_offset

์„ฑ๋Šฅ ์ด์ :

  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์†Œํ”„ํŠธ์›จ์–ด ๊ตฌํ˜„ ๋Œ€๋น„ ์ „์šฉ ์Šค์บ” ์œ ๋‹›
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋Œ€๋น„ ๋ ˆ์ง€์Šคํ„ฐ ์ „์šฉ ์—ฐ์‚ฐ
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ณต์žก๋„: ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์„ ๋™๋ฐ˜ํ•œ \(O(\log n)\)
  • ๋‹จ์ผ ์›Œํ”„ ์ตœ์ ํ™”: WARP_SIZE ํ•œ๋„ ๋‚ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ตœ์ 

์‹ค์šฉ์  ํ™œ์šฉ

์ด ๋ˆ„์  ํ•ฉ ํŒจํ„ด๋“ค์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๋ถ„์•ผ:

  • ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ: ๋ˆ„์  ํ•ฉ, ๋ˆ„์  ๊ณฑ, min/max ์Šค์บ”
  • ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜: ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜
  • Quicksort ํŒŒํ‹ฐ์…”๋‹: ๋ณ‘๋ ฌ ์ •๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•ต์‹ฌ ๋นŒ๋”ฉ ๋ธ”๋ก
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋ถ€ํ•˜ ๋ถ„์‚ฐ, ์ž‘์—… ๋ถ„๋ฐฐ, ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์กฐํ™”

prefix_sum()๊ณผ shuffle_xor()์˜ ๊ฒฐํ•ฉ์€ ํ˜„๋Œ€ GPU ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ์ตœ์†Œํ•œ์˜ ์ฝ”๋“œ ๋ณต์žก๋„์™€ ์ตœ์ ์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์œผ๋กœ ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์–ด๋–ป๊ฒŒ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Puzzle 27: ๋ธ”๋ก ์ „์ฒด ํŒจํ„ด

๊ฐœ์š”

Puzzle 27: ๋ธ”๋ก ์ „์ฒด ํŒจํ„ด์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์ด ํผ์ฆ์€ GPU ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ์ธ ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์นœ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ํ†ต์‹  ํŒจํ„ด์„ ํƒ๊ตฌํ•˜๋ฉฐ, ๋ณต์žกํ•œ ์ˆ˜๋™ ๋™๊ธฐํ™”๋ฅผ ๊ฐ„๊ฒฐํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด์— ์ตœ์ ํ™”๋œ ์—ฐ์‚ฐ์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

๋ชฉํ‘œ: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด(Puzzle 12)์—์„œ ๋ฒ—์–ด๋‚˜, ์—ฌ๋Ÿฌ ์›Œํ”„์— ๊ฑธ์นœ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ธ”๋ก ์ „์ฒด ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฐ„๊ฒฐํ•œ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์Šค๋ ˆ๋“œ ๋ธ”๋ก์€ ์ •๊ตํ•œ ํ•˜๋“œ์›จ์–ด ์กฐ์œจ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค - Mojo์˜ ๋ธ”๋ก ์—ฐ์‚ฐ์€ ํฌ๋กœ์Šค ์›Œํ”„ ํ†ต์‹ ๊ณผ ์ „์šฉ ํ•˜๋“œ์›จ์–ด ์œ ๋‹›์„ ํ™œ์šฉํ•˜์—ฌ ์™„๋ฒฝํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋นŒ๋”ฉ ๋ธ”๋ก์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค: ๋ฆฌ๋•์…˜(์ „์ฒดโ†’ํ•˜๋‚˜), ์Šค์บ”(์ „์ฒดโ†’๊ฐ๊ฐ), ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ(ํ•˜๋‚˜โ†’์ „์ฒด).

๋ฐฐ์šธ ๋‚ด์šฉ

๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต์‹  ๋ชจ๋ธ

GPU ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋‚ด ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ํ†ต์‹  ํŒจํ„ด์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

GPU ์Šค๋ ˆ๋“œ ๋ธ”๋ก (128 ์Šค๋ ˆ๋“œ, 4๊ฐœ ๋˜๋Š” 2๊ฐœ ์›Œํ”„, ํ•˜๋“œ์›จ์–ด ์กฐ์œจ)
์ „์ฒดโ†’ํ•˜๋‚˜ (Reduction):     ๋ชจ๋“  ์Šค๋ ˆ๋“œ โ†’ ์Šค๋ ˆ๋“œ 0์— ๋‹จ์ผ ๊ฒฐ๊ณผ
์ „์ฒดโ†’๊ฐ๊ฐ (Scan):         ๋ชจ๋“  ์Šค๋ ˆ๋“œ โ†’ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ˆ„์  ์œ„์น˜๋ฅผ ๋ฐ›์Œ
ํ•˜๋‚˜โ†’์ „์ฒด (Broadcast):     ์Šค๋ ˆ๋“œ 0 โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ๋ฐ›์Œ

ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ:
โ”œโ”€โ”€ ์›Œํ”„ 0 (์Šค๋ ˆ๋“œ 0-31)   โ”€โ”€block.sum()โ”€โ”€โ”
โ”œโ”€โ”€ ์›Œํ”„ 1 (์Šค๋ ˆ๋“œ 32-63)  โ”€โ”€block.sum()โ”€โ”€โ”ผโ†’ ์Šค๋ ˆ๋“œ 0 ๊ฒฐ๊ณผ
โ”œโ”€โ”€ ์›Œํ”„ 2 (์Šค๋ ˆ๋“œ 64-95)  โ”€โ”€block.sum()โ”€โ”€โ”ค
โ””โ”€โ”€ ์›Œํ”„ 3 (์Šค๋ ˆ๋“œ 96-127) โ”€โ”€block.sum()โ”€โ”€โ”˜

ํ•˜๋“œ์›จ์–ด ํ˜„์‹ค:

  • ํฌ๋กœ์Šค ์›Œํ”„ ๋™๊ธฐํ™”: ๋ธ”๋ก ๋‚ด ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ฐ„ ์ž๋™ ์กฐ์œจ
  • ์ „์šฉ ํ•˜๋“œ์›จ์–ด ์œ ๋‹›: ํŠนํ™”๋œ ์Šค์บ” ์œ ๋‹›๊ณผ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ๋„คํŠธ์›Œํฌ
  • ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ๋ถˆํ•„์š”: ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ชจ๋“  ๋™๊ธฐํ™”๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ๊ด€๋ฆฌ
  • ๋กœ๊ทธ ๋ณต์žก๋„: \(O(\log n)\) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ผ ๋ช…๋ น์˜ ๋‹จ์ˆœํ•จ์œผ๋กœ

Mojo์˜ ๋ธ”๋ก ์—ฐ์‚ฐ

gpu.primitives.block์˜ ์™„์ „ํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋„๊ตฌ ๋ชจ์Œ์„ ๋ฐฐ์›๋‹ˆ๋‹ค:

  1. block.sum(value): ํ•ฉ๊ณ„, ํ‰๊ท , ์ตœ๋Œ“๊ฐ’/์ตœ์†Ÿ๊ฐ’์„ ์œ„ํ•œ ์ „์ฒดโ†’ํ•˜๋‚˜ ๋ฆฌ๋•์…˜
  2. block.prefix_sum(value): ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ์ถ”์ถœ์„ ์œ„ํ•œ ์ „์ฒดโ†’๊ฐ๊ฐ ์Šค์บ”
  3. block.broadcast(value): ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ ์™€ ์กฐ์œจ์„ ์œ„ํ•œ ํ•˜๋‚˜โ†’์ „์ฒด ๋ถ„๋ฐฐ

์ฐธ๊ณ : ์ด ๊ธฐ๋ณธ ์š”์†Œ๋“ค์€ ํ†ต๊ณ„ ์—ฐ์‚ฐ, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜, ์ •๊ทœํ™” ์›Œํฌํ”Œ๋กœ์šฐ์™€ ๊ฐ™์€ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ณธ ์š”์†Œ ์—†์ด ๊ตฌํ˜„ํ•˜๋ ค๋ฉด ์ˆ˜์‹ญ ์ค„์˜ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ ์ฝ”๋“œ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๋ณ€ํ™˜ ์˜ˆ์‹œ

# ๋ณต์žกํ•œ ๋ธ”๋ก ์ „์ฒด ๋ฆฌ๋•์…˜ (๊ธฐ์กด ๋ฐฉ์‹ - Puzzle 12์—์„œ):
shared_memory[local_i] = my_value
barrier()
for stride in range(64, 0, -1):
    if local_i < stride:
        shared_memory[local_i] += shared_memory[local_i + stride]
    barrier()
if local_i == 0:
    output[block_idx.x] = shared_memory[0]

# ๋ธ”๋ก ์—ฐ์‚ฐ์œผ๋กœ ์ด ๋ชจ๋“  ๋ณต์žก์„ฑ์„ ์ œ๊ฑฐ:
my_partial = compute_local_contribution()
total = block.sum[block_size=128, broadcast=False](my_partial)  # ํ•œ ์ค„์ด๋ฉด ๋!
if local_i == 0:
    output[block_idx.x] = total[0]

๋ธ”๋ก ์—ฐ์‚ฐ์ด ๋น›๋‚˜๋Š” ์ˆœ๊ฐ„

์„ฑ๋Šฅ ํŠน์„ฑ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด๊ธฐ์กด ๋ฐฉ์‹๋ธ”๋ก ์—ฐ์‚ฐ
๋ธ”๋ก ์ „์ฒด ๋ฆฌ๋•์…˜๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด๋‹จ์ผ block.sum ํ˜ธ์ถœ
๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑblock.prefix_sum ์กฐ์œจ
๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ ์ˆ˜๋™ ๋™๊ธฐํ™”๋‹จ์ผ block.broadcast ํ˜ธ์ถœ
ํฌ๋กœ์Šค ์›Œํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ๊ด€๋ฆฌํ•˜๋“œ์›จ์–ด ๊ด€๋ฆฌ ์กฐ์œจ

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์˜ ์ง„ํ™”

์ถœ๋ฐœ์ : ์ˆ˜๋™ ์กฐ์œจ (Puzzle 12)

๋ณต์žกํ•˜์ง€๋งŒ ๊ต์œก์  - ๋ช…์‹œ์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ฐฐ๋ฆฌ์–ด, ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜:

# ์ˆ˜๋™ ๋ฐฉ์‹: 15์ค„ ์ด์ƒ์˜ ๋ณต์žกํ•œ ๋™๊ธฐํ™”
shared_memory[local_i] = my_value
barrier()
# ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜...
for stride in range(64, 0, -1):
    if local_i < stride:
        shared_memory[local_i] += shared_memory[local_i + stride]
    barrier()

์ค‘๊ฐ„ ๋‹จ๊ณ„: ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (Puzzle 24)

ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์ด์ง€๋งŒ ๋ฒ”์œ„๊ฐ€ ์ œํ•œ์  - 32 ์Šค๋ ˆ๋“œ ์›Œํ”„ ๋‚ด์˜ warp.sum():

# ์›Œํ”„ ๋ฐฉ์‹: 1์ค„์ด์ง€๋งŒ ๋‹จ์ผ ์›Œํ”„๋งŒ
total = warp.sum[warp_size=WARP_SIZE](val=partial_product)

์ตœ์ข… ๋ชฉ์ ์ง€: ๋ธ”๋ก ํ”„๋กœ๊ทธ๋ž˜๋ฐ (์ด๋ฒˆ ํผ์ฆ)

์™„์ „ํ•œ ๋„๊ตฌ ๋ชจ์Œ - ์ „์ฒด ๋ธ”๋ก์— ๊ฑธ์นœ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ณธ ์š”์†Œ:

# ๋ธ”๋ก ๋ฐฉ์‹: ์—ฌ๋Ÿฌ ์›Œํ”„์— ๊ฑธ์นœ 1์ค„ (128+ ์Šค๋ ˆ๋“œ)
total = block.sum[block_size=128, broadcast=False](val=partial_product)

์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ํ†ต์‹  ํŒจํ„ด

๋ธ”๋ก ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ๋ชจ๋“  ๋ณ‘๋ ฌ ํ†ต์‹  ์š”๊ตฌ๋ฅผ ์ถฉ์กฑํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

1. ์ „์ฒดโ†’ํ•˜๋‚˜: ๋ฆฌ๋•์…˜ (block.sum())

  • ํŒจํ„ด: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ธฐ์—ฌ โ†’ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์Œ
  • ์šฉ๋„: ํ•ฉ๊ณ„, ํ‰๊ท , ์ตœ๋Œ“๊ฐ’/์ตœ์†Ÿ๊ฐ’ ๊ณ„์‚ฐ
  • ์˜ˆ์‹œ: ๋‚ด์ , ํ†ต๊ณ„ ์ง‘๊ณ„
  • ํ•˜๋“œ์›จ์–ด: ์ž๋™ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํฌํ•จ๋œ ํฌ๋กœ์Šค ์›Œํ”„ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜

2. ์ „์ฒดโ†’๊ฐ๊ฐ: ์Šค์บ” (block.prefix_sum())

  • ํŒจํ„ด: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ธฐ์—ฌ โ†’ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ˆ„์  ์œ„์น˜๋ฅผ ๋ฐ›์Œ
  • ์šฉ๋„: ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง, ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜
  • ์˜ˆ์‹œ: ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ์œ„ํ•œ ์“ฐ๊ธฐ ์œ„์น˜ ๊ณ„์‚ฐ
  • ํ•˜๋“œ์›จ์–ด: ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ์„ ํฌํ•จํ•œ ๋ณ‘๋ ฌ ์Šค์บ”

3. ํ•˜๋‚˜โ†’์ „์ฒด: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ (block.broadcast())

  • ํŒจํ„ด: ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ์ œ๊ณต โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ๋ฐ›์Œ
  • ์šฉ๋„: ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ , ์„ค์ •๊ฐ’ ๋ถ„๋ฐฐ
  • ์˜ˆ์‹œ: ์ •๊ทœํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๊ณ„์‚ฐ๋œ ํ‰๊ท  ๊ณต์œ 
  • ํ•˜๋“œ์›จ์–ด: ์—ฌ๋Ÿฌ ์›Œํ”„์— ๊ฑธ์นœ ์ตœ์ ํ™”๋œ ๋ถ„๋ฐฐ

ํ•™์Šต ๊ฒฝ๋กœ

์„ธ ๋‹จ๊ณ„๋กœ ์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ฉฐ, ๋‹จ์ˆœํ•œ ๊ฒƒ์—์„œ ๋ณต์žกํ•œ ๊ฒƒ์œผ๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

Part 1: block.sum()์˜ ํ•ต์‹ฌ

๋ณต์žกํ•œ ๋ฆฌ๋•์…˜์„ ๋‹จ์ˆœํ•œ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋ณ€ํ™˜

block.sum()์œผ๋กœ ๋‚ด์ ์„ ๊ตฌํ˜„ํ•˜๋ฉฐ ๋ธ”๋ก ๋ฆฌ๋•์…˜์˜ ๊ธฐ๋ณธ ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ๋ธ”๋ก ์—ฐ์‚ฐ์ด 15์ค„ ์ด์ƒ์˜ ์ˆ˜๋™ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ๋‹จ์ผ ์ตœ์ ํ™” ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ์—ฌ๋Ÿฌ ์›Œํ”„์— ๊ฑธ์นœ ๋ธ”๋ก ์ „์ฒด ๋™๊ธฐํ™”
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ฆฌ๋•์…˜ ํŒจํ„ด
  • ์Šค๋ ˆ๋“œ 0 ๊ฒฐ๊ณผ ๊ด€๋ฆฌ
  • ๊ธฐ์กด ๋ฐฉ์‹๊ณผ์˜ ์„ฑ๋Šฅ ๋น„๊ต

ํ•™์Šต ๋ชฉํ‘œ: block.sum()์ด ๋ธ”๋ก ๊ทœ๋ชจ์—์„œ warp.sum()์˜ ๋‹จ์ˆœํ•จ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.


Part 2: block.prefix_sum()๊ณผ ๋ณ‘๋ ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜

๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ์ถ”์ถœ

ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด block.prefix_sum()์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. ๋ˆ„์  ํ•ฉ์ด ๋‹จ์ˆœํ•œ ๋ฆฌ๋•์…˜์œผ๋กœ๋Š” ๊ตฌํ˜„ํ•˜๊ธฐ ์–ด๋ ค์šด ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ์ด์ง„ ํ”„๋ ˆ๋””์ผ€์ดํŠธ๋ฅผ ์ด์šฉํ•œ ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง
  • ์กฐ์œจ๋œ ์“ฐ๊ธฐ ์œ„์น˜ ๊ณ„์‚ฐ
  • ๊ณ ๊ธ‰ ํŒŒํ‹ฐ์…”๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ํฌ๋กœ์Šค ์Šค๋ ˆ๋“œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ ํŒจํ„ด

ํ•™์Šต ๋ชฉํ‘œ: block.prefix_sum()์ด ๋‹จ์ˆœํ•œ ์ง‘๊ณ„๋ฅผ ๋„˜์–ด์„œ๋Š” ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.


Part 3: block.broadcast()์™€ ๋ฒกํ„ฐ ์ •๊ทœํ™”

๋ชจ๋“  ํŒจํ„ด์„ ๊ฒฐํ•ฉํ•˜๋Š” ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ

๋ธ”๋ก ์—ฐ์‚ฐ ๋„๊ตฌ ๋ชจ์Œ ์ „์ฒด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ์–ด๋–ป๊ฒŒ ํ•จ๊ป˜ ์ž‘๋™ํ•˜์—ฌ ์ˆ˜ํ•™์  ์ •ํ™•์„ฑ์„ ๊ฐ–์ถ˜ ์‹ค์ œ ์—ฐ์‚ฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹  ํŒจํ„ด
  • ์กฐ์œจ๋œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ
  • ์‹ค์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„

ํ•™์Šต ๋ชฉํ‘œ: ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•ด ๋ธ”๋ก ์—ฐ์‚ฐ์„ ์กฐํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ธ”๋ก ์—ฐ์‚ฐ์ด ์ค‘์š”ํ•œ ์ด์œ 

์ฝ”๋“œ ๋‹จ์ˆœํ™” ๋ณ€ํ™˜:

๊ธฐ์กด ๋ฐฉ์‹:     20์ค„ ์ด์ƒ์˜ ๋ฐฐ๋ฆฌ์–ด, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ
๋ธ”๋ก ์—ฐ์‚ฐ:     3-5์ค„์˜ ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ณธ ์š”์†Œ

์„ฑ๋Šฅ ์ด์ :

  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: GPU ์•„ํ‚คํ…์ฒ˜๋ณ„ ์ตœ์ ํ™”๋ฅผ ํ™œ์šฉ
  • ์ž๋™ ๋™๊ธฐํ™”: ์ˆ˜๋™ ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜ ์˜ค๋ฅ˜ ์ œ๊ฑฐ
  • ์กฐํ•ฉ ๊ฐ€๋Šฅ์„ฑ: ์—ฐ์‚ฐ๋“ค์ด ๋งค๋„๋Ÿฝ๊ฒŒ ํ•จ๊ป˜ ๋™์ž‘
  • ์ด์‹์„ฑ: ๋™์ผํ•œ ์ฝ”๋“œ๊ฐ€ ๋‹ค์–‘ํ•œ GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ์ž‘๋™

๊ต์œก์  ๊ฐ€์น˜:

  • ๊ฐœ๋…์  ๋ช…ํ™•์„ฑ: ๊ฐ ์—ฐ์‚ฐ์ด ๋ช…ํ™•ํ•œ ํ†ต์‹  ๋ชฉ์ ์„ ๊ฐ€์ง
  • ์ ์ง„์  ๋ณต์žก์„ฑ: ๋‹จ์ˆœํ•œ ๋ฆฌ๋•์…˜์—์„œ ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ฐœ์ „
  • ์‹ค์ œ ์‘์šฉ: ๊ณผํ•™ ์—ฐ์‚ฐ, ๊ทธ๋ž˜ํ”ฝ, AI์—์„œ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ํŒจํ„ด

์„ ์ˆ˜ ์ง€์‹

์ด ํผ์ฆ์„ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ์„ ์™„๋ฃŒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

ํ•™์Šต ์„ฑ๊ณผ

์„ธ ํŒŒํŠธ๋ฅผ ๋ชจ๋‘ ์™„๋ฃŒํ•˜๋ฉด ๋‹ค์Œ์„ ์ดํ•ดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

  1. ๊ฐ ๋ธ”๋ก ์—ฐ์‚ฐ์˜ ์šฉ๋„ - ๋‹ค์–‘ํ•œ ๋ณ‘๋ ฌ ํ†ต์‹  ์š”๊ตฌ์— ๋งž๋Š” ์„ ํƒ
  2. ์—ฐ์‚ฐ ์กฐํ•ฉ ๋ฐฉ๋ฒ• - ๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์ถ•
  3. ์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ - ์ˆ˜๋™ ๋ฐฉ์‹๊ณผ ์ž๋™ํ™” ๋ฐฉ์‹ ๊ฐ„์˜ ๋น„๊ต
  4. ์‹ค์ œ ์‘์šฉ - ๋ธ”๋ก ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์˜ ํ™œ์šฉ
  5. ์•„ํ‚คํ…์ฒ˜ ๋…๋ฆฝ์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ - ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ณธ ์š”์†Œ ํ™œ์šฉ

์‹œ์ž‘ํ•˜๊ธฐ

๊ถŒ์žฅ ์ˆœ์„œ: ๊ฐ ํŒŒํŠธ๊ฐ€ ์ด์ „ ํŒŒํŠธ์˜ ๊ฐœ๋…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฏ€๋กœ ์ˆœ์„œ๋Œ€๋กœ ์™„์„ฑํ•˜์„ธ์š”. ๋‹จ์ˆœํ•œ ๋ฆฌ๋•์…˜ โ†’ ๊ณ ๊ธ‰ ํŒŒํ‹ฐ์…”๋‹ โ†’ ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ๋กœ ์ด์–ด์ง€๋Š” ์ง„ํ–‰์ด ๋ธ”๋ก ๋ ˆ๋ฒจ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์ดํ•ดํ•˜๋Š” ์ตœ์ ์˜ ํ•™์Šต ๊ฒฝ๋กœ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ธ”๋ก ์—ฐ์‚ฐ์€ ํ”„๋กœ๊ทธ๋ž˜๋จธ ์ƒ์‚ฐ์„ฑ๊ณผ ํ•˜๋“œ์›จ์–ด ์„ฑ๋Šฅ ์‚ฌ์ด์˜ ์ตœ์  ์ง€์ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค - ๊ณ ์ˆ˜์ค€ ์—ฐ์‚ฐ์˜ ๋‹จ์ˆœํ•จ๊ณผ ์„ธ์‹ฌํ•˜๊ฒŒ ์ตœ์ ํ™”๋œ ์ €์ˆ˜์ค€ ๊ตฌํ˜„์˜ ํšจ์œจ์„ฑ์„ ๋™์‹œ์— ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ํผ์ฆ์€ ํ˜„๋Œ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ์ ํ•ฉํ•œ ์ถ”์ƒํ™” ์ˆ˜์ค€์—์„œ ์‚ฌ๊ณ ํ•˜๋Š” ๋ฒ•์„ ๊ฐ€๋ฅด์นฉ๋‹ˆ๋‹ค.

block.sum()์˜ ํ•ต์‹ฌ - ๋ธ”๋ก ๋ ˆ๋ฒจ ๋‚ด์ 

Puzzle 12์—์„œ ์‚ดํŽด๋ณธ ๋‚ด์ ์„ ๋ธ”๋ก ๋ ˆ๋ฒจ sum ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ๊ฐ„๋‹จํ•œ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ๋ธ”๋ก ๋‚ด ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  block.sum()์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ์ž๋™์œผ๋กœ ํ•ฉ์‚ฐํ•˜์—ฌ, ๋ธ”๋ก ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ์ „์ฒด ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์นœ GPU ๋™๊ธฐํ™”๋ฅผ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.sum() ์—ฐ์‚ฐ์€ ๋ธ”๋ก ์ „์ฒด ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์„ ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ฑธ์ณ ์›Œํ”„ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜๋Š” ์ •๊ตํ•˜๊ฒŒ ์ตœ์ ํ™”๋œ ๊ตฌํ˜„์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. LLVM ๋ถ„์„์€ ๊ธฐ์ˆ  ๋ถ„์„์„ ์ฐธ๊ณ ํ•˜์„ธ์š”.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • block.sum()์„ ํ™œ์šฉํ•œ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜
  • ๋ธ”๋ก ์ „์ฒด ๋™๊ธฐํ™”์™€ ์Šค๋ ˆ๋“œ ์กฐ์œจ
  • ๋‹จ์ผ ๋ธ”๋ก ๋‚ด ํฌ๋กœ์Šค ์›Œํ”„ ํ†ต์‹ 
  • ๋ณต์žกํ•œ ํŒจํ„ด์—์„œ ๊ฐ„๋‹จํ•œ ํŒจํ„ด์œผ๋กœ์˜ ์„ฑ๋Šฅ ๋ณ€ํ™˜
  • ์Šค๋ ˆ๋“œ 0 ๊ฒฐ๊ณผ ๊ด€๋ฆฌ์™€ ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ

์ˆ˜ํ•™์  ์—ฐ์‚ฐ์€ ๋‚ด์ ์ž…๋‹ˆ๋‹ค: \[\Large \text{output}[0] = \sum_{i=0}^{N-1} a[i] \times b[i]\]

ํ•˜์ง€๋งŒ ๊ตฌํ˜„ ๊ณผ์ •์—์„œ Mojo์˜ ๋ชจ๋“  ๋ธ”๋ก ๋ ˆ๋ฒจ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ์ ์šฉ๋˜๋Š” ๊ธฐ๋ณธ ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 128 ์š”์†Œ
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (128, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (TPB = 128)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D row-major)
  • ๋ธ”๋ก๋‹น ์›Œํ”„ ์ˆ˜: 128 / WARP_SIZE (NVIDIA์—์„œ 4๊ฐœ, AMD์—์„œ 2๊ฐœ ๋˜๋Š” 4๊ฐœ)

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ณต์žก์„ฑ (Puzzle 12์—์„œ)

Puzzle 12์˜ ๋ณต์žกํ•œ ๋ฐฉ์‹์„ ๋– ์˜ฌ๋ ค ๋ด…์‹œ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ฐฐ๋ฆฌ์–ด, ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค:

fn traditional_dot_product[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Traditional dot product using shared memory + barriers + tree reduction.
    Educational but complex - shows the manual coordination needed."""

    shared = LayoutTensor[
        dtype,
        Layout.row_major(tpb),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    # Each thread computes partial product
    if global_i < size:
        a_val = rebind[Scalar[dtype]](a[global_i])
        b_val = rebind[Scalar[dtype]](b[global_i])
        shared[local_i] = a_val * b_val

    barrier()

    # Tree reduction in shared memory - complex but educational
    var stride = tpb // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2

    # Only thread 0 writes final result
    if local_i == 0:
        output[0] = shared[0]


์ด ๋ฐฉ์‹์ด ๋ณต์žกํ•œ ์ด์œ :

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ๋ธ”๋ก ๋‚ด์—์„œ ์ˆ˜๋™์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ด€๋ฆฌ
  • ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด: ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•˜๊ธฐ ์œ„ํ•œ barrier() ํ˜ธ์ถœ
  • ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ณต์žกํ•œ ๋ฃจํ”„ (64โ†’32โ†’16โ†’8โ†’4โ†’2โ†’1)
  • ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ: ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”
  • ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ: ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ก

์ด ๋ฐฉ์‹์€ ์ „์ฒด ๋ธ”๋ก(GPU์— ๋”ฐ๋ผ 2๊ฐœ ๋˜๋Š” 4๊ฐœ ์›Œํ”„์— ๊ฑธ์นœ 128 ์Šค๋ ˆ๋“œ)์—์„œ ๋™์ž‘ํ•˜์ง€๋งŒ, ์ฝ”๋“œ๊ฐ€ ์žฅํ™ฉํ•˜๊ณ  ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์šฐ๋ฉฐ ๋ธ”๋ก ๋ ˆ๋ฒจ GPU ๋™๊ธฐํ™”์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์›Œํ”„ ๋ ˆ๋ฒจ ๊ฐœ์„  (Puzzle 24์—์„œ)

๋ธ”๋ก ๋ ˆ๋ฒจ ์—ฐ์‚ฐ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์—, Puzzle 24์—์„œ warp.sum()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์ผ ์›Œํ”„ ๋‚ด ๋ฆฌ๋•์…˜์„ ์–ด๋–ป๊ฒŒ ๋‹จ์ˆœํ™”ํ–ˆ๋Š”์ง€ ๋– ์˜ฌ๋ ค ๋ด…์‹œ๋‹ค:

fn simple_warp_dot_product[
    in_layout: Layout, out_layout: Layout, size: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
):
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)

    # Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based
    var partial_product: Scalar[dtype] = 0
    if global_i < size:
        partial_product = (a[global_i] * b[global_i]).reduce_add()

    # warp_sum() replaces all the shared memory + barriers + tree reduction
    total = warp_sum(partial_product)

    # Only lane 0 writes the result (all lanes have the same total)
    if lane_id() == 0:
        output[global_i // WARP_SIZE] = total


warp.sum()์ด ๋‹ฌ์„ฑํ•œ ๊ฒƒ:

  • ๋‹จ์ผ ์›Œํ”„ ๋ฒ”์œ„: 32 ์Šค๋ ˆ๋“œ(NVIDIA) ๋˜๋Š” 32/64 ์Šค๋ ˆ๋“œ(AMD) ๋‚ด์—์„œ ๋™์ž‘
  • ํ•˜๋“œ์›จ์–ด ์…”ํ”Œ: ํšจ์œจ์ ์ธ shfl.sync.bfly.b32 ๋ช…๋ น ์‚ฌ์šฉ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ถˆํ•„์š”: ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ์—†์Œ
  • ํ•œ ์ค„ ๋ฆฌ๋•์…˜: total = warp_sum[warp_size=WARP_SIZE](val=partial_product)

๊ทธ๋Ÿฌ๋‚˜ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค: warp.sum()์€ ๋‹จ์ผ ์›Œํ”„ ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ์›Œํ”„๊ฐ€ ํ•„์š”ํ•œ ๋ฌธ์ œ(์˜ˆ: 128 ์Šค๋ ˆ๋“œ ๋ธ”๋ก)์—์„œ๋Š” ์—ฌ์ „ํžˆ ์›Œํ”„ ๊ฐ„ ์กฐ์œจ์„ ์œ„ํ•ด ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด ๋ฐฉ์‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

pixi run p27 --traditional-dot-product
pixi run -e amd p27 --traditional-dot-product
pixi run -e apple p27 --traditional-dot-product
uv run poe p27 --traditional-dot-product

์™„์„ฑํ•  ์ฝ”๋“œ

block.sum() ๋ฐฉ์‹

๋ณต์žกํ•œ ๊ธฐ์กด ๋ฐฉ์‹์„ block.sum()์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ธ”๋ก ์ปค๋„๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

comptime SIZE = 128
comptime TPB = 128
comptime NUM_BINS = 8
comptime in_layout = Layout.row_major(SIZE)
comptime out_layout = Layout.row_major(1)
comptime dtype = DType.float32


fn block_sum_dot_product[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Dot product using block.sum() - convenience function like warp.sum()!
    Replaces manual shared memory + barriers + tree reduction with one line."""

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # FILL IN (roughly 6 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p27/p27.mojo

pixi run p27 --block-sum-dot-product
pixi run -e amd p27 --block-sum-dot-product
uv run poe p27 --block-sum-dot-product

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 128
TPB: 128
Expected result: 1381760.0
Block.sum result: 1381760.0
Block.sum() gives identical results!
Compare the code: 15+ lines of barriers โ†’ 1 line of block.sum()!
Just like warp.sum() but for the entire block
ํŒ

1. ์„ธ ๋‹จ๊ณ„ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

๋ชจ๋“  ๋ธ”๋ก ๋ฆฌ๋•์…˜์€ ๋™์ผํ•œ ๊ฐœ๋…์  ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๋กœ์ปฌ ๊ธฐ์—ฌ๋ถ„์„ ๊ณ„์‚ฐ
  2. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธ”๋ก ์ „์ฒด ๋ฆฌ๋•์…˜์— ์ฐธ์—ฌ
  3. ์ง€์ •๋œ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ฒ˜๋ฆฌ

2. ๋‚ด์  ์ˆ˜ํ•™ ๊ธฐ์–ตํ•˜๊ธฐ

๊ฐ ์Šค๋ ˆ๋“œ๋Š” ๋ฒกํ„ฐ a์™€ b์—์„œ ํ•˜๋‚˜์˜ ์š”์†Œ ์Œ์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋“ค์„ ์Šค๋ ˆ๋“œ ๊ฐ„์— ํ•ฉ์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” โ€œ๋ถ€๋ถ„ ๊ฒฐ๊ณผโ€œ๋กœ ํ•ฉ์น˜๋Š” ์—ฐ์‚ฐ์€ ๋ฌด์—‡์ผ๊นŒ์š”?

3. LayoutTensor ์ธ๋ฑ์‹ฑ ํŒจํ„ด

LayoutTensor ์š”์†Œ์— ์ ‘๊ทผํ•  ๋•Œ, ์ธ๋ฑ์‹ฑ์ด SIMD ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค๋Š” ์ ์„ ๊ธฐ์–ตํ•˜์„ธ์š”. ์‚ฐ์ˆ  ์—ฐ์‚ฐ์„ ์œ„ํ•ด ์Šค์นผ๋ผ ๊ฐ’์„ ์ถ”์ถœํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

4. block.sum() API ๊ฐœ๋…

ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ์‚ดํŽด๋ณด์„ธ์š” - ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ๋ธ”๋ก ํฌ๊ธฐ๋ฅผ ์ง€์ •ํ•˜๋Š” ํ…œํ”Œ๋ฆฟ ํŒŒ๋ผ๋ฏธํ„ฐ
  • ๊ฒฐ๊ณผ ๋ถ„๋ฐฐ ๋ฐฉ์‹์„ ์œ„ํ•œ ํ…œํ”Œ๋ฆฟ ํŒŒ๋ผ๋ฏธํ„ฐ (broadcast)
  • ๋ฆฌ๋“€์Šคํ•  ๊ฐ’์„ ๋‹ด์€ ๋Ÿฐํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ

5. ์Šค๋ ˆ๋“œ ์กฐ์œจ ์›์น™

  • ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌํ•  ์œ ํšจํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์„๊นŒ์š”? (ํžŒํŠธ: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ)
  • ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•ด์•ผ ํ• ๊นŒ์š”? (ํžŒํŠธ: ์ผ๊ด€๋œ ์„ ํƒ)
  • ๊ทธ ํŠน์ • ์Šค๋ ˆ๋“œ๋ฅผ ์–ด๋–ป๊ฒŒ ์‹๋ณ„ํ• ๊นŒ์š”? (ํžŒํŠธ: ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ)

์†”๋ฃจ์…˜

fn block_sum_dot_product[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Dot product using block.sum() - convenience function like warp.sum()!
    Replaces manual shared memory + barriers + tree reduction with one line."""

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # Each thread computes partial product
    var partial_product: Scalar[dtype] = 0.0
    if global_i < size:
        # LayoutTensor indexing `[0]` returns the underlying SIMD value
        partial_product = a[global_i][0] * b[global_i][0]

    # The magic: block.sum() replaces 15+ lines of manual reduction!
    # Just like warp.sum() but for the entire block
    total = block.sum[block_size=tpb, broadcast=False](
        val=SIMD[DType.float32, 1](partial_product)
    )

    # Only thread 0 writes the result
    if local_i == 0:
        output[0] = total[0]


block.sum() ์ปค๋„์€ ๋ณต์žกํ•œ ๋ธ”๋ก ๋™๊ธฐํ™”์—์„œ ์ •๊ตํ•˜๊ฒŒ ์ตœ์ ํ™”๋œ ๊ตฌํ˜„์œผ๋กœ์˜ ๊ทผ๋ณธ์ ์ธ ๋ณ€ํ™˜์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ธฐ์กด ๋ฐฉ์‹์—์„œ ์‚ฌ๋ผ์ง„ ๊ฒƒ๋“ค:

  • 15์ค„ ์ด์ƒ โ†’ 8์ค„: ํš๊ธฐ์ ์ธ ์ฝ”๋“œ ์ถ•์†Œ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๋ถˆํ•„์š”
  • 7ํšŒ ์ด์ƒ์˜ barrier() ํ˜ธ์ถœ: ๋ช…์‹œ์  ๋™๊ธฐํ™” ์ œ๋กœ
  • ๋ณต์žกํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒด
  • ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ: ์™„์ „ํžˆ ์ œ๊ฑฐ
  • ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ: ์ตœ์ ํ™”๋œ ๊ตฌํ˜„์ด ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ

๋ธ”๋ก ์ „์ฒด ์‹คํ–‰ ๋ชจ๋ธ:

๋ธ”๋ก ์Šค๋ ˆ๋“œ (128 ์Šค๋ ˆ๋“œ, 4๊ฐœ ์›Œํ”„):
์›Œํ”„ 0 (์Šค๋ ˆ๋“œ 0-31):
  ์Šค๋ ˆ๋“œ 0: partial_product = a[0] * b[0] = 0.0
  ์Šค๋ ˆ๋“œ 1: partial_product = a[1] * b[1] = 2.0
  ...
  ์Šค๋ ˆ๋“œ 31: partial_product = a[31] * b[31] = 1922.0

์›Œํ”„ 1 (์Šค๋ ˆ๋“œ 32-63):
  ์Šค๋ ˆ๋“œ 32: partial_product = a[32] * b[32] = 2048.0
  ...

์›Œํ”„ 2 (์Šค๋ ˆ๋“œ 64-95):
  ์Šค๋ ˆ๋“œ 64: partial_product = a[64] * b[64] = 8192.0
  ...

์›Œํ”„ 3 (์Šค๋ ˆ๋“œ 96-127):
  ์Šค๋ ˆ๋“œ 96: partial_product = a[96] * b[96] = 18432.0
  ์Šค๋ ˆ๋“œ 127: partial_product = a[127] * b[127] = 32258.0

block.sum() ํ•˜๋“œ์›จ์–ด ์—ฐ์‚ฐ:
๋ชจ๋“  ์Šค๋ ˆ๋“œ โ†’ 0.0 + 2.0 + 1922.0 + 2048.0 + ... + 32258.0 = 1381760.0
์Šค๋ ˆ๋“œ 0์ด ์ˆ˜์‹  โ†’ total = 1381760.0 (broadcast=False์ผ ๋•Œ)

๋ฐฐ๋ฆฌ์–ด ์—†์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  1. ๋ธ”๋ก ์ „์ฒด ์‹คํ–‰: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์›Œํ”„ ๋‚ด์—์„œ ๋ก์Šคํ…์œผ๋กœ ๊ฐ ๋ช…๋ น์„ ์‹คํ–‰
  2. ๋‚ด์žฅ ๋™๊ธฐํ™”: block.sum() ๊ตฌํ˜„์ด ๋™๊ธฐํ™”๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ์ฒ˜๋ฆฌ
  3. ํฌ๋กœ์Šค ์›Œํ”„ ํ†ต์‹ : ๋ธ”๋ก ๋‚ด ์›Œํ”„ ๊ฐ„ ์ตœ์ ํ™”๋œ ํ†ต์‹ 
  4. ์กฐ์œจ๋œ ๊ฒฐ๊ณผ ์ „๋‹ฌ: ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์‹ 

warp.sum() (Puzzle 24)๊ณผ์˜ ๋น„๊ต:

  • ์›Œํ”„ ๋ฒ”์œ„: warp.sum()์€ 32/64 ์Šค๋ ˆ๋“œ(๋‹จ์ผ ์›Œํ”„) ๋‚ด์—์„œ ๋™์ž‘
  • ๋ธ”๋ก ๋ฒ”์œ„: block.sum()์€ ์ „์ฒด ๋ธ”๋ก(์—ฌ๋Ÿฌ ์›Œํ”„)์— ๊ฑธ์ณ ๋™์ž‘
  • ๋™์ผํ•œ ๋‹จ์ˆœํ•จ: ๋‘˜ ๋‹ค ๋ณต์žกํ•œ ์ˆ˜๋™ ๋ฆฌ๋•์…˜์„ ํ•œ ์ค„ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒด
  • ์ž๋™ ์กฐ์œจ: block.sum()์€ warp.sum()์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†๋Š” ํฌ๋กœ์Šค ์›Œํ”„ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ

๊ธฐ์ˆ  ๋ถ„์„: block.sum()์€ ์‹ค์ œ๋กœ ๋ฌด์—‡์œผ๋กœ ์ปดํŒŒ์ผ๋ ๊นŒ?

block.sum()์ด ์‹ค์ œ๋กœ ๋ฌด์—‡์„ ์ƒ์„ฑํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด, ๋””๋ฒ„๊ทธ ์ •๋ณด์™€ ํ•จ๊ป˜ ํผ์ฆ์„ ์ปดํŒŒ์ผํ–ˆ์Šต๋‹ˆ๋‹ค:

pixi run mojo build --emit llvm --debug-level=line-tables solutions/p27/p27.mojo -o solutions/p27/p27.ll

์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ LLVM ํŒŒ์ผ solutions/p27/p27.ll์—๋Š”, ํ˜ธํ™˜ NVIDIA GPU์—์„œ ์‹ค์ œ GPU ๋ช…๋ น์„ ๋ณด์—ฌ์ฃผ๋Š” PTX ์–ด์…ˆ๋ธ”๋ฆฌ๊ฐ€ ๋‚ด์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

๋ฐœ๊ฒฌ 1: ๋‹จ์ผ ๋ช…๋ น์ด ์•„๋‹ˆ๋‹ค

block.sum()์€ ์•ฝ 20๊ฐœ ์ด์ƒ์˜ PTX ๋ช…๋ น์œผ๋กœ ์ปดํŒŒ์ผ๋˜๋ฉฐ, 2๋‹จ๊ณ„ ๋ฆฌ๋•์…˜์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜ (๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์…”ํ”Œ)

shfl.sync.bfly.b32 %r23, %r46, 16, 31, -1;   // ์˜คํ”„์…‹ 16์œผ๋กœ ์…”ํ”Œ
add.f32            %r24, %r46, %r23;         // ์…”ํ”Œ๋œ ๊ฐ’์„ ํ•ฉ์‚ฐ
shfl.sync.bfly.b32 %r25, %r24, 8, 31, -1;    // ์˜คํ”„์…‹ 8๋กœ ์…”ํ”Œ
add.f32            %r26, %r24, %r25;         // ์…”ํ”Œ๋œ ๊ฐ’์„ ํ•ฉ์‚ฐ
// ... ์˜คํ”„์…‹ 4, 2, 1์— ๋Œ€ํ•ด ๊ณ„์†

2๋‹จ๊ณ„: ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ

shr.u32            %r32, %r1, 5;             // ์›Œํ”„ ID๋ฅผ ๊ณ„์‚ฐ
mov.b32            %r34, _global_alloc_$__gpu_shared_mem; // ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ
bar.sync           0;                        // ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”
// ... ํฌ๋กœ์Šค ์›Œํ”„ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๋˜ ๋‹ค๋ฅธ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์…”ํ”Œ ์‹œํ€€์Šค

๋ฐœ๊ฒฌ 2: ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๊ตฌํ˜„

  • ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์…”ํ”Œ: ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜๋ณด๋‹ค ํšจ์œจ์ 
  • ์ž๋™ ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜: ํฌ๋กœ์Šค ์›Œํ”„ ๋™๊ธฐํ™”๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
  • ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ „๋žต์ ์œผ๋กœ ์‚ฌ์šฉ
  • ์•„ํ‚คํ…์ฒ˜ ์ธ์‹: ๋™์ผํ•œ API๊ฐ€ NVIDIA(32 ์Šค๋ ˆ๋“œ ์›Œํ”„)์™€ AMD(32 ๋˜๋Š” 64 ์Šค๋ ˆ๋“œ ์›Œํ”„)์—์„œ ๋™์ž‘

๋ฐœ๊ฒฌ 3: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„ ๋ถ„์„

๋ถ„์„ ์ ‘๊ทผ ๋ฐฉ์‹:

  1. ๋ฐ”์ด๋„ˆ๋ฆฌ ELF ์„น์…˜(.nv_debug_ptx_txt)์—์„œ PTX ์–ด์…ˆ๋ธ”๋ฆฌ๋ฅผ ํ™•์ธ
  2. ๊ฐœ๋ณ„ ๋ช…๋ น ์ˆ˜๋ฅผ ์„ธ๊ธฐ๋ณด๋‹ค ์•Œ๊ณ ๋ฆฌ์ฆ˜์  ์ฐจ์ด๋ฅผ ์‹๋ณ„

๊ด€์ฐฐ๋œ ์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ฐจ์ด:

  • ๊ธฐ์กด ๋ฐฉ์‹: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ + ๋‹ค์ˆ˜์˜ bar.sync ํ˜ธ์ถœ
  • block.sum(): ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์…”ํ”Œ ํŒจํ„ด + ์ตœ์ ํ™”๋œ ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ

์„ฑ๋Šฅ ์ด์ ์€ ๋ช…๋ น ์ˆ˜๋‚˜ ๋งˆ๋ฒ• ๊ฐ™์€ ํ•˜๋“œ์›จ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ์ •๊ตํ•˜๊ฒŒ ์ตœ์ ํ™”๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ ํƒ(๋ฒ„ํ„ฐํ”Œ๋ผ์ด > ํŠธ๋ฆฌ)์—์„œ ๋น„๋กฏ๋ฉ๋‹ˆ๋‹ค. ๊ตฌํ˜„์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ Mojo gpu ๋ชจ๋“ˆ์˜ block.mojo๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ

block.sum() vs ๊ธฐ์กด ๋ฐฉ์‹:

  • ์ฝ”๋“œ ๋‹จ์ˆœํ•จ: ๋ฆฌ๋•์…˜ ๋ถ€๋ถ„์ด 15์ค„ ์ด์ƒ โ†’ 1์ค„๋กœ
  • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ถˆํ•„์š”
  • ๋™๊ธฐํ™”: ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ๋ถˆํ•„์š”
  • ํ™•์žฅ์„ฑ: ํ•˜๋“œ์›จ์–ด ํ•œ๋„ ๋‚ด์—์„œ ๋ชจ๋“  ๋ธ”๋ก ํฌ๊ธฐ์— ๋™์ž‘

block.sum() vs warp.sum():

  • ๋ฒ”์œ„: ๋ธ”๋ก ์ „์ฒด(128 ์Šค๋ ˆ๋“œ) vs ์›Œํ”„ ์ „์ฒด(32 ์Šค๋ ˆ๋“œ)
  • ์šฉ๋„: ์ „์ฒด ๋ธ”๋ก์— ๊ฑธ์นœ ๋ฆฌ๋•์…˜์ด ํ•„์š”ํ•  ๋•Œ
  • ํŽธ์˜์„ฑ: ๋™์ผํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ, ๋‹ค๋ฅธ ๊ทœ๋ชจ

block.sum()์„ ์‚ฌ์šฉํ•ด์•ผ ํ•  ๋•Œ:

  • ๋‹จ์ผ ๋ธ”๋ก ๋ฌธ์ œ: ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ํ•˜๋‚˜์˜ ๋ธ”๋ก์— ๋“ค์–ด๊ฐˆ ๋•Œ
  • ๋ธ”๋ก ๋ ˆ๋ฒจ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋ฆฌ๋•์…˜์ด ํ•„์š”ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ
  • ํ™•์žฅ์„ฑ๋ณด๋‹ค ํŽธ์˜์„ฑ: ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋ฐฉ์‹๋ณด๋‹ค ๋‹จ์ˆœ

์ด์ „ ํผ์ฆ๊ณผ์˜ ๊ด€๊ณ„

Puzzle 12 (๊ธฐ์กด ๋ฐฉ์‹)์—์„œ:

๋ณต์žกํ•จ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜
โ†“
๋‹จ์ˆœํ•จ: block.sum() ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ณธ ์š”์†Œ

Puzzle 24 (warp.sum())์—์„œ:

์›Œํ”„ ๋ ˆ๋ฒจ: warp.sum() - 32 ์Šค๋ ˆ๋“œ (๋‹จ์ผ ์›Œํ”„)
โ†“
๋ธ”๋ก ๋ ˆ๋ฒจ: block.sum() - 128 ์Šค๋ ˆ๋“œ (์—ฌ๋Ÿฌ ์›Œํ”„)

3๋‹จ๊ณ„ ์ง„ํ–‰:

  1. ์ˆ˜๋™ ๋ฆฌ๋•์…˜ (Puzzle 12): ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜
  2. ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ (Puzzle 24): warp.sum() - ๋‹จ์ˆœํ•˜์ง€๋งŒ ๋‹จ์ผ ์›Œํ”„๋กœ ์ œํ•œ
  3. ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ (Puzzle 27): block.sum() - ์›Œํ”„์˜ ๋‹จ์ˆœํ•จ์„ ์—ฌ๋Ÿฌ ์›Œํ”„๋กœ ํ™•์žฅ

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.sum()์€ warp.sum()์˜ ๋‹จ์ˆœํ•จ์„ ์ œ๊ณตํ•˜๋ฉด์„œ ์ „์ฒด ๋ธ”๋ก์œผ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค. ์ˆ˜๋™์œผ๋กœ ๊ตฌํ˜„ํ•ด์•ผ ํ–ˆ๋˜ ๋ณต์žกํ•œ ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

block.sum() ์—ฐ์‚ฐ์„ ๋ฐฐ์› ์œผ๋‹ˆ, ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

๐Ÿ’ก ํ•ต์‹ฌ ์š”์ : ๋ธ”๋ก ์—ฐ์‚ฐ์€ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…์„ ์ „์ฒด ์Šค๋ ˆ๋“œ ๋ธ”๋ก์œผ๋กœ ํ™•์žฅํ•˜์—ฌ, ์—ฌ๋Ÿฌ ์›Œํ”„์— ๊ฑธ์ณ ๋™์‹œ์— ๋™์ž‘ํ•˜๋ฉด์„œ ๋ณต์žกํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋Œ€์ฒดํ•˜๋Š” ์ตœ์ ํ™”๋œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. warp.sum()์ด ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜์„ ๋‹จ์ˆœํ™”ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ, block.sum()์€ ์„ฑ๋Šฅ์„ ํฌ์ƒํ•˜์ง€ ์•Š๊ณ  ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜์„ ๋‹จ์ˆœํ™”ํ•ฉ๋‹ˆ๋‹ค.

block.prefix_sum()๊ณผ ๋ณ‘๋ ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜

์ด ํผ์ฆ์€ ๋ธ”๋ก ๋ ˆ๋ฒจ block.prefix_sum ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ์ถ”์ถœ์„ ์œ„ํ•œ ๋ณ‘๋ ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ์š”์†Œ๊ฐ€ ์†ํ•  ๋Œ€์ƒ ๊ตฌ๊ฐ„์„ ๊ฒฐ์ •ํ•œ ๋‹ค์Œ, block.prefix_sum()์„ ์ ์šฉํ•˜์—ฌ ํŠน์ • ๊ตฌ๊ฐ„์˜ ์š”์†Œ๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์“ฐ๊ธฐ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋ˆ„์  ํ•ฉ์ด ๋‹จ์ˆœํ•œ ๋ฆฌ๋•์…˜์„ ๋„˜์–ด ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.prefix_sum() ์—ฐ์‚ฐ์€ ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ฑธ์ณ ์ผ์น˜ํ•˜๋Š” ์š”์†Œ์˜ ๋ˆ„์  ์“ฐ๊ธฐ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ์ถ”์ถœ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • block.prefix_sum()์„ ํ™œ์šฉํ•œ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ˆ„์  ํ•ฉ
  • ๋ˆ„์  ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ์ถ”์ถœ
  • ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ๋ธ”๋ก ์ „์ฒด ์กฐ์œจ์„ ํ†ตํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜
  • ๋น„ํฌํ•จ(exclusive) vs ํฌํ•จ(inclusive) ๋ˆ„์  ํ•ฉ ํŒจํ„ด

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํŠน์ • ๊ฐ’ ๋ฒ”์œ„(๊ตฌ๊ฐ„)์— ์†ํ•˜๋Š” ์š”์†Œ๋ฅผ ์ถ”์ถœํ•˜์—ฌ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{Bin}_k = \{x_i : k/N \leq x_i < (k+1)/N\}\]

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ์š”์†Œ๊ฐ€ ์†ํ•˜๋Š” ๊ตฌ๊ฐ„์„ ๊ฒฐ์ •ํ•˜๊ณ , block.prefix_sum()์ด ๋ณ‘๋ ฌ ์ถ”์ถœ์„ ์กฐ์œจํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 128 ์š”์†Œ
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (128, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (TPB = 128)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๊ตฌ๊ฐ„ ์ˆ˜: NUM_BINS = 8 (๋ฒ”์œ„ [0.0, 0.125), [0.125, 0.25) ๋“ฑ)
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (1D row-major)
  • ๋ธ”๋ก๋‹น ์›Œํ”„ ์ˆ˜: 128 / WARP_SIZE (GPU์— ๋”ฐ๋ผ 2๊ฐœ ๋˜๋Š” 4๊ฐœ)

๋„์ „ ๊ณผ์ œ: ๋ณ‘๋ ฌ ๊ตฌ๊ฐ„ ์ถ”์ถœ

๊ธฐ์กด์˜ ์ˆœ์ฐจ์  ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ์„ฑ์€ ์š”์†Œ๋ฅผ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

# ์ˆœ์ฐจ์  ๋ฐฉ์‹ - ๋ณ‘๋ ฌํ™”๊ฐ€ ์–ด๋ ค์›€
histogram = [[] for _ in range(NUM_BINS)]
for element in data:
    bin_id = int(element * NUM_BINS)  # ๊ตฌ๊ฐ„ ๊ฒฐ์ •
    histogram[bin_id].append(element)  # ์ˆœ์ฐจ์  ์ถ”๊ฐ€

๋‹จ์ˆœํ•œ GPU ๋ณ‘๋ ฌํ™”์˜ ๋ฌธ์ œ์ :

  • ๊ฒฝ์Ÿ ์ƒํƒœ: ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ตฌ๊ฐ„์— ๋™์‹œ์— ์“ฐ๊ธฐ
  • ๋น„์ •๋ ฌ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผ
  • ๋ถ€ํ•˜ ๋ถˆ๊ท ํ˜•: ์ผ๋ถ€ ๊ตฌ๊ฐ„์— ํ›จ์”ฌ ๋งŽ์€ ์š”์†Œ๊ฐ€ ๋ชฐ๋ฆด ์ˆ˜ ์žˆ์Œ
  • ๋ณต์žกํ•œ ๋™๊ธฐํ™”: ๋ฐฐ๋ฆฌ์–ด์™€ ์›์ž์  ์—ฐ์‚ฐ์ด ํ•„์š”

๊ณ ๊ธ‰ ๋ฐฉ์‹: block.prefix_sum() ์กฐ์œจ

๋ณต์žกํ•œ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์„ ์กฐ์œจ๋œ ์ถ”์ถœ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

์™„์„ฑํ•  ์ฝ”๋“œ

block.prefix_sum() ๋ฐฉ์‹

block.prefix_sum()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

comptime bin_layout = Layout.row_major(SIZE)  # Max SIZE elements per bin


fn block_histogram_bin_extract[
    in_layout: Layout, bin_layout: Layout, out_layout: Layout, tpb: Int
](
    input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    bin_output: LayoutTensor[dtype, bin_layout, MutAnyOrigin],
    count_output: LayoutTensor[DType.int32, out_layout, MutAnyOrigin],
    size: Int,
    target_bin: Int,
    num_bins: Int,
):
    """Parallel histogram using block.prefix_sum() for bin extraction.

    This demonstrates advanced parallel filtering and extraction:
    1. Each thread determines which bin its element belongs to
    2. Use block.prefix_sum() to compute write positions for target_bin elements
    3. Extract and pack only elements belonging to target_bin
    """

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    # Step 1: Each thread determines its bin and element value

    # FILL IN (roughly 9 lines)

    # Step 2: Create predicate for target bin extraction

    # FILL IN (roughly 3 line)

    # Step 3: Use block.prefix_sum() for parallel bin extraction!
    # This computes where each thread should write within the target bin

    # FILL IN (1 line)

    # Step 4: Extract and pack elements belonging to target_bin

    # FILL IN (roughly 2 line)

    # Step 5: Final thread computes total count for this bin

    # FILL IN (roughly 3 line)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p27/p27.mojo

ํŒ

1. ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ (์ด์ „ ํผ์ฆ์—์„œ ์ ์šฉ)

block_sum_dot_product์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋‹ค์Œ ํ•ต์‹ฌ ๋ณ€์ˆ˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x

ํ•จ์ˆ˜๋Š” 5๊ฐ€์ง€ ์ฃผ์š” ๋‹จ๊ณ„(์ด ์•ฝ 15-20์ค„)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

  1. ์š”์†Œ๋ฅผ ๋กœ๋“œํ•˜๊ณ  ๊ตฌ๊ฐ„์„ ๊ฒฐ์ •
  2. ๋Œ€์ƒ ๊ตฌ๊ฐ„์— ๋Œ€ํ•œ ์ด์ง„ ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ
  3. ํ”„๋ ˆ๋””์ผ€์ดํŠธ์— block.prefix_sum() ์‹คํ–‰
  4. ๊ณ„์‚ฐ๋œ ์˜คํ”„์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ
  5. ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ๊ฐ€ ์ด ๊ฐœ์ˆ˜๋ฅผ ๊ณ„์‚ฐ

2. ๊ตฌ๊ฐ„ ๊ณ„์‚ฐ (math.floor ์‚ฌ์šฉ)

Float32 ๊ฐ’์„ ๊ตฌ๊ฐ„์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๋ ค๋ฉด:

my_value = input_data[global_i][0]  # ๋‚ด์ ์—์„œ์ฒ˜๋Ÿผ SIMD ์ถ”์ถœ
bin_number = Int(floor(my_value * num_bins))

๊ฒฝ๊ณ„ ์‚ฌ๋ก€ ์ฒ˜๋ฆฌ: ์ •ํ™•ํžˆ 1.0์ธ ๊ฐ’์€ ๊ตฌ๊ฐ„ NUM_BINS์— ๋“ค์–ด๊ฐ€์ง€๋งŒ, ์‹ค์ œ ๊ตฌ๊ฐ„์€ 0๋ถ€ํ„ฐ NUM_BINS-1๊นŒ์ง€์ž…๋‹ˆ๋‹ค. if ๋ฌธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ๋Œ€ ๊ตฌ๊ฐ„์„ ์ œํ•œํ•˜์„ธ์š”.

3. ์ด์ง„ ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ

์ด ์Šค๋ ˆ๋“œ์˜ ์š”์†Œ๊ฐ€ target_bin์— ์†ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ •์ˆ˜ ๋ณ€์ˆ˜(0 ๋˜๋Š” 1)๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค:

var belongs_to_target: Int = 0
if (thread_has_valid_element) and (my_bin == target_bin):
    belongs_to_target = 1

์ด๊ฒƒ์ด ํ•ต์‹ฌ ํ†ต์ฐฐ์ž…๋‹ˆ๋‹ค: ๋ˆ„์  ํ•ฉ์ด ์ด ์ด์ง„ ํ”Œ๋ž˜๊ทธ์— ์ž‘์šฉํ•˜์—ฌ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค!

4. block.prefix_sum() ํ˜ธ์ถœ ํŒจํ„ด

๋ฌธ์„œ์— ๋”ฐ๋ฅด๋ฉด ํ˜ธ์ถœ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

offset = block.prefix_sum[
    dtype=DType.int32,         # ์ •์ˆ˜ ํ”„๋ ˆ๋””์ผ€์ดํŠธ๋กœ ์ž‘์—…
    block_size=tpb,            # block.sum()๊ณผ ๋™์ผ
    exclusive=True             # ํ•ต์‹ฌ: ๊ฐ ์Šค๋ ˆ๋“œ ์ด์ „์˜ ์œ„์น˜๋ฅผ ์ œ๊ณต
](val=SIMD[DType.int32, 1](my_predicate_value))

์™œ ๋น„ํฌํ•จ(exclusive)์ธ๊ฐ€? ์œ„์น˜ 5์—์„œ ํ”„๋ ˆ๋””์ผ€์ดํŠธ=1์ธ ์Šค๋ ˆ๋“œ๋Š”, ์ž์‹  ์•ž์— 4๊ฐœ์˜ ์š”์†Œ๊ฐ€ ์žˆ์—ˆ๋‹ค๋ฉด output[4]์— ์จ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

5. ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ ํŒจํ„ด

belongs_to_target == 1์ธ ์Šค๋ ˆ๋“œ๋งŒ ๊ธฐ๋กํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

if belongs_to_target == 1:
    bin_output[Int(offset[0])] = my_value  # ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด SIMD๋ฅผ Int๋กœ ๋ณ€ํ™˜

์ด๊ฒƒ์€ Puzzle 12์˜ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํŒจํ„ด๊ณผ ๋™์ผํ•˜์ง€๋งŒ, ์กฐ๊ฑด์ด โ€œ๋Œ€์ƒ ๊ตฌ๊ฐ„์— ์†ํ•˜๋Š”์ง€โ€œ๋กœ ๋ฐ”๋€Œ์—ˆ์Šต๋‹ˆ๋‹ค.

6. ์ตœ์ข… ๊ฐœ์ˆ˜ ๊ณ„์‚ฐ

๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ(์Šค๋ ˆ๋“œ 0์ด ์•„๋‹˜!)๊ฐ€ ์ด ๊ฐœ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:

if local_i == tpb - 1:  # ๋ธ”๋ก์˜ ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ
    total_count = offset[0] + belongs_to_target  # ํฌํ•จ = ๋น„ํฌํ•จ + ์ž์‹ ์˜ ๊ธฐ์—ฌ๋ถ„
    count_output[0] = total_count

์™œ ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ์ธ๊ฐ€? ๊ฐ€์žฅ ๋†’์€ offset ๊ฐ’์„ ๊ฐ€์ง€๋ฏ€๋กœ, offset + ๊ธฐ์—ฌ๋ถ„์ด ์ด ๊ฐœ์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

7. ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ ๋ณ€ํ™˜

์ด์ „ ํผ์ฆ์˜ ํŒจํ„ด์„ ๊ธฐ์–ตํ•˜์„ธ์š”:

  • LayoutTensor ์ธ๋ฑ์‹ฑ์€ SIMD๋ฅผ ๋ฐ˜ํ™˜: input_data[i][0]
  • block.prefix_sum()์€ SIMD๋ฅผ ๋ฐ˜ํ™˜: offset[0]์œผ๋กœ ์ถ”์ถœ
  • ๋ฐฐ์—ด ์ธ๋ฑ์‹ฑ์€ Int๊ฐ€ ํ•„์š”: bin_output[...]์— Int(offset[0])

block.prefix_sum() ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

pixi run p27 --histogram
pixi run -e amd p27 --histogram
pixi run -e apple p27 --histogram
uv run poe p27 --histogram

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 128
TPB: 128
NUM_BINS: 8

Input sample: 0.0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 ...

=== Processing Bin 0 (range [ 0.0 , 0.125 )) ===
Bin 0 count: 26
Bin 0 extracted elements: 0.0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ...

=== Processing Bin 1 (range [ 0.125 , 0.25 )) ===
Bin 1 count: 24
Bin 1 extracted elements: 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 ...

=== Processing Bin 2 (range [ 0.25 , 0.375 )) ===
Bin 2 count: 26
Bin 2 extracted elements: 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 ...

=== Processing Bin 3 (range [ 0.375 , 0.5 )) ===
Bin 3 count: 22
Bin 3 extracted elements: 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 ...

=== Processing Bin 4 (range [ 0.5 , 0.625 )) ===
Bin 4 count: 13
Bin 4 extracted elements: 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 ...

=== Processing Bin 5 (range [ 0.625 , 0.75 )) ===
Bin 5 count: 12
Bin 5 extracted elements: 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 ...

=== Processing Bin 6 (range [ 0.75 , 0.875 )) ===
Bin 6 count: 5
Bin 6 extracted elements: 0.75 0.76 0.77 0.78 0.79

=== Processing Bin 7 (range [ 0.875 , 1.0 )) ===
Bin 7 count: 0
Bin 7 extracted elements:

์†”๋ฃจ์…˜

fn block_histogram_bin_extract[
    in_layout: Layout, bin_layout: Layout, out_layout: Layout, tpb: Int
](
    input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    bin_output: LayoutTensor[dtype, bin_layout, MutAnyOrigin],
    count_output: LayoutTensor[DType.int32, out_layout, MutAnyOrigin],
    size: Int,
    target_bin: Int,
    num_bins: Int,
):
    """Parallel histogram using block.prefix_sum() for bin extraction.

    This demonstrates advanced parallel filtering and extraction:
    1. Each thread determines which bin its element belongs to
    2. Use block.prefix_sum() to compute write positions for target_bin elements
    3. Extract and pack only elements belonging to target_bin
    """

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    # Step 1: Each thread determines its bin and element value
    var my_value: Scalar[dtype] = 0.0
    var my_bin: Int = -1

    if global_i < size:
        # `[0]` returns the underlying SIMD value
        my_value = input_data[global_i][0]
        # Bin values [0.0, 1.0) into num_bins buckets
        my_bin = Int(floor(my_value * num_bins))
        # Clamp to valid range
        if my_bin >= num_bins:
            my_bin = num_bins - 1
        if my_bin < 0:
            my_bin = 0

    # Step 2: Create predicate for target bin extraction
    var belongs_to_target: Int = 0
    if global_i < size and my_bin == target_bin:
        belongs_to_target = 1

    # Step 3: Use block.prefix_sum() for parallel bin extraction!
    # This computes where each thread should write within the target bin
    write_offset = block.prefix_sum[
        dtype = DType.int32, block_size=tpb, exclusive=True
    ](val=SIMD[DType.int32, 1](belongs_to_target))

    # Step 4: Extract and pack elements belonging to target_bin
    if belongs_to_target == 1:
        bin_output[Int(write_offset[0])] = my_value

    # Step 5: Final thread computes total count for this bin
    if local_i == tpb - 1:
        # Inclusive sum = exclusive sum + my contribution
        total_count = write_offset[0] + belongs_to_target
        count_output[0] = total_count


block.prefix_sum() ์ปค๋„์€ ์ด์ „ ํผ์ฆ์˜ ๊ฐœ๋…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์กฐ์œจ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๋‹จ๊ณ„๋ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

1๋‹จ๊ณ„: ์š”์†Œ ์ฒ˜๋ฆฌ (Puzzle 12 ๋‚ด์ ๊ณผ ์œ ์‚ฌ)

์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ (์ต์ˆ™ํ•œ ํŒจํ„ด):
  global_i = block_dim.x * block_idx.x + thread_idx.x  // ์ „์—ญ ์š”์†Œ ์ธ๋ฑ์Šค
  local_i = thread_idx.x                               // ๋กœ์ปฌ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค

์š”์†Œ ๋กœ๋”ฉ (LayoutTensor ํŒจํ„ด๊ณผ ๋™์ผ):
  ์Šค๋ ˆ๋“œ 0:  my_value = input_data[0][0] = 0.00
  ์Šค๋ ˆ๋“œ 1:  my_value = input_data[1][0] = 0.01
  ์Šค๋ ˆ๋“œ 13: my_value = input_data[13][0] = 0.13
  ์Šค๋ ˆ๋“œ 25: my_value = input_data[25][0] = 0.25
  ...

2๋‹จ๊ณ„: ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜ (์ƒˆ๋กœ์šด ๊ฐœ๋…)

floor ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•œ ๊ตฌ๊ฐ„ ๊ณ„์‚ฐ:
  ์Šค๋ ˆ๋“œ 0:  my_bin = Int(floor(0.00 * 8)) = 0  // ๊ฐ’ [0.000, 0.125) โ†’ ๊ตฌ๊ฐ„ 0
  ์Šค๋ ˆ๋“œ 1:  my_bin = Int(floor(0.01 * 8)) = 0  // ๊ฐ’ [0.000, 0.125) โ†’ ๊ตฌ๊ฐ„ 0
  ์Šค๋ ˆ๋“œ 13: my_bin = Int(floor(0.13 * 8)) = 1  // ๊ฐ’ [0.125, 0.250) โ†’ ๊ตฌ๊ฐ„ 1
  ์Šค๋ ˆ๋“œ 25: my_bin = Int(floor(0.25 * 8)) = 2  // ๊ฐ’ [0.250, 0.375) โ†’ ๊ตฌ๊ฐ„ 2
  ...

3๋‹จ๊ณ„: ์ด์ง„ ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ (ํ•„ํ„ฐ๋ง ํŒจํ„ด)

target_bin=0์— ๋Œ€ํ•ด ์ถ”์ถœ ๋งˆ์Šคํฌ ์ƒ์„ฑ:
  ์Šค๋ ˆ๋“œ 0:  belongs_to_target = 1  (๊ตฌ๊ฐ„ 0 == ๋Œ€์ƒ 0)
  ์Šค๋ ˆ๋“œ 1:  belongs_to_target = 1  (๊ตฌ๊ฐ„ 0 == ๋Œ€์ƒ 0)
  ์Šค๋ ˆ๋“œ 13: belongs_to_target = 0  (๊ตฌ๊ฐ„ 1 != ๋Œ€์ƒ 0)
  ์Šค๋ ˆ๋“œ 25: belongs_to_target = 0  (๊ตฌ๊ฐ„ 2 != ๋Œ€์ƒ 0)
  ...

์ด์ง„ ๋ฐฐ์—ด ์ƒ์„ฑ: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ...]

4๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ˆ„์  ํ•ฉ (๋งˆ๋ฒ•์ด ์ผ์–ด๋‚˜๋Š” ๊ณณ!)

ํ”„๋ ˆ๋””์ผ€์ดํŠธ์— block.prefix_sum[exclusive=True] ์ ์šฉ:
์ž…๋ ฅ:      [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ...]
๋น„ํฌํ•จ:    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12, -, -, -, ...]
                                                      ^
                                                 ์ค‘์š”ํ•˜์ง€ ์•Š์Œ

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ๋ฐฐ์—ด์—์„œ ์ž์‹ ์˜ ์“ฐ๊ธฐ ์œ„์น˜๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค!

5๋‹จ๊ณ„: ์กฐ์œจ๋œ ์ถ”์ถœ (์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ)

belongs_to_target=1์ธ ์Šค๋ ˆ๋“œ๋งŒ ๊ธฐ๋ก:
  ์Šค๋ ˆ๋“œ 0:  bin_output[0] = 0.00   // write_offset[0] = 0 ์‚ฌ์šฉ
  ์Šค๋ ˆ๋“œ 1:  bin_output[1] = 0.01   // write_offset[1] = 1 ์‚ฌ์šฉ
  ์Šค๋ ˆ๋“œ 12: bin_output[12] = 0.12  // write_offset[12] = 12 ์‚ฌ์šฉ
  ์Šค๋ ˆ๋“œ 13: (๊ธฐ๋ก ์•ˆ ํ•จ)             // belongs_to_target = 0
  ์Šค๋ ˆ๋“œ 25: (๊ธฐ๋ก ์•ˆ ํ•จ)             // belongs_to_target = 0
  ...

๊ฒฐ๊ณผ: [0.00, 0.01, 0.02, ..., 0.12, ???, ???, ...] // ๋นˆํ‹ˆ์—†์ด ์ฑ„์›Œ์ง!

6๋‹จ๊ณ„: ๊ฐœ์ˆ˜ ๊ณ„์‚ฐ (block.sum() ํŒจํ„ด๊ณผ ์œ ์‚ฌ)

๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ๊ฐ€ ์ด ๊ฐœ์ˆ˜๋ฅผ ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ 0์ด ์•„๋‹˜!):
  if local_i == tpb - 1:  // ์ด ๊ฒฝ์šฐ ์Šค๋ ˆ๋“œ 127
      total = write_offset[0] + belongs_to_target  // ํฌํ•จ ํ•ฉ ๊ณต์‹
      count_output[0] = total

์ด ๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

Puzzle 12 (๊ธฐ์กด ๋‚ด์ )๊ณผ์˜ ์—ฐ๊ฒฐ:

  • ๋™์ผํ•œ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ: global_i์™€ local_i ํŒจํ„ด
  • ๋™์ผํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: if global_i < size ๊ฒ€์ฆ
  • ๋™์ผํ•œ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ: [0]์„ ์‚ฌ์šฉํ•œ LayoutTensor SIMD ์ถ”์ถœ

block.sum() (์ด ํผ์ฆ์˜ ์•ž๋ถ€๋ถ„)๊ณผ์˜ ์—ฐ๊ฒฐ:

  • ๋™์ผํ•œ ๋ธ”๋ก ์ „์ฒด ์—ฐ์‚ฐ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ์— ์ฐธ์—ฌ
  • ๋™์ผํ•œ ๊ฒฐ๊ณผ ์ฒ˜๋ฆฌ: ํŠน์ • ์Šค๋ ˆ๋“œ(์ฒซ ๋ฒˆ์งธ ๋Œ€์‹  ๋งˆ์ง€๋ง‰)๊ฐ€ ์ตœ์ข… ๊ฒฐ๊ณผ ์ฒ˜๋ฆฌ
  • ๋™์ผํ•œ SIMD ๋ณ€ํ™˜: ๋ฐฐ์—ด ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•œ Int(result[0]) ํŒจํ„ด

block.prefix_sum()๋งŒ์˜ ๊ณ ๊ธ‰ ๊ฐœ๋…:

  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์Œ: ์Šค๋ ˆ๋“œ 0๋งŒ ์ค‘์š”ํ•œ block.sum()๊ณผ ๋‹ฌ๋ฆฌ
  • ์กฐ์œจ๋œ ์“ฐ๊ธฐ ์œ„์น˜: ๋ˆ„์  ํ•ฉ์ด ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ž๋™์œผ๋กœ ์ œ๊ฑฐ
  • ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง: ์ด์ง„ ํ”„๋ ˆ๋””์ผ€์ดํŠธ๊ฐ€ ๊ณ ๊ธ‰ ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ

๋‹จ์ˆœํ•œ ๋ฐฉ์‹ ๋Œ€๋น„ ์„ฑ๋Šฅ ์ด์ :

vs. ์›์ž์  ์—ฐ์‚ฐ:

  • ๊ฒฝ์Ÿ ์ƒํƒœ ์—†์Œ: ๋ˆ„์  ํ•ฉ์ด ๊ณ ์œ ํ•œ ์“ฐ๊ธฐ ์œ„์น˜๋ฅผ ์ œ๊ณต
  • ๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ: ์ˆœ์ฐจ์  ์“ฐ๊ธฐ๊ฐ€ ์บ์‹œ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ
  • ์ง๋ ฌํ™” ์—†์Œ: ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰

vs. ๋‹ค์ค‘ ํŒจ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜:

  • ๋‹จ์ผ ์ปค๋„: ํ•œ ๋ฒˆ์˜ GPU ์‹คํ–‰์œผ๋กœ ํžˆ์Šคํ† ๊ทธ๋žจ ์ถ”์ถœ ์™„๋ฃŒ
  • ์™„์ „ ํ™œ์šฉ: ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๊ด€๊ณ„์—†์ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ž‘์—…
  • ์ตœ์  ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ์— ์ตœ์ ํ™”๋œ ํŒจํ„ด

์ด๊ฒƒ์€ block.prefix_sum()์ด block.sum() ๊ฐ™์€ ๋‹จ์ˆœํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋กœ๋Š” ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์–ด๋–ป๊ฒŒ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ

block.prefix_sum() vs ๊ธฐ์กด ๋ฐฉ์‹:

  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ •๊ตํ•จ: ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹ vs ์ˆœ์ฐจ์  ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ: ๋ณ‘ํ•ฉ๋œ ์“ฐ๊ธฐ vs ๋ถ„์‚ฐ๋œ ๋ฌด์ž‘์œ„ ์ ‘๊ทผ
  • ๋™๊ธฐํ™”: ๋‚ด์žฅ ์กฐ์œจ vs ์ˆ˜๋™ ๋ฐฐ๋ฆฌ์–ด์™€ ์›์ž์  ์—ฐ์‚ฐ
  • ํ™•์žฅ์„ฑ: ๋ชจ๋“  ๋ธ”๋ก ํฌ๊ธฐ์™€ ๊ตฌ๊ฐ„ ์ˆ˜์— ๋™์ž‘

block.prefix_sum() vs block.sum():

  • ๋ฒ”์œ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์Œ vs ์Šค๋ ˆ๋“œ 0๋งŒ
  • ์šฉ๋„: ๋ณต์žกํ•œ ํŒŒํ‹ฐ์…”๋‹ vs ๋‹จ์ˆœํ•œ ์ง‘๊ณ„
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์œ ํ˜•: ๋ณ‘๋ ฌ ์Šค์บ” ๊ธฐ๋ณธ ์š”์†Œ vs ๋ฆฌ๋•์…˜ ๊ธฐ๋ณธ ์š”์†Œ
  • ์ถœ๋ ฅ ํŒจํ„ด: ์Šค๋ ˆ๋“œ๋ณ„ ์œ„์น˜ vs ๋‹จ์ผ ํ•ฉ๊ณ„

block.prefix_sum()์„ ์‚ฌ์šฉํ•ด์•ผ ํ•  ๋•Œ:

  • ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง: ์กฐ๊ฑด์— ๋งž๋Š” ์š”์†Œ ์ถ”์ถœ
  • ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜: ๋ถˆํ•„์š”ํ•œ ์š”์†Œ ์ œ๊ฑฐ
  • ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹: ๋ฐ์ดํ„ฐ๋ฅผ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ๋ถ„๋ฆฌ
  • ๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋ถ€ํ•˜ ๋ถ„์‚ฐ, ์ •๋ ฌ, ๊ทธ๋ž˜ํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๋‹ค์Œ ๋‹จ๊ณ„

block.prefix_sum() ์—ฐ์‚ฐ์„ ๋ฐฐ์› ์œผ๋‹ˆ, ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • block.broadcast()์™€ ๋ฒกํ„ฐ ์ •๊ทœํ™”: ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ฐ’์„ ๊ณต์œ 
  • ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋” ํฐ ๋ฌธ์ œ๋ฅผ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๋ธ”๋ก ๊ฐ„ ์กฐ์œจ
  • ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ •๋ ฌ, ๊ทธ๋ž˜ํ”„ ํƒ์ƒ‰, ๋™์  ๋ถ€ํ•˜ ๋ถ„์‚ฐ
  • ๋ณต์žกํ•œ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ๋ธ”๋ก ์—ฐ์‚ฐ๊ณผ ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์˜ ๊ฒฐํ•ฉ

๐Ÿ’ก ํ•ต์‹ฌ ์š”์ : ๋ธ”๋ก ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ์€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋‹จ์ˆœํ•œ ๋ณ‘๋ ฌ ๊ณ„์‚ฐ์—์„œ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. block.sum()์ด ๋ฆฌ๋•์…˜์„ ๋‹จ์ˆœํ™”ํ–ˆ๋‹ค๋ฉด, block.prefix_sum()์€ ๊ณ ์„ฑ๋Šฅ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์ˆ˜์ ์ธ ๊ณ ๊ธ‰ ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ ํŒจํ„ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

block.broadcast()์™€ ๋ฒกํ„ฐ ์ •๊ทœํ™”

block.sum๊ณผ block.broadcast ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๊ตฌํ˜„ํ•˜๊ณ , ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต์‹  ์›Œํฌํ”Œ๋กœ์šฐ์˜ ์ „์ฒด ๋ชจ์Šต์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ‰๊ท  ๊ณ„์‚ฐ์— ๊ธฐ์—ฌํ•œ ๋‹ค์Œ, ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ๋ฐ›์•„ ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ, ๋ธ”๋ก ์—ฐ์‚ฐ์ด ์‹ค์ œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋–ป๊ฒŒ ํ•จ๊ป˜ ๋™์ž‘ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.broadcast() ์—ฐ์‚ฐ์€ ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹ ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ, ๊ธฐ๋ณธ ๋ธ”๋ก ํ†ต์‹  ํŒจํ„ด์„ ์™„์„ฑํ•ฉ๋‹ˆ๋‹ค: ๋ฆฌ๋•์…˜(์ „์ฒดโ†’ํ•˜๋‚˜), ์Šค์บ”(์ „์ฒดโ†’๊ฐ๊ฐ), ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ(ํ•˜๋‚˜โ†’์ „์ฒด).

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • block.broadcast()๋ฅผ ํ™œ์šฉํ•œ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
  • ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹  ํŒจํ„ด
  • ์†Œ์Šค ์Šค๋ ˆ๋“œ ์ง€์ •๊ณผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ œ์–ด
  • ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜๋Š” ์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ
  • ์กฐ์œจ๋œ ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•œ ์‹ค์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \frac{\text{input}[i]}{\frac{1}{N}\sum_{j=0}^{N-1} \text{input}[j]}\]

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ‰๊ท  ๊ณ„์‚ฐ์— ๊ธฐ์—ฌํ•œ ๋‹ค์Œ, ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ๋ฐ›์•„ ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 128 ์š”์†Œ
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (128, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (TPB = 128)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(SIZE) (์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ๋ชจ๋‘ 1D row-major)
  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: 1-8 ๋ฐ˜๋ณต ๊ฐ’, ํ‰๊ท  = 4.5
  • ์˜ˆ์ƒ ์ถœ๋ ฅ: ํ‰๊ท ์ด 1.0์ธ ์ •๊ทœํ™”๋œ ๋ฒกํ„ฐ

๋„์ „ ๊ณผ์ œ: ๋ธ”๋ก ์ „์ฒด ๊ณ„์‚ฐ๊ณผ ๋ถ„๋ฐฐ์˜ ์กฐ์œจ

๊ธฐ์กด์˜ ํ‰๊ท  ์ •๊ทœํ™” ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ์ˆœ์ฐจ์  ๋ฐฉ์‹ - ๋ณ‘๋ ฌ์„ฑ์„ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•จ
total = sum(input_array)
mean = total / len(input_array)
output_array = [x / mean for x in input_array]

๋‹จ์ˆœํ•œ GPU ๋ณ‘๋ ฌํ™”์˜ ๋ฌธ์ œ์ :

  • ๋‹ค์ค‘ ์ปค๋„ ์‹คํ–‰: ํ‰๊ท  ๊ณ„์‚ฐ๊ณผ ์ •๊ทœํ™”์— ๊ฐ๊ฐ ๋ณ„๋„์˜ ํŒจ์Šค๊ฐ€ ํ•„์š”
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์™•๋ณต: ํ‰๊ท ์„ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ–ˆ๋‹ค๊ฐ€ ๋‚˜์ค‘์— ๋‹ค์‹œ ์ฝ๊ธฐ
  • ๋™๊ธฐํ™” ๋ณต์žก์„ฑ: ๊ณ„์‚ฐ ๋‹จ๊ณ„ ๊ฐ„์— ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”
  • ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ: ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ์ž‘์—…์„ ์ˆ˜ํ–‰

๊ธฐ์กด GPU ํ’€์ด์˜ ๋ณต์žก์„ฑ:

# 1๋‹จ๊ณ„: ํ•ฉ๊ณ„๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฆฌ๋•์…˜ (๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด)
shared_sum[local_i] = my_value
barrier()
# ์—ฌ๋Ÿฌ barrier() ํ˜ธ์ถœ์ด ํ•„์š”ํ•œ ์ˆ˜๋™ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜...

# 2๋‹จ๊ณ„: ์Šค๋ ˆ๋“œ 0์ด ํ‰๊ท ์„ ๊ณ„์‚ฐ
if local_i == 0:
    mean = shared_sum[0] / size
    shared_mean[0] = mean

barrier()

# 3๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ‰๊ท ์„ ์ฝ๊ณ  ์ •๊ทœํ™”
mean = shared_mean[0]  # ๋ชจ๋‘๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ์ฝ์Œ
output[global_i] = my_value / mean

๊ณ ๊ธ‰ ๋ฐฉ์‹: block.sum() + block.broadcast() ์กฐ์œจ

๋‹ค๋‹จ๊ณ„ ์กฐ์œจ์„ ๊ฐ„๊ฒฐํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

์™„์„ฑํ•  ์ฝ”๋“œ

์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ

๋ธ”๋ก ์—ฐ์‚ฐ ๋„๊ตฌ ๋ชจ์Œ ์ „์ฒด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ๊ธ‰ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:


comptime vector_layout = Layout.row_major(SIZE)


fn block_normalize_vector[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    output_data: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    size: Int,
):
    """Vector mean normalization using block.sum() + block.broadcast() combination.

    This demonstrates the complete block operations workflow:
    1. Use block.sum() to compute sum of all elements (all โ†’ one)
    2. Thread 0 computes mean = sum / size
    3. Use block.broadcast() to share mean to all threads (one โ†’ all)
    4. Each thread normalizes: output[i] = input[i] / mean
    """

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # Step 1: Each thread loads its element

    # FILL IN (roughly 3 lines)

    # Step 2: Use block.sum() to compute total sum (familiar from earlier!)

    # FILL IN (1 line)

    # Step 3: Thread 0 computes mean value

    # FILL IN (roughly 4 lines)

    # Step 4: block.broadcast() shares mean to ALL threads!
    # This completes the block operations trilogy demonstration

    # FILL IN (1 line)

    # Step 5: Each thread normalizes by the mean

    # FILL IN (roughly 3 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p27/p27.mojo

ํŒ

1. ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ ๊ตฌ์กฐ (๋ชจ๋“  ์ด์ „ ์—ฐ์‚ฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•)

์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์™„๋ฒฝํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ๋กœ๋“œ (๋ชจ๋“  ์ด์ „ ํผ์ฆ์—์„œ ์ต์ˆ™ํ•œ ํŒจํ„ด)
  2. block.sum()์œผ๋กœ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐ (์ด ํผ์ฆ์˜ ์•ž๋ถ€๋ถ„์—์„œ ๋ฐฐ์šด ๋‚ด์šฉ)
  3. ์Šค๋ ˆ๋“œ 0์ด ํ•ฉ๊ณ„๋กœ๋ถ€ํ„ฐ ํ‰๊ท ์„ ๊ณ„์‚ฐ
  4. block.broadcast()๋กœ ํ‰๊ท ์„ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ณต์œ  (์ƒˆ๋กœ์šด ๋‚ด์šฉ!)
  5. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์œผ๋กœ ์ •๊ทœํ™”

2. ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ๊ณผ ํ•ฉ๊ณ„ ๊ณ„์‚ฐ (์ต์ˆ™ํ•œ ํŒจํ„ด)

๊ธฐ์กด LayoutTensor ํŒจํ„ด์œผ๋กœ ์š”์†Œ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค:

var my_value: Scalar[dtype] = 0.0
if global_i < size:
    my_value = input_data[global_i][0]  # SIMD ์ถ”์ถœ

๊ทธ๋Ÿฐ ๋‹ค์Œ ์•ž์„œ ๋ฐฐ์šด ๋‚ด์ ๊ณผ ๋™์ผํ•˜๊ฒŒ block.sum()์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

total_sum = block.sum[block_size=tpb, broadcast=False](...)

3. ํ‰๊ท  ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ 0๋งŒ)

์Šค๋ ˆ๋“œ 0๋งŒ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

var mean_value: Scalar[dtype] = 1.0  # ์•ˆ์ „ํ•œ ๊ธฐ๋ณธ๊ฐ’
if local_i == 0:
    # total_sum๊ณผ size๋กœ ํ‰๊ท  ๊ณ„์‚ฐ

์™œ ์Šค๋ ˆ๋“œ 0์ธ๊ฐ€? block.sum() ํŒจํ„ด์—์„œ ์Šค๋ ˆ๋“œ 0์ด ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›๋Š” ๊ฒƒ๊ณผ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

4. block.broadcast() API ๊ฐœ๋…

ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ์‚ดํŽด๋ณด์„ธ์š” - ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ํ…œํ”Œ๋ฆฟ ํŒŒ๋ผ๋ฏธํ„ฐ: dtype, width, block_size
  • ๋Ÿฐํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ: val (๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•  SIMD ๊ฐ’), src_thread (๊ธฐ๋ณธ๊ฐ’=0)

ํ˜ธ์ถœ ํŒจํ„ด์€ ๊ธฐ์กด ํ…œํ”Œ๋ฆฟ ์Šคํƒ€์ผ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

result = block.broadcast[
    dtype = DType.float32,
    width = 1,
    block_size = tpb
](val=SIMD[DType.float32, 1](value_to_broadcast), src_thread=UInt(0))

5. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.broadcast()๋Š” ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์™€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค:

  • ์Šค๋ ˆ๋“œ 0์ด ๊ณ„์‚ฐ๋œ ํ‰๊ท ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ
  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ํ‰๊ท ๊ฐ’์ด ํ•„์š”
  • block.broadcast() ๊ฐ€ ์Šค๋ ˆ๋“œ 0์˜ ๊ฐ’์„ ๋ชจ๋‘์—๊ฒŒ ๋ณต์‚ฌ

์ด๊ฒƒ์€ block.sum()(์ „์ฒดโ†’ํ•˜๋‚˜)์˜ ๋ฐ˜๋Œ€์ด๋ฉฐ, block.prefix_sum()(์ „์ฒดโ†’๊ฐ๊ฐ ์œ„์น˜)๊ณผ๋„ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

6. ์ตœ์ข… ์ •๊ทœํ™” ๋‹จ๊ณ„

๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ๋ฐ›์œผ๋ฉด, ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค:

if global_i < size:
    normalized_value = my_value / broadcasted_mean[0]  # SIMD ์ถ”์ถœ
    output_data[global_i] = normalized_value

SIMD ์ถ”์ถœ: block.broadcast()๊ฐ€ SIMD๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋ฏ€๋กœ [0]์œผ๋กœ ์Šค์นผ๋ผ๋ฅผ ์ถ”์ถœํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

7. ์ด์ „ ํผ์ฆ์—์„œ์˜ ํŒจํ„ด ์ธ์‹

  • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ: ํ•ญ์ƒ ๋™์ผํ•œ global_i, local_i ํŒจํ„ด
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋™์ผํ•œ if global_i < size ๊ฒ€์ฆ
  • SIMD ์ฒ˜๋ฆฌ: ๋™์ผํ•œ [0] ์ถ”์ถœ ํŒจํ„ด
  • ๋ธ”๋ก ์—ฐ์‚ฐ: block.sum()๊ณผ ๋™์ผํ•œ ํ…œํ”Œ๋ฆฟ ํŒŒ๋ผ๋ฏธํ„ฐ ์Šคํƒ€์ผ

๊ฐ ๋ธ”๋ก ์—ฐ์‚ฐ์ด ์ผ๊ด€๋œ ํŒจํ„ด์„ ๋”ฐ๋ฅด๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค!

block.broadcast() ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

pixi run p27 --normalize
pixi run -e amd p27 --normalize
pixi run -e apple p27 --normalize
uv run poe p27 --normalize

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 128
TPB: 128

Input sample: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 ...
Sum value: 576.0
Mean value: 4.5

Mean Normalization Results:
Normalized sample: 0.22222222 0.44444445 0.6666667 0.8888889 1.1111112 1.3333334 1.5555556 1.7777778 ...

Output sum: 128.0
Output mean: 1.0
โœ… Success: Output mean is 1.0 (should be close to 1.0)

์†”๋ฃจ์…˜

fn block_normalize_vector[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    output_data: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    size: Int,
):
    """Vector mean normalization using block.sum() + block.broadcast() combination.

    This demonstrates the complete block operations workflow:
    1. Use block.sum() to compute sum of all elements (all โ†’ one)
    2. Thread 0 computes mean = sum / size
    3. Use block.broadcast() to share mean to all threads (one โ†’ all)
    4. Each thread normalizes: output[i] = input[i] / mean
    """

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # Step 1: Each thread loads its element
    var my_value: Scalar[dtype] = 0.0
    if global_i < size:
        my_value = input_data[global_i][0]  # Extract SIMD value

    # Step 2: Use block.sum() to compute total sum (familiar from earlier!)
    total_sum = block.sum[block_size=tpb, broadcast=False](
        val=SIMD[DType.float32, 1](my_value)
    )

    # Step 3: Thread 0 computes mean value
    var mean_value: Scalar[dtype] = 1.0  # Default to avoid division by zero
    if local_i == 0:
        if total_sum[0] > 0.0:
            mean_value = total_sum[0] / Float32(size)

    # Step 4: block.broadcast() shares mean to ALL threads!
    # This completes the block operations trilogy demonstration
    broadcasted_mean = block.broadcast[
        dtype = DType.float32, width=1, block_size=tpb
    ](val=SIMD[DType.float32, 1](mean_value), src_thread=UInt(0))

    # Step 5: Each thread normalizes by the mean
    if global_i < size:
        normalized_value = my_value / broadcasted_mean[0]
        output_data[global_i] = normalized_value


block.broadcast() ์ปค๋„์€ ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ํ†ต์‹  ํŒจํ„ด์„ ๋ชจ๋‘ ๊ฒฐํ•ฉํ•˜์—ฌ ์ˆ˜ํ•™์ ์œผ๋กœ ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์‹ค์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ตฌ์ฒด์ ์ธ ์‹คํ–‰์„ ํ†ตํ•œ ์™„์ „ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

1๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ (๋ชจ๋“  ์ด์ „ ํผ์ฆ์—์„œ ํ™•๋ฆฝ๋œ ํŒจํ„ด)

์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ (๋ชจ๋“  ํผ์ฆ์—์„œ ์ผ๊ด€๋จ):
  global_i = block_dim.x * block_idx.x + thread_idx.x  // ์ž…๋ ฅ ๋ฐฐ์—ด ์œ„์น˜์— ๋งคํ•‘
  local_i = thread_idx.x                              // ๋ธ”๋ก ๋‚ด ์œ„์น˜ (0-127)

LayoutTensor ํŒจํ„ด์„ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ์š”์†Œ ๋กœ๋”ฉ:
  ์Šค๋ ˆ๋“œ 0:   my_value = input_data[0][0] = 1.0    // ์ฒซ ๋ฒˆ์งธ ์ˆœํ™˜ ๊ฐ’
  ์Šค๋ ˆ๋“œ 1:   my_value = input_data[1][0] = 2.0    // ๋‘ ๋ฒˆ์งธ ์ˆœํ™˜ ๊ฐ’
  ์Šค๋ ˆ๋“œ 7:   my_value = input_data[7][0] = 8.0    // ๋งˆ์ง€๋ง‰ ์ˆœํ™˜ ๊ฐ’
  ์Šค๋ ˆ๋“œ 8:   my_value = input_data[8][0] = 1.0    // ์ˆœํ™˜ ๋ฐ˜๋ณต: 1,2,3,4,5,6,7,8,1,2...
  ์Šค๋ ˆ๋“œ 15:  my_value = input_data[15][0] = 8.0   // 15 % 8 = 7, 8๋ฒˆ์งธ ๊ฐ’
  ์Šค๋ ˆ๋“œ 127: my_value = input_data[127][0] = 8.0  // 127 % 8 = 7, 8๋ฒˆ์งธ ๊ฐ’

128๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๋กœ๋“œ - ์™„๋ฒฝํ•œ ๋ณ‘๋ ฌ ํšจ์œจ!

2๋‹จ๊ณ„: ๋ธ”๋ก ์ „์ฒด ํ•ฉ๊ณ„ ๋ฆฌ๋•์…˜ (์•ž์„œ ๋ฐฐ์šด block.sum() ์ง€์‹ ํ™œ์šฉ)

128๊ฐœ ์Šค๋ ˆ๋“œ์— ๊ฑธ์นœ block.sum() ์กฐ์œจ:
  ๊ธฐ์—ฌ๋ถ„ ๋ถ„์„:
    - ๊ฐ’ 1,2,3,4,5,6,7,8์ด ๊ฐ๊ฐ 16๋ฒˆ ๋ฐ˜๋ณต (128/8 = 16)
    - ์Šค๋ ˆ๋“œ ๊ธฐ์—ฌ๋ถ„: 16ร—1 + 16ร—2 + 16ร—3 + 16ร—4 + 16ร—5 + 16ร—6 + 16ร—7 + 16ร—8
    - ์ˆ˜ํ•™์  ํ•ฉ๊ณ„: 16 ร— (1+2+3+4+5+6+7+8) = 16 ร— 36 = 576.0

block.sum() ํ•˜๋“œ์›จ์–ด ์‹คํ–‰:
  ๋ชจ๋“  ์Šค๋ ˆ๋“œ โ†’ [๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ] โ†’ ์Šค๋ ˆ๋“œ 0
  total_sum = SIMD[DType.float32, 1](576.0)  // ์Šค๋ ˆ๋“œ 0๋งŒ ์ด ๊ฐ’์„ ์ˆ˜์‹ 

์Šค๋ ˆ๋“œ 1-127: total_sum์— ์ ‘๊ทผ ๋ถˆ๊ฐ€ (block.sum์—์„œ broadcast=False)

3๋‹จ๊ณ„: ๋…์ ์  ํ‰๊ท  ๊ณ„์‚ฐ (๋‹จ์ผ ์Šค๋ ˆ๋“œ ์ฒ˜๋ฆฌ)

์Šค๋ ˆ๋“œ 0์ด ํ•ต์‹ฌ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰:
  ์ž…๋ ฅ: total_sum[0] = 576.0, size = 128
  ๊ณ„์‚ฐ: mean_value = 576.0 / 128.0 = 4.5

  ๊ฒ€์ฆ: ๊ธฐ๋Œ€ ํ‰๊ท  = (1+2+3+4+5+6+7+8)/8 = 36/8 = 4.5 โœ“

๋‹ค๋ฅธ ๋ชจ๋“  ์Šค๋ ˆ๋“œ (1-127):
  mean_value = 1.0 (๊ธฐ๋ณธ ์•ˆ์ „ ๊ฐ’)
  ์ด ๊ฐ’๋“ค์€ ๋ฌด๊ด€ - ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด ์‹œ์ ์—์„œ ์˜ฌ๋ฐ”๋ฅธ ํ‰๊ท ๊ฐ’์„ ๊ฐ€์ง„ ๊ฒƒ์€ ์Šค๋ ˆ๋“œ 0๋ฟ์ž…๋‹ˆ๋‹ค!

4๋‹จ๊ณ„: ๋ธ”๋ก ์ „์ฒด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋ถ„๋ฐฐ (ํ•˜๋‚˜ โ†’ ์ „์ฒด ํ†ต์‹ )

block.broadcast() API ์‹คํ–‰:
  ์†Œ์Šค: src_thread = UInt(0) โ†’ ์Šค๋ ˆ๋“œ 0์˜ mean_value = 4.5
  ๋Œ€์ƒ: ๋ธ”๋ก ๋‚ด ๋ชจ๋“  128 ์Šค๋ ˆ๋“œ

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ „:
  ์Šค๋ ˆ๋“œ 0:   mean_value = 4.5  โ† ์ง„์‹ค์˜ ์›์ฒœ
  ์Šค๋ ˆ๋“œ 1:   mean_value = 1.0  โ† ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •
  ์Šค๋ ˆ๋“œ 2:   mean_value = 1.0  โ† ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •
  ...
  ์Šค๋ ˆ๋“œ 127: mean_value = 1.0  โ† ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •

block.broadcast() ์‹คํ–‰ ํ›„:
  ์Šค๋ ˆ๋“œ 0:   broadcasted_mean[0] = 4.5  โ† ์ž์‹ ์˜ ๊ฐ’์„ ๋‹ค์‹œ ์ˆ˜์‹ 
  ์Šค๋ ˆ๋“œ 1:   broadcasted_mean[0] = 4.5  โ† ์ด์ œ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ€์ง!
  ์Šค๋ ˆ๋“œ 2:   broadcasted_mean[0] = 4.5  โ† ์ด์ œ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ€์ง!
  ...
  ์Šค๋ ˆ๋“œ 127: broadcasted_mean[0] = 4.5  โ† ์ด์ œ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ€์ง!

๊ฒฐ๊ณผ: ์™„๋ฒฝํ•œ ๋™๊ธฐํ™” - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ํ‰๊ท ๊ฐ’์„ ๊ฐ€์ง!

5๋‹จ๊ณ„: ๋ณ‘๋ ฌ ํ‰๊ท  ์ •๊ทœํ™” (์กฐ์œจ๋œ ์ฒ˜๋ฆฌ)

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”:
  ์Šค๋ ˆ๋“œ 0:   normalized = 1.0 / 4.5 = 0.22222222...
  ์Šค๋ ˆ๋“œ 1:   normalized = 2.0 / 4.5 = 0.44444444...
  ์Šค๋ ˆ๋“œ 2:   normalized = 3.0 / 4.5 = 0.66666666...
  ์Šค๋ ˆ๋“œ 7:   normalized = 8.0 / 4.5 = 1.77777777...
  ์Šค๋ ˆ๋“œ 8:   normalized = 1.0 / 4.5 = 0.22222222...  (ํŒจํ„ด ๋ฐ˜๋ณต)
  ...

์ˆ˜ํ•™์  ๊ฒ€์ฆ:
  ์ถœ๋ ฅ ํ•ฉ๊ณ„ = (0.222... + 0.444... + ... + 1.777...) ร— 16 = 4.5 ร— 16 ร— 2 = 128.0
  ์ถœ๋ ฅ ํ‰๊ท  = 128.0 / 128 = 1.0  ์™„๋ฒฝํ•œ ์ •๊ทœํ™”!

๊ฐ ๊ฐ’์„ ์›๋ž˜ ํ‰๊ท ์œผ๋กœ ๋‚˜๋ˆ„๋ฉด ํ‰๊ท ์ด 1.0์ธ ์ถœ๋ ฅ์„ ์ƒ์„ฑ

6๋‹จ๊ณ„: ์ •ํ™•์„ฑ ๊ฒ€์ฆ

์ž…๋ ฅ ๋ถ„์„:
  - ํ•ฉ๊ณ„: 576.0, ํ‰๊ท : 4.5
  - ์ตœ๋Œ“๊ฐ’: 8.0, ์ตœ์†Ÿ๊ฐ’: 1.0
  - ๋ฒ”์œ„: [1.0, 8.0]

์ถœ๋ ฅ ๋ถ„์„:
  - ํ•ฉ๊ณ„: 128.0, ํ‰๊ท : 1.0 โœ“
  - ์ตœ๋Œ“๊ฐ’: 1.777..., ์ตœ์†Ÿ๊ฐ’: 0.222...
  - ๋ฒ”์œ„: [0.222, 1.777] (๋ชจ๋“  ๊ฐ’์ด 1/4.5 ๋น„์œจ๋กœ ์Šค์ผ€์ผ๋ง)

๋น„๋ก€ ๊ด€๊ณ„ ๋ณด์กด:
  - ์›๋ž˜ 8:1 ๋น„์œจ์ด 1.777:0.222 = 8:1๋กœ ์œ ์ง€ โœ“
  - ๋ชจ๋“  ์ƒ๋Œ€์  ํฌ๊ธฐ๊ฐ€ ์™„๋ฒฝํ•˜๊ฒŒ ์œ ์ง€

์ด ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ๊ฐ€ ์ˆ˜ํ•™์ ยท๊ณ„์‚ฐ์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ์ด์œ :

๊ธฐ์ˆ ์  ์ •ํ™•์„ฑ๊ณผ ๊ฒ€์ฆ:

์ˆ˜ํ•™์  ์ •ํ™•์„ฑ ์ฆ๋ช…:
  ์ž…๋ ฅ: xโ‚, xโ‚‚, ..., xโ‚™ (n = 128)
  ํ‰๊ท : ฮผ = (โˆ‘xแตข)/n = 576/128 = 4.5

  ์ •๊ทœํ™”: yแตข = xแตข/ฮผ
  ์ถœ๋ ฅ ํ‰๊ท : (โˆ‘yแตข)/n = (โˆ‘xแตข/ฮผ)/n = (1/ฮผ)(โˆ‘xแตข)/n = (1/ฮผ)ฮผ = 1 โœ“

์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ฆ๋ช… ๊ฐ€๋Šฅํ•˜๊ฒŒ ์˜ฌ๋ฐ”๋ฅธ ์ˆ˜ํ•™์  ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Puzzle 12 (๊ธฐ์ดˆ ํŒจํ„ด)๊ณผ์˜ ์—ฐ๊ฒฐ:

  • ์Šค๋ ˆ๋“œ ์กฐ์œจ์˜ ์ง„ํ™”: ๋™์ผํ•œ global_i, local_i ํŒจํ„ด์ด์ง€๋งŒ ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด: ๋™์ผํ•œ LayoutTensor SIMD ์ถ”์ถœ [0]์ด์ง€๋งŒ ์ตœ์ ํ™”๋œ ์›Œํฌํ”Œ๋กœ์šฐ
  • ๋ณต์žก์„ฑ ์ œ๊ฑฐ: 20์ค„ ์ด์ƒ์˜ ์ˆ˜๋™ ๋ฐฐ๋ฆฌ์–ด๋ฅผ 2๊ฐœ์˜ ๋ธ”๋ก ์—ฐ์‚ฐ์œผ๋กœ ๋Œ€์ฒด
  • ๊ต์œก์  ์ง„ํ–‰: ์ˆ˜๋™ โ†’ ์ž๋™, ๋ณต์žก โ†’ ๋‹จ์ˆœ, ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ โ†’ ์‹ ๋ขฐ์„ฑ

block.sum() (์™„๋ฒฝํ•œ ํ†ตํ•ฉ)๊ณผ์˜ ์—ฐ๊ฒฐ:

  • API ์ผ๊ด€์„ฑ: ๋™์ผํ•œ ํ…œํ”Œ๋ฆฟ ๊ตฌ์กฐ [block_size=tpb, broadcast=False]
  • ๊ฒฐ๊ณผ ํ๋ฆ„ ์„ค๊ณ„: ์Šค๋ ˆ๋“œ 0์ด ํ•ฉ๊ณ„๋ฅผ ์ˆ˜์‹ ํ•˜๊ณ , ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํŒŒ์ƒ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ„์‚ฐ
  • ๋งค๋„๋Ÿฌ์šด ์กฐํ•ฉ: block.sum()์˜ ์ถœ๋ ฅ์ด ๊ณ„์‚ฐ + ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์˜ ์ž…๋ ฅ์ด ๋จ
  • ์„ฑ๋Šฅ ์ตœ์ ํ™”: ๋‹จ์ผ ์ปค๋„ ์›Œํฌํ”Œ๋กœ์šฐ vs ๋‹ค์ค‘ ํŒจ์Šค ๋ฐฉ์‹

block.prefix_sum() (์ƒ๋ณด์  ํ†ต์‹ )๊ณผ์˜ ์—ฐ๊ฒฐ:

  • ๋ถ„๋ฐฐ ํŒจํ„ด: prefix_sum์€ ๊ณ ์œ ํ•œ ์œ„์น˜๋ฅผ, broadcast๋Š” ๊ณต์œ  ๊ฐ’์„ ์ œ๊ณต

  • ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค: prefix_sum์€ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์šฉ, broadcast๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ ์šฉ

  • ํ…œํ”Œ๋ฆฟ ์ผ๊ด€์„ฑ: ๋ชจ๋“  ์—ฐ์‚ฐ์—์„œ ๋™์ผํ•œ dtype, block_size ํŒŒ๋ผ๋ฏธํ„ฐ ํŒจํ„ด

  • SIMD ์ฒ˜๋ฆฌ ํ†ต์ผ์„ฑ: ๋ชจ๋“  ๋ธ”๋ก ์—ฐ์‚ฐ์ด [0] ์ถ”์ถœ์ด ํ•„์š”ํ•œ SIMD๋ฅผ ๋ฐ˜ํ™˜

๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ธ์‚ฌ์ดํŠธ:

ํ†ต์‹  ํŒจํ„ด ๋น„๊ต:
  ๊ธฐ์กด ๋ฐฉ์‹:
    1. ์ˆ˜๋™ ๋ฆฌ๋•์…˜:         O(log n), ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ํ•„์š”
    2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ:     O(1), ๋™๊ธฐํ™” ํ•„์š”
    3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ:     O(1), ๋ฑ…ํฌ ์ถฉ๋Œ ๊ฐ€๋Šฅ์„ฑ
    ์ดํ•ฉ: ๋‹ค์ˆ˜์˜ ๋™๊ธฐํ™” ์ง€์ , ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ

  ๋ธ”๋ก ์—ฐ์‚ฐ ๋ฐฉ์‹:
    1. block.sum():        O(log n), ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”, ์ž๋™ ๋ฐฐ๋ฆฌ์–ด
    2. ๊ณ„์‚ฐ:                O(1), ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    3. block.broadcast():  O(log n), ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”, ์ž๋™ ๋ถ„๋ฐฐ
    ์ดํ•ฉ: ๋‘ ๊ฐœ์˜ ๊ธฐ๋ณธ ์š”์†Œ, ์ž๋™ ๋™๊ธฐํ™”, ์ฆ๋ช…๋œ ์ •ํ™•์„ฑ

์‹ค์ œ ์‘์šฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด:

์ผ๋ฐ˜์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ:
  1๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ        โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ธฐ์—ฌ
  2๋‹จ๊ณ„: ์ „์—ญ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณ„์‚ฐ      โ†’ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐ
  3๋‹จ๊ณ„: ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„๋ฐฐ          โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ˆ˜์‹ 
  4๋‹จ๊ณ„: ์กฐ์œจ๋œ ๋ณ‘๋ ฌ ์ถœ๋ ฅ        โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌ

์ด ์ •ํ™•ํ•œ ํŒจํ„ด์ด ๋“ฑ์žฅํ•˜๋Š” ๋ถ„์•ผ:
  - ๋ฐฐ์น˜ ์ •๊ทœํ™” (๋”ฅ๋Ÿฌ๋‹)
  - ํžˆ์Šคํ† ๊ทธ๋žจ ๊ท ๋“ฑํ™” (์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ)
  - ๋ฐ˜๋ณต์  ์ˆ˜์น˜ ํ•ด๋ฒ• (๊ณผํ•™ ์—ฐ์‚ฐ)
  - ์กฐ๋ช… ๊ณ„์‚ฐ (์ปดํ“จํ„ฐ ๊ทธ๋ž˜ํ”ฝ)

ํ‰๊ท  ์ •๊ทœํ™”๋Š” ์ด ๊ทผ๋ณธ์ ์ธ ํŒจํ„ด์˜ ์™„๋ฒฝํ•œ ๊ต์œก ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.

๋ธ”๋ก ์—ฐ์‚ฐ 3๋ถ€์ž‘ ์™„์„ฑ:

1. block.sum() - ์ „์ฒดโ†’ํ•˜๋‚˜ (Reduction)

  • ์ž…๋ ฅ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ์ œ๊ณต
  • ์ถœ๋ ฅ: ์Šค๋ ˆ๋“œ 0์ด ์ง‘๊ณ„๋œ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์‹ 
  • ์šฉ๋„: ํ•ฉ๊ณ„, ์ตœ๋Œ“๊ฐ’ ๊ณ„์‚ฐ ๋“ฑ

2. block.prefix_sum() - ์ „์ฒดโ†’๊ฐ๊ฐ (Scan)

  • ์ž…๋ ฅ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ์ œ๊ณต
  • ์ถœ๋ ฅ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ˆ„์  ์œ„์น˜๋ฅผ ์ˆ˜์‹ 
  • ์šฉ๋„: ์“ฐ๊ธฐ ์œ„์น˜ ๊ณ„์‚ฐ, ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹

3. block.broadcast() - ํ•˜๋‚˜โ†’์ „์ฒด (Broadcast)

  • ์ž…๋ ฅ: ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ์ œ๊ณต (์ผ๋ฐ˜์ ์œผ๋กœ ์Šค๋ ˆ๋“œ 0)
  • ์ถœ๋ ฅ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ์ˆ˜์‹ 
  • ์šฉ๋„: ๊ณ„์‚ฐ๋œ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ , ์„ค์ •๊ฐ’ ๋ถ„๋ฐฐ

์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์ง„ํ–‰:

  1. ์ˆ˜๋™ ์กฐ์œจ (Puzzle 12): ๋ณ‘๋ ฌ ๊ธฐ์ดˆ ์ดํ•ด
  2. ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ (Puzzle 24): ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ํŒจํ„ด ํ•™์Šต
  3. ๋ธ”๋ก ๋ฆฌ๋•์…˜ (block.sum()): ์ „์ฒดโ†’ํ•˜๋‚˜ ํ†ต์‹  ํ•™์Šต
  4. ๋ธ”๋ก ์Šค์บ” (block.prefix_sum()): ์ „์ฒดโ†’๊ฐ๊ฐ ํ†ต์‹  ํ•™์Šต
  5. ๋ธ”๋ก ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ (block.broadcast()): ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹  ํ•™์Šต

์ „์ฒด ๊ทธ๋ฆผ: ๋ธ”๋ก ์—ฐ์‚ฐ์€ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๊ธฐ๋ณธ ํ†ต์‹  ๋นŒ๋”ฉ ๋ธ”๋ก์„ ์ œ๊ณตํ•˜๋ฉฐ, ๋ณต์žกํ•œ ์ˆ˜๋™ ์กฐ์œจ์„ ๊น”๋”ํ•˜๊ณ  ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ์™€ ๊ธฐ์ˆ  ๋ถ„์„

์ •๋Ÿ‰์  ์„ฑ๋Šฅ ๋น„๊ต:

block.broadcast() vs ๊ธฐ์กด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹ (์ฐธ๊ณ ์šฉ):

๊ธฐ์กด ์ˆ˜๋™ ๋ฐฉ์‹:

1๋‹จ๊ณ„: ์ˆ˜๋™ ๋ฆฌ๋•์…˜
  โ€ข ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ~5 ์‚ฌ์ดํด
  โ€ข ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”: ~10 ์‚ฌ์ดํด
  โ€ข ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ๋ฃจํ”„: ~15 ์‚ฌ์ดํด
  โ€ข ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅํ•œ ์ˆ˜๋™ ์ธ๋ฑ์‹ฑ

2๋‹จ๊ณ„: ํ‰๊ท  ๊ณ„์‚ฐ: ~2 ์‚ฌ์ดํด

3๋‹จ๊ณ„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
  โ€ข ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ˆ˜๋™ ์“ฐ๊ธฐ: ~2 ์‚ฌ์ดํด
  โ€ข ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”: ~10 ์‚ฌ์ดํด
  โ€ข ๋ชจ๋“  ์Šค๋ ˆ๋“œ ์ฝ๊ธฐ: ~3 ์‚ฌ์ดํด

์ดํ•ฉ: ~47 ์‚ฌ์ดํด
  + ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ
  + ๊ฒฝ์Ÿ ์ƒํƒœ ๊ฐ€๋Šฅ์„ฑ
  + ์ˆ˜๋™ ์˜ค๋ฅ˜ ๋””๋ฒ„๊น…

๋ธ”๋ก ์—ฐ์‚ฐ ๋ฐฉ์‹:

1๋‹จ๊ณ„: block.sum()
  โ€ข ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ~3 ์‚ฌ์ดํด
  โ€ข ์ž๋™ ๋ฐฐ๋ฆฌ์–ด: ๋ช…์‹œ์  ๋น„์šฉ 0
  โ€ข ์ตœ์ ํ™”๋œ ๋ฆฌ๋•์…˜: ~8 ์‚ฌ์ดํด
  โ€ข ๊ฒ€์ฆ๋œ ์˜ฌ๋ฐ”๋ฅธ ๊ตฌํ˜„

2๋‹จ๊ณ„: ํ‰๊ท  ๊ณ„์‚ฐ: ~2 ์‚ฌ์ดํด

3๋‹จ๊ณ„: block.broadcast()
  โ€ข ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ~4 ์‚ฌ์ดํด
  โ€ข ์ž๋™ ๋ถ„๋ฐฐ: ๋ช…์‹œ์  ๋น„์šฉ 0
  โ€ข ๊ฒ€์ฆ๋œ ์˜ฌ๋ฐ”๋ฅธ ๊ตฌํ˜„

์ดํ•ฉ: ~17 ์‚ฌ์ดํด
  + ์ž๋™ ์ตœ์ ํ™”
  + ๋ณด์žฅ๋œ ์ •ํ™•์„ฑ
  + ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ์„ค๊ณ„

๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ด์ :

์บ์‹œ ํšจ์œจ:

  • block.sum(): ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์บ์‹œ ๋ฏธ์Šค ๊ฐ์†Œ
  • block.broadcast(): ํšจ์œจ์ ์ธ ๋ถ„๋ฐฐ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ ์ตœ์†Œํ™”
  • ๊ฒฐํ•ฉ ์›Œํฌํ”Œ๋กœ์šฐ: ๋‹จ์ผ ์ปค๋„์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์™•๋ณต์„ 100% ๊ฐ์†Œ

๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ:

๊ธฐ์กด ๋ฉ€ํ‹ฐ ์ปค๋„ ๋ฐฉ์‹:
  ์ปค๋„ 1: ์ž…๋ ฅ โ†’ ๋ฆฌ๋•์…˜ โ†’ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ
  ์ปค๋„ 2: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ โ†’ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ โ†’ ์ถœ๋ ฅ
  ์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก: ๋ฐฐ์—ด ํฌ๊ธฐ์˜ 3๋ฐฐ

๋ธ”๋ก ์—ฐ์‚ฐ ๋‹จ์ผ ์ปค๋„:
  ์ž…๋ ฅ โ†’ block.sum() โ†’ block.broadcast() โ†’ ์ถœ๋ ฅ
  ์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก: ๋ฐฐ์—ด ํฌ๊ธฐ์˜ 2๋ฐฐ (33% ๊ฐœ์„ )

๊ฐ ๋ธ”๋ก ์—ฐ์‚ฐ์˜ ์ตœ์  ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค:

block.sum() ์ตœ์  ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ๋ฐ์ดํ„ฐ ์ง‘๊ณ„: ํ•ฉ๊ณ„, ํ‰๊ท , ์ตœ๋Œ“๊ฐ’/์ตœ์†Ÿ๊ฐ’ ๊ณ„์‚ฐ
  • ๋ฆฌ๋•์…˜ ํŒจํ„ด: ์ „์ฒดโ†’ํ•˜๋‚˜ ํ†ต์‹ ์ด ํ•„์š”ํ•œ ๋ชจ๋“  ๊ฒฝ์šฐ
  • ํ†ต๊ณ„ ์—ฐ์‚ฐ: ํ‰๊ท , ๋ถ„์‚ฐ, ์ƒ๊ด€๊ด€๊ณ„ ๊ณ„์‚ฐ

block.prefix_sum() ์ตœ์  ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹: ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜
  • ์“ฐ๊ธฐ ์œ„์น˜ ๊ณ„์‚ฐ: ๋ณ‘๋ ฌ ์ถœ๋ ฅ ์ƒ์„ฑ
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ •๋ ฌ, ๊ฒ€์ƒ‰, ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ

block.broadcast() ์ตœ์  ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ถ„๋ฐฐ: ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ณต์œ 
  • ์„ค์ • ์ „ํŒŒ: ๋ชจ๋“œ ํ”Œ๋ž˜๊ทธ, ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ, ์ž„๊ณ„๊ฐ’
  • ์กฐ์œจ๋œ ์ฒ˜๋ฆฌ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๊ณ„์‚ฐ๋œ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ํ•„์š”ํ•  ๋•Œ

์กฐํ•ฉ์˜ ์ด์ :

๊ฐœ๋ณ„ ์—ฐ์‚ฐ:   ์ข‹์€ ์„ฑ๋Šฅ, ์ œํ•œ๋œ ๋ฒ”์œ„
๊ฒฐํ•ฉ ์—ฐ์‚ฐ:   ํƒ์›”ํ•œ ์„ฑ๋Šฅ, ํฌ๊ด„์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜

์‹ค์ œ ์‘์šฉ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์กฐํ•ฉ ์˜ˆ์‹œ:
โ€ข block.sum() + block.broadcast():       ์ •๊ทœํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜
โ€ข block.prefix_sum() + block.sum():      ๊ณ ๊ธ‰ ํŒŒํ‹ฐ์…”๋‹
โ€ข ์„ธ ๊ฐ€์ง€ ๋ชจ๋‘ ๊ฒฐํ•ฉ:                      ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜
โ€ข ๊ธฐ์กด ํŒจํ„ด๊ณผ ํ•จ๊ป˜:                       ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ตœ์ ํ™” ์ „๋žต

๋‹ค์Œ ๋‹จ๊ณ„

์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ 3๋ถ€์ž‘์„ ๋ฐฐ์› ์œผ๋‹ˆ, ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์นœ ์—ฐ์‚ฐ ์กฐ์œจ
  • ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒจํ„ด: ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ๊ฒฐํ•ฉ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”: ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ์ด๋™ ํŒจํ„ด
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„: ๋ธ”๋ก ์—ฐ์‚ฐ ๋นŒ๋”ฉ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐํ™”
  • ์„ฑ๋Šฅ ์ตœ์ ํ™”: ์ตœ์ ์˜ ๋ธ”๋ก ํฌ๊ธฐ์™€ ์—ฐ์‚ฐ ์กฐํ•ฉ ์„ ํƒ

๐Ÿ’ก ํ•ต์‹ฌ ์š”์ : ๋ธ”๋ก ์—ฐ์‚ฐ 3๋ถ€์ž‘(sum, prefix_sum, broadcast)์€ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•œ ์™„์ „ํ•œ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ์—ฐ์‚ฐ๋“ค์„ ์กฐํ•ฉํ•˜๋ฉด GPU ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊น”๋”ํ•˜๊ณ  ์œ ์ง€๋ณด์ˆ˜ํ•˜๊ธฐ ์‰ฌ์šด ์ฝ”๋“œ๋กœ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ‰๊ท  ์ •๊ทœํ™”๋Š” ์ด ์—ฐ์‚ฐ๋“ค์ด ํ•จ๊ป˜ ์ž‘๋™ํ•˜์—ฌ ์‹ค์ œ ์—ฐ์‚ฐ ๋ฌธ์ œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Puzzle 28: ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ๊ณผ ๋ณต์‚ฌ ์ค‘์ฒฉ

GPU ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ ํ˜„์ƒ: ์‹ค์ œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋Œ€๋ถ€๋ถ„์€ ์ขŒ์ ˆ์Šค๋Ÿฌ์šด ๋ฒฝ์— ๋ถ€๋”ชํž™๋‹ˆ๋‹ค - ์—ฐ์‚ฐ ๋Šฅ๋ ฅ์ด ์•„๋‹ˆ๋ผ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ์˜ํ•ด ์ œํ•œ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋น„์‹ผ GPU ์ฝ”์–ด๊ฐ€ ๋А๋ฆฐ DRAM์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋„์ฐฉํ•˜๊ธฐ๋ฅผ ๊ธฐ๋‹ค๋ฆฌ๋ฉฐ ๋†€๊ณ  ์žˆ๋Š” ๊ฒƒ์ด์ฃ .

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ํ”ํžˆ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์ƒํ™ฉ์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

# ์„ฑ๋Šฅ์˜ ์  - ์ˆœ์ฐจ์  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ
load_input_tile()     # โ† DRAM ๋Œ€๊ธฐ 500 ์‚ฌ์ดํด
load_kernel_data()    # โ† ๋˜ 100 ์‚ฌ์ดํด ๋Œ€๊ธฐ
barrier()             # โ† ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์œ ํœด ๋Œ€๊ธฐ
compute()             # โ† ๋“œ๋””์–ด ์‹ค์ œ ์—ฐ์‚ฐ 50 ์‚ฌ์ดํด
# ์ด: 650 ์‚ฌ์ดํด, ์—ฐ์‚ฐ ํ™œ์šฉ๋ฅ  ๊ฒจ์šฐ 7.7%!

์ด๋ ‡๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ์–ด๋–จ๊นŒ์š”?

# ์„ฑ๋Šฅ ๊ฐœ์„  - ์ค‘์ฒฉ ์—ฐ์‚ฐ
launch_async_load()   # โ† ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ 500 ์‚ฌ์ดํด ์ „์†ก ์‹œ์ž‘
load_small_data()     # โ† ๋Œ€๊ธฐ ์ค‘ ์œ ์šฉํ•œ ์ž‘์—… 100 ์‚ฌ์ดํด
wait_and_compute()    # โ† ๋‚˜๋จธ์ง€ ~400 ์‚ฌ์ดํด๋งŒ ๋Œ€๊ธฐ ํ›„ ์—ฐ์‚ฐ
# ์ด: ~550 ์‚ฌ์ดํด, 45% ํ–ฅ์ƒ!

์ด๊ฒƒ์ด ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์˜ ์œ„๋ ฅ์ž…๋‹ˆ๋‹ค - ๋А๋ฆฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ GPU์˜ ์ž ์žฌ๋ ฅ์„ ์ตœ๋Œ€ํ•œ ๋ฐœํœ˜ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ฐจ์ด๋ฅผ ๋งŒ๋“ค์–ด ๋ƒ…๋‹ˆ๋‹ค.

์™œ ์ค‘์š”ํ•œ๊ฐ€

์ด ํผ์ฆ์—์„œ๋Š” Puzzle 13์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ 1D ํ•ฉ์„ฑ๊ณฑ์„ ์—ฐ์‚ฐ ๋’ค์— ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๋Š” ๊ณ ์„ฑ๋Šฅ ๊ตฌํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ํ•™์ˆ ์  ์—ฐ์Šต์ด ์•„๋‹™๋‹ˆ๋‹ค - ์ด ํŒจํ„ด๋“ค์€ ๋‹ค์Œ ๋ถ„์•ผ์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค:

  • ๋”ฅ๋Ÿฌ๋‹: ๊ฐ€์ค‘์น˜์™€ ํ™œ์„ฑํ™”๊ฐ’์˜ ํšจ์œจ์  ๋กœ๋”ฉ
  • ๊ณผํ•™ ์—ฐ์‚ฐ: ์Šคํ…์‹ค ์—ฐ์‚ฐ์—์„œ ๋ฐ์ดํ„ฐ ์ „์†ก ์ค‘์ฒฉ
  • ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ: ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ํ†ตํ•œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ์ŠคํŠธ๋ฆฌ๋ฐ
  • ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ƒ์‚ฐ์ ์ธ ์ž‘์—…์œผ๋กœ ์ „ํ™˜

์‚ฌ์ „ ์ค€๋น„

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ๋‚ด์šฉ์„ ํ™•์‹คํžˆ ์ดํ•ดํ•˜๊ณ  ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

ํ•„์ˆ˜ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…:

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (Puzzle 8, Puzzle 16) - matmul ํŒจํ„ด์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ(coalescing) (Puzzle 21) - ์ตœ์ ์˜ ๋น„๋™๊ธฐ ์ „์†ก์— ํ•„์ˆ˜
  • ํƒ€์ผ ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ (Puzzle 23) - ์ด ์ตœ์ ํ™”์˜ ๊ธฐ๋ฐ˜

ํ•˜๋“œ์›จ์–ด ์ดํ•ด:

  • GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ (DRAM โ†’ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ๋ ˆ์ง€์Šคํ„ฐ)
  • ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๊ตฌ์„ฑ๊ณผ ๋™๊ธฐํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„ vs. ๋Œ€์—ญํญ์— ๋Œ€ํ•œ ๊ธฐ๋ณธ ์ดํ•ด

API ์ˆ™์ง€: Mojo GPU Memory Operations

โš ๏ธ ํ•˜๋“œ์›จ์–ด ํ˜ธํ™˜์„ฑ ์ฐธ๊ณ : ์ด ํผ์ฆ์€ ์ตœ์‹  GPU ์•„ํ‚คํ…์ฒ˜๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ๋Š” ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ(copy_dram_to_sram_async, async_copy_wait_all)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. .async ์ˆ˜์ •์ž๋‚˜ ์ง€์›๋˜์ง€ ์•Š๋Š” ์—ฐ์‚ฐ ๊ด€๋ จ ์ปดํŒŒ์ผ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ํ•ด๋‹น GPU๊ฐ€ ์ด ๊ธฐ๋Šฅ์„ ์ง€์›ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜๋„ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ํŒจํ„ด์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ๊ฐœ๋…์€ ์—ฌ์ „ํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

GPU ์ปดํ“จํŒ… ๋Šฅ๋ ฅ ํ™•์ธ:

nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader,nounits
  • SM_70 ์ด์ƒ (์˜ˆ: V100, T4, A10G, RTX 20+ ์‹œ๋ฆฌ์ฆˆ): ๊ธฐ๋ณธ ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์ง€์›
  • SM_80 ์ด์ƒ (์˜ˆ: A100, RTX 30+ ์‹œ๋ฆฌ์ฆˆ): ์ „์ฒด ๋น„๋™๊ธฐ ๋ณต์‚ฌ ๊ธฐ๋Šฅ
  • SM_90 ์ด์ƒ (์˜ˆ: H100, RTX 40+ ์‹œ๋ฆฌ์ฆˆ): ๊ณ ๊ธ‰ TMA ์—ฐ์‚ฐ ์ง€์›

ํ•™์Šต ๋‚ด์šฉ

์ด ํผ์ฆ์„ ๋งˆ์น˜๋ฉด ๋‹ค์Œ์„ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ๊ธฐ๋ฒ•

  • ๋น„๋™๊ธฐ ๋ณต์‚ฌ ๊ธฐ๋ณธ ์š”์†Œ: ๋ฐฑ๊ทธ๋ผ์šด๋“œ DRAMโ†’SRAM ์ „์†ก ์‹œ์ž‘
  • ์ง€์—ฐ ์‹œ๊ฐ„ ์€ํ(latency hiding): ๋น„์šฉ์ด ํฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์œ ์šฉํ•œ ์—ฐ์‚ฐ๊ณผ ์ค‘์ฒฉ
  • ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ํ•˜๋“œ์›จ์–ด์— ๋งž์ถ”๊ธฐ
  • ํŒŒ์ดํ”„๋ผ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ์„ ๊ทน๋Œ€ํ™”ํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐํ™”

์ฃผ์š” API

Puzzle 16์˜ ๊ด€์šฉ์  matmul์—์„œ ์†Œ๊ฐœํ•œ ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์ด์ œ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ์ž ์žฌ๋ ฅ์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค:

  • copy_dram_to_sram_async(): ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐฑ๊ทธ๋ผ์šด๋“œ DRAMโ†’SRAM ์ „์†ก ์‹œ์ž‘
  • async_copy_wait_all(): ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ „ ์ „์†ก ์™„๋ฃŒ ๋™๊ธฐํ™”

Puzzle 16๊ณผ ๋‹ค๋ฅธ ์ ์€? Puzzle 16์—์„œ๋Š” matmul์˜ ๊น”๋”ํ•œ ํƒ€์ผ ๋กœ๋”ฉ์„ ์œ„ํ•ด ๋น„๋™๊ธฐ ๋ณต์‚ฌ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๋ฉด, ์ด ํผ์ฆ์€ ์ง€์—ฐ ์‹œ๊ฐ„ ์€ํ์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค - ๋น„์šฉ์ด ํฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ๊ณผ ์œ ์šฉํ•œ ์—ฐ์‚ฐ ์ž‘์—…์„ ์ค‘์ฒฉํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌ์กฐํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ํšจ๊ณผ

์ด ๊ธฐ๋ฒ•๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค:

  • DRAM ์ง€์—ฐ ์‹œ๊ฐ„ ์ˆจ๊ธฐ๊ธฐ: ์œ ํœด ๋Œ€๊ธฐ๋ฅผ ์ƒ์‚ฐ์ ์ธ ์—ฐ์‚ฐ ์‹œ๊ฐ„์œผ๋กœ ์ „ํ™˜
  • ๋Œ€์—ญํญ ๊ทน๋Œ€ํ™”: ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์บ์‹œ ๋ฏธ์Šค ๋ฐฉ์ง€
  • ํŒŒ์ดํ”„๋ผ์ธ ํšจ์œจ: ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์ด ๋ณ‘๋ ฌ๋กœ ์ผ์–ด๋‚˜๋Š” ๋™์•ˆ ์—ฐ์‚ฐ ์œ ๋‹›์„ ๋ฐ”์˜๊ฒŒ ์œ ์ง€

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์ด๋ž€? ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์€ GPU ๋ธ”๋ก์ด ๋‹ค๋ฅธ ์ž‘์—…์„ ๊ณ„์†ํ•˜๋Š” ๋™์•ˆ ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์‹คํ–‰๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ด๋™์„ ์ค‘์ฒฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ทผ๋ณธ์ ์ธ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ์ด๊ฒƒ์„ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์œ„ํ•œ ํŒŒ์ดํ”„๋ผ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์œผ๋กœ ์ƒ๊ฐํ•˜์„ธ์š” - ๋‹จ๊ณ„๋ฅผ ์ค‘์ฒฉํ•˜๊ณ , ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๊ณ , ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๊ทน๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค. ๋ชฉํ‘œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ด๋™ํ•˜๋Š” ๋™์•ˆ ๋น„์‹ผ ์—ฐ์‚ฐ ์œ ๋‹›์„ ๋ฐ”์˜๊ฒŒ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ—ค์ผ๋กœ ์˜์—ญ ์ดํ•ดํ•˜๊ธฐ

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, ํ•ฉ์„ฑ๊ณฑ๊ณผ ๊ฐ™์€ ์Šคํ…์‹ค ์—ฐ์‚ฐ์˜ ํƒ€์ผ ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ์— ํ•„์ˆ˜์ ์ธ ํ—ค์ผ๋กœ ์˜์—ญ(ghost cell ๋˜๋Š” guard cell์ด๋ผ๊ณ ๋„ ํ•จ)์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ—ค์ผ๋กœ ์˜์—ญ์ด๋ž€?

ํ—ค์ผ๋กœ ์˜์—ญ์€ ์Šคํ…์‹ค ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ์ด์›ƒ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด ์ฒ˜๋ฆฌ ํƒ€์ผ์˜ ๊ฒฝ๊ณ„๋ฅผ ๋„˜์–ด ํ™•์žฅ๋˜๋Š” ์ถ”๊ฐ€ ์š”์†Œ์ž…๋‹ˆ๋‹ค. ํƒ€์ผ ๊ฐ€์žฅ์ž๋ฆฌ ๊ทผ์ฒ˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ, ์Šคํ…์‹ค ์—ฐ์‚ฐ์€ ์ธ์ ‘ ํƒ€์ผ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ—ค์ผ๋กœ ์˜์—ญ์ด ํ•„์š”ํ•œ ์ด์œ 

ํƒ€์ผ์—์„œ 5์  ์ปค๋„์„ ์‚ฌ์šฉํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ์„ ์ƒ๊ฐํ•ด ๋ด…์‹œ๋‹ค:

์›๋ณธ ๋ฐ์ดํ„ฐ:      [... | a b c d e f g h i j k l m n o | ...]
์ฒ˜๋ฆฌ ํƒ€์ผ:              [c d e f g h i j k l m n o]
                            ^                 ^
                      ์™ผ์ชฝ ํƒ€์ผ์—์„œ        ์˜ค๋ฅธ์ชฝ ํƒ€์ผ์—์„œ
                      ์ด์›ƒ ํ•„์š”           ์ด์›ƒ ํ•„์š”

ํ—ค์ผ๋กœ ํฌํ•จ:       [a b | c d e f g h i j k l m n o | p q]
                 ^^^                               ^^^
                 ์™ผ์ชฝ ํ—ค์ผ๋กœ                     ์˜ค๋ฅธ์ชฝ ํ—ค์ผ๋กœ

์ฃผ์š” ํŠน์„ฑ:

  • ํ—ค์ผ๋กœ ํฌ๊ธฐ: ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ ์ธก๋ฉด์— KERNEL_SIZE // 2๊ฐœ ์š”์†Œ
  • ๋ชฉ์ : ํƒ€์ผ ๊ฒฝ๊ณ„์—์„œ ์ •ํ™•ํ•œ ์Šคํ…์‹ค ์—ฐ์‚ฐ ๊ฐ€๋Šฅ
  • ๋‚ด์šฉ: ์ด์›ƒ ํƒ€์ผ์˜ ๋ฐ์ดํ„ฐ ๋ณต์‚ฌ๋ณธ ๋˜๋Š” ๊ฒฝ๊ณ„ ์กฐ๊ฑด
  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ํฐ ์—ฐ์‚ฐ ์ด์ ์„ ์œ„ํ•œ ์ ์€ ์ถ”๊ฐ€ ์ €์žฅ ๊ณต๊ฐ„

ํ•ฉ์„ฑ๊ณฑ์—์„œ์˜ ํ—ค์ผ๋กœ ์˜์—ญ

5์  ํ•ฉ์„ฑ๊ณฑ ์ปค๋„ \([k_0, k_1, k_2, k_3, k_4]\)์˜ ๊ฒฝ์šฐ:

  • ์ค‘์‹ฌ ์š”์†Œ: \(k_2\)๊ฐ€ ํ˜„์žฌ ์ฒ˜๋ฆฌ ์š”์†Œ์™€ ์ •๋ ฌ
  • ์™ผ์ชฝ ์ด์›ƒ: \(k_0, k_1\)์€ ์™ผ์ชฝ 2๊ฐœ ์š”์†Œ ํ•„์š”
  • ์˜ค๋ฅธ์ชฝ ์ด์›ƒ: \(k_3, k_4\)์€ ์˜ค๋ฅธ์ชฝ 2๊ฐœ ์š”์†Œ ํ•„์š”
  • ํ—ค์ผ๋กœ ํฌ๊ธฐ: ๊ฐ ์ธก๋ฉด์— HALO_SIZE = 5 // 2 = 2๊ฐœ ์š”์†Œ

ํ—ค์ผ๋กœ ์˜์—ญ ์—†์ด:

  • ํƒ€์ผ ๊ฒฝ๊ณ„ ์š”์†Œ์—์„œ ์ „์ฒด ํ•ฉ์„ฑ๊ณฑ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์—†์Œ
  • ์ž˜๋ชป๋œ ์ถœ๋ ฅ์ด๋‚˜ ๋ณต์žกํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ๋กœ์ง์ด ํ•„์š”
  • ๋ถ„์‚ฐ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์„ฑ๋Šฅ ์ €ํ•˜

ํ—ค์ผ๋กœ ์˜์—ญ ์‚ฌ์šฉ ์‹œ:

  • ๋ชจ๋“  ํƒ€์ผ ์š”์†Œ๊ฐ€ ๋กœ์ปฌ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด ํ•ฉ์„ฑ๊ณฑ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ
  • ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ๊ฐ„๊ฒฐํ•˜๊ณ  ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ
  • ๋” ๋‚˜์€ ์บ์‹œ ํ™œ์šฉ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ

์ด ๊ฐœ๋…์€ ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•  ๋•Œ ํŠนํžˆ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ํ—ค์ผ๋กœ ์˜์—ญ์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋กœ๋”ฉํ•˜๊ณ  ๋™๊ธฐํ™”ํ•ด์•ผ ์—ฌ๋Ÿฌ ํƒ€์ผ์— ๊ฑธ์นœ ์ •ํ™•ํ•œ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์ค‘์ฒฉ์„ ํ™œ์šฉํ•œ 1D ํ•ฉ์„ฑ๊ณฑ

Puzzle 13 ๊ธฐ๋ฐ˜: ์ด ํผ์ฆ์€ Puzzle 13์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๋‹ค์‹œ ๋‹ค๋ฃจ์ง€๋งŒ, ์ด๋ฒˆ์—๋Š” ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์—ฐ์‚ฐ ๋’ค์— ์ˆจ๊ธฐ๋Š” ์ตœ์ ํ™”๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ๋™๊ธฐ์‹ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋Œ€์‹ , ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์„ ์‚ฌ์šฉํ•˜์—ฌ ๋น„์šฉ์ด ํฐ DRAM ์ „์†ก๊ณผ ์œ ์šฉํ•œ ์ž‘์—…์„ ์ค‘์ฒฉํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: VECTOR_SIZE = 16384 (์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์นœ 16K ์š”์†Œ)
  • ํƒ€์ผ ํฌ๊ธฐ: CONV_TILE_SIZE = 256 (์ฒ˜๋ฆฌ ํƒ€์ผ ํฌ๊ธฐ)
  • ๋ธ”๋ก ๊ตฌ์„ฑ: ๋ธ”๋ก๋‹น (256, 1) ์Šค๋ ˆ๋“œ
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ๊ทธ๋ฆฌ๋“œ๋‹น (VECTOR_SIZE // CONV_TILE_SIZE, 1) ๋ธ”๋ก (64๊ฐœ ๋ธ”๋ก)
  • ์ปค๋„ ํฌ๊ธฐ: KERNEL_SIZE = 5 (Puzzle 13๊ณผ ๋™์ผํ•œ ๊ฐ„๋‹จํ•œ 1D ํ•ฉ์„ฑ๊ณฑ)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: Layout.row_major(VECTOR_SIZE) (1D row-major)

๋น„๋™๊ธฐ ๋ณต์‚ฌ์˜ ๊ธฐํšŒ

Puzzle 16 ๊ธฐ๋ฐ˜: matmul์—์„œ ๊น”๋”ํ•œ ํƒ€์ผ ๋กœ๋”ฉ์„ ์œ„ํ•ด copy_dram_to_sram_async๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์ด๋ฏธ ๋ณด์…จ์Šต๋‹ˆ๋‹ค. ์ด์ œ ๊ณ ์„ฑ๋Šฅ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•ต์‹ฌ์ธ ์ง€์—ฐ ์‹œ๊ฐ„ ์€ํ ๊ธฐ๋Šฅ์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด์˜ ๋™๊ธฐ์‹ ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋”ฉ์€ ์ „์†ก ์ค‘ ์—ฐ์‚ฐ ์œ ๋‹›์„ ์œ ํœด ์ƒํƒœ๋กœ ๋Œ€๊ธฐํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์€ ์ „์†ก๊ณผ ์œ ์šฉํ•œ ์ž‘์—…์˜ ์ค‘์ฒฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค:

# ๋™๊ธฐ์‹ ์ ‘๊ทผ - ๋น„ํšจ์œจ์ :
for i in range(CONV_TILE_SIZE):
    input_shared[i] = input[base_idx + i]  # ๊ฐ ๋กœ๋“œ๊ฐ€ DRAM์„ ๊ธฐ๋‹ค๋ฆผ
for i in range(KERNEL_SIZE):
    kernel_shared[i] = kernel[i]           # DRAM ์ถ”๊ฐ€ ๋Œ€๊ธฐ
barrier()  # ์—ฐ์‚ฐ ์‹œ์ž‘ ์ „ ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๋Œ€๊ธฐ
# โ†‘ ์ด ์‹œ๊ฐ„ = input_transfer_time + kernel_transfer_time

# ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์ ‘๊ทผ - ํšจ์œจ์ :
copy_dram_to_sram_async[thread_layout](input_shared, input_tile)  # ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์ „์†ก ์‹œ์ž‘
# ์ž…๋ ฅ์ด ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ „์†ก๋˜๋Š” ๋™์•ˆ, ์ปค๋„์„ ๋™๊ธฐ์‹์œผ๋กœ ๋กœ๋”ฉ
for i in range(KERNEL_SIZE):
    kernel_shared[i] = kernel[i]  # ๋น„๋™๊ธฐ ์ž…๋ ฅ ์ „์†ก๊ณผ ์ค‘์ฒฉ
async_copy_wait_all()  # ๋‘ ์—ฐ์‚ฐ์ด ๋ชจ๋‘ ์™„๋ฃŒ๋  ๋•Œ๋งŒ ๋Œ€๊ธฐ
# โ†‘ ์ด ์‹œ๊ฐ„ = MAX(input_transfer_time, kernel_transfer_time)

๋น„๋™๊ธฐ ๋ณต์‚ฌ๊ฐ€ ์ž˜ ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  • ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„: ์ตœ์‹  GPU๋Š” ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ์šฐํšŒํ•˜๊ณ  ์ง„์ •ํ•œ ์—ฐ์‚ฐ-๋ฉ”๋ชจ๋ฆฌ ์ค‘์ฒฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์ „์šฉ ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค (Puzzle 16์—์„œ ์„ค๋ช…)
  • ์ง€์—ฐ ์‹œ๊ฐ„ ์€ํ: GPU ์Šค๋ ˆ๋“œ๊ฐ€ ๋‹ค๋ฅธ ์—ฐ์‚ฐ์„ ์‹คํ–‰ํ•˜๋Š” ๋™์•ˆ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค
  • ์ตœ์ ์˜ ๋ณ‘ํ•ฉ: ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ์ด ํšจ์œจ์ ์ธ DRAM ์ ‘๊ทผ ํŒจํ„ด์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋ฆฌ์†Œ์Šค ํ™œ์šฉ: ์—ฐ์‚ฐ ์œ ๋‹›์ด ์œ ํœด ๋Œ€๊ธฐ ๋Œ€์‹  ๊ณ„์† ๋ฐ”์˜๊ฒŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค

์™„์„ฑํ•  ์ฝ”๋“œ

Puzzle 16์˜ matmul ๊ตฌํ˜„ ํŒจํ„ด์„ ๋”ฐ๋ผ, ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก๊ณผ ์—ฐ์‚ฐ์„ ์ค‘์ฒฉํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ๋น„๋™๊ธฐ ๋ณต์‚ฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ๋ฒกํ„ฐ์— ๋Œ€ํ•œ 1D ํ•ฉ์„ฑ๊ณฑ์„ ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\text{output}[i] = \sum_{k=0}^{\text{KERNEL_SIZE}-1} \text{input}[i+k-\text{HALO_SIZE}] \times \text{kernel}[k]\]

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜:

  1. ๋น„๋™๊ธฐ ํƒ€์ผ ๋กœ๋”ฉ: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๋ฐฑ๊ทธ๋ผ์šด๋“œ DRAMโ†’SRAM ์ „์†ก ์‹œ์ž‘
  2. ์ค‘์ฒฉ ์—ฐ์‚ฐ: ์ž…๋ ฅ ์ „์†ก ์ค‘ ์ž‘์€ ์ปค๋„ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ
  3. ๋™๊ธฐํ™”: ์ „์†ก ์™„๋ฃŒ ๋Œ€๊ธฐ ํ›„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ
comptime VECTOR_SIZE = 16384
comptime CONV_TILE_SIZE = 256
comptime KERNEL_SIZE = 5
comptime HALO_SIZE = KERNEL_SIZE // 2  # Halo elements needed for boundary
comptime BUFFER_SIZE = CONV_TILE_SIZE + 2 * HALO_SIZE  # Include halo for boundary conditions
comptime BLOCKS_PER_GRID_ASYNC = (
    VECTOR_SIZE + CONV_TILE_SIZE - 1
) // CONV_TILE_SIZE
comptime THREADS_PER_BLOCK_ASYNC = 256
comptime dtype = DType.float32
comptime layout_async = Layout.row_major(VECTOR_SIZE)


fn async_copy_overlap_convolution[
    dtype: DType, layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    kernel: LayoutTensor[dtype, Layout.row_major(KERNEL_SIZE), ImmutAnyOrigin],
):
    """Demonstrates async copy operations building on p14 patterns.

    This shows how to use copy_dram_to_sram_async and async_copy_wait_all
    for efficient memory transfers, extending the patterns from p14 matmul.
    """

    # Shared memory buffers (like p14, but without .fill(0) to avoid race)
    input_shared = LayoutTensor[
        dtype,
        Layout.row_major(CONV_TILE_SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    kernel_shared = LayoutTensor[
        dtype,
        Layout.row_major(KERNEL_SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # FILL IN HERE (roughly 19 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p28/p28.mojo

ํŒ

1. ๋น„๋™๊ธฐ ๋ณต์‚ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ดํ•ด

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์€ ๋ธ”๋ก์ด ๋‹ค๋ฅธ ์ฝ”๋“œ๋ฅผ ๊ณ„์† ์‹คํ–‰ํ•˜๋Š” ๋™์•ˆ ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์ „์†ก์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

ํƒ๊ตฌํ•  ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • DRAM์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์–ด๋–ค ๋ฐ์ดํ„ฐ๋ฅผ ์ „์†กํ•ด์•ผ ํ•˜๋Š”๊ฐ€?
  • ์ „์†ก์ด ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ผ์–ด๋‚˜๋Š” ๋™์•ˆ ์–ด๋–ค ์—ฐ์‚ฐ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
  • ํ•˜๋“œ์›จ์–ด๊ฐ€ ์—ฌ๋Ÿฌ ๋™์‹œ ์—ฐ์‚ฐ์„ ์–ด๋–ป๊ฒŒ ์กฐ์œจํ•˜๋Š”๊ฐ€?

์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๋ธ”๋ก์—๋Š” THREADS_PER_BLOCK_ASYNC = 256๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค
  • ํƒ€์ผ์—๋Š” CONV_TILE_SIZE = 256๊ฐœ์˜ ์š”์†Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค
  • ์–ด๋–ค ๋ ˆ์ด์•„์›ƒ ํŒจํ„ด์ด ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์„ ๋ณด์žฅํ•˜๋Š”๊ฐ€?

2. ์ค‘์ฒฉ ๊ธฐํšŒ ํŒŒ์•…

๋ชฉํ‘œ๋Š” ์œ ์šฉํ•œ ์—ฐ์‚ฐ ๋’ค์— ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ถ„์„ ์ ‘๊ทผ๋ฒ•:

  • ์–ด๋–ค ์—ฐ์‚ฐ์ด ์ˆœ์ฐจ์ ์œผ๋กœ vs. ๋ณ‘๋ ฌ๋กœ ์ผ์–ด๋‚˜์•ผ ํ•˜๋Š”๊ฐ€?
  • ์–ด๋–ค ๋ฐ์ดํ„ฐ ์ „์†ก์ด ํฐ(๋น„์šฉ์ด ๋†’์€) vs. ์ž‘์€(๋น„์šฉ์ด ๋‚ฎ์€)๊ฐ€?
  • ๋ณ‘๋ ฌ ์‹คํ–‰์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์–ด๋–ป๊ฒŒ ๊ตฌ์กฐํ™”ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?

๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ํฐ ์ž…๋ ฅ ํƒ€์ผ: 256 ์š”์†Œ ร— 4 ๋ฐ”์ดํŠธ = 1KB ์ „์†ก
  • ์ž‘์€ ์ปค๋„: 5 ์š”์†Œ ร— 4 ๋ฐ”์ดํŠธ = 20 ๋ฐ”์ดํŠธ
  • ์–ด๋–ค ์ „์†ก์ด ๋น„๋™๊ธฐ ์ตœ์ ํ™”์˜ ์ด์ ์„ ๊ฐ€์žฅ ๋งŽ์ด ๋ฐ›๋Š”๊ฐ€?

3. ๋™๊ธฐํ™” ์ „๋žต

์ ์ ˆํ•œ ๋™๊ธฐํ™”๋Š” ์„ฑ๋Šฅ์„ ํฌ์ƒํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ์ •ํ™•์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

ํƒ€์ด๋ฐ ๋ถ„์„:

  • ๊ฐ ์—ฐ์‚ฐ์ด ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ค€๋น„๋˜์–ด์•ผ ํ•˜๋Š” ์‹œ์ ์€ ์–ธ์ œ์ธ๊ฐ€?
  • ์ •ํ™•์„ฑ์„ ์œ„ํ•ด ํ•„์š”ํ•œ ์ตœ์†Œํ•œ์˜ ๋™๊ธฐํ™”๋Š” ๋ฌด์—‡์ธ๊ฐ€?
  • ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ถˆํ•„์š”ํ•œ ์ •์ฒด๋ฅผ ์–ด๋–ป๊ฒŒ ํ”ผํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€:

  • ์ „์†ก์ด ์™„๋ฃŒ๋˜๊ธฐ ์ „์— ์—ฐ์‚ฐ์ด ์‹œ์ž‘๋˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋Š”๊ฐ€?
  • ๋ฉ”๋ชจ๋ฆฌ ํŽœ์Šค์™€ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์–ด๋–ป๊ฒŒ ์กฐ์œจํ•˜๋Š”๊ฐ€?

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์ค‘์ฒฉ ํ…Œ์ŠคํŠธ:

pixi run p28
pixi run -e amd p28
pixi run -e apple p28
uv run poe p28

์†”๋ฃจ์…˜

์ƒ์„ธ ์„ค๋ช…์ด ํฌํ•จ๋œ ์ „์ฒด ์†”๋ฃจ์…˜

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์ค‘์ฒฉ ์†”๋ฃจ์…˜๋Š” ๋น„์šฉ์ด ํฐ DRAM ์ „์†ก๊ณผ ์œ ์šฉํ•œ ์—ฐ์‚ฐ์„ ์ค‘์ฒฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

fn async_copy_overlap_convolution[
    dtype: DType, layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    kernel: LayoutTensor[dtype, Layout.row_major(KERNEL_SIZE), ImmutAnyOrigin],
):
    """Demonstrates async copy operations building on p14 patterns.

    This shows how to use copy_dram_to_sram_async and async_copy_wait_all
    for efficient memory transfers, extending the patterns from p14 matmul.
    """

    # Shared memory buffers (like p14, but without .fill(0) to avoid race)
    input_shared = LayoutTensor[
        dtype,
        Layout.row_major(CONV_TILE_SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    kernel_shared = LayoutTensor[
        dtype,
        Layout.row_major(KERNEL_SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    local_i = Int(thread_idx.x)

    # Phase 1: Launch async copy for input tile
    # Note: tile() does NOT perform bounds checking - ensure valid tile bounds
    input_tile = input.tile[CONV_TILE_SIZE](Int(block_idx.x))

    # Use async copy with thread layout matching p14 pattern
    comptime load_layout = Layout.row_major(THREADS_PER_BLOCK_ASYNC)
    copy_dram_to_sram_async[thread_layout=load_layout](input_shared, input_tile)

    # Phase 2: Load kernel synchronously (small data)
    if local_i < KERNEL_SIZE:
        kernel_shared[local_i] = kernel[local_i]

    # Phase 3: Wait for async copy to complete
    async_copy_wait_all()  # Always wait since we always do async copy
    barrier()  # Sync all threads

    # Phase 4: Compute convolution
    global_i = Int(block_idx.x) * CONV_TILE_SIZE + local_i
    if local_i < CONV_TILE_SIZE and global_i < output.shape[0]():
        var result: output.element_type = 0

        # Simple convolution avoiding boundary issues
        if local_i >= HALO_SIZE and local_i < CONV_TILE_SIZE - HALO_SIZE:
            # Full convolution for center elements
            for k in range(KERNEL_SIZE):
                input_idx = local_i + k - HALO_SIZE
                if input_idx >= 0 and input_idx < CONV_TILE_SIZE:
                    result += input_shared[input_idx] * kernel_shared[k]
        else:
            # For boundary elements, just copy input (no convolution)
            result = input_shared[local_i]

        output[global_i] = result


๋‹จ๊ณ„๋ณ„ ๋ถ„์„

Phase 1: ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์‹œ์ž‘

# Phase 1: Launch async copy for input tile
input_tile = input.tile[CONV_TILE_SIZE](block_idx.x)
comptime load_layout = Layout.row_major(THREADS_PER_BLOCK_ASYNC)
copy_dram_to_sram_async[thread_layout=load_layout](input_shared, input_tile)
  • ํƒ€์ผ ์ƒ์„ฑ: input.tile[CONV_TILE_SIZE](block_idx.x)๋Š” block_idx.x * 256์—์„œ ์‹œ์ž‘ํ•˜๋Š” 256๊ฐœ ์š”์†Œ์˜ ์ž…๋ ฅ ๋ฐฐ์—ด ๋ทฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. Mojo์˜ tile ๋ฉ”์„œ๋“œ๋Š” ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋‚˜ ์ œ๋กœ ํŒจ๋”ฉ์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ธ๋ฑ์Šค ์ ‘๊ทผ์€ ๋ฏธ์ •์˜ ๋™์ž‘์„ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌํ˜„์—์„œ ํƒ€์ผ ํฌ๊ธฐ์™€ offset์ด ์œ ํšจํ•œ ๋ฐฐ์—ด ๋ฒ”์œ„ ๋‚ด์— ์žˆ๋Š”์ง€ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ: Layout.row_major(THREADS_PER_BLOCK_ASYNC, 1)๋Š” ๋ธ”๋ก ๊ตฌ์„ฑ๊ณผ ์ผ์น˜ํ•˜๋Š” 256 x 1 ๋ ˆ์ด์•„์›ƒ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค - ์ตœ์ ์˜ ๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•ด ๋ ˆ์ด์•„์›ƒ์ด ๋ฌผ๋ฆฌ์  ์Šค๋ ˆ๋“œ ๋ฐฐ์น˜์™€ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ ˆ์ด์•„์›ƒ์ด ์ผ์น˜ํ•˜์ง€ ์•Š์œผ๋ฉด ์Šค๋ ˆ๋“œ๊ฐ€ ๋น„์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ์— ์ ‘๊ทผํ•˜์—ฌ ๋ณ‘ํ•ฉ์ด ๊นจ์ง€๊ณ  ์„ฑ๋Šฅ์ด ์‹ฌ๊ฐํ•˜๊ฒŒ ์ €ํ•˜๋ฉ๋‹ˆ๋‹ค.

  • ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์‹œ์ž‘: copy_dram_to_sram_async๋Š” DRAM์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ์˜ ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์ „์†ก์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋“œ์›จ์–ด๊ฐ€ 256๊ฐœ์˜ float(1KB)๋ฅผ ๋ณต์‚ฌํ•˜๋Š” ๋™์•ˆ ๋ธ”๋ก์€ ๊ณ„์† ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.

Phase 2: ์ค‘์ฒฉ ์—ฐ์‚ฐ

# Phase 2: Load kernel synchronously (small data)
if local_i < KERNEL_SIZE:
    kernel_shared[local_i] = kernel[local_i]
  • ๋™์‹œ ์‹คํ–‰: 1KB ์ž…๋ ฅ ํƒ€์ผ์ด ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ „์†ก๋˜๋Š” ๋™์•ˆ, ์Šค๋ ˆ๋“œ๋“ค์€ ์ž‘์€ 20๋ฐ”์ดํŠธ ์ปค๋„์„ ๋™๊ธฐ์‹์œผ๋กœ ๋กœ๋”ฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์ค‘์ฒฉ์ด ํ•ต์‹ฌ ์ตœ์ ํ™”์ž…๋‹ˆ๋‹ค.

  • ํฌ๊ธฐ ๊ธฐ๋ฐ˜ ์ „๋žต: ํฐ ์ „์†ก(์ž…๋ ฅ ํƒ€์ผ)์€ ๋น„๋™๊ธฐ ๋ณต์‚ฌ๋ฅผ, ์ž‘์€ ์ „์†ก(์ปค๋„)์€ ๋™๊ธฐ์‹ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ณต์žก์„ฑ๊ณผ ์„ฑ๋Šฅ ์ด์ ์˜ ๊ท ํ˜•์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

Phase 3: ๋™๊ธฐํ™”

# Phase 3: Wait for async copy to complete
async_copy_wait_all()  # Always wait since we always do async copy
barrier()  # Sync all threads
  • ์ „์†ก ์™„๋ฃŒ: async_copy_wait_all()์€ ๋ชจ๋“  ๋น„๋™๊ธฐ ์ „์†ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค. input_shared์— ์ ‘๊ทผํ•˜๊ธฐ ์ „์— ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”: barrier()๋Š” ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์‚ฐ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์— ์™„๋ฃŒ๋œ ์ „์†ก์„ ํ™•์ธํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

Phase 4: ์—ฐ์‚ฐ

# Phase 4: Compute convolution
global_i = block_idx.x * CONV_TILE_SIZE + local_i
if local_i < CONV_TILE_SIZE and global_i < output.shape[0]():
    var result: output.element_type = 0

    if local_i >= HALO_SIZE and local_i < CONV_TILE_SIZE - HALO_SIZE:
        # Full convolution for center elements
        for k in range(KERNEL_SIZE):
            input_idx = local_i + k - HALO_SIZE
            if input_idx >= 0 and input_idx < CONV_TILE_SIZE:
                result += input_shared[input_idx] * kernel_shared[k]
    else:
        # For boundary elements, just copy input (no convolution)
        result = input_shared[local_i]

    output[global_i] = result
  • ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๋ชจ๋“  ์—ฐ์‚ฐ์ด ๋ฏธ๋ฆฌ ๋กœ๋“œ๋œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์—ฐ์‚ฐ ์ง‘์•ฝ์ ์ธ ํ•ฉ์„ฑ๊ณฑ ๋ฃจํ”„์—์„œ ๋А๋ฆฐ DRAM ์ ‘๊ทผ์„ ํ”ผํ•ฉ๋‹ˆ๋‹ค.

  • ๋‹จ์ˆœํ™”๋œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์ด ๊ตฌํ˜„์€ ํƒ€์ผ ๊ฒฝ๊ณ„ ๊ทผ์ฒ˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์‹ค์šฉ์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

    • ์ค‘์‹ฌ ์š”์†Œ (local_i >= HALO_SIZE์ด๊ณ  local_i < CONV_TILE_SIZE - HALO_SIZE): ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด 5์  ํ•ฉ์„ฑ๊ณฑ ์ ์šฉ
    • ๊ฒฝ๊ณ„ ์š”์†Œ (๊ฐ ํƒ€์ผ์˜ ์ฒ˜์Œ 2๊ฐœ์™€ ๋งˆ์ง€๋ง‰ 2๊ฐœ ์š”์†Œ): ๋ณต์žกํ•œ ๊ฒฝ๊ณ„ ๋กœ์ง์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ํ•ฉ์„ฑ๊ณฑ ์—†์ด ์ž…๋ ฅ์„ ์ง์ ‘ ๋ณต์‚ฌ

    ๊ต์œก์  ๊ทผ๊ฑฐ: ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ๋ณด๋‹ค ๋น„๋™๊ธฐ ๋ณต์‚ฌ ํŒจํ„ด ์‹œ์—ฐ์„ ์šฐ์„ ์‹œํ•ฉ๋‹ˆ๋‹ค. HALO_SIZE = 2์ธ 256๊ฐœ ์š”์†Œ ํƒ€์ผ์—์„œ, ์š”์†Œ 0-1๊ณผ 254-255๋Š” ์ž…๋ ฅ ๋ณต์‚ฌ๋ฅผ, ์š”์†Œ 2-253์€ ์ „์ฒด ํ•ฉ์„ฑ๊ณฑ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋™์ž‘ํ•˜๋Š” ๊ตฌํ˜„์„ ์ œ๊ณตํ•˜๋ฉด์„œ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”์— ์ดˆ์ ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๋ถ„์„

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—†์ด (๋™๊ธฐ์‹):

Total Time = Input_Transfer_Time + Kernel_Transfer_Time + Compute_Time
           = Large_DRAM_transfer + Small_DRAM_transfer + convolution
           = Major_latency + Minor_latency + computation_work

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์‚ฌ์šฉ (์ค‘์ฒฉ):

Total Time = MAX(Input_Transfer_Time, Kernel_Transfer_Time) + Compute_Time
           = MAX(Major_latency, Minor_latency) + computation_work
           = Major_latency + computation_work

์„ฑ๋Šฅ ํ–ฅ์ƒ: ๋” ํฐ ์ž…๋ ฅ ์ „์†ก ๋’ค์— ๋” ์ž‘์€ ์ปค๋„ ์ „์†ก์˜ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊น€์œผ๋กœ์จ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค. ์‹ค์ œ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํญ์€ ์ „์†ก์˜ ์ƒ๋Œ€์  ํฌ๊ธฐ์™€ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ๋” ํฐ ์ค‘์ฒฉ์ด ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ํ›จ์”ฌ ํด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ์ˆ ์  ํ†ต์ฐฐ

  1. ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ ๋งค์นญ: Layout.row_major(256, 1) ๋ ˆ์ด์•„์›ƒ์ด ๋ธ”๋ก์˜ (256, 1) ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ๊ณผ ์ •ํ™•ํžˆ ์ผ์น˜ํ•˜์—ฌ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

  2. ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€: ์ ์ ˆํ•œ ์ˆœ์„œ ์ง€์ •(๋น„๋™๊ธฐ ๋ณต์‚ฌ โ†’ ์ปค๋„ ๋กœ๋“œ โ†’ ๋Œ€๊ธฐ โ†’ ๋ฐฐ๋ฆฌ์–ด โ†’ ์—ฐ์‚ฐ)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

  3. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์ตœ์‹  GPU๋Š” ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ „์šฉ ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์–ด, ๋ฉ”๋ชจ๋ฆฌ ์œ ๋‹›๊ณผ ์—ฐ์‚ฐ ์œ ๋‹› ์‚ฌ์ด์˜ ์ง„์ •ํ•œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

  4. ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ํ™œ์šฉ: ์ด ํŒจํ„ด์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ํšจ์œจ์ ์œผ๋กœ ์ด๋™์‹œํ‚ต๋‹ˆ๋‹ค: DRAM โ†’ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ๋ ˆ์ง€์Šคํ„ฐ โ†’ ์—ฐ์‚ฐ.

  5. ํ…Œ์ŠคํŠธ-๊ตฌํ˜„ ์ผ๊ด€์„ฑ: ํ…Œ์ŠคํŠธ ๊ฒ€์ฆ ๋กœ์ง์€ local_i_in_tile = i % CONV_TILE_SIZE๋ฅผ ๊ฒ€์‚ฌํ•˜์—ฌ ๊ฐ ์š”์†Œ๊ฐ€ ํ•ฉ์„ฑ๊ณฑ ๊ฒฐ๊ณผ(์ค‘์‹ฌ ์š”์†Œ)๋ฅผ ๊ธฐ๋Œ€ํ•ด์•ผ ํ•˜๋Š”์ง€ ์ž…๋ ฅ ๋ณต์‚ฌ(๊ฒฝ๊ณ„ ์š”์†Œ)๋ฅผ ๊ธฐ๋Œ€ํ•ด์•ผ ํ•˜๋Š”์ง€ ํŒ๋ณ„ํ•˜๋ฉฐ, ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ์ „๋žต๊ณผ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋‹จ์ˆœํ™”๋œ ๊ฒฝ๊ณ„ ์ ‘๊ทผ ๋ฐฉ์‹์˜ ์ •ํ™•ํ•œ ๊ฒ€์ฆ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

์ด ์†”๋ฃจ์…˜๋Š” ๋‹จ์ˆœํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ํ•ฉ์„ฑ๊ณฑ์„ ์œ ์šฉํ•œ ์ž‘์—… ๋’ค์— ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๋Š” ์ตœ์ ํ™”๋œ ๊ตฌํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ, ๊ณ ์„ฑ๋Šฅ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ธฐ๋ณธ ์›๋ฆฌ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Puzzle 29: GPU ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ

๋‹จ์ˆœํ•œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๋„˜์–ด์„œ

์ด ์žฅ์—์„œ๋Š” ์Šค๋ ˆ๋“œ ๊ฐ„ ์ •๋ฐ€ํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•œ ๋ณต์žกํ•œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋™๊ธฐํ™” ํŒจํ„ด์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ด์ „ ํผ์ฆ๋“ค๊ณผ ๋‹ฌ๋ฆฌ, ์ด ์ฑŒ๋ฆฐ์ง€๋“ค์€ ์‹ค์ œ GPU ์†Œํ”„ํŠธ์›จ์–ด์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์•„ํ‚คํ…์ฒ˜ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋‚ด์šฉ:

  • ์Šค๋ ˆ๋“œ ํŠนํ™”: ํ•˜๋‚˜์˜ ๋ธ”๋ก ์•ˆ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ๊ฐ๊ฐ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰
  • ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ํŒŒ์ดํ”„๋ผ์ธ: ๋ช…์‹œ์  ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ์„ ๊ฐ€์ง„ ๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ
  • ๊ณ ๊ธ‰ ๋ฐฐ๋ฆฌ์–ด API: ๊ธฐ๋ณธ barrier() ํ˜ธ์ถœ์„ ๋„˜์–ด์„  ์„ธ๋ฐ€ํ•œ ๋™๊ธฐํ™” ์ œ์–ด
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •: ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์‹œ์„ฑ๊ณผ ์ˆœ์„œ์— ๋Œ€ํ•œ ๋ช…์‹œ์  ์ œ์–ด
  • ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด: ๋ณต์žกํ•œ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋”๋ธ” ๋ฒ„ํผ๋ง๊ณผ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •

์™œ ์ค‘์š”ํ•œ๊ฐ€: ๋Œ€๋ถ€๋ถ„์˜ GPU ํŠœํ† ๋ฆฌ์–ผ์€ ๋‹จ์ˆœํ•œ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ํŒจํ„ด์„ ๊ฐ€๋ฅด์น˜์ง€๋งŒ, ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„ ๊ฐ„์˜ ์ •๊ตํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด ํผ์ฆ๋“ค์€ ํ•™์ˆ ์  ์˜ˆ์ œ์™€ ์‹ค์ œ GPU ์ปดํ“จํŒ… ์‚ฌ์ด์˜ ๊ฐ„๊ทน์„ ๋ฉ”์›Œ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

GPU ๋™๊ธฐํ™”๋Š” ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์˜ฌ๋ฐ”๋ฅด๊ณ  ํšจ์œจ์ ์œผ๋กœ ๋™์ž‘ํ•˜๊ฒŒ ํ•˜๋Š” ํ† ๋Œ€์ž…๋‹ˆ๋‹ค. ์ด ์žฅ์—์„œ๋Š” ๊ณ ์„ฑ๋Šฅ GPU ์ปดํ“จํŒ… ์ „๋ฐ˜์— ๊ฑธ์ณ ๋‚˜ํƒ€๋‚˜๋Š” ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ์ ์ธ ๋™๊ธฐํ™” ํŒจํ„ด์ธ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •, ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๊ด€๋ฆฌ, ์ŠคํŠธ๋ฆฌ๋ฐ ์—ฐ์‚ฐ์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต ๋ชฉํ‘œ:

  • ์„œ๋กœ ๋‹ค๋ฅธ ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ์–ธ์ œ, ์™œ ํ•„์š”ํ•œ์ง€ ์ดํ•ด
  • ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ํŠนํ™”๋ฅผ ํ†ตํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„
  • ์ •๋ฐ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •์ด ํ•„์š”ํ•œ ๋ฐ˜๋ณต ํŒจํ„ด ๊ตฌํ˜„
  • ์ •ํ™•์„ฑ์„ ๋ณด์žฅํ•˜๋ฉด์„œ ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์ ํ™”

์•„ํ‚คํ…์ฒ˜ ์ง„ํ–‰ ๊ตฌ์กฐ: ์ด ํผ์ฆ๋“ค์€ ๊ธฐ๋ณธ์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •๋ถ€ํ„ฐ ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๊ด€๋ฆฌ๊นŒ์ง€, ๊ทธ๋ฆฌ๊ณ  ์ตœ์ข…์ ์œผ๋กœ ๊ณ ์ฒ˜๋ฆฌ๋Ÿ‰ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ŠคํŠธ๋ฆฌ๋ฐ ์—ฐ์‚ฐ ํŒจํ„ด๊นŒ์ง€ ๋‹จ๊ณ„์ ์œผ๋กœ ์ง„ํ–‰๋˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์Šค๋ ˆ๋“œ ์กฐ์œจ ํŒจ๋Ÿฌ๋‹ค์ž„:

  • ๋‹จ์ˆœ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ (์ด์ „ ํผ์ฆ๋“ค)
  • ํŠนํ™” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ๊ฐ๊ฐ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ˆ˜ํ–‰ (์ด ์žฅ)
  • ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง„ ์ˆœ์ฐจ์  ๋‹จ๊ณ„
  • ๋ฐ˜๋ณต ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์‹ ์ค‘ํ•œ ๋ฒ„ํผ ๊ด€๋ฆฌ๋ฅผ ์ˆ˜๋ฐ˜ํ•˜๋Š” ๋‹ค์ค‘ ํŒจ์Šค

๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ์˜ ๊ณ„์ธต ๊ตฌ์กฐ:

  • ๊ธฐ๋ณธ barrier(): ๋ธ”๋ก ๋‚ด ๋‹จ์ˆœ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  • ๊ณ ๊ธ‰ mbarrier API: ์ƒํƒœ ์ถ”์ ์„ ์ง€์›ํ•˜๋Š” ์„ธ๋ฐ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์ œ์–ด
  • ์ŠคํŠธ๋ฆฌ๋ฐ ์กฐ์ •: ๋น„๋™๊ธฐ ๋ณต์‚ฌ ๋ฐ ๋Œ€๋Ÿ‰ ์ „์†ก ๋™๊ธฐํ™”

๋ฉ”๋ชจ๋ฆฌ ์ผ๊ด€์„ฑ ๋ชจ๋ธ:

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •: ์Šค๋ ˆ๋“œ ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•œ ๋น ๋ฅธ ์˜จ์นฉ ๋ฉ”๋ชจ๋ฆฌ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ˆœ์„œ ๋ณด์žฅ: ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์— ๊ฑธ์ณ ์“ฐ๊ธฐ์˜ ๊ฐ€์‹œ์„ฑ ๋ณด์žฅ
  • ๋ฒ„ํผ ๊ด€๋ฆฌ: ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋”๋ธ” ๋ฒ„ํผ๋ง๊ณผ ํ•‘ํ ํŒจํ„ด

๊ตฌ์„ฑ

์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜:

  • ๋ธ”๋ก ํฌ๊ธฐ: ์ตœ์ ์˜ ์ ์œ ์œจ์„ ์œ„ํ•ด ๋ธ”๋ก๋‹น TPB = 256 ์Šค๋ ˆ๋“œ
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ๊ฐ๊ฐ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ํƒ€์ผ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹ค์ˆ˜์˜ ๋ธ”๋ก
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ ˆ์ง€์Šคํ„ฐ, ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์˜ ์ „๋žต์  ํ™œ์šฉ
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: ์ˆ˜์น˜ ์—ฐ์‚ฐ์„ ์œ„ํ•œ DType.float32

๋‹ค๋ฃจ๋Š” ๋™๊ธฐํ™” ํŒจํ„ด:

  1. ๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ: ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •์„ ํ™œ์šฉํ•œ ์Šค๋ ˆ๋“œ ํŠนํ™”
  2. ๋”๋ธ” ๋ฒ„ํผ๋ง ๋ฐ˜๋ณต: ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๊ด€๋ฆฌ
  3. ์ŠคํŠธ๋ฆฌ๋ฐ ์—ฐ์‚ฐ: ๊ณ ์ฒ˜๋ฆฌ๋Ÿ‰ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์กฐ์ •

์„ฑ๋Šฅ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ: ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฐ๋ฆฌ์–ด ์œ ํ˜•์˜ ๋น„์šฉ ์ดํ•ด
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ์ตœ๋Œ€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์œ„ํ•œ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”
  • ์Šค๋ ˆ๋“œ ํ™œ์šฉ๋„: ํŠนํ™”๋œ ์—ญํ• ๊ณผ ์ „์ฒด ํšจ์œจ์„ฑ ๊ฐ„์˜ ๊ท ํ˜•

ํผ์ฆ ๊ตฌ์„ฑ

์ด ์žฅ์—๋Š” ์„œ๋กœ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐœ์ „ํ•˜๋Š” ์„ธ ๊ฐœ์˜ ์—ฐ๊ฒฐ๋œ ํผ์ฆ์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •

์ดˆ์ : ์Šค๋ ˆ๋“œ ํŠนํ™”์™€ ํŒŒ์ดํ”„๋ผ์ธ ์•„ํ‚คํ…์ฒ˜

ํ•˜๋‚˜์˜ ๋ธ”๋ก ์•ˆ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๋Š” GPU ์ปค๋„์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ์ด ํผ์ฆ์—์„œ๋Š” ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๊ด€๊ณ„์™€ ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„ ๊ฐ„์˜ ์ „๋žต์  ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ์Šค๋ ˆ๋“œ ์—ญํ•  ํŠนํ™” (Stage 1: ๋กœ๋“œ, Stage 2: ์ฒ˜๋ฆฌ, Stage 3: ์ถœ๋ ฅ)
  • ์ฒ˜๋ฆฌ ๋‹จ๊ณ„ ๊ฐ„ ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๋ฐ์ดํ„ฐ ํ๋ฆ„
  • ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์‚ฌ์ด์˜ ์ „๋žต์  ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜

์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ: ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ, ๋‹ค๋‹จ๊ณ„ ๊ณผํ•™ ์—ฐ์‚ฐ, ์‹ ๊ฒฝ๋ง ๋ ˆ์ด์–ด ์กฐ์ •

๋”๋ธ” ๋ฒ„ํผ๋ง ์Šคํ…์‹ค ์—ฐ์‚ฐ

์ดˆ์ : ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด API์™€ ๋ฐ˜๋ณต ์ฒ˜๋ฆฌ

์ •๋ฐ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •์ด ํ•„์š”ํ•œ ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•ด mbarrier API๋ฅผ ์‚ฌ์šฉํ•œ ์„ธ๋ฐ€ํ•œ ๋™๊ธฐํ™” ์ œ์–ด๋ฅผ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ด ํผ์ฆ์€ ๋ฐ˜๋ณต๋ฒ•๊ณผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์ˆ˜์ ์ธ ๋”๋ธ” ๋ฒ„ํผ๋ง ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ๊ณ ๊ธ‰ mbarrier API vs ๊ธฐ๋ณธ barrier()
  • ์ฝ๊ธฐ/์“ฐ๊ธฐ ๋ฒ„ํผ ์—ญํ• ์„ ๊ต๋Œ€ํ•˜๋Š” ๋”๋ธ” ๋ฒ„ํผ๋ง
  • ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์กฐ์ •

์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ: ๋ฐ˜๋ณต๋ฒ• (Jacobi, Gauss-Seidel), ์…€๋ฃฐ๋Ÿฌ ์˜คํ† ๋งˆํƒ€, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์‹œ๊ฐ„ ์Šคํ…

์‹œ์ž‘ํ•˜๊ธฐ

๊ถŒ์žฅ ํ•™์Šต ์ˆœ์„œ:

  1. ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •๋ถ€ํ„ฐ ์‹œ์ž‘: ์Šค๋ ˆ๋“œ ํŠนํ™”์˜ ๊ธฐ์ดˆ ์ดํ•ด
  2. ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๋กœ ์ง„ํ–‰: ์„ธ๋ฐ€ํ•œ ๋™๊ธฐํ™” ์ œ์–ด ํ•™์Šต
  3. ์ŠคํŠธ๋ฆฌ๋ฐ ํŒจํ„ด์— ์ ์šฉ: ๊ณ ์ฒ˜๋ฆฌ๋Ÿ‰ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์œ„ํ•œ ๊ฐœ๋… ๊ฒฐํ•ฉ

์‚ฌ์ „ ์ค€๋น„:

  • ๊ธฐ๋ณธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋… (์Šค๋ ˆ๋“œ, ๋ธ”๋ก, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)์— ๋Œ€ํ•œ ์ดํ•ด
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ์ ‘๊ทผ ํŒจํ„ด์— ๋Œ€ํ•œ ์ดํ•ด
  • ์ด์ „ ํผ์ฆ์—์„œ ๋ฐฐ์šด ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”์— ๋Œ€ํ•œ ์นœ์ˆ™ํ•จ

ํ•™์Šต ์„ฑ๊ณผ: ์ด ์žฅ์„ ์™„๋ฃŒํ•˜๋ฉด, ์ •๋ฐ€ํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•œ ์ •๊ตํ•œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•˜๊ณ  ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ํ† ๋Œ€๋ฅผ ๊ฐ–์ถ”๊ฒŒ ๋˜์–ด, ์‹ค์ œ GPU ์ปดํ“จํŒ… ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ๋งˆ์ฃผํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜์  ๋ณต์žก์„ฑ์— ๋Œ€๋น„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ • ์—์„œ ์Šค๋ ˆ๋“œ ํŠนํ™”์˜ ๊ธฐ๋ณธ์„ ๋ฐฐ์šด ๋‹ค์Œ, ๋”๋ธ” ๋ฒ„ํผ๋ง ์Šคํ…์‹ค ์—ฐ์‚ฐ ์œผ๋กœ ๋‚˜์•„๊ฐ€ ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๊ธฐ๋ฒ•์„ ํƒ๊ตฌํ•ด ๋ณด์„ธ์š”.

๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •

๊ฐœ์š”

์กฐ์œจ๋œ 3๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•ด ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ํŠนํ™”๋œ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ๋‹ด๋‹นํ•˜๊ณ , ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด๋กœ ๋™๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์—ญํ• ์ด ํŠนํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค: Stage 1 (์Šค๋ ˆ๋“œ 0-127)์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ  ์ „์ฒ˜๋ฆฌํ•˜๋ฉฐ, Stage 2 (์Šค๋ ˆ๋“œ 128-255)๋Š” ๋ธ”๋Ÿฌ ์—ฐ์‚ฐ์„ ์ ์šฉํ•˜๊ณ , Stage 3 (์ „์ฒด ์Šค๋ ˆ๋“œ)์€ ์ตœ์ข… ์Šค๋ฌด๋”ฉ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ์•„ํ‚คํ…์ฒ˜: ์ด ํผ์ฆ์€ ํ•˜๋‚˜์˜ GPU ๋ธ”๋ก ์•ˆ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๋Š” ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๋Š” ์ „ํ†ต์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๋‹ฌ๋ฆฌ, ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ์Šค๋ ˆ๋“œ๋ฅผ ๊ธฐ๋Šฅ๋ณ„๋กœ ํŠนํ™”ํ•˜์—ฌ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ๋…: ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์„ธ ๊ฐœ์˜ ๊ตฌ๋ถ„๋œ ๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ๊ฐ ๋‹จ๊ณ„์—๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๋Š” ํŠนํ™”๋œ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„๋Š” ๋‹ค์Œ ๋‹จ๊ณ„๊ฐ€ ์†Œ๋น„ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ, ๋ฐฐ๋ฆฌ์–ด๋กœ ์‹ ์ค‘ํ•˜๊ฒŒ ๋™๊ธฐํ™”ํ•ด์•ผ ํ•˜๋Š” ๋ช…์‹œ์  ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๊ด€๊ณ„๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์˜์กด์„ฑ๊ณผ ๋™๊ธฐํ™”: ๊ฐ ๋‹จ๊ณ„๋Š” ๋‹ค์Œ ๋‹จ๊ณ„๊ฐ€ ์†Œ๋น„ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • Stage 1 โ†’ Stage 2: ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๊ฐ€ ๋ธ”๋Ÿฌ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ
  • Stage 2 โ†’ Stage 3: ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„๊ฐ€ ์ตœ์ข… ์Šค๋ฌด๋”ฉ์„ ์œ„ํ•œ ๋ธ”๋Ÿฌ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑ
  • ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€: ์˜์กดํ•˜๋Š” ๋‹จ๊ณ„๊ฐ€ ์‹œ์ž‘๋˜๊ธฐ ์ „์— ํ•ด๋‹น ๋‹จ๊ณ„๊ฐ€ ์™„์ „ํžˆ ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ

๊ตฌ์ฒด์ ์œผ๋กœ, ๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ์€ ์„ธ ๊ฐ€์ง€ ์ˆ˜ํ•™ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์กฐ์œจ๋œ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

Stage 1 - ์ „์ฒ˜๋ฆฌ ๊ฐ•ํ™”:

\[P[i] = I[i] \times 1.1\]

์—ฌ๊ธฐ์„œ \(P[i]\)๋Š” ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ์ด๊ณ  \(I[i]\)๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

Stage 2 - ์ˆ˜ํ‰ ๋ธ”๋Ÿฌ ํ•„ํ„ฐ:

\[B[i] = \frac{1}{N_i} \sum_{k=-2}^{2} P[i+k] \quad \text{where } i+k \in [0, 255]\]

์—ฌ๊ธฐ์„œ \(B[i]\)๋Š” ๋ธ”๋Ÿฌ ๊ฒฐ๊ณผ์ด๊ณ , \(N_i\)๋Š” ํƒ€์ผ ๊ฒฝ๊ณ„ ๋‚ด์˜ ์œ ํšจํ•œ ์ด์›ƒ ์ˆ˜์ž…๋‹ˆ๋‹ค.

Stage 3 - ์—ฐ์‡„์  ์ด์›ƒ ์Šค๋ฌด๋”ฉ:

\[F[i] = \begin{cases} (B[i] + B[i+1]) \times 0.6 & \text{if } i = 0 \\ ((B[i] + B[i-1]) \times 0.6 + B[i+1]) \times 0.6 & \text{if } 0 < i < 255 \\ (B[i] + B[i-1]) \times 0.6 & \text{if } i = 255 \end{cases}\]

์—ฌ๊ธฐ์„œ \(F[i]\)๋Š” ์—ฐ์‡„์  ์Šค๋ฌด๋”ฉ์ด ์ ์šฉ๋œ ์ตœ์ข… ์ถœ๋ ฅ์ž…๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ํŠนํ™”:

  • ์Šค๋ ˆ๋“œ 0-127: \(i \in \{0, 1, 2, \ldots, 255\}\)์— ๋Œ€ํ•ด \(P[i]\) ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ๋‹น 2๊ฐœ ์š”์†Œ)
  • ์Šค๋ ˆ๋“œ 128-255: \(i \in \{0, 1, 2, \ldots, 255\}\)์— ๋Œ€ํ•ด \(B[i]\) ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ๋‹น 2๊ฐœ ์š”์†Œ)
  • ์ „์ฒด 256๊ฐœ ์Šค๋ ˆ๋“œ: \(i \in \{0, 1, 2, \ldots, 255\}\)์— ๋Œ€ํ•ด \(F[i]\) ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ๋‹น 1๊ฐœ ์š”์†Œ)

๋™๊ธฐํ™” ์ง€์ :

\[\text{barrier}_1 \Rightarrow P[i] \text{ complete} \Rightarrow \text{barrier}_2 \Rightarrow B[i] \text{ complete} \Rightarrow \text{barrier}_3 \Rightarrow F[i] \text{ complete}\]

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ๋Š” ๋‹ค์Œ์„ ๋ฐฐ์›๋‹ˆ๋‹ค:

  • ํ•˜๋‚˜์˜ GPU ๋ธ”๋ก ์•ˆ์—์„œ ์Šค๋ ˆ๋“œ ์—ญํ•  ํŠนํ™” ๊ตฌํ˜„
  • ์ฒ˜๋ฆฌ ๋‹จ๊ณ„ ๊ฐ„ ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๊ด€๊ณ„ ์กฐ์œจ
  • ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ„์˜ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•œ ๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ (๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด๋ถ€๋ฟ ์•„๋‹ˆ๋ผ)

ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๋ฉด์„œ ์ „๋žต์  ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜๋ฅผ ํ†ตํ•ด ์กฐ์œจ๋˜๋Š” ๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ์„ ์–ด๋–ป๊ฒŒ ์„ค๊ณ„ํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์™œ ์ค‘์š”ํ•œ๊ฐ€: ๋Œ€๋ถ€๋ถ„์˜ GPU ํŠœํ† ๋ฆฌ์–ผ์€ ๋‹จ์ผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด์—์„œ์˜ ๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ๋ฒ• - ๋ฆฌ๋•์…˜์ด๋‚˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ ์ค‘ ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•˜๋Š” ๊ฒƒ - ์„ ๊ฐ€๋ฅด์นฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ๋Š” ์‹ ์ค‘ํ•˜๊ฒŒ ์กฐ์œจํ•ด์•ผ ํ•˜๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ตฌ๋ถ„๋œ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ํฌํ•จํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜์  ๋ณต์žก์„ฑ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ์ด ํผ์ฆ์€ ๋‹จ์ผ์ฒด์  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํŠนํ™”๋˜๊ณ  ์กฐ์œจ๋œ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ด์ „ ํผ์ฆ๊ณผ ํ˜„์žฌ์˜ ๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ ๋น„๊ต:

  • ์ด์ „ ํผ์ฆ (P8, P12, P15): ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๊ณ , ๋ฐฐ๋ฆฌ์–ด๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„ ๋‚ด์—์„œ ๋™๊ธฐํ™”
  • ์ด ํผ์ฆ: ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๊ณ , ๋ฐฐ๋ฆฌ์–ด๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ„์˜ ์กฐ์œจ

์Šค๋ ˆ๋“œ ํŠนํ™” ์•„ํ‚คํ…์ฒ˜: ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐ์ดํ„ฐ ์ธ๋ฑ์Šค๋งŒ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์™€ ๋‹ฌ๋ฆฌ, ์ด ํผ์ฆ์€ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ์˜ ์—ญํ• ์— ๋”ฐ๋ผ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅธ ์ฝ”๋“œ ๊ฒฝ๋กœ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

์‹œ์Šคํ…œ ๋งค๊ฐœ๋ณ€์ˆ˜:

  • ์ด๋ฏธ์ง€ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ (๊ฐ„์†Œํ™”๋ฅผ ์œ„ํ•ด 1D)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 256 ์Šค๋ ˆ๋“œ, (256, 1) ๋ธ”๋ก ์ฐจ์›์œผ๋กœ ๊ตฌ์„ฑ
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ํƒ€์ผ ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ (4, 1) ๋ธ”๋ก (์ด 4๊ฐœ ๋ธ”๋ก)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: ๋ชจ๋“  ์—ฐ์‚ฐ์— DType.float32

์Šค๋ ˆ๋“œ ํŠนํ™” ์•„ํ‚คํ…์ฒ˜:

  • Stage 1 ์Šค๋ ˆ๋“œ: STAGE1_THREADS = 128 (์Šค๋ ˆ๋“œ 0-127, ๋ธ”๋ก์˜ ์ „๋ฐ˜๋ถ€)

    • ์—ญํ• : ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ  ์ „์ฒ˜๋ฆฌ ์ ์šฉ
    • ์ž‘์—… ๋ถ„๋ฐฐ: ํšจ์œจ์ ์ธ ๋ถ€ํ•˜ ๊ท ํ˜•์„ ์œ„ํ•ด ์Šค๋ ˆ๋“œ๋‹น 2๊ฐœ ์š”์†Œ ์ฒ˜๋ฆฌ
    • ์ถœ๋ ฅ: input_shared[256]์— ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ ์ฑ„์šฐ๊ธฐ
  • Stage 2 ์Šค๋ ˆ๋“œ: STAGE2_THREADS = 128 (์Šค๋ ˆ๋“œ 128-255, ๋ธ”๋ก์˜ ํ›„๋ฐ˜๋ถ€)

    • ์—ญํ• : ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ์— ์ˆ˜ํ‰ ๋ธ”๋Ÿฌ ํ•„ํ„ฐ ์ ์šฉ
    • ์ž‘์—… ๋ถ„๋ฐฐ: ์Šค๋ ˆ๋“œ๋‹น 2๊ฐœ์˜ ๋ธ”๋Ÿฌ ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ
    • ์ถœ๋ ฅ: blur_shared[256]์— ๋ธ”๋Ÿฌ ๊ฒฐ๊ณผ ์ฑ„์šฐ๊ธฐ
  • Stage 3 ์Šค๋ ˆ๋“œ: ์ „์ฒด 256๊ฐœ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ

    • ์—ญํ• : ์ตœ์ข… ์Šค๋ฌด๋”ฉ ๋ฐ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ถœ๋ ฅ
    • ์ž‘์—… ๋ถ„๋ฐฐ: ์ผ๋Œ€์ผ ๋งคํ•‘ (์Šค๋ ˆ๋“œ i๊ฐ€ ์š”์†Œ i๋ฅผ ์ฒ˜๋ฆฌ)
    • ์ถœ๋ ฅ: ๊ธ€๋กœ๋ฒŒ output ๋ฐฐ์—ด์— ์ตœ์ข… ๊ฒฐ๊ณผ ๊ธฐ๋ก

์™„์„ฑํ•  ์ฝ”๋“œ


comptime TPB = 256  # Threads per block for pipeline stages
comptime SIZE = 1024  # Image size (1D for simplicity)
comptime BLOCKS_PER_GRID = (4, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)

# Multi-stage processing configuration
comptime STAGE1_THREADS = TPB // 2
comptime STAGE2_THREADS = TPB // 2
comptime BLUR_RADIUS = 2


fn multi_stage_image_blur_pipeline[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    """Multi-stage image blur pipeline with barrier coordination.

    Stage 1 (threads 0-127): Load input data and apply 1.1x preprocessing
    Stage 2 (threads 128-255): Apply 5-point blur with BLUR_RADIUS=2
    Stage 3 (all threads): Final neighbor smoothing and output
    """

    # Shared memory buffers for pipeline stages
    input_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    blur_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    # Stage 1: Load and preprocess (threads 0-127)

    # FILL ME IN (roughly 10 lines)

    barrier()  # Wait for Stage 1 completion

    # Stage 2: Apply blur (threads 128-255)

    # FILL ME IN (roughly 25 lines)

    barrier()  # Wait for Stage 2 completion

    # Stage 3: Final smoothing (all threads)

    # FILL ME IN (roughly 7 lines)

    barrier()  # Ensure all writes complete


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p29/p29.mojo

ํŒ

์Šค๋ ˆ๋“œ ์—ญํ•  ์‹๋ณ„

  • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค ๋น„๊ต๋ฅผ ํ†ตํ•ด ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์–ด๋–ค ๋‹จ๊ณ„๋ฅผ ์‹คํ–‰ํ•ด์•ผ ํ•˜๋Š”์ง€ ๊ฒฐ์ •
  • Stage 1: ์ „๋ฐ˜๋ถ€ ์Šค๋ ˆ๋“œ (์Šค๋ ˆ๋“œ 0-127)
  • Stage 2: ํ›„๋ฐ˜๋ถ€ ์Šค๋ ˆ๋“œ (์Šค๋ ˆ๋“œ 128-255)
  • Stage 3: ๋ชจ๋“  ์Šค๋ ˆ๋“œ ์ฐธ์—ฌ

Stage 1 ์ ‘๊ทผ ๋ฐฉ์‹

  • ์ ์ ˆํ•œ ์ธ๋ฑ์Šค ๋น„๊ต๋ฅผ ํ†ตํ•ด Stage 1 ์Šค๋ ˆ๋“œ ์‹๋ณ„
  • ๋ถ€ํ•˜ ๊ท ํ˜•์„ ์œ„ํ•ด ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌ
  • ์ „์ฒ˜๋ฆฌ ๊ฐ•ํ™” ๊ณ„์ˆ˜ ์ ์šฉ
  • ์ œ๋กœ ํŒจ๋”ฉ์„ ์‚ฌ์šฉํ•œ ์ ์ ˆํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ๊ตฌํ˜„

Stage 2 ์ ‘๊ทผ ๋ฐฉ์‹

  • Stage 2 ์Šค๋ ˆ๋“œ๋ฅผ ์‹๋ณ„ํ•˜๊ณ  ์ธ๋ฑ์Šค๋ฅผ ์ฒ˜๋ฆฌ ๋ฒ”์œ„์— ๋งคํ•‘
  • ์ด์›ƒ ์š”์†Œ์˜ ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๋ธ”๋Ÿฌ ์ปค๋„ ๊ตฌํ˜„
  • ์œ ํšจํ•œ ์ด์›ƒ๋งŒ ํฌํ•จํ•˜์—ฌ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ
  • ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ์Šค๋ ˆ๋“œ๋‹น ์—ฌ๋Ÿฌ ์š”์†Œ ์ฒ˜๋ฆฌ

Stage 3 ์ ‘๊ทผ ๋ฐฉ์‹

  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ์ฒ˜๋ฆฌ์— ์ฐธ์—ฌ
  • ์ง€์ •๋œ ์Šค์ผ€์ผ๋ง ๊ณ„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ์ด์›ƒ ์Šค๋ฌด๋”ฉ ์ ์šฉ
  • ์ด์›ƒ์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์˜ ์—ฃ์ง€ ์ผ€์ด์Šค ์ฒ˜๋ฆฌ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•ด ๊ธ€๋กœ๋ฒŒ ์ถœ๋ ฅ์— ๊ฒฐ๊ณผ ๊ธฐ๋ก

๋™๊ธฐํ™” ์ „๋žต

  • ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ๊ณ„ ์‚ฌ์ด์— ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜
  • ์˜์กดํ•˜๋Š” ๋‹จ๊ณ„๊ฐ€ ์‹œ์ž‘๋˜๊ธฐ ์ „์— ๊ฐ ๋‹จ๊ณ„๊ฐ€ ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ
  • ๋ธ”๋ก ์ข…๋ฃŒ ์ „ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์ข… ๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค:

pixi run p29 --multi-stage
pixi run -e amd p29 --multi-stage
uv run poe p29 --multi-stage

ํผ์ฆ์„ ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒํ•˜๋ฉด ๋‹ค์Œ๊ณผ ์œ ์‚ฌํ•œ ์ถœ๋ ฅ์ด ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค:

Puzzle 29: GPU Synchronization Primitives
==================================================
TPB: 256
SIZE: 1024
STAGE1_THREADS: 128
STAGE2_THREADS: 128
BLUR_RADIUS: 2

Testing Puzzle 29A: Multi-Stage Pipeline Coordination
============================================================
Multi-stage pipeline blur completed
Input sample: 0.0 1.01 2.02
Output sample: 1.6665002 2.3331003 3.3996604
โœ… Multi-stage pipeline coordination test PASSED!

์†”๋ฃจ์…˜

fn multi_stage_image_blur_pipeline[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    """Multi-stage image blur pipeline with barrier coordination.

    Stage 1 (threads 0-127): Load input data and apply 1.1x preprocessing
    Stage 2 (threads 128-255): Apply 5-point blur with BLUR_RADIUS=2
    Stage 3 (all threads): Final neighbor smoothing and output
    """

    # Shared memory buffers for pipeline stages
    input_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    blur_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    # Stage 1: Load and preprocess (threads 0-127)
    if local_i < STAGE1_THREADS:
        if global_i < size:
            input_shared[local_i] = input[global_i] * 1.1
            # Each thread loads 2 elements
            if local_i + STAGE1_THREADS < size:
                input_shared[local_i + STAGE1_THREADS] = (
                    input[global_i + STAGE1_THREADS] * 1.1
                )
        else:
            # Zero-padding for out-of-bounds
            input_shared[local_i] = 0.0
            if local_i + STAGE1_THREADS < TPB:
                input_shared[local_i + STAGE1_THREADS] = 0.0

    barrier()  # Wait for Stage 1 completion

    # Stage 2: Apply blur (threads 128-255)
    if local_i >= STAGE1_THREADS:
        blur_idx = local_i - STAGE1_THREADS
        var blur_sum: Scalar[dtype] = 0.0
        blur_count = 0

        # 5-point blur kernel
        for offset in range(-BLUR_RADIUS, BLUR_RADIUS + 1):
            sample_idx = blur_idx + offset
            if sample_idx >= 0 and sample_idx < TPB:
                blur_sum += rebind[Scalar[dtype]](input_shared[sample_idx])
                blur_count += 1

        if blur_count > 0:
            blur_shared[blur_idx] = blur_sum / blur_count
        else:
            blur_shared[blur_idx] = 0.0

        # Process second element
        second_idx = blur_idx + STAGE1_THREADS
        if second_idx < TPB:
            blur_sum = 0.0
            blur_count = 0
            for offset in range(-BLUR_RADIUS, BLUR_RADIUS + 1):
                sample_idx = second_idx + offset
                if sample_idx >= 0 and sample_idx < TPB:
                    blur_sum += rebind[Scalar[dtype]](input_shared[sample_idx])
                    blur_count += 1

            if blur_count > 0:
                blur_shared[second_idx] = blur_sum / blur_count
            else:
                blur_shared[second_idx] = 0.0

    barrier()  # Wait for Stage 2 completion

    # Stage 3: Final smoothing (all threads)
    if global_i < size:
        final_value = blur_shared[local_i]

        # Neighbor smoothing with 0.6 scaling
        if local_i > 0:
            final_value = (final_value + blur_shared[local_i - 1]) * 0.6
        if local_i < TPB - 1:
            final_value = (final_value + blur_shared[local_i + 1]) * 0.6

        output[global_i] = final_value

    barrier()  # Ensure all writes complete


ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์ด๊ฒƒ์ด ์Šค๋ ˆ๋“œ ์—ญํ•  ํŠนํ™”๋ฅผ ๊ฐ€์ง„ ํŒŒ์ดํ”„๋ผ์ธ ์•„ํ‚คํ…์ฒ˜ ๋ฌธ์ œ์ž„์„ ์ธ์‹ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  1. ๋‹จ๊ณ„๋ณ„ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน ์„ค๊ณ„: ๋ฐ์ดํ„ฐ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ธฐ๋Šฅ๋ณ„๋กœ ์Šค๋ ˆ๋“œ๋ฅผ ๋ถ„ํ• 
  2. ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ์ฒด์ธ ๊ตฌํ˜„: Stage 1์ด Stage 2๋ฅผ ์œ„ํ•ด ์ƒ์‚ฐํ•˜๊ณ , Stage 2๊ฐ€ Stage 3์„ ์œ„ํ•ด ์ƒ์‚ฐ
  3. ์ „๋žต์  ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜: ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด๊ฐ€ ์•„๋‹ˆ๋ผ ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ„์˜ ๋™๊ธฐํ™”
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”: ๋ณ‘ํ•ฉ๋œ ์ฝ๊ธฐ์™€ ํšจ์œจ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ๋ณด์žฅ

์ƒ์„ธ ์„ค๋ช…์ด ํฌํ•จ๋œ ์ „์ฒด ์†”๋ฃจ์…˜

๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ ์†”๋ฃจ์…˜์€ ์ •๊ตํ•œ ์Šค๋ ˆ๋“œ ํŠนํ™”์™€ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ „ํ†ต์ ์ธ ๋‹จ์ผ์ฒด์  GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํŠนํ™”๋˜๊ณ  ์กฐ์œจ๋œ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ดํ”„๋ผ์ธ ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„

์ด ํผ์ฆ์˜ ๊ทผ๋ณธ์ ์ธ ๋ŒํŒŒ๊ตฌ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ ์—ญํ• ์— ์˜ํ•œ ์Šค๋ ˆ๋“œ ํŠนํ™”์ž…๋‹ˆ๋‹ค:

์ „ํ†ต์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰

  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ (๋ฆฌ๋•์…˜์ด๋‚˜ ํ–‰๋ ฌ ์—ฐ์‚ฐ ๋“ฑ)
  • ๋ฐฐ๋ฆฌ์–ด๋Š” ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„ ๋‚ด์—์„œ ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”
  • ์Šค๋ ˆ๋“œ ์—ญํ• ์€ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ธ๋ฑ์Šค๋งŒ ๋‹ค๋ฆ„

์ด ํผ์ฆ์˜ ํ˜์‹ : ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰

  • ์Šค๋ ˆ๋“œ 0-127์ด ๋กœ๋”ฉ ๋ฐ ์ „์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰
  • ์Šค๋ ˆ๋“œ 128-255๊ฐ€ ๋ธ”๋Ÿฌ ์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰
  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ์Šค๋ฌด๋”ฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ˜‘๋ ฅ
  • ๋ฐฐ๋ฆฌ์–ด๋Š” ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด๊ฐ€ ์•„๋‹ˆ๋ผ ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ„์˜ ์กฐ์œจ

์ƒ์‚ฐ์ž-์†Œ๋น„์ž ์กฐ์ •

์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด์—์„œ ๋™๋“ฑํ•œ ์—ญํ• ์„ ํ•˜๋˜ ์ด์ „ ํผ์ฆ๊ณผ ๋‹ฌ๋ฆฌ, ์ด ํผ์ฆ์€ ๋ช…์‹œ์ ์ธ ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๊ด€๊ณ„๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค:

  • Stage 1: ์ƒ์‚ฐ์ž (Stage 2๋ฅผ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ)
  • Stage 2: ์†Œ๋น„์ž (Stage 1์˜ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ) + ์ƒ์‚ฐ์ž (Stage 3์„ ์œ„ํ•œ ๋ธ”๋Ÿฌ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ)
  • Stage 3: ์†Œ๋น„์ž (Stage 2์˜ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ)

์ „๋žต์  ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜

๋ฐฐ๋ฆฌ์–ด๊ฐ€ ์–ธ์ œ ํ•„์š”ํ•˜๊ณ  ์–ธ์ œ ๋‚ญ๋น„์ ์ธ์ง€ ์ดํ•ดํ•˜๊ธฐ:

  • ํ•„์š”ํ•œ ๊ฒฝ์šฐ: ์˜์กด์ ์ธ ๋‹จ๊ณ„ ์‚ฌ์ด์—์„œ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด
  • ๋‚ญ๋น„์ ์ธ ๊ฒฝ์šฐ: ๊ฐ™์€ ๋‹จ๊ณ„์˜ ๋…๋ฆฝ์ ์ธ ์—ฐ์‚ฐ ๋‚ด์—์„œ
  • ์„ฑ๋Šฅ ํ†ต์ฐฐ: ๊ฐ ๋ฐฐ๋ฆฌ์–ด์—๋Š” ๋น„์šฉ์ด ์žˆ์œผ๋ฏ€๋กœ ์ „๋žต์ ์œผ๋กœ ์‚ฌ์šฉ

ํ•ต์‹ฌ ๋™๊ธฐํ™” ์ง€์ :

  1. Stage 1 ์ดํ›„: Stage 2๊ฐ€ ๋ถˆ์™„์ „ํ•œ ์ „์ฒ˜๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€
  2. Stage 2 ์ดํ›„: Stage 3์ด ๋ถˆ์™„์ „ํ•œ ๋ธ”๋Ÿฌ ๊ฒฐ๊ณผ๋ฅผ ์ฝ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€
  3. Stage 3 ์ดํ›„: ๋ธ”๋ก ์ข…๋ฃŒ ์ „ ๋ชจ๋“  ์ถœ๋ ฅ ์“ฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ

์Šค๋ ˆ๋“œ ํ™œ์šฉ ํŒจํ„ด

  • Stage 1: 50% ํ™œ์šฉ (256๊ฐœ ์ค‘ 128๊ฐœ ์Šค๋ ˆ๋“œ ํ™œ์„ฑ, 128๊ฐœ ์œ ํœด)
  • Stage 2: 50% ํ™œ์šฉ (128๊ฐœ ํ™œ์„ฑ, 128๊ฐœ ์œ ํœด)
  • Stage 3: 100% ํ™œ์šฉ (์ „์ฒด 256๊ฐœ ์Šค๋ ˆ๋“œ ํ™œ์„ฑ)

์ด๊ฒƒ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์กฐ์œจ๋œ ํŒŒ์ดํ”„๋ผ์ธ ๋‚ด์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์—ฐ์‚ฐ ์ž‘์—…์— ํŠนํ™”๋˜๋Š” ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋‹จ์ˆœํ•œ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๋„˜์–ด ์‹ค์ œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์š”ํ•œ ์•„ํ‚คํ…์ฒ˜์  ์‚ฌ๊ณ ๋กœ ๋‚˜์•„๊ฐ‘๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜:

  • ๋‘ ๊ฐœ์˜ ํŠนํ™”๋œ ๋ฒ„ํผ๊ฐ€ ๋‹จ๊ณ„ ๊ฐ„ ๋ฐ์ดํ„ฐ ํ๋ฆ„์„ ์ฒ˜๋ฆฌ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์€ ๊ฒฝ๊ณ„ ์—ฐ์‚ฐ์—๋งŒ ์ตœ์†Œํ™”
  • ๋ชจ๋“  ์ค‘๊ฐ„ ์ฒ˜๋ฆฌ์— ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ

์ ‘๊ทผ ํŒจํ„ด์˜ ์ด์ :

  • Stage 1: ์ž…๋ ฅ ๋กœ๋”ฉ์„ ์œ„ํ•œ ๋ณ‘ํ•ฉ๋œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ
  • Stage 2: ๋ธ”๋Ÿฌ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ
  • Stage 3: ์ถœ๋ ฅ์„ ์œ„ํ•œ ๋ณ‘ํ•ฉ๋œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ

์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ

์ด ํŒŒ์ดํ”„๋ผ์ธ ์•„ํ‚คํ…์ฒ˜ ํŒจํ„ด์€ ๋‹ค์Œ ๋ถ„์•ผ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค:

์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ:

  • ๋‹ค๋‹จ๊ณ„ ํ•„ํ„ฐ (๋ธ”๋Ÿฌ, ์„ ๋ช…ํ™”, ์—ฃ์ง€ ๊ฒ€์ถœ์„ ์ˆœ์ฐจ์ ์œผ๋กœ)
  • ์ƒ‰ ๊ณต๊ฐ„ ๋ณ€ํ™˜ (RGB โ†’ HSV โ†’ ์ฒ˜๋ฆฌ โ†’ RGB)
  • ๋‹ค์ค‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจ์Šค๋ฅผ ์‚ฌ์šฉํ•œ ๋…ธ์ด์ฆˆ ๊ฐ์†Œ

๊ณผํ•™ ์—ฐ์‚ฐ:

  • ๋‹ค๋‹จ๊ณ„ ์œ ํ•œ ์ฐจ๋ถ„ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ ์Šคํ…์‹ค ์—ฐ์‚ฐ
  • ํ•„ํ„ฐ๋ง, ๋ณ€ํ™˜, ๋ถ„์„ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•œ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ
  • ๋‹ค๋‹จ๊ณ„ ์†”๋ฒ„ ๋ฐ˜๋ณต์„ ์‚ฌ์šฉํ•œ ์ „์‚ฐ ์œ ์ฒด ์—ญํ•™

๋จธ์‹ ๋Ÿฌ๋‹:

  • ์„œ๋กœ ๋‹ค๋ฅธ ์—ฐ์‚ฐ์„ ์œ„ํ•ด ํŠนํ™”๋œ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์„ ๊ฐ€์ง„ ์‹ ๊ฒฝ๋ง ๋ ˆ์ด์–ด
  • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ (์กฐ์œจ๋œ ๋‹จ๊ณ„์—์„œ ๋กœ๋“œ, ์ •๊ทœํ™”, ์ฆ๊ฐ•)
  • ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์„œ๋กœ ๋‹ค๋ฅธ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ

ํ•ต์‹ฌ ๊ธฐ์ˆ ์  ํ†ต์ฐฐ

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ vs. ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ:

  • ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์š”์†Œ์— ๋™์ผํ•œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์Šค๋ ˆ๋“œ๊ฐ€ ํŠนํ™”๋œ ์—ญํ• ์— ๋”ฐ๋ผ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰

๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ ์ฒ ํ•™:

  • ์ „๋žต์  ๋ฐฐ์น˜: ์˜์กด์ ์ธ ๋‹จ๊ณ„ ๊ฐ„์˜ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ๊ณณ์—๋งŒ ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜
  • ์„ฑ๋Šฅ ๊ณ ๋ ค์‚ฌํ•ญ: ๊ฐ ๋ฐฐ๋ฆฌ์–ด์—๋Š” ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฏ€๋กœ ์ •ํ™•ํ•˜์ง€๋งŒ ์ ˆ์ œ๋œ ์‚ฌ์šฉ
  • ์ •ํ™•์„ฑ ๋ณด์žฅ: ์ ์ ˆํ•œ ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜๋กœ ์Šค๋ ˆ๋“œ ์‹คํ–‰ ํƒ€์ด๋ฐ์— ๊ด€๊ณ„์—†์ด ๊ฒฐ์ •์  ๊ฒฐ๊ณผ๋ฅผ ๋ณด์žฅ

์Šค๋ ˆ๋“œ ํŠนํ™”์˜ ์ด์ :

  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ตœ์ ํ™”: ๊ฐ ๋‹จ๊ณ„๋ฅผ ํ•ด๋‹น ์—ฐ์‚ฐ ํŒจํ„ด์— ๋งž๊ฒŒ ์ตœ์ ํ™” ๊ฐ€๋Šฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”: ์„œ๋กœ ๋‹ค๋ฅธ ๋‹จ๊ณ„์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ „๋žต ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ๋ฆฌ์†Œ์Šค ํ™œ์šฉ: ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํŠนํ™”๋˜๊ณ  ํšจ์œจ์ ์ธ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ๋ถ„ํ•ด ๊ฐ€๋Šฅ

์ด ์†”๋ฃจ์…˜์€ ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์—ฐ์‚ฐ์„ ์œ„ํ•ด ์Šค๋ ˆ๋“œ ํŠนํ™”์™€ ์ „๋žต์  ๋™๊ธฐํ™”๋ฅผ ํ™œ์šฉํ•˜๋Š” ์ •๊ตํ•œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋‹จ์ˆœํ•œ ๋ณ‘๋ ฌ ๋ฃจํ”„๋ฅผ ๋„˜์–ด ์‹ค์ œ GPU ์†Œํ”„ํŠธ์›จ์–ด์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์•„ํ‚คํ…์ฒ˜์  ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ ๋‚˜์•„๊ฐ‘๋‹ˆ๋‹ค.

๋”๋ธ” ๋ฒ„ํผ๋ง ์Šคํ…์‹ค ์—ฐ์‚ฐ

๐Ÿ”ฌ ์„ธ๋ฐ€ํ•œ ๋™๊ธฐํ™”: mbarrier vs barrier()

์ด ํผ์ฆ์€ ์ด์ „ ํผ์ฆ์—์„œ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ barrier() ํ•จ์ˆ˜๋ณด๋‹ค ํ›จ์”ฌ ๊ฐ•๋ ฅํ•œ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด API๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ณธ barrier()์˜ ํ•œ๊ณ„:

  • ์ผํšŒ์„ฑ ์‚ฌ์šฉ: ์ƒํƒœ ์ถ”์  ์—†์ด ๋‹จ์ผ ๋™๊ธฐํ™” ์ง€์ ๋งŒ ์ œ๊ณต
  • ๋ธ”๋ก ์ „์ฒด ์ „์šฉ: ๋ธ”๋ก์˜ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์ฐธ์—ฌํ•ด์•ผ ํ•จ
  • ์žฌ์‚ฌ์šฉ ๋ถˆ๊ฐ€: ๋งค barrier() ํ˜ธ์ถœ์ด ์ƒˆ๋กœ์šด ๋™๊ธฐํ™” ์ด๋ฒคํŠธ๋ฅผ ์ƒ์„ฑ
  • ์„ธ๋ฐ€๋„ ๋ถ€์กฑ: ๋ฉ”๋ชจ๋ฆฌ ์ˆœ์„œ์™€ ํƒ€์ด๋ฐ์— ๋Œ€ํ•œ ์ œํ•œ์  ์ œ์–ด
  • ์ •์  ์กฐ์ •: ์Šค๋ ˆ๋“œ ์ฐธ์—ฌ ํŒจํ„ด์˜ ๋ณ€ํ™”์— ์ ์‘ ๋ถˆ๊ฐ€

๊ณ ๊ธ‰ mbarrier API์˜ ๊ธฐ๋Šฅ:

  • ์ •๋ฐ€ํ•œ ์ œ์–ด: mbarrier_init()๋กœ ํŠน์ • ์Šค๋ ˆ๋“œ ์ˆ˜๋ฅผ ์ง€์ •ํ•˜์—ฌ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐฐ๋ฆฌ์–ด ๊ฐ์ฒด๋ฅผ ์„ค์ •
  • ์ƒํƒœ ์ถ”์ : mbarrier_arrive()๋กœ ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฌ๊ณ  ๋„์ฐฉ ํšŸ์ˆ˜๋ฅผ ์œ ์ง€
  • ์œ ์—ฐํ•œ ๋Œ€๊ธฐ: mbarrier_test_wait()๋กœ ํŠน์ • ์™„๋ฃŒ ์ƒํƒœ๋ฅผ ๊ธฐ๋‹ค๋ฆด ์ˆ˜ ์žˆ์Œ
  • ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๊ฐ์ฒด: ๋™์ผํ•œ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์—ฌ๋Ÿฌ ๋ฐ˜๋ณต์— ๊ฑธ์ณ ์žฌ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ๋‹ค์ค‘ ๋ฐฐ๋ฆฌ์–ด: ์„œ๋กœ ๋‹ค๋ฅธ ๋™๊ธฐํ™” ์ง€์ (์ดˆ๊ธฐํ™”, ๋ฐ˜๋ณต, ๋งˆ๋ฌด๋ฆฌ)์— ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฐ๋ฆฌ์–ด ๊ฐ์ฒด ์‚ฌ์šฉ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: GPU ํ•˜๋“œ์›จ์–ด ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ์— ์ง์ ‘ ๋งคํ•‘ํ•˜์—ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์˜๋ฏธ๋ก : ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์‹œ์„ฑ๊ณผ ์ˆœ์„œ ๋ณด์žฅ์— ๋Œ€ํ•œ ๋ช…์‹œ์  ์ œ์–ด

๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์™œ ์ค‘์š”ํ•œ๊ฐ€: ๋”๋ธ” ๋ฒ„ํผ๋ง ํŒจํ„ด์—์„œ๋Š” ๋ฒ„ํผ ๊ต์ฒด ๋‹จ๊ณ„ ๊ฐ„์˜ ์ •๋ฐ€ํ•œ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ barrier()๋กœ๋Š” ๋‹ค์Œ์— ํ•„์š”ํ•œ ์„ธ๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค:

  • ๋ฒ„ํผ ์—ญํ•  ๊ต๋Œ€: buffer_A์— ๋Œ€ํ•œ ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ buffer_A์—์„œ ์ฝ๊ธฐ ์‹œ์ž‘๋˜๋„๋ก ๋ณด์žฅ
  • ๋ฐ˜๋ณต ๊ฒฝ๊ณ„: ๋‹จ์ผ ์ปค๋„ ๋‚ด์—์„œ ์—ฌ๋Ÿฌ ๋™๊ธฐํ™” ์ง€์  ์กฐ์œจ
  • ์ƒํƒœ ๊ด€๋ฆฌ: ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ์–ด๋–ค ์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ–ˆ๋Š”์ง€ ์ถ”์ 
  • ์„ฑ๋Šฅ ์ตœ์ ํ™”: ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐฐ๋ฆฌ์–ด ๊ฐ์ฒด๋ฅผ ํ†ตํ•ด ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์†Œํ™”

์ด ํผ์ฆ์€ ๋ฐ˜๋ณต๋ฒ•, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ”„๋ ˆ์ž„์›Œํฌ, ๊ณ ์„ฑ๋Šฅ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ ๋“ฑ ์‹ค์ œ GPU ์ปดํ“จํŒ… ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

๋”๋ธ” ๋ฒ„ํผ๋ง ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ˜๋ณต ์Šคํ…์‹ค ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ณต ๊ฐ„ ์•ˆ์ „ํ•œ ๋ฒ„ํผ ๊ต์ฒด๋ฅผ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๋กœ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์Šคํ…์‹ค ์—ฐ์‚ฐ์€ ๋ฐฐ์—ด์˜ ๊ฐ ์š”์†Œ ๊ฐ’์„ ์ด์›ƒ ์š”์†Œ๋“ค์˜ ๊ณ ์ •๋œ ํŒจํ„ด์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ์—ฐ์‚ฐ ํŒจํ„ด์ž…๋‹ˆ๋‹ค.

์ฐธ๊ณ : ๋ฒ„ํผ ์—ญํ• ์ด ๊ต๋Œ€ํ•ฉ๋‹ˆ๋‹ค: buffer_A์™€ buffer_B๊ฐ€ ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ์—ฐ์‚ฐ์„ ๊ต๋Œ€ํ•˜๋ฉฐ, mbarrier ๋™๊ธฐํ™”๊ฐ€ ๋ฒ„ํผ ๊ต์ฒด ์ „์— ๋ชจ๋“  ์Šค๋ ˆ๋“œ์˜ ์“ฐ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ์•„ํ‚คํ…์ฒ˜: ์ด ํผ์ฆ์€ ๋‘ ๊ฐœ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„ํผ๊ฐ€ ์—ฌ๋Ÿฌ ๋ฐ˜๋ณต์— ๊ฑธ์ณ ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ๋Œ€์ƒ์˜ ์—ญํ• ์„ ๊ต๋Œ€ํ•˜๋Š” ๋”๋ธ” ๋ฒ„ํผ๋ง ํŒจํ„ด์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ๋งŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹จ์ˆœํ•œ ์Šคํ…์‹ค ์—ฐ์‚ฐ๊ณผ ๋‹ฌ๋ฆฌ, ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฒ„ํผ ์ „ํ™˜ ์ค‘ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์„ธ์‹ฌํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •๊ณผ ํ•จ๊ป˜ ๋ฐ˜๋ณต์  ๊ฐœ์„ ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ๋…: ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฐ˜๋ณต์  ์Šคํ…์‹ค ๊ฐœ์„ ์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ฐ˜๋ณต์€ ํ•˜๋‚˜์˜ ๋ฒ„ํผ์—์„œ ์ฝ๊ณ  ๋‹ค๋ฅธ ๋ฒ„ํผ์— ์“ฐ๋ฉฐ, ๋ฒ„ํผ๋“ค์€ ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ์—ญํ• ์„ ๊ต๋Œ€ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์†์ƒ ์—†์ด ์—ฐ์† ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ•‘ํ ํŒจํ„ด์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์˜์กด์„ฑ๊ณผ ๋™๊ธฐํ™”: ๊ฐ ๋ฐ˜๋ณต์€ ์ด์ „ ๋ฐ˜๋ณต์˜ ์™„์„ฑ๋œ ๊ฒฐ๊ณผ์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค:

  • ๋ฐ˜๋ณต N โ†’ ๋ฐ˜๋ณต N+1: ํ˜„์žฌ ๋ฐ˜๋ณต์ด ๋‹ค์Œ ๋ฐ˜๋ณต์ด ์†Œ๋น„ํ•˜๋Š” ๊ฐœ์„ ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ
  • ๋ฒ„ํผ ์กฐ์ •: ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ๋ฒ„ํผ๊ฐ€ ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ์—ญํ• ์„ ๊ตํ™˜
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€: ์ƒˆ๋กœ ๊ธฐ๋ก๋œ ๋ฒ„ํผ์—์„œ ์ฝ๊ธฐ๋ฅผ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ

๊ตฌ์ฒด์ ์œผ๋กœ, ๋”๋ธ” ๋ฒ„ํผ๋ง ์Šคํ…์‹ค์€ ์„ธ ๊ฐ€์ง€ ์ˆ˜ํ•™ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ˜๋ณต์  ์Šค๋ฌด๋”ฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

๋ฐ˜๋ณต ํŒจํ„ด - ๋ฒ„ํผ ๊ต๋Œ€:

\[\text{Iteration } i: \begin{cases} \text{Read from buffer_A, Write to buffer_B} & \text{if } i \bmod 2 = 0 \\ \text{Read from buffer_B, Write to buffer_A} & \text{if } i \bmod 2 = 1 \end{cases}\]

์Šคํ…์‹ค ์—ฐ์‚ฐ - 3์  ํ‰๊ท :

\[S^{(i+1)}[j] = \frac{1}{N_j} \sum_{k=-1}^{1} S^{(i)}[j+k] \quad \text{where } j+k \in [0, 255]\]

์—ฌ๊ธฐ์„œ \(S^{(i)}[j]\)๋Š” ๋ฐ˜๋ณต \(i\) ์ดํ›„ ์œ„์น˜ \(j\)์—์„œ์˜ ์Šคํ…์‹ค ๊ฐ’์ด๊ณ , \(N_j\)๋Š” ์œ ํšจํ•œ ์ด์›ƒ ์ˆ˜์ž…๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •:

\[\text{mbarrier_arrive}() \Rightarrow \text{mbarrier_test_wait}() \Rightarrow \text{buffer swap} \Rightarrow \text{next iteration}\]

์ตœ์ข… ์ถœ๋ ฅ ์„ ํƒ:

\[\text{Output}[j] = \begin{cases} \text{buffer_A}[j] & \text{if STENCIL_ITERATIONS } \bmod 2 = 0 \\ \text{buffer_B}[j] & \text{if STENCIL_ITERATIONS } \bmod 2 = 1 \end{cases}\]

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ๋Š” ๋‹ค์Œ์„ ๋ฐฐ์›๋‹ˆ๋‹ค:

  • ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋”๋ธ” ๋ฒ„ํผ๋ง ํŒจํ„ด ๊ตฌํ˜„
  • mbarrier API๋ฅผ ์‚ฌ์šฉํ•œ ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •
  • ๋ฐ˜๋ณต์— ๊ฑธ์ณ ๊ต๋Œ€ํ•˜๋Š” ์ฝ๊ธฐ/์“ฐ๊ธฐ ๋ฒ„ํผ ์—ญํ•  ๊ด€๋ฆฌ

ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ์—ฐ์‚ฐ ์‚ฌ์ด์˜ ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ์ ์ ˆํžˆ ๋™๊ธฐํ™”๋˜์ง€ ์•Š์œผ๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ๋ฒ„ํผ ๊ต์ฒด๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์กฐ์œจํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์™œ ์ค‘์š”ํ•œ๊ฐ€: ๋Œ€๋ถ€๋ถ„์˜ GPU ํŠœํ† ๋ฆฌ์–ผ์€ ๋‹จ์ˆœํ•œ ๋‹จ์ผ ํŒจ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ๋Š” ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋‹ค์ค‘ ํŒจ์Šค๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ˜๋ณต์  ๊ฐœ์„ ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ๋”๋ธ” ๋ฒ„ํผ๋ง์€ ๊ฐ ๋ฐ˜๋ณต์ด ์ด์ „ ๋ฐ˜๋ณต์˜ ์™„์„ฑ๋œ ๊ฒฐ๊ณผ์— ์˜์กดํ•˜๋Š” ๋ฐ˜๋ณต๋ฒ•, ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ํ•„ํ„ฐ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์—…๋ฐ์ดํŠธ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

์ด์ „ ํผ์ฆ๊ณผ ํ˜„์žฌ์˜ ๋™๊ธฐํ™” ๋น„๊ต:

  • ์ด์ „ ํผ์ฆ (P8, P12, P15): ๋‹จ์ผ ํŒจ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋‹จ์ˆœ barrier() ํ˜ธ์ถœ
  • ์ด ํผ์ฆ: ๋ฒ„ํผ ๊ต์ฒด ํƒ€์ด๋ฐ์— ๋Œ€ํ•œ ์ •๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ์œ„ํ•œ ๋ช…์‹œ์  mbarrier API

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ํŠนํ™”: ๊ธฐ๋ณธ์ ์ธ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”์™€ ๋‹ฌ๋ฆฌ, ์ด ํผ์ฆ์€ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด ์–ธ์ œ ์™„๋ฃŒ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ์„ธ๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ด๋Š” ๋ณต์žกํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

์‹œ์Šคํ…œ ๋งค๊ฐœ๋ณ€์ˆ˜:

  • ์ด๋ฏธ์ง€ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ (๊ฐ„์†Œํ™”๋ฅผ ์œ„ํ•ด 1D)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 256 ์Šค๋ ˆ๋“œ, (256, 1) ๋ธ”๋ก ์ฐจ์›์œผ๋กœ ๊ตฌ์„ฑ
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ํƒ€์ผ ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ (4, 1) ๋ธ”๋ก (์ด 4๊ฐœ ๋ธ”๋ก)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: ๋ชจ๋“  ์—ฐ์‚ฐ์— DType.float32

๋ฐ˜๋ณต ๋งค๊ฐœ๋ณ€์ˆ˜:

  • ์Šคํ…์‹ค ๋ฐ˜๋ณต ํšŸ์ˆ˜: STENCIL_ITERATIONS = 3 ๊ฐœ์„  ํŒจ์Šค
  • ๋ฒ„ํผ ์ˆ˜: BUFFER_COUNT = 2 (๋”๋ธ” ๋ฒ„ํผ๋ง)
  • ์Šคํ…์‹ค ์ปค๋„: ๋ฐ˜์ง€๋ฆ„ 1์˜ 3์  ํ‰๊ท 

๋ฒ„ํผ ์•„ํ‚คํ…์ฒ˜:

  • buffer_A: ์ฃผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„ํผ ([256] ์š”์†Œ)
  • buffer_B: ๋ณด์กฐ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„ํผ ([256] ์š”์†Œ)
  • ์—ญํ•  ๊ต๋Œ€: ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ๋ฒ„ํผ๊ฐ€ ์ฝ๊ธฐ ์†Œ์Šค์™€ ์“ฐ๊ธฐ ๋Œ€์ƒ ์‚ฌ์ด๋ฅผ ๊ต์ฒด

์ฒ˜๋ฆฌ ์š”๊ตฌ์‚ฌํ•ญ:

์ดˆ๊ธฐํ™” ๋‹จ๊ณ„:

  • ๋ฒ„ํผ ์„ค์ •: buffer_A๋ฅผ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ, buffer_B๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”
  • ๋ฐฐ๋ฆฌ์–ด ์ดˆ๊ธฐํ™”: ๋™๊ธฐํ™” ์ง€์ ์„ ์œ„ํ•œ mbarrier ๊ฐ์ฒด ์„ค์ •
  • ์Šค๋ ˆ๋“œ ์กฐ์ •: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ดˆ๊ธฐํ™”์— ์ฐธ์—ฌ

๋ฐ˜๋ณต ์ฒ˜๋ฆฌ:

  • ์ง์ˆ˜ ๋ฐ˜๋ณต (0, 2, 4โ€ฆ): buffer_A์—์„œ ์ฝ๊ณ  buffer_B์— ์“ฐ๊ธฐ
  • ํ™€์ˆ˜ ๋ฐ˜๋ณต (1, 3, 5โ€ฆ): buffer_B์—์„œ ์ฝ๊ณ  buffer_A์— ์“ฐ๊ธฐ
  • ์Šคํ…์‹ค ์—ฐ์‚ฐ: 3์  ํ‰๊ท  \((\text{left} + \text{center} + \text{right}) / 3\)
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ๋ฒ„ํผ ๊ฐ€์žฅ์ž๋ฆฌ์˜ ์š”์†Œ์— ๋Œ€ํ•ด ์ ์‘์  ํ‰๊ท  ์‚ฌ์šฉ

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •:

  • mbarrier_arrive(): ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์“ฐ๊ธฐ ๋‹จ๊ณ„ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆผ
  • mbarrier_test_wait(): ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์“ฐ๊ธฐ๋ฅผ ์™„๋ฃŒํ•  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
  • ๋ฒ„ํผ ๊ต์ฒด ์•ˆ์ „์„ฑ: ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์•„์ง ์“ฐ๊ณ  ์žˆ๋Š” ๋™์•ˆ ๋ฒ„ํผ์—์„œ ์ฝ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€
  • ๋ฐฐ๋ฆฌ์–ด ์žฌ์ดˆ๊ธฐํ™”: ๋ฐ˜๋ณต ๊ฐ„์— ๋ฐฐ๋ฆฌ์–ด ์ƒํƒœ๋ฅผ ์žฌ์„ค์ •

์ถœ๋ ฅ ๋‹จ๊ณ„:

  • ์ตœ์ข… ๋ฒ„ํผ ์„ ํƒ: ๋ฐ˜๋ณต ํšŸ์ˆ˜์˜ ํ™€์ง์— ๋”ฐ๋ผ ํ™œ์„ฑ ๋ฒ„ํผ ์„ ํƒ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ: ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ ๋ฐฐ์—ด์— ๋ณต์‚ฌ
  • ์™„๋ฃŒ ๋ฐฐ๋ฆฌ์–ด: ๋ธ”๋ก ์ข…๋ฃŒ ์ „ ๋ชจ๋“  ์“ฐ๊ธฐ ์™„๋ฃŒ ๋ณด์žฅ

์™„์„ฑํ•  ์ฝ”๋“œ


# Double-buffered stencil configuration
comptime STENCIL_ITERATIONS = 3
comptime BUFFER_COUNT = 2


fn double_buffered_stencil_computation[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    """Double-buffered stencil computation with memory barrier coordination.

    Iteratively applies 3-point stencil using alternating buffers.
    Uses mbarrier APIs for precise buffer swap coordination.
    """

    # Double-buffering: Two shared memory buffers
    buffer_A = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    buffer_B = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Memory barriers for coordinating buffer swaps
    init_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    iter_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    final_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    # Initialize barriers (only thread 0)
    if local_i == 0:
        mbarrier_init(init_barrier.ptr, TPB)
        mbarrier_init(iter_barrier.ptr, TPB)
        mbarrier_init(final_barrier.ptr, TPB)

    # Initialize buffer_A with input data

    # FILL ME IN (roughly 4 lines)

    # Wait for buffer_A initialization
    _ = mbarrier_arrive(init_barrier.ptr)
    _ = mbarrier_test_wait(init_barrier.ptr, TPB)

    # Iterative stencil processing with double-buffering
    @parameter
    for iteration in range(STENCIL_ITERATIONS):

        @parameter
        if iteration % 2 == 0:
            # Even iteration: Read from A, Write to B

            # FILL ME IN (roughly 12 lines)
            ...

        else:
            # Odd iteration: Read from B, Write to A

            # FILL ME IN (roughly 12 lines)
            ...

        # Memory barrier: wait for all writes before buffer swap
        _ = mbarrier_arrive(iter_barrier.ptr)
        _ = mbarrier_test_wait(iter_barrier.ptr, TPB)

        # Reinitialize barrier for next iteration
        if local_i == 0:
            mbarrier_init(iter_barrier.ptr, TPB)

    # Write final results from active buffer
    if local_i < TPB and global_i < size:

        @parameter
        if STENCIL_ITERATIONS % 2 == 0:
            # Even iterations end in buffer_A
            output[global_i] = buffer_A[local_i]
        else:
            # Odd iterations end in buffer_B
            output[global_i] = buffer_B[local_i]

    # Final barrier
    _ = mbarrier_arrive(final_barrier.ptr)
    _ = mbarrier_test_wait(final_barrier.ptr, TPB)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p29/p29.mojo

ํŒ

๋ฒ„ํผ ์ดˆ๊ธฐํ™”

  • buffer_A๋ฅผ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , buffer_B๋Š” ๋นˆ ์ƒํƒœ๋กœ ์‹œ์ž‘ ๊ฐ€๋Šฅ
  • ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์š”์†Œ์— ๋Œ€ํ•ด ์ œ๋กœ ํŒจ๋”ฉ์„ ์‚ฌ์šฉํ•œ ์ ์ ˆํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ์Šค๋ ˆ๋“œ 0๋งŒ mbarrier ๊ฐ์ฒด๋ฅผ ์ดˆ๊ธฐํ™”ํ•ด์•ผ ํ•จ
  • ์„œ๋กœ ๋‹ค๋ฅธ ๋™๊ธฐํ™” ์ง€์ ์— ๋ณ„๋„์˜ ๋ฐฐ๋ฆฌ์–ด ์„ค์ •

๋ฐ˜๋ณต ์ œ์–ด

  • ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ๋ฅผ ์œ„ํ•ด @parameter for iteration in range(STENCIL_ITERATIONS) ์‚ฌ์šฉ
  • iteration % 2๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฝ๊ธฐ/์“ฐ๊ธฐ ํ• ๋‹น์„ ๊ต๋Œ€ํ•˜๋ฉด์„œ ๋ฒ„ํผ ์—ญํ•  ๊ฒฐ์ •
  • ์ด์›ƒ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•ด ์œ ํšจํ•œ ๋ฒ”์œ„ ๋‚ด์—์„œ๋งŒ ์Šคํ…์‹ค ์—ฐ์‚ฐ ์ ์šฉ

์Šคํ…์‹ค ์—ฐ์‚ฐ

  • 3์  ํ‰๊ท  ๊ตฌํ˜„: (left + center + right) / 3
  • ์œ ํšจํ•œ ์ด์›ƒ๋งŒ ํ‰๊ท ์— ํฌํ•จํ•˜์—ฌ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ
  • ์—ฃ์ง€ ์ผ€์ด์Šค๋ฅผ ๋งค๋„๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ ์‘์  ์นด์šดํŒ… ์‚ฌ์šฉ

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •

  • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์“ฐ๊ธฐ ์—ฐ์‚ฐ์„ ์™„๋ฃŒํ•œ ํ›„ mbarrier_arrive() ํ˜ธ์ถœ
  • ๋ฒ„ํผ ๊ต์ฒด ์ „ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์™„๋ฃŒํ•˜๋„๋ก mbarrier_test_wait() ์‚ฌ์šฉ
  • ์žฌ์‚ฌ์šฉ์„ ์œ„ํ•ด ๋ฐ˜๋ณต ๊ฐ„์— ๋ฐฐ๋ฆฌ์–ด ์žฌ์ดˆ๊ธฐํ™”: mbarrier_init()
  • ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์Šค๋ ˆ๋“œ 0๋งŒ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์žฌ์ดˆ๊ธฐํ™”

์ถœ๋ ฅ ์„ ํƒ

  • STENCIL_ITERATIONS % 2๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ข… ํ™œ์„ฑ ๋ฒ„ํผ ์„ ํƒ
  • ์ง์ˆ˜ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋Š” buffer_A์— ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚จ์Œ
  • ํ™€์ˆ˜ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋Š” buffer_B์— ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚จ์Œ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ธ€๋กœ๋ฒŒ ์ถœ๋ ฅ์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค:

pixi run p29 --double-buffer
pixi run -e amd p29 --double-buffer
uv run poe p29 --double-buffer

ํผ์ฆ์„ ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒํ•˜๋ฉด ๋‹ค์Œ๊ณผ ์œ ์‚ฌํ•œ ์ถœ๋ ฅ์ด ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค:

Puzzle 29: GPU Synchronization Primitives
==================================================
TPB: 256
SIZE: 1024
STENCIL_ITERATIONS: 3
BUFFER_COUNT: 2

Testing Puzzle 29B: Double-Buffered Stencil Computation
============================================================
Double-buffered stencil completed
Input sample: 1.0 1.0 1.0
GPU output sample: 1.0 1.0 1.0
โœ… Double-buffered stencil test PASSED!

์†”๋ฃจ์…˜

fn double_buffered_stencil_computation[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    """Double-buffered stencil computation with memory barrier coordination.

    Iteratively applies 3-point stencil using alternating buffers.
    Uses mbarrier APIs for precise buffer swap coordination.
    """

    # Double-buffering: Two shared memory buffers
    buffer_A = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    buffer_B = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Memory barriers for coordinating buffer swaps
    init_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    iter_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    final_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    # Initialize barriers (only thread 0)
    if local_i == 0:
        mbarrier_init(init_barrier.ptr, TPB)
        mbarrier_init(iter_barrier.ptr, TPB)
        mbarrier_init(final_barrier.ptr, TPB)

    # Initialize buffer_A with input data
    if local_i < TPB and global_i < size:
        buffer_A[local_i] = input[global_i]
    else:
        buffer_A[local_i] = 0.0

    # Wait for buffer_A initialization
    _ = mbarrier_arrive(init_barrier.ptr)
    _ = mbarrier_test_wait(init_barrier.ptr, TPB)

    # Iterative stencil processing with double-buffering
    @parameter
    for iteration in range(STENCIL_ITERATIONS):

        @parameter
        if iteration % 2 == 0:
            # Even iteration: Read from A, Write to B
            if local_i < TPB:
                var stencil_sum: Scalar[dtype] = 0.0
                var stencil_count: Int = 0

                # 3-point stencil: [i-1, i, i+1]
                for offset in range(-1, 2):
                    sample_idx = local_i + offset
                    if sample_idx >= 0 and sample_idx < TPB:
                        stencil_sum += rebind[Scalar[dtype]](
                            buffer_A[sample_idx]
                        )
                        stencil_count += 1

                if stencil_count > 0:
                    buffer_B[local_i] = stencil_sum / stencil_count
                else:
                    buffer_B[local_i] = buffer_A[local_i]

        else:
            # Odd iteration: Read from B, Write to A
            if local_i < TPB:
                var stencil_sum: Scalar[dtype] = 0.0
                var stencil_count: Int = 0

                # 3-point stencil: [i-1, i, i+1]
                for offset in range(-1, 2):
                    sample_idx = local_i + offset
                    if sample_idx >= 0 and sample_idx < TPB:
                        stencil_sum += rebind[Scalar[dtype]](
                            buffer_B[sample_idx]
                        )
                        stencil_count += 1

                if stencil_count > 0:
                    buffer_A[local_i] = stencil_sum / stencil_count
                else:
                    buffer_A[local_i] = buffer_B[local_i]

        # Memory barrier: wait for all writes before buffer swap
        _ = mbarrier_arrive(iter_barrier.ptr)
        _ = mbarrier_test_wait(iter_barrier.ptr, TPB)

        # Reinitialize barrier for next iteration
        if local_i == 0:
            mbarrier_init(iter_barrier.ptr, TPB)

    # Write final results from active buffer
    if local_i < TPB and global_i < size:

        @parameter
        if STENCIL_ITERATIONS % 2 == 0:
            # Even iterations end in buffer_A
            output[global_i] = buffer_A[local_i]
        else:
            # Odd iterations end in buffer_B
            output[global_i] = buffer_B[local_i]

    # Final barrier
    _ = mbarrier_arrive(final_barrier.ptr)
    _ = mbarrier_test_wait(final_barrier.ptr, TPB)


ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์ด๊ฒƒ์ด ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •์„ ์‚ฌ์šฉํ•˜๋Š” ๋”๋ธ” ๋ฒ„ํผ๋ง ์•„ํ‚คํ…์ฒ˜ ๋ฌธ์ œ์ž„์„ ์ธ์‹ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  1. ๊ต๋Œ€ํ•˜๋Š” ๋ฒ„ํผ ์—ญํ•  ์„ค๊ณ„: ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ์ฝ๊ธฐ/์“ฐ๊ธฐ ์ฑ…์ž„์„ ๊ตํ™˜
  2. ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๊ตฌํ˜„: ์ •๋ฐ€ํ•œ ๋™๊ธฐํ™” ์ œ์–ด๋ฅผ ์œ„ํ•ด mbarrier API ์‚ฌ์šฉ
  3. ๋ฐ˜๋ณต ์ฒ˜๋ฆฌ ์กฐ์œจ: ๋ฒ„ํผ ๊ต์ฒด ์ „ ๋ฐ˜๋ณต ๊ฒฐ๊ณผ๊ฐ€ ์™„์ „ํžˆ ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”: ๋ชจ๋“  ์ฒ˜๋ฆฌ๋ฅผ ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ˆ˜ํ–‰

์ƒ์„ธ ์„ค๋ช…์ด ํฌํ•จ๋œ ์ „์ฒด ์†”๋ฃจ์…˜

๋”๋ธ” ๋ฒ„ํผ๋ง ์Šคํ…์‹ค ์†”๋ฃจ์…˜์€ ์ •๊ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •๊ณผ ๋ฐ˜๋ณต ์ฒ˜๋ฆฌ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํƒ€์ด๋ฐ์— ๋Œ€ํ•œ ์ •๋ฐ€ํ•œ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ์•ˆ์ „ํ•œ ๋ฐ˜๋ณต์  ๊ฐœ์„  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋”๋ธ” ๋ฒ„ํผ๋ง ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„

์ด ํผ์ฆ์˜ ๊ทผ๋ณธ์ ์ธ ๋ŒํŒŒ๊ตฌ๋Š” ๋‹จ์ˆœํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๊ฐ€ ์•„๋‹Œ ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์ œ์–ด์ž…๋‹ˆ๋‹ค:

์ „ํ†ต์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹: ๋‹จ์ˆœํ•œ ์Šค๋ ˆ๋“œ ์กฐ์ •์„ ์œ„ํ•ด ๊ธฐ๋ณธ barrier() ์‚ฌ์šฉ

  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์‹คํ–‰
  • ๋‹จ์ผ ๋ฐฐ๋ฆฌ์–ด ํ˜ธ์ถœ๋กœ ์Šค๋ ˆ๋“œ ์™„๋ฃŒ๋ฅผ ๋™๊ธฐํ™”
  • ํŠน์ • ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ ํƒ€์ด๋ฐ์— ๋Œ€ํ•œ ์ œ์–ด ์—†์Œ

์ด ํผ์ฆ์˜ ํ˜์‹ : ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๋กœ ์กฐ์ •๋˜๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฒ„ํผ ์—ญํ• 

  • buffer_A์™€ buffer_B๊ฐ€ ์ฝ๊ธฐ ์†Œ์Šค์™€ ์“ฐ๊ธฐ ๋Œ€์ƒ ์‚ฌ์ด๋ฅผ ๊ต๋Œ€
  • mbarrier API๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ ์™„๋ฃŒ์— ๋Œ€ํ•œ ์ •๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ์ œ๊ณต
  • ๋ช…์‹œ์  ์กฐ์ •์œผ๋กœ ๋ฒ„ํผ ์ „ํ™˜ ์ค‘ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€

๋ฐ˜๋ณต ์ฒ˜๋ฆฌ ์กฐ์œจ

๋‹จ์ผ ํŒจ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋‹ฌ๋ฆฌ, ์ด ํผ์ฆ์€ ์‹ ์ค‘ํ•œ ๋ฒ„ํผ ๊ด€๋ฆฌ๋ฅผ ํ†ตํ•œ ๋ฐ˜๋ณต์  ๊ฐœ์„ ์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค:

  • ๋ฐ˜๋ณต 0: buffer_A์—์„œ ์ฝ๊ธฐ (์ž…๋ ฅ์œผ๋กœ ์ดˆ๊ธฐํ™”๋จ), buffer_B์— ์“ฐ๊ธฐ
  • ๋ฐ˜๋ณต 1: buffer_B์—์„œ ์ฝ๊ธฐ (์ด์ „ ๊ฒฐ๊ณผ), buffer_A์— ์“ฐ๊ธฐ
  • ๋ฐ˜๋ณต 2: buffer_A์—์„œ ์ฝ๊ธฐ (์ด์ „ ๊ฒฐ๊ณผ), buffer_B์— ์“ฐ๊ธฐ
  • ๊ต๋Œ€ ๊ณ„์†: ๊ฐ ๋ฐ˜๋ณต์ด ์ด์ „ ๋ฐ˜๋ณต์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฐœ์„ 

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด API ์‚ฌ์šฉ๋ฒ•

mbarrier ์กฐ์ • ํŒจํ„ด์˜ ์ดํ•ด:

  • mbarrier_init(): ํŠน์ • ์Šค๋ ˆ๋“œ ์ˆ˜(TPB)๋ฅผ ์ง€์ •ํ•˜์—ฌ ๋ฐฐ๋ฆฌ์–ด ์ดˆ๊ธฐํ™”
  • mbarrier_arrive(): ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ์˜ ์“ฐ๊ธฐ ๋‹จ๊ณ„ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆผ
  • mbarrier_test_wait(): ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆด ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
  • ์žฌ์ดˆ๊ธฐํ™”: ์žฌ์‚ฌ์šฉ์„ ์œ„ํ•ด ๋ฐ˜๋ณต ๊ฐ„์— ๋ฐฐ๋ฆฌ์–ด ์ƒํƒœ๋ฅผ ์žฌ์„ค์ •

ํ•ต์‹ฌ ํƒ€์ด๋ฐ ์ˆœ์„œ:

  1. ๋ชจ๋“  ์Šค๋ ˆ๋“œ ์“ฐ๊ธฐ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ• ๋‹น๋œ ๋ฒ„ํผ ์š”์†Œ๋ฅผ ์—…๋ฐ์ดํŠธ
  2. ์™„๋ฃŒ ์•Œ๋ฆผ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ mbarrier_arrive() ํ˜ธ์ถœ
  3. ์ „์ฒด ๋Œ€๊ธฐ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ mbarrier_test_wait() ํ˜ธ์ถœ
  4. ์ง„ํ–‰ ์•ˆ์ „: ์ด์ œ ๋‹ค์Œ ๋ฐ˜๋ณต์„ ์œ„ํ•ด ๋ฒ„ํผ ์—ญํ• ์„ ์•ˆ์ „ํ•˜๊ฒŒ ๊ต์ฒด ๊ฐ€๋Šฅ

์Šคํ…์‹ค ์—ฐ์‚ฐ ๋ฉ”์ปค๋‹ˆ์ฆ˜

์ ์‘์  ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ๋ฅผ ํฌํ•จํ•œ 3์  ์Šคํ…์‹ค ์—ฐ์‚ฐ:

๋‚ด๋ถ€ ์š”์†Œ (์ธ๋ฑ์Šค 1๋ถ€ํ„ฐ 254):

# ์™ผ์ชฝ, ์ค‘์‹ฌ, ์˜ค๋ฅธ์ชฝ ์ด์›ƒ๊ณผ์˜ ํ‰๊ท 
stencil_sum = buffer[i-1] + buffer[i] + buffer[i+1]
result[i] = stencil_sum / 3.0

๊ฒฝ๊ณ„ ์š”์†Œ (์ธ๋ฑ์Šค 0๊ณผ 255):

# ์œ ํšจํ•œ ์ด์›ƒ๋งŒ ํ‰๊ท ์— ํฌํ•จ
stencil_count = 0
for neighbor in valid_neighbors:
    stencil_sum += buffer[neighbor]
    stencil_count += 1
result[i] = stencil_sum / stencil_count

๋ฒ„ํผ ์—ญํ•  ๊ต๋Œ€

ํ•‘ํ ๋ฒ„ํผ ํŒจํ„ด์ด ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

์ง์ˆ˜ ๋ฐ˜๋ณต (0, 2, 4โ€ฆ):

  • ์ฝ๊ธฐ ์†Œ์Šค: buffer_A์— ํ˜„์žฌ ๋ฐ์ดํ„ฐ ํฌํ•จ
  • ์“ฐ๊ธฐ ๋Œ€์ƒ: buffer_B๊ฐ€ ์—…๋ฐ์ดํŠธ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์‹ 
  • ๋ฉ”๋ชจ๋ฆฌ ํ๋ฆ„: buffer_A โ†’ ์Šคํ…์‹ค ์—ฐ์‚ฐ โ†’ buffer_B

ํ™€์ˆ˜ ๋ฐ˜๋ณต (1, 3, 5โ€ฆ):

  • ์ฝ๊ธฐ ์†Œ์Šค: buffer_B์— ํ˜„์žฌ ๋ฐ์ดํ„ฐ ํฌํ•จ
  • ์“ฐ๊ธฐ ๋Œ€์ƒ: buffer_A๊ฐ€ ์—…๋ฐ์ดํŠธ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์‹ 
  • ๋ฉ”๋ชจ๋ฆฌ ํ๋ฆ„: buffer_B โ†’ ์Šคํ…์‹ค ์—ฐ์‚ฐ โ†’ buffer_A

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ์—ฌ๋Ÿฌ ์œ ํ˜•์˜ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค:

๋ฐฐ๋ฆฌ์–ด ์—†์ด (์ž˜๋ชป๋œ ๊ฒฝ์šฐ):

# ์Šค๋ ˆ๋“œ A๊ฐ€ buffer_B[10]์— ์“ฐ๊ธฐ
buffer_B[10] = stencil_result_A

# ์Šค๋ ˆ๋“œ B๊ฐ€ ์Šคํ…์‹ค ์—ฐ์‚ฐ์„ ์œ„ํ•ด buffer_B[10]์„ ์ฆ‰์‹œ ์ฝ๊ธฐ
# ๊ฒฝ์Ÿ ์ƒํƒœ: ์Šค๋ ˆ๋“œ B๊ฐ€ ์Šค๋ ˆ๋“œ A์˜ ์“ฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋˜๊ธฐ ์ „์— ์ด์ „ ๊ฐ’์„ ์ฝ์„ ์ˆ˜ ์žˆ์Œ
stencil_input = buffer_B[10]  // ๋ฏธ์ •์˜ ๋™์ž‘!

๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ (์˜ฌ๋ฐ”๋ฅธ ๊ฒฝ์šฐ):

# ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ์“ฐ๊ธฐ
buffer_B[local_i] = stencil_result

# ์“ฐ๊ธฐ ์™„๋ฃŒ ์•Œ๋ฆผ
mbarrier_arrive(barrier)

# ๋ชจ๋“  ์Šค๋ ˆ๋“œ์˜ ์“ฐ๊ธฐ ์™„๋ฃŒ๊นŒ์ง€ ๋Œ€๊ธฐ
mbarrier_test_wait(barrier, TPB)

# ์ด์ œ ์ฝ๊ธฐ ์•ˆ์ „ - ๋ชจ๋“  ์“ฐ๊ธฐ ์™„๋ฃŒ ๋ณด์žฅ
stencil_input = buffer_B[neighbor_index]  // ํ•ญ์ƒ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ์ฝ์Œ

์ถœ๋ ฅ ๋ฒ„ํผ ์„ ํƒ

์ตœ์ข… ๊ฒฐ๊ณผ ์œ„์น˜๋Š” ๋ฐ˜๋ณต ํšŸ์ˆ˜์˜ ํ™€์ง์— ๋”ฐ๋ผ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค:

์ˆ˜ํ•™์  ๊ฒฐ์ •:

  • STENCIL_ITERATIONS = 3 (ํ™€์ˆ˜)
  • ์ตœ์ข… ํ™œ์„ฑ ๋ฒ„ํผ: ๋ฐ˜๋ณต 2๊ฐ€ buffer_B์— ์“ฐ๊ธฐ
  • ์ถœ๋ ฅ ์†Œ์Šค: buffer_B์—์„œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋ณต์‚ฌ

๊ตฌํ˜„ ํŒจํ„ด:

@parameter
if STENCIL_ITERATIONS % 2 == 0:
    # ์ง์ˆ˜ ์ด ๋ฐ˜๋ณต ํšŸ์ˆ˜๋Š” buffer_A์—์„œ ์ข…๋ฃŒ
    output[global_i] = buffer_A[local_i]
else:
    # ํ™€์ˆ˜ ์ด ๋ฐ˜๋ณต ํšŸ์ˆ˜๋Š” buffer_B์—์„œ ์ข…๋ฃŒ
    output[global_i] = buffer_B[local_i]

์„ฑ๋Šฅ ํŠน์„ฑ

๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: ์ž…๋ ฅ ๋กœ๋”ฉ๊ณผ ์ตœ์ข… ์ถœ๋ ฅ์—๋งŒ ์ ‘๊ทผ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ชจ๋“  ๋ฐ˜๋ณต ์ฒ˜๋ฆฌ์— ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
  • ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ค‘์‹ฌ์œผ๋กœ ์ตœ์†Œํ™”

๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ:

  • mbarrier ๋น„์šฉ: ๊ธฐ๋ณธ barrier()๋ณด๋‹ค ๋†’์ง€๋งŒ ํ•„์ˆ˜์ ์ธ ์ œ์–ด๋ฅผ ์ œ๊ณต
  • ๋ฐ˜๋ณต ํ™•์žฅ์„ฑ: ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐ˜๋ณต ํšŸ์ˆ˜์— ๋น„๋ก€ํ•˜์—ฌ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€
  • ์Šค๋ ˆ๋“œ ํšจ์œจ์„ฑ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌ ์ „๋ฐ˜์— ๊ฑธ์ณ ํ™œ์„ฑ ์ƒํƒœ ์œ ์ง€

์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ

์ด ๋”๋ธ” ๋ฒ„ํผ๋ง ํŒจํ„ด์€ ๋‹ค์Œ ๋ถ„์•ผ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค:

๋ฐ˜๋ณต๋ฒ•:

  • ์„ ํ˜• ์‹œ์Šคํ…œ์„ ์œ„ํ•œ Gauss-Seidel ๋ฐ Jacobi ๋ฐฉ๋ฒ•
  • ์ˆ˜์น˜ ์ •ํ™•๋„๋ฅผ ์œ„ํ•œ ๋ฐ˜๋ณต์  ๊ฐœ์„ 
  • ๋ ˆ๋ฒจ๋ณ„ ์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹ค์ค‘ ๊ทธ๋ฆฌ๋“œ ๋ฐฉ๋ฒ•

์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ:

  • ๋‹ค์ค‘ ํŒจ์Šค ํ•„ํ„ฐ (์–‘์ธก, ์œ ๋„, ์—ฃ์ง€ ๋ณด์กด)
  • ๋ฐ˜๋ณต์  ๋””๋…ธ์ด์ง• ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ์—ด ํ™•์‚ฐ๊ณผ ์ด๋ฐฉ์„ฑ ์Šค๋ฌด๋”ฉ

์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜:

  • ์ƒํƒœ ์ง„ํ™”๋ฅผ ๊ฐ€์ง„ ์…€๋ฃฐ๋Ÿฌ ์˜คํ† ๋งˆํƒ€
  • ์œ„์น˜ ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜๋ฐ˜ํ•˜๋Š” ์ž…์ž ์‹œ์Šคํ…œ
  • ๋ฐ˜๋ณต์  ์••๋ ฅ ์†”๋น™์„ ์‚ฌ์šฉํ•œ ์œ ์ฒด ์—ญํ•™

ํ•ต์‹ฌ ๊ธฐ์ˆ ์  ํ†ต์ฐฐ

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์ฒ ํ•™:

  • ๋ช…์‹œ์  ์ œ์–ด: ์ž๋™ ๋™๊ธฐํ™” ๋Œ€๋น„ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ •๋ฐ€ํ•œ ํƒ€์ด๋ฐ ์ œ์–ด
  • ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€: ๊ต๋Œ€ํ•˜๋Š” ์ฝ๊ธฐ/์“ฐ๊ธฐ ํŒจํ„ด์„ ๊ฐ€์ง„ ๋ชจ๋“  ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์ˆ˜
  • ์„ฑ๋Šฅ ์ ˆ์ถฉ: ๋ณด์žฅ๋œ ์ •ํ™•์„ฑ์„ ์œ„ํ•œ ๋” ๋†’์€ ๋™๊ธฐํ™” ๋น„์šฉ

๋”๋ธ” ๋ฒ„ํผ๋ง์˜ ์ด์ :

  • ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ: ์“ฐ๊ธฐ ์ค‘ ์ฝ๊ธฐ hazard ์ œ๊ฑฐ
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ช…ํ™•์„ฑ: ํ˜„์žฌ์™€ ๋‹ค์Œ ๋ฐ˜๋ณต ์ƒํƒœ ๊ฐ„์˜ ๊น”๋”ํ•œ ๋ถ„๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ค‘๊ฐ„ ์ €์žฅ์†Œ ๋ถˆํ•„์š”

๋ฐ˜๋ณต ๊ด€๋ฆฌ:

  • ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ: @parameter for๊ฐ€ ์ตœ์ ํ™” ๊ธฐํšŒ๋ฅผ ์ œ๊ณต
  • ์ƒํƒœ ์ถ”์ : ๋ฒ„ํผ ์—ญํ•  ๊ต๋Œ€๊ฐ€ ๊ฒฐ์ •์ ์ด์–ด์•ผ ํ•จ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์ ์‘์  ์Šคํ…์‹ค ์—ฐ์‚ฐ์ด ์—ฃ์ง€ ์ผ€์ด์Šค๋ฅผ ๋งค๋„๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌ

์ด ์†”๋ฃจ์…˜์€ ์ •๋ฐ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ๋ฐ˜๋ณต GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋‹จ์ˆœํ•œ ๋ณ‘๋ ฌ ๋ฃจํ”„๋ฅผ ๋„˜์–ด ์‹ค์ œ ์ˆ˜์น˜ ์†Œํ”„ํŠธ์›จ์–ด์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ •๊ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ํŒจํ„ด์œผ๋กœ ๋‚˜์•„๊ฐ‘๋‹ˆ๋‹ค.

Puzzle 30: GPU ํ”„๋กœํŒŒ์ผ๋ง

์˜ฌ๋ฐ”๋ฅธ ์ฝ”๋“œ, ๊ทธ ๋„ˆ๋จธ๋กœ

์ฐธ๊ณ : ์ด ํŒŒํŠธ๋Š” ํ˜ธํ™˜๋˜๋Š” NVIDIA GPU ์ „์šฉ์ž…๋‹ˆ๋‹ค

์ด ์ฑ•ํ„ฐ์—์„œ๋Š” ๋™์ž‘ํ•˜๋Š” GPU ์ฝ”๋“œ๋ฅผ ๊ณ ์„ฑ๋Šฅ ์ฝ”๋“œ๋กœ ํƒˆ๋ฐ”๊ฟˆ์‹œํ‚ค๋Š” ์ฒด๊ณ„์  ์„ฑ๋Šฅ ๋ถ„์„์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ •ํ™•์„ฑ๊ณผ GPU ๊ธฐ๋Šฅ์— ์ง‘์ค‘ํ–ˆ๋˜ ์ด์ „ ํผ์ฆ๋“ค๊ณผ ๋‹ฌ๋ฆฌ, ์—ฌ๊ธฐ์„œ๋Š” ์‹ค๋ฌด GPU ์†Œํ”„ํŠธ์›จ์–ด ๊ฐœ๋ฐœ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋‚ด์šฉ:

  • ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ: ์ข…ํ•ฉ์ ์ธ ์„ฑ๋Šฅ ๋ถ„์„์„ ์œ„ํ•œ NSight Systems์™€ NSight Compute
  • ์„ฑ๋Šฅ ํƒ์ • ์ž‘์—…: ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ชฉ๊ณผ ์ตœ์ ํ™” ๊ธฐํšŒ ํŒŒ์•…
  • ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ํ†ต์ฐฐ: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ๊ทน์ ์ธ ์˜ํ–ฅ ์ดํ•ด
  • ๋ฐ˜์ง๊ด€์ ์ธ ๋ฐœ๊ฒฌ: โ€œ์ข‹์•„ ๋ณด์ด๋Š”โ€ ์ง€ํ‘œ๊ฐ€ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š” ๊ฒฝ์šฐ
  • ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”: ๊ฐ€์ •์ด ์•„๋‹Œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•œ ์ตœ์ ํ™” ํŒ๋‹จ

์™œ ์ค‘์š”ํ•œ๊ฐ€: ๋Œ€๋ถ€๋ถ„์˜ GPU ํŠœํ† ๋ฆฌ์–ผ์€ ๊ธฐ๋ณธ์ ์ธ ์„ฑ๋Šฅ ๊ฐœ๋…๋งŒ ๊ฐ€๋ฅด์น˜์ง€๋งŒ, ์‹ค์ œ GPU ๊ฐœ๋ฐœ์—์„œ๋Š” ์‹ค์งˆ์ ์ธ ๋ณ‘๋ชฉ์„ ์ฐพ์•„๋‚ด๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ๋™์ž‘์„ ์ดํ•ดํ•˜๋ฉฐ, ๊ทผ๊ฑฐ ์žˆ๋Š” ์ตœ์ ํ™” ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๊ธฐ ์œ„ํ•œ ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์—ญ๋Ÿ‰์ด ํ•™์ˆ ์  ์˜ˆ์ œ์™€ ์‹ค๋ฌด GPU ์ปดํ“จํŒ… ์‚ฌ์ด์˜ ๊ฒฉ์ฐจ๋ฅผ ๋ฉ”์›Œ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

GPU ์„ฑ๋Šฅ ํ”„๋กœํŒŒ์ผ๋ง์€ ์ฒด๊ณ„์  ๋ถ„์„์„ ํ†ตํ•ด ์˜ฌ๋ฐ”๋ฅธ ์ฝ”๋“œ๋ฅผ ๊ณ ์„ฑ๋Šฅ ์ฝ”๋“œ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ฑ•ํ„ฐ์—์„œ๋Š” ์‹ค๋ฌด GPU ๊ฐœ๋ฐœ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ์™€ ํƒ์ • ๋ฐฉ๋ฒ•๋ก ์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต ๋ชฉํ‘œ:

  • ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ ์„ ํƒ๋ฒ• ํ•™์Šต - NSight Systems์™€ NSight Compute๋ฅผ ์–ธ์ œ ์‚ฌ์šฉํ•˜๋Š”์ง€ ์ดํ•ด
  • ์„ฑ๋Šฅ ํƒ์ • ๋Šฅ๋ ฅ ๊ฐœ๋ฐœ - ์‹ค์ œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ์ถœ๋ ฅ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณ‘๋ชฉ ์‹๋ณ„
  • ๋ฐ˜์ง๊ด€์ ์ธ ํ†ต์ฐฐ ๋ฐœ๊ฒฌ - GPU ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ๊ณผ ์บ์‹ฑ ๋™์ž‘์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ์‹œ๊ฐ
  • ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™” ํ•™์Šต - ๊ฐ€์ •์ด ์•„๋‹Œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•œ ์ตœ์ ํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ:

  • NSight Systems (nsys): CPU-GPU ์กฐ์œจ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ์œ„ํ•œ ์‹œ์Šคํ…œ ์ „์ฒด ํƒ€์ž„๋ผ์ธ ๋ถ„์„
  • NSight Compute (ncu): ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ๊ณผ ์—ฐ์‚ฐ ํ™œ์šฉ๋„๋ฅผ ์œ„ํ•œ ์ƒ์„ธ ์ปค๋„ ๋ถ„์„
  • ์ฒด๊ณ„์  ๋ฐฉ๋ฒ•๋ก : ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ๋ณ‘๋ชฉ ์‹๋ณ„๊ณผ ์ตœ์ ํ™” ๊ฒ€์ฆ

๋ฐœ๊ฒฌํ•˜๊ฒŒ ๋  ํ•ต์‹ฌ ํ†ต์ฐฐ:

  • ๋ฐ˜์ง๊ด€์  ๋™์ž‘: ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์ด ์‹ค์ œ๋กœ๋Š” ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒฝ์šฐ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด: ๋ณ‘ํ•ฉ์ด ๋Œ€์—ญํญ ํ™œ์šฉ์— ๋ฏธ์น˜๋Š” ๊ทน์ ์ธ ์˜ํ–ฅ
  • ๋„๊ตฌ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”: ์„ฑ๋Šฅ ๊ฐ€์ •์ด ์•„๋‹Œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์˜์‚ฌ๊ฒฐ์ •

๊ตฌ์„ฑ

์š”๊ตฌ ์‚ฌํ•ญ:

  • NVIDIA GPU: ํ”„๋กœํŒŒ์ผ๋ง์ด ํ™œ์„ฑํ™”๋œ CUDA ํ˜ธํ™˜ ํ•˜๋“œ์›จ์–ด
  • CUDA Toolkit: NSight Systems ๋ฐ NSight Compute ๋„๊ตฌ
  • ๋นŒ๋“œ ์„ค์ •: ๋””๋ฒ„๊ทธ ์ •๋ณด๊ฐ€ ํฌํ•จ๋œ ์ตœ์ ํ™” ์ฝ”๋“œ (--debug-level=full)

๋ฐฉ๋ฒ•๋ก :

  1. NSight Systems๋ฅผ ํ™œ์šฉํ•œ ์‹œ์Šคํ…œ ์ „์ฒด ๋ถ„์„์œผ๋กœ ์ฃผ์š” ๋ณ‘๋ชฉ ์‹๋ณ„
  2. NSight Compute๋ฅผ ํ™œ์šฉํ•œ ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ๋ถ„์„
  3. ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ๊ฒฐ๋ก ์œผ๋กœ ์ตœ์ ํ™” ๋ฐฉํ–ฅ ๋„์ถœ

ํผ์ฆ ๊ตฌ์„ฑ

์ด ์ฑ•ํ„ฐ๋Š” ์„œ๋กœ ์—ฐ๊ฒฐ๋˜์–ด ์ ์ง„์ ์œผ๋กœ ๋ฐœ์ „ํ•˜๋Š” ๋‘ ๊ฐœ์˜ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค:

NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ

์‹ค์ œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•œ ์‹ค์Šต ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ์ƒํƒœ๊ณ„์˜ ํ•ต์‹ฌ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

ํ•™์Šต ๋‚ด์šฉ:

  • ์‹œ์Šคํ…œ ์ „์ฒด ํƒ€์ž„๋ผ์ธ ๋ถ„์„๊ณผ ๋ณ‘๋ชฉ ์‹๋ณ„์„ ์œ„ํ•œ NSight Systems
  • ์ƒ์„ธ ์ปค๋„ ๋ถ„์„๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ํ†ต์ฐฐ์„ ์œ„ํ•œ NSight Compute
  • ์‹ค๋ฌด GPU ๊ฐœ๋ฐœ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ์›Œํฌํ”Œ๋กœ์šฐ์™€ ๋ชจ๋ฒ” ์‚ฌ๋ก€

์บ์‹œ ํžˆํŠธ์˜ ์—ญ์„ค

๋™์ผํ•œ ๋ฒกํ„ฐ ๋ง์…ˆ ์ปค๋„ ์„ธ ๊ฐœ๊ฐ€ ๊ทน์ ์œผ๋กœ ๋‹ค๋ฅธ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ํ’€์–ด๋ด…๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: ์บ์‹œ ํžˆํŠธ์œจ์ด ๊ฐ€์žฅ ๋†’์€ ์ปค๋„์ด ์„ฑ๋Šฅ์€ ๊ฐ€์žฅ ๋‚ฎ์€ ์ด์œ ๋ฅผ ๋ฐํ˜€๋‚ด์„ธ์š” - CPU ์ค‘์‹ฌ์˜ ์ „ํ†ต์ ์ธ ์„ฑ๋Šฅ ๊ด€๋…์„ ๋’ค์ง‘๋Š” ๋ฐ˜์ง๊ด€์  ํ†ต์ฐฐ์ž…๋‹ˆ๋‹ค.

ํƒ์ • ๋Šฅ๋ ฅ: ์‹ค์ œ NSight Systems์™€ NSight Compute ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ํšจ๊ณผ์™€ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋ฅผ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ

ํ•™์Šต ๊ฒฝ๋กœ:

  1. NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ - NSight Systems์™€ NSight Compute ํ•™์Šต
  2. ์บ์‹œ ํžˆํŠธ์˜ ์—ญ์„ค - ์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ ํ’€๊ธฐ์— ๋Šฅ๋ ฅ ์ ์šฉ

์‚ฌ์ „ ์ค€๋น„:

  • GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ์ ‘๊ทผ ํŒจํ„ด
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ์ดˆ (์Šค๋ ˆ๋“œ, ๋ธ”๋ก, ์›Œํ”„, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)
  • ์ปค๋งจ๋“œ๋ผ์ธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ ์‚ฌ์šฉ ๊ฒฝํ—˜

ํ•™์Šต ์„ฑ๊ณผ: ์‹ค๋ฌด GPU ๊ฐœ๋ฐœ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ฒด๊ณ„์  ๋ณ‘๋ชฉ ์‹๋ณ„๊ณผ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ํ”„๋กœํŒŒ์ผ๋ง ์—ญ๋Ÿ‰.

์ด ์ฑ•ํ„ฐ๋Š” ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง์ด ์ง๊ด€์ด ๋†“์น˜๋Š” ์ง„์‹ค์„ ๋“œ๋Ÿฌ๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๋ ค์ค๋‹ˆ๋‹ค - GPU ์„ฑ๋Šฅ ์ตœ์ ํ™”๋Š” ๊ฐ€์ •์ด ์•„๋‹Œ ๋„๊ตฌ ๊ธฐ๋ฐ˜์˜ ๋ฐœ๊ฒฌ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ถ”๊ฐ€ ์ž๋ฃŒ:

๐Ÿ“š NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ

๊ฐœ์š”

์ง€๊ธˆ๊นŒ์ง€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ธฐ์ดˆ์™€ ๊ณ ๊ธ‰ ํŒจํ„ด์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. Part II์—์„œ๋Š” compute-sanitizer์™€ cuda-gdb๋ฅผ ์‚ฌ์šฉํ•œ ์ •ํ™•์„ฑ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•์„, ๋‹ค๋ฅธ ํŒŒํŠธ์—์„œ๋Š” ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ, ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ, ๋ธ”๋ก ๋ ˆ๋ฒจ ์—ฐ์‚ฐ ๋“ฑ ๋‹ค์–‘ํ•œ GPU ๊ธฐ๋Šฅ์„ ๋‹ค๋ค˜์Šต๋‹ˆ๋‹ค. ์ปค๋„์ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋™์ž‘ํ•˜๊ธด ํ•ฉ๋‹ˆ๋‹ค - ํ•˜์ง€๋งŒ ๋น ๋ฅด๊ธฐ๋„ ํ• ๊นŒ์š”?

์ด ํŠœํ† ๋ฆฌ์–ผ์€ CUDA Best Practices Guide์—์„œ ๊ถŒ์žฅํ•˜๋Š” NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์˜ฌ๋ฐ”๋ฅธ ์ปค๋„์ด๋ผ๋„ ์ตœ์ ์˜ ์„ฑ๋Šฅ๋ณด๋‹ค ์ˆ˜์‹ญ ๋ฐฐ๋‚˜ ๋А๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กœํŒŒ์ผ๋ง์€ ๋™์ž‘ํ•˜๋Š” ์ฝ”๋“œ์™€ ๊ณ ์„ฑ๋Šฅ ์ฝ”๋“œ ์‚ฌ์ด์˜ ๊ฒฉ์ฐจ๋ฅผ ์ขํž™๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ ๋ชจ์Œ

pixi๋ฅผ ํ†ตํ•ด cuda-toolkit์ด ์„ค์น˜๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ, NVIDIA์˜ ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

NSight Systems (nsys) - โ€œ์ „์ฒด ๊ทธ๋ฆผโ€ ๋„๊ตฌ

์šฉ๋„: ์‹œ์Šคํ…œ ์ „์ฒด ์„ฑ๋Šฅ ๋ถ„์„ (NSight Systems ๋ฌธ์„œ)

  • CPU-GPU ์ƒํ˜ธ์ž‘์šฉ์˜ ํƒ€์ž„๋ผ์ธ ๋ทฐ
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋ณ‘๋ชฉ
  • ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ
  • ๋ฉ€ํ‹ฐ GPU ์กฐ์œจ
  • API ํ˜ธ์ถœ ์ถ”์ 

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ธํ„ฐํŽ˜์ด์Šค: ์ปค๋งจ๋“œ๋ผ์ธ (nsys) ๋ฐ GUI (nsys-ui)

์‚ฌ์šฉ ์‹œ์ :

  • ์ „์ฒด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ๋ฆ„ ํŒŒ์•…
  • CPU-GPU ๋™๊ธฐํ™” ๋ฌธ์ œ ์‹๋ณ„
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ํŒจํ„ด ๋ถ„์„
  • ์ปค๋„ ์‹คํ–‰ ๋ณ‘๋ชฉ ๋ฐœ๊ฒฌ
# ๋„์›€๋ง ๋ณด๊ธฐ
pixi run nsys --help

# ๊ธฐ๋ณธ ์‹œ์Šคํ…œ ์ „์ฒด ํ”„๋กœํŒŒ์ผ๋ง
pixi run nsys profile --trace=cuda,nvtx --output=timeline mojo your_program.mojo

# ๋Œ€ํ™”ํ˜• ๋ถ„์„
pixi run nsys stats --force-export=true timeline.nsys-rep

NSight Compute (ncu) - โ€œ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„โ€ ๋„๊ตฌ

์šฉ๋„: ์ƒ์„ธํ•œ ๋‹จ์ผ ์ปค๋„ ์„ฑ๋Šฅ ๋ถ„์„ (NSight Compute ๋ฌธ์„œ)

  • ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ ๋ถ„์„
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ํ™œ์šฉ๋„
  • ์›Œํ”„ ์‹คํ–‰ ํšจ์œจ
  • ๋ ˆ์ง€์Šคํ„ฐ/๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰
  • ์—ฐ์‚ฐ ์œ ๋‹› ํ™œ์šฉ๋„

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ธํ„ฐํŽ˜์ด์Šค: ์ปค๋งจ๋“œ๋ผ์ธ (ncu) ๋ฐ GUI (ncu-ui)

์‚ฌ์šฉ ์‹œ์ :

  • ํŠน์ • ์ปค๋„ ์„ฑ๋Šฅ ์ตœ์ ํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํŒŒ์•…
  • ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ vs ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„ ๋ถ„์„
  • ์›Œํ”„ ๋ถ„๊ธฐ ๋ฌธ์ œ ์‹๋ณ„
# ๋„์›€๋ง ๋ณด๊ธฐ
pixi run ncu --help

# ์ƒ์„ธ ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง
pixi run ncu --set full --output kernel_profile mojo your_program.mojo

# ํŠน์ • ์ปค๋„์— ์ง‘์ค‘
pixi run ncu --kernel-name regex:your_kernel_name mojo your_program.mojo

๋„๊ตฌ ์„ ํƒ ์˜์‚ฌ๊ฒฐ์ • ํŠธ๋ฆฌ

์„ฑ๋Šฅ ๋ฌธ์ œ ๋ฐœ์ƒ
      |
      v
์–ด๋–ค ์ปค๋„์ธ์ง€ ์•„๋Š”๊ฐ€?
    |           |
  ์•„๋‹ˆ์˜ค         ์˜ˆ
    |           |
    v           v
NSight    ์ปค๋„ ๊ณ ์œ ์˜ ๋ฌธ์ œ์ธ๊ฐ€?
Systems       |         |
    |       ์•„๋‹ˆ์˜ค       ์˜ˆ
    v         |         |
ํƒ€์ž„๋ผ์ธ        |         v
๋ถ„์„    <------+   NSight Compute
                        |
                        v
                   ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„

๋น ๋ฅธ ์˜์‚ฌ๊ฒฐ์ • ๊ฐ€์ด๋“œ:

  • ๋ณ‘๋ชฉ์ด ์–ด๋””์ธ์ง€ ๋ชจ๋ฅด๊ฒ ์œผ๋ฉด NSight Systems (nsys)๋ถ€ํ„ฐ ์‹œ์ž‘
  • ์ตœ์ ํ™”ํ•  ์ปค๋„์„ ์ •ํ™•ํžˆ ์•Œ๋ฉด NSight Compute (ncu) ์‚ฌ์šฉ
  • ์ข…ํ•ฉ์ ์ธ ๋ถ„์„์ด ํ•„์š”ํ•˜๋ฉด ๋‘˜ ๋‹ค ์‚ฌ์šฉ (์ผ๋ฐ˜์ ์ธ ์›Œํฌํ”Œ๋กœ์šฐ)

์‹ค์Šต: NSight Systems๋กœ ์‹œ์Šคํ…œ ์ „์ฒด ํ”„๋กœํŒŒ์ผ๋ง

Puzzle 16์˜ ํ–‰๋ ฌ ๊ณฑ์…ˆ ๊ตฌํ˜„๋“ค์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜์—ฌ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ํŒŒ์•…ํ•ด ๋ด…์‹œ๋‹ค.

GUI ์ฐธ๊ณ : NSight Systems์™€ Compute GUI (nsys-ui, ncu-ui)๋Š” ๋””์Šคํ”Œ๋ ˆ์ด์™€ OpenGL ์ง€์›์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. X11 ํฌ์›Œ๋”ฉ์ด ์—†๋Š” ํ—ค๋“œ๋ฆฌ์Šค ์„œ๋ฒ„๋‚˜ ์›๊ฒฉ ์‹œ์Šคํ…œ์—์„œ๋Š” ์ปค๋งจ๋“œ๋ผ์ธ ๋ฒ„์ „ (nsys, ncu)์„ ์‚ฌ์šฉํ•˜์—ฌ nsys stats์™€ ncu --import --page details๋กœ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์„ธ์š”. .nsys-rep์™€ .ncu-rep ํŒŒ์ผ์„ ๋กœ์ปฌ ๋จธ์‹ ์œผ๋กœ ์ „์†กํ•˜์—ฌ GUI๋กœ ๋ถ„์„ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

Step 1: ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ์ฝ”๋“œ ์ค€๋น„

์ค‘์š”: ์ •ํ™•ํ•œ ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•ด ์ตœ์ ํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜์—ฌ ๋นŒ๋“œํ•ฉ๋‹ˆ๋‹ค:

pixi shell -e nvidia
# ์ตœ์ ํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด ํฌํ•จ ๋นŒ๋“œ (ํฌ๊ด„์ ์ธ ์†Œ์Šค ๋งคํ•‘์šฉ)
mojo build --debug-level=full solutions/p16/p16.mojo -o solutions/p16/p16_optimized

# ์ตœ์ ํ™” ๋นŒ๋“œ ํ…Œ์ŠคํŠธ
./solutions/p16/p16_optimized --naive

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ :

  • ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด: ํ”„๋กœํŒŒ์ผ๋Ÿฌ๋ฅผ ์œ„ํ•œ ์™„์ „ํ•œ ์‹ฌ๋ณผ ํ…Œ์ด๋ธ”, ๋ณ€์ˆ˜๋ช…, ์†Œ์Šค ๋ผ์ธ ๋งคํ•‘ ์ œ๊ณต
  • ํฌ๊ด„์  ๋ถ„์„: NSight ๋„๊ตฌ๊ฐ€ ์„ฑ๋Šฅ ๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ • ์ฝ”๋“œ ์œ„์น˜์™€ ์—ฐ๊ฒฐ ๊ฐ€๋Šฅ
  • ์ตœ์ ํ™” ์œ ์ง€: ํ”„๋กœ๋•์…˜ ๋นŒ๋“œ์™€ ์ผ์น˜ํ•˜๋Š” ํ˜„์‹ค์ ์ธ ์„ฑ๋Šฅ ์ธก์ • ๋ณด์žฅ

Step 2: ์‹œ์Šคํ…œ ์ „์ฒด ํ”„๋กœํŒŒ์ผ ์ˆ˜์ง‘

# ํฌ๊ด„์  ์ถ”์ ์œผ๋กœ ์ตœ์ ํ™” ๋นŒ๋“œ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile \
  --trace=cuda,nvtx \
  --output=matmul_naive \
  --force-overwrite=true \
  ./solutions/p16/p16_optimized --naive

๋ช…๋ น์–ด ๋ถ„์„:

  • --trace=cuda,nvtx: CUDA API ํ˜ธ์ถœ ๋ฐ ์ปค์Šคํ…€ ์–ด๋…ธํ…Œ์ด์…˜ ์บก์ฒ˜
  • --output=matmul_naive: ํ”„๋กœํŒŒ์ผ์„ matmul_naive.nsys-rep๋กœ ์ €์žฅ
  • --force-overwrite=true: ๊ธฐ์กด ํ”„๋กœํŒŒ์ผ ๋ฎ์–ด์“ฐ๊ธฐ
  • ๋งˆ์ง€๋ง‰ ์ธ์ˆ˜: Mojo ํ”„๋กœ๊ทธ๋žจ

Step 3: ํƒ€์ž„๋ผ์ธ ๋ถ„์„

# ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ํ†ต๊ณ„ ์ƒ์„ฑ
nsys stats --force-export=true matmul_naive.nsys-rep

# ์ฃผ์š” ์ง€ํ‘œ ํ™•์ธ:
# - GPU ํ™œ์šฉ๋ฅ 
# - ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„
# - ์ปค๋„ ์‹คํ–‰ ์‹œ๊ฐ„
# - CPU-GPU ๋™๊ธฐํ™” ๊ฐ„๊ฒฉ

ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฐ๊ณผ (2ร—2 ํ–‰๋ ฌ ๊ณฑ์…ˆ์˜ ์‹ค์ œ ์ถœ๋ ฅ):

** CUDA API Summary (cuda_api_sum):
 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)  Min (ns)  Max (ns)  StdDev (ns)          Name
 --------  ---------------  ---------  ---------  --------  --------  --------  -----------  --------------------
     81.9          8617962          3  2872654.0    2460.0      1040   8614462    4972551.6  cuMemAllocAsync
     15.1          1587808          4   396952.0    5965.5      3810   1572067     783412.3  cuMemAllocHost_v2
      0.6            67152          1    67152.0   67152.0     67152     67152          0.0  cuModuleLoadDataEx
      0.4            44961          1    44961.0   44961.0     44961     44961          0.0  cuLaunchKernelEx

** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                    Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------
    100.0             1920          1    1920.0    1920.0      1920      1920          0.0  p16_naive_matmul_Layout_Int6A6AcB6A6AsA6A6A

** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum):
 Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Operation
 --------  ---------------  -----  --------  --------  --------  --------  -----------  ----------------------------
     49.4             4224      3    1408.0    1440.0      1312      1472         84.7  [CUDA memcpy Device-to-Host]
     36.0             3072      4     768.0     528.0       416      1600        561.0  [CUDA memset]
     14.6             1248      3     416.0     416.0       416       416          0.0  [CUDA memcpy Host-to-Device]

์ฃผ์š” ์„ฑ๋Šฅ ํ†ต์ฐฐ:

  • ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ์ง€๋ฐฐ์ : ์ „์ฒด ์‹œ๊ฐ„์˜ 81.9%๊ฐ€ cuMemAllocAsync์— ์†Œ๋น„
  • ์ปค๋„์€ ๋ฒˆ๊ฐœ์ฒ˜๋Ÿผ ๋น ๋ฆ„: ์‹คํ–‰ ์‹œ๊ฐ„ 1,920 ns (0.000001920์ดˆ)์— ๋ถˆ๊ณผ
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋‚ด์—ญ: 49.4% Deviceโ†’Host, 36.0% memset, 14.6% Hostโ†’Device
  • ์•„์ฃผ ์ž‘์€ ๋ฐ์ดํ„ฐ: ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด 0.001 MB ๋ฏธ๋งŒ (float32 4๊ฐœ = 16๋ฐ”์ดํŠธ)

Step 4: ๊ตฌํ˜„ ๋น„๊ต

๋‹ค๋ฅธ ๋ฒ„์ „๋“ค์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜๊ณ  ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค:

# pixi shell ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•˜์„ธ์š” `pixi run -e nvidia`

# ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_shared ./solutions/p16/p16_optimized --single-block

# Tiled ๋ฒ„์ „ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_tiled ./solutions/p16/p16_optimized --tiled

# ๊ด€์šฉ์  Tiled ๋ฒ„์ „ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_idiomatic_tiled ./solutions/p16/p16_optimized --idiomatic-tiled

# ๊ฐ ๊ตฌํ˜„์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋ถ„์„ (nsys stats๋Š” ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ํŒŒ์ผ๋งŒ ์ฒ˜๋ฆฌ)
nsys stats --force-export=true matmul_shared.nsys-rep
nsys stats --force-export=true matmul_tiled.nsys-rep
nsys stats --force-export=true matmul_idiomatic_tiled.nsys-rep

๊ฒฐ๊ณผ ๋น„๊ต ๋ฐฉ๋ฒ•:

  1. GPU Kernel Summary ํ™•์ธ - ๊ตฌํ˜„ ๊ฐ„ ์‹คํ–‰ ์‹œ๊ฐ„ ๋น„๊ต
  2. Memory Operations ํ™•์ธ - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ์„ ์ค„์ด๋Š”์ง€ ํ™•์ธ
  3. API ์˜ค๋ฒ„ํ—ค๋“œ ๋น„๊ต - ๋ชจ๋‘ ๋น„์Šทํ•œ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ํŒจํ„ด์„ ๊ฐ€์ ธ์•ผ ํ•จ

์ˆ˜๋™ ๋น„๊ต ์›Œํฌํ”Œ๋กœ์šฐ:

# ๊ฐ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•˜์—ฌ ๋น„๊ต
nsys stats --force-export=true matmul_naive.nsys-rep > naive_stats.txt
nsys stats --force-export=true matmul_shared.nsys-rep > shared_stats.txt
nsys stats --force-export=true matmul_tiled.nsys-rep > tiled_stats.txt
nsys stats --force-export=true matmul_idiomatic_tiled.nsys-rep > idiomatic_tiled_stats.txt

๊ณต์ •ํ•œ ๋น„๊ต ๊ฒฐ๊ณผ (์‹ค์ œ ํ”„๋กœํŒŒ์ผ๋ง ์ถœ๋ ฅ):

๋น„๊ต 1: 2 x 2 ํ–‰๋ ฌ

๊ตฌํ˜„๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ปค๋„ ์‹คํ–‰์„ฑ๋Šฅ
Naive81.9% cuMemAllocAsyncโœ… 1,920 ns๊ธฐ์ค€์„ 
Shared (--single-block)81.8% cuMemAllocAsyncโœ… 1,984 ns+3.3% ๋А๋ฆผ

๋น„๊ต 2: 9 x 9 ํ–‰๋ ฌ

๊ตฌํ˜„๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ปค๋„ ์‹คํ–‰์„ฑ๋Šฅ
Tiled (์ˆ˜๋™)81.1% cuMemAllocAsyncโœ… 2,048 ns๊ธฐ์ค€์„ 
Idiomatic Tiled81.6% cuMemAllocAsyncโœ… 2,368 ns+15.6% ๋А๋ฆผ

๊ณต์ • ๋น„๊ต์—์„œ ์–ป์€ ํ•ต์‹ฌ ํ†ต์ฐฐ:

๋‘ ํ–‰๋ ฌ ํฌ๊ธฐ ๋ชจ๋‘ GPU ์ž‘์—…์—๋Š” ๋„ˆ๋ฌด ์ž‘์Œ!:

  • 2ร—2 ํ–‰๋ ฌ: 4๊ฐœ ์š”์†Œ - ์™„์ „ํžˆ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ
  • 9ร—9 ํ–‰๋ ฌ: 81๊ฐœ ์š”์†Œ - ์—ฌ์ „ํžˆ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ
  • ์‹ค์ œ GPU ์›Œํฌ๋กœ๋“œ: ์ฐจ์›๋‹น ์ˆ˜์ฒœ~์ˆ˜๋ฐฑ๋งŒ ๊ฐœ ์š”์†Œ

์ด ๊ฒฐ๊ณผ๊ฐ€ ์‹ค์ œ๋กœ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ:

  • ๋ชจ๋“  ๋ณ€ํ˜•์ด ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์— ์ง€๋ฐฐ๋จ (์‹œ๊ฐ„์˜ 81% ์ด์ƒ)
  • ์ปค๋„ ์‹คํ–‰์€ ์˜๋ฏธ ์—†์Œ - ์„ค์ • ๋น„์šฉ์— ๋น„ํ•˜๋ฉด ๋ฏธ๋ฏธ
  • โ€œ์ตœ์ ํ™”โ€œ๊ฐ€ ์˜คํžˆ๋ ค ํ•ด๋กœ์šธ ์ˆ˜ ์žˆ์Œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ 3.3%, async_copy๊ฐ€ 15.6% ์˜ค๋ฒ„ํ—ค๋“œ ์ถ”๊ฐ€
  • ์ง„์งœ ๊ตํ›ˆ: ์ž‘์€ ์›Œํฌ๋กœ๋“œ์—์„œ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ ํƒ์ด ๋ฌด์˜๋ฏธ - ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ชจ๋“  ๊ฒƒ์„ ์••๋„

์ด๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๋Š” ์ด์œ :

  • GPU ์„ค์ • ๋น„์šฉ(๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ์ปค๋„ ์‹คํ–‰)์€ ๋ฌธ์ œ ํฌ๊ธฐ์— ๊ด€๊ณ„์—†์ด ๊ณ ์ •
  • ์ž‘์€ ๋ฌธ์ œ์—์„œ๋Š” ์ด ๊ณ ์ • ๋น„์šฉ์ด ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๋ฌด์ƒ‰ํ•˜๊ฒŒ ๋งŒ๋“ฆ
  • ํฐ ๋ฌธ์ œ๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋œ ์ตœ์ ํ™”๊ฐ€ ์ž‘์€ ๋ฌธ์ œ์—์„œ๋Š” ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋จ

์‹ค๋ฌด ํ”„๋กœํŒŒ์ผ๋ง ๊ตํ›ˆ:

  • ๋ฌธ์ œ ํฌ๊ธฐ ๋งฅ๋ฝ์ด ์ค‘์š”: 2ร—2์™€ 9ร—9 ๋ชจ๋‘ GPU์—๊ฒŒ๋Š” ์ž‘์Œ
  • ๊ณ ์ • ๋น„์šฉ์ด ์ž‘์€ ๋ฌธ์ œ๋ฅผ ์ง€๋ฐฐ: ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ
  • โ€œ์ตœ์ ํ™”โ€œ๊ฐ€ ์ž‘์€ ์›Œํฌ๋กœ๋“œ์— ํ•ด๋กœ์šธ ์ˆ˜ ์žˆ์Œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋น„๋™๊ธฐ ์—ฐ์‚ฐ์ด ์˜ค๋ฒ„ํ—ค๋“œ ์ถ”๊ฐ€
  • ์ž‘์€ ๋ฌธ์ œ๋ฅผ ์ตœ์ ํ™”ํ•˜์ง€ ๋ง ๊ฒƒ: ์‹ค์ œ ์›Œํฌ๋กœ๋“œ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ง‘์ค‘
  • ํ•ญ์ƒ ๋ฒค์น˜๋งˆํ‚นํ•  ๊ฒƒ: โ€œ๋” ์ข‹์€โ€ ์ฝ”๋“œ์— ๋Œ€ํ•œ ๊ฐ€์ •์€ ํ”ํžˆ ํ‹€๋ฆผ

์ž‘์€ ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง์˜ ์ดํ•ด: ์ด 2ร—2 ํ–‰๋ ฌ ์˜ˆ์ œ๋Š” ์ „ํ˜•์ ์ธ ์ž‘์€ ์ปค๋„ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • ์‹ค์ œ ์—ฐ์‚ฐ(ํ–‰๋ ฌ ๊ณฑ์…ˆ)์€ ๊ทนํžˆ ๋น ๋ฆ„ (1,920 ns)
  • ๋ฉ”๋ชจ๋ฆฌ ์„ค์ • ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ „์ฒด ์‹œ๊ฐ„์„ ์ง€๋ฐฐ (์‹คํ–‰์˜ 97% ์ด์ƒ)
  • ์ด๊ฒƒ์ด ์‹ค๋ฌด GPU ์ตœ์ ํ™”๊ฐ€ ๋‹ค์Œ์— ์ง‘์ค‘ํ•˜๋Š” ์ด์œ ์ž…๋‹ˆ๋‹ค:
    • ์—ฐ์‚ฐ ์ผ๊ด„ ์ฒ˜๋ฆฌ๋กœ ์„ค์ • ๋น„์šฉ ๋ถ„์‚ฐ
    • ๋ฉ”๋ชจ๋ฆฌ ์žฌ์‚ฌ์šฉ์œผ๋กœ ํ• ๋‹น ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ
    • ์—ฐ์‚ฐ์ด ๋ณ‘๋ชฉ์ด ๋˜๋Š” ๋” ํฐ ๋ฌธ์ œ ํฌ๊ธฐ

์‹ค์Šต: NSight Compute๋กœ ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„

์ด์ œ ํŠน์ • ์ปค๋„์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ์‹ฌ์ธต์ ์œผ๋กœ ๋“ค์—ฌ๋‹ค๋ด…์‹œ๋‹ค.

Step 1: ํŠน์ • ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง

# ํ™œ์„ฑ shell ์ƒํƒœ์ธ์ง€ ํ™•์ธ
pixi shell -e nvidia

# Naive MatMul ์ปค๋„์„ ์ƒ์„ธ ํ”„๋กœํŒŒ์ผ๋ง (์ตœ์ ํ™” ๋นŒ๋“œ ์‚ฌ์šฉ)
ncu \
  --set full \
  -o kernel_analysis \
  --force-overwrite \
  ./solutions/p16/p16_optimized --naive

ํ”ํ•œ ๋ฌธ์ œ: ๊ถŒํ•œ ์˜ค๋ฅ˜

ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋‹ค์Œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ์‹œ๋„ํ•˜์„ธ์š”:

# NVIDIA ๋“œ๋ผ์ด๋ฒ„ ์˜ต์…˜ ์ถ”๊ฐ€ (rmmod๋ณด๋‹ค ์•ˆ์ „)
echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee -a /etc/modprobe.d/nvidia-kernel-common.conf

# ์ปค๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
sudo sysctl -w kernel.perf_event_paranoid=0

# ์˜๊ตฌ ์ ์šฉ
echo 'kernel.perf_event_paranoid=0' | sudo tee -a /etc/sysctl.conf

# ๋“œ๋ผ์ด๋ฒ„ ๋ณ€๊ฒฝ ์‚ฌํ•ญ ์ ์šฉ์„ ์œ„ํ•ด ์žฌ๋ถ€ํŒ… ํ•„์š”
sudo reboot

# ๊ทธ๋Ÿฐ ๋‹ค์Œ ncu ๋ช…๋ น์„ ๋‹ค์‹œ ์‹คํ–‰
ncu \
  --set full \
  -o kernel_analysis \
  --force-overwrite \
  ./solutions/p16/p16_optimized --naive

Step 2: ์ฃผ์š” ์ง€ํ‘œ ๋ถ„์„

# ์ƒ์„ธ ๋ณด๊ณ ์„œ ์ƒ์„ฑ (์˜ฌ๋ฐ”๋ฅธ ๊ตฌ๋ฌธ)
ncu --import kernel_analysis.ncu-rep --page details

์‹ค์ œ NSight Compute ์ถœ๋ ฅ (2ร—2 Naive MatMul):

GPU Speed Of Light Throughput
----------------------- ----------- ------------
DRAM Frequency              Ghz         6.10
SM Frequency                Ghz         1.30
Elapsed Cycles            cycle         3733
Memory Throughput             %         1.02
DRAM Throughput               %         0.19
Duration                     us         2.88
Compute (SM) Throughput       %         0.00
----------------------- ----------- ------------

Launch Statistics
-------------------------------- --------------- ---------------
Block Size                                                     9
Grid Size                                                      1
Threads                           thread               9
Waves Per SM                                                0.00
-------------------------------- --------------- ---------------

Occupancy
------------------------------- ----------- ------------
Theoretical Occupancy                 %        33.33
Achieved Occupancy                    %         2.09
------------------------------- ----------- ------------

์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ ์–ป์€ ํ•ต์‹ฌ ํ†ต์ฐฐ:

์„ฑ๋Šฅ ๋ถ„์„ - ๋ƒ‰ํ˜นํ•œ ํ˜„์‹ค

  • Compute Throughput: 0.00% - GPU๊ฐ€ ์—ฐ์‚ฐ์ ์œผ๋กœ ์™„์ „ํžˆ ์œ ํœด ์ƒํƒœ
  • Memory Throughput: 1.02% - ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ๊ฑฐ์˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ
  • Achieved Occupancy: 2.09% - GPU ๋Šฅ๋ ฅ์˜ 2%๋งŒ ์‚ฌ์šฉ ์ค‘
  • Grid Size: 1 ๋ธ”๋ก - 80๊ฐœ ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์„œ๋ฅผ ์™„์ „ํžˆ ๋‚ญ๋น„!

์„ฑ๋Šฅ์ด ์ด๋ ‡๊ฒŒ ๋‚ฎ์€ ์ด์œ 

  • ์ž‘์€ ๋ฌธ์ œ ํฌ๊ธฐ: 2ร—2 ํ–‰๋ ฌ = ์ด 4๊ฐœ ์š”์†Œ
  • ์ž˜๋ชป๋œ ์‹คํ–‰ ๊ตฌ์„ฑ: 1๊ฐœ ๋ธ”๋ก์— 9๊ฐœ ์Šค๋ ˆ๋“œ (32์˜ ๋ฐฐ์ˆ˜์—ฌ์•ผ ํ•จ)
  • ์‹ฌ๊ฐํ•œ ๊ณผ์†Œ ํ™œ์šฉ: SM๋‹น 0.00 wave (ํšจ์œจ์„ ์œ„ํ•ด ์ˆ˜์ฒœ ๊ฐœ ํ•„์š”)

NSight Compute์˜ ํ•ต์‹ฌ ์ตœ์ ํ™” ๊ถŒ๊ณ ์‚ฌํ•ญ

  • โ€œEst. Speedup: 98.75%โ€ - 80๊ฐœ SM์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋„๋ก ๊ทธ๋ฆฌ๋“œ ํฌ๊ธฐ ์ฆ๊ฐ€
  • โ€œEst. Speedup: 71.88%โ€ - ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ 32์˜ ๋ฐฐ์ˆ˜๋กœ ์‚ฌ์šฉ
  • โ€œKernel grid is too smallโ€ - GPU ํšจ์œจ์„ ์œ„ํ•ด ํ›จ์”ฌ ํฐ ๋ฌธ์ œ ํ•„์š”

Step 3: ํ˜„์‹ค ์ง์‹œ

์ด ํ”„๋กœํŒŒ์ผ๋ง ๋ฐ์ดํ„ฐ๊ฐ€ ์•Œ๋ ค์ฃผ๋Š” ๊ฒƒ:

  1. ์ž‘์€ ๋ฌธ์ œ๋Š” GPU์—๊ฒŒ ๋…: 2ร—2 ํ–‰๋ ฌ์€ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ์™„์ „ํžˆ ๋‚ญ๋น„
  2. ์‹คํ–‰ ๊ตฌ์„ฑ์ด ์ค‘์š”: ์ž˜๋ชป๋œ ์Šค๋ ˆ๋“œ/๋ธ”๋ก ํฌ๊ธฐ๊ฐ€ ์„ฑ๋Šฅ์„ ์ฃฝ์ž„
  3. ๊ทœ๋ชจ๊ฐ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณด๋‹ค ์ค‘์š”: ๊ทผ๋ณธ์ ์œผ๋กœ ์ž‘์€ ๋ฌธ์ œ๋Š” ์–ด๋–ค ์ตœ์ ํ™”๋กœ๋„ ํ•ด๊ฒฐ ๋ถˆ๊ฐ€
  4. NSight Compute๋Š” ์ •์งํ•จ: ์ปค๋„ ์„ฑ๋Šฅ์ด ๋‚ฎ์„ ๋•Œ ๊ทธ๋Œ€๋กœ ์•Œ๋ ค์คŒ

์ง„์งœ ๊ตํ›ˆ:

  • ํ† ์ด ๋ฌธ์ œ๋ฅผ ์ตœ์ ํ™”ํ•˜์ง€ ๋ง ๊ฒƒ - ์‹ค์ œ GPU ์›Œํฌ๋กœ๋“œ๋ฅผ ๋Œ€ํ‘œํ•˜์ง€ ์•Š์Œ
  • ํ˜„์‹ค์ ์ธ ์›Œํฌ๋กœ๋“œ์— ์ง‘์ค‘ - ์ตœ์ ํ™”๊ฐ€ ์‹ค์ œ๋กœ ์˜๋ฏธ ์žˆ๋Š” 1000ร—1000+ ํ–‰๋ ฌ
  • ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ์ตœ์ ํ™”๋ฅผ ์•ˆ๋‚ด - ๋‹จ, ์ตœ์ ํ™”ํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ๋Š” ๋ฌธ์ œ์—๋งŒ

2ร—2 ์˜ˆ์ œ์˜ ๊ฒฝ์šฐ: ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜(๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, tiling)์ด ์ด๋ฏธ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ์ ์ธ ์›Œํฌ๋กœ๋“œ์— ์˜ค๋ฒ„ํ—ค๋“œ๋งŒ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋Ÿฌ ์ถœ๋ ฅ์„ ์„ฑ๋Šฅ ํƒ์ •์ฒ˜๋Ÿผ ์ฝ๊ธฐ

์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ์„ฑ๋Šฅ ํŒจํ„ด

ํŒจํ„ด 1: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„

NSight Systems๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๊ธด ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„ NSight Compute๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰, ๋‚ฎ์€ ์—ฐ์‚ฐ ํ™œ์šฉ๋„ ํ•ด๊ฒฐ์ฑ…: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ

ํŒจํ„ด 2: ๋‚ฎ์€ ์ ์œ ์œจ

NSight Systems๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ์งง์€ ์ปค๋„ ์‹คํ–‰๊ณผ ๊ฐ„๊ฒฉ NSight Compute๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ์‹ค์ œ ์ ์œ ์œจ์ด ๋‚ฎ์Œ ํ•ด๊ฒฐ์ฑ…: ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰ ์ค„์ด๊ธฐ, ๋ธ”๋ก ํฌ๊ธฐ ์ตœ์ ํ™”

ํŒจํ„ด 3: ์›Œํ”„ ๋ถ„๊ธฐ

NSight Systems๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๋ถˆ๊ทœ์น™ํ•œ ์ปค๋„ ์‹คํ–‰ ํŒจํ„ด NSight Compute๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๋‚ฎ์€ ์›Œํ”„ ์‹คํ–‰ ํšจ์œจ ํ•ด๊ฒฐ์ฑ…: ์กฐ๊ฑด ๋ถ„๊ธฐ ์ตœ์†Œํ™”, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์žฌ๊ตฌ์„ฑ

ํ”„๋กœํŒŒ์ผ๋ง ํƒ์ • ์›Œํฌํ”Œ๋กœ์šฐ

์„ฑ๋Šฅ ๋ฌธ์ œ ๋ฐœ์ƒ
     |
     v
NSight Systems: ์ „์ฒด ๊ทธ๋ฆผ
        |
        v
GPU๋ฅผ ์ž˜ ํ™œ์šฉํ•˜๊ณ  ์žˆ๋Š”๊ฐ€?
    |             |
  ์•„๋‹ˆ์˜ค           ์˜ˆ
    |             |
    v             v
CPU-GPU    NSight Compute: ์ปค๋„ ์ƒ์„ธ
ํŒŒ์ดํ”„๋ผ์ธ          |
์ˆ˜์ •               v
        ๋ฉ”๋ชจ๋ฆฌ ๋˜๋Š” ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์ธ๊ฐ€?
          |       |       |
         ๋ฉ”๋ชจ๋ฆฌ   ์—ฐ์‚ฐ    ๋‘˜ ๋‹ค ์•„๋‹˜
          |       |       |
          v       v       v
        ๋ฉ”๋ชจ๋ฆฌ    ์‚ฐ์ˆ      ์ ์œ ์œจ
        ์ ‘๊ทผ     ์ตœ์ ํ™”    ํ™•์ธ
        ์ตœ์ ํ™”

ํ”„๋กœํŒŒ์ผ๋ง ๋ชจ๋ฒ” ์‚ฌ๋ก€

ํฌ๊ด„์ ์ธ ํ”„๋กœํŒŒ์ผ๋ง ์ง€์นจ์€ Best Practices Guide - Performance Metrics๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

์ด๋ ‡๊ฒŒ ํ•˜์„ธ์š”

  1. ๋Œ€ํ‘œ์ ์ธ ์›Œํฌ๋กœ๋“œ๋ฅผ ํ”„๋กœํŒŒ์ผ๋ง: ํ˜„์‹ค์ ์ธ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์™€ ํŒจํ„ด ์‚ฌ์šฉ
  2. ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด๋กœ ๋นŒ๋“œ: ์ตœ์ ํ™”์™€ ํ•จ๊ป˜ ํฌ๊ด„์ ์ธ ํ”„๋กœํŒŒ์ผ๋ง ๋ฐ์ดํ„ฐ ๋ฐ ์†Œ์Šค ๋งคํ•‘์„ ์œ„ํ•ด --debug-level=full ์‚ฌ์šฉ
  3. GPU ์›Œ๋ฐ์—…: ์ปค๋„์„ ์—ฌ๋Ÿฌ ๋ฒˆ ์‹คํ–‰ํ•œ ํ›„ ํ›„๋ฐ˜ ๋ฐ˜๋ณต์„ ํ”„๋กœํŒŒ์ผ๋ง
  4. ๋Œ€์•ˆ ๋น„๊ต: ํ•ญ์ƒ ์—ฌ๋Ÿฌ ๊ตฌํ˜„์„ ํ”„๋กœํŒŒ์ผ๋ง
  5. ํ•ซ์ŠคํŒŸ์— ์ง‘์ค‘: ๊ฐ€์žฅ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š” ์ปค๋„์„ ์ตœ์ ํ™”

์ด๋ ‡๊ฒŒ ํ•˜์ง€ ๋งˆ์„ธ์š”

  1. ๋””๋ฒ„๊ทธ ์ •๋ณด ์—†์ด ํ”„๋กœํŒŒ์ผ๋งํ•˜์ง€ ๋ง ๊ฒƒ: ์„ฑ๋Šฅ์„ ์†Œ์Šค ์ฝ”๋“œ์— ๋งคํ•‘ํ•  ์ˆ˜ ์—†์Œ (mojo build --help)
  2. ๋‹จ์ผ ์‹คํ–‰๋งŒ ํ”„๋กœํŒŒ์ผ๋งํ•˜์ง€ ๋ง ๊ฒƒ: GPU ์„ฑ๋Šฅ์€ ์‹คํ–‰๋งˆ๋‹ค ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ
  3. ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ๋ฌด์‹œํ•˜์ง€ ๋ง ๊ฒƒ: CPU-GPU ์ „์†ก์ด ํ”ํžˆ ์ง€๋ฐฐ์ 
  4. ์„ฃ๋ถˆ๋ฆฌ ์ตœ์ ํ™”ํ•˜์ง€ ๋ง ๊ฒƒ: ๋จผ์ € ํ”„๋กœํŒŒ์ผ๋ง, ๊ทธ๋‹ค์Œ ์ตœ์ ํ™”

ํ”ํ•œ ํ•จ์ •๊ณผ ํ•ด๊ฒฐ์ฑ…

ํ•จ์ • 1: ์ฝœ๋“œ ์Šคํƒ€ํŠธ ํšจ๊ณผ

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ์ฒซ ๋ฒˆ์งธ ์‹คํ–‰์„ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile mojo your_program.mojo

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ์›Œ๋ฐ์—… ํ›„ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --delay=5 mojo your_program.mojo  # GPU ์›Œ๋ฐ์—… ๋Œ€๊ธฐ

ํ•จ์ • 2: ์ž˜๋ชป๋œ ๋นŒ๋“œ ๊ตฌ์„ฑ

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ์ „์ฒด ๋””๋ฒ„๊ทธ ๋นŒ๋“œ (์ตœ์ ํ™” ๋น„ํ™œ์„ฑํ™”) ์ฆ‰, `--no-optimization`
mojo build -O0 your_program.mojo -o your_program

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ๋””๋ฒ„๊ทธ ์ •๋ณด ์—†์Œ (์†Œ์Šค ๋งคํ•‘ ๋ถˆ๊ฐ€)
mojo build your_program.mojo -o your_program

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด ํฌํ•จ ์ตœ์ ํ™” ๋นŒ๋“œ
mojo build --debug-level=full your_program.mojo -o optimized_program
nsys profile ./optimized_program

ํ•จ์ • 3: ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋ฌด์‹œ

# NSight Systems์—์„œ ์ด ํŒจํ„ด์„ ์ฐพ์•„๋ณด์„ธ์š”:
CPU -> GPU transfer: 50ms
Kernel execution: 2ms
GPU -> CPU transfer: 48ms
# ์ด: 100ms (์ปค๋„์€ ๊ฒจ์šฐ 2%!)

ํ•ด๊ฒฐ์ฑ…: ์ „์†ก๊ณผ ์—ฐ์‚ฐ์„ ์ค‘์ฒฉํ•˜๊ณ  ์ „์†ก ๋นˆ๋„๋ฅผ ์ค„์ด๊ธฐ (Part IX์—์„œ ๋‹ค๋ฃธ)

ํ•จ์ • 4: ๋‹จ์ผ ์ปค๋„์—๋งŒ ์ง‘์ค‘

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: "๋А๋ฆฐ" ์ปค๋„๋งŒ ํ”„๋กœํŒŒ์ผ๋ง
ncu --kernel-name regex:slow_kernel program

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ๋จผ์ € ์ „์ฒด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile mojo program.mojo  # ์‹ค์ œ ๋ณ‘๋ชฉ ์ฐพ๊ธฐ

๋ชจ๋ฒ” ์‚ฌ๋ก€์™€ ๊ณ ๊ธ‰ ์˜ต์…˜

๊ณ ๊ธ‰ NSight Systems ํ”„๋กœํŒŒ์ผ๋ง

ํฌ๊ด„์ ์ธ ์‹œ์Šคํ…œ ์ „์ฒด ๋ถ„์„์„ ์œ„ํ•ด ๋‹ค์Œ ๊ณ ๊ธ‰ nsys ํ”Œ๋ž˜๊ทธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

# ํ”„๋กœ๋•์…˜๊ธ‰ ํ”„๋กœํŒŒ์ผ๋ง ๋ช…๋ น
nsys profile \
  --gpu-metrics-devices=all \
  --trace=cuda,osrt,nvtx \
  --trace-fork-before-exec=true \
  --cuda-memory-usage=true \
  --cuda-um-cpu-page-faults=true \
  --cuda-um-gpu-page-faults=true \
  --opengl-gpu-workload=false \
  --delay=2 \
  --duration=30 \
  --sample=cpu \
  --cpuctxsw=process-tree \
  --output=comprehensive_profile \
  --force-overwrite=true \
  ./your_program

ํ”Œ๋ž˜๊ทธ ์„ค๋ช…:

  • --gpu-metrics-devices=all: ๋ชจ๋“  ๋””๋ฐ”์ด์Šค์—์„œ GPU ์ง€ํ‘œ ์ˆ˜์ง‘
  • --trace=cuda,osrt,nvtx: ํฌ๊ด„์  API ์ถ”์ 
  • --cuda-memory-usage=true: ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น/ํ•ด์ œ ์ถ”์ 
  • --cuda-um-cpu/gpu-page-faults=true: Unified Memory ํŽ˜์ด์ง€ ํดํŠธ ๋ชจ๋‹ˆํ„ฐ๋ง
  • --delay=2: ํ”„๋กœํŒŒ์ผ๋ง ์ „ 2์ดˆ ๋Œ€๊ธฐ (์ฝœ๋“œ ์Šคํƒ€ํŠธ ํšŒํ”ผ)
  • --duration=30: ์ตœ๋Œ€ 30์ดˆ๊ฐ„ ํ”„๋กœํŒŒ์ผ๋ง
  • --sample=cpu: ํ•ซ์ŠคํŒŸ ๋ถ„์„์„ ์œ„ํ•œ CPU ์ƒ˜ํ”Œ๋ง ํฌํ•จ
  • --cpuctxsw=process-tree: CPU ์ปจํ…์ŠคํŠธ ์Šค์œ„์น˜ ์ถ”์ 

๊ณ ๊ธ‰ NSight Compute ํ”„๋กœํŒŒ์ผ๋ง

ํฌ๊ด„์  ์ง€ํ‘œ๋ฅผ ํฌํ•จํ•œ ์ƒ์„ธ ์ปค๋„ ๋ถ„์„:

# ๋ชจ๋“  ์ง€ํ‘œ ์„ธํŠธ๋กœ ์ „์ฒด ์ปค๋„ ๋ถ„์„
ncu \
  --set full \
  --import-source=on \
  --kernel-id=:::1 \
  --launch-skip=0 \
  --launch-count=1 \
  --target-processes=all \
  --replay-mode=kernel \
  --cache-control=all \
  --clock-control=base \
  --apply-rules=yes \
  --check-exit-code=yes \
  --export=detailed_analysis \
  --force-overwrite \
  ./your_program

# ํŠน์ • ์„ฑ๋Šฅ ์ธก๋ฉด์— ์ง‘์ค‘
ncu \
  --set=@roofline \
  --section=InstructionStats \
  --section=LaunchStats \
  --section=Occupancy \
  --section=SpeedOfLight \
  --section=WarpStateStats \
  --metrics=sm__cycles_elapsed.avg,dram__throughput.avg.pct_of_peak_sustained_elapsed \
  --kernel-name regex:your_kernel_.* \
  --export=targeted_analysis \
  ./your_program

์ฃผ์š” NSight Compute ํ”Œ๋ž˜๊ทธ:

  • --set full: ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์ง€ํ‘œ ์ˆ˜์ง‘ (ํฌ๊ด„์ ์ด์ง€๋งŒ ๋А๋ฆผ)
  • --set @roofline: ๋ฃจํ”„๋ผ์ธ ๋ถ„์„์— ์ตœ์ ํ™”๋œ ์„ธํŠธ
  • --import-source=on: ๊ฒฐ๊ณผ๋ฅผ ์†Œ์Šค ์ฝ”๋“œ์— ๋งคํ•‘
  • --replay-mode=kernel: ์ •ํ™•ํ•œ ์ธก์ •์„ ์œ„ํ•ด ์ปค๋„ ๋ฆฌํ”Œ๋ ˆ์ด
  • --cache-control=all: ์ผ๊ด€๋œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ GPU ์บ์‹œ ์ œ์–ด
  • --clock-control=base: ๊ธฐ๋ณธ ์ฃผํŒŒ์ˆ˜๋กœ ํด๋Ÿญ ๊ณ ์ •
  • --section=SpeedOfLight: Speed of Light ๋ถ„์„ ํฌํ•จ
  • --metrics=...: ํŠน์ • ์ง€ํ‘œ๋งŒ ์ˆ˜์ง‘
  • --kernel-name regex:pattern: ์ •๊ทœ์‹ ํŒจํ„ด์œผ๋กœ ์ปค๋„ ์ง€์ • (--kernel-regex๊ฐ€ ์•„๋‹˜)

ํ”„๋กœํŒŒ์ผ๋ง ์›Œํฌํ”Œ๋กœ์šฐ ๋ชจ๋ฒ” ์‚ฌ๋ก€

1. ์ ์ง„์  ํ”„๋กœํŒŒ์ผ๋ง ์ „๋žต

# Step 1: ๋น ๋ฅธ ๊ฐœ์š” (๋น ๋ฆ„)
nsys profile --trace=cuda --duration=10 --output=quick_look ./program

# Step 2: ์ƒ์„ธ ์‹œ์Šคํ…œ ๋ถ„์„ (์ค‘๊ฐ„)
nsys profile --trace=cuda,osrt,nvtx --cuda-memory-usage=true --output=detailed ./program

# Step 3: ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„ (๋А๋ฆฌ์ง€๋งŒ ํฌ๊ด„์ )
ncu --set=@roofline --kernel-name regex:hotspot_kernel ./program

2. ์‹ ๋ขฐ์„ฑ์„ ์œ„ํ•œ ๋‹ค์ค‘ ์‹คํ–‰ ๋ถ„์„

# ์—ฌ๋Ÿฌ ๋ฒˆ ํ”„๋กœํŒŒ์ผ๋งํ•˜๊ณ  ๋น„๊ต
for i in {1..5}; do
  nsys profile --output=run_${i} ./program
  nsys stats run_${i}.nsys-rep > stats_${i}.txt
done

# ๊ฒฐ๊ณผ ๋น„๊ต
diff stats_1.txt stats_2.txt

3. ํƒ€๊ฒŸ ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง

# ๋จผ์ € ํ•ซ์ŠคํŒŸ ์ปค๋„ ์‹๋ณ„
nsys profile --trace=cuda,nvtx --output=overview ./program
nsys stats overview.nsys-rep | grep -A 10 "GPU Kernel Summary"

# ๊ทธ๋Ÿฐ ๋‹ค์Œ ํŠน์ • ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง
ncu --kernel-name="identified_hotspot_kernel" --set full ./program

ํ™˜๊ฒฝ ๋ฐ ๋นŒ๋“œ ๋ชจ๋ฒ” ์‚ฌ๋ก€

์ตœ์  ๋นŒ๋“œ ๊ตฌ์„ฑ

# ํ”„๋กœํŒŒ์ผ๋ง์šฉ: ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด ํฌํ•จ ์ตœ์ ํ™” ๋นŒ๋“œ
mojo build --debug-level=full --optimization-level=3 program.mojo -o program_profile

# ๋นŒ๋“œ ์„ค์ • ํ™•์ธ
mojo build --help | grep -E "(debug|optimization)"

ํ”„๋กœํŒŒ์ผ๋ง ํ™˜๊ฒฝ ์„ค์ •

# ์ผ๊ด€๋œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด GPU ๋ถ€์ŠคํŠธ ๋น„ํ™œ์„ฑํ™”
sudo nvidia-smi -ac 1215,1410  # ๋ฉ”๋ชจ๋ฆฌ ๋ฐ GPU ํด๋Ÿญ ๊ณ ์ •

# ๊ฒฐ์ •๋ก ์  ๋™์ž‘ ์„ค์ •
export CUDA_LAUNCH_BLOCKING=1  # ์ •ํ™•ํ•œ ํƒ€์ด๋ฐ์„ ์œ„ํ•œ ๋™๊ธฐ์‹ ์‹คํ–‰

# ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ๋“œ๋ผ์ด๋ฒ„ ์ œํ•œ ์™„ํ™”
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee -a /etc/modprobe.d/nvidia-kernel-common.conf

๋ฉ”๋ชจ๋ฆฌ ๋ฐ ์„ฑ๋Šฅ ๊ฒฉ๋ฆฌ

# ํ”„๋กœํŒŒ์ผ๋ง ์ „ GPU ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ธฐํ™”
nvidia-smi --gpu-reset

# ๋‹ค๋ฅธ GPU ํ”„๋กœ์„ธ์Šค ๋น„ํ™œ์„ฑํ™”
sudo fuser -v /dev/nvidia*  # GPU ์‚ฌ์šฉ ์ค‘์ธ ํ”„๋กœ์„ธ์Šค ํ™•์ธ
sudo pkill -f cuda  # ํ•„์š”์‹œ CUDA ํ”„๋กœ์„ธ์Šค ์ข…๋ฃŒ

# ๋†’์€ ์šฐ์„ ์ˆœ์œ„๋กœ ์‹คํ–‰
sudo nice -n -20 nsys profile ./program

๋ถ„์„ ๋ฐ ๋ณด๊ณ  ๋ชจ๋ฒ” ์‚ฌ๋ก€

์ข…ํ•ฉ ๋ณด๊ณ ์„œ ์ƒ์„ฑ

# ์—ฌ๋Ÿฌ ๋ณด๊ณ ์„œ ํ˜•์‹ ์ƒ์„ฑ
nsys stats --report=cuda_api_sum,cuda_gpu_kern_sum,cuda_gpu_mem_time_sum --format=csv --output=. profile.nsys-rep

# ์™ธ๋ถ€ ๋ถ„์„์„ ์œ„ํ•ด ๋‚ด๋ณด๋‚ด๊ธฐ
nsys export --type=sqlite profile.nsys-rep
nsys export --type=json profile.nsys-rep

# ๋น„๊ต ๋ณด๊ณ ์„œ ์ƒ์„ฑ
nsys stats --report=cuda_gpu_kern_sum baseline.nsys-rep > baseline_kernels.txt
nsys stats --report=cuda_gpu_kern_sum optimized.nsys-rep > optimized_kernels.txt
diff -u baseline_kernels.txt optimized_kernels.txt

์„ฑ๋Šฅ ํšŒ๊ท€ ํ…Œ์ŠคํŠธ

#!/bin/bash
# CI/CD์šฉ ์ž๋™ํ™” ํ”„๋กœํŒŒ์ผ๋ง ์Šคํฌ๋ฆฝํŠธ
BASELINE_TIME=$(nsys stats baseline.nsys-rep | grep "Total Time" | awk '{print $3}')
CURRENT_TIME=$(nsys stats current.nsys-rep | grep "Total Time" | awk '{print $3}')

REGRESSION_THRESHOLD=1.10  # 10% ์„ฑ๋Šฅ ์ €ํ•˜ ์ž„๊ณ„๊ฐ’
if (( $(echo "$CURRENT_TIME > $BASELINE_TIME * $REGRESSION_THRESHOLD" | bc -l) )); then
    echo "Performance regression detected: ${CURRENT_TIME}ns vs ${BASELINE_TIME}ns"
    exit 1
fi

๋‹ค์Œ ๋‹จ๊ณ„

ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ๋ฅผ ์ดํ•ดํ–ˆ์œผ๋‹ˆ:

  1. ๊ธฐ์กด ์ปค๋„๋กœ ์—ฐ์Šต: ์ด๋ฏธ ํ’€์—ˆ๋˜ ํผ์ฆ๋“ค์„ ํ”„๋กœํŒŒ์ผ๋งํ•ด ๋ณด์„ธ์š”
  2. ์ตœ์ ํ™” ์ค€๋น„: Puzzle 31์—์„œ ์ด ํ†ต์ฐฐ์„ ์ ์œ ์œจ ์ตœ์ ํ™”์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค
  3. ๋„๊ตฌ ์ตํžˆ๊ธฐ: ๋‹ค์–‘ํ•œ NSight Systems์™€ NSight Compute ์˜ต์…˜์„ ์‹คํ—˜ํ•ด ๋ณด์„ธ์š”

๊ธฐ์–ตํ•˜์„ธ์š”: ํ”„๋กœํŒŒ์ผ๋ง์€ ๋‹จ์ˆœํžˆ ๋А๋ฆฐ ์ฝ”๋“œ๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค - ํ”„๋กœ๊ทธ๋žจ์˜ ๋™์ž‘์„ ์ดํ•ดํ•˜๊ณ  ๊ทผ๊ฑฐ ์žˆ๋Š” ์ตœ์ ํ™” ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ถ”๊ฐ€ ํ”„๋กœํŒŒ์ผ๋ง ์ž๋ฃŒ:

๐Ÿ•ต ์บ์‹œ ํžˆํŠธ์˜ ์—ญ์„ค

๊ฐœ์š”

์ฒซ ๋ฒˆ์งธ ํ”„๋กœํŒŒ์ผ๋ง ํƒ์ • ์‚ฌ๊ฑด์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์„ธ ๊ฐœ์˜ GPU ์ปค๋„์ด ๋ชจ๋‘ ๋™์ผํ•œ ๋ฒกํ„ฐ ๋ง์…ˆ output[i] = a[i] + b[i]์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋‹น์—ฐํžˆ ์„ฑ๋Šฅ๋„ ๊ฐ™๊ฒ ์ฃ ?

์•„๋‹™๋‹ˆ๋‹ค! ์ด ์ปค๋„๋“ค์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋Š” ๊ทน์ ์ž…๋‹ˆ๋‹ค - ํ•˜๋‚˜๋Š” ๋‚˜๋จธ์ง€๋ณด๋‹ค ์ˆ˜์‹ญ ๋ฐฐ๋‚˜ ๋А๋ฆฝ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ๋ถ„์˜ ์ž„๋ฌด: ๋ฐฉ๊ธˆ ๋ฐฐ์šด ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์™œ ๊ทธ๋Ÿฐ์ง€ ๋ฐํ˜€๋‚ด์„ธ์š”.

๋„์ „ ๊ณผ์ œ

GPU ์ตœ์ ํ™”์— ๋Œ€ํ•œ ๊ธฐ์กด ์ƒ์‹์„ ์™„์ „ํžˆ ๋’ค์ง‘๋Š” ์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ๋ˆˆ์•ž์—๋Š” ๊ฒ‰๋ณด๊ธฐ์— ๋™์ผํ•œ ๋ฒกํ„ฐ ๋ง์…ˆ ์ปค๋„ ์„ธ ๊ฐœ๊ฐ€ ์žˆ๊ณ , ๋ชจ๋‘ ์ •ํ™•ํžˆ ๊ฐ™์€ ์ˆ˜ํ•™ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

output[i] = a[i] + b[i]  // ๋‹จ์ˆœํ•œ ์‚ฐ์ˆ  ์—ฐ์‚ฐ - ๋ญ๊ฐ€ ์ž˜๋ชป๋  ์ˆ˜ ์žˆ์„๊นŒ?

์ถฉ๊ฒฉ์ ์ธ ํ˜„์‹ค:

  • ์„ธ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•˜๊ณ  ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ํ•˜๋‚˜์˜ ์ปค๋„์ด ๋‚˜๋จธ์ง€๋ณด๋‹ค ~50๋ฐฐ ๋А๋ฆฝ๋‹ˆ๋‹ค
  • ๊ฐ€์žฅ ๋А๋ฆฐ ์ปค๋„์ด ๊ฐ€์žฅ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์„ ๋ณด์ž…๋‹ˆ๋‹ค (์˜ˆ์ƒ๊ณผ ์ •๋ฐ˜๋Œ€!)
  • ์ผ๋ฐ˜์ ์ธ ์„ฑ๋Šฅ ์ง๊ด€์ด ์™„์ „ํžˆ ๋น—๋‚˜๊ฐ‘๋‹ˆ๋‹ค

ํƒ์ • ์ž„๋ฌด:

  1. ์„ฑ๋Šฅ ๋ฒ”์ธ ์‹๋ณ„ - ์–ด๋–ค ์ปค๋„์ด ์น˜๋ช…์ ์œผ๋กœ ๋А๋ฆฐ๊ฐ€?
  2. ์บ์‹œ์˜ ์—ญ์„ค ๊ทœ๋ช… - ๋†’์€ ์บ์‹œ ํžˆํŠธ๊ฐ€ ์™œ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ์˜๋ฏธํ•˜๋Š”๊ฐ€?
  3. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํ•ด๋… - ๋™์ผํ•œ ์—ฐ์‚ฐ์ด ์–ด๋–ป๊ฒŒ ์ด๋ ‡๊ฒŒ ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘ํ•˜๋Š”๊ฐ€?
  4. ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก  ํ•™์Šต - ์ถ”์ธก์ด ์•„๋‹Œ NSight ๋„๊ตฌ๋กœ ๊ทผ๊ฑฐ๋ฅผ ํ™•๋ณดํ•˜๋ผ

์™œ ์ค‘์š”ํ•œ๊ฐ€: ์ด ํผ์ฆ์€ CPU ๊ธฐ๋ฐ˜ ์ง๊ด€์— ๋„์ „ํ•˜๋Š” GPU ์„ฑ๋Šฅ์˜ ๊ทผ๋ณธ ์›๋ฆฌ๋ฅผ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ธฐ๋ฅด๋Š” ์—ญ๋Ÿ‰์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„๋ณด๋‹ค ์ค‘์š”ํ•œ ์‹ค๋ฌด GPU ์ตœ์ ํ™”์— ์ง์ ‘ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

๋ฐ˜์ „: ์ด ๊ณผ์ •์€ ํ”„๋กœ๋•์…˜ ์„ฑ๋Šฅ ์ด์Šˆ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋“ฏ์ด, ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ๋จผ์ € ๋ณด์ง€ ์•Š๊ณ  ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋งŒ์œผ๋กœ ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฐ๊ณผ๋ฅผ ์–ป์€ ํ›„์— ์ฝ”๋“œ๋ฅผ ๋“ค์—ฌ๋‹ค๋ด…๋‹ˆ๋‹ค.

ํƒ์ • ๋„๊ตฌ ๋ชจ์Œ

ํ”„๋กœํŒŒ์ผ๋ง ํŠœํ† ๋ฆฌ์–ผ์—์„œ ๋ฐฐ์šด ๋„๊ตฌ๋“ค:

  • NSight Systems (nsys) - ์–ด๋–ค ์ปค๋„์ด ๋А๋ฆฐ์ง€ ์ฐพ๊ธฐ
  • NSight Compute (ncu) - ์ปค๋„์ด ์™œ ๋А๋ฆฐ์ง€ ๋ถ„์„ํ•˜๊ธฐ
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ ์ง€ํ‘œ - ๋น„ํšจ์œจ์ ์ธ ์ ‘๊ทผ ํŒจํ„ด ํƒ์ง€

์‹œ์ž‘ํ•˜๊ธฐ

Step 1: ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰

pixi shell -e nvidia
mojo problems/p30/p30.mojo --benchmark

์ปค๋„ ๊ฐ„์— ๊ทน์ ์ธ ์‹คํ–‰ ์‹œ๊ฐ„ ์ฐจ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ํ•˜๋‚˜์˜ ์ปค๋„์ด ๋‚˜๋จธ์ง€๋ณด๋‹ค ํ›จ์”ฌ ๋А๋ฆฝ๋‹ˆ๋‹ค. ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋งŒ์œผ๋กœ ์›์ธ์„ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค.

์ถœ๋ ฅ ์˜ˆ์‹œ:

| name    | met (ms)  | iters |
| ------- | --------- | ----- |
| kernel1 | 171.85    | 11    |
| kernel2 | 1546.68   | 11    |  <- ์ด๊ฒƒ๋งŒ ์œ ๋… ๋А๋ฆฌ๋‹ค!
| kernel3 | 172.18    | 11    |

Step 2: ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ๋นŒ๋“œ ์ค€๋น„

ํ•„์ˆ˜: ์ •ํ™•ํ•œ ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•ด ์ตœ์ ํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜์—ฌ ๋นŒ๋“œํ•ฉ๋‹ˆ๋‹ค:

mojo build --debug-level=full problems/p30/p30.mojo -o problems/p30/p30_profiler

์ค‘์š”ํ•œ ์ด์œ :

  • ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด: ํ”„๋กœํŒŒ์ผ๋Ÿฌ์— ์™„์ „ํ•œ ์‹ฌ๋ณผ ํ…Œ์ด๋ธ”, ๋ณ€์ˆ˜๋ช…, ์†Œ์Šค ๋ผ์ธ ๋งคํ•‘์„ ์ œ๊ณต
  • ์ข…ํ•ฉ ๋ถ„์„: NSight ๋„๊ตฌ๊ฐ€ ์„ฑ๋Šฅ ๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ • ์ฝ”๋“œ ์œ„์น˜์™€ ์—ฐ๊ด€ ์ง“๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅ
  • ์ตœ์ ํ™” ์œ ์ง€: ํ”„๋กœ๋•์…˜ ๋นŒ๋“œ์™€ ๋™์ผํ•œ ํ˜„์‹ค์ ์ธ ์„ฑ๋Šฅ ์ธก์ • ๋ณด์žฅ

Step 3: ์‹œ์Šคํ…œ ์ „์ฒด ์กฐ์‚ฌ (NSight Systems)

๊ฐ ์ปค๋„์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜์—ฌ ์ „์ฒด ๊ทธ๋ฆผ์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:

# ์ตœ์ ํ™” ๋นŒ๋“œ๋กœ ๊ฐ ์ปค๋„์„ ๊ฐœ๋ณ„ ํ”„๋กœํŒŒ์ผ๋ง (์ฝœ๋“œ ์Šคํƒ€ํŠธ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ์›Œ๋ฐ์—… ํฌํ•จ)
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel1_profile ./problems/p30/p30_profiler --kernel1
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel2_profile ./problems/p30/p30_profiler --kernel2
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel3_profile ./problems/p30/p30_profiler --kernel3

# ๊ฒฐ๊ณผ ๋ถ„์„
nsys stats --force-export=true ./problems/p30/kernel1_profile.nsys-rep > ./problems/p30/kernel1_profile.txt
nsys stats --force-export=true ./problems/p30/kernel2_profile.nsys-rep > ./problems/p30/kernel2_profile.txt
nsys stats --force-export=true ./problems/p30/kernel3_profile.nsys-rep > ./problems/p30/kernel3_profile.txt

ํ™•์ธํ•  ์‚ฌํ•ญ:

  • GPU ์ปค๋„ ์š”์•ฝ - ์–ด๋–ค ์ปค๋„์ด ๊ฐ€์žฅ ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š”๊ฐ€?
  • ์ปค๋„ ์‹คํ–‰ ์‹œ๊ฐ„ - ์ฐจ์ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋‚˜๋Š”๊ฐ€?
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ํŒจํ„ด - ๊ตฌํ˜„ ๊ฐ„์— ๋น„์Šทํ•œ๊ฐ€?

Step 4: ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„ (NSight Compute)

๋А๋ฆฐ ์ปค๋„์„ ์‹๋ณ„ํ•œ ํ›„, NSight Compute๋กœ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค:

# ์ตœ์ ํ™” ๋นŒ๋“œ๋กœ ๊ฐ ์ปค๋„์˜ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด ์‹ฌ์ธต ๋ถ„์„
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel1_analysis ./problems/p30/p30_profiler --kernel1
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel2_analysis ./problems/p30/p30_profiler --kernel2
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel3_analysis ./problems/p30/p30_profiler --kernel3

# ๊ฒฐ๊ณผ ํ™•์ธ
ncu --import ./problems/p30/kernel1_analysis.ncu-rep --page details
ncu --import ./problems/p30/kernel2_analysis.ncu-rep --page details
ncu --import ./problems/p30/kernel3_analysis.ncu-rep --page details

์œ„ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

Kernel1: Memory Throughput: ~308 Gbyte/s, Max Bandwidth: ~51%
Kernel2: Memory Throughput: ~6 Gbyte/s,   Max Bandwidth: ~12%
Kernel3: Memory Throughput: ~310 Gbyte/s, Max Bandwidth: ~52%

์ฃผ์š” ์กฐ์‚ฌ ์ง€ํ‘œ:

  • Memory Throughput (Gbyte/s) - ์‹ค์ œ ๋‹ฌ์„ฑํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ
  • Max Bandwidth (%) - ์ด๋ก ์  ์ตœ๋Œ€ ๋Œ€์—ญํญ ๋Œ€๋น„ ํ™œ์šฉ๋ฅ 
  • L1/TEX Hit Rate (%) - L1 ์บ์‹œ ํšจ์œจ
  • L2 Hit Rate (%) - L2 ์บ์‹œ ํšจ์œจ

๐Ÿค” ๋ฐ˜์ง๊ด€์ ์ธ ๊ฒฐ๊ณผ: Kernel2๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์„ ๋ณด์ด๋ฉด์„œ ๊ฐ€์žฅ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค! ์ด๊ฒƒ์ด ํ’€์–ด์•ผ ํ•  ํ•ต์‹ฌ ๋ฏธ์Šคํ„ฐ๋ฆฌ์ž…๋‹ˆ๋‹ค.

Step 5: ํƒ์ • ์งˆ๋ฌธ

ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ปค๋„ ์ฝ”๋“œ problems/p30/p30.mojo๋ฅผ ์‚ดํŽด๋ณด๋ฉฐ ๋‹ค์Œ ์งˆ๋ฌธ์— ๋‹ตํ•ด ๋ณด์„ธ์š”:

์„ฑ๋Šฅ ๋ถ„์„

  1. ์–ด๋–ค ์ปค๋„์ด ๊ฐ€์žฅ ๋†’์€ Memory Throughput์„ ๋‹ฌ์„ฑํ•˜๋Š”๊ฐ€? (Gbyte/s ๊ฐ’ ํ™•์ธ)
  2. ์–ด๋–ค ์ปค๋„์˜ Max Bandwidth ํ™œ์šฉ๋ฅ ์ด ๊ฐ€์žฅ ๋‚ฎ์€๊ฐ€? (๋ฐฑ๋ถ„์œจ ๋น„๊ต)
  3. ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰์˜ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ๋Š” ์–ผ๋งˆ์ธ๊ฐ€? (๊ฐ€์žฅ ๋น ๋ฅธ ๊ฒƒ๊ณผ ๊ฐ€์žฅ ๋А๋ฆฐ ๊ฒƒ์˜ ๋ฐฐ์ˆ˜ ์ฐจ์ด)

์บ์‹œ์˜ ์—ญ์„ค

  1. ์–ด๋–ค ์ปค๋„์˜ L1/TEX Hit Rate๊ฐ€ ๊ฐ€์žฅ ๋†’์€๊ฐ€?
  2. ์–ด๋–ค ์ปค๋„์˜ L2 Hit Rate๊ฐ€ ๊ฐ€์žฅ ๋†’์€๊ฐ€?
  3. ๐Ÿคฏ ์บ์‹œ ํžˆํŠธ์œจ์ด ๊ฐ€์žฅ ๋†’์€ ์ปค๋„์ด ์™œ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ๋‚˜์œ๊ฐ€?

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํƒ๊ตฌ

  1. ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์ด ์‹ค์ œ๋กœ ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š”๊ฐ€?
  2. ์–ด๋–ค ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ๋†’์€ ์บ์‹œ ํžˆํŠธ์™€ ๋‚ฎ์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋™์‹œ์— ์œ ๋ฐœํ•˜๋Š”๊ฐ€?
  3. ์™œ โ€œํšจ์œจ์ ์ธ ์บ์‹ฑโ€œ์ด โ€œ๋น„ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผโ€œ์˜ ์ฆ์ƒ์ผ ์ˆ˜ ์žˆ๋Š”๊ฐ€?

โ€œ์•„ํ•˜!โ€ ์ˆœ๊ฐ„

  1. ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, ์ด ์‚ฌ๋ก€๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” GPU ๋ฉ”๋ชจ๋ฆฌ์˜ ๊ทผ๋ณธ ์›๋ฆฌ๋Š” ๋ฌด์—‡์ธ๊ฐ€?

๋ฐœ๊ฒฌํ•  ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋•Œ๋กœ๋Š” ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์ด ์„ฑ๋Šฅ ์Šน๋ฆฌ๊ฐ€ ์•„๋‹ˆ๋ผ ์œ„ํ—˜ ์‹ ํ˜ธ์ž…๋‹ˆ๋‹ค!

์†”๋ฃจ์…˜

์ด ๋ฏธ์Šคํ„ฐ๋ฆฌ๋Š” GPU ์„ฑ๋Šฅ์˜ ๊ทผ๋ณธ ์›๋ฆฌ๋ฅผ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค: ์ปค๋„์ด ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋”๋ผ๋„ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์„ฑ๋Šฅ์„ ์ง€๋ฐฐํ•ฉ๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ๊ฐ€ ๋ฐํžˆ๋Š” ๊ฒƒ:

  1. ์„ฑ๋Šฅ ์œ„๊ณ„: Kernel1๊ณผ Kernel3์€ ๋น ๋ฅด๊ณ , Kernel2๋Š” ์น˜๋ช…์ ์œผ๋กœ ๋А๋ฆผ (์ˆ˜์‹ญ ๋ฐฐ ์ฐจ์ด)
  2. ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰์ด ๋‹ต์„ ๋งํ•ด์ค€๋‹ค: ๋น ๋ฅธ ์ปค๋„์€ ๋†’์€ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ ์„ ๋‹ฌ์„ฑํ•˜๊ณ , ๋А๋ฆฐ ์ปค๋„์€ ์ตœ์†Œํ•œ์˜ ํ™œ์šฉ๋ฅ ๋งŒ ๋‹ฌ์„ฑ
  3. ์บ์‹œ์˜ ์—ญ์„ค: ๊ฐ€์žฅ ๋А๋ฆฐ ์ปค๋„์ด ๊ฐ€์žฅ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์„ ๋ณด์ž„ - ๋†’์€ ์บ์‹œ ํžˆํŠธ๊ฐ€ ๋น„ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌ
  4. ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ GPU ์›Œํฌ๋กœ๋“œ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„๋ณด๋‹ค ์ค‘์š”
์ƒ์„ธ ์†”๋ฃจ์…˜๊ณผ ์‹ฌ์ธต ์„ค๋ช…

์ด ํ”„๋กœํŒŒ์ผ๋ง ํƒ์ • ์‚ฌ๊ฑด์€ ์ปค๋„์ด ๋™์ผํ•œ ์ˆ˜ํ•™ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋”๋ผ๋„ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์–ด๋–ป๊ฒŒ ์ˆ˜์‹ญ ๋ฐฐ์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ํ™•์ธํ•œ ์„ฑ๋Šฅ ๊ทผ๊ฑฐ

NSight Systems ํƒ€์ž„๋ผ์ธ ๋ถ„์„:

  • Kernel 1: ์งง์€ ์‹คํ–‰ ์‹œ๊ฐ„ - ํšจ์œจ์ 
  • Kernel 3: Kernel 1๊ณผ ์œ ์‚ฌ - ํšจ์œจ์ 
  • Kernel 2: ๊ทน์ ์œผ๋กœ ๊ธด ์‹คํ–‰ ์‹œ๊ฐ„ - ๋น„ํšจ์œจ์ 

NSight Compute ๋ฉ”๋ชจ๋ฆฌ ๋ถ„์„ (ํ•˜๋“œ์›จ์–ด ๋ฌด๊ด€ํ•œ ํŒจํ„ด):

  • ํšจ์œจ์ ์ธ ์ปค๋„ (1 & 3): ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰, ์–‘ํ˜ธํ•œ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ , ๋ณดํ†ต ์ˆ˜์ค€์˜ ์บ์‹œ ํžˆํŠธ์œจ
  • ๋น„ํšจ์œจ์ ์ธ ์ปค๋„ (2): ๋งค์šฐ ๋‚ฎ์€ ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰, ์—ด์•…ํ•œ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ , ๊ทน๋„๋กœ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ

์บ์‹œ์˜ ์—ญ์„ค ๊ทœ๋ช…

๐Ÿคฏ ๋ฐ˜์ง๊ด€์ ์ธ ๋ฐœ๊ฒฌ:

  • Kernel2๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์„ ๋ณด์ด๋ฉด์„œ ์„ฑ๋Šฅ์€ ์ตœ์•…
  • ๊ธฐ์กด ์ƒ์‹์— ๋Œ€ํ•œ ๋„์ „: โ€œ๋†’์€ ์บ์‹œ ํžˆํŠธ = ์ข‹์€ ์„ฑ๋Šฅโ€
  • ์ง„์‹ค: ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์€ ๋น„ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์˜ ์ฆ์ƒ์ผ ์ˆ˜ ์žˆ์Œ

์บ์‹œ์˜ ์—ญ์„ค์ด ๋ฐœ์ƒํ•˜๋Š” ์ด์œ :

์ „ํ†ต์ ์ธ CPU ์ง๊ด€ (GPU์—์„œ๋Š” ํ‹€๋ฆผ):

  • ์บ์‹œ ํžˆํŠธ์œจ์ด ๋†’์„์ˆ˜๋ก ํ•ญ์ƒ ์„ฑ๋Šฅ์ด ์ข‹๋‹ค
  • ์บ์‹œ ํžˆํŠธ๋Š” ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ์„ ์ค„์—ฌ ํšจ์œจ์„ ๋†’์ธ๋‹ค

GPU ๋ฉ”๋ชจ๋ฆฌ์˜ ํ˜„์‹ค (์˜ฌ๋ฐ”๋ฅธ ์ดํ•ด):

  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ์—์„œ๋Š” ๋ณ‘ํ•ฉ์ด ์บ์‹ฑ๋ณด๋‹ค ์ค‘์š”
  • ๋น„ํšจ์œจ์ ์ธ ์ ‘๊ทผ ํŒจํ„ด์€ ์ธ์œ„์ ์œผ๋กœ ์บ์‹œ ํžˆํŠธ์œจ์„ ๋ถ€ํ’€๋ฆด ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ ์ด ์ง„์ •ํ•œ ์„ฑ๋Šฅ ์ง€ํ‘œ

๊ทผ๋ณธ ์›์ธ ๋ถ„์„ - ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

p30.mojo์˜ ์‹ค์ œ ์ปค๋„ ๊ตฌํ˜„:

Kernel 1 - ํšจ์œจ์ ์ธ ๋ณ‘ํ•ฉ ์ ‘๊ทผ:

fn kernel1[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    i = Int(block_dim.x * block_idx.x + thread_idx.x)
    if i < size:
        output[i] = a[i] + b[i]


ํ‘œ์ค€ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ - ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ

Kernel 2 - ๋น„ํšจ์œจ์ ์ธ ์ŠคํŠธ๋ผ์ด๋“œ ์ ‘๊ทผ:

fn kernel2[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    tid = Int(block_idx.x * block_dim.x + thread_idx.x)
    stride = 512

    i = tid
    while i < size:
        output[i] = a[i] + b[i]
        i += stride


ํฐ stride=512๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๊ฐ„๊ฒฉ ๋ฐœ์ƒ - ๋™์ผํ•œ ์—ฐ์‚ฐ์ด์ง€๋งŒ ํฉ์–ด์ง„ ์ ‘๊ทผ

Kernel 3 - ํšจ์œจ์ ์ธ ์—ญ์ˆœ ์ ‘๊ทผ:

fn kernel3[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    tid = Int(block_idx.x * block_dim.x + thread_idx.x)
    total_threads = (SIZE // 1024) * 1024

    for step in range(0, size, total_threads):
        forward_i = step + tid
        if forward_i < size:
            reverse_i = size - 1 - forward_i
            output[reverse_i] = a[reverse_i] + b[reverse_i]


์—ญ์ˆœ ์ธ๋ฑ์‹ฑ์ด์ง€๋งŒ ์—ฌ์ „ํžˆ ์˜ˆ์ธก ๊ฐ€๋Šฅ - ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ์ฃผ์†Œ์— ์ ‘๊ทผ (๋ฐฉํ–ฅ๋งŒ ๋ฐ˜๋Œ€)

ํŒจํ„ด ๋ถ„์„:

  • Kernel 1: ์ „ํ˜•์ ์ธ ๋ณ‘ํ•ฉ ์ ‘๊ทผ - ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ
  • Kernel 2: ์น˜๋ช…์ ์ธ ์ŠคํŠธ๋ผ์ด๋“œ ์ ‘๊ทผ - ์Šค๋ ˆ๋“œ๊ฐ€ 512๊ฐœ ์š”์†Œ์”ฉ ๊ฑด๋„ˆ๋œ€
  • Kernel 3: ์—ญ์ˆœ์ด์ง€๋งŒ ์›Œํ”„ ๋‚ด์—์„œ๋Š” ๋ณ‘ํ•ฉ ์œ ์ง€ - ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ํŒจํ„ด

๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ์ดํ•ด

GPU ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜ ๊ธฐ์ดˆ:

  • ์›Œํ”„ ์‹คํ–‰: 32๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•จ๊ป˜ ์‹คํ–‰
  • ์บ์‹œ ๋ผ์ธ ํฌ๊ธฐ: 128๋ฐ”์ดํŠธ (float32 ๊ฐ’ 32๊ฐœ)
  • ๋ณ‘ํ•ฉ ์š”๊ฑด: ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•ด์•ผ ํ•จ

p30.mojo ์„ค์ • ์ƒ์„ธ:

comptime SIZE = 16 * 1024 * 1024          # 16M ์š”์†Œ (float32 ๋ฐ์ดํ„ฐ 64MB)
comptime THREADS_PER_BLOCK = (1024, 1)    # ๋ธ”๋ก๋‹น 1024 ์Šค๋ ˆ๋“œ
comptime BLOCKS_PER_GRID = (SIZE // 1024, 1)  # ์ด 16,384 ๋ธ”๋ก
comptime dtype = DType.float32             # ์š”์†Œ๋‹น 4๋ฐ”์ดํŠธ

์ด ์„ค์ •์ด ์ค‘์š”ํ•œ ์ด์œ :

  • ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์…‹ (16M): ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์˜ ์ฐจ์ด๊ฐ€ ๋ช…ํ™•ํ•˜๊ฒŒ ๋“œ๋Ÿฌ๋‚จ
  • ๋ธ”๋ก๋‹น 1024 ์Šค๋ ˆ๋“œ: CUDA ์ตœ๋Œ€ ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ธ”๋ก๋‹น 32๊ฐœ ์›Œํ”„: ๊ฐ ๋ธ”๋ก์— 32๊ฐœ์˜ ์›Œํ”„(๊ฐ 32 ์Šค๋ ˆ๋“œ)๊ฐ€ ํฌํ•จ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ ์‹œ๊ฐํ™”:

KERNEL 1 (๋ณ‘ํ•ฉ):                KERNEL 2 (stride 512):
์›Œํ”„ ์Šค๋ ˆ๋“œ 0-31:               ์›Œํ”„ ์Šค๋ ˆ๋“œ 0-31:
  Thread 0: Memory[0]            Thread 0: Memory[0]
  Thread 1: Memory[1]            Thread 1: Memory[512]
  Thread 2: Memory[2]            Thread 2: Memory[1024]
  ...                           ...
  Thread 31: Memory[31]          Thread 31: Memory[15872]

๊ฒฐ๊ณผ: ์บ์‹œ ๋ผ์ธ 1ํšŒ fetch          ๊ฒฐ๊ณผ: ๋ณ„๋„์˜ ์บ์‹œ ๋ผ์ธ 32ํšŒ fetch
์ƒํƒœ: ~308 GB/s ์ฒ˜๋ฆฌ๋Ÿ‰            ์ƒํƒœ: ~6 GB/s ์ฒ˜๋ฆฌ๋Ÿ‰
์บ์‹œ: ํšจ์œจ์  ํ™œ์šฉ                  ์บ์‹œ: ๊ฐ™์€ ๋ผ์ธ์„ ๋ฐ˜๋ณต ํžˆํŠธ!

KERNEL 3 (์—ญ์ˆœ์ด์ง€๋งŒ ๋ณ‘ํ•ฉ):

์›Œํ”„ ์Šค๋ ˆ๋“œ 0-31 (์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต):
  Thread 0: Memory[SIZE-1]     (reverse_i = SIZE-1-0)
  Thread 1: Memory[SIZE-2]     (reverse_i = SIZE-1-1)
  Thread 2: Memory[SIZE-3]     (reverse_i = SIZE-1-2)
  ...
  Thread 31: Memory[SIZE-32]   (reverse_i = SIZE-1-31)

๊ฒฐ๊ณผ: ์ธ์ ‘ํ•œ ์ฃผ์†Œ (๋ฐฉํ–ฅ๋งŒ ๋ฐ˜๋Œ€)
์ƒํƒœ: ~310 GB/s ์ฒ˜๋ฆฌ๋Ÿ‰ (Kernel 1๊ณผ ๊ฑฐ์˜ ๋™์ผ)
์บ์‹œ: ์—ญ์ˆœ์ž„์—๋„ ํšจ์œจ์  ํ™œ์šฉ

์บ์‹œ์˜ ์—ญ์„ค ์„ค๋ช…

Kernel2 (stride=512)๊ฐ€ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์—๋„ ์„ฑ๋Šฅ์ด ๋‚˜์œ ์ด์œ :

stride=512์˜ ์žฌ์•™ ์„ค๋ช…:

# ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํฐ ๊ฐ„๊ฒฉ์œผ๋กœ ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌ:
Thread 0: elements [0, 512, 1024, 1536, 2048, ...]
Thread 1: elements [1, 513, 1025, 1537, 2049, ...]
Thread 2: elements [2, 514, 1026, 1538, 2050, ...]
...

์ด๊ฒƒ์ด ์บ์‹œ์˜ ์—ญ์„ค์„ ๋งŒ๋“œ๋Š” ์ด์œ :

  1. ์บ์‹œ ๋ผ์ธ ๋ฐ˜๋ณต: 512๊ฐœ ์š”์†Œ๋ฅผ ๊ฑด๋„ˆ๋›ฐ์–ด๋„ ๊ฒน์น˜๋Š” ์บ์‹œ ๋ผ์ธ ์˜์—ญ ์•ˆ์— ๋จธ๋ฌด๋ฆ„
  2. ๊ฑฐ์ง“ ํšจ์œจ์˜ ํ™˜์ƒ: ๊ฐ™์€ ์บ์‹œ ๋ผ์ธ์— ๋ฐ˜๋ณต ์ ‘๊ทผ = ์ธ์œ„์ ์œผ๋กœ ๋†’์€ โ€œํžˆํŠธ์œจโ€
  3. ๋Œ€์—ญํญ ์žฌ์•™: 32๊ฐœ ์Šค๋ ˆ๋“œ ร— 32๊ฐœ ๋ณ„๋„ ์บ์‹œ ๋ผ์ธ = ๋ง‰๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ
  4. ์›Œํ”„ ์‹คํ–‰ ๋ถˆ์ผ์น˜: GPU๋Š” ๋ณ‘ํ•ฉ ์ ‘๊ทผ์— ๋งž๊ฒŒ ์„ค๊ณ„๋˜์—ˆ์ง€๋งŒ, ํฉ์–ด์ง„ ์ ‘๊ทผ์„ ๋ฐ›์Œ

float32 (๊ฐ 4๋ฐ”์ดํŠธ) ๊ตฌ์ฒด ์˜ˆ์‹œ:

  • ์บ์‹œ ๋ผ์ธ: 128๋ฐ”์ดํŠธ = float32 ๊ฐ’ 32๊ฐœ
  • stride 512: ์Šค๋ ˆ๋“œ๊ฐ€ 512ร—4 = 2048๋ฐ”์ดํŠธ = 16 ์บ์‹œ ๋ผ์ธ ๊ฐ„๊ฒฉ์œผ๋กœ ์ ํ”„!
  • ์›Œํ”„ ์˜ํ–ฅ: 32๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ 1๊ฐœ ๋Œ€์‹  32๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์บ์‹œ ๋ผ์ธ์„ ํ•„์š”๋กœ ํ•จ

ํ•ต์‹ฌ ํ†ต์ฐฐ: Kernel2์˜ ๋†’์€ ์บ์‹œ ํžˆํŠธ๋Š” ๋น„ํšจ์œจ์ ์œผ๋กœ ๊ฐ€์ ธ์˜จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ฐ˜๋ณต ์ ‘๊ทผ์ด์ง€, ํ˜„๋ช…ํ•œ ์บ์‹ฑ์ด ์•„๋‹™๋‹ˆ๋‹ค!

ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก  ํ†ต์ฐฐ

์ฒด๊ณ„์  ํƒ์ • ์ ‘๊ทผ๋ฒ•:

1๋‹จ๊ณ„: NSight Systems (์ „์ฒด ๊ทธ๋ฆผ)

  • ์–ด๋–ค ์ปค๋„์ด ๋А๋ฆฐ์ง€ ์‹๋ณ„
  • ๋ช…๋ฐฑํ•œ ๋ณ‘๋ชฉ ๋ฐฐ์ œ (๋ฉ”๋ชจ๋ฆฌ ์ „์†ก, API ์˜ค๋ฒ„ํ—ค๋“œ)
  • ์ปค๋„ ์‹คํ–‰ ์‹œ๊ฐ„ ์ฐจ์ด์— ์ง‘์ค‘

2๋‹จ๊ณ„: NSight Compute (์‹ฌ์ธต ๋ถ„์„)

  • ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰ ์ง€ํ‘œ ๋ถ„์„
  • ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ  ๋ฐฑ๋ถ„์œจ ๋น„๊ต
  • ์บ์‹œ ํžˆํŠธ์œจ๊ณผ ํŒจํ„ด ์กฐ์‚ฌ

3๋‹จ๊ณ„: ๊ทผ๊ฑฐ๋ฅผ ์ด๋ก ์œผ๋กœ ์—ฐ๊ฒฐ

ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ โ†’ ์ฝ”๋“œ ๋ถ„์„:

NSight Compute ๊ฒฐ๊ณผ:              ์‹ค์ œ ์ฝ”๋“œ ํŒจํ„ด:
- Kernel1: ~308 GB/s            โ†’ i = block_idx*block_dim + thread_idx (๋ณ‘ํ•ฉ)
- Kernel2: ~6 GB/s, 99% L2 hits โ†’ i += 512 (์น˜๋ช…์  stride)
- Kernel3: ~310 GB/s            โ†’ reverse_i = size-1-forward_i (์—ญ์ˆœ ๋ณ‘ํ•ฉ)

ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ์„ ์ง์ ‘ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค!

๊ทผ๊ฑฐ์—์„œ ์ฝ”๋“œ๋กœ์˜ ์—ฐ๊ฒฐ:

  • ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰ + ๋ณดํ†ต ์บ์‹œ ํžˆํŠธ์œจ = ๋ณ‘ํ•ฉ ์ ‘๊ทผ (Kernel 1 & 3)
  • ๋‚ฎ์€ ์ฒ˜๋ฆฌ๋Ÿ‰ + ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ = ๋น„ํšจ์œจ์  ์ŠคํŠธ๋ผ์ด๋“œ ์ ‘๊ทผ (Kernel 2)
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ ์ด ์บ์‹œ ํ†ต๊ณ„์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ง„์ •ํ•œ ํšจ์œจ์„ ๋“œ๋Ÿฌ๋ƒ„

์‹ค๋ฌด ์„ฑ๋Šฅ ์‹œ์‚ฌ์ 

์ด ํŒจํ„ด์ด ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” GPU ์‘์šฉ ๋ถ„์•ผ:

๊ณผํ•™ ์ปดํ“จํŒ…:

  • ์Šคํ…์‹ค ์—ฐ์‚ฐ: ๊ทธ๋ฆฌ๋“œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ์˜ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด
  • ์„ ํ˜• ๋Œ€์ˆ˜: ํ–‰๋ ฌ ์ˆœํšŒ ์ˆœ์„œ (ํ–‰ ์šฐ์„  vs ์—ด ์šฐ์„ )
  • ํŽธ๋ฏธ๋ถ„ ๋ฐฉ์ •์‹ ํ’€์ด: ์œ ํ•œ ์ฐจ๋ถ„๋ฒ•์—์„œ์˜ ๊ฒฉ์ž์  ์ ‘๊ทผ ํŒจํ„ด

๊ทธ๋ž˜ํ”ฝ์Šค ๋ฐ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ:

  • ํ…์Šค์ฒ˜ ํ•„ํ„ฐ๋ง: ์…ฐ์ด๋”์—์„œ์˜ ์ƒ˜ํ”Œ ์ ‘๊ทผ ํŒจํ„ด
  • ์ด๋ฏธ์ง€ ํ•ฉ์„ฑ๊ณฑ: ํ•„ํ„ฐ ์ปค๋„์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
  • ์ƒ‰ ๊ณต๊ฐ„ ๋ณ€ํ™˜: ์ฑ„๋„ ์ธํ„ฐ๋ฆฌ๋น™ ์ „๋žต

๋จธ์‹ ๋Ÿฌ๋‹:

  • ํ–‰๋ ฌ ์—ฐ์‚ฐ: GEMM์—์„œ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”
  • ํ…์„œ ์ถ•์•ฝ: ๋‹ค์ฐจ์› ๋ฐฐ์—ด ์ ‘๊ทผ ํŒจํ„ด
  • ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ: ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ์™€ ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ

GPU ์ตœ์ ํ™”์˜ ๊ทผ๋ณธ ์›์น™

๋ฉ”๋ชจ๋ฆฌ ์šฐ์„  ์ตœ์ ํ™” ์ „๋žต:

  1. ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์ด ์ง€๋ฐฐ: ์ ‘๊ทผ ํŒจํ„ด์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„๋ณด๋‹ค ๋” ์ค‘์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
  2. ๋ณ‘ํ•ฉ์ด ํ•ต์‹ฌ: ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋„๋ก ์„ค๊ณ„
  3. ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ  ์ธก์ •: ์บ์‹œ ํ†ต๊ณ„๊ฐ€ ์•„๋‹Œ ์‹ค์ œ ์ฒ˜๋ฆฌ๋Ÿ‰์— ์ง‘์ค‘
  4. ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง: NSight ๋„๊ตฌ๋กœ ์‹ค์ œ ๋ณ‘๋ชฉ์„ ํŒŒ์•…

ํ•ต์‹ฌ ๊ธฐ์ˆ  ํ†ต์ฐฐ:

  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ: ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ ์ด ์„ฑ๋Šฅ์„ ๊ฒฐ์ •
  • ์บ์‹œ ์ง€ํ‘œ์˜ ํ•จ์ •: ๋†’์€ ํžˆํŠธ์œจ์ด ํ•ญ์ƒ ํšจ์œจ์„ ์˜๋ฏธํ•˜์ง€๋Š” ์•Š์Œ
  • ์›Œํ”„ ๋ ˆ๋ฒจ ์‚ฌ๊ณ : 32๊ฐœ ์Šค๋ ˆ๋“œ ์‹คํ–‰ ๊ทธ๋ฃน์„ ์œ„ํ•œ ์ ‘๊ทผ ํŒจํ„ด ์„ค๊ณ„
  • ํ•˜๋“œ์›จ์–ด ์ธ์‹ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ดํ•ด๊ฐ€ ํ•„์ˆ˜

ํ•ต์‹ฌ ๊ตํ›ˆ

์ด๋ฒˆ์— ํƒ๊ตฌํ•œ ์‚ฌ๋ก€๋Š” GPU ์„ฑ๋Šฅ ์ตœ์ ํ™”๊ฐ€ CPU ์ง๊ด€์„ ๋ฒ„๋ฆฌ๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์ค‘์‹ฌ ์‚ฌ๊ณ ๋กœ ์ „ํ™˜ํ•  ๊ฒƒ์„ ์š”๊ตฌํ•œ๋‹ค๋Š” ์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ํ†ต์ฐฐ:

  • ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์€ ์ข‹์€ ์„ฑ๋Šฅ์ด ์•„๋‹ˆ๋ผ ๋น„ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ ์ด ์บ์‹œ ํ†ต๊ณ„๋ณด๋‹ค ์ค‘์š”
  • ๋‹จ์ˆœํ•œ ๋ณ‘ํ•ฉ ํŒจํ„ด์ด ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณด๋‹ค ๋” ๋น ๋ฅธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
  • ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๊ฐ€ ์ง๊ด€์œผ๋กœ๋Š” ์•Œ ์ˆ˜ ์—†๋Š” ์„ฑ๋Šฅ์˜ ์ง„์‹ค์„ ๋“œ๋Ÿฌ๋ƒ„

์‹ค์ „ ๋ฐฉ๋ฒ•๋ก :

  • NSight Systems์™€ NSight Compute๋กœ ์ฒด๊ณ„์ ์œผ๋กœ ํ”„๋กœํŒŒ์ผ๋ง
  • ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋„๋ก ์„ค๊ณ„ (๋ณ‘ํ•ฉ)
  • ์ง๊ด€์ด ์•„๋‹Œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๊ทผ๊ฑฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ตœ์ ํ™” ๊ฒฐ์ •

์บ์‹œ์˜ ์—ญ์„ค์€ ์•„ํ‚คํ…์ฒ˜์— ๋Œ€ํ•œ ์ดํ•ด ์—†์ด ๊ณ ์ˆ˜์ค€ ์ง€ํ‘œ์— ์˜์กดํ•˜๋ฉด ์ž˜๋ชป๋œ ๊ฒฐ๋ก ์— ์ด๋ฅผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค - GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋„˜์–ด ๋‘๋ฃจ ์ ์šฉ๋˜๋Š” ๊ตํ›ˆ์ž…๋‹ˆ๋‹ค.

Puzzle 31: ์ ์œ ์œจ ์ตœ์ ํ™”

์ด ํผ์ฆ์ด ์ค‘์š”ํ•œ ์ด์œ 

Puzzle 30์˜ ์—ฐ์žฅ์„ : GPU ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ๋ฐฐ์šฐ๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์–ด๋–ป๊ฒŒ ๊ทน์ ์ธ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š”์ง€ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ ๋‹ค์Œ ๋‹จ๊ณ„๋กœ ๋‚˜์•„๊ฐˆ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค: ๋ฆฌ์†Œ์Šค ์ตœ์ ํ™”.

ํ•™์Šต ์—ฌ์ •:

  • Puzzle 30์—์„œ๋Š” NSight ํ”„๋กœํŒŒ์ผ๋ง(nsys์™€ ncu)์„ ํ†ตํ•ด ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ์ง„๋‹จํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค
  • Puzzle 31์—์„œ๋Š” ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ์˜ˆ์ธกํ•˜๊ณ  ์ œ์–ดํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค
  • ๋‘˜์„ ํ•ฉ์น˜๋ฉด GPU ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์™„์ „ํ•œ ๋„๊ตฌ ์„ธํŠธ๋ฅผ ๊ฐ–์ถ”๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

๋ฐœ๊ฒฌํ•˜๊ฒŒ ๋  ๊ฒƒ: GPU ์„ฑ๋Šฅ์€ ๋‹จ์ˆœํžˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํšจ์œจ์˜ ๋ฌธ์ œ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค - ์ฝ”๋“œ๊ฐ€ ํ•œ์ •๋œ ํ•˜๋“œ์›จ์–ด ๋ฆฌ์†Œ์Šค๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•˜๋А๋ƒ๊ฐ€ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  GPU๋Š” ์œ ํ•œํ•œ ๋ ˆ์ง€์Šคํ„ฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์‹คํ–‰ ์œ ๋‹›์„ ๊ฐ–๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ ์œ ์œจ(occupancy) - SM๋‹น ํ™œ์„ฑ ์›Œํ”„ ์ˆ˜ ๋Œ€๋น„ ์ตœ๋Œ€ ๊ฐ€๋Šฅ ์›Œํ”„ ์ˆ˜์˜ ๋น„์œจ - ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์œ ๋กœ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€๊ธฐ ์‹œ๊ฐ„ ๋™์•ˆ GPU๊ฐ€ ์œ ํœด ์ƒํƒœ์— ๋น ์ง€์ง€ ์•Š๋„๋ก ์œ ์ง€
  • ๋ฆฌ์†Œ์Šค ํ• ๋‹น: ๋ ˆ์ง€์Šคํ„ฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๊ฐ„์˜ ๊ท ํ˜• ์กฐ์ ˆ
  • ์„ฑ๋Šฅ ์˜ˆ์ธก: ๋ณ‘๋ชฉ์ด ๋ฐœ์ƒํ•˜๊ธฐ ์ „์— ๋ฏธ๋ฆฌ ํŒŒ์•…
  • ์ตœ์ ํ™” ์ „๋žต: ์ ์œ ์œจ์— ์ง‘์ค‘ํ•ด์•ผ ํ•  ๋•Œ์™€ ๋‹ค๋ฅธ ์š”์†Œ์— ์ง‘์ค‘ํ•ด์•ผ ํ•  ๋•Œ ํŒ๋‹จ

GPU๋ฅผ ๋„˜์–ด์„œ ์ ์šฉ๋˜๋Š” ์›๋ฆฌ: ์—ฌ๊ธฐ์„œ ๋ฐฐ์šฐ๋Š” ์›๋ฆฌ๋Š” ๋ฆฌ์†Œ์Šค๋ฅผ ์—ฌ๋Ÿฌ ์‹คํ–‰ ์œ ๋‹›์ด ๊ณต์œ ํ•˜๋Š” ๋ชจ๋“  ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค - ํ•˜์ดํผ์Šค๋ ˆ๋”ฉ์„ ์‚ฌ์šฉํ•˜๋Š” CPU๋ถ€ํ„ฐ ๋ถ„์‚ฐ ์ปดํ“จํŒ… ํด๋Ÿฌ์Šคํ„ฐ๊นŒ์ง€.

๊ฐœ์š”

GPU ์ ์œ ์œจ์€ SM๋‹น ํ™œ์„ฑ ์›Œํ”„ ์ˆ˜ ๋Œ€๋น„ ์ตœ๋Œ€ ๊ฐ€๋Šฅ ์›Œํ”„ ์ˆ˜์˜ ๋น„์œจ์ž…๋‹ˆ๋‹ค. GPU๊ฐ€ ์›Œํ”„ ์ „ํ™˜์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์–ผ๋งˆ๋‚˜ ํšจ๊ณผ์ ์œผ๋กœ ์ˆจ๊ธธ ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

SAXPY๋Š” Single-precision Alpha times X plus Y์˜ ์•ฝ์ž์ž…๋‹ˆ๋‹ค. ์ด ํผ์ฆ์—์„œ๋Š” ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•˜์ง€๋งŒ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ๋‹ค๋ฅธ ์„ธ ๊ฐ€์ง€ SAXPY ์ปค๋„(y[i] = alpha * x[i] + y[i])์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค:

comptime SIZE = 32 * 1024 * 1024  # 32M elements - larger workload to show occupancy effects
comptime THREADS_PER_BLOCK = (1024, 1)
comptime BLOCKS_PER_GRID = (SIZE // 1024, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)
comptime ALPHA = Float32(2.5)  # SAXPY coefficient


fn minimal_kernel[
    layout: Layout
](
    y: LayoutTensor[dtype, layout, MutAnyOrigin],
    x: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    alpha: Float32,
    size: Int,
):
    """Minimal SAXPY kernel - simple and register-light for high occupancy."""
    i = Int(block_dim.x * block_idx.x + thread_idx.x)
    if i < size:
        # Direct computation: y[i] = alpha * x[i] + y[i]
        # Uses minimal registers (~8), no shared memory
        y[i] = alpha * x[i] + y[i]


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p31/p31.mojo

fn sophisticated_kernel[
    layout: Layout
](
    y: LayoutTensor[dtype, layout, MutAnyOrigin],
    x: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    alpha: Float32,
    size: Int,
):
    """Sophisticated SAXPY kernel - over-engineered with excessive resource usage.
    """
    # Maximum shared memory allocation (close to 48KB limit)
    shared_cache = LayoutTensor[
        dtype,
        Layout.row_major(1024 * 12),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()  # 48KB

    i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    if i < size:
        # REAL computational work that can't be optimized away - affects final result
        base_x = x[i]
        base_y = y[i]

        # Simulate "precision enhancement" - multiple small adjustments that add up
        # Each computation affects the final result so compiler can't eliminate them
        # But artificially increases register pressure
        precision_x1 = base_x * 1.0001
        precision_x2 = precision_x1 * 0.9999
        precision_x3 = precision_x2 * 1.000001
        precision_x4 = precision_x3 * 0.999999

        precision_y1 = base_y * 1.000005
        precision_y2 = precision_y1 * 0.999995
        precision_y3 = precision_y2 * 1.0000001
        precision_y4 = precision_y3 * 0.9999999

        # Multiple alpha computations for "stability" - should equal alpha
        alpha1 = alpha * 1.00001 * 0.99999
        alpha2 = alpha1 * 1.000001 * 0.999999
        alpha3 = alpha2 * 1.0000001 * 0.9999999
        alpha4 = alpha3 * 1.00000001 * 0.99999999

        # Complex polynomial "optimization" - creates register pressure
        x_power2 = precision_x4 * precision_x4
        x_power3 = x_power2 * precision_x4
        x_power4 = x_power3 * precision_x4
        x_power5 = x_power4 * precision_x4
        x_power6 = x_power5 * precision_x4
        x_power7 = x_power6 * precision_x4
        x_power8 = x_power7 * precision_x4

        # "Advanced" mathematical series that contributes tiny amount to result
        series_term1 = x_power2 * 0.0000001  # x^2/10M
        series_term2 = x_power4 * 0.00000001  # x^4/100M
        series_term3 = x_power6 * 0.000000001  # x^6/1B
        series_term4 = x_power8 * 0.0000000001  # x^8/10B
        series_correction = (
            series_term1 - series_term2 + series_term3 - series_term4
        )

        # Over-engineered shared memory usage with multiple caching strategies
        if local_i < 1024:
            shared_cache[local_i] = precision_x4
            shared_cache[local_i + 1024] = precision_y4
            shared_cache[local_i + 2048] = alpha4
            shared_cache[local_i + 3072] = series_correction
        barrier()

        # Load from shared memory for "optimization"
        cached_x = shared_cache[local_i] if local_i < 1024 else precision_x4
        cached_y = (
            shared_cache[local_i + 1024] if local_i < 1024 else precision_y4
        )
        cached_alpha = (
            shared_cache[local_i + 2048] if local_i < 1024 else alpha4
        )
        cached_correction = (
            shared_cache[local_i + 3072] if local_i
            < 1024 else series_correction
        )

        # Final "high precision" computation - all work contributes to result
        high_precision_result = (
            cached_alpha * cached_x + cached_y + cached_correction
        )

        # Over-engineered result with massive resource usage but mathematically ~= alpha*x + y
        y[i] = high_precision_result


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p31/p31.mojo

fn balanced_kernel[
    layout: Layout
](
    y: LayoutTensor[dtype, layout, MutAnyOrigin],
    x: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    alpha: Float32,
    size: Int,
):
    """Balanced SAXPY kernel - efficient optimization with moderate resources.
    """
    # Reasonable shared memory usage for effective caching (16KB)
    shared_cache = LayoutTensor[
        dtype,
        Layout.row_major(1024 * 4),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()  # 16KB total

    i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    if i < size:
        # Moderate computational work that contributes to result
        base_x = x[i]
        base_y = y[i]

        # Light precision enhancement - less than sophisticated kernel
        enhanced_x = base_x * 1.00001 * 0.99999
        enhanced_y = base_y * 1.00001 * 0.99999
        stable_alpha = alpha * 1.000001 * 0.999999

        # Moderate computational optimization
        x_squared = enhanced_x * enhanced_x
        optimization_hint = x_squared * 0.000001

        # Efficient shared memory caching - only what we actually need
        if local_i < 1024:
            shared_cache[local_i] = enhanced_x
            shared_cache[local_i + 1024] = enhanced_y
        barrier()

        # Use cached values efficiently
        cached_x = shared_cache[local_i] if local_i < 1024 else enhanced_x
        cached_y = (
            shared_cache[local_i + 1024] if local_i < 1024 else enhanced_y
        )

        # Balanced computation - moderate work, good efficiency
        result = stable_alpha * cached_x + cached_y + optimization_hint

        # Balanced result with moderate resource usage (~15 registers, 16KB shared)
        y[i] = result


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p31/p31.mojo

๋„์ „ ๊ณผ์ œ

ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„ธ ์ปค๋„์„ ์กฐ์‚ฌํ•˜๊ณ , ์ ์œ ์œจ ์ตœ์ ํ™”์— ๋Œ€ํ•œ ๋ถ„์„ ์งˆ๋ฌธ์— ๋‹ตํ•˜์„ธ์š”. ์ปค๋„๋“ค์€ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•˜์ง€๋งŒ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ๊ทน์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค - ์„ฑ๋Šฅ๊ณผ ์ ์œ ์œจ์ด ์™œ ์ง๊ด€์— ์–ด๊ธ‹๋‚˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•˜๋Š”์ง€ ๋ฐœ๊ฒฌํ•˜๋Š” ๊ฒƒ์ด ์—ฌ๋Ÿฌ๋ถ„์˜ ์ž„๋ฌด์ž…๋‹ˆ๋‹ค!

์ด ํผ์ฆ์— ํ‘œ์‹œ๋œ ๊ตฌ์ฒด์ ์ธ ์ˆ˜์น˜ ๊ฒฐ๊ณผ๋Š” NVIDIA A10G (Ampere 8.6) ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” GPU ์ œ์กฐ์‚ฌ์™€ ์•„ํ‚คํ…์ฒ˜(NVIDIA: Pascal/Turing/Ampere/Ada/Hopper, AMD: RDNA/GCN, Apple: M1/M2/M3/M4/M5)์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€์ง€๋งŒ, ๊ธฐ๋ณธ ๊ฐœ๋…, ๋ฐฉ๋ฒ•๋ก , ํ†ต์ฐฐ์€ ๋ชจ๋“  ์ตœ์‹  GPU์— ๋ณดํŽธ์ ์œผ๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. pixi run gpu-specs๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ํ•˜๋“œ์›จ์–ด๋ณ„ ์ˆ˜์น˜๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

๊ตฌ์„ฑ

์š”๊ตฌ ์‚ฌํ•ญ:

  • CUDA ํˆดํ‚ท์ด ์„ค์น˜๋œ NVIDIA GPU
  • Puzzle 30์˜ NSight Compute

โš ๏ธ GPU ํ˜ธํ™˜์„ฑ ์ฐธ๊ณ : ๊ธฐ๋ณธ ์„ค์ •์€ ๊ณต๊ฒฉ์ ์ธ ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ๊ตฌํ˜•์ด๋‚˜ ์ €์‚ฌ์–‘ GPU์—์„œ๋Š” ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

comptime SIZE = 32 * 1024 * 1024  # 32M ์š”์†Œ (๋ฐฐ์—ด๋‹น ~256MB ๋ฉ”๋ชจ๋ฆฌ)
comptime THREADS_PER_BLOCK = (1024, 1)  # ๋ธ”๋ก๋‹น 1024 ์Šค๋ ˆ๋“œ
comptime BLOCKS_PER_GRID = (SIZE // 1024, 1)  # 32768 ๋ธ”๋ก

์‹คํ–‰ ์‹คํŒจ ์‹œ problems/p31/p31.mojo์—์„œ ๋‹ค์Œ ๊ฐ’์„ ์ค„์ด์„ธ์š”:

  • ๊ตฌํ˜• GPU (Compute Capability < 3.0): THREADS_PER_BLOCK = (512, 1), SIZE = 16 * 1024 * 1024 ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ GPU (< 2GB): SIZE = 8 * 1024 * 1024 ๋˜๋Š” SIZE = 4 * 1024 * 1024 ์‚ฌ์šฉ
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์› ์ œํ•œ: BLOCKS_PER_GRID๋Š” SIZE์— ๋งž์ถฐ ์ž๋™ ์กฐ์ •๋ฉ๋‹ˆ๋‹ค

์ ์œ ์œจ ๊ณต์‹:

์ด๋ก ์  ์ ์œ ์œจ = min(
    SM๋‹น ๋ ˆ์ง€์Šคํ„ฐ ์ˆ˜ / (์Šค๋ ˆ๋“œ๋‹น ๋ ˆ์ง€์Šคํ„ฐ ์ˆ˜ ร— ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜),
    SM๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ / ๋ธ”๋ก๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ,
    SM๋‹น ์ตœ๋Œ€ ๋ธ”๋ก ์ˆ˜
) ร— ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ / SM๋‹น ์ตœ๋Œ€ ์Šค๋ ˆ๋“œ ์ˆ˜

์กฐ์‚ฌ ๊ณผ์ •

Step 1: ์ปค๋„ ํ…Œ์ŠคํŠธ

pixi shell -e nvidia
mojo problems/p31/p31.mojo --all

์„ธ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฏธ์Šคํ„ฐ๋ฆฌ: ์™œ ์„ฑ๋Šฅ์€ ๋‹ค๋ฅผ๊นŒ์š”?

Step 2: ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ

mojo problems/p31/p31.mojo --benchmark

์„ธ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฏธ์Šคํ„ฐ๋ฆฌ: ์™œ ์„ฑ๋Šฅ์€ ๋‹ค๋ฅผ๊นŒ์š”?

Step 3: ํ”„๋กœํŒŒ์ผ๋ง์šฉ ๋นŒ๋“œ

mojo build --debug-level=full problems/p31/p31.mojo -o problems/p31/p31_profiler

Step 4: ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰ ํ”„๋กœํŒŒ์ผ๋ง

# ๊ฐ ์ปค๋„์˜ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰ ํ”„๋กœํŒŒ์ผ๋ง
ncu --set=@occupancy --section=LaunchStats problems/p31/p31_profiler --minimal
ncu --set=@occupancy --section=LaunchStats problems/p31/p31_profiler --sophisticated
ncu --set=@occupancy --section=LaunchStats problems/p31/p31_profiler --balanced

์ ์œ ์œจ ๋ถ„์„์„ ์œ„ํ•ด ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰์„ ๊ธฐ๋กํ•˜์„ธ์š”.

Step 5: ์ด๋ก ์  ์ ์œ ์œจ ๊ณ„์‚ฐ

๋จผ์ € GPU ์•„ํ‚คํ…์ฒ˜์™€ ์„ธ๋ถ€ ์ŠคํŽ™์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:

pixi run gpu-specs

์ฐธ๊ณ : gpu-specs๋Š” GPU ์ œ์กฐ์‚ฌ(NVIDIA/AMD/Apple)๋ฅผ ์ž๋™ ๊ฐ์ง€ํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด์—์„œ ํŒŒ์ƒ๋œ ๋ชจ๋“  ์•„ํ‚คํ…์ฒ˜ ์„ธ๋ถ€ ์ •๋ณด๋ฅผ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค - ๋ณ„๋„์˜ ์ฐธ์กฐํ‘œ๊ฐ€ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค!

์ฃผ์š” ์•„ํ‚คํ…์ฒ˜ ์ŠคํŽ™ (์ฐธ๊ณ ์šฉ):

์•„ํ‚คํ…์ฒ˜Compute Cap๋ ˆ์ง€์Šคํ„ฐ/SM๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ/SM์ตœ๋Œ€ ์Šค๋ ˆ๋“œ/SM์ตœ๋Œ€ ๋ธ”๋ก/SM
Hopper (H100)9.065,536228KB2,04832
Ada (RTX 40xx)8.965,536128KB2,04832
Ampere (RTX 30xx, A100, A10G)8.0, 8.665,536164KB2,04832
Turing (RTX 20xx)7.565,53696KB1,02416
Pascal (GTX 10xx)6.165,53696KB2,04832

๐Ÿ“š ๊ณต์‹ ๋ฌธ์„œ:

โš ๏ธ ์ฐธ๊ณ : ์ด ๊ฐ’๋“ค์€ ์ด๋ก ์  ์ตœ๋Œ€์น˜์ž…๋‹ˆ๋‹ค. ์‹ค์ œ ์ ์œ ์œจ์€ ํ•˜๋“œ์›จ์–ด ์Šค์ผ€์ค„๋ง ์ œ์•ฝ, ๋“œ๋ผ์ด๋ฒ„ ์˜ค๋ฒ„ํ—ค๋“œ ๋“ฑ์˜ ์š”์ธ์œผ๋กœ ๋” ๋‚ฎ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GPU ์ŠคํŽ™๊ณผ ์ ์œ ์œจ ๊ณต์‹์„ ์‚ฌ์šฉํ•˜์—ฌ:

  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: 1024 (์ปค๋„ ์„ค์ •๊ฐ’)

์ ์œ ์œจ ๊ณต์‹๊ณผ ํ•˜๋“œ์›จ์–ด ์ŠคํŽ™์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ปค๋„์˜ ์ด๋ก ์  ์ ์œ ์œจ์„ ์˜ˆ์ธกํ•˜์„ธ์š”.

Step 6: ์‹ค์ œ ์ ์œ ์œจ ์ธก์ •

# ๊ฐ ์ปค๋„์˜ ์‹ค์ œ ์ ์œ ์œจ ์ธก์ •
ncu --metrics=smsp__warps_active.avg.pct_of_peak_sustained_active problems/p31/p31_profiler --minimal
ncu --metrics=smsp__warps_active.avg.pct_of_peak_sustained_active problems/p31/p31_profiler --sophisticated
ncu --metrics=smsp__warps_active.avg.pct_of_peak_sustained_active problems/p31/p31_profiler --balanced

์ด๋ก ์  ๊ณ„์‚ฐ๊ณผ ์‹ค์ œ ์ธก์ •๋œ ์ ์œ ์œจ์„ ๋น„๊ตํ•˜์„ธ์š” - ๋ฏธ์Šคํ„ฐ๋ฆฌ๊ฐ€ ๋“œ๋Ÿฌ๋‚˜๋Š” ์ˆœ๊ฐ„์ž…๋‹ˆ๋‹ค!

ํ•ต์‹ฌ ํ†ต์ฐฐ

๐Ÿ’ก ์ ์œ ์œจ ์ž„๊ณ„๊ฐ’: ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๊ธฐ์— ์ถฉ๋ถ„ํ•œ ์ ์œ ์œจ(~25-50%)์„ ํ™•๋ณดํ•˜๋ฉด, ๊ทธ ์ด์ƒ์˜ ์ ์œ ์œจ์€ ์ˆ˜ํ™• ์ฒด๊ฐ ํšจ๊ณผ๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

๐Ÿ’ก ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ vs ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ: SAXPY๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž…๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์ ์œ ์œจ๋ณด๋‹ค ๋” ์ค‘์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก ๋ฆฌ์†Œ์Šค ํšจ์œจ: ์ตœ์‹  GPU๋Š” ์ ๋‹นํ•œ ์ˆ˜์ค€์˜ ๋ ˆ์ง€์Šคํ„ฐ ์••๋ฐ•(์Šค๋ ˆ๋“œ๋‹น 20-40๊ฐœ)์„ ์ ์œ ์œจ์˜ ๊ทน์ ์ธ ๊ฐ์†Œ ์—†์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: ๋‹ค์Œ ์งˆ๋ฌธ์— ๋‹ตํ•˜์„ธ์š”

์œ„์˜ ์กฐ์‚ฌ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ•œ ํ›„, ๋‹ค์Œ ๋ถ„์„ ์งˆ๋ฌธ์— ๋‹ตํ•˜์—ฌ ์ ์œ ์œจ ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ํ’€์–ด๋ณด์„ธ์š”:

์„ฑ๋Šฅ ๋ถ„์„ (Step 2):

  1. ์–ด๋–ค ์ปค๋„์ด ๊ฐ€์žฅ ๋น ๋ฅด๊ณ , ์–ด๋–ค ์ปค๋„์ด ๊ฐ€์žฅ ๋А๋ฆฐ๊ฐ€์š”? ์‹คํ–‰ ์‹œ๊ฐ„ ์ฐจ์ด๋ฅผ ๊ธฐ๋กํ•˜์„ธ์š”.

๋ฆฌ์†Œ์Šค ํ”„๋กœํŒŒ์ผ๋ง (Step 4):

  1. ๊ฐ ์ปค๋„์˜ ์Šค๋ ˆ๋“œ๋‹น ๋ ˆ์ง€์Šคํ„ฐ ์ˆ˜, ๋ธ”๋ก๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, SM๋‹น ์›Œํ”„ ์ˆ˜๋ฅผ ๊ธฐ๋กํ•˜์„ธ์š”.

์ด๋ก ์  ๊ณ„์‚ฐ (Step 5):

  1. GPU ์ŠคํŽ™๊ณผ ์ ์œ ์œจ ๊ณต์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ปค๋„์˜ ์ด๋ก ์  ์ ์œ ์œจ์„ ๊ณ„์‚ฐํ•˜์„ธ์š”. ์–ด๋–ค ์ปค๋„์ด ๊ฐ€์žฅ ๋†’๊ณ /๋‚ฎ์•„์•ผ ํ•˜๋‚˜์š”?

์ธก์ •๋œ ์ ์œ ์œจ (Step 6):

  1. ์ธก์ •๋œ ์ ์œ ์œจ ๊ฐ’์ด ๊ณ„์‚ฐ ๊ฒฐ๊ณผ์™€ ์–ด๋–ป๊ฒŒ ๋น„๊ต๋˜๋‚˜์š”?

์ ์œ ์œจ ๋ฏธ์Šคํ„ฐ๋ฆฌ:

  1. ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ๊ทน์ ์œผ๋กœ ๋‹ค๋ฅธ๋ฐ๋„ ์„ธ ์ปค๋„ ๋ชจ๋‘ ๋น„์Šทํ•œ ์ ์œ ์œจ(~64-66%, GPU ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ)๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?
  2. ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ๊ทน์ ์œผ๋กœ ์ฐจ์ด๋‚˜๋Š”๋ฐ(19 vs 40 ๋ ˆ์ง€์Šคํ„ฐ, 0KB vs 49KB ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ) ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ ๋™์ผํ•œ(<2% ์ฐจ์ด) ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?
  3. ์ด๋ก ์  ์ ์œ ์œจ ๊ณ„์‚ฐ๊ณผ ์‹ค์ œ GPU ๋™์ž‘ ์‚ฌ์ด์˜ ๊ด€๊ณ„์— ๋Œ€ํ•ด ๋ฌด์—‡์„ ์•Œ ์ˆ˜ ์žˆ๋‚˜์š”?
  4. ์ด SAXPY ์›Œํฌ๋กœ๋“œ์˜ ์‹ค์ œ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์ด ์ ์œ ์œจ์ด ์•„๋‹ˆ๋ผ๋ฉด ๋ฌด์—‡์ธ๊ฐ€์š”?
ํŒ

ํƒ์ • ๋„๊ตฌ ๋ชจ์Œ:

  • NSight Compute (ncu) - ์ ์œ ์œจ๊ณผ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰ ์ธก์ •
  • GPU ์•„ํ‚คํ…์ฒ˜ ์ŠคํŽ™ - pixi run gpu-specs๋ฅผ ์‚ฌ์šฉํ•œ ์ด๋ก ์  ํ•œ๊ณ„ ๊ณ„์‚ฐ
  • ์ ์œ ์œจ ๊ณต์‹ - ๋ฆฌ์†Œ์Šค ๋ณ‘๋ชฉ ์˜ˆ์ธก
  • ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ - ์ด๋ก ์  ๋ถ„์„ ๊ฒ€์ฆ

ํ•ต์‹ฌ ์ตœ์ ํ™” ์›์น™:

  • ์ตœ์ ํ™” ์ „์— ๊ณ„์‚ฐํ•˜๊ธฐ: ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ ์ „์— ์ ์œ ์œจ ๊ณต์‹์œผ๋กœ ๋ฆฌ์†Œ์Šค ํ•œ๊ณ„๋ฅผ ์˜ˆ์ธก
  • ์ธก์ •์œผ๋กœ ๊ฒ€์ฆํ•˜๊ธฐ: ์ด๋ก ์  ๊ณ„์‚ฐ์€ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”์™€ ํ•˜๋“œ์›จ์–ด ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•จ
  • ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ ๊ณ ๋ คํ•˜๊ธฐ: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ๋Š” ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ๋ณด๋‹ค ์ ์œ ์œจ์ด ๋œ ํ•„์š”
  • ์ตœ๋Œ€ ์ ์œ ์œจ์„ ๋ชฉํ‘œ๋กœ ํ•˜์ง€ ์•Š๊ธฐ: ์ถฉ๋ถ„ํ•œ ์ ์œ ์œจ + ๋‹ค๋ฅธ ์„ฑ๋Šฅ ์š”์†Œ๋ฅผ ์ตœ์ ํ™”
  • ์ž„๊ณ„๊ฐ’ ๊ด€์ ์œผ๋กœ ์‚ฌ๊ณ ํ•˜๊ธฐ: 25-50% ์ ์œ ์œจ์ด๋ฉด ๋Œ€๋ถ€๋ถ„ ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๊ธฐ์— ์ถฉ๋ถ„
  • ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰ ํ”„๋กœํŒŒ์ผ๋งํ•˜๊ธฐ: NSight Compute๋กœ ์‹ค์ œ ๋ ˆ์ง€์Šคํ„ฐ์™€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„๋Ÿ‰ ํŒŒ์•…

์กฐ์‚ฌ ์ ‘๊ทผ๋ฒ•:

  1. ๋ฒค์น˜๋งˆํ‚น๋ถ€ํ„ฐ ์‹œ์ž‘ - ๋จผ์ € ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ํ™•์ธ
  2. NSight Compute๋กœ ํ”„๋กœํŒŒ์ผ๋ง - ์‹ค์ œ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์ ์œ ์œจ ๋ฐ์ดํ„ฐ ํ™•๋ณด
  3. ์ด๋ก ์  ์ ์œ ์œจ ๊ณ„์‚ฐ - GPU ์ŠคํŽ™๊ณผ ์ ์œ ์œจ ๊ณต์‹ ํ™œ์šฉ
  4. ์ด๋ก ๊ณผ ํ˜„์‹ค ๋น„๊ต - ๋ฏธ์Šคํ„ฐ๋ฆฌ๊ฐ€ ๋“œ๋Ÿฌ๋‚˜๋Š” ์ˆœ๊ฐ„!
  5. ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ ๊ณ ์ฐฐ - ์ด๋ก ๊ณผ ์‹ค์ œ๊ฐ€ ์™œ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋Š”์ง€ ์ƒ๊ฐํ•ด๋ณด๊ธฐ

์†”๋ฃจ์…˜

์‹ฌ์ธต ํ•ด์„ค์ด ํฌํ•จ๋œ ์™„์ „ํ•œ ํ’€์ด

์ด ์ ์œ ์œจ ํƒ์ • ์‚ฌ๊ฑด์€ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด GPU ์„ฑ๋Šฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๊ณ , ์ด๋ก ์  ์ ์œ ์œจ๊ณผ ์‹ค์ œ ์„ฑ๋Šฅ ์‚ฌ์ด์˜ ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค.

์•„๋ž˜ ๊ตฌ์ฒด์ ์ธ ๊ณ„์‚ฐ์€ NVIDIA A10G (Ampere 8.6) - ํ…Œ์ŠคํŠธ์— ์‚ฌ์šฉ๋œ GPU - ๊ธฐ์ค€์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” GPU ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€์ง€๋งŒ, ๋ฐฉ๋ฒ•๋ก ๊ณผ ํ†ต์ฐฐ์€ ๋ณดํŽธ์ ์œผ๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. pixi run gpu-specs๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ํ•˜๋“œ์›จ์–ด๋ณ„ ์ˆ˜์น˜๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

๋ฆฌ์†Œ์Šค ๋ถ„์„์„ ํ†ตํ•œ ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ

NSight Compute ๋ฆฌ์†Œ์Šค ๋ถ„์„:

์‹ค์ œ ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฐ๊ณผ (NVIDIA A10G - GPU์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ):

  • Minimal: 19 ๋ ˆ์ง€์Šคํ„ฐ, ~0KB ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ์ ์œ ์œจ 63.87%, 327.7ms
  • Balanced: 25 ๋ ˆ์ง€์Šคํ„ฐ, 16.4KB ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ์ ์œ ์œจ 65.44%, 329.4ms
  • Sophisticated: 40 ๋ ˆ์ง€์Šคํ„ฐ, 49.2KB ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ์ ์œ ์œจ 65.61%, 330.9ms

๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ ๊ทผ๊ฑฐ:

  • ์„ธ ์ปค๋„ ๋ชจ๋‘ ๊ฑฐ์˜ ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„ (~327-331ms, <2% ์ฐจ์ด)
  • ๋ฆฌ์†Œ์Šค ์ฐจ์ด๊ฐ€ ํฌ์ง€๋งŒ ๋ชจ๋‘ ๋น„์Šทํ•œ ์ ์œ ์œจ์„ ๋‹ฌ์„ฑ (~64-66%)
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์ œํ•œ ์š”์ธ์œผ๋กœ ์ž‘์šฉ

์ ์œ ์œจ ๊ณ„์‚ฐ์˜ ์‹ค์ฒด

์ด๋ก ์  ์ ์œ ์œจ ๋ถ„์„ (NVIDIA A10G, Ampere 8.6):

GPU ์ŠคํŽ™ (pixi run gpu-specs ์ถœ๋ ฅ):

  • SM๋‹น ๋ ˆ์ง€์Šคํ„ฐ: 65,536
  • SM๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: 164KB (์•„ํ‚คํ…์ฒ˜ ์ตœ๋Œ€์น˜)
  • SM๋‹น ์ตœ๋Œ€ ์Šค๋ ˆ๋“œ: 1,536 (A10G ํ•˜๋“œ์›จ์–ด ์ œํ•œ)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ: 1,024 (์ปค๋„ ์„ค์ •๊ฐ’)
  • SM๋‹น ์ตœ๋Œ€ ๋ธ”๋ก: 32

Minimal ์ปค๋„ ๊ณ„์‚ฐ:

๋ ˆ์ง€์Šคํ„ฐ ์ œํ•œ = 65,536 / (19 ร— 1,024) = 3.36 ๋ธ”๋ก/SM
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ = 164KB / 0KB = โˆž ๋ธ”๋ก/SM
ํ•˜๋“œ์›จ์–ด ๋ธ”๋ก ์ œํ•œ = 32 ๋ธ”๋ก/SM

์Šค๋ ˆ๋“œ ์ œํ•œ = 1,536 / 1,024 = 1 ๋ธ”๋ก/SM (๋‚ด๋ฆผ)
์‹ค์ œ ๋ธ”๋ก = min(3, โˆž, 1) = 1 ๋ธ”๋ก/SM
์ด๋ก ์  ์ ์œ ์œจ = (1 ร— 1,024) / 1,536 = 66.7%

Balanced ์ปค๋„ ๊ณ„์‚ฐ:

๋ ˆ์ง€์Šคํ„ฐ ์ œํ•œ = 65,536 / (25 ร— 1,024) = 2.56 ๋ธ”๋ก/SM
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ = 164KB / 16.4KB = 10 ๋ธ”๋ก/SM
ํ•˜๋“œ์›จ์–ด ๋ธ”๋ก ์ œํ•œ = 32 ๋ธ”๋ก/SM

์Šค๋ ˆ๋“œ ์ œํ•œ = 1,536 / 1,024 = 1 ๋ธ”๋ก/SM (๋‚ด๋ฆผ)
์‹ค์ œ ๋ธ”๋ก = min(2, 10, 1) = 1 ๋ธ”๋ก/SM
์ด๋ก ์  ์ ์œ ์œจ = (1 ร— 1,024) / 1,536 = 66.7%

Sophisticated ์ปค๋„ ๊ณ„์‚ฐ:

๋ ˆ์ง€์Šคํ„ฐ ์ œํ•œ = 65,536 / (40 ร— 1,024) = 1.64 ๋ธ”๋ก/SM
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ = 164KB / 49.2KB = 3.33 ๋ธ”๋ก/SM
ํ•˜๋“œ์›จ์–ด ๋ธ”๋ก ์ œํ•œ = 32 ๋ธ”๋ก/SM

์Šค๋ ˆ๋“œ ์ œํ•œ = 1,536 / 1,024 = 1 ๋ธ”๋ก/SM (๋‚ด๋ฆผ)
์‹ค์ œ ๋ธ”๋ก = min(1, 3, 1) = 1 ๋ธ”๋ก/SM
์ด๋ก ์  ์ ์œ ์œจ = (1 ร— 1,024) / 1,536 = 66.7%

ํ•ต์‹ฌ ๋ฐœ๊ฒฌ: ์ด๋ก ๊ณผ ํ˜„์‹ค์ด ์ผ์น˜ํ•œ๋‹ค!

  • ์ด๋ก ์ : ๋ชจ๋“  ์ปค๋„ ~66.7% (A10G์˜ ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰์— ์˜ํ•ด ์ œํ•œ)
  • ์‹ค์ธก: ๋ชจ๋‘ ~64-66% (๋งค์šฐ ๊ทผ์ ‘ํ•œ ๊ฒฐ๊ณผ!)

์ด๋Š” A10G์˜ ์Šค๋ ˆ๋“œ ์ œํ•œ์ด ์ง€๋ฐฐ์ ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค - SM๋‹น ์ตœ๋Œ€ ์Šค๋ ˆ๋“œ๊ฐ€ 1,536๊ฐœ์ด๋ฏ€๋กœ 1,024 ์Šค๋ ˆ๋“œ ๋ธ”๋ก์€ 1๊ฐœ๋งŒ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค. ์ด๋ก (66.7%)๊ณผ ์‹ค์ธก(~65%) ์‚ฌ์ด์˜ ์ž‘์€ ์ฐจ์ด๋Š” ํ•˜๋“œ์›จ์–ด ์Šค์ผ€์ค„๋ง ์˜ค๋ฒ„ํ—ค๋“œ์™€ ๋“œ๋ผ์ด๋ฒ„ ์ œ์•ฝ์—์„œ ๋น„๋กฏ๋ฉ๋‹ˆ๋‹ค.

์ด๋ก ๊ณผ ํ˜„์‹ค์ด ๊ทผ์ ‘ํ•œ ์ด์œ 

์ด๋ก ์ (66.7%)๊ณผ ์‹ค์ธก(~65%) ์ ์œ ์œจ ์‚ฌ์ด ์ž‘์€ ์ฐจ์ด์˜ ์›์ธ:

  1. ํ•˜๋“œ์›จ์–ด ์Šค์ผ€์ค„๋ง ์˜ค๋ฒ„ํ—ค๋“œ: ์‹ค์ œ ์›Œํ”„ ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์ด๋ก ์  ๊ณ„์‚ฐ์„ ๋„˜์–ด์„œ๋Š” ์‹ค์งˆ์  ์ œ์•ฝ์ด ์žˆ์Œ
  2. CUDA ๋Ÿฐํƒ€์ž„ ์˜ˆ์•ฝ: ๋“œ๋ผ์ด๋ฒ„์™€ ๋Ÿฐํƒ€์ž„ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๊ฐ€์šฉ SM ๋ฆฌ์†Œ์Šค๋ฅผ ์•ฝ๊ฐ„ ์ค„์ž„
  3. ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ ์••๋ฐ•: A10G์˜ ๋ฉ”๋ชจ๋ฆฌ ์„œ๋ธŒ์‹œ์Šคํ…œ์ด ์•ฝ๊ฐ„์˜ ์Šค์ผ€์ค„๋ง ์ œ์•ฝ์„ ๋งŒ๋“ฆ
  4. ์ „๋ ฅ ๋ฐ ์—ด ๊ด€๋ฆฌ: ๋™์  ์ฃผํŒŒ์ˆ˜ ์กฐ์ ˆ์ด ์ตœ๋Œ€ ์„ฑ๋Šฅ์— ์˜ํ–ฅ
  5. ๋ช…๋ น์–ด ์บ์‹œ ํšจ๊ณผ: ์‹ค์ œ ์ปค๋„์€ ์ ์œ ์œจ ๊ณ„์‚ฐ์— ํฌ์ฐฉ๋˜์ง€ ์•Š๋Š” ๋ช…๋ น์–ด ํŽ˜์น˜ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์žˆ์Œ

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด๋ก ๊ณผ ์‹ค์ธก์ด ๊ทผ์ ‘ํ•˜๋‹ค๋Š” ๊ฒƒ(66.7% vs ~65%)์€ ๋ ˆ์ง€์Šคํ„ฐ์™€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฐจ์ด์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ A10G์˜ ์Šค๋ ˆ๋“œ ์ œํ•œ์ด ์„ธ ์ปค๋„ ๋ชจ๋‘๋ฅผ ์ง€๋ฐฐํ•œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. ์ง„์งœ ๋ณ‘๋ชฉ์„ ์ •ํ™•ํžˆ ์งš์–ด๋‚ธ ์ข‹์€ ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค!

์ ์œ ์œจ ๋ฏธ์Šคํ„ฐ๋ฆฌ ํ•ด์„ค

๋ฏธ์Šคํ„ฐ๋ฆฌ์˜ ์ง„์งœ ์ •์ฒด:

  • ๋ฆฌ์†Œ์Šค ์ฐจ์ด๊ฐ€ ๊ทน์ ์ธ๋ฐ๋„ ์„ธ ์ปค๋„ ๋ชจ๋‘ ๊ฑฐ์˜ ๋™์ผํ•œ ์ ์œ ์œจ์„ ๋‹ฌ์„ฑ (~64-66%)
  • ์„ฑ๋Šฅ์ด ๋ณธ์งˆ์ ์œผ๋กœ ๋™์ผ (์„ธ ์ปค๋„ ๋ชจ๋‘ <2% ๋ณ€๋™)
  • ์ด๋ก ์ด ์ ์œ ์œจ์„ ์ •ํ™•ํžˆ ์˜ˆ์ธก (66.7% ์ด๋ก  โ‰ˆ 65% ์‹ค์ธก)
  • ๋ฏธ์Šคํ„ฐ๋ฆฌ๋Š” ์ ์œ ์œจ ๋ถˆ์ผ์น˜๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค - ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ํฌ๊ฒŒ ๋‹ค๋ฅธ๋ฐ๋„ ์™œ ์ ์œ ์œจ๊ณผ ์„ฑ๋Šฅ์ด ๋™์ผํ•œ์ง€๊ฐ€ ์ง„์งœ ๋ฏธ์Šคํ„ฐ๋ฆฌ์ž…๋‹ˆ๋‹ค!

๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ๋‹ค๋ฅธ๋ฐ ์„ฑ๋Šฅ์ด ๋™์ผํ•œ ์ด์œ :

SAXPY ์›Œํฌ๋กœ๋“œ์˜ ํŠน์„ฑ:

  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ: ๊ฐ ์Šค๋ ˆ๋“œ์˜ ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ทนํžˆ ์ ์Œ (y[i] = alpha * x[i] + y[i])
  • ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ: ์Šค๋ ˆ๋“œ๋‹น 2๊ฐœ ๊ฐ’ ์ฝ๊ธฐ, 1๊ฐœ ๊ฐ’ ์“ฐ๊ธฐ
  • ๋‚ฎ์€ ์‚ฐ์ˆ  ๊ฐ•๋„: 12๋ฐ”์ดํŠธ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ๋‹น 2 FLOPS๋งŒ ์ˆ˜ํ–‰

๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋ถ„์„ (A10G):

๋‹จ์ผ ์ปค๋„ ํŒจ์Šค ๋ถ„์„:
- ์ž…๋ ฅ ๋ฐฐ์—ด: 32M ร— 4๋ฐ”์ดํŠธ ร— 2 ๋ฐฐ์—ด = 256MB ์ฝ๊ธฐ
- ์ถœ๋ ฅ ๋ฐฐ์—ด: 32M ร— 4๋ฐ”์ดํŠธ ร— 1 ๋ฐฐ์—ด = 128MB ์“ฐ๊ธฐ
- ์ปค๋„๋‹น ์ด๋Ÿ‰: 384MB ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ

์ตœ๋Œ€ ๋Œ€์—ญํญ (A10G): 600 GB/s
๋‹จ์ผ ํŒจ์Šค ์‹œ๊ฐ„: 384MB / 600 GB/s โ‰ˆ 0.64ms ์ด๋ก ์  ์ตœ์†Œ์น˜
๋ฒค์น˜๋งˆํฌ ์‹œ๊ฐ„: ~328ms (์—ฌ๋Ÿฌ ๋ฐ˜๋ณต + ์˜ค๋ฒ„ํ—ค๋“œ ํฌํ•จ)

์‹ค์ œ ์„ฑ๋Šฅ ๊ฒฐ์ • ์š”์ธ:

  1. ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ: ๋ชจ๋“  ์ปค๋„์ด ๊ฐ€์šฉ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํฌํ™”์‹œํ‚ด
  2. ์—ฐ์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ: Sophisticated ์ปค๋„์ด ์ถ”๊ฐ€ ์ž‘์—…์„ ์ˆ˜ํ–‰ (๋ ˆ์ง€์Šคํ„ฐ ์••๋ฐ• ํšจ๊ณผ)
  3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ด์ : Balanced ์ปค๋„์ด ์ผ๋ถ€ ์บ์‹ฑ ์ด์ ์„ ์–ป์Œ
  4. ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”: ์ตœ์‹  ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ๊ฐ€๋Šฅํ•œ ํ•œ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ์„ ์ตœ์†Œํ™”

์ ์œ ์œจ ์ž„๊ณ„๊ฐ’ ๊ฐœ๋… ์ดํ•ดํ•˜๊ธฐ

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ ์œ ์œจ์€ โ€œ์ตœ๋Œ€โ€œ๊ฐ€ ์•„๋‹Œ โ€œ์ถฉ๋ถ„ํ•จโ€œ์˜ ๋ฌธ์ œ

๋Œ€๊ธฐ ์‹œ๊ฐ„ ์€๋‹‰ ์š”๊ตฌ ์‚ฌํ•ญ:

  • ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„: ์ตœ์‹  GPU์—์„œ ~500-800 ์‚ฌ์ดํด
  • ์›Œํ”„ ์Šค์ผ€์ค„๋ง: GPU๋Š” ์ด ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๊ธฐ ์œ„ํ•ด ์ถฉ๋ถ„ํ•œ ์›Œํ”„๊ฐ€ ํ•„์š”
  • ์ถฉ๋ถ„ํ•œ ์ž„๊ณ„๊ฐ’: ๋ณดํ†ต 25-50% ์ ์œ ์œจ์ด๋ฉด ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ํšจ๊ณผ์ ์œผ๋กœ ์ˆจ๊ธธ ์ˆ˜ ์žˆ์Œ

๋†’์€ ์ ์œ ์œจ์ด ํ•ญ์ƒ ๋„์›€์ด ๋˜์ง€ ์•Š๋Š” ์ด์œ :

๋ฆฌ์†Œ์Šค ๊ฒฝ์Ÿ:

  • ๋” ๋งŽ์€ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ๋†“๊ณ  ๊ฒฝ์Ÿ
  • ๋™์‹œ ์ ‘๊ทผ์ด ๋งŽ์•„์ง€๋ฉด ์บ์‹œ ์••๋ฐ•์ด ์ฆ๊ฐ€
  • ๋ ˆ์ง€์Šคํ„ฐ/๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์••๋ฐ•์ด ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ

์›Œํฌ๋กœ๋“œ๋ณ„ ์ตœ์ ํ™”:

  • ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ: ๋†’์€ ์ ์œ ์œจ์ด ALU ํŒŒ์ดํ”„๋ผ์ธ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๋Š” ๋ฐ ๋„์›€
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ: ์ ์œ ์œจ๊ณผ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์„ฑ๋Šฅ์„ ์ œํ•œ
  • ํ˜ผํ•ฉ ์›Œํฌ๋กœ๋“œ: ์ ์œ ์œจ๊ณผ ๋‹ค๋ฅธ ์ตœ์ ํ™” ์š”์†Œ ์‚ฌ์ด์—์„œ ๊ท ํ˜• ํ•„์š”

์‹ค์ „ ์ ์œ ์œจ ์ตœ์ ํ™” ์›์น™

์ฒด๊ณ„์  ์ ์œ ์œจ ๋ถ„์„ ์ ‘๊ทผ๋ฒ•:

1๋‹จ๊ณ„: ์ด๋ก ์  ํ•œ๊ณ„ ๊ณ„์‚ฐ

# GPU ์ŠคํŽ™ ํ™•์ธ
pixi run gpu-specs

2๋‹จ๊ณ„: ์‹ค์ œ ์‚ฌ์šฉ๋Ÿ‰ ํ”„๋กœํŒŒ์ผ๋ง

# ๋ฆฌ์†Œ์Šค ์†Œ๋น„๋Ÿ‰ ์ธก์ •
ncu --set=@occupancy --section=LaunchStats your_kernel

# ๋‹ฌ์„ฑ๋œ ์ ์œ ์œจ ์ธก์ •
ncu --metrics=smsp__warps_active.avg.pct_of_peak_sustained_active your_kernel

3๋‹จ๊ณ„: ์„ฑ๋Šฅ ๊ฒ€์ฆ

# ํ•ญ์ƒ ์‹ค์ œ ์„ฑ๋Šฅ ์ธก์ •์œผ๋กœ ๊ฒ€์ฆ
ncu --set=@roofline --section=MemoryWorkloadAnalysis your_kernel

๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ:

์ ์œ ์œจ ๋ถ„์„ โ†’ ์ตœ์ ํ™” ์ „๋žต:

๋†’์€ ์ ์œ ์œจ (>70%) + ์ข‹์€ ์„ฑ๋Šฅ:
โ†’ ์ ์œ ์œจ์€ ์ถฉ๋ถ„, ๋‹ค๋ฅธ ๋ณ‘๋ชฉ์— ์ง‘์ค‘

๋‚ฎ์€ ์ ์œ ์œจ (<30%) + ๋‚˜์œ ์„ฑ๋Šฅ:
โ†’ ๋ฆฌ์†Œ์Šค ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ์ ์œ ์œจ ํ–ฅ์ƒ ํ•„์š”

์ ๋‹นํ•œ ์ ์œ ์œจ (50-70%) + ๋‚˜์œ ์„ฑ๋Šฅ:
โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ, ์บ์‹œ, ์—ฐ์‚ฐ ๋ณ‘๋ชฉ ์กฐ์‚ฌ ํ•„์š”

๋‚ฎ์€ ์ ์œ ์œจ (<30%) + ์ข‹์€ ์„ฑ๋Šฅ:
โ†’ ์›Œํฌ๋กœ๋“œ๊ฐ€ ๋†’์€ ์ ์œ ์œจ์„ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์Œ (๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ)

์‹ค์šฉ์ ์ธ ์ ์œ ์œจ ์ตœ์ ํ™” ๊ธฐ๋ฒ•

๋ ˆ์ง€์Šคํ„ฐ ์ตœ์ ํ™”:

  • ์ ์ ˆํ•œ ๋ฐ์ดํ„ฐ ํƒ€์ž… ์‚ฌ์šฉ: float32 vs float64, int32 vs int64
  • ์ค‘๊ฐ„ ๋ณ€์ˆ˜ ์ตœ์†Œํ™”: ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ์ž„์‹œ ์ €์žฅ์†Œ๋ฅผ ์ตœ์ ํ™”ํ•˜๋„๋ก ๋งก๊ธฐ๊ธฐ
  • ๋ฃจํ”„ ์ „๊ฐœ ๊ณ ๋ ค: ์ ์œ ์œจ๊ณผ ๋ช…๋ น์–ด ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ ์‚ฌ์ด์˜ ๊ท ํ˜•

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”:

  • ํ•„์š”ํ•œ ํฌ๊ธฐ ๊ณ„์‚ฐ: ๊ณผ๋‹ค ํ• ๋‹น ๋ฐฉ์ง€
  • ํƒ€์ผ๋ง ์ „๋žต ๊ณ ๋ ค: ์ ์œ ์œจ๊ณผ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ ์‚ฌ์ด์˜ ๊ท ํ˜•
  • ๋ฑ…ํฌ ์ถฉ๋Œ ํšŒํ”ผ: ์ถฉ๋Œ ์—†๋Š” ์ ‘๊ทผ ํŒจํ„ด ์„ค๊ณ„

๋ธ”๋ก ํฌ๊ธฐ ํŠœ๋‹:

  • ์—ฌ๋Ÿฌ ์„ค์ • ํ…Œ์ŠคํŠธ: ๋ธ”๋ก๋‹น 256, 512, 1024 ์Šค๋ ˆ๋“œ
  • ์›Œํ”„ ํ™œ์šฉ ๊ณ ๋ ค: ๊ฐ€๋Šฅํ•˜๋ฉด ๋ถˆ์™„์ „ํ•œ ์›Œํ”„ ๋ฐฉ์ง€
  • ์ ์œ ์œจ๊ณผ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์˜ ๊ท ํ˜•: ๋ธ”๋ก์ด ํด์ˆ˜๋ก ๋ฆฌ์†Œ์Šค ํ•œ๊ณ„์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ์Œ

ํ•ต์‹ฌ ์ •๋ฆฌ: A10G ๋ฏธ์Šคํ„ฐ๋ฆฌ์—์„œ ๋ณดํŽธ์  ์›์น™์œผ๋กœ

์ด A10G ์ ์œ ์œจ ์กฐ์‚ฌ๋Š” ๋ชจ๋“  GPU ์ตœ์ ํ™”์— ์ ์šฉ๋˜๋Š” ๋ช…ํ™•ํ•œ ํ†ต์ฐฐ์˜ ์ง„ํ–‰์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

A10G ๋ฐœ๊ฒฌ ๊ณผ์ •:

  1. ์Šค๋ ˆ๋“œ ์ œํ•œ์ด ๋ชจ๋“  ๊ฒƒ์„ ์ง€๋ฐฐ - 19 vs 40 ๋ ˆ์ง€์Šคํ„ฐ, 0KB vs 49KB ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฐจ์ด์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , A10G์˜ 1,536 ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰ ๋•Œ๋ฌธ์— ๋ชจ๋“  ์ปค๋„์ด SM๋‹น 1๋ธ”๋ก์ด๋ผ๋Š” ๋™์ผํ•œ ์ œํ•œ์— ๊ฑธ๋ฆผ
  2. ์ด๋ก ์ด ํ˜„์‹ค๊ณผ ๊ทผ์ ‘ํ•˜๊ฒŒ ์ผ์น˜ - 66.7% ์ด๋ก  vs ~65% ์‹ค์ธก ์ ์œ ์œจ์€ ์˜ฌ๋ฐ”๋ฅธ ๋ณ‘๋ชฉ์„ ์‹๋ณ„ํ–ˆ์„ ๋•Œ ๊ณ„์‚ฐ์ด ์œ ํšจํ•จ์„ ๋ณด์—ฌ์คŒ
  3. ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์„ฑ๋Šฅ์„ ์ง€๋ฐฐ - ๋™์ผํ•œ 66.7% ์ ์œ ์œจ์—์„œ, SAXPY์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ํŠน์„ฑ(600 GB/s ํฌํ™”)์ด ๋ฆฌ์†Œ์Šค ์ฐจ์ด์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ์„ค๋ช…

๋ณดํŽธ์ ์ธ GPU ์ตœ์ ํ™” ์›์น™:

์ง„์งœ ๋ณ‘๋ชฉ ์‹๋ณ„ํ•˜๊ธฐ:

  • ๋ชจ๋“  ๋ฆฌ์†Œ์Šค์—์„œ ์ ์œ ์œจ ์ œํ•œ์„ ๊ณ„์‚ฐ: ๋ ˆ์ง€์Šคํ„ฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰
  • ๊ฐ€์žฅ ์ œํ•œ์ ์ธ ์š”์†Œ๊ฐ€ ๊ฒฐ์ •์  - ๋ ˆ์ง€์Šคํ„ฐ๋‚˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ๋ณ‘๋ชฉ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜์ง€ ๋ง ๊ฒƒ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ(SAXPY ๊ฐ™์€)๋Š” ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธธ ๋งŒํผ ์ถฉ๋ถ„ํ•œ ์Šค๋ ˆ๋“œ๋งŒ ํ™•๋ณด๋˜๋ฉด ์ ์œ ์œจ์ด ์•„๋‹Œ ๋Œ€์—ญํญ์ด ์ œํ•œ ์š”์ธ

์ ์œ ์œจ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ vs ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ:

  • ๋†’์€ ์ ์œ ์œจ์ด ์ค‘์š”: ์—ฐ์‚ฐ ์ง‘์•ฝ์  ์ปค๋„(GEMM, ๊ณผํ•™ ์‹œ๋ฎฌ๋ ˆ์ด์…˜)์—์„œ ALU ํŒŒ์ดํ”„๋ผ์ธ์ด ๋ฉˆ์ถ”๋Š” ์‹œ๊ฐ„์„ ๋‹ค๋ฅธ ์›Œํ”„ ์‹คํ–‰์œผ๋กœ ์ˆจ๊ฒจ์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ
  • ์ ์œ ์œจ์ด ๋œ ์ค‘์š”: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ(BLAS Level 1, ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ)์—์„œ ์ ์œ ์œจ์ด ์ œํ•œ ์š”์ธ์ด ๋˜๊ธฐ ์ „์— ๋Œ€์—ญํญ์ด ํฌํ™”๋˜๋Š” ๊ฒฝ์šฐ
  • ์ ์ • ์ˆ˜์ค€: 60-70% ์ ์œ ์œจ์ด๋ฉด ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๊ธฐ์— ์ถฉ๋ถ„ - ๊ทธ ์ด์ƒ์€ ์ง„์งœ ๋ณ‘๋ชฉ์— ์ง‘์ค‘

์‹ค์ „ ์ตœ์ ํ™” ์›Œํฌํ”Œ๋กœ์šฐ:

  1. ๋จผ์ € ํ”„๋กœํŒŒ์ผ๋ง (ncu --set=@occupancy) - ์‹ค์ œ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์ ์œ ์œจ ์ธก์ •
  2. ์ด๋ก ์  ํ•œ๊ณ„ ๊ณ„์‚ฐ - GPU ์ŠคํŽ™ ํ™œ์šฉ (pixi run gpu-specs)
  3. ์ง€๋ฐฐ์  ์ œ์•ฝ ์‹๋ณ„ - ๋ ˆ์ง€์Šคํ„ฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰, ๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ
  4. ๋ณ‘๋ชฉ ์ตœ์ ํ™” - ์ œํ•œ ์š”์ธ์ด ์•„๋‹Œ ๋ฆฌ์†Œ์Šค์— ์‹œ๊ฐ„ ๋‚ญ๋น„ํ•˜์ง€ ์•Š๊ธฐ
  5. ์ข…๋‹จ๊ฐ„ ์„ฑ๋Šฅ์œผ๋กœ ๊ฒ€์ฆ - ์ ์œ ์œจ์€ ์„ฑ๋Šฅ์„ ์œ„ํ•œ ์ˆ˜๋‹จ์ด์ง€ ๋ชฉํ‘œ๊ฐ€ ์•„๋‹˜

A10G ์‚ฌ๋ก€๋Š” ์ฒด๊ณ„์  ๋ณ‘๋ชฉ ๋ถ„์„์ด ์ง๊ด€๋ณด๋‹ค ๋‚ซ๋‹ค๋Š” ๊ฒƒ์„ ์™„๋ฒฝํ•˜๊ฒŒ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค - ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰์ด ์ง€๋ฐฐ์ ์ด์—ˆ๊ธฐ์— Sophisticated ์ปค๋„์˜ ๋†’์€ ๋ ˆ์ง€์Šคํ„ฐ ์••๋ฐ•์€ ๋ฌด๊ด€ํ–ˆ๊ณ , ๋™์ผํ•œ ์ ์œ ์œจ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํฌํ™”๊ฐ€ ์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ์™„์ „ํžˆ ์„ค๋ช…ํ•ด์ค๋‹ˆ๋‹ค.

Puzzle 32: ๋ฑ…ํฌ ์ถฉ๋Œ

์ด ํผ์ฆ์ด ์ค‘์š”ํ•œ ์ด์œ 

์„ฑ๋Šฅ ์ตœ์ ํ™” 3๋ถ€์ž‘์˜ ์™„๊ฒฐ: Puzzle 30์—์„œ GPU ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ๋ฐฐ์šฐ๊ณ , Puzzle 31์—์„œ ์ ์œ ์œจ ์ตœ์ ํ™”๋ฅผ ์ดํ•ดํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ ์„ฑ๋Šฅ ์ตœ์ ํ™” ํผ์ฆ์˜ ๋งˆ์ง€๋ง‰ ์กฐ๊ฐ์„ ๋งž์ถœ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ.

์ˆจ๊ฒจ์ง„ ์„ฑ๋Šฅ ํ•จ์ •: ์™„๋ฒฝํ•œ ์ ์œ ์œจ, ์ตœ์ ์˜ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ, ๋™์ผํ•œ ์ˆ˜ํ•™์  ์—ฐ์‚ฐ์„ ๊ฐ–์ถ˜ GPU ์ปค๋„์„ ์ž‘์„ฑํ•˜๊ณ ๋„ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹ ๋•Œ๋ฌธ์— ๊ทน์ ์ธ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๊ฒฝํ—˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฑ…ํฌ ์ถฉ๋Œ์€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๊ฐ€์žฅ ๋ฏธ๋ฌ˜ํ•˜๋ฉด์„œ๋„ ์˜ํ–ฅ๋ ฅ์ด ํฐ ์„ฑ๋Šฅ ํ•จ์ • ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ์—ฌ์ •:

  • Puzzle 30์—์„œ๋Š” NSight ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๊ณ  ์ง„๋‹จํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค
  • Puzzle 31์—์„œ๋Š” ์ ์œ ์œจ ๋ถ„์„์„ ํ†ตํ•ด ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์„ ์˜ˆ์ธกํ•˜๊ณ  ์ œ์–ดํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค
  • Puzzle 32์—์„œ๋Š” ์ตœ๋Œ€ ํšจ์œจ์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค

GPU๋ฅผ ๋„˜์–ด์„œ ์ ์šฉ๋˜๋Š” ์›๋ฆฌ: ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น, ์ถฉ๋Œ ๊ฐ์ง€, ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”์˜ ์›๋ฆฌ๋Š” CPU ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ๋ถ€ํ„ฐ ๋ถ„์‚ฐ ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ : ์ด ํผ์ฆ์€ NVIDIA GPU ์ „์šฉ์ž…๋‹ˆ๋‹ค

๋ฑ…ํฌ ์ถฉ๋Œ ๋ถ„์„์€ NVIDIA์˜ 32-๋ฑ…ํฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜์™€ NSight Compute ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ ํ™” ์›๋ฆฌ๋Š” ๋„๋ฆฌ ์ ์šฉ๋˜์ง€๋งŒ, ๊ตฌ์ฒด์ ์ธ ๊ธฐ๋ฒ•๊ณผ ์ธก์ • ๋ฐฉ๋ฒ•์€ NVIDIA CUDA์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐœ์š”

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์›Œํ”„ ๋‚ด์˜ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ์†Œ์— ๋™์‹œ์— ์ ‘๊ทผํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋ฉฐ, ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ์„ ์ง๋ ฌํ™”ํ•˜๋„๋ก ๊ฐ•์ œํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ผ ์‚ฌ์ดํด ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด์–ด์•ผ ํ•  ๊ฒƒ์ด ์—ฌ๋Ÿฌ ์‚ฌ์ดํด์˜ ์ง๋ ฌํ™”๋œ ์ ‘๊ทผ์œผ๋กœ ๋ฐ”๋€” ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐœ๊ฒฌํ•˜๊ฒŒ ๋  ๊ฒƒ:

  • ํ•˜๋“œ์›จ์–ด ์ˆ˜์ค€์—์„œ GPU ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น์ด ์ž‘๋™ํ•˜๋Š” ๋ฐฉ์‹
  • ๋™์ผํ•œ ์ปค๋„์ด ์™œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์—์„œ ํฌ๊ฒŒ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋Š”์ง€
  • ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ธฐ ์ „์— ๋ฑ…ํฌ ์ถฉ๋Œ์„ ์˜ˆ์ธกํ•˜๊ณ  ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ์ถฉ๋Œ ์—†๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•˜๊ธฐ ์œ„ํ•œ ์ „๋ฌธ์ ์ธ ์ตœ์ ํ™” ์ „๋žต

ํƒ์ • ๋ฐฉ๋ฒ•๋ก : ์ด ํผ์ฆ์€ ์ด์ „ ์„ฑ๋Šฅ ํผ์ฆ๊ณผ ๋™์ผํ•œ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค - ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋กœ ์ˆจ๊ฒจ์ง„ ๋น„ํšจ์œจ์„ ๋ฐํ˜€๋‚ธ ๋‹ค์Œ, ์ฒด๊ณ„์ ์ธ ์ตœ์ ํ™” ์›์น™์„ ์ ์šฉํ•˜์—ฌ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜์˜ ๊ธฐ์ดˆ:

  • 32-๋ฑ…ํฌ ์„ค๊ณ„: NVIDIA GPU๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ 32๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ๋ฑ…ํฌ๋กœ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ์ถฉ๋Œ ์œ ํ˜•: ์ถฉ๋Œ ์—†์Œ(์ตœ์ ), N-way ์ถฉ๋Œ(์ง๋ ฌํ™”), ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ(์ตœ์ ํ™”)
  • ์ ‘๊ทผ ํŒจํ„ด ์ˆ˜ํ•™: ๋ฑ…ํฌ ํ• ๋‹น ๊ณต์‹๊ณผ ์ถฉ๋Œ ์˜ˆ์ธก
  • ์„ฑ๋Šฅ ์˜ํ–ฅ: ์ตœ์ ์˜ 1์‚ฌ์ดํด ์ ‘๊ทผ๋ถ€ํ„ฐ ์ตœ์•…์˜ 32์‚ฌ์ดํด ์ง๋ ฌํ™”๊นŒ์ง€

์ „๋ฌธ์ ์ธ ์ตœ์ ํ™” ๊ธฐ์ˆ :

  • ํŒจํ„ด ๋ถ„์„: ๋ฑ…ํ‚น ๋™์ž‘์˜ ์ˆ˜ํ•™์  ์˜ˆ์ธก
  • ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก : ์ถฉ๋Œ ์ธก์ •์„ ์œ„ํ•œ NSight Compute ๋ฉ”ํŠธ๋ฆญ
  • ์„ค๊ณ„ ์›์น™: ์ถฉ๋Œ ์—†๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด๊ณผ ์˜ˆ๋ฐฉ ์ „๋žต
  • ์„ฑ๋Šฅ ๊ฒ€์ฆ: ์ฒด๊ณ„์ ์ธ ์ธก์ •์„ ํ†ตํ•œ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”

ํผ์ฆ ๊ตฌ์„ฑ

์ด ํผ์ฆ์€ ์ „๋ฌธ์„ฑ์„ ์ ์ง„์ ์œผ๋กœ ์Œ“์•„๊ฐ€๋Š” ๋‘ ๊ฐœ์˜ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ์„น์…˜์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

๐Ÿ“š ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ดํ•ดํ•˜๊ธฐ

๋ช…ํ™•ํ•œ ์„ค๋ช…๊ณผ ์‹ค์šฉ์ ์ธ ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด GPU ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น์˜ ์ด๋ก ์  ๊ธฐ์ดˆ๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šฐ๊ฒŒ ๋  ๊ฒƒ:

  • NVIDIA์˜ 32-๋ฑ…ํฌ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ๋ณ‘๋ ฌ ์ ‘๊ทผ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ์‹
  • ๋ฑ…ํฌ ํ• ๋‹น๊ณผ ์ถฉ๋Œ ์˜ˆ์ธก์˜ ์ˆ˜ํ•™
  • ์ถฉ๋Œ ์œ ํ˜•๊ณผ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ
  • ์ด์ „ ๊ฐœ๋…๊ณผ์˜ ์—ฐ๊ฒฐ (์›Œํ”„ ์‹คํ–‰, ์ ์œ ์œจ, ํ”„๋กœํŒŒ์ผ๋ง)

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํ•˜๋“œ์›จ์–ด๋ฅผ ์ดํ•ดํ•˜๋ฉด ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ ์ „์— ์„ฑ๋Šฅ์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ถฉ๋Œ ์—†๋Š” ํŒจํ„ด

๋ฑ…ํ‚น ์ง€์‹์„ ํ™œ์šฉํ•˜์—ฌ ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ๋ฒ•์œผ๋กœ ์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ํ’€์–ด๋ด…๋‹ˆ๋‹ค.

ํƒ์ • ๋„์ „ ๊ณผ์ œ: ๋‘ ์ปค๋„์ด ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•˜์ง€๋งŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ์€ ๊ทน์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. NSight Compute๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•œ ์ปค๋„์€ ์ฒด๊ณ„์ ์ธ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ๊ฒช๊ณ  ๋‹ค๋ฅธ ์ปค๋„์€ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋Š” ์ด์œ ๋ฅผ ๋ฐํ˜€๋‚ด์„ธ์š”.

๊ธธ๋Ÿฌ์ง€๋Š” ์—ญ๋Ÿ‰: ํŒจํ„ด ๋ถ„์„, ์ถฉ๋Œ ์ธก์ •, ์ฒด๊ณ„์  ์ตœ์ ํ™”, ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์„ฑ๋Šฅ ๊ฐœ์„ .

์‹œ์ž‘ํ•˜๊ธฐ

ํ•™์Šต ๊ฒฝ๋กœ:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ดํ•ดํ•˜๊ธฐ - ์ด๋ก ์  ๊ธฐ์ดˆ ์Œ“๊ธฐ
  2. ์ถฉ๋Œ ์—†๋Š” ํŒจํ„ด - ์‹ค์ „ ์ตœ์ ํ™”์— ํƒ์ • ์—ญ๋Ÿ‰ ์ ์šฉํ•˜๊ธฐ

์„ ์ˆ˜ ์กฐ๊ฑด:

  • Puzzle 30์—์„œ ์ตํžŒ GPU ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฝํ—˜
  • Puzzle 31์—์„œ ์ตํžŒ ๋ฆฌ์†Œ์Šค ์ตœ์ ํ™” ์ดํ•ด
  • Puzzle 8๊ณผ Puzzle 16์—์„œ ์ตํžŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฒฝํ—˜

ํ•˜๋“œ์›จ์–ด ์š”๊ตฌ ์‚ฌํ•ญ:

์ตœ์ ํ™”์˜ ํšจ๊ณผ

๋ฑ…ํฌ ์ถฉ๋Œ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ:

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ๋ง์„ ์‚ฌ์šฉํ•˜๋Š” ํ–‰๋ ฌ ๊ณฑ์…ˆ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์บ์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ์Šคํ…์‹ค ์—ฐ์‚ฐ
  • ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜๋Š” ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜

์ „๋ฌธ ์—ญ๋Ÿ‰ ๊ฐœ๋ฐœ:

  • ์ฒด๊ณ„์  ์ตœ์ ํ™”: ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์„ฑ๋Šฅ ๊ฐœ์„  ๋ฐฉ๋ฒ•๋ก 
  • ํ•˜๋“œ์›จ์–ด ์ธ์‹: ์†Œํ”„ํŠธ์›จ์–ด๊ฐ€ ํ•˜๋“œ์›จ์–ด ์ œ์•ฝ์— ์–ด๋–ป๊ฒŒ ๋งคํ•‘๋˜๋Š”์ง€ ์ดํ•ด
  • ํŒจํ„ด ์ธ์‹: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„์—์„œ ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ ‘๊ทผ ํŒจํ„ด ์‹๋ณ„

ํ•™์Šต ์„ฑ๊ณผ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์„ค๊ณ„, ์ธก์ •, ์ตœ์ ํ™”ํ•˜๋Š” ์—ญ๋Ÿ‰๊นŒ์ง€ ๊ฐ–์ถ”๋ฉด GPU ์„ฑ๋Šฅ ์ตœ์ ํ™” ๋„๊ตฌ ์„ธํŠธ๊ฐ€ ์™„์„ฑ๋ฉ๋‹ˆ๋‹ค - ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•œ ๋งˆ์ง€๋ง‰ ํผ์ฆ ์กฐ๊ฐ์ž…๋‹ˆ๋‹ค.

์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์—์„œ ์ ์œ ์œจ ๊ด€๋ฆฌ๋ฅผ ๊ฑฐ์ณ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น ํšจ์œจ๊นŒ์ง€, ์ด ํผ์ฆ์€ ์ตœ์ ์˜ GPU ์„ฑ๋Šฅ์„ ์œ„ํ•ด์„œ๋Š” ์—ฌ๋Ÿฌ ์ˆ˜์ค€์—์„œ ํ•˜๋“œ์›จ์–ด๋ฅผ ์ดํ•ดํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ“š ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ดํ•ดํ•˜๊ธฐ

์ง€๊ธˆ๊นŒ์ง€ ๋ฐฐ์šด ๊ฒƒ์„ ๋ฐ”ํƒ•์œผ๋กœ

GPU ์ตœ์ ํ™” ์—ฌ์ •์—์„œ ์ด๋ฏธ ๋งŽ์€ ๊ธธ์„ ๊ฑธ์–ด์™”์Šต๋‹ˆ๋‹ค. Puzzle 8์—์„œ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅธ ๋ธ”๋ก ๋‚ด๋ถ€ ์ €์žฅ์†Œ๋ฅผ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. Puzzle 16์—์„œ๋Š” ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ปค๋„์ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํƒ€์ผ์„ ์บ์‹ฑํ•˜๊ณ , ๋น„์šฉ์ด ํฐ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—๋Š” ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์„ ์ง๋ ฌํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์ˆจ๊ฒจ์ง„ ์„ฑ๋Šฅ ํ•จ์ •์ด ๋„์‚ฌ๋ฆฌ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค: ๋ฑ…ํฌ ์ถฉ๋Œ.

์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ: ๊ฒ‰๋ณด๊ธฐ์— ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ๋‘ ์ปค๋„์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค - ๋‘˜ ๋‹ค ๊ฐ™์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์™„๋ฒฝํ•œ ์ ์œ ์œจ์„ ๊ฐ€์ง€๋ฉฐ, ๊ฒฝ์Ÿ ์ƒํƒœ๋„ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํ•˜๋‚˜๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ๋ณด๋‹ค 32๋ฐฐ ๋А๋ฆฝ๋‹ˆ๋‹ค. ๋ฒ”์ธ์€? ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ๋ž€?

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ฑ…ํฌ๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” 32๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์œ ๋‹›์˜ ์ง‘ํ•ฉ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜์„ธ์š”. ๊ฐ ๋ฑ…ํฌ๋Š” ํด๋ก ์‚ฌ์ดํด๋‹น ํ•˜๋‚˜์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฑ…ํ‚น ์‹œ์Šคํ…œ์ด ์กด์žฌํ•˜๋Š” ๊ทผ๋ณธ์ ์ธ ์ด์œ ๋Š” ํ•˜๋“œ์›จ์–ด ๋ณ‘๋ ฌ์„ฑ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

32๊ฐœ ์Šค๋ ˆ๋“œ๋กœ ๊ตฌ์„ฑ๋œ ์›Œํ”„๊ฐ€ ๋™์‹œ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•ด์•ผ ํ•  ๋•Œ, ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•œ๋‹ค๋ฉด GPU๋Š” 32๊ฐœ์˜ ์š”์ฒญ์„ ๋ชจ๋‘ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๋ ค ํ•˜๋ฉด ํ•˜๋“œ์›จ์–ด๋Š” ์ด๋ฅผ ์ง๋ ฌํ™”ํ•ด์•ผ ํ•˜๋ฏ€๋กœ, 1์‚ฌ์ดํด์ด๋ฉด ๋  ์—ฐ์‚ฐ์ด ์—ฌ๋Ÿฌ ์‚ฌ์ดํด๋กœ ๋Š˜์–ด๋‚ฉ๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ฃผ์†Œ ๋งคํ•‘

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ๊ฐ 4๋ฐ”์ดํŠธ ์›Œ๋“œ๋Š” ๋‹ค์Œ ๊ณต์‹์— ๋”ฐ๋ผ ํŠน์ • ๋ฑ…ํฌ์— ๋ฐฐ์ •๋ฉ๋‹ˆ๋‹ค:

bank_id = (byte_address / 4) % 32

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์ฒ˜์Œ 128๋ฐ”์ดํŠธ๊ฐ€ ๋ฑ…ํฌ์— ๋งคํ•‘๋˜๋Š” ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Address RangeBank IDExample float32 Elements
0-3 bytesBank 0shared[0]
4-7 bytesBank 1shared[1]
8-11 bytesBank 2shared[2]
โ€ฆโ€ฆโ€ฆ
124-127 bytesBank 31shared[31]
128-131 bytesBank 0shared[32]
132-135 bytesBank 1shared[33]

ํ•ต์‹ฌ ํ†ต์ฐฐ: float32 ๋ฐฐ์—ด์—์„œ ๋ฑ…ํ‚น ํŒจํ„ด์€ 32๊ฐœ ์š”์†Œ๋งˆ๋‹ค ๋ฐ˜๋ณต๋˜๋ฉฐ, ์ด๋Š” 32๊ฐœ ์Šค๋ ˆ๋“œ๋กœ ๊ตฌ์„ฑ๋œ ์›Œํ”„ ํฌ๊ธฐ์™€ ์ •ํ™•ํžˆ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์šฐ์—ฐ์ด ์•„๋‹™๋‹ˆ๋‹ค - ์ตœ์ ์˜ ๋ณ‘๋ ฌ ์ ‘๊ทผ์„ ์œ„ํ•ด ์„ค๊ณ„๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ถฉ๋Œ์˜ ์œ ํ˜•

์ถฉ๋Œ ์—†์Œ: ์ด์ƒ์ ์ธ ๊ฒฝ์šฐ

์›Œํ”„ ๋‚ด ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๋ฉด 32๊ฐœ์˜ ์ ‘๊ทผ์ด ๋ชจ๋‘ 1์‚ฌ์ดํด์— ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค:

# Perfect case: each thread accesses a different bank
shared[thread_idx.x]  # Thread 0โ†’Bank 0, Thread 1โ†’Bank 1, ..., Thread 31โ†’Bank 31

๊ฒฐ๊ณผ: 32๊ฐœ ๋ณ‘๋ ฌ ์ ‘๊ทผ, ์ด 1์‚ฌ์ดํด

N-way ๋ฑ…ํฌ ์ถฉ๋Œ

N๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฑ…ํฌ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ์†Œ์— ์ ‘๊ทผํ•˜๋ฉด ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ ‘๊ทผ์„ ์ง๋ ฌํ™”ํ•ฉ๋‹ˆ๋‹ค:

# 2-way conflict: stride-2 access pattern
shared[thread_idx.x * 2]  # Thread 0,16โ†’Bank 0; Thread 1,17โ†’Bank 1; etc.

๊ฒฐ๊ณผ: ๋ฑ…ํฌ๋‹น 2ํšŒ ์ ‘๊ทผ, ์ด 2์‚ฌ์ดํด (ํšจ์œจ 50%)

# Worst case: all threads access different addresses in Bank 0
shared[thread_idx.x * 32]  # All threadsโ†’Bank 0

๊ฒฐ๊ณผ: 32ํšŒ ์ง๋ ฌํ™”๋œ ์ ‘๊ทผ, ์ด 32์‚ฌ์ดํด (ํšจ์œจ 3%)

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์˜ˆ์™ธ

์ถฉ๋Œ ๊ทœ์น™์—๋Š” ํ•œ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ์˜ˆ์™ธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ ‘๊ทผ. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์ฃผ์†Œ๋ฅผ ์ฝ์œผ๋ฉด ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ด๋ฅผ ๋‹จ์ผ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค:

# Broadcast: all threads read the same value
constant = shared[0]  # All threads read shared[0]

๊ฒฐ๊ณผ: 1ํšŒ ์ ‘๊ทผ์œผ๋กœ 32๊ฐœ ์Šค๋ ˆ๋“œ์— ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ, ์ด 1์‚ฌ์ดํด

์ด ์ตœ์ ํ™”๊ฐ€ ์กด์žฌํ•˜๋Š” ์ด์œ ๋Š” ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๊ฐ€ ํ”ํ•œ ํŒจํ„ด(์ƒ์ˆ˜ ๋กœ๋”ฉ, ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ ๋“ฑ)์ด๊ณ , ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์—†์ด ๋‹จ์ผ ๊ฐ’์„ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๋ณต์ œํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ถฉ๋Œ์ด ์ค‘์š”ํ•œ ์ด์œ 

์„ฑ๋Šฅ ์˜ํ–ฅ

๋ฑ…ํฌ ์ถฉ๋Œ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹œ๊ฐ„์„ ์ง์ ‘์ ์œผ๋กœ ๋ฐฐ๊ฐ€์‹œํ‚ต๋‹ˆ๋‹ค:

์ถฉ๋Œ ์œ ํ˜•์ ‘๊ทผ ์‹œ๊ฐ„ํšจ์œจ์„ฑ๋Šฅ ์˜ํ–ฅ
์ถฉ๋Œ ์—†์Œ1์‚ฌ์ดํด100%๊ธฐ์ค€์„ 
2-way conflict2์‚ฌ์ดํด50%2๋ฐฐ ๋А๋ฆผ
4-way conflict4์‚ฌ์ดํด25%4๋ฐฐ ๋А๋ฆผ
32-way conflict32์‚ฌ์ดํด3%32๋ฐฐ ๋А๋ฆผ

์‹ค์ „ ๋งฅ๋ฝ

Puzzle 30์—์„œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ๊ทน์ ์ธ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์ด ์›๋ฆฌ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ˆ˜์ค€์—์„œ ์ž‘๋™ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.

์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด DRAM ๋Œ€์—ญํญ ํ™œ์šฉ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, ๋ฑ…ํฌ ์ถฉ๋Œ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰์— ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค. ์ฐจ์ด๋Š” ๊ทœ๋ชจ์— ์žˆ์Šต๋‹ˆ๋‹ค: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์€ ์ˆ˜๋ฐฑ ์‚ฌ์ดํด์ด์ง€๋งŒ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ถฉ๋Œ์€ ์ ‘๊ทผ๋‹น ๋ช‡ ์‚ฌ์ดํด๋งŒ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ง‘์ค‘์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์—ฐ์‚ฐ ์ง‘์•ฝ์  ์ปค๋„์—์„œ๋Š” ์ด โ€œ๋ช‡ ์‚ฌ์ดํดโ€œ์ด ๋น ๋ฅด๊ฒŒ ๋ˆ„์ ๋ฉ๋‹ˆ๋‹ค.

์›Œํ”„ ์‹คํ–‰๊ณผ์˜ ๊ด€๊ณ„

Puzzle 24์—์„œ ์›Œํ”„๊ฐ€ SIMT(Single Instruction, Multiple Thread) ๋ฐฉ์‹์œผ๋กœ ์‹คํ–‰๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ์›Œํ”„๊ฐ€ ๋ฑ…ํฌ ์ถฉ๋Œ์— ๋ถ€๋”ชํžˆ๋ฉด ์ง๋ ฌํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ 32๊ฐœ ์Šค๋ ˆ๋“œ ๋ชจ๋‘๊ฐ€ ๋Œ€๊ธฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋Œ€๊ธฐ ์‹œ๊ฐ„์€ ์ถฉ๋Œ์„ ์ผ์œผํ‚จ ์Šค๋ ˆ๋“œ๋งŒ์ด ์•„๋‹ˆ๋ผ ์›Œํ”„ ์ „์ฒด์˜ ์ง„ํ–‰์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค.

์ด๋Š” Puzzle 31์˜ ์ ์œ ์œจ ๊ฐœ๋…๊ณผ ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค: ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์›Œํ”„๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ํšจ๊ณผ์ ์œผ๋กœ ์ˆจ๊ธฐ๋Š” ๊ฒƒ์„ ๋ฐฉํ•ดํ•˜์—ฌ, ๋†’์€ ์ ์œ ์œจ์˜ ์‹ค์งˆ์ ์ธ ์ด์ ์„ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ถฉ๋Œ ๊ฐ์ง€ํ•˜๊ธฐ

์‹œ๊ฐ์  ํŒจํ„ด ์ธ์‹

์ ‘๊ทผ ํŒจํ„ด์„ ๋ถ„์„ํ•˜๋ฉด ๋ฑ…ํฌ ์ถฉ๋Œ์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค:

์ˆœ์ฐจ ์ ‘๊ทผ (์ถฉ๋Œ ์—†์Œ):

# Thread ID:  0  1  2  3  ...  31
# Address:    0  4  8 12  ... 124
# Bank:       0  1  2  3  ...  31  โœ… All different banks

Stride-2 ์ ‘๊ทผ (2-way conflict):

# Thread ID:  0  1  2  3  ...  15 16 17 18 ... 31
# Address:    0  8 16 24  ... 120  4 12 20 ... 124
# Bank:       0  2  4  6  ...  30  1  3  5 ...  31
# Conflict:   Banks 0,2,4... have 2 threads each  โŒ

Stride-32 ์ ‘๊ทผ (32-way conflict):

# Thread ID:  0   1   2   3  ...  31
# Address:    0  128 256 384 ... 3968
# Bank:       0   0   0   0  ...   0  โŒ All threadsโ†’Bank 0

NSight Compute(ncu)๋ฅผ ์‚ฌ์šฉํ•œ ํ”„๋กœํŒŒ์ผ๋ง

Puzzle 30์—์„œ ๋ฐฐ์šด ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ๋ฐ”ํƒ•์œผ๋กœ, ๋ฑ…ํฌ ์ถฉ๋Œ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ์ธก์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# Key metrics for shared memory bank conflicts
ncu --metrics=l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st your_kernel

# Additional context metrics
ncu --metrics=smsp__sass_average_branch_targets_threads_uniform.pct your_kernel
ncu --metrics=smsp__warps_issue_stalled_membar_per_warp_active.pct your_kernel

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld์™€ l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st ๋ฉ”ํŠธ๋ฆญ์€ ์ปค๋„ ์‹คํ–‰ ์ค‘ ๋กœ๋“œ ๋ฐ ์Šคํ† ์–ด ์—ฐ์‚ฐ์˜ ๋ฑ…ํฌ ์ถฉ๋Œ ํšŸ์ˆ˜๋ฅผ ์ง์ ‘ ์นด์šดํŠธํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšŸ์ˆ˜์™€ ๊ฒฐํ•ฉํ•˜๋ฉด ์ถฉ๋Œ ๋น„์œจ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํ•ต์‹ฌ์ ์ธ ์„ฑ๋Šฅ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ถฉ๋Œ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ

์—ฐ์‚ฐ ์ง‘์•ฝ์  ์ปค๋„

๋ฑ…ํฌ ์ถฉ๋Œ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ปค๋„์—์„œ ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค:

  • ํƒ€์ดํŠธํ•œ ๋ฃจํ”„ ์•ˆ์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ž์ฃผ ์ ‘๊ทผํ•˜๋Š” ๊ฒฝ์šฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๋‹น ์—ฐ์‚ฐ๋Ÿ‰์ด ์ ์€ ๊ฒฝ์šฐ
  • ์ปค๋„์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ๊ฐ€ ์•„๋‹Œ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์ธ ๊ฒฝ์šฐ

๋Œ€ํ‘œ์ ์ธ ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ํ–‰๋ ฌ ๊ณฑ์…ˆ ๋‚ด๋ถ€ ๋ฃจํ”„ (Puzzle 16์˜ ํƒ€์ผ๋ง ๋ฒ„์ „๊ณผ ๊ฐ™์€)
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์บ์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ์Šคํ…์‹ค ์—ฐ์‚ฐ
  • ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ

๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ vs ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

Puzzle 31์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ์—์„œ๋Š” ์ ์œ ์œจ์ด ๋œ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์•˜๋“ฏ์ด, ์ปค๋„์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ๋ณ‘๋ชฉ์ด ๊ฑธ๋ฆฌ๊ฑฐ๋‚˜ ์‚ฐ์ˆ  ๊ฐ•๋„๊ฐ€ ๋งค์šฐ ๋‚ฎ์€ ๊ฒฝ์šฐ์—๋Š” ๋ฑ…ํฌ ์ถฉ๋Œ์˜ ์˜ํ–ฅ๋„ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋งŽ์€ ์ปค๋„์€ ๋ฐ”๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์—์„œ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ๋กœ ์ „ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์• ์ดˆ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋„์ž…ํ•œ ์ด์œ ์˜€๋˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์•ž์œผ๋กœ์˜ ๋ฐฉํ–ฅ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น์„ ์ดํ•ดํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ธฐ์ดˆ๋ฅผ ๊ฐ–์ถ”๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

  1. ์ ‘๊ทผ ํŒจํ„ด์„ ๋ถ„์„ํ•˜์—ฌ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ ์ „์— ์„ฑ๋Šฅ์„ ์˜ˆ์ธก
  2. ์ฒด๊ณ„์ ์ธ ํ”„๋กœํŒŒ์ผ๋ง ์ ‘๊ทผ๋ฒ•์œผ๋กœ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ง„๋‹จ
  3. ๋†’์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์œ ์ง€ํ•˜๋Š” ์ถฉ๋Œ ์—†๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„
  4. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ ์‚ฌ์ด์˜ ๊ท ํ˜• ์žกํžŒ ํŒ๋‹จ

๋‹ค์Œ ์„น์…˜์—์„œ๋Š” ์ด ์ง€์‹์„ ์‹ค์Šต์— ์ ์šฉํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ์ถฉ๋Œ ํŒจํ„ด๊ณผ ํ•ด๊ฒฐ์ฑ…์„ ์ง์ ‘ ๋‹ค๋ค„๋ด…๋‹ˆ๋‹ค - ์ด๋ก ์  ์ดํ•ด๋ฅผ ์‹ค์ „ ์ตœ์ ํ™” ์—ญ๋Ÿ‰์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.

์ถฉ๋Œ ์—†๋Š” ํŒจํ„ด

์ฐธ๊ณ : ์ด ์„น์…˜์€ NVIDIA GPU ์ „์šฉ์ž…๋‹ˆ๋‹ค

์—ฌ๊ธฐ์„œ ๋‹ค๋ฃจ๋Š” ๋ฑ…ํฌ ์ถฉ๋Œ ๋ถ„์„๊ณผ ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ๋ฒ•์€ NVIDIA GPU์— ํŠนํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กœํŒŒ์ผ๋ง ๋ช…๋ น์€ NVIDIA CUDA ํˆดํ‚ท์— ํฌํ•จ๋œ NSight Compute ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง ์—ญ๋Ÿ‰์„ ๋ฐ”ํƒ•์œผ๋กœ

Puzzle 30์—์„œ GPU ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ๋ฅผ ๋ฐฐ์šฐ๊ณ , Puzzle 31์—์„œ ๋ฆฌ์†Œ์Šค ์ตœ์ ํ™”๋ฅผ ์ดํ•ดํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ ๋ฐฐ์šด ํƒ์ • ๊ธฐ์ˆ ์„ ์ƒˆ๋กœ์šด ์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ์— ์ ์šฉํ•  ์ฐจ๋ก€์ž…๋‹ˆ๋‹ค: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ.

ํƒ์ • ๋„์ „ ๊ณผ์ œ: ๋™์ผํ•œ ์ˆ˜ํ•™์  ์—ฐ์‚ฐ((input + 10) * 2)์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‘ GPU ์ปค๋„์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘˜ ๋‹ค ์ •ํ™•ํžˆ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค. ๊ฐ™์€ ์–‘์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ ์œ ์œจ๋„ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํ•˜๋‚˜๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹ ๋•Œ๋ฌธ์— ์ฒด๊ณ„์ ์ธ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๊ฒช์Šต๋‹ˆ๋‹ค.

์—ฌ๋Ÿฌ๋ถ„์˜ ์ž„๋ฌด: ์ง€๊ธˆ๊นŒ์ง€ ๋ฐฐ์šด ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ ์ด ์ˆจ๊ฒจ์ง„ ์„ฑ๋Šฅ ํ•จ์ •์„ ๋ฐํ˜€๋‚ด๊ณ , ์‹ค์ œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋ฑ…ํฌ ์ถฉ๋Œ์ด ์–ธ์ œ ์ค‘์š”ํ•œ์ง€ ์ดํ•ดํ•˜์„ธ์š”.

๊ฐœ์š”

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์›Œํ”„ ๋‚ด์˜ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ์†Œ์— ๋™์‹œ์— ์ ‘๊ทผํ•  ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด ํƒ์ • ์‚ฌ๊ฑด์—์„œ๋Š” ๋Œ€์กฐ์ ์ธ ์ ‘๊ทผ ํŒจํ„ด์„ ๊ฐ€์ง„ ๋‘ ์ปค๋„์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค:

comptime SIZE = 8 * 1024  # 8K elements - small enough to focus on shared memory patterns
comptime TPB = 256  # Threads per block - divisible by 32 (warp size)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime BLOCKS_PER_GRID = (SIZE // TPB, 1)
comptime dtype = DType.float32
comptime layout = Layout.row_major(SIZE)


fn no_conflict_kernel[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    """Perfect shared memory access - no bank conflicts.

    Each thread accesses a different bank: thread_idx.x maps to bank thread_idx.x % 32.
    This achieves optimal shared memory bandwidth utilization.
    """

    # Shared memory buffer - each thread loads one element
    shared_buf = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # Load from global memory to shared memory - no conflicts
    if global_i < size:
        shared_buf[local_i] = (
            input[global_i] + 10.0
        )  # Add 10 as simple operation

    barrier()  # Synchronize shared memory writes

    # Read back from shared memory and write to output - no conflicts
    if global_i < size:
        output[global_i] = shared_buf[local_i] * 2.0  # Multiply by 2

    barrier()  # Ensure completion


fn two_way_conflict_kernel[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    """Stride-2 shared memory access - creates 2-way bank conflicts.

    Threads 0,16 โ†’ Bank 0, Threads 1,17 โ†’ Bank 1, etc.
    Each bank serves 2 threads, doubling access time.
    """

    # Shared memory buffer - stride-2 access pattern creates conflicts
    shared_buf = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # CONFLICT: stride-2 access creates 2-way bank conflicts
    conflict_index = (local_i * 2) % TPB

    # Load with bank conflicts
    if global_i < size:
        shared_buf[conflict_index] = (
            input[global_i] + 10.0
        )  # Same operation as no-conflict

    barrier()  # Synchronize shared memory writes

    # Read back with same conflicts
    if global_i < size:
        output[global_i] = (
            shared_buf[conflict_index] * 2.0
        )  # Same operation as no-conflict

    barrier()  # Ensure completion


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p32/p32.mojo

๋ฏธ์Šคํ„ฐ๋ฆฌ: ์ด ์ปค๋„๋“ค์€ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•˜์ง€๋งŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ์€ ๊ทน์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์ฒด๊ณ„์ ์ธ ํ”„๋กœํŒŒ์ผ๋ง ๋ถ„์„์„ ํ†ตํ•ด ๊ทธ ์ด์œ ๋ฅผ ๋ฐํ˜€๋‚ด๋Š” ๊ฒƒ์ด ์ž„๋ฌด์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

์š”๊ตฌ ์‚ฌํ•ญ:

  • Puzzle 30์˜ CUDA ํˆดํ‚ท๊ณผ NSight Compute๊ฐ€ ์„ค์น˜๋œ NVIDIA GPU
  • ์ด์ „ ์„น์…˜์—์„œ ๋‹ค๋ฃฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น ๊ฐœ๋…์— ๋Œ€ํ•œ ์ดํ•ด

์ปค๋„ ์„ค์ •:

comptime SIZE = 8 * 1024      # 8K elements - focus on shared memory patterns
comptime TPB = 256            # 256 threads per block (8 warps)
comptime BLOCKS_PER_GRID = (SIZE // TPB, 1)  # 32 blocks

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ œํ•œ์ด ์•„๋‹Œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํšจ๊ณผ๋ฅผ ๋ถ€๊ฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ์ œ ํฌ๊ธฐ๋ฅผ ์˜๋„์ ์œผ๋กœ ์ด์ „ ํผ์ฆ๋ณด๋‹ค ์ž‘๊ฒŒ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

์กฐ์‚ฌ ๊ณผ์ •

Step 1: ์ •ํ™•์„ฑ ๊ฒ€์ฆ

pixi shell -e nvidia
mojo problems/p32/p32.mojo --test

๋‘ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ฑ…ํฌ ์ถฉ๋Œ์ด ์ •ํ™•์„ฑ์ด ์•„๋‹Œ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

Step 2: ์„ฑ๋Šฅ ๊ธฐ์ค€์„  ๋ฒค์น˜๋งˆํฌ

mojo problems/p32/p32.mojo --benchmark

์‹คํ–‰ ์‹œ๊ฐ„์„ ๊ธฐ๋กํ•˜์„ธ์š”. ์›Œํฌ๋กœ๋“œ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— ์˜ํ•ด ์ง€๋ฐฐ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋น„์Šทํ•œ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ฑ…ํฌ ์ถฉ๋Œ์€ ํ”„๋กœํŒŒ์ผ๋ง ๋ฉ”ํŠธ๋ฆญ์„ ํ†ตํ•ด ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค.

Step 3: ํ”„๋กœํŒŒ์ผ๋ง์šฉ ๋นŒ๋“œ

mojo build --debug-level=full problems/p32/p32.mojo -o problems/p32/p32_profiler

Step 4: ๋ฑ…ํฌ ์ถฉ๋Œ ํ”„๋กœํŒŒ์ผ๋ง

NSight Compute๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค:

# Profile no-conflict kernel
ncu --metrics=l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st problems/p32/p32_profiler --no-conflict

๊ทธ๋ฆฌ๊ณ 

# Profile two-way conflict kernel
ncu --metrics=l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st problems/p32/p32_profiler --two-way

๊ธฐ๋กํ•  ํ•ต์‹ฌ ๋ฉ”ํŠธ๋ฆญ:

  • l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum - ๋กœ๋“œ ์ถฉ๋Œ
  • l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum - ์Šคํ† ์–ด ์ถฉ๋Œ

Step 5: ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„

ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ˆ˜ํ•™์  ์ ‘๊ทผ ํŒจํ„ด์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค:

์ถฉ๋Œ ์—†๋Š” ์ปค๋„ ์ ‘๊ทผ ํŒจํ„ด:

# Thread mapping: thread_idx.x directly maps to shared memory index
shared_buf[thread_idx.x]  # Thread 0โ†’Index 0, Thread 1โ†’Index 1, etc.
# Bank mapping: Index % 32 = Bank ID
# Result: Thread 0โ†’Bank 0, Thread 1โ†’Bank 1, ..., Thread 31โ†’Bank 31

2-way ์ถฉ๋Œ ์ปค๋„ ์ ‘๊ทผ ํŒจํ„ด:

# Thread mapping with stride-2 modulo operation
shared_buf[(thread_idx.x * 2) % TPB]
# For threads 0-31: Index 0,2,4,6,...,62, then wraps to 64,66,...,126, then 0,2,4..
# Bank mapping examples:
# Thread 0  โ†’ Index 0   โ†’ Bank 0
# Thread 16 โ†’ Index 32  โ†’ Bank 0  (conflict!)
# Thread 1  โ†’ Index 2   โ†’ Bank 2
# Thread 17 โ†’ Index 34  โ†’ Bank 2  (conflict!)

๋„์ „ ๊ณผ์ œ: ๋ฑ…ํฌ ์ถฉ๋Œ ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ํ’€์–ด๋ณด์„ธ์š”

์œ„์˜ ์กฐ์‚ฌ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ•œ ํ›„, ๋‹ค์Œ ๋ถ„์„ ์งˆ๋ฌธ์— ๋‹ตํ•˜์„ธ์š”:

์„ฑ๋Šฅ ๋ถ„์„ (Step 1-2)

  1. ๋‘ ์ปค๋„์ด ๋™์ผํ•œ ์ˆ˜ํ•™์  ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋‚˜์š”?
  2. ์ปค๋„ ๊ฐ„ ์‹คํ–‰ ์‹œ๊ฐ„ ์ฐจ์ด๊ฐ€ ์žˆ๋‚˜์š”?
  3. ์ ‘๊ทผ ํŒจํ„ด์ด ๋‹ค๋ฅธ๋ฐ๋„ ์„ฑ๋Šฅ์ด ๋น„์Šทํ•  ์ˆ˜ ์žˆ๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?

๋ฑ…ํฌ ์ถฉ๋Œ ํ”„๋กœํŒŒ์ผ๋ง (Step 4)

  1. ์ถฉ๋Œ ์—†๋Š” ์ปค๋„์€ ๋กœ๋“œ์™€ ์Šคํ† ์–ด์—์„œ ๋ช‡ ๊ฑด์˜ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ๋ฐœ์ƒ์‹œํ‚ค๋‚˜์š”?
  2. 2-way ์ถฉ๋Œ ์ปค๋„์€ ๋กœ๋“œ์™€ ์Šคํ† ์–ด์—์„œ ๋ช‡ ๊ฑด์˜ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ๋ฐœ์ƒ์‹œํ‚ค๋‚˜์š”?
  3. ๋‘ ์ปค๋„ ๊ฐ„ ์ด ์ถฉ๋Œ ํšŸ์ˆ˜ ์ฐจ์ด๋Š” ์–ผ๋งˆ์ธ๊ฐ€์š”?

์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„ (Step 5)

  1. ์ถฉ๋Œ ์—†๋Š” ์ปค๋„์—์„œ Thread 0์€ ์–ด๋–ค ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๋‚˜์š”? Thread 31์€?
  2. 2-way ์ถฉ๋Œ ์ปค๋„์—์„œ Bank 0์— ์ ‘๊ทผํ•˜๋Š” ์Šค๋ ˆ๋“œ๋Š”? Bank 2์— ์ ‘๊ทผํ•˜๋Š” ์Šค๋ ˆ๋“œ๋Š”?
  3. ์ถฉ๋Œ ์ปค๋„์—์„œ ๊ฐ™์€ ๋ฑ…ํฌ๋ฅผ ๋†“๊ณ  ๊ฒฝ์Ÿํ•˜๋Š” ์Šค๋ ˆ๋“œ๋Š” ๋ช‡ ๊ฐœ์ธ๊ฐ€์š”?

๋ฑ…ํฌ ์ถฉ๋Œ ํƒ์ • ์ž‘์—…

  1. ์ถฉ๋Œ ์—†๋Š” ์ปค๋„์€ ์ถฉ๋Œ์ด 0์ธ๋ฐ, 2-way ์ถฉ๋Œ ์ปค๋„์—์„œ๋Š” ์ธก์ • ๊ฐ€๋Šฅํ•œ ์ถฉ๋Œ์ด ๋‚˜ํƒ€๋‚˜๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?
  2. stride-2 ์ ‘๊ทผ ํŒจํ„ด (thread_idx.x * 2) % TPB๋Š” ์–ด๋–ป๊ฒŒ ์ฒด๊ณ„์ ์ธ ์ถฉ๋Œ์„ ๋งŒ๋“ค์–ด๋‚ด๋‚˜์š”?
  3. ๋ฑ…ํฌ ์ถฉ๋Œ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„๋ณด๋‹ค ์—ฐ์‚ฐ ์ง‘์•ฝ์  ์ปค๋„์—์„œ ๋” ์ค‘์š”ํ•œ ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?

์‹ค์ „ ์‹œ์‚ฌ์ 

  1. ๋ฑ…ํฌ ์ถฉ๋Œ์ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜๋Š” ๊ฒฝ์šฐ๋Š” ์–ธ์ œ์ธ๊ฐ€์š”?
  2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์ „์— ๋ฑ…ํฌ ์ถฉ๋Œ ํŒจํ„ด์„ ์–ด๋–ป๊ฒŒ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  3. ํ–‰๋ ฌ ์—ฐ์‚ฐ๊ณผ ์Šคํ…์‹ค ์—ฐ์‚ฐ์—์„œ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ํ”ผํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ์„ค๊ณ„ ์›์น™์€ ๋ฌด์—‡์ธ๊ฐ€์š”?
ํŒ

๋ฑ…ํฌ ์ถฉ๋Œ ํƒ์ • ๋„๊ตฌ ๋ชจ์Œ:

  • NSight Compute ๋ฉ”ํŠธ๋ฆญ - ์ •๋ฐ€ํ•œ ์ธก์ •์œผ๋กœ ์ถฉ๋Œ์„ ์ •๋Ÿ‰ํ™”
  • ์ ‘๊ทผ ํŒจํ„ด ์‹œ๊ฐํ™” - ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๋ฑ…ํฌ์— ์ฒด๊ณ„์ ์œผ๋กœ ๋งคํ•‘
  • ์ˆ˜ํ•™์  ๋ถ„์„ - ๋ชจ๋“ˆ๋กœ ์—ฐ์‚ฐ์œผ๋กœ ์ถฉ๋Œ ์˜ˆ์ธก
  • ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ - ์ถฉ๋Œ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ์™€ ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ ์ดํ•ด

ํ•ต์‹ฌ ์กฐ์‚ฌ ์›์น™:

  • ์ฒด๊ณ„์ ์œผ๋กœ ์ธก์ •ํ•˜๊ธฐ: ์ถฉ๋Œ์„ ์ถ”์ธกํ•˜์ง€ ๋ง๊ณ  ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉ
  • ์ ‘๊ทผ ํŒจํ„ด ์‹œ๊ฐํ™”ํ•˜๊ธฐ: ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์Šค๋ ˆ๋“œ-๋ฑ…ํฌ ๋งคํ•‘์„ ๊ทธ๋ ค๋ณด๊ธฐ
  • ์›Œํฌ๋กœ๋“œ ๋งฅ๋ฝ ๊ณ ๋ คํ•˜๊ธฐ: ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์—ฐ์‚ฐ ์ง‘์•ฝ์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ๊ฐ€์žฅ ์ค‘์š”
  • ์˜ˆ๋ฐฉ์ ์œผ๋กœ ์‚ฌ๊ณ ํ•˜๊ธฐ: ์ฒ˜์Œ๋ถ€ํ„ฐ ์ถฉ๋Œ ์—†๋Š” ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„

์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„ ๋ฐฉ๋ฒ•:

  1. ์Šค๋ ˆ๋“œ๋ฅผ ์ธ๋ฑ์Šค์— ๋งคํ•‘: ์ˆ˜ํ•™์  ์ฃผ์†Œ ๊ณ„์‚ฐ์„ ์ดํ•ด
  2. ๋ฑ…ํฌ ํ• ๋‹น ๊ณ„์‚ฐ: ๊ณต์‹ bank_id = (address / 4) % 32 ์‚ฌ์šฉ
  3. ์ถฉ๋Œ ์‹๋ณ„: ๊ฐ™์€ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๋Š” ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ธ์ง€ ํ™•์ธ
  4. ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ๊ฒ€์ฆ: NSight Compute ์ธก์ •์œผ๋กœ ์ด๋ก ์  ๋ถ„์„ ํ™•์ธ

์ผ๋ฐ˜์ ์ธ ์ถฉ๋Œ ์—†๋Š” ํŒจํ„ด:

  • ์ˆœ์ฐจ ์ ‘๊ทผ: shared[thread_idx.x] - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผ
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ ‘๊ทผ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ shared[0] - ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”
  • 2์˜ ๊ฑฐ๋“ญ์ œ๊ณฑ ์ŠคํŠธ๋ผ์ด๋“œ: stride-32๋Š” ๋ฑ…ํ‚น ํŒจํ„ด์— ๊น”๋”ํ•˜๊ฒŒ ๋งคํ•‘๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
  • ํŒจ๋”ฉ๋œ ๋ฐฐ์—ด: ํŒจ๋”ฉ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ ‘๊ทผ ํŒจํ„ด์„ ์ด๋™

์†”๋ฃจ์…˜

๋ฑ…ํฌ ์ถฉ๋Œ ๋ถ„์„์ด ํฌํ•จ๋œ ์™„์ „ํ•œ ํ’€์ด

์ด ๋ฑ…ํฌ ์ถฉ๋Œ ํƒ์ • ์‚ฌ๊ฑด์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด GPU ์„ฑ๋Šฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง์˜ ์ค‘์š”์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง์„ ํ†ตํ•œ ์กฐ์‚ฌ ๊ฒฐ๊ณผ

Step 1: ์ •ํ™•์„ฑ ๊ฒ€์ฆ ๋‘ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•œ ์ˆ˜ํ•™์  ๊ฒฐ๊ณผ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค:

โœ… No-conflict kernel: PASSED
โœ… Two-way conflict kernel: PASSED
โœ… Both kernels produce identical results

Step 2: ์„ฑ๋Šฅ ๊ธฐ์ค€์„  ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ๋Š” ๋น„์Šทํ•œ ์‹คํ–‰ ์‹œ๊ฐ„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

| name             | met (ms)           | iters |
| ---------------- | ------------------ | ----- |
| no_conflict      | 2.1930616745886655 | 547   |
| two_way_conflict | 2.1978922967032966 | 546   |

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ ๋™์ผํ•œ ์ด์œ (~2.19ms vs ~2.20ms)๋Š” ์ด ์›Œํฌ๋กœ๋“œ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ๊ฐ€ ์•„๋‹Œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์‹คํ–‰ ์‹œ๊ฐ„์ด ์•„๋‹Œ ํ”„๋กœํŒŒ์ผ๋ง ๋ฉ”ํŠธ๋ฆญ์„ ํ†ตํ•ด ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ถฉ๋Œ ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ

์ถฉ๋Œ ์—†๋Š” ์ปค๋„ (์ตœ์  ์ ‘๊ทผ ํŒจํ„ด):

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum    0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum    0

๊ฒฐ๊ณผ: ๋กœ๋“œ์™€ ์Šคํ† ์–ด ๋ชจ๋‘ ์ถฉ๋Œ 0๊ฑด - ์™„๋ฒฝํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ.

2-Way ์ถฉ๋Œ ์ปค๋„ (๋ฌธ์ œ ์žˆ๋Š” ์ ‘๊ทผ ํŒจํ„ด):

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum    256
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum    256

๊ฒฐ๊ณผ: ๋กœ๋“œ์™€ ์Šคํ† ์–ด ๊ฐ๊ฐ 256๊ฑด์˜ ์ถฉ๋Œ - ์ฒด๊ณ„์ ์ธ ๋ฑ…ํ‚น ๋ฌธ์ œ์˜ ๋ช…ํ™•ํ•œ ๊ทผ๊ฑฐ.

์ด ์ถฉ๋Œ ์ฐจ์ด: 512๊ฑด์˜ ์ถฉ๋Œ(256 + 256)์ด ์ธก์ • ๊ฐ€๋Šฅํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋น„ํšจ์œจ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ ‘๊ทผ ํŒจํ„ด ์ˆ˜ํ•™์  ๋ถ„์„

์ถฉ๋Œ ์—†๋Š” ์ปค๋„ ์ ‘๊ทผ ํŒจํ„ด

์Šค๋ ˆ๋“œ-์ธ๋ฑ์Šค ๋งคํ•‘:

shared_buf[thread_idx.x]

๋ฑ…ํฌ ํ• ๋‹น ๋ถ„์„:

Thread 0  โ†’ Index 0   โ†’ Bank 0 % 32 = 0
Thread 1  โ†’ Index 1   โ†’ Bank 1 % 32 = 1
Thread 2  โ†’ Index 2   โ†’ Bank 2 % 32 = 2
...
Thread 31 โ†’ Index 31  โ†’ Bank 31 % 32 = 31

๊ฒฐ๊ณผ: ์™„๋ฒฝํ•œ ๋ฑ…ํฌ ๋ถ„๋ฐฐ - ๊ฐ ์›Œํ”„ ๋‚ด์—์„œ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜์—ฌ ๋ณ‘๋ ฌ ์ ‘๊ทผ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

2-way ์ถฉ๋Œ ์ปค๋„ ์ ‘๊ทผ ํŒจํ„ด

์Šค๋ ˆ๋“œ-์ธ๋ฑ์Šค ๋งคํ•‘:

shared_buf[(thread_idx.x * 2) % TPB]  # TPB = 256

์ฒซ ๋ฒˆ์งธ ์›Œํ”„(์Šค๋ ˆ๋“œ 0-31)์˜ ๋ฑ…ํฌ ํ• ๋‹น ๋ถ„์„:

Thread 0  โ†’ Index (0*2)%256 = 0   โ†’ Bank 0
Thread 1  โ†’ Index (1*2)%256 = 2   โ†’ Bank 2
Thread 2  โ†’ Index (2*2)%256 = 4   โ†’ Bank 4
...
Thread 16 โ†’ Index (16*2)%256 = 32 โ†’ Bank 0  โ† Thread 0๊ณผ ์ถฉ๋Œ
Thread 17 โ†’ Index (17*2)%256 = 34 โ†’ Bank 2  โ† Thread 1๊ณผ ์ถฉ๋Œ
Thread 18 โ†’ Index (18*2)%256 = 36 โ†’ Bank 4  โ† Thread 2์™€ ์ถฉ๋Œ
...

์ถฉ๋Œ ํŒจํ„ด: ๊ฐ ๋ฑ…ํฌ๊ฐ€ ์ •ํ™•ํžˆ 2๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ์ฒ˜๋ฆฌํ•˜์—ฌ 32๊ฐœ ๋ฑ…ํฌ ์ „์ฒด์—์„œ ์ฒด๊ณ„์ ์ธ 2-way ์ถฉ๋Œ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์„ค๋ช…: stride-2 ํŒจํ„ด๊ณผ ๋ชจ๋“ˆ๋กœ 256์˜ ์กฐํ•ฉ์ด ๋ฐ˜๋ณต์ ์ธ ์ ‘๊ทผ ํŒจํ„ด์„ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค:

  • ์Šค๋ ˆ๋“œ 0-15๋Š” ๋ฑ…ํฌ 0,2,4,โ€ฆ,30์— ์ ‘๊ทผ
  • ์Šค๋ ˆ๋“œ 16-31์€ ๋™์ผํ•œ ๋ฑ…ํฌ 0,2,4,โ€ฆ,30์— ์ ‘๊ทผ
  • ๊ฐ ๋ฑ…ํฌ ์ถฉ๋Œ๋งˆ๋‹ค ํ•˜๋“œ์›จ์–ด ์ง๋ ฌํ™”๊ฐ€ ํ•„์š”

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ : ์›Œํฌ๋กœ๋“œ ๋งฅ๋ฝ ๋ถ„์„

๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ vs ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ ์‹œ์‚ฌ์ 

์ด ์›Œํฌ๋กœ๋“œ์˜ ํŠน์„ฑ:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ง€๋ฐฐ์ : ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋Œ€๋น„ ์ตœ์†Œํ•œ์˜ ์—ฐ์‚ฐ๋งŒ ์ˆ˜ํ–‰
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋Š” ๋ถ€์ฐจ์ : ๋ฑ…ํฌ ์ถฉ๋Œ์ด ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ถ”๊ฐ€ํ•˜์ง€๋งŒ ์ „์ฒด ์‹คํ–‰ ์‹œ๊ฐ„์„ ์ง€๋ฐฐํ•˜์ง€๋Š” ์•Š์Œ
  • ๋™์ผํ•œ ์„ฑ๋Šฅ: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํฌํ™”๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋น„ํšจ์œจ์„ ๊ฐ€๋ฆผ

๋ฑ…ํฌ ์ถฉ๋Œ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ:

  1. ์—ฐ์‚ฐ ์ง‘์•ฝ์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ํ–‰๋ ฌ ๊ณฑ์…ˆ, ์Šคํ…์‹ค ์—ฐ์‚ฐ, FFT
  2. ํƒ€์ดํŠธํ•œ ์—ฐ์‚ฐ ๋ฃจํ”„ - ๋‚ด๋ถ€ ๋ฃจํ”„ ์•ˆ์—์„œ ๋ฐ˜๋ณต์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
  3. ๋†’์€ ์‚ฐ์ˆ  ๊ฐ•๋„ - ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๋‹น ์ƒ๋‹นํ•œ ์—ฐ์‚ฐ๋Ÿ‰
  4. ๋Œ€๊ทœ๋ชจ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ž‘์—… ์„ธํŠธ - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์บ์‹ฑ์„ ์ง‘์ค‘์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

์‹ค์ „ ์„ฑ๋Šฅ ์‹œ์‚ฌ์ 

๋ฑ…ํฌ ์ถฉ๋Œ์ด ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜:

ํ–‰๋ ฌ ๊ณฑ์…ˆ:

# Problematic: All threads in warp access same column
for k in range(tile_size):
    acc += a_shared[local_row, k] * b_shared[k, local_col]  # b_shared[k, 0] conflicts

์Šคํ…์‹ค ์—ฐ์‚ฐ:

# Problematic: Stride access in boundary handling
shared_buf[thread_idx.x * stride]  # Creates systematic conflicts

๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜:

# Problematic: Power-of-2 stride patterns
if thread_idx.x < stride:
    shared_buf[thread_idx.x] += shared_buf[thread_idx.x + stride]  # Conflict potential

์ถฉ๋Œ ์—†๋Š” ์„ค๊ณ„ ์›์น™

์˜ˆ๋ฐฉ ์ „๋žต

1. ์ˆœ์ฐจ ์ ‘๊ทผ ํŒจํ„ด:

shared[thread_idx.x]  # Optimal - each thread different bank

2. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ตœ์ ํ™”:

constant = shared[0]  # All threads read same address - hardware optimized

3. ํŒจ๋”ฉ ๊ธฐ๋ฒ•:

shared = LayoutTensor[dtype, Layout.row_major(TPB + 1), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()  # Shift access patterns

4. ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„:

  • ๊ตฌํ˜„ ์ „์— ๋ฑ…ํฌ ํ• ๋‹น์„ ๊ณ„์‚ฐ
  • ๋ชจ๋“ˆ๋กœ ์—ฐ์‚ฐ ์‚ฌ์šฉ: bank_id = (address_bytes / 4) % 32
  • ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์Šค๋ ˆ๋“œ-๋ฑ…ํฌ ๋งคํ•‘์„ ์‹œ๊ฐํ™”

์ฒด๊ณ„์  ์ตœ์ ํ™” ์›Œํฌํ”Œ๋กœ์šฐ

์„ค๊ณ„ ๋‹จ๊ณ„:

  1. ์ ‘๊ทผ ํŒจํ„ด ๊ณ„ํš - ์Šค๋ ˆ๋“œ-๋ฉ”๋ชจ๋ฆฌ ๋งคํ•‘์„ ์Šค์ผ€์น˜
  2. ๋ฑ…ํฌ ํ• ๋‹น ๊ณ„์‚ฐ - ์ˆ˜ํ•™์  ๋ถ„์„ ํ™œ์šฉ
  3. ์ถฉ๋Œ ์˜ˆ์ธก - ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ ‘๊ทผ ํŒจํ„ด ์‹๋ณ„
  4. ๋Œ€์•ˆ ์„ค๊ณ„ - ํŒจ๋”ฉ, ์ „์น˜, ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ€๊ฒฝ ๊ณ ๋ ค

๊ตฌํ˜„ ๋‹จ๊ณ„:

  1. ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง - NSight Compute ์ถฉ๋Œ ๋ฉ”ํŠธ๋ฆญ ์‚ฌ์šฉ
  2. ์˜ํ–ฅ ์ธก์ • - ๊ตฌํ˜„ ๊ฐ„ ์ถฉ๋Œ ํšŸ์ˆ˜ ๋น„๊ต
  3. ์„ฑ๋Šฅ ๊ฒ€์ฆ - ์ตœ์ ํ™”๊ฐ€ ์ข…๋‹จ๊ฐ„ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š”์ง€ ํ™•์ธ
  4. ํŒจํ„ด ๋ฌธ์„œํ™” - ์„ฑ๊ณต์ ์ธ ์ถฉ๋Œ ์—†๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์žฌ์‚ฌ์šฉ์„ ์œ„ํ•ด ๊ธฐ๋ก

ํ•ต์‹ฌ ์ •๋ฆฌ: ํƒ์ • ์ž‘์—…์—์„œ ์ตœ์ ํ™” ์ „๋ฌธ์„ฑ์œผ๋กœ

๋ฑ…ํฌ ์ถฉ๋Œ ์กฐ์‚ฌ์—์„œ ๋ฐํ˜€์ง„ ๊ฒƒ:

  1. ์ธก์ •์ด ์ง๊ด€๋ณด๋‹ค ๋‚ซ๋‹ค - ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๊ฐ€ ์„ฑ๋Šฅ ํƒ€์ด๋ฐ์œผ๋กœ๋Š” ๋ณด์ด์ง€ ์•Š๋Š” ์ถฉ๋Œ์„ ๋“œ๋Ÿฌ๋ƒ„
  2. ํŒจํ„ด ๋ถ„์„์ด ์œ ํšจํ•˜๋‹ค - ์ˆ˜ํ•™์  ์˜ˆ์ธก์ด NSight Compute ๊ฒฐ๊ณผ์™€ ์ •ํ™•ํžˆ ์ผ์น˜
  3. ๋งฅ๋ฝ์ด ์ค‘์š”ํ•˜๋‹ค - ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์—ฐ์‚ฐ ์ง‘์•ฝ์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์›Œํฌ๋กœ๋“œ์—์„œ ๊ฐ€์žฅ ์ค‘์š”
  4. ์˜ˆ๋ฐฉ์ด ์ˆ˜์ •๋ณด๋‹ค ๋‚ซ๋‹ค - ์ถฉ๋Œ ์—†๋Š” ํŒจํ„ด์„ ์„ค๊ณ„ํ•˜๋Š” ๊ฒƒ์ด ์‚ฌํ›„ ์ตœ์ ํ™”๋ณด๋‹ค ์‰ฌ์›€

๋ณดํŽธ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ์›์น™:

๋ฑ…ํฌ ์ถฉ๋Œ์— ์ฃผ์˜ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ:

  • ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์—ฐ์‚ฐ ์ง‘์•ฝ์  ์ปค๋„
  • ํƒ€์ดํŠธํ•œ ๋ฃจํ”„์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ๋ชจ๋“  ์‚ฌ์ดํด์ด ์ค‘์š”ํ•œ ์„ฑ๋Šฅ ํ•ต์‹ฌ ์ฝ”๋“œ
  • ๋Œ€์—ญํญ ๋ฐ”์šด๋“œ๊ฐ€ ์•„๋‹Œ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์—ฐ์‚ฐ

๋ฑ…ํฌ ์ถฉ๋Œ์ด ๋œ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์„ฑ๋Šฅ์„ ์ง€๋ฐฐํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์žฌ์‚ฌ์šฉ์ด ์ตœ์†Œ์ธ ๋‹จ์ˆœ ์บ์‹ฑ ์‹œ๋‚˜๋ฆฌ์˜ค
  • ๋ฐ˜๋ณต์ ์ธ ์ถฉ๋Œ ๋ฐœ์ƒ ์—ฐ์‚ฐ์ด ์—†๋Š” ์ผํšŒ์„ฑ ์ ‘๊ทผ ํŒจํ„ด

์ „๋ฌธ์  ๊ฐœ๋ฐœ ๋ฐฉ๋ฒ•๋ก :

  1. ์ตœ์ ํ™” ์ „์— ํ”„๋กœํŒŒ์ผ๋ง - NSight Compute๋กœ ์ถฉ๋Œ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ์ธก์ •
  2. ์ ‘๊ทผ ์ˆ˜ํ•™ ์ดํ•ด - ๋ฑ…ํฌ ํ• ๋‹น ๊ณต์‹์œผ๋กœ ๋ฌธ์ œ๋ฅผ ์˜ˆ์ธก
  3. ์ฒด๊ณ„์ ์œผ๋กœ ์„ค๊ณ„ - ๋ฑ…ํ‚น์„ ์‚ฌํ›„ ๊ณ ๋ ค๊ฐ€ ์•„๋‹Œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„ ๋‹จ๊ณ„์—์„œ ๊ณ ๋ ค
  4. ์ตœ์ ํ™” ๊ฒ€์ฆ - ์ถฉ๋Œ ๊ฐ์†Œ๊ฐ€ ์‹ค์ œ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š”์ง€ ํ™•์ธ

์ด ํƒ์ • ์‚ฌ๊ฑด์€ ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง์ด ์„ฑ๋Šฅ ํƒ€์ด๋ฐ๋งŒ์œผ๋กœ๋Š” ๋ณด์ด์ง€ ์•Š๋Š” ์ตœ์ ํ™” ๊ธฐํšŒ๋ฅผ ๋“œ๋Ÿฌ๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค - ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์ธก์ • ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๊ฐ€ ์ถ”์ธก๋ณด๋‹ค ๋‚˜์€ ๋Œ€ํ‘œ์ ์ธ ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.

Puzzle 33: ํ…์„œ ์ฝ”์–ด ์—ฐ์‚ฐ

์†Œ๊ฐœ

GPU ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ตœ์ ํ™”์˜ ์ตœ์ „์„ ์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์ด ํผ์ฆ์—์„œ๋Š” ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์ „๋ก€ ์—†๋Š” ์†๋„๋กœ ๊ฐ€์†ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ์ „์šฉ ํ•˜๋“œ์›จ์–ด ์œ ๋‹›์ธ ํ…์„œ ์ฝ”์–ด๋ฅผ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

์ง€๊ธˆ๊นŒ์ง€ ๋ฐฐ์šด ๋ชจ๋“  ๊ฒƒ, ํŠนํžˆ Puzzle 16์˜ ๊ด€์šฉ์  ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์ตœ์‹  GPU๊ฐ€ ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ๊ทน์ ์œผ๋กœ ๋น ๋ฅด๊ฒŒ ๋งŒ๋“œ๋Š” ์ „์šฉ ์‹ค๋ฆฌ์ฝ˜์„ ์–ด๋–ป๊ฒŒ ์ œ๊ณตํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

ํ…์„œ ์ฝ”์–ด๋ž€?

ํ…์„œ ์ฝ”์–ด(AMD ํ•˜๋“œ์›จ์–ด์—์„œ๋Š” Matrix Core๋ผ๊ณ ๋„ ํ•จ)๋Š” ๋‹จ์ผ ๋ช…๋ น์–ด๋กœ ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ํ–‰๋ ฌ-ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์ „์šฉ ํ”„๋กœ์„ธ์‹ฑ ์œ ๋‹›์ž…๋‹ˆ๋‹ค. ์ด ์œ ๋‹›์€ ์ตœ์‹  GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • NVIDIA: Tensor Cores (Volta, Turing, Ampere, Hopper)
  • AMD: Matrix Cores (CDNA/CDNA2/CDNA3 ์•„ํ‚คํ…์ฒ˜)

GPU์— ์ง์ ‘ ๋‚ด์žฅ๋œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† GEMM(์—ญ์ฃผ: General Matrix Multiply, ๋ฒ”์šฉ ํ–‰๋ ฌ ๊ณฑ์…ˆ) ์—”์ง„์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํŠน์ง•

  • ์›Œํ”„ ์ˆ˜์ค€ ์—ฐ์‚ฐ: ๊ฐ ๋ช…๋ น์–ด๊ฐ€ ์ „์ฒด ์›Œํ”„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค (NVIDIA์—์„œ 32๊ฐœ ์Šค๋ ˆ๋“œ, AMD์—์„œ 32 ๋˜๋Š” 64๊ฐœ)
  • ๊ณ ์ • ํƒ€์ผ ํฌ๊ธฐ: ์—ฐ์‚ฐ์ด ํŠน์ • ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ํฌ๊ธฐ์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: FP32์˜ ๊ฒฝ์šฐ 16ร—8ร—8)
  • ํ˜ผํ•ฉ ์ •๋ฐ€๋„: ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ์ •๋ฐ€๋„๋ฅผ ํ˜ผํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ๋Œ€๊ทœ๋ชจ ์ฒ˜๋ฆฌ๋Ÿ‰: ํ–‰๋ ฌ ์—ฐ์‚ฐ์—์„œ ์ผ๋ฐ˜ ์ปดํ“จํŠธ ์ฝ”์–ด ๋Œ€๋น„ 10~100๋ฐฐ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

ํƒ€์ผ๋ง์—์„œ ํ…์„œ ์ฝ”์–ด๋กœ

๊ธฐ๋ณธ ํ–‰๋ ฌ ๊ณฑ์…ˆ์—์„œ ํ…์„œ ์ฝ”์–ด๊นŒ์ง€์˜ ์—ฌ์ •์„ ๋Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

  1. Puzzle 16: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ๊ด€์šฉ์  ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค
  2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”: ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ์œ„ํ•ด copy_dram_to_sram_async๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค
  3. ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ: ๋ฐฐ๋ฆฌ์–ด์™€ ๋น„๋™๊ธฐ ์—ฐ์‚ฐ์œผ๋กœ ์›Œํ”„๋ฅผ ์กฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค
  4. ์ง€๊ธˆ: ํ•ต์‹ฌ ์—ฐ์‚ฐ์„ ๊ฐ€์†ํ•˜๊ธฐ ์œ„ํ•ด ์ „์šฉ ํ•˜๋“œ์›จ์–ด(ํ…์„œ ์ฝ”์–ด)๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค

ํ…์„œ ์ฝ”์–ด ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ

ํ…์„œ ์ฝ”์–ด๋Š” ๊ธฐ์กด๊ณผ ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

๊ธฐ์กด ์ปดํ“จํŠธ ์ฝ”์–ด ๋ฐฉ์‹

# Each thread computes one element
acc += a_shared[local_row, k] * b_shared[k, local_col]

ํ…์„œ ์ฝ”์–ด ๋ฐฉ์‹

# Entire warp cooperates on matrix fragments
a_reg = mma_op.load_a(A_mma_tile)           # Load 16ร—8 fragment
b_reg = mma_op.load_b(B_mma_tile)           # Load 8ร—8 fragment
c_reg = mma_op.load_c(C_mma_tile)           # Load 16ร—8 accumulator
d_reg = mma_op.mma_op(a_reg, b_reg, c_reg)  # D = Aร—B + C
mma_op.store_d(C_mma_tile, d_reg)           # Store result

Mojo์˜ ํ…์„œ ์ฝ”์–ด API

Mojo๋Š” TensorCore ํƒ€์ž…์„ ํ†ตํ•ด ํ…์„œ ์ฝ”์–ด์— ๋Œ€ํ•œ ๊น”๋”ํ•œ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

from layout.tensor_core import TensorCore

# Create a Tensor Core operator for specific tile sizes
mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()

# Core operations:
# - load_a(): Load matrix A fragment from shared memory
# - load_b(): Load matrix B fragment from shared memory
# - load_c(): Load matrix C fragment (accumulator)
# - mma_op(): Perform D = Aร—B + C operation
# - store_d(): Store result fragment to memory

๊ณ ๊ธ‰ ๊ธฐ๋Šฅ: TensorCore API๋Š” ์–‘์žํ™” ์—ฐ์‚ฐ, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ์Šค์œ„์ฆ ํŒจํ„ด(์—ญ์ฃผ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ฃผ์†Œ๋ฅผ ๋น„ํŠธ ์—ฐ์‚ฐ์œผ๋กœ ์žฌ๋ฐฐ์น˜ํ•˜๋Š” ๊ธฐ๋ฒ•), ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ๋„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ๋ชจ๋“  ํ˜•ํƒœ, ๋ฐ์ดํ„ฐ ํƒ€์ž…, ๋ฉ”์„œ๋“œ์— ๋Œ€ํ•œ ์ „์ฒด ๋ฌธ์„œ๋Š” ๊ณต์‹ TensorCore API ๋ ˆํผ๋Ÿฐ์Šค๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ํฌ๊ธฐ

TensorCore API๋Š” GPU ํ•˜๋“œ์›จ์–ด์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์™€ ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค:

NVIDIA GPU:

  • float32: 16ร—8ร—8 ๋˜๋Š” 16ร—8ร—4
  • half-precision: 16ร—8ร—16
  • float8: 16ร—8ร—32

AMD GPU:

  • float32: 16ร—16ร—4
  • half-precision: 16ร—16ร—16 ๋˜๋Š” 32ร—32ร—8

์ด ํผ์ฆ์—์„œ๋Š” FP32์™€ 16ร—8ร—8 ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • MMA_M = 16: ํ–‰๋ ฌ A์˜ ๋†’์ด (์ถœ๋ ฅ ๋†’์ด์™€ ๋™์ผ)
  • MMA_N = 8: ํ–‰๋ ฌ B์˜ ๋„ˆ๋น„ (์ถœ๋ ฅ ๋„ˆ๋น„์™€ ๋™์ผ)
  • MMA_K = 8: ๋‚ด๋ถ€ ์ฐจ์› (A์˜ ๋„ˆ๋น„ = B์˜ ๋†’์ด)

MMA๋ž€? MMA๋Š” โ€œMixed-precision Matrix-Multiply-Accumulateโ€œ์˜ ์•ฝ์ž๋กœ, ํ…์„œ ์ฝ”์–ด๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋ณธ ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ๊ฐ MMA ๋ช…๋ น์–ด๋Š” D = A ร— B + C๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉฐ, ์—ฌ๊ธฐ์„œ A, B, C, D๋Š” ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ์ž…๋‹ˆ๋‹ค.

ํ”„๋ž˜๊ทธ๋จผํŠธ ์‹œ๊ฐํ™”:

A fragment (16ร—8)  ร—  B fragment (8ร—8)  +  C fragment (16ร—8)  =  D fragment (16ร—8)

    16 rows             8 rows               16 rows              16 rows
    8 cols              8 cols               8 cols               8 cols
      |                   |                    |                    |
   [A data]         ร—   [B data]         +   [C data]         =  [D result]

์ฆ‰, ๊ฐ ํ…์„œ ์ฝ”์–ด ๋ช…๋ น์–ด๋Š” A์˜ 16ร—8 ํƒ€์ผ๊ณผ B์˜ 8ร—8 ํƒ€์ผ์„ ๊ณฑํ•œ ๋’ค ๊ธฐ์กด 16ร—8 ๋ˆ„์‚ฐ๊ธฐ์— ๋”ํ•˜์—ฌ 16ร—8 ์ถœ๋ ฅ ํƒ€์ผ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

ํ…์„œ ์ฝ”์–ด๋ฅผ ์œ„ํ•œ ์›Œํ”„ ๊ตฌ์„ฑ

์›Œํ”„๋ž€? ์›Œํ”„๋Š” ๋ก์Šคํ…์œผ๋กœ ๋ช…๋ น์–ด๋ฅผ ํ•จ๊ป˜ ์‹คํ–‰ํ•˜๋Š” ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน(NVIDIA์—์„œ 32๊ฐœ, AMD์—์„œ 32 ๋˜๋Š” 64๊ฐœ)์ž…๋‹ˆ๋‹ค. ํ…์„œ ์ฝ”์–ด๋Š” ๋‹จ์ผ ํ–‰๋ ฌ ์—ฐ์‚ฐ์— ์›Œํ”„ ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ˜‘๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์™œ ์›Œํ”„ ์ˆ˜์ค€์ผ๊นŒ? ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ์ผ๋ฐ˜ ์—ฐ์‚ฐ๊ณผ ๋‹ฌ๋ฆฌ, ํ…์„œ ์ฝ”์–ด๋Š” ์ „์ฒด ์›Œํ”„๊ฐ€ ํ•จ๊ป˜ ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๋กœ๋“œํ•˜๊ณ , MMA ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ…์„œ ์ฝ”์–ด๊ฐ€ ์›Œํ”„ ์ˆ˜์ค€์—์„œ ๋™์ž‘ํ•˜๋ฏ€๋กœ, ์Šค๋ ˆ๋“œ๋ฅผ ๋‹ค๋ฅด๊ฒŒ ๊ตฌ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

# Calculate warp coordinates within the block
warp_id = thread_idx.x // WARP_SIZE
warps_in_n = BN // WN  # Number of warps along N dimension
warps_in_m = BM // WM  # Number of warps along M dimension
warp_y = warp_id // warps_in_n  # Warp's row
warp_x = warp_id % warps_in_n   # Warp's column

# Each warp handles a WMร—WN tile of the output
C_warp_tile = C_block_tile.tile[WM, WN](warp_y, warp_x)

์›Œํ”„ ๊ตฌ์„ฑ ์˜ˆ์‹œ (BM=128, BN=64, WM=32, WN=32์ธ ๊ฒฝ์šฐ):

Block (128ร—64) contains 8 warps arranged as:

    32 cols    32 cols
     |          |
[  Warp 0  ][  Warp 1  ]  โ† 32 rows each
[  Warp 2  ][  Warp 3  ]  โ† 32 rows each
[  Warp 4  ][  Warp 5  ]  โ† 32 rows each
[  Warp 6  ][  Warp 7  ]  โ† 32 rows each

Total: 4ร—2 = 8 warps, each handling 32ร—32 output region

ํ…์„œ ์ฝ”์–ด์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ

ํ…์„œ ์ฝ”์–ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”์— ํ•œ ๋‹จ๊ณ„๋ฅผ ๋” ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ โ†’ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: copy_dram_to_sram_async ์‚ฌ์šฉ (Puzzle 16์—์„œ ๋ฐฐ์šด ๊ฒƒ)
  2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ๋ ˆ์ง€์Šคํ„ฐ ํ”„๋ž˜๊ทธ๋จผํŠธ: mma_op.load_a/load_b ์‚ฌ์šฉ
  3. ์—ฐ์‚ฐ: ๋ ˆ์ง€์Šคํ„ฐ ํ”„๋ž˜๊ทธ๋จผํŠธ์—์„œ mma_op.mma_op ์‚ฌ์šฉ
  4. ๋ ˆ์ง€์Šคํ„ฐ ํ”„๋ž˜๊ทธ๋จผํŠธ โ†’ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: mma_op.store_d ์‚ฌ์šฉ

๋„์ „ ๊ณผ์ œ

tensor_core_matrix_multiplication ํ•จ์ˆ˜๋ฅผ ์™„์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค. ์Šค์ผˆ๋ ˆํ†ค ์ฝ”๋“œ๋Š” ํƒ€์ผ๋ง ๋ฐฉ์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋˜ ์‹ค์ œ ํ…์„œ ์ฝ”์–ด ํ•˜๋“œ์›จ์–ด ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์š”๊ตฌ์‚ฌํ•ญ

  1. ์‹ค์ œ ํ…์„œ ์ฝ”์–ด API ์‚ฌ์šฉ: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ด ์•„๋‹Œ ์‹ค์ œ mma_op.load_a(), mma_op.mma_op() ๋“ฑ์„ ์‚ฌ์šฉํ•˜์„ธ์š”
  2. ์ •ํ™•์„ฑ ์œ ์ง€: ๊ฒฐ๊ณผ๊ฐ€ CPU ์ฐธ์กฐ ๊ตฌํ˜„๊ณผ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  3. ์˜ฌ๋ฐ”๋ฅธ ์›Œํ”„ ์กฐ์ •: ๋ธ”๋ก๋‹น ์—ฌ๋Ÿฌ ์›Œํ”„๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค (NVIDIA์™€ AMD ๋ชจ๋‘์—์„œ ๋™์ž‘)
  4. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: Puzzle 16์—์„œ ๋ฐฐ์šด ๋น„๋™๊ธฐ ๋ณต์‚ฌ ํŒจํ„ด์„ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  5. ํฌ๋กœ์Šค ํ”Œ๋žซํผ ํ˜ธํ™˜์„ฑ: ํƒ€์ผ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ WARP_SIZE์˜ ๋ฐฐ์ˆ˜์ธ์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค

์„ค์ •

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{SIZE} = 1024\)
  • ๋ธ”๋ก ํƒ€์ผ๋ง: \(\text{BM} = 128, \text{BN} = 64, \text{BK} = 32\)
  • ์›Œํ”„ ํƒ€์ผ๋ง: \(\text{WM} = 32, \text{WN} = 32\) (WARP_SIZE์˜ ๋ฐฐ์ˆ˜)
  • MMA ํ”„๋ž˜๊ทธ๋จผํŠธ: \(16 \times 8 \times 8\) (FP32)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(8 \times \text{WARP_SIZE}\) (๋ธ”๋ก๋‹น 8๊ฐœ ์›Œํ”„)
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: ๋ธ”๋ก ํƒ€์ผ๋กœ ์ „์ฒด ํ–‰๋ ฌ์„ ์ปค๋ฒ„

๋ ˆ์ด์•„์›ƒ ์„ค์ •:

  • ์ž…๋ ฅ A: Layout.row_major(SIZE, SIZE)
  • ์ž…๋ ฅ B: Layout.row_major(SIZE, SIZE)
  • ์ถœ๋ ฅ C: Layout.row_major(SIZE, SIZE)
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ธ”๋ก ํฌ๊ธฐ ํƒ€์ผ

๋„์ „ ๊ณผ์ œ

์ด ํผ์ฆ์—์„œ๋Š” Puzzle 16์˜ ๊ด€์šฉ์  ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ํ…์„œ ์ฝ”์–ด ๊ตฌํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ํƒ€์ผ๋ง ๊ธฐ๋ณธ ๊ตฌํ˜„ ์ดํ•ดํ•˜๊ธฐ

ํผ์ฆ์€ ์ฐธ์กฐ์šฉ์œผ๋กœ ์™„์„ฑ๋œ ๊ด€์šฉ์  ํƒ€์ผ๋ง ๊ตฌํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

fn matmul_idiomatic_tiled[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    # Use block_dim to get actual tile size dynamically
    var tile_size_x = block_dim.x
    var tile_size_y = block_dim.y

    local_row = thread_idx.y
    local_col = thread_idx.x
    tiled_row = Int(block_idx.y * tile_size_y + local_row)
    tiled_col = Int(block_idx.x * tile_size_x + local_col)

    # Get the tile of the output matrix that this thread block is responsible for
    out_tile = output.tile[TILE_SIZE, TILE_SIZE](
        Int(block_idx.y), Int(block_idx.x)
    )
    a_shared = LayoutTensor[
        dtype,
        Layout.row_major(TILE_SIZE, TILE_SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    b_shared = LayoutTensor[
        dtype,
        Layout.row_major(TILE_SIZE, TILE_SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    var acc: output.element_type = 0

    comptime load_a_layout = Layout.row_major(1, TILE_SIZE)  # Coalesced loading
    comptime load_b_layout = Layout.row_major(1, TILE_SIZE)  # Coalesced loading
    # Note: Both matrices stored in same orientation for correct matrix multiplication
    # Transposed loading would be useful if B were pre-transposed in global memory

    for idx in range(size // TILE_SIZE):  # Iterate over K tiles
        # Get tiles from A and B matrices
        a_tile = a.tile[TILE_SIZE, TILE_SIZE](Int(block_idx.y), idx)
        b_tile = b.tile[TILE_SIZE, TILE_SIZE](idx, Int(block_idx.x))

        # Asynchronously copy tiles to shared memory with consistent orientation
        copy_dram_to_sram_async[
            thread_layout=load_a_layout,
            num_threads = TILE_SIZE * TILE_SIZE,
            block_dim_count=BLOCK_DIM_COUNT,
        ](a_shared, a_tile)
        copy_dram_to_sram_async[
            thread_layout=load_b_layout,
            num_threads = TILE_SIZE * TILE_SIZE,
            block_dim_count=BLOCK_DIM_COUNT,
        ](b_shared, b_tile)

        async_copy_wait_all()
        barrier()

        # Compute partial matrix multiplication for this tile
        for k in range(TILE_SIZE):
            if (
                local_row < TILE_SIZE
                and local_col < TILE_SIZE
                and k < TILE_SIZE
            ):
                acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write final result to output tile
    if tiled_row < size and tiled_col < size:
        out_tile[local_row, local_col] = acc


์ด ๊ธฐ๋ณธ ๊ตฌํ˜„์ด ํ•˜๋Š” ์ผ:

  • ์ •ํ™•์„ฑ: ์ด ๊ตฌํ˜„์€ ์™„๋ฒฝํ•˜๊ฒŒ ๋™์ž‘ํ•˜๋ฉฐ ๋ชจ๋“  ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค
  • ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ: ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ์œ„ํ•ด copy_dram_to_sram_async๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ฐฐ๋ฆฌ์–ด์™€ ๋น„๋™๊ธฐ ์—ฐ์‚ฐ์œผ๋กœ ์Šค๋ ˆ๋“œ๋ฅผ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค
  • ํƒ€์ผ๋ง ์—ฐ์‚ฐ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ์ถœ๋ ฅ ์š”์†Œ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค

2๋‹จ๊ณ„: ํ…์„œ ์ฝ”์–ด ๋ฏธ์…˜

์œ„ ๋ฐฉ์‹์„ ์ „์šฉ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์„ ํ™œ์šฉํ•˜๋„๋ก ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

  • ๊ธฐ์กด: ์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ์—ฐ์‚ฐ โ†’ ๋ณ€ํ™˜ ํ›„: ์›Œํ”„ ์ˆ˜์ค€ ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ
  • ๊ธฐ์กด: ํ‘œ์ค€ FP32 ์‚ฐ์ˆ  โ†’ ๋ณ€ํ™˜ ํ›„: ํ•˜๋“œ์›จ์–ด ๊ฐ€์† GEMM ์—ฐ์‚ฐ
  • ๊ธฐ์กด: ๊ฐœ๋ณ„ ์š”์†Œ ๊ฒฐ๊ณผ โ†’ ๋ณ€ํ™˜ ํ›„: 16ร—8 ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ๊ฒฐ๊ณผ

3๋‹จ๊ณ„: ์„ค์ • ์ดํ•ดํ•˜๊ธฐ

ํ…์„œ ์ฝ”์–ด ๋ฒ„์ „์€ ํ•˜๋“œ์›จ์–ด์— ์ตœ์ ํ™”๋œ ๋‹ค๋ฅธ ํƒ€์ผ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • ๋ธ”๋ก ํƒ€์ผ๋ง: BM=128, BN=64, BK=32 (๋” ๋‚˜์€ ์ ์œ ์œจ์„ ์œ„ํ•ด ๋” ํฐ ๋ธ”๋ก)
  • ์›Œํ”„ ํƒ€์ผ๋ง: WM=32, WN=32 (๊ฐ ์›Œํ”„๊ฐ€ 32ร—32 ์ถœ๋ ฅ ์˜์—ญ์„ ๋‹ด๋‹น)
  • MMA ํ”„๋ž˜๊ทธ๋จผํŠธ: 16ร—8ร—8 (ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ •์˜ํ•œ ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ํฌ๊ธฐ)
  • ๋ธ”๋ก๋‹น ์›Œํ”„: 8๊ฐœ (BMร—BN ๋ธ”๋ก ๋‚ด์—์„œ 4ร—2๋กœ ๋ฐฐ์น˜)

์™œ ์ด ํŠน์ • ํฌ๊ธฐ์ธ๊ฐ€?

  • BM=128, BN=64: ํ…์„œ ์ฝ”์–ด๋ฅผ ๋” ์ž˜ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํƒ€์ผ๋ง ๋ฒ„์ „(32ร—32)๋ณด๋‹ค ํฝ๋‹ˆ๋‹ค
  • WM=WN=32: WARP_SIZE์˜ ๋ฐฐ์ˆ˜์ด๋ฉฐ 2ร—4=8๊ฐœ์˜ MMA ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค (32รท16=2, 32รท8=4)
  • MMA 16ร—8ร—8: ํ•˜๋“œ์›จ์–ด์— ์˜ํ•ด ๊ณ ์ •๋จ - ํ…์„œ ์ฝ”์–ด๊ฐ€ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค
  • 8 ์›Œํ”„: BMรทWM ร— BNรทWN = 128รท32 ร— 64รท32 = 4ร—2 = ๋ธ”๋ก๋‹น 8๊ฐœ ์›Œํ”„

์›Œํ”„๊ฐ€ MMA ํ”„๋ž˜๊ทธ๋จผํŠธ์— ๋งคํ•‘๋˜๋Š” ๋ฐฉ์‹:

Each 32ร—32 warp tile contains multiple 16ร—8 MMA fragments:

    16 cols   16 cols
     |         |
[ MMA 0,0 ][ MMA 0,1 ]  โ† 8 rows each (32รท8=4 fragments down)
[ MMA 1,0 ][ MMA 1,1 ]  โ† 8 rows each
[ MMA 2,0 ][ MMA 2,1 ]  โ† 8 rows each
[ MMA 3,0 ][ MMA 3,1 ]  โ† 8 rows each

2 fragments across (32รท16=2) ร— 4 fragments down (32รท8=4) = 8 MMA operations per warp per K-tile

4๋‹จ๊ณ„: ์™„์„ฑํ•  ์ฝ”๋“œ

# Block and warp tiling sizes
comptime BM = 4 * WARP_SIZE  # Block tile M (4 warps along M)
comptime BN = 2 * WARP_SIZE  # Block tile N (2 warps along N)
comptime BK = WARP_SIZE  # Block tile K (stay within SMEM limit)
comptime WM = WARP_SIZE  # Warp tile M
comptime WN = WARP_SIZE  # Warp tile N

# MMA tile sizes for tensor cores
comptime MMA_M = 16
comptime MMA_N = 8
comptime MMA_K = 8

comptime THREADS_PER_BLOCK_TENSOR_CORE = (8 * WARP_SIZE, 1)  # 8 warps per block
# grid_dim is (x, y). We want x to sweep N (columns) and y to sweep M (rows)
comptime BLOCKS_PER_GRID_TENSOR_CORE = (
    (SIZE + BN - 1) // BN,
    (SIZE + BM - 1) // BM,
)


fn tensor_core_matrix_multiplication[
    dtype: DType,
    layout_a: Layout,
    layout_b: Layout,
    layout_c: Layout,
    BM: Int,
    BN: Int,
    BK: Int,
    WM: Int,
    WN: Int,
    MMA_M: Int,
    MMA_N: Int,
    MMA_K: Int,
](
    A: LayoutTensor[dtype, layout_a, ImmutAnyOrigin],
    B: LayoutTensor[dtype, layout_b, ImmutAnyOrigin],
    C: LayoutTensor[dtype, layout_c, MutAnyOrigin],
):
    comptime M = C.shape[0]()
    comptime N = C.shape[1]()
    comptime K = A.shape[1]()

    warp_id = Int(thread_idx.x) // WARP_SIZE
    warps_in_n = BN // WN
    warps_in_m = BM // WM
    warp_y = warp_id // warps_in_n
    warp_x = warp_id % warps_in_n

    warp_is_active = warp_y < warps_in_m

    C_block_tile = C.tile[BM, BN](Int(block_idx.y), Int(block_idx.x))
    C_warp_tile = C_block_tile.tile[WM, WN](warp_y, warp_x)

    mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()

    # Shared SRAM tiles (no padding to stay under shared memory limit)
    A_sram_tile = LayoutTensor[
        A.dtype,
        Layout.row_major(BM, BK),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    B_sram_tile = LayoutTensor[
        B.dtype,
        Layout.row_major(BK, BN),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # One per-warp accumulator tile of shape [WM, WN]
    C_warp_accum = LayoutTensor[
        C.dtype,
        Layout.row_major(WM, WN),
        MutAnyOrigin,
        address_space = AddressSpace.GENERIC,
    ].stack_allocation()

    # Zero initialize accumulator (only for active warps)
    if warp_is_active:

        @parameter
        for i in range(WM):

            @parameter
            for j in range(WN):
                C_warp_accum[i, j] = 0.0

    # Sweep across K in BK chunks (single-buffered)
    for k_i in range(K // BK):
        barrier()

        A_dram_tile = A.tile[BM, BK](Int(block_idx.y), k_i)
        B_dram_tile = B.tile[BK, BN](k_i, Int(block_idx.x))

        copy_dram_to_sram_async[
            thread_layout = Layout.row_major(4, 8),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](A_sram_tile.vectorize[1, 4](), A_dram_tile.vectorize[1, 4]())
        copy_dram_to_sram_async[
            thread_layout = Layout.row_major(4, 8),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](B_sram_tile.vectorize[1, 4](), B_dram_tile.vectorize[1, 4]())

        async_copy_wait_all()
        barrier()

        if warp_is_active:
            A_warp_tile = A_sram_tile.tile[WM, BK](warp_y, 0)
            B_warp_tile = B_sram_tile.tile[BK, WN](0, warp_x)

            @parameter
            for mma_k in range(BK // MMA_K):

                @parameter
                for mma_m in range(WM // MMA_M):

                    @parameter
                    for mma_n in range(WN // MMA_N):
                        # FILL IN (roughly 8 lines)
                        ...

    # Store the final per-warp accumulation to the output warp tile
    if warp_is_active:

        @parameter
        for mma_m in range(WM // MMA_M):

            @parameter
            for mma_n in range(WN // MMA_N):
                var C_mma_tile = C_warp_tile.tile[MMA_M, MMA_N](mma_m, mma_n)
                Acc_mma_tile = C_warp_accum.tile[MMA_M, MMA_N](mma_m, mma_n)
                frag = mma_op.load_c(Acc_mma_tile)
                mma_op.store_d(C_mma_tile, frag)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p33/p33.mojo

ํ•  ์ผ: ์„ธ ๊ฒน์˜ ์ค‘์ฒฉ ๋ฃจํ”„ ์•ˆ์— ์žˆ๋Š” ๋นˆ ๋ถ€๋ถ„(# FILL IN (roughly 8 lines)์œผ๋กœ ํ‘œ์‹œ๋จ)์„ ์™„์„ฑํ•˜์„ธ์š”.

์ดํ•ดํ•ด์•ผ ํ•  ๊ฒƒ:

  • ์Šค์ผˆ๋ ˆํ†ค์ด ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ, ์›Œํ”„ ๊ตฌ์„ฑ, ๋™๊ธฐํ™”๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ํ•ต์‹ฌ ํ…์„œ ์ฝ”์–ด ์—ฐ์‚ฐ๋งŒ ๊ตฌํ˜„ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค
  • ๋ฃจํ”„๋Š” MMA ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค: mma_k, mma_m, mma_n
  • ๊ฐ ๋ฐ˜๋ณต์—์„œ ํ•˜๋‚˜์˜ 16ร—8ร—8 ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

์„ธ ๊ฒน ์ค‘์ฒฉ ๋ฃจํ”„ ์ดํ•ดํ•˜๊ธฐ:

@parameter
for mma_k in range(BK // MMA_K):     # 32รท8 = 4 iterations (K dimension)
    @parameter
    for mma_m in range(WM // MMA_M): # 32รท16 = 2 iterations (M dimension)
        @parameter
        for mma_n in range(WN // MMA_N): # 32รท8 = 4 iterations (N dimension)
            # YOUR CODE HERE: Process one 16ร—8ร—8 MMA fragment

๊ฐ ๋ฃจํ”„๊ฐ€ ํ•˜๋Š” ์ผ:

  • mma_k: ํ˜„์žฌ K-ํƒ€์ผ์˜ K-์Šฌ๋ผ์ด์Šค๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค (๊ฐ 8๊ฐœ ์š”์†Œ์˜ 4๊ฐœ ์Šฌ๋ผ์ด์Šค)
  • mma_m: ์›Œํ”„ ์ถœ๋ ฅ์˜ M-์Šฌ๋ผ์ด์Šค๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค (๊ฐ 16ํ–‰์˜ 2๊ฐœ ์Šฌ๋ผ์ด์Šค)
  • mma_n: ์›Œํ”„ ์ถœ๋ ฅ์˜ N-์Šฌ๋ผ์ด์Šค๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค (๊ฐ 8์—ด์˜ 4๊ฐœ ์Šฌ๋ผ์ด์Šค)
  • ํ•ฉ๊ณ„: 4ร—2ร—4 = K-ํƒ€์ผ๋‹น ์›Œํ”„๋‹น 32๊ฐœ MMA ์—ฐ์‚ฐ
ํŒ

ํ…์„œ ์ฝ”์–ด ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”. ํ•„์š”ํ•œ ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์˜ฌ๋ฐ”๋ฅธ ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ์ถ”์ถœํ•˜๊ธฐ:

    • ์›Œํ”„ ํƒ€์ผ(A_warp_tile, B_warp_tile, C_warp_accum)์—์„œ MMA ํฌ๊ธฐ์˜ ํŠน์ • ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค
    • ๋ฃจํ”„ ์ธ๋ฑ์Šค(mma_m, mma_k, mma_n)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ฌ๋ฐ”๋ฅธ ํƒ€์ผ ์ขŒํ‘œ๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค
    • ๊ธฐ์–ตํ•˜์„ธ์š”: A๋Š” [MMA_M, MMA_K], B๋Š” [MMA_K, MMA_N], C๋Š” [MMA_M, MMA_N]์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  2. ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ํ…์„œ ์ฝ”์–ด ๋ ˆ์ง€์Šคํ„ฐ์— ๋กœ๋“œํ•˜๊ธฐ:

    • mma_op ๊ฐ์ฒด์—๋Š” ๊ฐ ํ–‰๋ ฌ ํƒ€์ž…์„ ๋กœ๋“œํ•˜๋Š” ๋ฉ”์„œ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค
    • ๊ฐ ๋กœ๋“œ ๋ฉ”์„œ๋“œ๋Š” ํƒ€์ผ์„ ๋ฐ›์•„์„œ ๋ ˆ์ง€์Šคํ„ฐ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค
    • ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: load_a(), load_b(), load_c() - ๊ฐ๊ฐ ๋ฌด์—‡์„ ๋ฐ›์„๊นŒ์š”?
  3. ํ•˜๋“œ์›จ์–ด ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ ์ €์žฅํ•˜๊ธฐ:

    • MMA ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ๊ฒฐ๊ณผ๋ฅผ ๋ˆ„์‚ฐ๊ธฐ ํƒ€์ผ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
    • ์—ฐ์‚ฐ ํŒจํ„ด: result = A ร— B + C

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ: 128๊ฐœ์˜ ๊ฐœ๋ณ„ ๊ณฑ์…ˆ-๋ง์…ˆ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ ํ•˜๋“œ์›จ์–ด ๋ช…๋ น์–ด๋กœ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค!

๋””๋ฒ„๊น… ํŒ: ์ฐจ์› ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ํƒ€์ผ ์ธ๋ฑ์‹ฑ์„ ๋‹ค์‹œ ํ™•์ธํ•˜์„ธ์š” - mma_m, mma_k, mma_n์˜ ์ˆœ์„œ๊ฐ€ ์˜ฌ๋ฐ”๋ฅธ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์‹คํ–‰

ํ’€์ด๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p33 --test
uv run poe p33 --test

์™„์„ฑํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ •ํ™•๋„ ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ๊ฐ€ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

=== Running All Accuracy Tests ===
--- Test 1: Tensor Core vs CPU Reference ---
โœ… TENSOR CORE ACCURACY TEST PASSED!
--- Test 2: Idiomatic Tiled vs CPU Reference ---
โœ… IDIOMATIC TILED ACCURACY TEST PASSED!
ALL TESTS PASSED!

์†”๋ฃจ์…˜

fn tensor_core_matrix_multiplication[
    dtype: DType,
    layout_a: Layout,
    layout_b: Layout,
    layout_c: Layout,
    BM: Int,
    BN: Int,
    BK: Int,
    WM: Int,
    WN: Int,
    MMA_M: Int,
    MMA_N: Int,
    MMA_K: Int,
](
    A: LayoutTensor[dtype, layout_a, ImmutAnyOrigin],
    B: LayoutTensor[dtype, layout_b, ImmutAnyOrigin],
    C: LayoutTensor[dtype, layout_c, MutAnyOrigin],
):
    comptime M = C.shape[0]()
    comptime N = C.shape[1]()
    comptime K = A.shape[1]()

    warp_id = Int(thread_idx.x) // WARP_SIZE
    warps_in_n = BN // WN
    warps_in_m = BM // WM
    warp_y = warp_id // warps_in_n
    warp_x = warp_id % warps_in_n

    warp_is_active = warp_y < warps_in_m

    C_block_tile = C.tile[BM, BN](Int(block_idx.y), Int(block_idx.x))
    C_warp_tile = C_block_tile.tile[WM, WN](warp_y, warp_x)

    mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()

    # Shared SRAM tiles (no padding to stay under shared memory limit)
    A_sram_tile = LayoutTensor[
        A.dtype,
        Layout.row_major(BM, BK),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    B_sram_tile = LayoutTensor[
        B.dtype,
        Layout.row_major(BK, BN),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # One per-warp accumulator tile of shape [WM, WN]
    C_warp_accum = LayoutTensor[
        C.dtype,
        Layout.row_major(WM, WN),
        MutAnyOrigin,
        address_space = AddressSpace.LOCAL,
    ].stack_allocation()

    # Zero initialize accumulator (only for active warps)
    if warp_is_active:

        @parameter
        for i in range(WM):

            @parameter
            for j in range(WN):
                C_warp_accum[i, j] = 0.0

    # (Removed shared C accumulator to reduce shared usage)

    # Sweep across K in BK chunks (single-buffered)
    for k_i in range(K // BK):
        barrier()

        A_dram_tile = A.tile[BM, BK](Int(block_idx.y), k_i)
        B_dram_tile = B.tile[BK, BN](k_i, Int(block_idx.x))

        copy_dram_to_sram_async[
            thread_layout = Layout.row_major(4, 8),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](A_sram_tile.vectorize[1, 4](), A_dram_tile.vectorize[1, 4]())
        copy_dram_to_sram_async[
            thread_layout = Layout.row_major(4, 8),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](B_sram_tile.vectorize[1, 4](), B_dram_tile.vectorize[1, 4]())

        async_copy_wait_all()
        barrier()

        if warp_is_active:
            A_warp_tile = A_sram_tile.tile[WM, BK](warp_y, 0)
            B_warp_tile = B_sram_tile.tile[BK, WN](0, warp_x)

            @parameter
            for mma_k in range(BK // MMA_K):

                @parameter
                for mma_m in range(WM // MMA_M):

                    @parameter
                    for mma_n in range(WN // MMA_N):
                        A_mma_tile = A_warp_tile.tile[MMA_M, MMA_K](
                            mma_m, mma_k
                        )
                        B_mma_tile = B_warp_tile.tile[MMA_K, MMA_N](
                            mma_k, mma_n
                        )
                        C_mma_tile = C_warp_accum.tile[MMA_M, MMA_N](
                            mma_m, mma_n
                        )

                        a_reg = mma_op.load_a(A_mma_tile)
                        b_reg = mma_op.load_b(B_mma_tile)
                        c_reg = mma_op.load_c(C_mma_tile)
                        d_reg = mma_op.mma_op(a_reg, b_reg, c_reg)
                        mma_op.store_d(C_mma_tile, d_reg)

    # Store the final per-warp accumulation to the output warp tile
    if warp_is_active:

        @parameter
        for mma_m in range(WM // MMA_M):

            @parameter
            for mma_n in range(WN // MMA_N):
                var C_mma_tile = C_warp_tile.tile[MMA_M, MMA_N](mma_m, mma_n)
                Acc_mma_tile = C_warp_accum.tile[MMA_M, MMA_N](mma_m, mma_n)
                frag = mma_op.load_c(Acc_mma_tile)
                mma_op.store_d(C_mma_tile, frag)


์ด ํ’€์ด๋Š” ํ…์„œ ์ฝ”์–ด ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  1. ์›Œํ”„ ๊ตฌ์„ฑ

    • warp_id = thread_idx.x // WARP_SIZE๋กœ ๋ธ”๋ก ๋‚ด ์›Œํ”„ ์ขŒํ‘œ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ์›Œํ”„๋ฅผ ์ถœ๋ ฅ ํƒ€์ผ์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค: ๊ฐ ์›Œํ”„๊ฐ€ WMร—WN ์˜์—ญ์„ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค
    • ์˜ˆ์ƒ๋ณด๋‹ค ์ ์€ ์ˆ˜์˜ ์›Œํ”„๊ฐ€ ์žˆ๋Š” ๋ธ”๋ก์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด warp_is_active ๊ฐ€๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  2. ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”

    • ๊ธ€๋กœ๋ฒŒ โ†’ ๊ณต์œ : ํšจ์œจ์ ์ธ ๋ธ”๋ก ์ˆ˜์ค€ ์ „์†ก์„ ์œ„ํ•ด copy_dram_to_sram_async๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
    • ๊ณต์œ  โ†’ ๋ ˆ์ง€์Šคํ„ฐ: ์›Œํ”„ ์ˆ˜์ค€ ํ”„๋ž˜๊ทธ๋จผํŠธ ๋กœ๋”ฉ์„ ์œ„ํ•ด mma_op.load_a/load_b๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
    • ๋ ˆ์ง€์Šคํ„ฐ ์—ฐ์‚ฐ: ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์œ„ํ•ด mma_op.mma_op๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
    • ๋ ˆ์ง€์Šคํ„ฐ โ†’ ๊ธ€๋กœ๋ฒŒ: ํšจ์œจ์ ์ธ ๊ฒฐ๊ณผ ์ €์žฅ์„ ์œ„ํ•ด mma_op.store_d๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  3. ํ…์„œ ์ฝ”์–ด ์—ฐ์‚ฐ

    • load_a(A_mma_tile): 16ร—8 ํ–‰๋ ฌ A ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๋ ˆ์ง€์Šคํ„ฐ์— ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค
    • load_b(B_mma_tile): 8ร—8 ํ–‰๋ ฌ B ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๋ ˆ์ง€์Šคํ„ฐ์— ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค
    • load_c(C_mma_tile): 16ร—8 ๋ˆ„์‚ฐ๊ธฐ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค
    • mma_op(a_reg, b_reg, c_reg): ์ „์šฉ ํ•˜๋“œ์›จ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ D = Aร—B + C๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • store_d(C_mma_tile, d_reg): 16ร—8 ๊ฒฐ๊ณผ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  4. ํฌ๋กœ์Šค ํ”Œ๋žซํผ ํ˜ธํ™˜์„ฑ

    • ๋ชจ๋“  ํƒ€์ผ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ WARP_SIZE์˜ ๋ฐฐ์ˆ˜์ž…๋‹ˆ๋‹ค (NVIDIA์—์„œ 32, AMD์—์„œ 64)
    • Mojo๋Š” TensorCore ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ†ตํ•ด ํ•˜๋“œ์›จ์–ด ์ฐจ์ด๋ฅผ ์ถ”์ƒํ™”ํ•ฉ๋‹ˆ๋‹ค
    • ๋™์ผํ•œ ์ฝ”๋“œ๊ฐ€ NVIDIA ํ…์„œ ์ฝ”์–ด์™€ AMD Matrix Core ๋ชจ๋‘์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ๋Š” ํ…์„œ ์ฝ”์–ด๊ฐ€ ์Šค๋ ˆ๋“œ ์ˆ˜์ค€์˜ ๊ฐœ๋ณ„ ์š”์†Œ๊ฐ€ ์•„๋‹Œ ์›Œํ”„ ์ˆ˜์ค€์˜ ์ „์ฒด ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ๋‹จ์œ„๋กœ ๋™์ž‘ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์™€ ์ „์šฉ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์ด ๊ฐ€๋Šฅํ•ด์ง‘๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๋ถ„์„: ์ด๊ฒƒ์œผ๋กœ ๋์ผ๊นŒ?

์ด์ œ ํ…์„œ ์ฝ”์–ด๊ฐ€ ๊ด€์šฉ์  ํƒ€์ผ๋ง ๋ฐฉ์‹ ๋Œ€๋น„ ์•ฝ์†๋œ ์„ฑ๋Šฅ ์šฐ์œ„๋ฅผ ์‹ค์ œ๋กœ ์ œ๊ณตํ•˜๋Š”์ง€ ํ™•์ธํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง์šฉ ๋นŒ๋“œ

uv run mojo build problems/p33/p33.mojo -o problems/p33/p33_profiler
pixi run mojo build problems/p33/p33.mojo -o problems/p33/p33_profiler

NVIDIA Nsight Compute๋กœ ํ”„๋กœํŒŒ์ผ๋ง (NVIDIA ์ „์šฉ)

๋จผ์ € ncu์— ์ ‘๊ทผํ•˜๊ธฐ ์œ„ํ•ด CUDA ํ™˜๊ฒฝ์— ์ง„์ž…ํ•ฉ๋‹ˆ๋‹ค:

# Enter CUDA environment
pixi shell -e nvidia

# Profile tensor core version
ncu --set full --metrics sm__cycles_elapsed.avg,smsp__cycles_active.avg.pct_of_peak_sustained_elapsed,dram__throughput.avg.pct_of_peak_sustained_elapsed,smsp__inst_executed_pipe_tensor_op_hmma.sum ./problems/p33p33_profiler --tensor-core

# Profile tiled version for comparison
ncu --set full --metrics sm__cycles_elapsed.avg,smsp__cycles_active.avg.pct_of_peak_sustained_elapsed,dram__throughput.avg.pct_of_peak_sustained_elapsed ./problems/p33p33_profiler --tiled

๋น„๊ตํ•  ํ•ต์‹ฌ ๋ฉ”ํŠธ๋ฆญ

์„ฑ๋Šฅ ๋ฉ”ํŠธ๋ฆญ:

  • Duration: ์ „์ฒด kernel ์‹คํ–‰ ์‹œ๊ฐ„ (๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ)
  • SM Active %: SM ํ™œ์šฉ๋ฅ  (๋†’์„์ˆ˜๋ก ์ข‹์Œ)
  • DRAM Throughput: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ  (๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฌ๋ถ€๋ฅผ ๋ณด์—ฌ์คŒ)
  • Tensor Op Instructions: ์‹ค์ œ ํ…์„œ ์ฝ”์–ด ์—ฐ์‚ฐ ํšŸ์ˆ˜ (ํ…์„œ ์ฝ”์–ด ๋ฒ„์ „์—๋งŒ ํ•ด๋‹น)

์ผ๋ฐ˜์ ์ธ ๊ฒฐ๊ณผ:

ํ…์„œ ์ฝ”์–ด ๋ฒ„์ „ (๋” ๋А๋ฆผ):

  • Duration: ~13.9 ms (ํ›จ์”ฌ ๋А๋ฆผ!)
  • SM Active: 83.7% (์ข‹์€ ํ™œ์šฉ๋ฅ )
  • DRAM Throughput: 72.5% (๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ!)
  • Occupancy: 26.3% (๋‚˜์จ - ๋ ˆ์ง€์Šคํ„ฐ์— ์˜ํ•ด ์ œํ•œ๋จ)
  • Tensor Op Instructions: 1,048,576 (ํ…์„œ ์ฝ”์–ด๊ฐ€ ๋™์ž‘ ์ค‘์ž„์„ ํ™•์ธ)

ํƒ€์ผ๋ง ๋ฒ„์ „ (๋” ๋น ๋ฆ„):

  • Duration: ~1.62 ms (8.6๋ฐฐ ๋น ๋ฆ„!)
  • SM Active: 98.0% (ํƒ์›”ํ•œ ํ™œ์šฉ๋ฅ )
  • DRAM Throughput: 1.7% (์˜ˆ์ƒ๋Œ€๋กœ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ)
  • Occupancy: 66.7% (ํ›จ์”ฌ ๋‚˜์Œ)
  • L2 Hit Rate: 96.9% vs 29.7% (ํ›จ์”ฌ ๋‚˜์€ ์บ์‹œ ์ง€์—ญ์„ฑ)

์™œ ํ…์„œ ์ฝ”์–ด๊ฐ€ ๋” ๋А๋ฆด๊นŒ?

  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ: 72% DRAM ์‚ฌ์šฉ๋Ÿ‰์€ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ๊ฐ€ ์•„๋‹Œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ๋‚ฎ์€ ์ ์œ ์œจ: 26% vs 67% - ๋†’์€ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰(์Šค๋ ˆ๋“œ๋‹น 68 vs 38)์ด ๋™์‹œ ์›Œํ”„ ์ˆ˜๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค
  • ์บ์‹œ ๋ฏธ์Šค: 29% L2 ์ ์ค‘๋ฅ  vs 97%๋Š” ๋‚ฎ์€ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ถฉ๋Œ: ์ตœ์ ํ™”๋˜์ง€ ์•Š์€ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์ธํ•œ ๋ฑ…ํฌ ์ถฉ๋Œ
  • ์‹คํ–‰ ์„ค์ •: ์ด ๋ฌธ์ œ ํฌ๊ธฐ์— ๋Œ€ํ•ด ์ตœ์ ์ด ์•„๋‹Œ ๋ธ”๋ก/์›Œํ”„ ๊ตฌ์„ฑ

์„ฑ๋Šฅ์˜ ํ˜„์‹ค

ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฐ๊ณผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, โ€œ์ „์šฉ ํ•˜๋“œ์›จ์–ดโ€œ๊ฐ€ ์ž๋™์œผ๋กœ ๋นจ๋ผ์ง€๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค! ํ…์„œ ์ฝ”์–ด ๋ฒ„์ „์€ ๋‹จ์ˆœํ•œ ํƒ€์ผ๋ง ๋ฐฉ์‹๋ณด๋‹ค ์ƒ๋‹นํžˆ ๋А๋ฆฝ๋‹ˆ๋‹ค(~8.6๋ฐฐ). ์ด๋Š” GPU ์ตœ์ ํ™”์—์„œ ํ”ํžˆ ๋ณผ ์ˆ˜ ์žˆ๋Š” ํ˜„์‹ค์ž…๋‹ˆ๋‹ค - ํ•˜๋“œ์›จ์–ด์˜ ์›์‹œ ์„ฑ๋Šฅ์ด ๊ณง ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ:

  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ: 72% DRAM ์‚ฌ์šฉ๋Ÿ‰์€ ํ…์„œ ์ฝ”์–ด๊ฐ€ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ๊ฐ€ ์•„๋‹Œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ๋‚ฎ์€ ์ ์œ ์œจ: ๋†’์€ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰์œผ๋กœ ์ธํ•ด 26% vs 67%๋กœ ๋™์‹œ ์›Œํ”„ ์ˆ˜๊ฐ€ ์ œํ•œ๋ฉ๋‹ˆ๋‹ค
  • ์บ์‹œ ๋ฏธ์Šค: 29% vs 97% L2 ์ ์ค‘๋ฅ ์€ ๋‚ฎ์€ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ๋ฆฌ์†Œ์Šค ๋‚ญ๋น„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ๊ณผ ์ตœ์ ์ด ์•„๋‹Œ ์‹คํ–‰ ์„ค์ •

๊ตํ›ˆ: ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์„ ์ดํ•ดํ•˜๊ณ  ์ฒด๊ณ„์ ์œผ๋กœ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด โ€œ์ตœ์‹ ์˜ ๊ฐ€์žฅ ๋›ฐ์–ด๋‚œโ€ API๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋“œ์›จ์–ด ๊ธฐ๋Šฅ์€ ์„ธ์‹ฌํ•œ ํŠœ๋‹์ด ํ•„์š”ํ•œ ๋„๊ตฌ์ด์ง€, ๋งˆ๋ฒ•์˜ ์€ํƒ„ํ™˜์ด ์•„๋‹™๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

๋ณด๋žŒ ์žˆ๋Š” GPU ์ตœ์ ํ™” ๋„์ „์„ ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๐ŸŽฏ ์„ฑ๋Šฅ ๋ณด๋„ˆ์Šค ์ฑŒ๋ฆฐ์ง€๋กœ ์ด๋™ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ธ ํ…์„œ ์ฝ”์–ด ๊ตฌํ˜„์„ ๋‹จ์ˆœํ•œ ํƒ€์ผ๋ง ๋ฒ„์ „์„ ์‹ค์ œ๋กœ ์ด๊ธฐ๋Š” ๊ตฌํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›Œ๋ณด์„ธ์š”!

๐ŸŽฏ ์„ฑ๋Šฅ ๋ณด๋„ˆ์Šค ์ฑŒ๋ฆฐ์ง€

๋ฐœ๊ฒฌ

Puzzle 33์„ ์™„๋ฃŒํ•˜๊ณ  Mojo์˜ TensorCore API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹ค์ œ ํ…์„œ ์ฝ”์–ด ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ตฌํ˜„์€ ์ •ํ™•ํ•˜๊ฒŒ ๋™์ž‘ํ•˜๊ณ , ๋ชจ๋“  ์ •ํ™•๋„ ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ•˜๋ฉฐ, ์‹ค์ œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ Puzzle 16์˜ ํƒ€์ผ๋ง ๋ฒ„์ „๊ณผ ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ๋น„๊ตํ•˜๋ฉดโ€ฆ

โ€œ์ „์šฉ ํ•˜๋“œ์›จ์–ดโ€œ๊ฐ€ ์—„์ฒญ๋‚˜๊ฒŒ ๋” ๋А๋ฆฝ๋‹ˆ๋‹ค!

๋ฌด์—‡์ด ์ž˜๋ชป๋œ ๊ฑธ๊นŒ?

(NVIDIA ์ „์šฉ) ncu๋ฅผ ์‚ฌ์šฉํ•œ ํ”„๋กœํŒŒ์ผ๋ง์ด ๋ƒ‰ํ˜นํ•œ ํ˜„์‹ค์„ ๋“œ๋Ÿฌ๋ƒˆ์Šต๋‹ˆ๋‹ค (ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ๋ฒ•์„ ๋ณต์Šตํ•˜๋ ค๋ฉด Puzzle 10์˜ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜ ํƒ์ง€์™€ Puzzle 30์˜ GPU ํ”„๋กœํŒŒ์ผ๋ง์„ ์ฐธ๊ณ ํ•˜์„ธ์š”):

ํ…์„œ ์ฝ”์–ด ๋ฒ„์ „ (๊ธฐ๋Œ€์— ๋ชป ๋ฏธ์นจ):

  • Duration: ~13.9 ms
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ: 72.5% DRAM ์ฒ˜๋ฆฌ๋Ÿ‰ (์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์—ฌ์•ผ ํ•˜๋Š”๋ฐ!)
  • ๋‚ฎ์€ ์ ์œ ์œจ: 26.3% (ํ•˜๋“œ์›จ์–ด ๋‚ญ๋น„)
  • ์บ์‹œ ์žฌ์•™: 29.7% L2 ์ ์ค‘๋ฅ 
  • ๋ ˆ์ง€์Šคํ„ฐ ์••๋ฐ•: ์Šค๋ ˆ๋“œ๋‹น 68๊ฐœ ๋ ˆ์ง€์Šคํ„ฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ถฉ๋Œ: ๋ฑ…ํฌ ์ถฉ๋Œ์ด ์„ฑ๋Šฅ์„ ํŒŒ๊ดด

ํƒ€์ผ๋ง ๋ฒ„์ „ (์Šน์ž):

  • Duration: ~1.62 ms (8.6๋ฐฐ ๋น ๋ฆ„!)
  • ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ: 1.7% DRAM ์ฒ˜๋ฆฌ๋Ÿ‰ (์˜ˆ์ƒ๋Œ€๋กœ)
  • ํƒ์›”ํ•œ ์ ์œ ์œจ: 66.7%
  • ์บ์‹œ ์นœํ™”์ : 96.9% L2 ์ ์ค‘๋ฅ 
  • ํšจ์œจ์ : ์Šค๋ ˆ๋“œ๋‹น 38๊ฐœ ๋ ˆ์ง€์Šคํ„ฐ
  • ๊น”๋”ํ•œ ๋ฉ”๋ชจ๋ฆฌ: ์œ ์˜๋ฏธํ•œ ๋ฑ…ํฌ ์ถฉ๋Œ ์—†์Œ

๋ƒ‰ํ˜นํ•œ ํ˜„์‹ค

์ด๋Š” GPU ์ตœ์ ํ™”์—์„œ ํ”ํ•œ ์ด์•ผ๊ธฐ์ž…๋‹ˆ๋‹ค: ํ•˜๋“œ์›จ์–ด์˜ ์›์‹œ ์„ฑ๋Šฅ โ‰  ์‹ค์ œ ์„ฑ๋Šฅ. ํ…์„œ ์ฝ”์–ด๋Š” ๋†€๋ž๋„๋ก ๊ฐ•๋ ฅํ•˜์ง€๋งŒ, ๋™์‹œ์— ์š”๊ตฌ์‚ฌํ•ญ๋„ ๋†€๋ž๋„๋ก ๊นŒ๋‹ค๋กญ์Šต๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ๋ฒฝ: ์—ฐ์‚ฐ์ด ๋„ˆ๋ฌด ๋นจ๋ผ์„œ ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ์ด ๋“œ๋Ÿฌ๋‚จ
  • ๋ฆฌ์†Œ์Šค ํƒ์‹: ๋†’์€ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰์ด ์ ์œ ์œจ์„ ์ €ํ•˜์‹œํ‚ด
  • ์ ‘๊ทผ ํŒจํ„ด ๋ฏผ๊ฐ: ๋‚˜์œ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์ด ์บ์‹œ ๋™์ž‘์„ ํŒŒ๊ดดํ•จ
  • ์„ค์ •์ด ํ•ต์‹ฌ: ์‹คํ–‰ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ ํŠœ๋‹ํ•ด์•ผ ํ•จ

๋ฏธ์…˜: ํ…์„œ ์ฝ”์–ด ์„ฑ๋Šฅ ๊ฐœ์„ ํ•˜๊ธฐ

๋„์ „ ๊ณผ์ œ: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์— ๋‚ฎ์€ ์ ์œ ์œจ์ธ ํ…์„œ ์ฝ”์–ด ๊ตฌํ˜„์„ ๋‹จ์ˆœํ•œ ํƒ€์ผ๋ง ๋ฒ„์ „์„ ์‹ค์ œ๋กœ ์ด๊ธฐ๋Š” ๊ตฌํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•˜์„ธ์š”.

์ด๊ฒจ์•ผ ํ•  ๊ธฐ์ค€:

  • ๋ชฉํ‘œ Duration: < 1.62 ms
  • ์ ์œ ์œจ: > 26.3% ๊ธฐ์ค€์„ 
  • DRAM ๋ถ€ํ•˜: < 72.5% ๊ธฐ์ค€์„ 
  • ์บ์‹œ ์„ฑ๋Šฅ: > 29.7% L2 ์ ์ค‘๋ฅ  ๊ธฐ์ค€์„ 

ํƒ๊ตฌํ•  ์ตœ์ ํ™” ์ „๋žต:

  1. ๋ ˆ์ง€์Šคํ„ฐ ์••๋ฐ• ์ค„์ด๊ธฐ

    • ๋” ์ž‘์€ ๋ˆ„์‚ฐ๊ธฐ ํƒ€์ผ ์‚ฌ์šฉ
    • ์ค‘๊ฐ„ ์ €์žฅ ๊ณต๊ฐ„ ์ตœ์†Œํ™”
    • ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ๊ณ ๋ ค
    • ํšจ์œจ์ ์ธ ๋ˆ„์  ํŒจํ„ด์€ Puzzle 16์˜ ํƒ€์ผ๋ง ๋ฐฉ์‹ ์ฐธ๊ณ 
  2. ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด ์ตœ์ ํ™”

    • ๋ฑ…ํฌ ์ถฉ๋Œ์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจ๋”ฉ ์ถ”๊ฐ€ (๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ๋… ์ฐธ๊ณ )
    • copy_dram_to_sram_async ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”
    • ๋ณ‘ํ•ฉ ํŒจํ„ด ๊ฐœ์„  (์ดˆ๋ฐ˜ ํผ์ฆ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๊ธฐ์ดˆ ์ฐธ๊ณ )
  3. ์ ์œ ์œจ ๊ฐœ์„ 

    • ๋” ๋‚˜์€ ์›Œํ”„ ํ™œ์šฉ์„ ์œ„ํ•œ ๋ธ”๋ก ํฌ๊ธฐ ํŠœ๋‹
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ vs ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰ ๊ท ํ˜• ๋งž์ถ”๊ธฐ
    • ์›Œํ”„-SM ๋งคํ•‘ ์ตœ์ ํ™”
    • Puzzle 11-20 ์‹œ๋ฆฌ์ฆˆ์˜ ์Šค๋ ˆ๋“œ ์กฐ์ • ๊ตํ›ˆ ์ ์šฉ
  4. ์บ์‹œ ์ตœ์ ํ™”

    • ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ ํŒจํ„ด ๊ฐœ์„ 
    • ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์— ๋งž๋Š” ํƒ€์ผ ํฌ๊ธฐ ์ตœ์ ํ™”
    • ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ ๋ณ€ํ™˜ ๊ณ ๋ ค
    • ์ด์ „ ํผ์ฆ ๊ณผ์ •์˜ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ๊ฐœ๋… ํ™œ์šฉ
  5. ๊ณ ๊ธ‰ ๊ธฐ๋ฒ•

    • ๋ฉ”๋ชจ๋ฆฌ์™€ ์—ฐ์‚ฐ์„ ์ค‘์ฒฉํ•˜๊ธฐ ์œ„ํ•œ ๋”๋ธ” ๋ฒ„ํผ๋ง ๊ตฌํ˜„
    • ์†Œํ”„ํŠธ์›จ์–ด ํŒŒ์ดํ”„๋ผ์ด๋‹ ์‚ฌ์šฉ
    • ๋น„๋™๊ธฐ ์‹คํ–‰ ํŒจํ„ด ํƒ๊ตฌ
    • ์ƒˆ๋‹ˆํƒ€์ด์ € ํผ์ฆ์˜ ๊ณ ๊ธ‰ ์กฐ์ • ๊ธฐ๋ฒ• ์ ์šฉ

์„ฑ๊ณต ๊ธฐ์ค€

  • ์ •ํ™•์„ฑ: ๋ชจ๋“  ์ •ํ™•๋„ ํ…Œ์ŠคํŠธ๊ฐ€ ์—ฌ์ „ํžˆ ํ†ต๊ณผ
  • ์„ฑ๋Šฅ: ํ…์„œ ์ฝ”์–ด Duration < 1.62 ms
  • ํšจ์œจ์„ฑ: ๋” ๋†’์€ ์ ์œ ์œจ (>26.3%)
  • ๋ฉ”๋ชจ๋ฆฌ: ๋” ๋‚ฎ์€ DRAM ๋ถ€ํ•˜ (<72.5%)
  • ์บ์‹œ: ๋” ๋†’์€ ์ ์ค‘๋ฅ  (>29.7% L2)

๋” ๊นŠ์€ ๊ตํ›ˆ

์ด ๋ณด๋„ˆ์Šค ์ฑŒ๋ฆฐ์ง€๋Š” GPU ์ตœ์ ํ™”์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ตํ›ˆ์„ ๊ฐ€๋ฅด์นฉ๋‹ˆ๋‹ค: ๋ณ‘๋ชฉ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ตœ์‹  API๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๋ชฉํ‘œ๋Š” ๋‹จ์ˆœํžˆ ํ…์„œ ์ฝ”์–ด๋ฅผ ๋” ๋น ๋ฅด๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค - ํ…์„œ ์ฝ”์–ด๊ฐ€ ์™œ ๋” ๋А๋ ค์งˆ ์ˆ˜ ์žˆ๋Š”์ง€ ์ดํ•ดํ•˜๊ณ , ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ง„๋‹จํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์šฐ๊ณ , ์›์น™์— ๊ธฐ๋ฐ˜ํ•œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ์ฑŒ๋ฆฐ์ง€๋ฅผ ์™„์ˆ˜ํ•˜๋ฉด, ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ•˜๋“œ์›จ์–ด ๊ธฐ๋Šฅ๊ณผ ๊ด€๊ณ„์—†์ด ์–ด๋–ค GPU ์›Œํฌ๋กœ๋“œ๋“  ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์—ญ๋Ÿ‰์„ ๊ฐ–์ถ”๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Puzzle 34: GPU ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (SM90+)

์†Œ๊ฐœ

ํ•˜๋“œ์›จ์–ด ์š”๊ตฌ์‚ฌํ•ญ: โš ๏ธ NVIDIA SM90+ ์ „์šฉ

์ด ํผ์ฆ์€ SM90+ ์ปดํ“จํŠธ ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ NVIDIA Hopper ์•„ํ‚คํ…์ฒ˜ (H100, H200) ์ด์ƒ์˜ GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ API๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ฐ˜์ด๋ฉฐ, ์ง€์›ํ•˜์ง€ ์•Š๋Š” ํ•˜๋“œ์›จ์–ด์—์„œ๋Š” ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ ์ค‘์ธ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ํ™•์‹คํ•˜์ง€ ์•Š๋‹ค๋ฉด pixi run gpu-specs๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ์ตœ์†Œ Compute Cap: 9.0 ์ด์ƒ์ธ์ง€ ํ™•์ธํ•˜์„ธ์š” (ํ•˜๋“œ์›จ์–ด ์‹๋ณ„์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”)

์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (Puzzle 24-26) ์—์„œ ๋ธ”๋ก ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (Puzzle 27) ๊นŒ์ง€์˜ ์—ฌ์ •์„ ์ด์–ด, ์ด์ œ ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋ฐฐ์›๋‹ˆ๋‹ค - ๋‹จ์ผ ๋ธ”๋ก์˜ ํ•œ๊ณ„๋ฅผ ๋„˜์–ด์„œ๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ์กฐ์ •ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ๋ธ”๋ก ํด๋Ÿฌ์Šคํ„ฐ๋ž€?

์Šค๋ ˆ๋“œ ๋ธ”๋ก ํด๋Ÿฌ์Šคํ„ฐ๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋™๊ธฐํ™” ๋ฐ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ†ตํ•ด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ํ˜‘๋ ฅํ•˜์—ฌ ํ•˜๋‚˜์˜ ์—ฐ์‚ฐ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ํ˜์‹ ์ ์ธ SM90+ ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ๋Šฅ:

  • ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”: cluster_sync, cluster_arrive, cluster_wait๋กœ ์—ฌ๋Ÿฌ ๋ธ”๋ก์„ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค
  • ๋ธ”๋ก ์‹๋ณ„: block_rank_in_cluster๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์œ ํ•œ ๋ธ”๋ก ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ํšจ์œจ์ ์ธ ์กฐ์ •: elect_one_sync๋กœ ์ตœ์ ํ™”๋œ ์›Œํ”„ ์ˆ˜์ค€ ํ˜‘๋ ฅ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค
  • ๊ณ ๊ธ‰ ํŒจํ„ด: cluster_mask_base๋กœ ์„ ํƒ์  ๋ธ”๋ก ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ

๊ธฐ์กด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณ„์ธต ๊ตฌ์กฐ

Grid (Multiple Blocks)
โ”œโ”€โ”€ Block (Multiple Warps) - barrier() synchronization
    โ”œโ”€โ”€ Warp (32 Threads) - SIMT lockstep execution
    โ”‚   โ”œโ”€โ”€ Lane 0  โ”€โ”
    โ”‚   โ”œโ”€โ”€ Lane 1   โ”‚ All execute same instruction
    โ”‚   โ”œโ”€โ”€ Lane 2   โ”‚ at same time (SIMT)
    โ”‚   โ”‚   ...      โ”‚ warp.sum(), warp.broadcast()
    โ”‚   โ””โ”€โ”€ Lane 31 โ”€โ”˜
        โ””โ”€โ”€ Thread (SIMD operations within each thread)

์ƒˆ๋กœ์šด ๊ณ„์ธต: ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณ„์ธต ๊ตฌ์กฐ:

Grid (Multiple Clusters)
โ”œโ”€โ”€ ๐Ÿ†• Cluster (Multiple Blocks) - cluster_sync(), cluster_arrive()
    โ”œโ”€โ”€ Block (Multiple Warps) - barrier() synchronization
        โ”œโ”€โ”€ Warp (32 Threads) - SIMT lockstep execution
        โ”‚   โ”œโ”€โ”€ Lane 0  โ”€โ”
        โ”‚   โ”œโ”€โ”€ Lane 1   โ”‚ All execute same instruction
        โ”‚   โ”œโ”€โ”€ Lane 2   โ”‚ at same time (SIMT)
        โ”‚   โ”‚   ...      โ”‚ warp.sum(), warp.broadcast()
        โ”‚   โ””โ”€โ”€ Lane 31 โ”€โ”˜
            โ””โ”€โ”€ Thread (SIMD operations within each thread)

์‹คํ–‰ ๋ชจ๋ธ ์ƒ์„ธ:

  • ์Šค๋ ˆ๋“œ ๋ ˆ๋ฒจ: ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ์˜ SIMD ์—ฐ์‚ฐ
  • ์›Œํ”„ ๋ ˆ๋ฒจ: SIMT ์‹คํ–‰ - 32๊ฐœ ์Šค๋ ˆ๋“œ์˜ ๋ก์Šคํ… ์กฐ์ •
  • ๋ธ”๋ก ๋ ˆ๋ฒจ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์™€ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ํ™œ์šฉํ•œ ๋ฉ€ํ‹ฐ ์›Œํ”„ ์กฐ์ •
  • ๐Ÿ†• ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ: SM90+ ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ํ™œ์šฉํ•œ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ •

ํ•™์Šต ๋‹จ๊ณ„

์ด ํผ์ฆ์€ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—ญ๋Ÿ‰์„ ์ฒด๊ณ„์ ์œผ๋กœ ์Œ“์•„๊ฐ€๋Š” 3๋‹จ๊ณ„ ๊ตฌ์„ฑ์œผ๋กœ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

๐Ÿ”ฐ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ • ๊ธฐ์ดˆ

ํ•ต์‹ฌ: ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ํŒจํ„ด์˜ ๊ธฐ๋ณธ ์ดํ•ด

์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด cluster_arrive()์™€ cluster_wait()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ๋ณธ์ ์ธ ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ๊ณผ ๋ฐ์ดํ„ฐ ๋ถ„๋ฐฐ๋ฅผ ์œ„ํ•ด ์‹คํ–‰์„ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

์ฃผ์š” API: block_rank_in_cluster(), cluster_arrive(), cluster_wait()


โ˜ธ๏ธ ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ

ํ•ต์‹ฌ: ๋ธ”๋ก ๋ ˆ๋ฒจ ํŒจํ„ด์„ ํด๋Ÿฌ์Šคํ„ฐ ๊ทœ๋ชจ๋กœ ํ™•์žฅ

์ต์ˆ™ํ•œ block.sum() ๊ฐœ๋…์„ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์ณ ํ™•์žฅํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ์—ฐ์‚ฐ์„ ์กฐ์ •ํ•˜๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ๋ฆฌ๋•์…˜๊ณผ ์ง‘ํ•ฉ ์—ฐ์‚ฐ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

์ฃผ์š” API: cluster_sync(), ํšจ์œจ์ ์ธ ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ์œ„ํ•œ elect_one_sync()


๐Ÿง  ๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜

ํ•ต์‹ฌ: ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€์˜ ๋‹ค๋‹จ๊ณ„ ์กฐ์ • ํŒจํ„ด

GPU ํ™œ์šฉ๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ณ  ๋ณต์žกํ•œ ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์›Œํ”„ ๋ ˆ๋ฒจ, ๋ธ”๋ก ๋ ˆ๋ฒจ, ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ์˜ ์กฐ์ •์„ ๊ฒฐํ•ฉํ•˜๋Š” ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” API: elect_one_sync(), cluster_arrive(), ๊ณ ๊ธ‰ ์กฐ์ • ํŒจํ„ด

ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ์ค‘์š”ํ•œ ์ด์œ 

๋ฌธ์ œ ๊ทœ๋ชจ: ํ˜„๋Œ€ AI ๋ฐ ๊ณผํ•™ ์›Œํฌ๋กœ๋“œ๋Š” ๋‹จ์ผ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ ๋Šฅ๋ ฅ์„ ์ดˆ๊ณผํ•˜๋Š” ์—ฐ์‚ฐ์„ ํ•„์š”๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค:

ํ•˜๋“œ์›จ์–ด ๋ฐœ์ „: GPU๊ฐ€ ๋” ๋งŽ์€ ์—ฐ์‚ฐ ์œ ๋‹›์„ ๊ฐ–์ถ”๊ฒŒ ๋จ์— ๋”ฐ๋ผ (Puzzle 30์˜ GPU ์•„ํ‚คํ…์ฒ˜ ํ”„๋กœํŒŒ์ผ๋ง ์ฐธ๊ณ ), ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์ฐจ์„ธ๋Œ€ ํ•˜๋“œ์›จ์–ด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ๋ฐ ํ•„์ˆ˜์ ์ด ๋ฉ๋‹ˆ๋‹ค.

๊ต์œก์  ๊ฐ€์น˜

์ด ํผ์ฆ์„ ์™„๋ฃŒํ•˜๋ฉด ์™„์ „ํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

์ด ๊ณผ์ •์€ Puzzle 30-32์˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์ฐจ์„ธ๋Œ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ๋„์ „์— ๋Œ€๋น„ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ค€๋น„์‹œ์ผœ ์ค๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ

์„ ์ˆ˜ ์กฐ๊ฑด:

๊ถŒ์žฅ ํ•™์Šต ๋ฐฉ๋ฒ•: 3๋‹จ๊ณ„ ๊ตฌ์„ฑ์„ ์ˆœ์„œ๋Œ€๋กœ ๋”ฐ๋ผ๊ฐ€์„ธ์š”. ๊ฐ ๋‹จ๊ณ„๊ฐ€ ๋‹ค์Œ ๋‹จ๊ณ„์˜ ๋ณต์žก์„ฑ์„ ์œ„ํ•œ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.

ํ•˜๋“œ์›จ์–ด ์ฐธ๊ณ : SM90+ ์ด์™ธ์˜ ํ•˜๋“œ์›จ์–ด์—์„œ ์‹คํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ, ์ด ํผ์ฆ์€ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…๊ณผ API ์‚ฌ์šฉ ํŒจํ„ด์˜ ๊ต์œก์  ์˜ˆ์ œ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๋ฏธ๋ž˜๋ฅผ ๋ฐฐ์šธ ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ • ๊ธฐ์ดˆ ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ๊ธฐ๋ณธ์ ์ธ ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋ฐฐ์›Œ๋ณด์„ธ์š”!

๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ • ๊ธฐ์ดˆ

๊ฐœ์š”

์ฒซ ๋ฒˆ์งธ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋„์ „์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์ด ์„น์…˜์—์„œ๋Š” SM90+ ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ์‚ฌ์šฉํ•œ ๋ธ”๋ก ๊ฐ„ ์กฐ์ •์˜ ๊ธฐ๋ณธ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: 4๊ฐœ์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ์กฐ์ •ํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ  ์ถœ๋ ฅ ๋ฐฐ์—ด์— ์ €์žฅํ•˜๋Š” ๋ฉ€ํ‹ฐ ๋ธ”๋ก ํžˆ์Šคํ† ๊ทธ๋žจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต: cluster_arrive() โ†’ ์ฒ˜๋ฆฌ โ†’ cluster_wait()๋ผ๋Š” ํ•„์ˆ˜์ ์ธ ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค. Puzzle 29์˜ barrier()์—์„œ ๋ฐฐ์šด ๋™๊ธฐํ™” ๊ฐœ๋…์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์ œ: ๋ฉ€ํ‹ฐ ๋ธ”๋ก ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜

Puzzle 27๊ณผ ๊ฐ™์€ ๊ธฐ์กด์˜ ๋‹จ์ผ ๋ธ”๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•˜๋‚˜์˜ ๋ธ”๋ก์ด ๊ฐ€์ง„ ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰(์˜ˆ: 256๊ฐœ ์Šค๋ ˆ๋“œ) ๋‚ด์— ๋“ค์–ด์˜ค๋Š” ๋ฐ์ดํ„ฐ๋งŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์ดˆ๊ณผํ•˜๋Š” ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฝ์šฐ, ์—ฌ๋Ÿฌ ๋ธ”๋ก์ด ํ˜‘๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ณผ์ œ: 4๊ฐœ ๋ธ”๋ก ๊ฐ๊ฐ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , ๊ณ ์œ ํ•œ ๋ธ”๋ก ์ˆœ์œ„๋กœ ๊ฐ’์„ ์Šค์ผ€์ผ๋งํ•˜๋ฉฐ, Puzzle 29์˜ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅธ ๋ธ”๋ก๋“ค๊ณผ ์กฐ์ •ํ•จ์œผ๋กœ์จ ๋ชจ๋“  ๋ธ”๋ก์˜ ์ฒ˜๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

๋ฌธ์ œ ๋ช…์„ธ

๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋ฐ์ดํ„ฐ ๋ถ„๋ฐฐ:

  • Block 0: ์š”์†Œ 0-255๋ฅผ ์ฒ˜๋ฆฌ, 1๋ฐฐ ์Šค์ผ€์ผ๋ง
  • Block 1: ์š”์†Œ 256-511์„ ์ฒ˜๋ฆฌ, 2๋ฐฐ ์Šค์ผ€์ผ๋ง
  • Block 2: ์š”์†Œ 512-767์„ ์ฒ˜๋ฆฌ, 3๋ฐฐ ์Šค์ผ€์ผ๋ง
  • Block 3: ์š”์†Œ 768-1023์„ ์ฒ˜๋ฆฌ, 4๋ฐฐ ์Šค์ผ€์ผ๋ง

์กฐ์ • ์š”๊ตฌ์‚ฌํ•ญ:

  1. ๊ฐ ๋ธ”๋ก์€ cluster_arrive()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ ค์•ผ ํ•ฉ๋‹ˆ๋‹ค
  2. ๋ชจ๋“  ๋ธ”๋ก์€ cluster_wait()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅธ ๋ธ”๋ก์„ ๊ธฐ๋‹ค๋ ค์•ผ ํ•ฉ๋‹ˆ๋‹ค
  3. ์ตœ์ข… ์ถœ๋ ฅ์€ ๊ฐ ๋ธ”๋ก์˜ ์ฒ˜๋ฆฌ๋œ ํ•ฉ๊ณ„๋ฅผ 4๊ฐœ ์š”์†Œ ๋ฐฐ์—ด๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค

์„ค์ •

  • ๋ฌธ์ œ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ (1D ๋ฐฐ์—ด)
  • ๋ธ”๋ก ์„ค์ •: TPB = 256 ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (256, 1)
  • ๊ทธ๋ฆฌ๋“œ ์„ค์ •: CLUSTER_SIZE = 4 ํด๋Ÿฌ์Šคํ„ฐ๋‹น ๋ธ”๋ก ์ˆ˜ (4, 1)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ์ž…๋ ฅ Layout.row_major(SIZE), ์ถœ๋ ฅ Layout.row_major(CLUSTER_SIZE)

์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋ถ„๋ฐฐ:

  • Block 0: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 0-255
  • Block 1: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 256-511
  • Block 2: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 512-767
  • Block 3: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 768-1023

์™„์„ฑํ•  ์ฝ”๋“œ

fn cluster_coordination_basics[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Real cluster coordination using SM90+ cluster APIs."""
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # Check what's happening with cluster ranks
    my_block_rank = Int(block_rank_in_cluster())
    block_id = Int(block_idx.x)

    shared_data = LayoutTensor[
        dtype,
        Layout.row_major(tpb),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # FIX: Use block_idx.x for data distribution instead of cluster rank
    # Each block should process different portions of the data
    var data_scale = Float32(
        block_id + 1
    )  # Use block_idx instead of cluster rank

    # Phase 1: Each block processes its portion
    if global_i < size:
        shared_data[local_i] = input[global_i] * data_scale
    else:
        shared_data[local_i] = 0.0

    barrier()

    # Phase 2: Use cluster_arrive() for inter-block coordination
    # Signal this block has completed processing

    # FILL IN 1 line here

    # Block-level aggregation (only thread 0)
    if local_i == 0:
        # FILL IN 4 line here
        ...

    # Wait for all blocks in cluster to complete

    # FILL IN 1 line here


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p34/p34.mojo

ํŒ

๋ธ”๋ก ์‹๋ณ„ ํŒจํ„ด

  • block_rank_in_cluster()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ ์ˆœ์œ„(0-3)๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค
  • ๊ทธ๋ฆฌ๋“œ ์‹คํ–‰์—์„œ ์•ˆ์ •์ ์ธ ๋ธ”๋ก ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด Int(block_idx.x)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๋ธ”๋ก ์œ„์น˜์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๊ณ ์œ ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •

  • LayoutTensor[dtype, Layout.row_major(tpb), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค (Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ดˆ ์ฐธ๊ณ )
  • block_id + 1๋กœ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๋ธ”๋ก๋งˆ๋‹ค ๊ณ ์œ ํ•œ ์Šค์ผ€์ผ๋ง์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ์‹œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (Puzzle 3์˜ ๊ฐ€๋“œ ํŒจํ„ด)

ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ํŒจํ„ด

  1. ์ฒ˜๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด ์ž์‹ ์˜ ๋ฐ์ดํ„ฐ ์˜์—ญ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  2. ์‹ ํ˜ธ: cluster_arrive()๋กœ ์ฒ˜๋ฆฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  3. ์—ฐ์‚ฐ: ๋ธ”๋ก ๋‚ด๋ถ€ ์—ฐ์‚ฐ (๋ฆฌ๋•์…˜, ์ง‘๊ณ„)
  4. ๋Œ€๊ธฐ: cluster_wait()๋กœ ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค

๋ธ”๋ก ๋‚ด๋ถ€ ์Šค๋ ˆ๋“œ ์กฐ์ •

  • ํด๋Ÿฌ์Šคํ„ฐ ์—ฐ์‚ฐ ์ „์— ๋ธ”๋ก ๋‚ด๋ถ€ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด barrier()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (Puzzle 29์˜ ๋ฐฐ๋ฆฌ์–ด ๊ฐœ๋…)
  • ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๋ธ”๋ก ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๋‹จ์ผ ์“ฐ๊ธฐ ํŒจํ„ด)
  • ์•ˆ์ •์ ์ธ ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด ๊ฒฐ๊ณผ๋ฅผ output[block_id]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

์ฝ”๋“œ ์‹คํ–‰

pixi run p34 --coordination
uv run poe p34 --coordination

์˜ˆ์ƒ ์ถœ๋ ฅ:

Testing Multi-Block Coordination
SIZE: 1024 TPB: 256 CLUSTER_SIZE: 4
Block coordination results:
  Block 0 : 127.5
  Block 1 : 255.0
  Block 2 : 382.5
  Block 3 : 510.0
โœ… Multi-block coordination tests passed!

์„ฑ๊ณต ๊ธฐ์ค€:

  • 4๊ฐœ ๋ธ”๋ก ๋ชจ๋‘ 0์ด ์•„๋‹Œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ๊ฒฐ๊ณผ๊ฐ€ ์Šค์ผ€์ผ๋ง ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: Block 1 > Block 0, Block 2 > Block 1 ๋“ฑ
  • ๊ฒฝ์Ÿ ์ƒํƒœ๋‚˜ ์กฐ์ • ์‹คํŒจ๊ฐ€ ์—†์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

์†”๋ฃจ์…˜

fn cluster_coordination_basics[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Real cluster coordination using SM90+ cluster APIs."""
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = thread_idx.x

    # Check what's happening with cluster ranks
    my_block_rank = Int(block_rank_in_cluster())
    block_id = Int(block_idx.x)

    shared_data = LayoutTensor[
        dtype,
        Layout.row_major(tpb),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # FIX: Use block_idx.x for data distribution instead of cluster rank
    # Each block should process different portions of the data
    var data_scale = Float32(
        block_id + 1
    )  # Use block_idx instead of cluster rank

    # Phase 1: Each block processes its portion
    if global_i < size:
        shared_data[local_i] = input[global_i] * data_scale
    else:
        shared_data[local_i] = 0.0

    barrier()

    # Phase 2: Use cluster_arrive() for inter-block coordination
    cluster_arrive()  # Signal this block has completed processing

    # Block-level aggregation (only thread 0)
    if local_i == 0:
        var block_sum: Float32 = 0.0
        for i in range(tpb):
            block_sum += shared_data[i][0]
        # FIX: Store result at block_idx position (guaranteed unique per block)
        output[block_id] = block_sum

    # Wait for all blocks in cluster to complete
    cluster_wait()


ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ํ’€์ด๋Š” ์‹ ์ค‘ํ•˜๊ฒŒ ์„ค๊ณ„๋œ 2๋‹จ๊ณ„ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ†ตํ•ด ๊ธฐ๋ณธ์ ์ธ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ๋…๋ฆฝ์  ๋ธ”๋ก ์ฒ˜๋ฆฌ

์Šค๋ ˆ๋“œ ๋ฐ ๋ธ”๋ก ์‹๋ณ„:

global_i = block_dim.x * block_idx.x + thread_idx.x  # Global thread index
local_i = thread_idx.x                               # Local thread index within block
my_block_rank = Int(block_rank_in_cluster())         # Cluster rank (0-3)
block_id = Int(block_idx.x)                          # Block index for reliable addressing

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ:

  • ๊ฐ ๋ธ”๋ก์ด ์ž์ฒด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ž‘์—… ๊ณต๊ฐ„์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค: LayoutTensor[dtype, Layout.row_major(tpb), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()
  • ์Šค์ผ€์ผ๋ง ์ „๋žต: data_scale = Float32(block_id + 1)๋กœ ๊ฐ ๋ธ”๋ก์ด ๋‹ค๋ฅด๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค
    • Block 0: 1.0๋ฐฐ, Block 1: 2.0๋ฐฐ, Block 2: 3.0๋ฐฐ, Block 3: 4.0๋ฐฐ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: if global_i < size:๋กœ ๋ฒ”์œ„ ๋ฐ– ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ: shared_data[local_i] = input[global_i] * data_scale๋กœ ๋ธ”๋ก๋ณ„ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์Šค์ผ€์ผ๋งํ•ฉ๋‹ˆ๋‹ค

๋ธ”๋ก ๋‚ด๋ถ€ ๋™๊ธฐํ™”:

  • barrier()๋Š” ๊ฐ ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์„ ์™„๋ฃŒํ•œ ํ›„์—์•ผ ๋‹ค์Œ ๋‹จ๊ณ„๋กœ ์ง„ํ–‰ํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ๊ณผ ์ดํ›„์˜ ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ์‚ฌ์ด์˜ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค

2๋‹จ๊ณ„: ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •

๋ธ”๋ก ๊ฐ„ ์‹ ํ˜ธ:

  • cluster_arrive()๋Š” ์ด ๋ธ”๋ก์ด ๋กœ์ปฌ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ–ˆ์Œ์„ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ํ•˜๋“œ์›จ์–ด์— ์™„๋ฃŒ๋ฅผ ๋“ฑ๋กํ•˜๋Š” ๋…ผ๋ธ”๋กœํ‚น ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค

๋กœ์ปฌ ์ง‘๊ณ„ (์Šค๋ ˆ๋“œ 0๋งŒ):

if local_i == 0:
    var block_sum: Float32 = 0.0
    for i in range(tpb):
        block_sum += shared_data[i][0]  # Sum all elements in shared memory
    output[block_id] = block_sum        # Store result at unique block position
  • ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์Šค๋ ˆ๋“œ 0๋งŒ ํ•ฉ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • output[block_id]์— ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•˜์—ฌ ๊ฐ ๋ธ”๋ก์ด ๊ณ ์œ ํ•œ ์œ„์น˜์— ๊ธฐ๋กํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค

์ตœ์ข… ๋™๊ธฐํ™”:

  • cluster_wait()๋Š” ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๋ชจ๋“  ๋ธ”๋ก์ด ์ž‘์—…์„ ์™„๋ฃŒํ•  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค
  • ์ด๋ฅผ ํ†ตํ•ด ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ์— ๊ฑธ์ณ ๊ฒฐ์ •๋ก ์  ์™„๋ฃŒ ์ˆœ์„œ๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค

ํ•ต์‹ฌ ๊ธฐ์ˆ  ์ธ์‚ฌ์ดํŠธ

์™œ my_block_rank ๋Œ€์‹  block_id๋ฅผ ์‚ฌ์šฉํ• ๊นŒ?

  • block_idx.x๋Š” ์•ˆ์ •์ ์ธ ๊ทธ๋ฆฌ๋“œ ์‹คํ–‰ ์ธ๋ฑ์‹ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค (0, 1, 2, 3)
  • block_rank_in_cluster()๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์„ค์ •์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • block_id๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ ๋ธ”๋ก์ด ๊ณ ์œ ํ•œ ๋ฐ์ดํ„ฐ ์˜์—ญ๊ณผ ์ถœ๋ ฅ ์œ„์น˜๋ฅผ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ input[global_i]๋ฅผ ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ์ฝ์Šต๋‹ˆ๋‹ค
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก ๋‚ด๋ถ€ ํ†ต์‹ ๊ณผ ์ง‘๊ณ„์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค
  • ์ถœ๋ ฅ ๋ฉ”๋ชจ๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด output[block_id]์— ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค

๋™๊ธฐํ™” ๊ณ„์ธต ๊ตฌ์กฐ:

  1. barrier(): ๊ฐ ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ๋‚ด๋ถ€)
  2. cluster_arrive(): ๋‹ค๋ฅธ ๋ธ”๋ก์— ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค (๋ธ”๋ก ๊ฐ„, ๋…ผ๋ธ”๋กœํ‚น)
  3. cluster_wait(): ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ๊ฐ„, ๋ธ”๋กœํ‚น)

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์—ฐ์‚ฐ ๋ณต์žก๋„: ๋ธ”๋ก๋‹น ๋กœ์ปฌ ํ•ฉ์‚ฐ์— O(TPB), ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์— O(1)
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ๊ฐ ์ž…๋ ฅ ์š”์†Œ๋ฅผ ํ•œ ๋ฒˆ๋งŒ ์ฝ์œผ๋ฉฐ, ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์€ ์ตœ์†Œํ™”
  • ํ™•์žฅ์„ฑ: ํŒจํ„ด์ด ๋” ํฐ ํด๋Ÿฌ์Šคํ„ฐ ํฌ๊ธฐ์—๋„ ์ตœ์†Œํ•œ์˜ ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅ

ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์˜ ํ•ต์‹ฌ ํŒจํ„ด์€ ๋‹จ์ˆœํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. 1๋‹จ๊ณ„: ๊ฐ ๋ธ”๋ก์ด ํ• ๋‹น๋œ ๋ฐ์ดํ„ฐ ์˜์—ญ์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  2. ์‹ ํ˜ธ: cluster_arrive()๋กœ ์ฒ˜๋ฆฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  3. 2๋‹จ๊ณ„: ๋‹ค๋ฅธ ๋ธ”๋ก์˜ ๊ฒฐ๊ณผ์— ์˜์กดํ•˜๋Š” ์—ฐ์‚ฐ์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  4. ๋™๊ธฐํ™”: cluster_wait()๋กœ ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋œ ํ›„ ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค

๋‹ค์Œ ๋‹จ๊ณ„: ๋” ๊ณ ๊ธ‰ ์กฐ์ •์„ ๋ฐฐ์šธ ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ ์œผ๋กœ ์ด๋™ํ•˜์—ฌ Puzzle 27์˜ block.sum() ํŒจํ„ด์„ ํด๋Ÿฌ์Šคํ„ฐ ๊ทœ๋ชจ๋กœ ํ™•์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›Œ๋ณด์„ธ์š”. Puzzle 24์˜ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค!

โ˜ธ๏ธ ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ

๊ฐœ์š”

์ด์ „ ์„น์…˜์˜ ๊ธฐ๋ณธ ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ๋ฐ”ํƒ•์œผ๋กœ, ์ด ๋„์ „์—์„œ๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค - Puzzle 27์—์„œ ์ตํžŒ block.sum ํŒจํ„ด์„ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์ณ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: 4๊ฐœ์˜ ์กฐ์ •๋œ ๋ธ”๋ก์— ๊ฑธ์ณ 1024๊ฐœ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , ๊ฐ ๋ธ”๋ก์˜ ๊ฐœ๋ณ„ ๋ฆฌ๋•์…˜์„ ํ•˜๋‚˜์˜ ์ „์—ญ ๊ฒฐ๊ณผ๋กœ ํ•ฉ์น˜๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต: ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ์œ„ํ•œ cluster_sync()์™€ ํšจ์œจ์ ์ธ ์ตœ์ข… ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ elect_one_sync()๋ฅผ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฌธ์ œ: ๋Œ€๊ทœ๋ชจ ์ „์—ญ ํ•ฉ์‚ฐ

๋‹จ์ผ ๋ธ”๋ก์€ (Puzzle 27์—์„œ ๋ฐฐ์› ๋“ฏ์ด) ์Šค๋ ˆ๋“œ ์ˆ˜์™€ Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์— ์˜ํ•ด ์ œํ•œ๋ฉ๋‹ˆ๋‹ค. ๋‹จ์ผ ๋ธ”๋ก ๋ฆฌ๋•์…˜์„ ๋„˜์–ด์„œ๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์˜ ์ „์—ญ ํ†ต๊ณ„(ํ‰๊ท , ๋ถ„์‚ฐ, ํ•ฉ๊ณ„)๋ฅผ ๊ตฌํ•˜๋ ค๋ฉด ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ณผ์ œ: ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ํ•ฉ์‚ฐ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•˜์„ธ์š”:

  1. ๊ฐ ๋ธ”๋ก์ด ๋กœ์ปฌ ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (Puzzle 27์˜ block.sum()๊ณผ ์œ ์‚ฌ)
  2. Puzzle 29์˜ ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ธ”๋ก๋“ค์ด ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์นฉ๋‹ˆ๋‹ค
  3. ์„ ์ถœ๋œ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ์›Œํ”„ ์„ ์ถœ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ข… ์ „์—ญ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค

๋ฌธ์ œ ๋ช…์„ธ

์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ๋ฆ„:

1๋‹จ๊ณ„ - ๋กœ์ปฌ ๋ฆฌ๋•์…˜ (๊ฐ ๋ธ”๋ก ๋‚ด๋ถ€): \[R_i = \sum_{j=0}^{TPB-1} input[i \times TPB + j] \quad \text{for block } i\]

2๋‹จ๊ณ„ - ์ „์—ญ ์ง‘๊ณ„ (ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด): \[\text{Global Sum} = \sum_{i=0}^{\text{CLUSTER_SIZE}-1} R_i\]

์กฐ์ • ์š”๊ตฌ์‚ฌํ•ญ:

  1. ๋กœ์ปฌ ๋ฆฌ๋•์…˜: ๊ฐ ๋ธ”๋ก์ด ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์œผ๋กœ ๋ถ€๋ถ„ ํ•ฉ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  2. ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™”: cluster_sync()๋กœ ๋ชจ๋“  ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๊ฐ€ ์ค€๋น„๋˜์—ˆ๋Š”์ง€ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  3. ์ตœ์ข… ์ง‘๊ณ„: ์„ ์ถœ๋œ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ชจ๋“  ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์นฉ๋‹ˆ๋‹ค

์„ค์ •

  • ๋ฌธ์ œ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ
  • ๋ธ”๋ก ์„ค์ •: TPB = 256 ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (256, 1)
  • ๊ทธ๋ฆฌ๋“œ ์„ค์ •: CLUSTER_SIZE = 4 ํด๋Ÿฌ์Šคํ„ฐ๋‹น ๋ธ”๋ก ์ˆ˜ (4, 1)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ์ž…๋ ฅ Layout.row_major(SIZE), ์ถœ๋ ฅ Layout.row_major(1)
  • ์ž„์‹œ ์ €์žฅ์†Œ: ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ Layout.row_major(CLUSTER_SIZE)

์˜ˆ์ƒ ๊ฒฐ๊ณผ: ์ˆ˜์—ด 0, 0.01, 0.02, ..., 10.23์˜ ํ•ฉ = 523,776

์™„์„ฑํ•  ์ฝ”๋“œ

fn cluster_collective_operations[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    temp_storage: LayoutTensor[
        dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin
    ],
    size: Int,
):
    """Cluster-wide collective operations using real cluster APIs."""
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    # FILL IN (roughly 24 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p34/p34.mojo

ํŒ

๋กœ์ปฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด

ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ์ „๋žต

  • ์•ˆ์ •์ ์ธ ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ temp_storage[block_id]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  • ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด cluster_sync()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (arrive/wait๋ณด๋‹ค ๊ฐ•๋ ฅ)
  • ์ตœ์ข… ์ „์—ญ ์ง‘๊ณ„๋Š” ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋งŒ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

ํšจ์œจ์ ์ธ ์„ ์ถœ ํŒจํ„ด

  • ์ฒซ ๋ฒˆ์งธ ๋ธ”๋ก(my_block_rank == 0) ๋‚ด์—์„œ elect_one_sync()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํŒจํ„ด)
  • ์ค‘๋ณต ์—ฐ์‚ฐ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋งŒ ์ตœ์ข… ํ•ฉ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ temp_storage์—์„œ ๋ชจ๋“  ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์ฝ์Šต๋‹ˆ๋‹ค (Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ณผ ์œ ์‚ฌ)

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

  • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ํ•จ๊ป˜ input[global_i]๋ฅผ ์ฝ์Šต๋‹ˆ๋‹ค (Puzzle 3์˜ ๊ฐ€๋“œ)
  • ๋ธ”๋ก ๋‚ด๋ถ€ ๋ฆฌ๋•์…˜์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•ด ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ temp_storage[block_id]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  • ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” output[0]์— ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ์กฐ์ •์˜ ๋‹จ์ผ ์“ฐ๊ธฐ ํŒจํ„ด)

ํด๋Ÿฌ์Šคํ„ฐ API ์ฐธ์กฐ

gpu.primitives.cluster ๋ชจ๋“ˆ:

  • cluster_sync(): ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” - arrive/wait ํŒจํ„ด๋ณด๋‹ค ๊ฐ•๋ ฅ
  • elect_one_sync(): ํšจ์œจ์ ์ธ ์กฐ์ •์„ ์œ„ํ•ด ์›Œํ”„ ๋‚ด์—์„œ ๋‹จ์ผ ์Šค๋ ˆ๋“œ๋ฅผ ์„ ์ถœ
  • block_rank_in_cluster(): ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๊ณ ์œ ํ•œ ๋ธ”๋ก ์‹๋ณ„์ž๋ฅผ ๋ฐ˜ํ™˜

ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด

Puzzle 27์˜ ์ „ํ†ต์ ์ธ ๋‚ด์ ์—์„œ ๋ฐฐ์šด ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ๋– ์˜ฌ๋ ค ๋ณด์„ธ์š”:

Stride 128: [T0] += [T128], [T1] += [T129], [T2] += [T130], ...
Stride 64:  [T0] += [T64],  [T1] += [T65],  [T2] += [T66],  ...
Stride 32:  [T0] += [T32],  [T1] += [T33],  [T2] += [T34],  ...
Stride 16:  [T0] += [T16],  [T1] += [T17],  [T2] += [T18],  ...
...
Stride 1:   [T0] += [T1] โ†’ Final result at T0

์ด์ œ ์ด ํŒจํ„ด์„ ํด๋Ÿฌ์Šคํ„ฐ ๊ทœ๋ชจ๋กœ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค - ๊ฐ ๋ธ”๋ก์ด ํ•˜๋‚˜์˜ ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•œ ๋’ค, ๋ธ”๋ก ๊ฐ„์— ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์‹คํ–‰

pixi run p34 --reduction
uv run poe p34 --reduction

์˜ˆ์ƒ ์ถœ๋ ฅ:

Testing Cluster-Wide Reduction
SIZE: 1024 TPB: 256 CLUSTER_SIZE: 4
Expected sum: 523776.0
Cluster reduction result: 523776.0
Expected: 523776.0
Error: 0.0
โœ… Passed: Cluster reduction accuracy test
โœ… Cluster-wide collective operations tests passed!

์„ฑ๊ณต ๊ธฐ์ค€:

  • ์™„๋ฒฝํ•œ ์ •ํ™•๋„: ๊ฒฐ๊ณผ๊ฐ€ ์˜ˆ์ƒ ํ•ฉ๊ณ„(523,776)์™€ ์ •ํ™•ํžˆ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •: 4๊ฐœ ๋ธ”๋ก ๋ชจ๋‘๊ฐ€ ๋ถ€๋ถ„ ํ•ฉ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค
  • ํšจ์œจ์ ์ธ ์ตœ์ข… ๋ฆฌ๋•์…˜: ์„ ์ถœ๋œ ๋‹จ์ผ ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค

์†”๋ฃจ์…˜

fn cluster_collective_operations[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    temp_storage: LayoutTensor[
        dtype, Layout.row_major(CLUSTER_SIZE), MutAnyOrigin
    ],
    size: Int,
):
    """Cluster-wide collective operations using real cluster APIs."""
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)
    my_block_rank = Int(block_rank_in_cluster())
    block_id = Int(block_idx.x)

    # Each thread accumulates its data
    var my_value: Float32 = 0.0
    if global_i < size:
        my_value = input[global_i][0]

    # Block-level reduction using shared memory
    shared_mem = LayoutTensor[
        dtype,
        Layout.row_major(tpb),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    shared_mem[local_i] = my_value
    barrier()

    # Tree reduction within block
    var stride = tpb // 2
    while stride > 0:
        if local_i < stride and local_i + stride < tpb:
            shared_mem[local_i] += shared_mem[local_i + stride]
        barrier()
        stride = stride // 2

    # FIX: Store block result using block_idx for reliable indexing
    if local_i == 0:
        temp_storage[block_id] = shared_mem[0]

    # Use cluster_sync() for full cluster synchronization
    cluster_sync()

    # Final cluster reduction (elect one thread to do the final work)
    if elect_one_sync() and my_block_rank == 0:
        var total: Float32 = 0.0
        for i in range(CLUSTER_SIZE):
            total += temp_storage[i][0]
        output[0] = total


ํด๋Ÿฌ์Šคํ„ฐ ์ง‘ํ•ฉ ์—ฐ์‚ฐ ํ’€์ด๋Š” ๋ถ„์‚ฐ ์ปดํ“จํŒ…์˜ ๊ณ ์ „์ ์ธ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: ๋กœ์ปฌ ๋ฆฌ๋•์…˜ โ†’ ์ „์—ญ ์กฐ์ • โ†’ ์ตœ์ข… ์ง‘๊ณ„:

1๋‹จ๊ณ„: ๋กœ์ปฌ ๋ธ”๋ก ๋ฆฌ๋•์…˜ (์ „ํ†ต์  ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜)

๋ฐ์ดํ„ฐ ๋กœ๋”ฉ ๋ฐ ์ดˆ๊ธฐํ™”:

var my_value: Float32 = 0.0
if global_i < size:
    my_value = input[global_i][0]  # Load with bounds checking
shared_mem[local_i] = my_value     # Store in shared memory
barrier()                          # Ensure all threads complete loading

ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜:

var stride = tpb // 2  # Start with half the threads (128)
while stride > 0:
    if local_i < stride and local_i + stride < tpb:
        shared_mem[local_i] += shared_mem[local_i + stride]
    barrier()          # Synchronize after each reduction step
    stride = stride // 2

ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ์‹œ๊ฐํ™” (TPB=256):

Step 1: stride=128  [T0]+=T128, [T1]+=T129, ..., [T127]+=T255
Step 2: stride=64   [T0]+=T64,  [T1]+=T65,  ..., [T63]+=T127
Step 3: stride=32   [T0]+=T32,  [T1]+=T33,  ..., [T31]+=T63
Step 4: stride=16   [T0]+=T16,  [T1]+=T17,  ..., [T15]+=T31
Step 5: stride=8    [T0]+=T8,   [T1]+=T9,   ..., [T7]+=T15
Step 6: stride=4    [T0]+=T4,   [T1]+=T5,   [T2]+=T6,  [T3]+=T7
Step 7: stride=2    [T0]+=T2,   [T1]+=T3
Step 8: stride=1    [T0]+=T1    โ†’ Final result at shared_mem[0]

๋ถ€๋ถ„ ๊ฒฐ๊ณผ ์ €์žฅ:

  • ์Šค๋ ˆ๋“œ 0๋งŒ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค: temp_storage[block_id] = shared_mem[0]
  • ๊ฐ ๋ธ”๋ก์ด ์ž์‹ ์˜ ํ•ฉ๊ณ„๋ฅผ temp_storage[0], temp_storage[1], temp_storage[2], temp_storage[3]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

2๋‹จ๊ณ„: ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™”

์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ ๋ฐฐ๋ฆฌ์–ด:

  • cluster_sync()๋Š” cluster_arrive()/cluster_wait()๋ณด๋‹ค ๋” ๊ฐ•๋ ฅํ•œ ๋ณด์žฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค
  • ์–ด๋–ค ๋ธ”๋ก์ด๋“  ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•˜๊ธฐ ์ „์— ๋ชจ๋“  ๋ธ”๋ก์ด ๋กœ์ปฌ ๋ฆฌ๋•์…˜์„ ์™„๋ฃŒํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๋ชจ๋“  ๋ธ”๋ก์— ๊ฑธ์นœ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋™๊ธฐํ™”์ž…๋‹ˆ๋‹ค

3๋‹จ๊ณ„: ์ตœ์ข… ์ „์—ญ ์ง‘๊ณ„

ํšจ์œจ์ ์ธ ์Šค๋ ˆ๋“œ ์„ ์ถœ:

if elect_one_sync() and my_block_rank == 0:
    var total: Float32 = 0.0
    for i in range(CLUSTER_SIZE):
        total += temp_storage[i][0]  # Sum: temp[0] + temp[1] + temp[2] + temp[3]
    output[0] = total

์™œ ์ด ์„ ์ถœ ์ „๋žต์„ ์‚ฌ์šฉํ• ๊นŒ?

  • elect_one_sync(): ์›Œํ”„๋‹น ์ •ํ™•ํžˆ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋ฅผ ์„ ํƒํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ณธ ์š”์†Œ์ž…๋‹ˆ๋‹ค
  • my_block_rank == 0: ๋‹จ์ผ ์“ฐ๊ธฐ๋ฅผ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ์ฒซ ๋ฒˆ์งธ ๋ธ”๋ก์—์„œ๋งŒ ์„ ์ถœํ•ฉ๋‹ˆ๋‹ค
  • ๊ฒฐ๊ณผ: ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๋‹จ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋งŒ ์ตœ์ข… ํ•ฉ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ํšจ์œจ์„ฑ: 1024๊ฐœ ์ „์ฒด ์Šค๋ ˆ๋“œ์— ๊ฑธ์นœ ์ค‘๋ณต ์—ฐ์‚ฐ์„ ํ”ผํ•ฉ๋‹ˆ๋‹ค

ํ•ต์‹ฌ ๊ธฐ์ˆ  ์ธ์‚ฌ์ดํŠธ

3๋‹จ๊ณ„ ๋ฆฌ๋•์…˜ ๊ณ„์ธต ๊ตฌ์กฐ:

  1. ์Šค๋ ˆ๋“œ โ†’ ์›Œํ”„: ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ๊ฐ€ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ถ€๋ถ„ ํ•ฉ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค
  2. ์›Œํ”„ โ†’ ๋ธ”๋ก: ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ์›Œํ”„๋“ค์„ ํ•˜๋‚˜์˜ ๋ธ”๋ก ๊ฒฐ๊ณผ๋กœ ํ•ฉ์นฉ๋‹ˆ๋‹ค (256 โ†’ 1)
  3. ๋ธ”๋ก โ†’ ํด๋Ÿฌ์Šคํ„ฐ: ๋‹จ์ˆœ ๋ฃจํ”„๊ฐ€ ๋ธ”๋ก ๊ฒฐ๊ณผ๋ฅผ ์ตœ์ข… ํ•ฉ๊ณ„๋กœ ํ•ฉ์นฉ๋‹ˆ๋‹ค (4 โ†’ 1)

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

  • ์ž…๋ ฅ: ๊ฐ ์š”์†Œ๋ฅผ ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ์ฝ์Šต๋‹ˆ๋‹ค (input[global_i])
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก ๋‚ด๋ถ€ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๊ณ ์† ์ž‘์—… ๊ณต๊ฐ„
  • ์ž„์‹œ ์ €์žฅ์†Œ: ์ €๋น„์šฉ ๋ธ”๋ก ๊ฐ„ ํ†ต์‹  (4๊ฐœ ๊ฐ’๋งŒ)
  • ์ถœ๋ ฅ: ๋‹จ์ผ ์ „์—ญ ๊ฒฐ๊ณผ๋ฅผ ํ•œ ๋ฒˆ ๊ธฐ๋ก

๋™๊ธฐํ™” ๋ณด์žฅ:

  • barrier(): ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • cluster_sync(): ์ „์—ญ ๋ฐฐ๋ฆฌ์–ด - ๋ชจ๋“  ๋ธ”๋ก์ด ๋™์ผํ•œ ์‹คํ–‰ ์ง€์ ์— ๋„๋‹ฌํ•ฉ๋‹ˆ๋‹ค
  • ๋‹จ์ผ ์“ฐ๊ธฐ: ์„ ์ถœ์„ ํ†ตํ•ด ์ตœ์ข… ์ถœ๋ ฅ์— ๋Œ€ํ•œ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„ ๋ถ„์„:

  • ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: O(logโ‚‚ TPB) = O(logโ‚‚ 256) = ๋ธ”๋ก๋‹น 8๋‹จ๊ณ„
  • ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •: O(1) ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ
  • ์ตœ์ข… ์ง‘๊ณ„: O(CLUSTER_SIZE) = O(4) ๋‹จ์ˆœ ๋ง์…ˆ
  • ์ „์ฒด: ๋ธ”๋ก ๋‚ด๋ถ€๋Š” ๋กœ๊ทธ, ๋ธ”๋ก ๊ฐ„์€ ์„ ํ˜•

ํ™•์žฅ์„ฑ ํŠน์„ฑ:

  • ๋ธ”๋ก ๋ ˆ๋ฒจ: ๋กœ๊ทธ ๋ณต์žก๋„๋กœ ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊นŒ์ง€ ํ™•์žฅ ๊ฐ€๋Šฅ
  • ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ: ์„ ํ˜• ๋ณต์žก๋„๋กœ ์ˆ˜์‹ญ ๊ฐœ์˜ ๋ธ”๋ก๊นŒ์ง€ ํ™•์žฅ ๊ฐ€๋Šฅ
  • ๋ฉ”๋ชจ๋ฆฌ: ์ž„์‹œ ์ €์žฅ์†Œ ์š”๊ตฌ๋Ÿ‰์ด ํด๋Ÿฌ์Šคํ„ฐ ํฌ๊ธฐ์— ๋น„๋ก€ํ•˜์—ฌ ์„ ํ˜• ์ฆ๊ฐ€
  • ํ†ต์‹ : ์ตœ์†Œํ•œ์˜ ๋ธ”๋ก ๊ฐ„ ๋ฐ์ดํ„ฐ ์ด๋™ (๋ธ”๋ก๋‹น ํ•˜๋‚˜์˜ ๊ฐ’)

์ง‘ํ•ฉ ์—ฐ์‚ฐ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

์ด ํผ์ฆ์€ ๋ถ„์‚ฐ ์ปดํ“จํŒ…์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ณ ์ „์ ์ธ 2๋‹จ๊ณ„ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  1. ๋กœ์ปฌ ์ง‘๊ณ„: ๊ฐ ์ฒ˜๋ฆฌ ๋‹จ์œ„(๋ธ”๋ก)๊ฐ€ ์ž์‹ ์˜ ๋ฐ์ดํ„ฐ ์˜์—ญ์„ ๋ฆฌ๋•์…˜ํ•ฉ๋‹ˆ๋‹ค
  2. ์ „์—ญ ์กฐ์ •: ์ฒ˜๋ฆฌ ๋‹จ์œ„๋“ค์ด ๋™๊ธฐํ™”ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค
  3. ์ตœ์ข… ๋ฆฌ๋•์…˜: ์„ ์ถœ๋œ ํ•˜๋‚˜์˜ ๋‹จ์œ„๊ฐ€ ๋ชจ๋“  ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์นฉ๋‹ˆ๋‹ค

๋‹จ์ผ ๋ธ”๋ก ๋ฐฉ์‹๊ณผ์˜ ๋น„๊ต:

  • ๊ธฐ์กด block.sum(): ์ตœ๋Œ€ 256๊ฐœ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ์ง‘ํ•ฉ ์—ฐ์‚ฐ: ์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์ณ 1000๊ฐœ ์ด์ƒ์˜ ์Šค๋ ˆ๋“œ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค
  • ๋™์ผํ•œ ์ •ํ™•๋„: ๋‘˜ ๋‹ค ๋™์ผํ•œ ์ˆ˜ํ•™์  ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ๋‹ค๋ฅธ ๊ทœ๋ชจ: ํด๋Ÿฌ์Šคํ„ฐ ๋ฐฉ์‹์ด ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

์„ฑ๋Šฅ ์ด์ :

  • ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹: ๋‹จ์ผ ๋ธ”๋ก ์šฉ๋Ÿ‰์„ ์ดˆ๊ณผํ•˜๋Š” ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ๋” ๋‚˜์€ ํ™œ์šฉ๋ฅ : ๋” ๋งŽ์€ GPU ์—ฐ์‚ฐ ์œ ๋‹›์„ ๋™์‹œ์— ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํŒจํ„ด: ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค

๋‹ค์Œ ๋‹จ๊ณ„: ์ตœ์ข… ๋„์ „์„ ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์œผ๋กœ ์ด๋™ํ•˜์—ฌ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ+๋ธ”๋ก ์กฐ์ •+ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™”๋ฅผ ๊ฒฐํ•ฉํ•œ ๊ณ„์ธต์  ํŒจํ„ด์„ ๋ฐฐ์›Œ๋ณด์„ธ์š”. ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค!

๐Ÿง  ๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๊ฐœ์š”

์ด ๋งˆ์ง€๋ง‰ ๋„์ „์—์„œ๋Š” ์›Œํ”„ ๋ ˆ๋ฒจ (Puzzle 24-26), ๋ธ”๋ก ๋ ˆ๋ฒจ (Puzzle 27), ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณ„์ธต ๊ตฌ์กฐ์˜ ๋ชจ๋“  ๋ ˆ๋ฒจ์„ ๊ฒฐํ•ฉํ•˜์—ฌ GPU ํ™œ์šฉ๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ์ •๊ตํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: ์›Œํ”„ ๋ ˆ๋ฒจ ์ตœ์ ํ™” (elect_one_sync()), ๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„, ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ์กฐ์ •์„ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ํŒจํ„ด์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ณ„์ธต์  ํด๋Ÿฌ์Šคํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต: ๊ณ ๊ธ‰ ์—ฐ์‚ฐ ์›Œํฌ๋กœ๋“œ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€์˜ ์กฐ์ • ํŒจํ„ด๊ณผ ํ•จ๊ป˜ ์™„์ „ํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์Šคํƒ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฌธ์ œ: ๋‹ค๋‹จ๊ณ„ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ

์‹ค์ œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ GPU ๊ณ„์ธต ๊ตฌ์กฐ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ๋ฒจ(Puzzle 24์˜ ์›Œํ”„, Puzzle 27์˜ ๋ธ”๋ก, ํด๋Ÿฌ์Šคํ„ฐ)์ด ์กฐ์ •๋œ ์—ฐ์‚ฐ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ๊ฐ๊ฐ ์ „๋ฌธํ™”๋œ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ณ„์ธต์  ์กฐ์ •์„ ํ•„์š”๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์ด๋Š” Puzzle 29์˜ ๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ๋ฅผ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

๊ณผ์ œ: ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜์„ธ์š”:

  1. ์›Œํ”„ ๋ ˆ๋ฒจ: ํšจ์œจ์ ์ธ ์›Œํ”„ ๋‚ด๋ถ€ ์กฐ์ •์„ ์œ„ํ•ด elect_one_sync()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (SIMT ์‹คํ–‰)
  2. ๋ธ”๋ก ๋ ˆ๋ฒจ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •์„ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„ ๊ฒฐ๊ณผ๋ฅผ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค
  3. ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ: cluster_arrive() / cluster_wait() Puzzle 29์˜ ๋‹จ๊ณ„์  ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ธ”๋ก ๊ฐ„ ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ช…์„ธ

๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ:

  1. 1๋‹จ๊ณ„ (์›Œํ”„ ๋ ˆ๋ฒจ): ๊ฐ ์›Œํ”„๊ฐ€ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋ฅผ ์„ ์ถœํ•˜์—ฌ 32๊ฐœ์˜ ์—ฐ์† ์š”์†Œ๋ฅผ ํ•ฉ์‚ฐํ•ฉ๋‹ˆ๋‹ค
  2. 2๋‹จ๊ณ„ (๋ธ”๋ก ๋ ˆ๋ฒจ): ๊ฐ ๋ธ”๋ก ๋‚ด์˜ ๋ชจ๋“  ์›Œํ”„ ํ•ฉ๊ณ„๋ฅผ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค
  3. 3๋‹จ๊ณ„ (ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ): cluster_arrive() / cluster_wait()๋กœ ๋ธ”๋ก ๊ฐ„ ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

์ž…๋ ฅ: ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•œ (i % 50) * 0.02 ํŒจํ„ด์˜ 1024๊ฐœ float ๊ฐ’ ์ถœ๋ ฅ: ๊ณ„์ธต์  ์ฒ˜๋ฆฌ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” 4๊ฐœ ๋ธ”๋ก ๊ฒฐ๊ณผ

์„ค์ •

  • ๋ฌธ์ œ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ
  • ๋ธ”๋ก ์„ค์ •: TPB = 256 ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (256, 1)
  • ๊ทธ๋ฆฌ๋“œ ์„ค์ •: CLUSTER_SIZE = 4 ๋ธ”๋ก (4, 1)
  • ์›Œํ”„ ํฌ๊ธฐ: WARP_SIZE = 32 ์›Œํ”„๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (NVIDIA ํ‘œ์ค€)
  • ๋ธ”๋ก๋‹น ์›Œํ”„ ์ˆ˜: TPB / WARP_SIZE = 8 ์›Œํ”„
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ์ž…๋ ฅ Layout.row_major(SIZE), ์ถœ๋ ฅ Layout.row_major(CLUSTER_SIZE)

์ฒ˜๋ฆฌ ๋ถ„๋ฐฐ:

  • Block 0: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 0-255
  • Block 1: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 256-511
  • Block 2: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 512-767
  • Block 3: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 768-1023

์™„์„ฑํ•  ์ฝ”๋“œ

fn advanced_cluster_patterns[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Advanced cluster programming using cluster masks and relaxed synchronization.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)

    # FILL IN (roughly 26 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p34/p34.mojo

ํŒ

์›Œํ”„ ๋ ˆ๋ฒจ ์ตœ์ ํ™” ํŒจํ„ด

๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„ ์ „๋žต

  • ์›Œํ”„ ์ฒ˜๋ฆฌ ํ›„ ๋ชจ๋“  ์›Œํ”„ ๊ฒฐ๊ณผ๋ฅผ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค (Puzzle 27์˜ ๋ธ”๋ก ์กฐ์ • ํ™•์žฅ)
  • ์„ ์ถœ๋œ ์œ„์น˜์—์„œ ์ฝ์Šต๋‹ˆ๋‹ค: ์ธ๋ฑ์Šค 0, 32, 64, 96, 128, 160, 192, 224
  • for i in range(0, tpb, 32) ๋ฃจํ”„๋กœ ์›Œํ”„ ๋ฆฌ๋”๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค (๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํŒจํ„ด)
  • ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค (๋ฐฐ๋ฆฌ์–ด ์กฐ์ •์˜ ๋‹จ์ผ ์“ฐ๊ธฐ ํŒจํ„ด)

ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ํ๋ฆ„

  1. ์ฒ˜๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด ๊ณ„์ธต์  ์›Œํ”„ ์ตœ์ ํ™”๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  2. ์‹ ํ˜ธ: cluster_arrive()๋กœ ๋กœ์ปฌ ์ฒ˜๋ฆฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  3. ์ €์žฅ: ์Šค๋ ˆ๋“œ 0์ด ๋ธ”๋ก ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ์— ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค
  4. ๋Œ€๊ธฐ: cluster_wait()๋กœ ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค

๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง ๋ฐ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

  • Float32(block_id + 1)๋กœ ์ž…๋ ฅ์„ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๋ธ”๋ก๋ณ„ ๊ณ ์œ  ํŒจํ„ด์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค
  • ์ž…๋ ฅ์„ ์ฝ๊ธฐ ์ „์— ํ•ญ์ƒ global_i < size๋ฅผ ๊ฒ€์‚ฌํ•ฉ๋‹ˆ๋‹ค (Puzzle 3์˜ ๊ฐ€๋“œ)
  • ๋ธ”๋ก ๋‚ด ์ฒ˜๋ฆฌ ๋‹จ๊ณ„ ์‚ฌ์ด์— barrier()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (๋™๊ธฐํ™” ํŒจํ„ด)
  • ๋ฃจํ”„์—์„œ ์›Œํ”„ ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์ฃผ์˜ ๊นŠ๊ฒŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค (์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ณ ๋ ค์‚ฌํ•ญ)

๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ API

gpu.primitives.cluster ๋ชจ๋“ˆ:

  • elect_one_sync(): ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ์Šค๋ ˆ๋“œ ์„ ์ถœ
  • cluster_arrive(): ๋‹จ๊ณ„์  ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ์œ„ํ•œ ์™„๋ฃŒ ์‹ ํ˜ธ
  • cluster_wait(): ๋ชจ๋“  ๋ธ”๋ก์ด ๋™๊ธฐํ™” ์ง€์ ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
  • block_rank_in_cluster(): ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๊ณ ์œ ํ•œ ๋ธ”๋ก ์‹๋ณ„์ž ๋ฐ˜ํ™˜

๊ณ„์ธต์  ์กฐ์ • ํŒจํ„ด

์ด ํผ์ฆ์€ 3๋‹จ๊ณ„ ์กฐ์ • ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๋ ˆ๋ฒจ 1: ์›Œํ”„ ์กฐ์ • (Puzzle 24)

Warp (32 threads) โ†’ elect_one_sync() โ†’ 1 elected thread โ†’ processes 32 elements

๋ ˆ๋ฒจ 2: ๋ธ”๋ก ์กฐ์ • (Puzzle 27)

Block (8 warps) โ†’ aggregate warp results โ†’ 1 block total

๋ ˆ๋ฒจ 3: ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • (์ด ํผ์ฆ)

Cluster (4 blocks) โ†’ cluster_arrive/wait โ†’ synchronized completion

๊ฒฐํ•ฉ ํšจ๊ณผ: 1024๊ฐœ ์Šค๋ ˆ๋“œ โ†’ 32๊ฐœ ์›Œํ”„ ๋ฆฌ๋” โ†’ 4๊ฐœ ๋ธ”๋ก ๊ฒฐ๊ณผ โ†’ ์กฐ์ •๋œ ํด๋Ÿฌ์Šคํ„ฐ ์™„๋ฃŒ

์ฝ”๋“œ ์‹คํ–‰

pixi run p34 --advanced
uv run poe p34 --advanced

์˜ˆ์ƒ ์ถœ๋ ฅ:

Testing Advanced Cluster Algorithms
SIZE: 1024 TPB: 256 CLUSTER_SIZE: 4
Advanced cluster algorithm results:
  Block 0 : 122.799995
  Block 1 : 247.04001
  Block 2 : 372.72
  Block 3 : 499.83997
โœ… Advanced cluster patterns tests passed!

์„ฑ๊ณต ๊ธฐ์ค€:

  • ๊ณ„์ธต์  ์Šค์ผ€์ผ๋ง: ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋‹จ๊ณ„ ์กฐ์ • ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ์›Œํ”„ ์ตœ์ ํ™”: elect_one_sync()๊ฐ€ ์ค‘๋ณต ์—ฐ์‚ฐ์„ ์ค„์ž…๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •: ๋ชจ๋“  ๋ธ”๋ก์ด ์ฒ˜๋ฆฌ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒํ•ฉ๋‹ˆ๋‹ค
  • ์„ฑ๋Šฅ ํŒจํ„ด: ๋” ๋†’์€ ๋ธ”๋ก ID๊ฐ€ ๋น„๋ก€์ ์œผ๋กœ ๋” ํฐ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค

์†”๋ฃจ์…˜

fn advanced_cluster_patterns[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    input: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Advanced cluster programming using cluster masks and relaxed synchronization.
    """
    global_i = Int(block_dim.x * block_idx.x + thread_idx.x)
    local_i = Int(thread_idx.x)
    my_block_rank = Int(block_rank_in_cluster())
    block_id = Int(block_idx.x)

    shared_data = LayoutTensor[
        dtype,
        Layout.row_major(tpb),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Compute cluster mask for advanced coordination
    # base_mask = cluster_mask_base()  # Requires cluster_shape parameter

    # FIX: Process data with block_idx-based scaling for guaranteed uniqueness
    var data_scale = Float32(block_id + 1)
    if global_i < size:
        shared_data[local_i] = input[global_i] * data_scale
    else:
        shared_data[local_i] = 0.0

    barrier()

    # Advanced pattern: Use elect_one_sync for efficient coordination
    if elect_one_sync():  # Only one thread per warp does this work
        var warp_sum: Float32 = 0.0
        var warp_start = (local_i // 32) * 32  # Get warp start index
        for i in range(32):  # Sum across warp
            if warp_start + i < tpb:
                warp_sum += shared_data[warp_start + i][0]
        shared_data[local_i] = warp_sum

    barrier()

    # Use cluster_arrive for staged synchronization in sm90+
    cluster_arrive()

    # Only first thread in each block stores result
    if local_i == 0:
        var block_total: Float32 = 0.0
        for i in range(0, tpb, 32):  # Sum warp results
            if i < tpb:
                block_total += shared_data[i][0]
        output[block_id] = block_total

    # Wait for all blocks to complete their calculations in sm90+
    cluster_wait()


๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ ํŒจํ„ด ํ’€์ด๋Š” GPU ํ™œ์šฉ๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์›Œํ”„, ๋ธ”๋ก, ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ๊ฒฐํ•ฉํ•˜๋Š” ์ •๊ตํ•œ 3๋‹จ๊ณ„ ๊ณ„์ธต์  ์ตœ์ ํ™”๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๋ ˆ๋ฒจ 1: ์›Œํ”„ ๋ ˆ๋ฒจ ์ตœ์ ํ™” (์Šค๋ ˆ๋“œ ์„ ์ถœ)

๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ์Šค์ผ€์ผ๋ง:

var data_scale = Float32(block_id + 1)  # Block-specific scaling factor
if global_i < size:
    shared_data[local_i] = input[global_i] * data_scale
else:
    shared_data[local_i] = 0.0  # Zero-pad for out-of-bounds
barrier()  # Ensure all threads complete data loading

์›Œํ”„ ๋ ˆ๋ฒจ ์Šค๋ ˆ๋“œ ์„ ์ถœ:

if elect_one_sync():  # Hardware elects exactly 1 thread per warp
    var warp_sum: Float32 = 0.0
    var warp_start = (local_i // 32) * 32  # Calculate warp boundary
    for i in range(32):  # Process entire warp's data
        if warp_start + i < tpb:
            warp_sum += shared_data[warp_start + i][0]
    shared_data[local_i] = warp_sum  # Store result at elected thread's position

์›Œํ”„ ๊ฒฝ๊ณ„ ๊ณ„์‚ฐ ์„ค๋ช…:

  • ์Šค๋ ˆ๋“œ 37 (์›Œํ”„ 1): warp_start = (37 // 32) * 32 = 1 * 32 = 32
  • ์Šค๋ ˆ๋“œ 67 (์›Œํ”„ 2): warp_start = (67 // 32) * 32 = 2 * 32 = 64
  • ์Šค๋ ˆ๋“œ 199 (์›Œํ”„ 6): warp_start = (199 // 32) * 32 = 6 * 32 = 192

์„ ์ถœ ํŒจํ„ด ์‹œ๊ฐํ™” (TPB=256, 8 ์›Œํ”„):

Warp 0 (threads 0-31):   elect_one_sync() โ†’ Thread 0   processes elements 0-31
Warp 1 (threads 32-63):  elect_one_sync() โ†’ Thread 32  processes elements 32-63
Warp 2 (threads 64-95):  elect_one_sync() โ†’ Thread 64  processes elements 64-95
Warp 3 (threads 96-127): elect_one_sync() โ†’ Thread 96  processes elements 96-127
Warp 4 (threads 128-159):elect_one_sync() โ†’ Thread 128 processes elements 128-159
Warp 5 (threads 160-191):elect_one_sync() โ†’ Thread 160 processes elements 160-191
Warp 6 (threads 192-223):elect_one_sync() โ†’ Thread 192 processes elements 192-223
Warp 7 (threads 224-255):elect_one_sync() โ†’ Thread 224 processes elements 224-255

๋ ˆ๋ฒจ 2: ๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„ (์›Œํ”„ ๋ฆฌ๋” ์กฐ์ •)

์›Œํ”„ ๊ฐ„ ๋™๊ธฐํ™”:

barrier()  # Ensure all warps complete their elected computations

์›Œํ”„ ๋ฆฌ๋” ์ง‘๊ณ„ (์Šค๋ ˆ๋“œ 0๋งŒ):

if local_i == 0:
    var block_total: Float32 = 0.0
    for i in range(0, tpb, 32):  # Iterate through warp leader positions
        if i < tpb:
            block_total += shared_data[i][0]  # Sum warp results
    output[block_id] = block_total

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

  • ์Šค๋ ˆ๋“œ 0์ด ๋‹ค์Œ ์œ„์น˜์—์„œ ์ฝ์Šต๋‹ˆ๋‹ค: shared_data[0], shared_data[32], shared_data[64], shared_data[96], shared_data[128], shared_data[160], shared_data[192], shared_data[224]
  • ์ด ์œ„์น˜๋“ค์—๋Š” ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐํ•œ ์›Œํ”„ ํ•ฉ๊ณ„๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค
  • ๊ฒฐ๊ณผ: 8๊ฐœ ์›Œํ”„ ํ•ฉ๊ณ„ โ†’ 1๊ฐœ ๋ธ”๋ก ํ•ฉ๊ณ„

๋ ˆ๋ฒจ 3: ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ๋‹จ๊ณ„์  ๋™๊ธฐํ™”

๋‹จ๊ณ„์  ๋™๊ธฐํ™” ์ ‘๊ทผ:

cluster_arrive()  # Non-blocking: signal this block's completion
# ... Thread 0 computes and stores block result ...
cluster_wait()    # Blocking: wait for all blocks to complete

์™œ ๋‹จ๊ณ„์  ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ• ๊นŒ?

  • cluster_arrive() ๋ฅผ ์ตœ์ข… ์—ฐ์‚ฐ ์ด์ „์— ํ˜ธ์ถœํ•˜๋ฉด ์ž‘์—… ์ค‘์ฒฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค
  • ๋‹ค๋ฅธ ๋ธ”๋ก์ด ์•„์ง ์ฒ˜๋ฆฌ ์ค‘์ธ ๋™์•ˆ์—๋„ ๋ธ”๋ก์ด ์ž์ฒด ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • cluster_wait() ๋กœ ๊ฒฐ์ •๋ก ์  ์™„๋ฃŒ ์ˆœ์„œ๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋…๋ฆฝ์ ์ธ ๋ธ”๋ก ์—ฐ์‚ฐ์˜ ๊ฒฝ์šฐ cluster_sync()๋ณด๋‹ค ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค

๊ณ ๊ธ‰ ํŒจํ„ด ํŠน์„ฑ

๊ณ„์ธต์  ์—ฐ์‚ฐ ์ถ•์†Œ:

  1. 256๊ฐœ ์Šค๋ ˆ๋“œ โ†’ 8๊ฐœ ์„ ์ถœ ์Šค๋ ˆ๋“œ (๋ธ”๋ก๋‹น 32๋ฐฐ ์ถ•์†Œ)
  2. 8๊ฐœ ์›Œํ”„ ํ•ฉ๊ณ„ โ†’ 1๊ฐœ ๋ธ”๋ก ํ•ฉ๊ณ„ (๋ธ”๋ก๋‹น 8๋ฐฐ ์ถ•์†Œ)
  3. 4๊ฐœ ๋ธ”๋ก โ†’ ๋‹จ๊ณ„์  ์™„๋ฃŒ (๋™๊ธฐํ™”๋œ ์ข…๋ฃŒ)
  4. ์ „์ฒด ํšจ์œจ: ๋ธ”๋ก๋‹น ์ค‘๋ณต ์—ฐ์‚ฐ 256๋ฐฐ ์ถ•์†Œ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”:

  • ๋ ˆ๋ฒจ 1: input[global_i]์—์„œ ๋ณ‘ํ•ฉ๋œ ์ฝ๊ธฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์Šค์ผ€์ผ๋ง๋œ ์“ฐ๊ธฐ
  • ๋ ˆ๋ฒจ 2: ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์›Œํ”„ ๋ ˆ๋ฒจ ์ง‘๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (256๊ฐœ ๋Œ€์‹  8๊ฐœ ์—ฐ์‚ฐ)
  • ๋ ˆ๋ฒจ 3: ์Šค๋ ˆ๋“œ 0์ด ๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (8๊ฐœ ๋Œ€์‹  1๊ฐœ ์—ฐ์‚ฐ)
  • ๊ฒฐ๊ณผ: ๊ณ„์ธต์  ๋ฆฌ๋•์…˜์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค

๋™๊ธฐํ™” ๊ณ„์ธต ๊ตฌ์กฐ:

  1. barrier(): ๋ธ”๋ก ๋‚ด๋ถ€ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” (๋ฐ์ดํ„ฐ ๋กœ๋”ฉ ๋ฐ ์›Œํ”„ ์ฒ˜๋ฆฌ ํ›„)
  2. cluster_arrive(): ๋ธ”๋ก ๊ฐ„ ์‹ ํ˜ธ (๋…ผ๋ธ”๋กœํ‚น, ์ž‘์—… ์ค‘์ฒฉ ๊ฐ€๋Šฅ)
  3. cluster_wait(): ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™” (๋ธ”๋กœํ‚น, ์™„๋ฃŒ ์ˆœ์„œ ๋ณด์žฅ)

์™œ โ€œ๊ณ ๊ธ‰โ€œ์ธ๊ฐ€:

  • ๋‹ค๋‹จ๊ณ„ ์ตœ์ ํ™”: ์›Œํ”„, ๋ธ”๋ก, ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ๋ฒ•์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค
  • ํ•˜๋“œ์›จ์–ด ํšจ์œจ: elect_one_sync()๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์›Œํ”„ ํ™œ์šฉ๋ฅ ์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค
  • ๋‹จ๊ณ„์  ์กฐ์ •: ๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์œ ์—ฐํ•œ ๋™๊ธฐํ™”๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค
  • ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€: ์‹ค์ œ GPU ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค

์‹ค์ œ ์„ฑ๋Šฅ ์ด์ :

  • ๋ฉ”๋ชจ๋ฆฌ ๋ถ€ํ•˜ ๊ฐ์†Œ: ๋™์‹œ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ์ ์–ด์ง‘๋‹ˆ๋‹ค
  • ๋” ๋‚˜์€ ์›Œํ”„ ํ™œ์šฉ: ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์ง‘์ค‘์ ์ธ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์กฐ์ •: ๋‹จ๊ณ„์  ๋™๊ธฐํ™”๊ฐ€ ๋” ํฐ ํด๋Ÿฌ์Šคํ„ฐ ํฌ๊ธฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์œ ์—ฐ์„ฑ: ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค

๋ณต์žก๋„ ๋ถ„์„:

  • ์›Œํ”„ ๋ ˆ๋ฒจ: ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๋‹น O(32) ์—ฐ์‚ฐ = ๋ธ”๋ก๋‹น ์ด O(256)
  • ๋ธ”๋ก ๋ ˆ๋ฒจ: ๋ธ”๋ก๋‹น O(8) ์ง‘๊ณ„ ์—ฐ์‚ฐ
  • ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ: ๋ธ”๋ก๋‹น O(1) ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ
  • ์ „์ฒด: ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌํ™” ์ด์ ์„ ๊ฐ€์ง„ ์„ ํ˜• ๋ณต์žก๋„

์™„์ „ํ•œ GPU ๊ณ„์ธต ๊ตฌ์กฐ

์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ์ด ํผ์ฆ์„ ์™„๋ฃŒํ•จ์œผ๋กœ์จ ์™„์ „ํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์Šคํƒ์„ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค:

โœ… ์Šค๋ ˆ๋“œ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: ๊ฐœ๋ณ„ ์‹คํ–‰ ๋‹จ์œ„
โœ… ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: 32๊ฐœ ์Šค๋ ˆ๋“œ SIMT ์กฐ์ •
โœ… ๋ธ”๋ก ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: ๋ฉ€ํ‹ฐ ์›Œํ”„ ์กฐ์ •๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ
โœ… ๐Ÿ†• ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: SM90+ API๋ฅผ ํ™œ์šฉํ•œ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ •
โœ… ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ๋กœ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ์กฐ์ •
โœ… ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์ผ ๋ธ”๋ก ํ•œ๊ณ„๋ฅผ ๋„˜์–ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™•์žฅ
โœ… ์›Œํ”„ + ๋ธ”๋ก + ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ๊ฒฐํ•ฉํ•œ ๊ณ„์ธต์  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„
โœ… SM90+ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์œผ๋กœ ์ฐจ์„ธ๋Œ€ GPU ํ•˜๋“œ์›จ์–ด๋ฅผ ํ™œ์šฉ

์‹ค์ „ ์‘์šฉ

์ด ํผ์ฆ์˜ ๊ณ„์ธต์  ์กฐ์ • ํŒจํ„ด์€ ๋‹ค์Œ ๋ถ„์•ผ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค:

๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ…:

  • ๋ฉ€ํ‹ฐ ๊ทธ๋ฆฌ๋“œ ๊ธฐ๋ฒ•: ๊ฐ ๋ ˆ๋ฒจ์ด ์„œ๋กœ ๋‹ค๋ฅธ ํ•ด์ƒ๋„์˜ ๊ทธ๋ฆฌ๋“œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ๋„๋ฉ”์ธ ๋ถ„ํ•ด: ๋ฌธ์ œ์˜ ํ•˜์œ„ ๋„๋ฉ”์ธ์— ๊ฑธ์นœ ๊ณ„์ธต์  ์กฐ์ •
  • ๋ณ‘๋ ฌ ๋ฐ˜๋ณต๋ฒ•: ์›Œํ”„ ๋ ˆ๋ฒจ์˜ ๋กœ์ปฌ ์—ฐ์‚ฐ๊ณผ ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ์˜ ์ „์—ญ ํ†ต์‹ 

๋”ฅ๋Ÿฌ๋‹:

  • ๋ชจ๋ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด ๋ชจ๋ธ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์—ฌ๋Ÿฌ ํŠธ๋žœ์Šคํฌ๋จธ ๋ ˆ์ด์–ด์— ๊ฑธ์นœ ๋‹จ๊ณ„์  ์ฒ˜๋ฆฌ
  • ๊ธฐ์šธ๊ธฐ ์ง‘๊ณ„: ๋ถ„์‚ฐ ํ•™์Šต ๋…ธ๋“œ์— ๊ฑธ์นœ ๊ณ„์ธต์  ๋ฆฌ๋•์…˜

๊ทธ๋ž˜ํ”ฝ์Šค ๋ฐ ์‹œ๊ฐํ™”:

  • ๋ฉ€ํ‹ฐ ํŒจ์Šค ๋ Œ๋”๋ง: ๋ณต์žกํ•œ ์‹œ๊ฐ ํšจ๊ณผ๋ฅผ ์œ„ํ•œ ๋‹จ๊ณ„์  ์ฒ˜๋ฆฌ
  • ๊ณ„์ธต์  ์ปฌ๋ง: ๊ฐ ๋ ˆ๋ฒจ์ด ์„œ๋กœ ๋‹ค๋ฅธ ์„ธ๋ถ„๋„์—์„œ ์ปฌ๋งํ•ฉ๋‹ˆ๋‹ค
  • ๋ณ‘๋ ฌ ์ง€์˜ค๋ฉ”ํŠธ๋ฆฌ ์ฒ˜๋ฆฌ: ์กฐ์ •๋œ ๋ณ€ํ™˜ ํŒŒ์ดํ”„๋ผ์ธ

๋‹ค์Œ ๋‹จ๊ณ„

์ด์ œ ์ตœ์‹  ํ•˜๋“œ์›จ์–ด์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ตœ์ฒจ๋‹จ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค!

๋” ๋งŽ์€ ๋„์ „์„ ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋‹ค๋ฅธ ๊ณ ๊ธ‰ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ฃผ์ œ๋ฅผ ํƒ๊ตฌํ•˜๊ณ , Puzzle 30-32์˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ๋ณต์Šตํ•˜๊ณ , NVIDIA ๋„๊ตฌ์˜ ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ์ ์šฉํ•˜๊ฑฐ๋‚˜, ์ด ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž์‹ ๋งŒ์˜ ์—ฐ์‚ฐ ์›Œํฌ๋กœ๋“œ๋ฅผ ๊ตฌ์ถ•ํ•ด ๋ณด์„ธ์š”!