Mojo ๐Ÿ”ฅ GPU Puzzles, Edition 1

โ€œ์šฐ๋ฆฌ๊ฐ€ ํ•  ์ˆ˜ ์žˆ๊ธฐ ์ „์— ๋ฐฐ์›Œ์•ผ ํ•˜๋Š” ๊ฒƒ๋“ค์€, ํ•˜๋ฉด์„œ ๋ฐฐ์šด๋‹ค.โ€ ์•„๋ฆฌ์Šคํ† ํ…”๋ ˆ์Šค (๋‹ˆ์ฝ”๋งˆ์ฝ”์Šค ์œค๋ฆฌํ•™)

Mojo ๐Ÿ”ฅ๋ฅผ ์‚ฌ์šฉํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์‹ค์Šต ๊ฐ€์ด๋“œ์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. Mojo๋Š” ํŒŒ์ด์ฌ ๋ฌธ๋ฒ•๊ณผ ์‹œ์Šคํ…œ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๊ฒฐํ•ฉํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด์ž…๋‹ˆ๋‹ค.

์•„๋ž˜ ๊ฐœ์š” ์˜์ƒ์„ ๋จผ์ € ์‹œ์ฒญํ•˜๊ฑฐ๋‚˜, ๊ณ„์† ์ฝ์–ด์ฃผ์„ธ์š”.

์™œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ธ๊ฐ€?

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์ „๋ฌธ ๊ธฐ์ˆ ์—์„œ ํ˜„๋Œ€ ์ปดํ“จํŒ…์˜ ํ•ต์‹ฌ ์ธํ”„๋ผ๋กœ ๋ฐœ์ „ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜์‹ญ์–ต ๊ฐœ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ๋ถ€ํ„ฐ ์‹ค์‹œ๊ฐ„ ์˜์ƒ ์ŠคํŠธ๋ฆผ์„ ๋ถ„์„ํ•˜๋Š” ์ปดํ“จํ„ฐ ๋น„์ „ ์‹œ์Šคํ…œ๊นŒ์ง€, GPU ๊ฐ€์†์ด ์˜ค๋Š˜๋‚ ์˜ ์—ฐ์‚ฐ ํ˜์‹ ์„ ์ด๋Œ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐํ›„ ๋ชจ๋ธ๋ง, ์‹ ์•ฝ ๋ฐœ๊ฒฌ, ์–‘์ž ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋“ฑ ๊ณผํ•™์  ๋ฐœ์ „์€ GPU๋งŒ์ด ์ œ๊ณตํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ๋Šฅ๋ ฅ์— ์˜์กดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธˆ์œต ๊ธฐ๊ด€์€ ์‹ค์‹œ๊ฐ„ ๋ฆฌ์Šคํฌ ๋ถ„์„๊ณผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŠธ๋ ˆ์ด๋”ฉ์— GPU ์ปดํ“จํŒ…์„ ํ™œ์šฉํ•˜๋ฉฐ, ์ž์œจ์ฃผํ–‰ ์ฐจ๋Ÿ‰์€ GPU ๊ฐ€์† ์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ์„ผ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜์—ฌ ์ค‘์š”ํ•œ ์˜์‚ฌ๊ฒฐ์ •์„ ๋‚ด๋ฆฝ๋‹ˆ๋‹ค.

๊ฒฝ์ œ์  ํŒŒ๊ธ‰๋ ฅ๋„ ์ƒ๋‹นํ•ฉ๋‹ˆ๋‹ค. GPU ์ปดํ“จํŒ…์„ ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ์กฐ์ง์€ ๊ฐœ๋ฐœ ์ฃผ๊ธฐ ๋‹จ์ถ•, ์—ฐ์‚ฐ ๋น„์šฉ ์ ˆ๊ฐ, ๊ทธ๋ฆฌ๊ณ  ์ด์ „์—๋Š” ํ’€๊ธฐ ์–ด๋ ค์› ๋˜ ๊ณ„์‚ฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ ๋“ฑ ์ƒ๋‹นํ•œ ๊ฒฝ์Ÿ ์šฐ์œ„๋ฅผ ํ™•๋ณดํ•ฉ๋‹ˆ๋‹ค. ๊ณ„์‚ฐ ๋Šฅ๋ ฅ์ด ๋น„์ฆˆ๋‹ˆ์Šค ๊ฐ€์น˜์™€ ์ง๊ฒฐ๋˜๋Š” ์‹œ๋Œ€์—, GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—ญ๋Ÿ‰์€ ์—”์ง€๋‹ˆ์–ด, ์—ฐ๊ตฌ์ž, ์กฐ์ง์—๊ฒŒ ์ „๋žต์  ์ฐจ๋ณ„ํ™” ์š”์†Œ์ž…๋‹ˆ๋‹ค.

์™œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— Mojo๐Ÿ”ฅ๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๊ฐ€?

์ปดํ“จํŒ… ์‚ฐ์—…์€ ์ค‘๋Œ€ํ•œ ์ „ํ™˜์ ์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค. CPU ์„ฑ๋Šฅ์€ ์ „๋ ฅ๊ณผ ๋ฐœ์—ด ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ํด๋Ÿญ ์†๋„ ํ–ฅ์ƒ๋งŒ์œผ๋กœ๋Š” ํ•œ๊ณ„์— ์ด๋ฅด๋ €์Šต๋‹ˆ๋‹ค. ์ด์— ๋”ฐ๋ผ ํ•˜๋“œ์›จ์–ด ์ œ์กฐ์‚ฌ๋“ค์€ ๋ฌผ๋ฆฌ์  ์ฝ”์–ด ์ˆ˜๋ฅผ ๋Š˜๋ฆฌ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋‚˜์•„๊ฐ”๊ณ , ์ด๋Ÿฌํ•œ ๋ฉ€ํ‹ฐ์ฝ”์–ด ์ ‘๊ทผ ๋ฐฉ์‹์˜ ์ •์ ์ด ๋ฐ”๋กœ ์ˆ˜์ฒœ ๊ฐœ์˜ ์ฝ”์–ด๊ฐ€ ๋ณ‘๋ ฌ๋กœ ๋™์ž‘ํ•˜๋Š” ํ˜„๋Œ€ GPU์ž…๋‹ˆ๋‹ค. NVIDIA H100์„ ์˜ˆ๋กœ ๋“ค๋ฉด, ๋‹จ์ผ ํด๋Ÿญ ์‚ฌ์ดํด์— 16,896๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๋™์‹œ์— ์‹คํ–‰ํ•˜๋ฉด์„œ 270,000๊ฐœ ์ด์ƒ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๋Œ€๊ธฐ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Mojo๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ๋Œ€ํ•œ ์‹ค์šฉ์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ๊ณตํ•˜์—ฌ, ์ด๋Ÿฌํ•œ ๋ณ‘๋ ฌ์„ฑ์„ ๋” ์‰ฝ๊ฒŒ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค:

  • ํŒŒ์ด์ฌ ์Šคํƒ€์ผ ๋ฌธ๋ฒ•์œผ๋กœ ์‹œ์Šคํ…œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊นŒ์ง€
  • ์ถ”์ƒํ™”ํ•ด๋„ ์„ฑ๋Šฅ ์†์‹ค ์—†์ด ๋จธ์‹  ์ฝ”๋“œ๋กœ ์ปดํŒŒ์ผ๋˜๋Š” ์ œ๋กœ ์ฝ”์ŠคํŠธ ์ถ”์ƒํ™”
  • ์ปดํŒŒ์ผ ํƒ€์ž„์— ์˜ค๋ฅ˜๋ฅผ ์žก๋Š” ๊ฐ•๋ ฅํ•œ ํƒ€์ž… ์‹œ์Šคํ…œ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋ฅผ ๊ณ ๋ คํ•œ ํ…์„œ ๊ธฐ๋ณธ ์ง€์›
  • CPUยทGPU ๋‚ด์žฅ ํ•จ์ˆ˜๋ฅผ ์ง์ ‘ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋Š” ํ•˜๋“œ์›จ์–ด ์ง์ ‘ ์ œ์–ด
  • CPU์™€ GPU ๋ชจ๋‘์—์„œ ๋™์ž‘ํ•˜๋Š” ํฌ๋กœ์Šค ํ•˜๋“œ์›จ์–ด ์ด์‹์„ฑ
  • C/C++ ๋Œ€๋น„ ํ–ฅ์ƒ๋œ ์•ˆ์ „์„ฑ
  • ๋‚ฎ์€ ์ง„์ž… ์žฅ๋ฒฝ์œผ๋กœ ๋” ๋งŽ์€ ํ”„๋กœ๊ทธ๋ž˜๋จธ๊ฐ€ GPU ์„ฑ๋Šฅ์„ ํ™œ์šฉ

Mojo๐Ÿ”ฅ๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋ˆ„๊ตฌ๋‚˜ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์–ด ํ˜์‹ ์„ ์ด๋Œ๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. >์ต์ˆ™ํ•œ ํŒŒ์ด์ฌ ๋ฌธ๋ฒ•์„ ๋ฐ”ํƒ•์œผ๋กœ GPU์— ์ง์ ‘ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์–ด, ๊นŠ์€ ์ „๋ฌธ ์ง€์‹ ์—†์ด๋„ CPU์™€ GPU๋ฅผ ํ•จ๊ป˜ ํ™œ์šฉํ•˜๋Š” ๊ณ ์„ฑ๋Šฅ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์™œ ํผ์ฆ๋กœ ๋ฐฐ์šฐ๋Š”๊ฐ€?

๋Œ€๋ถ€๋ถ„์˜ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ž๋ฃŒ๋Š” ์‹ค์Šต์— ์•ž์„œ ๋ฐฉ๋Œ€ํ•œ ์ด๋ก ์„ ๋จผ์ € ๋‹ค๋ฃน๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ง์ ‘ ํ•ด๋ด์•ผ ์ดํ•ด๋˜๋Š” ์ถ”์ƒ์  ๊ฐœ๋…๋“ค์€ ์ž…๋ฌธ์ž์—๊ฒŒ ๋ถ€๋‹ด์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ์ฑ…์€ ๋‹ค๋ฅธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ „ ๋ฌธ์ œ์— ๋ฐ”๋กœ ๋›ฐ์–ด๋“ค์–ด, ๋‹จ๊ณ„์ ์œผ๋กœ ๊ฐœ๋…์„ ๋ฐœ๊ฒฌํ•ด ๋‚˜๊ฐ‘๋‹ˆ๋‹ค.

ํผ์ฆ ๊ธฐ๋ฐ˜ ํ•™์Šต์˜ ์žฅ์ :

  • ์ง์ ‘ ์ฒดํ—˜: GPU์—์„œ ๋ฐ”๋กœ ์‹คํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ์ ์ง„์  ๋ณต์žก๋„: ๊ฐ ํผ์ฆ์ด ์ด์ „์— ๋ฐฐ์šด ๊ฐœ๋… ์œ„์— ์Œ“์—ฌ๊ฐ‘๋‹ˆ๋‹ค
  • ์‹ค์šฉ์  ์ดˆ์ : ์‹ค์ œ ๊ณ„์‚ฐ ๋ฌธ์ œ๋ฅผ ๋ฐ˜์˜ํ•œ ํผ์ฆ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค
  • ๋””๋ฒ„๊น… ๋Šฅ๋ ฅ: ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น… ์—ฐ์Šต์„ ํ†ตํ•ด ๋ฌธ์ œ ํ•ด๊ฒฐ ๊ฐ๊ฐ์„ ํ‚ค์›๋‹ˆ๋‹ค
  • ์ง€์‹ ์ •์ฐฉ: ์ง์ ‘ ํ’€์–ด๋ณด๋Š” ๊ฒƒ์ด ์ฝ๊ธฐ๋งŒ ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ดํ•ด๊ฐ€ ๋” ๊นŠ์–ด์ง‘๋‹ˆ๋‹ค

์•”๊ธฐ๊ฐ€ ์•„๋‹Œ ๋ฐœ๊ฒฌ์— ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค. ์ง์ ‘ ์‹คํ—˜ํ•˜๋ฉด์„œ ๊ฐœ๋…์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ตํžˆ๊ณ , ๊นŠ์€ ์ดํ•ด์™€ ์‹ค์ „ ์—ญ๋Ÿ‰์„ ํ•จ๊ป˜ ์Œ“์•„๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐ์‚ฌ์˜ ๋ง: ์ด ์ฑ…์˜ Part I๊ณผ III์€ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ NVIDIA GPU ํ•™์Šต ํ”„๋กœ์ ํŠธ์ธ GPU Puzzles์—์„œ ํฐ ์˜๊ฐ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ์ด ์ฑ…์€ ํ•ด๋‹น ๊ฐœ๋…๋“ค์„ Mojo์˜ ์ถ”์ƒํ™”์™€ ์„ฑ๋Šฅ์„ ํ™œ์šฉํ•˜์—ฌ ์žฌ๊ตฌํ˜„ํ•˜๊ณ , Mojo์— ํŠนํ™”๋œ ์ตœ์ ํ™”๋กœ ๊ณ ๊ธ‰ ์ฃผ์ œ๋ฅผ ๋” ๋„“๊ฒŒ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์‚ฌ๊ณ ๋ฐฉ์‹

ํšจ๊ณผ์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•ด์„œ๋Š” ๊ณ„์‚ฐ์„ ๋ฐ”๋ผ๋ณด๋Š” ๋ฐฉ์‹ ์ž์ฒด๋ฅผ ๋ฐ”๊ฟ”์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์•ž์œผ๋กœ์˜ ํ•™์Šต์— ๊ธธ์žก์ด๊ฐ€ ๋  ํ•ต์‹ฌ ์‚ฌ๊ณ  ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค:

์ˆœ์ฐจ์—์„œ ๋ณ‘๋ ฌ๋กœ: ๋ฐ˜๋ณต๋ฌธ์„ ์Šค๋ ˆ๋“œ๋กœ ๋Œ€์ฒด

๊ธฐ์กด CPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” ๋ฐ˜๋ณต๋ฌธ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์”ฉ ์ˆœ์„œ๋Œ€๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

# CPU ๋ฐฉ์‹
for i in range(data_size):
    result[i] = process(data[i])

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์ด ๋ฐฉ์‹์„ ์™„์ „ํžˆ ๋’ค์ง‘์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์”ฉ ์ˆœํšŒํ•˜๋Š” ๋Œ€์‹ , ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ํ• ๋‹นํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์š”์†Œ๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

# GPU ๋ฐฉ์‹ (๊ฐœ๋…์ )
thread_id = get_global_id()
if thread_id < data_size:
    result[thread_id] = process(data[thread_id])

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ์š”์†Œ๋ฅผ ๋งก์•„ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ, ๋ช…์‹œ์ ์ธ ๋ฐ˜๋ณต๋ฌธ์ด ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์‹คํ–‰์œผ๋กœ ๋ฐ”๋€๋‹ˆ๋‹ค. ์ˆœ์ฐจ ์ฒ˜๋ฆฌ์—์„œ ๋™์‹œ ์‹คํ–‰์œผ๋กœ์˜ ์ „ํ™˜์ด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ•ต์‹ฌ ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์œ„์— ์—ฐ์‚ฐ ๊ทธ๋ฆฌ๋“œ ๋งž์ถ”๊ธฐ

๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™”๋œ ๊ทธ๋ฆฌ๋“œ๋กœ, GPU ์Šค๋ ˆ๋“œ๊ฐ€ ์ด์— ๋Œ€์‘ํ•˜๋Š” ์—ฐ์‚ฐ ๊ทธ๋ฆฌ๋“œ๋ฅผ ํ˜•์„ฑํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”. ํšจ๊ณผ์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์ด ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์„ ์ž˜ ์„ค๊ณ„ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ๊ณต๊ฐ„์„ ์ตœ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  • ์Šค๋ ˆ๋“œ: ๊ฐ๊ฐ ํŠน์ • ๋ฐ์ดํ„ฐ ์š”์†Œ๋ฅผ ๋‹ด๋‹นํ•˜๋Š” ๊ฐœ๋ณ„ ์ฒ˜๋ฆฌ ๋‹จ์œ„
  • ๋ธ”๋ก: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ณผ ๋™๊ธฐํ™” ๊ธฐ๋Šฅ์„ ๊ฐ–์ถ˜ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน
  • ๊ทธ๋ฆฌ๋“œ: ์ „์ฒด ๊ณ„์‚ฐ ๋ฌธ์ œ๋ฅผ ์•„์šฐ๋ฅด๋Š” ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์ž˜ํ•˜๋ ค๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ๋™๊ธฐํ™” ์š”๊ตฌ์‚ฌํ•ญ์„ ๊ด€๋ฆฌํ•˜๋ฉด์„œ ๋ณ‘๋ ฌ ํšจ์œจ์„ ์ตœ๋Œ€ํ•œ ๋Œ์–ด์˜ฌ๋ฆฌ๋„๋ก ์ด ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์˜ ๊ท ํ˜•์„ ์žก์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์ด๋™ vs. ์—ฐ์‚ฐ

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” ์—ฐ์‚ฐ ์ž์ฒด๋ณด๋‹ค ๋ฐ์ดํ„ฐ๋ฅผ ์˜ฎ๊ธฐ๋Š” ๋น„์šฉ์ด ๋” ํด ๋•Œ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค:

  • CPU์™€ GPU ๊ฐ„ ๋ฐ์ดํ„ฐ ์ด๋™์€ ๋А๋ฆฝ๋‹ˆ๋‹ค
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ์˜ ์ด๋™์€ ๊ทธ๋ณด๋‹ค ๋น ๋ฆ…๋‹ˆ๋‹ค
  • ๋ ˆ์ง€์Šคํ„ฐ๋‚˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ด๋ฏธ ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฒƒ์€ ๋งค์šฐ ๋น ๋ฆ…๋‹ˆ๋‹ค

์ด๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ํ”ํžˆ ๊ฐ€์ง€๋Š” ๊ฐ€์ •์„ ๋’ค์ง‘์Šต๋‹ˆ๋‹ค. ๋ณ‘๋ชฉ์€ ์—ฐ์‚ฐ์ด ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ ์ด๋™์ž…๋‹ˆ๋‹ค.

์ด ์ฑ…์˜ ํผ์ฆ๋“ค์„ ํ’€์–ด๊ฐ€๋ฉด์„œ ์ด๋Ÿฌํ•œ ์›์น™์„ ์ง๊ด€์ ์œผ๋กœ ์ฒด๋“ํ•˜๊ณ , ๊ณ„์‚ฐ ๋ฌธ์ œ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹์„ ๋ฐ”๊ฟ” ๋‚˜๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๋‚ด์šฉ

์ด ์ฑ…์€ ๊ธฐ์ดˆ ์›๋ฆฌ๋ถ€ํ„ฐ ๊ณ ๊ธ‰ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ๋ฒ•๊นŒ์ง€ ๋‹ค๋ฃน๋‹ˆ๋‹ค. GPU๋ฅผ ์•Œ ์ˆ˜ ์—†๋Š” ๋ธ”๋ž™๋ฐ•์Šค๋กœ ๋‘์ง€ ์•Š๊ณ , ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ์˜ ๋™์ž‘๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊นŒ์ง€ ๋‹จ๊ณ„๋ณ„๋กœ ์ดํ•ด๋ฅผ ์Œ“์•„๊ฐ‘๋‹ˆ๋‹ค. ์ €์ˆ˜์ค€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์™€ ๊ณ ์ˆ˜์ค€ ํ…์„œ ์ถ”์ƒํ™”๋ฅผ ๋ชจ๋‘ ๋ฐฐ์›€์œผ๋กœ์จ, ์–ด๋–ค GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณผ์ œ์—๋„ ์œ ์—ฐํ•˜๊ฒŒ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ˜„์žฌ ํ•™์Šต ๊ณผ์ •

ํ•ต์‹ฌ ๊ธฐ์ˆ ์ƒํƒœํผ์ฆ
์Šค๋ ˆ๋“œ/๋ธ”๋ก ๊ธฐ์ดˆโœ… ์ œ๊ณต ์ค‘Part I (1-8)
GPU ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น…โœ… ์ œ๊ณต ์ค‘Part II (9-10)
ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜โœ… ์ œ๊ณต ์ค‘Part III (11-16)
MAX ๊ทธ๋ž˜ํ”„ ํ†ตํ•ฉโœ… ์ œ๊ณต ์ค‘Part IV (17-19)
PyTorch ํ†ตํ•ฉโœ… ์ œ๊ณต ์ค‘Part V (20-22)
ํ•จ์ˆ˜ํ˜• ํŒจํ„ด ๋ฐ ๋ฒค์น˜๋งˆํ‚นโœ… ์ œ๊ณต ์ค‘Part VI (23)
์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐโœ… ์ œ๊ณต ์ค‘Part VII (24-26)
๋ธ”๋ก ์ˆ˜์ค€ ํ”„๋กœ๊ทธ๋ž˜๋ฐโœ… ์ œ๊ณต ์ค‘Part VIII (27)
๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐโœ… ์ œ๊ณต ์ค‘Part IX (28-29)
์„ฑ๋Šฅ ๋ถ„์„โœ… ์ œ๊ณต ์ค‘Part X (30-32)
์ตœ์‹  GPU ๊ธฐ๋Šฅโœ… ์ œ๊ณต ์ค‘Part XI (33-34)

์ƒ์„ธ ํ•™์Šต ๋ชฉํ‘œ

Part I: GPU ๊ธฐ์ดˆ (ํผ์ฆ 1-8) โœ…

  • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ๊ณผ ๋ธ”๋ก ๊ตฌ์„ฑ ๋ฐฐ์šฐ๊ธฐ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ๊ฐ€๋“œ ์ดํ•ดํ•˜๊ธฐ
  • ์›์‹œ ํฌ์ธํ„ฐ์™€ TileTensor ์ถ”์ƒํ™” ๋ชจ๋‘ ๋‹ค๋ค„๋ณด๊ธฐ
  • ์Šค๋ ˆ๋“œ ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ดˆ ์ตํžˆ๊ธฐ

Part II: GPU ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น… (ํผ์ฆ 9-10) โœ…

  • GPU ๋””๋ฒ„๊ฑฐ์™€ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ• ๋ฐฐ์šฐ๊ธฐ
  • ์ƒˆ๋‹ˆํƒ€์ด์ €๋กœ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜์™€ ๊ฒฝ์Ÿ ์ƒํƒœ ์ฐพ๊ธฐ
  • GPU ๋ฒ„๊ทธ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์‹๋ณ„ํ•˜๊ณ  ์ˆ˜์ •ํ•˜๊ธฐ
  • ๋ณต์žกํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณผ์ œ์— ๋„์ „ํ•  ์ž์‹ ๊ฐ ์Œ“๊ธฐ

์ฐธ๊ณ : ๋””๋ฒ„๊น… ํผ์ฆ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด NVIDIA GPU ๋””๋ฒ„๊น… ๋„๊ตฌ ์ ‘๊ทผ์„ ์œ„ํ•œ pixi๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. CUDA๋ฅผ ์ง€์›ํ•˜๋Š” NVIDIA GPU์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

Part III: GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜ (ํผ์ฆ 11-16) โœ…

  • ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜๊ณผ ํ’€๋ง ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ
  • ํšจ์œจ์ ์ธ ํ•ฉ์„ฑ๊ณฑ ์ปค๋„ ๋งŒ๋“ค๊ธฐ
  • ๋ˆ„์  ํ•ฉ(์Šค์บ”) ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฐฐ์šฐ๊ธฐ
  • ํƒ€์ผ๋ง ์ „๋žต์œผ๋กœ ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ตœ์ ํ™”ํ•˜๊ธฐ

Part IV: MAX ๊ทธ๋ž˜ํ”„ ํ†ตํ•ฉ (ํผ์ฆ 17-19) โœ…

  • ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ ๋งŒ๋“ค๊ธฐ
  • GPU ์ปค๋„๊ณผ ํŒŒ์ด์ฌ ์ฝ”๋“œ ์—ฐ๊ฒฐํ•˜๊ธฐ
  • ์†Œํ”„ํŠธ๋งฅ์Šค, ์–ดํ…์…˜ ๊ฐ™์€ ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€์˜ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ

Part V: PyTorch ํ†ตํ•ฉ (ํผ์ฆ 20-22) โœ…

  • Mojo GPU ์ปค๋„๊ณผ PyTorch ํ…์„œ ์—ฐ๊ฒฐํ•˜๊ธฐ
  • CustomOpLibrary๋กœ ํ…์„œ ๋งˆ์ƒฌ๋ง์„ ๋งค๋„๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ
  • torch.compile๊ณผ ํ†ตํ•ฉํ•˜์—ฌ ์‹คํ–‰ ์ตœ์ ํ™”ํ•˜๊ธฐ
  • ์ปค๋„ ํ“จ์ „๊ณผ ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋ฐฐ์šฐ๊ธฐ

Part VI: Mojo ํ•จ์ˆ˜ํ˜• ํŒจํ„ด ๋ฐ ๋ฒค์น˜๋งˆํ‚น (ํผ์ฆ 23) โœ…

  • ํ•จ์ˆ˜ํ˜• ํŒจํ„ด ๋ฐฐ์šฐ๊ธฐ: elementwise, tiled ์ฒ˜๋ฆฌ, vectorization
  • ์ฒด๊ณ„์ ์ธ ์„ฑ๋Šฅ ์ตœ์ ํ™”์™€ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ ์ตํžˆ๊ธฐ
  • ์ •๋Ÿ‰์  ๋ฒค์น˜๋งˆํ‚น์œผ๋กœ ์„ฑ๋Šฅ ๋ถ„์„ํ•˜๊ธฐ
  • GPU ์Šค๋ ˆ๋”ฉ vs SIMD ์‹คํ–‰ ๊ณ„์ธต ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

Part VII: ์›Œํ”„ ์ˆ˜์ค€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (ํผ์ฆ 24-26) โœ…

  • ์›Œํ”„ ๊ธฐ์ดˆ์™€ SIMT ์‹คํ–‰ ๋ชจ๋ธ ๋ฐฐ์šฐ๊ธฐ
  • ํ•ต์‹ฌ ์›Œํ”„ ์—ฐ์‚ฐ ์ตํžˆ๊ธฐ: sum, shuffle_down, broadcast
  • shuffle_xor์™€ prefix_sum์œผ๋กœ ๊ณ ๊ธ‰ ํŒจํ„ด ๊ตฌํ˜„ํ•˜๊ธฐ
  • ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๊ธฐ

Part VIII: ๋ธ”๋ก ์ˆ˜์ค€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (ํผ์ฆ 27) โœ…

  • block.sum()๊ณผ block.max()๋กœ ๋ธ”๋ก ๋‹จ์œ„ ๋ฆฌ๋•์…˜ ๋ฐฐ์šฐ๊ธฐ
  • ๋ธ”๋ก ์ˆ˜์ค€ ๋ˆ„์  ํ•ฉ ํŒจํ„ด๊ณผ ํ†ต์‹  ์ตํžˆ๊ธฐ
  • block.broadcast()๋กœ ๋ธ”๋ก ๋‚ด ์กฐ์œจ ํšจ์œจ์ ์œผ๋กœ ๊ตฌํ˜„ํ•˜๊ธฐ

Part IX: ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ (ํผ์ฆ 28-29) โœ…

  • ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ํŒจํ„ด ๊ตฌํ˜„ํ•˜๊ธฐ
  • ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์œผ๋กœ ์—ฐ์‚ฐ๊ณผ ์ „์†ก์„ ๊ฒน์ณ ์ง€์—ฐ ์‹œ๊ฐ„ ์ˆจ๊ธฐ๊ธฐ
  • ๋ฉ”๋ชจ๋ฆฌ ํŽœ์Šค์™€ ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ ๋ฐฐ์šฐ๊ธฐ
  • ํ”„๋ฆฌํŽ˜์นญ๊ณผ ์บ์‹œ ์ตœ์ ํ™” ์ „๋žต ์ตํžˆ๊ธฐ

Part X: ์„ฑ๋Šฅ ๋ถ„์„ ๋ฐ ์ตœ์ ํ™” (ํผ์ฆ 30-32) โœ…

  • GPU ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ๋ณ‘๋ชฉ ์ง€์  ์ฐพ๊ธฐ
  • ์ ์œ ์œจ๊ณผ ๋ฆฌ์†Œ์Šค ํ™œ์šฉ๋„ ์ตœ์ ํ™”ํ•˜๊ธฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ ์ œ๊ฑฐํ•˜๊ธฐ

Part XI: ๊ณ ๊ธ‰ GPU ๊ธฐ๋Šฅ (ํผ์ฆ 33-34) โœ…

  • AI ์›Œํฌ๋กœ๋“œ๋ฅผ ์œ„ํ•œ ํ…์„œ ์ฝ”์–ด ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฐฐ์šฐ๊ธฐ
  • ํ˜„๋Œ€ GPU์˜ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฐฐ์šฐ๊ธฐ

์ด ์ฑ…์€ ๊ธฐ์กด ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ, ๋จผ์ € ์ €์ˆ˜์ค€ ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ž‘์œผ๋กœ ์ดํ•ด๋ฅผ ์Œ“์€ ๋’ค ์ ์ง„์ ์œผ๋กœ Mojo์˜ TileTensor ์ถ”์ƒํ™”๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด GPU ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด์™€ ํ˜„๋Œ€์  ํ…์„œ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์˜ ์‹ค์šฉ์  ์ง€์‹์„ ๋ชจ๋‘ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”?

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ์™œ ์ค‘์š”ํ•œ์ง€, Mojo๊ฐ€ ์™œ ์ ํ•ฉํ•œ์ง€, ๊ทธ๋ฆฌ๊ณ  ํผ์ฆ๋กœ ์–ด๋–ป๊ฒŒ ๋ฐฐ์šฐ๋Š”์ง€ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด์ œ ์‹œ์ž‘ํ•ด ๋ด…์‹œ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„: ํผ์ฆ ์‚ฌ์šฉ ๊ฐ€์ด๋“œ์—์„œ ํ™˜๊ฒฝ ์„ค์ •, ์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ, ์ฒซ ๋ฒˆ์งธ ํผ์ฆ ์‹คํ–‰ ๋ฐฉ๋ฒ•์„ ํ™•์ธํ•˜์„ธ์š”.

ํผ์ฆ ์‚ฌ์šฉ ๊ฐ€์ด๋“œ

๊ฐ ํผ์ฆ์€ ๋‹จ๊ณ„์ ์œผ๋กœ ์‹ค๋ ฅ์„ ์Œ“์„ ์ˆ˜ ์žˆ๋„๋ก ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ผ๊ด€๋œ ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๊ฐœ์š”: ๋ฌธ์ œ ์ •์˜์™€ ํ•ต์‹ฌ ๊ฐœ๋… ์†Œ๊ฐœ
  • ๊ตฌ์„ฑ: ๊ธฐ์ˆ ์  ์„ค์ •๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ ์„ค๋ช…
  • ์™„์„ฑํ•  ์ฝ”๋“œ: problems/pXX/์— ์ฑ„์›Œ์•ผ ํ•  ๋ถ€๋ถ„์ด ํ‘œ์‹œ๋œ ๊ตฌํ˜„ ํ…œํ”Œ๋ฆฟ
  • ํžŒํŠธ: ํ•„์š”ํ•  ๋•Œ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋Š” ์ „๋žต์  ํžŒํŠธ๋กœ, ์ •๋‹ต์„ ์ง์ ‘ ์•Œ๋ ค์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค
  • ํ’€์ด: ์„ฑ๋Šฅ ๊ณ ๋ ค์‚ฌํ•ญ๊ณผ ๊ฐœ๋… ์„ค๋ช…์„ ํฌํ•จํ•œ ์ข…ํ•ฉ ๋ถ„์„

ํผ์ฆ์€ ์ด์ „์— ๋ฐฐ์šด ๊ฐœ๋… ์œ„์— ์ƒˆ๋กœ์šด ๊ฐœ๋…์„ ์Œ“์•„๊ฐ€๋ฉฐ ์ ์ฐจ ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค. ๊ณ ๊ธ‰ ํผ์ฆ์€ ์•ž์„  ํผ์ฆ์˜ ๊ฐœ๋…์„ ์•Œ๊ณ  ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋ฏ€๋กœ, ์ˆœ์„œ๋Œ€๋กœ ํ’€์–ด๋‚˜๊ฐ€๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์‹คํ–‰ํ•˜๊ธฐ

๋ชจ๋“  ํผ์ฆ์—๋Š” ๊ตฌํ˜„ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ƒ ๊ฒฐ๊ณผ์™€ ๋น„๊ตํ•ด์ฃผ๋Š” ํ…Œ์ŠคํŠธ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ํผ์ฆ๋ณ„๋กœ ์‹คํ–‰ ๋ฐฉ๋ฒ•๊ณผ ๊ฒ€์ฆ ์ ˆ์ฐจ๊ฐ€ ์•ˆ๋‚ด๋ฉ๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ค€๋น„

์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ

๋จผ์ € ์‹œ์Šคํ…œ์ด ์‹œ์Šคํ…œ ์š”๊ตฌ์‚ฌํ•ญ์„ ์ถฉ์กฑํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.

์ง€์›๋˜๋Š” GPU

ํผ์ฆ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด ์ง€์›๋˜๋Š” GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํ™˜๊ฒฝ ์„ค์ •์„ ๋งˆ์นœ ๋’ค ์•„๋ž˜ ํ™˜๊ฒฝ ์„ค์ •์˜ gpu-specs ๋ช…๋ น์–ด๋กœ GPU ํ˜ธํ™˜์„ฑ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์šด์˜์ฒด์ œ

[!NOTE] ์šด์˜์ฒด์ œ๋ณ„ GPU ์ง€์› ์„ค์ • ๋ฐฉ๋ฒ•์„ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค.

Windows WSL2 for Linux with NVIDIA

Windows Subsystem for Linux(WSL2, ์˜ˆ: Ubuntu)์—์„œ NVIDIA GPU๋ฅผ ์„ค์ •ํ•˜๋ ค๋ฉด NVIDIA CUDA on WSL ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

ํ•ต์‹ฌ์€ Windows์šฉ NVIDIA CUDA ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์„ค์น˜ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ๋“œ๋ผ์ด๋ฒ„๊ฐ€ WSL2๋ฅผ ์™„๋ฒฝํžˆ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. Windows์— NVIDIA GPU ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์„ค์น˜ํ•˜๋ฉด WSL 2 ์•ˆ์—์„œ CUDA๋ฅผ ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Windows ํ˜ธ์ŠคํŠธ์˜ CUDA ๋“œ๋ผ์ด๋ฒ„๊ฐ€ WSL 2 ๋‚ด๋ถ€์—์„œ libcuda.so๋กœ ์Šคํ…(stub) ์ฒ˜๋ฆฌ๋˜๋ฏ€๋กœ, WSL 2 ์•ˆ์— ๋ณ„๋„์˜ NVIDIA GPU Linux ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์„ค์น˜ํ•ด์„œ๋Š” ์•ˆ ๋ฉ๋‹ˆ๋‹ค.

๋“œ๋ผ์ด๋ฒ„ ์„ค์น˜ ํ›„ ์ •์ƒ ๋™์ž‘์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

Windows์—์„œ ํ™•์ธ: PowerShell์„ ์—ฝ๋‹ˆ๋‹ค (WSL์ด ์•„๋‹™๋‹ˆ๋‹ค)

nvidia-smi

WSL ๋‚ด๋ถ€์—์„œ ํ™•์ธ: (๋จผ์ € WSL์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ: wsl -d Ubuntu)

ls -l /usr/lib/wsl/lib/nvidia-smi
/usr/lib/wsl/lib/nvidia-smi

Pixi์—์„œ ์„ค์ •์„ ํ™•์ธํ•˜๊ณ , ํ•„์š”์‹œ ๋ˆ„๋ฝ๋œ ์š”๊ตฌ์‚ฌํ•ญ์„ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: cuda-gdb ๋””๋ฒ„๊น…์šฉ)

pixi run nvidia-smi
pixi run setup-cuda-gdb
pixi run mojo debug --help
pixi run cuda-gdb --version

WSL์—์„œ๋Š” VS Code๋ฅผ ์—๋””ํ„ฐ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • Windows์—์„œ https://code.visualstudio.com/์„ ํ†ตํ•ด VS Code๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฐ ๋‹ค์Œ Remote - WSL ํ™•์žฅ์„ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.

[!NOTE] ํผ์ฆ 1-15๋Š” ๋ชจ๋‘ WSL๊ณผ Linux์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

Linux native with NVIDIA

๋จผ์ € GPU์™€ Ubuntu ๋ฒ„์ „์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค (์ง€์›๋˜๋Š” Ubuntu LTS: 20.04, 22.04, 24.04)

lspci | grep -i nvidia
lsb_release -a

NVIDIA ๋“œ๋ผ์ด๋ฒ„๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค (ํ•„์ˆ˜)

sudo ubuntu-drivers devices
sudo ubuntu-drivers autoinstall
sudo reboot

Linux์—์„œ๋Š” VS Code๋ฅผ ์—๋””ํ„ฐ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. VS Code APT ์ €์žฅ์†Œ๋ฅผ ํ†ตํ•ด ์„ค์น˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

Microsoft GPG ํ‚ค ๊ฐ€์ ธ์˜ค๊ธฐ

wget -qO- https://packages.microsoft.com/keys/microsoft.asc \
  | gpg --dearmor \
  | sudo tee /usr/share/keyrings/packages.microsoft.gpg > /dev/null

VS Code APT ์ €์žฅ์†Œ ์ถ”๊ฐ€

echo "deb [arch=amd64 signed-by=/usr/share/keyrings/packages.microsoft.gpg] \
https://packages.microsoft.com/repos/code stable main" \
| sudo tee /etc/apt/sources.list.d/vscode.list

VS Code ์„ค์น˜ ๋ฐ ํ™•์ธ

sudo apt update
sudo apt install code
code --version

[!NOTE] ํผ์ฆ 1-15๋Š” ๋ชจ๋‘ Linux์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

macOS Apple Silicon

osx-arm64 ์‚ฌ์šฉ์ž๋Š” ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • macOS 15.0 ์ด์ƒ โ€” ์ตœ์  ํ˜ธํ™˜์„ฑ์„ ์œ„ํ•ด ๊ถŒ์žฅ๋ฉ๋‹ˆ๋‹ค. pixi run check-macos๋กœ ํ™•์ธํ•˜๊ณ , ์‹คํŒจํ•˜๋ฉด ์—…๊ทธ๋ ˆ์ด๋“œํ•˜์„ธ์š”.
  • Xcode 16 ์ด์ƒ โ€” ์ตœ์†Œ ์š”๊ตฌ์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค. xcodebuild -version์œผ๋กœ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

xcrun -sdk macosx metal ์‹คํ–‰ ์‹œ cannot execute tool 'metal' due to missing Metal toolchain ์˜ค๋ฅ˜๊ฐ€ ๋‚˜ํƒ€๋‚˜๋ฉด ๋‹ค์Œ์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

xcodebuild -downloadComponent MetalToolchain

์ดํ›„ xcrun -sdk macosx metal์„ ๋‹ค์‹œ ์‹คํ–‰ํ•˜๋ฉด no input files error๊ฐ€ ๋‚˜ํƒ€๋‚˜์•ผ ์ •์ƒ์ž…๋‹ˆ๋‹ค.

[!NOTE] ํ˜„์žฌ ํผ์ฆ 1-8๊ณผ 11-15๊ฐ€ macOS์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋” ๋งŽ์€ ํผ์ฆ ์ง€์›์„ ์ค€๋น„ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค!

ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ง€์‹

๋‹ค์Œ์— ๋Œ€ํ•œ ๊ธฐ๋ณธ์ ์ธ ์ดํ•ด๊ฐ€ ์žˆ์œผ๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค:

  • ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ์ดˆ (๋ณ€์ˆ˜, ๋ฐ˜๋ณต๋ฌธ, ์กฐ๊ฑด๋ฌธ, ํ•จ์ˆ˜)
  • ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ๊ฐœ๋… (์Šค๋ ˆ๋“œ, ๋™๊ธฐํ™”, ๊ฒฝ์Ÿ ์ƒํƒœ)
  • Mojo ๊ธฐ๋ณธ ๋ฌธ๋ฒ• (ํฌ์ธํ„ฐ ์ž…๋ฌธ ์„น์…˜ ํฌํ•จ)
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ์ดˆ๋ฅผ ๋ฏธ๋ฆฌ ์ฝ์–ด๋‘๋ฉด ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค!

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฒฝํ—˜์ด ์—†์–ด๋„ ๊ดœ์ฐฎ์Šต๋‹ˆ๋‹ค! ํผ์ฆ์„ ํ’€์–ด๊ฐ€๋ฉฐ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ตํž ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Mojo๐Ÿ”ฅ์™€ ํ•จ๊ป˜ GPU ์ปดํ“จํŒ…์˜ ์„ธ๊ณ„๋กœ ๋– ๋‚˜๋ด…์‹œ๋‹ค!

ํ™˜๊ฒฝ ์„ค์ •ํ•˜๊ธฐ

  1. GitHub ์ €์žฅ์†Œ๋ฅผ ํด๋ก ํ•˜๊ณ  ํ•ด๋‹น ๋””๋ ‰ํ† ๋ฆฌ๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค:

    # ์ €์žฅ์†Œ ํด๋ก 
    git clone https://github.com/modular/mojo-gpu-puzzles
    cd mojo-gpu-puzzles
    
  2. Mojo๐Ÿ”ฅ ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ํŒจํ‚ค์ง€ ๋งค๋‹ˆ์ €๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค:

์˜ต์…˜ 1 (๊ฐ•๋ ฅ ์ถ”์ฒœ): pixi

์ด ํ”„๋กœ์ ํŠธ์—์„œ `pixi`๋ฅผ **๊ถŒ์žฅํ•˜๋Š” ์ด์œ **๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
  • Modular์˜ MAX/Mojo ํŒจํ‚ค์ง€์— ์‰ฝ๊ฒŒ ์ ‘๊ทผ ๊ฐ€๋Šฅ
  • GPU ์˜์กด์„ฑ์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
  • conda + PyPI ์ƒํƒœ๊ณ„๋ฅผ ๋ชจ๋‘ ์ง€์›
> **์ฐธ๊ณ : ์ผ๋ถ€ ํผ์ฆ์€ `pixi`์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค**

**์„ค์น˜:**

 ```bash
curl -fsSL https://pixi.sh/install.sh | sh
 ```

**์—…๋ฐ์ดํŠธ:**

 ```bash
pixi self-update
 ```

์˜ต์…˜ 2: uv

**์„ค์น˜:**

```bash
curl -fsSL https://astral.sh/uv/install.sh | sh
```

**์—…๋ฐ์ดํŠธ:**

```bash
uv self update
```

**๊ฐ€์ƒ ํ™˜๊ฒฝ ์ƒ์„ฑ:**

```bash
uv venv && source .venv/bin/activate
```
  1. ์„ค์ •์„ ํ™•์ธํ•˜๊ณ  ์ฒซ ๋ฒˆ์งธ ํผ์ฆ์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค:
# GPU ์‚ฌ์–‘ ํ™•์ธ
pixi run gpu-specs

# ์ฒซ ๋ฒˆ์งธ ํผ์ฆ ์‹คํ–‰
# ์•„์ง ๊ตฌํ˜„ ์ „์ด๋ฏ€๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค! ๋ณธ๋ฌธ์„ ๋”ฐ๋ผ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”
pixi run p01
# GPU ์‚ฌ์–‘ ํ™•์ธ
pixi run gpu-specs

# ์ฒซ ๋ฒˆ์งธ ํผ์ฆ ์‹คํ–‰
# ์•„์ง ๊ตฌํ˜„ ์ „์ด๋ฏ€๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค! ๋ณธ๋ฌธ์„ ๋”ฐ๋ผ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”
pixi run -e amd p01
# GPU ์‚ฌ์–‘ ํ™•์ธ
pixi run gpu-specs

# ์ฒซ ๋ฒˆ์งธ ํผ์ฆ ์‹คํ–‰
# ์•„์ง ๊ตฌํ˜„ ์ „์ด๋ฏ€๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค! ๋ณธ๋ฌธ์„ ๋”ฐ๋ผ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”
pixi run -e apple p01
# GPU๋ณ„ ์˜์กด์„ฑ ์„ค์น˜
uv pip install -e ".[nvidia]"  # NVIDIA GPU์šฉ
# ๋˜๋Š”
uv pip install -e ".[amd]"     # AMD GPU์šฉ

# GPU ์‚ฌ์–‘ ํ™•์ธ
uv run poe gpu-specs

# ์ฒซ ๋ฒˆ์งธ ํผ์ฆ ์‹คํ–‰
# ์•„์ง ๊ตฌํ˜„ ์ „์ด๋ฏ€๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค! ๋ณธ๋ฌธ์„ ๋”ฐ๋ผ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”
uv run poe p01

ํผ์ฆ ํ’€๊ธฐ

ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

  • problems/: ํ’€์ด๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•˜๋Š” ๊ณณ์ž…๋‹ˆ๋‹ค (์—ฌ๊ธฐ์„œ ์ž‘์—…ํ•ฉ๋‹ˆ๋‹ค!)
  • solutions/: ๋น„๊ต์™€ ํ•™์Šต์„ ์œ„ํ•œ ์ฐธ๊ณ  ํ’€์ด์ž…๋‹ˆ๋‹ค. ์ฑ… ์ „๋ฐ˜์— ๊ฑธ์ณ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค

์ž‘์—… ํ๋ฆ„

  1. problems/pXX/์—์„œ ํผ์ฆ ํ…œํ”Œ๋ฆฟ์„ ์—ฝ๋‹ˆ๋‹ค
  2. ์ œ๊ณต๋œ ํ”„๋ ˆ์ž„์›Œํฌ ์•ˆ์— ํ’€์ด๋ฅผ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค
  3. ๊ตฌํ˜„์„ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค: pixi run pXX ๋˜๋Š” uv run poe pXX (ํ”Œ๋žซํผ์— ๋”ฐ๋ผ -e platform์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ: -e amd)
  4. solutions/pXX/์˜ ์ฐธ๊ณ  ํ’€์ด์™€ ๋น„๊ตํ•˜๋ฉฐ ๋‹ค๋ฅธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋ฐฐ์›๋‹ˆ๋‹ค

์ฃผ์š” ๋ช…๋ น์–ด

# ํผ์ฆ ์‹คํ–‰ (ํ•„์š”์‹œ -e๋กœ ํ”Œ๋žซํผ ์ง€์ •)
pixi run pXX             # NVIDIA (๊ธฐ๋ณธ๊ฐ’) `pixi run -e nvidia pXX`์™€ ๋™์ผ
pixi run -e amd pXX      # AMD GPU
pixi run -e apple pXX    # Apple GPU

# ํ’€์ด ํ…Œ์ŠคํŠธ
pixi run tests           # ๋ชจ๋“  ํ’€์ด ํ…Œ์ŠคํŠธ
pixi run tests pXX       # ํŠน์ • ํผ์ฆ ํ…Œ์ŠคํŠธ

# ์ˆ˜๋™ ์‹คํ–‰
pixi run mojo problems/pXX/pXX.mojo     # ๋‚ด ๊ตฌํ˜„
pixi run mojo solutions/pXX/pXX.mojo    # ์ฐธ๊ณ  ํ’€์ด

# ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ ์…ธ
pixi shell               # ํ™˜๊ฒฝ ์ง„์ž…
mojo problems/p01/p01.mojo              # ์ง์ ‘ ์‹คํ–‰
exit                     # ์…ธ ์ข…๋ฃŒ

# ๊ฐœ๋ฐœ
pixi run format         # ์ฝ”๋“œ ํฌ๋งทํŒ…
pixi task list          # ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ช…๋ น์–ด
# ์ฐธ๊ณ : uv๋Š” ์ œํ•œ์ ์ด๋ฉฐ ์ผ๋ถ€ ์ฑ•ํ„ฐ๋Š” pixi๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
# GPU๋ณ„ ์˜์กด์„ฑ ์„ค์น˜:
uv pip install -e ".[nvidia]"  # NVIDIA GPU์šฉ
uv pip install -e ".[amd]"     # AMD GPU์šฉ

# ํ’€์ด ํ…Œ์ŠคํŠธ
uv run poe tests        # ๋ชจ๋“  ํ’€์ด ํ…Œ์ŠคํŠธ
uv run poe tests pXX    # ํŠน์ • ํผ์ฆ ํ…Œ์ŠคํŠธ

# ์ˆ˜๋™ ์‹คํ–‰
uv run mojo problems/pXX/pXX.mojo      # ๋‚ด ๊ตฌํ˜„
uv run mojo solutions/pXX/pXX.mojo     # ์ฐธ๊ณ  ํ’€์ด

GPU ์ง€์› ํ˜„ํ™ฉ

์•„๋ž˜ ํ‘œ๋Š” ํผ์ฆ๋ณ„ GPU ํ”Œ๋žซํผ ํ˜ธํ™˜์„ฑ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํผ์ฆ์— ๋”ฐ๋ผ ํ•„์š”ํ•œ GPU ๊ธฐ๋Šฅ๊ณผ ๋ฒค๋”๋ณ„ ๋„๊ตฌ๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

ํผ์ฆNVIDIA GPUAMD GPUApple GPU๋น„๊ณ 
Part I: GPU ๊ธฐ์ดˆ
1 - Mapโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
2 - Zipโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
3 - ๊ฐ€๋“œโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
4 - Map 2Dโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
5 - ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
6 - ๋ธ”๋กโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
7 - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
8 - ์Šคํ…์‹คโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
Part II: ๋””๋ฒ„๊น…
9 - GPU ๋””๋ฒ„๊ฑฐโœ…โŒโŒNVIDIA ์ „์šฉ ๋””๋ฒ„๊น… ๋„๊ตฌ
10 - ์ƒˆ๋‹ˆํƒ€์ด์ €โœ…โŒโŒNVIDIA ์ „์šฉ ๋””๋ฒ„๊น… ๋„๊ตฌ
Part III: GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜
11 - ๋ฆฌ๋•์…˜โœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
12 - ์Šค์บ”โœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
13 - ํ’€๋งโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
14 - ํ•ฉ์„ฑ๊ณฑโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
15 - ํ–‰๋ ฌ ๊ณฑ์…ˆโœ…โœ…โœ…๊ธฐ๋ณธ GPU ์ปค๋„
16 - Flashdotโœ…โœ…โœ…๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด
Part IV: MAX ๊ทธ๋ž˜ํ”„
17 - ์ปค์Šคํ…€ Opโœ…โœ…โœ…MAX ๊ทธ๋ž˜ํ”„ ํ†ตํ•ฉ
18 - ์†Œํ”„ํŠธ๋งฅ์Šคโœ…โœ…โœ…MAX ๊ทธ๋ž˜ํ”„ ํ†ตํ•ฉ
19 - ์–ดํ…์…˜โœ…โœ…โœ…MAX ๊ทธ๋ž˜ํ”„ ํ†ตํ•ฉ
Part V: PyTorch ํ†ตํ•ฉ
20 - Torch ๋ธŒ๋ฆฟ์ง€โœ…โœ…โŒPyTorch ํ†ตํ•ฉ
21 - ์˜คํ† ๊ทธ๋ž˜๋“œโœ…โœ…โŒPyTorch ํ†ตํ•ฉ
22 - ํ“จ์ „โœ…โœ…โŒPyTorch ํ†ตํ•ฉ
Part VI: ํ•จ์ˆ˜ํ˜• ํŒจํ„ด
23 - ํ•จ์ˆ˜ํ˜•โœ…โœ…โœ…๊ณ ๊ธ‰ Mojo ํŒจํ„ด
Part VII: ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ
24 - ์›Œํ”„ ํ•ฉ๊ณ„โœ…โœ…โœ…์›Œํ”„ ์ˆ˜์ค€ ์—ฐ์‚ฐ
25 - ์›Œํ”„ ํ†ต์‹ โœ…โœ…โœ…์›Œํ”„ ์ˆ˜์ค€ ์—ฐ์‚ฐ
26 - ๊ณ ๊ธ‰ ์›Œํ”„โœ…โœ…โœ…์›Œํ”„ ์ˆ˜์ค€ ์—ฐ์‚ฐ
Part VIII: ๋ธ”๋ก ํ”„๋กœ๊ทธ๋ž˜๋ฐ
27 - ๋ธ”๋ก ์—ฐ์‚ฐโœ…โœ…โœ…๋ธ”๋ก ๋‹จ์œ„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด
Part IX: ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ
28 - ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌโœ…โœ…โœ…๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ
29 - ๋ฐฐ๋ฆฌ์–ดโœ…โŒโŒNVIDIA ์ „์šฉ ๊ณ ๊ธ‰ ๋™๊ธฐํ™”
Part X: ์„ฑ๋Šฅ ๋ถ„์„
30 - ํ”„๋กœํŒŒ์ผ๋งโœ…โŒโŒNVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ (NSight)
31 - ์ ์œ ์œจโœ…โŒโŒNVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ
32 - ๋ฑ…ํฌ ์ถฉ๋Œโœ…โŒโŒNVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ
Part XI: ์ตœ์‹  GPU ๊ธฐ๋Šฅ
33 - ํ…์„œ ์ฝ”์–ดโœ…โŒโŒNVIDIA ํ…์„œ ์ฝ”์–ด ์ „์šฉ
34 - ํด๋Ÿฌ์Šคํ„ฐโœ…โŒโŒNVIDIA ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ

๋ฒ”๋ก€

  • โœ… ์ง€์›: ํ•ด๋‹น ํ”Œ๋žซํผ์—์„œ ํผ์ฆ์ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค
  • โŒ ๋ฏธ์ง€์›: ํ”Œ๋žซํผ๋ณ„ ๊ณ ์œ  ๊ธฐ๋Šฅ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค

ํ”Œ๋žซํผ๋ณ„ ์ฐธ๊ณ ์‚ฌํ•ญ

NVIDIA GPU (์ „์ฒด ์ง€์›)

  • ๋ชจ๋“  ํผ์ฆ(1-34)์ด CUDA๋ฅผ ์ง€์›ํ•˜๋Š” NVIDIA GPU์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค
  • CUDA ํˆดํ‚ท๊ณผ ํ˜ธํ™˜ ๋“œ๋ผ์ด๋ฒ„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • ๋ชจ๋“  ๊ธฐ๋Šฅ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด ๊ฐ€์žฅ ์™„์ „ํ•œ ํ•™์Šต ๊ฒฝํ—˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค

AMD GPU (ํญ๋„“์€ ์ง€์›)

  • ๋Œ€๋ถ€๋ถ„์˜ ํผ์ฆ(1-8, 11-29)์ด ROCm์„ ํ†ตํ•ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฏธ์ง€์›: ๋””๋ฒ„๊น… ๋„๊ตฌ(9-10), ํ”„๋กœํŒŒ์ผ๋ง(30-32), ํ…์„œ ์ฝ”์–ด(33-34)
  • ๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด๊นŒ์ง€ ํฌํ•จํ•˜์—ฌ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ํญ๋„“๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

Apple GPU (๊ธฐ๋ณธ ์ง€์›)

  • ๊ธฐ์ดˆ(1-8, 11-18) ๋ฐ ๊ณ ๊ธ‰(23-27) ํผ์ฆ ์ผ๋ถ€๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฏธ์ง€์›: ๊ณ ๊ธ‰ ๊ธฐ๋Šฅ ์ „๋ฐ˜, ๋””๋ฒ„๊น…, ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ธฐ๋ณธ ํŒจํ„ด์„ ์ตํžˆ๊ธฐ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค

ํ–ฅํ›„ ์ง€์› ๊ณ„ํš: AMD ๋ฐ Apple GPU์— ๋Œ€ํ•œ ๋„๊ตฌ์™€ ํ”Œ๋žซํผ ์ง€์›์„ ๊พธ์ค€ํžˆ ํ™•๋Œ€ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋””๋ฒ„๊น… ๋„๊ตฌ, ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ๋Šฅ, ๊ณ ๊ธ‰ GPU ์—ฐ์‚ฐ ๋“ฑ ์•„์ง ์ง€์›๋˜์ง€ ์•Š๋Š” ๊ธฐ๋Šฅ์€ ํ–ฅํ›„ ๋ฆด๋ฆฌ์Šค์— ํฌํ•จ๋  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ํฌ๋กœ์Šค ํ”Œ๋žซํผ ํ˜ธํ™˜์„ฑ์„ ๊ณ„์† ๊ฐœ์„ ํ•˜๊ณ  ์žˆ์œผ๋‹ˆ ์—…๋ฐ์ดํŠธ๋ฅผ ํ™•์ธํ•ด ์ฃผ์„ธ์š”.

GPU ๋ฆฌ์†Œ์Šค

๋ฌด๋ฃŒ ํด๋ผ์šฐ๋“œ GPU ํ”Œ๋žซํผ

๋กœ์ปฌ GPU๊ฐ€ ์—†๋‹ค๋ฉด, ๋ฌด๋ฃŒ๋กœ GPU๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํด๋ผ์šฐ๋“œ ํ”Œ๋žซํผ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Google Colab

Google Colab์€ ๋ฌด๋ฃŒ GPU ์ ‘๊ทผ์„ ์ œ๊ณตํ•˜์ง€๋งŒ, Mojo GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—๋Š” ์ผ๋ถ€ ์ œํ•œ์ด ์žˆ์Šต๋‹ˆ๋‹ค:

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU:

  • Tesla T4 (๊ตฌ์„ธ๋Œ€ Turing ์•„ํ‚คํ…์ฒ˜)
  • Tesla V100 (์ œํ•œ์  ๊ฐ€์šฉ)

Mojo GPU Puzzles ์‚ฌ์šฉ ์‹œ ์ œํ•œ์‚ฌํ•ญ:

  • ๊ตฌ์„ธ๋Œ€ GPU ์•„ํ‚คํ…์ฒ˜: T4 GPU๋Š” ๊ณ ๊ธ‰ Mojo GPU ๊ธฐ๋Šฅ๊ณผ ํ˜ธํ™˜๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ์„ธ์…˜ ์‹œ๊ฐ„ ์ œํ•œ: ์ตœ๋Œ€ 12์‹œ๊ฐ„ ์‹คํ–‰ ํ›„ ์ž๋™์œผ๋กœ ์—ฐ๊ฒฐ์ด ๋Š๊น๋‹ˆ๋‹ค
  • ์ œํ•œ์  ๋””๋ฒ„๊น… ์ง€์›: NVIDIA ๋””๋ฒ„๊น… ๋„๊ตฌ(ํผ์ฆ 9-10)๋ฅผ ์™„์ „ํžˆ ์‚ฌ์šฉํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ํŒจํ‚ค์ง€ ์„ค์น˜ ์ œํ•œ: Mojo/MAX ์„ค์น˜ ์‹œ ์šฐํšŒ ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ์„ฑ๋Šฅ ์ œํ•œ: ๊ณต์œ  ์ธํ”„๋ผ ํŠน์„ฑ์ƒ ์ผ๊ด€๋œ ๋ฒค์น˜๋งˆํ‚น์ด ์–ด๋ ต์Šต๋‹ˆ๋‹ค

์ถ”์ฒœ ์šฉ๋„: ๊ธฐ๋ณธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…(ํผ์ฆ 1-8, 11-15)๊ณผ ๊ธฐ์ดˆ ํŒจํ„ด ํ•™์Šต.

Kaggle Notebooks

Kaggle์€ Colab๋ณด๋‹ค ๋„‰๋„‰ํ•œ ๋ฌด๋ฃŒ GPU ์‚ฌ์šฉ ์‹œ๊ฐ„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU:

  • Tesla T4 (์ฃผ๋‹น 30์‹œ๊ฐ„ ๋ฌด๋ฃŒ)
  • P100 (์ œํ•œ์  ๊ฐ€์šฉ)

Colab ๋Œ€๋น„ ์žฅ์ :

  • ๋„‰๋„‰ํ•œ ์‹œ๊ฐ„: Colab์˜ ์ผ์ผ ์„ธ์…˜ ์ œํ•œ๊ณผ ๋‹ฌ๋ฆฌ ์ฃผ๋‹น 30์‹œ๊ฐ„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ์ž๋™ ์ €์žฅ: ๋…ธํŠธ๋ถ์ด ์ž๋™์œผ๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค
  • ์•ˆ์ •์ ์ธ ํ™˜๊ฒฝ: ํŒจํ‚ค์ง€ ์„ค์น˜๊ฐ€ ๋” ์•ˆ์ •์ ์ž…๋‹ˆ๋‹ค

Mojo GPU Puzzles ์‚ฌ์šฉ ์‹œ ์ œํ•œ์‚ฌํ•ญ:

  • GPU ์•„ํ‚คํ…์ฒ˜ ์ œ์•ฝ: T4์˜ ๊ณ ๊ธ‰ ๊ธฐ๋Šฅ ํ˜ธํ™˜์„ฑ ๋ฌธ์ œ๋Š” Colab๊ณผ ๋™์ผ
  • ์ œํ•œ์  ๋””๋ฒ„๊น… ๋„๊ตฌ: NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋ฐ ๋””๋ฒ„๊น… ๋„๊ตฌ(ํผ์ฆ 9-10, 30-32) ์‚ฌ์šฉ ๋ถˆ๊ฐ€
  • Mojo ์„ค์น˜ ๋ณต์žก์„ฑ: Mojo ํ™˜๊ฒฝ์„ ์ˆ˜๋™์œผ๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฏธ์ง€์›: ๊ณ ๊ธ‰ ํผ์ฆ(33-34) ์ž‘๋™ ๋ถˆ๊ฐ€

์ถ”์ฒœ ์šฉ๋„: ๊ธฐ๋ณธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ(ํผ์ฆ 1-16)์„ ์žฅ์‹œ๊ฐ„์— ๊ฑธ์ณ ํ•™์Šตํ•  ๋•Œ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.

๊ถŒ์žฅ ์‚ฌํ•ญ

  • ์ „์ฒด ํ•™์Šต ๊ณผ์ •: NVIDIA GPU๊ฐ€ ์žˆ์œผ๋ฉด ๋ชจ๋“  ํผ์ฆ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (์ „์ฒด 34๊ฐœ)
  • ํญ๋„“์€ ํ•™์Šต: AMD GPU๋กœ๋„ ๋Œ€๋ถ€๋ถ„์˜ ๋‚ด์šฉ์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (34๊ฐœ ์ค‘ 27๊ฐœ)
  • ๊ธฐ์ดˆ ํ•™์Šต: Apple GPU๋กœ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ์ตํž ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค (34๊ฐœ ์ค‘ 13๊ฐœ)
  • ๋ฌด๋ฃŒ ํ”Œ๋žซํผ ํ•™์Šต: Google Colab/Kaggle๋กœ ๊ธฐ์ดˆ~์ค‘๊ธ‰ ๊ฐœ๋…๊นŒ์ง€ ํ•™์Šต ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค (ํผ์ฆ 1-16)
  • ๋””๋ฒ„๊น… ๋ฐ ํ”„๋กœํŒŒ์ผ๋ง: ๋””๋ฒ„๊น… ๋„๊ตฌ์™€ ์„ฑ๋Šฅ ๋ถ„์„์—๋Š” NVIDIA GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • ์ตœ์‹  GPU ๊ธฐ๋Šฅ: ํ…์„œ ์ฝ”์–ด์™€ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—๋Š” NVIDIA GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค

๊ฐœ๋ฐœ

์ž์„ธํ•œ ๋‚ด์šฉ์€ README๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

์ปค๋ฎค๋‹ˆํ‹ฐ ์ฐธ์—ฌํ•˜๊ธฐ

์—…๋ฐ์ดํŠธ ๊ตฌ๋… Modular ํฌ๋Ÿผ Discord

์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐํ•˜๊ณ , ํ’€์ด๋ฅผ ๊ณต์œ ํ•˜๊ณ , ์„œ๋กœ ๋„์›€์„ ์ฃผ๊ณ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ† ๋ณด์ƒ์„ ๋ฐ›์•„๊ฐ€์„ธ์š”

ํผ์ฆ์„ ๋ชจ๋‘ ํ’€์–ด๋ณด์…จ๋‚˜์š”? ์—ฌ๋Ÿฌ๋ถ„์˜ ๋„์ „์„ ์ถ•ํ•˜ํ•˜๋ฉฐ ๋ฌด๋ฃŒ ์Šคํ‹ฐ์ปค ํŒฉ์„ ์„ ๋ฌผ๋กœ ๋“œ๋ ค์š”!

๋ฌด๋ฃŒ ์Šคํ‹ฐ์ปค๋ฅผ ๋ฐ›๋Š” ๋ฐฉ๋ฒ•:

  1. GitHub ์ €์žฅ์†Œ https://github.com/modular/mojo-gpu-puzzles๋ฅผ Forkํ•ฉ๋‹ˆ๋‹ค
  2. ํผ์ฆ ์†”๋ฃจ์…˜์„ ์ž‘์„ฑํ•ด์„œ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค
  3. ์ด ์–‘์‹์œผ๋กœ ์ œ์ถœํ•˜๋ฉด Modular ํ•œ์ • ์Šคํ‹ฐ์ปค๋ฅผ ๋ณด๋‚ด๋“œ๋ ค์š”!

ํ˜„์žฌ๋Š” ๋ถ๋ฏธ ์ง€์—ญ์œผ๋กœ๋งŒ ๋ฐฐ์†ก์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์ง€์—ญ์— ๊ณ„์‹  ๋ถ„๋“ค๋„ ์†”๋ฃจ์…˜์„ ์ œ์ถœํ•ด ์ฃผ์„ธ์š” โ€“ ๋ฐฐ์†ก ๋ฒ”์œ„๋ฅผ ๋„“ํ˜€๊ฐ€๊ณ  ์žˆ์œผ๋‹ˆ, ๊ฐ€๋Šฅํ•ด์ง€๋ฉด ๊ผญ ๋ณด์ƒ์„ ๋ณด๋‚ด๋“œ๋ฆด๊ฒŒ์š”.

Puzzle 1: Map

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” GPU ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์˜ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐ์ดํ„ฐ ์š”์†Œ ํ•˜๋‚˜๋ฅผ ๋งก์•„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ๋ฒกํ„ฐ a์˜ ๊ฐ ์š”์†Œ์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ๋ฐฐ์ •๋ฉ๋‹ˆ๋‹ค.

Map Map

ํ•ต์‹ฌ ๊ฐœ๋…

  • GPU ์ปค๋„์˜ ๊ธฐ๋ณธ ๊ตฌ์กฐ
  • ์Šค๋ ˆ๋“œ์™€ ๋ฐ์ดํ„ฐ ๊ฐ„ ์ผ๋Œ€์ผ ๋งคํ•‘
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • GPU์—์„œ์˜ ๋ฐฐ์—ด ์—ฐ์‚ฐ

๊ฐ ์œ„์น˜ \(i\)์— ๋Œ€ํ•ด: \[\Large output[i] = a[i] + 10\]

๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ

๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹

์ง์ ‘ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋‹ค๋ฃจ๋ฉฐ GPU์˜ ๊ธฐ๋ณธ ์›๋ฆฌ๋ฅผ ์ตํž™๋‹ˆ๋‹ค.

๐Ÿ’ก ๋ฏธ๋ฆฌ๋ณด๊ธฐ: TileTensor๋ฅผ ํ™œ์šฉํ•œ ํ˜„๋Œ€์  ๋ฐฉ์‹

TileTensor๊ฐ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์–ด๋–ป๊ฒŒ ๋‹จ์ˆœํ™”ํ•˜๋Š”์ง€ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ๋” ์•ˆ์ „ํ•˜๊ณ  ๊น”๋”ํ•œ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก ํŒ: ๋‘ ๋ฐฉ์‹์„ ๋ชจ๋‘ ์ตํžˆ๋ฉด ํ˜„๋Œ€์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ๋” ๊นŠ์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šฐ๋Š” ๋‚ด์šฉ:

  • ๊ธฐ๋ณธ GPU ์ปค๋„ ๊ตฌ์กฐ

  • thread_idx.x๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

  • ๊ฐ„๋‹จํ•œ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ

  • ๋ณ‘๋ ฌ์„ฑ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค

  • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ: i = thread_idx.x ์œ„์น˜์˜ ์š”์†Œ์— ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค

  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: a[i]์—์„œ ์ฝ๊ณ  output[i]์— ์”๋‹ˆ๋‹ค

  • ๋ฐ์ดํ„ฐ ๋…๋ฆฝ์„ฑ: ๊ฐ ์ถœ๋ ฅ์€ ํ•ด๋‹น ์ž…๋ ฅ์—๋งŒ ์˜์กดํ•ฉ๋‹ˆ๋‹ค

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 4
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = SIZE
comptime dtype = DType.float32


def add_10(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    var i = thread_idx.x
    # FILL ME IN (roughly 1 line)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p01/p01.mojo

ํŒ
  1. thread_idx.x๋ฅผ i์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  2. a[i]์— 10์„ ๋”ํ•ฉ๋‹ˆ๋‹ค
  3. ๊ฒฐ๊ณผ๋ฅผ output[i]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p01
pixi run -e amd p01
pixi run -e apple p01
uv run poe p01

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

์†”๋ฃจ์…˜

def add_10(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    var i = thread_idx.x
    output[i] = a[i] + 10.0


์ด ์†”๋ฃจ์…˜์€:

  • i = thread_idx.x๋กœ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค
  • ์ž…๋ ฅ๊ฐ’์— 10์„ ๋”ํ•ฉ๋‹ˆ๋‹ค: output[i] = a[i] + 10.0

์™œ TileTensor๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ• ๊นŒ์š”?

์•„๋ž˜ ๊ธฐ์กด ๊ตฌํ˜„์„ ๋ณด๋ฉด ๋ช‡ ๊ฐ€์ง€ ์ž ์žฌ์ ์ธ ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

ํ˜„์žฌ ๋ฐฉ์‹

i = thread_idx.x
output[i] = a[i] + 10.0

1D ๋ฐฐ์—ด์—์„œ๋Š” ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ƒํ™ฉ์—์„œ๋Š” ์–ด๋–จ๊นŒ์š”?

  • 2D๋‚˜ 3D ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ค„์•ผ ํ•  ๋•Œ
  • ๋‹ค์–‘ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•  ๋•Œ
  • ๋ณ‘ํ•ฉ(coalesced) ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ๋ณด์žฅํ•ด์•ผ ํ•  ๋•Œ

์•ž์œผ๋กœ์˜ ๋„์ „ ๋ฏธ๋ฆฌ๋ณด๊ธฐ

ํผ์ฆ์„ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๋ฐฐ์—ด ์ธ๋ฑ์‹ฑ์€ ์ ์  ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค:

# ์ดํ›„ ํผ์ฆ์—์„œ ๋‹ค๋ฃฐ 2D ์ธ๋ฑ์‹ฑ
idx = row * WIDTH + col

# 3D ์ธ๋ฑ์‹ฑ
idx = (batch * HEIGHT + row) * WIDTH + col

# ํŒจ๋”ฉ์ด ์žˆ๋Š” ๊ฒฝ์šฐ
idx = (batch * padded_height + row) * padded_width + col

TileTensor ๋ฏธ๋ฆฌ๋ณด๊ธฐ

TileTensor๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋Ÿฐ ๊ฒฝ์šฐ๋ฅผ ํ›จ์”ฌ ๊น”๋”ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# ๋ฏธ๋ฆฌ๋ณด๊ธฐ - ์ง€๊ธˆ์€ ์ด ๋ฌธ๋ฒ•์„ ๋ชฐ๋ผ๋„ ๊ดœ์ฐฎ์Šต๋‹ˆ๋‹ค!
output[i, j] = a[i, j] + 10.0  # 2D ์ธ๋ฑ์‹ฑ
output[b, i, j] = a[b, i, j] + 10.0  # 3D ์ธ๋ฑ์‹ฑ

Puzzle 4์—์„œ TileTensor๋ฅผ ์ž์„ธํžˆ ๋ฐฐ์šธ ์˜ˆ์ •์ž…๋‹ˆ๋‹ค. ๊ทธ๋•Œ ์ด ๊ฐœ๋…๋“ค์ด ํ•„์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ง€๊ธˆ์€ ๋‹ค์Œ ๋‚ด์šฉ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ•˜์„ธ์š”:

  • ๊ธฐ๋ณธ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ
  • ๊ฐ„๋‹จํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ์Šค๋ ˆ๋“œ์™€ ๋ฐ์ดํ„ฐ์˜ ์ผ๋Œ€์ผ ๋งคํ•‘

๐Ÿ’ก ํ•ต์‹ฌ ํฌ์ธํŠธ: ์ง์ ‘ ์ธ๋ฑ์‹ฑ์€ ๊ฐ„๋‹จํ•œ ๊ฒฝ์šฐ์— ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ๋ณต์žกํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์—์„œ๋Š” ๊ณง ๋” ์ •๊ตํ•œ ๋„๊ตฌ๊ฐ€ ํ•„์š”ํ•ด์ง‘๋‹ˆ๋‹ค.

Puzzle 2: Zip

๊ฐœ์š”

๋ฒกํ„ฐ a์™€ ๋ฒกํ„ฐ b์˜ ๊ฐ ์œ„์น˜๋ฅผ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ๋ฐฐ์ •๋ฉ๋‹ˆ๋‹ค.

Zip Zip

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šฐ๋Š” ๋‚ด์šฉ:

  • ์—ฌ๋Ÿฌ ์ž…๋ ฅ ๋ฐฐ์—ด์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ
  • ์—ฌ๋Ÿฌ ์ž…๋ ฅ์— ๋Œ€ํ•œ ์š”์†Œ๋ณ„ ์—ฐ์‚ฐ
  • ๋ฐฐ์—ด ๊ฐ„ ์Šค๋ ˆ๋“œ-๋ฐ์ดํ„ฐ ๋งคํ•‘
  • ์—ฌ๋Ÿฌ ๋ฐฐ์—ด์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

๊ฐ ์Šค๋ ˆ๋“œ \(i\)์— ๋Œ€ํ•ด: \[\Large output[i] = a[i] + b[i]\]

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

Thread 0:  a[0] + b[0] โ†’ output[0]
Thread 1:  a[1] + b[1] โ†’ output[1]
Thread 2:  a[2] + b[2] โ†’ output[2]
...

๐Ÿ’ก ์ฐธ๊ณ : ์ด์ œ ์ปค๋„์—์„œ ์„ธ ๊ฐœ์˜ ๋ฐฐ์—ด(a, b, output)์„ ๋‹ค๋ฃจ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฐ์‚ฐ์ด ๋ณต์žกํ•ด์งˆ์ˆ˜๋ก ์—ฌ๋Ÿฌ ๋ฐฐ์—ด์— ๋Œ€ํ•œ ์ ‘๊ทผ์„ ๊ด€๋ฆฌํ•˜๊ธฐ๊ฐ€ ์ ์  ์–ด๋ ค์›Œ์ง‘๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 4
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = SIZE
comptime dtype = DType.float32


def add(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    var i = thread_idx.x
    # FILL ME IN (roughly 1 line)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p02/p02.mojo

ํŒ
  1. thread_idx.x๋ฅผ i์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  2. a[i]์™€ b[i]๋ฅผ ๋”ํ•ฉ๋‹ˆ๋‹ค
  3. ๊ฒฐ๊ณผ๋ฅผ output[i]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p02
pixi run -e amd p02
pixi run -e apple p02
uv run poe p02

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 2.0, 4.0, 6.0])

์†”๋ฃจ์…˜

def add(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    var i = thread_idx.x
    output[i] = a[i] + b[i]


์ด ์†”๋ฃจ์…˜์€:

  • i = thread_idx.x๋กœ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค
  • ๋‘ ๋ฐฐ์—ด์˜ ๊ฐ’์„ ๋”ํ•ฉ๋‹ˆ๋‹ค: output[i] = a[i] + b[i]

์•ž์œผ๋กœ ๋‹ค๋ฃฐ ๋‚ด์šฉ

์ง์ ‘ ์ธ๋ฑ์‹ฑ์€ ๊ฐ„๋‹จํ•œ ์š”์†Œ๋ณ„ ์—ฐ์‚ฐ์—์„œ ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ๋‹ค์Œ ์ƒํ™ฉ์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • ๋ฐฐ์—ด์˜ ๋ ˆ์ด์•„์›ƒ์ด ์„œ๋กœ ๋‹ค๋ฅด๋‹ค๋ฉด?
  • ํ•œ ๋ฐฐ์—ด์„ ๋‹ค๋ฅธ ๋ฐฐ์—ด์— ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•ด์•ผ ํ•œ๋‹ค๋ฉด?
  • ์—ฌ๋Ÿฌ ๋ฐฐ์—ด์—์„œ ๋ณ‘ํ•ฉ(coalesced) ์ ‘๊ทผ์„ ์–ด๋–ป๊ฒŒ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์„๊นŒ?

์ด๋Ÿฌํ•œ ์งˆ๋ฌธ๋“ค์€ Puzzle 4์˜ TileTensor ์•Œ์•„๋ณด๊ธฐ์—์„œ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

Puzzle 3: ๊ฐ€๋“œ

๊ฐœ์š”

๋ฒกํ„ฐ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๋ณด๋‹ค ๋งŽ์•„์„œ, ์ผ๋ถ€ ์Šค๋ ˆ๋“œ๋Š” ์ฒ˜๋ฆฌํ•  ๋ฐ์ดํ„ฐ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜์ง€ ์•Š๋„๋ก ๋ฐฉ์ง€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Guard Guard

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • ์Šค๋ ˆ๋“œ ์ˆ˜์™€ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ ๋ถˆ์ผ์น˜ ์ฒ˜๋ฆฌ
  • ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ฐฉ์ง€
  • GPU ์ปค๋„์—์„œ ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ์‚ฌ์šฉ
  • ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

์ˆ˜ํ•™์  ํ‘œํ˜„

๊ฐ ์Šค๋ ˆ๋“œ \(i\)์— ๋Œ€ํ•ด: \[\Large \text{if}\ i < \text{size}: output[i] = a[i] + 10\]

๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „ ํŒจํ„ด

Thread 0 (i=0):  if 0 < size:  output[0] = a[0] + 10  โœ“ Valid
Thread 1 (i=1):  if 1 < size:  output[1] = a[1] + 10  โœ“ Valid
Thread 2 (i=2):  if 2 < size:  output[2] = a[2] + 10  โœ“ Valid
Thread 3 (i=3):  if 3 < size:  output[3] = a[3] + 10  โœ“ Valid
Thread 4 (i=4):  if 4 < size:  โŒ Skip (out of bounds)
Thread 5 (i=5):  if 5 < size:  โŒ Skip (out of bounds)

๐Ÿ’ก ์ฐธ๊ณ : ๋‹ค์Œ ์ƒํ™ฉ์—์„œ ๊ฒฝ๊ณ„(boundary) ๊ฒ€์‚ฌ๋Š” ์ ์  ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค:

  • ๋‹ค์ฐจ์› ๋ฐฐ์—ด
  • ๋‹ค์–‘ํ•œ ๋ฐฐ์—ด ํ˜•ํƒœ
  • ๋ณต์žกํ•œ ์ ‘๊ทผ ํŒจํ„ด

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 4
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = 8
comptime dtype = DType.float32


def add_10_guard(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: Int,
):
    var i = thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p03/p03.mojo

ํŒ
  1. thread_idx.x๋ฅผ i์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if i < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: output[i] = a[i] + 10.0

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p03
pixi run -e amd p03
pixi run -e apple p03
uv run poe p03

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

์†”๋ฃจ์…˜

def add_10_guard(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: Int,
):
    var i = thread_idx.x
    if i < size:
        output[i] = a[i] + 10.0


์ด ์†”๋ฃจ์…˜์€:

  • i = thread_idx.x๋กœ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค
  • if i < size๋กœ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
  • ๊ฐ€๋“œ ๋‚ด๋ถ€: ์ž…๋ ฅ๊ฐ’์— 10์„ ๋”ํ•ฉ๋‹ˆ๋‹ค

๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ์—†์ด๋„ ํ…Œ์ŠคํŠธ๊ฐ€ ํ†ต๊ณผ๋˜๋Š” ์ด์œ ๊ฐ€ ๊ถ๊ธˆํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ํ…Œ์ŠคํŠธ ํ†ต๊ณผ๊ฐ€ ์ฝ”๋“œ์˜ ์•ˆ์ „์„ฑ์ด๋‚˜ ๋ฏธ์ •์˜ ๋™์ž‘(Undefined Behavior) ๋ถ€์žฌ๋ฅผ ๋ณด์žฅํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค๋Š” ์ ์„ ํ•ญ์ƒ ๊ธฐ์–ตํ•˜์„ธ์š”. Puzzle 10์—์„œ ์ด๋Ÿฐ ๊ฒฝ์šฐ๋ฅผ ์‚ดํŽด๋ณด๊ณ , ์•ˆ์ „์„ฑ ๋ฒ„๊ทธ๋ฅผ ์žก๋Š” ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•ด ๋ด…๋‹ˆ๋‹ค.

์•ž์œผ๋กœ ๋‹ค๋ฃฐ ๋‚ด์šฉ

๊ฐ„๋‹จํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ๊ธฐ์„œ ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, ๋‹ค์Œ ์ƒํ™ฉ์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • 2D/3D ๋ฐฐ์—ด์˜ ๊ฒฝ๊ณ„๋Š” ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ• ๊นŒ?
  • ๋‹ค์–‘ํ•œ ํ˜•ํƒœ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ ค๋ฉด?
  • ํŒจ๋”ฉ์ด๋‚˜ ๊ฐ€์žฅ์ž๋ฆฌ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋ฉด?

๋ณต์žก๋„๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” ์˜ˆ์‹œ:

# ํ˜„์žฌ: 1D ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
if i < size: ...

# ๊ณง ๋‹ค๋ฃฐ ๋‚ด์šฉ: 2D ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
if i < height and j < width: ...

# ์ดํ›„: ํŒจ๋”ฉ์ด ์žˆ๋Š” 3D
if i < height and j < width and k < depth and
   i >= padding and j >= padding: ...

์ด๋Ÿฐ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ํŒจํ„ด์€ Puzzle 4์˜ TileTensor ์•Œ์•„๋ณด๊ธฐ์—์„œ ๋ฐฐ์šฐ๋ฉด ํ›จ์”ฌ ๊น”๋”ํ•ด์ง‘๋‹ˆ๋‹ค. TileTensor๋Š” ํ˜•ํƒœ ๊ด€๋ฆฌ ๊ธฐ๋Šฅ์„ ๊ธฐ๋ณธ์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Puzzle 4: 2D Map

๊ฐœ์š”

2D ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

2D ํ–‰๋ ฌ ๋งคํ•‘ 2D ํ–‰๋ ฌ ๋งคํ•‘

ํ•ต์‹ฌ ๊ฐœ๋…

  • 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ
  • GPU์—์„œ์˜ ํ–‰๋ ฌ ์—ฐ์‚ฐ
  • ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ํŒจํ„ด

๊ฐ ์œ„์น˜ \((i,j)\)์— ๋Œ€ํ•ด: \[\Large output[i,j] = a[i,j] + 10\]

์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ ๊ทœ์น™

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ 2D ํ–‰๋ ฌ์„ ๋‹ค๋ฃฐ ๋•Œ๋Š” ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค์™€ ํ–‰๋ ฌ ์ขŒํ‘œ ์‚ฌ์ด์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋งคํ•‘์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  • thread_idx.y๋Š” ํ–‰(row) ์ธ๋ฑ์Šค
  • thread_idx.x๋Š” ์—ด(column) ์ธ๋ฑ์Šค
2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

์ด ๊ทœ์น™์€ ๋‹ค์Œ๊ณผ ์ž˜ ๋งž์Šต๋‹ˆ๋‹ค:

  1. ํ–‰๋ ฌ ์œ„์น˜๋ฅผ (row, column)์œผ๋กœ ์“ฐ๋Š” ํ‘œ์ค€ ์ˆ˜ํ•™ ํ‘œ๊ธฐ๋ฒ•
  2. ํ–‰์€ ์œ„์—์„œ ์•„๋ž˜๋กœ(y์ถ•), ์—ด์€ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ(x์ถ•) ๊ฐ€๋Š” ํ–‰๋ ฌ์˜ ์‹œ๊ฐ์  ๊ตฌ์กฐ
  3. ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ํ–‰๋ ฌ ๊ตฌ์กฐ์— ๋งž์ถฐ 2D ๊ทธ๋ฆฌ๋“œ๋กœ ๊ตฌ์„ฑํ•˜๋Š” ์ผ๋ฐ˜์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด

์—ญ์‚ฌ์  ๋ฐฐ๊ฒฝ

๊ทธ๋ž˜ํ”ฝ์ด๋‚˜ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ์—์„œ๋Š” ๋ณดํ†ต \((x,y)\) ์ขŒํ‘œ๋ฅผ ์“ฐ์ง€๋งŒ, ํ–‰๋ ฌ ์—ฐ์‚ฐ์—์„œ๋Š” ์ „ํ†ต์ ์œผ๋กœ (row, column) ์ธ๋ฑ์‹ฑ์„ ์จ์™”์Šต๋‹ˆ๋‹ค. ์ดˆ๊ธฐ ์ปดํ“จํ„ฐ๊ฐ€ 2D ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๊ณ  ์ฒ˜๋ฆฌํ•˜๋˜ ๋ฐฉ์‹์—์„œ ๋น„๋กฏ๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค: ์œ„์—์„œ ์•„๋ž˜๋กœ ํ•œ ์ค„์”ฉ, ๊ฐ ์ค„์€ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ฝ์—ˆ์ฃ . ์ด๋Ÿฐ ํ–‰ ์šฐ์„ (row-major) ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋งž์•„์„œ CPU์™€ GPU ๋ชจ๋‘์—์„œ ํšจ์œจ์ ์ž„์ด ์ž…์ฆ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์šฉ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ๋„์ž…๋์„ ๋•Œ, thread_idx.y๋ฅผ ํ–‰์—, thread_idx.x๋ฅผ ์—ด์— ๋งคํ•‘ํ•œ ๊ฑด ๊ธฐ์กด์— ํ™•๋ฆฝ๋œ ํ–‰๋ ฌ ์ธ๋ฑ์‹ฑ ๊ทœ์น™๊ณผ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๋ ค๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ์„ ํƒ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

๊ตฌํ˜„ ๋ฐฉ์‹

๐Ÿ”ฐ ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹

์ˆ˜๋™์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ด€๋ฆฌํ•˜๋ฉด์„œ 2D ์ธ๋ฑ์‹ฑ์ด ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.

๐Ÿ“š TileTensor ์•Œ์•„๋ณด๊ธฐ

GPU์—์„œ ๋‹ค์ฐจ์› ๋ฐฐ์—ด ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„ํŽธํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” ๊ฐ•๋ ฅํ•œ ์ถ”์ƒํ™”๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๐Ÿš€ ํ˜„๋Œ€์  2D ์—ฐ์‚ฐ

์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ๊ณผ ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ๊ฐ–์ถ˜ TileTensor๋ฅผ ์ง์ ‘ ์จ๋ด…๋‹ˆ๋‹ค.

๐Ÿ’ก ์ฐธ๊ณ : ์ด ํผ์ฆ๋ถ€ํ„ฐ๋Š” ๋” ๊น”๋”ํ•˜๊ณ  ์•ˆ์ „ํ•œ GPU ์ฝ”๋“œ๋ฅผ ์œ„ํ•ด TileTensor๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ฐœ์š”

2D ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D ์ •์‚ฌ๊ฐ ํ–‰๋ ฌ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค ๋‹ค๋ฃจ๊ธฐ (thread_idx.x, thread_idx.y)
  • 2D ์ขŒํ‘œ๋ฅผ 1D ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ
  • 2์ฐจ์›์—์„œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ 2D ์Šค๋ ˆ๋“œ ์ขŒํ‘œ \((i,j)\)๋ฅผ ํฌ๊ธฐ \(n \times n\)์ธ ํ–‰ ์šฐ์„  ํ–‰๋ ฌ์˜ ์›์†Œ๋กœ ๋งคํ•‘ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋™์‹œ์— ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๊ฐ€ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋Š”์ง€๋„ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • 2D ์ธ๋ฑ์‹ฑ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ ์œ ํ•œ \((i,j)\) ์œ„์น˜๋ฅผ ๊ฐ€์ง
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ํ–‰ ์šฐ์„  ์ˆœ์„œ๋กœ 2D๋ฅผ 1D ๋ฉ”๋ชจ๋ฆฌ์— ๋งคํ•‘
  • ๊ฐ€๋“œ ์กฐ๊ฑด: ๋‘ ์ฐจ์› ๋ชจ๋‘ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํ•„์š”
  • ์Šค๋ ˆ๋“œ ๋ฒ”์œ„: ์Šค๋ ˆ๋“œ \((3 \times 3)\)๊ฐ€ ํ–‰๋ ฌ ์›์†Œ \((2 \times 2)\)๋ณด๋‹ค ๋งŽ์Œ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32


def add_10_2d(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p04/p04.mojo

ํŒ
  1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: row = thread_idx.y, col = thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€์—์„œ ํ–‰ ์šฐ์„  ๋ฐฉ์‹์œผ๋กœ 10 ๋”ํ•˜๊ธฐ!

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p04
pixi run -e amd p04
pixi run -e apple p04
uv run poe p04

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

์†”๋ฃจ์…˜

def add_10_2d(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x
    if row < size and col < size:
        output[row * size + col] = a[row * size + col] + 10.0


์ด ์†”๋ฃจ์…˜์€:

  1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: row = thread_idx.y, col = thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: output[row * size + col] = a[row * size + col] + 10.0

TileTensor ์•Œ์•„๋ณด๊ธฐ

ํผ์ฆ ํ’€์ด๋ฅผ ์ž ์‹œ ๋ฉˆ์ถ”๊ณ , GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋” ์ฆ๊ฒ๊ฒŒ ๋งŒ๋“ค์–ด์ค„ ๊ฐ•๋ ฅํ•œ ์ถ”์ƒํ™”๋ฅผ ๋ฏธ๋ฆฌ ์‚ดํŽด๋ด…์‹œ๋‹ค: ๐Ÿฅโ€ฆ ๋ฐ”๋กœ TileTensor ์ž…๋‹ˆ๋‹ค.

๐Ÿ’ก TileTensor๊ฐ€ ์–ด๋–ค ์ผ์„ ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ง›๋ณด๊ธฐ๋กœ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ์ง€๊ธˆ ๋ชจ๋“  ๊ฑธ ์ดํ•ดํ•  ํ•„์š”๋Š” ์—†์–ด์š” - ํผ์ฆ์„ ์ง„ํ–‰ํ•˜๋ฉด์„œ ๊ฐ ๊ธฐ๋Šฅ์„ ์ž์„ธํžˆ ์•Œ์•„๋ณผ ๊ฒ๋‹ˆ๋‹ค.

๋ฌธ์ œ: ์ ์  ๋ณต์žกํ•ด์ง€๋Š” ์ฝ”๋“œ

์ง€๊ธˆ๊นŒ์ง€ ๊ฒช์€ ์–ด๋ ค์›€์„ ์‚ดํŽด๋ด…์‹œ๋‹ค:

# Puzzle 1: ๋‹จ์ˆœ ์ธ๋ฑ์‹ฑ
output[i] = a[i] + 10.0

# Puzzle 2: ์—ฌ๋Ÿฌ ๋ฐฐ์—ด ๊ด€๋ฆฌ
output[i] = a[i] + b[i]

# Puzzle 3: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
if i < size:
    output[i] = a[i] + 10.0

์ฐจ์›์ด ๋Š˜์–ด๋‚˜๋ฉด ์ฝ”๋“œ๋Š” ๋” ๋ณต์žกํ•ด์ง‘๋‹ˆ๋‹ค:

# ์ „ํ†ต์ ์ธ 2D ์ธ๋ฑ์‹ฑ (ํ–‰ ์šฐ์„  2D ํ–‰๋ ฌ)
idx = row * WIDTH + col
if row < height and col < width:
    output[idx] = a[idx] + 10.0

ํ•ด๊ฒฐ์ฑ…: TileTensor ๋ฏธ๋ฆฌ๋ณด๊ธฐ

TileTensor๋Š” ์ด๋Ÿฐ ๋ฌธ์ œ๋“ค์„ ๊น”๋”ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•ด์ค๋‹ˆ๋‹ค. ์•ž์œผ๋กœ ๋ฐฐ์šธ ๋‚ด์šฉ์„ ์‚ด์ง ์—ฟ๋ณด๋ฉด:

  1. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ: ์ˆ˜๋™ ์˜คํ”„์…‹ ๊ณ„์‚ฐ ๋Œ€์‹  tensor[i, j] ์‚ฌ์šฉ
  2. ์œ ์—ฐํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ํ–‰ ์šฐ์„ , ์—ด ์šฐ์„ , ํƒ€์ผ ๊ตฌ์„ฑ ์ง€์›
  3. ์„ฑ๋Šฅ ์ตœ์ ํ™”: GPU์— ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

์•ž์œผ๋กœ ๋ฐฐ์šธ ๋‚ด์šฉ ๋ง›๋ณด๊ธฐ

TileTensor๊ฐ€ ํ•  ์ˆ˜ ์žˆ๋Š” ์ผ์„ ๋ช‡ ๊ฐ€์ง€ ์˜ˆ์‹œ๋กœ ์‚ดํŽด๋ด…์‹œ๋‹ค. ์ง€๊ธˆ ๋ชจ๋“  ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ์ดํ•ดํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค - ์•ž์œผ๋กœ ๋‚˜์˜ฌ ํผ์ฆ์—์„œ ๊ฐ ๊ธฐ๋Šฅ์„ ๊ผผ๊ผผํžˆ ๋‹ค๋ฃฐ ๊ฑฐ์˜ˆ์š”.

๊ธฐ๋ณธ ์‚ฌ์šฉ ์˜ˆ์‹œ

from layout import TileTensor
from layout.tile_layout import row_major

# ๋ ˆ์ด์•„์›ƒ ์ •์˜
comptime HEIGHT = 2
comptime WIDTH = 3
comptime layout = row_major[HEIGHT, WIDTH]()
comptime LayoutType = type_of(layout)

# ํ…์„œ ์ƒ์„ฑ
tensor = TileTensor(buffer, layout)

# ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์š”์†Œ ์ ‘๊ทผ
tensor[0, 0] = 1.0  # ์ฒซ ๋ฒˆ์งธ ์š”์†Œ
tensor[1, 2] = 2.0  # ๋งˆ์ง€๋ง‰ ์š”์†Œ

Layout๊ณผ TileTensor์— ๋Œ€ํ•ด ๋” ์•Œ์•„๋ณด๋ ค๋ฉด Mojo ๋งค๋‰ด์–ผ์˜ ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”:

๊ฐ„๋‹จํ•œ ์˜ˆ์ œ

TileTensor์˜ ๊ธฐ๋ณธ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋กœ ๋ชจ๋“  ๊ฒƒ์„ ์ •๋ฆฌํ•ด๋ด…์‹œ๋‹ค:

# ===----------------------------------------------------------------------=== #
#
# This file is Modular Inc proprietary.
#
# ===----------------------------------------------------------------------=== #
from std.gpu.host import DeviceContext
from layout import TileTensor
from layout.tile_layout import row_major

comptime HEIGHT = 2
comptime WIDTH = 3
comptime dtype = DType.float32
comptime layout = row_major[HEIGHT, WIDTH]()
comptime LayoutType = type_of(layout)


def kernel(
    tensor: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
):
    print("Before:")
    print(tensor)
    tensor[0, 0] += 1
    print("After:")
    print(tensor)


def main() raises:
    ctx = DeviceContext()

    a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH)
    a.enqueue_fill(0)
    tensor = TileTensor(a, layout)
    # Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
    ctx.enqueue_function[kernel, kernel](tensor, grid_dim=1, block_dim=1)

    ctx.synchronize()

๋‹ค์Œ ๋ช…๋ น์–ด๋กœ ์ด ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด:

pixi run tile_tensor_intro
pixi run -e amd tile_tensor_intro
pixi run -e apple tile_tensor_intro
uv run poe tile_tensor_intro
Before:
0.0 0.0 0.0
0.0 0.0 0.0
After:
1.0 0.0 0.0
0.0 0.0 0.0

๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์‚ดํŽด๋ด…์‹œ๋‹ค:

  1. ํ–‰ ์šฐ์„  ๋ ˆ์ด์•„์›ƒ์œผ๋กœ 2 x 3 ํ…์„œ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  2. ์ฒ˜์Œ์—๋Š” ๋ชจ๋“  ์š”์†Œ๊ฐ€ 0์ž…๋‹ˆ๋‹ค
  3. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค
  4. ๋ณ€๊ฒฝ ์‚ฌํ•ญ์ด ์ถœ๋ ฅ์— ๋ฐ˜์˜๋ฉ๋‹ˆ๋‹ค

์ด ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋Š” TileTensor์˜ ํ•ต์‹ฌ ์žฅ์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • ํ…์„œ ์ƒ์„ฑ๊ณผ ์ ‘๊ทผ์„ ์œ„ํ•œ ๊น”๋”ํ•œ ๋ฌธ๋ฒ•
  • ์ž๋™ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ
  • ์ž์—ฐ์Šค๋Ÿฌ์šด ๋‹ค์ฐจ์› ์ธ๋ฑ์‹ฑ

์ด ์˜ˆ์ œ๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ, ๊ฐ™์€ ํŒจํ„ด์ด ์•ž์œผ๋กœ ๋‚˜์˜ฌ ํผ์ฆ์˜ ๋ณต์žกํ•œ GPU ์—ฐ์‚ฐ์—๋„ ๊ทธ๋Œ€๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๊ธฐ๋ณธ ๊ฐœ๋…์ด ๋‹ค์Œ์œผ๋กœ ์–ด๋–ป๊ฒŒ ํ™•์žฅ๋˜๋Š”์ง€ ๋ณด๊ฒŒ ๋  ๊ฑฐ์˜ˆ์š”:

  • ๋ฉ€ํ‹ฐ ์Šค๋ ˆ๋“œ GPU ์—ฐ์‚ฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”
  • ๋ณต์žกํ•œ ํƒ€์ผ๋ง ์ „๋žต
  • ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ์—ฐ์‚ฐ

TileTensor์™€ ํ•จ๊ป˜ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—ฌ์ •์„ ์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋๋‚˜์š”? ํผ์ฆ๋กœ ๋“ค์–ด๊ฐ€๋ด…์‹œ๋‹ค!

๐Ÿ’ก ํŒ: ์ง„ํ–‰ํ•˜๋ฉด์„œ ์ด ์˜ˆ์ œ๋ฅผ ๊ธฐ์–ตํ•ด๋‘์„ธ์š” - ์ด ๊ธฐ๋ณธ ๊ฐœ๋…์„ ๋ฐ”ํƒ•์œผ๋กœ ์ ์  ๋” ์ •๊ตํ•œ GPU ํ”„๋กœ๊ทธ๋žจ์„ ๋งŒ๋“ค์–ด๊ฐˆ ๊ฒ๋‹ˆ๋‹ค.

TileTensor ๋ฒ„์ „

๊ฐœ์š”

2D TileTensor a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D TileTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • 2D ๋ฐฐ์—ด ์ ‘๊ทผ์— TileTensor ์‚ฌ์šฉํ•˜๊ธฐ
  • tensor[i, j]๋กœ ์ง์ ‘ 2D ์ธ๋ฑ์‹ฑํ•˜๊ธฐ
  • TileTensor์—์„œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ TileTensor๊ฐ€ ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•˜์—ฌ ๋‚ด๋ถ€ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์„ ์ถ”์ƒํ™”ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • 2D ์ ‘๊ทผ: TileTensor๋กœ ์ž์—ฐ์Šค๋Ÿฌ์šด \((i,j)\) ์ธ๋ฑ์‹ฑ
  • ๋ฉ”๋ชจ๋ฆฌ ์ถ”์ƒํ™”: ์ˆ˜๋™ ํ–‰ ์šฐ์„  ๊ณ„์‚ฐ ๋ถˆํ•„์š”
  • ๊ฐ€๋“œ ์กฐ๊ฑด: ๋‘ ์ฐจ์› ๋ชจ๋‘ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํ•„์š”
  • ์Šค๋ ˆ๋“œ ๋ฒ”์œ„: ์Šค๋ ˆ๋“œ \((3 \times 3)\)๊ฐ€ ํ…์„œ ์›์†Œ \((2 \times 2)\)๋ณด๋‹ค ๋งŽ์Œ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime layout = row_major[SIZE, SIZE]()
comptime LayoutType = type_of(layout)


def add_10_2d(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p04/p04_tile_tensor.mojo

ํŒ
  1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: row = thread_idx.y, col = thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€์—์„œ a[row, col]์— 10 ๋”ํ•˜๊ธฐ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p04_tile_tensor
pixi run -e amd p04_tile_tensor
pixi run -e apple p04_tile_tensor
uv run poe p04_tile_tensor

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

์†”๋ฃจ์…˜

def add_10_2d(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x
    if col < size and row < size:
        output[row, col] = a[row, col] + 10.0


์ด ์†”๋ฃจ์…˜์€:

  • row = thread_idx.y, col = thread_idx.x๋กœ 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฐ€์ ธ์˜ด
  • if row < size and col < size๋กœ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ ๋ฐฉ์ง€
  • TileTensor์˜ 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: output[row, col] = a[row, col] + 10.0

Puzzle 5: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ

๊ฐœ์š”

1D TileTensor a์™€ b๋ฅผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋”ํ•ด 2D TileTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ์œ„์น˜ ์ˆ˜๋ณด๋‹ค ๋งŽ์Šต๋‹ˆ๋‹ค.

Broadcast ์‹œ๊ฐํ™” Broadcast ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์— TileTensor ์‚ฌ์šฉํ•˜๊ธฐ
  • ์„œ๋กœ ๋‹ค๋ฅธ ํ…์„œ ํฌ๊ธฐ ๋‹ค๋ฃจ๊ธฐ
  • TileTensor๋กœ 2D ์ธ๋ฑ์‹ฑ ์ฒ˜๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ TileTensor๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ํ…์„œ ํฌ๊ธฐ \((1, n)\)์™€ \((n, 1)\)์„ \((n,n)\)์œผ๋กœ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฉด์„œ๋„ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • ํ…์„œ ํฌ๊ธฐ: ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋Š” \((1, n)\)๊ณผ \((n, 1)\)
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ: ๋‘ ์ฐจ์›์„ ๊ฒฐํ•ฉํ•ด \((n,n)\) ์ถœ๋ ฅ ์ƒ์„ฑ
  • ๊ฐ€๋“œ ์กฐ๊ฑด: ์ถœ๋ ฅ ํฌ๊ธฐ์— ๋Œ€ํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋Š” ์—ฌ์ „ํžˆ ํ•„์š”
  • ์Šค๋ ˆ๋“œ ๋ฒ”์œ„: ํ…์„œ ์›์†Œ \((2 \times 2)\)๋ณด๋‹ค ์Šค๋ ˆ๋“œ \((3 \times 3)\)๊ฐ€ ๋งŽ์Œ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime out_layout = row_major[SIZE, SIZE]()
comptime a_layout = row_major[1, SIZE]()
comptime b_layout = row_major[SIZE, 1]()
comptime OutLayout = type_of(out_layout)
comptime ALayout = type_of(a_layout)
comptime BLayout = type_of(b_layout)


def broadcast_add(
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, BLayout, ImmutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p05/p05.mojo

ํŒ
  1. 2D ์ธ๋ฑ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ: row = thread_idx.y, col = thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: TileTensor๋กœ a์™€ b ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ• ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p05
pixi run -e amd p05
pixi run -e apple p05
uv run poe p05

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 2.0, 11.0, 12.0])

์†”๋ฃจ์…˜

def broadcast_add(
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, BLayout, ImmutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x
    if row < size and col < size:
        output[row, col] = a[0, col] + b[row, 0]


TileTensor ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์™€ GPU ์Šค๋ ˆ๋“œ ๋งคํ•‘์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ์—์„œ ํ–‰๋ ฌ๋กœ ๋งคํ•‘

    • thread_idx.y๋กœ ํ–‰, thread_idx.x๋กœ ์—ด์— ์ ‘๊ทผ
    • ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ์ด ์ถœ๋ ฅ ํ–‰๋ ฌ ๊ตฌ์กฐ์™€ ์ผ์น˜
    • ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ(3ร—3 ๊ทธ๋ฆฌ๋“œ)๋Š” ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ์ฒ˜๋ฆฌ
  2. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ž‘๋™ ๋ฐฉ์‹

    • ์ž…๋ ฅ a์˜ ํฌ๊ธฐ๋Š” (1,n): a[0,col]์ด ํ–‰์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
    • ์ž…๋ ฅ b์˜ ํฌ๊ธฐ๋Š” (n,1): b[row,0]์ด ์—ด์„ ๊ฐ€๋กœ์งˆ๋Ÿฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
    • ์ถœ๋ ฅ์˜ ํฌ๊ธฐ๋Š” (n,n): ๊ฐ ์›์†Œ๋Š” ํ•ด๋‹น ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’๋“ค์˜ ํ•ฉ
    [ a0 a1 ]  +  [ b0 ]  =  [ a0+b0  a1+b0 ]
                  [ b1 ]     [ a0+b1  a1+b1 ]
    
  3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

    • ๊ฐ€๋“œ ์กฐ๊ฑด row < size and col < size๋กœ ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ ๋ฐฉ์ง€
    • ํ–‰๋ ฌ ๋ฒ”์œ„์™€ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ
    • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋•๋ถ„์— a์™€ b์— ๋Œ€ํ•œ ๋ณ„๋„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š”

์ด ํŒจํ„ด์€ ์ดํ›„ ํผ์ฆ์—์„œ ๋‹ค๋ฃฐ ๋” ๋ณต์žกํ•œ ํ…์„œ ์—ฐ์‚ฐ์˜ ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

Puzzle 6: ๋ธ”๋ก

๊ฐœ์š”

๋ฒกํ„ฐ a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค.

๋ธ”๋ก ์‹œ๊ฐํ™” ๋ธ”๋ก ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • ์Šค๋ ˆ๋“œ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  • ์—ฌ๋Ÿฌ ๋ธ”๋ก์˜ ์Šค๋ ˆ๋“œ ์กฐ์œจ
  • ์ „์—ญ ์Šค๋ ˆ๋“œ ์œ„์น˜ ๊ณ„์‚ฐ

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ํ˜‘๋ ฅํ•˜์—ฌ ๋‹จ์ผ ๋ธ”๋ก ์šฉ๋Ÿ‰๋ณด๋‹ค ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ๋„, ์š”์†Œ์™€ ์Šค๋ ˆ๋“œ ๊ฐ„ ์˜ฌ๋ฐ”๋ฅธ ๋งคํ•‘์„ ์œ ์ง€ํ•˜๋Š” ์›๋ฆฌ๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 9
comptime BLOCKS_PER_GRID = (3, 1)
comptime THREADS_PER_BLOCK = (4, 1)
comptime dtype = DType.float32


def add_10_blocks(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: Int,
):
    var i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p06/p06.mojo

์ฐธ๊ณ : ์ด ํผ์ฆ์˜ TileTensor ๋ฒ„์ „์€ ๊ฑฐ์˜ ๋™์ผํ•˜๋ฏ€๋กœ ๋…์ž์—๊ฒŒ ๋งก๊น๋‹ˆ๋‹ค.

ํŒ
  1. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ: i = block_dim.x * block_idx.x + thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if i < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: output[i] = a[i] + 10.0

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p06
pixi run -e amd p06
pixi run -e apple p06
uv run poe p06

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])

์†”๋ฃจ์…˜

def add_10_blocks(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    size: Int,
):
    var i = block_dim.x * block_idx.x + thread_idx.x
    if i < size:
        output[i] = a[i] + 10.0


์ด ์†”๋ฃจ์…˜์€ ๋ธ”๋ก ๊ธฐ๋ฐ˜ GPU ์ฒ˜๋ฆฌ์˜ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค:

  1. ์ „์—ญ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

    • ๋ธ”๋ก ์ธ๋ฑ์Šค์™€ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฒฐํ•ฉ: block_dim.x * block_idx.x + thread_idx.x

    • ๊ฐ ์Šค๋ ˆ๋“œ๋ฅผ ๊ณ ์œ ํ•œ ์ „์—ญ ์œ„์น˜์— ๋งคํ•‘

    • ๋ธ”๋ก๋‹น 3๊ฐœ ์Šค๋ ˆ๋“œ ์˜ˆ์‹œ:

      Block 0: [0 1 2]
      Block 1: [3 4 5]
      Block 2: [6 7 8]
      
  2. ๋ธ”๋ก ์กฐ์œจ

    • ๊ฐ ๋ธ”๋ก์€ ์—ฐ์†๋œ ๋ฐ์ดํ„ฐ ์ฒญํฌ๋ฅผ ์ฒ˜๋ฆฌ

    • ๋ธ”๋ก ํฌ๊ธฐ(3) < ๋ฐ์ดํ„ฐ ํฌ๊ธฐ(9)์ด๋ฏ€๋กœ ์—ฌ๋Ÿฌ ๋ธ”๋ก ํ•„์š”

    • ๋ธ”๋ก ๊ฐ„ ์ž๋™ ์ž‘์—… ๋ถ„๋ฐฐ:

      Data:    [0 1 2 3 4 5 6 7 8]
      Block 0: [0 1 2]
      Block 1:       [3 4 5]
      Block 2:             [6 7 8]
      
  3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

    • ๊ฐ€๋“œ ์กฐ๊ฑด i < size๋กœ ๊ฒฝ๊ณ„ ์ผ€์ด์Šค ์ฒ˜๋ฆฌ
    • ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋กœ ๋‚˜๋ˆ„์–ด ๋–จ์–ด์ง€์ง€ ์•Š์„ ๋•Œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ ๋ฐฉ์ง€
    • ๋ฐ์ดํ„ฐ ๋๋ถ€๋ถ„์˜ ๋ถˆ์™„์ „ํ•œ ๋ธ”๋ก ์ฒ˜๋ฆฌ์— ํ•„์ˆ˜
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ๋ณ‘ํ•ฉ(coalesced) ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋“ค์ด ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ์š”์†Œ ์ฒ˜๋ฆฌ: output[i] = a[i] + 10.0
    • ๋ธ”๋ก ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉ

์ด ํŒจํ„ด์€ ๋‹จ์ผ ์Šค๋ ˆ๋“œ ๋ธ”๋ก ํฌ๊ธฐ๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ์ฒ˜๋ฆฌ์˜ ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

Puzzle 7: 2D ๋ธ”๋ก

๊ฐœ์š”

2D TileTensor a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 2D TileTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํ–‰๊ณผ ์—ด ํฌ๊ธฐ๋ณด๋‹ค ๋ชจ๋‘ ์ž‘์Šต๋‹ˆ๋‹ค.

2D Blocks ์‹œ๊ฐํ™” 2D Blocks ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์—ฌ๋Ÿฌ ๋ธ”๋ก๊ณผ ํ•จ๊ป˜ TileTensor ์‚ฌ์šฉํ•˜๊ธฐ
  • 2D ๋ธ”๋ก ๊ตฌ์„ฑ์œผ๋กœ ํฐ ํ–‰๋ ฌ ์ฒ˜๋ฆฌํ•˜๊ธฐ
  • ๋ธ”๋ก ์ธ๋ฑ์‹ฑ๊ณผ TileTensor ์ ‘๊ทผ ๊ฒฐํ•ฉํ•˜๊ธฐ

ํ•ต์‹ฌ์€ TileTensor๊ฐ€ 2D ์ธ๋ฑ์‹ฑ์„ ๋‹จ์ˆœํ™”ํ•ด ์ฃผ์ง€๋งŒ, ํฐ ํ–‰๋ ฌ์—์„œ๋Š” ์—ฌ์ „ํžˆ ๋ธ”๋ก ๊ฐ„ ์กฐ์œจ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

๐Ÿ”‘ 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ ๋ฐฉ์‹

Puzzle 4: 2D Map์˜ ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ์„ 2D๋กœ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค:

์ „์—ญ ์œ„์น˜ ๊ณ„์‚ฐ:
row = block_dim.y * block_idx.y + thread_idx.y
col = block_dim.x * block_idx.x + thread_idx.x

์˜ˆ๋ฅผ ๋“ค์–ด, 4ร—4 ๊ทธ๋ฆฌ๋“œ์—์„œ 2ร—2 ๋ธ”๋ก์„ ์‚ฌ์šฉํ•˜๋ฉด:

Block (0,0):   Block (1,0):
[0,0  0,1]     [0,2  0,3]
[1,0  1,1]     [1,2  1,3]

Block (0,1):   Block (1,1):
[2,0  2,1]     [2,2  2,3]
[3,0  3,1]     [3,2  3,3]

๊ฐ ์œ„์น˜๋Š” ํ•ด๋‹น ์Šค๋ ˆ๋“œ์˜ ์ „์—ญ ์ธ๋ฑ์Šค (row, col)๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๋ธ”๋ก ์ฐจ์›๊ณผ ์ธ๋ฑ์Šค๊ฐ€ ํ•จ๊ป˜ ์ž‘๋™ํ•˜์—ฌ ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  • 2D ๊ณต๊ฐ„ ์ „์ฒด๋ฅผ ๋นˆํ‹ˆ์—†์ด ์ฒ˜๋ฆฌ
  • ๋ธ”๋ก ๊ฐ„ ๊ฒน์นจ ์—†์Œ
  • ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(5 \times 5\) ์›์†Œ
  • ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ: TileTensor๊ฐ€ ํ–‰ ์šฐ์„  ๊ตฌ์„ฑ ๊ด€๋ฆฌ
  • ๋ธ”๋ก ์กฐ์œจ: ์—ฌ๋Ÿฌ ๋ธ”๋ก์œผ๋กœ ์ „์ฒด ํ–‰๋ ฌ ์ปค๋ฒ„
  • 2D ์ธ๋ฑ์‹ฑ: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ํ•จ๊ป˜ ์ž์—ฐ์Šค๋Ÿฌ์šด \((i,j)\) ์ ‘๊ทผ
  • ์ด ์Šค๋ ˆ๋“œ ์ˆ˜: \(25\)๊ฐœ ์›์†Œ์— ๋Œ€ํ•ด \(36\)๊ฐœ
  • ์Šค๋ ˆ๋“œ ๋งคํ•‘: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ–‰๋ ฌ ์›์†Œ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 5
comptime BLOCKS_PER_GRID = (2, 2)
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime out_layout = row_major[SIZE, SIZE]()
comptime a_layout = row_major[SIZE, SIZE]()
comptime OutLayout = type_of(out_layout)
comptime ALayout = type_of(a_layout)


def add_10_blocks_2d(
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin],
    size: Int,
):
    var row = block_dim.y * block_idx.y + thread_idx.y
    var col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p07/p07.mojo

ํŒ
  1. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ: row = block_dim.y * block_idx.y + thread_idx.y, col = block_dim.x * block_idx.x + thread_idx.x
  2. ๊ฐ€๋“œ ์ถ”๊ฐ€: if row < size and col < size
  3. ๊ฐ€๋“œ ๋‚ด๋ถ€: 2D TileTensor์— 10์„ ๋”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p07
pixi run -e amd p07
pixi run -e apple p07
uv run poe p07

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, ... , 34.0])

์†”๋ฃจ์…˜

def add_10_blocks_2d(
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, ALayout, ImmutAnyOrigin],
    size: Int,
):
    var row = block_dim.y * block_idx.y + thread_idx.y
    var col = block_dim.x * block_idx.x + thread_idx.x
    if row < size and col < size:
        output[row, col] = a[row, col] + 10.0


TileTensor๊ฐ€ 2D ๋ธ”๋ก ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

    • ์ „์—ญ ํ–‰(row): block_dim.y * block_idx.y + thread_idx.y

    • ์ „์—ญ ์—ด(col): block_dim.x * block_idx.x + thread_idx.x

    • ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ๋ฅผ ํ…์„œ ์›์†Œ์— ๋งคํ•‘:

      3ร—3 ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋œ 5ร—5 ํ…์„œ:
      
      Block (0,0)         Block (1,0)
      [(0,0) (0,1) (0,2)] [(0,3) (0,4)    *  ]
      [(1,0) (1,1) (1,2)] [(1,3) (1,4)    *  ]
      [(2,0) (2,1) (2,2)] [(2,3) (2,4)    *  ]
      
      Block (0,1)         Block (1,1)
      [(3,0) (3,1) (3,2)] [(3,3) (3,4)    *  ]
      [(4,0) (4,1) (4,2)] [(4,3) (4,4)    *  ]
      [  *     *     *  ] [  *     *      *  ]
      

      (* = ์Šค๋ ˆ๋“œ๋Š” ์กด์žฌํ•˜์ง€๋งŒ ํ…์„œ ๊ฒฝ๊ณ„ ๋ฐ–)

  2. TileTensor์˜ ์žฅ์ 

    • ์ž์—ฐ์Šค๋Ÿฌ์šด 2D ์ธ๋ฑ์‹ฑ: ์ˆ˜๋™ ์˜คํ”„์…‹ ๊ณ„์‚ฐ ๋Œ€์‹  tensor[row, col] ์‚ฌ์šฉ

    • ์ž๋™ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”

    • ์ ‘๊ทผ ํŒจํ„ด ์˜ˆ์‹œ:

      ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ:          TileTensor:
      row * size + col    tensor[row, col]
      (2,1) -> 11        (2,1) -> ๊ฐ™์€ ์›์†Œ
      
  3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

    • ๊ฐ€๋“œ row < size and col < size๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ์ƒํ™ฉ:
      • ๋ถ€๋ถ„ ๋ธ”๋ก์—์„œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋Š” ์Šค๋ ˆ๋“œ
      • ํ…์„œ ๊ฒฝ๊ณ„์˜ ์—ฃ์ง€ ์ผ€์ด์Šค
      • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์€ TileTensor๊ฐ€ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
      • 25๊ฐœ ์›์†Œ๋ฅผ 36๊ฐœ ์Šค๋ ˆ๋“œ๋กœ ์ฒ˜๋ฆฌ (3ร—3 ๋ธ”๋ก์˜ 2ร—2 ๊ทธ๋ฆฌ๋“œ)
  4. ๋ธ”๋ก ์กฐ์œจ

    • ๊ฐ 3ร—3 ๋ธ”๋ก์ด 5ร—5 ํ…์„œ์˜ ์ผ๋ถ€๋ถ„์„ ๋‹ด๋‹น
    • TileTensor๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ถ€๋ถ„:
      • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”
      • ํšจ์œจ์ ์ธ ์ ‘๊ทผ ํŒจํ„ด
      • ๋ธ”๋ก ๊ฒฝ๊ณ„ ๊ฐ„ ์กฐ์œจ
      • ์บ์‹œ ์นœํ™”์  ๋ฐ์ดํ„ฐ ์ ‘๊ทผ

์ด ํŒจํ„ด์€ TileTensor๊ฐ€ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์Šค๋ ˆ๋“œ ์กฐ์œจ์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ 2D ๋ธ”๋ก ์ฒ˜๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Puzzle 8: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ

๊ฐœ์š”

1D TileTensor a์˜ ๊ฐ ์œ„์น˜์— 10์„ ๋”ํ•ด 1D TileTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ด ๋ณด์„ธ์š”.

์ฐธ๊ณ : ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ a์˜ ํฌ๊ธฐ๋ณด๋‹ค ์ž‘์Šต๋‹ˆ๋‹ค.

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ๊ฐํ™” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • address_space๋ฅผ ํ™œ์šฉํ•œ TileTensor์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ๋Šฅ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์˜ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  • TileTensor๋กœ ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌํ•˜๊ธฐ

ํ•ต์‹ฌ์€ TileTensor๊ฐ€ ๋ธ”๋ก ๋กœ์ปฌ ์ €์žฅ์†Œ์˜ ์„ฑ๋Šฅ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8 ์›์†Œ
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 4
  • ๋ธ”๋ก ์ˆ˜: 2
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น TPB๊ฐœ ์›์†Œ

๊ฒฝ๊ณ : ๊ฐ ๋ธ”๋ก์—๋Š” ํ•ด๋‹น ๋ธ”๋ก์˜ ์Šค๋ ˆ๋“œ๋“ค์ด ์ฝ๊ณ  ์“ธ ์ˆ˜ ์žˆ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์–‘์ด _์ƒ์ˆ˜_๋กœ ๊ณ ์ •๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ’์€ ํŒŒ์ด์ฌ ๋ฆฌํ„ฐ๋Ÿด ์ƒ์ˆ˜์—ฌ์•ผ ํ•˜๋ฉฐ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์“ด ํ›„์—๋Š” barrier๋ฅผ ํ˜ธ์ถœํ•ด ์Šค๋ ˆ๋“œ๋“ค์ด ๊ต์ฐจํ•˜์ง€ ์•Š๋„๋ก ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ์ฐธ๊ณ : ์ด ํผ์ฆ์—์„œ๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์—๋งŒ ์ ‘๊ทผํ•˜๋ฏ€๋กœ barrier()๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋” ๋ณต์žกํ•œ ์ƒํ™ฉ์—์„œ ํ•„์š”ํ•œ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 4
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (2, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = row_major[SIZE]()
comptime LayoutType = type_of(layout)


def add_10_shared(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    # Allocate shared memory using stack_allocation
    var shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # FILL ME IN (roughly 2 lines)


์ „์ฒด ์ฝ”๋“œ ๋ณด๊ธฐ: problems/p08/p08.mojo

ํŒ
  1. address_space ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ TileTensor ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ
  2. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: shared[local_i] = a[global_i]
  3. barrier()๋กœ ๋™๊ธฐํ™” (ํ•™์Šต์šฉ - ์—ฌ๊ธฐ์„œ๋Š” ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Œ)
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค๋กœ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  5. ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•˜๋Š” ๊ฐ€๋“œ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p08
pixi run -e amd p08
pixi run -e apple p08
uv run poe p08

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

์†”๋ฃจ์…˜

def add_10_shared_tile_tensor(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    # Allocate shared memory using stack_allocation
    var shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    # Note: barrier is not strictly needed here since each thread only accesses
    # its own shared memory location. However, it's included to teach proper
    # shared memory synchronization patterns for more complex scenarios where
    # threads need to coordinate access to shared data.
    barrier()

    if global_i < size:
        output[global_i] = shared[local_i] + 10


TileTensor๊ฐ€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ์„ ์–ผ๋งˆ๋‚˜ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค:

  1. TileTensor๋ฅผ ์‚ฌ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ

    • ์ „์—ญ ํ…์„œ: a์™€ output (๋А๋ฆผ, ๋ชจ๋“  ๋ธ”๋ก์—์„œ ๋ณด์ž„)

    • ๊ณต์œ  ํ…์„œ: shared (๋น ๋ฆ„, ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋กœ์ปฌ)

    • ๋ธ”๋ก๋‹น 4๊ฐœ ์Šค๋ ˆ๋“œ๋กœ 8๊ฐœ ์›์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์˜ˆ์‹œ:

      ์ „์—ญ ํ…์„œ a: [1 1 1 1 | 1 1 1 1]  # ์ž…๋ ฅ: ๋ชจ๋‘ 1
      
      Block (0):         Block (1):
      shared[0..3]       shared[0..3]
      [1 1 1 1]          [1 1 1 1]
      
  2. ์Šค๋ ˆ๋“œ ์กฐ์œจ

    • ๋กœ๋“œ ๋‹จ๊ณ„ (์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ):

      Thread 0: shared[0] = a[0]=1    Thread 2: shared[2] = a[2]=1
      Thread 1: shared[1] = a[1]=1    Thread 3: shared[3] = a[3]=1
      barrier()    โ†“         โ†“        โ†“         โ†“   # ๋ชจ๋“  ๋กœ๋“œ ์™„๋ฃŒ ๋Œ€๊ธฐ
      
    • ์ฒ˜๋ฆฌ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ํ…์„œ ๊ฐ’์— 10์„ ๋”ํ•จ

    • ๊ฒฐ๊ณผ: output[global_i] = shared[local_i] + 10 = 11

์ฐธ๊ณ : ์ด ๊ฒฝ์šฐ์—๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜(shared[local_i])์—๋งŒ ์“ฐ๊ณ  ์ฝ์œผ๋ฏ€๋กœ barrier()๊ฐ€ ์—„๋ฐ€ํžˆ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•˜๋Š” ์ƒํ™ฉ์—์„œ ํ•„์ˆ˜์ ์ธ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์ตํžˆ๊ธฐ ์œ„ํ•ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

  1. TileTensor์˜ ์žฅ์ 

    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น:

      # address_space๋ฅผ ์‚ฌ์šฉํ•œ ๊น”๋”ํ•œ TileTensor API
      shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]())
      
    • ์ „์—ญ๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‘ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ:

      Block 0 ์ถœ๋ ฅ: [11 11 11 11]
      Block 1 ์ถœ๋ ฅ: [11 11 11 11]
      
    • ๋‚ด์žฅ๋œ ๋ ˆ์ด์•„์›ƒ ๊ด€๋ฆฌ์™€ ํƒ€์ž… ์•ˆ์ „์„ฑ

  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ๋กœ๋“œ: ์ „์—ญ ํ…์„œ โ†’ ๊ณต์œ  ํ…์„œ (์ตœ์ ํ™”๋จ)
    • ๋™๊ธฐํ™”: ์›์‹œ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „๊ณผ ๋™์ผํ•œ barrier() ํ•„์š”
    • ์ฒ˜๋ฆฌ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐ’์— 10 ๋”ํ•˜๊ธฐ
    • ์ €์žฅ: ๊ฒฐ๊ณผ(11)๋ฅผ ์ „์—ญ ํ…์„œ์— ์“ฐ๊ธฐ

์ด ํŒจํ„ด์€ TileTensor๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ ์ด์ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋” ํŽธ๋ฆฌํ•œ API์™€ ๋‚ด์žฅ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Puzzle 9: GPU ๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ

โš ๏ธ ์ด ํผ์ฆ์€ ํ˜ธํ™˜๋˜๋Š” NVIDIA GPU์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ GPU ๋ฒค๋” ์ง€์›์„ ์œ„ํ•œ ๋„๊ตฌ ๊ฐœ๋ฐœ์ด ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.

GPU ํ”„๋กœ๊ทธ๋žจ์ด ์‹คํŒจํ•  ๋•Œ

์ง€๊ธˆ๊นŒ์ง€ GPU ์ปค๋„์„ ์ž‘์„ฑํ•˜๊ณ , ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋‹ค๋ฃจ๊ณ , ์ˆ˜์ฒœ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ๋ฅผ ์กฐ์œจํ•ด ์™”์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ๊ฐ€ ์ปดํŒŒ์ผ๋ฉ๋‹ˆ๋‹ค. ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋Œ€ํ•˜๋ฉฐ ์‹คํ–‰ํ•˜๋ฉด:

  • ํฌ๋ž˜์‹œ
  • ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ
  • ๋ฌดํ•œ ์ •์ง€

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ˜„์‹ค์ด ๋ฐ”๋กœ ์ด๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ์—์„œ ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฝ”๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•ด์•ผ ํ•˜์ฃ . ์ด๋ก ๊ณผ ์‹ค์ „์ด ๋งŒ๋‚˜๊ณ , ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ง€์‹๊ณผ ์กฐ์‚ฌ ๋Šฅ๋ ฅ์ด ๊ต์ฐจํ•˜๋Š” ์˜์—ญ์ž…๋‹ˆ๋‹ค.

GPU ๋””๋ฒ„๊น…์ด ์–ด๋ ค์šด ์ด์œ 

๋‹จ์ผ ์Šค๋ ˆ๋“œ์˜ ์ˆœ์ฐจ ์‹คํ–‰์„ ๋”ฐ๋ผ๊ฐ€๋Š” ์ „ํ†ต์ ์ธ CPU ๋””๋ฒ„๊น…๊ณผ ๋‹ฌ๋ฆฌ, GPU ๋””๋ฒ„๊น…์€ ๋‹ค์Œ์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค:

  • ๋ณ‘๋ ฌ๋กœ ์‚ฌ๊ณ ํ•˜๊ธฐ: ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์‹คํ–‰๋˜๋ฉฐ, ๊ฐ๊ฐ ๋‹ค๋ฅธ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Œ
  • ์—ฌ๋Ÿฌ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„ ํƒ์ƒ‰: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ ˆ์ง€์Šคํ„ฐ, ์ƒ์ˆ˜ ๋ฉ”๋ชจ๋ฆฌ
  • ์กฐ์œจ ์‹คํŒจ ์ฒ˜๋ฆฌ: ๊ฒฝ์Ÿ ์ƒํƒœ, ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์œ„๋ฐ˜
  • ์ตœ์ ํ™”๋œ ์ฝ”๋“œ ๋””๋ฒ„๊น…: JIT ์ปดํŒŒ์ผ, ๋ณ€์ˆ˜ ์ตœ์ ํ™”, ์ œํ•œ๋œ ์‹ฌ๋ณผ ์ •๋ณด
  • ์ „๋ฌธ ๋„๊ตฌ ์‚ฌ์šฉ: ์ปค๋„ ๊ฒ€์‚ฌ, ์Šค๋ ˆ๋“œ ํƒ์ƒ‰, ๋ณ‘๋ ฌ ์ƒํƒœ ๋ถ„์„์„ ์œ„ํ•œ CUDA-GDB

GPU ๋””๋ฒ„๊น…์„ ์ตํžˆ๋ฉด ๋ณ‘๋ ฌ ์ปดํ“จํŒ…์˜ ๊ธฐ์ดˆ๋ฅผ ๊นŠ์ด ์ดํ•ดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ

์ด ํผ์ฆ์—์„œ๋Š” GPU ์ฝ”๋“œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. GPU ๊ฐœ๋ฐœ์ž๋“ค์ด ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋งค์ผ ์‚ฌ์šฉํ•˜๋Š” ์ ‘๊ทผ๋ฒ•, ๋„๊ตฌ, ๊ธฐ๋ฒ•์„ ์ตํžˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ตํžˆ๊ฒŒ ๋  ํ•ต์‹ฌ ๊ธฐ์ˆ 

  1. ์ „๋ฌธ์ ์ธ ๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ - ์ „๋ฌธ๊ฐ€๋“ค์ด ์‚ฌ์šฉํ•˜๋Š” ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ•
  2. ๋„๊ตฌ ์ˆ™๋ จ๋„ - ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ์šฉ LLDB, GPU ์ปค๋„์šฉ CUDA-GDB
  3. ํŒจํ„ด ์ธ์‹ - ํ”ํ•œ GPU ๋ฒ„๊ทธ ์œ ํ˜•๊ณผ ์ฆ์ƒ
  4. ์กฐ์‚ฌ ๊ธฐ๋ฒ• - ๋ณ€์ˆ˜๊ฐ€ ์ตœ์ ํ™”๋กœ ์ œ๊ฑฐ๋˜์—ˆ์„ ๋•Œ ๊ทผ๋ณธ ์›์ธ ์ฐพ๊ธฐ
  5. ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น… - ๊ณ ๊ธ‰ GPU ๋””๋ฒ„๊น… ๊ธฐ์ˆ 

์‹ค์ œ ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค

๊ฐ€์žฅ ํ”ํ•œ ์„ธ ๊ฐ€์ง€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์‹คํŒจ ์ƒํ™ฉ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ํฌ๋ž˜์‹œ - Null ํฌ์ธํ„ฐ, ์ž˜๋ชป๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ, ์„ธ๊ทธ๋ฉ˜ํ…Œ์ด์…˜ ํดํŠธ
  • ๋กœ์ง ๋ฒ„๊ทธ - ์ •์ƒ ์‹คํ–‰๋˜์ง€๋งŒ ๊ฒฐ๊ณผ๊ฐ€ ํ‹€๋ฆผ, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜
  • ์กฐ์œจ ๊ต์ฐฉ ์ƒํƒœ - ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™” ์‹คํŒจ, ๋ฌดํ•œ ์ •์ง€

๊ฐ ์‹œ๋‚˜๋ฆฌ์˜ค๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์กฐ์‚ฌ ๊ธฐ๋ฒ•์„ ๊ฐ€๋ฅด์น˜๊ณ  ๋””๋ฒ„๊น… ๊ฐ๊ฐ์„ ๊ธธ๋Ÿฌ์ค๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ์—ฌ์ •

์ด ํผ์ฆ์€ ๊ธฐ๋ณธ ๋””๋ฒ„๊น… ๊ฐœ๋…๋ถ€ํ„ฐ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์กฐ์œจ ์‹คํŒจ๊นŒ์ง€, ์ฒด๊ณ„์ ์œผ๋กœ ์„ค๊ณ„๋œ ๊ณผ์ •์„ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค:

๐Ÿ“š Step 1: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ

๊ธฐ์ดˆ ๋‹ค์ง€๊ธฐ - ๋„๊ตฌ์™€ ์›Œํฌํ”Œ๋กœ์šฐ ๋ฐฐ์šฐ๊ธฐ

  • pixi์™€ CUDA-GDB๋กœ ๋””๋ฒ„๊น… ํ™˜๊ฒฝ ์„ค์ •
  • ๋„ค ๊ฐ€์ง€ ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ• ๋ฐฐ์šฐ๊ธฐ: JIT vs ๋ฐ”์ด๋„ˆ๋ฆฌ, CPU vs GPU
  • GPU ์ปค๋„ ๊ฒ€์‚ฌ๋ฅผ ์œ„ํ•œ ํ•„์ˆ˜ CUDA-GDB ๋ช…๋ น์–ด ํ•™์Šต
  • ์ด์ „ ํผ์ฆ์˜ ์ต์ˆ™ํ•œ ์ฝ”๋“œ๋กœ ์‹ค์Šต
  • ๊ฐ ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ•์„ ์–ธ์ œ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”์ง€ ์ดํ•ด

๋ชฉํ‘œ: ์ „๋ฌธ์ ์ธ ๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ์™€ ๋„๊ตฌ ์ˆ™๋ จ๋„

๐Ÿง Step 2: ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€

๋ฉ”๋ชจ๋ฆฌ ํฌ๋ž˜์‹œ ์กฐ์‚ฌ - ํฌ๋ž˜์‹œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” GPU ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น…

  • CUDA_ERROR_ILLEGAL_ADDRESS ํฌ๋ž˜์‹œ ์กฐ์‚ฌ
  • ์ฒด๊ณ„์ ์ธ ํฌ์ธํ„ฐ ๊ฒ€์‚ฌ ๊ธฐ๋ฒ• ํ•™์Šต
  • Null ํฌ์ธํ„ฐ ํƒ์ง€ ๋ฐ ๊ฒ€์ฆ ํ•™์Šต
  • ์ „๋ฌธ์ ์ธ ํฌ๋ž˜์‹œ ๋ถ„์„ ์›Œํฌํ”Œ๋กœ์šฐ ์‹ค์Šต
  • GPU ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹คํŒจ ์ดํ•ด

๋ชฉํ‘œ: GPU ๋ฉ”๋ชจ๋ฆฌ ํฌ๋ž˜์‹œ์™€ ํฌ์ธํ„ฐ ๋ฌธ์ œ ๋””๋ฒ„๊น… ๋Šฅ๋ ฅ

๐Ÿ” Step 3: ํƒ์ • ์ˆ˜์‚ฌ: ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€

๋กœ์ง ๋ฒ„๊ทธ ์กฐ์‚ฌ - ๊ฒฐ๊ณผ๊ฐ€ ํ‹€๋ฆฐ ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น…

  • TileTensor ๊ธฐ๋ฐ˜์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์กฐ์‚ฌ
  • ์ตœ์ ํ™”๋กœ ๋ณ€์ˆ˜๊ฐ€ ์‚ฌ๋ผ์กŒ์„ ๋•Œ ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„ํ•˜๊ธฐ
  • ๋ฐ˜๋ณต๋ฌธ ๊ฒฝ๊ณ„์™€ ๋ฐ˜๋ณต ํšŸ์ˆ˜ ๋ถ„์„ํ•˜๊ธฐ
  • ํ‹€๋ฆฐ ๊ฒฐ๊ณผ์—์„œ ํŒจํ„ด ์ฐพ์•„๋‚ด๊ธฐ
  • ๋ณ€์ˆ˜๋ฅผ ์ง์ ‘ ํ™•์ธํ•˜์ง€ ์•Š๊ณ  ๋””๋ฒ„๊น…ํ•˜๊ธฐ

๋ชฉํ‘œ: GPU ์ปค๋„์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜์™€ ๋กœ์ง ๋ฒ„๊ทธ ๋””๋ฒ„๊น… ๋Šฅ๋ ฅ

๐Ÿ•ต๏ธ Step 4: ํƒ์ • ์ˆ˜์‚ฌ: ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€

๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ์กฐ์‚ฌ - ์˜์›ํžˆ ๋ฉˆ์ถ”๋Š” ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น…

  • ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™” ์‹คํŒจ ์กฐ์‚ฌ
  • ๋ณ‘๋ ฌ ์‹คํ–‰ ์ „๋ฐ˜์˜ ๋ฉ€ํ‹ฐ ์Šค๋ ˆ๋“œ ์ƒํƒœ ๋ถ„์„ ํ•™์Šต
  • ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ๊ฒฝ๋กœ ์ถ”์  ํ•™์Šต
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น… ์‹ค์Šต
  • ๊ฐ€์žฅ ์–ด๋ ค์šด GPU ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค ์ดํ•ด

๋ชฉํ‘œ: ๊ณ ๊ธ‰ ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น… - GPU ๋””๋ฒ„๊น… ๊ธฐ์ˆ ์˜ ์ •์ 

ํƒ์ •์˜ ๋งˆ์ธ๋“œ์…‹

GPU ๋””๋ฒ„๊น…์€ ์ผ๋ฐ˜์ ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๋‹ค๋ฅธ ์‚ฌ๊ณ ๋ฐฉ์‹์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ๋ถ„์€ ๋ฒ”์ฃ„ ํ˜„์žฅ์„ ์กฐ์‚ฌํ•˜๋Š” ํƒ์ •์ด ๋ฉ๋‹ˆ๋‹ค:

  • ๋‹จ์„œ๊ฐ€ ๋ถ€์กฑํ•จ - ๋ณ€์ˆ˜๋Š” ์ตœ์ ํ™”๋กœ ์‚ฌ๋ผ์ง€๊ณ , ์‹ฌ๋ณผ๋ช…์€ ์•Œ์•„๋ณด๊ธฐ ์–ด๋ ค์›€
  • ์šฉ์˜์ž๊ฐ€ ๋„˜์นจ - ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ, ๋ˆ„๊ตฌ๋“  ๋ฒ”์ธ์ผ ์ˆ˜ ์žˆ์Œ
  • ํƒ€์ž„๋ผ์ธ์ด ๋ณต์žกํ•จ - ๋ณ‘๋ ฌ ์‹คํ–‰, ๊ฒฝ์Ÿ ์ƒํƒœ, ํƒ€์ด๋ฐ ์˜์กด์„ฑ
  • ์ „๋ฌธ ๋„๊ตฌ๊ฐ€ ํ•„์š”ํ•จ - CUDA-GDB, ์Šค๋ ˆ๋“œ ํƒ์ƒ‰, GPU ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์‚ฌ

ํ•˜์ง€๋งŒ ํ›Œ๋ฅญํ•œ ํƒ์ •์ด ๊ทธ๋ ‡๋“ฏ, ์—ฌ๋Ÿฌ๋ถ„๋„ ๋‹ค์Œ์„ ๋ฐฐ์šฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

  • ๋‹จ์„œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ถ”์  - ์—๋Ÿฌ ๋ฉ”์‹œ์ง€, ํฌ๋ž˜์‹œ ํŒจํ„ด, ์Šค๋ ˆ๋“œ ์ƒํƒœ
  • ๊ฐ€์„ค ์ˆ˜๋ฆฝ - ์ด ๋™์ž‘์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์›์ธ์€ ๋ฌด์—‡์ผ๊นŒ?
  • ์ด๋ก  ๊ฒ€์ฆ - ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋กœ ์•„์ด๋””์–ด๋ฅผ ํ™•์ธํ•˜๊ฑฐ๋‚˜ ๋ฐ˜์ฆ
  • ๊ทผ๋ณธ ์›์ธ ์ถ”์  - ์ฆ์ƒ์—์„œ ์‹ค์ œ ๋ฌธ์ œ์˜ ์›์ธ๊นŒ์ง€

์‹œ์ž‘ํ•˜๊ธฐ ์ „์—

์•Œ์•„์•ผ ํ•  ๊ฒƒ:

  • Puzzle 1-8์—์„œ ๋‹ค๋ฃฌ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋… (์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ, ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ, ๋ฐฐ๋ฆฌ์–ด)
  • ๊ธฐ๋ณธ์ ์ธ ๋ช…๋ น์ค„ ์‚ฌ์šฉ์— ์ต์ˆ™ํ•จ (ํ„ฐ๋ฏธ๋„ ๊ธฐ๋ฐ˜ ๋””๋ฒ„๊น… ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค)
  • ์ธ๋‚ด์‹ฌ๊ณผ ์ฒด๊ณ„์  ์‚ฌ๊ณ  (GPU ๋””๋ฒ„๊น…์€ ๊ผผ๊ผผํ•œ ์กฐ์‚ฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค)

๋ชฉํ‘œ:

  • GPU ๊ฐœ๋ฐœํŒ€์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ „๋ฌธ ๋””๋ฒ„๊น… ๊ธฐ์ˆ 
  • ์Šค๋ ˆ๋“œ ์ˆ˜์ค€์˜ ์‹คํ–‰์„ ๊ด€์ฐฐํ•˜๋ฉฐ ์–ป๋Š” ๋ณ‘๋ ฌ ์ปดํ“จํŒ…์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด
  • ๊ฐ€์žฅ ๊นŒ๋‹ค๋กœ์šด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ƒํ™ฉ์—์„œ๋„ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ž์‹ ๊ฐ
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ปค๋ฆฌ์–ด ์ „๋ฐ˜์— ๋„์›€์ด ๋  ๋„๊ตฌ ์ˆ™๋ จ๋„

์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”?

GPU ๋””๋ฒ„๊น…์€ GPU ํ”„๋กœ๊ทธ๋žจ์„ ์ž‘์„ฑํ•˜๋Š” ๊ฒƒ์—์„œ ๊นŠ์ด ์ดํ•ดํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋‚˜์•„๊ฐ€๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ์ „๋ฌธ GPU ๊ฐœ๋ฐœ์ž๋ผ๋ฉด ๋ˆ„๊ตฌ๋‚˜ ๋ณ‘๋ ฌ ์ฝ”๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๊ณ , ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋กœ ๋™์‹œ์— ์‚ฌ๊ณ ํ•˜๋Š” ๋ฒ•์„ ์ตํžˆ๊ณ , ๋ณต์žกํ•œ ์กฐ์œจ ์‹คํŒจ๋ฅผ ๋ˆ๊ธฐ ์žˆ๊ฒŒ ์กฐ์‚ฌํ•˜๋ฉฐ ์ˆ˜๋งŽ์€ ์‹œ๊ฐ„์„ ๋ณด๋ƒˆ์Šต๋‹ˆ๋‹ค.

์ง€๊ธˆ์ด ๋ฐ”๋กœ ๊ทธ ์ „๋ฌธ๊ฐ€ ๊ทธ๋ฃน์— ํ•ฉ๋ฅ˜ํ•  ๊ธฐํšŒ์ž…๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ์—ฌ์ • ์‹œ์ž‘ํ•˜๊ธฐ: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ


โ€œ๋””๋ฒ„๊น…์€ ์ฝ”๋“œ ์ž‘์„ฑ๋ณด๋‹ค ๋‘ ๋ฐฐ๋Š” ์–ด๋ ต๋‹ค. ๋”ฐ๋ผ์„œ ์ตœ๋Œ€ํ•œ ์˜๋ฆฌํ•˜๊ฒŒ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ–ˆ๋‹ค๋ฉด, ์ •์˜์ƒ ๊ทธ๊ฒƒ์„ ๋””๋ฒ„๊น…ํ•  ๋งŒํผ ๋˜‘๋˜‘ํ•˜์ง€ ์•Š๋‹ค๋Š” ๋œป์ด๋‹ค.โ€ - Brian Kernighan

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” ์ด ๋ง์ด ์ˆ˜์ฒœ ๋ฐฐ๋กœ ์™€๋‹ฟ์Šต๋‹ˆ๋‹ค. ๋™์‹œ์— ๋””๋ฒ„๊น…ํ•ด์•ผ ํ•  ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ ์ˆ˜๋งŒํผ์š”.

๐Ÿ“š Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ

GPU ๋””๋ฒ„๊น…์˜ ์„ธ๊ณ„์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! Puzzle 1-8์„ ํ†ตํ•ด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…์„ ๋ฐฐ์› ์œผ๋‹ˆ, ์ด์ œ ๋ชจ๋“  GPU ํ”„๋กœ๊ทธ๋ž˜๋จธ์—๊ฒŒ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ธฐ์ˆ ์„ ๋ฐฐ์šธ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค: ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ์„ ๋•Œ ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐฉ๋ฒ•.

GPU ๋””๋ฒ„๊น…์€ ์ฒ˜์Œ์—๋Š” ์–ด๋ ค์›Œ ๋ณด์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜๊ณ , ๋‹ค์–‘ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์ด ์žˆ์œผ๋ฉฐ, ํ•˜๋“œ์›จ์–ด๋ณ„ ๋™์ž‘๋„ ๋‹ค๋ฃจ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ ์ ˆํ•œ ๋„๊ตฌ์™€ ์›Œํฌํ”Œ๋กœ์šฐ๋งŒ ์žˆ์œผ๋ฉด GPU ์ฝ”๋“œ ๋””๋ฒ„๊น…๋„ ์ฒด๊ณ„์ ์œผ๋กœ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ(GPU ์ž‘์—…์„ ์„ค์ •ํ•˜๋Š” ๋ถ€๋ถ„)์™€ GPU ์ปค๋„ ์ฝ”๋“œ(๋ณ‘๋ ฌ ์—ฐ์‚ฐ์ด ์‹คํ–‰๋˜๋Š” ๋ถ€๋ถ„) ๋ชจ๋‘๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ์‹ค์ œ ์˜ˆ์ œ, ์‹ค์ œ ๋””๋ฒ„๊ฑฐ ์ถœ๋ ฅ, ๊ทธ๋ฆฌ๊ณ  ์—ฌ๋Ÿฌ๋ถ„์˜ ํ”„๋กœ์ ํŠธ์— ๋ฐ”๋กœ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ๊ณ„๋ณ„ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ : ๋‹ค์Œ ๋‚ด์šฉ์€ ๋ฒ”์šฉ IDE ํ˜ธํ™˜์„ฑ์„ ์œ„ํ•ด ๋ช…๋ น์ค„ ๋””๋ฒ„๊น…์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. VS Code ๋””๋ฒ„๊น…์„ ์„ ํ˜ธํ•œ๋‹ค๋ฉด Mojo ๋””๋ฒ„๊น… ๋ฌธ์„œ์—์„œ VS Code ์ „์šฉ ์„ค์ •๊ณผ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

GPU ๋””๋ฒ„๊น…์ด ๋‹ค๋ฅธ ์ด์œ 

๋„๊ตฌ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, GPU ๋””๋ฒ„๊น…์ด ํŠน๋ณ„ํ•œ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

  • ์ „ํ†ต์ ์ธ CPU ๋””๋ฒ„๊น…: ๋‹จ์ผ ์Šค๋ ˆ๋“œ, ์ˆœ์ฐจ ์‹คํ–‰, ๋‹จ์ˆœํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋ธ
  • GPU ๋””๋ฒ„๊น…: ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ, ๋ณ‘๋ ฌ ์‹คํ–‰, ์—ฌ๋Ÿฌ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„, ๊ฒฝ์Ÿ ์ƒํƒœ

์ด๋Š” ๋‹ค์Œ์„ ํ•  ์ˆ˜ ์žˆ๋Š” ์ „๋ฌธ ๋„๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค:

  • ์„œ๋กœ ๋‹ค๋ฅธ GPU ์Šค๋ ˆ๋“œ ๊ฐ„ ์ „ํ™˜
  • ์Šค๋ ˆ๋“œ๋ณ„ ๋ณ€์ˆ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์‚ฌ
  • ๋ณ‘๋ ฌ ์‹คํ–‰์˜ ๋ณต์žก์„ฑ ์ฒ˜๋ฆฌ
  • CPU ์„ค์ • ์ฝ”๋“œ์™€ GPU ์ปค๋„ ์ฝ”๋“œ ๋ชจ๋‘ ๋””๋ฒ„๊น…

๋””๋ฒ„๊น… ๋„๊ตฌ ๋ชจ์Œ

Mojo์˜ GPU ๋””๋ฒ„๊น… ๊ธฐ๋Šฅ์€ ํ˜„์žฌ NVIDIA GPU๋กœ ์ œํ•œ๋ฉ๋‹ˆ๋‹ค. Mojo ๋””๋ฒ„๊น… ๋ฌธ์„œ์— ๋”ฐ๋ฅด๋ฉด Mojo ํŒจํ‚ค์ง€์—๋Š” ๋‹ค์Œ์ด ํฌํ•จ๋ฉ๋‹ˆ๋‹ค:

  • CPU ์ธก ๋””๋ฒ„๊น…์„ ์œ„ํ•œ Mojo ํ”Œ๋Ÿฌ๊ทธ์ธ์ด ํฌํ•จ๋œ LLDB ๋””๋ฒ„๊ฑฐ
  • GPU ์ปค๋„ ๋””๋ฒ„๊น…์„ ์œ„ํ•œ CUDA-GDB ํ†ตํ•ฉ
  • ๋ฒ”์šฉ IDE ํ˜ธํ™˜์„ฑ์„ ์œ„ํ•œ mojo debug๋ฅผ ํ†ตํ•œ ๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค

GPU ์ „์šฉ ๋””๋ฒ„๊น…์— ๋Œ€ํ•ด์„œ๋Š” Mojo GPU ๋””๋ฒ„๊น… ๊ฐ€์ด๋“œ์—์„œ ์ถ”๊ฐ€ ๊ธฐ์ˆ  ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ด ์•„ํ‚คํ…์ฒ˜๋Š” ์ต์ˆ™ํ•œ ๋””๋ฒ„๊น… ๋ช…๋ น์–ด์™€ GPU ์ „์šฉ ๊ธฐ๋Šฅ, ๋‘ ๊ฐ€์ง€ ์žฅ์ ์„ ๋ชจ๋‘ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ: ๋ฌธ์ œ์—์„œ ํ•ด๊ฒฐ๊นŒ์ง€

GPU ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜๊ฑฐ๋‚˜, ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๊ฑฐ๋‚˜, ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋™์ž‘์„ ํ•  ๋•Œ ๋‹ค์Œ์˜ ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ•์„ ๋”ฐ๋ฅด์„ธ์š”:

  1. ๋””๋ฒ„๊น…์„ ์œ„ํ•œ ์ฝ”๋“œ ์ค€๋น„ (์ตœ์ ํ™” ๋น„ํ™œ์„ฑํ™”, ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ ์ถ”๊ฐ€)
  2. ์ ์ ˆํ•œ ๋””๋ฒ„๊ฑฐ ์„ ํƒ (CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ vs GPU ์ปค๋„ ๋””๋ฒ„๊น…)
  3. ์ „๋žต์  ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ • (๋ฌธ์ œ๊ฐ€ ์˜์‹ฌ๋˜๋Š” ์œ„์น˜์—)
  4. ์‹คํ–‰ ๋ฐ ๊ฒ€์‚ฌ (์ฝ”๋“œ๋ฅผ ๋‹จ๊ณ„๋ณ„๋กœ ์‹คํ–‰ํ•˜๋ฉฐ ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ)
  5. ํŒจํ„ด ๋ถ„์„ (๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ, ์Šค๋ ˆ๋“œ ๋™์ž‘, ๊ฒฝ์Ÿ ์ƒํƒœ)

์ด ์›Œํฌํ”Œ๋กœ์šฐ๋Š” Puzzle 01์˜ ๊ฐ„๋‹จํ•œ ๋ฐฐ์—ด ์—ฐ์‚ฐ์ด๋“  Puzzle 08์˜ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ”๋“œ๋“  ์ƒ๊ด€์—†์ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

Step 1: ๋””๋ฒ„๊น…์„ ์œ„ํ•œ ์ฝ”๋“œ ์ค€๋น„

๐Ÿฅ‡ ์ฒ ์น™: ์ตœ์ ํ™”๋œ ์ฝ”๋“œ๋Š” ์ ˆ๋Œ€ ๋””๋ฒ„๊น…ํ•˜์ง€ ๋งˆ์„ธ์š”. ์ตœ์ ํ™”๋Š” ๋ช…๋ น์–ด ์ˆœ์„œ๋ฅผ ๋ฐ”๊พธ๊ณ , ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•˜๊ณ , ํ•จ์ˆ˜๋ฅผ ์ธ๋ผ์ธํ™”ํ•˜์—ฌ ๋””๋ฒ„๊น…์„ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๋””๋ฒ„๊ทธ ์ •๋ณด๋กœ ๋นŒ๋“œํ•˜๊ธฐ

๋””๋ฒ„๊น…์šฉ Mojo ํ”„๋กœ๊ทธ๋žจ์„ ๋นŒ๋“œํ•  ๋•Œ๋Š” ํ•ญ์ƒ ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ์„ ํฌํ•จํ•˜์„ธ์š”:

# ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด๋กœ ๋นŒ๋“œ
mojo build -O0 -g your_program.mojo -o your_program_debug

์ด ํ”Œ๋ž˜๊ทธ๋“ค์ด ํ•˜๋Š” ์ผ:

  • -O0: ๋ชจ๋“  ์ตœ์ ํ™”๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•˜์—ฌ ์›๋ž˜ ์ฝ”๋“œ ๊ตฌ์กฐ๋ฅผ ๋ณด์กด
  • -g: ๋””๋ฒ„๊ฑฐ๊ฐ€ ๋จธ์‹  ์ฝ”๋“œ๋ฅผ Mojo ์†Œ์Šค์— ๋งคํ•‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ ํฌํ•จ
  • -o: ์‰ฌ์šด ์‹๋ณ„์„ ์œ„ํ•ด ๋ช…๋ช…๋œ ์ถœ๋ ฅ ํŒŒ์ผ ์ƒ์„ฑ

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ 

๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ ์—†์ด๋Š” ๋””๋ฒ„๊น… ์„ธ์…˜์ด ์ด๋ ‡๊ฒŒ ๋ณด์ž…๋‹ˆ๋‹ค:

(lldb) print my_variable
error: use of undeclared identifier 'my_variable'

๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ์ด ์žˆ์œผ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ฉ๋‹ˆ๋‹ค:

(lldb) print my_variable
(int) $0 = 42

Step 2: ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ• ์„ ํƒ

์—ฌ๊ธฐ์„œ GPU ๋””๋ฒ„๊น…์ด ํฅ๋ฏธ๋กœ์›Œ์ง‘๋‹ˆ๋‹ค. ๋„ค ๊ฐ€์ง€ ๋‹ค๋ฅธ ์กฐํ•ฉ ์ค‘์—์„œ ์„ ํƒํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ ์ ˆํ•œ ๊ฒƒ์„ ๊ณ ๋ฅด๋ฉด ์‹œ๊ฐ„์„ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

๋„ค ๊ฐ€์ง€ ๋””๋ฒ„๊น… ์กฐํ•ฉ

๋น ๋ฅธ ์ฐธ์กฐ:

# 1. JIT + LLDB: ์†Œ์Šค์—์„œ ์ง์ ‘ CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ ๋””๋ฒ„๊น…
pixi run mojo debug your_gpu_program.mojo

# 2. JIT + CUDA-GDB: ์†Œ์Šค์—์„œ ์ง์ ‘ GPU ์ปค๋„ ๋””๋ฒ„๊น…
pixi run mojo debug --cuda-gdb --break-on-launch your_gpu_program.mojo

# 3. ๋ฐ”์ด๋„ˆ๋ฆฌ + LLDB: ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ์—์„œ CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ ๋””๋ฒ„๊น…
pixi run mojo build -O0 -g your_gpu_program.mojo -o your_program_debug
pixi run mojo debug your_program_debug

# 4. ๋ฐ”์ด๋„ˆ๋ฆฌ + CUDA-GDB: ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ์—์„œ GPU ์ปค๋„ ๋””๋ฒ„๊น…
pixi run mojo debug --cuda-gdb --break-on-launch your_program_debug

๊ฐ ์ ‘๊ทผ๋ฒ•์„ ์–ธ์ œ ์‚ฌ์šฉํ• ๊นŒ

ํ•™์Šต๊ณผ ๋น ๋ฅธ ์‹คํ—˜์šฉ:

  • JIT ๋””๋ฒ„๊น… ์‚ฌ์šฉ - ๋นŒ๋“œ ๋‹จ๊ณ„๊ฐ€ ํ•„์š” ์—†์–ด ๋” ๋น ๋ฅด๊ฒŒ ๋ฐ˜๋ณต ๊ฐ€๋Šฅ

๋ณธ๊ฒฉ์ ์ธ ๋””๋ฒ„๊น… ์„ธ์…˜์šฉ:

  • ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น… ์‚ฌ์šฉ - ๋” ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๊ณ  ๊น”๋”ํ•œ ๋””๋ฒ„๊ฑฐ ์ถœ๋ ฅ

CPU ์ธก ๋ฌธ์ œ์šฉ (๋ฒ„ํผ ํ• ๋‹น, ํ˜ธ์ŠคํŠธ ๋ฉ”๋ชจ๋ฆฌ, ํ”„๋กœ๊ทธ๋žจ ๋กœ์ง):

  • LLDB ๋ชจ๋“œ ์‚ฌ์šฉ - main() ํ•จ์ˆ˜์™€ ์„ค์ • ์ฝ”๋“œ ๋””๋ฒ„๊น…์— ์ ํ•ฉ

GPU ์ปค๋„ ๋ฌธ์ œ์šฉ (์Šค๋ ˆ๋“œ ๋™์ž‘, GPU ๋ฉ”๋ชจ๋ฆฌ, ์ปค๋„ ํฌ๋ž˜์‹œ):

  • CUDA-GDB ๋ชจ๋“œ ์‚ฌ์šฉ - ๊ฐœ๋ณ„ GPU ์Šค๋ ˆ๋“œ๋ฅผ ๊ฒ€์‚ฌํ•˜๋Š” ์œ ์ผํ•œ ๋ฐฉ๋ฒ•

์žฅ์ ์€ ๋‹ค์–‘ํ•˜๊ฒŒ ์กฐํ•ฉํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. JIT + LLDB๋กœ ์„ค์ • ์ฝ”๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•œ ๋‹ค์Œ, JIT + CUDA-GDB๋กœ ์ „ํ™˜ํ•ด์„œ ์‹ค์ œ ์ปค๋„์„ ๋””๋ฒ„๊น…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


CUDA-GDB๋กœ GPU ์ปค๋„ ๋””๋ฒ„๊น… ์ดํ•ดํ•˜๊ธฐ

์ด์ œ GPU ์ปค๋„ ๋””๋ฒ„๊น…์ž…๋‹ˆ๋‹ค - ๋””๋ฒ„๊น… ๋„๊ตฌ ๋ชจ์Œ์—์„œ ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•˜๋ฉด์„œ๋„ ๋ณต์žกํ•œ ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.

--cuda-gdb๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด Mojo๋Š” NVIDIA์˜ CUDA-GDB ๋””๋ฒ„๊ฑฐ์™€ ํ†ตํ•ฉ๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋‹จ์ˆœํ•œ ๋””๋ฒ„๊ฑฐ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค - GPU ์ปดํ“จํŒ…์˜ ๋ณ‘๋ ฌ ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋“œ ์„ธ๊ณ„๋ฅผ ์œ„ํ•ด ํŠน๋ณ„ํžˆ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

CUDA-GDB๊ฐ€ ํŠน๋ณ„ํ•œ ์ด์œ 

์ผ๋ฐ˜ GDB๋Š” ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋ฉฐ ์ˆœ์ฐจ ์ฝ”๋“œ๋ฅผ ๋‹จ๊ณ„๋ณ„๋กœ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. CUDA-GDB๋Š” ์ˆ˜์ฒœ ๊ฐœ์˜ GPU ์Šค๋ ˆ๋“œ๋ฅผ ๋™์‹œ์— ๋””๋ฒ„๊น…ํ•˜๋ฉฐ, ๊ฐ๊ฐ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” ๋‹ค์Œ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค:

  • GPU ์ปค๋„ ๋‚ด๋ถ€์— ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ • - ์–ด๋–ค ์Šค๋ ˆ๋“œ๋“  ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์— ๋„๋‹ฌํ•˜๋ฉด ์‹คํ–‰์„ ์ผ์‹œ ์ •์ง€
  • GPU ์Šค๋ ˆ๋“œ ๊ฐ„ ์ „ํ™˜ - ๊ฐ™์€ ์ˆœ๊ฐ„์— ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฌด์—‡์„ ํ•˜๋Š”์ง€ ๊ฒ€์‚ฌ
  • ์Šค๋ ˆ๋“œ๋ณ„ ๋ฐ์ดํ„ฐ ๊ฒ€์‚ฌ - ๊ฐ™์€ ๋ณ€์ˆ˜๊ฐ€ ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๊ฒƒ์„ ํ™•์ธ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋””๋ฒ„๊น… - ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ, ๊ฒฝ์Ÿ ์ƒํƒœ, ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ ํฌ์ฐฉ (์ด๋Ÿฐ ๋ฌธ์ œ ๊ฐ์ง€์— ๋Œ€ํ•ด์„œ๋Š” Puzzle 10์—์„œ ๋” ์ž์„ธํžˆ)
  • ๋ณ‘๋ ฌ ์‹คํ–‰ ๋ถ„์„ - ์Šค๋ ˆ๋“œ๋“ค์ด ์–ด๋–ป๊ฒŒ ์ƒํ˜ธ์ž‘์šฉํ•˜๊ณ  ๋™๊ธฐํ™”ํ•˜๋Š”์ง€ ์ดํ•ด

์ด์ „ ํผ์ฆ์˜ ๊ฐœ๋…๊ณผ ์—ฐ๊ฒฐ

Puzzle 1-8์—์„œ ๋ฐฐ์šด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…์„ ๊ธฐ์–ตํ•˜์‹œ๋‚˜์š”? CUDA-GDB๋กœ ๋Ÿฐํƒ€์ž„์— ๋ชจ๋“  ๊ฒƒ์„ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ ๋””๋ฒ„๊น…

Puzzle 1-8์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค:

# Puzzle 1์—์„œ: ๊ธฐ๋ณธ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ
i = thread_idx.x  # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ ์œ ํ•œ ์ธ๋ฑ์Šค๋ฅผ ์–ป์Œ

# Puzzle 7์—์„œ: 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ
row = thread_idx.y  # 2D ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ
col = thread_idx.x

CUDA-GDB๋กœ ์ด ์Šค๋ ˆ๋“œ ์ขŒํ‘œ๋“ค์ด ์‹ค์ œ๋กœ ๋™์ž‘ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

(cuda-gdb) info cuda threads

์ถœ๋ ฅ:

  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (3,0,0)     4 0x00007fffcf26fed0 /home/ubuntu/workspace/mojo-gpu-puzzles/solutions/p01/p01.mojo    13

๊ทธ๋ฆฌ๊ณ  ํŠน์ • ์Šค๋ ˆ๋“œ๋กœ ์ด๋™ํ•ด์„œ ๋ฌด์—‡์„ ํ•˜๋Š”์ง€ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

(cuda-gdb) cuda thread (1,0,0)

์ถœ๋ ฅ:

[Switching to CUDA thread (1,0,0)]

์ •๋ง ๊ฐ•๋ ฅํ•œ ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค - ๋ง ๊ทธ๋Œ€๋กœ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ์—์„œ ์‹คํ–‰๋˜๋Š” ๊ฒƒ์„ ์ง์ ‘ ์ง€์ผœ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„ ๋””๋ฒ„๊น…

๋‹ค์–‘ํ•œ ์œ ํ˜•์˜ GPU ๋ฉ”๋ชจ๋ฆฌ์— ๋Œ€ํ•ด ๋ฐฐ์šด Puzzle 8์„ ๊ธฐ์–ตํ•˜์‹œ๋‚˜์š”? CUDA-GDB๋กœ ๋ชจ๋“  ๊ฒƒ์„ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์‚ฌ (Puzzle 1-5์˜ ๋ฐฐ์—ด๋“ค)
(cuda-gdb) print input_array[0]@4
$1 = {{1}, {2}, {3}, {4}}   # Mojo ์Šค์นผ๋ผ ํ˜•์‹

# ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์‚ฌ (thread_idx.x๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์Œ)
(cuda-gdb) print shared_data[i]   # thread_idx.x ๋Œ€์‹  ๋กœ์ปฌ ๋ณ€์ˆ˜ 'i' ์‚ฌ์šฉ
$2 = {42}

๋””๋ฒ„๊ฑฐ๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ •ํ™•ํžˆ ๋ฌด์—‡์„ ๋ณด๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋‚˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ฒ„๊ทธ๋ฅผ ์žก๊ธฐ์— ์™„๋ฒฝํ•ฉ๋‹ˆ๋‹ค.

์ „๋žต์  ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ๋ฐฐ์น˜

CUDA-GDB ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋Š” ๋ณ‘๋ ฌ ์‹คํ–‰๊ณผ ํ•จ๊ป˜ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ผ๋ฐ˜ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋ณด๋‹ค ํ›จ์”ฌ ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค:

# ์–ด๋–ค ์Šค๋ ˆ๋“œ๋“  ์ปค๋„์— ์ง„์ž…ํ•  ๋•Œ ์ค‘๋‹จ
(cuda-gdb) break add_kernel

# ํŠน์ • ์Šค๋ ˆ๋“œ์— ๋Œ€ํ•ด์„œ๋งŒ ์ค‘๋‹จ (๋ฌธ์ œ ๊ฒฉ๋ฆฌ์— ์ข‹์Œ)
(cuda-gdb) break add_kernel if thread_idx.x == 0

# ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์œ„๋ฐ˜ ์‹œ ์ค‘๋‹จ
(cuda-gdb) watch input_array[thread_idx.x]

# ํŠน์ • ๋ฐ์ดํ„ฐ ์กฐ๊ฑด์—์„œ ์ค‘๋‹จ
(cuda-gdb) break add_kernel if input_array[thread_idx.x] > 100.0

์ด๋ฅผ ํ†ตํ•ด ์ˆ˜์ฒœ ๊ฐœ ์Šค๋ ˆ๋“œ์˜ ์ถœ๋ ฅ์— ํŒŒ๋ฌปํžˆ์ง€ ์•Š๊ณ  ์ •ํ™•ํžˆ ๊ด€์‹ฌ ์žˆ๋Š” ์Šค๋ ˆ๋“œ์™€ ์กฐ๊ฑด์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


ํ™˜๊ฒฝ ์ค€๋น„ํ•˜๊ธฐ

๋””๋ฒ„๊น…์„ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๊ฐœ๋ฐœ ํ™˜๊ฒฝ์ด ์ œ๋Œ€๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”. ์ด์ „ ํผ์ฆ๋“ค์„ ์ง„ํ–‰ํ•ด์™”๋‹ค๋ฉด ๋Œ€๋ถ€๋ถ„ ์ด๋ฏธ ์„ค์ •๋˜์–ด ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค!

์ฐธ๊ณ : pixi ์—†์ด๋Š” NVIDIA ๊ณต์‹ ๋ฆฌ์†Œ์Šค์—์„œ CUDA Toolkit์„ ์ˆ˜๋™์œผ๋กœ ์„ค์น˜ํ•˜๊ณ , ๋“œ๋ผ์ด๋ฒ„ ํ˜ธํ™˜์„ฑ์„ ๊ด€๋ฆฌํ•˜๊ณ , ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ , ์ปดํฌ๋„ŒํŠธ ๊ฐ„ ๋ฒ„์ „ ์ถฉ๋Œ์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. pixi๋Š” ๋ชจ๋“  CUDA ์˜์กด์„ฑ, ๋ฒ„์ „, ํ™˜๊ฒฝ ๊ตฌ์„ฑ์„ ์ž๋™์œผ๋กœ ๊ด€๋ฆฌํ•˜์—ฌ ์ด ๋ณต์žก์„ฑ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

pixi๊ฐ€ ๋””๋ฒ„๊น…์— ์ค‘์š”ํ•œ ์ด์œ 

๋ฌธ์ œ์ : GPU ๋””๋ฒ„๊น…์€ CUDA ํˆดํ‚ท, GPU ๋“œ๋ผ์ด๋ฒ„, Mojo ์ปดํŒŒ์ผ๋Ÿฌ, ๋””๋ฒ„๊ฑฐ ์ปดํฌ๋„ŒํŠธ ๊ฐ„์˜ ์ •๋ฐ€ํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋ฒ„์ „ ๋ถˆ์ผ์น˜๋Š” โ€œ๋””๋ฒ„๊ฑฐ๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Œโ€ ์˜ค๋ฅ˜๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๊ฒฐ์ฑ…: pixi๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด ๋ชจ๋“  ์ปดํฌ๋„ŒํŠธ๊ฐ€ ์กฐํ™”๋กญ๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. pixi run mojo debug --cuda-gdb๋ฅผ ์‹คํ–‰ํ•˜๋ฉด pixi๊ฐ€ ์ž๋™์œผ๋กœ:

  • CUDA ํˆดํ‚ท ๊ฒฝ๋กœ ์„ค์ •
  • ์˜ฌ๋ฐ”๋ฅธ GPU ๋“œ๋ผ์ด๋ฒ„ ๋กœ๋“œ
  • Mojo ๋””๋ฒ„๊น… ํ”Œ๋Ÿฌ๊ทธ์ธ ๊ตฌ์„ฑ
  • ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์ผ๊ด€๋˜๊ฒŒ ๊ด€๋ฆฌ

์„ค์ • ํ™•์ธ

๋ชจ๋“  ๊ฒƒ์ด ์ž‘๋™ํ•˜๋Š”์ง€ ํ™•์ธํ•ด ๋ด…์‹œ๋‹ค:

# 1. GPU ํ•˜๋“œ์›จ์–ด ์ ‘๊ทผ ๊ฐ€๋Šฅ ์—ฌ๋ถ€ ํ™•์ธ
pixi run nvidia-smi
# GPU์™€ ๋“œ๋ผ์ด๋ฒ„ ๋ฒ„์ „์ด ํ‘œ์‹œ๋˜์–ด์•ผ ํ•จ

# 2. CUDA-GDB ํ†ตํ•ฉ ์„ค์ • (GPU ๋””๋ฒ„๊น…์— ํ•„์š”)
pixi run setup-cuda-gdb
# ์‹œ์Šคํ…œ CUDA-GDB ๋ฐ”์ด๋„ˆ๋ฆฌ๋ฅผ conda ํ™˜๊ฒฝ์— ๋งํฌ

# 3. Mojo ๋””๋ฒ„๊ฑฐ ์‚ฌ์šฉ ๊ฐ€๋Šฅ ์—ฌ๋ถ€ ํ™•์ธ
pixi run mojo debug --help
# --cuda-gdb๋ฅผ ํฌํ•จํ•œ ๋””๋ฒ„๊น… ์˜ต์…˜์ด ํ‘œ์‹œ๋˜์–ด์•ผ ํ•จ

# 4. CUDA-GDB ํ†ตํ•ฉ ํ…Œ์ŠคํŠธ
pixi run cuda-gdb --version
# NVIDIA CUDA-GDB ๋ฒ„์ „ ์ •๋ณด๊ฐ€ ํ‘œ์‹œ๋˜์–ด์•ผ ํ•จ

์ด ๋ช…๋ น์–ด ์ค‘ ํ•˜๋‚˜๋ผ๋„ ์‹คํŒจํ•˜๋ฉด pixi.toml ๊ตฌ์„ฑ์„ ๋‹ค์‹œ ํ™•์ธํ•˜๊ณ  CUDA ํˆดํ‚ท ๊ธฐ๋Šฅ์ด ํ™œ์„ฑํ™”๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.

์ค‘์š”: conda์˜ cuda-gdb ํŒจํ‚ค์ง€๋Š” ๋ž˜ํผ ์Šคํฌ๋ฆฝํŠธ๋งŒ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์— pixi run setup-cuda-gdb ๋ช…๋ น์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ช…๋ น์€ ์‹œ์Šคํ…œ CUDA ์„ค์น˜์—์„œ ์‹ค์ œ CUDA-GDB ๋ฐ”์ด๋„ˆ๋ฆฌ๋ฅผ ์ž๋™ ๊ฐ์ง€ํ•˜๊ณ  conda ํ™˜๊ฒฝ์— ๋งํฌํ•˜์—ฌ ์ „์ฒด GPU ๋””๋ฒ„๊น… ๊ธฐ๋Šฅ์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋ช…๋ น์ด ํ•˜๋Š” ์ผ:

์Šคํฌ๋ฆฝํŠธ๋Š” ์—ฌ๋Ÿฌ ์ผ๋ฐ˜์ ์ธ ์œ„์น˜์—์„œ CUDA๋ฅผ ์ž๋™ ๊ฐ์ง€ํ•ฉ๋‹ˆ๋‹ค:

  • $CUDA_HOME ํ™˜๊ฒฝ ๋ณ€์ˆ˜
  • /usr/local/cuda (Ubuntu/Debian ๊ธฐ๋ณธ๊ฐ’)
  • /opt/cuda (ArchLinux ๋ฐ ๊ธฐํƒ€ ๋ฐฐํฌํŒ)
  • ์‹œ์Šคํ…œ PATH (which cuda-gdb ํ†ตํ•ด)

๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ์€ scripts/setup-cuda-gdb.sh๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

WSL ์‚ฌ์šฉ์ž๋ฅผ ์œ„ํ•œ ํŠน๋ณ„ ์ฐธ๊ณ ์‚ฌํ•ญ: Part II์—์„œ ์‚ฌ์šฉํ•  ๋‘ ๊ฐ€์ง€ ๋””๋ฒ„๊ทธ ๋„๊ตฌ(cuda-gdb์™€ compute-sanitizer)๋Š” WSL์—์„œ CUDA ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ๋””๋ฒ„๊น…์„ ์ง€์›ํ•˜์ง€๋งŒ, ๋ ˆ์ง€์ŠคํŠธ๋ฆฌ ํ‚ค HKEY_LOCAL_MACHINE\SOFTWARE\NVIDIA Corporation\GPUDebugger\EnableInterface๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  (DWORD) 1๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ํ”Œ๋žซํผ๊ณผ OS๋ณ„ ๋™์ž‘์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ cuda-gdb์™€ compute-sanitizer๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.


์‹ค์Šต ํŠœํ† ๋ฆฌ์–ผ: ์ฒซ GPU ๋””๋ฒ„๊น… ์„ธ์…˜

์ด๋ก ๋„ ์ข‹์ง€๋งŒ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๋Š” ๊ฒƒ๋งŒ ํ•œ ๊ฒŒ ์—†์Šต๋‹ˆ๋‹ค. Puzzle 01 - ์—ฌ๋Ÿฌ๋ถ„์ด ์ž˜ ์•„๋Š” ๊ฐ„๋‹จํ•œ โ€œ๋ฐฐ์—ด ๊ฐ ์š”์†Œ์— 10 ๋”ํ•˜๊ธฐโ€ ์ปค๋„์„ ์‚ฌ์šฉํ•ด์„œ ์‹ค์ œ ํ”„๋กœ๊ทธ๋žจ์„ ๋””๋ฒ„๊น…ํ•ด ๋ด…์‹œ๋‹ค.

์™œ Puzzle 01์ธ๊ฐ€? ๋‹ค์Œ ์ด์œ ๋กœ ์™„๋ฒฝํ•œ ๋””๋ฒ„๊น… ํŠœํ† ๋ฆฌ์–ผ์ž…๋‹ˆ๋‹ค:

  • ์ถฉ๋ถ„ํžˆ ๋‹จ์ˆœํ•ด์„œ ๋ฌด์—‡์ด ์ผ์–ด๋‚˜์•ผ ํ•˜๋Š”์ง€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Œ
  • ์‹ค์ œ ์ปค๋„ ์‹คํ–‰์ด ์žˆ๋Š” ์ง„์งœ GPU ์ฝ”๋“œ
  • CPU ์„ค์ • ์ฝ”๋“œ์™€ GPU ์ปค๋„ ์ฝ”๋“œ ๋ชจ๋‘ ํฌํ•จ
  • ์งง์€ ์‹คํ–‰ ์‹œ๊ฐ„์œผ๋กœ ๋น ๋ฅธ ๋ฐ˜๋ณต ๊ฐ€๋Šฅ

์ด ํŠœํ† ๋ฆฌ์–ผ์ด ๋๋‚˜๋ฉด ๋„ค ๊ฐ€์ง€ ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ• ๋ชจ๋‘๋กœ ๊ฐ™์€ ํ”„๋กœ๊ทธ๋žจ์„ ๋””๋ฒ„๊น…ํ•˜๊ณ , ์‹ค์ œ ๋””๋ฒ„๊ฑฐ ์ถœ๋ ฅ์„ ๋ณด๊ณ , ๋งค์ผ ์‚ฌ์šฉํ•  ํ•„์ˆ˜ ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋ฅผ ๋ฐฐ์šฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ• ํ•™์Šต ๊ฒฝ๋กœ

Puzzle 01์„ ์˜ˆ์ œ๋กœ ๋„ค ๊ฐ€์ง€ ๋””๋ฒ„๊น… ์กฐํ•ฉ์„ ํƒ์ƒ‰ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ๊ฒฝ๋กœ: JIT + LLDB(๊ฐ€์žฅ ์‰ฌ์›€)๋กœ ์‹œ์ž‘ํ•ด์„œ CUDA-GDB(๊ฐ€์žฅ ๊ฐ•๋ ฅํ•จ)๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

โš ๏ธ GPU ๋””๋ฒ„๊น… ์‹œ ์ค‘์š”์‚ฌํ•ญ:

  • --break-on-launch ํ”Œ๋ž˜๊ทธ๋Š” CUDA-GDB ์ ‘๊ทผ๋ฒ•์—์„œ ํ•„์ˆ˜
  • ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ (์ ‘๊ทผ๋ฒ• 3 & 4)๋Š” ๋””๋ฒ„๊น…์„ ์œ„ํ•ด i ๊ฐ™์€ ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ๋ณด์กด
  • JIT ์ปดํŒŒ์ผ (์ ‘๊ทผ๋ฒ• 1 & 2)์€ ๋Œ€๋ถ€๋ถ„์˜ ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์ตœ์ ํ™”๋กœ ์ œ๊ฑฐ
  • ๋ณธ๊ฒฉ์ ์ธ GPU ๋””๋ฒ„๊น…์—๋Š” ์ ‘๊ทผ๋ฒ• 4 (๋ฐ”์ด๋„ˆ๋ฆฌ + CUDA-GDB) ์‚ฌ์šฉ

ํŠœํ† ๋ฆฌ์–ผ Step 1: LLDB๋กœ CPU ๋””๋ฒ„๊น…

๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค๋กœ ์‹œ์ž‘ํ•ฉ์‹œ๋‹ค: ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜๊ฑฐ๋‚˜ ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋™์ž‘์„ ํ•ด์„œ main() ํ•จ์ˆ˜์—์„œ ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ๋ด์•ผ ํ•  ๋•Œ.

๋ฏธ์…˜: Puzzle 01์˜ CPU ์ธก ์„ค์ • ์ฝ”๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜์—ฌ Mojo๊ฐ€ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ์ปค๋„์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค.

๋””๋ฒ„๊ฑฐ ์‹คํ–‰

JIT ์ปดํŒŒ์ผ๋กœ LLDB ๋””๋ฒ„๊ฑฐ๋ฅผ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค:

# ํ•œ ๋‹จ๊ณ„๋กœ p01.mojo๋ฅผ ์ปดํŒŒ์ผํ•˜๊ณ  ๋””๋ฒ„๊น…
pixi run mojo debug solutions/p01/p01.mojo

LLDB ํ”„๋กฌํ”„ํŠธ๊ฐ€ ๋ณด์ž…๋‹ˆ๋‹ค: (lldb). ์ด์ œ ๋””๋ฒ„๊ฑฐ ์•ˆ์—์„œ ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰์„ ๊ฒ€์‚ฌํ•  ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!

์ฒซ ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋“ค

Puzzle 01์ด ์‹คํ–‰๋  ๋•Œ ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ถ”์ ํ•ด ๋ด…์‹œ๋‹ค. ๋ณด์—ฌ๋“œ๋ฆฐ ๋Œ€๋กœ ์ •ํ™•ํžˆ ์ด ๋ช…๋ น์–ด๋“ค์„ ์ž…๋ ฅํ•˜๊ณ  ์ถœ๋ ฅ์„ ๊ด€์ฐฐํ•˜์„ธ์š”:

Step 1: main ํ•จ์ˆ˜์— ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ •

(lldb) br set -n main

์ถœ๋ ฅ:

Breakpoint 1: where = mojo`main, address = 0x00000000027d7530

๋””๋ฒ„๊ฑฐ๊ฐ€ main ํ•จ์ˆ˜๋ฅผ ์ฐพ์•˜๊ณ  ๊ฑฐ๊ธฐ์„œ ์‹คํ–‰์„ ์ผ์‹œ ์ •์ง€ํ•ฉ๋‹ˆ๋‹ค.

Step 2: ํ”„๋กœ๊ทธ๋žจ ์‹œ์ž‘

(lldb) run

์ถœ๋ ฅ:

Process 186951 launched: '/home/ubuntu/workspace/mojo-gpu-puzzles/.pixi/envs/default/bin/mojo' (x86_64)
Process 186951 stopped
* thread #1, name = 'mojo', stop reason = breakpoint 1.1
    frame #0: 0x0000555557d2b530 mojo`main
mojo`main:
->  0x555557d2b530 <+0>: pushq  %rbp
    0x555557d2b531 <+1>: movq   %rsp, %rbp
    ...

ํ”„๋กœ๊ทธ๋žจ์ด ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์—์„œ ๋ฉˆ์ท„์Šต๋‹ˆ๋‹ค. ํ˜„์žฌ ์–ด์…ˆ๋ธ”๋ฆฌ ์ฝ”๋“œ๋ฅผ ๋ณด๊ณ  ์žˆ๋Š”๋ฐ ์ด๋Š” ์ •์ƒ์ž…๋‹ˆ๋‹ค - ๋””๋ฒ„๊ฑฐ๊ฐ€ ๊ณ ์ˆ˜์ค€ Mojo ์†Œ์Šค์— ๋„๋‹ฌํ•˜๊ธฐ ์ „์— ์ €์ˆ˜์ค€ ๋จธ์‹  ์ฝ”๋“œ์—์„œ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

Step 3: ์‹œ์ž‘ ๊ณผ์ • ํƒ์ƒ‰

# ๋ช…๋ น์–ด ํ•˜๋‚˜๋ฅผ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ ์‹œ๋„
(lldb) next

์ถœ๋ ฅ:

Process 186951 stopped
* thread #1, name = 'mojo', stop reason = instruction step over
    frame #0: 0x0000555557d2b531 mojo`main + 1
mojo`main:
->  0x555557d2b531 <+1>: movq   %rsp, %rbp
    0x555557d2b534 <+4>: pushq  %r15
    ...

์–ด์…ˆ๋ธ”๋ฆฌ๋ฅผ ๋‹จ๊ณ„๋ณ„๋กœ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์€ ์ง€๋ฃจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋” ๊ด€๋ จ ์žˆ๋Š” ๋ถ€๋ถ„์œผ๋กœ ์ง„ํ–‰ํ•ฉ์‹œ๋‹ค.

Step 4: Mojo ์†Œ์Šค ์ฝ”๋“œ์— ๋„๋‹ฌํ•˜๊ธฐ ์œ„ํ•ด ๊ณ„์†

# ์‹œ์ž‘ ์–ด์…ˆ๋ธ”๋ฆฌ๋ฅผ ๊ฑด๋„ˆ๋›ฐ์–ด ์‹ค์ œ ์ฝ”๋“œ๋กœ ์ด๋™
(lldb) continue

์ถœ๋ ฅ:

Process 186951 resuming
Process 186951 stopped and restarted: thread 1 received signal: SIGCHLD
2 locations added to breakpoint 1
Process 186951 stopped
* thread #1, name = 'mojo', stop reason = breakpoint 1.3
    frame #0: 0x00007fff5c01e841 JIT(0x7fff5c075000)`stdlib::builtin::_startup::__mojo_main_prototype(argc=([0] = 1), argv=0x00007fffffffa858) at _startup.mojo:95:4

Mojo์˜ ๋Ÿฐํƒ€์ž„์ด ์ดˆ๊ธฐํ™” ์ค‘์ž…๋‹ˆ๋‹ค. _startup.mojo๋Š” Mojo์˜ ๋‚ด๋ถ€ ์‹œ์ž‘ ์ฝ”๋“œ๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. SIGCHLD ์‹œ๊ทธ๋„์€ ์ •์ƒ์ž…๋‹ˆ๋‹ค - Mojo๊ฐ€ ๋‚ด๋ถ€ ํ”„๋กœ์„ธ์Šค๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

Step 5: ์‹ค์ œ ์ฝ”๋“œ๋กœ ๊ณ„์†

# ํ•œ ๋ฒˆ ๋” continueํ•ด์„œ p01.mojo ์ฝ”๋“œ์— ๋„๋‹ฌ!
(lldb) continue

์ถœ๋ ฅ:

Process 186951 resuming
Process 186951 stopped
* thread #1, name = 'mojo', stop reason = breakpoint 1.2
    frame #0: 0x00007fff5c014040 JIT(0x7fff5c075000)`p01::main(__error__=<unavailable>) at p01.mojo:24:23
   21
   22
   23   def main():
-> 24       with DeviceContext() as ctx:
   25           out = ctx.enqueue_create_buffer[dtype](SIZE)
   26           out.enqueue_fill(0)
   27           a = ctx.enqueue_create_buffer[dtype](SIZE)

์ด์ œ ์‹ค์ œ Mojo ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฃผ๋ชฉํ•  ์ :

  • p01.mojo ํŒŒ์ผ์˜ 21-27๋ฒˆ ์ค„
  • ํ˜„์žฌ ์ค„ 24: with DeviceContext() as ctx:
  • JIT ์ปดํŒŒ์ผ: JIT(0x7fff5c075000)์€ Mojo๊ฐ€ ์ฝ”๋“œ๋ฅผ ์ฆ‰์„์—์„œ ์ปดํŒŒ์ผํ–ˆ์Œ์„ ๋‚˜ํƒ€๋ƒ„

Step 6: ํ”„๋กœ๊ทธ๋žจ ์™„๋ฃŒ

# ํ”„๋กœ๊ทธ๋žจ์„ ์™„๋ฃŒ๊นŒ์ง€ ์‹คํ–‰
(lldb) continue

์ถœ๋ ฅ:

Process 186951 resuming
out: HostBuffer([10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
Process 186951 exited with status = 0 (0x00000000)

๋ฐฐ์šด ๋‚ด์šฉ

๐ŸŽ“ ์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ์ฒซ GPU ํ”„๋กœ๊ทธ๋žจ ๋””๋ฒ„๊น… ์„ธ์…˜์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ฌด์Šจ ์ผ์ด ์žˆ์—ˆ๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

๊ฑฐ์ณ์˜จ ๋””๋ฒ„๊น… ์—ฌ์ •:

  1. ์–ด์…ˆ๋ธ”๋ฆฌ๋กœ ์‹œ์ž‘ - ์ €์ˆ˜์ค€ ๋””๋ฒ„๊น…์—์„œ๋Š” ์ •์ƒ์ ์ธ ํ˜„์ƒ์ด๋ฉฐ, ๋””๋ฒ„๊ฑฐ๊ฐ€ ๋จธ์‹  ์ˆ˜์ค€์—์„œ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€ ๋ณด์—ฌ์คŒ
  2. Mojo ์‹œ์ž‘ ๊ณผ์ • ํƒ์ƒ‰ - Mojo์— ๋‚ด๋ถ€ ์ดˆ๊ธฐํ™” ์ฝ”๋“œ๊ฐ€ ์žˆ์Œ์„ ํ•™์Šต
  3. ์†Œ์Šค ์ฝ”๋“œ ๋„๋‹ฌ - ๊ตฌ๋ฌธ ๊ฐ•์กฐ๊ฐ€ ๋œ ์‹ค์ œ p01.mojo 21-27๋ฒˆ ์ค„ ํ™•์ธ
  4. JIT ์ปดํŒŒ์ผ ๊ด€์ฐฐ - Mojo๊ฐ€ ์ฝ”๋“œ๋ฅผ ์ฆ‰์„์—์„œ ์ปดํŒŒ์ผํ•˜๋Š” ๊ฒƒ์„ ๊ด€์ฐฐ
  5. ์„ฑ๊ณต์ ์ธ ์‹คํ–‰ ํ™•์ธ - ํ”„๋กœ๊ทธ๋žจ์ด ์˜ˆ์ƒ๋œ ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•จ์„ ํ™•์ธ

LLDB ๋””๋ฒ„๊น…์ด ์ œ๊ณตํ•˜๋Š” ๊ฒƒ:

  • โœ… CPU ์ธก ๊ฐ€์‹œ์„ฑ: main() ํ•จ์ˆ˜, ๋ฒ„ํผ ํ• ๋‹น, ๋ฉ”๋ชจ๋ฆฌ ์„ค์ • ํ™•์ธ
  • โœ… ์†Œ์Šค ์ฝ”๋“œ ๊ฒ€์‚ฌ: ์ค„ ๋ฒˆํ˜ธ๊ฐ€ ์žˆ๋Š” ์‹ค์ œ Mojo ์ฝ”๋“œ ๋ณด๊ธฐ
  • โœ… ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ: ํ˜ธ์ŠคํŠธ ์ธก ๋ณ€์ˆ˜(CPU ๋ฉ”๋ชจ๋ฆฌ) ๊ฐ’ ํ™•์ธ
  • โœ… ํ”„๋กœ๊ทธ๋žจ ํ๋ฆ„ ์ œ์–ด: ์„ค์ • ๋กœ์ง์„ ์ค„ ๋‹จ์œ„๋กœ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰
  • โœ… ์˜ค๋ฅ˜ ์กฐ์‚ฌ: ์žฅ์น˜ ์„ค์ •, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋“ฑ์˜ ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น…

LLDB๊ฐ€ ํ•  ์ˆ˜ ์—†๋Š” ๊ฒƒ:

  • โŒ GPU ์ปค๋„ ๊ฒ€์‚ฌ: add_10 ํ•จ์ˆ˜ ์‹คํ–‰ ๋‚ด๋ถ€๋กœ ์ง„์ž… ๋ถˆ๊ฐ€๋Šฅ
  • โŒ ์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ๋””๋ฒ„๊น…: ๊ฐœ๋ณ„ GPU ์Šค๋ ˆ๋“œ ๋™์ž‘ ํ™•์ธ ๋ถˆ๊ฐ€
  • โŒ GPU ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: GPU ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณด๋Š” ๋ฐ์ดํ„ฐ ๊ฒ€์‚ฌ ๋ถˆ๊ฐ€
  • โŒ ๋ณ‘๋ ฌ ์‹คํ–‰ ๋ถ„์„: ๊ฒฝ์Ÿ ์ƒํƒœ๋‚˜ ๋™๊ธฐํ™” ๋””๋ฒ„๊น… ๋ถˆ๊ฐ€

LLDB ๋””๋ฒ„๊น…์„ ์‚ฌ์šฉํ•  ๋•Œ:

  • GPU ์ฝ”๋“œ๊ฐ€ ์‹คํ–‰๋˜๊ธฐ ์ „์— ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•  ๋•Œ
  • ๋ฒ„ํผ ํ• ๋‹น์ด๋‚˜ ๋ฉ”๋ชจ๋ฆฌ ์„ค์ • ๋ฌธ์ œ
  • ํ”„๋กœ๊ทธ๋žจ ์ดˆ๊ธฐํ™”์™€ ํ๋ฆ„ ์ดํ•ด
  • Mojo ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ์–ด๋–ป๊ฒŒ ์‹œ์ž‘๋˜๋Š”์ง€ ํ•™์Šต
  • ๋น ๋ฅธ ํ”„๋กœํ† ํƒ€์ดํ•‘๊ณผ ์ฝ”๋“œ ๋ณ€๊ฒฝ ์‹คํ—˜

ํ•ต์‹ฌ ํ†ต์ฐฐ: LLDB๋Š” ํ˜ธ์ŠคํŠธ ์ธก ๋””๋ฒ„๊น…์— ์™„๋ฒฝํ•ฉ๋‹ˆ๋‹ค - GPU ์‹คํ–‰ ์ „ํ›„์— CPU์—์„œ ์ผ์–ด๋‚˜๋Š” ๋ชจ๋“  ๊ฒƒ. ์‹ค์ œ GPU ์ปค๋„ ๋””๋ฒ„๊น…์—๋Š” ๋‹ค์Œ ์ ‘๊ทผ๋ฒ•์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹คโ€ฆ

ํŠœํ† ๋ฆฌ์–ผ Step 2: ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น…

JIT ๋””๋ฒ„๊น…์„ ๋ฐฐ์› ์œผ๋‹ˆ ์ด์ œ ํ”„๋กœ๋•์…˜ ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ „๋ฌธ์ ์ธ ์ ‘๊ทผ๋ฒ•์„ ํƒ์ƒ‰ํ•ฉ์‹œ๋‹ค.

์‹œ๋‚˜๋ฆฌ์˜ค: ์—ฌ๋Ÿฌ ํŒŒ์ผ์ด ์žˆ๋Š” ๋ณต์žกํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ๋””๋ฒ„๊น…ํ•˜๊ฑฐ๋‚˜ ๊ฐ™์€ ํ”„๋กœ๊ทธ๋žจ์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ๋””๋ฒ„๊น…ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋จผ์ € ๋ฐ”์ด๋„ˆ๋ฆฌ๋ฅผ ๋นŒ๋“œํ•˜๋ฉด ๋” ๋งŽ์€ ์ œ์–ด์™€ ๋น ๋ฅธ ๋””๋ฒ„๊น… ๋ฐ˜๋ณต์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๋””๋ฒ„๊ทธ ๋ฐ”์ด๋„ˆ๋ฆฌ ๋นŒ๋“œ

Step 1: ๋””๋ฒ„๊ทธ ์ •๋ณด๋กœ ์ปดํŒŒ์ผ

# ๋””๋ฒ„๊ทธ ๋นŒ๋“œ ์ƒ์„ฑ (๋ช…ํ™•ํ•œ ๋ช…๋ช…์— ์ฃผ๋ชฉ)
pixi run mojo build -O0 -g solutions/p01/p01.mojo -o solutions/p01/p01_debug

์—ฌ๊ธฐ์„œ ์ผ์–ด๋‚˜๋Š” ์ผ:

  • ๐Ÿ”ง -O0: ์ตœ์ ํ™” ๋น„ํ™œ์„ฑํ™” (์ •ํ™•ํ•œ ๋””๋ฒ„๊น…์— ๋ฐ˜๋“œ์‹œ ํ•„์š”)
  • ๐Ÿ” -g: ๋จธ์‹  ์ฝ”๋“œ๋ฅผ ์†Œ์Šค ์ฝ”๋“œ์— ๋งคํ•‘ํ•˜๋Š” ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ ํฌํ•จ
  • ๐Ÿ“ -o p01_debug: ๋ช…ํ™•ํ•˜๊ฒŒ ์ด๋ฆ„ ์ง€์€ ๋””๋ฒ„๊ทธ ๋ฐ”์ด๋„ˆ๋ฆฌ ์ƒ์„ฑ

Step 2: ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น…

# ๋ฏธ๋ฆฌ ๋นŒ๋“œ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น…
pixi run mojo debug solutions/p01/p01_debug

๋ฌด์—‡์ด ๋‹ค๋ฅธ๊ฐ€ (๊ทธ๋ฆฌ๊ณ  ๋” ๋‚˜์€๊ฐ€)

์‹œ์ž‘ ๋น„๊ต:

JIT ๋””๋ฒ„๊น…๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น…
ํ•œ ๋‹จ๊ณ„๋กœ ์ปดํŒŒ์ผ + ๋””๋ฒ„๊น…ํ•œ ๋ฒˆ ๋นŒ๋“œ, ์—ฌ๋Ÿฌ ๋ฒˆ ๋””๋ฒ„๊น…
๋А๋ฆฐ ์‹œ์ž‘ (์ปดํŒŒ์ผ ์˜ค๋ฒ„ํ—ค๋“œ)๋น ๋ฅธ ์‹œ์ž‘
์ปดํŒŒ์ผ ๋ฉ”์‹œ์ง€๊ฐ€ ๋””๋ฒ„๊ทธ ์ถœ๋ ฅ๊ณผ ์„ž์ž„๊น”๋”ํ•œ ๋””๋ฒ„๊ฑฐ ์ถœ๋ ฅ
๋””๋ฒ„๊น… ์ค‘ ์ƒ์„ฑ๋˜๋Š” ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ๊ณ ์ •๋œ ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ

๊ฐ™์€ LLDB ๋ช…๋ น์–ด(br set -n main, run, continue)๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ฐจ์ด๋ฅผ ๋А๋‚„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋น ๋ฅธ ์‹œ์ž‘ - ์ปดํŒŒ์ผ ์ง€์—ฐ ์—†์Œ
  • ๊น”๋”ํ•œ ์ถœ๋ ฅ - JIT ์ปดํŒŒ์ผ ๋ฉ”์‹œ์ง€ ์—†์Œ
  • ๋” ์˜ˆ์ธก ๊ฐ€๋Šฅ - ๋””๋ฒ„๊ทธ ์‹ฌ๋ณผ์ด ์‹คํ–‰ ๊ฐ„์— ๋ณ€ํ•˜์ง€ ์•Š์Œ
  • ์ „๋ฌธ์ ์ธ ์›Œํฌํ”Œ๋กœ์šฐ - ํ”„๋กœ๋•์…˜ ๋””๋ฒ„๊น…์ด ์ด๋ ‡๊ฒŒ ์ž‘๋™ํ•จ

ํŠœํ† ๋ฆฌ์–ผ Step 3: GPU ์ปค๋„ ๋””๋ฒ„๊น…

์ง€๊ธˆ๊นŒ์ง€๋Š” CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ - ์„ค์ •, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ์ดˆ๊ธฐํ™”๋ฅผ ๋””๋ฒ„๊น…ํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์ด ์ผ์–ด๋‚˜๋Š” ์‹ค์ œ GPU ์ปค๋„์€ ์–ด๋–จ๊นŒ์š”?

๋ฌธ์ œ์ : add_10 ์ปค๋„์€ ์ž ์žฌ์ ์œผ๋กœ ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” GPU์—์„œ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. LLDB๋Š” GPU์˜ ๋ณ‘๋ ฌ ์‹คํ–‰ ํ™˜๊ฒฝ์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

ํ•ด๊ฒฐ์ฑ…: CUDA-GDB - GPU ์Šค๋ ˆ๋“œ, GPU ๋ฉ”๋ชจ๋ฆฌ, ๋ณ‘๋ ฌ ์‹คํ–‰์„ ์ดํ•ดํ•˜๋Š” ์ „๋ฌธ ๋””๋ฒ„๊ฑฐ์ž…๋‹ˆ๋‹ค.

CUDA-GDB๊ฐ€ ํ•„์š”ํ•œ ์ด์œ 

GPU ๋””๋ฒ„๊น…์ด ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅธ ์ด์œ ๋ฅผ ์ดํ•ดํ•ฉ์‹œ๋‹ค:

CPU ๋””๋ฒ„๊น… (LLDB):

  • ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š” ๋‹จ์ผ ์Šค๋ ˆ๋“œ
  • ์ถ”์ ํ•  ์ฝœ ์Šคํƒ์ด ํ•˜๋‚˜๋ฟ
  • ๋‹จ์ˆœํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋ธ
  • ๋ณ€์ˆ˜๊ฐ€ ๋‹จ์ผ ๊ฐ’์„ ๊ฐ€์ง

GPU ๋””๋ฒ„๊น… (CUDA-GDB):

  • ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰๋˜๋Š” ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ
  • ์—ฌ๋Ÿฌ ์ฝœ ์Šคํƒ (์Šค๋ ˆ๋“œ๋‹น ํ•˜๋‚˜)
  • ๋ณต์žกํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ (์ „์—ญ, ๊ณต์œ , ๋กœ์ปฌ, ๋ ˆ์ง€์Šคํ„ฐ)
  • ๊ฐ™์€ ๋ณ€์ˆ˜๊ฐ€ ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ€์ง

์‹ค์ œ ์˜ˆ: add_10 ์ปค๋„์—์„œ thread_idx.x ๋ณ€์ˆ˜๋Š” ๊ฐ ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค - ์Šค๋ ˆ๋“œ 0์€ 0์„, ์Šค๋ ˆ๋“œ 1์€ 1์„ ๋ณด๋Š” ์‹์ž…๋‹ˆ๋‹ค. CUDA-GDB๋งŒ์ด ์ด ๋ณ‘๋ ฌ ํ˜„์‹ค์„ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

CUDA-GDB ๋””๋ฒ„๊ฑฐ ์‹คํ–‰

Step 1: GPU ์ปค๋„ ๋””๋ฒ„๊น… ์‹œ์ž‘

์ ‘๊ทผ๋ฒ•์„ ์„ ํƒํ•˜์„ธ์š”:

# ์ด๋ฏธ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธ (ํ•œ ๋ฒˆ์ด๋ฉด ์ถฉ๋ถ„)
pixi run setup-cuda-gdb

# JIT + CUDA-GDB ์‚ฌ์šฉ (์œ„์˜ ์ ‘๊ทผ๋ฒ• 2)
pixi run mojo debug --cuda-gdb --break-on-launch solutions/p01/p01.mojo

ํ•™์Šต๊ณผ ๋น ๋ฅธ ๋ฐ˜๋ณต์— ์ ํ•ฉํ•œ JIT + CUDA-GDB ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

Step 2: ์‹คํ–‰ํ•˜๊ณ  GPU ์ปค๋„ ์ง„์ž… ์‹œ ์ž๋™ ์ •์ง€

CUDA-GDB ํ”„๋กฌํ”„ํŠธ๋Š” ์ด๋ ‡๊ฒŒ ๋ณด์ž…๋‹ˆ๋‹ค: (cuda-gdb). ํ”„๋กœ๊ทธ๋žจ์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค:

# ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰ - GPU ์ปค๋„์ด ์‹คํ–‰๋  ๋•Œ ์ž๋™์œผ๋กœ ์ •์ง€
(cuda-gdb) run

์ถœ๋ ฅ:

Starting program: /home/ubuntu/workspace/mojo-gpu-puzzles/.pixi/envs/default/bin/mojo...
[Thread debugging using libthread_db enabled]
...
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0)]

CUDA thread hit application kernel entry function breakpoint, p01_add_10_UnsafePointer...
   <<<(1,1,1),(4,1,1)>>> (output=0x302000000, a=0x302000200) at p01.mojo:16
16          i = thread_idx.x

์„ฑ๊ณต! GPU ์ปค๋„ ๋‚ด๋ถ€์—์„œ ์ž๋™์œผ๋กœ ์ •์ง€ํ–ˆ์Šต๋‹ˆ๋‹ค! --break-on-launch ํ”Œ๋ž˜๊ทธ๊ฐ€ ์ปค๋„ ์‹คํ–‰์„ ๊ฐ์ง€ํ–ˆ๊ณ  ์ด์ œ i = thread_idx.x๊ฐ€ ์‹คํ–‰๋˜๋Š” 16๋ฒˆ ์ค„์— ์žˆ์Šต๋‹ˆ๋‹ค.

์ค‘์š”: break add_10์ฒ˜๋Ÿผ ์ˆ˜๋™์œผ๋กœ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋ฅผ ์„ค์ •ํ•  ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค

  • ์ปค๋„ ์ง„์ž… ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋Š” ์ž๋™์ž…๋‹ˆ๋‹ค. GPU ์ปค๋„ ํ•จ์ˆ˜๋Š” CUDA-GDB์—์„œ ๋งน๊ธ€๋ง๋œ ์ด๋ฆ„(p01_add_10_UnsafePointer... ๊ฐ™์€)์„ ๊ฐ€์ง€์ง€๋งŒ, ์ด๋ฏธ ์ปค๋„ ์•ˆ์— ์žˆ์œผ๋ฏ€๋กœ ๋ฐ”๋กœ ๋””๋ฒ„๊น…์„ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Step 3: ๋ณ‘๋ ฌ ์‹คํ–‰ ํƒ์ƒ‰

# ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์—์„œ ์ผ์‹œ ์ •์ง€๋œ ๋ชจ๋“  GPU ์Šค๋ ˆ๋“œ ๋ณด๊ธฐ
(cuda-gdb) info cuda threads

์ถœ๋ ฅ:

  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (3,0,0)     4 0x00007fffd326fb70 /home/ubuntu/workspace/mojo-gpu-puzzles/solutions/p01/p01.mojo    16

์™„๋ฒฝํ•ฉ๋‹ˆ๋‹ค! Puzzle 01์˜ ๋ชจ๋“  4๊ฐœ ๋ณ‘๋ ฌ GPU ์Šค๋ ˆ๋“œ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • *๊ฐ€ ํ˜„์žฌ ์Šค๋ ˆ๋“œ ํ‘œ์‹œ: (0,0,0) - ๋””๋ฒ„๊น… ์ค‘์ธ ์Šค๋ ˆ๋“œ
  • ์Šค๋ ˆ๋“œ ๋ฒ”์œ„: (0,0,0)์—์„œ (3,0,0)๊นŒ์ง€ - ๋ธ”๋ก์˜ ๋ชจ๋“  4๊ฐœ ์Šค๋ ˆ๋“œ
  • Count: 4 - ์ฝ”๋“œ์˜ THREADS_PER_BLOCK = 4์™€ ์ผ์น˜
  • ๊ฐ™์€ ์œ„์น˜: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ p01.mojo์˜ 16๋ฒˆ ์ค„์—์„œ ์ผ์‹œ ์ •์ง€

Step 4: ์ปค๋„์„ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ํ•˜๊ณ  ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ

# 'next'๋กœ ์ฝ”๋“œ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ ('step'์€ ๋‚ด๋ถ€๋กœ ๋“ค์–ด๊ฐ)
(cuda-gdb) next

์ถœ๋ ฅ:

p01_add_10_UnsafePointer... at p01.mojo:17
17          output[i] = a[i] + 10.0
# ๋กœ์ปฌ ๋ณ€์ˆ˜๋Š” ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ์—์„œ ์ž‘๋™!
(cuda-gdb) print i

์ถœ๋ ฅ:

$1 = 0                    # ์ด ์Šค๋ ˆ๋“œ์˜ ์ธ๋ฑ์Šค (thread_idx.x ๊ฐ’ ์บก์ฒ˜)
# GPU ๋‚ด์žฅ ๋ณ€์ˆ˜๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์ง€๋งŒ ํ•„์š” ์—†์Œ
(cuda-gdb) print thread_idx.x

์ถœ๋ ฅ:

No symbol "thread_idx" in current context.
# ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ์Šค๋ ˆ๋“œ๋ณ„ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ
(cuda-gdb) print a[i]     # ์ด ์Šค๋ ˆ๋“œ์˜ ์ž…๋ ฅ: a[0]

์ถœ๋ ฅ:

$2 = {0}                  # ์ž…๋ ฅ ๊ฐ’ (Mojo ์Šค์นผ๋ผ ํ˜•์‹)
(cuda-gdb) print output[i] # ์—ฐ์‚ฐ ์ „ ์ด ์Šค๋ ˆ๋“œ์˜ ์ถœ๋ ฅ

์ถœ๋ ฅ:

$3 = {0}                  # ์•„์ง 0 - ์—ฐ์‚ฐ์ด ์•„์ง ์‹คํ–‰๋˜์ง€ ์•Š์Œ!
# ์—ฐ์‚ฐ ์ค„ ์‹คํ–‰
(cuda-gdb) next

์ถœ๋ ฅ:

13      fn add_10(         # ์—ฐ์‚ฐ ํ›„ ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜ ์ค„๋กœ ์ด๋™
# ์ด์ œ ๊ฒฐ๊ณผ ํ™•์ธ
(cuda-gdb) print output[i]

์ถœ๋ ฅ:

$4 = {10}                 # ์ด์ œ ๊ณ„์‚ฐ๋œ ๊ฒฐ๊ณผ ํ‘œ์‹œ: 0 + 10 = 10
# ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ์—ฌ์ „ํžˆ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
(cuda-gdb) print a

์ถœ๋ ฅ:

$5 = (!pop.scalar<f32> * @register) 0x302000200

Step 5: ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ ๊ฐ„ ์ด๋™

# ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๋กœ ์ „ํ™˜ํ•ด์„œ ์‹คํ–‰ ํ™•์ธ
(cuda-gdb) cuda thread (1,0,0)

์ถœ๋ ฅ:

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]
13      fn add_10(         # ์Šค๋ ˆ๋“œ 1๋„ ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜์— ์žˆ์Œ
# ์Šค๋ ˆ๋“œ์˜ ๋กœ์ปฌ ๋ณ€์ˆ˜ ํ™•์ธ
(cuda-gdb) print i

์ถœ๋ ฅ:

$5 = 1                    # ์Šค๋ ˆ๋“œ 1์˜ ์ธ๋ฑ์Šค (์Šค๋ ˆ๋“œ 0๊ณผ ๋‹ค๋ฆ„!)
# ์ด ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ ๊ฒ€์‚ฌ
(cuda-gdb) print a[i]     # ์ด ์Šค๋ ˆ๋“œ์˜ ์ž…๋ ฅ: a[1]

์ถœ๋ ฅ:

$6 = {1}                  # ์Šค๋ ˆ๋“œ 1์˜ ์ž…๋ ฅ ๊ฐ’
# ์Šค๋ ˆ๋“œ 1์˜ ์—ฐ์‚ฐ์€ ์ด๋ฏธ ์™„๋ฃŒ (๋ณ‘๋ ฌ ์‹คํ–‰!)
(cuda-gdb) print output[i] # ์ด ์Šค๋ ˆ๋“œ์˜ ์ถœ๋ ฅ: output[1]

์ถœ๋ ฅ:

$7 = {11}                 # 1 + 10 = 11 (์ด๋ฏธ ๊ณ„์‚ฐ๋จ)
# ์ตœ๊ณ ์˜ ๊ธฐ๋ฒ•: ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๊ฒฐ๊ณผ๋ฅผ ํ•œ ๋ฒˆ์— ๋ณด๊ธฐ
(cuda-gdb) print output[0]@4

์ถœ๋ ฅ:

$8 = {{10}, {11}, {12}, {13}}     # ๋ชจ๋“  4๊ฐœ ์Šค๋ ˆ๋“œ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ•œ ๋ช…๋ น์–ด๋กœ!
(cuda-gdb) print a[0]@4

์ถœ๋ ฅ:

$9 = {{0}, {1}, {2}, {3}}         # ๋น„๊ต๋ฅผ ์œ„ํ•œ ๋ชจ๋“  ์ž…๋ ฅ ๊ฐ’
# ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ์ง„ํ–‰ํ•˜๋ฉด CUDA ์ปจํ…์ŠคํŠธ๋ฅผ ์žƒ์Šต๋‹ˆ๋‹ค
(cuda-gdb) next

์ถœ๋ ฅ:

[Switching to Thread 0x7ffff7e25840 (LWP 306942)]  # ํ˜ธ์ŠคํŠธ ์Šค๋ ˆ๋“œ๋กœ ๋ณต๊ท€
0x00007fffeca3f831 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(cuda-gdb) print output[i]

์ถœ๋ ฅ:

No symbol "output" in current context.  # GPU ์ปจํ…์ŠคํŠธ๋ฅผ ์žƒ์Œ!

์ด ๋””๋ฒ„๊น… ์„ธ์…˜์˜ ํ•ต์‹ฌ ํ†ต์ฐฐ:

  • ๐Ÿคฏ ๋ณ‘๋ ฌ ์‹คํ–‰์€ ์ง„์งœ์ž…๋‹ˆ๋‹ค - ์Šค๋ ˆ๋“œ (1,0,0)์œผ๋กœ ์ „ํ™˜ํ•˜๋ฉด ์ด๋ฏธ ์—ฐ์‚ฐ์ด ์™„๋ฃŒ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค!
  • ๊ฐ ์Šค๋ ˆ๋“œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ด…๋‹ˆ๋‹ค - i=0 vs i=1, a[i]={0} vs a[i]={1}, output[i]={10} vs output[i]={11}
  • ๋ฐฐ์—ด ๊ฒ€์‚ฌ๊ฐ€ ๊ฐ•๋ ฅํ•ฉ๋‹ˆ๋‹ค - print output[0]@4๋กœ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์˜ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: {{10}, {11}, {12}, {13}}
  • GPU ์ปจํ…์ŠคํŠธ๋Š” ๊นจ์ง€๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค - ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ์ง„ํ–‰ํ•˜๋ฉด ํ˜ธ์ŠคํŠธ ์Šค๋ ˆ๋“œ๋กœ ๋Œ์•„๊ฐ€ GPU ๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

์ด๊ฒƒ์ด ๋ฐ”๋กœ ๋ณ‘๋ ฌ ์ปดํ“จํŒ…์˜ ๋ณธ์งˆ์ž…๋‹ˆ๋‹ค: ๊ฐ™์€ ์ฝ”๋“œ, ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ, ๋™์‹œ ์‹คํ–‰.

CUDA-GDB๋กœ ๋ฐฐ์šด ๋‚ด์šฉ

๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ๋กœ GPU ์ปค๋„ ์‹คํ–‰ ๋””๋ฒ„๊น…์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ์‹ค์ œ๋กœ ์ž‘๋™ํ•˜๋Š” ๊ธฐ๋Šฅ๋“ค์ž…๋‹ˆ๋‹ค:

์Šต๋“ํ•œ GPU ๋””๋ฒ„๊น… ๋Šฅ๋ ฅ:

  • โœ… GPU ์ปค๋„ ์ž๋™ ๋””๋ฒ„๊น… - --break-on-launch๊ฐ€ ์ปค๋„ ์ง„์ž… ์‹œ์ ์—์„œ ์ •์ง€ํ•ฉ๋‹ˆ๋‹ค
  • โœ… GPU ์Šค๋ ˆ๋“œ ๊ฐ„ ์ด๋™ - cuda thread๋กœ ์ปจํ…์ŠคํŠธ๋ฅผ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค
  • โœ… ๋กœ์ปฌ ๋ณ€์ˆ˜ ์ ‘๊ทผ - -O0 -g๋กœ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ์—์„œ print i๊ฐ€ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค
  • โœ… ์Šค๋ ˆ๋“œ๋ณ„ ๋ฐ์ดํ„ฐ ๊ฒ€์‚ฌ - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ i, a[i], output[i] ๊ฐ’์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • โœ… ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๊ฒฐ๊ณผ ๋ณด๊ธฐ - print output[0]@4๋กœ {{10}, {11}, {12}, {13}}์„ ํ•œ ๋ฒˆ์— ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค
  • โœ… GPU ์ฝ”๋“œ ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰ - next๊ฐ€ ์—ฐ์‚ฐ์„ ์‹คํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • โœ… ๋ณ‘๋ ฌ ์‹คํ–‰ ํ™•์ธ - ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค (์ „ํ™˜ํ•˜๋ฉด ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๋Š” ์ด๋ฏธ ๊ณ„์‚ฐ ์™„๋ฃŒ)
  • โœ… ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ ‘๊ทผ - output๊ณผ a ํฌ์ธํ„ฐ๋ฅผ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • โŒ GPU ๋‚ด์žฅ ๋ณ€์ˆ˜ ์‚ฌ์šฉ ๋ถˆ๊ฐ€ - thread_idx.x, blockIdx.x ๋“ฑ์€ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค (ํ•˜์ง€๋งŒ ๋กœ์ปฌ ๋ณ€์ˆ˜๋Š” ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค!)
  • ๐Ÿ“Š Mojo ์Šค์นผ๋ผ ํ˜•์‹ - ๊ฐ’์ด 10.0 ๋Œ€์‹  {10}์œผ๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค
  • โš ๏ธ ๊นจ์ง€๊ธฐ ์‰ฌ์šด GPU ์ปจํ…์ŠคํŠธ - ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ์ง„ํ–‰ํ•˜๋ฉด GPU ๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

ํ•ต์‹ฌ ํ†ต์ฐฐ:

  • ๋ฏธ๋ฆฌ ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ (mojo build -O0 -g)๋Š” ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค - ๋กœ์ปฌ ๋ณ€์ˆ˜๊ฐ€ ๋ณด์กด๋ฉ๋‹ˆ๋‹ค
  • @N์„ ์‚ฌ์šฉํ•œ ๋ฐฐ์—ด ๊ฒ€์‚ฌ - ๋ชจ๋“  ๋ณ‘๋ ฌ ๊ฒฐ๊ณผ๋ฅผ ํ•œ ๋ฒˆ์— ๋ณด๋Š” ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค
  • GPU ๋‚ด์žฅ ๋ณ€์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค - ํ•˜์ง€๋งŒ i ๊ฐ™์€ ๋กœ์ปฌ ๋ณ€์ˆ˜๊ฐ€ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค
  • Mojo๋Š” {value} ํ˜•์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค - ์Šค์นผ๋ผ๊ฐ€ 10.0 ๋Œ€์‹  {10}์œผ๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค
  • ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰์— ์ฃผ์˜ํ•˜์„ธ์š” - GPU ์ปจํ…์ŠคํŠธ๋ฅผ ์žƒ๊ณ  ํ˜ธ์ŠคํŠธ ์Šค๋ ˆ๋“œ๋กœ ๋Œ์•„๊ฐ€๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค

์‹ค์ œ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•๋“ค

์ด์ œ ์‹ค์ œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋งˆ์ฃผ์น˜๊ฒŒ ๋  ์‹ค์šฉ์ ์ธ ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ์‚ดํŽด๋ด…์‹œ๋‹ค:

๊ธฐ๋ฒ• 1: ์Šค๋ ˆ๋“œ ๊ฒฝ๊ณ„ ํ™•์ธ

# ๋ชจ๋“  4๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ณ„์‚ฐํ–ˆ๋Š”์ง€ ํ™•์ธ
(cuda-gdb) print output[0]@4

์ถœ๋ ฅ:

$8 = {{10}, {11}, {12}, {13}}    # ๋ชจ๋“  4๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ณ„์‚ฐ
# ์œ ํšจ ๋ฒ”์œ„๋ฅผ ๋„˜์–ด ํ™•์ธํ•˜์—ฌ ๋ฒ”์œ„ ์ดˆ๊ณผ ๋ฌธ์ œ ๊ฐ์ง€
(cuda-gdb) print output[0]@5

์ถœ๋ ฅ:

$9 = {{10}, {11}, {12}, {13}, {0}}  # ์š”์†Œ 4๋Š” ์ดˆ๊ธฐํ™”๋˜์ง€ ์•Š์Œ (์ข‹์Œ!)
# ์ž…๋ ฅ๊ณผ ๋น„๊ตํ•˜์—ฌ ์—ฐ์‚ฐ ๊ฒ€์ฆ
(cuda-gdb) print a[0]@4

์ถœ๋ ฅ:

$10 = {{0}, {1}, {2}, {3}}       # ์ž…๋ ฅ ๊ฐ’: 0+10=10, 1+10=11 ๋“ฑ

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ : ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ์€ GPU ํฌ๋ž˜์‹œ์˜ ๊ฐ€์žฅ ํ”ํ•œ ์›์ธ์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๋””๋ฒ„๊น… ๋‹จ๊ณ„๋กœ ์ผ์ฐ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ธฐ๋ฒ• 2: ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ ์ดํ•ด

# ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธ”๋ก์œผ๋กœ ์–ด๋–ป๊ฒŒ ๊ตฌ์„ฑ๋˜๋Š”์ง€ ๋ณด๊ธฐ
(cuda-gdb) info cuda blocks

์ถœ๋ ฅ:

  BlockIdx To BlockIdx Count   State
Kernel 0
*  (0,0,0)     (0,0,0)     1 running
# ํ˜„์žฌ ๋ธ”๋ก์˜ ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๋ณด๊ธฐ
(cuda-gdb) info cuda threads

์ถœ๋ ฅ์€ ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ํ™œ์„ฑ ์ƒํƒœ์ธ์ง€, ์ •์ง€๋˜์—ˆ๋Š”์ง€, ์˜ค๋ฅ˜๊ฐ€ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ : ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๊ตฌ์„ฑ์„ ์ดํ•ดํ•˜๋ฉด ๋™๊ธฐํ™”์™€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ฒ• 3: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„

# GPU ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ ํ™•์ธ:
(cuda-gdb) print a               # ์ž…๋ ฅ ๋ฐฐ์—ด GPU ํฌ์ธํ„ฐ

์ถœ๋ ฅ:

$9 = (!pop.scalar<f32> * @register) 0x302000200
(cuda-gdb) print output          # ์ถœ๋ ฅ ๋ฐฐ์—ด GPU ํฌ์ธํ„ฐ

์ถœ๋ ฅ:

$10 = (!pop.scalar<f32> * @register) 0x302000000
# ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํ™•์ธ:
(cuda-gdb) print a[i]            # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ 'i'๋ฅผ ์‚ฌ์šฉํ•ด ์ž์‹ ์˜ ์š”์†Œ์— ์ ‘๊ทผ

์ถœ๋ ฅ:

$11 = {0}                        # ์Šค๋ ˆ๋“œ์˜ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ : ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์€ ์„ฑ๋Šฅ๊ณผ ์ •ํ™•์„ฑ์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค. ์ž˜๋ชป๋œ ํŒจํ„ด์€ ๊ฒฝ์Ÿ ์ƒํƒœ๋‚˜ ํฌ๋ž˜์‹œ๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ฒ• 4: ๊ฒฐ๊ณผ ๊ฒ€์ฆ ๋ฐ ์™„๋ฃŒ

# ์ปค๋„ ์‹คํ–‰์„ ๋‹จ๊ณ„๋ณ„๋กœ ์‹คํ–‰ํ•œ ํ›„ ์ตœ์ข… ๊ฒฐ๊ณผ ํ™•์ธ
(cuda-gdb) print output[0]@4

์ถœ๋ ฅ:

$11 = {10.0, 11.0, 12.0, 13.0}    # ์™„๋ฒฝ! ๊ฐ ์š”์†Œ๊ฐ€ 10 ์ฆ๊ฐ€
# ํ”„๋กœ๊ทธ๋žจ์„ ์ •์ƒ์ ์œผ๋กœ ์™„๋ฃŒ
(cuda-gdb) continue

์ถœ๋ ฅ:

...ํ”„๋กœ๊ทธ๋žจ ์ถœ๋ ฅ์ด ์„ฑ๊ณต ํ‘œ์‹œ...
# ๋””๋ฒ„๊ฑฐ ์ข…๋ฃŒ
(cuda-gdb) exit

์„ค์ •๋ถ€ํ„ฐ ๊ฒฐ๊ณผ๊นŒ์ง€ GPU ์ปค๋„ ์‹คํ–‰ ๋””๋ฒ„๊น…์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค.

GPU ๋””๋ฒ„๊น… ์—ฌ์ •: ํ•ต์‹ฌ ํ†ต์ฐฐ

ํฌ๊ด„์ ์ธ GPU ๋””๋ฒ„๊น… ํŠœํ† ๋ฆฌ์–ผ์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณ‘๋ ฌ ์ปดํ“จํŒ…์— ๋Œ€ํ•ด ๋ฐœ๊ฒฌํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค:

๋ณ‘๋ ฌ ์‹คํ–‰์— ๋Œ€ํ•œ ๊นŠ์€ ํ†ต์ฐฐ

  1. ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ์˜ ์‹ค์ œ: thread_idx.x๊ฐ€ ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ’(0, 1, 2, 3โ€ฆ)์„ ๊ฐ–๋Š” ๊ฒƒ์„ ์ด๋ก ์ด ์•„๋‹Œ ์ง์ ‘ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค

  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํŒŒ์•…: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ a[thread_idx.x]์—์„œ ์ฝ๊ณ  output[thread_idx.x]์— ์“ฐ๋ฉฐ, ์ถฉ๋Œ ์—†์ด ์™„๋ฒฝํ•œ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ์„ฑ์„ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค

  3. ๋ณ‘๋ ฌ ์‹คํ–‰์˜ ์ดํ•ด: ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์ปค๋„ ์ฝ”๋“œ๋ฅผ ๋™์‹œ์— ์‹คํ–‰ํ•˜๋ฉด์„œ ๊ฐ๊ฐ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

  4. GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ: ๋ฐฐ์—ด์€ ์ „์—ญ GPU ๋ฉ”๋ชจ๋ฆฌ์— ์žˆ์–ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์Šค๋ ˆ๋“œ๋ณ„ ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค

๋ชจ๋“  ํผ์ฆ์— ์ ์šฉ๋˜๋Š” ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•

Puzzle 01๋ถ€ํ„ฐ Puzzle 08, ๊ทธ๋ฆฌ๊ณ  ๊ทธ ์ดํ›„๊นŒ์ง€ ๋ณดํŽธ์ ์œผ๋กœ ์ ์šฉ๋˜๋Š” ๊ธฐ๋ฒ•์„ ์Šต๋“ํ–ˆ์Šต๋‹ˆ๋‹ค:

  • CPU ์ธก ๋ฌธ์ œ(์žฅ์น˜ ์„ค์ •, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น)๋Š” LLDB๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค
  • GPU ์ปค๋„ ๋ฌธ์ œ(์Šค๋ ˆ๋“œ ๋™์ž‘, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ)๋Š” CUDA-GDB๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค
  • ํŠน์ • ์Šค๋ ˆ๋“œ๋‚˜ ๋ฐ์ดํ„ฐ ์กฐ๊ฑด์— ์ง‘์ค‘ํ•˜๋ ค๋ฉด ์กฐ๊ฑด๋ถ€ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๋ณ‘๋ ฌ ์‹คํ–‰ ํŒจํ„ด์„ ์ดํ•ดํ•˜๋ ค๋ฉด ์Šค๋ ˆ๋“œ ๊ฐ„ ์ด๋™์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๊ฒฝ์Ÿ ์ƒํƒœ์™€ ๋ฒ”์œ„ ์ดˆ๊ณผ ์˜ค๋ฅ˜๋ฅผ ์žก์œผ๋ ค๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค

ํ™•์žฅ์„ฑ: ์ด ๊ธฐ๋ฒ•๋“ค์€ ๋‹ค์Œ ๋ชจ๋“  ์ƒํ™ฉ์—์„œ ๋™์ผํ•˜๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค:

  • Puzzle 01: ๊ฐ„๋‹จํ•œ ๋ง์…ˆ์„ ํ•˜๋Š” 4๊ฐœ ์š”์†Œ ๋ฐฐ์—ด
  • Puzzle 08: ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•œ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ
  • ํ”„๋กœ๋•์…˜ ์ฝ”๋“œ: ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฑ๋งŒ ๊ฐœ ์š”์†Œ ๋ฐฐ์—ด

ํ•„์ˆ˜ ๋””๋ฒ„๊น… ๋ช…๋ น์–ด ์ฐธ์กฐ

๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๋ฐฐ์› ์œผ๋‹ˆ, ์ผ์ƒ์ ์ธ ๋””๋ฒ„๊น… ์„ธ์…˜์—์„œ ์“ธ ๋น ๋ฅธ ์ฐธ์กฐ ๊ฐ€์ด๋“œ๋ฅผ ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ์ด ์„น์…˜์„ ๋ถ๋งˆํฌํ•˜์„ธ์š”!

GDB ๋ช…๋ น์–ด ์•ฝ์–ด (์‹œ๊ฐ„ ์ ˆ์•ฝ!)

๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๋‹จ์ถ•ํ‚ค๋กœ ๋” ๋น ๋ฅธ ๋””๋ฒ„๊น…:

์•ฝ์–ด์ „์ฒด ๋ช…๋ น์–ด๊ธฐ๋Šฅ
rrunํ”„๋กœ๊ทธ๋žจ ์‹œ์ž‘/์‹คํ–‰
ccontinue์‹คํ–‰ ์žฌ๊ฐœ
nnext์Šคํ… ์˜ค๋ฒ„ (๊ฐ™์€ ๋ ˆ๋ฒจ)
sstepํ•จ์ˆ˜ ๋‚ด๋ถ€๋กœ ์ง„์ž…
bbreak๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ •
pprint๋ณ€์ˆ˜ ๊ฐ’ ์ถœ๋ ฅ
llist์†Œ์Šค ์ฝ”๋“œ ํ‘œ์‹œ
qquit๋””๋ฒ„๊ฑฐ ์ข…๋ฃŒ

์˜ˆ์‹œ:

(cuda-gdb) r                    # 'run' ๋Œ€์‹ 
(cuda-gdb) b 39                 # 'break 39' ๋Œ€์‹ 
(cuda-gdb) p thread_id          # 'print thread_id' ๋Œ€์‹ 
(cuda-gdb) n                    # 'next' ๋Œ€์‹ 
(cuda-gdb) c                    # 'continue' ๋Œ€์‹ 

โšก Pro ํŒ: ์•ฝ์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋””๋ฒ„๊น… ์†๋„๊ฐ€ 3-5๋ฐฐ ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค!

LLDB ๋ช…๋ น์–ด (CPU ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ ๋””๋ฒ„๊น…)

์–ธ์ œ ์‚ฌ์šฉ: ์žฅ์น˜ ์„ค์ •, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ํ”„๋กœ๊ทธ๋žจ ํ๋ฆ„, ํ˜ธ์ŠคํŠธ ์ธก ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น…

์‹คํ–‰ ์ œ์–ด

(lldb) run                   # ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰
(lldb) continue              # ์‹คํ–‰ ์žฌ๊ฐœ (๋ณ„์นญ: c)
(lldb) step                  # ํ•จ์ˆ˜ ๋‚ด๋ถ€๋กœ ์ง„์ž… (์†Œ์Šค ๋ ˆ๋ฒจ)
(lldb) next                  # ํ•จ์ˆ˜ ๊ฑด๋„ˆ๋›ฐ๊ธฐ (์†Œ์Šค ๋ ˆ๋ฒจ)
(lldb) finish                # ํ˜„์žฌ ํ•จ์ˆ˜์—์„œ ๋‚˜๊ฐ€๊ธฐ

๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ๊ด€๋ฆฌ

(lldb) br set -n main        # main ํ•จ์ˆ˜์— ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ •
(lldb) br set -n function_name     # ์–ด๋–ค ํ•จ์ˆ˜์—๋“  ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์„ค์ •
(lldb) br list               # ๋ชจ๋“  ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ํ‘œ์‹œ
(lldb) br delete 1           # ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ #1 ์‚ญ์ œ
(lldb) br disable 1          # ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ #1 ์ž„์‹œ ๋น„ํ™œ์„ฑํ™”

๋ณ€์ˆ˜ ๊ฒ€์‚ฌ

(lldb) print variable_name   # ๋ณ€์ˆ˜ ๊ฐ’ ํ‘œ์‹œ
(lldb) print pointer[offset]        # ํฌ์ธํ„ฐ ์—ญ์ฐธ์กฐ
(lldb) print array[0]@4      # ์ฒซ 4๊ฐœ ๋ฐฐ์—ด ์š”์†Œ ํ‘œ์‹œ

CUDA-GDB ๋ช…๋ น์–ด (GPU ์ปค๋„ ๋””๋ฒ„๊น…)

์–ธ์ œ ์‚ฌ์šฉ: GPU ์ปค๋„, ์Šค๋ ˆ๋“œ ๋™์ž‘, ๋ณ‘๋ ฌ ์‹คํ–‰, GPU ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ ๋””๋ฒ„๊น…

GPU ์ƒํƒœ ๊ฒ€์‚ฌ

(cuda-gdb) info cuda threads    # ๋ชจ๋“  GPU ์Šค๋ ˆ๋“œ์™€ ์ƒํƒœ ํ‘œ์‹œ
(cuda-gdb) info cuda blocks     # ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๋ธ”๋ก ํ‘œ์‹œ
(cuda-gdb) cuda kernel          # ํ™œ์„ฑ GPU ์ปค๋„ ๋‚˜์—ด

์Šค๋ ˆ๋“œ ํƒ์ƒ‰ (๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ๊ธฐ๋Šฅ!)

(cuda-gdb) cuda thread (0,0,0)  # ํŠน์ • ์Šค๋ ˆ๋“œ ์ขŒํ‘œ๋กœ ์ „ํ™˜
(cuda-gdb) cuda block (0,0)     # ํŠน์ • ๋ธ”๋ก์œผ๋กœ ์ „ํ™˜
(cuda-gdb) cuda thread          # ํ˜„์žฌ ์Šค๋ ˆ๋“œ ์ขŒํ‘œ ํ‘œ์‹œ

์Šค๋ ˆ๋“œ๋ณ„ ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ

# ๋กœ์ปฌ ๋ณ€์ˆ˜์™€ ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ:
(cuda-gdb) print i              # ๋กœ์ปฌ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค ๋ณ€์ˆ˜
(cuda-gdb) print output         # ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ์ธํ„ฐ
(cuda-gdb) print a              # ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ์ธํ„ฐ

GPU ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ

# ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐฐ์—ด ๊ฒ€์‚ฌ (์‹ค์ œ๋กœ ์ž‘๋™ํ•˜๋Š” ๊ฒƒ):
(cuda-gdb) print array[i]       # ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ๋ณ„ ๋ฐฐ์—ด ์ ‘๊ทผ
(cuda-gdb) print array[0]@4     # ์—ฌ๋Ÿฌ ์š”์†Œ ๋ณด๊ธฐ: {{val1}, {val2}, {val3}, {val4}}

๊ณ ๊ธ‰ GPU ๋””๋ฒ„๊น…

# ๋ฉ”๋ชจ๋ฆฌ ๊ฐ์‹œ
(cuda-gdb) watch array[i]     # ๋ฉ”๋ชจ๋ฆฌ ๋ณ€๊ฒฝ ์‹œ ์ค‘๋‹จ
(cuda-gdb) rwatch array[i]    # ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ ์‹œ ์ค‘๋‹จ

๋น ๋ฅธ ์ฐธ์กฐ: ๋””๋ฒ„๊น… ๊ฒฐ์ • ํŠธ๋ฆฌ

๐Ÿค” ์–ด๋–ค ์œ ํ˜•์˜ ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๊ณ  ์žˆ๋‚˜์š”?

GPU ์ฝ”๋“œ ์‹คํ–‰ ์ „์— ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œ

โ†’ LLDB ๋””๋ฒ„๊น… ์‚ฌ์šฉ

pixi run mojo debug your_program.mojo

GPU ์ปค๋„์ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ ์ƒ์„ฑ

โ†’ ์กฐ๊ฑด๋ถ€ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์™€ ํ•จ๊ป˜ CUDA-GDB ์‚ฌ์šฉ

pixi run mojo debug --cuda-gdb --break-on-launch your_program.mojo

์„ฑ๋Šฅ ๋ฌธ์ œ๋‚˜ ๊ฒฝ์Ÿ ์ƒํƒœ

โ†’ ์žฌํ˜„์„ฑ์„ ์œ„ํ•ด ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น… ์‚ฌ์šฉ

pixi run mojo build -O0 -g your_program.mojo -o debug_binary
pixi run mojo debug --cuda-gdb --break-on-launch debug_binary

GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค

GPU ๋””๋ฒ„๊น… ๊ธฐ์ดˆ์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ ํŠœํ† ๋ฆฌ์–ผ์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ๋‹ฌ์„ฑํ•œ ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค:

์Šต๋“ํ•œ ๊ธฐ์ˆ 

๋‹ค์ค‘ ๋ ˆ๋ฒจ ๋””๋ฒ„๊น… ์ง€์‹:

  • โœ… LLDB๋กœ CPU ํ˜ธ์ŠคํŠธ ๋””๋ฒ„๊น… - ์žฅ์น˜ ์„ค์ •, ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ํ”„๋กœ๊ทธ๋žจ ํ๋ฆ„ ๋””๋ฒ„๊น…
  • โœ… CUDA-GDB๋กœ GPU ์ปค๋„ ๋””๋ฒ„๊น… - ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ, GPU ๋ฉ”๋ชจ๋ฆฌ, ๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น…
  • โœ… JIT vs ๋ฐ”์ด๋„ˆ๋ฆฌ ๋””๋ฒ„๊น… - ์ƒํ™ฉ์— ๋งž๋Š” ์ ‘๊ทผ๋ฒ• ์„ ํƒ
  • โœ… pixi๋กœ ํ™˜๊ฒฝ ๊ด€๋ฆฌ - ์ผ๊ด€๋˜๊ณ  ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๋””๋ฒ„๊น… ์„ค์ • ๋ณด์žฅ

์‹ค์ œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ†ต์ฐฐ:

  • ์Šค๋ ˆ๋“œ์˜ ์‹ค์ œ ๋™์ž‘ ํ™•์ธ - ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ๋งˆ๋‹ค thread_idx.x๊ฐ€ ๋‹ค๋ฅธ ๊ฐ’์„ ๊ฐ–๋Š” ๊ฒƒ์„ ์ง์ ‘ ๋ชฉ๊ฒฉํ–ˆ์Šต๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ดํ•ด - ์ „์—ญ GPU ๋ฉ”๋ชจ๋ฆฌ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์Šค๋ ˆ๋“œ ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ๋””๋ฒ„๊น…ํ–ˆ์Šต๋‹ˆ๋‹ค
  • ์Šค๋ ˆ๋“œ ํƒ์ƒ‰ ํ•™์Šต - ์ˆ˜์ฒœ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ ์‚ฌ์ด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ด๋™ํ–ˆ์Šต๋‹ˆ๋‹ค

์ด๋ก ์—์„œ ์‹ค์ „์œผ๋กœ

GPU ๋””๋ฒ„๊น…์— ๋Œ€ํ•ด ์ฝ๊ธฐ๋งŒ ํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๊ฒฝํ—˜ํ–ˆ์Šต๋‹ˆ๋‹ค:

  • ์‹ค์ œ ์ฝ”๋“œ ๋””๋ฒ„๊น…: ์‹ค์ œ GPU ์‹คํ–‰์œผ๋กœ Puzzle 01์˜ add_10 ์ปค๋„์„ ๋””๋ฒ„๊น…ํ–ˆ์Šต๋‹ˆ๋‹ค
  • ์‹ค์ œ ๋””๋ฒ„๊ฑฐ ์ถœ๋ ฅ ํ™•์ธ: LLDB ์–ด์…ˆ๋ธ”๋ฆฌ, CUDA-GDB ์Šค๋ ˆ๋“œ ์ƒํƒœ, ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ๋ฅผ ์ง์ ‘ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค
  • ์ „๋ฌธ ๋„๊ตฌ ์‚ฌ์šฉ: ํ”„๋กœ๋•์…˜ GPU ๊ฐœ๋ฐœ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•œ CUDA-GDB๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค
  • ์‹ค์ œ ์‹œ๋‚˜๋ฆฌ์˜ค ํ•ด๊ฒฐ: ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ, ๊ฒฝ์Ÿ ์ƒํƒœ, ์ปค๋„ ์‹คํ–‰ ์‹คํŒจ ๋ฌธ์ œ๋ฅผ ๋‹ค๋ค˜์Šต๋‹ˆ๋‹ค

๋””๋ฒ„๊น… ๋„๊ตฌ ๋ชจ์Œ

๋น ๋ฅธ ๊ฒฐ์ • ๊ฐ€์ด๋“œ (ํ•ญ์ƒ ๊ฐ€๊นŒ์ด ๋‘์„ธ์š”!):

๋ฌธ์ œ ์œ ํ˜•๋„๊ตฌ๋ช…๋ น์–ด
GPU ์ „์— ํ”„๋กœ๊ทธ๋žจ ํฌ๋ž˜์‹œLLDBpixi run mojo debug program.mojo
GPU ์ปค๋„ ๋ฌธ์ œCUDA-GDBpixi run mojo debug --cuda-gdb --break-on-launch program.mojo
๊ฒฝ์Ÿ ์ƒํƒœCUDA-GDB + ์Šค๋ ˆ๋“œ ํƒ์ƒ‰(cuda-gdb) cuda thread (0,0,0)

ํ•„์ˆ˜ ๋ช…๋ น์–ด (์ผ์ƒ ๋””๋ฒ„๊น…์šฉ):

# GPU ์Šค๋ ˆ๋“œ ๊ฒ€์‚ฌ
(cuda-gdb) info cuda threads          # ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๋ณด๊ธฐ
(cuda-gdb) cuda thread (0,0,0)        # ์Šค๋ ˆ๋“œ ์ „ํ™˜
(cuda-gdb) print i                    # ๋กœ์ปฌ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค (thread_idx.x ๋“ฑ๊ฐ€)

# ์Šค๋งˆํŠธ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ (GPU ๋‚ด์žฅ ๋ณ€์ˆ˜๊ฐ€ ์ž‘๋™ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๋กœ์ปฌ ๋ณ€์ˆ˜ ์‚ฌ์šฉ)
(cuda-gdb) break kernel if i == 0      # ์Šค๋ ˆ๋“œ 0์— ์ง‘์ค‘
(cuda-gdb) break kernel if array[i] > 100  # ๋ฐ์ดํ„ฐ ์กฐ๊ฑด์— ์ง‘์ค‘

# ๋ฉ”๋ชจ๋ฆฌ ๋””๋ฒ„๊น…
(cuda-gdb) print array[i]              # ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ๋ณ„ ๋ฐ์ดํ„ฐ
(cuda-gdb) print array[0]@4            # ๋ฐฐ์—ด ์„ธ๊ทธ๋จผํŠธ: {{val1}, {val2}, {val3}, {val4}}

์š”์•ฝ

GPU ๋””๋ฒ„๊น…์—๋Š” ์ˆ˜์ฒœ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ, ๋ณต์žกํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ, ์ „๋ฌธ ๋„๊ตฌ๊ฐ€ ๊ด€์—ฌํ•ฉ๋‹ˆ๋‹ค. ์ด์ œ ๋‹ค์Œ์„ ๊ฐ–์ถ”๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

  • ์–ด๋–ค GPU ํ”„๋กœ๊ทธ๋žจ์—๋„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์ฒด๊ณ„์ ์ธ ์›Œํฌํ”Œ๋กœ์šฐ
  • LLDB์™€ CUDA-GDB ์ „๋ฌธ ๋„๊ตฌ์— ๋Œ€ํ•œ ์นœ์ˆ™ํ•จ
  • ์‹ค์ œ ๋ณ‘๋ ฌ ์ฝ”๋“œ๋ฅผ ๋””๋ฒ„๊น…ํ•œ ์‹ค์ „ ๊ฒฝํ—˜
  • ๋ณต์žกํ•œ ์ƒํ™ฉ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ์‹ค์šฉ์ ์ธ ์ „๋žต
  • GPU ๋””๋ฒ„๊น… ๊ณผ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ๊ธฐ์ดˆ

์ถ”๊ฐ€ ์ž๋ฃŒ

์ฐธ๊ณ : GPU ๋””๋ฒ„๊น…์—๋Š” ์ธ๋‚ด์‹ฌ๊ณผ ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃฌ ์›Œํฌํ”Œ๋กœ์šฐ์™€ ๋ช…๋ น์–ด๋Š” ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ๋งˆ์ฃผ์น˜๊ฒŒ ๋  ๋ณต์žกํ•œ GPU ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋Š” ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

๐Ÿง ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ฐœ์š”

์ด๋ฒˆ ํผ์ฆ์—์„œ๋Š” ํฌ๋ž˜์‹œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” GPU ํ”„๋กœ๊ทธ๋žจ์ด ์ฃผ์–ด์ง‘๋‹ˆ๋‹ค. ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  (cuda-gdb) ๋””๋ฒ„๊น… ๋„๊ตฌ๋งŒ์œผ๋กœ ๋ฌธ์ œ๋ฅผ ์ฐพ์•„๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋””๋ฒ„๊น… ์Šคํ‚ฌ์„ ๋ฐœํœ˜ํ•ด ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ํ’€์–ด๋ณด์„ธ์š”!

์‚ฌ์ „ ์ค€๋น„: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ์„ ๋จผ์ € ์™„๋ฃŒํ•ด์„œ CUDA-GDB ์„ค์ •๊ณผ ๊ธฐ๋ณธ ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋ฅผ ์ตํ˜€๋‘์„ธ์š”. ์•„๋ž˜ ๋ช…๋ น์„ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pixi run -e nvidia setup-cuda-gdb

์ด ๋ช…๋ น์€ ์‹œ์Šคํ…œ์˜ CUDA ์„ค์น˜๋ฅผ ์ž๋™์œผ๋กœ ๊ฐ์ง€ํ•˜๊ณ  GPU ๋””๋ฒ„๊น…์— ํ•„์š”ํ•œ ๋งํฌ๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด๋ฒˆ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น…: ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋ฅผ ๋‹จ์„œ ์‚ผ์•„ ๊ทผ๋ณธ ์›์ธ ์ฐพ๊ธฐ
  • ์˜ค๋ฅ˜ ๋ถ„์„: ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€์™€ ์Šคํƒ ์ถ”์ (stack trace) ํ•ด์„ํ•˜๊ธฐ
  • ๊ฐ€์„ค ์ˆ˜๋ฆฝ: ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ•ฉ๋ฆฌ์ ์ธ ์ถ”์ธก ์„ธ์šฐ๊ธฐ
  • ๋””๋ฒ„๊น… ์›Œํฌํ”Œ๋กœ์šฐ: ๋‹จ๊ณ„๋ณ„ ์กฐ์‚ฌ ๊ณผ์ • ์ตํžˆ๊ธฐ

์ฝ”๋“œ ์‹คํ–‰

๋จผ์ € ์ „์ฒด ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ์ปค๋„๋งŒ ์‚ดํŽด๋ด…์‹œ๋‹ค:

def add_10(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    var i = thread_idx.x
    output[i] = a[i] + 10.0


๋ฒ„๊ทธ๋ฅผ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š” (pixi ์ „์šฉ):

pixi run -e nvidia p09 --first-case

ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

First Case: Try to identify what's wrong without looking at the code!

stack trace was not collected. Enable stack trace collection with environment variable `MOJO_ENABLE_STACK_TRACE_ON_ERROR`
Unhandled exception caught during execution: At open-source/max/mojo/stdlib/stdlib/gpu/host/device_context.mojo:2082:17: CUDA call failed: CUDA_ERROR_INVALID_IMAGE (device kernel image is invalid)
To get more accurate error information, set MODULAR_DEVICE_CONTEXT_SYNC_MODE=true.
/home/ubuntu/workspace/mojo-gpu-puzzles/.pixi/envs/nvidia/bin/mojo: error: execution exited with a non-zero result: 1

๊ณผ์ œ: ํƒ์ • ์ˆ˜์‚ฌ

๋„์ „: ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š์€ ์ƒํƒœ์—์„œ, ์ด ํฌ๋ž˜์‹œ๋ฅผ ์กฐ์‚ฌํ•˜๊ธฐ ์œ„ํ•œ ๋””๋ฒ„๊น… ์ „๋žต์€ ๋ฌด์—‡์ผ๊นŒ์š”?

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ณด์„ธ์š”:

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --first-case
ํŒ
  1. ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€๋ฅผ ๊ผผ๊ผผํžˆ ์ฝ๊ธฐ - CUDA_ERROR_ILLEGAL_ADDRESS๋Š” GPU๊ฐ€ ์ž˜๋ชป๋œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋ ค ํ–ˆ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค
  2. ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์ •๋ณด ํ™•์ธ - CUDA-GDB๊ฐ€ ๋ฉˆ์ถœ ๋•Œ ํ‘œ์‹œ๋˜๋Š” ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ดํŽด๋ณด์„ธ์š”
  3. ๋ชจ๋“  ํฌ์ธํ„ฐ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๊ฒ€์‚ฌ - print๋กœ ๊ฐ ํฌ์ธํ„ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ™•์ธํ•˜์„ธ์š”
  4. ์ˆ˜์ƒํ•œ ์ฃผ์†Œ ์ฐพ๊ธฐ - ์œ ํšจํ•œ GPU ์ฃผ์†Œ๋Š” ๋ณดํ†ต ํฐ 16์ง„์ˆ˜์ž…๋‹ˆ๋‹ค (0x0์€ ๋ฌด์—‡์„ ์˜๋ฏธํ• ๊นŒ์š”?)
  5. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํ…Œ์ŠคํŠธ - ๊ฐ ํฌ์ธํ„ฐ๋กœ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•ด์„œ ์–ด๋А ๊ฒƒ์ด ์‹คํŒจํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”
  6. ์ฒด๊ณ„์ ์œผ๋กœ ์ ‘๊ทผ - ํƒ์ •์ฒ˜๋Ÿผ ์ฆ๊ฑฐ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ ์ฆ์ƒ์—์„œ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ์ถ”์ ํ•˜์„ธ์š”
  7. ์œ ํšจํ•œ ํŒจํ„ด๊ณผ ๊ทธ๋ ‡์ง€ ์•Š์€ ํŒจํ„ด ๋น„๊ต - ํ•œ ํฌ์ธํ„ฐ๊ฐ€ ์ž‘๋™ํ•˜๊ณ  ๋‹ค๋ฅธ ๊ฑด ์•ˆ ๋œ๋‹ค๋ฉด, ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ์ชฝ์— ์ง‘์ค‘ํ•˜์„ธ์š”
๐Ÿ’ก ์กฐ์‚ฌ ๊ณผ์ •๊ณผ ํ•ด๊ฒฐ์ฑ…

CUDA-GDB๋กœ ๋‹จ๊ณ„๋ณ„ ์กฐ์‚ฌ

๋””๋ฒ„๊ฑฐ ์‹คํ–‰

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --first-case

๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์ •๋ณด ํ™•์ธ

CUDA-GDB๊ฐ€ ๋ฉˆ์ถ”๋ฉด ๋ฐ”๋กœ ์œ ์šฉํ•œ ๋‹จ์„œ๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

(cuda-gdb) run
CUDA thread hit breakpoint, p09_add_10_... (output=0x302000000, a=0x0)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:31
31          i = thread_idx.x

๐Ÿ” ์ฒซ ๋ฒˆ์งธ ๋‹จ์„œ: ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜์— (output=0x302000000, a=0x0)์ด ๋ณด์ž…๋‹ˆ๋‹ค

  • output์€ ์œ ํšจํ•œ GPU ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค
  • a๋Š” 0x0 - null ํฌ์ธํ„ฐ์ž…๋‹ˆ๋‹ค!

์ฒด๊ณ„์ ์ธ ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ

(cuda-gdb) next
32          output[i] = a[i] + 10.0
(cuda-gdb) print i
$1 = 0
(cuda-gdb) print output
$2 = (!pop.scalar<f32> * @register) 0x302000000
(cuda-gdb) print a
$3 = (!pop.scalar<f32> * @register) 0x0

์ฆ๊ฑฐ ์ˆ˜์ง‘:

  • โœ… ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค i=0์€ ์œ ํšจํ•ฉ๋‹ˆ๋‹ค
  • โœ… ๊ฒฐ๊ณผ ํฌ์ธํ„ฐ 0x302000000์€ ์˜ฌ๋ฐ”๋ฅธ GPU ์ฃผ์†Œ์ž…๋‹ˆ๋‹ค
  • โŒ ์ž…๋ ฅ ํฌ์ธํ„ฐ 0x0์€ null์ž…๋‹ˆ๋‹ค

๋ฌธ์ œ ํ™•์ธ

(cuda-gdb) print a[i]
Cannot access memory at address 0x0

๊ฒฐ์ •์  ์ฆ๊ฑฐ: null ์ฃผ์†Œ์˜ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค - ๋ฐ”๋กœ ์ด๊ฒƒ์ด ํฌ๋ž˜์‹œ์˜ ์›์ธ์ž…๋‹ˆ๋‹ค!

๊ทผ๋ณธ ์›์ธ ๋ถ„์„

๋ฌธ์ œ์ : ์ด์ œ --first-crash์˜ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด, ํ˜ธ์ŠคํŠธ ์ฝ”๋“œ๊ฐ€ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ œ๋Œ€๋กœ ํ• ๋‹นํ•˜์ง€ ์•Š๊ณ  null ํฌ์ธํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

 input_buf = ctx.enqueue_create_buffer[dtype](0)  # 0๊ฐœ์˜ ์š”์†Œ๋ฅผ ๊ฐ€์ง„ `DeviceBuffer`๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์š”์†Œ๊ฐ€ 0๊ฐœ์ด๋ฏ€๋กœ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ• ๋‹น๋˜์ง€ ์•Š์•„ NULL ํฌ์ธํ„ฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค!

์™œ ํฌ๋ž˜์‹œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š”๊ฐ€:

  1. ctx.enqueue_create_buffer[dtype](0)์€ 0๊ฐœ ์š”์†Œ๋ฅผ ๊ฐ€์ง„ DeviceBuffer๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  2. ํ• ๋‹นํ•  ์š”์†Œ๊ฐ€ ์—†์œผ๋‹ˆ null ํฌ์ธํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  3. ์ด null ํฌ์ธํ„ฐ๊ฐ€ GPU ์ปค๋„๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค.
  4. ์ปค๋„์ด a[i]์— ์ ‘๊ทผํ•˜๋ ค ํ•  ๋•Œ null์„ ์—ญ์ฐธ์กฐ โ†’ CUDA_ERROR_ILLEGAL_ADDRESS

์ˆ˜์ • ๋ฐฉ๋ฒ•

Null ํฌ์ธํ„ฐ ์ƒ์„ฑ์„ ์ ์ ˆํ•œ ๋ฒ„ํผ ํ• ๋‹น์œผ๋กœ ๊ต์ฒดํ•ฉ๋‹ˆ๋‹ค:

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: Null ํฌ์ธํ„ฐ ์ƒ์„ฑ
input_buf = ctx.enqueue_create_buffer[dtype](0)

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ์•ˆ์ „ํ•œ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ์‹ค์ œ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•˜๊ณ  ์ดˆ๊ธฐํ™”
input_buf = ctx.enqueue_create_buffer[dtype](SIZE)
input_buf.enqueue_fill(0)

ํ•ต์‹ฌ ๋””๋ฒ„๊น… ๊ตํ›ˆ

ํŒจํ„ด ์ธ์‹:

  • 0x0 ์ฃผ์†Œ๋Š” ํ•ญ์ƒ null ํฌ์ธํ„ฐ์ž…๋‹ˆ๋‹ค
  • ์œ ํšจํ•œ GPU ์ฃผ์†Œ๋Š” ํฐ 16์ง„์ˆ˜์ž…๋‹ˆ๋‹ค (์˜ˆ: 0x302000000)

๋””๋ฒ„๊น… ์ „๋žต:

  1. ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€ ์ฝ๊ธฐ - ๋Œ€์ฒด๋กœ ๋ฌธ์ œ ์œ ํ˜•์— ๋Œ€ํ•œ ํžŒํŠธ๋ฅผ ์ค๋‹ˆ๋‹ค
  2. ํ•จ์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ ํ™•์ธ - CUDA-GDB๊ฐ€ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์ง„์ž… ์‹œ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  3. ๋ชจ๋“  ํฌ์ธํ„ฐ ๊ฒ€์‚ฌ - ์ฃผ์†Œ๋ฅผ ๋น„๊ตํ•ด์„œ null์ด๋‚˜ ์ž˜๋ชป๋œ ๊ฒƒ์„ ์ฐพ์Šต๋‹ˆ๋‹ค
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํ…Œ์ŠคํŠธ - ์ˆ˜์ƒํ•œ ํฌ์ธํ„ฐ๋ฅผ ์—ญ์ฐธ์กฐํ•ด ๋ด…๋‹ˆ๋‹ค
  5. ํ• ๋‹น ์ง€์ ๊นŒ์ง€ ์ถ”์  - ๋ฌธ์ œ์˜ ํฌ์ธํ„ฐ๊ฐ€ ์–ด๋””์„œ ์ƒ์„ฑ๋˜์—ˆ๋Š”์ง€ ์ฐพ์Šต๋‹ˆ๋‹ค

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด๋Ÿฐ ์œ ํ˜•์˜ null ํฌ์ธํ„ฐ ๋ฒ„๊ทธ๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋งค์šฐ ํ”ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋ฐฐ์šด ์ฒด๊ณ„์ ์ธ CUDA-GDB ์กฐ์‚ฌ ๋ฐฉ๋ฒ•์€ ๋‹ค๋ฅธ ๋งŽ์€ GPU ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ, ๊ฒฝ์Ÿ ์ƒํƒœ, ์ปค๋„ ํฌ๋ž˜์‹œ๋ฅผ ๋””๋ฒ„๊น…ํ•  ๋•Œ๋„ ๊ทธ๋Œ€๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„: ํฌ๋ž˜์‹œ์—์„œ ์กฐ์šฉํ•œ ๋ฒ„๊ทธ๋กœ

ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น…์„ ์ตํ˜”์Šต๋‹ˆ๋‹ค! ์ด์ œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋ฅผ ๋‹จ์„œ๋กœ GPU ํฌ๋ž˜์‹œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์กฐ์‚ฌ
  • ํฌ์ธํ„ฐ ์ฃผ์†Œ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•ด null ํฌ์ธํ„ฐ ๋ฒ„๊ทธ ์‹๋ณ„
  • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ๋””๋ฒ„๊น…์— CUDA-GDB๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉ

๋‹ค์Œ ๋„์ „: ํƒ์ • ์ˆ˜์‚ฌ: ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ทธ๋Ÿฐ๋ฐ ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด์š”? ์™„๋ฒฝํ•˜๊ฒŒ ์‹คํ–‰๋˜์ง€๋งŒ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค๋ฉด?

๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€๋Š” ์ „ํ˜€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๋””๋ฒ„๊น… ๋„์ „์ž…๋‹ˆ๋‹ค:

  • ๊ธธ์žก์ด๊ฐ€ ๋˜์–ด์ค„ ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค
  • ์กฐ์‚ฌํ•  ๋šœ๋ ทํ•œ ํฌ์ธํ„ฐ ๋ฌธ์ œ๋„ ์—†์Šต๋‹ˆ๋‹ค
  • ๋ฌธ์ œ๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š” ์Šคํƒ ์ถ”์ ๋„ ์—†์Šต๋‹ˆ๋‹ค
  • ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ๊ฐ€ ํ•„์š”ํ•œ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋งŒ ์žˆ์Šต๋‹ˆ๋‹ค

์ƒˆ๋กญ๊ฒŒ ์ตํžˆ๊ฒŒ ๋  ์Šคํ‚ฌ:

  • ๋กœ์ง ๋ฒ„๊ทธ ํƒ์ง€ - ํฌ๋ž˜์‹œ ์—†์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์ฐพ๊ธฐ
  • ํŒจํ„ด ๋ถ„์„ - ์ž˜๋ชป๋œ ์ถœ๋ ฅ์—์„œ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€๊ธฐ
  • ์‹คํ–‰ ํ๋ฆ„ ๋””๋ฒ„๊น… - ์ตœ์ ํ™” ๋•Œ๋ฌธ์— ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ๊ฐ€ ์•ˆ ๋  ๋•Œ ๋Œ€์ฒ˜ํ•˜๊ธฐ

์—ฌ๊ธฐ์„œ ๋ฐฐ์šด ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ๋ฐฉ๋ฒ• - ๋‹จ์„œ ์ฝ๊ธฐ, ๊ฐ€์„ค ์„ธ์šฐ๊ธฐ, ์ฒด๊ณ„์ ์œผ๋กœ ํ…Œ์ŠคํŠธํ•˜๊ธฐ - ์€ ์•ž์œผ๋กœ ๋งˆ์ฃผํ•  ๋” ๋ฏธ๋ฌ˜ํ•œ ๋กœ์ง ์˜ค๋ฅ˜๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋Š” ๊ธฐ์ดˆ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

๐Ÿ” ํƒ์ • ์ˆ˜์‚ฌ: ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ฐœ์š”

์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์—์„œ ์ตํžŒ ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น… ์Šคํ‚ฌ์„ ๋ฐ”ํƒ•์œผ๋กœ, ์ด๋ฒˆ์—๋Š” ์ „ํ˜€ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๋„์ „์„ ๋งˆ์ฃผํ•ฉ๋‹ˆ๋‹ค: ํฌ๋ž˜์‹œ ์—†์ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๋กœ์ง ๋ฒ„๊ทธ์ž…๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ๊ด€์ ์˜ ์ „ํ™˜:

  • ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€: ๋ช…ํ™•ํ•œ ํฌ๋ž˜์‹œ ์‹ ํ˜ธ(CUDA_ERROR_ILLEGAL_ADDRESS)๊ฐ€ ์กฐ์‚ฌ๋ฅผ ์•ˆ๋‚ดํ•จ
  • ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€: ํฌ๋ž˜์‹œ๋„ ์—†๊ณ  ์—๋Ÿฌ ๋ฉ”์‹œ์ง€๋„ ์—†์Œ - ํƒ์ •์ฒ˜๋Ÿผ ํŒŒํ—ค์ณ์•ผ ํ•˜๋Š” ๋ฏธ๋ฌ˜ํ•˜๊ฒŒ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋งŒ ์žˆ์Œ

์ด๋ฒˆ ์ค‘๊ธ‰ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ๋Š” TileTensor ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜๋ฅผ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœ๊ทธ๋žจ์€ ์„ฑ๊ณต์ ์œผ๋กœ ์‹คํ–‰๋˜์ง€๋งŒ ์ž˜๋ชป๋œ ์ถœ๋ ฅ์„ ๋‚ด๋Š”๋ฐ, ์‹ค์ œ ๊ฐœ๋ฐœ์—์„œ ํ›จ์”ฌ ํ”ํ•˜๋ฉด์„œ๋„ ๊นŒ๋‹ค๋กœ์šด ๋””๋ฒ„๊น… ์‹œ๋‚˜๋ฆฌ์˜ค์ž…๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ค€๋น„: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ๊ณผ ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€๋ฅผ ๋จผ์ € ์™„๋ฃŒํ•ด์„œ CUDA-GDB ์›Œํฌํ”Œ๋กœ์šฐ์™€ ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•์„ ์ตํ˜€๋‘์„ธ์š”. ์•„๋ž˜ ๋ช…๋ น์„ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pixi run -e nvidia setup-cuda-gdb

ํ•ต์‹ฌ ๊ฐœ๋…

์ด๋ฒˆ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • TileTensor ๋””๋ฒ„๊น…: ๊ตฌ์กฐํ™”๋œ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํŒจํ„ด ์กฐ์‚ฌํ•˜๊ธฐ
  • ๋กœ์ง ๋ฒ„๊ทธ ํƒ์ง€: ํฌ๋ž˜์‹œํ•˜์ง€ ์•Š๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์ฐพ๊ธฐ
  • ๋ฐ˜๋ณต๋ฌธ ๊ฒฝ๊ณ„ ๋ถ„์„: ๋ฐ˜๋ณต ํšŸ์ˆ˜ ๋ฌธ์ œ ์ดํ•ดํ•˜๊ธฐ
  • ๊ฒฐ๊ณผ ํŒจํ„ด ๋ถ„์„: ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋กœ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€๊ธฐ

์ฝ”๋“œ ์‹คํ–‰

๋จผ์ € ์ „์ฒด ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ์ปค๋„๋งŒ ์‚ดํŽด๋ด…์‹œ๋‹ค:

def process_sliding_window(
    output: TileTensor[mut=True, dtype, VectorLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, VectorLayout, ImmutAnyOrigin],
):
    var thread_id = thread_idx.x

    # Each thread processes a sliding window of 3 elements
    var window_sum = Scalar[dtype](0.0)

    # Sum elements in sliding window: [i-1, i, i+1]
    for offset in range(ITER):
        var idx = Int(thread_id) + offset - 1
        if 0 <= idx < SIZE:
            var value = rebind[Scalar[dtype]](a[idx])
            window_sum += value

    output[thread_id] = window_sum


๋ฒ„๊ทธ๋ฅผ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š” (pixi ์ „์šฉ):

pixi run -e nvidia p09 --second-case

๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค - ํฌ๋ž˜์‹œ ์—†์ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ:

This program computes sliding window sums for each position...

Input array: [0, 1, 2, 3]
Computing sliding window sums (window size = 3)...
Each position should sum its neighbors: [left + center + right]
stack trace was not collected. Enable stack trace collection with environment variable `MOJO_ENABLE_STACK_TRACE_ON_ERROR`
Unhandled exception caught during execution: At open-source/max/mojo/stdlib/stdlib/gpu/host/device_context.mojo:2082:17: CUDA call failed: CUDA_ERROR_INVALID_IMAGE (device kernel image is invalid)
To get more accurate error information, set MODULAR_DEVICE_CONTEXT_SYNC_MODE=true.
/home/ubuntu/workspace/mojo-gpu-puzzles/.pixi/envs/nvidia/bin/mojo: error: execution exited with a non-zero result: 1

๊ณผ์ œ: ํƒ์ • ์ˆ˜์‚ฌ

๋„์ „: ํ”„๋กœ๊ทธ๋žจ์€ ํฌ๋ž˜์‹œ ์—†์ด ์‹คํ–‰๋˜์ง€๋งŒ ์ผ์ •ํ•œ ํŒจํ„ด์œผ๋กœ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค. ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š์€ ์ƒํƒœ์—์„œ, ์ด ๋กœ์ง ๋ฒ„๊ทธ๋ฅผ ์กฐ์‚ฌํ•˜๊ธฐ ์œ„ํ•œ ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฌด์—‡์ผ๊นŒ์š”?

์ƒ๊ฐํ•ด ๋ณผ ์ :

  • ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ์—์„œ ์–ด๋–ค ํŒจํ„ด์ด ๋ณด์ด๋‚˜์š”?
  • ์ œ๋Œ€๋กœ ๋Œ์ง€ ์•Š๋Š” ๊ฒƒ ๊ฐ™์€ ๋ฐ˜๋ณต๋ฌธ์€ ์–ด๋–ป๊ฒŒ ์กฐ์‚ฌํ•  ๊ฑด๊ฐ€์š”?
  • ๋ณ€์ˆ˜๋ฅผ ์ง์ ‘ ๊ฒ€์‚ฌํ•  ์ˆ˜ ์—†์„ ๋•Œ ์–ด๋–ค ๋””๋ฒ„๊น… ์ „๋žต์ด ํšจ๊ณผ์ ์ผ๊นŒ์š”?
  • ์กฐ์‚ฌ๋ฅผ ์•ˆ๋‚ดํ•ด ์ค„ ํฌ๋ž˜์‹œ ์‹ ํ˜ธ๊ฐ€ ์—†์„ ๋•Œ, ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์˜ ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ๋ฐฉ๋ฒ•์„ ์–ด๋–ป๊ฒŒ ์ ์šฉํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ณด์„ธ์š”:

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --second-case

GDB ๋ช…๋ น์–ด ๋‹จ์ถ•ํ‚ค (๋น ๋ฅธ ๋””๋ฒ„๊น…)

์ด ๋‹จ์ถ•ํ‚ค๋“ค์„ ์‚ฌ์šฉํ•˜๋ฉด ๋””๋ฒ„๊น… ์„ธ์…˜ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

๋‹จ์ถ•์ „์ฒด์‚ฌ์šฉ ์˜ˆ์‹œ
rrun(cuda-gdb) r
nnext(cuda-gdb) n
ccontinue(cuda-gdb) c
bbreak(cuda-gdb) b 39
pprint(cuda-gdb) p thread_id
qquit(cuda-gdb) q

์•„๋ž˜ ๋ชจ๋“  ๋””๋ฒ„๊น… ๋ช…๋ น์–ด๋Š” ํšจ์œจ์„ ์œ„ํ•ด ์ด ๋‹จ์ถ•ํ‚ค๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค!

ํŒ
  1. ํŒจํ„ด ๋ถ„์„๋ถ€ํ„ฐ - ๊ธฐ๋Œ€๊ฐ’๊ณผ ์‹ค์ œ ๊ฒฐ๊ณผ์˜ ๊ด€๊ณ„๋ฅผ ์‚ดํŽด๋ณด์„ธ์š” (์ฐจ์ด์— ์–ด๋–ค ์ˆ˜ํ•™์  ํŒจํ„ด์ด ์žˆ๋‚˜์š”?)
  2. ์‹คํ–‰ ํ๋ฆ„์— ์ง‘์ค‘ - ๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์œผ๋ฉด ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์„ธ์–ด๋ณด์„ธ์š”
  3. ๋‹จ์ˆœํ•œ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์‚ฌ์šฉ - ์ตœ์ ํ™”๋œ ์ฝ”๋“œ์—์„œ๋Š” ๋ณต์žกํ•œ ๋””๋ฒ„๊น… ๋ช…๋ น์ด ์‹คํŒจํ•˜๊ธฐ ์‰ฝ์Šต๋‹ˆ๋‹ค
  4. ์ˆ˜ํ•™์  ์ถ”๋ก  - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ๊ณผ ์‹ค์ œ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์„ ๋”ฐ์ ธ๋ณด์„ธ์š”
  5. ๋ˆ„๋ฝ๋œ ๋ฐ์ดํ„ฐ ์กฐ์‚ฌ - ๊ฒฐ๊ณผ๊ฐ€ ์ผ๊ด€๋˜๊ฒŒ ๊ธฐ๋Œ€๋ณด๋‹ค ์ž‘๋‹ค๋ฉด, ๋ฌด์—‡์ด ๋น ์กŒ์„๊นŒ์š”?
  6. ํ˜ธ์ŠคํŠธ ์ถœ๋ ฅ ๊ฒ€์ฆ - ์ตœ์ข… ๊ฒฐ๊ณผ์—์„œ ๋ฒ„๊ทธ์˜ ํŒจํ„ด์ด ๋“œ๋Ÿฌ๋‚˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค
  7. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฒฝ๊ณ„ ๋ถ„์„ - ๋ฐ˜๋ณต๋ฌธ์ด ์˜ฌ๋ฐ”๋ฅธ ๊ฐœ์ˆ˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”
  8. ์ž‘๋™ํ•˜๋Š” ์ผ€์ด์Šค์™€ ๊ต์ฐจ ๊ฒ€์ฆ - ์Šค๋ ˆ๋“œ 3์€ ์ •ํ™•ํ•˜๊ฒŒ ์ž‘๋™ํ•˜๋Š”๋ฐ ๋‹ค๋ฅธ ๊ฒƒ๋“ค์€ ์™œ ์•ˆ ๋ ๊นŒ์š”?
๐Ÿ’ก ์กฐ์‚ฌ ๊ณผ์ •๊ณผ ํ•ด๊ฒฐ์ฑ…

CUDA-GDB๋กœ ๋‹จ๊ณ„๋ณ„ ์กฐ์‚ฌ

1๋‹จ๊ณ„: ์‹คํ–‰๊ณผ ์ดˆ๊ธฐ ๋ถ„์„

Step 1: ๋””๋ฒ„๊ฑฐ ์‹คํ–‰

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --second-case

Step 2: ์ฆ์ƒ๋ถ€ํ„ฐ ๋ถ„์„

๋””๋ฒ„๊ฑฐ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

์‹ค์ œ ๊ฒฐ๊ณผ: [0.0, 1.0, 3.0, 5.0]
๊ธฐ๋Œ€๊ฐ’: [1.0, 3.0, 6.0, 5.0]

๐Ÿ” ํŒจํ„ด ์ธ์‹:

  • ์Šค๋ ˆ๋“œ 0: 0.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 1.0 โ†’ 1.0 ๋ˆ„๋ฝ
  • ์Šค๋ ˆ๋“œ 1: 1.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 3.0 โ†’ 2.0 ๋ˆ„๋ฝ
  • ์Šค๋ ˆ๋“œ 2: 3.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 6.0 โ†’ 3.0 ๋ˆ„๋ฝ
  • ์Šค๋ ˆ๋“œ 3: 5.0 ์–ป์Œ, ๊ธฐ๋Œ€๊ฐ’ 5.0 โ†’ โœ… ์ •ํ™•

์ดˆ๊ธฐ ๊ฐ€์„ค: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ˆ„๋ฝํ•˜๊ณ  ์žˆ๋Š”๋ฐ, ์Šค๋ ˆ๋“œ 3๋งŒ ์ •ํ™•ํ•˜๊ฒŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

2๋‹จ๊ณ„: ์ปค๋„ ์ง„์ž…

Step 3: ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ ์ง„์ž… ํ™•์ธ

์‹ค์ œ ๋””๋ฒ„๊น… ์„ธ์…˜์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค:

(cuda-gdb) r
Starting program: .../mojo run problems/p09/p09.mojo --second-case

This program computes sliding window sums for each position...
Input array: [0, 1, 2, 3]
Computing sliding window sums (window size = 3)...
Each position should sum its neighbors: [left + center + right]

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

CUDA thread hit application kernel entry function breakpoint, p09_process_sliding_window_...
   <<<(1,1,1),(4,1,1)>>> (output=..., input=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:30
30          input: TileTensor[mut=False, dtype, vector_layout],

Step 4: ๋ฉ”์ธ ๋กœ์ง์œผ๋กœ ์ด๋™

(cuda-gdb) n
29          output: TileTensor[mut=True, dtype, vector_layout],
(cuda-gdb) n
32          thread_id = thread_idx.x
(cuda-gdb) n
38          for offset in range(ITER):

Step 5: ๋ณ€์ˆ˜ ์ ‘๊ทผ์„ฑ ํ…Œ์ŠคํŠธ - ์ค‘์š”ํ•œ ๋ฐœ๊ฒฌ

(cuda-gdb) p thread_id
$1 = 0

โœ… ์ข‹์Œ: Thread ID์— ์ ‘๊ทผ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

(cuda-gdb) p window_sum
Cannot access memory at address 0x0

โŒ ๋ฌธ์ œ: window_sum์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

(cuda-gdb) p a[0]
Attempt to take address of value not located in memory.

โŒ ๋ฌธ์ œ: TileTensor ์ง์ ‘ ์ธ๋ฑ์‹ฑ์ด ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

(cuda-gdb) p a.ptr[0]
$2 = {0}
(cuda-gdb) p a.ptr[0]@4
$3 = {{0}, {1}, {2}, {3}}

๐ŸŽฏ ๋ŒํŒŒ๊ตฌ: a.ptr[0]@4๋กœ ์ „์ฒด ์ž…๋ ฅ ๋ฐฐ์—ด์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ์ด๊ฒƒ์ด TileTensor ๋ฐ์ดํ„ฐ๋ฅผ ๊ฒ€์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

3๋‹จ๊ณ„: ํ•ต์‹ฌ ๋ฐ˜๋ณต๋ฌธ ์กฐ์‚ฌ

Step 6: ๋ฐ˜๋ณต๋ฌธ ๋ชจ๋‹ˆํ„ฐ๋ง ์„ค์ •

(cuda-gdb) b 42
Breakpoint 1 at 0x7fffd326ffd0: file problems/p09/p09.mojo, line 42.
(cuda-gdb) c
Continuing.

CUDA thread hit Breakpoint 1, p09_process_sliding_window_...
   <<<(1,1,1),(4,1,1)>>> (output=..., input=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:42
42              idx = thread_id + offset - 1

๐Ÿ” ์ด์ œ ๋ฐ˜๋ณต๋ฌธ ๋ณธ๋ฌธ ์•ˆ์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ง์ ‘ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋ฅผ ์„ธ์–ด๋ด…์‹œ๋‹ค.

Step 7: ์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต (offset = 0)

(cuda-gdb) n
43              if 0 <= idx < SIZE:
(cuda-gdb) n
41          for offset in range(ITER):

์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต ์™„๋ฃŒ: ๋ฐ˜๋ณต๋ฌธ์ด 42๋ฒˆ ์ค„ โ†’ 43๋ฒˆ ์ค„ โ†’ 41๋ฒˆ ์ค„๋กœ ๋Œ์•„์™”์Šต๋‹ˆ๋‹ค. ๋ฐ˜๋ณต๋ฌธ์ด ๊ณ„์†๋ฉ๋‹ˆ๋‹ค.

Step 8: ๋‘ ๋ฒˆ์งธ ๋ฐ˜๋ณต (offset = 1)

(cuda-gdb) n

CUDA thread hit Breakpoint 1, p09_process_sliding_window_...
42              idx = thread_id + offset - 1
(cuda-gdb) n
43              if 0 <= idx < SIZE:
(cuda-gdb) n
44                  value = rebind[Scalar[dtype]](input[idx])
(cuda-gdb) n
45                  window_sum += value
(cuda-gdb) n
43              if 0 <= idx < SIZE:
(cuda-gdb) n
41          for offset in range(ITER):

๋‘ ๋ฒˆ์งธ ๋ฐ˜๋ณต ์™„๋ฃŒ: ์ด๋ฒˆ์—๋Š” if ๋ธ”๋ก(44-45๋ฒˆ ์ค„)์„ ํ†ต๊ณผํ–ˆ์Šต๋‹ˆ๋‹ค.

Step 9: ์„ธ ๋ฒˆ์งธ ๋ฐ˜๋ณต ํ…Œ์ŠคํŠธ

(cuda-gdb) n
47          output[thread_id] = window_sum

๊ฒฐ์ •์  ๋ฐœ๊ฒฌ: ๋ฐ˜๋ณต๋ฌธ์ด 2๋ฒˆ๋งŒ ๋Œ๊ณ  ์ข…๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค! 42๋ฒˆ ์ค„์˜ ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ์— ๋‹ค์‹œ ๊ฑธ๋ฆฌ์ง€ ์•Š๊ณ  47๋ฒˆ ์ค„๋กœ ๋ฐ”๋กœ ๋„˜์–ด๊ฐ”์Šต๋‹ˆ๋‹ค.

๊ฒฐ๋ก : ๋ฐ˜๋ณต๋ฌธ์ด ์ •ํ™•ํžˆ 2๋ฒˆ ๋Œ๊ณ  ์ข…๋ฃŒ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Step 10: ์ปค๋„ ์‹คํ–‰ ์™„๋ฃŒ์™€ ์ปจํ…์ŠคํŠธ ์†์‹ค

(cuda-gdb) n
31      fn process_sliding_window(
(cuda-gdb) n
[Switching to Thread 0x7ffff7cc0e00 (LWP 110927)]
0x00007ffff064f84a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(cuda-gdb) p output.ptr[0]@4
No symbol "output" in current context.
(cuda-gdb) p offset
No symbol "offset" in current context.

๐Ÿ” ์ปจํ…์ŠคํŠธ ์†์‹ค: ์ปค๋„ ์‹คํ–‰์ด ๋๋‚˜๋ฉด ์ปค๋„ ๋ณ€์ˆ˜์— ๋” ์ด์ƒ ์ ‘๊ทผํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ์ •์ƒ์ ์ธ ๋™์ž‘์ž…๋‹ˆ๋‹ค.

4๋‹จ๊ณ„: ๊ทผ๋ณธ ์›์ธ ๋ถ„์„

Step 11: ๊ด€์ฐฐ๋œ ์‹คํ–‰์—์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„

๋””๋ฒ„๊น… ์„ธ์…˜์—์„œ ๊ด€์ฐฐํ•œ ๊ฒƒ:

  1. ๋ฐ˜๋ณต ํšŸ์ˆ˜: 2๋ฒˆ๋งŒ ๋ฐ˜๋ณต (offset = 0, offset = 1)
  2. ๊ธฐ๋Œ€๊ฐ’: ํฌ๊ธฐ 3์˜ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋Š” 3๋ฒˆ ๋ฐ˜๋ณตํ•ด์•ผ ํ•จ (offset = 0, 1, 2)
  3. ๋ˆ„๋ฝ: ์„ธ ๋ฒˆ์งธ ๋ฐ˜๋ณต (offset = 2)

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐํ•ด์•ผ ํ•  ๊ฒƒ:

  • ์Šค๋ ˆ๋“œ 0: window_sum = input[-1] + input[0] + input[1] = (๊ฒฝ๊ณ„) + 0 + 1 = 1.0
  • ์Šค๋ ˆ๋“œ 1: window_sum = input[0] + input[1] + input[2] = 0 + 1 + 2 = 3.0
  • ์Šค๋ ˆ๋“œ 2: window_sum = input[1] + input[2] + input[3] = 1 + 2 + 3 = 6.0
  • ์Šค๋ ˆ๋“œ 3: window_sum = input[2] + input[3] + input[4] = 2 + 3 + (๊ฒฝ๊ณ„) = 5.0

Step 12: ์Šค๋ ˆ๋“œ 0์˜ ์‹ค์ œ ์‹คํ–‰ ์ถ”์ 

2๋ฒˆ๋งŒ ๋ฐ˜๋ณตํ•  ๊ฒฝ์šฐ (offset = 0, 1):

๋ฐ˜๋ณต 1 (offset = 0):

  • idx = thread_id + offset - 1 = 0 + 0 - 1 = -1
  • if 0 <= idx < SIZE: โ†’ if 0 <= -1 < 4: โ†’ False
  • ํ•ฉ์‚ฐ ์—ฐ์‚ฐ ๊ฑด๋„ˆ๋œ€

๋ฐ˜๋ณต 2 (offset = 1):

  • idx = thread_id + offset - 1 = 0 + 1 - 1 = 0
  • if 0 <= idx < SIZE: โ†’ if 0 <= 0 < 4: โ†’ True
  • window_sum += input[0] โ†’ window_sum += 0

๋ˆ„๋ฝ๋œ ๋ฐ˜๋ณต 3 (offset = 2):

  • idx = thread_id + offset - 1 = 0 + 2 - 1 = 1
  • if 0 <= idx < SIZE: โ†’ if 0 <= 1 < 4: โ†’ True
  • window_sum += input[1] โ†’ window_sum += 1 โ† ์ด ์—ฐ์‚ฐ์ด ์‹คํ–‰๋˜์ง€ ์•Š์Œ

๊ฒฐ๊ณผ: ์Šค๋ ˆ๋“œ 0์€ window_sum = 0 + 1 = 1 ๋Œ€์‹  window_sum = 0์„ ์–ป์Šต๋‹ˆ๋‹ค

5๋‹จ๊ณ„: ๋ฒ„๊ทธ ํ™•์ธ

๋ฌธ์ œ ์ฝ”๋“œ๋ฅผ ๋ณด๋ฉด:

comptime ITER = 2                       # โ† ๋ฒ„๊ทธ: 3์ด์–ด์•ผ ํ•จ!

for offset in range(ITER):           # โ† 2๋ฒˆ๋งŒ ๋ฐ˜๋ณต: [0, 1]
    idx = Int(thread_id) + offset - 1     # โ† offset = 2 ๋ˆ„๋ฝ
    if 0 <= idx < SIZE:
        value = rebind[Scalar[dtype]](a[idx])
        window_sum += value

๐ŸŽฏ ๊ทผ๋ณธ ์›์ธ ํ™•์ธ: ํฌ๊ธฐ 3์˜ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋ฅผ ์œ„ํ•ด ITER = 2๊ฐ€ ITER = 3์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜์ • ๋ฐฉ๋ฒ•: ์†Œ์Šค ์ฝ”๋“œ์—์„œ comptime ITER = 2๋ฅผ comptime ITER = 3์œผ๋กœ ๋ณ€๊ฒฝํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋””๋ฒ„๊น… ๊ตํ›ˆ

๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์—†์„ ๋•Œ:

  1. ์‹คํ–‰ ํ๋ฆ„์— ์ง‘์ค‘ - ๋ธŒ๋ ˆ์ดํฌํฌ์ธํŠธ๊ฐ€ ๋ช‡ ๋ฒˆ ๊ฑธ๋ฆฌ๋Š”์ง€, ๋ฐ˜๋ณต์ด ๋ช‡ ๋ฒˆ ๋„๋Š”์ง€ ์„ธ์–ด๋ณด์„ธ์š”
  2. ์ˆ˜ํ•™์  ์ถ”๋ก  ์‚ฌ์šฉ - ์ผ์–ด๋‚˜์•ผ ํ•  ์ผ๊ณผ ์‹ค์ œ๋กœ ์ผ์–ด๋‚˜๋Š” ์ผ์„ ๋”ฐ์ ธ๋ณด์„ธ์š”
  3. ํŒจํ„ด ๋ถ„์„ - ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๊ฐ€ ์กฐ์‚ฌ๋ฅผ ์ด๋Œ๋„๋ก ํ•˜์„ธ์š”
  4. ๊ต์ฐจ ๊ฒ€์ฆ - ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ๋Œ€ํ•ด ๊ฐ€์„ค์„ ํ…Œ์ŠคํŠธํ•˜์„ธ์š”

์ „๋ฌธ์ ์ธ GPU ๋””๋ฒ„๊น…์˜ ํ˜„์‹ค:

  • ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™” ๋•Œ๋ฌธ์— ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ๊ฐ€ ์‹คํŒจํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค
  • ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„์ด ๋ฐ์ดํ„ฐ ๊ฒ€์‚ฌ๋ณด๋‹ค ๋” ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ํ˜ธ์ŠคํŠธ ์ถœ๋ ฅ ํŒจํ„ด์ด ์ค‘์š”ํ•œ ๋””๋ฒ„๊น… ๋‹จ์„œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค
  • ์†Œ์Šค ์ฝ”๋“œ ์ถ”๋ก ์ด ์ œํ•œ๋œ ๋””๋ฒ„๊ฑฐ ๊ธฐ๋Šฅ์„ ๋ณด์™„ํ•ฉ๋‹ˆ๋‹ค

TileTensor ๋””๋ฒ„๊น…:

  • TileTensor ์ถ”์ƒํ™”๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๊ทผ๋ณธ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ฒ„๊ทธ๋Š” ๊ทธ๋Œ€๋กœ ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค
  • ํ…์„œ ๋‚ด์šฉ์„ ๊ฒ€์‚ฌํ•˜๋ ค ํ•˜๊ธฐ๋ณด๋‹ค ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋กœ์ง์— ์ง‘์ค‘ํ•˜์„ธ์š”
  • ์ฒด๊ณ„์ ์ธ ์ถ”๋ก ์œผ๋กœ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ๊ณผ ์‹ค์ œ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์„ ์ถ”์ ํ•˜์„ธ์š”

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด๋Ÿฐ ์œ ํ˜•์˜ off-by-one (์—ญ์ฃผ: ๊ฒฝ๊ณ„๊ฐ’์ด 1๋งŒํผ ์–ด๊ธ‹๋‚˜๋Š” ์˜ค๋ฅ˜) ๋ฐ˜๋ณต๋ฌธ ๋ฒ„๊ทธ๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋งค์šฐ ํ”ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋ฐฐ์šด ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ• - ์ œํ•œ๋œ ๋””๋ฒ„๊ฑฐ ์ •๋ณด์— ์ˆ˜ํ•™์  ๋ถ„์„๊ณผ ํŒจํ„ด ์ธ์‹์„ ๊ฒฐํ•ฉํ•˜๋Š” ๊ฒƒ - ์€ ๋„๊ตฌ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์„ ๋•Œ ์ „๋ฌธ GPU ๊ฐœ๋ฐœ์ž๋“ค์ด ๋””๋ฒ„๊น…ํ•˜๋Š” ๋ฐฉ์‹ ๊ทธ๋Œ€๋กœ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„: ๋กœ์ง ๋ฒ„๊ทธ์—์„œ ๊ต์ฐฉ ์ƒํƒœ๋กœ

๋กœ์ง ๋ฒ„๊ทธ ๋””๋ฒ„๊น…์„ ์ตํ˜”์Šต๋‹ˆ๋‹ค! ์ด์ œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • โœ… ํฌ๋ž˜์‹œ๋‚˜ ๋šœ๋ ทํ•œ ์ฆ์ƒ ์—†์ด๋„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์กฐ์‚ฌ
  • โœ… ํŒจํ„ด ๋ถ„์„์œผ๋กœ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ์—์„œ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ์ถ”์ 
  • โœ… ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„์œผ๋กœ ๋ณ€์ˆ˜ ์ ‘๊ทผ์ด ์ œํ•œ๋œ ์ƒํ™ฉ์—์„œ ๋””๋ฒ„๊น…
  • โœ… ๋””๋ฒ„๊ฑฐ ๋„๊ตฌ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์„ ๋•Œ ์ˆ˜ํ•™์  ์ถ”๋ก  ์ ์šฉ

๋งˆ์ง€๋ง‰ ๋„์ „: ํƒ์ • ์ˆ˜์‚ฌ: ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ทธ๋Ÿฐ๋ฐ ํ”„๋กœ๊ทธ๋žจ์ด ํฌ๋ž˜์‹œํ•˜์ง€๋„ ์•Š๊ณ  ๋๋‚˜์ง€๋„ ์•Š๋Š”๋‹ค๋ฉด์š”? ๊ทธ๋ƒฅ ์˜์›ํžˆ ๋ฉˆ์ถฐ๋ฒ„๋ฆฐ๋‹ค๋ฉด์š”?

์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€๋Š” ๊ถ๊ทน์˜ ๋””๋ฒ„๊น… ๋„์ „์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค:

  • โŒ ํฌ๋ž˜์‹œ ๋ฉ”์‹œ์ง€ ์—†์Œ (์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์ฒ˜๋Ÿผ)
  • โŒ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ ์—†์Œ (๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€์ฒ˜๋Ÿผ)
  • โŒ ์™„๋ฃŒ ์ž์ฒด๊ฐ€ ์—†์Œ - ๊ทธ๋ƒฅ ๋ฌดํ•œํžˆ ๋ฉˆ์ถค
  • โœ… ๊ณ ๊ธ‰ ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋ถ„์„์ด ํ•„์š”ํ•œ ์กฐ์šฉํ•œ ๊ต์ฐฉ ์ƒํƒœ

์ƒˆ๋กญ๊ฒŒ ์ตํžˆ๊ฒŒ ๋  ์Šคํ‚ฌ:

  • ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ํƒ์ง€ - ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ์—์„œ ์กฐ์œจ ์‹คํŒจ ์ฐพ๊ธฐ
  • ๋ฉ€ํ‹ฐ ์Šค๋ ˆ๋“œ ์ƒํƒœ ๋ถ„์„ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๋ฅผ ๋™์‹œ์— ๊ฒ€์‚ฌํ•˜๊ธฐ
  • ๋™๊ธฐํ™” ๋””๋ฒ„๊น… - ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ์‹คํŒจ ์ดํ•ดํ•˜๊ธฐ

๋””๋ฒ„๊น… ์ง„ํ™”:

  1. ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€: ํฌ๋ž˜์‹œ ์‹ ํ˜ธ ๋”ฐ๋ผ๊ฐ€๊ธฐ โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ ์ฐพ๊ธฐ
  2. ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€: ๊ฒฐ๊ณผ ํŒจํ„ด ๋ถ„์„ํ•˜๊ธฐ โ†’ ๋กœ์ง ๋ฒ„๊ทธ ์ฐพ๊ธฐ
  3. ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€: ์Šค๋ ˆ๋“œ ์ƒํƒœ ์กฐ์‚ฌํ•˜๊ธฐ โ†’ ์กฐ์œจ ๋ฒ„๊ทธ ์ฐพ๊ธฐ

์ด์ „ ๋‘ ์‚ฌ๋ก€์—์„œ ๋ฐฐ์šด ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ์Šคํ‚ฌ - ๊ฐ€์„ค ์ˆ˜๋ฆฝ, ์ฆ๊ฑฐ ์ˆ˜์ง‘, ํŒจํ„ด ๋ถ„์„ - ์€ ๊ฐ€์žฅ ์–ด๋ ค์šด GPU ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•  ๋•Œ ํ•ต์‹ฌ์ด ๋ฉ๋‹ˆ๋‹ค: ์กฐ์œจ์ด ์–ด๊ธ‹๋‚˜ ์˜์›ํžˆ ์„œ๋กœ๋ฅผ ๊ธฐ๋‹ค๋ฆฌ๋Š” ์Šค๋ ˆ๋“œ๋“ค.

๐Ÿ•ต ํƒ์ • ์ˆ˜์‚ฌ: ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€

๊ฐœ์š”

๋ฉ”๋ชจ๋ฆฌ ํฌ๋ž˜์‹œ์™€ ๋กœ์ง ๋ฒ„๊ทธ ๋””๋ฒ„๊น…์„ ์ตํ˜”์Šต๋‹ˆ๋‹ค. ์ด์ œ GPU ๋””๋ฒ„๊น…์˜ ์ตœ์ข… ๋ณด์Šค์— ๋„์ „ํ•ฉ๋‹ˆ๋‹ค: ํ”„๋กœ๊ทธ๋žจ์ด ๋ฌดํ•œ์ • ๋ฉˆ์ถฐ๋ฒ„๋ฆฌ๋Š” ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ. ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋„, ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋„ ์—†์ด - ๊ทธ์ € ๋์—†๋Š” ์นจ๋ฌต๋งŒ ์žˆ์Šต๋‹ˆ๋‹ค.

๋””๋ฒ„๊น… ์—ฌ์ •์˜ ์™„๊ฒฐ:

  • ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€: ํ”„๋กœ๊ทธ๋žจ ํฌ๋ž˜์‹œ โ†’ ์˜ค๋ฅ˜ ์‹ ํ˜ธ ์ถ”์  โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ ๋ฐœ๊ฒฌ
  • ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€: ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ ์ถœ๋ ฅ โ†’ ํŒจํ„ด ๋ถ„์„ โ†’ ๋กœ์ง ๋ฒ„๊ทธ ๋ฐœ๊ฒฌ
  • ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€: ํ”„๋กœ๊ทธ๋žจ ๋ฌดํ•œ ์ •์ง€ โ†’ ์Šค๋ ˆ๋“œ ์ƒํƒœ ์กฐ์‚ฌ โ†’ ์กฐ์œจ ๋ฒ„๊ทธ ๋ฐœ๊ฒฌ

์ด ๊ณ ๊ธ‰ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, TileTensor ์—ฐ์‚ฐ, ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”๊ฐ€ ์–ฝํžŒ ์Šค๋ ˆ๋“œ ์กฐ์œจ ์‹คํŒจ๋ฅผ ์กฐ์‚ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค - ์ด์ „ ์‚ฌ๋ก€๋“ค์—์„œ ์ตํžŒ ์ฒด๊ณ„์ ์ธ ์กฐ์‚ฌ ๊ธฐ์ˆ ์„ ์ด๋™์›ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์ „ ์ค€๋น„: Mojo GPU ๋””๋ฒ„๊น…์˜ ํ•ต์‹ฌ, ํƒ์ • ์ˆ˜์‚ฌ: ์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€, ํƒ์ • ์ˆ˜์‚ฌ: ๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€๋ฅผ ๋จผ์ € ์™„๋ฃŒํ•ด์„œ CUDA-GDB ์›Œํฌํ”Œ๋กœ์šฐ, ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ์˜ ํ•œ๊ณ„, ์ฒด๊ณ„์ ์ธ ๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ•์„ ์ดํ•ดํ•˜์„ธ์š”. ์•„๋ž˜ ์„ค์ • ๋ช…๋ น์„ ์‹คํ–‰ํ–ˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:

pixi run -e nvidia setup-cuda-gdb

ํ•ต์‹ฌ ๊ฐœ๋…

์ด๋ฒˆ ๋””๋ฒ„๊น… ์ฑŒ๋ฆฐ์ง€์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ํƒ์ง€: ์Šค๋ ˆ๋“œ๋“ค์ด ๋™๊ธฐํ™” ์ง€์ ์—์„œ ์˜์›ํžˆ ๊ธฐ๋‹ค๋ฆฌ๊ฒŒ ๋˜๋Š” ์ƒํ™ฉ ์‹๋ณ„ํ•˜๊ธฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ: TileTensor๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ
  • ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ๋ถ„์„: ์ผ๋ถ€ ์Šค๋ ˆ๋“œ๊ฐ€ ๋‹ค๋ฅธ ์ฝ”๋“œ ๊ฒฝ๋กœ๋ฅผ ํƒˆ ๋•Œ ๋””๋ฒ„๊น…ํ•˜๊ธฐ
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น…: CUDA-GDB๋กœ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ์‹คํŒจ ๋ถ„์„ํ•˜๊ธฐ

์ฝ”๋“œ ์‹คํ–‰

๋จผ์ € ์ „์ฒด ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ์ปค๋„๋งŒ ์‚ดํŽด๋ด…์‹œ๋‹ค:

def collaborative_filter(
    output: TileTensor[mut=True, dtype, VectorLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, VectorLayout, ImmutAnyOrigin],
):
    var thread_id = thread_idx.x

    # Shared memory workspace for collaborative processing
    var shared_workspace = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[SIZE - 1]())

    # Phase 1: Initialize shared workspace (all threads participate)
    if thread_id < SIZE - 1:
        shared_workspace[thread_id] = rebind[Scalar[dtype]](a[thread_id])
    barrier()

    # Phase 2: Collaborative processing
    if thread_id < SIZE - 1:
        # Apply collaborative filter with neighbors
        if thread_id > 0:
            shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
        barrier()

    # Phase 3: Final synchronization and output
    barrier()

    # Write filtered results back to output
    if thread_id < SIZE - 1:
        output[thread_id] = shared_workspace[thread_id]
    else:
        output[thread_id] = rebind[Scalar[dtype]](a[thread_id])


๋ฒ„๊ทธ๋ฅผ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š” (pixi ์ „์šฉ):

pixi run -e nvidia p09 --third-case

๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค - ํ”„๋กœ๊ทธ๋žจ์ด ๋ฌดํ•œ์ • ๋ฉˆ์ถฅ๋‹ˆ๋‹ค:

Third Case: Advanced collaborative filtering with shared memory...
WARNING: This may hang - use Ctrl+C to stop if needed

Input array: [1, 2, 3, 4]
Applying collaborative filter using shared memory...
Each thread cooperates with neighbors for smoothing...
Waiting for GPU computation to complete...
[HANGS FOREVER - Use Ctrl+C to stop]

โš ๏ธ ๊ฒฝ๊ณ : ์ด ํ”„๋กœ๊ทธ๋žจ์€ ๋ฉˆ์ถฐ์„œ ์™„๋ฃŒ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. Ctrl+C๋กœ ์ค‘๋‹จํ•˜์„ธ์š”.

๊ณผ์ œ: ํƒ์ • ์ˆ˜์‚ฌ

๋„์ „: ํ”„๋กœ๊ทธ๋žจ์ด ์ •์ƒ์ ์œผ๋กœ ์‹œ์ž‘๋˜์ง€๋งŒ GPU ์—ฐ์‚ฐ ์ค‘์— ๋ฉˆ์ถฐ์„œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š์€ ์ƒํƒœ์—์„œ, ์ด ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ์กฐ์‚ฌํ•˜๊ธฐ ์œ„ํ•œ ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ•์€ ๋ฌด์—‡์ผ๊นŒ์š”?

์ƒ๊ฐํ•ด๋ณผ ์ :

  • GPU ์ปค๋„์ด ์˜์˜ ์™„๋ฃŒ๋˜์ง€ ์•Š๊ฒŒ ๋งŒ๋“œ๋Š” ์›์ธ์€ ๋ฌด์—‡์ผ๊นŒ์š”?
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋ฌธ์ œ๋ฅผ ์–ด๋–ป๊ฒŒ ์กฐ์‚ฌํ•˜์‹œ๊ฒ ์Šต๋‹ˆ๊นŒ?
  • ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€ ์—†์ด ํ”„๋กœ๊ทธ๋žจ์ด ๊ทธ๋ƒฅ โ€œ๋ฉˆ์ถฐ๋ฒ„๋ฆดโ€ ๋•Œ ์–ด๋–ค ๋””๋ฒ„๊น… ์ „๋žต์ด ํ†ตํ• ๊นŒ์š”?
  • ์Šค๋ ˆ๋“œ๋“ค์ด ์ œ๋Œ€๋กœ ํ˜‘๋ ฅํ•˜์ง€ ์•Š์„ ์ˆ˜๋„ ์žˆ๋‹ค๋ฉด ์–ด๋–ป๊ฒŒ ๋””๋ฒ„๊น…ํ• ๊นŒ์š”?
  • ์ฒด๊ณ„์  ์กฐ์‚ฌ(์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€)์™€ ์‹คํ–‰ ํ๋ฆ„ ๋ถ„์„(๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€)์„ ๊ฒฐํ•ฉํ•ด์„œ ์กฐ์œจ ์‹คํŒจ๋ฅผ ์–ด๋–ป๊ฒŒ ๋””๋ฒ„๊น…ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ณด์„ธ์š”:

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --third-case

GDB ๋ช…๋ น์–ด ๋‹จ์ถ•ํ‚ค (๋น ๋ฅธ ๋””๋ฒ„๊น…)

์ด ๋‹จ์ถ•ํ‚ค๋“ค์„ ์‚ฌ์šฉํ•˜๋ฉด ๋””๋ฒ„๊น… ์„ธ์…˜ ์†๋„๋ฅผ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

๋‹จ์ถ•์ „์ฒด์‚ฌ์šฉ ์˜ˆ์‹œ
rrun(cuda-gdb) r
nnext(cuda-gdb) n
ccontinue(cuda-gdb) c
bbreak(cuda-gdb) b 62
pprint(cuda-gdb) p thread_id
qquit(cuda-gdb) q

์•„๋ž˜ ๋ชจ๋“  ๋””๋ฒ„๊น… ๋ช…๋ น์€ ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ๋‹จ์ถ•ํ‚ค๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค!

ํŒ
  1. ์†Œ๋ฆฌ ์—†๋Š” ๋ฉˆ์ถค ์กฐ์‚ฌ - ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€ ์—†์ด ํ”„๋กœ๊ทธ๋žจ์ด ๋ฉˆ์ถฐ๋ฒ„๋ฆด ๋•Œ, GPU์˜ ์–ด๋–ค ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ๋ฌดํ•œ ๋Œ€๊ธฐ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์„๊นŒ์š”?
  2. ์Šค๋ ˆ๋“œ ์ƒํƒœ ๊ฒ€์‚ฌ - info cuda threads๋กœ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๋“ค์ด ์–ด๋””์„œ ๋ฉˆ์ท„๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”
  3. ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ๋ถ„์„ - ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ์–ด๋–ค ์ฝ”๋“œ ๊ฒฝ๋กœ๋ฅผ ์‹คํ–‰ํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š” (๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ฅด๋‚˜์š”?)
  4. ๋™๊ธฐํ™” ์ง€์  ์กฐ์‚ฌ - ์Šค๋ ˆ๋“œ๋“ค์ด ์กฐ์œจํ•ด์•ผ ํ•  ์ˆ˜๋„ ์žˆ๋Š” ์ง€์ ์„ ์ฐพ์œผ์„ธ์š”
  5. ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ ํƒ์ง€ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ํ”„๋กœ๊ทธ๋žจ ์œ„์น˜์— ์žˆ๋‚˜์š”, ์•„๋‹ˆ๋ฉด ์ผ๋ถ€๋Š” ๋‹ค๋ฅธ ๊ณณ์— ์žˆ๋‚˜์š”?
  6. ์กฐ์œจ ๊ธฐ๋ณธ ์š”์†Œ ๋ถ„์„ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋™๊ธฐํ™” ์—ฐ์‚ฐ์— ์ฐธ์—ฌํ•˜์ง€ ์•Š์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”?
  7. ์‹คํ–‰ ํ๋ฆ„ ์ถ”์  - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์กฐ๊ฑด๋ฌธ์„ ํ†ตํ•ด ์–ด๋–ค ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋Š”์ง€ ์ถ”์ ํ•˜์„ธ์š”
  8. ์Šค๋ ˆ๋“œ ID ์˜ํ–ฅ ๋ถ„์„ - ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ID๊ฐ€ ์–ด๋–ค ์ฝ”๋“œ ๊ฒฝ๋กœ๋ฅผ ์‹คํ–‰ํ• ์ง€ ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋‚˜์š”?
๐Ÿ’ก ์กฐ์‚ฌ ๊ณผ์ •๊ณผ ํ•ด๊ฒฐ์ฑ…

CUDA-GDB๋กœ ๋‹จ๊ณ„๋ณ„ ์กฐ์‚ฌ

1๋‹จ๊ณ„: ์‹คํ–‰๊ณผ ์ดˆ๊ธฐ ์„ค์ •

Step 1: ๋””๋ฒ„๊ฑฐ ์‹คํ–‰

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --third-case

Step 2: ์ •์ง€ ํ˜„์ƒ ๋ถ„์„

๋””๋ฒ„๊น…์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ์•Œ๊ณ  ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

๊ธฐ๋Œ€๊ฐ’: ํ”„๋กœ๊ทธ๋žจ์ด ์™„๋ฃŒ๋˜๊ณ  ํ•„ํ„ฐ๋ง๋œ ๊ฒฐ๊ณผ ํ‘œ์‹œ
์‹ค์ œ: "Waiting for GPU computation to complete..."์—์„œ ๋ฉˆ์ถค

๐Ÿ” ์ดˆ๊ธฐ ๊ฐ€์„ค: GPU ์ปค๋„์ด ๊ต์ฐฉ ์ƒํƒœ์— ๋น ์ง - ์–ด๋–ค ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ์Šค๋ ˆ๋“œ๋“ค์„ ์˜์›ํžˆ ๋Œ€๊ธฐ์‹œํ‚ค๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

2๋‹จ๊ณ„: ์ปค๋„ ์ง„์ž…

Step 3: ์‹คํ–‰ ๋ฐ ์ปค๋„ ์ง„์ž… ๊ด€์ฐฐ

(cuda-gdb) r
Starting program: .../mojo run problems/p09/p09.mojo --third-case

Third Case: Advanced collaborative filtering with shared memory...
WARNING: This may hang - use Ctrl+C to stop if needed

Input array: [1, 2, 3, 4]
Applying collaborative filter using shared memory...
Each thread cooperates with neighbors for smoothing...
Waiting for GPU computation to complete...

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

CUDA thread hit application kernel entry function breakpoint, p09_collaborative_filter_Orig6A6AcB6A6A_1882ca334fc2d34b2b9c4fa338df6c07<<<(1,1,1),(4,1,1)>>> (
    output=..., a=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:56
56          a: TileTensor[mut=False, dtype, vector_layout],

๐Ÿ” ์ฃผ์š” ๊ด€์ฐฐ:

  • Grid: (1,1,1) - ๋‹จ์ผ ๋ธ”๋ก
  • Block: (4,1,1) - ์ด 4๊ฐœ ์Šค๋ ˆ๋“œ (0, 1, 2, 3)
  • ํ˜„์žฌ ์Šค๋ ˆ๋“œ: (0,0,0) - ์Šค๋ ˆ๋“œ 0 ๋””๋ฒ„๊น… ์ค‘
  • ํ•จ์ˆ˜: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” collaborative_filter

Step 4: ์ดˆ๊ธฐํ™” ๊ณผ์ • ํƒ์ƒ‰

(cuda-gdb) n
55          output: TileTensor[mut=True, dtype, vector_layout],
(cuda-gdb) n
58          thread_id = thread_idx.x
(cuda-gdb) n
66          ].stack_allocation()
(cuda-gdb) n
69          if thread_id < SIZE - 1:
(cuda-gdb) p thread_id
$1 = 0

โœ… ์Šค๋ ˆ๋“œ 0 ์ƒํƒœ: thread_id = 0, ์กฐ๊ฑด 0 < 3 ๊ฒ€์‚ฌ ์ง์ „ โ†’ True

Step 5: 1๋‹จ๊ณ„ ์ถ”์ 

(cuda-gdb) n
70              shared_workspace[thread_id] = rebind[Scalar[dtype]](a[thread_id])
(cuda-gdb) n
69          if thread_id < SIZE - 1:
(cuda-gdb) n
71          barrier()

1๋‹จ๊ณ„ ์™„๋ฃŒ: ์Šค๋ ˆ๋“œ 0์ด ์ดˆ๊ธฐํ™”๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์ฒซ ๋ฒˆ์งธ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ–ˆ์Šต๋‹ˆ๋‹ค.

3๋‹จ๊ณ„: ๊ฒฐ์ •์ ์ธ ๋ฐฐ๋ฆฌ์–ด ์กฐ์‚ฌ

Step 6: ์ฒซ ๋ฒˆ์งธ ๋ฐฐ๋ฆฌ์–ด ๊ฒ€์‚ฌ

(cuda-gdb) n
74          if thread_id < SIZE - 1:
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (3,0,0)     4 0x00007fffd3272180 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    74

โœ… ์ •์ƒ: 4๊ฐœ ์Šค๋ ˆ๋“œ ๋ชจ๋‘ 74๋ฒˆ ์ค„(์ฒซ ๋ฒˆ์งธ ๋ฐฐ๋ฆฌ์–ด ํ†ต๊ณผ ํ›„)์— ์žˆ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฐฐ๋ฆฌ์–ด๋Š” ์ •์ƒ ์ž‘๋™ํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ” ๊ฒฐ์ •์  ์ง€์ : ์ด์ œ ๋˜ ๋‹ค๋ฅธ ์กฐ๊ฑด๋ฌธ์ด ์žˆ๋Š” 2๋‹จ๊ณ„์— ์ง„์ž…ํ•ฉ๋‹ˆ๋‹ค.

Step 7: 2๋‹จ๊ณ„ ์ถ”์  - ์Šค๋ ˆ๋“œ 0 ๊ด€์ 

(cuda-gdb) n
76              if thread_id > 0:

์Šค๋ ˆ๋“œ 0 ๋ถ„์„: 0 < 3 โ†’ True โ†’ ์Šค๋ ˆ๋“œ 0์ด 2๋‹จ๊ณ„ ๋ธ”๋ก์— ์ง„์ž…

(cuda-gdb) n
78              barrier()

์Šค๋ ˆ๋“œ 0 ๊ฒฝ๋กœ: 0 > 0 โ†’ False โ†’ ์Šค๋ ˆ๋“œ 0์ด ๋‚ด๋ถ€ ์—ฐ์‚ฐ์€ ๊ฑด๋„ˆ๋›ฐ์ง€๋งŒ 78๋ฒˆ ์ค„์˜ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌ

๊ฒฐ์ •์  ์ˆœ๊ฐ„: ์Šค๋ ˆ๋“œ 0์ด ์ด์ œ 78๋ฒˆ ์ค„์˜ ๋ฐฐ๋ฆฌ์–ด์—์„œ ๋Œ€๊ธฐ ์ค‘์ž…๋‹ˆ๋‹ค.

(cuda-gdb) n # <-- ์‹คํ–‰ํ•˜๋ฉด ํ”„๋กœ๊ทธ๋žจ์ด ๋ฉˆ์ถฅ๋‹ˆ๋‹ค!
[HANGS HERE - ํ”„๋กœ๊ทธ๋žจ์ด ์ด ์ง€์ ์„ ๋„˜์–ด๊ฐ€์ง€ ๋ชปํ•จ]

Step 8: ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ์กฐ์‚ฌ

(cuda-gdb) cuda thread (1,0,0)
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]
78              barrier()
(cuda-gdb) p thread_id
$2 = 1
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (2,0,0)     3 0x00007fffd3273aa0 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    78
   (0,0,0)   (3,0,0)     (0,0,0)      (3,0,0)     1 0x00007fffd3273b10 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    81

๊ฒฐ์ •์  ์ฆ๊ฑฐ ๋ฐœ๊ฒฌ:

  • ์Šค๋ ˆ๋“œ 0, 1, 2: 78๋ฒˆ ์ค„์—์„œ ๋ชจ๋‘ ๋Œ€๊ธฐ ์ค‘ (์กฐ๊ฑด ๋ธ”๋ก ์•ˆ์˜ ๋ฐฐ๋ฆฌ์–ด)
  • ์Šค๋ ˆ๋“œ 3: 81๋ฒˆ ์ค„์— ์žˆ์Œ (์กฐ๊ฑด ๋ธ”๋ก์„ ์ง€๋‚˜์ณค๊ณ , ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•œ ์  ์—†์Œ!)

Step 9: ์Šค๋ ˆ๋“œ 3์˜ ์‹คํ–‰ ๊ฒฝ๋กœ ๋ถ„์„

๐Ÿ” info ์ถœ๋ ฅ์œผ๋กœ ๋ณธ ์Šค๋ ˆ๋“œ 3 ๋ถ„์„:

  • ์Šค๋ ˆ๋“œ 3: 81๋ฒˆ ์ค„์— ์œ„์น˜ (PC: 0x00007fffd3273b10)
  • 2๋‹จ๊ณ„ ์กฐ๊ฑด: thread_id < SIZE - 1 โ†’ 3 < 3 โ†’ False
  • ๊ฒฐ๊ณผ: ์Šค๋ ˆ๋“œ 3์€ 2๋‹จ๊ณ„ ๋ธ”๋ก(74-78๋ฒˆ ์ค„)์— ์ง„์ž…ํ•˜์ง€ ์•Š์Œ
  • ๊ฒฐ๊ณผ: ์Šค๋ ˆ๋“œ 3์€ 78๋ฒˆ ์ค„์˜ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•œ ์  ์—†์Œ
  • ํ˜„์žฌ ์ƒํƒœ: ์Šค๋ ˆ๋“œ 3์€ 81๋ฒˆ ์ค„(๋งˆ์ง€๋ง‰ ๋ฐฐ๋ฆฌ์–ด)์— ์žˆ๊ณ , ์Šค๋ ˆ๋“œ 0,1,2๋Š” 78๋ฒˆ ์ค„์—์„œ ๊ฐ‡ํ˜€ ์žˆ์Œ

4๋‹จ๊ณ„: ๊ทผ๋ณธ ์›์ธ ๋ถ„์„

Step 10: ๊ต์ฐฉ ์ƒํƒœ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์‹๋ณ„

# 2๋‹จ๊ณ„: ํ˜‘๋ ฅ์  ์ฒ˜๋ฆฌ
if thread_id < SIZE - 1:        # โ† ์Šค๋ ˆ๋“œ 0, 1, 2๋งŒ ์ด ๋ธ”๋ก์— ์ง„์ž…
    # ์ด์›ƒ๊ณผ ํ˜‘๋ ฅ ํ•„ํ„ฐ ์ ์šฉ
    if thread_id > 0:
        shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
    barrier()                   # โ† ๊ต์ฐฉ ์ƒํƒœ: 4๊ฐœ ์ค‘ 3๊ฐœ ์Šค๋ ˆ๋“œ๋งŒ ์—ฌ๊ธฐ์— ๋„๋‹ฌ!

๐Ÿ’€ ๊ต์ฐฉ ์ƒํƒœ ๋ฉ”์ปค๋‹ˆ์ฆ˜:

  1. ์Šค๋ ˆ๋“œ 0: 0 < 3 โ†’ True โ†’ ๋ธ”๋ก ์ง„์ž… โ†’ ๋ฐฐ๋ฆฌ์–ด์—์„œ ๋Œ€๊ธฐ (69๋ฒˆ ์ค„)
  2. ์Šค๋ ˆ๋“œ 1: 1 < 3 โ†’ True โ†’ ๋ธ”๋ก ์ง„์ž… โ†’ ๋ฐฐ๋ฆฌ์–ด์—์„œ ๋Œ€๊ธฐ (69๋ฒˆ ์ค„)
  3. ์Šค๋ ˆ๋“œ 2: 2 < 3 โ†’ True โ†’ ๋ธ”๋ก ์ง„์ž… โ†’ ๋ฐฐ๋ฆฌ์–ด์—์„œ ๋Œ€๊ธฐ (69๋ฒˆ ์ค„)
  4. ์Šค๋ ˆ๋“œ 3: 3 < 3 โ†’ False โ†’ ๋ธ”๋ก์— ์ง„์ž… ์•ˆ ํ•จ โ†’ 72๋ฒˆ ์ค„๋กœ ๊ณ„์† ์ง„ํ–‰

๊ฒฐ๊ณผ: 3๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ 4๋ฒˆ์งธ ์Šค๋ ˆ๋“œ๋ฅผ ์˜์›ํžˆ ๊ธฐ๋‹ค๋ฆฌ์ง€๋งŒ, ์Šค๋ ˆ๋“œ 3์€ ๊ทธ ๋ฐฐ๋ฆฌ์–ด์— ์ ˆ๋Œ€ ๋„์ฐฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

5๋‹จ๊ณ„: ๋ฒ„๊ทธ ํ™•์ธ๊ณผ ํ•ด๊ฒฐ์ฑ…

Step 11: ๊ทผ๋ณธ์ ์ธ ๋ฐฐ๋ฆฌ์–ด ๊ทœ์น™ ์œ„๋ฐ˜

GPU ๋ฐฐ๋ฆฌ์–ด ๊ทœ์น™: ๋™๊ธฐํ™”๊ฐ€ ์™„๋ฃŒ๋˜๋ ค๋ฉด ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋ฌด์—‡์ด ์ž˜๋ชป๋˜์—ˆ๋‚˜:

# โŒ ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ์กฐ๊ฑด๋ฌธ ์•ˆ์— ๋ฐฐ๋ฆฌ์–ด
if thread_id < SIZE - 1:    # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ง„์ž…ํ•˜์ง€ ์•Š์Œ
    # ... ์—ฐ์‚ฐ ...
    barrier()               # ์ผ๋ถ€ ์Šค๋ ˆ๋“œ๋งŒ ์—ฌ๊ธฐ์— ๋„๋‹ฌ

# โœ… ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ์กฐ๊ฑด๋ฌธ ๋ฐ–์— ๋ฐฐ๋ฆฌ์–ด
if thread_id < SIZE - 1:    # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ง„์ž…ํ•˜์ง€ ์•Š์Œ
    # ... ์—ฐ์‚ฐ ...
 barrier()                  # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฌ๊ธฐ์— ๋„๋‹ฌ

์ˆ˜์ • ๋ฐฉ๋ฒ•: ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์กฐ๊ฑด ๋ธ”๋ก ๋ฐ–์œผ๋กœ ์ด๋™:

def collaborative_filter(
    output: TileTensor[mut=True, dtype, vector_layout],
    a: TileTensor[mut=False, dtype, vector_layout],
):
    thread_id = thread_idx.x
    shared_workspace = TileTensor[
        dtype,
        row_major[SIZE-1](),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # 1๋‹จ๊ณ„: ๊ณต์œ  ์ž‘์—…๊ณต๊ฐ„ ์ดˆ๊ธฐํ™” (๋ชจ๋“  ์Šค๋ ˆ๋“œ ์ฐธ์—ฌ)
    if thread_id < SIZE - 1:
        shared_workspace[thread_id] = rebind[Scalar[dtype]](a[thread_id])
    barrier()

    # 2๋‹จ๊ณ„: ํ˜‘๋ ฅ์  ์ฒ˜๋ฆฌ
    if thread_id < SIZE - 1:
        if thread_id > 0:
            shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
    # โœ… ์ˆ˜์ •: ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์กฐ๊ฑด๋ฌธ ๋ฐ–์œผ๋กœ ์ด๋™ํ•ด์„œ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋„๋‹ฌํ•˜๋„๋ก
    barrier()

    # 3๋‹จ๊ณ„: ์ตœ์ข… ๋™๊ธฐํ™”์™€ ์ถœ๋ ฅ
    barrier()

    if thread_id < SIZE - 1:
        output[thread_id] = shared_workspace[thread_id]
    else:
        output[thread_id] = rebind[Scalar[dtype]](a[thread_id])

ํ•ต์‹ฌ ๋””๋ฒ„๊น… ๊ตํ›ˆ

๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ํƒ์ง€:

  1. info cuda threads ์‚ฌ์šฉ - ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ์–ด๋А ์ค„์— ์žˆ๋Š”์ง€ ๋ณด์—ฌ์คŒ
  2. ์Šค๋ ˆ๋“œ ์ƒํƒœ ๋ถ„๊ธฐ ์ฐพ๊ธฐ - ์ผ๋ถ€ ์Šค๋ ˆ๋“œ๊ฐ€ ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋žจ ์œ„์น˜์— ์žˆ์Œ
  3. ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ๊ฒฝ๋กœ ์ถ”์  - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•˜๋Š”์ง€ ํ™•์ธ
  4. ๋ฐฐ๋ฆฌ์–ด ๋„๋‹ฌ ๊ฐ€๋Šฅ์„ฑ ๊ฒ€์ฆ - ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๋“ค์ด ๋„๋‹ฌํ•˜๋Š” ๋ฐฐ๋ฆฌ์–ด๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ์Šค๋ ˆ๋“œ๊ฐ€ ์—†๋Š”์ง€ ํ™•์ธ

์‹ค๋ฌด GPU ๋””๋ฒ„๊น…์˜ ํ˜„์‹ค:

  • ๊ต์ฐฉ ์ƒํƒœ๋Š” ์†Œ๋ฆฌ ์—†๋Š” ์‚ด์ธ์ž - ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€ ์—†์ด ํ”„๋กœ๊ทธ๋žจ์ด ๊ทธ๋ƒฅ ๋ฉˆ์ถค
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ ๋””๋ฒ„๊น…์€ ์ธ๋‚ด๊ฐ€ ํ•„์š” - ๊ฐ ์Šค๋ ˆ๋“œ ๊ฒฝ๋กœ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•ด์•ผ ํ•จ
  • ์กฐ๊ฑด๋ถ€ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ๊ต์ฐฉ ์ƒํƒœ์˜ 1์ˆœ์œ„ ์›์ธ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋™๊ธฐํ™” ์ง€์ ์— ๋„๋‹ฌํ•˜๋Š”์ง€ ํ•ญ์ƒ ํ™•์ธ
  • CUDA-GDB ์Šค๋ ˆ๋“œ ๊ฒ€์‚ฌ๊ฐ€ ํ•„์ˆ˜ - ์Šค๋ ˆ๋“œ ์กฐ์œจ ์‹คํŒจ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์œ ์ผํ•œ ๋ฐฉ๋ฒ•

๊ณ ๊ธ‰ GPU ๋™๊ธฐํ™”:

  • ๋ฐฐ๋ฆฌ์–ด ๊ทœ์น™: ๋ธ”๋ก์˜ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฐฐ๋ฆฌ์–ด์— ๋„๋‹ฌํ•ด์•ผ ํ•จ
  • ์กฐ๊ฑด๋ถ€ ์‹คํ–‰์˜ ํ•จ์ •: ์–ด๋–ค if๋ฌธ์ด๋“  ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์Œ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ: ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜์— ์ฃผ์˜ ํ•„์š”
  • TileTensor๊ฐ€ ๊ต์ฐฉ ์ƒํƒœ๋ฅผ ๋ง‰์•„์ฃผ์ง€ ์•Š์Œ: ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋ผ๋„ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™”๋Š” ์—ฌ์ „ํžˆ ํ•„์š”

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ๋Š” GPU ๋ฒ„๊ทธ ์ค‘ ๋””๋ฒ„๊น…ํ•˜๊ธฐ ๊ฐ€์žฅ ์–ด๋ ค์šด ์œ ํ˜•์— ์†ํ•ฉ๋‹ˆ๋‹ค:

  • ์˜ค๋ฅ˜๊ฐ€ ๋ณด์ด์ง€ ์•Š์Œ - ๊ทธ์ € ๋ฌดํ•œ ๋Œ€๊ธฐ
  • ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ๋ถ„์„ ํ•„์š” - ์Šค๋ ˆ๋“œ ํ•˜๋‚˜๋งŒ ๋ด์„œ๋Š” ๋””๋ฒ„๊น…ํ•  ์ˆ˜ ์—†์Œ
  • ์กฐ์šฉํ•œ ์‹คํŒจ ๋ชจ๋“œ - ์ •ํ™•์„ฑ ๋ฒ„๊ทธ๊ฐ€ ์•„๋‹Œ ์„ฑ๋Šฅ ๋ฌธ์ œ์ฒ˜๋Ÿผ ๋ณด์ž„
  • ๋ณต์žกํ•œ ์Šค๋ ˆ๋“œ ์กฐ์œจ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ฑธ์ณ ์‹คํ–‰ ๊ฒฝ๋กœ๋ฅผ ์ถ”์ ํ•ด์•ผ ํ•จ

CUDA-GDB๋กœ ์Šค๋ ˆ๋“œ ์ƒํƒœ๋ฅผ ๋ถ„์„ํ•˜๊ณ , ๋ถ„๊ธฐ๋œ ์‹คํ–‰ ๊ฒฝ๋กœ๋ฅผ ์‹๋ณ„ํ•˜๊ณ , ๋ฐฐ๋ฆฌ์–ด ๋„๋‹ฌ ๊ฐ€๋Šฅ์„ฑ์„ ๊ฒ€์ฆํ•˜๋Š” ์ด ๋””๋ฒ„๊น… ๋ฐฉ์‹์€ ์‹ค๋ฌด GPU ๊ฐœ๋ฐœ์ž๋“ค์ด ์šด์˜ ์‹œ์Šคํ…œ์—์„œ ๊ต์ฐฉ ์ƒํƒœ ๋ฌธ์ œ์— ๋งž๋‹ฅ๋œจ๋ ธ์„ ๋•Œ ์“ฐ๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ •ํ™•ํžˆ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„: GPU ๋””๋ฒ„๊น… ์Šคํ‚ฌ ์™„์„ฑ

GPU ๋””๋ฒ„๊น… ์‚ผ๋ถ€์ž‘์„ ์™„๋ฃŒํ–ˆ์Šต๋‹ˆ๋‹ค!

์™„์„ฑ๋œ GPU ๋””๋ฒ„๊น… ๋ฌด๊ธฐ๊ณ 

์ฒซ ๋ฒˆ์งธ ์‚ฌ๋ก€์—์„œ - ํฌ๋ž˜์‹œ ๋””๋ฒ„๊น…:

  • โœ… ์˜ค๋ฅ˜ ๋ฉ”์‹œ์ง€๋ฅผ ๊ฐ€์ด๋“œ ์‚ผ์•„ ์ฒด๊ณ„์ ์ธ ํฌ๋ž˜์‹œ ์กฐ์‚ฌ
  • โœ… ํฌ์ธํ„ฐ ์ฃผ์†Œ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ ํƒ์ง€
  • โœ… ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ๋ฌธ์ œ๋ฅผ ์œ„ํ•œ CUDA-GDB ๊ธฐ์ดˆ

๋‘ ๋ฒˆ์งธ ์‚ฌ๋ก€์—์„œ - ๋กœ์ง ๋ฒ„๊ทธ ๋””๋ฒ„๊น…:

  • โœ… ๋šœ๋ ทํ•œ ์ฆ์ƒ ์—†์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์˜ค๋ฅ˜ ์กฐ์‚ฌ
  • โœ… ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ๊ทผ๋ณธ ์›์ธ๊นŒ์ง€ ์ถ”์ ํ•˜๋Š” ํŒจํ„ด ๋ถ„์„ ๊ธฐ๋ฒ•
  • โœ… ๋ณ€์ˆ˜ ๊ฒ€์‚ฌ๊ฐ€ ์•ˆ ๋  ๋•Œ ์‹คํ–‰ ํ๋ฆ„ ๋””๋ฒ„๊น…

์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€์—์„œ - ์กฐ์œจ ๋””๋ฒ„๊น…:

  • โœ… ์Šค๋ ˆ๋“œ ์กฐ์œจ ์‹คํŒจ๋ฅผ ์œ„ํ•œ ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ ์กฐ์‚ฌ
  • โœ… ๊ณ ๊ธ‰ CUDA-GDB ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์ƒํƒœ ๋ถ„์„
  • โœ… ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ์„ ์œ„ํ•œ ๋™๊ธฐํ™” ๊ฒ€์ฆ

์ „๋ฌธ๊ฐ€์˜ GPU ๋””๋ฒ„๊น… ๋ฐฉ๋ฒ•๋ก 

์‹ค๋ฌด GPU ๊ฐœ๋ฐœ์ž๋“ค์ด ์‚ฌ์šฉํ•˜๋Š” ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ๋ฒ•์„ ์ตํ˜”์Šต๋‹ˆ๋‹ค:

  1. ์ฆ์ƒ ์ฝ๊ธฐ - ํฌ๋ž˜์‹œ์ธ๊ฐ€? ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ์ธ๊ฐ€? ๋ฌดํ•œ ์ •์ง€์ธ๊ฐ€?
  2. ๊ฐ€์„ค ์ˆ˜๋ฆฝ - ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ? ๋กœ์ง ์˜ค๋ฅ˜? ์กฐ์œจ ๋ฌธ์ œ?
  3. ์ฆ๊ฑฐ ์ˆ˜์ง‘ - ๋ฒ„๊ทธ ์œ ํ˜•์— ๋งž์ถฐ CUDA-GDB๋ฅผ ์ „๋žต์ ์œผ๋กœ ํ™œ์šฉ
  4. ์ฒด๊ณ„์ ์œผ๋กœ ํ…Œ์ŠคํŠธ - ๋ชฉํ‘œ ์ง€ํ–ฅ์  ์กฐ์‚ฌ๋ฅผ ํ†ตํ•ด ๊ฐ ๊ฐ€์„ค ๊ฒ€์ฆ
  5. ๊ทผ๋ณธ ์›์ธ ์ถ”์  - ์ฆ๊ฑฐ์˜ ์—ฐ๊ฒฐ ๊ณ ๋ฆฌ๋ฅผ ๋”ฐ๋ผ ์›์ฒœ๊นŒ์ง€

์—…์  ๋‹ฌ์„ฑ: ์ด์ œ ๊ฐ€์žฅ ํ”ํ•œ ์„ธ ๊ฐ€์ง€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ฌธ์ œ๋ฅผ ๋””๋ฒ„๊น…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Puzzle 10: ์ƒˆ๋‹ˆํƒ€์ด์ €๋กœ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜์™€ ๊ฒฝ์Ÿ ์ƒํƒœ ์ฐพ๊ธฐ

โš ๏ธ ์ด ํผ์ฆ์€ ํ˜ธํ™˜๋˜๋Š” NVIDIA GPU์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ GPU ๋ฒค๋” ์ง€์›์„ ์œ„ํ•œ ๋„๊ตฌ ๊ฐœ๋ฐœ์ด ์ง„ํ–‰ ์ค‘์ž…๋‹ˆ๋‹ค.

๋ชจ๋“  GPU ๊ฐœ๋ฐœ์ž๊ฐ€ ๋‘๋ ค์›Œํ•˜๋Š” ์ˆœ๊ฐ„

์™„๋ฒฝํ•ด ๋ณด์ด๋Š” GPU ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ •ํ™•ํ•˜๊ณ , ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋„ ์˜ฌ๋ฐ”๋ฅธ ๊ฒƒ ๊ฐ™๊ณ , ์Šค๋ ˆ๋“œ ์กฐ์œจ๋„ ํ ์žก์„ ๋ฐ ์—†์–ด ๋ณด์ž…๋‹ˆ๋‹ค. ์ž์‹  ์žˆ๊ฒŒ ํ…Œ์ŠคํŠธ๋ฅผ ์‹คํ–‰ํ•˜๋ฉดโ€ฆ

  • โœ… ๋ชจ๋“  ํ…Œ์ŠคํŠธ ํ†ต๊ณผ
  • โœ… ์„ฑ๋Šฅ๋„ ํ›Œ๋ฅญํ•จ
  • โœ… ์ถœ๋ ฅ์ด ์˜ˆ์ƒ ๊ฒฐ๊ณผ์™€ ์ผ์น˜

๋ฟŒ๋“ฏํ•˜๊ฒŒ ์ฝ”๋“œ๋ฅผ ํ”„๋กœ๋•์…˜์— ๋ฐฐํฌํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๋ช‡ ์ฃผ ํ›„, ์—ฐ๋ฝ์ด ์˜ต๋‹ˆ๋‹ค:

  • โ€œํ”„๋กœ๋•์…˜์—์„œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์ด ํฌ๋ž˜์‹œ๋์–ด์š”โ€
  • โ€œ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์š”โ€
  • โ€œ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ์ด ๊ฐ์ง€๋์–ด์š”โ€

์กฐ์šฉํžˆ ์ˆจ์–ด๋“œ๋Š” GPU ๋ฒ„๊ทธ์˜ ์„ธ๊ณ„์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์˜ ๊ทธ๋Š˜์— ์ˆจ์–ด ์žˆ๋‹ค๊ฐ€ ๊ฐ€์žฅ ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ์ˆœ๊ฐ„์— ํŠ€์–ด๋‚˜์˜ค๋Š” ์˜ค๋ฅ˜๋“ค์ด์ฃ . ์ด๋Ÿฐ ๋ฒ„๊ทธ๋“ค์€ ๋ชจ๋“  ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ•˜๊ณ , 99%์˜ ๊ฒฝ์šฐ ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋‹ค๊ฐ€, ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ˆœ๊ฐ„์— ์น˜๋ช…์ ์œผ๋กœ ์‹คํŒจํ•ฉ๋‹ˆ๋‹ค.

์ค‘์š”: ์ด ํผ์ฆ์€ NVIDIA GPU ํ•˜๋“œ์›จ์–ด๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, compute-sanitizer๊ฐ€ NVIDIA CUDA toolkit์— ํฌํ•จ๋˜์–ด ์žˆ์–ด pixi๋ฅผ ํ†ตํ•ด์„œ๋งŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GPU ๋ฒ„๊ทธ๊ฐ€ ์œ ๋‚œํžˆ ๊ตํ™œํ•œ ์ด์œ 

CPU ํ”„๋กœ๊ทธ๋žจ์—์„œ๋Š” ๋ฒ„๊ทธ๊ฐ€ ๋ณดํ†ต ์ฆ‰๊ฐ์ ์ธ ํฌ๋ž˜์‹œ๋‚˜ ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋กœ ์ž์‹ ์˜ ์กด์žฌ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ GPU ๋ฒ„๊ทธ๋Š” ์ˆจ๊ธฐ์˜ ๋‹ฌ์ธ์ž…๋‹ˆ๋‹ค:

์กฐ์šฉํžˆ ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ์‹œํ‚ค๋Š” ํŒจํ„ด:

  • ํฌ๋ž˜์‹œ ์—†๋Š” ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜: ์šฐ์—ฐํžˆ ์œ ํšจํ•œ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜๋ฅผ ๊ฑด๋“œ๋ฆฌ๋Š” ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ
  • โ€œ๋Œ€๋ถ€๋ถ„์€ ์ž˜ ๋™์ž‘ํ•˜๋Š”โ€ ๊ฒฝ์Ÿ ์ƒํƒœ: ํƒ€์ด๋ฐ์— ๋”ฐ๋ผ ๋ฌด์ž‘์œ„์ฒ˜๋Ÿผ ๋‚˜ํƒ€๋‚˜๋Š” ๋ฒ„๊ทธ
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ ์‹คํŒจ: ํŠน์ • ๋ถ€ํ•˜ ์กฐ๊ฑด์—์„œ๋งŒ ๋ฐœ์ƒํ•˜๋Š” ๊ต์ฐฉ ์ƒํƒœ

๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์—์„œ ์ฆํญ๋˜๋Š” ๋ฌธ์ œ:

  • ํ•œ ์Šค๋ ˆ๋“œ์˜ ๋ฒ„๊ทธ๊ฐ€ ์ˆ˜์ฒœ ๊ฐœ์— ์˜ํ–ฅ: ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ํ•˜๋‚˜๊ฐ€ ์ „์ฒด ์›Œํ”„๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ
  • ๊ฒฝ์Ÿ ์ƒํƒœ์˜ ๊ธฐํ•˜๊ธ‰์ˆ˜์  ์ฆ๊ฐ€: ์Šค๋ ˆ๋“œ๊ฐ€ ๋งŽ์„์ˆ˜๋ก ์†์ƒ ๊ฐ€๋Šฅ์„ฑ๋„ ์ปค์ง
  • ํ•˜๋“œ์›จ์–ด ์ฐจ์ด๊ฐ€ ๋ฌธ์ œ๋ฅผ ์€ํ: ๊ฐ™์€ ๋ฒ„๊ทธ๊ฐ€ GPU ์•„ํ‚คํ…์ฒ˜๋งˆ๋‹ค ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘

ํ•˜์ง€๋งŒ ํฌ์†Œ์‹์ด ์žˆ์Šต๋‹ˆ๋‹ค: GPU ๊ฒ€์‚ฌ ๋„๊ตฌ๋ฅผ ์ตํžˆ๋ฉด, ์ด๋ ‡๊ฒŒ ์ฐพ๊ธฐ ์–ด๋ ค์šด ๋ฒ„๊ทธ๋“ค์„ ํ”„๋กœ๋•์…˜์— ๋„๋‹ฌํ•˜๊ธฐ ์ „์— ์žก์•„๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ƒˆ๋‹ˆํƒ€์ด์ € ๋„๊ตฌ ๋ชจ์Œ: NVIDIA compute-sanitizer

NVIDIA compute-sanitizer๋Š” GPU ๋ฒ„๊ทธ์— ๋งž์„œ ์‹ธ์šฐ๋Š” ์—ฌ๋Ÿฌ๋ถ„์˜ ๋น„๋ฐ€ ๋ฌด๊ธฐ์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ์„ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜: ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ, ์ž˜๋ชป๋œ ํฌ์ธํ„ฐ, ๋ฉ”๋ชจ๋ฆฌ ๋ˆ„์ˆ˜
  • ๊ฒฝ์Ÿ ์ƒํƒœ: ์Šค๋ ˆ๋“œ ๊ฐ„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ hazard
  • ๋™๊ธฐํ™” ๋ฒ„๊ทธ: ๊ต์ฐฉ ์ƒํƒœ, barrier ์˜ค์šฉ, ๋ถ€์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ์กฐ์œจ
  • ๊ทธ ์™ธ: pixi run compute-sanitizer --help๋กœ ํ™•์ธ

๐Ÿ“– ๊ณต์‹ ๋ฌธ์„œ: NVIDIA Compute Sanitizer User Guide

GPU ํ”„๋กœ๊ทธ๋žจ์˜ X-ray๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜ ํ…Œ์ŠคํŠธ๋กœ๋Š” ๋ณผ ์ˆ˜ ์—†๋Š” ์ˆจ๊ฒจ์ง„ ๋ฌธ์ œ๊นŒ์ง€ ๋“œ๋Ÿฌ๋‚ด ์ค๋‹ˆ๋‹ค.

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ

์ด ํผ์ฆ์—์„œ๋Š” ๊ฐ€์žฅ ์ฐพ๊ธฐ ์–ด๋ ค์šด GPU ๋ฒ„๊ทธ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ฐพ์•„ ์ˆ˜์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ์œ ๋Šฅํ•œ GPU ๊ฐœ๋ฐœ์ž์™€ ๋›ฐ์–ด๋‚œ ๊ฐœ๋ฐœ์ž๋ฅผ ๊ตฌ๋ถ„ ์ง“๋Š” ํƒ์ • ๊ธฐ์ˆ ์„ ์ตํžˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์ตํžˆ๊ฒŒ ๋  ํ•ต์‹ฌ ๊ธฐ์ˆ 

  1. ์ˆจ์€ ๋ฒ„๊ทธ ์ฐพ๊ธฐ - ํ…Œ์ŠคํŠธ๋กœ๋Š” ์žกํžˆ์ง€ ์•Š๋Š” ๋ฌธ์ œ ๋ฐœ๊ฒฌ
  2. ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ ์กฐ์‚ฌ - ํ”ผํ•ด๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์ „์— ๋ฏธ์ •์˜ ๋™์ž‘ ์ถ”์ 
  3. ๊ฒฝ์Ÿ ์ƒํƒœ ํƒ์ง€ - ๋™์‹œ์„ฑ ์œ„ํ—˜ ์š”์†Œ๋ฅผ ์ฐพ์•„๋‚ด๊ณ  ์ œ๊ฑฐ
  4. ๋„๊ตฌ ์„ ํƒ ๋Šฅ๋ ฅ - ์ƒํ™ฉ์— ๋งž๋Š” ์ƒˆ๋‹ˆํƒ€์ด์ € ์„ ํƒ
  5. ํ”„๋กœ๋•์…˜ ๋””๋ฒ„๊น… ์ž์‹ ๊ฐ - ์‚ฌ์šฉ์ž์—๊ฒŒ ๋„๋‹ฌํ•˜๊ธฐ ์ „์— ๋ฒ„๊ทธ ํฌ์ฐฉ

์‹ค์ „ ๋ฒ„๊ทธ ์‚ฌ๋ƒฅ ์‹œ๋‚˜๋ฆฌ์˜ค

๊ฐ€์žฅ ์œ„ํ—˜ํ•œ ๋‘ ์ข…๋ฅ˜์˜ GPU ๋ฒ„๊ทธ๋ฅผ ์กฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ - ๊ฒฝ๊ณ  ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ง๊ฐ€๋œจ๋ฆฌ๋Š” ์กฐ์šฉํ•œ ์•”์‚ด์ž
  • ๊ฒฝ์Ÿ ์ƒํƒœ - ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ํ˜ผ๋ˆ์˜ ์”จ์•—

๊ฐ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ์ผ๋ฐ˜ ํ…Œ์ŠคํŠธ๋กœ๋Š” ๋ณด์ด์ง€ ์•Š๋Š” ๋‹จ์„œ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ, GPU ๋ฒ„๊ทธ ํƒ์ •์ฒ˜๋Ÿผ ์‚ฌ๊ณ ํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฒ„๊ทธ ์‚ฌ๋ƒฅ ์—ฌ์ •

์ด ํผ์ฆ์€ ์กฐ์šฉํ•œ ์†์ƒ์„ ๋ฐœ๊ฒฌํ•˜๋Š” ๊ฒƒ๋ถ€ํ„ฐ ๋ณ‘๋ ฌ ๋””๋ฒ„๊น…์„ ๋ฐฐ์šฐ๋Š” ๊ฒƒ๊นŒ์ง€, ์ฒด๊ณ„์ ์œผ๋กœ ์„ค๊ณ„๋œ ๊ณผ์ •์„ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค:

๐Ÿ‘ฎ๐Ÿผโ€โ™‚๏ธ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ํƒ์ง€

๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ์กฐ์‚ฌ - ํ…Œ์ŠคํŠธ๋Š” ํ†ต๊ณผํ•ด๋„ ๋ฉ”๋ชจ๋ฆฌ๋Š” ๊ฑฐ์ง“๋ง์„ ํ•  ๋•Œ

  • ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ•˜๋ฉด์„œ๋„ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ”์ฃ„๋ฅผ ์ €์ง€๋ฅด๋Š” ํ”„๋กœ๊ทธ๋žจ ์กฐ์‚ฌ
  • ๋ฏธ์ •์˜ ๋™์ž‘(UB)์˜ ์ง•ํ›„๋ฅผ ์•Œ์•„๋ณด๋Š” ๋ฒ• ์ตํžˆ๊ธฐ
  • memcheck ํ•™์Šต - ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์„ ์žก์•„๋‚ด๋Š” ํƒ์ง€๊ธฐ
  • GPU ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜๋ฅผ ์ˆจ๊ธฐ๋Š” ์ด์œ  ์ดํ•ด
  • ์ฒด๊ณ„์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๊ฒ€์ฆ ์‹ค์Šต

๋ชฉํ‘œ: ๋ฐฉ์น˜ํ•˜๋ฉด ํ”„๋กœ๋•์…˜๊นŒ์ง€ ๋ฐœ๊ฒฌ๋˜์ง€ ์•Š์•˜์„ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ํƒ์ง€ ๋Šฅ๋ ฅ

๐Ÿ ๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น…

๋™์‹œ์„ฑ ๋ฒ„๊ทธ ์กฐ์‚ฌ - ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ๋ฐœ๋ชฉ์„ ์žก์„ ๋•Œ

  • ์Šค๋ ˆ๋“œ ํƒ€์ด๋ฐ ๋•Œ๋ฌธ์— ๋ฌด์ž‘์œ„๋กœ ์‹คํŒจํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ ์กฐ์‚ฌ
  • ๋ฐ์ดํ„ฐ๊ฐ€ ์†์ƒ๋˜๊ธฐ ์ „์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„ํ—˜ ์š”์†Œ ์‹๋ณ„๋ฒ• ์ตํžˆ๊ธฐ
  • racecheck ํ•™์Šต - ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์žก์•„๋‚ด๋Š” ํƒ์ง€๊ธฐ
  • ๋‹ค์–‘ํ•œ ๋™์‹œ์„ฑ ๋ฒ„๊ทธ์— ๋Œ€ํ•ด racecheck vs synccheck ๋น„๊ต
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ์ „๋žต ์‹ค์Šต

๋ชฉํ‘œ: ๊ณ ๊ธ‰ ๋™์‹œ์„ฑ ๋””๋ฒ„๊น… - ์ˆ˜์ฒœ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ๋ฅผ ๊ธธ๋“ค์ด๋Š” ๋Šฅ๋ ฅ

GPU ํƒ์ • ๋งˆ์ธ๋“œ์…‹

GPU ๊ฒ€์‚ฌ๋ฅผ ํ•˜๋ ค๋ฉด ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋žจ ํƒ์ •์ด ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‚ฌ๊ฑด์„ ์กฐ์‚ฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

  • ์ฆ๊ฑฐ๊ฐ€ ์ˆจ๊ฒจ์ ธ ์žˆ๋‹ค - ์ง์ ‘ ๊ด€์ฐฐํ•  ์ˆ˜ ์—†๋Š” ๋ณ‘๋ ฌ ์‹คํ–‰ ์†์—์„œ ๋ฒ„๊ทธ๊ฐ€ ๋ฐœ์ƒ
  • ์šฉ์˜์ž๊ฐ€ ์ˆ˜์—†์ด ๋งŽ๋‹ค - ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ ์ค‘ ์–ด๋–ค ์กฐํ•ฉ์ด๋“  ๋ฒ”์ธ์ผ ์ˆ˜ ์žˆ์Œ
  • ๋ฒ”ํ–‰์ด ๊ฐ„ํ—์ ์ด๋‹ค - ๊ฒฝ์Ÿ ์ƒํƒœ์™€ ํƒ€์ด๋ฐ์— ๋”ฐ๋ฅธ ์‹คํŒจ
  • ์ „๋ฌธ ๋„๊ตฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค - ์ผ๋ฐ˜ ๋””๋ฒ„๊น…์œผ๋กœ๋Š” ๋ณผ ์ˆ˜ ์—†๋Š” ๊ฒƒ์„ ์ƒˆ๋‹ˆํƒ€์ด์ €๊ฐ€ ๋ณด์—ฌ์คŒ

ํ•˜์ง€๋งŒ ํ›Œ๋ฅญํ•œ ํƒ์ •์ฒ˜๋Ÿผ, ์—ฌ๋Ÿฌ๋ถ„๋„ ๋‹ค์Œ์„ ๋ฐฐ์šฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

  • ๋ณด์ด์ง€ ์•Š๋Š” ๋‹จ์„œ ๋”ฐ๋ผ๊ฐ€๊ธฐ - ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด, ์Šค๋ ˆ๋“œ ํƒ€์ด๋ฐ, ๋™๊ธฐํ™” ์ง€์ 
  • ๋ณ‘๋ ฌ์ ์œผ๋กœ ์‚ฌ๊ณ ํ•˜๊ธฐ - ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์–ด๋–ป๊ฒŒ ์ƒํ˜ธ์ž‘์šฉํ•˜๋Š”์ง€ ๊ณ ๋ ค
  • ๋ฏธ๋ž˜์˜ ๋ฒ”์ฃ„ ์˜ˆ๋ฐฉํ•˜๊ธฐ - ๊ฐœ๋ฐœ ์›Œํฌํ”Œ๋กœ์šฐ์— ๊ฒ€์‚ฌ ๋„๊ตฌ ํ†ตํ•ฉ
  • ๋„๊ตฌ ๋ฏฟ๊ธฐ - ์ˆ˜๋™ ํ…Œ์ŠคํŠธ๋กœ๋Š” ๋“œ๋Ÿฌ๋‚ผ ์ˆ˜ ์—†๋Š” ๊ฒƒ์„ ์ƒˆ๋‹ˆํƒ€์ด์ €์— ๋งก๊ธฐ๊ธฐ

์‹œ์ž‘ํ•˜๊ธฐ ์ „์—

์•Œ์•„์•ผ ํ•  ๊ฒƒ:

  • Puzzle 1-8์—์„œ ๋‹ค๋ฃฌ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋… (๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ, ์Šค๋ ˆ๋“œ ์กฐ์œจ, ๋ฐฐ๋ฆฌ์–ด)
  • ํ˜ธํ™˜ NVIDIA GPU ํ•˜๋“œ์›จ์–ด
  • compute-sanitizer ์ ‘๊ทผ์„ ์œ„ํ•œ pixi ํŒจํ‚ค์ง€ ๋งค๋‹ˆ์ € ํ™˜๊ฒฝ ์„ค์ •
  • ์„ ํ–‰ ํผ์ฆ: Puzzle 4์™€ Puzzle 8 ์ˆ™์ง€ ๊ถŒ์žฅ

๋ชฉํ‘œ:

  • ์ „๋ฌธ GPU ๊ฐœ๋ฐœํŒ€์—์„œ ์‚ฌ์šฉํ•˜๋Š” ํ”„๋กœ๋•์…˜๊ธ‰ ๋””๋ฒ„๊น… ๊ธฐ์ˆ 
  • ๋น„์šฉ์ด ํฐ ํ”„๋กœ๋•์…˜ ์žฅ์• ๋ฅผ ์˜ˆ๋ฐฉํ•˜๋Š” ์ˆจ์€ ๋ฒ„๊ทธ ํƒ์ง€ ๊ธฐ์ˆ 
  • ๊ฐ€์žฅ ๊นŒ๋‹ค๋กœ์šด ๋™์‹œ์„ฑ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋„ ๋ณ‘๋ ฌ ๋””๋ฒ„๊น… ์ž์‹ ๊ฐ
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ปค๋ฆฌ์–ด ์ „๋ฐ˜์— ๋„์›€์ด ๋  ๋„๊ตฌ ์ „๋ฌธ์„ฑ

๐Ÿ‘ฎ๐Ÿผโ€โ™‚๏ธ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ํƒ์ง€

๊ฐœ์š”

ํ…Œ์ŠคํŠธ๊ฐ€ ํ†ต๊ณผํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์—ฌ๋„ GPU ํ”„๋กœ๊ทธ๋žจ์„ ์กฐ์šฉํžˆ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์„ ํƒ์ง€ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. NVIDIA์˜ compute-sanitizer(pixi๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉ ๊ฐ€๋Šฅ)์™€ memcheck ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, GPU ์ฝ”๋“œ์—์„œ ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ๋™์ž‘์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์ˆจ์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ๋ฅผ ๋ฐœ๊ฒฌํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ํ”„๋กœ๊ทธ๋žจ์€ ๋ถˆ๋ฒ•์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ๋„ ๋™์‹œ์— โ€œ์˜ฌ๋ฐ”๋ฅธโ€ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์„ ํ–‰ ํ•™์Šต: Puzzle 4 TileTensor์™€ ๊ธฐ๋ณธ์ ์ธ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ๋…์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์กฐ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„๊ทธ์˜ ๋ฐœ๊ฒฌ

ํ…Œ์ŠคํŠธ๋Š” ํ†ต๊ณผํ–ˆ์ง€๋งŒ, ์ฝ”๋“œ๊ฐ€ ์ •๋ง ์˜ฌ๋ฐ”๋ฅธ ๊ฑธ๊นŒ?

์–ผํ• ๋ฌดํ•ดํ•ด ๋ณด์ด๊ณ  ์™„๋ฒฝํ•˜๊ฒŒ ๋™์ž‘ํ•˜๋Š” ๋“ฏํ•œ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ ์‹œ์ž‘ํ•ด ๋ด…์‹œ๋‹ค (๊ฐ€๋“œ๊ฐ€ ์—†๋Š” Puzzle 04์ž…๋‹ˆ๋‹ค):

def add_10_2d(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x
    output[row, col] = a[row, col] + 10.0


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p10/p10.mojo

์ด ํ”„๋กœ๊ทธ๋žจ์„ ์ผ๋ฐ˜์ ์œผ๋กœ ์‹คํ–‰ํ•˜๋ฉด, ๋ชจ๋“  ๊ฒƒ์ด ์ •์ƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค:

pixi run p10 --memory-bug
out shape: 2 x 2
Running memory bug example (bounds checking issue)...
out: HostBuffer([10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
โœ… Memory test PASSED! (memcheck may find bounds violations)

โœ… ํ…Œ์ŠคํŠธ ํ†ต๊ณผ! ์ถœ๋ ฅ์ด ์˜ˆ์ƒ ๊ฒฐ๊ณผ์™€ ์™„๋ฒฝํ•˜๊ฒŒ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ๊ฑด ์ข…๊ฒฐ, ๋งž์ฃ ?

์•„๋‹™๋‹ˆ๋‹ค! compute-sanitizer๊ฐ€ ๋ฌด์—‡์„ ๋ณด์—ฌ์ฃผ๋Š”์ง€ ๋ด…์‹œ๋‹ค:

MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0 pixi run compute-sanitizer --tool memcheck mojo problems/p10/p10.mojo --memory-bug

์ฐธ๊ณ : MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0์€ ๋””๋ฐ”์ด์Šค ์ปจํ…์ŠคํŠธ์˜ ๋ฒ„ํผ ์บ์‹œ๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•˜๋Š” ๋ช…๋ น์ค„ ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •์ž…๋‹ˆ๋‹ค. ์ด ์„ค์ •์€ ์ผ๋ฐ˜์ ์ธ ์บ์‹ฑ ๋™์ž‘์— ์˜ํ•ด ์ˆจ๊ฒจ์ง€๋˜ ๊ฒฝ๊ณ„ ์œ„๋ฐ˜ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฌธ์ œ๋ฅผ ๋“œ๋Ÿฌ๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (์—ญ์ฃผ: ๋ฒ„ํผ ์บ์‹œ๊ฐ€ ํ™œ์„ฑํ™”๋˜๋ฉด ํ•ด์ œ๋œ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฆ‰์‹œ ๋ฐ˜ํ™˜ํ•˜์ง€ ์•Š๊ณ  ์žฌ์‚ฌ์šฉ์„ ์œ„ํ•ด ๋ณด๊ด€ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•Œ๋ฌธ์— ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์ด ์•„์ง ์œ ํšจํ•œ ์บ์‹œ ์˜์—ญ์— ๋‹ฟ์•„ ์˜ค๋ฅ˜๊ฐ€ ๋“œ๋Ÿฌ๋‚˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋น„ํ™œ์„ฑํ™”ํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ฆ‰์‹œ ๋ฐ˜ํ™˜๋˜์–ด ์œ„๋ฐ˜์ด ๊ฐ์ง€๋ฉ๋‹ˆ๋‹ค.)

========= COMPUTE-SANITIZER
out shape: 2 x 2
Running memory bug example (bounds checking issue)...

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (2,1,0) in block (0,0,0)
=========     Access at 0xe0c000210 is out of bounds
=========     and is 513 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (0,2,0) in block (0,0,0)
=========     Access at 0xe0c000210 is out of bounds
=========     and is 513 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (1,2,0) in block (0,0,0)
=========     Access at 0xe0c000214 is out of bounds
=========     and is 517 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (2,2,0) in block (0,0,0)
=========     Access at 0xe0c000218 is out of bounds
=========     and is 521 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuStreamSynchronize.
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuEventCreate.
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuMemFreeAsync.

========= ERROR SUMMARY: 7 errors

๋ชจ๋“  ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ–ˆ์Œ์—๋„ ํ”„๋กœ๊ทธ๋žจ์—๋Š” ์ด 7๊ฐœ์˜ ์˜ค๋ฅ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  • 4๊ฐœ์˜ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ (Invalid __global__ read)
  • 3๊ฐœ์˜ ๋Ÿฐํƒ€์ž„ ์˜ค๋ฅ˜ (๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์œผ๋กœ ์ธํ•ด ๋ฐœ์ƒ)

์ˆจ๊ฒจ์ง„ ๋ฒ„๊ทธ ์ดํ•ดํ•˜๊ธฐ

๊ทผ๋ณธ ์›์ธ ๋ถ„์„

๋ฌธ์ œ:

  • ํ…์„œ ํฌ๊ธฐ: 2ร—2 (์œ ํšจํ•œ ์ธ๋ฑ์Šค: 0, 1)
  • ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ: 3ร—3 (์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค: 0, 1, 2)
  • ๋ฒ”์œ„ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ: (2,1), (0,2), (1,2), (2,2)๊ฐ€ ์ž˜๋ชป๋œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ๋ˆ„๋ฝ: ํ…์„œ ์ฐจ์›์— ๋Œ€ํ•œ thread_idx ๊ฒ€์ฆ์ด ์—†์Œ

7๊ฐœ ์˜ค๋ฅ˜ ์ „์ฒด ์ดํ•ดํ•˜๊ธฐ

4๊ฐœ์˜ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜:

  • ๊ฐ ๋ฒ”์œ„ ์ดˆ๊ณผ ์Šค๋ ˆ๋“œ (2,1), (0,2), (1,2), (2,2)๊ฐ€ Invalid __global__ read๋ฅผ ๋ฐœ์ƒ์‹œํ‚ด

3๊ฐœ์˜ CUDA ๋Ÿฐํƒ€์ž„ ์˜ค๋ฅ˜:

  • ์ปค๋„ ์‹คํ–‰ ์‹คํŒจ๋กœ ์ธํ•ด cuStreamSynchronize ์‹คํŒจ
  • ์ •๋ฆฌ ๊ณผ์ •์—์„œ cuEventCreate ์‹คํŒจ
  • ๋ฉ”๋ชจ๋ฆฌ ํ•ด์ œ ๊ณผ์ •์—์„œ cuMemFreeAsync ์‹คํŒจ

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์€ ์—ฐ์‡„ ํšจ๊ณผ๋ฅผ ์ผ์œผํ‚ต๋‹ˆ๋‹ค - ํ•˜๋‚˜์˜ ์ž˜๋ชป๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์—ฌ๋Ÿฌ ํ›„์† CUDA API ์‹คํŒจ๋ฅผ ์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿผ์—๋„ ํ…Œ์ŠคํŠธ๊ฐ€ ํ†ต๊ณผํ•œ ์ด์œ :

  • ์œ ํšจํ•œ ์Šค๋ ˆ๋“œ (0,0), (0,1), (1,0), (1,1)์ด ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•จ
  • ํ…Œ์ŠคํŠธ๊ฐ€ ์œ ํšจํ•œ ์ถœ๋ ฅ ์œ„์น˜๋งŒ ๊ฒ€์‚ฌํ•จ
  • ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ์ด ํ”„๋กœ๊ทธ๋žจ์„ ์ฆ‰์‹œ ํฌ๋ž˜์‹œ์‹œํ‚ค์ง€ ์•Š์Œ

๋ฏธ์ •์˜ ๋™์ž‘ ์ดํ•ดํ•˜๊ธฐ

๋ฏธ์ •์˜ ๋™์ž‘์ด๋ž€?

๋ฏธ์ •์˜ ๋™์ž‘(Undefined Behavior, UB) ์€ ํ”„๋กœ๊ทธ๋žจ์ด ์–ธ์–ด ๋ช…์„ธ์ƒ ์ •์˜๋˜์ง€ ์•Š์€ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ๋ฒ”์œ„ ์ดˆ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์ž…๋‹ˆ๋‹ค.

๋ฏธ์ •์˜ ๋™์ž‘์˜ ์ฃผ์š” ํŠน์„ฑ:

  • ํ”„๋กœ๊ทธ๋žจ์ด ๋ง ๊ทธ๋Œ€๋กœ ๋ฌด์Šจ ์ง“์ด๋“  ํ•  ์ˆ˜ ์žˆ์Œ: ํฌ๋ž˜์‹œ, ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ, ์ •์ƒ ๋™์ž‘ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๊ธฐ, ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ
  • ์–ด๋–ค ๋ณด์žฅ๋„ ์—†์Œ: ์ปดํŒŒ์ผ๋Ÿฌ, ํ•˜๋“œ์›จ์–ด, ๋“œ๋ผ์ด๋ฒ„, ์‹ฌ์ง€์–ด ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๋™์ž‘์ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ

๋ฏธ์ •์˜ ๋™์ž‘์ด ํŠนํžˆ ์œ„ํ—˜ํ•œ ์ด์œ 

์ •ํ™•์„ฑ ๋ฌธ์ œ:

  • ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ฒฐ๊ณผ: ํ…Œ์ŠคํŠธ ์ค‘์—๋Š” ๋™์ž‘ํ•˜๋‹ค๊ฐ€ ํ”„๋กœ๋•์…˜์—์„œ ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Œ
  • ๋น„๊ฒฐ์ •์  ๋™์ž‘: ๊ฐ™์€ ์ฝ”๋“œ๊ฐ€ ๋‹ค๋ฅธ ์‹คํ–‰์—์„œ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Œ
  • ์กฐ์šฉํ•œ ์†์ƒ: ๋ฏธ์ •์˜ ๋™์ž‘์€ ๊ฐ€์‹œ์ ์ธ ์˜ค๋ฅ˜ ์—†์ด ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ
  • ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”: ์ปดํŒŒ์ผ๋Ÿฌ๋Š” ๋ฏธ์ •์˜ ๋™์ž‘์ด ์—†๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ์˜ˆ์ƒ์น˜ ๋ชปํ•œ ๋ฐฉ์‹์œผ๋กœ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ์Œ

๋ณด์•ˆ ์ทจ์•ฝ์ :

  • ๋ฒ„ํผ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ: ์‹œ์Šคํ…œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋ณด์•ˆ ๊ณต๊ฒฉ์˜ ๊ณ ์ „์ ์ธ ์›์ธ
  • ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ: ๊ถŒํ•œ ์ƒ์Šน์ด๋‚˜ ์ฝ”๋“œ ์ธ์ ์…˜ ๊ณต๊ฒฉ์œผ๋กœ ์ด์–ด์งˆ ์ˆ˜ ์žˆ์Œ
  • ์ •๋ณด ์œ ์ถœ: ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ฝ๊ธฐ๋กœ ๋ฏผ๊ฐํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋…ธ์ถœ๋  ์ˆ˜ ์žˆ์Œ
  • ์ œ์–ด ํ๋ฆ„ ํ•˜์ด์žฌํ‚น: ๋ฏธ์ •์˜ ๋™์ž‘์„ ์•…์šฉํ•ด ํ”„๋กœ๊ทธ๋žจ ์‹คํ–‰ ํ๋ฆ„์„ ํƒˆ์ทจํ•  ์ˆ˜ ์žˆ์Œ

GPU ํŠน์œ ์˜ ๋ฏธ์ •์˜ ๋™์ž‘ ์œ„ํ—˜์„ฑ

๋Œ€๊ทœ๋ชจ ์˜ํ–ฅ:

  • ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ: ํ•œ ์Šค๋ ˆ๋“œ์˜ ๋ฏธ์ •์˜ ๋™์ž‘์ด ์ „์ฒด ์›Œํ”„(32๊ฐœ ์Šค๋ ˆ๋“œ)์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ์ด ์ธ์ ‘ ์Šค๋ ˆ๋“œ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ
  • ์ปค๋„ ์‹คํŒจ: ๋ฏธ์ •์˜ ๋™์ž‘์ด GPU ์ปค๋„ ์ „์ฒด๋ฅผ ์™„์ „ํžˆ ๋ง๊ฐ€๋œจ๋ฆด ์ˆ˜ ์žˆ์Œ

ํ•˜๋“œ์›จ์–ด ์ฐจ์ด:

  • ๋‹ค๋ฅธ GPU ์•„ํ‚คํ…์ฒ˜: ๋ฏธ์ •์˜ ๋™์ž‘์ด ๋‹ค๋ฅธ GPU ๋ชจ๋ธ์—์„œ ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Œ
  • ๋“œ๋ผ์ด๋ฒ„ ์ฐจ์ด: ๊ฐ™์€ ๋ฏธ์ •์˜ ๋™์ž‘์ด ๋“œ๋ผ์ด๋ฒ„ ๋ฒ„์ „์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ๋ณ€๊ฒฝ: GPU ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ํŒจํ„ด์— ๋”ฐ๋ผ ๋ฏธ์ •์˜ ๋™์ž‘์ด ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Œ

๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜ ์ˆ˜์ •ํ•˜๊ธฐ

ํ•ด๊ฒฐ์ฑ…

Puzzle 04์—์„œ ๋ณธ ๊ฒƒ์ฒ˜๋Ÿผ, ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

def add_10_2d(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x
    if col < size and row < size:
        output[row, col] = a[row, col] + 10.0


ํ•ด๊ฒฐ์ฑ…์€ ๊ฐ„๋‹จํ•ฉ๋‹ˆ๋‹ค: ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๊ธฐ ์ „์— ํ•ญ์ƒ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ์ดํ„ฐ ์ฐจ์›์— ๋Œ€ํ•ด ๊ฒ€์ฆํ•˜์„ธ์š”.

compute-sanitizer๋กœ ๊ฒ€์ฆ

# p10.mojo ๋ณต์‚ฌ๋ณธ์—์„œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์ˆ˜์ •ํ•œ ํ›„ ์‹คํ–‰:
MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0 pixi run compute-sanitizer --tool memcheck mojo problems/p10/p10.mojo --memory-bug
========= COMPUTE-SANITIZER
out shape: 2 x 2
Running memory bug example (bounds checking issue)...
out: HostBuffer([10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
โœ… Memory test PASSED! (memcheck may find bounds violations)
========= ERROR SUMMARY: 0 errors

โœ… ์„ฑ๊ณต: ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์ด ํƒ์ง€๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค!

ํ•ต์‹ฌ ํ•™์Šต ํฌ์ธํŠธ

์ˆ˜๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ค‘์š”ํ•œ ์ด์œ 

  1. ๋ช…ํ™•์„ฑ: ์ฝ”๋“œ์—์„œ ์•ˆ์ „ ์š”๊ตฌ์‚ฌํ•ญ์„ ๋ช…์‹œ์ ์œผ๋กœ ํ‘œํ˜„
  2. ์ œ์–ด: ๋ฒ”์œ„ ์ดˆ๊ณผ ์ผ€์ด์Šค์—์„œ ์ •ํ™•ํžˆ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚ ์ง€ ์ง์ ‘ ๊ฒฐ์ •
  3. ๋””๋ฒ„๊น…: ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์ด ๋ฐœ์ƒํ•  ๋•Œ ์ถ”๋ก ํ•˜๊ธฐ ์‰ฌ์›€

GPU ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „ ๊ทœ์น™

  1. ํ•ญ์ƒ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๊ฒ€์ฆํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ฐจ์›๊ณผ ๋น„๊ต
  2. ๋ฏธ์ •์˜ ๋™์ž‘์„ ์–ด๋–ค ๋Œ€๊ฐ€๋ฅผ ์น˜๋ฅด๋”๋ผ๋„ ํ”ผํ•˜๊ธฐ - ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ์€ ๋ฏธ์ •์˜ ๋™์ž‘์ด๋ฉฐ ๋ชจ๋“  ๊ฒƒ์„ ๋ง๊ฐ€๋œจ๋ฆด ์ˆ˜ ์žˆ์Œ
  3. ๊ฐœ๋ฐœ๊ณผ ํ…Œ์ŠคํŠธ ์ค‘ compute-sanitizer ์‚ฌ์šฉ
  4. ๋ฉ”๋ชจ๋ฆฌ ๊ฒ€์‚ฌ ์—†์ด โ€œ๋™์ž‘ํ•œ๋‹คโ€œ๊ณ  ์ ˆ๋Œ€ ๊ฐ€์ •ํ•˜์ง€ ์•Š๊ธฐ
  5. ๋‹ค์–‘ํ•œ ๊ทธ๋ฆฌ๋“œ/๋ธ”๋ก ๊ตฌ์„ฑ์œผ๋กœ ํ…Œ์ŠคํŠธํ•˜์—ฌ ์ผ๊ด€์„ฑ ์—†์ด ๋‚˜ํƒ€๋‚˜๋Š” ๋ฏธ์ •์˜ ๋™์ž‘ ํฌ์ฐฉ

compute-sanitizer ๋ชจ๋ฒ” ์‚ฌ๋ก€

MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0 pixi run compute-sanitizer --tool memcheck mojo your_code.mojo

์ฐธ๊ณ : ์ƒˆ๋‹ˆํƒ€์ด์ € ์ถœ๋ ฅ์—์„œ Mojo ๋Ÿฐํƒ€์ž„ ๊ฒฝ๊ณ ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ ๋ฉ”๋ชจ๋ฆฌ ์œ„๋ฐ˜์„ ํ™•์ธํ•˜๋ ค๋ฉด ========= Invalid์™€ ========= ERROR SUMMARY ๋ผ์ธ์— ์ง‘์ค‘ํ•˜์„ธ์š”.

๐Ÿ ๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น…

๊ฐœ์š”

NVIDIA compute-sanitizer๋ฅผ ์‚ฌ์šฉํ•ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ๋ฅผ ์ผ์œผํ‚ค๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์‹๋ณ„ํ•˜๋ฉด์„œ ์‹คํŒจํ•˜๋Š” GPU ํ”„๋กœ๊ทธ๋žจ์„ ๋””๋ฒ„๊น…ํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์—์„œ ๋™์‹œ์„ฑ ๋ฒ„๊ทธ๋ฅผ ์ฐพ๋Š” racecheck ๋„๊ตฌ ์‚ฌ์šฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ์˜ ๊ฐ’์„ ๋ˆ„์ ํ•ด์•ผ ํ•˜๋Š” GPU ์ปค๋„์ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ…Œ์ŠคํŠธ๋Š” ์‹คํŒจํ•˜๋Š”๋ฐ, ๋กœ์ง์€ ์˜ฌ๋ฐ”๋ฅธ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‹น์‹ ์˜ ๊ณผ์ œ๋Š” ์‹คํŒจ๋ฅผ ์ผ์œผํ‚ค๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ฐพ์•„ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)  # 9๊ฐœ ์Šค๋ ˆ๋“œ ์ค‘ 4๊ฐœ๋งŒ ํ™œ์„ฑํ™”
comptime dtype = DType.float32

์‹คํŒจํ•˜๋Š” ์ปค๋„


comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime layout = row_major[SIZE, SIZE]()
comptime LayoutType = type_of(layout)


def shared_memory_race(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x

    var shared_sum = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[1]())

    if row < size and col < size:
        shared_sum[0] += a[row, col]

    barrier()

    if row < size and col < size:
        output[row, col] = shared_sum[0]


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p10/p10.mojo

์ฝ”๋“œ ์‹คํ–‰

pixi run p10 --race-condition

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

out shape: 2 x 2
Running race condition example...
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([6.0, 6.0, 6.0, 6.0])
stack trace was not collected. Enable stack trace collection with environment variable `MOJO_ENABLE_STACK_TRACE_ON_ERROR`
Unhandled exception caught during execution: At /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p10/p10.mojo:122:33: AssertionError: `left == right` comparison failed:
   left: 0.0
  right: 6.0

compute-sanitizer๊ฐ€ GPU ์ฝ”๋“œ์˜ ๋ฌธ์ œ๋ฅผ ์–ด๋–ป๊ฒŒ ์ฐพ์•„๋‚ด๋Š”์ง€ ์‚ดํŽด๋ด…์‹œ๋‹ค.

compute-sanitizer๋กœ ๋””๋ฒ„๊น…ํ•˜๊ธฐ

1๋‹จ๊ณ„: racecheck๋กœ ๊ฒฝ์Ÿ ์ƒํƒœ ์‹๋ณ„

compute-sanitizer์™€ racecheck ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค:

pixi run compute-sanitizer --tool racecheck mojo problems/p10/p10.mojo --race-condition

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

========= COMPUTE-SANITIZER
out shape: 2 x 2
Running race condition example...
========= Error: Race reported between Write access at p10_shared_memory_race_...+0x140
=========     and Read access at p10_shared_memory_race_...+0xe0 [4 hazards]
=========     and Write access at p10_shared_memory_race_...+0x140 [5 hazards]
=========
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([6.0, 6.0, 6.0, 6.0])
AssertionError: `left == right` comparison failed:
  left: 0.0
  right: 6.0
========= RACECHECK SUMMARY: 1 hazard displayed (1 error, 0 warnings)

๋ถ„์„: ํ”„๋กœ๊ทธ๋žจ์— 1๊ฐœ์˜ ๊ฒฝ์Ÿ ์ƒํƒœ์™€ 9๊ฐœ์˜ ๊ฐœ๋ณ„ ์œ„ํ—˜ ์š”์†Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  • 4๊ฐœ์˜ read-after-write ์œ„ํ—˜ ์š”์†Œ (๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์“ฐ๋Š” ๋™์•ˆ ์ฝ๊ธฐ)
  • 5๊ฐœ์˜ write-after-write ์œ„ํ—˜ ์š”์†Œ (์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์“ฐ๊ธฐ)

2๋‹จ๊ณ„: synccheck์™€ ๋น„๊ต

๋™๊ธฐํ™” ๋ฌธ์ œ๊ฐ€ ์•„๋‹Œ ๊ฒฝ์Ÿ ์ƒํƒœ์ธ์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:

pixi run compute-sanitizer --tool synccheck mojo problems/p10/p10.mojo --race-condition

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

========= COMPUTE-SANITIZER
out shape: 2 x 2
Running race condition example...
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([6.0, 6.0, 6.0, 6.0])
AssertionError: `left == right` comparison failed:
  left: 0.0
  right: 6.0
========= ERROR SUMMARY: 0 errors

ํ•ต์‹ฌ ํ†ต์ฐฐ: synccheck๊ฐ€ 0๊ฐœ์˜ ์˜ค๋ฅ˜๋ฅผ ์ฐพ์•˜์Šต๋‹ˆ๋‹ค - ๊ต์ฐฉ ์ƒํƒœ ๊ฐ™์€ ๋™๊ธฐํ™” ๋ฌธ์ œ๋Š” ์—†์Šต๋‹ˆ๋‹ค. ๋ฌธ์ œ๋Š” ๋™๊ธฐํ™” ๋ฒ„๊ทธ๊ฐ€ ์•„๋‹Œ ๊ฒฝ์Ÿ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.

๊ต์ฐฉ ์ƒํƒœ vs ๊ฒฝ์Ÿ ์ƒํƒœ: ์ฐจ์ด์  ์ดํ•ดํ•˜๊ธฐ

์ธก๋ฉด๊ต์ฐฉ ์ƒํƒœ๊ฒฝ์Ÿ ์ƒํƒœ
์ฆ์ƒํ”„๋กœ๊ทธ๋žจ์ด ์˜์›ํžˆ ๋ฉˆ์ถคํ”„๋กœ๊ทธ๋žจ์ด ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ ์ƒ์„ฑ
์‹คํ–‰์™„๋ฃŒ๋˜์ง€ ์•Š์Œ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒ๋จ
ํƒ€์ด๋ฐ๊ฒฐ์ •์ ์œผ๋กœ ๋ฉˆ์ถค๋น„๊ฒฐ์ •์  ๊ฒฐ๊ณผ
๊ทผ๋ณธ ์›์ธ๋™๊ธฐํ™” ๋กœ์ง ์˜ค๋ฅ˜๋™๊ธฐํ™”๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ
ํƒ์ง€ ๋„๊ตฌsynccheckracecheck
์˜ˆ์‹œPuzzle 09: ์„ธ ๋ฒˆ์งธ ์‚ฌ๋ก€ ๋ฐฐ๋ฆฌ์–ด ๊ต์ฐฉ ์ƒํƒœ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ += ์—ฐ์‚ฐ

์šฐ๋ฆฌ ์‚ฌ๋ก€์—์„œ:

  • ํ”„๋กœ๊ทธ๋žจ ์™„๋ฃŒ๋จ โ†’ ๊ต์ฐฉ ์ƒํƒœ ์—†์Œ (์Šค๋ ˆ๋“œ๊ฐ€ ๋ฉˆ์ถ”์ง€ ์•Š์Œ)
  • ์ž˜๋ชป๋œ ๊ฒฐ๊ณผ โ†’ ๊ฒฝ์Ÿ ์ƒํƒœ (์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ)
  • ๋„๊ตฌ ํ™•์ธ โ†’ synccheck๋Š” 0๊ฐœ ์˜ค๋ฅ˜, racecheck๋Š” 9๊ฐœ ์œ„ํ—˜ ์š”์†Œ ๋ณด๊ณ 

๋””๋ฒ„๊น…์—์„œ ์ด ๊ตฌ๋ถ„์ด ์ค‘์š”ํ•œ ์ด์œ :

  • ๊ต์ฐฉ ์ƒํƒœ ๋””๋ฒ„๊น…: ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜, ์กฐ๊ฑด๋ถ€ ๋™๊ธฐํ™”, ์Šค๋ ˆ๋“œ ์กฐ์œจ์— ์ง‘์ค‘
  • ๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น…: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด, ์›์ž์  ์—ฐ์‚ฐ (์—ญ์ฃผ: ์ค‘๊ฐ„ ์ƒํƒœ ์—†์ด ์™„์ „ํžˆ ์‹คํ–‰๋˜๊ฑฐ๋‚˜ ์ „ํ˜€ ์‹คํ–‰๋˜์ง€ ์•Š๋Š” ์—ฐ์‚ฐ), ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ์— ์ง‘์ค‘

๋„์ „ ๊ณผ์ œ

์ด ๋„๊ตฌ๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ์‹คํŒจํ•˜๋Š” ์ปค๋„์„ ์ˆ˜์ •ํ•˜์„ธ์š”.

ํŒ

์œ„ํ—˜ ์š”์†Œ ๋ถ„์„

shared_sum[0] += a[row, col] ์—ฐ์‚ฐ์ด ์œ„ํ—˜ํ•œ ์ด์œ ๋Š” ์‹ค์ œ๋กœ ์„ธ ๊ฐœ์˜ ๋ณ„๋„ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค:

  1. shared_sum[0] ์ฝ๊ธฐ
  2. ์ฝ์€ ๊ฐ’์— a[row, col] ๋”ํ•˜๊ธฐ
  3. ๊ฒฐ๊ณผ๋ฅผ shared_sum[0]์— ๋‹ค์‹œ ์“ฐ๊ธฐ

4๊ฐœ์˜ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ(์œ„์น˜ (0,0), (0,1), (1,0), (1,1))์—์„œ ์ด ์—ฐ์‚ฐ๋“ค์ด ๊ฒน์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์Šค๋ ˆ๋“œ ํƒ€์ด๋ฐ ์ค‘์ฒฉ โ†’ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ์ดˆ๊ธฐ๊ฐ’(0.0)์„ ์ฝ์Œ
  • ์—…๋ฐ์ดํŠธ ์†์‹ค โ†’ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ 0.0 + ์ž์‹ ์˜_๊ฐ’์„ ์จ์„œ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ์˜ ์ž‘์—…์„ ๋ฎ์–ด์”€
  • ๋น„์›์ž์  ์—ฐ์‚ฐ โ†’ += ๋ณตํ•ฉ ๋Œ€์ž…์€ GPU ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์›์ž์ ์ด์ง€ ์•Š์Œ (์—ญ์ฃผ: ์‹คํ–‰ ๋„์ค‘ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ผ์–ด๋“ค ์ˆ˜ ์žˆ์–ด ์ค‘๊ฐ„ ์ƒํƒœ๊ฐ€ ๋…ธ์ถœ๋จ)

์ •ํ™•ํžˆ 9๊ฐœ์˜ ์œ„ํ—˜ ์š”์†Œ๊ฐ€ ๋‚˜์˜ค๋Š” ์ด์œ :

  • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ read-modify-write๋ฅผ ์‹œ๋„
  • 4๊ฐœ ์Šค๋ ˆ๋“œ ร— ์Šค๋ ˆ๋“œ๋‹น 2-3๊ฐœ ์œ„ํ—˜ ์š”์†Œ = ์ด 9๊ฐœ ์œ„ํ—˜ ์š”์†Œ
  • compute-sanitizer๊ฐ€ ๋ชจ๋“  ์ถฉ๋Œํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์Œ์„ ์ถ”์ 

๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น… ํŒ

  1. ๋ฐ์ดํ„ฐ ๊ฒฝ์Ÿ์—๋Š” racecheck ์‚ฌ์šฉ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„ํ—˜ ์š”์†Œ์™€ ๋ฐ์ดํ„ฐ ์†์ƒ ํƒ์ง€
  2. ๊ต์ฐฉ ์ƒํƒœ์—๋Š” synccheck ์‚ฌ์šฉ: ๋™๊ธฐํ™” ๋ฒ„๊ทธ(๋ฐฐ๋ฆฌ์–ด ๋ฌธ์ œ, ๊ต์ฐฉ ์ƒํƒœ) ํƒ์ง€
  3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— ์ง‘์ค‘: ๊ณต์œ  ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ๋™๊ธฐํ™”๋˜์ง€ ์•Š์€ +=, = ์—ฐ์‚ฐ ์ฐพ๊ธฐ
  4. ํŒจํ„ด ์‹๋ณ„: read-modify-write ์—ฐ์‚ฐ์ด ํ”ํ•œ ๊ฒฝ์Ÿ ์ƒํƒœ ์›์ธ
  5. ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜ ํ™•์ธ: ๋ฐฐ๋ฆฌ์–ด๋Š” ์ถฉ๋Œ ์—ฐ์‚ฐ ์ด์ „์— ๋ฐฐ์น˜ํ•ด์•ผ ํ•จ, ์ดํ›„๊ฐ€ ์•„๋‹˜

๋””๋ฒ„๊น…์—์„œ ์ด ๊ตฌ๋ถ„์ด ์ค‘์š”ํ•œ ์ด์œ :

  • ๊ต์ฐฉ ์ƒํƒœ ๋””๋ฒ„๊น…: ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜, ์กฐ๊ฑด๋ถ€ ๋™๊ธฐํ™”, ์Šค๋ ˆ๋“œ ์กฐ์œจ์— ์ง‘์ค‘
  • ๊ฒฝ์Ÿ ์ƒํƒœ ๋””๋ฒ„๊น…: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด, ์›์ž์  ์—ฐ์‚ฐ, ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ์— ์ง‘์ค‘

ํ”ผํ•ด์•ผ ํ•  ํ”ํ•œ ๊ฒฝ์Ÿ ์ƒํƒœ ํŒจํ„ด:

  • ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์“ฐ๊ธฐ
  • ๋™๊ธฐํ™”๋˜์ง€ ์•Š์€ read-modify-write ์—ฐ์‚ฐ (+=, ++ ๋“ฑ)
  • ๊ฒฝ์Ÿ ์ƒํƒœ ์ด์ „์ด ์•„๋‹Œ ์ดํ›„์— ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜

์†”๋ฃจ์…˜


comptime SIZE = 2
comptime BLOCKS_PER_GRID = 1
comptime THREADS_PER_BLOCK = (3, 3)
comptime dtype = DType.float32
comptime layout = row_major[SIZE, SIZE]()
comptime LayoutType = type_of(layout)


def shared_memory_race(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    """Fixed: sequential access with barriers eliminates race conditions."""
    var row = thread_idx.y
    var col = thread_idx.x

    var shared_sum = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[1]())

    # Only thread 0 does all the accumulation work to prevent races
    if row == 0 and col == 0:
        # Use local accumulation first, then single write to shared memory
        var local_sum = Scalar[dtype](0.0)
        for r in range(size):
            for c in range(size):
                local_sum += rebind[Scalar[dtype]](a[r, c])

        shared_sum[0] = local_sum  # Single write operation

    barrier()  # Ensure thread 0 completes before others read

    # All threads read the safely accumulated result after synchronization
    if row < size and col < size:
        output[row, col] = shared_sum[0]


๋ฌด์—‡์ด ์ž˜๋ชป๋˜์—ˆ๋Š”์ง€ ์ดํ•ดํ•˜๊ธฐ

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฌธ์ œ ํŒจํ„ด

์›๋ž˜ ์‹คํŒจํ•˜๋Š” ์ฝ”๋“œ์—๋Š” ์ด ํ•ต์‹ฌ์ ์ธ ์ค„์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค:

shared_sum[0] += a[row, col]  # ๊ฒฝ์Ÿ ์ƒํƒœ!

์ด ํ•œ ์ค„์ด 4๊ฐœ์˜ ์œ ํšจํ•œ ์Šค๋ ˆ๋“œ ์‚ฌ์ด์—์„œ ์—ฌ๋Ÿฌ ์œ„ํ—˜ ์š”์†Œ๋ฅผ ์ผ์œผํ‚ต๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ (0,0)์ด ์ฝ์Œ shared_sum[0] (๊ฐ’: 0.0)
  2. ์Šค๋ ˆ๋“œ (0,1)์ด ์ฝ์Œ shared_sum[0] (๊ฐ’: 0.0) โ† Read-after-write ์œ„ํ—˜!
  3. ์Šค๋ ˆ๋“œ (0,0)์ด ์”€ 0.0 + 0
  4. ์Šค๋ ˆ๋“œ (1,0)์ด ์”€ 0.0 + 2 โ† Write-after-write ์œ„ํ—˜!

ํ…Œ์ŠคํŠธ๊ฐ€ ์‹คํŒจํ•œ ์ด์œ 

  • += ์—ฐ์‚ฐ ์ค‘ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ์˜ ์“ฐ๊ธฐ๋ฅผ ์†์ƒ์‹œํ‚ด
  • += ์—ฐ์‚ฐ์ด ์ค‘๋‹จ๋˜์–ด ์—…๋ฐ์ดํŠธ ์†์‹ค ๋ฐœ์ƒ
  • ์˜ˆ์ƒ ํ•ฉ๊ณ„ 6.0 (0+1+2+3)์ด์ง€๋งŒ, ๊ฒฝ์Ÿ ์ƒํƒœ๋กœ ์ธํ•ด 0.0์ด ๋จ
  • barrier()๊ฐ€ ๋„ˆ๋ฌด ๋Šฆ๊ฒŒ ์˜ด - ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ์ด๋ฏธ ๋ฐœ์ƒํ•œ ํ›„

๊ฒฝ์Ÿ ์ƒํƒœ๋ž€?

๊ฒฝ์Ÿ ์ƒํƒœ๋Š” ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณต์œ  ๋ฐ์ดํ„ฐ์— ๋™์‹œ์— ์ ‘๊ทผํ•˜๊ณ , ๊ฒฐ๊ณผ๊ฐ€ ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ์Šค๋ ˆ๋“œ ์‹คํ–‰ ํƒ€์ด๋ฐ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์„ฑ:

  • ๋น„๊ฒฐ์ •์  ๋™์ž‘: ๊ฐ™์€ ์ฝ”๋“œ๊ฐ€ ๋‹ค๋ฅธ ์‹คํ–‰์—์„œ ๋‹ค๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์Œ
  • ํƒ€์ด๋ฐ ์˜์กด์ : ๊ฒฐ๊ณผ๊ฐ€ ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ โ€œ๊ฒฝ์Ÿ์—์„œ ์ด๊ธฐ๋Š”์ง€โ€œ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง
  • ์žฌํ˜„ํ•˜๊ธฐ ์–ด๋ ค์›€: ํŠน์ • ์กฐ๊ฑด์ด๋‚˜ ํ•˜๋“œ์›จ์–ด์—์„œ๋งŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Œ

GPU ํŠน์œ ์˜ ์œ„ํ—˜์„ฑ

๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์˜ ์˜ํ–ฅ:

  • ์›Œํ”„ ์ˆ˜์ค€ ์†์ƒ: ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ์ „์ฒด ์›Œํ”„(32๊ฐœ ์Šค๋ ˆ๋“œ)์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ๋ฌธ์ œ: ๊ฒฝ์Ÿ์œผ๋กœ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ๊นจ์งˆ ์ˆ˜ ์žˆ์Œ
  • ์ปค๋„ ์ „์ฒด ์‹คํŒจ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์†์ƒ์ด ์ „์ฒด GPU ์ปค๋„์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Œ

ํ•˜๋“œ์›จ์–ด ์ฐจ์ด:

  • ๋‹ค๋ฅธ GPU ์•„ํ‚คํ…์ฒ˜: ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ GPU ๋ชจ๋ธ๋งˆ๋‹ค ๋‹ค๋ฅด๊ฒŒ ๋‚˜ํƒ€๋‚  ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต: L1 ์บ์‹œ, L2 ์บ์‹œ, ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๊ฐ๊ฐ ๋‹ค๋ฅธ ๊ฒฝ์Ÿ ๋™์ž‘์„ ๋ณด์ผ ์ˆ˜ ์žˆ์Œ
  • ์›Œํ”„ ์Šค์ผ€์ค„๋ง: ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ์Šค์ผ€์ค„๋ง์ด ๋‹ค๋ฅธ ๊ฒฝ์Ÿ ์ƒํƒœ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ๋…ธ์ถœ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ

์ „๋žต: ๋‹จ์ผ ์“ฐ๊ธฐ ํŒจํ„ด

ํ•ต์‹ฌ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋Œ€ํ•œ ๋™์‹œ ์“ฐ๊ธฐ๋ฅผ ์—†์• ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  1. Single writer: ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ(์œ„์น˜ (0,0))๋งŒ ๋ชจ๋“  ๋ˆ„์  ์ž‘์—… ์ˆ˜ํ–‰
  2. ๋กœ์ปฌ ๋ˆ„์ : ์œ„์น˜ (0,0) ์Šค๋ ˆ๋“œ๊ฐ€ ๋กœ์ปฌ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๋ฐ˜๋ณต์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ํ”ผํ•จ
  3. ๋‹จ์ผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ: ๋‹จ์ผ ์“ฐ๊ธฐ ์—ฐ์‚ฐ์œผ๋กœ write-write ๊ฒฝ์Ÿ ์ œ๊ฑฐ
  4. ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”: writer๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ๋„๋ก ๋ณด์žฅ
  5. ๋‹ค์ค‘ ์ฝ๊ธฐ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์•ˆ์ „ํ•˜๊ฒŒ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ฝ์Œ

๋‹จ๊ณ„๋ณ„ ์†”๋ฃจ์…˜ ๋ถ„์„

1๋‹จ๊ณ„: ์Šค๋ ˆ๋“œ ์‹๋ณ„

if row == 0 and col == 0:

์ง์ ‘ ์ขŒํ‘œ ๊ฒ€์‚ฌ๋กœ ์œ„์น˜ (0,0)์˜ ์Šค๋ ˆ๋“œ๋ฅผ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

2๋‹จ๊ณ„: ๋‹จ์ผ ์Šค๋ ˆ๋“œ ๋ˆ„์ 

if row == 0 and col == 0:
    local_sum = Scalar[dtype](0.0)
    for r in range(size):
        for c in range(size):
            local_sum += rebind[Scalar[dtype]](a[r, c])
    shared_sum[0] = local_sum  # ๋‹จ์ผ ์“ฐ๊ธฐ ์—ฐ์‚ฐ

์œ„์น˜ (0,0)์˜ ์Šค๋ ˆ๋“œ๋งŒ ๋ชจ๋“  ๋ˆ„์  ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

3๋‹จ๊ณ„: ๋™๊ธฐํ™” ๋ฐฐ๋ฆฌ์–ด

barrier()  # ์Šค๋ ˆ๋“œ (0,0)์ด ์™„๋ฃŒํ•œ ํ›„ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ๋„๋ก ๋ณด์žฅ

๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์œ„์น˜ (0,0)์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ˆ„์ ์„ ๋งˆ์น  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆฝ๋‹ˆ๋‹ค.

4๋‹จ๊ณ„: ์•ˆ์ „ํ•œ ๋ณ‘๋ ฌ ์ฝ๊ธฐ

if row < size and col < size:
    output[row, col] = shared_sum[0]

๋™๊ธฐํ™” ํ›„ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์•ˆ์ „ํ•˜๊ฒŒ ๊ฒฐ๊ณผ๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํšจ์œจ์„ฑ์— ๊ด€ํ•œ ์ค‘์š” ์‚ฌํ•ญ

์ด ์†”๋ฃจ์…˜์€ ํšจ์œจ์„ฑ๋ณด๋‹ค ์ •ํ™•์„ฑ์„ ์šฐ์„ ํ•ฉ๋‹ˆ๋‹ค. ๊ฒฝ์Ÿ ์ƒํƒœ๋Š” ์ œ๊ฑฐํ•˜์ง€๋งŒ, ์œ„์น˜ (0,0) ์Šค๋ ˆ๋“œ๋งŒ ๋ˆ„์ ์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€ GPU ์„ฑ๋Šฅ์— ์ตœ์ ์ด ์•„๋‹™๋‹ˆ๋‹ค - ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์žฅ์น˜์—์„œ ์‚ฌ์‹ค์ƒ ์ง๋ ฌ ๊ณ„์‚ฐ์„ ํ•˜๋Š” ์…ˆ์ž…๋‹ˆ๋‹ค.

์ด์–ด์„œ Puzzle 11: ํ’€๋ง์—์„œ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๋ฅผ ํ™œ์šฉํ•ด ๊ณ ์„ฑ๋Šฅ ํ•ฉ์‚ฐ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ๋„ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๋Š” ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ์ด ํผ์ฆ์€ ์ •ํ™•์„ฑ ์šฐ์„ ์˜ ๊ธฐ์ดˆ๋ฅผ ๊ฐ€๋ฅด์นฉ๋‹ˆ๋‹ค - ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๊ณ  ๋‚˜๋ฉด, Puzzle 11์—์„œ ์ •ํ™•์„ฑ๊ณผ ์„ฑ๋Šฅ ๋ชจ๋‘๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๊ฒ€์ฆ

pixi run compute-sanitizer --tool racecheck mojo solutions/p10/p10.mojo --race-condition

์˜ˆ์ƒ ์ถœ๋ ฅ:

========= COMPUTE-SANITIZER
out shape: 2 x 2
Running race condition example...
out: HostBuffer([6.0, 6.0, 6.0, 6.0])
expected: HostBuffer([6.0, 6.0, 6.0, 6.0])
โœ… Race condition test PASSED! (racecheck will find hazards)
========= RACECHECK SUMMARY: 0 hazards displayed (0 errors, 0 warnings)

โœ… ์„ฑ๊ณต: ํ…Œ์ŠคํŠธ๊ฐ€ ํ†ต๊ณผํ•˜๊ณ  ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ํƒ์ง€๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค!

Puzzle 11: ํ’€๋ง

๊ฐœ์š”

1D TileTensor a์—์„œ ๊ฐ ์œ„์น˜์˜ ์ง์ „ 3๊ฐœ ๊ฐ’์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D TileTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

Pooling ์‹œ๊ฐํ™” Pooling ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • TileTensor๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ
  • Puzzle 8์—์„œ ๋‹ค๋ฃฌ TileTensor ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌํ•˜๊ธฐ
  • ํšจ์œจ์ ์ธ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ

ํ•ต์‹ฌ์€ TileTensor๊ฐ€ ํšจ์œจ์ ์ธ ์œˆ๋„์šฐ ๊ธฐ๋ฐ˜ ์—ฐ์‚ฐ์€ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ์œˆ๋„์šฐ ํฌ๊ธฐ: 3
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ

์ฐธ๊ณ :

  • TileTensor ํ• ๋‹น: stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]()) ์‚ฌ์šฉ
  • ์œˆ๋„์šฐ ์ ‘๊ทผ: 3๊ฐœ์งœ๋ฆฌ ์œˆ๋„์šฐ์— ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์ฒ˜์Œ ๋‘ ์œ„์น˜๋Š” ํŠน์ˆ˜ ์ผ€์ด์Šค
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์Šค๋ ˆ๋“œ๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ 1ํšŒ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = row_major[SIZE]()
comptime LayoutType = type_of(layout)


def pooling(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    # Allocate shared memory using stack_allocation
    var shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # FIX ME IN (roughly 10 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p11/p11.mojo

ํŒ
  1. TileTensor์™€ ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ
  2. ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: shared[local_i] = a[global_i]
  3. ์ฒ˜์Œ ๋‘ ์œ„์น˜๋ฅผ ํŠน์ˆ˜ ์ผ€์ด์Šค๋กœ ์ฒ˜๋ฆฌ
  4. ์œˆ๋„์šฐ ์—ฐ์‚ฐ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  5. ๊ฒฝ๊ณ„ ์ดˆ๊ณผ ์ ‘๊ทผ์— ๊ฐ€๋“œ ์ถ”๊ฐ€

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p11
pixi run -e amd p11
pixi run -e apple p11
uv run poe p11

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])

์†”๋ฃจ์…˜

def pooling(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    # Allocate shared memory using stack_allocation
    var shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Load data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # Synchronize threads within block
    barrier()

    # Handle first two special cases
    if global_i == 0:
        output[0] = shared[0]
    elif global_i == 1:
        output[1] = shared[0] + shared[1]
    # Handle general case
    elif 1 < global_i < size:
        output[global_i] = (
            shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
        )


TileTensor๋ฅผ ํ™œ์šฉํ•œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ํ•ฉ๊ณ„ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ์ฃผ์š” ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •

    • TileTensor๊ฐ€ ์ฃผ์†Œ ๊ณต๊ฐ„(address_space)์œผ๋กœ ๋ธ”๋ก ๋กœ์ปฌ ์ €์žฅ์†Œ๋ฅผ ์ƒ์„ฑ:

      shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]())
      
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์”ฉ ๋กœ๋“œ:

      Input array:  [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      Block shared: [0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0]
      
    • barrier()๋กœ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋กœ๋“œ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

  2. ๊ฒฝ๊ณ„ ์ผ€์ด์Šค

    • ์œ„์น˜ 0: ํ•˜๋‚˜๋งŒ

      output[0] = shared[0] = 0.0
      
    • ์œ„์น˜ 1: ์ฒ˜์Œ ๋‘ ๊ฐ’์˜ ํ•ฉ

      output[1] = shared[0] + shared[1] = 0.0 + 1.0 = 1.0
      
  3. ๋ฉ”์ธ ์œˆ๋„์šฐ ์—ฐ์‚ฐ

    • ์œ„์น˜ 2 ์ดํ›„:

      Position 2: shared[0] + shared[1] + shared[2] = 0.0 + 1.0 + 2.0 = 3.0
      Position 3: shared[1] + shared[2] + shared[3] = 1.0 + 2.0 + 3.0 = 6.0
      Position 4: shared[2] + shared[3] + shared[4] = 2.0 + 3.0 + 4.0 = 9.0
      ...
      
    • TileTensor์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ:

      # 3๊ฐœ์งœ๋ฆฌ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ
      window_sum = shared[i-2] + shared[i-1] + shared[i]
      
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

    • ์Šค๋ ˆ๋“œ๋งˆ๋‹ค ๊ณต์œ  ํ…์„œ๋กœ ์ „์—ญ ์ฝ๊ธฐ 1ํšŒ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ํšจ์œจ์ ์ธ ์ด์›ƒ ์ ‘๊ทผ
    • TileTensor์˜ ์žฅ์ :
      • ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
      • ์ž์—ฐ์Šค๋Ÿฌ์šด ์œˆ๋„์šฐ ์ธ๋ฑ์‹ฑ
      • ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
      • ์ „ ๊ณผ์ •์— ๊ฑธ์นœ ํƒ€์ž… ์•ˆ์ „์„ฑ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์„ฑ๋Šฅ๊ณผ TileTensor์˜ ์•ˆ์ „์„ฑ ๋ฐ ํŽธ์˜์„ฑ์„ ๊ฒฐํ•ฉํ•œ ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์†Œํ™”
  • ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ฐ„์†Œํ™”
  • ๊น”๋”ํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ
  • ๋ณ‘ํ•ฉ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€

์ตœ์ข… ์ถœ๋ ฅ์€ ๋ˆ„์  ์œˆ๋„์šฐ ํ•ฉ๊ณ„์ž…๋‹ˆ๋‹ค:

[0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0]

Puzzle 12: ๋‚ด์ 

๊ฐœ์š”

1D TileTensor a์™€ 1D TileTensor b์˜ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D TileTensor output(๋‹จ์ผ ๊ฐ’)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ๋‚ด์ ์€ ํฌ๊ธฐ๊ฐ€ ๊ฐ™์€ ๋‘ ๋ฒกํ„ฐ์—์„œ ๋Œ€์‘ํ•˜๋Š” ์›์†Œ๋ผ๋ฆฌ ๊ณฑํ•œ ๋’ค, ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ชจ๋‘ ๋”ํ•ด ํ•˜๋‚˜์˜ ์ˆซ์ž(์Šค์นผ๋ผ)๋ฅผ ๊ตฌํ•˜๋Š” ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ๋‘ ๋ฒกํ„ฐ๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์„ ๋•Œ:

\[a = [a_{1}, a_{2}, โ€ฆ, a_{n}] \] \[b = [b_{1}, b_{2}, โ€ฆ, b_{n}] \]

๋‚ด์ ์€ ์ด๋ ‡๊ฒŒ ๊ตฌํ•ฉ๋‹ˆ๋‹ค: \[a \cdot b = a_{1}b_{1} + a_{2}b_{2} + โ€ฆ + a_{n}b_{n}\]

์ฐธ๊ณ : ๊ฐ ์œ„์น˜๋งˆ๋‹ค ์Šค๋ ˆ๋“œ 1๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ๋ธ”๋ก๋‹น ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๋‚ด์  ์‹œ๊ฐํ™” ๋‚ด์  ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • Puzzle 8, Puzzle 11์—์„œ ์ด์–ด์ง€๋Š” TileTensor ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • address_space๋ฅผ ํ™œ์šฉํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ
  • ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ํ˜‘๋ ฅํ•ด ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๊ฐ€๋Š” ๊ณผ์ •
  • ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ํ…์„œ ์—ฐ์‚ฐ

ํ•ต์‹ฌ์€ TileTensor๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋ฉด์„œ๋„, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์˜ ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ ์‚ด๋ฆฌ๋Š” ๋ฐฉ์‹์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 1
  • ์ถœ๋ ฅ ํฌ๊ธฐ: 1
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ

์ฐธ๊ณ :

  • TileTensor ํ• ๋‹น: stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB]()) ์‚ฌ์šฉ
  • ์š”์†Œ ์ ‘๊ทผ: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ž๋™์œผ๋กœ ๋”ฐ๋ผ์˜ค๋Š” ์ž์—ฐ์Šค๋Ÿฌ์šด ์ธ๋ฑ์‹ฑ
  • ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ: ์ž…๋ ฅ์šฉ๊ณผ ์ถœ๋ ฅ์šฉ ๋ ˆ์ด์•„์›ƒ์„ ๋”ฐ๋กœ ๊ตฌ์„ฑ
  • ์Šค๋ ˆ๋“œ ์กฐ์œจ: ๋™์ผํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์œผ๋กœ barrier() ์‚ฌ์šฉ

์™„์„ฑํ•  ์ฝ”๋“œ

from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.memory import AddressSpace
from layout import TileTensor
from layout.tile_layout import row_major
from layout.tile_tensor import stack_allocation


comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = row_major[SIZE]()
comptime out_layout = row_major[1]()
comptime LayoutType = type_of(layout)
comptime OutLayout = type_of(out_layout)


def dot_product(
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    # FILL ME IN (roughly 13 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p12/p12.mojo

ํŒ
  1. TileTensor์™€ address_space๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ƒ์„ฑ
  2. shared[local_i]์— a[global_i] * b[global_i]๋ฅผ ์ €์žฅ
  3. barrier()์™€ ํ•จ๊ป˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด ์ ์šฉ
  4. ์Šค๋ ˆ๋“œ 0์ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ output[0]์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p12
pixi run -e amd p12
pixi run -e apple p12
uv run poe p12

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0])
expected: HostBuffer([140.0])

์†”๋ฃจ์…˜

def dot_product(
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    var shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Compute element-wise multiplication into shared memory
    if global_i < size:
        shared[local_i] = a[global_i] * b[global_i]

    # Synchronize threads within block
    barrier()

    # Parallel reduction in shared memory
    var stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]

        barrier()
        stride //= 2

    # Only thread 0 writes the final result
    if local_i == 0:
        output[0] = shared[0]


TileTensor๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์œผ๋กœ ๋‚ด์ ์„ ๊ณ„์‚ฐํ•˜๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ง๊ด€์ ์ธ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๊ณฑ์…ˆ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

shared[local_i] = a[global_i] * b[global_i]

2๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜

๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹ํ•˜๋Š” ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜์ž…๋‹ˆ๋‹ค:

์ดˆ๊ธฐ๊ฐ’:    [0*0  1*1  2*2  3*3  4*4  5*5  6*6  7*7]
        = [0    1    4    9    16   25   36   49]

Step 1:   [0+16 1+25 4+36 9+49  16   25   36   49]
        = [16   26   40   58   16   25   36   49]

Step 2:   [16+40 26+58 40   58   16   25   36   49]
        = [56   84   40   58   16   25   36   49]

Step 3:   [56+84  84   40   58   16   25   36   49]
        = [140   84   40   58   16   25   36   49]

๊ตฌํ˜„์˜ ํ•ต์‹ฌ ํŠน์ง•

  1. ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ:

    • address_space ํŒŒ๋ผ๋ฏธํ„ฐ ํ•˜๋‚˜๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊น”๋”ํ•˜๊ฒŒ ํ• ๋‹น
    • ํƒ€์ž… ์•ˆ์ „ํ•œ ์—ฐ์‚ฐ์ด ๋ณด์žฅ๋˜๊ณ 
    • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์ž๋™์œผ๋กœ ๋”ฐ๋ผ์˜ค๋ฉฐ
    • ์ธ๋ฑ์‹ฑ๋„ ๋ ˆ์ด์•„์›ƒ์„ ์ธ์‹
  2. ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”:

    • ์ดˆ๊ธฐ ๊ณฑ์…ˆ์ด ๋๋‚˜๋ฉด barrier()
    • ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด๋งˆ๋‹ค barrier()
    • ์Šค๋ ˆ๋“œ ๊ฐ„ ์•ˆ์ „ํ•œ ์กฐ์œจ ๋ณด์žฅ
  3. ๋ฆฌ๋•์…˜ ๋กœ์ง:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2
    
  4. ์„ฑ๋Šฅ์ƒ ์ด์ :

    • \(O(\log n)\) ์‹œ๊ฐ„ ๋ณต์žก๋„
    • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
    • ์ตœ์†Œํ•œ์˜ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ํšจ์œจ์  ํ™œ์šฉ

TileTensor ๋ฒ„์ „์€ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์˜ ํšจ์œจ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ, ์—ฌ๊ธฐ์— ๋”ํ•ด:

  • ํƒ€์ž… ์•ˆ์ „์„ฑ์ด ํ•œ์ธต ๊ฐ•ํ™”๋˜๊ณ 
  • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๊ฐ€ ๋” ๊น”๋”ํ•ด์ง€๋ฉฐ
  • ๋ ˆ์ด์•„์›ƒ์„ ์ž๋™์œผ๋กœ ์ธ์‹ํ•˜๊ณ 
  • ์ธ๋ฑ์‹ฑ ๋ฌธ๋ฒ•๋„ ์ž์—ฐ์Šค๋Ÿฌ์›Œ์ง‘๋‹ˆ๋‹ค

๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”์˜ ์ค‘์š”์„ฑ

๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ ์‚ฌ์ด์˜ barrier()๋Š” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์ด์œ ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

barrier()๊ฐ€ ์—†์œผ๋ฉด ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค:

์ดˆ๊ธฐ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: [0 1 4 9 16 25 36 49]

Step 1 (stride = 4):
Thread 0 ์ฝ๊ธฐ: shared[0] = 0, shared[4] = 16
Thread 1 ์ฝ๊ธฐ: shared[1] = 1, shared[5] = 25
Thread 2 ์ฝ๊ธฐ: shared[2] = 4, shared[6] = 36
Thread 3 ์ฝ๊ธฐ: shared[3] = 9, shared[7] = 49

barrier ์—†์ด:
- Thread 0 ์“ฐ๊ธฐ: shared[0] = 0 + 16 = 16
- Thread 1์ด Thread 0๋ณด๋‹ค ๋จผ์ € ๋‹ค์Œ ๋‹จ๊ณ„(stride = 2)๋กœ ๋„˜์–ด๊ฐ€์„œ
  16์ด ์•„๋‹Œ ์ด์ „ ๊ฐ’ shared[0] = 0์„ ์ฝ์–ด๋ฒ„๋ฆฝ๋‹ˆ๋‹ค!

barrier()๊ฐ€ ์žˆ์œผ๋ฉด:

Step 1 (stride = 4):
๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•ฉ์„ ๊ธฐ๋ก:
[16 26 40 58 16 25 36 49]
barrier()๊ฐ€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์—๊ฒŒ ์ด ๊ฐ’๋“ค์ด ๋ณด์ด๋„๋ก ๋ณด์žฅ

Step 2 (stride = 2):
์ด์ œ ์—…๋ฐ์ดํŠธ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์Œ:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84

barrier()๋Š” ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  1. ํ˜„์žฌ ๋‹จ๊ณ„์˜ ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ๋๋‚œ ๋’ค์—์•ผ ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐ
  2. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์‹  ๊ฐ’์„ ๋ณผ ์ˆ˜ ์žˆ์Œ
  3. ์–ด๋–ค ์Šค๋ ˆ๋“œ๋„ ์•ž์„œ ๋‚˜๊ฐ€์ง€ ์•Š์Œ
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ์ผ๊ด€๋œ ์ƒํƒœ๋ฅผ ์œ ์ง€

์ด๋Ÿฐ ๋™๊ธฐํ™” ์ง€์ ์ด ์—†์œผ๋ฉด:

  • ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ 
  • ์Šค๋ ˆ๋“œ๊ฐ€ ์ด๋ฏธ ์ง€๋‚œ ๊ฐ’์„ ์ฝ๊ฒŒ ๋˜๋ฉฐ
  • ์‹คํ–‰ํ•  ๋•Œ๋งˆ๋‹ค ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€๊ณ 
  • ์ตœ์ข… ํ•ฉ๊ณ„๊ฐ€ ํ‹€์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

Puzzle 13: 1D ํ•ฉ์„ฑ๊ณฑ

TileTensor๋กœ ์ „ํ™˜ํ•˜๊ธฐ

์ง€๊ธˆ๊นŒ์ง€ GPU ํผ์ฆ ์—ฌ์ •์—์„œ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ์— ๋Œ€ํ•œ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ•จ๊ป˜ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค:

  1. UnsafePointer๋ฅผ ์‚ฌ์šฉํ•œ ํฌ์ธํ„ฐ ์ง์ ‘ ์กฐ์ž‘ ๋ฐฉ์‹์˜ raw ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ
  2. ๊ฐ•๋ ฅํ•œ address_space ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•˜๋Š”, ๋ณด๋‹ค ๊ตฌ์กฐํ™”๋œ TileTensor

์ด ํผ์ฆ๋ถ€ํ„ฐ๋Š” TileTensor๋กœ ์™„์ „ํžˆ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ถ”์ƒํ™”๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • ํƒ€์ž… ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ์˜ ๋ช…ํ™•ํ•œ ํ‘œํ˜„
  • ์ฝ”๋“œ ์œ ์ง€๋ณด์ˆ˜์„ฑ ํ–ฅ์ƒ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ จ ๋ฒ„๊ทธ ๋ฐœ์ƒ ๊ฐ€๋Šฅ์„ฑ ๊ฐ์†Œ
  • ๋‚ด๋ถ€ ์—ฐ์‚ฐ์˜ ์˜๋„๋ฅผ ๋” ์ž˜ ๋“œ๋Ÿฌ๋‚ด๋Š” ํ‘œํ˜„๋ ฅ ์žˆ๋Š” ์ฝ”๋“œ
  • ์•ž์œผ๋กœ ์ฐจ์ฐจ ์•Œ์•„๊ฐˆ ๋” ๋งŽ์€ ๊ฒƒ๋“ค!

์ด๋Ÿฌํ•œ ์ „ํ™˜์€ Mojo ๐Ÿ”ฅ์˜ ํ˜„๋Œ€์  GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ฒ” ์‚ฌ๋ก€์™€ ๋งž๋‹ฟ์•„ ์žˆ์Šต๋‹ˆ๋‹ค. ๋†’์€ ์ˆ˜์ค€์˜ ์ถ”์ƒํ™”๋กœ ๋ณต์žก์„ฑ์„ ๊ด€๋ฆฌํ•˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐœ์š”

์‹ ํ˜ธ ์ฒ˜๋ฆฌ์™€ ์ด๋ฏธ์ง€ ๋ถ„์„์—์„œ ํ•ฉ์„ฑ๊ณฑ(convolution)์€ ๋‘ ์‹œํ€€์Šค๋ฅผ ๊ฒฐํ•ฉํ•ด ์ƒˆ๋กœ์šด ์‹œํ€€์Šค๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ํ•ต์‹ฌ ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ์ด ํผ์ฆ์—์„œ๋Š” ์ž…๋ ฅ ๋ฐฐ์—ด ์œ„๋กœ ์ปค๋„์„ ์Šฌ๋ผ์ด๋”ฉํ•˜๋ฉด์„œ ๊ฐ ์ถœ๋ ฅ ์›์†Œ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ์„ GPU์—์„œ ๊ตฌํ˜„ํ•ด ๋ด…๋‹ˆ๋‹ค.

TileTensor ์ถ”์ƒํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ a์™€ ๋ฒกํ„ฐ b์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

1D ํ•ฉ์„ฑ๊ณฑ ์‹œ๊ฐํ™” 1D ํ•ฉ์„ฑ๊ณฑ ์‹œ๊ฐํ™”

ํ•ฉ์„ฑ๊ณฑ์ด ์ฒ˜์Œ์ด๋ผ๋ฉด, ๊ฐ€์ค‘์น˜๊ฐ€ ์ ์šฉ๋œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ๊ฐ ์œ„์น˜์—์„œ ์ปค๋„ ๊ฐ’๊ณผ ๋Œ€์‘ํ•˜๋Š” ์ž…๋ ฅ ๊ฐ’์„ ๊ณฑํ•œ ๋’ค ํ•ฉ์‚ฐํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜ํ•™์  ํ‘œ๊ธฐ๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

\[\Large output[i] = \sum_{j=0}^{\text{CONV}-1} a[i+j] \cdot b[j] \]

์˜์‚ฌ ์ฝ”๋“œ๋กœ ํ‘œํ˜„ํ•œ 1D ํ•ฉ์„ฑ๊ณฑ:

for i in range(SIZE):
    for j in range(CONV):
        if i + j < SIZE:
            ret[i] += a_host[i + j] * b_host[j]

์ด ํผ์ฆ์€ ๋‹จ๊ณ„์ ์œผ๋กœ ์ดํ•ด๋ฅผ ์Œ“์•„๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก ๋‘ ํŒŒํŠธ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค:

  • ๐Ÿ”ฐ ๊ธฐ๋ณธ ๋ฒ„์ „ ์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์„ธ์š”. ๋‹จ์ผ ๋ธ”๋ก์—์„œ TileTensor์™€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ํ•ฉ์„ฑ๊ณฑ ๊ตฌํ˜„์˜ ๊ธฐ์ดˆ๋ฅผ ์ตํž™๋‹ˆ๋‹ค.

  • โญ ๋ธ”๋ก ๊ฒฝ๊ณ„ ๋ฒ„์ „ ์ด์–ด์„œ ๋ธ”๋ก ๊ฒฝ๊ณ„๋ฅผ ๋„˜์–ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ ํ•ด์•ผ ํ•˜๋Š” ๋” ๊นŒ๋‹ค๋กœ์šด ๊ฒฝ์šฐ์— ๋„์ „ํ•ฉ๋‹ˆ๋‹ค. TileTensor์˜ ๊ธฐ๋Šฅ์„ ๋ณธ๊ฒฉ์ ์œผ๋กœ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ๋ฒ„์ „์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์Šค๋ ˆ๋“œ ๊ฐ„ ํ˜‘๋ ฅ ์ธก๋ฉด์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋„์ „ ๊ณผ์ œ๋ฅผ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ๋ฒ„์ „์—์„œ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์˜ ์›๋ฆฌ๋ฅผ ์ตํžŒ ๋‹ค์Œ, ๋ธ”๋ก ๊ฒฝ๊ณ„ ๋ฒ„์ „์—์„œ๋Š” ์‹ค์ œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋งˆ์ฃผ์น˜๋Š” ๋ณต์žกํ•œ ์ƒํ™ฉ์„ ๋‹ค๋ฃจ๋Š” ๋Šฅ๋ ฅ์„ ์‹œํ—˜ํ•ด ๋ด…๋‹ˆ๋‹ค.

๋‹จ์ผ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ ๋ฒ„์ „

1D TileTensor a์™€ 1D TileTensor b์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D TileTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • GPU์—์„œ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ ๊ตฌํ˜„ํ•˜๊ธฐ
  • ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ ๊ด€๋ฆฌํ•˜๊ธฐ
  • ๊ฒน์น˜๋Š” ์˜์—ญ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉํ•˜๊ธฐ

ํ•ต์‹ฌ์€ ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ๊ฒน์น˜๋Š” ์›์†Œ์— ํšจ์œจ์ ์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ์ž…๋ ฅ ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 6
  • ์ปค๋„ ํฌ๊ธฐ: CONV = 3
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 1
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: SIZE์™€ CONV ํฌ๊ธฐ์˜ ๋ฐฐ์—ด 2๊ฐœ

์ฐธ๊ณ :

  • ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž…๋ ฅ ๋ฐฐ์—ด๊ณผ ์ปค๋„์—์„œ ์›์†Œ๋ฅผ ํ•˜๋‚˜์”ฉ ๋กœ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์ž…๋ ฅ ๋ฐฐ์—ด๊ณผ ํ•ฉ์„ฑ๊ณฑ ์ปค๋„์„ ์ €์žฅํ•˜๋Š” ๊ณต์œ  ๋ฐฐ์—ด
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”: ์—ฐ์‚ฐ ์‹œ์ž‘ ์ „ ์Šค๋ ˆ๋“œ ๊ฐ„ ์กฐ์œจ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 8
comptime SIZE = 6
comptime CONV = 3
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime in_layout = row_major[SIZE]()
comptime InLayout = type_of(in_layout)
comptime out_layout = row_major[SIZE]()
comptime OutLayout = type_of(out_layout)
comptime conv_layout = row_major[CONV]()
comptime ConvLayout = type_of(conv_layout)


def conv_1d_simple(
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, ConvLayout, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # FILL ME IN (roughly 14 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p13/p13.mojo

ํŒ
  1. stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[SIZE]())์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น
  2. ์ž…๋ ฅ์„ shared_a[local_i]์—, ์ปค๋„์„ shared_b[local_i]์— ๋กœ๋“œ
  3. ๋ฐ์ดํ„ฐ ๋กœ๋“œ ํ›„ barrier() ํ˜ธ์ถœ
  4. ๊ฒฝ๊ณ„ ์•ˆ์—์„œ ๊ณฑ์„ ํ•ฉ์‚ฐ: if local_i + j < SIZE
  5. global_i < SIZE์ผ ๋•Œ๋งŒ ๊ฒฐ๊ณผ ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p13 --simple
pixi run -e amd p13 --simple
pixi run -e apple p13 --simple
uv run poe p13 --simple

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])

์†”๋ฃจ์…˜

def conv_1d_simple(
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, ConvLayout, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    var shared_a = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[SIZE]())
    var shared_b = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[CONV]())
    if global_i < SIZE:
        shared_a[local_i] = a[global_i]

    if global_i < CONV:
        shared_b[local_i] = b[global_i]

    barrier()

    # Note: this is unsafe as it enforces no guard so could access `shared_a` beyond its bounds
    # local_sum = Scalar[dtype](0)
    # for j in range(CONV):
    #     if local_i + j < SIZE:
    #         local_sum += shared_a[local_i + j] * shared_b[j]

    # if global_i < SIZE:
    #     out[global_i] = local_sum

    # Safe and correct:
    if global_i < SIZE:
        # Note: using `var` allows us to include the type in the type inference
        # `out.ElementType` is available in TileTensor
        var local_sum: output.ElementType = 0

        # Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant
        # See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement
        comptime for j in range(CONV):
            # Bonus: do we need this check for this specific example with fixed SIZE, CONV
            if local_i + j < SIZE:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum


๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•ด ๊ฒน์น˜๋Š” ์›์†Œ์— ํšจ์œจ์ ์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ

์ž…๋ ฅ ๋ฐฐ์—ด a:       [0  1  2  3  4  5]
์ปค๋„ b:          [0  1  2]

์—ฐ์‚ฐ ๊ณผ์ •

  1. ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ:

    shared_a: [0  1  2  3  4  5]  // ์ž…๋ ฅ ๋ฐฐ์—ด
    shared_b: [0  1  2]           // ํ•ฉ์„ฑ๊ณฑ ์ปค๋„
    
  2. ๊ฐ ์œ„์น˜ i์— ๋Œ€ํ•œ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ:

    output[0] = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] = 0*0 + 1*1 + 2*2 = 5
    output[1] = a[1]*b[0] + a[2]*b[1] + a[3]*b[2] = 1*0 + 2*1 + 3*2 = 8
    output[2] = a[2]*b[0] + a[3]*b[1] + a[4]*b[2] = 2*0 + 3*1 + 4*2 = 11
    output[3] = a[3]*b[0] + a[4]*b[1] + a[5]*b[2] = 3*0 + 4*1 + 5*2 = 14
    output[4] = a[4]*b[0] + a[5]*b[1] + 0*b[2]    = 4*0 + 5*1 + 0*2 = 5
    output[5] = a[5]*b[0] + 0*b[1]   + 0*b[2]     = 5*0 + 0*1 + 0*2 = 0
    

๊ตฌํ˜„ ์ƒ์„ธ

  1. ์Šค๋ ˆ๋“œ ์ฐธ์—ฌ ๋ฒ”์œ„์™€ ํšจ์œจ์„ฑ:
    • ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๊ฐ€๋“œ๊ฐ€ ์—†๋Š” ๋น„ํšจ์œจ์  ์ ‘๊ทผ:

      # ๋น„ํšจ์œจ์  ๋ฒ„์ „ - ๊ฒฐ๊ณผ๊ฐ€ ์‚ฌ์šฉ๋˜์ง€ ์•Š์„ ์Šค๋ ˆ๋“œ๋„ ๋ชจ๋‘ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
      local_sum = Scalar[dtype](0)
      for j in range(CONV):
          if local_i + j < SIZE:
              local_sum += shared_a[local_i + j] * shared_b[j]
      # ๋งˆ์ง€๋ง‰ ์“ฐ๊ธฐ๋งŒ ๊ฐ€๋“œ
      if global_i < SIZE:
          output[global_i] = local_sum
      
    • ํšจ์œจ์ ์ด๊ณ  ์˜ฌ๋ฐ”๋ฅธ ๊ตฌํ˜„:

      if global_i < SIZE:
          var local_sum: output.element_type = 0  # var๋กœ ํƒ€์ž… ์ถ”๋ก  ํ™œ์šฉ
          @parameter  # CONV๊ฐ€ ์ƒ์ˆ˜์ด๋ฏ€๋กœ ์ปดํŒŒ์ผ ํƒ€์ž„์— ๋ฃจํ”„ ์ „๊ฐœ
          for j in range(CONV):
              if local_i + j < SIZE:
                  local_sum += shared_a[local_i + j] * shared_b[j]
          output[global_i] = local_sum
      

ํ•ต์‹ฌ์ ์ธ ์ฐจ์ด๋Š” ๊ฐ€๋“œ์˜ ์œ„์น˜์ž…๋‹ˆ๋‹ค. ๋น„ํšจ์œจ์  ๋ฒ„์ „์€ global_i >= SIZE์ธ ์Šค๋ ˆ๋“œ๋ฅผ ํฌํ•จํ•ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ ๋’ค, ๋งˆ์ง€๋ง‰ ์“ฐ๊ธฐ์—์„œ๋งŒ ๊ฐ€๋“œ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋กœ ์ธํ•ด:

  • ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ: ์œ ํšจ ๋ฒ”์œ„ ๋ฐ–์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ์“ธ๋ชจ์—†๋Š” ์ž‘์—…์„ ์ˆ˜ํ–‰
  • ํšจ์œจ ์ €ํ•˜: ์‚ฌ์šฉ๋˜์ง€ ์•Š์„ ์—ฐ์‚ฐ์— ์ž์› ์†Œ๋น„
  • GPU ํ™œ์šฉ๋„ ์ €ํ•˜: ์˜๋ฏธ ์—†๋Š” ๊ณ„์‚ฐ์— GPU ์ฝ”์–ด๋ฅผ ๋‚ญ๋น„

ํšจ์œจ์  ๋ฒ„์ „์€ ์œ ํšจํ•œ global_i ๊ฐ’์„ ๊ฐ€์ง„ ์Šค๋ ˆ๋“œ๋งŒ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋ฏ€๋กœ GPU ์ž์›์„ ๋” ์ž˜ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

  1. ์ฃผ์š” ๊ตฌํ˜„ ํŠน์ง•:

    • var์™€ output.element_type์œผ๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์ถ”๋ก 
    • @parameter ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋กœ ํ•ฉ์„ฑ๊ณฑ ๋ฃจํ”„๋ฅผ ์ปดํŒŒ์ผ ํƒ€์ž„์— ์ „๊ฐœ
    • ์—„๊ฒฉํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ ํ™•๋ณด
    • TileTensor์˜ ํƒ€์ž… ์‹œ์Šคํ…œ์œผ๋กœ ์ฝ”๋“œ ์•ˆ์ „์„ฑ ํ–ฅ์ƒ
  2. ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ:

    • ์ž…๋ ฅ ๋ฐฐ์—ด๊ณผ ์ปค๋„ ๋ชจ๋‘ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
    • ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ 1ํšŒ ๋กœ๋“œ
    • ๋กœ๋“œํ•œ ๋ฐ์ดํ„ฐ์˜ ํšจ์œจ์  ์žฌ์‚ฌ์šฉ
  3. ์Šค๋ ˆ๋“œ ์กฐ์œจ:

    • barrier()๋กœ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋กœ๋“œ๊ฐ€ ๋๋‚œ ํ›„ ์—ฐ์‚ฐ ์‹œ์ž‘์„ ๋ณด์žฅ
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ์›์†Œ ํ•˜๋‚˜๋ฅผ ๊ณ„์‚ฐ
    • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€
  4. ์„ฑ๋Šฅ ์ตœ์ ํ™”:

    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์†Œํ™”
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋น ๋ฅธ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ
    • ๋ฉ”์ธ ์—ฐ์‚ฐ ๋ฃจํ”„์—์„œ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ ํšŒํ”ผ
    • @parameter ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋ฅผ ํ†ตํ•œ ๋ฃจํ”„ ์ „๊ฐœ

๋ธ”๋ก ๊ฒฝ๊ณ„ ๋ฒ„์ „

1D TileTensor a์™€ 1D TileTensor b์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ณ„์‚ฐํ•˜์—ฌ 1D TileTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ์ฝ๊ธฐ 2ํšŒ, ์ „์—ญ ์“ฐ๊ธฐ 1ํšŒ๋งŒ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ์ž…๋ ฅ ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE_2 = 15
  • ์ปค๋„ ํฌ๊ธฐ: CONV_2 = 4
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 2
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ์ž…๋ ฅ์šฉ TPB + CONV_2 - 1๊ฐœ

์ฐธ๊ณ :

  • ํ™•์žฅ ๋กœ๋”ฉ: ๊ฒฝ๊ณ„ ๊ฒน์นจ ์˜์—ญ์„ ๊ณ ๋ ค
  • ๋ธ”๋ก ๊ฐ€์žฅ์ž๋ฆฌ: ๋ธ”๋ก ๊ฒฝ๊ณ„๋ฅผ ๋„˜๋Š” ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ํšจ์œจ์  ํ™œ์šฉ
  • ๋™๊ธฐํ™”: ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๊ฐ„ ์กฐ์œจ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE_2 = 15
comptime CONV_2 = 4
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (TPB, 1)
comptime in_2_layout = row_major[SIZE_2]()
comptime In2Layout = type_of(in_2_layout)
comptime out_2_layout = row_major[SIZE_2]()
comptime Out2Layout = type_of(out_2_layout)
comptime conv_2_layout = row_major[CONV_2]()
comptime Conv2Layout = type_of(conv_2_layout)


def conv_1d_block_boundary(
    output: TileTensor[mut=True, dtype, Out2Layout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, In2Layout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, Conv2Layout, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # FILL ME IN (roughly 18 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p13/p13.mojo

ํŒ
  1. stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + CONV_2 - 1]())์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น
  2. ๋ฉ”์ธ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: shared_a[local_i] = a[global_i]
  3. ๊ฒฝ๊ณ„ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: if local_i < CONV_2 - 1์ผ ๋•Œ ๋‹ค์Œ ๋ธ”๋ก์˜ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
  4. ์ปค๋„ ๋กœ๋“œ: shared_b[local_i] = b[local_i]
  5. ์ž…๋ ฅ ๋ฒ”์œ„ ์•ˆ์—์„œ ํ•ฉ์‚ฐ: if global_i + j < SIZE_2

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p13 --block-boundary
pixi run -e amd p13 --block-boundary
pixi run -e apple p13 --block-boundary
uv run poe p13 --block-boundary

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([14.0, 20.0, 26.0, 32.0, 38.0, 44.0, 50.0, 56.0, 62.0, 68.0, 74.0, 80.0, 41.0, 14.0, 0.0])

์†”๋ฃจ์…˜

def conv_1d_block_boundary(
    output: TileTensor[mut=True, dtype, Out2Layout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, In2Layout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, Conv2Layout, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # first: need to account for padding
    var shared_a = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB + CONV_2 - 1]())
    var shared_b = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[CONV_2]())
    if global_i < SIZE_2:
        shared_a[local_i] = a[global_i]
    else:
        shared_a[local_i] = 0

    # second: load elements needed for convolution at block boundary
    if local_i < CONV_2 - 1:
        # indices from next block
        var next_idx = global_i + TPB
        if next_idx < SIZE_2:
            shared_a[TPB + local_i] = a[next_idx]
        else:
            # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
            # which is an undefined behavior
            shared_a[TPB + local_i] = 0

    if local_i < CONV_2:
        shared_b[local_i] = b[local_i]

    barrier()

    if global_i < SIZE_2:
        var local_sum: output.ElementType = 0

        comptime for j in range(CONV_2):
            if global_i + j < SIZE_2:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum


ํ™•์žฅ๋œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด ๋ธ”๋ก ๊ฒฝ๊ณ„๋ฅผ ๋„˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ์†”๋ฃจ์…˜์ž…๋‹ˆ๋‹ค. ์ž์„ธํžˆ ๋ถ„์„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ํฌ๊ธฐ ๊ณ„์‚ฐ

ํ…Œ์ŠคํŠธ ๊ตฌ์„ฑ:
- ์ „์ฒด ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE_2 = 15
- ๊ทธ๋ฆฌ๋“œ: 2 ๋ธ”๋ก ร— 8 ์Šค๋ ˆ๋“œ
- ํ•ฉ์„ฑ๊ณฑ ์ปค๋„: CONV_2 = 4

Block 0 ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ:  [0 1 2 3 4 5 6 7|8 9 10]  // TPB(8) + (CONV_2-1)(3) ํŒจ๋”ฉ
Block 1 ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ:  [8 9 10 11 12 13 14 0|0 0 0]  // ๋‘ ๋ฒˆ์งธ ๋ธ”๋ก. ๋ฐ์ดํ„ฐ(7) + ๊ทธ๋ฆฌ๋“œ ์ฑ„์›€์šฉ ํŒจ๋”ฉ(1) + (CONV_2-1)(3) ํŒจ๋”ฉ

ํฌ๊ธฐ ๊ณ„์‚ฐ:
- ๋ฉ”์ธ ๋ฐ์ดํ„ฐ: TPB๊ฐœ (8)
- ๊ฒน์นจ ์˜์—ญ: CONV_2 - 1๊ฐœ (4 - 1 = 3)
- ํ•ฉ๊ณ„: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11๊ฐœ

๊ตฌํ˜„ ์ƒ์„ธ

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น:

    # ํ•ฉ์„ฑ๊ณฑ ์œˆ๋„์šฐ์— ํ•„์š”ํ•œ ํŒจ๋”ฉ์„ ๋จผ์ € ๊ณ ๋ ค
    shared_a = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + CONV_2 - 1]())
    shared_b = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[CONV_2]())
    

    ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ธ”๋ก ๋ฐ์ดํ„ฐ์™€ ๊ฒน์นจ ์˜์—ญ์„ ๋ชจ๋‘ ๋‹ด๊ธฐ์— ์ถฉ๋ถ„ํ•œ ๊ณต๊ฐ„์ด ํ™•๋ณด๋ฉ๋‹ˆ๋‹ค.

  2. ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ ์ „๋žต:

    # ๋ฉ”์ธ ๋ธ”๋ก ๋ฐ์ดํ„ฐ
    if global_i < SIZE_2:
        shared_a[local_i] = a[global_i]
    else:
        shared_a[local_i] = 0
    
    # ๋‹ค์Œ ๋ธ”๋ก์˜ ๊ฒฝ๊ณ„ ๋ฐ์ดํ„ฐ
    if local_i < CONV_2 - 1:
        next_idx = global_i + TPB
        if next_idx < SIZE_2:
            shared_a[TPB + local_i] = a[next_idx]
        else:
            # ๋ฒ”์œ„ ๋ฐ– ์›์†Œ๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜์—ฌ
            # ๋ฏธ์ •์˜ ๋™์ž‘์„ ์œ ๋ฐœํ•˜๋Š” ์ดˆ๊ธฐํ™”๋˜์ง€ ์•Š์€ ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ๋ฅผ ๋ฐฉ์ง€
            shared_a[TPB + local_i] = 0
    
    • local_i < CONV_2 - 1์ธ ์Šค๋ ˆ๋“œ๋งŒ ๊ฒฝ๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œ
    • ๋ถˆํ•„์š”ํ•œ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ ๋ฐฉ์ง€
    • ๋ฉ”์ธ ๋ฐ์ดํ„ฐ ๋กœ๋“œ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ์œ ์ง€
    • ๋ฒ”์œ„ ๋ฐ– ์›์†Œ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ 0์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜์—ฌ ๋ฏธ์ •์˜ ๋™์ž‘ ๋ฐฉ์ง€
  3. ์ปค๋„ ๋กœ๋”ฉ:

    if local_i < b_size:
        shared_b[local_i] = b[local_i]
    
    • ์Šค๋ ˆ๋“œ๋‹น 1ํšŒ ๋กœ๋“œ
    • ์ปค๋„ ํฌ๊ธฐ๋กœ ๋ฒ”์œ„ ์ œํ•œ
  4. ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ:

    if global_i < SIZE_2:
        var local_sum: output.element_type = 0
        @parameter
        for j in range(CONV_2):
            if global_i + j < SIZE_2:
                local_sum += shared_a[local_i + j] * shared_b[j]
    
    • @parameter๋กœ ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ
    • output.element_type์œผ๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์ถ”๋ก 
    • ์˜๋ฏธ์ ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅธ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ์œ ํšจํ•œ ์ž…๋ ฅ ์œ„์น˜์—์„œ๋งŒ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์‚ฐ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„

  1. Block 0 ์ ‘๊ทผ ํŒจํ„ด:

    Thread 0: [0 1 2 3] ร— [0 1 2 3]
    Thread 1: [1 2 3 4] ร— [0 1 2 3]
    Thread 2: [2 3 4 5] ร— [0 1 2 3]
    ...
    Thread 7: [7 8 9 10] ร— [0 1 2 3]  // ๊ฒน์นจ ์˜์—ญ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ
    
  2. Block 1 ์ ‘๊ทผ ํŒจํ„ด: Thread 4๋ถ€ํ„ฐ๋Š” global_i + j < SIZE_2๊ฐ€ False๊ฐ€ ๋˜์–ด ํ•ด๋‹น ๋ฐ˜๋ณต์„ ๊ฑด๋„ˆ๋›ฐ๋Š” ์ ์— ์ฃผ๋ชฉํ•˜์„ธ์š”.

    Thread 0: [8  9 10 11] ร— [0 1 2 3]
    Thread 1: [9 10 11 12] ร— [0 1 2 3]
    ...
    Thread 4: [12 13 14] ร— [0 1 2]       // ๋๋ถ€๋ถ„ ์ œ๋กœ ํŒจ๋”ฉ
    Thread 5: [13 14]    ร— [0 1]
    Thread 6: [14]       ร— [0]
    Thread 7: ๊ฑด๋„ˆ๋œ€                      // ๋ชจ๋“  j์— ๋Œ€ํ•ด global_i + j < SIZE_2๊ฐ€ false, ์—ฐ์‚ฐ ์—†์Œ
    

์„ฑ๋Šฅ ์ตœ์ ํ™”

  1. ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ:

    • ๋ฉ”์ธ ๋ฐ์ดํ„ฐ ๋กœ๋“œ: ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ
    • ๊ฒฝ๊ณ„ ๋ฐ์ดํ„ฐ: ํ•„์š”ํ•œ ์Šค๋ ˆ๋“œ๋งŒ ์ฐธ์—ฌ
    • ๋‹จ์ผ ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™” ์ง€์ 
  2. ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ ์ตœ์†Œํ™”:

    • ๋ฉ”์ธ ๋กœ๋”ฉ๊ณผ ๊ฒฝ๊ณ„ ๋กœ๋”ฉ์˜ ๊น”๋”ํ•œ ๋ถ„๋ฆฌ
    • ์›Œํ”„ ๋‚ด ๊ท ์ผํ•œ ์—ฐ์‚ฐ ํŒจํ„ด
    • ํšจ์œจ์ ์ธ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ:

    • ๋ธ”๋ก ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ์— ์ตœ์ ํ™”๋œ ํฌ๊ธฐ ์„ค์ •
    • ์ ‘๊ทผ ํŒจํ„ด์—์„œ ๋ฑ…ํฌ ์ถฉ๋Œ ์—†์Œ
    • ๋กœ๋“œํ•œ ๋ฐ์ดํ„ฐ์˜ ํšจ์œจ์  ์žฌ์‚ฌ์šฉ
  4. ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ:

    • ๋ฒ”์œ„ ๋ฐ– ์›์†Œ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ 0์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ์ดˆ๊ธฐํ™”๋˜์ง€ ์•Š์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ ๋ฐฉ์ง€
    • global_i + j < SIZE_2๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์•„๋‹Œ ์‹ค์ œ ์ž…๋ ฅ ๋ฒ”์œ„ ๊ธฐ์ค€์˜ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
    • ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ ์—†์ด ์ ์ ˆํ•œ ์—ฃ์ง€ ์ผ€์ด์Šค ์ฒ˜๋ฆฌ

๊ฒฝ๊ณ„ ์กฐ๊ฑด ๊ฐœ์„ 

์ด ์†”๋ฃจ์…˜์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ”์œ„๋ฅผ ํ™•์ธํ•˜๋Š” ๋Œ€์‹  if global_i + j < SIZE_2:๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ํŒจํ„ด์€:

  • ์ˆ˜ํ•™์ ์œผ๋กœ ์ •ํ™•: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ค์ œ๋กœ ์กด์žฌํ•˜๋Š” ์œ„์น˜์—์„œ๋งŒ ํ•ฉ์„ฑ๊ณฑ ๊ณ„์‚ฐ
  • ๋” ํšจ์œจ์ : ์ž…๋ ฅ ๋ฐฐ์—ด์„ ๋„˜์–ด์„  ์œ„์น˜์— ๋Œ€ํ•œ ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ ํšŒํ”ผ
  • ๋” ์•ˆ์ „: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์ œ๋กœ ํŒจ๋”ฉ ๋™์ž‘์— ์˜์กดํ•˜์ง€ ์•Š์Œ

์ด ๊ตฌํ˜„์€ ๋ธ”๋ก ๊ฐ„ ํ•ฉ์„ฑ๊ณฑ์„ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋ฉด์„œ ๋‹ค์Œ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค:

  • ์ ์ ˆํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ
  • ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ํ†ตํ•œ ๋†’์€ ์„ฑ๋Šฅ
  • TileTensor ์ถ”์ƒํ™”๋ฅผ ํ™œ์šฉํ•œ ๊น”๋”ํ•œ ์ฝ”๋“œ ๊ตฌ์กฐ
  • ์ตœ์†Œํ•œ์˜ ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ
  • ์ˆ˜ํ•™์ ์œผ๋กœ ๊ฑด์ „ํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ

Puzzle 14: ๋ˆ„์  ํ•ฉ

๊ฐœ์š”

๋ˆ„์  ํ•ฉ(prefix sum, scan ์ด๋ผ๊ณ ๋„ ํ•ฉ๋‹ˆ๋‹ค)์€ ์‹œํ€€์Šค์˜ ๊ฐ’์„ ์ฐจ๋ก€๋กœ ๋”ํ•ด ๋‚˜๊ฐ€๋Š” ๊ธฐ๋ณธ์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž…๋‹ˆ๋‹ค. ์ •๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ถ€ํ„ฐ ๊ณผํ•™ ์‹œ๋ฎฌ๋ ˆ์ด์…˜๊นŒ์ง€ ์ˆ˜๋งŽ์€ ๋ณ‘๋ ฌ ์‘์šฉ์˜ ํ•ต์‹ฌ์— ์ž๋ฆฌํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ˆซ์ž ์‹œํ€€์Šค๋ฅผ ๋ˆ„์  ํ•ฉ๊ณ„๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ˆœ์ฐจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๊ธฐ๋Š” ๊ฐ„๋‹จํ•˜์ง€๋งŒ, GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ๋งŒ๋“ค๋ ค๋ฉด ๊ธฐ๋ฐœํ•œ ๋ณ‘๋ ฌ์  ์‚ฌ๊ณ ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค!

1D TileTensor a์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D TileTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : a์˜ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, ๊ฐ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋งŒ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ˆ„์  ํ•ฉ ์‹œ๊ฐํ™” ๋ˆ„์  ํ•ฉ ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋กœ๊ทธ ๋ณต์žก๋„๋ฅผ ๊ฐ€์ง„ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ˜‘๋ ฅ ํŒจํ„ด
  • ๋‹ค๋‹จ๊ณ„ ์—ฐ์‚ฐ ์ „๋žต

ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์ˆœ์ฐจ ์—ฐ์‚ฐ์„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, ์ž…๋ ฅ ์‹œํ€€์Šค \([3, 1, 4, 1, 5, 9]\) ๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, ๋ˆ„์  ํ•ฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค:

  • \([3]\) (์ฒซ ๋ฒˆ์งธ ์›์†Œ ๊ทธ๋Œ€๋กœ)
  • \([3, 4]\) (3 + 1)
  • \([3, 4, 8]\) (์ด์ „ ํ•ฉ + 4)
  • \([3, 4, 8, 9]\) (์ด์ „ ํ•ฉ + 1)
  • \([3, 4, 8, 9, 14]\) (์ด์ „ ํ•ฉ + 5)
  • \([3, 4, 8, 9, 14, 23]\) (์ด์ „ ํ•ฉ + 9)

์ˆ˜ํ•™์ ์œผ๋กœ, ์‹œํ€€์Šค \([x_0, x_1, โ€ฆ, x_n]\) ์˜ ๋ˆ„์  ํ•ฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: \[ [x_0, x_0+x_1, x_0+x_1+x_2, โ€ฆ, \sum_{i=0}^n x_i] \]

์ˆœ์ฐจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ผ๋ฉด \(O(n)\) ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•˜๊ฒ ์ง€๋งŒ, ์—ฌ๊ธฐ์„œ๋Š” ์˜๋ฆฌํ•œ 2๋‹จ๊ณ„ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ \(O(\log n)\) ๋‹จ๊ณ„๋งŒ์— ์™„๋ฃŒํ•ฉ๋‹ˆ๋‹ค! ์œ„์˜ ์• ๋‹ˆ๋ฉ”์ด์…˜์—์„œ ์ด ๊ณผ์ •์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ํผ์ฆ์€ ๊ฐœ๋…์„ ๋‹จ๊ณ„์ ์œผ๋กœ ์ตํž ์ˆ˜ ์žˆ๋„๋ก ๋‘ ํŒŒํŠธ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค:

  • ๐Ÿ”ฐ ๊ธฐ๋ณธ ๋ฒ„์ „ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋“ค์–ด๊ฐ€๋Š” ๋‹จ์ผ ๋ธ”๋ก ๊ตฌํ˜„๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์›๋ฆฌ๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์ข‹์Šต๋‹ˆ๋‹ค.

  • โญ ์™„์„ฑ ๋ฒ„์ „ ์ด์–ด์„œ ์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์น˜๋Š” ํฐ ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋” ๊นŒ๋‹ค๋กœ์šด ๊ฒฝ์šฐ์— ๋„์ „ํ•ฉ๋‹ˆ๋‹ค. ๋ธ”๋ก ๊ฐ„ ์กฐ์œจ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ ๋ฒ„์ „์€ ์ด์ „ ๋ฒ„์ „ ์œ„์— ์Œ“์•„ ์˜ฌ๋ฆฌ๋Š” ๋ฐฉ์‹์œผ๋กœ, ๋ณ‘๋ ฌ ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ดํ•ด๋ฅผ ๊นŠ์ด ์žˆ๊ฒŒ ๋ฐœ์ „์‹œ์ผœ ์ค๋‹ˆ๋‹ค. ๊ธฐ๋ณธ ๋ฒ„์ „์—์„œ ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹ค์ง€๊ณ , ์™„์„ฑ ๋ฒ„์ „์—์„œ๋Š” ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ™•์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค โ€” ์‹ค์ œ GPU ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์ž์ฃผ ๋งˆ์ฃผ์น˜๋Š” ๊ณผ์ œ์ž…๋‹ˆ๋‹ค.

๊ธฐ๋ณธ ๋ฒ„์ „

1D TileTensor a์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D TileTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : a์˜ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, ๊ฐ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋งŒ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE = 8
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 1
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB๊ฐœ ์›์†Œ

์ฐธ๊ณ :

  • ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ TileTensor ์ ‘๊ทผ์„ ํ†ตํ•ด ์›์†Œ ํ•˜๋‚˜๋ฅผ ๋กœ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: address_space๋ฅผ ์ง€์ •ํ•œ TileTensor๋กœ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”: ์—ฐ์‚ฐ ๋‹จ๊ณ„ ๊ฐ„ ์กฐ์œจ
  • ์ ‘๊ทผ ํŒจํ„ด: ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ
  • ํƒ€์ž… ์•ˆ์ „์„ฑ: TileTensor์˜ ํƒ€์ž… ์‹œ์Šคํ…œ ํ™œ์šฉ

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TPB = 8
comptime SIZE = 8
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = row_major[SIZE]()
comptime LayoutType = type_of(layout)


def prefix_sum_simple(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # FILL ME IN (roughly 18 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p14/p14.mojo

ํŒ
  1. ๋ฐ์ดํ„ฐ๋ฅผ shared[local_i]์— ๋กœ๋“œ
  2. offset = 1์—์„œ ์‹œ์ž‘ํ•ด ๋งค ๋‹จ๊ณ„๋งˆ๋‹ค 2๋ฐฐ๋กœ ์ฆ๊ฐ€
  3. local_i >= offset์ธ ์›์†Œ์— ๋Œ€ํ•ด ๋ง์…ˆ ์ˆ˜ํ–‰
  4. ๊ฐ ๋‹จ๊ณ„ ์‚ฌ์ด์— barrier() ํ˜ธ์ถœ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p14 --simple
pixi run -e amd p14 --simple
pixi run -e apple p14 --simple
uv run poe p14 --simple

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: DeviceBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0])

์†”๋ฃจ์…˜

def prefix_sum_simple(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    var shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    var offset = 1
    for i in range(Int(log2(Scalar[dtype](TPB)))):
        var current_val: output.ElementType = 0
        if local_i >= offset and local_i < size:
            current_val = shared[local_i - offset]  # read

        barrier()
        if local_i >= offset and local_i < size:
            shared[local_i] += current_val

        barrier()
        offset *= 2

    if global_i < size:
        output[global_i] = shared[local_i]


๋ณ‘๋ ฌ (ํฌํ•จ) ๋ˆ„์  ํ•ฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค:

์„ค์ • ๋ฐ ๊ตฌ์„ฑ

  • TPB (๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜) = 8
  • SIZE (๋ฐฐ์—ด ํฌ๊ธฐ) = 8

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ช…์‹œ์  ๋™๊ธฐํ™”๋ฅผ ํ†ตํ•ด ์ฝ๊ธฐ-์“ฐ๊ธฐ ์ถฉ๋Œ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค:

  • ์ฝ๊ธฐ ๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋จผ์ € ํ•„์š”ํ•œ ๊ฐ’์„ ๋กœ์ปฌ ๋ณ€์ˆ˜ current_val์— ์ฝ์–ด๋‘ 
  • ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ ์“ฐ๊ธฐ๊ฐ€ ์‹œ์ž‘๋˜๋„๋ก ๋ณด์žฅ
  • ์“ฐ๊ธฐ ๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๊ธฐ๋ก

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜๋ฅผ ์ฝ๊ณ  ์“ธ ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.ใ„นใ…‡

๋Œ€์•ˆ์  ์ ‘๊ทผ: ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ๋”๋ธ” ๋ฒ„ํผ๋ง ์ž…๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ 2๋ฐฐ๋กœ ํ• ๋‹นํ•œ ๋’ค, ํ•œ ๋ฒ„ํผ์—์„œ ์ฝ๊ณ  ๋‹ค๋ฅธ ๋ฒ„ํผ์— ์“ฐ๋Š” ๊ฒƒ์„ ๋ฒˆ๊ฐˆ์•„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์™„์ „ํžˆ ์ œ๊ฑฐํ•˜์ง€๋งŒ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ๋Š˜์–ด๋‚˜๊ณ  ๋ณต์žก๋„๊ฐ€ ์˜ฌ๋ผ๊ฐ‘๋‹ˆ๋‹ค. ํ•™์Šต ๋ชฉ์ ์œผ๋กœ๋Š” ์ดํ•ดํ•˜๊ธฐ ๋” ์‰ฌ์šด ๋ช…์‹œ์  ๋™๊ธฐํ™” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ๋งคํ•‘

  • thread_idx.x: \([0, 1, 2, 3, 4, 5, 6, 7]\) (local_i)
  • block_idx.x: \([0, 0, 0, 0, 0, 0, 0, 0]\)
  • global_i: \([0, 1, 2, 3, 4, 5, 6, 7]\) (block_idx.x * TPB + thread_idx.x)

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ดˆ๊ธฐ ๋กœ๋“œ

Threads:      Tโ‚€   Tโ‚   Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡
Input array:  [0    1    2    3    4    5    6    7]
shared:       [0    1    2    3    4    5    6    7]
               โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘
              Tโ‚€   Tโ‚   Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡

Offset = 1: ์ฒซ ๋ฒˆ์งธ ๋ณ‘๋ ฌ ๋‹จ๊ณ„

ํ™œ์„ฑ ์Šค๋ ˆ๋“œ: \(T_1 \ldots T_7\) (local_i โ‰ฅ 1์ธ ์Šค๋ ˆ๋“œ)

์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

Tโ‚ reads shared[0] = 0    Tโ‚… reads shared[4] = 4
Tโ‚‚ reads shared[1] = 1    Tโ‚† reads shared[5] = 5
Tโ‚ƒ reads shared[2] = 2    Tโ‚‡ reads shared[6] = 6
Tโ‚„ reads shared[3] = 3

๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ํ˜„์žฌ ์œ„์น˜์— ๋”ํ•จ:

Before:      [0    1    2    3    4    5    6    7]
Add:              +0   +1   +2   +3   +4   +5   +6
                   |    |    |    |    |    |    |
Result:      [0    1    3    5    7    9    11   13]
                   โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘
                  Tโ‚   Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡

Offset = 2: ๋‘ ๋ฒˆ์งธ ๋ณ‘๋ ฌ ๋‹จ๊ณ„

ํ™œ์„ฑ ์Šค๋ ˆ๋“œ: \(T_2 \ldots T_7\) (local_i โ‰ฅ 2์ธ ์Šค๋ ˆ๋“œ)

์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

Tโ‚‚ reads shared[0] = 0    Tโ‚… reads shared[3] = 5
Tโ‚ƒ reads shared[1] = 1    Tโ‚† reads shared[4] = 7
Tโ‚„ reads shared[2] = 3    Tโ‚‡ reads shared[5] = 9

๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ๋”ํ•จ:

Before:      [0    1    3    5    7    9    11   13]
Add:                   +0   +1   +3   +5   +7   +9
                        |    |    |    |    |    |
Result:      [0    1    3    6    10   14   18   22]
                        โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘
                       Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡

Offset = 4: ์„ธ ๋ฒˆ์งธ ๋ณ‘๋ ฌ ๋‹จ๊ณ„

ํ™œ์„ฑ ์Šค๋ ˆ๋“œ: \(T_4 \ldots T_7\) (local_i โ‰ฅ 4์ธ ์Šค๋ ˆ๋“œ)

์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

Tโ‚„ reads shared[0] = 0    Tโ‚† reads shared[2] = 3
Tโ‚… reads shared[1] = 1    Tโ‚‡ reads shared[3] = 6

๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ๋”ํ•จ:

Before:      [0    1    3    6    10   14   18   22]
Add:                              +0   +1   +3   +6
                                  |    |    |    |
Result:      [0    1    3    6    10   15   21   28]
                                  โ†‘    โ†‘    โ†‘    โ†‘
                                  Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡

์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ output์— ๊ธฐ๋ก

Threads:      Tโ‚€   Tโ‚   Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡
global_i:     0    1    2    3    4    5    6    7
output:       [0    1    3    6    10   15   21   28]
              โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘    โ†‘
              Tโ‚€   Tโ‚   Tโ‚‚   Tโ‚ƒ   Tโ‚„   Tโ‚…   Tโ‚†   Tโ‚‡

์ฃผ์š” ๊ตฌํ˜„ ์ƒ์„ธ

๋™๊ธฐํ™” ํŒจํ„ด: ๊ฐ ๋ฐ˜๋ณต์€ ์—„๊ฒฉํ•œ ์ฝ๊ธฐ โ†’ ๋™๊ธฐํ™” โ†’ ์“ฐ๊ธฐ ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. var current_val: out.element_type = 0 - ๋กœ์ปฌ ๋ณ€์ˆ˜ ์ดˆ๊ธฐํ™”
  2. current_val = shared[local_i - offset] - ์ฝ๊ธฐ ๋‹จ๊ณ„ (์กฐ๊ฑด ์ถฉ์กฑ ์‹œ)
  3. barrier() - ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ๋ช…์‹œ์  ๋™๊ธฐํ™”
  4. shared[local_i] += current_val - ์“ฐ๊ธฐ ๋‹จ๊ณ„ (์กฐ๊ฑด ์ถฉ์กฑ ์‹œ)
  5. barrier() - ๋‹ค์Œ ๋ฐ˜๋ณต ์ „ ๋™๊ธฐํ™”

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€: ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜์ง€ ์•Š์œผ๋ฉด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผํ•˜์—ฌ ๋ฏธ์ •์˜ ๋™์ž‘์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ช…์‹œ์  ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ•œ 2๋‹จ๊ณ„ ์ ‘๊ทผ ๋ฐฉ์‹์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ: ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‹ค์Œ์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค:

  • if local_i >= offset and local_i < size๋กœ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ์ž„์‹œ ๋ณ€์ˆ˜์˜ ์ ์ ˆํ•œ ์ดˆ๊ธฐํ™”
  • ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ์กฐ์œจ๋œ ์ ‘๊ทผ ํŒจํ„ด

์ด ์†”๋ฃจ์…˜์€ barrier()๋ฅผ ์‚ฌ์šฉํ•ด ๋‹จ๊ณ„ ๊ฐ„ ์˜ฌ๋ฐ”๋ฅธ ๋™๊ธฐํ™”๋ฅผ ๋ณด์žฅํ•˜๊ณ , if global_i < size๋กœ ๋ฐฐ์—ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” ๊ฐ ์›์†Œ \(i\)๊ฐ€ \(\sum_{j=0}^{i} a[j]\) ๋ฅผ ํฌํ•จํ•˜๋Š” ํฌํ•จ ๋ˆ„์  ํ•ฉ์ž…๋‹ˆ๋‹ค.

์™„์„ฑ ๋ฒ„์ „

1D TileTensor a์— ๋Œ€ํ•ด ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 1D TileTensor output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ฐธ๊ณ : a์˜ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก ํฌ๊ธฐ๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ, ์˜ฌ๋ฐ”๋ฅธ ๊ฒฐ๊ณผ๋ฅผ ์–ป์œผ๋ ค๋ฉด ์—ฌ๋Ÿฌ ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฐฐ์—ด ํฌ๊ธฐ: SIZE_2 = 15
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 8
  • ๋ธ”๋ก ์ˆ˜: 2
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น TPB๊ฐœ ์›์†Œ

์ฐธ๊ณ :

  • ๋‹ค์ค‘ ๋ธ”๋ก: ์ž…๋ ฅ ๋ฐฐ์—ด์ด ํ•˜๋‚˜์˜ ๋ธ”๋ก๋ณด๋‹ค ํด ๋•Œ๋Š” ๋‹ค๋‹จ๊ณ„ ์ ‘๊ทผ์ด ํ•„์š”
  • ๋ธ”๋ก ๋ ˆ๋ฒจ ๋™๊ธฐํ™”: ๋ธ”๋ก ๋‚ด์—์„œ๋Š” barrier()๋กœ ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”
  • ํ˜ธ์ŠคํŠธ ๋ ˆ๋ฒจ ๋™๊ธฐํ™”: Mojo์˜ DeviceContext๊ฐ€ ์ปค๋„ ์‹คํ–‰ ์ˆœ์„œ๋ฅผ ๋ณด์žฅํ•˜๋ฏ€๋กœ, ์ปค๋„๋“ค์€ ํ์— ๋„ฃ์€ ์ˆœ์„œ๋Œ€๋กœ ์‹คํ–‰๋˜๊ณ  ์ด์ „ ์ปค๋„์ด ๋๋‚˜์•ผ ๋‹ค์Œ์ด ์‹œ์ž‘๋ฉ๋‹ˆ๋‹ค. ํ˜ธ์ŠคํŠธ์—์„œ ๊ฒฐ๊ณผ๋ฅผ ์ฝ๊ธฐ ์ „์— ctx.synchronize()๋กœ ๋ชจ๋“  GPU ์ž‘์—… ์™„๋ฃŒ๋ฅผ ํ™•์ธํ•ด์•ผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ณด์กฐ ์ €์žฅ์†Œ: ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•ด ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์ €์žฅํ•  ์ถ”๊ฐ€ ๊ณต๊ฐ„ ์‚ฌ์šฉ

์™„์„ฑํ•  ์ฝ”๋“œ

๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋ˆ„์  ํ•ฉ์„ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ ๋ณ„๋„ ์ปค๋„ ํ•จ์ˆ˜๋ฅผ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ฒซ ๋ฒˆ์งธ ์ปค๋„ (prefix_sum_local_phase): ๊ฐ ๋ธ”๋ก ๋‚ด์—์„œ ๋กœ์ปฌ ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์ €์žฅ
  2. ๋‘ ๋ฒˆ์งธ ์ปค๋„ (prefix_sum_block_sum_phase): ์ด์ „ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋ฅผ ํ›„์† ๋ธ”๋ก์˜ ์›์†Œ์— ๋”ํ•จ

๋ฉ”์ธ ํ•จ์ˆ˜๊ฐ€ ์ด ์ปค๋„๋“ค ์‚ฌ์ด์— ํ•„์š”ํ•œ ํ˜ธ์ŠคํŠธ ์ธก ๋™๊ธฐํ™”๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

comptime SIZE_2 = 15
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (TPB, 1)
comptime EXTENDED_SIZE = SIZE_2 + 2  # up to 2 blocks
comptime layout_2 = row_major[SIZE_2]()
comptime Layout2Type = type_of(layout_2)
comptime extended_layout = row_major[EXTENDED_SIZE]()
comptime ExtendedLayoutType = type_of(extended_layout)


# Kernel 1: Compute local prefix sums and store block sums in out
def prefix_sum_local_phase(
    output: TileTensor[mut=True, dtype, ExtendedLayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, Layout2Type, ImmutAnyOrigin],
    size: Int,
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # FILL ME IN (roughly 20 lines)


# Kernel 2: Add block sums to their respective blocks
def prefix_sum_block_sum_phase(
    output: TileTensor[mut=True, dtype, ExtendedLayoutType, MutAnyOrigin],
    size: Int,
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 3 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p14/p14.mojo

์ด ํผ์ฆ์˜ ํ•ต์‹ฌ์€ barrier๊ฐ€ ๋ธ”๋ก ๋‚ด๋ถ€์˜ ์Šค๋ ˆ๋“œ๋งŒ ๋™๊ธฐํ™”ํ•˜๋ฉฐ, ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”๋Š” ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋””๋ฐ”์ด์Šค์—์„œ ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š” ์—ฌ๋Ÿฌ ์ปค๋„์„ ํ์— ๋„ฃ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

            # Phase 1: Local prefix sums
            ctx.enqueue_function[
                prefix_sum_local_phase, prefix_sum_local_phase
            ](
                out_tensor,
                a_tensor,
                size,
                grid_dim=BLOCKS_PER_GRID_2,
                block_dim=THREADS_PER_BLOCK_2,
            )

            # Phase 2: Add block sums
            ctx.enqueue_function[
                prefix_sum_block_sum_phase, prefix_sum_block_sum_phase
            ](
                out_tensor,
                size,
                grid_dim=BLOCKS_PER_GRID_2,
                block_dim=THREADS_PER_BLOCK_2,
            )

๋‘ ์ปค๋„์ด ์ˆœ์ฐจ์ ์œผ๋กœ ํ์— ๋“ค์–ด๊ฐ€์ง€๋งŒ, out_tensor๋Š” ๋‘ ์ปค๋„์˜ ์ž‘์—…์ด ๋ชจ๋‘ ๋๋‚  ๋•Œ๊นŒ์ง€ ํ˜ธ์ŠคํŠธ๋กœ ์ „์†ก๋˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์ ์— ์ฃผ๋ชฉํ•˜์„ธ์š”. Mojo์˜ DeviceContext๊ฐ€ ๋‹จ์ผ ์‹คํ–‰ ์ŠคํŠธ๋ฆผ์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ, ํ์— ๋„ฃ์€ ๋ชจ๋“  ์ปค๋„์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. ํ˜ธ์ŠคํŠธ์—์„œ ๊ฒฐ๊ณผ๋ฅผ ์ฝ๊ธฐ ์ „์— ๋ชจ๋“  GPU ์ž‘์—…์˜ ์™„๋ฃŒ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋Œ€๊ธฐํ•˜๋ ค๋ฉด ctx.synchronize()๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํŒ

1. ๊ธฐ๋ณธ ๋ˆ„์  ํ•ฉ ์œ„์— ์Œ“์•„ ์˜ฌ๋ฆฌ๊ธฐ

๐Ÿ”ฐ ๊ธฐ๋ณธ ๋ฒ„์ „์—์„œ ๋‹จ์ผ ๋ธ”๋ก ๋ˆ„์  ํ•ฉ ๊ตฌํ˜„ ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ๋ฒ•์„ ์—ฌ๋Ÿฌ ๋ธ”๋ก์—์„œ ๋™์ž‘ํ•˜๋„๋ก ํ™•์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

๊ธฐ๋ณธ ๋ฒ„์ „ (๋‹จ์ผ ๋ธ”๋ก): [0,1,2,3,4,5,6,7] โ†’ [0,1,3,6,10,15,21,28]

์™„์„ฑ ๋ฒ„์ „ (๋‘ ๋ธ”๋ก):
Block 0: [0,1,2,3,4,5,6,7] โ†’ [0,1,3,6,10,15,21,28]
Block 1: [8,9,10,11,12,13,14] โ†’ [8,17,27,38,50,63,77]

๊ทธ๋Ÿฐ๋ฐ ๋‘ ๋ฒˆ์งธ ๋ธ”๋ก์˜ ๊ฐ’์€ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ• ๊นŒ์š”? ์ฒซ ๋ฒˆ์งธ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋ฅผ ํฌํ•จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค!

2. 2๋‹จ๊ณ„ ์ ‘๊ทผ

๊ธฐ๋ณธ ๋ˆ„์  ํ•ฉ์œผ๋กœ๋Š” ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ, ์ž‘์—…์„ ๋‚˜๋ˆ•๋‹ˆ๋‹ค:

  1. 1๋‹จ๊ณ„: ๊ฐ ๋ธ”๋ก์ด ๋กœ์ปฌ ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐ (๊ธฐ๋ณธ ๋ฒ„์ „๊ณผ ๋™์ผ)
  2. 2๋‹จ๊ณ„: ๊ฐ ๋ธ”๋ก์ด ์ด์ „ ๋ธ”๋ก์˜ ํ•ฉ๊ณ„๋ฅผ ๋ฐ˜์˜

์ฃผ์˜: barrier()๋Š” ํ•˜๋‚˜์˜ ๋ธ”๋ก ๋‚ด์—์„œ๋งŒ ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ๋‹จ๊ณ„ ๊ฐ„์—๋Š” ํ˜ธ์ŠคํŠธ ๋ ˆ๋ฒจ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

3. ํ™•์žฅ ๋ฉ”๋ชจ๋ฆฌ ์ „๋žต

๋ธ”๋ก๋ผ๋ฆฌ ์ง์ ‘ ํ†ต์‹ ํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ, ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์ €์žฅํ•  ๊ณณ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ์ถœ๋ ฅ ๋ฒ„ํผ ๋์— ์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹น
  • ๊ฐ ๋ธ”๋ก์˜ ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ํ•ฉ๊ณ„๋ฅผ ์ด ์ถ”๊ฐ€ ๊ณต๊ฐ„์— ์ €์žฅ
  • ํ›„์† ๋ธ”๋ก์ด ์ด ํ•ฉ๊ณ„๋ฅผ ์ฝ์–ด์„œ ์ž๊ธฐ ์›์†Œ์— ๋”ํ•จ

4. ์ฃผ์š” ๊ตฌํ˜„ ํฌ์ธํŠธ

  • ๋ ˆ์ด์•„์›ƒ ์ฐจ์ด: ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ํ˜•ํƒœ๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ํ•ญ์ƒ global_i < size๋กœ ๋ฐฐ์—ด ๋ฒ”์œ„ ํ™•์ธ
  • ์Šค๋ ˆ๋“œ ์—ญํ•  ๋ถ„๋‹ด: ํŠน์ • ์Šค๋ ˆ๋“œ(์˜ˆ: ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ)๋งŒ ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์ €์žฅ
  • ๋‘ ์ปค๋„ ๊ฐ„ ๋™๊ธฐํ™”: ๋‘ ๋ฒˆ์งธ ์ปค๋„์€ ๋ฐ˜๋“œ์‹œ ์ฒซ ๋ฒˆ์งธ ์ปค๋„์ด ์™„๋ฃŒ๋œ ํ›„์— ์‹คํ–‰๋˜์–ด์•ผ ํ•จ

5. ๋””๋ฒ„๊น… ์ „๋žต

๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด, 1๋‹จ๊ณ„ ์ดํ›„์˜ ์ค‘๊ฐ„ ์ƒํƒœ๋ฅผ ์‹œ๊ฐํ™”ํ•ด ๋ณด์„ธ์š”:

1๋‹จ๊ณ„ ์ดํ›„: [0,1,3,6,10,15,21,28, 8,17,27,38,50,63,77, ???,???]

์—ฌ๊ธฐ์„œ ???์—๋Š” 2๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉ๋  ๋ธ”๋ก ํ•ฉ๊ณ„๊ฐ€ ๋“ค์–ด๊ฐ€์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜๋ ค๋ฉด ๋จผ์ € ๋””๋ฐ”์ด์Šค์˜ ์ž‘์—… ์™„๋ฃŒ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ณด์žฅํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์„ ๊ธฐ์–ตํ•˜์„ธ์š”.

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p14 --complete
pixi run -e amd p14 --complete
pixi run -e apple p14 --complete
uv run poe p14 --complete

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0, 36.0, 45.0, 55.0, 66.0, 78.0, 91.0, 105.0])

์†”๋ฃจ์…˜



# Kernel 1: Compute local prefix sums and store block sums in out
def prefix_sum_local_phase(
    output: TileTensor[mut=True, dtype, ExtendedLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, Layout2Type, ImmutAnyOrigin],
    size: Int,
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    var shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    # Load data into shared memory
    # Example with SIZE_2=15, TPB=8, BLOCKS=2:
    # Block 0 shared mem: [0,1,2,3,4,5,6,7]
    # Block 1 shared mem: [8,9,10,11,12,13,14,uninitialized]
    # Note: The last position remains uninitialized since global_i >= size,
    # but this is safe because that thread doesn't participate in computation
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # Compute local prefix sum using parallel reduction
    # This uses a tree-based algorithm with log(TPB) iterations
    # Iteration 1 (offset=1):
    #   Block 0: [0,0+1,2+1,3+2,4+3,5+4,6+5,7+6] = [0,1,3,5,7,9,11,13]
    # Iteration 2 (offset=2):
    #   Block 0: [0,1,3+0,5+1,7+3,9+5,11+7,13+9] = [0,1,3,6,10,14,18,22]
    # Iteration 3 (offset=4):
    #   Block 0: [0,1,3,6,10+0,14+1,18+3,22+6] = [0,1,3,6,10,15,21,28]
    #   Block 1 follows same pattern to get [8,17,27,38,50,63,77,???]
    var offset = 1
    for i in range(Int(log2(Scalar[dtype](TPB)))):
        var current_val: output.ElementType = 0
        if local_i >= offset and local_i < TPB:
            current_val = shared[local_i - offset]  # read

        barrier()
        if local_i >= offset and local_i < TPB:
            shared[local_i] += current_val  # write

        barrier()
        offset *= 2

    # Write local results to output
    # Block 0 writes: [0,1,3,6,10,15,21,28]
    # Block 1 writes: [8,17,27,38,50,63,77,???]
    if global_i < size:
        output[global_i] = shared[local_i]

    # Store block sums in auxiliary space
    # Block 0: Thread 7 stores shared[7] == 28 at position size+0 (position 15)
    # Block 1: Thread 7 stores shared[7] == ??? at position size+1 (position 16).  This sum is not needed for the final output.
    # This gives us: [0,1,3,6,10,15,21,28, 8,17,27,38,50,63,77, 28,???]
    #                                                           โ†‘  โ†‘
    #                                                     Block sums here
    if local_i == TPB - 1:
        output[size + block_idx.x] = shared[local_i]


# Kernel 2: Add block sums to their respective blocks
def prefix_sum_block_sum_phase(
    output: TileTensor[mut=True, dtype, ExtendedLayout, MutAnyOrigin],
    size: Int,
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    # Second pass: add previous block's sum to each element
    # Block 0: No change needed - already correct
    # Block 1: Add Block 0's sum (28) to each element
    #   Before: [8,17,27,38,50,63,77]
    #   After: [36,45,55,66,78,91,105]
    # Final result combines both blocks:
    # [0,1,3,6,10,15,21,28, 36,45,55,66,78,91,105]
    if block_idx.x > 0 and global_i < size:
        var prev_block_sum = output[size + block_idx.x - 1]
        output[global_i] += prev_block_sum


์ด ์†”๋ฃจ์…˜์€ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์น˜๋Š” ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด 2๊ฐœ์˜ ์ปค๋„์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋ˆ„์  ํ•ฉ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ถ€๋ถ„์„ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์˜ ๊ณผ์ œ

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ทผ๋ณธ์ ์ธ ์ œ์•ฝ์€ barrier()๋ฅผ ์‚ฌ์šฉํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๊ฐ€ ๋ธ”๋ก ๋‚ด๋ถ€์—์„œ๋งŒ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์ณ ์žˆ์„ ๋•Œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ œ์— ์ง๋ฉดํ•ฉ๋‹ˆ๋‹ค: ๋ธ”๋ก์ด ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค๋ฅธ ๋ธ”๋ก์— ์–ด๋–ป๊ฒŒ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์„๊นŒ?

๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์‹œ๊ฐํ™”

ํ…Œ์ŠคํŠธ ์ผ€์ด์Šค SIZE_2 = 15, TPB = 8์˜ ๊ฒฝ์šฐ:

Input array:  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]

Block 0 ์ฒ˜๋ฆฌ: [0, 1, 2, 3, 4, 5, 6, 7]
Block 1 ์ฒ˜๋ฆฌ: [8, 9, 10, 11, 12, 13, 14] (์œ ํšจ ์›์†Œ 7๊ฐœ)

๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์œ„ํ•œ ๊ณต๊ฐ„์„ ํฌํ•จํ•˜๋„๋ก ์ถœ๋ ฅ ๋ฒ„ํผ๋ฅผ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค:

ํ™•์žฅ ๋ฒ„ํผ: [๋ฐ์ดํ„ฐ ๊ฐ’ (15๊ฐœ)] + [๋ธ”๋ก ํ•ฉ๊ณ„ (2๊ฐœ)]
           [0...14] + [block0_sum, block1_sum]

์ด ํ™•์žฅ ๋ฒ„ํผ์˜ ํฌ๊ธฐ: EXTENDED_SIZE = SIZE_2 + num_blocks = 15 + 2 = 17

1๋‹จ๊ณ„ ์ปค๋„: ๋กœ์ปฌ ๋ˆ„์  ํ•ฉ

๋กœ์ปฌ ๋‹จ๊ณ„์—์„œ์˜ ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€

๋กœ์ปฌ ๋‹จ๊ณ„๋Š” ๊ธฐ๋ณธ ๋ฒ„์ „๊ณผ ๋™์ผํ•œ ๋ช…์‹œ์  ๋™๊ธฐํ™” ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฝ๊ธฐ-์“ฐ๊ธฐ ์ถฉ๋Œ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค:

  • ์ฝ๊ธฐ ๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋จผ์ € ํ•„์š”ํ•œ ๊ฐ’์„ ๋กœ์ปฌ ๋ณ€์ˆ˜ current_val์— ์ฝ์–ด๋‘ 
  • ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ ์“ฐ๊ธฐ๊ฐ€ ์‹œ์ž‘๋˜๋„๋ก ๋ณด์žฅ
  • ์“ฐ๊ธฐ ๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ์•ˆ์ „ํ•˜๊ฒŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๊ธฐ๋ก

์ด๋ฅผ ํ†ตํ•ด ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์ค‘ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผํ•  ๋•Œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

Block 0 ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๊ฐ’ ๋กœ๋“œ:

    shared = [0, 1, 2, 3, 4, 5, 6, 7]
    
  2. ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๋ฐ˜๋ณต (\(\log_2(TPB) = 3\)ํšŒ ๋ฐ˜๋ณต):

    ๋ฐ˜๋ณต 1 (offset=1):

    ์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

    Tโ‚ reads shared[0] = 0    Tโ‚… reads shared[4] = 4
    Tโ‚‚ reads shared[1] = 1    Tโ‚† reads shared[5] = 5
    Tโ‚ƒ reads shared[2] = 2    Tโ‚‡ reads shared[6] = 6
    Tโ‚„ reads shared[3] = 3
    

    ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

    ์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ๋”ํ•จ:

    shared[0] = 0              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[1] = 1 + 0 = 1
    shared[2] = 2 + 1 = 3
    shared[3] = 3 + 2 = 5
    shared[4] = 4 + 3 = 7
    shared[5] = 5 + 4 = 9
    shared[6] = 6 + 5 = 11
    shared[7] = 7 + 6 = 13
    

    ๋ฐฐ๋ฆฌ์–ด ํ›„: shared = [0, 1, 3, 5, 7, 9, 11, 13]

    ๋ฐ˜๋ณต 2 (offset=2):

    ์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

    Tโ‚‚ reads shared[0] = 0    Tโ‚… reads shared[3] = 5
    Tโ‚ƒ reads shared[1] = 1    Tโ‚† reads shared[4] = 7
    Tโ‚„ reads shared[2] = 3    Tโ‚‡ reads shared[5] = 9
    

    ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

    ์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ๋”ํ•จ:

    shared[0] = 0              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[1] = 1              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[2] = 3 + 0 = 3      (๋ณ€๊ฒฝ ์—†์Œ)
    shared[3] = 5 + 1 = 6
    shared[4] = 7 + 3 = 10
    shared[5] = 9 + 5 = 14
    shared[6] = 11 + 7 = 18
    shared[7] = 13 + 9 = 22
    

    ๋ฐฐ๋ฆฌ์–ด ํ›„: shared = [0, 1, 3, 6, 10, 14, 18, 22]

    ๋ฐ˜๋ณต 3 (offset=4):

    ์ฝ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•„์š”ํ•œ ๊ฐ’์„ ์ฝ์Œ:

    Tโ‚„ reads shared[0] = 0    Tโ‚† reads shared[2] = 3
    Tโ‚… reads shared[1] = 1    Tโ‚‡ reads shared[3] = 6
    

    ๋™๊ธฐํ™”: barrier()๋กœ ๋ชจ๋“  ์ฝ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅ

    ์“ฐ๊ธฐ ๋‹จ๊ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์€ ๊ฐ’์„ ๋”ํ•จ:

    shared[0] = 0              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[1] = 1              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[2] = 3              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[3] = 6              (๋ณ€๊ฒฝ ์—†์Œ)
    shared[4] = 10 + 0 = 10    (๋ณ€๊ฒฝ ์—†์Œ)
    shared[5] = 14 + 1 = 15
    shared[6] = 18 + 3 = 21
    shared[7] = 22 + 6 = 28
    

    ๋ฐฐ๋ฆฌ์–ด ํ›„: shared = [0, 1, 3, 6, 10, 15, 21, 28]

  3. ๋กœ์ปฌ ๊ฒฐ๊ณผ๋ฅผ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ๊ธฐ๋ก:

    output[0...7] = [0, 1, 3, 6, 10, 15, 21, 28]
    
  4. ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ๋ณด์กฐ ๊ณต๊ฐ„์— ์ €์žฅ (๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ๋งŒ):

    output[15] = 28  // ์œ„์น˜: size + block_idx.x = 15 + 0
    

Block 1 ๋‹จ๊ณ„๋ณ„ ์‹คํ–‰

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๊ฐ’ ๋กœ๋“œ:

    shared = [8, 9, 10, 11, 12, 13, 14, ๋ฏธ์ดˆ๊ธฐํ™”]
    

    ์ฐธ๊ณ : ์Šค๋ ˆ๋“œ 7์€ global_i = 15 >= SIZE_2์ด๋ฏ€๋กœ ์•„๋ฌด๊ฒƒ๋„ ๋กœ๋“œํ•˜์ง€ ์•Š์•„ shared[7]์ด ๋ฏธ์ดˆ๊ธฐํ™” ์ƒํƒœ๋กœ ๋‚จ์Šต๋‹ˆ๋‹ค. ์Šค๋ ˆ๋“œ 7์€ ์ตœ์ข… ์ถœ๋ ฅ์— ์ฐธ์—ฌํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์•ˆ์ „ํ•ฉ๋‹ˆ๋‹ค.

  2. ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๋ฐ˜๋ณต (\(\log_2(TPB) = 3\)ํšŒ ๋ฐ˜๋ณต):

    ์‹ค์ œ๋กœ ์—ฐ์‚ฐ์— ์ฐธ์—ฌํ•˜๋Š” ๊ฒƒ์€ ์ฒ˜์Œ 7๊ฐœ ์Šค๋ ˆ๋“œ๋ฟ์ž…๋‹ˆ๋‹ค. ์„ธ ๋ฒˆ์˜ ๋ฐ˜๋ณต์„ ๊ฑฐ์น˜๋ฉด:

    shared = [8, 17, 27, 38, 50, 63, 77, ๋ฏธ์ดˆ๊ธฐํ™”]
    
  3. ๋กœ์ปฌ ๊ฒฐ๊ณผ๋ฅผ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ๊ธฐ๋ก:

    output[8...14] = [8, 17, 27, 38, 50, 63, 77]  // ์œ ํšจ ์ถœ๋ ฅ 7๊ฐœ๋งŒ
    
  4. ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ๋ณด์กฐ ๊ณต๊ฐ„์— ์ €์žฅ (๋ธ”๋ก์˜ ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ๋งŒ):

    output[16] = shared[7]  // ์Šค๋ ˆ๋“œ 7 (TPB-1)์ด shared[7]์˜ ๊ฐ’์„ ์ €์žฅ
    

    ์ฐธ๊ณ : ์Šค๋ ˆ๋“œ 7์€ ์œ ํšจํ•œ ์ž…๋ ฅ์„ ๋กœ๋“œํ•˜์ง€ ์•Š์•˜์ง€๋งŒ, ๋ธ”๋ก ๋‚ด ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ์—๋Š” ๊ทธ๋Œ€๋กœ ์ฐธ์—ฌํ•ฉ๋‹ˆ๋‹ค. shared[7]์€ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ๊ฑฐ์น˜๋ฉฐ ๊ฐฑ์‹ ๋˜์ง€๋งŒ, ๋ฏธ์ดˆ๊ธฐํ™” ์ƒํƒœ์—์„œ ์‹œ์ž‘ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์ข… ๊ฐ’์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ Block 1์ด ๋งˆ์ง€๋ง‰ ๋ธ”๋ก์ด๋ฏ€๋กœ ์ด ๋ธ”๋ก ํ•ฉ๊ณ„๋Š” 2๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉ๋˜์ง€ ์•Š์•„ ์ •ํ™•์„ฑ์—๋Š” ์˜ํ–ฅ์ด ์—†์Šต๋‹ˆ๋‹ค.

1๋‹จ๊ณ„ ์ดํ›„ ์ถœ๋ ฅ ๋ฒ„ํผ์˜ ๋‚ด์šฉ:

[0, 1, 3, 6, 10, 15, 21, 28, 8, 17, 27, 38, 50, 63, 77, 28, ???]
                                                        ^   ^
                                                ๋ธ”๋ก ํ•ฉ๊ณ„๊ฐ€ ์—ฌ๊ธฐ์— ์ €์žฅ๋จ

์ฐธ๊ณ : ๋งˆ์ง€๋ง‰ ๋ธ”๋ก ํ•ฉ๊ณ„ (???) ๋Š” ๋ฏธ์ดˆ๊ธฐํ™” ๋ฉ”๋ชจ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•˜๋ฏ€๋กœ ์˜ˆ์ธกํ•  ์ˆ˜ ์—†์ง€๋งŒ, ์ตœ์ข… ๊ฒฐ๊ณผ์—๋Š” ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™”: ์‹ค์ œ๋กœ ํ•„์š”ํ•œ ์‹œ์ 

๋‘ ์ปค๋„ ๋‹จ๊ณ„๋Š” ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค:

# 1๋‹จ๊ณ„: ๋กœ์ปฌ ๋ˆ„์  ํ•ฉ
ctx.enqueue_function[prefix_sum_local_phase[...], prefix_sum_local_phase[...]](...)

# 2๋‹จ๊ณ„: ๋ธ”๋ก ํ•ฉ๊ณ„ ๋”ํ•˜๊ธฐ (์ž๋™์œผ๋กœ 1๋‹จ๊ณ„ ์™„๋ฃŒ๋ฅผ ๋Œ€๊ธฐ)
ctx.enqueue_function[prefix_sum_block_sum_phase[...], prefix_sum_block_sum_phase[...]](...)

ํ•ต์‹ฌ ํ†ต์ฐฐ: Mojo์˜ DeviceContext๋Š” ๋‹จ์ผ ์‹คํ–‰ ์ŠคํŠธ๋ฆผ(NVIDIA GPU์—์„œ๋Š” CUDA ์ŠคํŠธ๋ฆผ, AMD ROCm GPU์—์„œ๋Š” HIP ์ŠคํŠธ๋ฆผ)์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ, ํ์— ๋„ฃ์€ ์ปค๋„์ด ์ •ํ™•ํžˆ ๋„ฃ์€ ์ˆœ์„œ๋Œ€๋กœ ์‹คํ–‰๋จ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค. ์ปค๋„ ๊ฐ„์— ๋ช…์‹œ์  ๋™๊ธฐํ™”๊ฐ€ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค.

ctx.synchronize()๊ฐ€ ํ•„์š”ํ•œ ์‹œ์ :

# ๋‘ ์ปค๋„ ์™„๋ฃŒ ํ›„, ํ˜ธ์ŠคํŠธ์—์„œ ๊ฒฐ๊ณผ๋ฅผ ์ฝ๊ธฐ ์ „
ctx.synchronize()  # ํ˜ธ์ŠคํŠธ๊ฐ€ GPU ์™„๋ฃŒ๋ฅผ ๋Œ€๊ธฐ

with out.map_to_host() as out_host:  # ์ด์ œ GPU ๊ฒฐ๊ณผ๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์ฝ์„ ์ˆ˜ ์žˆ์Œ
    print("out:", out_host)

ctx.synchronize() ํ˜ธ์ถœ์˜ ์—ญํ• :

  • ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™”: ๊ฒฐ๊ณผ์— ์ ‘๊ทผํ•˜๊ธฐ ์ „์— ํ˜ธ์ŠคํŠธ๊ฐ€ ๋ชจ๋“  GPU ์ž‘์—…์˜ ์™„๋ฃŒ๋ฅผ ๋Œ€๊ธฐํ•˜๋„๋ก ๋ณด์žฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์•ˆ์ „์„ฑ: ์—ฐ์‚ฐ์ด ๋๋‚˜๊ธฐ ์ „์— GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฝ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€

์‹คํ–‰ ๋ชจ๋ธ: ๋ธ”๋ก ๋‚ด๋ถ€์˜ ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•˜๋Š” barrier()์™€ ๋‹ฌ๋ฆฌ, ์ปค๋„ ์‹คํ–‰ ์ˆœ์„œ๋Š” Mojo์˜ ๋‹จ์ผ ์ŠคํŠธ๋ฆผ ์‹คํ–‰ ๋ชจ๋ธ์—์„œ ๋ณด์žฅ๋˜๋ฉฐ, ctx.synchronize()๋Š” ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๊ฐ„ ์กฐ์œจ์„ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

2๋‹จ๊ณ„ ์ปค๋„: ๋ธ”๋ก ํ•ฉ๊ณ„ ๋”ํ•˜๊ธฐ

  1. Block 0: ๋ณ€๊ฒฝ ๋ถˆํ•„์š” (์ด๋ฏธ ์˜ฌ๋ฐ”๋ฅธ ์ƒํƒœ).

  2. Block 1: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ Block 0์˜ ํ•ฉ๊ณ„๋ฅผ ์ž๊ธฐ ์›์†Œ์— ๋”ํ•จ:

    prev_block_sum = output[size + block_idx.x - 1] = output[15] = 28
    output[global_i] += prev_block_sum
    

    Block 1์˜ ๊ฐ’์ด ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค:

    Before: [8, 17, 27, 38, 50, 63, 77]
    After:  [36, 45, 55, 66, 78, 91, 105]
    

์„ฑ๋Šฅ ๋ฐ ์ตœ์ ํ™” ๊ณ ๋ ค ์‚ฌํ•ญ

์ฃผ์š” ๊ตฌํ˜„ ์ƒ์„ธ

๋กœ์ปฌ ๋‹จ๊ณ„ ๋™๊ธฐํ™” ํŒจํ„ด: ๋ธ”๋ก ๋‚ด ๊ฐ ๋ฐ˜๋ณต์€ ์—„๊ฒฉํ•œ ์ฝ๊ธฐ โ†’ ๋™๊ธฐํ™” โ†’ ์“ฐ๊ธฐ ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. var current_val: out.element_type = 0 - ๋กœ์ปฌ ๋ณ€์ˆ˜ ์ดˆ๊ธฐํ™”
  2. current_val = shared[local_i - offset] - ์ฝ๊ธฐ ๋‹จ๊ณ„ (์กฐ๊ฑด ์ถฉ์กฑ ์‹œ)
  3. barrier() - ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ๋ช…์‹œ์  ๋™๊ธฐํ™”
  4. shared[local_i] += current_val - ์“ฐ๊ธฐ ๋‹จ๊ณ„ (์กฐ๊ฑด ์ถฉ์กฑ ์‹œ)
  5. barrier() - ๋‹ค์Œ ๋ฐ˜๋ณต ์ „ ๋™๊ธฐํ™”

๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”: ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‘ ๋‹จ๊ณ„์˜ ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • ๋ธ”๋ก ๋‚ด๋ถ€: ๋กœ์ปฌ ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ ์ค‘ barrier()๋กœ ๊ฐ ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”
  • ๋ธ”๋ก ๊ฐ„: DeviceContext๊ฐ€ ํ์— ๋„ฃ์€ ์ปค๋„์„ ์ˆœ์ฐจ ์‹คํ–‰ํ•˜์—ฌ 1๋‹จ๊ณ„๊ฐ€ 2๋‹จ๊ณ„ ์ „์— ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ. ๊ฒฐ๊ณผ๋ฅผ ์ฝ๊ธฐ ์ „์— ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•˜๋ฉด ctx.synchronize()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€: ๋กœ์ปฌ ๋‹จ๊ณ„์—์„œ ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์ค‘ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ๋™์‹œ์— ์ ‘๊ทผํ•  ๋•Œ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

  1. ์ž‘์—… ํšจ์œจ์„ฑ: ์ด ๊ตฌํ˜„์˜ ์ž‘์—… ๋ณต์žก๋„๋Š” \(O(n \log n)\)์ด๋ฉฐ, ์ˆœ์ฐจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ \(O(n)\)์ž…๋‹ˆ๋‹ค. ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์ „ํ˜•์ ์ธ ๊ณต๊ฐ„-์‹œ๊ฐ„ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„์ž…๋‹ˆ๋‹ค.

  2. ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ์œ„ํ•œ ์ถ”๊ฐ€ ๊ณต๊ฐ„์€ ์•„์ฃผ ์ ์Šต๋‹ˆ๋‹ค (๋ธ”๋ก๋‹น ์›์†Œ ํ•˜๋‚˜).

์ด 2๊ฐœ ์ปค๋„ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์ด ํ•„์š”ํ•œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ณธ ํŒจํ„ด์ž…๋‹ˆ๋‹ค. ๊ธฐ์ˆ˜ ์ •๋ ฌ, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ณ„์‚ฐ, ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ ๋“ฑ ๋‹ค๋ฅธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—๋„ ๋™์ผํ•œ ์ „๋žต์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Puzzle 15: ์ถ• ํ•ฉ๊ณ„

๊ฐœ์š”

2D ํ–‰๋ ฌ a์˜ ๊ฐ ํ–‰์— ๋Œ€ํ•ด ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ TileTensor๋ฅผ ์‚ฌ์šฉํ•ด output์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ถ• ํ•ฉ๊ณ„ ์‹œ๊ฐํ™” ์ถ• ํ•ฉ๊ณ„ ์‹œ๊ฐํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • TileTensor๋ฅผ ํ™œ์šฉํ•œ ํ–‰๋ ฌ ์ฐจ์› ๋ฐฉํ–ฅ์˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • ๋ธ”๋ก ์ขŒํ‘œ๋ฅผ ์ด์šฉํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํ• 
  • ํšจ์œจ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด
  • ๋‹ค์ฐจ์› ํ…์„œ ๋ ˆ์ด์•„์›ƒ ๋‹ค๋ฃจ๊ธฐ

ํ•ต์‹ฌ์€ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ํ–‰๋ ฌ์˜ ํ–‰์— ๋งคํ•‘ํ•˜๊ณ , TileTensor์˜ ์ฐจ์›๋ณ„ ์ธ๋ฑ์‹ฑ์„ ํ™œ์šฉํ•˜๋ฉด์„œ ๊ฐ ๋ธ”๋ก ๋‚ด์—์„œ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{BATCH} \times \text{SIZE} = 4 \times 6\)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(\text{TPB} = 8\)
  • ๊ทธ๋ฆฌ๋“œ ํฌ๊ธฐ: \(1 \times \text{BATCH}\)
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น \(\text{TPB}\)๊ฐœ ์›์†Œ
  • ์ž…๋ ฅ ๋ ˆ์ด์•„์›ƒ: row_major[BATCH, SIZE]()
  • ์ถœ๋ ฅ ๋ ˆ์ด์•„์›ƒ: row_major[BATCH, 1]()

ํ–‰๋ ฌ ์‹œ๊ฐํ™”:

Row 0: [0, 1, 2, 3, 4, 5]       โ†’ Block(0,0)
Row 1: [6, 7, 8, 9, 10, 11]     โ†’ Block(0,1)
Row 2: [12, 13, 14, 15, 16, 17] โ†’ Block(0,2)
Row 3: [18, 19, 20, 21, 22, 23] โ†’ Block(0,3)

์™„์„ฑํ•  ์ฝ”๋“œ

from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.memory import AddressSpace
from layout import TileTensor
from layout.tile_layout import row_major
from layout.tile_tensor import stack_allocation


comptime TPB = 8
comptime BATCH = 4
comptime SIZE = 6
comptime BLOCKS_PER_GRID = (1, BATCH)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime in_layout = row_major[BATCH, SIZE]()
comptime InLayout = type_of(in_layout)
comptime out_layout = row_major[BATCH, 1]()
comptime OutLayout = type_of(out_layout)


def axis_sum(
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    size: Int,
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    var batch = block_idx.y
    # FILL ME IN (roughly 15 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p15/p15.mojo

ํŒ
  1. batch = block_idx.y๋กœ ํ–‰ ์„ ํƒ
  2. ์›์†Œ ๋กœ๋“œ: cache[local_i] = a[batch, local_i]
  3. ์ŠคํŠธ๋ผ์ด๋“œ๋ฅผ ์ ˆ๋ฐ˜์”ฉ ์ค„์ด๋ฉฐ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์ˆ˜ํ–‰
  4. ์Šค๋ ˆ๋“œ 0์ด ์ตœ์ข… ํ•ฉ๊ณ„๋ฅผ output[batch]์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p15
pixi run -e amd p15
pixi run -e apple p15
uv run poe p15

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: DeviceBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([15.0, 51.0, 87.0, 123.0])

์†”๋ฃจ์…˜

def axis_sum(
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    size: Int,
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    var batch = block_idx.y
    var cache = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    # Visualize:
    # Block(0,0): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 0: [0,1,2,3,4,5]
    # Block(0,1): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 1: [6,7,8,9,10,11]
    # Block(0,2): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 2: [12,13,14,15,16,17]
    # Block(0,3): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 3: [18,19,20,21,22,23]

    # each row is handled by each block bc we have grid_dim=(1, BATCH)

    if local_i < size:
        cache[local_i] = a[batch, local_i]
    else:
        # Add zero-initialize padding elements for later reduction
        cache[local_i] = 0

    barrier()

    # do reduction sum per each block
    var stride = TPB // 2
    while stride > 0:
        # Read phase: all threads read the values they need first to avoid race conditions
        var temp_val: output.ElementType = 0
        if local_i < stride:
            temp_val = cache[local_i + stride]

        barrier()

        # Write phase: all threads safely write their computed values
        if local_i < stride:
            cache[local_i] += temp_val

        barrier()
        stride //= 2

    # writing with local thread = 0 that has the sum for each batch
    if local_i == 0:
        output[batch, 0] = cache[0]


TileTensor๋ฅผ ํ™œ์šฉํ•ด 2D ํ–‰๋ ฌ์˜ ํ–‰ ๋ฐฉํ–ฅ ํ•ฉ๊ณ„๋ฅผ ๋ณ‘๋ ฌ๋กœ ๊ตฌํ•˜๋Š” ๋ฆฌ๋•์…˜ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

ํ–‰๋ ฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ๋ธ”๋ก ๋งคํ•‘

Input Matrix (4ร—6) with TileTensor:                Block Assignment:
[[ a[0,0]  a[0,1]  a[0,2]  a[0,3]  a[0,4]  a[0,5] ] โ†’ Block(0,0)
 [ a[1,0]  a[1,1]  a[1,2]  a[1,3]  a[1,4]  a[1,5] ] โ†’ Block(0,1)
 [ a[2,0]  a[2,1]  a[2,2]  a[2,3]  a[2,4]  a[2,5] ] โ†’ Block(0,2)
 [ a[3,0]  a[3,1]  a[3,2]  a[3,3]  a[3,4]  a[3,5] ] โ†’ Block(0,3)

๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๊ณผ์ •

  1. ์ดˆ๊ธฐ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ:

    Block(0,0): cache = [a[0,0] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] * *]  // * = ํŒจ๋”ฉ
    Block(0,1): cache = [a[1,0] a[1,1] a[1,2] a[1,3] a[1,4] a[1,5] * *]
    Block(0,2): cache = [a[2,0] a[2,1] a[2,2] a[2,3] a[2,4] a[2,5] * *]
    Block(0,3): cache = [a[3,0] a[3,1] a[3,2] a[3,3] a[3,4] a[3,5] * *]
    
  2. ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„ (Block 0,0 ๊ธฐ์ค€):

    Initial:  [0  1  2  3  4  5  *  *]
    Stride 4: [4  5  6  7  4  5  *  *]
    Stride 2: [10 12 6  7  4  5  *  *]
    Stride 1: [15 12 6  7  4  5  *  *]
    

์ฃผ์š” ๊ตฌํ˜„ ํŠน์ง•

  1. ๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ:

    • ์ž…๋ ฅ: ํ–‰ ์šฐ์„ (row-major) ๋ ˆ์ด์•„์›ƒ (BATCH ร— SIZE)
    • ์ถœ๋ ฅ: ํ–‰ ์šฐ์„  ๋ ˆ์ด์•„์›ƒ (BATCH ร— 1)
    • ๊ฐ ๋ธ”๋ก์ด ํ•˜๋‚˜์˜ ํ–‰ ์ „์ฒด๋ฅผ ์ฒ˜๋ฆฌ
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • ์ž…๋ ฅ์— TileTensor 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: a[batch, local_i]
    • ํšจ์œจ์ ์ธ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
    • ์ถœ๋ ฅ์— TileTensor 2D ์ธ๋ฑ์‹ฑ ์‚ฌ์šฉ: output[batch, 0]
  3. ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๋กœ์ง:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            cache[local_i] += cache[local_i + stride]
        barrier()
        stride //= 2
    

    ์ฐธ๊ณ : ์ด ๊ตฌํ˜„์—์„œ๋Š” ๊ฐ™์€ ๋ฐ˜๋ณต ๋‚ด์—์„œ ์Šค๋ ˆ๋“œ๋“ค์ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋™์‹œ์— ์ฝ๊ณ  ์“ฐ๊ธฐ ๋•Œ๋ฌธ์— ์ž ์žฌ์ ์ธ ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋” ์•ˆ์ „ํ•œ ๋ฐฉ๋ฒ•์€ ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ๋‹จ๊ณ„๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

    stride = TPB // 2
    while stride > 0:
        var temp_val: output.element_type = 0
        if local_i < stride:
            temp_val = cache[local_i + stride]  # ์ฝ๊ธฐ ๋‹จ๊ณ„
        barrier()
        if local_i < stride:
            cache[local_i] += temp_val  # ์“ฐ๊ธฐ ๋‹จ๊ณ„
        barrier()
        stride //= 2
    
  4. ์ถœ๋ ฅ ๊ธฐ๋ก:

    if local_i == 0:
        output[batch, 0] = cache[0]  --> ๋ฐฐ์น˜๋‹น ๊ฒฐ๊ณผ ํ•˜๋‚˜
    

์„ฑ๋Šฅ ์ตœ์ ํ™”

  1. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ:

    • TileTensor๋ฅผ ํ†ตํ•œ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
    • ๋น ๋ฅธ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
    • ํ–‰ ๊ฒฐ๊ณผ๋‹น ํ•œ ๋ฒˆ์˜ ์“ฐ๊ธฐ
  2. ์Šค๋ ˆ๋“œ ํ™œ์šฉ:

    • ํ–‰ ๊ฐ„ ์™„๋ฒฝํ•œ ๋ถ€ํ•˜ ๊ท ํ˜•
    • ์ฃผ์š” ์—ฐ์‚ฐ์—์„œ ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ ์—†์Œ
    • ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด
  3. ๋™๊ธฐํ™”:

    • ์ตœ์†Œํ•œ์˜ ๋ฐฐ๋ฆฌ์–ด (๋ฆฌ๋•์…˜ ์ค‘์—๋งŒ ์‚ฌ์šฉ)
    • ํ–‰ ๊ฐ„ ๋…๋ฆฝ์ ์ธ ์ฒ˜๋ฆฌ
    • ๋ธ”๋ก ๊ฐ„ ํ†ต์‹  ๋ถˆํ•„์š”
    • ๊ฒฝ์Ÿ ์ƒํƒœ ๊ณ ๋ ค์‚ฌํ•ญ: ํ˜„์žฌ ๊ตฌํ˜„์—์„œ๋Š” ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์ค‘์— ์ฝ๊ธฐ-์“ฐ๊ธฐ ์ถฉ๋Œ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ช…์‹œ์ ์ธ ์ฝ๊ธฐ-์“ฐ๊ธฐ ๋‹จ๊ณ„ ๋ถ„๋ฆฌ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

๋ณต์žก๋„ ๋ถ„์„

  • ์‹œ๊ฐ„: ํ–‰๋‹น \(O(\log n)\), n์€ ํ–‰์˜ ๊ธธ์ด
  • ๊ณต๊ฐ„: ๋ธ”๋ก๋‹น \(O(TPB)\) ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ
  • ์ „์ฒด ๋ณ‘๋ ฌ ์‹œ๊ฐ„: ์Šค๋ ˆ๋“œ๊ฐ€ ์ถฉ๋ถ„ํ•  ๋•Œ \(O(\log n)\)

Puzzle 16: ํ–‰๋ ฌ ๊ณฑ์…ˆ (MatMul)

๊ฐœ์š”

ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ ๊ณผํ•™ ๊ณ„์‚ฐ, ๋จธ์‹  ๋Ÿฌ๋‹, ๊ทธ๋ž˜ํ”ฝ์Šค์—์„œ ๊ฐ€์žฅ ๊ธฐ๋ณธ์ด ๋˜๋Š” ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ๋‘ ํ–‰๋ ฌ \(A\)์™€ \(B\)๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ด๋“ค์˜ ๊ณฑ \(C = A \times B\) ๋ฅผ ๊ตฌํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.

ํ–‰๋ ฌ \(A_{m\times k}\)์™€ \(B_{k\times n}\)์— ๋Œ€ํ•ด, ๊ฒฐ๊ณผ \(C_{m\times n}\)์˜ ๊ฐ ์›์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค:

\[\Large C_{ij} = \sum_{l=0}^{k-1} A_{il} \cdot B_{lj} \]

ํ–‰๋ ฌ ๊ณฑ์…ˆ ์‹œ๊ฐํ™” ํ–‰๋ ฌ ๊ณฑ์…ˆ ์‹œ๊ฐํ™”

์ด ํผ์ฆ์—์„œ๋Š” GPU์—์„œ ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๊ตฌํ˜„ํ•˜๋Š” ์—ฌ๋Ÿฌ ์ ‘๊ทผ๋ฒ•์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ๊ฐ ๋ฒ„์ „์€ ์„œ๋กœ ๋‹ค๋ฅธ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ ๋ฒ„์ „ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ํ–‰๋ ฌ์˜ ์›์†Œ ํ•˜๋‚˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์ง๊ด€์ ์ธ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค. ์ดํ•ดํ•˜๊ธฐ ์‰ฝ์ง€๋งŒ, ์ค‘๋ณต๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ๋งŽ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „ ์ž…๋ ฅ ํ–‰๋ ฌ์˜ ๋ธ”๋ก์„ ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์˜ฌ๋ ค ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์ž…๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ์›์†Œ ํ•˜๋‚˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์€ ๊ฐ™์ง€๋งŒ, ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ฝ์Šต๋‹ˆ๋‹ค.

  • ํƒ€์ผ๋ง ๋ฒ„์ „ ์—ฐ์‚ฐ์„ ํƒ€์ผ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„์–ด ์Šค๋ ˆ๋“œ๋“ค์ด ์ถœ๋ ฅ ํ–‰๋ ฌ์˜ ๋ธ”๋ก์„ ํ•จ๊ป˜ ๋กœ๋“œํ•˜๊ณ  ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ์„ ํ•œ์ธต ํšจ๊ณผ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

๊ฐ ๋ฒ„์ „์€ ์ด์ „ ๋ฒ„์ „ ์œ„์— ์Œ“์•„ ์˜ฌ๋ฆฌ๋ฉด์„œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ์ƒˆ๋กœ์šด ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ์ „๋žต์ด ์„ฑ๋Šฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋ฐฐ์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋‹จ๊ณ„์  ์ง„ํ–‰ ๊ณผ์ •์€ GPU ์ตœ์ ํ™”์˜ ๋Œ€ํ‘œ์ ์ธ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  1. ์ •ํ™•ํ•˜์ง€๋งŒ ๋‹จ์ˆœํ•œ ๊ธฐ๋ณธ ๊ตฌํ˜„์—์„œ ์ถœ๋ฐœ
  2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ค„์ด๊ธฐ
  3. ํƒ€์ผ๋ง์œผ๋กœ ๋ฐ์ดํ„ฐ ์ง€์—ญ์„ฑ๊ณผ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ ๊ฐœ์„ 
  4. ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋ฅผ ํ™œ์šฉํ•˜๋ฉด์„œ๋„ ์„ฑ๋Šฅ ์œ ์ง€

์›ํ•˜๋Š” ๋ฒ„์ „์„ ๊ณจ๋ผ ํ–‰๋ ฌ ๊ณฑ์…ˆ ์—ฌ์ •์„ ์‹œ์ž‘ํ•ด ๋ณด์„ธ์š”!

์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ ๋ฒ„์ „

๊ฐœ์š”

์ •๋ฐฉ ํ–‰๋ ฌ \(A\)์™€ \(B\)๋ฅผ ๊ณฑํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ \(\text{output}\)์— ์ €์žฅํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ํ–‰๋ ฌ์˜ ์›์†Œ ํ•˜๋‚˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๊ตฌํ˜„์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์œ„ํ•œ 2D ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ํ–‰ ์šฐ์„ (row-major) ๋ ˆ์ด์•„์›ƒ์—์„œ์˜ ํ–‰๋ ฌ ์ธ๋ฑ์‹ฑ
  • ์Šค๋ ˆ๋“œ์™€ ์ถœ๋ ฅ ์›์†Œ ๊ฐ„ ๋งคํ•‘

ํ•ต์‹ฌ์€ 2D ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ํ–‰๋ ฌ ์›์†Œ์— ๋งคํ•‘ํ•˜๊ณ , ๋‚ด์ ์„ ๋ณ‘๋ ฌ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: \(1 \times 1\)

๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ:

  • ์ž…๋ ฅ A: row_major[SIZE, SIZE]()
  • ์ž…๋ ฅ B: row_major[SIZE, SIZE]()
  • ์ถœ๋ ฅ: row_major[SIZE, SIZE]()

์™„์„ฑํ•  ์ฝ”๋“œ

from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.memory import AddressSpace
from layout import TileTensor
from layout.tile_layout import row_major
from layout.tile_tensor import stack_allocation


comptime TPB = 3
comptime SIZE = 2
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (TPB, TPB)
comptime dtype = DType.float32
comptime layout = row_major[SIZE, SIZE]()
comptime LayoutType = type_of(layout)


def naive_matmul(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    var row = block_dim.y * block_idx.y + thread_idx.y
    var col = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 6 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p16/p16.mojo

ํŒ
  1. ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋กœ row์™€ col ๊ณ„์‚ฐ
  2. ์ธ๋ฑ์Šค๊ฐ€ size ๋ฒ”์œ„ ์•ˆ์— ์žˆ๋Š”์ง€ ํ™•์ธ
  3. ๋กœ์ปฌ ๋ณ€์ˆ˜์— ๊ณฑ์˜ ํ•ฉ ๋ˆ„์ 
  4. ์ตœ์ข… ํ•ฉ์„ ์˜ฌ๋ฐ”๋ฅธ ์ถœ๋ ฅ ์œ„์น˜์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p16 --naive
pixi run -e amd p16 --naive
pixi run -e apple p16 --naive
uv run poe p16 --naive

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([4.0, 6.0, 12.0, 22.0])

์†”๋ฃจ์…˜

def naive_matmul[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    var row = block_dim.y * block_idx.y + thread_idx.y
    var col = block_dim.x * block_idx.x + thread_idx.x

    if row < size and col < size:
        var acc: output.ElementType = 0

        comptime for k in range(size):
            acc += a[row, k] * b[k, col]

        output[row, col] = acc


TileTensor๋ฅผ ํ™œ์šฉํ•œ ๊ธฐ๋ณธ ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

ํ–‰๋ ฌ ๋ ˆ์ด์•„์›ƒ (2ร—2 ์˜ˆ์‹œ)

Matrix A:          Matrix B:                   Output C:
[a[0,0] a[0,1]]    [b[0,0] b[0,1]]             [c[0,0] c[0,1]]
[a[1,0] a[1,1]]    [b[1,0] b[1,1]]             [c[1,0] c[1,1]]

๊ตฌํ˜„ ์ƒ์„ธ

  1. ์Šค๋ ˆ๋“œ ๋งคํ•‘:

    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • ์ง์ ‘ 2D ์ธ๋ฑ์‹ฑ: a[row, k]
    • ์ „์น˜ ์ ‘๊ทผ: b[k, col]
    • ์ถœ๋ ฅ ๊ธฐ๋ก: output[row, col]
  3. ์—ฐ์‚ฐ ํ๋ฆ„:

    # var๋กœ ๊ฐ€๋ณ€ ๋ˆ„์  ๋ณ€์ˆ˜๋ฅผ ์„ ์–ธํ•˜๊ณ  ํ…์„œ์˜ ์›์†Œ ํƒ€์ž…์„ ์‚ฌ์šฉ
    var acc: output.element_type = 0
    
    # @parameter๋กœ ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ
    @parameter
    for k in range(size):
        acc += a[row, k] * b[k, col]
    

์ฃผ์š” ์–ธ์–ด ๊ธฐ๋Šฅ

  1. ๋ณ€์ˆ˜ ์„ ์–ธ:

    • var acc: output.element_type = 0์—์„œ var๋กœ ๊ฐ€๋ณ€ ๋ณ€์ˆ˜๋ฅผ ์„ ์–ธํ•˜๊ณ , output.element_type์œผ๋กœ ์ถœ๋ ฅ ํ…์„œ์™€ ๋™์ผํ•œ ํƒ€์ž…์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค
    • ๋ˆ„์  ์—ฐ์‚ฐ ์ „์— 0์œผ๋กœ ์ดˆ๊ธฐํ™”
  2. ๋ฃจํ”„ ์ตœ์ ํ™”:

    • @parameter ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋กœ ์ปดํŒŒ์ผ ํƒ€์ž„์— ๋ฃจํ”„ ์ „๊ฐœ
    • ํฌ๊ธฐ๊ฐ€ ์ž‘๊ณ  ๋ฏธ๋ฆฌ ์•Œ๋ ค์ง„ ํ–‰๋ ฌ์—์„œ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • ๋” ๋‚˜์€ ๋ช…๋ น์–ด ์Šค์ผ€์ค„๋ง ๊ฐ€๋Šฅ

์„ฑ๋Šฅ ํŠน์„ฑ

  1. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ 2 x SIZEํšŒ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ฝ์Œ
    • ์Šค๋ ˆ๋“œ๋‹น ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ 1ํšŒ
    • ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ ์—†์Œ
  2. ์—ฐ์‚ฐ ํšจ์œจ:

    • ๋‹จ์ˆœํ•œ ๊ตฌํ˜„์ด์ง€๋งŒ ์„ฑ๋Šฅ์€ ์ตœ์ ์ด ์•„๋‹˜
    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ค‘๋ณต์œผ๋กœ ๋งŽ์ด ์ฝ์Œ
    • ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์ง€ ์•Š์Œ
  3. ํ•œ๊ณ„:

    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ๋งŽ์ด ์†Œ๋ชจ
    • ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ์ง€์—ญ์„ฑ
    • ํฐ ํ–‰๋ ฌ๋กœ ๊ฐˆ์ˆ˜๋ก ํ™•์žฅ์„ฑ ๋ถ€์กฑ

์ด ๊ธฐ๋ณธ ๊ตฌํ˜„์€ GPU ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์ค€์ ์œผ๋กœ, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์ตœ์ ํ™”ํ•ด์•ผ ํ•˜๋Š” ์ด์œ ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

GPU ์„ฑ๋Šฅ ์ดํ•ดํ•˜๊ธฐ: ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ

๊ธฐ๋ณธ ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๊ตฌํ˜„ํ–ˆ์œผ๋‹ˆ, ์ด๋Ÿฐ ๊ถ๊ธˆ์ฆ์ด ์ƒ๊ธธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: ์šฐ๋ฆฌ ์ปค๋„์€ ์‹ค์ œ๋กœ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋™์ž‘ํ•˜๊ณ  ์žˆ์„๊นŒ? GPU์˜ ์—ฐ์‚ฐ ๋Šฅ๋ ฅ์— ์˜ํ•ด ์ œํ•œ๋˜๋Š” ๊ฑธ๊นŒ, ์•„๋‹ˆ๋ฉด ๋‹ค๋ฅธ ๋ฌด์–ธ๊ฐ€๊ฐ€ ๋ฐœ๋ชฉ์„ ์žก๊ณ  ์žˆ๋Š” ๊ฑธ๊นŒ?

๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ(์—ญ์ฃผ: ๋ฃจํ”„๋ผ์ธ์€ โ€œ์ƒํ•œ์„ โ€œ์ด๋ผ๋Š” ๋œป์œผ๋กœ, ์„ฑ๋Šฅ์ด ๋„˜์„ ์ˆ˜ ์—†๋Š” ํ•œ๊ณ„๋ฅผ ์ง€๋ถ• ์„ ์— ๋น„์œ ํ•œ ์ด๋ฆ„์ž…๋‹ˆ๋‹ค)์€ GPU ์ตœ์ ํ™”์˜ ๋‚˜์นจ๋ฐ˜์ž…๋‹ˆ๋‹ค. ์ปค๋„์˜ ์„ฑ๋Šฅ์„ ์ œํ•œํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๋ณ‘๋ชฉ์ด ๋ฌด์—‡์ธ์ง€ ์•Œ๋ ค์ฃผ๊ณ , ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ์ตœ์ ํ™” ๋ฐฉํ–ฅ์œผ๋กœ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค. ๊ฐ์œผ๋กœ ๊ฐœ์„ ํ•˜๋Š” ๋Œ€์‹ , ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์ด ์ •ํ™•ํžˆ ์–ด๋””์— ์ง‘์ค‘ํ•ด์•ผ ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

1. ๋ชจ๋“  GPU ์ปค๋„์˜ ๋‘ ๊ฐ€์ง€ ์„ฑ๋Šฅ ์ƒํ•œ

๋ชจ๋“  GPU ์ปค๋„์€ ๋‘ ๊ฐ€์ง€ ๊ทผ๋ณธ์ ์ธ ์ œ์•ฝ ์•„๋ž˜์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค:

  • ์—ฐ์‚ฐ ์ƒํ•œ(compute ceiling) โ€“ ์ฝ”์–ด๊ฐ€ ๋ถ€๋™์†Œ์ˆ˜์  ์—ฐ์‚ฐ์„ ์–ผ๋งˆ๋‚˜ ๋น ๋ฅด๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€ (์ตœ๋Œ€ FLOPs/s)
  • ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ(memory ceiling) โ€“ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์ด ์ฝ”์–ด์— ๋ฐ์ดํ„ฐ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋น ๋ฅด๊ฒŒ ๊ณต๊ธ‰ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€ (์ตœ๋Œ€ bytes/s)

์–ด๋–ค ์ƒํ•œ์ด ์ปค๋„์„ ์ œ์•ฝํ•˜๋Š”์ง€ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์ด ์ตœ์ ํ™” ์ „๋žต์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์€ ๋‘ ๊ฐ€์ง€ ํ•ต์‹ฌ ์ง€ํ‘œ๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ํ‘œํ˜„ํ•˜์—ฌ ์ด ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค:

X์ถ•: ์‚ฐ์ˆ  ๊ฐ•๋„(Arithmetic Intensity) โ€“ ๋ฐ์ดํ„ฐ 1๋ฐ”์ดํŠธ๋‹น ์ˆ˜ํ–‰ํ•˜๋Š” ์—ฐ์‚ฐ๋Ÿ‰

\[\Large I = \frac{\text{Total FLOPs}}{\text{Total Bytes from Memory}} \quad [\text{FLOP/B}]\]

Y์ถ•: ์‹ค์ธก ์„ฑ๋Šฅ(Sustained Performance) โ€“ ์ปค๋„์ด ์‹ค์ œ๋กœ ๋‹ฌ์„ฑํ•˜๋Š” ์†๋„

\[\Large P_{\text{sustained}} = \frac{\text{Total FLOPs}}{\text{Elapsed Time}} \quad [\text{GFLOP/s}]\]

๋‘ ๊ฐœ์˜ โ€œ์ƒํ•œ(roof)โ€œ์ด ๋‹ฌ์„ฑ ๊ฐ€๋Šฅํ•œ ์„ฑ๋Šฅ์˜ ์ƒํ•œ์„ ์ •ํ•ฉ๋‹ˆ๋‹ค:

์ƒํ•œ์ˆ˜์‹์˜๋ฏธ
๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ\(P = B_{\text{peak}} \cdot I\)๊ธฐ์šธ์–ด์ง„ ์ง์„ . ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ์˜ํ•ด ์„ฑ๋Šฅ์ด ์ œํ•œ๋จ
์—ฐ์‚ฐ ์ƒํ•œ\(P = P_{\text{peak}}\)์ˆ˜ํ‰์„ . ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ๋Ÿ‰์— ์˜ํ•ด ์„ฑ๋Šฅ์ด ์ œํ•œ๋จ

์ž„๊ณ„ ๊ฐ•๋„(critical intensity)

\[\Large I^* = \frac{P_{\text{peak}}}{B_{\text{peak}}}\]

๋Š” ์ปค๋„์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ(\(I < I^\ast\) )์—์„œ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ(\(I > I^\ast\) )๋กœ ์ „ํ™˜๋˜๋Š” ์ง€์ ์ž…๋‹ˆ๋‹ค.

2. ํ•˜๋“œ์›จ์–ด ์˜ˆ์‹œ: NVIDIA A100 ์‚ฌ์–‘

์ด๋ก ์„ NVIDIA A100์˜ ๊ตฌ์ฒด์ ์ธ ์ˆซ์ž๋กœ ํ™•์ธํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

์ตœ๋Œ€ FP32 ์ฒ˜๋ฆฌ๋Ÿ‰ \[\Large P_{\text{peak}} = 19.5 \text{ TFLOP/s} = 19{,}500 \text{ GFLOP/s}\]

์ตœ๋Œ€ HBM2 ๋Œ€์—ญํญ \[\Large B_{\text{peak}} = 1{,}555 \text{ GB/s}\]

์ž„๊ณ„ ๊ฐ•๋„ \[\Large I^* = \frac{19{,}500}{1{,}555} \approx 12.5 \text{ FLOP/B}\]

์ถœ์ฒ˜: NVIDIA A100 Tensor Core GPU Architecture

์ด๋Š” ์‚ฐ์ˆ  ๊ฐ•๋„๊ฐ€ 12.5 FLOP/B ๋ฏธ๋งŒ์ธ ์ปค๋„์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ, ๊ทธ ์ด์ƒ์ธ ์ปค๋„์€ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์ž„์„ ๋œปํ•ฉ๋‹ˆ๋‹ค.

3. ํ–‰๋ ฌ ๊ณฑ์…ˆ ๊ตฌํ˜„์˜ ์‹œ๊ฐํ™”

์•„๋ž˜ ์• ๋‹ˆ๋ฉ”์ด์…˜์€ ์ด ํผ์ฆ์˜ ๊ตฌํ˜„๋“ค์ด A100์˜ ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์— ์–ด๋–ป๊ฒŒ ๋Œ€์‘ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ ์‹œ๊ฐํ™”

์ด ์‹œ๊ฐํ™”๋Š” ์ด ํผ์ฆ์—์„œ ๊ฑฐ์น˜๊ฒŒ ๋  ์ตœ์ ํ™” ๊ณผ์ •์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  1. ํ•˜๋“œ์›จ์–ด ์ œ์•ฝ โ€“ ๋นจ๊ฐ„์ƒ‰ ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ๊ณผ ํŒŒ๋ž€์ƒ‰ ์—ฐ์‚ฐ ์ƒํ•œ์ด ์„ฑ๋Šฅ ํ•œ๊ณ„๋ฅผ ์ •์˜
  2. ์ถœ๋ฐœ์  โ€“ ๊ธฐ๋ณธ ๊ตฌํ˜„(์ฃผํ™ฉ์ƒ‰ ์ )์ด ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ ์œ„์— ์œ„์น˜
  3. ์ตœ์ ํ™” ๋ชฉํ‘œ โ€“ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „(์ฒญ๋ก์ƒ‰ ์ )์œผ๋กœ ์‚ฐ์ˆ  ๊ฐ•๋„๊ฐ€ ๊ฐœ์„ ๋จ
  4. ๊ถ๊ทน์  ๋ชฉํ‘œ โ€“ ๊ธˆ์ƒ‰ ํ™”์‚ดํ‘œ๋Š” ์ปค๋„์ด ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ๊ฐ€ ๋˜๋Š” ์ž„๊ณ„ ๊ฐ•๋„ ์ง€์ ์„ ๊ฐ€๋ฆฌํ‚ด

4. ๊ธฐ๋ณธ ๊ตฌํ˜„ ๋ถ„์„

์ด์ „ ์„น์…˜์˜ ๊ธฐ๋ณธ ์ปค๋„์ด ์™œ ์ด๋Ÿฐ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. \(2 \times 2\) ํ–‰๋ ฌ ๊ณฑ์…ˆ์˜ ๊ฒฝ์šฐ:

์ถœ๋ ฅ ์›์†Œ๋‹น ์—ฐ์‚ฐ๋Ÿ‰: \(\text{SIZE} + (\text{SIZE}-1) = 3 \text{ FLOPs }\)

๊ฐ ์›์†Œ์—๋Š” \(\text{SIZE}\) ํšŒ์˜ ๊ณฑ์…ˆ๊ณผ \(\text{SIZE} - 1\) ํšŒ์˜ ๋ง์…ˆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค: \[C_{00} = A_{00} \cdot B_{00} + A_{01} \cdot B_{10}\] \(\text{SIZE} = 2\) ์ผ ๋•Œ ๊ณฑ์…ˆ 2ํšŒ + ๋ง์…ˆ 1ํšŒ = 3 FLOPs

์ถœ๋ ฅ ์›์†Œ๋‹น ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

  • ํ–‰๋ ฌ A์˜ ํ–‰: \(2 \times 4 = 8\) bytes (FP32)
  • ํ–‰๋ ฌ B์˜ ์—ด: \(2 \times 4 = 8\) bytes (FP32)
  • ํ•ฉ๊ณ„: ์ถœ๋ ฅ ์›์†Œ๋‹น \(16\) bytes

์‚ฐ์ˆ  ๊ฐ•๋„: \[\Large I_{\text{naive}} = \frac{3 \text{ FLOPs}}{16 \text{ bytes}} = 0.1875 \text{ FLOP/B}\]

์ด ์‚ฐ์ˆ  ๊ฐ•๋„๋Š” A100์˜ ์ž„๊ณ„ ๊ฐ•๋„์— ํ•œ์ฐธ ๋ชป ๋ฏธ์น˜๋ฏ€๋กœ, ๊ธฐ๋ณธ ์ปค๋„์€ ์‹ฌ๊ฐํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.

\[\Large I_{\text{naive}} = 0.1875 \ll I^* = 12.5\]

์˜ˆ์ƒ ์„ฑ๋Šฅ: \[\Large P \approx B_{\text{peak}} \times I_{\text{naive}} = 1{,}555 \times 0.1875 \approx 292 \text{ GFLOP/s}\]

์ด๋Š” GPU ์—ฐ์‚ฐ ์ž ์žฌ๋ ฅ์˜ \(\frac{292}{19{,}500} \approx 1.5\%\) ์— ๋ถˆ๊ณผํ•ฉ๋‹ˆ๋‹ค! ์‹œ๊ฐํ™”์—์„œ ๋…ธ๋ž€์ƒ‰ ์ ์ด ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ ์œ„์— ๋†“์ธ ๊ฒƒ์ด ์ด๋ฅผ ์ž˜ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค โ€” ์—ฐ์‚ฐ ์ƒํ•œ์—๋Š” ํ•œ์ฐธ ๋ฏธ์น˜์ง€ ๋ชปํ•˜๋Š” ์ˆ˜์ค€์ž…๋‹ˆ๋‹ค.

5. ๋‹ค์Œ ๋‹จ๊ณ„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”

๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์ด ์•Œ๋ ค์ฃผ๋Š” ์ตœ์ ํ™” ์ „๋žต์€ ๋ช…ํ™•ํ•ฉ๋‹ˆ๋‹ค: ์ค‘๋ณต ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์—ฌ ์‚ฐ์ˆ  ๊ฐ•๋„๋ฅผ ๋†’์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๋ฒ•์ด ๋ฐ”๋กœ ์ด๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์ด์ :

  • ํ˜‘๋ ฅ์  ๋กœ๋”ฉ: ์Šค๋ ˆ๋“œ๋“ค์ด ํ•จ๊ป˜ ํ–‰๋ ฌ ๋ธ”๋ก์„ ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ
  • ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ: ๋กœ๋“œํ•œ ์›์†Œ ํ•˜๋‚˜๋ฅผ ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ์— ํ™œ์šฉ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ๊ฐ์†Œ: ๋А๋ฆฐ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ๋Œ€ํ•œ ์ ‘๊ทผ ํšŸ์ˆ˜ ๊ฐ์†Œ

์‚ฐ์ˆ  ๊ฐ•๋„ ๊ฐœ์„  ์˜ˆ์ƒ์น˜: \[\Large I_{\text{shared}} = \frac{12 \text{ FLOPs}}{32 \text{ bytes}} = 0.375 \text{ FLOP/B}\]

์ž‘์€ \(2 \times 2\) ๊ทœ๋ชจ์—์„œ๋Š” ์—ฌ์ „ํžˆ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ด์ง€๋งŒ, ์ด 2๋ฐฐ์˜ ์‚ฐ์ˆ  ๊ฐ•๋„ ํ–ฅ์ƒ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ์„ ํ›จ์”ฌ ๋” ๋งŽ์ด ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํฐ ํ–‰๋ ฌ์—์„œ ๊ทน์ ์ธ ํšจ๊ณผ๋ฅผ ๋ฐœํœ˜ํ•ฉ๋‹ˆ๋‹ค.

6. ๋ฃจํ”„๋ผ์ธ์ด ์•Œ๋ ค์ฃผ๋Š” ์ตœ์ ํ™” ์ „๋žต

๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์€ ํ˜„์žฌ ์„ฑ๋Šฅ์„ ์ง„๋‹จํ•  ๋ฟ ์•„๋‹ˆ๋ผ, ์ตœ์ ํ™” ๋ฐฉํ–ฅ๊นŒ์ง€ ์•Œ๋ ค์ค๋‹ˆ๋‹ค. ์ดํ›„ ํผ์ฆ์—์„œ ์‚ดํŽด๋ณผ ํ•ต์‹ฌ ๊ธฐ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

๊ธฐ๋ฒ•๋ฃจํ”„๋ผ์ธ ํšจ๊ณผ๊ตฌํ˜„ ๋ฐฉ๋ฒ•
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ๋ง๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์œผ๋กœ ์‚ฐ์ˆ  ๊ฐ•๋„ โ†‘ํ˜‘๋ ฅ์  ๋กœ๋”ฉ, ๋ธ”๋ก ๋‹จ์œ„ ์—ฐ์‚ฐ
๋ ˆ์ง€์Šคํ„ฐ ๋ธ”๋กœํ‚น๋ ˆ์ง€์Šคํ„ฐ ๋ˆ„์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ๊ฐ์†Œ๋ ˆ์ง€์Šคํ„ฐ ๋ณ€์ˆ˜์™€ ๋ฃจํ”„ ์ „๊ฐœ
์ปค๋„ ํ“จ์ „์—ฐ์‚ฐ ๊ฒฐํ•ฉ์œผ๋กœ ๋ฐ”์ดํŠธ๋‹น FLOPs ์ฆ๊ฐ€๋‹จ์ผ ์ปค๋„์—์„œ ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ ๋‹จ๊ณ„ ์ฒ˜๋ฆฌ
๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ(coalescing)์‹คํšจ ๋Œ€์—ญํญ ํ™œ์šฉ ๊ทน๋Œ€ํ™”๊ตฌ์กฐํ™”๋œ ์ ‘๊ทผ ํŒจํ„ด, ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ
๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์œผ๋กœ ์—ฐ์‚ฐ-๋ฉ”๋ชจ๋ฆฌ ์ค‘์ฒฉcopy_dram_to_sram_async์™€ ์—ฐ์‚ฐ ์ค‘์ฒฉ
ํ˜ผํ•ฉ ์ •๋ฐ€๋„์ž‘์€ ๋ฐ์ดํ„ฐ ํƒ€์ž…์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€ํ•˜ ๊ฐ์†ŒFP16/BF16 ์ž…๋ ฅ + FP32 ๋ˆ„์ 

๊ฐ ๊ธฐ๋ฒ•์€ ์ปค๋„์„ ๋ฃจํ”„๋ผ์ธ ์œ„์—์„œ ์ด๋™์‹œํ‚ต๋‹ˆ๋‹ค โ€” ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ์„ ๋”ฐ๋ผ ์œ„๋กœ(๋Œ€์—ญํญ ํ™œ์šฉ ๊ฐœ์„ ), ๋˜๋Š” ์˜ค๋ฅธ์ชฝ ์—ฐ์‚ฐ ์ƒํ•œ์„ ํ–ฅํ•ด(์‚ฐ์ˆ  ๊ฐ•๋„ ํ–ฅ์ƒ).

๋น„๋™๊ธฐ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ฐธ๊ณ : ํ‘œ์ค€ GPU ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ(ld.global)๋Š” ์ด๋ฏธ ๋น„๋™๊ธฐ์ž…๋‹ˆ๋‹ค โ€” ์›Œํ”„๋Š” ๋กœ๋“œํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ค์ œ๋กœ ํ•„์š”ํ•ด์งˆ ๋•Œ๊นŒ์ง€ ๊ณ„์† ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค. cp.async(CUDA)๋‚˜ copy_dram_to_sram_async(Mojo) ๊ฐ™์€ ์ „์šฉ ๋น„๋™๊ธฐ ๋ณต์‚ฌ ๋ช…๋ น์€ ์—ฌ๊ธฐ์„œ ํ•œ ๊ฑธ์Œ ๋” ๋‚˜์•„๊ฐ€, ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์„ ์‚ฌ์šฉํ•˜๊ณ  ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ์šฐํšŒํ•˜์—ฌ ์ž์› ํ™œ์šฉ์„ ๋†’์ž…๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ ๋™๊ธฐ ์—ฐ์‚ฐ์„ ๋น„๋™๊ธฐ๋กœ ๋ฐ”๊พธ๋Š” ๊ฒƒ๊ณผ๋Š” ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

7. ๋‹จ์ˆœํ•œ ๋ฃจํ”„๋ผ์ธ์„ ๋„˜์–ด์„œ

๋‹ค๋‹จ๊ณ„ ๋ฉ”๋ชจ๋ฆฌ: ๊ณ ๊ธ‰ ๋ฃจํ”„๋ผ์ธ์€ L2 ์บ์‹œ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ ˆ์ง€์Šคํ„ฐ ๋Œ€์—ญํญ์— ๋Œ€ํ•ด ๋ณ„๋„์˜ ์ƒํ•œ์„ ํฌํ•จํ•˜์—ฌ ์–ด๋–ค ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต์ด ์„ฑ๋Šฅ์„ ์ œ์•ฝํ•˜๋Š”์ง€ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

ํ†ต์‹  ๋ฃจํ”„๋ผ์ธ: ๋ฉ€ํ‹ฐ GPU ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋Œ€์‹  ์ธํ„ฐ์ปค๋„ฅํŠธ ๋Œ€์—ญํญ(NVLink, InfiniBand)์„ ์‚ฌ์šฉํ•˜์—ฌ ์Šค์ผ€์ผ๋ง ํšจ์œจ์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค.

์ „์šฉ ์œ ๋‹›: ์ตœ์‹  GPU๋Š” ๊ณ ์œ ํ•œ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ฐ€์ง„ ํ…์„œ ์ฝ”์–ด๋ฅผ ํฌํ•จํ•˜๋ฉฐ, ๋ณ„๋„์˜ ๋ฃจํ”„๋ผ์ธ ๋ถ„์„์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

8. ์‹ค์ „์—์„œ ๋ฃจํ”„๋ผ์ธ ํ™œ์šฉํ•˜๊ธฐ

  1. ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง: Nsight Compute ๊ฐ™์€ ๋„๊ตฌ๋กœ ์‹ค์ œ FLOPs์™€ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ธก์ •
  2. ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ํ‘œ์‹œ: ์‚ฐ์ˆ  ๊ฐ•๋„์™€ ์‹ค์ธก ์„ฑ๋Šฅ ๊ณ„์‚ฐ
  3. ๋ณ‘๋ชฉ ์‹๋ณ„: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„์€ ๋ฉ”๋ชจ๋ฆฌ ์ƒํ•œ ์œ„์—, ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ ์ปค๋„์€ ์—ฐ์‚ฐ ์ƒํ•œ์— ๊ทผ์ ‘
  4. ์ตœ์ ํ™” ์„ ํƒ: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„์—๋Š” ๋Œ€์—ญํญ ๊ฐœ์„ ์—, ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ ์ปค๋„์—๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ€๊ฒฝ์— ์ง‘์ค‘
  5. ์ธก์ •๊ณผ ๋ฐ˜๋ณต: ์ตœ์ ํ™”๊ฐ€ ์ปค๋„์„ ๊ธฐ๋Œ€ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™์‹œํ‚ค๋Š”์ง€ ๊ฒ€์ฆ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํผ์ฆ๊ณผ์˜ ์—ฐ๊ฒฐ

๋‹ค์Œ ์„น์…˜์—์„œ๋Š” ์ปค๋„์„ ๋ฃจํ”„๋ผ์ธ ์œ„๋กœ ๋Œ์–ด์˜ฌ๋ฆฌ๊ธฐ ์‹œ์ž‘ํ•˜๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์‹œ๊ฐํ™”์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, ์ฃผํ™ฉ์ƒ‰ ์ (๊ธฐ๋ณธ)์—์„œ ์ฒญ๋ก์ƒ‰ ์ (๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)์œผ๋กœ ์ด๋™ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค โ€” ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ ๊ฐœ์„ ์„ ํ†ตํ•œ ํ™•์‹คํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ž…๋‹ˆ๋‹ค.

\(2 \times 2\) ์˜ˆ์ œ์—์„œ๋Š” ์—ฐ์‚ฐ ์ƒํ•œ์— ๋„๋‹ฌํ•˜์ง€ ๋ชปํ•˜์ง€๋งŒ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์„ฑ๋Šฅ์— ๊ฒฐ์ •์ ์ธ ์—ญํ• ์„ ํ•˜๋Š” ํฐ ํ–‰๋ ฌ์—์„œ ๋™์ผํ•œ ์›๋ฆฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ํ™•์žฅ๋˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์™œ ๋„์›€์ด ๋˜๊ณ  ์–ผ๋งˆ๋‚˜ ๊ฐœ์„ ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ์ด๋ก ์  ํ† ๋Œ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ์„ ์ดํ•ดํ•˜๋ฉด GPU ์ตœ์ ํ™”๊ฐ€ ์ถ”์ธก์—์„œ ์ฒด๊ณ„์ ์ธ ์—”์ง€๋‹ˆ์–ด๋ง์œผ๋กœ ๋ฐ”๋€๋‹ˆ๋‹ค. ์ด ์ฑ…์˜ ๋ชจ๋“  ์ตœ์ ํ™” ๊ธฐ๋ฒ•์€ ์ด ๋‹จ์ˆœํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ ๋ชจ๋ธ์— ๋Œ€ํ•œ ํšจ๊ณผ๋กœ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „

๊ฐœ์š”

์ •๋ฐฉ ํ–‰๋ ฌ \(A\) ์™€ \(B\) ์˜ ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๊ตฌํ˜„ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ \(\text{output}\)์— ์ €์žฅํ•˜๋Š” ํผ์ฆ์ž…๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค. ์—ฐ์‚ฐ ์ „์— ํ–‰๋ ฌ ๋ธ”๋ก์„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋ฏธ๋ฆฌ ๋กœ๋“œํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • TileTensor๋ฅผ ์‚ฌ์šฉํ•œ ๋ธ”๋ก ๋กœ์ปฌ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ
  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ํŒจํ„ด
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”
  • 2D ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•œ ํ˜‘๋ ฅ์  ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ
  • ํ–‰๋ ฌ ์—ฐ์‚ฐ์— TileTensor๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜๊ธฐ

ํ•ต์‹ฌ์€ TileTensor๋ฅผ ํ†ตํ•ด ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋น„์šฉ์ด ํฐ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: \(1 \times 1\)

๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ:

  • ์ž…๋ ฅ A: row_major[SIZE, SIZE]()
  • ์ž…๋ ฅ B: row_major[SIZE, SIZE]()
  • ์ถœ๋ ฅ: row_major[SIZE, SIZE]()
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB ร— TPB ํฌ๊ธฐ์˜ TileTensor 2๊ฐœ

๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ:

Global Memory (TileTensor):          Shared Memory (TileTensor):
A[i,j]: Direct access                  a_shared[local_row, local_col]
B[i,j]: Direct access                  b_shared[local_row, local_col]

์™„์„ฑํ•  ์ฝ”๋“œ

def single_block_matmul(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    var row = block_dim.y * block_idx.y + thread_idx.y
    var col = block_dim.x * block_idx.x + thread_idx.x
    var local_row = thread_idx.y
    var local_col = thread_idx.x
    # FILL ME IN (roughly 12 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p16/p16.mojo

ํŒ
  1. ์ „์—ญ ์ธ๋ฑ์Šค์™€ ๋กœ์ปฌ ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ–‰๋ ฌ์„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ
  2. ๋กœ๋“œ ํ›„ barrier() ํ˜ธ์ถœ
  3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‚ด์  ๊ณ„์‚ฐ
  4. ๋ชจ๋“  ์—ฐ์‚ฐ์—์„œ ๋ฐฐ์—ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p16 --single-block
pixi run -e amd p16 --single-block
pixi run -e apple p16 --single-block
uv run poe p16 --single-block

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([4.0, 6.0, 12.0, 22.0])

์†”๋ฃจ์…˜

def single_block_matmul[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    var row = block_dim.y * block_idx.y + thread_idx.y
    var col = block_dim.x * block_idx.x + thread_idx.x
    var local_row = thread_idx.y
    var local_col = thread_idx.x

    var a_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB, TPB]())
    var b_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB, TPB]())

    if row < size and col < size:
        a_shared[local_row, local_col] = a[row, col]
        b_shared[local_row, local_col] = b[row, col]

    barrier()

    if row < size and col < size:
        var acc: output.ElementType = 0

        comptime for k in range(size):
            acc += a_shared[local_row, k] * b_shared[k, local_col]

        output[row, col] = acc


TileTensor๋ฅผ ํ™œ์šฉํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ตฌํ˜„์€ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค:

๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ

Input Tensors (2ร—2):                Shared Memory (3ร—3):
Matrix A:                           a_shared:
 [a[0,0] a[0,1]]                     [s[0,0] s[0,1] s[0,2]]
 [a[1,0] a[1,1]]                     [s[1,0] s[1,1] s[1,2]]
                                     [s[2,0] s[2,1] s[2,2]]
Matrix B:                           b_shared: (๋น„์Šทํ•œ ๋ ˆ์ด์•„์›ƒ)
 [b[0,0] b[0,1]]                     [t[0,0] t[0,1] t[0,2]]
 [b[1,0] b[1,1]]                     [t[1,0] t[1,1] t[1,2]]
                                     [t[2,0] t[2,1] t[2,2]]

๊ตฌํ˜„ ๋‹จ๊ณ„

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •:

    # address_space๋ฅผ ์ง€์ •ํ•œ TileTensor๋กœ 2D ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ…์„œ ์ƒ์„ฑ
    a_shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB, TPB]())
    b_shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB, TPB]())
    
  2. ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ:

    # ํ–‰๋ ฌ ์ ‘๊ทผ์„ ์œ„ํ•œ ์ „์—ญ ์ธ๋ฑ์Šค
    row = block_dim.y * block_idx.y + thread_idx.y
    col = block_dim.x * block_idx.x + thread_idx.x
    
    # ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์šฉ ๋กœ์ปฌ ์ธ๋ฑ์Šค
    local_row = thread_idx.y
    local_col = thread_idx.x
    
  3. ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ:

    # TileTensor ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œ
    if row < size and col < size:
        a_shared[local_row, local_col] = a[row, col]
        b_shared[local_row, local_col] = b[row, col]
    
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ์—ฐ์‚ฐ:

    # ๊ฐ€๋“œ๋กœ ์œ ํšจํ•œ ํ–‰๋ ฌ ์›์†Œ๋งŒ ๊ณ„์‚ฐ
    if row < size and col < size:
        # ์ถœ๋ ฅ ํ…์„œ์˜ ํƒ€์ž…์œผ๋กœ ๋ˆ„์  ๋ณ€์ˆ˜ ์ดˆ๊ธฐํ™”
        var acc: output.element_type = 0
    
        # ์ปดํŒŒ์ผ ํƒ€์ž„์— ์ „๊ฐœ๋˜๋Š” ํ–‰๋ ฌ ๊ณฑ์…ˆ ๋ฃจํ”„
        @parameter
        for k in range(size):
            acc += a_shared[local_row, k] * b_shared[k, local_col]
    
        # ํ–‰๋ ฌ ๊ฒฝ๊ณ„ ๋‚ด์˜ ์Šค๋ ˆ๋“œ๋งŒ ๊ฒฐ๊ณผ ๊ธฐ๋ก
        output[row, col] = acc
    

    ์ฃผ์š” ํฌ์ธํŠธ:

    • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: if row < size and col < size

      • ๋ฒ”์œ„ ๋ฐ– ์—ฐ์‚ฐ ๋ฐฉ์ง€
      • ์œ ํšจํ•œ ์Šค๋ ˆ๋“œ๋งŒ ์ž‘์—… ์ˆ˜ํ–‰
      • TPB (3ร—3) > SIZE (2ร—2)์ด๋ฏ€๋กœ ํ•„์ˆ˜
    • ๋ˆ„์  ๋ณ€์ˆ˜ ํƒ€์ž…: var acc: output.element_type

      • ์ถœ๋ ฅ ํ…์„œ์˜ ์›์†Œ ํƒ€์ž…์œผ๋กœ ํƒ€์ž… ์•ˆ์ „์„ฑ ํ™•๋ณด
      • ์ผ๊ด€๋œ ์ˆ˜์น˜ ์ •๋ฐ€๋„ ๋ณด์žฅ
      • ๋ˆ„์  ์ „์— 0์œผ๋กœ ์ดˆ๊ธฐํ™”
    • ๋ฃจํ”„ ์ตœ์ ํ™”: @parameter for k in range(size)

      • ์ปดํŒŒ์ผ ํƒ€์ž„์— ๋ฃจํ”„ ์ „๊ฐœ
      • ๋” ๋‚˜์€ ๋ช…๋ น์–ด ์Šค์ผ€์ค„๋ง ๊ฐ€๋Šฅ
      • ํฌ๊ธฐ๊ฐ€ ์ž‘๊ณ  ๋ฏธ๋ฆฌ ์•Œ๋ ค์ง„ ํ–‰๋ ฌ์— ํšจ๊ณผ์ 
    • ๊ฒฐ๊ณผ ๊ธฐ๋ก: output[row, col] = acc

      • ๋™์ผํ•œ ๊ฐ€๋“œ ์กฐ๊ฑด์œผ๋กœ ๋ณดํ˜ธ
      • ์œ ํšจํ•œ ์Šค๋ ˆ๋“œ๋งŒ ๊ฒฐ๊ณผ ๊ธฐ๋ก
      • ํ–‰๋ ฌ ๊ฒฝ๊ณ„ ์•ˆ์ „์„ฑ ์œ ์ง€

์Šค๋ ˆ๋“œ ์•ˆ์ „์„ฑ๊ณผ ๋™๊ธฐํ™”

  1. ๊ฐ€๋“œ ์กฐ๊ฑด:

    • ์ž…๋ ฅ ๋กœ๋”ฉ: if row < size and col < size
    • ์—ฐ์‚ฐ: ๋™์ผํ•œ ๊ฐ€๋“œ๋กœ ์Šค๋ ˆ๋“œ ์•ˆ์ „์„ฑ ๋ณด์žฅ
    • ์ถœ๋ ฅ ๊ธฐ๋ก: ๊ฐ™์€ ์กฐ๊ฑด์œผ๋กœ ๋ณดํ˜ธ
    • ์ž˜๋ชป๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ณผ ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์•ˆ์ „์„ฑ:

    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TPB ๋ฒ”์œ„ ๋‚ด์—์„œ๋งŒ ์ ‘๊ทผ
    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: ํฌ๊ธฐ ๊ฒ€์‚ฌ๋กœ ๋ณดํ˜ธ
    • ์ถœ๋ ฅ: ๊ฐ€๋“œ๋œ ์“ฐ๊ธฐ๋กœ ๋ฐ์ดํ„ฐ ์†์ƒ ๋ฐฉ์ง€

์ฃผ์š” ์–ธ์–ด ๊ธฐ๋Šฅ

  1. TileTensor์˜ ์žฅ์ :

    • ์ง์ ‘ 2D ์ธ๋ฑ์‹ฑ์œผ๋กœ ์ฝ”๋“œ ๋‹จ์ˆœํ™”
    • element_type์„ ํ†ตํ•œ ํƒ€์ž… ์•ˆ์ „์„ฑ
    • ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ
  2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น:

    • address_space๋ฅผ ์ง€์ •ํ•œ TileTensor๋กœ ๊ตฌ์กฐํ™”๋œ ํ• ๋‹น
    • ์ž…๋ ฅ ํ…์„œ์™€ ๋™์ผํ•œ ํ–‰ ์šฐ์„  ๋ ˆ์ด์•„์›ƒ
    • ํšจ์œจ์  ์ ‘๊ทผ์„ ์œ„ํ•œ ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ
  3. ๋™๊ธฐํ™”:

    • barrier()๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ผ๊ด€์„ฑ ๋ณด์žฅ
    • ๋กœ๋“œ์™€ ์—ฐ์‚ฐ ๊ฐ„ ์ ์ ˆํ•œ ๋™๊ธฐํ™”
    • ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ ๊ฐ„ ํ˜‘๋ ฅ

์„ฑ๋Šฅ ์ตœ์ ํ™”

  1. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ:

    • ์›์†Œ๋‹น ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ 1ํšŒ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ๋‹ค์ค‘ ์žฌ์‚ฌ์šฉ
    • ๋ณ‘ํ•ฉ๋œ(coalesced) ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  2. ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ:

    • ํ˜‘๋ ฅ์  ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ
    • ๊ณต์œ  ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ
    • ํšจ์œจ์ ์ธ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  3. ์—ฐ์‚ฐ ์ด์ :

    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ๊ฐ์†Œ
    • ์บ์‹œ ํ™œ์šฉ๋„ ํ–ฅ์ƒ
    • ๋ช…๋ น์–ด ์ฒ˜๋ฆฌ๋Ÿ‰ ๊ฐœ์„ 

์ด ๊ตฌํ˜„์€ ๋‹ค์Œ์„ ํ†ตํ•ด ๊ธฐ๋ณธ ๋ฒ„์ „ ๋Œ€๋น„ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšŸ์ˆ˜ ๊ฐ์†Œ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ
  • TileTensor์˜ ํšจ์œจ์ ์ธ 2D ์ธ๋ฑ์‹ฑ ํ™œ์šฉ
  • ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ์œ ์ง€

ํƒ€์ผ๋ง ๋ฒ„์ „

๊ฐœ์š”

TileTensor๋ฅผ ์‚ฌ์šฉํ•œ ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์œผ๋กœ ์ •๋ฐฉ ํ–‰๋ ฌ \(A\) ์™€ \(B\) ๋ฅผ ๊ณฑํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•˜์„ธ์š”. ํฐ ํ–‰๋ ฌ์„ ์ž‘์€ ์กฐ๊ฐ(ํƒ€์ผ)์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

  • TileTensor๋ฅผ ์‚ฌ์šฉํ•œ ํ–‰๋ ฌ ํƒ€์ผ๋ง์œผ๋กœ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ
  • ์ ์ ˆํ•œ ๋ ˆ์ด์•„์›ƒ์„ ์‚ฌ์šฉํ•œ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์œจ
  • TensorBuilder๋ฅผ ํ†ตํ•œ ํšจ์œจ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  • TileTensor ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•œ ํƒ€์ผ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ

๊ตฌ์„ฑ

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{SIZE_TILED} = 9\)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: \(3 \times 3\) ๋ธ”๋ก
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก๋‹น \(\text{TPB} \times \text{TPB}\) TileTensor 2๊ฐœ

๋ ˆ์ด์•„์›ƒ ๊ตฌ์„ฑ:

  • ์ž…๋ ฅ A: row_major[SIZE_TILED, SIZE_TILED]()
  • ์ž…๋ ฅ B: row_major[SIZE_TILED, SIZE_TILED]()
  • ์ถœ๋ ฅ: row_major[SIZE_TILED, SIZE_TILED]()
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: TensorBuilder๋ฅผ ์‚ฌ์šฉํ•œ TPB ร— TPB TileTensor 2๊ฐœ

ํƒ€์ผ๋ง ์ „๋žต

๋ธ”๋ก ๊ตฌ์„ฑ

Grid Layout (3ร—3):           Thread Layout per Block (3ร—3):
[B00][B01][B02]               [T00 T01 T02]
[B10][B11][B12]               [T10 T11 T12]
[B20][B21][B22]               [T20 T21 T22]

๊ฐ ๋ธ”๋ก์€ TileTensor ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ํƒ€์ผ์„ ์ฒ˜๋ฆฌ

ํƒ€์ผ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„

  1. ์Šค๋ ˆ๋“œ ์œ„์น˜์— ๋Œ€ํ•œ ์ „์—ญ ์ธ๋ฑ์Šค์™€ ๋กœ์ปฌ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ
  2. A์™€ B ํƒ€์ผ์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น
  3. ๊ฐ ํƒ€์ผ์— ๋Œ€ํ•ด:
    • ํ–‰๋ ฌ A์™€ B์—์„œ ํƒ€์ผ ๋กœ๋“œ
    • ๋ถ€๋ถ„ ๊ณฑ ๊ณ„์‚ฐ
    • ๋ ˆ์ง€์Šคํ„ฐ์— ๊ฒฐ๊ณผ ๋ˆ„์ 
  4. ์ตœ์ข… ๋ˆ„์  ๊ฒฐ๊ณผ ๊ธฐ๋ก

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

Matrix A (8ร—8)                 Matrix B (8ร—8)               Matrix C (8ร—8)
+---+---+---+                  +---+---+---+                +---+---+---+
|T00|T01|T02| ...              |T00|T01|T02| ...            |T00|T01|T02| ...
+---+---+---+                  +---+---+---+                +---+---+---+
|T10|T11|T12|                  |T10|T11|T12|                |T10|T11|T12|
+---+---+---+                  +---+---+---+                +---+---+---+
|T20|T21|T22|                  |T20|T21|T22|                |T20|T21|T22|
+---+---+---+                  +---+---+---+                +---+---+---+
  ...                            ...                          ...

ํƒ€์ผ ์ฒ˜๋ฆฌ ๊ณผ์ • (C[T11] ๊ณ„์‚ฐ ์˜ˆ์‹œ):
1. A์™€ B์—์„œ ํƒ€์ผ ๋กœ๋“œ:
   +---+      +---+
   |A11| ร—    |B11|     ๊ฐ ๋‹จ๊ณ„ k์— ๋Œ€ํ•ด:
   +---+      +---+     C[T11] += A[row, k] ร— B[k, col]

2. ํƒ€์ผ ์ด๋™:
   ๋‹จ๊ณ„ 1      ๋‹จ๊ณ„ 2      ๋‹จ๊ณ„ 3
   A: [T10]    A: [T11]    A: [T12]
   B: [T01]    B: [T11]    B: [T21]

3. ํƒ€์ผ ๋‚ด ๊ฐ ์Šค๋ ˆ๋“œ (i,j)์˜ ์—ฐ์‚ฐ:
   C[i,j] = ฮฃ (A[i,k] ร— B[k,j]), k๋Š” ํƒ€์ผ ๋„ˆ๋น„ ๋ฒ”์œ„

๋™๊ธฐํ™” ํ•„์š” ์‹œ์ :
* ํƒ€์ผ์„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ๋กœ๋“œํ•œ ํ›„
* ๊ฐ ๋‹จ๊ณ„์˜ ์—ฐ์‚ฐ์ด ๋๋‚œ ํ›„

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE_TILED = 9
comptime BLOCKS_PER_GRID_TILED = (3, 3)  # each block convers 3x3 elements
comptime THREADS_PER_BLOCK_TILED = (TPB, TPB)
comptime layout_tiled = row_major[SIZE_TILED, SIZE_TILED]()
comptime LayoutTiledType = type_of(layout_tiled)


def matmul_tiled(
    output: TileTensor[mut=True, dtype, LayoutTiledType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
):
    var local_row = thread_idx.y
    var local_col = thread_idx.x
    var tiled_row = block_idx.y * TPB + thread_idx.y
    var tiled_col = block_idx.x * TPB + thread_idx.x
    # FILL ME IN (roughly 20 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p16/p16.mojo

ํŒ
  1. ํ‘œ์ค€ ์ธ๋ฑ์‹ฑ ๊ทœ์น™์„ ์‚ฌ์šฉํ•˜์„ธ์š”: local_row = thread_idx.y, local_col = thread_idx.x

  2. ์ „์—ญ ์œ„์น˜ ๊ณ„์‚ฐ:

    global_row = block_idx.y * TPB + local_row
    

    ๊ทธ๋ฆฌ๊ณ 

    global_col = block_idx.x * TPB + local_col
    

    ์ „์—ญ ์ธ๋ฑ์‹ฑ ๊ณต์‹ ์ดํ•ดํ•˜๊ธฐ:

    • ๊ฐ ๋ธ”๋ก์€ ํ–‰๋ ฌ์˜ TPB ร— TPB ํƒ€์ผ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

    • block_idx.y๋Š” ํ˜„์žฌ ๋ช‡ ๋ฒˆ์งธ ๋ธ”๋ก ํ–‰์ธ์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค (0, 1, 2โ€ฆ)

    • block_idx.y * TPB๋Š” ํ•ด๋‹น ๋ธ”๋ก ํƒ€์ผ์˜ ์‹œ์ž‘ ํ–‰์ž…๋‹ˆ๋‹ค

    • local_row (0~TPB-1)์€ ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ์˜ ์˜คํ”„์…‹์ž…๋‹ˆ๋‹ค

    • ๋‘˜์„ ๋”ํ•˜๋ฉด ์ „์ฒด ํ–‰๋ ฌ์—์„œ์˜ ์‹ค์ œ ํ–‰ ์œ„์น˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค

      TPB=3 ์˜ˆ์‹œ:

      Block Layout:        Global Matrix (9ร—9):
      [B00][B01][B02]      [0 1 2 | 3 4 5 | 6 7 8]
      [B10][B11][B12]  โ†’   [9 A B | C D E | F G H]
      [B20][B21][B22]      [I J K | L M N | O P Q]
                          โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”
                          [R S T | U V W | X Y Z]
                          [a b c | d e f | g h i]
                          [j k l | m n o | p q r]
                          โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”
                          [s t u | v w x | y z ฮฑ]
                          [ฮฒ ฮณ ฮด | ฮต ฮถ ฮท | ฮธ ฮน ฮบ]
                          [ฮป ฮผ ฮฝ | ฮพ ฮฟ ฯ€ | ฯ ฯƒ ฯ„]
      
      Thread(1,2) in Block(1,0):
      - block_idx.y = 1, local_row = 1
      - global_row = 1 * 3 + 1 = 4
      - ์ด ์Šค๋ ˆ๋“œ๋Š” ํ–‰๋ ฌ์˜ 4๋ฒˆ์งธ ํ–‰์„ ๋‹ด๋‹น
      
      ```text
      
      
  3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น (.fill(0)์œผ๋กœ ์‚ฌ์ „ ์ดˆ๊ธฐํ™”๋จ)

  4. 9ร—9 ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์ด๋ฏ€๋กœ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ๋ถˆํ•„์š”!

  5. ์ ์ ˆํ•œ ๋™๊ธฐํ™”์™€ ํ•จ๊ป˜ ํƒ€์ผ ๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๋ˆ„์ 

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p16 --tiled
pixi run -e amd p16 --tiled
pixi run -e apple p16 --tiled
uv run poe p16 --tiled

ํผ์ฆ์„ ์•„์ง ํ’€์ง€ ์•Š์•˜๋‹ค๋ฉด ์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 4176.0, 4248.0, 9504.0, 9738.0, 9972.0, 10206.0, 10440.0, 10674.0, 10908.0, 11142.0, 11376.0, 15336.0, 15732.0, 16128.0, 16524.0, 16920.0, 17316.0, 17712.0, 18108.0, 18504.0, 21168.0, 21726.0, 22284.0, 22842.0, 23400.0, 23958.0, 24516.0, 25074.0, 25632.0, 27000.0, 27720.0, 28440.0, 29160.0, 29880.0, 30600.0, 31320.0, 32040.0, 32760.0, 32832.0, 33714.0, 34596.0, 35478.0, 36360.0, 37242.0, 38124.0, 39006.0, 39888.0, 38664.0, 39708.0, 40752.0, 41796.0, 42840.0, 43884.0, 44928.0, 45972.0, 47016.0, 44496.0, 45702.0, 46908.0, 48114.0, 49320.0, 50526.0, 51732.0, 52938.0, 54144.0, 50328.0, 51696.0, 53064.0, 54432.0, 55800.0, 57168.0, 58536.0, 59904.0, 61272.0])

์†”๋ฃจ์…˜: ์ˆ˜๋™ ํƒ€์ผ๋ง

def matmul_tiled[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutTiledType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
):
    var local_row = thread_idx.y
    var local_col = thread_idx.x
    var tiled_row = block_idx.y * TPB + local_row
    var tiled_col = block_idx.x * TPB + local_col

    var a_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB, TPB]())
    var b_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB, TPB]())

    var acc: output.ElementType = 0

    # Iterate over tiles to compute matrix product
    comptime for tile in range((size + TPB - 1) // TPB):
        # Load A tile - global row stays the same, col determined by tile
        if tiled_row < size and (tile * TPB + local_col) < size:
            a_shared[local_row, local_col] = a[
                tiled_row, tile * TPB + local_col
            ]

        # Load B tile - row determined by tile, global col stays the same
        if (tile * TPB + local_row) < size and tiled_col < size:
            b_shared[local_row, local_col] = b[
                tile * TPB + local_row, tiled_col
            ]

        barrier()

        # Matrix multiplication within the tile
        if tiled_row < size and tiled_col < size:
            comptime for k in range(min(Int(TPB), Int(size - tile * TPB))):
                acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write out final result
    if tiled_row < size and tiled_col < size:
        output[tiled_row, tiled_col] = acc


ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ ๊ตฌํ˜„์€ ์ž‘์€ ํƒ€์ผ \((3 \times 3)\) ์„ ์‚ฌ์šฉํ•˜์—ฌ ํฐ ํ–‰๋ ฌ \((9 \times 9)\) ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋™์ž‘ ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น

    Input matrices (9ร—9) - (3ร—3) ํƒ€์ผ๋ง์— ๋”ฑ ๋งž๋Š” ํฌ๊ธฐ:
    A = [0  1  2  3  4  5  6  7  8 ]    B = [0  2  4  6  8  10 12 14 16]
        [9  10 11 12 13 14 15 16 17]        [18 20 22 24 26 28 30 32 34]
        [18 19 20 21 22 23 24 25 26]        [36 38 40 42 44 46 48 50 52]
        [27 28 29 30 31 32 33 34 35]        [54 56 58 60 62 64 66 68 70]
        [36 37 38 39 40 41 42 43 44]        [72 74 76 78 80 82 84 86 88]
        [45 46 47 48 49 50 51 52 53]        [90 92 94 96 98 100 102 104 106]
        [54 55 56 57 58 59 60 61 62]        [108 110 112 114 116 118 120 122 124]
        [63 64 65 66 67 68 69 70 71]        [126 128 130 132 134 136 138 140 142]
        [72 73 74 75 76 77 78 79 80]        [144 146 148 150 152 154 156 158 160]
    
    ๋ธ”๋ก๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ (3ร—3):
    a_shared[TPB, TPB]  b_shared[TPB, TPB]
    
  2. ํƒ€์ผ ์ฒ˜๋ฆฌ ๋ฃจํ”„

    ํƒ€์ผ ์ˆ˜ = 9 // 3 = 3๊ฐœ (๋‚˜๋จธ์ง€ ์—†์ด ๋”ฑ ๋‚˜๋ˆ ์ง!)
    
    ๊ฐ ํƒ€์ผ์— ๋Œ€ํ•ด:
    1. A์™€ B์—์„œ ํƒ€์ผ ๋กœ๋“œ
    2. ๋ถ€๋ถ„ ๊ณฑ ๊ณ„์‚ฐ
    3. ๋ ˆ์ง€์Šคํ„ฐ์— ๋ˆ„์ 
    
  3. ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋”ฉ ํŒจํ„ด

    • \((9 \times 9)\) ์ด ๋”ฑ ๋‚˜๋ˆ ์ง€๋ฏ€๋กœ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ๊ธฐ์ˆ ์ ์œผ๋กœ๋Š” ๋ถˆํ•„์š”ํ•˜์ง€๋งŒ, ๋ฐฉ์–ด์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๋‹ค๋ฅธ ํ–‰๋ ฌ ํฌ๊ธฐ์—๋„ ๋Œ€์‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

         # A ํƒ€์ผ ๋กœ๋“œ - ์ „์—ญ ํ–‰์€ ๊ทธ๋Œ€๋กœ, ์—ด์€ ํƒ€์ผ์— ์˜ํ•ด ๊ฒฐ์ •
         if tiled_row < size and (tile * TPB + local_col) < size:
             a_shared[local_row, local_col] = a[
                 tiled_row, tile * TPB + local_col
             ]
      
         # B ํƒ€์ผ ๋กœ๋“œ - ํ–‰์€ ํƒ€์ผ์— ์˜ํ•ด ๊ฒฐ์ •, ์ „์—ญ ์—ด์€ ๊ทธ๋Œ€๋กœ
         if (tile * TPB + local_row) < size and tiled_col < size:
             b_shared[local_row, local_col] = b[
                 tile * TPB + local_row, tiled_col
             ]
      
  4. ํƒ€์ผ ๋‚ด ์—ฐ์‚ฐ

    for k in range(min(TPB, size - tile * TPB)):
        acc += a_shared[local_row, k] * b_shared[k, local_col]
    
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ ํšŒํ”ผ:

      Bank Conflict Free (Good):        Bank Conflicts (Bad):
      Thread0: a_shared[0,k] b_shared[k,0]  Thread0: a_shared[k,0] b_shared[0,k]
      Thread1: a_shared[0,k] b_shared[k,1]  Thread1: a_shared[k,0] b_shared[1,k]
      Thread2: a_shared[0,k] b_shared[k,2]  Thread2: a_shared[k,0] b_shared[2,k]
      โ†“                                     โ†“
      ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ๋ณ‘๋ ฌ ์ ‘๊ทผ             b_shared๊ฐ€ ์—ด ์šฐ์„ ์ด์—ˆ๋‹ค๋ฉด
      (a_shared๋Š” broadcast)               ๊ฐ™์€ ๋ฑ…ํฌ์— ์ง๋ ฌ ์ ‘๊ทผ
      

      ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ ์„ค๋ช…:

      • ์™ผ์ชฝ (Good): b_shared[k,threadIdx.x]๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๊ณ , a_shared[0,k]๋Š” ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋ฉ๋‹ˆ๋‹ค
      • ์˜ค๋ฅธ์ชฝ (Bad): b_shared๊ฐ€ ์—ด ์šฐ์„ ์ด์—ˆ๋‹ค๋ฉด ์Šค๋ ˆ๋“œ๋“ค์ด ๋™์‹œ์— ๊ฐ™์€ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค
      • ํ•ต์‹ฌ: ์ด๊ฒƒ์€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์•„๋‹Œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์— ๊ด€ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค
      • ๋ฑ…ํฌ ๊ตฌ์กฐ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋Š” 32๊ฐœ ๋ฑ…ํฌ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๊ฐ™์€ ๋ฑ…ํฌ์˜ ๋‹ค๋ฅธ ์ฃผ์†Œ์— ์ ‘๊ทผํ•  ๋•Œ ์ถฉ๋Œ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค
  5. ๋™๊ธฐํ™” ์ง€์ 

    barrier() ํ˜ธ์ถœ ์‹œ์ :
    1. ํƒ€์ผ ๋กœ๋”ฉ ํ›„
    2. ํƒ€์ผ ์—ฐ์‚ฐ ํ›„
    

์ฃผ์š” ์„ฑ๋Šฅ ํŠน์„ฑ:

  • \((3 \times 3)\) ํƒ€์ผ๋กœ \((9 \times 9)\) ํ–‰๋ ฌ ์ฒ˜๋ฆฌ (๋”ฑ ๋งž๋Š” ํฌ๊ธฐ!)
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋น ๋ฅธ ํƒ€์ผ ์ ‘๊ทผ
  • ๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜ ์ตœ์†Œํ™”
  • ๋ฑ…ํฌ ์ถฉ๋Œ์„ ํ”ผํ•˜๋„๋ก ์ตœ์ ํ™”๋œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ์ ‘๊ทผ ํŒจํ„ด
  1. ๊ฒฐ๊ณผ ๊ธฐ๋ก:

    if tiled_row < size and tiled_col < size:
       output[tiled_row, tiled_col] = acc
    
    • ๋‹ค๋ฅธ ํ–‰๋ ฌ ํฌ๊ธฐ์™€ ํƒ€์ผ๋ง ์ „๋žต์„ ์œ„ํ•œ ๋ฐฉ์–ด์  ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํฌํ•จ
    • ์ถœ๋ ฅ ํ–‰๋ ฌ์— ์ง์ ‘ ๋Œ€์ž…
    • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์œ ํšจํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ก

์ฃผ์š” ์ตœ์ ํ™”

  1. ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”:

    • ๋ชจ๋“  ํ…์„œ์— ํ–‰ ์šฐ์„  ๋ ˆ์ด์•„์›ƒ
    • ํšจ์œจ์ ์ธ 2D ์ธ๋ฑ์‹ฑ
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ๋ณ‘ํ•ฉ๋œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ
    • ํšจ์œจ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  3. ์—ฐ์‚ฐ:

    • ๋ ˆ์ง€์Šคํ„ฐ ๊ธฐ๋ฐ˜ ๋ˆ„์ , ์ฆ‰ var acc: output.element_type = 0
    • @parameter๋ฅผ ํ†ตํ•œ ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ

์ด ๊ตฌํ˜„์€ ๋‹ค์Œ์„ ํ†ตํ•ด ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • TileTensor๋ฅผ ํ™œ์šฉํ•œ ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
  • ์ตœ์ ์˜ ํƒ€์ผ๋ง ์ „๋žต
  • ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  • ์„ธ์‹ฌํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ

์†”๋ฃจ์…˜: ๊ด€์šฉ์  TileTensor ํƒ€์ผ๋ง

from std.gpu.memory import async_copy_wait_all
from layout.layout_tensor import copy_dram_to_sram_async
from layout import Layout as IntTupleLayout

comptime NUM_THREADS = TPB * TPB
comptime BLOCK_DIM_COUNT = 2


def matmul_idiomatic_tiled[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutTiledType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutTiledType, ImmutAnyOrigin],
):
    var local_row = thread_idx.y
    var local_col = thread_idx.x
    var tiled_row = block_idx.y * TPB + local_row
    var tiled_col = block_idx.x * TPB + local_col

    # Get the tile of the output matrix that this thread block is responsible for
    var out_tile = output.tile[TPB, TPB](block_idx.y, block_idx.x)
    var a_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB, TPB]())
    var b_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB, TPB]())

    var acc: output.ElementType = 0

    comptime load_a_layout = IntTupleLayout.row_major(
        1, TPB
    )  # Coalesced loading
    comptime load_b_layout = IntTupleLayout.row_major(
        1, TPB
    )  # Coalesced loading
    # Note: Both matrices stored in same orientation for correct matrix multiplication
    # Transposed loading would be useful if B were pre-transposed in global memory

    comptime for idx in range(
        size // TPB
    ):  # Perfect division: 9 // 3 = 3 tiles
        # Get tiles from A and B matrices
        var a_tile = a.tile[TPB, TPB](block_idx.y, Int(idx))
        var b_tile = b.tile[TPB, TPB](Int(idx), block_idx.x)

        # Asynchronously copy tiles to shared memory with consistent orientation
        copy_dram_to_sram_async[
            thread_layout=load_a_layout,
            num_threads=NUM_THREADS,
            block_dim_count=BLOCK_DIM_COUNT,
        ](a_shared.to_layout_tensor(), a_tile.to_layout_tensor())
        copy_dram_to_sram_async[
            thread_layout=load_b_layout,
            num_threads=NUM_THREADS,
            block_dim_count=BLOCK_DIM_COUNT,
        ](b_shared.to_layout_tensor(), b_tile.to_layout_tensor())

        # Wait for all async copies to complete
        async_copy_wait_all()
        barrier()

        # Compute partial matrix multiplication for this tile
        comptime for k in range(TPB):
            acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write final result to output tile
    if tiled_row < size and tiled_col < size:
        out_tile[local_row, local_col] = acc


๊ด€์šฉ์  ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์€ Mojo์˜ TileTensor API์™€ ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ํ™œ์šฉํ•˜์—ฌ ๊น”๋”ํ•œ ๊ตฌํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํฌ์ธํŠธ: ์ด ๊ตฌํ˜„์€ ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œ์ค€ A ร— B ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ตฌํ˜„์ด ํ•˜๋Š” ๊ฒƒ:

  • ํ–‰๋ ฌ ์—ฐ์‚ฐ: ํ‘œ์ค€ \(A \times B\) ๊ณฑ์…ˆ (\(A \times B^T\) ๊ฐ€ ์•„๋‹˜)
  • ๋กœ๋”ฉ ํŒจํ„ด: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ row_major[1, TPB]()๋กœ ๋ณ‘ํ•ฉ ์ ‘๊ทผ
  • ์—ฐ์‚ฐ: acc += a_shared[local_row, k] * b_shared[k, local_col]
  • ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ: ๋กœ๋”ฉ ์‹œ ์ „์น˜ ์—†์Œ - ๋‘ ํ–‰๋ ฌ์„ ๊ฐ™์€ ๋ฐฉํ–ฅ์œผ๋กœ ๋กœ๋“œ

์ด ๊ตฌํ˜„์ด ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ:

  • \(A \times B^T\) ๊ณฑ์…ˆ์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Œ
  • ์ „์น˜ ๋กœ๋”ฉ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ
  • ๋ณต์‚ฌ ๊ณผ์ •์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ „์น˜ํ•˜์ง€ ์•Š์Œ

\((9 \times 9)\) ํ–‰๋ ฌ ํฌ๊ธฐ์—์„œ๋Š” ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์ด ์ด๋ฃจ์–ด์ ธ ๋ชจ๋“  ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ๋ถˆํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  1. TileTensor ํƒ€์ผ API

    out_tile = output.tile[TPB, TPB](block_idx.y, block_idx.x)
    a_tile = a.tile[TPB, TPB](block_idx.y, idx)
    b_tile = b.tile[TPB, TPB](idx, block_idx.x)
    

    ์ˆ˜๋™ ์ขŒํ‘œ ๊ณ„์‚ฐ ์—†์ด โ€œ(block_idx.y, block_idx.x) ์œ„์น˜์˜ ํƒ€์ผ์„ ๊ฐ€์ ธ์˜จ๋‹คโ€œ๋ฅผ ์ง์ ‘ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

  2. ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ

    copy_dram_to_sram_async[
       thread_layout = load_a_layout,
       num_threads = NUM_THREADS,
       block_dim_count = BLOCK_DIM_COUNT
    ](a_shared,a_tile)
    copy_dram_to_sram_async[
       thread_layout = load_b_layout,
       num_threads = NUM_THREADS,
       block_dim_count = BLOCK_DIM_COUNT
    ](b_shared, b_tile)
    async_copy_wait_all()
    

    ์ด ์—ฐ์‚ฐ๋“ค์€:

    • ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ์šฐํšŒํ•˜๋Š” ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์˜ ์ค‘์ฒฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค (copy_dram_to_sram_async ์ฐธ๊ณ )
    • ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์œ„ํ•œ ํŠนํ™”๋œ ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
    • ์ˆ˜๋™ ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ธฐํ™”๊ฐ€ ๋ถˆํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
    • ์ค‘์š”:
      • ํ‘œ์ค€ GPU ๋กœ๋“œ๋Š” ์ด๋ฏธ ๋น„๋™๊ธฐ์ ์ž…๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋“ค์€ ๋” ๋‚˜์€ ๋ฆฌ์†Œ์Šค ํ™œ์šฉ๊ณผ ๋ ˆ์ง€์Šคํ„ฐ ์šฐํšŒ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค
      • copy_dram_to_sram_async๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ 1D ์Šค๋ ˆ๋“œ ๋ธ”๋ก(block_dim.y == block_dim.z == 1)์„ ๊ฐ€์ •ํ•˜๋ฉฐ, ๋ณ„๋„ ์ง€์ •์ด ์—†์œผ๋ฉด ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณต์‚ฌ์— ์ฐธ์—ฌํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์„ ์ง€์ •ํ•˜์—ฌ ์ด ๋™์ž‘์„ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
        • block_dim_count: ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ ์ฐจ์› ์ˆ˜ (2D ์Šค๋ ˆ๋“œ ๋ธ”๋ก THREADS_PER_BLOCK_TILED = (TPB, TPB)์˜ ๊ฒฝ์šฐ 2)
        • num_threads: ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ ์Šค๋ ˆ๋“œ ์ˆ˜ (TPB*TPB == 9)
  3. ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ ˆ์ด์•„์›ƒ

    comptime load_a_layout = row_major[1, TPB]()    # ๋ณ‘ํ•ฉ ๋กœ๋”ฉ
    comptime load_b_layout = row_major[1, TPB]()    # ๋ณ‘ํ•ฉ ๋กœ๋”ฉ
    # ์ฐธ๊ณ : ํ‘œ์ค€ A ร— B ๊ณฑ์…ˆ์—์„œ ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ๊ฐ™์€ ๋ ˆ์ด์•„์›ƒ์„ ์‚ฌ์šฉ
    

    ํ˜„์žฌ ๊ตฌํ˜„์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ถ„์„:

    ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ์„ ์œ„ํ•ด row_major[1, TPB]()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

    • load_a_layout: ์Šค๋ ˆ๋“œ๋“ค์ด ํ˜‘๋ ฅํ•˜์—ฌ ํ–‰๋ ฌ A ํ–‰์˜ ์—ฐ์† ์›์†Œ๋ฅผ ๋กœ๋“œ
    • load_b_layout: ์Šค๋ ˆ๋“œ๋“ค์ด ํ˜‘๋ ฅํ•˜์—ฌ ํ–‰๋ ฌ B ํ–‰์˜ ์—ฐ์† ์›์†Œ๋ฅผ ๋กœ๋“œ
    • ํ•ต์‹ฌ: ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ์€ ๋ณต์‚ฌ ์‹œ ์Šค๋ ˆ๋“œ ๊ฐ„ ํ˜‘๋ ฅ ๋ฐฉ์‹์„ ๊ฒฐ์ •ํ•˜๋ฉฐ, ์ตœ์ข… ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ๊ณผ๋Š” ๋ณ„๊ฐœ์ž…๋‹ˆ๋‹ค

    ์‹ค์ œ ์—ฐ์‚ฐ ํŒจํ„ด (A ร— B์ž„์„ ์ฆ๋ช…):

    # ํ˜„์žฌ ๊ตฌํ˜„์˜ ์‹ค์ œ ์—ฐ์‚ฐ
    acc += a_shared[local_row, k] * b_shared[k, local_col]
    
    # ์ด๊ฒƒ์€ C[i,j] = ฮฃ(A[i,k] * B[k,j])์— ํ•ด๋‹น
    # ์ฆ‰, ํ‘œ์ค€ ํ–‰๋ ฌ ๊ณฑ์…ˆ A ร— B
    

    ๋‘ ํ–‰๋ ฌ์ด ๊ฐ™์€ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ :

    ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ํƒ€์ผ ๋กœ๋”ฉ:
    - Matrix A ํƒ€์ผ: ์Šค๋ ˆ๋“œ๋“ค์ด A[block_row, k], A[block_row, k+1], A[block_row, k+2]... ๋กœ๋“œ (์—ฐ์†)
    - Matrix B ํƒ€์ผ: ์Šค๋ ˆ๋“œ๋“ค์ด B[k, block_col], B[k, block_col+1], B[k, block_col+2]... ๋กœ๋“œ (์—ฐ์†)
    
    row_major[1, TPB]()๋กœ ๋‘ ํŒจํ„ด ๋ชจ๋‘ ๋ณ‘ํ•ฉ
    

    ์„ธ ๊ฐ€์ง€ ๋ณ„๊ฐœ์˜ ๋ฉ”๋ชจ๋ฆฌ ๊ณ ๋ ค์‚ฌํ•ญ:

    1. ์ „์—ญโ†’๊ณต์œ  ๋ณ‘ํ•ฉ: row_major[1, TPB]()๋กœ ๋ณ‘ํ•ฉ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ณด์žฅ
    2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ: a_shared[local_row, k] * b_shared[k, local_col]๋กœ ๋ฑ…ํฌ ์ถฉ๋Œ ํšŒํ”ผ
    3. ํ–‰๋ ฌ ์—ฐ์‚ฐ: ์—ฐ์‚ฐ ํŒจํ„ด์ด A ร— B๋ฅผ ๊ฒฐ์ • (A ร— B^T๊ฐ€ ์•„๋‹˜)
  4. ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์œผ๋กœ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š”

    @parameter
    for idx in range(size // TPB):  # ๋‚˜๋จธ์ง€ ์—†๋Š” ๋‚˜๋ˆ—์…ˆ: 9 // 3 = 3
    

    \((9 \times 9)\) ํ–‰๋ ฌ๊ณผ \((3 \times 3)\) ํƒ€์ผ์—์„œ๋Š” ๋ชจ๋“  ํƒ€์ผ์ด ์ •ํ™•ํžˆ ๊ฝ‰ ์ฐจ๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค!

  5. ๋ฐฉ์–ด์  ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํฌํ•จํ•œ ๊น”๋”ํ•œ ํƒ€์ผ ์ฒ˜๋ฆฌ

    # ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์—์„œ๋„ ๋ฐฉ์–ด์  ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํฌํ•จ
    if tiled_row < size and tiled_col < size:
        out_tile[local_row, local_col] = acc
    

    \((9 \times 9)\) ์˜ ์™„๋ฒฝํ•œ ํƒ€์ผ๋ง์—์„œ๋Š” ์ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ๊ธฐ์ˆ ์ ์œผ๋กœ ๋ถˆํ•„์š”ํ•˜์ง€๋งŒ, ๋ฐฉ์–ด์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๋‹ค๋ฅธ ํ–‰๋ ฌ ํฌ๊ธฐ์™€์˜ ์ผ๊ด€์„ฑ์„ ์œ„ํ•ด ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๊ณ ๋ ค์‚ฌํ•ญ

๊ด€์šฉ์  ๊ตฌํ˜„์€ ํƒ€์ผ๋ง์˜ ์„ฑ๋Šฅ ์ด์ ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋” ๊น”๋”ํ•œ ์ถ”์ƒํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  1. ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ: ํƒ€์ผ๋ง์„ ํ†ตํ•ด ๊ณต๊ฐ„์ , ์‹œ๊ฐ„์  ์ง€์—ญ์„ฑ์„ ํ™œ์šฉ
  2. ๋ณ‘ํ•ฉ ์ ‘๊ทผ: ํŠนํ™”๋œ ๋กœ๋“œ ๋ ˆ์ด์•„์›ƒ์œผ๋กœ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋ณด์žฅ
  3. ์—ฐ์‚ฐ-๋ฉ”๋ชจ๋ฆฌ ์ค‘์ฒฉ: ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ํ†ตํ•œ ์ค‘์ฒฉ ๊ฐ€๋Šฅ
  4. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ: ๋ถˆํ•„์š”ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ธฐํ™” ์—†์Œ
  5. ๋ ˆ์ง€์Šคํ„ฐ ์••๋ ฅ: ์ตœ์ ์˜ ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์œ„ํ•œ ๋ˆ„์  ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ

์ด ๊ตฌํ˜„์€ ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋กœ๋„ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ๋ณต์žกํ•œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ณ ์ˆ˜์ค€์˜ ํ‘œํ˜„๋ ฅ๊ณผ ์ €์ˆ˜์ค€์˜ ์„ฑ๋Šฅ ์ œ์–ด๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” Mojo์˜ ์ฒ ํ•™์„ ์ž˜ ๋ณด์—ฌ์ฃผ๋Š” ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค.

์ˆ˜๋™ ํƒ€์ผ๋ง๊ณผ์˜ ์ฃผ์š” ์ฐจ์ด์ 

๊ธฐ๋Šฅ์ˆ˜๋™ Tiling๊ด€์šฉ์  Tiling
๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ์žˆ๋Š” ์ง์ ‘ ์ธ๋ฑ์‹ฑTileTensor ํƒ€์ผ API
ํƒ€์ผ ๋กœ๋”ฉ์›์†Œ๋ณ„ ๋ช…์‹œ์  ๋ณต์‚ฌ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์˜ ๋ฒŒํฌ ์ „์†ก
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์ˆ˜๋™ ์ดˆ๊ธฐํ™” (๋ฐฉ์–ด์ )๋ณต์‚ฌ ํ•จ์ˆ˜๊ฐ€ ๊ด€๋ฆฌ
์ฝ”๋“œ ๋ณต์žก๋„๋ช…์‹œ์  ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋‹ค์†Œ ์žฅํ™ฉ๊ณ ์ˆ˜์ค€ API๋กœ ๋” ๊ฐ„๊ฒฐ
๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ๋”ฉ๊ณผ ์—ฐ์‚ฐ ์ค‘ ๋‹ค์ˆ˜์˜ ๊ฒ€์‚ฌ์ตœ์ข… ๊ธฐ๋ก ์‹œ ๋‹จ์ผ ๋ฐฉ์–ด์  ๊ฒ€์‚ฌ
ํ–‰๋ ฌ ๋ฐฉํ–ฅA์™€ B ๋ชจ๋‘ ๊ฐ™์€ ๋ฐฉํ–ฅ (ํ‘œ์ค€ A ร— B)A์™€ B ๋ชจ๋‘ ๊ฐ™์€ ๋ฐฉํ–ฅ (ํ‘œ์ค€ A ร— B)
์„ฑ๋Šฅ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์˜ ๋ช…์‹œ์  ์ œ์–ด๋ ˆ์ง€์Šคํ„ฐ ์šฐํšŒ๋ฅผ ํฌํ•จํ•œ ์ตœ์ ํ™”๋œ ๋ ˆ์ด์•„์›ƒ

๊ด€์šฉ์  ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋‹จ์ˆœํžˆ ๋” ๊น”๋”ํ•  ๋ฟ ์•„๋‹ˆ๋ผ, ํŠนํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ๋น„๋™๊ธฐ ์—ฐ์‚ฐ ๋•๋ถ„์— ์„ฑ๋Šฅ๋„ ๋” ์ข‹์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ : ์ „์น˜ ๋กœ๋”ฉ์€ ์–ธ์ œ ์œ ์šฉํ• ๊นŒ?

ํ˜„์žฌ ๊ตฌํ˜„์€ ์ „์น˜ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด ์„น์…˜์€ ๋ ˆ์ด์•„์›ƒ ์‹œ์Šคํ…œ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•œ ๊ต์œก์  ๋‚ด์šฉ์ž…๋‹ˆ๋‹ค.

ํ˜„์žฌ ๊ตฌํ˜„ ์š”์•ฝ:

  • ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ row_major[1, TPB]() ์‚ฌ์šฉ
  • ํ‘œ์ค€ A ร— B ๊ณฑ์…ˆ ์ˆ˜ํ–‰
  • ๋ณต์‚ฌ ์ค‘ ๋ฐ์ดํ„ฐ ์ „์น˜ ์—†์Œ

์ „์น˜ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ต์œก์  ์‹œ๋‚˜๋ฆฌ์˜ค:

์ด ํผ์ฆ์€ ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ํ‘œ์ค€ ๋ณ‘ํ•ฉ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ๋ ˆ์ด์•„์›ƒ ์‹œ์Šคํ…œ์˜ ์œ ์—ฐ์„ฑ์€ ๋‹ค๋ฅธ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ๊ฐ•๋ ฅํ•œ ์ตœ์ ํ™”๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค:

# ์˜ˆ์‹œ: A ร— B๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์ „ ์ „์น˜๋œ ํ–‰๋ ฌ B^T๋ฅผ ๋กœ๋“œ
# (ํ˜„์žฌ ๊ตฌํ˜„์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ํ•˜์ง€ ์•Š์Œ)
comptime load_b_layout = row_major[TPB, 1]()   # B^T๋ฅผ ๋ณ‘ํ•ฉ ์ ‘๊ทผ์œผ๋กœ ๋กœ๋“œ
comptime store_b_layout = row_major[1, TPB]()  # ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— B๋กœ ์ €์žฅ
copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile)

์ „์น˜ ๋กœ๋”ฉ์˜ ํ™œ์šฉ ์‚ฌ๋ก€ (์ด ํผ์ฆ์—์„œ๋Š” ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ):

  1. ์ด๋ฏธ ์ „์น˜๋œ ์ž…๋ ฅ ํ–‰๋ ฌ: \(B\) ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ์ „์น˜ ์ƒํƒœ๋กœ ์ €์žฅ๋˜์–ด ์žˆ๋Š” ๊ฒฝ์šฐ
  2. ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜: \(A^T \times B\), \(A \times B^T\), ๋˜๋Š” \(A^T \times B^T\) ๊ณ„์‚ฐ
  3. ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ๋ณ€ํ™˜: ํ–‰ ์šฐ์„ ๊ณผ ์—ด ์šฐ์„  ๋ ˆ์ด์•„์›ƒ ๊ฐ„ ๋ณ€ํ™˜
  4. ๋ณ„๋„ ์ „์น˜ ์—ฐ์‚ฐ ์—†์ด ๋กœ๋“œ: ํ•„์š”ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ง์ ‘ ๋กœ๋“œ

ํ•ต์‹ฌ ๊ตฌ๋ถ„:

  • ํ˜„์žฌ ๊ตฌํ˜„: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ ํ‘œ์ค€ \(A \times B\) ๊ณฑ์…ˆ์— row_major[1, TPB]() ์‚ฌ์šฉ
  • ์ „์น˜ ๋กœ๋”ฉ ์˜ˆ์‹œ: ์ด๋ฏธ ์ „์น˜๋œ ๋ฐ์ดํ„ฐ๋‚˜ ๋‹ค๋ฅธ ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ๋‹ค๋ฅธ ๋ ˆ์ด์•„์›ƒ ์‚ฌ์šฉ

์ด๊ฒƒ์€ Mojo์˜ ์ฒ ํ•™์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ์— ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, ํ•„์š”ํ•  ๋•Œ ์ €์ˆ˜์ค€ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.


์š”์•ฝ: ํ•ต์‹ฌ ์ •๋ฆฌ

๊ด€์šฉ์  ํƒ€์ผ๋ง ๊ตฌํ˜„์ด ์‹ค์ œ๋กœ ํ•˜๋Š” ๊ฒƒ:

  1. ํ–‰๋ ฌ ์—ฐ์‚ฐ: ํ‘œ์ค€ A ร— B ๊ณฑ์…ˆ
  2. ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋”ฉ: ๋‘ ํ–‰๋ ฌ ๋ชจ๋‘ row_major[1, TPB]()๋กœ ๋ณ‘ํ•ฉ ์ ‘๊ทผ
  3. ์—ฐ์‚ฐ ํŒจํ„ด: acc += a_shared[local_row, k] * b_shared[k, local_col]
  4. ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ: ๋กœ๋”ฉ ์‹œ ์ „์น˜ ์—†์Œ

์ด๊ฒƒ์ด ์ตœ์ ์ธ ์ด์œ :

  • ๋ณ‘ํ•ฉ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: row_major[1, TPB]()๋กœ ํšจ์œจ์ ์ธ ๋กœ๋”ฉ ๋ณด์žฅ
  • ๋ฑ…ํฌ ์ถฉ๋Œ ํšŒํ”ผ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์ถฉ๋Œ์„ ๋ฐฉ์ง€
  • ํ‘œ์ค€ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ํ–‰๋ ฌ ๊ณฑ์…ˆ ํŒจํ„ด์„ ๊ตฌํ˜„

Puzzle 17: 1D ํ•ฉ์„ฑ๊ณฑ Op

MAX ๊ทธ๋ž˜ํ”„๋กœ ํŒŒ์ด์ฌ ์—ฐ๋™ํ•˜๊ธฐ

GPU ํผ์ฆ ์—ฌ์ •์˜ Part IV์— ์ง„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค: MAX ๊ทธ๋ž˜ํ”„ ์ปค์Šคํ…€ Op์œผ๋กœ ํŒŒ์ด์ฌ ์—ฐ๋™ํ•˜๊ธฐ.

์ด์ „ ํผ์ฆ๋“ค์—์„œ๋Š” Mojo๋กœ ํšจ์œจ์ ์ธ GPU ์ปค๋„์„ ์ž‘์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ์ด์ œ๋ถ€ํ„ฐ๋Š” ๋‹ค์Œ์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค:

  • ์ปค๋„์„ ํŒŒ์ด์ฌ์—์„œ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋Š” ์ปค์Šคํ…€ ์—ฐ์‚ฐ์œผ๋กœ ํŒจํ‚ค์ง•ํ•˜๊ธฐ
  • MAX ๊ทธ๋ž˜ํ”„ ์‹œ์Šคํ…œ๊ณผ ํ†ตํ•ฉํ•˜์—ฌ ๋จธ์‹ ๋Ÿฌ๋‹์„ ๊ฐ€์†ํ•˜๊ธฐ
  • ํ•˜์ด๋ ˆ๋ฒจ ํŒŒ์ด์ฌ API์™€ ๋กœ์šฐ๋ ˆ๋ฒจ GPU ์ฝ”๋“œ ์‚ฌ์ด์˜ ๊ฐ„๊ทน ๋ฉ”์šฐ๊ธฐ

์ด๋ฅผ ํ†ตํ•ด ์ต์ˆ™ํ•œ ํŒŒ์ด์ฌ ํ™˜๊ฒฝ์—์„œ ์ž‘์—…ํ•˜๋ฉด์„œ๋„ Mojo GPU ์ปค๋„์˜ ์„ฑ๋Šฅ์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐœ์š”

Puzzle 13: 1D ํ•ฉ์„ฑ๊ณฑ์—์„œ GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ ์ปค๋„์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ์—๋Š” ์ด ์ปค๋„์„ MAX ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด ํŒŒ์ด์ฌ์—์„œ ์ง์ ‘ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋Š” ์ปค์Šคํ…€ ์—ฐ์‚ฐ์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉํ•  1D ํ•ฉ์„ฑ๊ณฑ ์ปค๋„์€ ์ด๋ฏธ ๊ตฌํ˜„๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

comptime TPB = 15
comptime BLOCKS_PER_GRID = (2, 1)


def conv1d_kernel[
    input_size: Int,
    conv_size: Int,
    OutLayout: TensorLayout,
    InLayout: TensorLayout,
    ConvLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    kernel: TileTensor[mut=False, dtype, ConvLayout, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # first: need to account for padding
    var shared_a = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB + conv_size - 1]())
    var shared_b = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[conv_size]())
    if global_i < input_size:
        shared_a[local_i] = input[global_i]

    # second: load elements needed for convolution at block boundary
    if local_i < conv_size - 1:
        # indices from next block
        var next_idx = global_i + TPB
        if next_idx < input_size:
            shared_a[TPB + local_i] = input[next_idx]
        else:
            shared_a[TPB + local_i] = 0

    if local_i < conv_size:
        shared_b[local_i] = kernel[local_i]

    barrier()

    if global_i < input_size:
        var local_sum: output.ElementType = 0

        comptime for j in range(conv_size):
            if local_i + j < TPB + conv_size - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum


์ด ํผ์ฆ์˜ ํ•ต์‹ฌ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์ปค์Šคํ…€ op ๋“ฑ๋ก: @compiler.register ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋ฅผ ํ†ตํ•ด Mojo ํ•จ์ˆ˜๋ฅผ ํŒŒ์ด์ฌ์— ๋…ธ์ถœํ•˜๋Š” ๋ฐฉ๋ฒ• ์ดํ•ดํ•˜๊ธฐ
  2. ์ปค์Šคํ…€ op ํŒจํ‚ค์ง•: Mojo ์ฝ”๋“œ๋ฅผ MAX ๊ทธ๋ž˜ํ”„์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํŒจํ‚ค์ง•ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ตํžˆ๊ธฐ
  3. ํŒŒ์ด์ฌ ํ†ตํ•ฉ: MAX ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด ํŒŒ์ด์ฌ์—์„œ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ํ˜ธ์ถœํ•˜๊ธฐ
  4. ํฌ๋กœ์Šค ์–ธ์–ด ๋ฐ์ดํ„ฐ ํ๋ฆ„: ํŒŒ์ด์ฌ๊ณผ GPU ์‚ฌ์ด์˜ ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌํ•˜๊ธฐ

์ด ์ปค์Šคํ…€ ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ผ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  • ํŒŒ์ด์ฌ์—์„œ NumPy ๋ฐฐ์—ด์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ
  • ์ด ๋ฐ์ดํ„ฐ๋ฅผ GPU๋กœ ์ „์†กํ•˜๊ธฐ
  • ์ตœ์ ํ™”๋œ ํ•ฉ์„ฑ๊ณฑ ์ปค๋„ ์‹คํ–‰ํ•˜๊ธฐ
  • ๊ฒฐ๊ณผ๋ฅผ ํŒŒ์ด์ฌ์œผ๋กœ ๋ฐ˜ํ™˜ํ•˜๊ธฐ

์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ฉด ํŒŒ์ด์ฌ์˜ ํ’๋ถ€ํ•œ ์ƒํƒœ๊ณ„์™€ Mojo์˜ ๊ฐ•๋ ฅํ•œ GPU ์„ฑ๋Šฅ์„ ์ž‡๋Š” ๋งค๋„๋Ÿฌ์šด ๋‹ค๋ฆฌ๋ฅผ ๋งŒ๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ ค๋ฉด conv1d.mojo์—์„œ conv1d_kernel์„ ํ˜ธ์ถœํ•˜๋Š” ํ•œ ์ค„๋งŒ ์ฑ„์šฐ๋ฉด ๋ฉ๋‹ˆ๋‹ค:

import compiler
from std.runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
from std.memory import UnsafePointer
from std.gpu.host import DeviceBuffer


@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    def execute[
        # The kind of device this will be run on: "cpu" or "gpu"
        target: StaticString,
        input_size: Int,
        conv_size: Int,
        dtype: DType = DType.float32,
    ](
        output: OutputTensor[rank=1, static_spec=_],
        input: InputTensor[rank=output.rank, static_spec=_],
        kernel: InputTensor[rank=output.rank, static_spec=_],
        # the context is needed for some GPU calls
        ctx: DeviceContextPtr,
    ) raises:
        var output_tensor = output.to_layout_tensor()
        var input_tensor = input.to_layout_tensor()
        var kernel_tensor = kernel.to_layout_tensor()

        comptime if target == "gpu":
            var gpu_ctx = ctx.get_device_context()
            # making sure the output tensor is zeroed out before the kernel is called
            gpu_ctx.enqueue_memset(
                DeviceBuffer[output_tensor.dtype](
                    gpu_ctx,
                    output_tensor.ptr,
                    input_size,
                    owning=False,
                ),
                0,
            )

            # FILL ME IN with 1 line calling our conv1d_kernel

        elif target == "cpu":
            # we can fallback to CPU
            pass
        else:
            raise Error("Unsupported target: " + target)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p17/op/conv1d.mojo

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ํผ์ฆ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pixi run p17
pixi run -e amd p17
pixi run -e apple p17
uv run poe p17

์„ฑ๊ณตํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ ์ถœ๋ ฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Input array: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]
Expected result (NumPy calculation): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Compiling 1D convolution graph...
Executing 1D convolution...
1D Convolution result (custom Mojo kernel): [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
Verification passed: Custom kernel results match NumPy calculation

์ด ์ถœ๋ ฅ์€ ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์ด 1D ํ•ฉ์„ฑ๊ณฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌํ˜„ํ–ˆ์Œ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

์†”๋ฃจ์…˜

์ด ํผ์ฆ์„ ํ’€๋ ค๋ฉด 1D ํ•ฉ์„ฑ๊ณฑ ์ปค๋„์„ MAX ๊ทธ๋ž˜ํ”„ ์‹œ์Šคํ…œ๊ณผ ํ†ตํ•ฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ Conv1DCustomOp ๊ตฌ์กฐ์ฒด์˜ execute ๋ฉ”์„œ๋“œ์—์„œ ์ปค๋„์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ˜ธ์ถœํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ’€์ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

            comptime kernel = conv1d_kernel[
                input_size, conv_size, OutLayout, OutLayout, ConvLayout
            ]
            gpu_ctx.enqueue_function[kernel, kernel](
                output_tensor,
                input_tensor,
                kernel_tensor,
                grid_dim=BLOCKS_PER_GRID,
                block_dim=(TPB, 1),
            )
์ด ํ•œ ์ค„์ด ์ˆ˜ํ–‰ํ•˜๋Š” ์ค‘์š”ํ•œ ์ž‘์—…๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
  1. GPU ์ปจํ…์ŠคํŠธ(gpu_ctx์˜ ํƒ€์ž…์€ DeviceContext)์—์„œ enqueue_function์„ ํ˜ธ์ถœํ•˜์—ฌ ์ปค๋„ ์‹คํ–‰ ์˜ˆ์•ฝ
  2. ํ•„์š”ํ•œ ๋ ˆ์ด์•„์›ƒ๊ณผ ํฌ๊ธฐ ์ •๋ณด๋ฅผ ์ปดํŒŒ์ผ ํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ „๋‹ฌ
  3. ์ถœ๋ ฅ, ์ž…๋ ฅ, ์ปค๋„ ํ…์„œ๋ฅผ ๋Ÿฐํƒ€์ž„ ์ธ์ž๋กœ ์ œ๊ณต
  4. ์ ์ ˆํ•œ ์ฐจ์›์œผ๋กœ ์‹คํ–‰ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ

์ „์ฒด ๋งฅ๋ฝ์—์„œ ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

ํŒŒ์ด์ฌ-Mojo ํ†ตํ•ฉ ํ๋ฆ„

  1. ํŒŒ์ด์ฌ ์ชฝ (problems/p17/p17.py):

    • ์ž…๋ ฅ๊ณผ ์ปค๋„์šฉ NumPy ๋ฐฐ์—ด ์ƒ์„ฑ
    • MAX ๊ทธ๋ž˜ํ”„๋กœ ์—ฐ์‚ฐ์„ ๊ฐ์‹ธ๋Š” conv_1d() ํ•จ์ˆ˜ ํ˜ธ์ถœ
    • NumPy ๋ฐฐ์—ด์„ Buffer.from_numpy(input).to(device)๋กœ MAX driver Buffer๋กœ ๋ณ€ํ™˜
    • custom_extensions=[mojo_kernels]๋กœ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ํŒจํ‚ค์ง€ ๋กœ๋“œ
  2. ๊ทธ๋ž˜ํ”„ ๊ตฌ์ถ•:

    • TensorType์œผ๋กœ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ ํ…์„œ ํƒ€์ž… ์ •์˜
    • parameters={...}๋ฅผ ํ†ตํ•ด ์—ฐ์‚ฐ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ง€์ •
    • Graph("conv_1d_graph", ...)๋กœ ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”„ ์ƒ์„ฑ
    • ops.custom(name="conv1d", ...)๋กœ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ํ˜ธ์ถœ
  3. ์ปค์Šคํ…€ op ๋“ฑ๋ก:

    • @compiler.register("conv1d") ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๊ฐ€ ์—ฐ์‚ฐ์„ MAX ๊ทธ๋ž˜ํ”„์— ๋…ธ์ถœ. @compiler.register ์ฐธ๊ณ 
    • execute ๋ฉ”์„œ๋“œ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ธํ„ฐํŽ˜์ด์Šค(์ž…๋ ฅ, ์ถœ๋ ฅ, ์ปจํ…์ŠคํŠธ) ์ •์˜
    • ์ž…์ถœ๋ ฅ ํ…์„œ๊ฐ€ ์ปค๋„์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก TileTensor๋กœ ๋ณ€ํ™˜
    • Device context๊ฐ€ GPU ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น๊ณผ ์ปค๋„ ์‹คํ–‰ ๊ด€๋ฆฌ
  4. ์ปค๋„ ์‹คํ–‰:

    • model.execute(...)๊ฐ€ ํ˜ธ์ถœ๋˜๋ฉด conv1d_kernel์ด ๋ฐ์ดํ„ฐ ์ˆ˜์‹ 
    • grid_dim๊ณผ block_dim์œผ๋กœ GPU ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ ์„ค์ •
    • result.to(CPU())๋กœ ๊ฒฐ๊ณผ๋ฅผ CPU๋กœ ์ „์†ก
    • NumPy ๊ฒ€์ฆ์œผ๋กœ ๊ธฐ๋Œ€ ์ถœ๋ ฅ๊ณผ ๊ฒฐ๊ณผ ๋น„๊ต

ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ ์ƒ์„ธ

  1. ์ปค์Šคํ…€ Op ๊ตฌ์กฐ์ฒด:

    @compiler.register("conv1d")
    struct Conv1DCustomOp:
        @staticmethod
        def execute[target: StaticString, input_size: Int, conv_size: Int, dtype: DType = DType.float32](
            output: OutputTensor[rank=1],
            input: InputTensor[dtype = output.dtype, rank = output.rank],
            kernel: InputTensor[dtype = output.dtype, rank = output.rank],
            ctx: DeviceContextPtr,
        ) raises:
            # ๊ตฌํ˜„
    
    • target์€ ๋””๋ฐ”์ด์Šค ํƒ€์ž…(โ€œgpuโ€ ๋˜๋Š” โ€œcpuโ€)์„ ๋‚˜ํƒ€๋ƒ„
    • input_size์™€ conv_size๋Š” ํŒŒ์ด์ฌ์—์„œ ์ „๋‹ฌ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ
    • ํ…์„œ ํƒ€์ž…์ด ์˜ฌ๋ฐ”๋ฅธ shape๊ณผ ํƒ€์ž… ๊ฒ€์‚ฌ ๋ณด์žฅ
    • ๋ฐ˜ํ™˜ ํƒ€์ž…์€ ์ ์ ˆํ•œ ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ ์œ„ํ•ด raises
  2. ํ…์„œ ๋ณ€ํ™˜:

    output_tensor = output.to_layout_tensor()
    input_tensor = input.to_layout_tensor()
    kernel_tensor = kernel.to_layout_tensor()
    
    • MAX ๊ทธ๋ž˜ํ”„ ํ…์„œ๋ฅผ Mojo TileTensor๋กœ ๋ณ€ํ™˜
    • ์ปค๋„์ด ํ…์„œ๋ฅผ ์ง์ ‘ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์คŒ
    • ์ปดํŒŒ์ผ ํƒ€์ž„ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๋ ˆ์ด์•„์›ƒ ์ถ”์ถœ
  3. Device Context ์‚ฌ์šฉ:

    gpu_ctx = ctx.get_device_context()
    gpu_ctx.enqueue_memset(...)  # ์ถœ๋ ฅ ๋ฒ„ํผ ์ดˆ๊ธฐํ™”
    gpu_ctx.enqueue_function[..., ...](...) # ์ปค๋„ ์˜ˆ์•ฝ
    
    • ๋””๋ฐ”์ด์Šค ์ปจํ…์ŠคํŠธ๊ฐ€ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ๊ด€๋ฆฌ
    • ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์œผ๋กœ ์˜ฌ๋ฐ”๋ฅธ ๋ฒ„ํผ ์ƒํƒœ๋ฅผ ๋ณด์žฅ
    • ํ•จ์ˆ˜๋ฅผ ํ์— ๋“ฑ๋กํ•˜์—ฌ ์ปค๋„ ์‹คํ–‰์„ ์˜ˆ์•ฝ

์ด ํ’€์ด๋Š” ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ๊ฐ€ MAX ๊ทธ๋ž˜ํ”„๋ฅผ ๊ฑฐ์ณ GPU์—์„œ ์‹คํ–‰๋˜๊ณ  ๋‹ค์‹œ ๋Œ์•„์˜ค๋Š” ์ „์ฒด ํ๋ฆ„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. Mojo์˜ ๊ฐ•๋ ฅํ•œ ํƒ€์ž… ์‹œ์Šคํ…œ๊ณผ ๋งค๊ฐœ๋ณ€์ˆ˜ํ™” ํ•จ์ˆ˜๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํšจ์œจ์ ์ด๊ณ  ํƒ€์ž… ์•ˆ์ „ํ•œ ๊ฐ€์† ์—ฐ์‚ฐ์„ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค.

MAX ๊ทธ๋ž˜ํ”„ ์ปค์Šคํ…€ op ์ดํ•ดํ•˜๊ธฐ

๋” ์ž์„ธํ•œ ๋‚ด์šฉ์€ ์•„๋ž˜ ํŠœํ† ๋ฆฌ์–ผ์„ ์ฐธ๊ณ ํ•˜์„ธ์š”:

์ปค์Šคํ…€ op ๋“ฑ๋ก

์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ๋งŒ๋“œ๋Š” ํ•ต์‹ฌ์€ @compiler.register ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ์™€ ๊ด€๋ จ ๊ตฌ์กฐ์ฒด์ž…๋‹ˆ๋‹ค:

@compiler.register("conv1d")
struct Conv1DCustomOp:
    @staticmethod
    def execute[...](
        output: OutputTensor[rank=1],
        input: InputTensor[dtype = output.dtype, rank = output.rank],
        kernel: InputTensor[type = output.dtype, rank = output.rank],
        ctx: DeviceContextPtr,
    ) raises:
        # ๊ตฌํ˜„

๋“ฑ๋ก์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ:

  • ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ์— ์ „๋‹ฌํ•˜๋Š” ์ด๋ฆ„("conv1d")์ด ํŒŒ์ด์ฌ ์ฝ”๋“œ์—์„œ ์ด ์—ฐ์‚ฐ์„ ํ˜ธ์ถœํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ์ด๋ฆ„
  • ๊ตฌ์กฐ์ฒด์—๋Š” ์˜ฌ๋ฐ”๋ฅธ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ๊ฐ€์ง„ execute ๋ฉ”์„œ๋“œ๊ฐ€ ์žˆ์–ด์•ผ ํ•จ
  • OutputTensor์™€ InputTensor ํƒ€์ž…์ด ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ์™€์˜ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ •์˜
  • DeviceContextPtr์ด ์‹คํ–‰ ํ™˜๊ฒฝ์— ๋Œ€ํ•œ ์ ‘๊ทผ์„ ์ œ๊ณต

์ปค์Šคํ…€ op ํŒจํ‚ค์ง•

์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ํŒŒ์ด์ฌ์—์„œ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋จผ์ € ํŒจํ‚ค์ง•ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

mojo package op -o op.mojopkg

์ด ๋ช…๋ น์€:

  1. Mojo ์ฝ”๋“œ๋ฅผ ๋ฐฐํฌ ๊ฐ€๋Šฅํ•œ ํŒจํ‚ค์ง€๋กœ ์ปดํŒŒ์ผ
  2. MAX ๊ทธ๋ž˜ํ”„๊ฐ€ ์—ฐ์‚ฐ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ƒ์„ฑ
  3. ํŒŒ์ด์ฌ์—์„œ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐ”์ด๋„ˆ๋ฆฌ ์•„ํ‹ฐํŒฉํŠธ(op.mojopkg)๋ฅผ ์ƒ์„ฑ

ํŒจํ‚ค์ง€๋Š” MAX ๊ทธ๋ž˜ํ”„๊ฐ€ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” ์œ„์น˜์— ๋ฐฐ์น˜ํ•ด์•ผ ํ•˜๋ฉฐ, ๋ณดํ†ต ํŒŒ์ด์ฌ ์ฝ”๋“œ์—์„œ ์ ‘๊ทผ ๊ฐ€๋Šฅํ•œ ๋””๋ ‰ํ† ๋ฆฌ์— ๋‘ก๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ํ†ตํ•ฉ

ํŒŒ์ด์ฌ ์ชฝ์—์„œ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

# Mojo ์—ฐ์‚ฐ์ด ํฌํ•จ๋œ ๋””๋ ‰ํ† ๋ฆฌ ๊ฒฝ๋กœ
mojo_kernels = Path(__file__).parent / "op"

# ์ปค์Šคํ…€ conv1d ์—ฐ์‚ฐ์œผ๋กœ ๊ทธ๋ž˜ํ”„ ๊ตฌ์„ฑ
with Graph(
    "conv_1d_graph",
    input_types=[...],
    custom_extensions=[mojo_kernels],  # ์ปค์Šคํ…€ op ํŒจํ‚ค์ง€ ๋กœ๋“œ
) as graph:
    # ๊ทธ๋ž˜ํ”„์˜ ์ž…๋ ฅ ์ •์˜
    input_value, kernel_value = graph.inputs

    # ์ด๋ฆ„์œผ๋กœ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ์‚ฌ์šฉ
    output = ops.custom(
        name="conv1d",  # @compiler.register์˜ ์ด๋ฆ„๊ณผ ์ผ์น˜ํ•ด์•ผ ํ•จ
        values=[input_value, kernel_value],
        out_types=[...],
        parameters={
            "input_size": input_tensor.shape[0],
            "conv_size": kernel_tensor.shape[0],
            "dtype": dtype,
        },
    )[0].tensor

ํ•ต์‹ฌ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. custom_extensions๋กœ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์˜ ๊ฒฝ๋กœ ์ง€์ •
  2. ๋“ฑ๋ก๋œ ์—ฐ์‚ฐ ์ด๋ฆ„์œผ๋กœ ops.custom ํ˜ธ์ถœ
  3. ์—ฐ์‚ฐ์˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜์— ๋งž๋Š” ์ž…๋ ฅ ๊ฐ’๊ณผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ „๋‹ฌ

Puzzle 18: ์†Œํ”„ํŠธ๋งฅ์Šค Op

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์†Œํ”„ํŠธ๋งฅ์Šค๋Š” ์‹ค์ˆ˜ ๋ฒกํ„ฐ๋ฅผ ๋ฐ›์•„ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ์ •๊ทœํ™”ํ•˜๋Š” ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค.

์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋Š” ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ๋‹จ๊ณ„๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ง€์ˆ˜ ํ•จ์ˆ˜ ์ ์šฉ: ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ ๊ฐ ์š”์†Œ์— ์ง€์ˆ˜ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋“  ๊ฐ’์ด ์–‘์ˆ˜๊ฐ€ ๋˜๊ณ  ๊ฐ’ ์‚ฌ์ด์˜ ์ฐจ์ด๊ฐ€ ์ฆํญ๋ฉ๋‹ˆ๋‹ค. ํฐ ์ž…๋ ฅ๊ฐ’์€ ํ›จ์”ฌ ํฐ ์ง€์ˆ˜ ์ถœ๋ ฅ์„ ๋งŒ๋“ค๊ณ , ์ž‘๊ฑฐ๋‚˜ ์Œ์ˆ˜์ธ ๊ฐ’์€ 0์— ๊ฐ€๊นŒ์šด ์ถœ๋ ฅ์„ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค.

  2. ์ •๊ทœํ™”: ๊ฐ ์ง€์ˆ˜ ๊ฐ’์„ ๋ชจ๋“  ์ง€์ˆ˜ ๊ฐ’์˜ ํ•ฉ์œผ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค. ์ด ์ •๊ทœํ™” ๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด ๊ฒฐ๊ณผ๊ฐ’์ด ์œ ํšจํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ชจ๋“  ๊ฐ’์ด 0๊ณผ 1 ์‚ฌ์ด์ด๊ณ  ํ•ฉ์ด ์ •ํ™•ํžˆ 1์ด ๋ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์ ์œผ๋กœ ์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$$

์—ฌ๊ธฐ์„œ:

  • \(x_i\)๋Š” ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ \(i\)๋ฒˆ์งธ ์š”์†Œ
  • \(n\)์€ ์ž…๋ ฅ ๋ฒกํ„ฐ์˜ ๊ธธ์ด

๊ทธ๋Ÿฌ๋‚˜ ์ด ์ง์ ‘์ ์ธ ๊ตฌํ˜„์€ ๊ฐ’์ด ํด ๋•Œ ์ˆ˜์น˜ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ˆ˜์น˜์ ์œผ๋กœ ๋” ์•ˆ์ •์ ์ธ ๋ฒ„์ „์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

GPU ๊ตฌํ˜„์—์„œ๋Š” ์ตœ๋Œ“๊ฐ’ ์ฐพ๊ธฐ์™€ ์ง€์ˆ˜ ํ•ฉ ๊ณ„์‚ฐ ๋ชจ๋‘์— ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ํฐ ๋ฒกํ„ฐ์—์„œ๋„ ๋†’์€ ํšจ์œจ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

  • ํšจ์œจ์ ์ธ ์ตœ๋Œ“๊ฐ’ ๋ฐ ํ•ฉ๊ณ„ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • ์ตœ๋Œ“๊ฐ’ ์ฐจ๊ฐ ๊ธฐ๋ฒ•์„ ํ†ตํ•œ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ
  • ์Šค๋ ˆ๋“œ ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  • ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์˜ ํŒŒ์ด์ฌ ํ†ตํ•ฉ
  • ๋ฐฐ๋ฆฌ์–ด๋ฅผ ํ†ตํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”

์„ค์ •

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 128
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: BLOCK_DIM_X = 1 << log2_ceil(SIZE). ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜์ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋™์ž‘ํ•˜๋ ค๋ฉด BLOCK_DIM_X๊ฐ€ SIZE ์ด์ƒ์ธ ๊ฐ€์žฅ ์ž‘์€ 2์˜ ๊ฑฐ๋“ญ์ œ๊ณฑ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: \(1 \times 1\) ๋ธ”๋ก
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ์ตœ๋Œ“๊ฐ’๊ณผ ํ•ฉ๊ณ„๋ฅผ ์œ„ํ•œ ๋‘ ๊ฐœ์˜ ๊ณต์œ  ๋ณ€์ˆ˜

๋ ˆ์ด์•„์›ƒ ์„ค์ •:

  • ์ž…๋ ฅ ํ…์„œ: row_major[SIZE]()
  • ์ถœ๋ ฅ ํ…์„œ: row_major[SIZE]()
  • ์ปค์Šคํ…€ op ํŒŒ๋ผ๋ฏธํ„ฐ: {"input_size": input_tensor.shape[0]}

์ด ํผ์ฆ์˜ ํ•ต์‹ฌ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์ˆ˜์น˜ ์•ˆ์ •์„ฑ: ์ž ์žฌ์ ์ธ ์ˆ˜์น˜ ๋ฌธ์ œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ• ์ดํ•ดํ•˜๊ธฐ
  2. ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ํšจ์œจ์ ์ธ ์ตœ๋Œ“๊ฐ’ ๋ฐ ํ•ฉ๊ณ„ ๊ณ„์‚ฐ
  3. ์ปค์Šคํ…€ op ํ†ตํ•ฉ: Mojo GPU ์ปค๋„์„ ์œ„ํ•œ ํŒŒ์ด์ฌ ์ธํ„ฐํŽ˜์ด์Šค ์™„์„ฑํ•˜๊ธฐ
  4. ํ…Œ์ŠคํŠธ์™€ ๊ฒ€์ฆ: ๊ตฌํ˜„์ด ๊ธฐ๋Œ€ ๊ฒฐ๊ณผ์™€ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ

์†Œํ”„ํŠธ๋งฅ์Šค ์ปค์Šคํ…€ ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ผ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  • ํŒŒ์ด์ฌ์—์„œ NumPy ๋ฐฐ์—ด์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ
  • GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ
  • ์ •๊ทœํ™”๋œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๊ธฐ
  • SciPy์˜ ์†Œํ”„ํŠธ๋งฅ์Šค ๊ตฌํ˜„ ๊ฒฐ๊ณผ์™€ ์ผ์น˜์‹œํ‚ค๊ธฐ

์™„์„ฑํ•  ์ฝ”๋“œ

์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ ค๋ฉด Mojo ํŒŒ์ผ์—์„œ GPU์™€ CPU ์ปค๋„์„ ๋ชจ๋‘ ๊ตฌํ˜„ํ•˜๊ณ , ํŒŒ์ด์ฌ ์ฝ”๋“œ์—์„œ ๊ทธ๋ž˜ํ”„ ์ •์˜๋ฅผ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

1. softmax.mojo์—์„œ GPU ์ปค๋„ ๊ตฌํ˜„ํ•˜๊ธฐ

from std.gpu import thread_idx, block_idx, block_dim, barrier
from std.gpu.host import DeviceContext, HostBuffer, DeviceBuffer
from std.gpu.memory import AddressSpace
from layout import TileTensor
from layout.tile_layout import row_major
from layout.tile_tensor import stack_allocation
from std.math import exp
from std.bit import log2_ceil
from std.utils.numerics import max_finite, min_finite


comptime SIZE = 128  # This must be equal to INPUT_SIZE in p18.py
comptime layout = row_major[SIZE]()
comptime LayoutType = type_of(layout)
comptime GRID_DIM_X = 1
# Tree-based reduction require the number of threads to be the next power of two >= SIZE for correctness.
comptime BLOCK_DIM_X = 1 << log2_ceil(SIZE)


def softmax_gpu_kernel[
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    comptime assert (
        dtype.is_floating_point()
    ), "dtype must be a floating-point type"
    # FILL IN (roughly 31 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p18/op/softmax.mojo

ํŒ
  1. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ์ตœ๋Œ“๊ฐ’๊ณผ ํ•ฉ๊ณ„ ๋ชจ๋‘์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”
  2. ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ ์ ˆํ•œ ์ง€์ ์—์„œ barrier()๋ฅผ ํ˜ธ์ถœํ•˜๋Š” ๊ฒƒ์„ ์žŠ์ง€ ๋งˆ์„ธ์š”
  3. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž…๋ ฅ ๋ฐฐ์—ด์˜ ์ผ๋ถ€๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•˜์„ธ์š”
  4. ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์„ธ์š”
  5. ํŠนํžˆ ํฐ ์ž…๋ ฅ์—์„œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ ‘๊ทผ์„ ์ฃผ์˜ ๊นŠ๊ฒŒ ์ฒ˜๋ฆฌํ•˜์„ธ์š”
  6. ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด \(e^{x_i}\) ๋Œ€์‹  \(e^{x_i - max}\)๋ฅผ ๊ณ„์‚ฐํ•˜์„ธ์š”

2. softmax.mojo์—์„œ CPU ์ปค๋„ ๊ตฌํ˜„ํ•˜๊ธฐ

def softmax_cpu_kernel[
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    comptime assert (
        dtype.is_floating_point()
    ), "dtype must be a floating-point type"
    # FILL IN (roughly 10 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p18/op/softmax.mojo

ํŒ
  1. GPU ๋ฒ„์ „๊ณผ ๋™์ผํ•œ ์ˆ˜ํ•™์  ๋‹จ๊ณ„๋ฅผ ๋”ฐ๋ฅด๋Š” ์ˆœ์ฐจ์  ๊ตฌํ˜„์„ ์ž‘์„ฑํ•˜์„ธ์š”
  2. ๋จผ์ € ๋ชจ๋“  ์ž…๋ ฅ์—์„œ ์ตœ๋Œ“๊ฐ’์„ ์ฐพ์œผ์„ธ์š”
  3. ๊ทธ๋‹ค์Œ ๊ฐ ์š”์†Œ์— ๋Œ€ํ•ด \(e^{x_i - max}\)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ํ•ฉ๊ณ„๋ฅผ ๋ˆ„์ ํ•˜์„ธ์š”
  4. ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ฐ ์š”์†Œ๋ฅผ ํ•ฉ๊ณ„๋กœ ๋‚˜๋ˆ  ์ •๊ทœํ™”ํ•˜์„ธ์š”
  5. CPU ๊ตฌํ˜„์—๋Š” ๋ณ‘๋ ฌ ์Šค๋ ˆ๋“œ๊ฐ€ ์—†์œผ๋ฏ€๋กœ ์Šค์นผ๋ผ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜์„ธ์š”

CPU์™€ GPU ์ปค๋„ ํ…Œ์ŠคํŠธ

uv run poe p18-test-kernels
pixi run p18-test-kernels

์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌํ˜„ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

Total Discovered Tests: 1

Passed : 1 (100.00%)
Failed : 0 (0.00%)
Skipped: 0 (0.00%)

3. p18.py์—์„œ ๊ทธ๋ž˜ํ”„ ์ •์˜ ์™„์„ฑํ•˜๊ธฐ

from pathlib import Path

import numpy as np
from max.driver import CPU, Accelerator, Buffer, Device
from max.dtype import DType
from max.engine import InferenceSession
from max.graph import DeviceRef, Graph, TensorType
from numpy.typing import NDArray
from scipy.special import softmax as scipy_softmax


def softmax(
    input: NDArray[np.float32],
    session: InferenceSession,
    device: Device,
) -> Buffer:
    dtype = DType.float32
    input_tensor = Buffer.from_numpy(input).to(device)
    mojo_kernels = Path(__file__).parent / "op"

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
        # FILL IN (roughly 4 unformatted lines)
        pass

์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p18/p18.py

ํŒ
  1. graph.inputs[0]์œผ๋กœ ๊ทธ๋ž˜ํ”„์— ์ „๋‹ฌ๋œ ์ž…๋ ฅ ํ…์„œ์— ์ ‘๊ทผํ•˜์„ธ์š”
  2. ๋“ฑ๋กํ•œ ์ปค์Šคํ…€ op ์ด๋ฆ„(โ€œsoftmaxโ€)์œผ๋กœ ops.custom()์„ ํ˜ธ์ถœํ•˜์„ธ์š”
  3. ์ž…๋ ฅ ํ…์„œ๋ฅผ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์˜ ๊ฐ’์œผ๋กœ ์ „๋‹ฌํ•˜์„ธ์š”
  4. ์ž…๋ ฅ shape๊ณผ ์ผ์น˜ํ•˜๋Š” ์ถœ๋ ฅ ํƒ€์ž…์„ ์ง€์ •ํ•˜์„ธ์š”
  5. ์ปค๋„์— ํ•„์š”ํ•œ โ€œinput_sizeโ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํฌํ•จํ•˜์„ธ์š”
  6. graph.outputs๋ฅผ ์—ฐ์‚ฐ์˜ ์ถœ๋ ฅ ํ…์„œ๊ฐ€ ๋‹ด๊ธด ๋ฆฌ์ŠคํŠธ๋กœ ์„ค์ •ํ•˜์„ธ์š”

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ํผ์ฆ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pixi run p18
pixi run -e amd p18
pixi run -e apple p18
uv run poe p18

์„ฑ๊ณตํ•˜๋ฉด CPU์™€ GPU์—์„œ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ ์ถœ๋ ฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Input shape: (128,)
First few random input values: [ 1.1810775   0.60472375  0.5718309   0.6644599  -0.08899796]
Compiling softmax graph on Device(type=cpu,id=0)
Executing softmax on Device(type=cpu,id=0)
====================================================================================================
Compiling softmax graph on Device(type=gpu,id=0)
Executing softmax on Device(type=gpu,id=0)
====================================================================================================
First few softmax results on CPU (custom Mojo kernel): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
First few softmax results on GPU (custom Mojo kernel): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
First few expected results (SciPy calculation): [0.01718348 0.00965615 0.0093437  0.01025055 0.0048253 ]
Verification passed: Custom kernel results match SciPy calculation
Sum of all probabilities on CPU: 1.0
Sum of all probabilities on GPU: 1.0

์ด ์ถœ๋ ฅ์€ ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์ด ์†Œํ”„ํŠธ๋งฅ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌํ˜„ํ•˜์—ฌ ์œ ํšจํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ–ˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์†”๋ฃจ์…˜

์ด ํผ์ฆ์„ ํ’€๋ ค๋ฉด Mojo ์ปค๋„(GPU์™€ CPU)๊ณผ ํŒŒ์ด์ฌ ๊ทธ๋ž˜ํ”„ ์ •์˜๋ฅผ ๋ชจ๋‘ ๊ตฌํ˜„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. Puzzle 17: 1D ํ•ฉ์„ฑ๊ณฑ Op์—์„œ ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ, ํŒŒ์ด์ฌ์˜ ์ƒํƒœ๊ณ„์™€ Mojo์˜ GPU ๊ฐ€์† ์ปดํ“จํŒ… ์—ญ๋Ÿ‰์„ ์ž‡๋Š” ๋‹ค๋ฆฌ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๊ตฌํ˜„ํ•  ์†Œํ”„ํŠธ๋งฅ์Šค ์—ฐ์‚ฐ์€ ์ˆ˜ํ•™์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{n} e^{x_j}}$$

ํ•˜์ง€๋งŒ ์ˆ˜์น˜ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋” ์•ˆ์ •์ ์ธ ํ˜•ํƒœ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

$$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

GPU ์ปค๋„ ๊ตฌํ˜„

def softmax_gpu_kernel[
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
):
    comptime assert (
        dtype.is_floating_point()
    ), "dtype must be a floating-point type"
    var shared_max = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[BLOCK_DIM_X]())
    var shared_sum = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[BLOCK_DIM_X]())
    var global_i = thread_idx.x

    # Initialize out-of-bounds (shared_max[local_i], global_i >= input_size) shared memory addresses to the minimum
    # finite value for dtype, ensuring that if these elements are accessed in the parallel max reduction below they
    # do not influence the result (max(min_finite, x) == x for any x).
    var val: Scalar[dtype] = min_finite[dtype]()
    if global_i < input_size:
        val = rebind[Scalar[dtype]](input[global_i])
    shared_max[global_i] = val

    barrier()

    # Parallel reduction to find max similar to reduction we saw before
    var stride = BLOCK_DIM_X // 2
    while stride > 0:
        if global_i < stride:
            shared_max[global_i] = max(
                shared_max[global_i], shared_max[global_i + stride]
            )
        barrier()
        stride = stride // 2

    var block_max = shared_max[0]

    # Initialize out-of-bounds (shared_max[global_i], global_i >= input_size) shared memory addresses to 0.0,
    # ensuring that if these elements are accessed in the parallel sum reduction below they
    # do not influence the result (adding 0.0 does not change the sum).
    var exp_val: Scalar[dtype] = 0.0
    if global_i < input_size:
        exp_val = rebind[Scalar[dtype]](exp(val - block_max))
    shared_sum[global_i] = exp_val
    barrier()

    # Parallel reduction for sum similar to reduction we saw before
    stride = BLOCK_DIM_X // 2
    while stride > 0:
        if global_i < stride:
            shared_sum[global_i] += shared_sum[global_i + stride]
        barrier()
        stride = stride // 2

    var block_sum = shared_sum[0]

    # Normalize by sum
    if global_i < input_size:
        output[global_i] = exp_val / block_sum


GPU ์ปค๋„์€ ๊ณ ๋„๋กœ ์ตœ์ ํ™”๋œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ์†Œํ”„ํŠธ๋งฅ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์ปค๋„์„ ์ƒ์„ธํžˆ ๋ถ„์„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

์ปค๋„ ์‹œ๊ทธ๋‹ˆ์ฒ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ

def softmax_gpu_kernel[
    layout: Layout,
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, layout],
    input: TileTensor[mut=False, dtype, layout],
)

์ปค๋„์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ตฌ์„ฑ:

  • ์ž…์ถœ๋ ฅ ํ…์„œ์— ๊ณตํ†ต์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ ˆ์ด์•„์›ƒ ํŒŒ๋ผ๋ฏธํ„ฐ
  • ์ •์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ง€์ •๋˜๋Š” ๋ฒกํ„ฐ ํฌ๊ธฐ
  • ๊ธฐ๋ณธ๊ฐ’์ด float32์ธ ์„ค์ • ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ ํƒ€์ž…
  • ์—ฐ์‚ฐ ๊ฒฐ๊ณผ๋ฅผ ์ง์ ‘ ์ €์žฅํ•˜๋Š” ๋ณ€๊ฒฝ ๊ฐ€๋Šฅํ•œ(mutable) ์ถœ๋ ฅ ํ…์„œ
  • ๋ณ€๊ฒฝ ๋ถˆ๊ฐ€๋Šฅํ•œ(mut=False) ์ž…๋ ฅ ํ…์„œ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น

shared_max = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[BLOCK_DIM_X]())
shared_sum = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[BLOCK_DIM_X]())

์ปค๋„์€ ๋‘ ๊ฐœ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„ํผ๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค:

  • shared_max: ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’ ํƒ์ƒ‰ ๋ฆฌ๋•์…˜์šฉ
  • shared_sum: ๋ณ‘๋ ฌ ํ•ฉ๊ณ„ ์—ฐ์‚ฐ์šฉ
  • ๋‘˜ ๋‹ค BLOCK_DIM_X = 128 ํฌ๊ธฐ๋ฅผ ์‚ฌ์šฉ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋Š” ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๋น ๋ฅธ ์ ‘๊ทผ์„ ์ œ๊ณต

์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ

global_i = thread_idx.x

์ด ์†Œํ”„ํŠธ๋งฅ์Šค ๊ตฌํ˜„์€ ๋‹จ์ผ 1D ์Šค๋ ˆ๋“œ ๋ธ”๋ก์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ „์—ญ ์ธ๋ฑ์Šค์™€ ๋กœ์ปฌ ์ธ๋ฑ์Šค๊ฐ€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

์ตœ๋Œ“๊ฐ’ ํƒ์ƒ‰ ๋‹จ๊ณ„

var val: Scalar[dtype] = min_finite[dtype]()
if global_i < input_size:
    val = rebind[Scalar[dtype]](input[global_i])

shared_max[local_i] = val
barrier()

๊ฐ ์Šค๋ ˆ๋“œ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ดˆ๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค:

  • ์œ ํšจ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์š”์†Œ์—๋Š” ์ตœ์†Œ ์œ ํ•œ(finite) ๊ฐ’ ํ• ๋‹น
  • ์œ ํšจํ•œ ์š”์†Œ์— ๋งคํ•‘๋˜๋Š” ์Šค๋ ˆ๋“œ์—๋Š” ์‹ค์ œ ์ž…๋ ฅ๊ฐ’ ํ• ๋‹น
  • ๋ฆฌ๋•์…˜ ๊ณผ์ •์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ
  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋˜๋„๋ก ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”

๋ณ‘๋ ฌ max ๋ฆฌ๋•์…˜

stride = BLOCK_DIM_X // 2
while stride > 0:
    if local_i < stride:
        shared_max[local_i] = max(shared_max[local_i], shared_max[local_i + stride])
    barrier()
    stride = stride // 2

๋ณ‘๋ ฌ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

  1. stride = 64(BLOCK_DIM_X์˜ ์ ˆ๋ฐ˜)๋กœ ์‹œ์ž‘
  2. ๊ฐ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๊ฐ€ stride๋งŒํผ ๋–จ์–ด์ง„ ๋‘ ๊ฐ’ ๋น„๊ต
  3. ๋” ์ž‘์€ ์ธ๋ฑ์Šค์— ์ตœ๋Œ“๊ฐ’ ์ €์žฅ
  4. ๋ฐฐ๋ฆฌ์–ด๋กœ ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  5. Stride๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๊ณ  ๋ฐ˜๋ณต
  6. \(\log_2(BLOCK\_DIM\_X)~\) ๋‹จ๊ณ„ ํ›„ shared_max[0]์— ์ „์ฒด ์ตœ๋Œ“๊ฐ’์ด ๋‹ด๊น€

์ด ๋กœ๊ทธ ๋ฆฌ๋•์…˜์€ ๋Œ€๊ทœ๋ชจ ์ž…๋ ฅ์—์„œ ์„ ํ˜• ์Šค์บ”๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฆ…๋‹ˆ๋‹ค.

์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ์ง€์ˆ˜ ํ•จ์ˆ˜ ์ ์šฉ

block_max = shared_max[0]

var exp_val: Scalar[dtype] = 0.0
if global_i < input_size:
    exp_val = rebind[Scalar[dtype]](exp(val - block_max))

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ „์ฒด ์ตœ๋Œ“๊ฐ’ ์ฝ์Œ
  2. ์ง€์ˆ˜ ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜๊ธฐ ์ „์— ์ž…๋ ฅ๊ฐ’์—์„œ ์ตœ๋Œ“๊ฐ’ ์ฐจ๊ฐ
  3. ์ด ์ฐจ๊ฐ์ด ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์˜ ํ•ต์‹ฌ โ€” ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ ๋ฐฉ์ง€
  4. ๊ฐ€์žฅ ํฐ ์ง€์ˆ˜๊ฐ€ \(e^0 = 1\)์ด ๋˜๊ณ , ๋‚˜๋จธ์ง€๋Š” ๋ชจ๋‘ \(e^{์Œ์ˆ˜} < 1\)

๋ณ‘๋ ฌ sum ๋ฆฌ๋•์…˜

shared_sum[local_i] = exp_val
barrier()

stride = BLOCK_DIM_X // 2
while stride > 0:
    if local_i < stride:
        shared_sum[local_i] += shared_sum[local_i + stride]
    barrier()
    stride = stride // 2

๋‘ ๋ฒˆ์งธ ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„:

  1. ๋ชจ๋“  ์ง€์ˆ˜ ๊ฐ’์„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅ
  2. max์™€ ๋™์ผํ•œ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜ ํŒจํ„ด ์‚ฌ์šฉ
  3. ๋‹จ, ์ตœ๋Œ“๊ฐ’ ๋น„๊ต ๋Œ€์‹  ๋ง์…ˆ ์ˆ˜ํ–‰
  4. \(\log_2(BLOCK\_DIM\_X)~\) ๋‹จ๊ณ„ ํ›„ shared_sum[0]์— ๋ชจ๋“  ์ง€์ˆ˜ ๊ฐ’์˜ ์ดํ•ฉ์ด ๋‹ด๊น€

์ตœ์ข… ์ •๊ทœํ™”

block_sum = shared_sum[0]

if global_i < input_size:
    output[global_i] = exp_val / block_sum

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋Š” ์ž‘์—…:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ดํ•ฉ์„ ์ฝ์Œ
  2. ์ž์‹ ์˜ ์ง€์ˆ˜ ๊ฐ’์„ ์ด ์ดํ•ฉ์œผ๋กœ ๋‚˜๋ˆ”
  3. ์ •๊ทœํ™”๋œ ํ™•๋ฅ ์„ ์ถœ๋ ฅ ๋ฒ„ํผ์— ๊ธฐ๋ก
  4. ํ•ฉ์ด 1์ธ ์œ ํšจํ•œ ํ™•๋ฅ  ๋ถ„ํฌ ์ƒ์„ฑ

์„ฑ๋Šฅ ํŠน์„ฑ

์ด ๊ตฌํ˜„์€ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ฐ–์Šต๋‹ˆ๋‹ค:

  • ๋ณต์žก๋„: ์ˆœ์ฐจ์  ์ ‘๊ทผ์˜ \(O(n)\)์— ๋น„ํ•ด max์™€ sum ๊ณ„์‚ฐ ๋ชจ๋‘ \(O(\log n)\)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ \(2 \times BLOCK\_DIM\_X~\) ์š”์†Œ๋งŒ ์‚ฌ์šฉ
  • ์ž‘์—… ํšจ์œจ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์•ฝ \(2 \times \log_2(BLOCK\_DIM\_X)~\) ํšŒ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
  • ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์–‘์˜ ์ž‘์—… ์ฒ˜๋ฆฌ
  • ๋™๊ธฐํ™”: ํ•„์š”ํ•œ ๊ณณ์—์„œ๋งŒ ์ตœ์†Œํ•œ์˜ ๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ์ตœ์  ๋Œ€์—ญํญ์„ ์œ„ํ•œ ๋ณ‘ํ•ฉ๋œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ˆ˜์น˜์ ์œผ๋กœ๋„ ๊ฒฌ๊ณ ํ•ฉ๋‹ˆ๋‹ค. ์ตœ๋Œ“๊ฐ’ ์ฐจ๊ฐ ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ์‹ ๊ฒฝ๋ง ํ™œ์„ฑํ™”์—์„œ ํ”ํ•œ ๋„“์€ ๋ฒ”์œ„์˜ ๊ฐ’์—์„œ๋„ ์ •๋ฐ€๋„๋ฅผ ์œ ์ง€ํ•˜๋ฉฐ, ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ/์–ธ๋”ํ”Œ๋กœ์šฐ ๊ฐ€๋Šฅ์„ฑ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

CPU ํด๋ฐฑ ๊ตฌํ˜„

def softmax_cpu_kernel[
    input_size: Int,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
):
    comptime assert (
        dtype.is_floating_point()
    ), "dtype must be a floating-point type"
    var max_val: Scalar[dtype] = min_finite[dtype]()
    for i in range(input_size):
        max_val = max(max_val, rebind[Scalar[dtype]](input[i]))

    var sum_exp: Scalar[dtype] = 0.0
    for i in range(input_size):
        var exp_val = rebind[Scalar[dtype]](exp(input[i] - max_val))
        output[i] = exp_val
        sum_exp += exp_val

    for i in range(input_size):
        output[i] = output[i] / sum_exp


CPU ๊ตฌํ˜„์€ ๊ฐ™์€ ์ˆ˜ํ•™์  ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋”ฐ๋ฅด๋˜ ๋‹จ์ผ ์Šค๋ ˆ๋“œ ์‹คํ–‰์— ์ตœ์ ํ™”๋œ ์ˆœ์ฐจ์  ํด๋ฐฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„๋ฅผ ๋ถ„์„ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:
  1. ์ตœ๋Œ“๊ฐ’ ํƒ์ƒ‰:

    var max_val: Scalar[dtype] = min_finite[dtype]()
    for i in range(input_size):
        max_val = max(max_val, rebind[Scalar[dtype]](input[i]))
    

    ์ตœ์†Œ ์œ ํ•œ๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ๋ฐฐ์—ด์„ ์„ ํ˜• ์Šค์บ”ํ•˜๋ฉฐ ๋งŒ๋‚œ ์ตœ๋Œ“๊ฐ’์„ ์ถ”์ ํ•ฉ๋‹ˆ๋‹ค. \(O(n)\) ๋ณต์žก๋„์ด์ง€๋งŒ, ๋ณ‘๋ ฌํ™”ํ•  ์ฝ”์–ด๊ฐ€ ๋งŽ์ง€ ์•Š์€ CPU์—์„œ๋Š” ํšจ์œจ์ ์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

  2. ์ง€์ˆ˜ ํ•จ์ˆ˜ ์ ์šฉ๊ณผ ํ•ฉ์‚ฐ:

    var sum_exp: Scalar[dtype] = 0.0
    for i in range(input_size):
        var exp_val = rebind[Scalar[dtype]](exp(input[i] - max_val))
        output[i] = exp_val
        sum_exp += exp_val
    

    ๊ฐ ์š”์†Œ์— ๋Œ€ํ•ด \(e^{x_i - max}\)๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ ๋ฒ„ํผ์— ์ €์žฅํ•˜๋ฉด์„œ ํ•ฉ๊ณ„ \(\sum_{j=1}^{n} e^{x_j - max}\)๋ฅผ ํ•œ ๋ฒˆ์˜ ์ˆœํšŒ๋กœ ๋ˆ„์ ํ•ฉ๋‹ˆ๋‹ค. ๋ณ„๋„์˜ ๋ฐ˜๋ณต๋ฌธ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.

  3. ์ •๊ทœํ™”:

    for i in range(input_size):
        output[i] = output[i] / sum_exp
    

    ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ฐ ์š”์†Œ๋ฅผ ํ•ฉ๊ณ„๋กœ ๋‚˜๋ˆ  ์†Œํ”„ํŠธ๋งฅ์Šค ๊ณต์‹์— ๋”ฐ๋ฅธ ์˜ฌ๋ฐ”๋ฅธ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

    $$\Large \text{softmax}(x_i) = \frac{e^{x_i - \max(x)}}{\sum_{j=1}^{n} e^{x_j - \max(x)}}$$

CPU ๊ตฌํ˜„์€ ๋™์ผํ•œ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ๊ธฐ๋ฒ•(์ตœ๋Œ“๊ฐ’ ์ฐจ๊ฐ)์„ ์‚ฌ์šฉํ•˜๋˜, ๋ณ‘๋ ฌ์ด ์•„๋‹Œ ์ˆœ์ฐจ์  ์—ฐ์‚ฐ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๋ฅผ ๋‹ค๋ฃฐ ํ•„์š”๊ฐ€ ์—†์–ด GPU ๋ฒ„์ „๋ณด๋‹ค ๋‹จ์ˆœํ•˜์ง€๋งŒ, ๋Œ€๊ทœ๋ชจ ์ž…๋ ฅ์—์„œ๋Š” ํšจ์œจ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

๋‘ ๊ตฌํ˜„ ๋ชจ๋‘ @compiler.register("softmax") ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋ฅผ ํ†ตํ•ด MAX ๊ทธ๋ž˜ํ”„์˜ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ์‹œ์Šคํ…œ์— ๋“ฑ๋ก๋˜๋ฏ€๋กœ, ๊ฐ€์šฉ ์—ฌ๋ถ€์— ๋”ฐ๋ผ ์–ด๋А ๋””๋ฐ”์ด์Šค์—์„œ๋“  ๋งค๋„๋Ÿฝ๊ฒŒ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.

ํŒŒ์ด์ฌ ํ†ตํ•ฉ

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
        input_value = graph.inputs[0]

        # The output shape is the same as the input for softmax
        # Note: the name must match the name used in `@compiler.register("softmax")` in op/softmax.mojo
        output = ops.custom(
            name="softmax",
            values=[input_value],
            device=DeviceRef.from_device(device),
            out_types=[
                TensorType(
                    dtype=input_value.tensor.dtype,
                    shape=input_value.tensor.shape,
                    device=DeviceRef.from_device(device),
                )
            ],
            parameters={
                "target": "gpu" if device == Accelerator() else "cpu",
                "input_size": input_tensor.shape[0],
                "dtype": dtype,
            },
        )[0].tensor
        graph.output(output)

ํŒŒ์ด์ฌ ํ†ตํ•ฉ์€ NumPy ๋ฐฐ์—ด๊ณผ ์ตœ์ ํ™”๋œ Mojo GPU ์ปค๋„ ์‚ฌ์ด์— ๋งค๋„๋Ÿฌ์šด ๋‹ค๋ฆฌ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. ๊ตฌํ˜„์€ ์—ฌ๋Ÿฌ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ค„์ ธ ์žˆ์Šต๋‹ˆ๋‹ค:
  1. ๊ทธ๋ž˜ํ”„ ์„ค์ •๊ณผ ๊ตฌ์„ฑ:

    with Graph(
        "softmax_graph",
        input_types=[
            TensorType(
                dtype,
                shape=input_tensor.shape,
                device=DeviceRef.from_device(device),
            ),
        ],
        custom_extensions=[mojo_kernels],
    ) as graph:
    

    โ€œsoftmax_graphโ€œ๋ผ๋Š” ์ด๋ฆ„์˜ ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”„๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

    • ์ ์ ˆํ•œ dtype๊ณผ shape์œผ๋กœ ์ž…๋ ฅ ํ…์„œ ํƒ€์ž… ์ •์˜
    • ํ…์„œ๋ฅผ ๋Œ€์ƒ ๋””๋ฐ”์ด์Šค(CPU ๋˜๋Š” GPU)์— ๋งคํ•‘
    • ์ง€์ •๋œ ๋””๋ ‰ํ† ๋ฆฌ์—์„œ ์ปค์Šคํ…€ Mojo ์—ฐ์‚ฐ ๋กœ๋“œ
    • custom_extensions ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ Mojo ๊ตฌํ˜„๊ณผ์˜ ์—ฐ๊ฒฐ ํ•ต์‹ฌ
  2. ์ปค์Šคํ…€ ์—ฐ์‚ฐ ๊ตฌ์„ฑ:

    output = ops.custom(
        name="softmax",
        values=[input_value],
        out_types=[
            TensorType(
                dtype=input_value.tensor.dtype,
                shape=input_value.tensor.shape,
                device=DeviceRef.from_device(device),
            )
        ],
        parameters={
            "target": "gpu" if device == Accelerator() else "cpu",
            "input_size": input_tensor.shape[0],
            "dtype": dtype,
        },
    )[0].tensor
    

    ์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค:

    • Mojo ์ฝ”๋“œ์˜ @compiler.register("softmax")์™€ ์ผ์น˜ํ•˜๋Š” ์ด๋ฆ„
    • ๋ฆฌ์ŠคํŠธ๋กœ ์ „๋‹ฌ๋˜๋Š” ์ž…๋ ฅ ๊ฐ’
    • ์ž…๋ ฅ shape๊ณผ ํƒ€์ž…์— ๋งž๋Š” ์ถœ๋ ฅ ํƒ€์ž… ์ •์˜
    • ๋Œ€์ƒ ๋””๋ฐ”์ด์Šค, ๋ฒกํ„ฐ ํฌ๊ธฐ, ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ํฌํ•จํ•œ ์ปค๋„ ํ•„์ˆ˜ ํŒŒ๋ผ๋ฏธํ„ฐ
    • [0].tensor๋กœ ์ฒซ ๋ฒˆ์งธ ๋ฐ˜ํ™˜ ์š”์†Œ์—์„œ ํ…์„œ ์ถ”์ถœ
  3. ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ ์ •์˜:

    graph.output(output)
    

    ์—ฐ์‚ฐ์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ทธ๋ž˜ํ”„์˜ ์ถœ๋ ฅ์œผ๋กœ ๋“ฑ๋กํ•ฉ๋‹ˆ๋‹ค.

๋ฉ”์ธ ์Šคํฌ๋ฆฝํŠธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ผผ๊ผผํ•œ ๊ฒ€์ฆ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค:

  • ๋žœ๋ค ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ: np.random.randn(INPUT_SIZE).astype(np.float32)
  • SciPy๋กœ ๊ธฐ๋Œ€ ๊ฒฐ๊ณผ ๊ณ„์‚ฐ: scipy_softmax(input_array)
  • ์ˆ˜์น˜ ์ •ํ™•๋„ ๊ฒ€์ฆ: np.testing.assert_allclose(..., rtol=1e-5)
  • ์ถœ๋ ฅ์ด ์œ ํšจํ•œ ํ™•๋ฅ  ๋ถ„ํฌ์ธ์ง€ ํ™•์ธ: np.sum(result.to_numpy())

์ด ๊ตฌํ˜„์€ ๊ณ ์„ฑ๋Šฅ Mojo ์ปค๋„๊ณผ ํŒŒ์ด์ฌ์˜ ๊ณผํ•™ ์ปดํ“จํŒ… ์ƒํƒœ๊ณ„๋ฅผ ํ†ตํ•ฉํ•˜๋Š” MAX ๊ทธ๋ž˜ํ”„์˜ ๊ฐ•๋ ฅํ•œ ์—ญ๋Ÿ‰์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ํšจ์œจ์„ฑ๊ณผ ์‚ฌ์šฉ ํŽธ์˜์„ฑ์„ ๋™์‹œ์— ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

Puzzle 19: ์–ดํ…์…˜ Op

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์–ดํ…์…˜์€ ํŠธ๋žœ์Šคํฌ๋จธ์™€ ํ•จ๊ป˜ ๋„๋ฆฌ ์•Œ๋ ค์ง„ ํ˜„๋Œ€ ์‹ ๊ฒฝ๋ง์˜ ํ•ต์‹ฌ ์š”์†Œ๋กœ, ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•  ๋•Œ ์ž…๋ ฅ์—์„œ ๊ด€๋ จ๋œ ๋ถ€๋ถ„์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค.

์ˆ˜ํ•™์ ์œผ๋กœ ์–ดํ…์…˜ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค:

$$\Large \text{Attention}(Q, K, V) = \text{softmax}(Q \cdot K^T) \cdot V$$

์—ฌ๊ธฐ์„œ:

  • \(Q\)๋Š” \((d,)~\) ํ˜•ํƒœ์˜ ์ฟผ๋ฆฌ ๋ฒกํ„ฐ - ์ฐพ์œผ๋ ค๋Š” ๋Œ€์ƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค
  • \(K\)๋Š” \((\text{seq_len}, d)~\) ํ˜•ํƒœ์˜ ํ‚ค ํ–‰๋ ฌ - ๋งค์นญํ•  ์ˆ˜ ์žˆ๋Š” ๋Œ€์ƒ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค
  • \(V\)๋Š” \((\text{seq_len}, d)~\) ํ˜•ํƒœ์˜ ๊ฐ’ ํ–‰๋ ฌ - ๊ฒ€์ƒ‰ํ•  ์ •๋ณด๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค
  • ์ถœ๋ ฅ์€ \((d,)\) ํ˜•ํƒœ์˜ ๊ฐ€์ค‘ํ•ฉ ๋ฒกํ„ฐ์ž…๋‹ˆ๋‹ค

์—ฐ์‚ฐ์€ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ๋‹จ๊ณ„๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค:

  1. ์–ดํ…์…˜ ์ ์ˆ˜: \(Q \cdot K^T\)๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ฟผ๋ฆฌ๊ฐ€ ๊ฐ ํ‚ค ๋ฒกํ„ฐ์™€ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งค์นญ๋˜๋Š”์ง€ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค
  2. ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜: ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์ ์šฉํ•˜์—ฌ ์ ์ˆ˜๋ฅผ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค (๊ฐ€์ค‘์น˜์˜ ํ•ฉ = 1)
  3. ๊ฐ€์ค‘ ํ•ฉ: ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ’ ๋ฒกํ„ฐ๋“ค์„ ๊ฒฐํ•ฉํ•ด ์ตœ์ข… ์ถœ๋ ฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค

์–ดํ…์…˜ ์ดํ•ดํ•˜๊ธฐ: ๋‹จ๊ณ„๋ณ„ ๋ถ„์„

์–ดํ…์…˜์„ ์Šค๋งˆํŠธ ๊ฒ€์ƒ‰ ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”. ์ฟผ๋ฆฌ(์ฐพ๊ณ ์ž ํ•˜๋Š” ๊ฒƒ)๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, ์–ดํ…์…˜์€ ํ‚ค-๊ฐ’ ์Œ์˜ ๋ชจ์Œ์—์„œ ๊ฐ€์žฅ ๊ด€๋ จ์„ฑ ๋†’์€ ์ •๋ณด๋ฅผ ์ฐพ์•„๋ƒ…๋‹ˆ๋‹ค:

  1. 1๋‹จ๊ณ„ - ์œ ์‚ฌ๋„ ๋งค์นญ: ์ฟผ๋ฆฌ \(Q\)๋ฅผ ๋ชจ๋“  ํ‚ค \(K\)์™€ ๋น„๊ตํ•˜์—ฌ ์œ ์‚ฌ๋„ ์ ์ˆ˜๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค

    • \(Q \cdot K^T\)๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ \(Q\)๊ฐ€ ๊ฐ ํ‚ค ๋ฒกํ„ฐ์™€ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋งค์นญ๋˜๋Š”์ง€ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค
    • ๋†’์€ ์ ์ˆ˜ = ๋” ์ข‹์€ ๋งค์นญ
  2. 2๋‹จ๊ณ„ - ํ™•๋ฅ  ๋ถ„ํฌ: ์›์‹œ ์ ์ˆ˜๋ฅผ ์ •๊ทœํ™”๋œ ๊ฐ€์ค‘์น˜๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค

    • ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์ ์šฉํ•˜์—ฌ ๋ชจ๋“  ๊ฐ€์ค‘์น˜์˜ ํ•ฉ์ด 1.0์ด ๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค
    • ์–ด๋–ค ๊ฐ’์— ์ง‘์ค‘ํ• ์ง€์— ๋Œ€ํ•œ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค
  3. 3๋‹จ๊ณ„ - ๊ฐ€์ค‘ ๊ฒ€์ƒ‰: ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ’๋“ค์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค

    • ๊ฐ ๊ฐ’ ๋ฒกํ„ฐ์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๊ณฑํ•ฉ๋‹ˆ๋‹ค
    • ๋ชจ๋“  ๊ฒƒ์„ ๋”ํ•ด ์ตœ์ข… ์ถœ๋ ฅ์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค

์‹ค์ƒํ™œ ๋น„์œ : ๋„์„œ๊ด€์—์„œ ๊ฒ€์ƒ‰ํ•˜๋Š” ๊ฒƒ์„ ์ƒ์ƒํ•ด ๋ณด์„ธ์š”. ์ฟผ๋ฆฌ๋Š” ์ฐพ๊ณ  ์‹ถ์€ ๊ฒƒ์ด๊ณ , ์ฑ… ์ œ๋ชฉ์€ ํ‚ค์ด๋ฉฐ, ์ฑ… ๋‚ด์šฉ์€ ๊ฐ’์ž…๋‹ˆ๋‹ค. ์–ดํ…์…˜์€ ๊ฐ ์ฑ…์ด ์ฟผ๋ฆฌ์™€ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ ์žˆ๋Š”์ง€ ๊ณ„์‚ฐํ•œ ๋‹ค์Œ, ๊ด€๋ จ๋„์— ๋”ฐ๋ผ ๊ฐ€์ค‘ ์š”์•ฝ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์—ฐ์‚ฐ ํ๋ฆ„ ์‹œ๊ฐํ™”

Input:  Q(16,)    K(16,16)    V(16,16)
         โ†“           โ†“           โ†“
Step 1: Q(1,16) @ K^T(16,16) โ†’ Scores(1,16)
         โ†“
Step 2: softmax(Scores) โ†’ Weights(1,16)  [sum = 1.0]
         โ†“
Step 3: Weights(1,16) @ V(16,16) โ†’ Output(1,16) โ†’ reshape โ†’ Output(16,)

ํ•ต์‹ฌ ์•„์ด๋””์–ด: ์ฟผ๋ฆฌ ๋ฒกํ„ฐ \(Q\)๋ฅผ \((16,)\)์—์„œ \((1,16)\)์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉด, ๋‚ด์  ๋Œ€์‹  ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋•๋ถ„์— Puzzle 18์˜ ๊ณ ๋„๋กœ ์ตœ์ ํ™”๋œ ํƒ€์ผ๋ง matmul ์ปค๋„์„ ๊ทธ๋Œ€๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

GPU ๊ตฌํ˜„์€ ์ด์ „ ํผ์ฆ์—์„œ ์ตœ์ ํ™”๋œ ์ปค๋„๋“ค์„ ์žฌ์‚ฌ์šฉํ•˜๊ณ  ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:

๐Ÿ”„ ์ปค๋„ ์žฌ์‚ฌ์šฉ ์ „๋žต: ์ด ํผ์ฆ์€ ์ด์ „ ํผ์ฆ์—์„œ ๊ฒ€์ฆ๋œ ์ตœ์ ํ™” ์ปค๋„๋“ค์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ณต์žกํ•œ ์—ฐ์‚ฐ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๋ชจ๋“  ๊ฒƒ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ์ž‘์„ฑํ•˜๋Š” ๋Œ€์‹ , Puzzle 16์˜ matmul_idiomatic_tiled๊ณผ Puzzle 18์˜ softmax_kernel์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋“ˆํ˜• GPU ์ปค๋„ ์„ค๊ณ„์˜ ๊ฐ•๋ ฅํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

  • ์‹œํ€€์Šค ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋ฒกํ„ฐ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜
  • ์ปค๋„ ์žฌ์‚ฌ์šฉ: Puzzle 16๊ณผ Puzzle 18์˜ ๊ฒ€์ฆ๋œ ๊ตฌํ˜„ ํ™œ์šฉ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ tiling์„ ํ™œ์šฉํ•œ ํšจ์œจ์ ์ธ ํ–‰๋ ฌ ๊ณฑ์…ˆ
  • ๋ฒ„ํผ ํ• ๋‹น์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ํ…์„œ ํ˜•ํƒœ ๋ณ€ํ™˜
  • ์—ฌ๋Ÿฌ ์ตœ์ ํ™” ์ปค๋„์„ ๋‹จ์ผ ์—ฐ์‚ฐ์œผ๋กœ ํ†ตํ•ฉ
  • ๋‹ค์ค‘ ์ž…๋ ฅ์„ ์ง€์›ํ•˜๋Š” ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ
  • ํ˜ธํ™˜์„ฑ์„ ์œ„ํ•œ CPU ํด๋ฐฑ ๊ตฌํ˜„

์„ค์ •

  • ์‹œํ€€์Šค ๊ธธ์ด: \(\text{SEQ_LEN} = 16~\) - ์‹œํ€€์Šค ๋‚ด ํ‚ค/๊ฐ’ ๋ฒกํ„ฐ์˜ ์ˆ˜
  • ๋ชจ๋ธ ์ฐจ์›: \(\text{D} = 16~\) - ๊ฐ ๋ฒกํ„ฐ(์ฟผ๋ฆฌ, ํ‚ค, ๊ฐ’)์˜ ์ฐจ์›
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: ๊ฐ ์ปค๋„์— ๋งž๊ฒŒ ๊ฐœ๋ณ„ ์ตœ์ ํ™”
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: ๋‹ค์–‘ํ•œ ํ–‰๋ ฌ ํฌ๊ธฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋„๋ก ๋™์ ์œผ๋กœ ๊ณ„์‚ฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ์ „์น˜, matmul, ์†Œํ”„ํŠธ๋งฅ์Šค ์ปค๋„์—์„œ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ํ™œ์šฉ

๋ ˆ์ด์•„์›ƒ ์„ค์ •:

  • ์ฟผ๋ฆฌ ํ…์„œ: row_major[d]()
  • ํ‚ค ํ…์„œ: row_major[seq_len, d]()
  • ๊ฐ’ ํ…์„œ: row_major[seq_len, d]()
  • ์ถœ๋ ฅ ํ…์„œ: row_major[d]()
  • ์ปค์Šคํ…€ op ํŒŒ๋ผ๋ฏธํ„ฐ: {"seq_len": seq_len, "d": d, "dtype": dtype}

์ด ํผ์ฆ์˜ ํ•ต์‹ฌ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ๋‹ค์ค‘ ์ปค๋„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜: ์ „์น˜, matmul, ์†Œํ”„ํŠธ๋งฅ์Šค ์—ฐ์‚ฐ์˜ ๊ฒฐํ•ฉ
  2. ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”: ํ˜•ํƒœ ๋ณ€ํ™˜ ์—ฐ์‚ฐ๊ณผ ๋ฒ„ํผ ์žฌ์‚ฌ์šฉ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์ตœ์†Œํ™”
  3. ์ˆ˜์น˜ ์•ˆ์ •์„ฑ: Puzzle 18์˜ ๊ฒ€์ฆ๋œ ์†Œํ”„ํŠธ๋งฅ์Šค ๊ตฌํ˜„ ํ™œ์šฉ
  4. ์„ฑ๋Šฅ ์ตœ์ ํ™”: ๋ชจ๋“  ํ–‰๋ ฌ ์—ฐ์‚ฐ์— Puzzle 16์˜ ํƒ€์ผ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์‚ฌ์šฉ
  5. ๋‹ค์ค‘ ์ž…๋ ฅ ์—ฐ์‚ฐ: ๋‹จ์ผ ์ปค์Šคํ…€ op์—์„œ ์„ธ ๊ฐœ์˜ ์ž…๋ ฅ ํ…์„œ(Q, K, V) ์ฒ˜๋ฆฌ

์–ดํ…์…˜ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ผ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  • ํŒŒ์ด์ฌ์—์„œ ์ฟผ๋ฆฌ, ํ‚ค, ๊ฐ’ ํ…์„œ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ
  • ์ตœ์ ํ™”๋œ ์ปค๋„์„ ์‚ฌ์šฉํ•˜์—ฌ GPU์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ
  • ์–ดํ…์…˜ ๊ฐ€์ค‘ ์ถœ๋ ฅ ๋ฒกํ„ฐ ๋ฐ˜ํ™˜
  • NumPy ์ฐธ์กฐ ๊ตฌํ˜„ ๊ฒฐ๊ณผ์™€ ์ผ์น˜

์™„์„ฑํ•  ์ฝ”๋“œ

์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ ค๋ฉด Puzzle 16์˜ ํƒ€์ผ๋ง matmul ์ปค๋„๊ณผ Puzzle 18์˜ ์†Œํ”„ํŠธ๋งฅ์Šค ์ปค๋„์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Mojo ํŒŒ์ผ์—์„œ ์ „์น˜ ์ปค๋„๋งŒ ๊ตฌํ˜„ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

1. ์ „์น˜ ์ปค๋„ ๊ตฌํ˜„ํ•˜๊ธฐ

def transpose_kernel[
    rows: Int,
    cols: Int,
    OutLayout: TensorLayout,
    InLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    inp: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
):
    # FILL ME IN (roughly 18 lines)
    ...


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p19/op/attention.mojo

ํŒ

์ „์น˜ ์ปค๋„ ๊ตฌํ˜„ ๊ฐ€์ด๋“œ:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์„ค์ •: stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY]())์„ ์‚ฌ์šฉํ•˜์—ฌ TRANSPOSE_BLOCK_DIM_XY ร— TRANSPOSE_BLOCK_DIM_XY ํฌ๊ธฐ์˜ ์ •์‚ฌ๊ฐํ˜• ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์Šค๋ ˆ๋“œ ๊ฐ„ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ตํ™˜์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

  2. ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ: ์Šค๋ ˆ๋“œ๋ฅผ ํ–‰๋ ฌ ์š”์†Œ์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค:

    • local_row = thread_idx.y, local_col = thread_idx.x (๋ธ”๋ก ๋‚ด ์œ„์น˜)
    • global_row = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_row (์ „์ฒด ํ–‰๋ ฌ์—์„œ์˜ ์œ„์น˜)
  3. 2๋‹จ๊ณ„ ์—ฐ์‚ฐ:

    • 1๋‹จ๊ณ„: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ผ๋ฐ˜ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค
    • 2๋‹จ๊ณ„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋’ค๋ฐ”๊พผ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  4. ํ•„์ˆ˜ ๋™๊ธฐํ™”: ๋กœ๋“œ์™€ ์ €์žฅ ์‚ฌ์ด์— barrier()๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋กœ๋“œ๋ฅผ ์™„๋ฃŒํ•œ ํ›„์—์•ผ ์ €์žฅ์„ ์‹œ์ž‘ํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค

  5. ์ „์น˜์˜ ํ•ต์‹ฌ: ์ „์น˜๋Š” ๋’ค๋ฐ”๊พผ ์ธ๋ฑ์‹ฑ์„ ํ†ตํ•ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค: shared_tile[local_row, local_col] ๋Œ€์‹  shared_tile[local_col, local_row]๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค

  6. ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ TRANSPOSE_BLOCK_DIM_XY x TRANSPOSE_BLOCK_DIM_XY๋กœ ์ •ํ™•ํžˆ ๋‚˜๋ˆ„์–ด์ง€์ง€ ์•Š๋Š” ํ–‰๋ ฌ์—์„œ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ฝ๊ธฐ/์“ฐ๊ธฐ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค

  7. ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ์ด ํŒจํ„ด์€ ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ๋ชจ๋‘ ๋ณ‘ํ•ฉ๋˜๋„๋ก ๋ณด์žฅํ•˜์—ฌ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค

2. ์–ดํ…์…˜ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜

            var gpu_ctx = rebind[DeviceContext](ctx[])

            # Define layouts for matrix multiplication
            # Q reshaped to (1, d)
            comptime layout_q_2d = row_major[1, d]()
            comptime Q2DLayout = type_of(layout_q_2d)
            # K^T is (d, seq_len)
            comptime layout_k_t = row_major[d, seq_len]()
            comptime KTLayout = type_of(layout_k_t)
            # Scores as (1, seq_len)
            comptime layout_scores_2d = row_major[1, seq_len]()
            comptime Scores2DLayout = type_of(layout_scores_2d)
            # Weights as (1, seq_len)
            comptime layout_weights_2d = row_major[1, seq_len]()
            comptime Weights2DLayout = type_of(layout_weights_2d)
            # Result as (1, d)
            comptime layout_result_2d = row_major[1, d]()
            comptime Result2DLayout = type_of(layout_result_2d)

            # Transpose implementation limited to square (TRANSPOSE_BLOCK_DIM_XY x TRANSPOSE_BLOCK_DIM_XY) thread blocks
            comptime transpose_threads_per_block = (
                TRANSPOSE_BLOCK_DIM_XY,
                TRANSPOSE_BLOCK_DIM_XY,
            )
            # Tile over the K (seq_len, d) matrix
            comptime transpose_blocks_per_grid = (
                (d + TRANSPOSE_BLOCK_DIM_XY - 1) // TRANSPOSE_BLOCK_DIM_XY,
                (seq_len + TRANSPOSE_BLOCK_DIM_XY - 1)
                // TRANSPOSE_BLOCK_DIM_XY,
            )
            # Matmul implementation limited to square (MATMUL_BLOCK_DIM_XY x MATMUL_BLOCK_DIM_XY) thread blocks
            comptime matmul_threads_per_block = (
                MATMUL_BLOCK_DIM_XY,
                MATMUL_BLOCK_DIM_XY,
            )
            # seq_len outputs ( Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len) ) with one thread per output
            comptime scores_blocks_per_grid = (
                seq_len + MATMUL_BLOCK_DIM_XY - 1
            ) // MATMUL_BLOCK_DIM_XY
            comptime softmax_threads = SOFTMAX_BLOCK_DIM_X
            comptime softmax_blocks_per_grid = 1
            # d outputs ( weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d) ) with one thread per output
            comptime result_blocks_per_grid = (
                d + MATMUL_BLOCK_DIM_XY - 1
            ) // MATMUL_BLOCK_DIM_XY

            # Allocate minimal temporary buffers - reuse same buffer for different shapes
            var k_t_buf = gpu_ctx.enqueue_create_buffer[dtype](
                seq_len * d
            )  # K^T as (d, seq_len)
            var scores_weights_buf = gpu_ctx.enqueue_create_buffer[dtype](
                seq_len
            )  # Reused for scores and weights

            var k_t = TileTensor(k_t_buf, layout_k_t)

            # Step 1: Reshape Q from (d,) to (1, d) - no buffer needed
            # FILL ME IN 1 line

            # Step 2: Transpose K from (seq_len, d) to K^T (d, seq_len)
            # FILL ME IN 1 function call

            # Step 3: Compute attention scores using matmul: Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len)
            # This computes Q ยท K^T[i] = Q ยท K[i] for each column i of K^T (which is row i of K)
            # Reuse scores_weights_buf as (1, seq_len) for scores
            # FILL ME IN 2 lines

            # Step 4: Reshape scores from (1, seq_len) to (seq_len,) for softmax
            # FILL ME IN 1 line

            # Step 5: Apply softmax to get attention weights
            # FILL ME IN 1 function call

            # Step 6: Reshape weights from (seq_len,) to (1, seq_len) for final matmul
            # FILL ME IN 1 line

            # Step 7: Compute final result using matmul: weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d)
            # Reuse out_tensor reshaped as (1, d) for result
            # FILL ME IN 2 lines

์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p19/op/attention.mojo

์ปค๋„ ํ…Œ์ŠคํŠธ

pixi run p19
pixi run -e amd p19
pixi run -e apple p19
uv run poe p19

์„ฑ๊ณตํ•˜๋ฉด CPU์™€ GPU์—์„œ ๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ ์ถœ๋ ฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Input shapes: Q=(16,), K=(16, 16), V=(16, 16)
Sample Q values: [ 0.04967142 -0.01382643  0.06476886  0.15230298 -0.02341534]
Sample K[0] values: [-0.10128311  0.03142473 -0.09080241 -0.14123037  0.14656489]
Sample V[0] values: [ 0.11631638  0.00102331 -0.09815087  0.04621035  0.01990597]

================================================================================
STEP-BY-STEP VECTOR ATTENTION COMPUTATION DEBUG
================================================================================

1. INPUT SHAPES:
   Q shape: (16,) (query vector)
   K shape: (16, 16) (key matrix)
   V shape: (16, 16) (value matrix)
   Q[:5]: [ 0.04967142 -0.01382643  0.06476886  0.15230298 -0.02341534]

2. ATTENTION SCORES (K[i] ยท Q):
   Scores shape: (16,)
   Scores[:5]: [-0.03479404 -0.01563787  0.04834607  0.06764711  0.04001468]
   Min: -0.061636, Max: 0.067647
   Manual verification:
     Q ยท K[0] = K[0] ยท Q = -0.034794 (computed: -0.034794)
     Q ยท K[1] = K[1] ยท Q = -0.015638 (computed: -0.015638)
     Q ยท K[2] = K[2] ยท Q = 0.048346 (computed: 0.048346)

3. SOFTMAX:
   Max score: 0.067647
   Attention weights shape: (16,)
   Attention weights[:5]: [0.05981331 0.06097015 0.06499878 0.0662655  0.06445949]
   Sum: 1.000000 (should be 1.0)

4. WEIGHTED SUM OF VALUES:
   Output shape: (16,)
   Output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
   Output norm: 0.092764
   Manual output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
   Match: True

================================================================================
TESTING INDIVIDUAL OPERATIONS
================================================================================

Test 1: Vector Dot Product
a ยท b = 3.000000

Test 2: Matrix-Vector Multiplication
M @ v = [ 3.  7. 11.]

Test 3: Softmax
Input: [1. 2. 3. 4.]
Softmax: [0.0320586  0.08714432 0.2368828  0.6439143 ]
Sum: 1.000000

================================================================================
TESTING FULL ATTENTION
================================================================================
Compiling attention graph on Device(type=cpu,id=0)
Executing attention on Device(type=cpu,id=0)
====================================================================================================

CPU attention output[:5]: [-0.00935538 -0.02434331  0.00306551  0.02346884  0.019306  ]
CPU matches NumPy: True
Compiling attention graph on Device(type=gpu,id=0)
Executing attention on Device(type=gpu,id=0)
====================================================================================================

GPU attention output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
Expected output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
GPU matches NumPy: True

================================================================================
FINAL VERIFICATION
================================================================================
โœ“ CPU implementation PASSED
โœ“ GPU implementation PASSED

Output vector norms:
  CPU: 0.092764
  GPU: 0.092764
  Expected: 0.092764

์ด ์ถœ๋ ฅ์€ ์ปค์Šคํ…€ MAX ๊ทธ๋ž˜ํ”„ ์—ฐ์‚ฐ์ด ์–ดํ…์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๊ตฌํ˜„ํ•˜์—ฌ NumPy ์ฐธ์กฐ ๊ตฌํ˜„๊ณผ ์ผ์น˜ํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ–ˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์†”๋ฃจ์…˜

์ด ํผ์ฆ์„ ํ’€๋ ค๋ฉด Mojo์—์„œ ์ „์น˜ ์ปค๋„์„ ๊ตฌํ˜„ํ•˜๊ณ  ์–ดํ…์…˜ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ํŒŒ์ด์ฌ ๊ทธ๋ž˜ํ”„ ์ •์˜๋ฅผ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ํผ์ฆ์€ ์ด์ „ ํผ์ฆ์˜ ๊ฐœ๋…๋“ค์„ ๊ธฐ๋ฐ˜์œผ๋กœ, Puzzle 16์˜ ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ๊ณผ Puzzle 18์˜ ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ์™„์ „ํ•œ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์žฌ์‚ฌ์šฉ ์ปค๋„

๊ตฌํ˜„์—์„œ ๋‹ค์Œ์˜ ๊ฒ€์ฆ๋œ ์ปค๋„๋“ค์„ ์ง์ ‘ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค:

  1. matmul_idiomatic_tiled (Puzzle 16) - \(Q \times K^T\)์™€ \(\text{weights} \times V\) ์—ฐ์‚ฐ ๋ชจ๋‘๋ฅผ ์ˆ˜ํ–‰
  2. softmax_kernel (Puzzle 18) - ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜ ๊ณ„์‚ฐ ์ œ๊ณต

์ด๋Š” ๋ชจ๋“ˆํ˜• GPU ์•„ํ‚คํ…์ฒ˜์˜ ์ข‹์€ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค: ๋‹จ์ผ ๊ตฌํ˜„์ฒด๊ฐ€ ์•„๋‹Œ, ๊ฒ€์ฆ๋œ ์ตœ์ ํ™” ์ปดํฌ๋„ŒํŠธ๋ฅผ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ํ•˜์—ฌ ๋ณต์žกํ•œ ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.

์–ดํ…์…˜ ์—ฐ์‚ฐ์€ ํ‘œ์ค€์ ์ธ ์ˆ˜ํ•™์  ์ •์˜๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

$$\Large \text{Attention}(Q, K, V) = \text{softmax}(Q \cdot K^T) \cdot V$$

์ˆ˜์‹ ๋ถ„์„:

  • \(Q \cdot K^T~\): ์ฟผ๋ฆฌ-ํ‚ค ์œ ์‚ฌ๋„ ์ ์ˆ˜, ํ˜•ํƒœ: \((1, \text{seq_len})\)
  • \(\text{softmax}(\cdot)~\): ์ ์ˆ˜๋ฅผ ํ™•๋ฅ ๋กœ ์ •๊ทœํ™”, ํ˜•ํƒœ: \((1, \text{seq_len})\)
  • \(\text{weights} \cdot V~\): ๊ฐ’์˜ ๊ฐ€์ค‘ ๊ฒฐํ•ฉ, ํ˜•ํƒœ: \((1, d)\)

์ด ๊ณผ์ •์—๋Š” ์ด์ „ ํผ์ฆ์˜ GPU ์ปค๋„์„ ํ™œ์šฉํ•˜์—ฌ ์ตœ์ ํ™”ํ•˜๋Š” ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ ๋‹จ๊ณ„๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

1. ์ „์น˜ ์ปค๋„ ๊ตฌํ˜„

def transpose_kernel[
    rows: Int,
    cols: Int,
    OutLayout: TensorLayout,
    InLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    inp: TileTensor[mut=True, dtype, InLayout, MutAnyOrigin],
):
    """Transpose matrix using shared memory tiling for coalesced access."""
    comptime shared_layout = row_major[
        TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY
    ]()
    var shared_tile = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](shared_layout)

    var local_row = thread_idx.y
    var local_col = thread_idx.x

    var global_row = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_row
    var global_col = block_idx.x * TRANSPOSE_BLOCK_DIM_XY + local_col

    var inp_lt = inp.to_layout_tensor()
    var output_lt = output.to_layout_tensor()
    var shared_tile_lt = shared_tile.to_layout_tensor()

    if global_row < rows and global_col < cols:
        shared_tile_lt[local_row, local_col] = inp_lt[global_row, global_col]

    barrier()

    var out_row = block_idx.x * TRANSPOSE_BLOCK_DIM_XY + local_row
    var out_col = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_col

    # Store data from shared memory to global memory (coalesced write)
    # Note: we transpose the shared memory access pattern
    if out_row < cols and out_col < rows:
        output_lt[out_row, out_col] = shared_tile_lt[local_col, local_row]


์ „์น˜ ์ปค๋„์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ tiling์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๊ตฌํ˜„ ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ์ „์น˜ ํŒจํ„ด

# ์ผ๋ฐ˜ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๋กœ๋“œ
shared_tile[local_row, local_col] = inp[global_row, global_col]
barrier()
# ๋’ค๋ฐ”๊พผ ์ธ๋ฑ์‹ฑ์œผ๋กœ ์ €์žฅํ•˜์—ฌ ์ „์น˜
output[out_row, out_col] = shared_tile[local_col, local_row]

์ „์น˜๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์—์„œ ๋’ค๋ฐ”๊พผ ์ธ๋ฑ์‹ฑ([local_row, local_col] ๋Œ€์‹  [local_col, local_row])๊ณผ ์ถœ๋ ฅ ์œ„์น˜ ์ง€์ •์„ ์œ„ํ•œ ๋’ค๋ฐ”๊พผ ๋ธ”๋ก ์ขŒํ‘œ๋ฅผ ํ†ตํ•ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ๋ชจ๋‘ ๋ณ‘ํ•ฉ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์น˜ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

2. GPU ์ปค๋„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜


            # Step 1: Reshape Q from (d,) to (1, d) - no buffer needed
            var q_2d = q_tensor.reshape(layout_q_2d)

            # Step 2: Transpose K from (seq_len, d) to K^T (d, seq_len)\
            comptime kernel = transpose_kernel[
                seq_len, d, KTLayout, KLayout, dtype
            ]
            gpu_ctx.enqueue_function[kernel, kernel](
                k_t,
                k_tensor,
                grid_dim=transpose_blocks_per_grid,
                block_dim=transpose_threads_per_block,
            )

            # Step 3: Compute attention scores using matmul: Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len)
            # This computes Q ยท K^T[i] = Q ยท K[i] for each column i of K^T (which is row i of K)
            # Reuse scores_weights_buf as (1, seq_len) for scores
            var scores_2d = TileTensor(scores_weights_buf, layout_scores_2d)
            comptime kernel2 = matmul_idiomatic_tiled[
                1,
                seq_len,
                d,
                Scores2DLayout,
                Q2DLayout,
                KTLayout,
                dtype,
            ]
            gpu_ctx.enqueue_function[kernel2, kernel2](
                scores_2d,
                q_2d,
                k_t,
                grid_dim=scores_blocks_per_grid,
                block_dim=matmul_threads_per_block,
            )

            # Step 4: Reshape scores from (1, seq_len) to (seq_len,) for softmax
            var weights = scores_2d.reshape(layout_scores)

            # Step 5: Apply softmax to get attention weights (in-place)
            comptime ScoresLayout = type_of(layout_scores)
            comptime kernel3 = softmax_gpu_kernel[seq_len, ScoresLayout, dtype]
            # Create two TileTensor views from the underlying buffer to avoid aliasing error
            var weights_out = TileTensor[
                mut=True, dtype, ScoresLayout, MutAnyOrigin
            ](scores_weights_buf, layout_scores)
            var weights_in = TileTensor[
                mut=True, dtype, ScoresLayout, MutAnyOrigin
            ](scores_weights_buf, layout_scores)
            gpu_ctx.enqueue_function[kernel3, kernel3](
                weights_out,
                weights_in,
                grid_dim=softmax_blocks_per_grid,
                block_dim=softmax_threads,
            )

            # Step 6: Reshape weights from (seq_len,) to (1, seq_len) for final matmul
            var weights_2d = weights.reshape(layout_weights_2d)

            # Step 7: Compute final result using matmul: weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d)
            # Reuse out_tensor reshaped as (1, d) for result
            var result_2d = output_tensor.reshape(layout_result_2d)
            comptime kernel4 = matmul_idiomatic_tiled[
                1,
                d,
                seq_len,
                Result2DLayout,
                Weights2DLayout,
                VLayout,
                dtype,
            ]
            gpu_ctx.enqueue_function[kernel4, kernel4](
                result_2d,
                weights_2d,
                v_tensor,
                grid_dim=result_blocks_per_grid,
                block_dim=matmul_threads_per_block,
            )

GPU ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜์€ ์ •๊ตํ•œ ์ปค๋„ ์ฒด์ด๋‹๊ณผ ์ œ๋กœ ์นดํ”ผ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ์ „๋žต

# ์ œ๋กœ ์นดํ”ผ reshape - ๋ฐ์ดํ„ฐ ์ด๋™ ์—†์ด ํ…์„œ shape๋งŒ ์žฌํ•ด์„
q_2d = q_tensor.reshape[layout_q_2d]()
# ์ ๊ทน์ ์ธ ๋ฒ„ํผ ์žฌ์‚ฌ์šฉ - ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ, ๋‹ค๋ฅธ ํ•ด์„
weights = scores_2d.reshape[layout_scores]()

๊ตฌํ˜„์€ ๋‹ค์Œ์„ ํ†ตํ•ด ์ตœ๋Œ€ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • ์ œ๋กœ ์นดํ”ผ ํ˜•ํƒœ ๋ณ€ํ™˜: ๋ฉ”๋ชจ๋ฆฌ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋™ํ•˜์ง€ ์•Š๊ณ  ํ…์„œ ํ˜•ํƒœ๋ฅผ ์žฌํ•ด์„
  • ์ง€๋Šฅ์  ๋ฒ„ํผ ์žฌ์‚ฌ์šฉ: ๋™์ผํ•œ scores_weights_buf๊ฐ€ ์ ์ˆ˜ \((1,\text{seq_len})\)์™€ ๊ฐ€์ค‘์น˜ \((\text{seq_len},)\) ์ด์ค‘ ์šฉ๋„๋กœ ํ™œ์šฉ
  • ์ตœ์†Œ ํ• ๋‹น: ๋‹จ 2๊ฐœ์˜ ์ž„์‹œ ๋ฒ„ํผ๋กœ ์ „์ฒด ์–ดํ…์…˜ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ๋ชจ๋“  ์—ฐ์‚ฐ์—์„œ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์œ ์ง€

์ „๋žต์  ์ปค๋„ ์žฌ์‚ฌ์šฉ ํŒจํ„ด

  • 3๋‹จ๊ณ„ & 7๋‹จ๊ณ„: ๋‘˜ ๋‹ค Puzzle 16์˜ matmul_idiomatic_tiled ํ™œ์šฉ
    • 3๋‹จ๊ณ„: \(Q \times K^T\) โ†’ ์–ดํ…์…˜ ์ ์ˆ˜ ๊ณ„์‚ฐ \((1,d) \times (d,\text{seq_len}) \rightarrow (1,\text{seq_len})\)
    • 7๋‹จ๊ณ„: \(\text{weights} \times V\) โ†’ ์ตœ์ข… ๊ฐ€์ค‘ ์ถœ๋ ฅ \((1,\text{seq_len}) \times (\text{seq_len},d) \rightarrow (1,d)\)
    • ๋‘ ์—ฐ์‚ฐ ๋ชจ๋‘ ๋‹ค์–‘ํ•œ ํ–‰๋ ฌ ํฌ๊ธฐ๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํฌํ•จ
  • 5๋‹จ๊ณ„: Puzzle 18์˜ softmax_kernel ์‚ฌ์šฉ
    • ์›์‹œ ์ ์ˆ˜๋ฅผ ์ •๊ทœํ™”๋œ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜
    • ์ตœ๋Œ“๊ฐ’ ์ฐจ๊ฐ๊ณผ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ํ†ตํ•œ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ๋ณด์žฅ
    • \(\sum_{i} \text{weights}[i] = 1.0\) ๋ณด์žฅ

์ด๋Š” ๋ชจ๋“ˆํ˜• GPU ์•„ํ‚คํ…์ฒ˜์˜ ์ข‹์€ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค: ๋‹จ์ผ ๊ตฌํ˜„์ฒด๊ฐ€ ์•„๋‹Œ, ๊ฒ€์ฆ๋œ ์ตœ์ ํ™” ์ปค๋„๋“ค์„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ํ•˜์—ฌ ๋ณต์žกํ•œ ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค!

ํ•ต์‹ฌ ๊ตฌํ˜„ ์ธ์‚ฌ์ดํŠธ

๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ์ „๋žต

์ ๊ทน์ ์ธ ๋ฒ„ํผ ์žฌ์‚ฌ์šฉ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค:

# ์ „์ฒด ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ์ž„์‹œ ๋ฒ„ํผ๋Š” ๋‹จ 2๊ฐœ
k_t_buf = gpu_ctx.enqueue_create_buffer[dtype](seq_len * d)
scores_weights_buf = gpu_ctx.enqueue_create_buffer[dtype](seq_len)

ํ•ต์‹ฌ ์ตœ์ ํ™” ํฌ์ธํŠธ:

  • ๋™์ผํ•œ scores_weights_buf๊ฐ€ ํ˜•ํƒœ ๋ณ€ํ™˜ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ์–ดํ…์…˜ ์ ์ˆ˜์™€ ๊ฐ€์ค‘์น˜ ๋ชจ๋‘์— ์žฌ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค
  • ์ œ๋กœ ์นดํ”ผ ํ…์„œ ํ˜•ํƒœ ๋ณ€ํ™˜์œผ๋กœ ๋ถˆํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ด๋™์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค

์ปค๋„ ์žฌ์‚ฌ์šฉ ์•„ํ‚คํ…์ฒ˜

์ด ํผ์ฆ์€ ์„ธ ๊ฐ€์ง€ ํŠนํ™”๋œ ์ปค๋„์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ชจ๋“ˆํ˜• ์ปค๋„ ์„ค๊ณ„๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • matmul_idiomatic_tiled (2ํšŒ ์‚ฌ์šฉ) - \(Q \times K^T\)์™€ \(\text{weights} \times V\) ์—ฐ์‚ฐ ๋ชจ๋‘๋ฅผ ์ˆ˜ํ–‰
  • softmax_kernel - ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ํ™œ์šฉํ•˜์—ฌ ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜ ๊ณ„์‚ฐ
  • transpose_kernel - ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ํšจ์œจ์ ์ธ \(K^T\) ๊ณ„์‚ฐ

์•„ํ‚คํ…์ฒ˜์˜ ์žฅ์ :

  • ์กฐํ•ฉ ๊ฐ€๋Šฅ์„ฑ: ๊ฒ€์ฆ๋œ ์ปดํฌ๋„ŒํŠธ๋กœ ๋ณต์žกํ•œ ์—ฐ์‚ฐ ๊ตฌ์ถ•
  • ์œ ์ง€๋ณด์ˆ˜์„ฑ: ๊ฐ ์ปค๋„์ด ๋ช…ํ™•ํ•˜๊ฒŒ ์ •์˜๋œ ๋‹จ์ผ ์—ญํ•  ์ˆ˜ํ–‰
  • ์„ฑ๋Šฅ: ์ด์ „ ํผ์ฆ์˜ ๊ณ ๋„๋กœ ์ตœ์ ํ™”๋œ ๊ตฌํ˜„ ํ™œ์šฉ
  • ํ™•์žฅ์„ฑ: ๋ชจ๋“ˆํ˜• ์„ค๊ณ„๋กœ ๋” ํฐ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ํ™•์žฅ ์šฉ์ด

์ด ๊ตฌํ˜„์€ ์ •๊ตํ•œ ์‹ ๊ฒฝ๋ง ์—ฐ์‚ฐ์ด ๋‹จ์ผ ๊ตฌํ˜„์ฒด๊ฐ€ ์•„๋‹Œ, ๋” ๋‹จ์ˆœํ•˜๊ณ  ์ž˜ ๊ฒ€์ฆ๋œ GPU ์ปค๋„๋“ค์„ ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜ํ•˜์—ฌ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ณด๋„ˆ์Šค ์ฑŒ๋ฆฐ์ง€

์ฑŒ๋ฆฐ์ง€ I: ๊ณ ๊ธ‰ ์†Œํ”„ํŠธ๋งฅ์Šค ๊ตฌํ˜„

์ด ์ฑŒ๋ฆฐ์ง€๋Š” Puzzle 18: ์†Œํ”„ํŠธ๋งฅ์Šค Op์˜ ํ™•์žฅ์ž…๋‹ˆ๋‹ค

์†Œํ”„ํŠธ๋งฅ์Šค ๊ตฌํ˜„์„ ํ™•์žฅํ•˜๋Š” ๊ณ ๊ธ‰ ์ฑŒ๋ฆฐ์ง€๋“ค์ž…๋‹ˆ๋‹ค:

1. ๋Œ€๊ทœ๋ชจ ์†Œํ”„ํŠธ๋งฅ์Šค: TPB < SIZE ์ฒ˜๋ฆฌ

์ž…๋ ฅ ํฌ๊ธฐ๊ฐ€ ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜๋ฅผ ์ดˆ๊ณผํ•˜๋ฉด(TPB < SIZE), ๋‹จ์ผ ๋ธ”๋ก์ด ์ „์ฒด ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†์–ด ํ˜„์žฌ ๊ตฌํ˜„์ด ๋™์ž‘ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค:

1.1 ๋ฒ„ํผ ๋ฆฌ๋•์…˜

  • ๋ธ”๋ก ๋‹จ์œ„ ๊ฒฐ๊ณผ(์ตœ๋Œ“๊ฐ’๊ณผ ํ•ฉ๊ณ„)๋ฅผ ๋””๋ฐ”์ด์Šค ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋‘ ๋ฒˆ์งธ ์ปค๋„์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋“ค์— ๋Œ€ํ•ด ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ์ „์—ญ ์ตœ๋Œ“๊ฐ’๊ณผ ํ•ฉ๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ตœ์ข… ์ •๊ทœํ™” ๋‹จ๊ณ„๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค

1.2 2๋‹จ๊ณ„ ์†Œํ”„ํŠธ๋งฅ์Šค

  • 1์ฐจ: ๊ฐ ๋ธ”๋ก์ด ๋กœ์ปฌ ์ตœ๋Œ“๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  • ๋™๊ธฐํ™” ํ›„ ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  • 2์ฐจ: \(e^{x-max}\)์™€ ๋กœ์ปฌ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  • ๋™๊ธฐํ™” ํ›„ ์ „์—ญ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  • ์ตœ์ข…: ์ „์—ญ ํ•ฉ๊ณ„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค

2. ๋ฐฐ์น˜ ์†Œํ”„ํŠธ๋งฅ์Šค

๋ฒกํ„ฐ ๋ฐฐ์น˜(2D ์ž…๋ ฅ ํ…์„œ)์— ๋Œ€ํ•œ ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ๋‹ค์Œ ๋ณ€ํ˜•์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

  • ํ–‰ ๋‹จ์œ„ ์†Œํ”„ํŠธ๋งฅ์Šค: ๊ฐ ํ–‰์— ๋…๋ฆฝ์ ์œผ๋กœ ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ์—ด ๋‹จ์œ„ ์†Œํ”„ํŠธ๋งฅ์Šค: ๊ฐ ์—ด์— ๋…๋ฆฝ์ ์œผ๋กœ ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๋‘ ๊ตฌํ˜„ ๊ฐ„์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค

์ฑŒ๋ฆฐ์ง€ II: ๊ณ ๊ธ‰ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜

์ด ์ฑŒ๋ฆฐ์ง€๋Š” Puzzle 19: ์–ดํ…์…˜ Op์˜ ํ™•์žฅ์ž…๋‹ˆ๋‹ค

๋ฒกํ„ฐ ์–ดํ…์…˜ ๊ตฌํ˜„์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ํ•œ๊ณ„๋ฅผ ๋„“ํ˜€๋ณด๋Š” ๊ณ ๊ธ‰ ์ฑŒ๋ฆฐ์ง€๋“ค์ž…๋‹ˆ๋‹ค:

1. ๋” ๊ธด ์‹œํ€€์Šค ๊ธธ์ด

๊ธฐ์กด ์ปค๋„์„ ์‚ฌ์šฉํ•˜์—ฌ ๋” ๊ธด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค:

1.1 ์‹œํ€€์Šค ๊ธธ์ด ํ™•์žฅ

  • SEQ_LEN = 32์™€ SEQ_LEN = 64๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ์–ดํ…์…˜ ๊ตฌํ˜„์„ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค
  • TPB(๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜) ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ทธ์— ๋งž๊ฒŒ ์—…๋ฐ์ดํŠธํ•ฉ๋‹ˆ๋‹ค
  • ์ „์น˜ ์ปค๋„์ด ๋” ํฐ ํ–‰๋ ฌ ํฌ๊ธฐ๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค

1.2 ๋™์  ์‹œํ€€์Šค ๊ธธ์ด

  • ๋Ÿฐํƒ€์ž„์— ๊ฐ€๋ณ€ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์–ดํ…์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค
  • SEQ_LEN๋ณด๋‹ค ์งง์€ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ปค๋„์— ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค
  • ๊ณ ์ • ์‹œํ€€์Šค ๊ธธ์ด ์ฒ˜๋ฆฌ์™€ ๋™์  ์‹œํ€€์Šค ๊ธธ์ด ์ฒ˜๋ฆฌ์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค

2. ๋ฐฐ์น˜ ๋ฒกํ„ฐ ์–ดํ…์…˜

์—ฌ๋Ÿฌ ์–ดํ…์…˜ ์—ฐ์‚ฐ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค:

2.1 ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ

  • ์—ฌ๋Ÿฌ ์ฟผ๋ฆฌ ๋ฒกํ„ฐ๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๋„๋ก ์–ดํ…์…˜ ์—ฐ์‚ฐ์„ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค
  • ์ž…๋ ฅ ํ˜•ํƒœ: Q(batch_size, d), K(seq_len, d), V(seq_len, d)
  • ์ถœ๋ ฅ ํ˜•ํƒœ: (batch_size, d)
  • ์ ์ ˆํ•œ ์ธ๋ฑ์‹ฑ์œผ๋กœ ๊ธฐ์กด ์ปค๋„์„ ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค

2.2 ๋ฐฐ์น˜๋ฅผ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”

  • ๋ฐฐ์น˜ ์š”์†Œ ๊ฐ„ ๋ฒ„ํผ๋ฅผ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค
  • ๋‹ค์–‘ํ•œ ๋ฐฐ์น˜ ํฌ๊ธฐ(2, 4, 8)์—์„œ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ํŒจํ„ด์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค

Puzzle 20: 1D ํ•ฉ์„ฑ๊ณฑ Op

MAX ๊ทธ๋ž˜ํ”„์—์„œ PyTorch ์ปค์Šคํ…€ Op์œผ๋กœ

GPU ํผ์ฆ ์—ฌ์ •์˜ Part V์— ์ง„์ž…ํ–ˆ์Šต๋‹ˆ๋‹ค: PyTorch ์ปค์Šคํ…€ Op ํ†ตํ•ฉํ•˜๊ธฐ.

Puzzle 17: 1D ํ•ฉ์„ฑ๊ณฑ Op์—์„œ MAX ๊ทธ๋ž˜ํ”„๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Mojo GPU ์ปค๋„์„ ํŒŒ์ด์ฌ๊ณผ ์—ฐ๋™ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ์ด์ œ๋ถ€ํ„ฐ๋Š” ๋‹ค์Œ์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค:

  • ๋™์ผํ•œ Mojo ์ปค๋„์„ PyTorch์˜ CustomOpLibrary๋กœ ์‚ฌ์šฉํ•˜๊ธฐ
  • PyTorch์˜ ํ…์„œ ์‹œ์Šคํ…œ ๋ฐ ์˜คํ† ๊ทธ๋ž˜๋“œ(autograd)์™€ ํ†ตํ•ฉํ•˜๊ธฐ
  • MAX ๊ทธ๋ž˜ํ”„์™€ PyTorch ๋ฐฉ์‹์˜ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ๋น„๊ตํ•˜๊ธฐ
  • ๋ช…์‹œ์  ์ถœ๋ ฅ ํ…์„œ ํ• ๋‹น์ด๋ผ๋Š” ํ•ต์‹ฌ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

์ด ์ „ํ™˜์„ ํ†ตํ•ด ๋™์ผํ•œ ์ตœ์ ํ™”๋œ GPU ์ปค๋„์ด ์„œ๋กœ ๋‹ค๋ฅธ ํŒŒ์ด์ฌ ํ†ตํ•ฉ ๋ฐฉ์‹์—์„œ ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” Puzzle 17: 1D ํ•ฉ์„ฑ๊ณฑ Op์˜ 1D ํ•ฉ์„ฑ๊ณฑ(convolution) ์ปค๋„์„ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™€์„œ, MAX ๊ทธ๋ž˜ํ”„ ๋Œ€์‹  CustomOpLibrary๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ PyTorch์™€ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค.

์—ฌ๊ธฐ์„œ ํ•ต์‹ฌ์€ ๋™์ผํ•œ Mojo ์ปค๋„์ด ์ˆ˜์ • ์—†์ด ๊ทธ๋Œ€๋กœ ๋™์ž‘ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. MAX ๊ทธ๋ž˜ํ”„์™€ PyTorch ๋ฐฉ์‹ ์‚ฌ์ด์—์„œ ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์€ ํŒŒ์ด์ฌ ํ†ตํ•ฉ ๋ ˆ์ด์–ด๋ฟ์ž…๋‹ˆ๋‹ค.

์™„์„ฑํ•  ์ฝ”๋“œ

์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ ค๋ฉด ์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ํ˜ธ์ถœํ•˜๋Š” ํ•œ ์ค„๋งŒ ์ฑ„์šฐ๋ฉด ๋ฉ๋‹ˆ๋‹ค:

import torch
from max.experimental.torch import CustomOpLibrary


def conv1d_pytorch(
    input_tensor: torch.Tensor, kernel_tensor: torch.Tensor
) -> torch.Tensor:
    """
    1D convolution using our custom PyTorch operation.

    This demonstrates the transition from MAX Graph (p15) to PyTorch CustomOpLibrary.
    Uses the EXACT same Mojo kernel, but different Python integration!
    """
    # Load our custom operations
    mojo_kernels = Path(__file__).parent / "op"
    ops = CustomOpLibrary(mojo_kernels)

    # Create output tensor with same shape as input
    output_tensor = torch.empty_like(input_tensor)

    # Call our custom conv1d operation with explicit output tensor
    # The Mojo signature expects: (out, input, kernel)
    _conv1d = ops.conv1d[
        {
            "input_size": input_tensor.shape[0],
            "conv_size": kernel_tensor.shape[0],
        }
    ]

    # FILL IN with 1 line of code

    return output_tensor


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p20/p20.py

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ํผ์ฆ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pixi run p20
pixi run -e amd p20
uv run poe p20

์„ฑ๊ณตํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ ์ถœ๋ ฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Puzzle 20: From MAX Graph to PyTorch Custom Ops
============================================================
Input array: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14.]
Convolution kernel: [0. 1. 2. 3.]

NumPy reference result: [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]

Testing PyTorch Custom Op (device: cuda)
----------------------------------------
PyTorch custom op result: [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
โœ… PyTorch custom op verification PASSED

Comparing with MAX Graph approach (like p15)
--------------------------------------------
MAX Graph result: [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]
โœ… MAX Graph verification PASSED
โœ… PyTorch and MAX Graph results MATCH

์†”๋ฃจ์…˜

์ปดํŒŒ์ผ๋œ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์„ ์ ์ ˆํ•œ ์ธ์ž์™€ ํ•จ๊ป˜ ํ˜ธ์ถœํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

    # Call our custom conv1d operation with explicit output tensor
    # The Mojo signature expects: (out, input, kernel)
    conv1d = ops.conv1d[
        {
            "input_size": input_tensor.shape[0],
            "conv_size": kernel_tensor.shape[0],
        }
    ]
    torch.compile(conv1d)(output_tensor, input_tensor, kernel_tensor)

์ด ํ’€์ด๋Š” ๋ช‡ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

1. torch.compile() ํ†ตํ•ฉ

torch.compile ํ†ตํ•ฉ ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

torch.compile(conv1d)(output_tensor, input_tensor, kernel_tensor)

2. ๋ช…์‹œ์  ์ถœ๋ ฅ ํ…์„œ ํ• ๋‹น

output_tensor = torch.empty_like(input_tensor)
  • MAX ๊ทธ๋ž˜ํ”„๋Š” ์ถœ๋ ฅ ํ• ๋‹น์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•˜์ง€๋งŒ
  • PyTorch CustomOpLibrary๋Š” ๋ฏธ๋ฆฌ ํ• ๋‹น๋œ ์ถœ๋ ฅ ํ…์„œ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • Mojo ์—ฐ์‚ฐ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋Š” (out, input, kernel) ์ˆœ์„œ๋ฅผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค

3. ํŒŒ๋ผ๋ฏธํ„ฐ ๋”•์…”๋„ˆ๋ฆฌ

ops.conv1d[{"input_size": input_tensor.shape[0], "conv_size": kernel_tensor.shape[0]}]
  • ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์—ฐ์‚ฐ์— ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค
  • ์ด ๊ฐ’๋“ค์€ Mojo ์ปค๋„์˜ ์ปดํŒŒ์ผ ํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค
  • Mojo @staticmethod fn execute ์‹œ๊ทธ๋‹ˆ์ฒ˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ด๋ฆ„๊ณผ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

4. ๊ฐ™์€ ์ปค๋„, ๋‹ค๋ฅธ ํ†ตํ•ฉ ๋ฐฉ์‹

๋‚ด๋ถ€์˜ Mojo ์ปค๋„(conv1d_kernel)์€ Puzzle 17๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค:

  • ๋™์ผํ•œ GPU ์ปค๋„ ์ฝ”๋“œ
  • ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ๋™์ผํ•œ ์—ฐ์‚ฐ ๋กœ์ง
  • ํŒŒ์ด์ฌ ๋ž˜ํผ ๋ ˆ์ด์–ด๋งŒ ๋‹ฌ๋ผ์ง

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์€ PyTorch ์ปค์Šคํ…€ ์—ฐ์‚ฐ์˜ ์ฃผ์š” ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ฐœ๋…MAX ๊ทธ๋ž˜ํ”„ (p15)PyTorch CustomOpLibrary (p18)
์ถœ๋ ฅ ํ• ๋‹น์ž๋™์ˆ˜๋™ (torch.empty_like())
์—ฐ์‚ฐ ํ˜ธ์ถœops.custom(...)torch.compile(op)(...)
ํŒŒ๋ผ๋ฏธํ„ฐ ์ „๋‹ฌparameters={...}op[{...}]
๋””๋ฐ”์ด์Šค ๊ด€๋ฆฌ๋ช…์‹œ์  device contextPyTorch ํ…์„œ์˜ device
๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌMAX ๊ทธ๋ž˜ํ”„ ํ…์„œPyTorch ํ…์„œ

ํ•ต์‹ฌ ํŒจํ„ด: ๋ช…์‹œ์  ์ถœ๋ ฅ ํ…์„œ ํ• ๋‹น

๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ฐจ์ด์ ์€ PyTorch CustomOpLibrary๊ฐ€ ๋ช…์‹œ์  ์ถœ๋ ฅ ํ…์„œ ํ• ๋‹น์„ ์š”๊ตฌํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

# โŒ ๋™์ž‘ํ•˜์ง€ ์•Š์Œ - ์ถœ๋ ฅ ํ…์„œ ์—†์Œ
result = torch.compile(conv1d)(input_tensor, kernel_tensor)

# โœ… ๋™์ž‘ํ•จ - ๋ฏธ๋ฆฌ ํ• ๋‹น๋œ ์ถœ๋ ฅ ํ…์„œ
output_tensor = torch.empty_like(input_tensor)
torch.compile(conv1d)(output_tensor, input_tensor, kernel_tensor)

์ด ํŒจํ„ด์ด ๋ณด์žฅํ•˜๋Š” ๊ฒƒ๋“ค:

  • ์˜ฌ๋ฐ”๋ฅธ ๋””๋ฐ”์ด์Šค์— ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น
  • ์ถœ๋ ฅ ํ…์„œ์˜ shape๊ณผ dtype์ด ์ •ํ™•
  • Mojo ์ปค๋„์ด ์ถœ๋ ฅ ๋ฒ„ํผ์— ์ง์ ‘ ์“ฐ๊ธฐ ๊ฐ€๋Šฅ

torch.compile() ํ†ตํ•ฉ

torch.compile()์ด ํ•„์ˆ˜์ ์ธ ์ด์œ :

  • PyTorch์™€ Mojo ์‚ฌ์ด์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ๋ณ€ํ™˜ ์ฒ˜๋ฆฌ
  • ๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™” ๊ด€๋ฆฌ (CPU โ†” GPU)
  • ํ…์„œ ํฌ๋งท ๋ณ€ํ™˜ ์ตœ์ ํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ ์ ˆํ•œ ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ ์ œ๊ณต

์ฐธ๊ณ : torch.compile() ์—†์ด ์‚ฌ์šฉํ•˜๋ฉด std::bad_alloc ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์›์‹œ ์—ฐ์‚ฐ์ด PyTorch์˜ ํ…์„œ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋ฅผ ์ฒ˜๋ฆฌํ•˜์ง€ ๋ชปํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

์ปค์Šคํ…€ ์—ฐ์‚ฐ ๋””๋ฒ„๊น…

์ž์ฃผ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ์™€ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•:

  1. ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์˜ค๋ฅ˜: ํ•ญ์ƒ torch.compile()์„ ์‚ฌ์šฉํ•˜์„ธ์š”
  2. ์ž˜๋ชป๋œ ์ถœ๋ ฅ ํ˜•์ƒ: ์ถœ๋ ฅ ํ…์„œ๊ฐ€ ๊ธฐ๋Œ€ํ•˜๋Š” ์ฐจ์›๊ณผ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”
  3. ๋””๋ฐ”์ด์Šค ๋ถˆ์ผ์น˜: ๋ชจ๋“  ํ…์„œ๊ฐ€ ๊ฐ™์€ ๋””๋ฐ”์ด์Šค์— ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  4. ํŒŒ๋ผ๋ฏธํ„ฐ ์˜ค๋ฅ˜: ํŒŒ๋ผ๋ฏธํ„ฐ ์ด๋ฆ„์ด Mojo ์—ฐ์‚ฐ ์‹œ๊ทธ๋‹ˆ์ฒ˜์™€ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”

๋””๋ฒ„๊น… ์ ‘๊ทผ๋ฒ•: PyTorch ๊ฒฐ๊ณผ๋ฅผ ๋™์ผํ•œ ์ปค๋„์„ ์‹คํ–‰ํ•˜๋Š” MAX ๊ทธ๋ž˜ํ”„ ๋ ˆํผ๋Ÿฐ์Šค ๊ตฌํ˜„๊ณผ ๋น„๊ตํ•ด ๋ณด์„ธ์š”.

Puzzle 21: ์ž„๋ฒ ๋”ฉ Op

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์„ฑ๋Šฅ

๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ๊ณผ GPU ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”์— ์ดˆ์ ์„ ๋งž์ถฐ Part V๋ฅผ ์ด์–ด๊ฐ‘๋‹ˆ๋‹ค.

Puzzle 20: 1D ํ•ฉ์„ฑ๊ณฑ Op์— ์ด์–ด, ๋™์ผํ•œ ์—ฐ์‚ฐ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ปค๋„ ๊ตฌํ˜„์ด ์„ฑ๋Šฅ์— ์–ผ๋งˆ๋‚˜ ๊ทน์ ์ธ ์ฐจ์ด๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋Š”์ง€ ์•Œ์•„๋ด…๋‹ˆ๋‹ค. ๋ฐฐ์šธ ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • GPU ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ์ด ์ค‘์š”ํ•œ ์ด์œ 
  • ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์ปค๋„์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋”ฉ ์ „๋žต์ด ๊ฐ€์ ธ์˜ค๋Š” ์„ฑ๋Šฅ ์ฐจ์ด

์ด ํผ์ฆ์€ ์–ด๋–ค ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋А๋ƒ๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ์— ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•˜๋А๋ƒ๊ฐ€ ๋” ์ค‘์š”ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ์‹ ๊ฒฝ๋ง์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ์ธ ์ž„๋ฒ ๋”ฉ(embedding) ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋‘ ๊ฐ€์ง€ GPU ์ปค๋„์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋‘ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

๋น„๊ตํ•  ๋‘ ์ปค๋„:

  • 1D ๋ณ‘ํ•ฉ(coalesced) ์ปค๋„: ์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ์ตœ์ ํ™”
  • 2D ๋น„๋ณ‘ํ•ฉ(non-coalesced) ์ปค๋„: ๋น„๊ต๋ฅผ ์œ„ํ•œ ์ตœ์ ํ™”๋˜์ง€ ์•Š์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

์ด ๋น„๊ต๋ฅผ ํ†ตํ•ด GPU ์ปค๋„ ์„ฑ๋Šฅ์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ์ฒด๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐฐ๊ฒฝ: ์ž„๋ฒ ๋”ฉ ์—ฐ์‚ฐ

์ž„๋ฒ ๋”ฉ ์—ฐ์‚ฐ์€ ์ด์‚ฐ์ ์ธ ํ† ํฐ ์ธ๋ฑ์Šค๋ฅผ ๋ฐ€์ง‘ ๋ฒกํ„ฐ ํ‘œํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

# Input: token indices
indices = [[1, 5, 2], [7, 1, 9]]           # Shape: [batch_size, seq_len]

# Embedding table (learned parameters)
embedding_table = [                        # Shape: [vocab_size, embed_dim]
    [0.1, 0.2, 0.3, 0.4],  # Token 0
    [0.5, 0.6, 0.7, 0.8],  # Token 1
    [0.9, 1.0, 1.1, 1.2],  # Token 2
    # ... more tokens
]

# Output: embedded vectors
output[0,0] = embedding_table[1]  # [0.5, 0.6, 0.7, 0.8]
output[0,1] = embedding_table[5]  # lookup token 5's embedding
output[0,2] = embedding_table[2]  # [0.9, 1.0, 1.1, 1.2]
# ... and so on

์ด ์—ฐ์‚ฐ์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž…๋‹ˆ๋‹ค. ์„ฑ๋Šฅ์€ ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์—์„œ ์–ผ๋งˆ๋‚˜ ํšจ์œจ์ ์œผ๋กœ ์ฝ๊ณ  ์ถœ๋ ฅ ํ…์„œ์— ์“ธ ์ˆ˜ ์žˆ๋А๋ƒ์— ๋‹ฌ๋ ค ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•™์Šต ๊ฒฝ๋กœ

์ด ํผ์ฆ์€ ์ฒด๊ณ„์ ์ธ ์ดํ•ด๋ฅผ ์œ„ํ•ด ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ ์ปค๋„

์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์‹ค์ œ ํผ์ฆ ์ฝ”๋“œ๋ฅผ ๊ตฌํ˜„ํ•˜๊ณ  ์ปค๋„ ๊ตฌํ˜„์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ฌด์—‡์„ ํ•˜๊ฒŒ ๋ ๊นŒ์š”:

  • ๋‘ ๊ฐ€์ง€ GPU ์ž„๋ฒ ๋”ฉ ์ปค๋„ ์™„์„ฑ (1D ๋ณ‘ํ•ฉ vs 2D ๋น„๋ณ‘ํ•ฉ)
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ธฐ๋ณธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํ•™์Šต
  • ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋”ฉ ์ „๋žต์œผ๋กœ ๊ตฌํ˜„ํ•˜๋Š” ์‚ฌ๋ก€ ํ™•์ธ
  • Mojo์—์„œ์˜ ์ปค์Šคํ…€ ์—ฐ์‚ฐ ๋“ฑ๋ก ์ดํ•ด

์„ฑ๋Šฅ ๋น„๊ต

์ปค๋„ ์„ฑ๋Šฅ์ด ์™œ ๋‹ค๋ฅธ์ง€, ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์˜ ์ด๋ก ์„ ๊นŠ์ด ํŒŒ๊ณ ๋“ญ๋‹ˆ๋‹ค.

๋ฌด์—‡์„ ๋ฐฐ์šธ๊นŒ์š”:

  • GPU ์„ฑ๋Šฅ์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์ค‘์š”ํ•œ ์ด์œ 
  • ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์ด ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ
  • ์‹ ๊ฒฝ๋ง ์ตœ์ ํ™”์— ๋Œ€ํ•œ ์‹ค์ œ ์‹œ์‚ฌ์ 
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ตœ์ ํ™” ์ „๋žต

์‹œ์ž‘ํ•˜๊ธฐ

GPU ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”๋ฅผ ํƒ๊ตฌํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ ์ปค๋„ ์—์„œ ์ฝ”๋“œ๋ฅผ ๊ตฌํ˜„ํ•œ ํ›„, ์„ฑ๋Šฅ ๋น„๊ต ๋กœ ๋„˜์–ด๊ฐ€ ์„ฑ๋Šฅ ์ฐจ์ด์˜ ์›์ธ์„ ์ดํ•ดํ•ด ๋ณด์„ธ์š”.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ์„œ๋กœ ๋‹ค๋ฅธ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ(1D vs 2D)์ด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์ฃผ์˜ ๊นŠ๊ฒŒ ์‚ดํŽด๋ณด์„ธ์š”. ์ด ํ†ต์ฐฐ์€ ์ž„๋ฒ ๋”ฉ์„ ๋„˜์–ด ๋‹ค์–‘ํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์‹œ๋‚˜๋ฆฌ์˜ค์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ž„๋ฒ ๋”ฉ ์ปค๋„: ๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ

์ด ํผ์ฆ์—์„œ๋Š” ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜๋Š” ๋‘ ๊ฐ€์ง€ GPU ์ž„๋ฒ ๋”ฉ ์ปค๋„์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. GPU ์„ฑ๋Šฅ์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ์ง์ ‘ ์ฒดํ—˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1D ๋ณ‘ํ•ฉ ์ปค๋„ (์ตœ์ ํ™”๋œ ์ ‘๊ทผ๋ฒ•)

์ด ์ปค๋„์€ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ •ํ™•ํžˆ ํ•˜๋‚˜์˜ ์ถœ๋ ฅ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹จ์ˆœํ•œ 1D ๊ทธ๋ฆฌ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผํ•˜์—ฌ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์„ ๋‹ฌ์„ฑํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: [total_elements // 256] ๋ธ”๋ก, ๋ธ”๋ก๋‹น 256 ์Šค๋ ˆ๋“œ
  • ์Šค๋ ˆ๋“œ ๋งคํ•‘: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ (batch, seq, embed) ์œ„์น˜ ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ์ž„๋ฒ ๋”ฉ ์ฐจ์› ์ ‘๊ทผ

๊ตฌํ˜„ํ•  ๋‚ด์šฉ:

  1. ๋ธ”๋ก ์ธ๋ฑ์Šค์™€ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋กœ๋ถ€ํ„ฐ ์ „์—ญ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ
  2. 1์ฐจ์› ์ธ๋ฑ์Šค๋ฅผ 3D ์ขŒํ‘œ (batch_idx, seq_idx, embed_idx)๋กœ ๋ณ€ํ™˜
  3. indices ํ…์„œ์—์„œ ํ† ํฐ ์ธ๋ฑ์Šค ์กฐํšŒ
  4. ํ•ด๋‹นํ•˜๋Š” ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ์š”์†Œ๋ฅผ ์ถœ๋ ฅ์— ๋ณต์‚ฌ

์™„์„ฑํ•  ์ฝ”๋“œ

๋‘ ์ž„๋ฒ ๋”ฉ ์ปค๋„์˜ ๋นˆ ๋ถ€๋ถ„์„ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

comptime THREADS_PER_BLOCK = 256


def embedding_kernel_coalesced[
    batch_size: Int,
    seq_len: Int,
    vocab_size: Int,
    embed_dim: Int,
    OutLayout: TensorLayout,
    IndicesLayout: TensorLayout,
    WeightsLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    indices: TileTensor[mut=False, DType.int32, IndicesLayout, MutAnyOrigin],
    weights: TileTensor[mut=False, dtype, WeightsLayout, MutAnyOrigin],
):
    """
    Memory-coalescing focused embedding kernel.

    Key insight: The bottleneck is memory access patterns, not computation.
    - Each thread handles one (batch, seq, embed) position
    - Simple 1D grid for maximum simplicity and correctness
    - Focus on getting memory access right first
    """

    # Simple 1D indexing - each thread = one output element
    var global_idx = block_idx.x * block_dim.x + thread_idx.x
    var total_elements = batch_size * seq_len * embed_dim

    if global_idx >= total_elements:
        return

    # Convert to (batch, seq, embed) coordinates
    # FILL IN roughly 4 lines

    # Get token index
    # FILL IN 1 line

    # Simple, correct assignment
    # FILL IN 4 lines


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p21/op/embedding.mojo

ํŒ
  • global_idx = block_idx.x * block_dim.x + thread_idx.x๋กœ ์‹œ์ž‘ํ•˜์„ธ์š”
  • ๋‚˜๋ˆ—์…ˆ๊ณผ ๋‚˜๋จธ์ง€ ์—ฐ์‚ฐ์œผ๋กœ 3D ์ขŒํ‘œ๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค: batch_idx = global_idx // (seq_len * embed_dim)
  • remaining = global_idx % (seq_len * embed_dim)์„ ์‚ฌ์šฉํ•˜๋ฉด ์ดํ›„ ๊ณ„์‚ฐ์ด ๊ฐ„๋‹จํ•ด์ง‘๋‹ˆ๋‹ค
  • ํ•ญ์ƒ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ•˜์„ธ์š”: if global_idx >= total_elements: return
  • ์œ ํšจํ•˜์ง€ ์•Š์€ ํ† ํฐ ์ธ๋ฑ์Šค๋Š” ์ถœ๋ ฅ์„ 0์œผ๋กœ ์„ค์ •ํ•˜์„ธ์š”
  • ์ž„๋ฒ ๋”ฉ ์กฐํšŒ: output[batch_idx, seq_idx, embed_idx] = weights[token_idx, embed_idx]

2D ๋น„๋ณ‘ํ•ฉ ์ปค๋„ (๋น„๊ต์šฉ ์ ‘๊ทผ๋ฒ•)

์ด ์ปค๋„์€ X ์ฐจ์›์ด (batch ร— seq) ์œ„์น˜๋ฅผ, Y ์ฐจ์›์ด ์ž„๋ฒ ๋”ฉ ์ฐจ์›์„ ๋‹ด๋‹นํ•˜๋Š” 2D ๊ทธ๋ฆฌ๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ๋ณ‘ํ•ฉ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: [batch x seq // 16, embed_dim // 16] ๋ธ”๋ก, 16 x 16 ์Šค๋ ˆ๋“œ
  • ์Šค๋ ˆ๋“œ ๋งคํ•‘: thread_idx.x๋Š” batch/sequence์—, thread_idx.y๋Š” ์ž„๋ฒ ๋”ฉ ์ฐจ์›์— ๋งคํ•‘
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์›Œํ”„ ๋‚ด ์Šค๋ ˆ๋“œ๋“ค์ด ํฉ์–ด์ง„ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Œ

๊ตฌํ˜„ํ•  ๋‚ด์šฉ:

  1. 2D ๊ทธ๋ฆฌ๋“œ์—์„œ X, Y ์ขŒํ‘œ ๊ณ„์‚ฐ
  2. X ์ขŒํ‘œ๋ฅผ batch ์ธ๋ฑ์Šค์™€ sequence ์ธ๋ฑ์Šค๋กœ ๋ถ„๋ฆฌ
  3. Y ์ขŒํ‘œ๋ฅผ ์ž„๋ฒ ๋”ฉ ์ฐจ์›์œผ๋กœ ์ง์ ‘ ์‚ฌ์šฉ
  4. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ํ•จ๊ป˜ ๋™์ผํ•œ ์ž„๋ฒ ๋”ฉ ์กฐํšŒ ์ˆ˜ํ–‰

์™„์„ฑํ•  ์ฝ”๋“œ

๋‘ ์ž„๋ฒ ๋”ฉ ์ปค๋„์˜ ๋นˆ ๋ถ€๋ถ„์„ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

def embedding_kernel_2d[
    batch_size: Int,
    seq_len: Int,
    vocab_size: Int,
    embed_dim: Int,
    OutLayout: TensorLayout,
    IndicesLayout: TensorLayout,
    WeightsLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    indices: TileTensor[mut=False, DType.int32, IndicesLayout, MutAnyOrigin],
    weights: TileTensor[mut=False, dtype, WeightsLayout, MutAnyOrigin],
):
    """
    2D grid non-coalesced embedding kernel.

    Non-optimal approach for comparison:
    - 2D grid: (batch*seq, embed_dim)
    - More complex indexing
    - Potentially worse memory access patterns
    """

    # 2D grid indexing
    var batch_seq_idx = block_idx.x * block_dim.x + thread_idx.x
    var embed_idx = block_idx.y * block_dim.y + thread_idx.y
    var total_positions = batch_size * seq_len

    if batch_seq_idx >= total_positions or embed_idx >= embed_dim:
        return

    # Convert to (batch, seq) coordinates
    # FILL IN 2 lines

    # Get token index
    # FILL IN 1 line

    # Assignment with 2D grid pattern
    # FILL IN 4 lines


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p21/op/embedding.mojo

ํŒ
  • X, Y ์Šค๋ ˆ๋“œ ์ขŒํ‘œ๋ฅผ ๋ชจ๋‘ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: batch_seq_idx = block_idx.x * block_dim.x + thread_idx.x
  • ๊ทธ๋ฆฌ๊ณ : embed_idx = block_idx.y * block_dim.y + thread_idx.y
  • batch_seq_idx๋ฅผ batch์™€ sequence ์ธ๋ฑ์Šค๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค: batch_idx = batch_seq_idx // seq_len
  • ๋‘ ์ฐจ์› ๋ชจ๋‘ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์žŠ์ง€ ๋งˆ์„ธ์š”: if batch_seq_idx >= total_positions or embed_idx >= embed_dim
  • ํ† ํฐ ์กฐํšŒ๋Š” 1D์™€ ๋™์ผํ•˜์ง€๋งŒ, ์Šค๋ ˆ๋“œ๋‹น ํ•˜๋‚˜์˜ ์ž„๋ฒ ๋”ฉ ์ฐจ์›๋งŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ์ด ์ปค๋„์€ ์ „์ฒด ๋ฒกํ„ฐ๊ฐ€ ์•„๋‹Œ ์Šค๋ ˆ๋“œ๋‹น ํ•˜๋‚˜์˜ ์ž„๋ฒ ๋”ฉ ์ฐจ์›์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

์ปค์Šคํ…€ op ๋“ฑ๋ก

์ปค๋„๋“ค์€ PyTorch์™€ ์‰ฝ๊ฒŒ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ปค์Šคํ…€ ์—ฐ์‚ฐ์œผ๋กœ ๋ž˜ํ•‘๋ฉ๋‹ˆ๋‹ค. ๋“ฑ๋ก ํŒจํ„ด์€ MAX ๊ทธ๋ž˜ํ”„ ์ปค์Šคํ…€ op ์ดํ•ดํ•˜๊ธฐ์—์„œ ์„ค๋ช…ํ•œ MAX ์ปค์Šคํ…€ op๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค:

1D ๋ณ‘ํ•ฉ ์—ฐ์‚ฐ

์ด ์—ฐ์‚ฐ์€ ์ตœ์ ํ™”๋œ 1D ์ž„๋ฒ ๋”ฉ ์ปค๋„์„ "embedding"์œผ๋กœ ๋“ฑ๋กํ•ฉ๋‹ˆ๋‹ค:

import compiler
from std.runtime.asyncrt import DeviceContextPtr
from tensor import InputTensor, OutputTensor
from std.memory import UnsafePointer
from std.gpu.host import DeviceBuffer


@compiler.register("embedding")
struct EmbeddingCustomOp:
    @staticmethod
    def execute[
        target: StaticString,
        batch_size: Int,
        seq_len: Int,
        vocab_size: Int,
        embed_dim: Int,
    ](
        output: OutputTensor[
            dtype=DType.float32, rank=3, static_spec=_
        ],  # [batch_size, seq_len, embed_dim]
        indices: InputTensor[
            dtype=DType.int32, rank=2, static_spec=_
        ],  # [batch_size, seq_len]
        weights: InputTensor[
            dtype=output.dtype, rank=2, static_spec=_
        ],  # [vocab_size, embed_dim]
        ctx: DeviceContextPtr,
    ) raises:
        comptime out_layout_val = row_major[batch_size, seq_len, embed_dim]()
        comptime OutLayout = type_of(out_layout_val)
        comptime indices_layout_val = row_major[batch_size, seq_len]()
        comptime IndicesLayout = type_of(indices_layout_val)
        comptime weights_layout_val = row_major[vocab_size, embed_dim]()
        comptime WeightsLayout = type_of(weights_layout_val)

        var output_tensor = TileTensor[
            mut=True, output.dtype, OutLayout, MutAnyOrigin
        ](output.unsafe_ptr(), out_layout_val)
        var indices_tensor = TileTensor[
            mut=True, DType.int32, IndicesLayout, MutAnyOrigin
        ](indices.unsafe_ptr(), indices_layout_val)
        var weights_tensor = TileTensor[
            mut=True, output.dtype, WeightsLayout, MutAnyOrigin
        ](weights.unsafe_ptr(), weights_layout_val)

        comptime if target == "gpu":
            var gpu_ctx = ctx.get_device_context()

            # Zero out output tensor
            gpu_ctx.enqueue_memset(
                DeviceBuffer[output.dtype](
                    gpu_ctx,
                    output.unsafe_ptr(),
                    batch_size * seq_len * embed_dim,
                    owning=False,
                ),
                0,
            )

            # Calculate 1D grid dimensions (matching kernel's flat indexing)
            var total_elements = batch_size * seq_len * embed_dim
            var blocks = max(1, ceildiv(total_elements, THREADS_PER_BLOCK))

            # Compile and launch optimized kernel
            comptime kernel = embedding_kernel_coalesced[
                batch_size,
                seq_len,
                vocab_size,
                embed_dim,
                OutLayout,
                IndicesLayout,
                WeightsLayout,
                output.dtype,
            ]
            var compiled_kernel = gpu_ctx.compile_function[kernel, kernel]()

            gpu_ctx.enqueue_function(
                compiled_kernel,
                output_tensor,
                indices_tensor,
                weights_tensor,
                grid_dim=(blocks,),
                block_dim=(THREADS_PER_BLOCK,),
            )

        elif target == "cpu":
            for batch in range(batch_size):
                for seq in range(seq_len):
                    var token_idx_val = Int(indices_tensor[batch, seq])
                    if token_idx_val >= 0 and token_idx_val < vocab_size:
                        for emb in range(embed_dim):
                            output_tensor[batch, seq, emb] = weights_tensor[
                                token_idx_val, emb
                            ]
        else:
            raise Error("Unsupported target: " + target)


๋“ฑ๋ก์˜ ํ•ต์‹ฌ ์š”์†Œ:

  • ๋‹จ์ˆœํ•œ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ceildiv(total_elements, THREADS_PER_BLOCK) ๋ธ”๋ก์œผ๋กœ ์ง๊ด€์ ์ธ 1D ๊ทธ๋ฆฌ๋“œ ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”: ๋‹จ์ผ enqueue_memset ํ˜ธ์ถœ๋กœ ์ถœ๋ ฅ ๋ฒ„ํผ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ดˆ๊ธฐํ™”
  • ์ปดํŒŒ์ผ ํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ: ๋ชจ๋“  ํ…์„œ ์ฐจ์›์„ ์ปดํŒŒ์ผ ํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ „๋‹ฌํ•˜์—ฌ ์ตœ์  ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  • ๋””๋ฐ”์ด์Šค ์ถ”์ƒํ™”: GPU ์‹คํ–‰๊ณผ CPU ํด๋ฐฑ์„ ๋งค๋„๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌ

2D ๋น„๋ณ‘ํ•ฉ ์—ฐ์‚ฐ

์ด ์—ฐ์‚ฐ์€ ๋น„๊ต์šฉ 2D ์ž„๋ฒ ๋”ฉ ์ปค๋„์„ "embedding_2d"๋กœ ๋“ฑ๋กํ•ฉ๋‹ˆ๋‹ค:

@compiler.register("embedding_2d")
struct Embedding2DCustomOp:
    @staticmethod
    def execute[
        target: StaticString,
        batch_size: Int,
        seq_len: Int,
        vocab_size: Int,
        embed_dim: Int,
    ](
        output: OutputTensor[
            dtype=DType.float32, rank=3, static_spec=_
        ],  # [batch_size, seq_len, embed_dim]
        indices: InputTensor[
            dtype=DType.int32, rank=2, static_spec=_
        ],  # [batch_size, seq_len]
        weights: InputTensor[
            dtype=output.dtype, rank=2, static_spec=_
        ],  # [vocab_size, embed_dim]
        ctx: DeviceContextPtr,
    ) raises:
        comptime out_layout_val = row_major[batch_size, seq_len, embed_dim]()
        comptime OutLayout = type_of(out_layout_val)
        comptime indices_layout_val = row_major[batch_size, seq_len]()
        comptime IndicesLayout = type_of(indices_layout_val)
        comptime weights_layout_val = row_major[vocab_size, embed_dim]()
        comptime WeightsLayout = type_of(weights_layout_val)

        var output_tensor = TileTensor[
            mut=True, output.dtype, OutLayout, MutAnyOrigin
        ](output.unsafe_ptr(), out_layout_val)
        var indices_tensor = TileTensor[
            mut=True, DType.int32, IndicesLayout, MutAnyOrigin
        ](indices.unsafe_ptr(), indices_layout_val)
        var weights_tensor = TileTensor[
            mut=True, output.dtype, WeightsLayout, MutAnyOrigin
        ](weights.unsafe_ptr(), weights_layout_val)

        comptime if target == "gpu":
            var gpu_ctx = ctx.get_device_context()

            # Zero out output tensor
            gpu_ctx.enqueue_memset(
                DeviceBuffer[output.dtype](
                    gpu_ctx,
                    output.unsafe_ptr(),
                    batch_size * seq_len * embed_dim,
                    owning=False,
                ),
                0,
            )

            # Calculate 2D grid dimensions for non-coalesced access
            var total_positions = batch_size * seq_len
            comptime BLOCK_X = 16  # batch*seq dimension
            comptime BLOCK_Y = 16  # embed dimension
            var blocks_x = max(1, ceildiv(total_positions, BLOCK_X))
            var blocks_y = max(1, ceildiv(embed_dim, BLOCK_Y))

            # Compile and launch 2D kernel
            comptime kernel = embedding_kernel_2d[
                batch_size,
                seq_len,
                vocab_size,
                embed_dim,
                OutLayout,
                IndicesLayout,
                WeightsLayout,
                output.dtype,
            ]

            var compiled_kernel = gpu_ctx.compile_function[kernel, kernel]()

            gpu_ctx.enqueue_function(
                compiled_kernel,
                output_tensor,
                indices_tensor,
                weights_tensor,
                grid_dim=(blocks_x, blocks_y),
                block_dim=(BLOCK_X, BLOCK_Y),
            )

        elif target == "cpu":
            # Same CPU fallback as 1D version
            for batch in range(batch_size):
                for seq in range(seq_len):
                    var token_idx_val = Int(indices_tensor[batch, seq])
                    if token_idx_val >= 0 and token_idx_val < vocab_size:
                        for emb in range(embed_dim):
                            output_tensor[batch, seq, emb] = weights_tensor[
                                token_idx_val, emb
                            ]
        else:
            raise Error("Unsupported target: " + target)


1D ์—ฐ์‚ฐ๊ณผ์˜ ์ฃผ์š” ์ฐจ์ด์ :

  • ๋ณต์žกํ•œ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: blocks_x์™€ blocks_y๋ฅผ ๋ณ„๋„๋กœ ๊ณ„์‚ฐํ•˜๋Š” 2D ๊ทธ๋ฆฌ๋“œ ์‚ฌ์šฉ
  • ๊ณ ์ • ๋ธ”๋ก ์ฐจ์›: 2D ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์„ ์œ„ํ•ด BLOCK_X = 16, BLOCK_Y = 16์œผ๋กœ ๊ณ ์ •
  • ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ: ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ธฐํ™”์™€ CPU ํด๋ฐฑ ๋กœ์ง์€ ๋™์ผ
  • ๋‹ค๋ฅธ ์ปค๋„ ํ˜ธ์ถœ ๋ฐฉ์‹: 2D ๊ทธ๋ฆฌ๋“œ ์ฐจ์› (blocks_x, blocks_y)๊ณผ ๋ธ”๋ก ์ฐจ์› (BLOCK_X, BLOCK_Y) ์ „๋‹ฌ

๊ณตํ†ต ๋ž˜ํผ ๊ธฐ๋Šฅ

๋‘ ์ปค์Šคํ…€ ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ•„์ˆ˜ ์ธํ”„๋ผ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  1. ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ:

    • enqueue_memset์œผ๋กœ ์ถœ๋ ฅ ํ…์„œ 0 ์ดˆ๊ธฐํ™”
    • ์ ์ ˆํ•œ ๋ฒ„ํผ ์ƒ์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ฒ˜๋ฆฌ
    • ์ž๋™ ์ •๋ฆฌ ๋ฐ ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ
  2. ๋””๋ฐ”์ด์Šค ์ถ”์ƒํ™”:

    • ์ตœ์ ํ™”๋œ ์ปค๋„๋กœ GPU ์‹คํ–‰
    • ํ˜ธํ™˜์„ฑ๊ณผ ๋””๋ฒ„๊น…์„ ์œ„ํ•œ CPU ํด๋ฐฑ
    • ์‹คํ–‰ ๋Œ€์ƒ์— ๊ด€๊ณ„์—†์ด ์ผ๊ด€๋œ ์ธํ„ฐํŽ˜์ด์Šค
  3. ํŒŒ๋ผ๋ฏธํ„ฐ ์ „๋‹ฌ:

    • ์ปค๋„ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ปดํŒŒ์ผ ํƒ€์ž„ ํ…์„œ ์ฐจ์›
    • ๋ ˆ์ด์•„์›ƒ ํ…์„œ ๋ณ€ํ™˜์„ ํ†ตํ•œ ๋Ÿฐํƒ€์ž„ ํ…์„œ ๋ฐ์ดํ„ฐ
    • ํƒ€์ž… ์•ˆ์ „ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฒ€์ฆ
  4. ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ:

    • ์ตœ์ ์˜ ๊ทธ๋ฆฌ๋“œ ์ฐจ์› ์ž๋™ ๊ณ„์‚ฐ
    • ๊ฐ ์ปค๋„์˜ ์ ‘๊ทผ ํŒจํ„ด์— ์ตœ์ ํ™”๋œ ์„œ๋กœ ๋‹ค๋ฅธ ์ „๋žต
    • ์ ์ ˆํ•œ ๋ธ”๋ก ์ฐจ์› ๊ด€๋ฆฌ

PyTorch ํ†ตํ•ฉ

๋“ฑ๋ก๋œ ์—ฐ์‚ฐ์€ CustomOpLibrary๋ฅผ ํ†ตํ•ด ํŒŒ์ด์ฌ์—์„œ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# Load the custom operations
ops = CustomOpLibrary(mojo_kernels)

# Call the 1D coalesced version
result_1d = ops.embedding[{"batch_size": B, "seq_len": L, "vocab_size": V, "embed_dim": E}](
    indices, weights
)

# Call the 2D non-coalesced version
result_2d = ops.embedding_2d[{"batch_size": B, "seq_len": L, "vocab_size": V, "embed_dim": E}](
    indices, weights
)

์ด ์ ‘๊ทผ๋ฒ•์˜ ์žฅ์ ์€ ๋™์ผํ•œ ์ปค๋„ ๊ตฌํ˜„์„ ๋‹ค์–‘ํ•œ ํŒŒ์ด์ฌ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ์‚ฌ์šฉํ•˜๋ฉด์„œ๋„ ์ตœ์ ์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ฝ”๋“œ ์‹คํ–‰

๋‹ค์Œ ๋ช…๋ น์œผ๋กœ ํผ์ฆ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

pixi run p21
pixi run -e amd p21
uv run poe p21

์„ฑ๊ณตํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๋น„์Šทํ•œ ์ถœ๋ ฅ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

Puzzle 21: Mojo Embedding Kernel Comparison
======================================================================
Configuration: B=8, L=512, V=10000, E=512
------------------------------------------------------------

Testing Correctness...
   1D Coalesced - Max difference: 1.19e-07
   2D Non-coalesced - Max difference: 1.19e-07
   โœ… Both implementations CORRECT

Benchmarking Mojo Kernels...

Performance Results:
   1D Coalesced:     2.145 ms
   2D Non-coalesced: 3.867 ms
   1D is 1.80x faster than 2D

Key Learning Points:
โ€ข Compare different GPU kernel implementations
โ€ข 1D vs 2D grid patterns have different memory access
โ€ข Coalesced memory access should be faster
โ€ข Grid configuration affects GPU utilization

์†”๋ฃจ์…˜

๋‘ ์ปค๋„์˜ ์ขŒํ‘œ ๋ณ€ํ™˜๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค:

1D ๋ณ‘ํ•ฉ ์ปค๋„

def embedding_kernel_coalesced[
    batch_size: Int,
    seq_len: Int,
    vocab_size: Int,
    embed_dim: Int,
    OutLayout: TensorLayout,
    IndicesLayout: TensorLayout,
    WeightsLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    indices: TileTensor[mut=True, DType.int32, IndicesLayout, MutAnyOrigin],
    weights: TileTensor[mut=True, dtype, WeightsLayout, MutAnyOrigin],
):
    """
    Memory-coalescing focused embedding kernel.

    Key insight: The bottleneck is memory access patterns, not computation.
    - Each thread handles one (batch, seq, embed) position
    - Simple 1D grid for maximum simplicity and correctness
    - Focus on getting memory access right first
    """

    # Simple 1D indexing - each thread = one output element
    var global_idx = block_idx.x * block_dim.x + thread_idx.x
    var total_elements = batch_size * seq_len * embed_dim

    if global_idx >= total_elements:
        return

    var output_lt = output.to_layout_tensor()
    var indices_lt = indices.to_layout_tensor()
    var weights_lt = weights.to_layout_tensor()

    # Convert to (batch, seq, embed) coordinates
    var batch_idx = global_idx // (seq_len * embed_dim)
    var remaining = global_idx % (seq_len * embed_dim)
    var seq_idx = remaining // embed_dim
    var embed_idx = remaining % embed_dim

    # Get token index
    var token_idx_val = Int(indices_lt[batch_idx, seq_idx])

    # Simple, correct assignment
    if token_idx_val >= 0 and token_idx_val < vocab_size:
        output_lt[batch_idx, seq_idx, embed_idx] = weights_lt[
            token_idx_val, embed_idx
        ]
    else:
        output_lt[batch_idx, seq_idx, embed_idx] = 0


2D ๋น„๋ณ‘ํ•ฉ ์ปค๋„

def embedding_kernel_2d[
    batch_size: Int,
    seq_len: Int,
    vocab_size: Int,
    embed_dim: Int,
    OutLayout: TensorLayout,
    IndicesLayout: TensorLayout,
    WeightsLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    indices: TileTensor[mut=True, DType.int32, IndicesLayout, MutAnyOrigin],
    weights: TileTensor[mut=True, dtype, WeightsLayout, MutAnyOrigin],
):
    """
    2D grid non-coalesced embedding kernel.

    Non-optimal approach for comparison:
    - 2D grid: (batch*seq, embed_dim)
    - More complex indexing
    - Potentially worse memory access patterns
    """

    # 2D grid indexing
    var batch_seq_idx = block_idx.x * block_dim.x + thread_idx.x
    var embed_idx = block_idx.y * block_dim.y + thread_idx.y

    var total_positions = batch_size * seq_len

    # Bounds check
    if batch_seq_idx >= total_positions or embed_idx >= embed_dim:
        return

    var output_lt = output.to_layout_tensor()
    var indices_lt = indices.to_layout_tensor()
    var weights_lt = weights.to_layout_tensor()

    # Convert to (batch, seq) coordinates
    var batch_idx = batch_seq_idx // seq_len
    var seq_idx = batch_seq_idx % seq_len

    # Get token index
    var token_idx_val = Int(indices_lt[batch_idx, seq_idx])

    # Assignment with 2D grid pattern
    if token_idx_val >= 0 and token_idx_val < vocab_size:
        output_lt[batch_idx, seq_idx, embed_idx] = weights_lt[
            token_idx_val, embed_idx
        ]
    else:
        output_lt[batch_idx, seq_idx, embed_idx] = 0


๋‘ ํ’€์ด ๋ชจ๋‘ ๋™์ผํ•œ ์ž„๋ฒ ๋”ฉ ์กฐํšŒ ๋กœ์ง์„ ๊ตฌํ˜„ํ•˜์ง€๋งŒ ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค:

์ฃผ์š” ์ฐจ์ด์ 

  1. ์Šค๋ ˆ๋“œ ๋งคํ•‘:

    • 1D ์ปค๋„: ์ถœ๋ ฅ ์š”์†Œ๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ, ๋‹จ์ˆœํ•œ 1์ฐจ์› ์ธ๋ฑ์‹ฑ
    • 2D ์ปค๋„: (batchร—seq, embed_dim) ์ขŒํ‘œ์— ๋Œ€ํ•œ 2D ๊ทธ๋ฆฌ๋“œ ๋งคํ•‘
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • 1D ์ปค๋„: ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ์ž„๋ฒ ๋”ฉ ์ฐจ์›์— ์ ‘๊ทผ โ†’ ๋ณ‘ํ•ฉ๋จ
    • 2D ์ปค๋„: ์Šค๋ ˆ๋“œ ์ ‘๊ทผ ํŒจํ„ด์ด ๋ธ”๋ก ๊ตฌ์„ฑ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง โ†’ ๋ณ‘ํ•ฉ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ
  3. ์ธ๋ฑ์‹ฑ ๋ณต์žก๋„:

    • 1D ์ปค๋„: ๋‹จ์ผ ๋‚˜๋ˆ—์…ˆ/๋‚˜๋จธ์ง€ ์ฒด์ธ์œผ๋กœ 3D ์ขŒํ‘œ ๊ณ„์‚ฐ
    • 2D ์ปค๋„: X/Y ์ขŒํ‘œ๋ฅผ ๋ณ„๋„๋กœ ๊ณ„์‚ฐ

์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ

1D ์ปค๋„์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ์ด์œ :

  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ์— ์ ‘๊ทผ
  • ๋‹จ์ˆœํ•œ ์ธ๋ฑ์‹ฑ: ์ขŒํ‘œ ๊ณ„์‚ฐ์˜ ์—ฐ์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋‚ฎ์Œ
  • ๋” ๋‚˜์€ ์บ์‹œ ํ™œ์šฉ: ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

2D ์ปค๋„์˜ ์„ฑ๋Šฅ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๋Š” ์ด์œ :

  • ํฉ์–ด์ง„ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ์›Œํ”„ ๋‚ด ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Œ
  • ๋ณต์žกํ•œ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: 16ร—16 ๋ธ”๋ก์ด ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ๊ณผ ์ตœ์ ์œผ๋กœ ๋งž์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ
  • ์›Œํ”„ ๋ถ„๊ธฐ: ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ์‹คํ–‰ ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ฅผ ์ˆ˜ ์žˆ์Œ

ํ•ต์‹ฌ ๊ฐœ๋…

๊ฐœ๋…1D ๋ณ‘ํ•ฉ2D ๋น„๋ณ‘ํ•ฉ
์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ1D 1์ฐจ์› ์ธ๋ฑ์‹ฑ2D ๊ทธ๋ฆฌ๋“œ (batchร—seq, embed)
๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์—ฐ์†๋œ ์ฃผ์†Œํฉ์–ด์งˆ ์ˆ˜ ์žˆ์Œ
๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ๋‹จ์ˆœ: [total_elements // 256]๋ณต์žก: [batchร—seq // 16, embed // 16]
์„ฑ๋Šฅ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ์ตœ์ ํ™”์ตœ์ ํ™”๋˜์ง€ ์•Š์€ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด
์‚ฌ์šฉ ๋ชฉ์ ํ”„๋กœ๋•์…˜ ์ปค๋„๊ต์œก์šฉ ๋น„๊ต

ํ•ต์‹ฌ ๊ตํ›ˆ: ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์€ ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ 2~3๋ฐฐ์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์„ฑ๋Šฅ: ๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ GPU ์„ฑ๋Šฅ ์ตœ์ ํ™”์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ์ด ์„น์…˜์—์„œ๋Š” ์ž„๋ฒ ๋”ฉ ์กฐํšŒ์™€ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์™œ ๋น„๋ณ‘ํ•ฉ ํŒจํ„ด๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”์ง€ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ๊ธฐ์ดˆ

๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์€ ์›Œํ”„ ๋‚ด ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ์— ์ ‘๊ทผํ•  ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. GPU๋Š” ์ด๋Ÿฌํ•œ ๊ฐœ๋ณ„ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ๋” ์ ์€ ์ˆ˜์˜ ๋Œ€์šฉ๋Ÿ‰ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜์œผ๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ ์ ‘๊ทผ

๋ณ‘ํ•ฉ (ํšจ์œจ์ ):

- Thread 0 โ†’ Address 0x1000
- Thread 1 โ†’ Address 0x1004
- Thread 2 โ†’ Address 0x1008
- Thread 3 โ†’ Address 0x100C
- ...

๊ฒฐ๊ณผ: ์›Œํ”„ ์ „์ฒด(32๊ฐœ ์Šค๋ ˆ๋“œ)์— ๋Œ€ํ•ด 1๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

๋น„๋ณ‘ํ•ฉ (๋น„ํšจ์œจ์ ):

- Thread 0 โ†’ Address 0x1000
- Thread 1 โ†’ Address 0x2000
- Thread 2 โ†’ Address 0x3000
- Thread 3 โ†’ Address 0x4000
- ...

๊ฒฐ๊ณผ: ์ตœ๋Œ€ 32๋ฒˆ์˜ ๊ฐœ๋ณ„ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

์ž„๋ฒ ๋”ฉ ์—ฐ์‚ฐ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ธ ์ด์œ 

์ž„๋ฒ ๋”ฉ ์กฐํšŒ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์„ฑ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž…๋‹ˆ๋‹ค:

  • ์ตœ์†Œํ•œ์˜ ์—ฐ์‚ฐ: ํ•˜๋Š” ์ผ์ด๋ผ๊ณค ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ถœ๋ ฅ์œผ๋กœ ๋ณต์‚ฌํ•˜๋Š” ๊ฒƒ๋ฟ
  • ํฐ ๋ฉ”๋ชจ๋ฆฌ ํ’‹ํ”„๋ฆฐํŠธ: ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์€ ์ˆ˜ ๊ธฐ๊ฐ€๋ฐ”์ดํŠธ์— ๋‹ฌํ•  ์ˆ˜ ์žˆ์Œ
  • ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์š”๊ตฌ: ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ ์ „์†ก์ด ํ•„์š”

์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ์—์„œ๋Š” ์—ฐ์‚ฐ ๋ณต์žก๋„๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ์ด ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

์ปค๋„ ๋น„๊ต

1D ๋ณ‘ํ•ฉ ์ปค๋„

  • ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ: [total_elements // 256] ๋ธ”๋ก, ์ถœ๋ ฅ ์š”์†Œ๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ์ž„๋ฒ ๋”ฉ ์ฐจ์›์— ์ ‘๊ทผ
  • ์™œ ๋ณ‘ํ•ฉ๋˜๋Š”๊ฐ€: Thread 0: output[0,0,0], Thread 1: output[0,0,1] โ†’ ์—ฐ์†๋œ ์ฃผ์†Œ

2D ๋น„๋ณ‘ํ•ฉ ์ปค๋„

  • ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ: [batch*seq // 16, embed_dim // 16] ๋ธ”๋ก, 16ร—16 ์Šค๋ ˆ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Œ
  • ์™œ ๋น„๋ณ‘ํ•ฉ์ธ๊ฐ€: ์Šค๋ ˆ๋“œ ์ ‘๊ทผ ํŒจํ„ด์ด ๋ฉ”๋ชจ๋ฆฌ ์ „์ฒด์— ํฉ์–ด์งˆ ์ˆ˜ ์žˆ์Œ

์„ฑ๋Šฅ ๊ฒฐ๊ณผ

์ผ๋ฐ˜์ ์ธ ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ:

Performance Results:
   1D Coalesced:     2.145 ms
   2D Non-coalesced: 3.867 ms
   1D is 1.80x faster than 2D

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹œ๊ฐํ™”

๋ณ‘ํ•ฉ ํŒจํ„ด (1D ์ปค๋„)

output[0,0,0:32]์— ๋Œ€ํ•œ ์›Œํ”„ ์‹คํ–‰:

์š”์†Œ์Šค๋ ˆ๋“œ ID๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ฃผ์†Œ ํŒจํ„ด
output[0,0,0]0[0,0]Base + 0
output[0,0,1]1[0,1]Base + 4
output[0,0,2]2[0,2]Base + 8
output[0,0,3]3[0,3]Base + 12
โ€ฆโ€ฆโ€ฆโ€ฆ
output[0,0,31]31[0,31]Base + 124

๊ฒฐ๊ณผ: ์—ฐ์†๋œ ์ฃผ์†Œ โ†’ ์›Œํ”„ ์ „์ฒด์— ๋Œ€ํ•ด 1๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

๋น„๋ณ‘ํ•ฉ ํŒจํ„ด (2D ์ปค๋„)

16ร—16 ๋ธ”๋ก์˜ ์›Œํ”„ ์‹คํ–‰:

Block organization (16ร—16):
    X-dim: batch*seq positions (0-15)
    Y-dim: embed dimensions (0-15)

Warp threads might access:
    Thread 0:  batch=0, seq=0, embed=0  โ†’ Address A
    Thread 1:  batch=0, seq=1, embed=0  โ†’ Address B (different row)
    Thread 2:  batch=0, seq=2, embed=0  โ†’ Address C (different row)
    ...
    Thread 31: batch=1, seq=15, embed=0 โ†’ Address Z (scattered)

๊ฒฐ๊ณผ: ํฉ์–ด์ง„ ์ฃผ์†Œ โ†’ ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

ํ•ต์‹ฌ ์ตœ์ ํ™” ์ „๋žต

  1. ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ๋Š” ๊ฐ€๋Šฅํ•œ ํ•œ 1D ์ธ๋ฑ์‹ฑ์„ ์„ ํ˜ธํ•˜์„ธ์š”
  2. ๋ณ‘ํ•ฉ์— ์œ ๋ฆฌํ•˜๋„๋ก ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ์ •๋ ฌํ•˜์„ธ์š”
  3. ์ปค๋„ ์„ค๊ณ„ ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ๊ณ ๋ คํ•˜์„ธ์š”
  4. ๋ณ‘๋ชฉ ์ง€์ ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜์„ธ์š”
  5. ์ตœ์ ํ™” ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ™œ์šฉํ•˜์„ธ์š”

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํŠนํžˆ ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ๋Š” ์—ฐ์‚ฐ ๋ณต์žก๋„๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด GPU ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

Puzzle 22: ์ปค๋„ ํ“จ์ „๊ณผ ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค

์ปค๋„ ํ“จ์ „๊ณผ ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ

์ปค๋„ ํ“จ์ „ ๊ณผ ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ์— ์ดˆ์ ์„ ๋งž์ถฐ Part V๋ฅผ ์ด์–ด๊ฐ‘๋‹ˆ๋‹ค.

Puzzle 21: ์ž„๋ฒ ๋”ฉ Op์— ์ด์–ด, ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ ํšจ์œจ์ ์ธ ์ปค๋„๋กœ ๊ฒฐํ•ฉํ•˜๊ณ  ์ด๋ฅผ PyTorch์˜ ์˜คํ† ๊ทธ๋ž˜๋“œ ์‹œ์Šคํ…œ๊ณผ ํ†ตํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค. ๋ฐฐ์šธ ๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ์ปค๋„ ํ“จ์ „์ด ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค(forward pass)์™€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค(backward pass) ๋ชจ๋‘์—์„œ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š” ์›๋ฆฌ
  • ํ“จ์ „ ์—ฐ์‚ฐ์— ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๊ฐ€ ํ•„์ˆ˜์ ์ธ ์ด์œ 
  • ์ ์ ˆํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์„ ๊ฐ–์ถ˜ ํ“จ์ „ ์ปค๋„ ์„ค๊ณ„ ๋ฐฉ๋ฒ•
  • ์„œ๋กœ ๋‹ค๋ฅธ ํ“จ์ „ ์ „๋žต์ด ๊ฐ€์ ธ์˜ค๋Š” ์„ฑ๋Šฅ ์ฐจ์ด

์ด ํผ์ฆ์€ ์—ฐ์‚ฐ์„ ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉํ•˜๋А๋ƒ๊ฐ€ ์–ด๋–ป๊ฒŒ ๊ตฌํ˜„ํ•˜๋А๋ƒ๋งŒํผ ์ค‘์š”ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋ฅผ ๋ชจ๋‘ ํฌํ•จํ•˜๋Š” ํ“จ์ „ LayerNorm + Linear ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ํ“จ์ „๊ณผ ์–ธํ“จ์ „ ๊ตฌํ˜„ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ์ „๋žต์„ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

๋น„๊ตํ•  ๋‚ด์šฉ:

  • ์–ธํ“จ์ „ ๋ฐฉ์‹: LayerNorm๊ณผ Linear๋ฅผ ๋ณ„๋„์˜ ์ปค๋„๋กœ ์‹คํ–‰
  • ํ“จ์ „ ์ปค๋„: ํ•˜๋‚˜์˜ ์ปค๋„์—์„œ ๋‘ ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์‹คํ–‰
  • ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค: ํ“จ์ „ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ

์ด ๋น„๊ต๋ฅผ ํ†ตํ•ด ๋”ฅ๋Ÿฌ๋‹ ์—ฐ์‚ฐ์—์„œ ์ปค๋„ ํ“จ์ „๊ณผ ์ ์ ˆํ•œ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ์ฒด๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐฐ๊ฒฝ: LayerNorm + Linear ์—ฐ์‚ฐ

LayerNorm๊ณผ Linear๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์˜ ํ•ต์‹ฌ ์—ฐ์‚ฐ์œผ๋กœ, ํŠนํžˆ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜๊ณผ ํ”ผ๋“œํฌ์›Œ๋“œ ๋„คํŠธ์›Œํฌ์—์„œ ๋นˆ๋ฒˆํ•˜๊ฒŒ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์ธ ์‚ฌ์šฉ ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

import torch
import torch.nn.functional as F

# Input: hidden states
x = torch.randn(batch_size, seq_len, hidden_dim)

# LayerNorm parameters
ln_weight = torch.ones(hidden_dim)  # scale parameter (ฮณ)
ln_bias = torch.zeros(hidden_dim)   # shift parameter (ฮฒ)

# Linear layer parameters
linear_weight = torch.randn(output_dim, hidden_dim)
linear_bias = torch.zeros(output_dim)

# Unfused operations (with autograd)
ln_output = F.layer_norm(x, [hidden_dim], weight=ln_weight, bias=ln_bias)
output = F.linear(ln_output, linear_weight, linear_bias)

# Fused operation (custom implementation)
# This is what you'll implement in this puzzle
output_fused = fused_layernorm_linear(x, ln_weight, ln_bias, linear_weight, linear_bias)

ํ“จ์ „ ์—ฐ์‚ฐ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๋ฉด ํ•˜๋‚˜์˜ ํšจ์œจ์ ์ธ ์ปค๋„์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์ ์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
  • ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์†Œํ™”
  • ์บ์‹œ ํ™œ์šฉ๋„ ํ–ฅ์ƒ
  • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ ์ €์žฅ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์ œ๊ฑฐ

์‹ค์ œ๋กœ ์ด๋Ÿฌํ•œ ํ“จ์ „์€ ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋ชจ๋‘์—์„œ ์ตœ๋Œ€ 1.5~2๋ฐฐ์˜ ์†๋„ ํ–ฅ์ƒ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ํ•™์Šต ํšจ์œจ์— ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๊ฐ€ ์ค‘์š”ํ•œ ์ด์œ 

PyTorch์˜ ์˜คํ† ๊ทธ๋ž˜๋“œ ์‹œ์Šคํ…œ์€ ๊ฐœ๋ณ„ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ ์ž๋™์œผ๋กœ ๊ณ„์‚ฐํ•˜์ง€๋งŒ, ํ“จ์ „ ์—ฐ์‚ฐ์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์œ ๋กœ ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ์œ ์ง€
  • ์ ์ ˆํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„ ๋ณด์žฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”
  • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ

ํ•™์Šต ๊ฒฝ๋กœ

์ด ํผ์ฆ์€ ์ฒด๊ณ„์ ์ธ ์ดํ•ด๋ฅผ ์œ„ํ•ด ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

ํ“จ์ „ vs ์–ธํ“จ์ „ ์ปค๋„

์—ฌ๊ธฐ์„œ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ํ“จ์ „ ์ˆœ๋ฐฉํ–ฅ ์ปค๋„์„ ๊ตฌํ˜„ํ•˜๊ณ  ์ปค๋„ ํ“จ์ „์˜ ์ด์ ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ฌด์—‡์„ ํ•˜๊ฒŒ ๋ ๊นŒ์š”:

  • ์–ธํ“จ์ „๊ณผ ํ“จ์ „ ์ˆœ๋ฐฉํ–ฅ ์ปค๋„ ๋ชจ๋‘ ๊ตฌํ˜„
  • ํ•ต์‹ฌ ์ปค๋„ ํ“จ์ „ ๊ธฐ๋ฒ• ํ•™์Šต
  • ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์„œ๋กœ ๋‹ค๋ฅธ ์ „๋žต์œผ๋กœ ๊ตฌํ˜„ํ•˜๋Š” ์‚ฌ๋ก€ ํ™•์ธ
  • ํ“จ์ „์ด ๊ฐ€์ ธ์˜ค๋Š” ์„ฑ๋Šฅ ์ฐจ์ด ์ดํ•ด
  • ์ตœ์  ์„ฑ๋Šฅ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํ•™์Šต

์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ๊ณผ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค

์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ๊ณผ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ๊นŠ์ด ํŒŒ๊ณ ๋“ญ๋‹ˆ๋‹ค.

๋ฌด์—‡์„ ๋ฐฐ์šธ๊นŒ์š”:

  • ์ปค์Šคํ…€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ตฌํ˜„ ๋ฐฉ๋ฒ•
  • ์ ์ ˆํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์ด ์ค‘์š”ํ•œ ์ด์œ 
  • ํ•™์Šต ํšจ์œจ์— ๋Œ€ํ•œ ์‹ค์ œ ์‹œ์‚ฌ์ 
  • ์—ญ๋ฐฉํ–ฅ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ตœ์ ํ™” ์ „๋žต
  • ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์˜ ์ˆ˜ํ•™์  ๊ธฐ์ดˆ
  • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ
  • ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค์—์„œ์˜ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ

์‹œ์ž‘ํ•˜๊ธฐ

์ปค๋„ ํ“จ์ „๊ณผ ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ์„ ํƒ๊ตฌํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ํ“จ์ „ vs ์–ธํ“จ์ „ ์ปค๋„ ์—์„œ ํ“จ์ „ ์ปค๋„์„ ๊ตฌํ˜„ํ•œ ํ›„, ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ๊ณผ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋กœ ๋„˜์–ด๊ฐ€ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ์ดํ•ดํ•ด ๋ณด์„ธ์š”.

์ด ํผ์ฆ์—๋Š” ๋‹ค์Œ์„ ๊ฒ€์ฆํ•˜๋Š” ์ข…ํ•ฉ ํ…Œ์ŠคํŠธ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋ชจ๋‘์—์„œ PyTorch ๊ตฌํ˜„๊ณผ์˜ ์ˆ˜์น˜์  ์ •ํ™•๋„
  • CPU์™€ GPU ๊ตฌํ˜„ ๊ฐ„์˜ ์„ฑ๋Šฅ ๋น„๊ต
  • ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ(์ž…๋ ฅ, LayerNorm ๊ฐ€์ค‘์น˜/๋ฐ”์ด์–ด์Šค, Linear ๊ฐ€์ค‘์น˜/๋ฐ”์ด์–ด์Šค)์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ ์ •ํ™•๋„
  • ์ปค๋„ ํ“จ์ „์„ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ตœ์ ํ™”

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ์„œ๋กœ ๋‹ค๋ฅธ ๊ตฌํ˜„ ๋ฐฉ์‹(ํ“จ์ „ vs ์–ธํ“จ์ „)์ด ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ์„ฑ๋Šฅ ๋ชจ๋‘์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์ฃผ์˜ ๊นŠ๊ฒŒ ์‚ดํŽด๋ณด์„ธ์š”. ์ด ํ†ต์ฐฐ์€ LayerNorm + Linear๋ฅผ ๋„˜์–ด ๋‹ค์–‘ํ•œ ๋”ฅ๋Ÿฌ๋‹ ์—ฐ์‚ฐ์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ํŠนํžˆ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ตฌํ˜„์€ ํ•™์Šต ํšจ์œจ๊ณผ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์— ์ง์ ‘์ ์ธ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฏ€๋กœ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

โš›๏ธ ํ“จ์ „ vs ์–ธํ“จ์ „ ์ปค๋„

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” LayerNorm๊ณผ Linear ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ ๋ฐฉ์‹์„ ๊ตฌํ˜„ํ•˜๊ณ  ๋น„๊ตํ•˜๋ฉฐ, ์ปค๋„ ํ“จ์ „์˜ ์„ฑ๋Šฅ ์ด์ ์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค:

  1. ์–ธํ“จ์ „ ๋ฐฉ์‹: LayerNorm๊ณผ Linear๋ฅผ ๋ณ„๋„์˜ ์—ฐ์‚ฐ์œผ๋กœ ์‹คํ–‰
  2. ํ“จ์ „ ์ปค๋„: LayerNorm๊ณผ Linear ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ GPU ์ปค๋„๋กœ ๊ฒฐํ•ฉ

์ด ๋น„๊ต๋ฅผ ํ†ตํ•ด ์ปค๋„ ํ“จ์ „์ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
  • ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์†Œํ™”
  • ์บ์‹œ ํ™œ์šฉ๋„ ํ–ฅ์ƒ
  • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ ์ €์žฅ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ์ œ๊ฑฐ

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜๋Š” ์ปค๋„ ํ“จ์ „ ๊ธฐ๋ฒ•
  • ํ“จ์ „ ์—ฐ์‚ฐ์„ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ตœ์ ํ™”
  • ์„œ๋กœ ๋‹ค๋ฅธ ์ปค๋„ ๊ตฌํ˜„์˜ ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํ‚น
  • ํ“จ์ „ ์—ฐ์‚ฐ์—์„œ์˜ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ
  • PyTorch ์ปค์Šคํ…€ ์—ฐ์‚ฐ ํ†ตํ•ฉ

๊ฒฐํ•ฉํ•  ์ˆ˜ํ•™์  ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. LayerNorm: \[\Large \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \]

  2. Linear: \[\Large \text{Linear}(x) = Wx + b \]

ํ“จ์ „ ์—ฐ์‚ฐ์œผ๋กœ ๊ฒฐํ•ฉํ•˜๋ฉด ๋‹ค์Œ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{Fused}(x) = W(\gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta) + b \]

LayerNorm ์ดํ•ดํ•˜๊ธฐ

LayerNorm์€ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•˜๊ณ  ๊ฐ€์†ํ•˜๋Š” ์ •๊ทœํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ตฌ์„ฑ ์š”์†Œ์™€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

LayerNorm์ด ํ•˜๋Š” ์ผ

  1. ์ •๊ทœํ™”: LayerNorm์€ ๊ฐ ์ƒ˜ํ”Œ์˜ ํŠน์„ฑ(์€๋‹‰ ์ฐจ์›, hidden dimension) ์ „์ฒด์— ๊ฑธ์ณ ํ™œ์„ฑํ™” ๊ฐ’์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ:

    • ๊ฐ ์‹œํ€€์Šค ์œ„์น˜์—์„œ ์€๋‹‰ ์ฐจ์›์— ๋Œ€ํ•œ ํ†ต๊ณ„๋Ÿ‰์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ๋ฐฐ์น˜์˜ ๊ฐ ์ƒ˜ํ”Œ์€ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”๋ฉ๋‹ˆ๋‹ค
    • ๋ฐฐ์น˜ ์ฐจ์›์— ๋Œ€ํ•ด ์ •๊ทœํ™”ํ•˜๋Š” BatchNorm๊ณผ๋Š” ๋‹ค๋ฆ…๋‹ˆ๋‹ค
  2. ํŒŒ๋ผ๋ฏธํ„ฐ:

    • \(\gamma\) (scale): ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฐ ํŠน์„ฑ์˜ ์ตœ์  ์Šค์ผ€์ผ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒกํ„ฐ
    • \(\beta\) (shift): ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฐ ํŠน์„ฑ์˜ ์ตœ์  ์ด๋™๋Ÿ‰์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ๋ฒกํ„ฐ
    • \(\epsilon\): 0์œผ๋กœ ๋‚˜๋ˆ„๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋ถ„์‚ฐ์— ๋”ํ•˜๋Š” ์ž‘์€ ์ƒ์ˆ˜ (1e-5)

LayerNorm์˜ ์‹ค์ œ ์—ญํ• 

LayerNorm์€ ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์—์„œ ์—ฌ๋Ÿฌ ์ค‘์š”ํ•œ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  1. ํŠน์„ฑ ํ‘œ์ค€ํ™”:

    • ๊ฐ ํŠน์„ฑ์„ ํ‰๊ท  0, ๋ถ„์‚ฐ 1๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค
    • ๋„คํŠธ์›Œํฌ์˜ ํ•™์Šต ๊ณผ์ •์„ ๋” ์•ˆ์ •์ ์œผ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค
    • ํ•™์Šต ์ค‘ ๋ ˆ์ด์–ด ์ž…๋ ฅ์˜ ๋ถ„ํฌ๊ฐ€ ๋ณ€ํ•˜๋Š” โ€œ๋‚ด๋ถ€ ๊ณต๋ณ€๋Ÿ‰ ์ด๋™(internal covariate shift)โ€ ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
  2. ๊ธฐ์šธ๊ธฐ ํ๋ฆ„:

    • ๋„คํŠธ์›Œํฌ๋ฅผ ํ†ตํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค
    • ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค/ํญ๋ฐœ ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
    • ๋” ๋†’์€ ํ•™์Šต๋ฅ ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด ํ•™์Šต ํšจ์œจ์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค
  3. ์ •๊ทœํ™” ํšจ๊ณผ:

    • ์•”๋ฌต์ ์ธ ์ •๊ทœํ™” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค
    • ํŠน์„ฑ ๋ถ„ํฌ๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
    • ์ž…๋ ฅ ๋ณ€๋™์— ๋Œ€ํ•œ ๋„คํŠธ์›Œํฌ์˜ ๊ฐ•๊ฑด์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค
  4. ์‹œํ€€์Šค ๋ชจ๋ธ๋ง:

    • ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์—์„œ ํŠนํžˆ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค
    • ์„œ๋กœ ๋‹ค๋ฅธ ์‹œํ€€์Šค ๊ธธ์ด์—์„œ๋„ ์ผ๊ด€๋œ ์‹ ํ˜ธ ํฌ๊ธฐ๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค
    • ๊ฐ€๋ณ€ ๊ธธ์ด ์‹œํ€€์Šค๋ฅผ ๋” ์ž˜ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค
  5. ํ•™์Šต ์—ญํ•™:

    • ํ•™์Šต ์ˆ˜๋ ด์„ ๊ฐ€์†ํ•ฉ๋‹ˆ๋‹ค
    • ์„ธ๋ฐ€ํ•œ ํ•™์Šต๋ฅ  ์กฐ์ •์˜ ํ•„์š”์„ฑ์„ ์ค„์ž…๋‹ˆ๋‹ค
    • ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”์— ๋Œ€ํ•œ ๋„คํŠธ์›Œํฌ์˜ ๋ฏผ๊ฐ๋„๋ฅผ ๋‚ฎ์ถฅ๋‹ˆ๋‹ค

์ˆ˜ํ•™์  ๊ตฌ์„ฑ ์š”์†Œ

  1. ํ‰๊ท  ๊ณ„์‚ฐ (\(\mu\)): \[\Large \mu = \frac{1}{H} \sum_{i=1}^{H} x_i \]

    • ์€๋‹‰ ์ฐจ์›(H)์— ๊ฑธ์ณ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ๊ฐ ์‹œํ€€์Šค ์œ„์น˜๋งˆ๋‹ค ๊ณ ์œ ํ•œ ํ‰๊ท ์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค
  2. ๋ถ„์‚ฐ ๊ณ„์‚ฐ (\(\sigma^2\)): \[\Large \sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2 \]

    • ์€๋‹‰ ์ฐจ์›์— ๊ฑธ์ณ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ์ •๊ทœํ™”๋œ ๊ฐ’์˜ ์Šค์ผ€์ผ๋ง์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค
  3. ์ •๊ทœํ™”์™€ ์Šค์ผ€์ผ๋ง: \[\Large \text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \]

    • ๋จผ์ € ์ž…๋ ฅ์„ ํ‰๊ท  0, ๋ถ„์‚ฐ 1๋กœ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค
    • ๊ทธ๋Ÿฐ ๋‹ค์Œ ํ•™์Šต ๊ฐ€๋Šฅํ•œ scale (\(\gamma\))๊ณผ shift (\(\beta\)) ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค
    • \(\odot\) ๊ธฐํ˜ธ๋Š” ์š”์†Œ๋ณ„ ๊ณฑ์…ˆ(์•„๋‹ค๋งˆ๋ฅด ๊ณฑ)์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค
    • ์˜ˆ๋ฅผ ๋“ค์–ด, \(\gamma = [1.2, 0.8, 1.5]\)์ด๊ณ  ์ •๊ทœํ™”๋œ ์ž…๋ ฅ์ด \([0.5, -0.3, 0.7]\)์ด๋ฉด, \(\gamma \odot x = [0.6, -0.24, 1.05]\)์ž…๋‹ˆ๋‹ค

LayerNorm์ด ์ค‘์š”ํ•œ ์ด์œ 

  1. ํ•™์Šต ์•ˆ์ •์„ฑ:

    • ํ™œ์„ฑํ™” ๊ฐ’์ด ๋„ˆ๋ฌด ํฌ๊ฑฐ๋‚˜ ์ž‘์•„์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
    • ๋„คํŠธ์›Œํฌ ์ „์ฒด์— ๊ฑธ์ณ ์ผ๊ด€๋œ ์‹ ํ˜ธ ํฌ๊ธฐ๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค
  2. ํŠน์„ฑ ํ•™์Šต:

    • scale (\(\gamma\))๊ณผ shift (\(\beta\)) ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ†ตํ•ด ์–ด๋–ค ํŠน์„ฑ์ด ์ค‘์š”ํ•œ์ง€ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
    • ํŠน์ • ํŠน์„ฑ์„ ๋ฌด์‹œํ•˜๊ฑฐ๋‚˜ ๊ฐ•์กฐํ•˜๋Š” ๊ฒƒ์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  3. ๋…๋ฆฝ์„ฑ:

    • BatchNorm๊ณผ ๋‹ฌ๋ฆฌ, LayerNorm์˜ ํ†ต๊ณ„๋Ÿ‰์€ ๊ฐ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ๋…๋ฆฝ์ ์œผ๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค
    • ๊ฐ€๋ณ€ ๊ธธ์ด ์‹œํ€€์Šค์™€ ์ž‘์€ ๋ฐฐ์น˜ ํฌ๊ธฐ์— ๋” ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค

๊ตฌ์„ฑ

  • ๋ฐฐ์น˜ ํฌ๊ธฐ: BATCH_SIZE = 4
  • ์‹œํ€€์Šค ๊ธธ์ด: SEQ_LEN = 4
  • ์€๋‹‰ ์ฐจ์›: HIDDEN_DIM = 8
  • ์ถœ๋ ฅ ์ฐจ์›: OUTPUT_DIM = 16
  • ์—ก์‹ค๋ก : EPS = 1e-5
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32

๊ตฌํ˜„ ๋ฐฉ์‹

1. ์–ธํ“จ์ „ ๊ตฌํ˜„

์–ธํ“จ์ „ ๋ฐฉ์‹์€ ์—ฌ๋Ÿฌ ์ปค๋„์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ์ฑ•ํ„ฐ์—์„œ ์ž‘์„ฑํ•œ ์ปค๋„๋“ค์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ปค๋„

Puzzle 16: ํ–‰๋ ฌ ๊ณฑ์…ˆ (MatMul)์—์„œ ์‚ฌ์šฉํ•œ ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ปค๋„์„ ์„ ํ˜• ๋ณ€ํ™˜์— ์žฌ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์ปค๋„์€ ๋‹ค์–‘ํ•œ ํ–‰๋ ฌ ํฌ๊ธฐ๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค:

# Idiomatic tiled matmul from p19.mojo
def matmul_idiomatic_tiled[
    rows: Int,
    cols: Int,
    inner: Int,
    OutLayout: TensorLayout,
    ALayout: TensorLayout,
    BLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, ALayout, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, BLayout, MutAnyOrigin],
):
    """Idiomatic tiled matrix multiplication from p19."""
    var local_row = thread_idx.y
    var local_col = thread_idx.x
    var tiled_row = block_idx.y * MATMUL_BLOCK_DIM_XY + local_row
    var tiled_col = block_idx.x * MATMUL_BLOCK_DIM_XY + local_col

    # Get the tile of the output matrix that this thread block is responsible for
    var out_tile = output.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
        block_idx.y, block_idx.x
    )
    comptime shared_layout = row_major[
        MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
    ]()
    var a_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](shared_layout)
    var b_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](shared_layout)
    var acc: output.ElementType = 0

    comptime load_a_layout = row_major[
        MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
    ]()  # Coalesced loading
    comptime load_b_layout = row_major[
        MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY
    ]()  # Coalesced loading

    comptime for idx in range(
        (inner + MATMUL_BLOCK_DIM_XY - 1) // MATMUL_BLOCK_DIM_XY
    ):
        # Get tiles from A and B matrices
        var a_tile = a.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
            block_idx.y, idx
        )
        var b_tile = b.tile[MATMUL_BLOCK_DIM_XY, MATMUL_BLOCK_DIM_XY](
            idx, block_idx.x
        )

        # Asynchronously copy tiles to shared memory with consistent orientation
        copy_dram_to_sram_async[
            thread_layout=load_a_layout,
            num_threads=MATMUL_NUM_THREADS,
            block_dim_count=MATMUL_BLOCK_DIM_COUNT,
        ](a_shared, a_tile)
        copy_dram_to_sram_async[
            thread_layout=load_b_layout,
            num_threads=MATMUL_NUM_THREADS,
            block_dim_count=MATMUL_BLOCK_DIM_COUNT,
        ](b_shared, b_tile)

        # Wait for all async copies to complete
        async_copy_wait_all()
        barrier()

        # Compute partial matrix multiplication for this tile
        comptime for k in range(MATMUL_BLOCK_DIM_XY):
            if (
                tiled_row < rows and tiled_col < cols
            ):  # Only perform calculation for valid outputs
                if k < a_tile.dim(
                    1
                ):  # Only perform calculation on valid inputs
                    acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write final result with bounds checking (needed for variable matrix sizes)
    if tiled_row < rows and tiled_col < cols:
        out_tile[local_row, local_col] = acc


์ „์น˜ ์ปค๋„

ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ๋ง์„ ์‚ฌ์šฉํ•˜๋Š” ์ „์น˜ ์ปค๋„์ž…๋‹ˆ๋‹ค:

def transpose_kernel[
    rows: Int,
    cols: Int,
    OutLayout: TensorLayout,
    InLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    inp: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
):
    """Transpose matrix using shared memory tiling for coalesced access.
    We will learn more about coalesced access in the next part.
    """
    comptime shared_layout = row_major[
        TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY
    ]()
    var shared_tile = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](shared_layout)

    var local_row = thread_idx.y
    var local_col = thread_idx.x

    var global_row = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_row
    var global_col = block_idx.x * TRANSPOSE_BLOCK_DIM_XY + local_col

    if global_row < rows and global_col < cols:
        shared_tile[local_row, local_col] = inp[global_row, global_col]

    barrier()

    var out_row = block_idx.x * TRANSPOSE_BLOCK_DIM_XY + local_row
    var out_col = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_col

    # Store data from shared memory to global memory (coalesced write)
    # Note: we transpose the shared memory access pattern
    if out_row < cols and out_col < rows:
        output[out_row, out_col] = shared_tile[local_col, local_row]


Bias ํ•ฉ์‚ฐ ์ปค๋„

Bias ํ•ญ์„ ๋”ํ•˜๋Š” ๊ฐ„๋‹จํ•œ ์š”์†Œ๋ณ„ ํ•ฉ์‚ฐ ์ปค๋„์ž…๋‹ˆ๋‹ค:

def add_bias_kernel[
    batch_size: Int,
    seq_len: Int,
    output_dim: Int,
    OutputLayout: TensorLayout,
    InputLayout: TensorLayout,
    BiasLayout: TensorLayout,
](
    output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InputLayout, MutAnyOrigin],
    bias: TileTensor[mut=False, dtype, BiasLayout, ImmutAnyOrigin],
):
    """Simple bias addition."""
    var batch_idx = block_idx.x
    var seq_idx = block_idx.y
    var out_idx = thread_idx.x

    if batch_idx >= batch_size or seq_idx >= seq_len or out_idx >= output_dim:
        return

    output[batch_idx, seq_idx, out_idx] = input[
        batch_idx, seq_idx, out_idx
    ] + rebind[Scalar[dtype]](bias[out_idx])


LayerNorm ์ปค๋„

์ด์ œ ์ด ์ปค๋„์„ ์™„์„ฑํ•˜์—ฌ LayerNorm ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  1. ๊ฐ ์‹œํ€€์Šค ์œ„์น˜์— ๋Œ€ํ•œ ํ‰๊ท  \(\mu\)๊ณผ ๋ถ„์‚ฐ \(\sigma^2\) ๊ณ„์‚ฐ
  2. ์ด ํ†ต๊ณ„๋Ÿ‰์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž…๋ ฅ ์ •๊ทœํ™”
  3. ์Šค์ผ€์ผ \(\gamma\)๊ณผ ์‹œํ”„ํŠธ \(\beta\) ํŒŒ๋ผ๋ฏธํ„ฐ ์ ์šฉ
def layernorm_kernel[
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    OutputLayout: TensorLayout,
    InputLayout: TensorLayout,
    LnParamsLayout: TensorLayout,
](
    output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin],
    ln_weight: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
    ln_bias: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
):
    var batch_idx = block_idx.x
    var seq_idx = block_idx.y
    var hidden_idx = thread_idx.x

    if (
        batch_idx >= batch_size
        or seq_idx >= seq_len
        or hidden_idx >= hidden_dim
    ):
        return

    # Compute statistics for this sequence position (redundant but simple)
    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0

    # FILL ME IN (roughly 11 lines)


๊ตฌํ˜„ ๋‹จ๊ณ„:

  1. ๋จผ์ €, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  2. ๊ทธ๋Ÿฐ ๋‹ค์Œ, ์ด ํ†ต๊ณ„๋Ÿ‰์œผ๋กœ ์ž…๋ ฅ์„ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค
  3. ๋งˆ์ง€๋ง‰์œผ๋กœ, ์Šค์ผ€์ผ๊ณผ ์‹œํ”„ํŠธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค

์–ธํ“จ์ „ ๋ฐฉ์‹์˜ ํŠน์„ฑ:

  • ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ์ปค๋„ ์‹คํ–‰ (LayerNorm โ†’ MatMul โ†’ Bias)
  • ์—ฐ์‚ฐ ๊ฐ„ ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น
  • ๋ณ„๋„์˜ ํŒจ์Šค๋กœ ์ธํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ฆ๊ฐ€
  • ๊ด€์‹ฌ์‚ฌ ๋ถ„๋ฆฌ๊ฐ€ ๋ช…ํ™•ํ•œ ๊ฐ„๊ฒฐํ•œ ๊ตฌํ˜„
  • ๊ฐ ์—ฐ์‚ฐ์ด ๊ฒฉ๋ฆฌ๋˜์–ด ๋””๋ฒ„๊น…์ด ์šฉ์ด
ํŒ
  1. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก ์‚ฌ์šฉ (๊ทธ๋ฆฌ๋“œ: [batch_size, seq_len])
    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ์€๋‹‰ ์ฐจ์› ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌ
    • ์‹œํ€€์Šค๋‹น ํ†ต๊ณ„๋Ÿ‰์„ ํ•œ ๋ฒˆ๋งŒ ๊ณ„์‚ฐํ•˜์—ฌ ์ค‘๋ณต ์—ฐ์‚ฐ ๋ฐฉ์ง€
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ์ž…๋ ฅ ํ…์„œ: [batch_idx, seq_idx, hidden_idx]๋กœ ์ ‘๊ทผ
    • ์ถœ๋ ฅ ํ…์„œ: [batch_idx, seq_idx, hidden_idx]๋กœ ์ ‘๊ทผ
    • LayerNorm ํŒŒ๋ผ๋ฏธํ„ฐ: [hidden_idx]๋กœ ์ ‘๊ทผ
  3. ์ˆ˜์น˜ ์•ˆ์ •์„ฑ:

    • ์ œ๊ณฑ๊ทผ์„ ์ทจํ•˜๊ธฐ ์ „์— ์—ก์‹ค๋ก (1e-5)์„ ๋”ํ•ฉ๋‹ˆ๋‹ค
    • ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ…์„ ์œ„ํ•ด rebind[Scalar[dtype]] ์‚ฌ์šฉ
    • ๋ถ„์‚ฐ์€ (sq_sum / hidden_dim) - (mean * mean)์œผ๋กœ ๊ณ„์‚ฐ
  4. ์„ฑ๋Šฅ:

    • ํ•œ ๋ฒˆ์˜ ํŒจ์Šค๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๋™์‹œ์— ๊ณ„์‚ฐ
    • ๊ณ„์‚ฐ๋œ ํ†ต๊ณ„๋Ÿ‰์„ ์‹œํ€€์Šค ๋‚ด ๋ชจ๋“  ์š”์†Œ์— ์žฌ์‚ฌ์šฉ
    • ๋ถˆํ•„์š”ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๋ฐฉ์ง€

์ฝ”๋“œ ์‹คํ–‰

์–ธํ“จ์ „ ๊ตฌํ˜„์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p22 --unfused
pixi run -e amd p22 --unfused
uv run poe p22 --unfused

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Testing with dimensions: [4, 4, 8] -> [4, 4, 16]
โœ… Loaded Mojo operations library
============================================================
   Puzzle 22: UNFUSED Algorithm Test & Benchmark
============================================================

๐Ÿงช Correctness Testing for UNFUSED Algorithm
====================================================

Testing Reference PyTorch Implementation
-----------------------------------------------
โœ… Reference PyTorch
   Max difference: 0.00e+00
   Result: โœ… CORRECT

Testing CPU Implementation
---------------------------------
โœ… Using Mojo fused kernel (CPU)
   Max difference: 1.86e-08
   Result: โœ… CORRECT

Testing GPU Unfused Implementation
-----------------------------------------
โœ… Using Mojo unfused kernel (GPU)
   Max difference: 1.86e-08
   Result: โœ… CORRECT

Correctness Summary:
   - Reference:   โœ… CORRECT
   - CPU:         โœ… CORRECT
   - GPU unfused: โœ… CORRECT

   Overall Correctness: โœ… ALL CORRECT

Benchmarking CPU vs GPU UNFUSED
------------------------------------------
   Testing CPU performance...
   CPU: 3173.70ms (50 iterations)
   Testing GPU unfused performance...
   GPU unfused: 3183.57ms (50 iterations)

   GPU unfused vs CPU: 1.00x slower
   CPU wins (GPU overhead > computation benefit)

UNFUSED Algorithm Test Completed!

์†”๋ฃจ์…˜

def layernorm_kernel[
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    OutputLayout: TensorLayout,
    InputLayout: TensorLayout,
    LnParamsLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
    input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin],
    ln_weight: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
    ln_bias: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
):
    var batch_idx = block_idx.x
    var seq_idx = block_idx.y
    var hidden_idx = thread_idx.x

    if (
        batch_idx >= batch_size
        or seq_idx >= seq_len
        or hidden_idx >= hidden_dim
    ):
        return

    var output_lt = output.to_layout_tensor()
    var input_lt = input.to_layout_tensor()
    var ln_weight_lt = ln_weight.to_layout_tensor()
    var ln_bias_lt = ln_bias.to_layout_tensor()

    # Compute statistics for this sequence position (redundant but simple)
    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0

    comptime for h in range(hidden_dim):
        var val = input_lt[batch_idx, seq_idx, h]
        sum_val += rebind[Scalar[dtype]](val)
        sq_sum += rebind[Scalar[dtype]](val * val)

    var mean_val = sum_val / hidden_dim
    var var_val = (sq_sum / hidden_dim) - (mean_val * mean_val)
    var inv_std = 1.0 / sqrt(var_val + 1e-5)

    # Apply LayerNorm to this element
    var input_val = input_lt[batch_idx, seq_idx, hidden_idx]
    var normalized = (input_val - mean_val) * inv_std * rebind[Scalar[dtype]](
        ln_weight_lt[hidden_idx]
    ) + rebind[Scalar[dtype]](ln_bias_lt[hidden_idx])
    output_lt[batch_idx, seq_idx, hidden_idx] = normalized


์–ธํ“จ์ „ ๊ตฌํ˜„์€ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ํ…์„œ์˜ ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์ง๊ด€์ ์ธ ๋ฐฉ์‹์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ์™€ ๋ธ”๋ก ๊ตฌ์„ฑ:

    batch_idx = block_idx.x
    seq_idx = block_idx.y
    hidden_idx = thread_idx.x
    
    • ๊ฐ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ๋ฐฐ์น˜ ๋‚ด ํ•˜๋‚˜์˜ ์‹œํ€€์Šค ์œ„์น˜๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

    • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: [batch_size, seq_len]

    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์€๋‹‰ ์ฐจ์›์˜ ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

    • ์ธ๋ฑ์Šค๊ฐ€ ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋ฉด ์กฐ๊ธฐ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

      if (batch_idx >= batch_size or seq_idx >= seq_len or hidden_idx >= hidden_dim):
          return
      
  2. ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ:

    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0
    
    @parameter
    for h in range(hidden_dim):
        val = input[batch_idx, seq_idx, h]
        sum_val += rebind[Scalar[dtype]](val)
        sq_sum += rebind[Scalar[dtype]](val * val)
    
    • ํ•œ ๋ฒˆ์˜ ํŒจ์Šค๋กœ ํ•ฉ๊ณ„์™€ ์ œ๊ณฑํ•ฉ์„ ๋™์‹œ์— ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค

    • ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ๋ฅผ ์œ„ํ•ด @parameter๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค

    • rebind[Scalar[dtype]]๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

    • ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:

      mean_val = sum_val / hidden_dim
      var_val = (sq_sum / hidden_dim) - (mean_val * mean_val)
      inv_std = 1.0 / sqrt(var_val + 1e-5)
      
  3. ์ •๊ทœํ™”์™€ ์Šค์ผ€์ผ๋ง:

    input_val = input[batch_idx, seq_idx, hidden_idx]
    normalized = (input_val - mean_val) * inv_std * rebind[Scalar[dtype]](
        ln_weight[hidden_idx]
    ) + rebind[Scalar[dtype]](ln_bias[hidden_idx])
    output[batch_idx, seq_idx, hidden_idx] = normalized
    
    • ์ •๊ทœํ™”๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{normalized} = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \]
    • ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ ฮณ (ln_weight)๋กœ ์Šค์ผ€์ผ๋งํ•ฉ๋‹ˆ๋‹ค
    • ํ•™์Šต ๊ฐ€๋Šฅํ•œ bias ฮฒ (ln_bias)๋ฅผ ๋”ํ•ฉ๋‹ˆ๋‹ค
    • ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ ํ…์„œ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  4. ์„ฑ๋Šฅ ํŠน์„ฑ:

    • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ํ†ต๊ณ„๋Ÿ‰์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์—†์Œ (๊ฐ„๋‹จํ•˜์ง€๋งŒ ๋œ ํšจ์œจ์ )
    • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:
      • ์ž…๋ ฅ: [batch_idx, seq_idx, h]
      • ์ถœ๋ ฅ: [batch_idx, seq_idx, hidden_idx]
      • ํŒŒ๋ผ๋ฏธํ„ฐ: [hidden_idx]
    • ๋‹ค์Œ์„ ํ†ตํ•ด ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:
      • ์ œ๊ณฑ๊ทผ ์ „์— ์—ก์‹ค๋ก (1e-5) ์ถ”๊ฐ€
      • ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ… ์‚ฌ์šฉ
      • ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ธ ๋ฐฉ์‹์œผ๋กœ ๋ถ„์‚ฐ ๊ณ„์‚ฐ
  5. ๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ:

    • ํƒ€์ž… ์•ˆ์ „์„ฑ:

      • ์ค‘๊ฐ„ ๊ณ„์‚ฐ์— Scalar[dtype] ์‚ฌ์šฉ
      • ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ…์„ ์œ„ํ•ด rebind[Scalar[dtype]] ์‚ฌ์šฉ
      • ์ผ๊ด€๋œ ๋ถ€๋™์†Œ์ˆ˜์  ์ •๋ฐ€๋„ ๋ณด์žฅ
    • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

      • ์ž…๋ ฅ ํ…์„œ์—์„œ ๋ณ‘ํ•ฉ ์ฝ๊ธฐ
      • ์ถœ๋ ฅ ํ…์„œ์— ๋ณ‘ํ•ฉ ์“ฐ๊ธฐ
      • LayerNorm ํŒŒ๋ผ๋ฏธํ„ฐ์— ์ˆœ์ฐจ์  ์ ‘๊ทผ
    • ์—ฐ์‚ฐ ํ๋ฆ„:

      • ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ: \[\Large O(H) \text{ operations per thread} \]
      • ์ •๊ทœํ™”: \[\Large O(1) \text{ operations per thread} \]
      • ์ „์ฒด ๋ณต์žก๋„: \[\Large O(H) \text{ per output element} \]
    • ํ•œ๊ณ„์ :

      • ํ†ต๊ณ„๋Ÿ‰์˜ ์ค‘๋ณต ๊ณ„์‚ฐ
      • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—†์Œ
      • ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰
      • ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ์ปค๋„ ์‹คํ–‰ ํ•„์š”

์ด ๊ตฌํ˜„์€ ์ •ํ™•ํ•˜์ง€๋งŒ ์„ฑ๋Šฅ ๋ฉด์—์„œ ์ตœ์ ์ด ์•„๋‹ˆ๋ฉฐ, ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ์—์„œ CPU ๋ฒ„์ „๋ณด๋‹ค ์•ฝ๊ฐ„ ๋А๋ฆฐ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ“จ์ „ ๊ตฌํ˜„์—์„œ๋Š” ๋‹ค์Œ์„ ํ†ตํ•ด ์ด๋Ÿฌํ•œ ์„ฑ๋Šฅ ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค:

  • ์‹œํ€€์Šค๋‹น ํ†ต๊ณ„๋Ÿ‰์„ ํ•œ ๋ฒˆ๋งŒ ๊ณ„์‚ฐ
  • ์ •๊ทœํ™”๋œ ๊ฐ’ ์žฌ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ๊ฐ์†Œ
  • ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ์ œ๊ฑฐ

2. ํ“จ์ „ ์ปค๋„ ๊ตฌํ˜„

ํ“จ์ „ ์ปค๋„์€ LayerNorm๊ณผ Linear ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ GPU ์ปค๋„๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:

def minimal_fused_kernel[
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    output_dim: Int,
    OutputLayout: TensorLayout,
    InputLayout: TensorLayout,
    LnParamsLayout: TensorLayout,
    WeightLayout: TensorLayout,
    BiasLayout: TensorLayout,
](
    output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin],
    ln_weight: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
    ln_bias: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
    linear_weight: TileTensor[mut=False, dtype, WeightLayout, ImmutAnyOrigin],
    linear_bias: TileTensor[mut=False, dtype, BiasLayout, ImmutAnyOrigin],
):
    """Minimal fused kernel - one thread per sequence position to avoid redundancy.
    """
    # Grid: (batch_size, seq_len) - one thread block per sequence position
    # Block: (1,) - single thread per sequence position to avoid redundant computation
    var batch_idx = block_idx.x
    var seq_idx = block_idx.y

    if batch_idx >= batch_size or seq_idx >= seq_len:
        return

    # Step 1: Compute LayerNorm statistics once per sequence position

    # FILL IN roughly 10 lines

    # Step 2: Compute all outputs for this sequence position

    # FILL IN roughly 10 lines


ํ•ต์‹ฌ ์ตœ์ ํ™”:

  • ๋‘ ๋ฒˆ ๋Œ€์‹  ํ•œ ๋ฒˆ์˜ ์ปค๋„ ์‹คํ–‰
  • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ
  • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
  • ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ๋ถˆํ•„์š”
ํŒ
  1. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก (๊ทธ๋ฆฌ๋“œ: [batch_size, seq_len])
    • ์ค‘๋ณต์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์‹œํ€€์Šค ์œ„์น˜๋‹น ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    • ๊ฐ ์‹œํ€€์Šค ์œ„์น˜์˜ ๋ชจ๋“  ์ถœ๋ ฅ์„ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ์—์„œ ๊ณ„์‚ฐ
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ์ž…๋ ฅ ํ…์„œ: [batch_idx, seq_idx, h]๋กœ ์ ‘๊ทผ
    • ์ถœ๋ ฅ ํ…์„œ: [batch_idx, seq_idx, out_idx]๋กœ ์ ‘๊ทผ
    • ๊ฐ€์ค‘์น˜: ์„ ํ˜• ๋ ˆ์ด์–ด์—์„œ [out_idx, h]๋กœ ์ ‘๊ทผ
  3. ์—ฐ์‚ฐ ํ๋ฆ„:

    • ์‹œํ€€์Šค๋‹น LayerNorm ํ†ต๊ณ„๋Ÿ‰์„ ํ•œ ๋ฒˆ๋งŒ ๊ณ„์‚ฐ
    • ๋ชจ๋“  ์ถœ๋ ฅ ์ฐจ์›์— ์ •๊ทœํ™”๋œ ๊ฐ’์„ ์žฌ์‚ฌ์šฉ
    • ์ •๊ทœํ™”์™€ ์„ ํ˜• ๋ณ€ํ™˜์„ ๊ฒฐํ•ฉ
  4. ์„ฑ๋Šฅ:

    • ํ†ต๊ณ„๋Ÿ‰์˜ ์ค‘๋ณต ๊ณ„์‚ฐ ๋ฐฉ์ง€
    • ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ตœ์†Œํ™”
    • rebind[Scalar[dtype]]๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ… ์‚ฌ์šฉ

์ฝ”๋“œ ์‹คํ–‰

ํ“จ์ „ ๊ตฌํ˜„์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p22 --fused
pixi run -e amd p22 --fused
uv run poe p22 --fused

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Testing with dimensions: [4, 4, 8] -> [4, 4, 16]
โœ… Loaded Mojo operations library
============================================================
   Puzzle 22: FUSED Algorithm Test & Benchmark
============================================================

๐Ÿงช Correctness Testing for FUSED Algorithm
==================================================

Testing Reference PyTorch Implementation
-----------------------------------------------
โœ… Reference PyTorch
   Max difference: 0.00e+00
   Result: โœ… CORRECT

Testing CPU Implementation
---------------------------------
โœ… Using Mojo fused kernel (CPU)
   Max difference: 1.86e-08
   Result: โœ… CORRECT

Testing GPU Fused Implementation
---------------------------------------
โœ… Using Mojo fused kernel (GPU)
   Max difference: 1.86e-08
   Result: โœ… CORRECT

Correctness Summary:
   - Reference:   โœ… CORRECT
   - CPU:         โœ… CORRECT
   - GPU fused: โœ… CORRECT

   Overall Correctness: โœ… ALL CORRECT

โšก Benchmarking CPU vs GPU FUSED
----------------------------------------
   Testing CPU performance...
   CPU: 3144.75ms (50 iterations)
   Testing GPU fused performance...
   GPU fused: 3116.11ms (50 iterations)

   GPU fused vs CPU: 1.01x faster
   GPU fused wins!

FUSED Algorithm Test Completed!

์†”๋ฃจ์…˜

def minimal_fused_kernel[
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    output_dim: Int,
    OutputLayout: TensorLayout,
    InputLayout: TensorLayout,
    LnParamsLayout: TensorLayout,
    WeightLayout: TensorLayout,
    BiasLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    output: TileTensor[mut=True, dtype, OutputLayout, MutAnyOrigin],
    input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin],
    ln_weight: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
    ln_bias: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
    linear_weight: TileTensor[mut=True, dtype, WeightLayout, MutAnyOrigin],
    linear_bias: TileTensor[mut=True, dtype, BiasLayout, MutAnyOrigin],
):
    """Minimal fused kernel - one thread per sequence position to avoid redundancy.
    """
    # Grid: (batch_size, seq_len) - one thread block per sequence position
    # Block: (1,) - single thread per sequence position to avoid redundant computation
    var batch_idx = block_idx.x
    var seq_idx = block_idx.y

    if batch_idx >= batch_size or seq_idx >= seq_len:
        return

    var output_lt = output.to_layout_tensor()
    var input_lt = input.to_layout_tensor()
    var ln_weight_lt = ln_weight.to_layout_tensor()
    var ln_bias_lt = ln_bias.to_layout_tensor()
    var linear_weight_lt = linear_weight.to_layout_tensor()
    var linear_bias_lt = linear_bias.to_layout_tensor()

    # Step 1: Compute LayerNorm statistics once per sequence position
    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0

    comptime for h in range(hidden_dim):
        var val = input_lt[batch_idx, seq_idx, h]
        sum_val += rebind[Scalar[dtype]](val)
        sq_sum += rebind[Scalar[dtype]](val * val)

    var mean_val = sum_val / hidden_dim
    var var_val = (sq_sum / hidden_dim) - (mean_val * mean_val)
    var inv_std = 1.0 / sqrt(var_val + 1e-5)

    # Step 2: Compute all outputs for this sequence position
    comptime for out_idx in range(output_dim):
        var acc: Scalar[dtype] = 0

        comptime for h in range(hidden_dim):
            var input_val = input_lt[batch_idx, seq_idx, h]
            var normalized = (input_val - mean_val) * inv_std * rebind[
                Scalar[dtype]
            ](ln_weight_lt[h]) + rebind[Scalar[dtype]](ln_bias_lt[h])
            acc += rebind[Scalar[dtype]](
                normalized * linear_weight_lt[out_idx, h]
            )

        output_lt[batch_idx, seq_idx, out_idx] = acc + rebind[Scalar[dtype]](
            linear_bias_lt[out_idx]
        )


ํ“จ์ „ ๊ตฌํ˜„์€ ์—ฐ์‚ฐ๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก (๊ทธ๋ฆฌ๋“œ: [batch_size, seq_len])
    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค: batch_idx = block_idx.x, seq_idx = block_idx.y
  2. LayerNorm ๋‹จ๊ณ„:

    • ์‹œํ€€์Šค ์œ„์น˜์— ๋Œ€ํ•œ ํ•ฉ๊ณ„์™€ ์ œ๊ณฑํ•ฉ ๊ณ„์‚ฐ
    • ํ‰๊ท  ๊ณ„์‚ฐ: \[\Large \mu = \frac{1}{H} \sum_{i=1}^{H} x_i \]
    • ๋ถ„์‚ฐ ๊ณ„์‚ฐ: \[\Large \sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2 \]
    • ์—ญํ‘œ์ค€ํŽธ์ฐจ ๊ณ„์‚ฐ: \[\Large \text{inv_std} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} \]
  3. Linear ๋‹จ๊ณ„:

    • ๊ฐ ์ถœ๋ ฅ ์ฐจ์›์— ๋Œ€ํ•ด:
      • ์ •๊ทœํ™”๋œ ๊ฐ’ ๊ณ„์‚ฐ: \[\Large \text{normalized} = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta \]
      • ์„ ํ˜• ๊ฐ€์ค‘์น˜์™€ ๊ณฑํ•˜๊ณ  ๋ˆ„์ : \[\Large \text{acc} = \sum_{h=1}^{H} \text{normalized}h \cdot W{out,h} \]
      • ์„ ํ˜• bias ์ถ”๊ฐ€: \[\Large \text{output} = \text{acc} + b_{out} \]
    • ๊ฒฐ๊ณผ๋ฅผ output[batch_idx, seq_idx, out_idx]์— ์ €์žฅ
  4. ์„ฑ๋Šฅ ์ตœ์ ํ™”:

    • ๋‘ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋‹จ์ผ ์ปค๋„ ์‹คํ–‰
    • ๊ณ„์‚ฐ๋œ ํ†ต๊ณ„๋Ÿ‰ ์žฌ์‚ฌ์šฉ
    • ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ตœ์†Œํ™”
    • ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ๋ถˆํ•„์š”
    • ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

์ด ๊ตฌํ˜„์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์—ฌ ์–ธํ“จ์ „ ๋ฒ„์ „๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์ปค๋„ ํ“จ์ „์˜ ์žฅ์ 

์ด ํผ์ฆ์—์„œ LayerNorm + Linear ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์„ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค:

  1. ์–ธํ“จ์ „ ๊ตฌํ˜„:

    • LayerNorm๊ณผ Linear๋ฅผ ๋ณ„๋„์˜ ์ปค๋„๋กœ ์‹คํ–‰
    • ๊ตฌํ˜„์ด ๊ฐ„๋‹จํ•˜์ง€๋งŒ ๋œ ํšจ์œจ์ 
    • ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰
    • ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ์ปค๋„ ์‹คํ–‰
    • ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ: 3183.57ms (GPU)
  2. ํ“จ์ „ ๊ตฌํ˜„:

    • ๋‘ ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•œ ๋‹จ์ผ ์ปค๋„
    • ๋” ๋ณต์žกํ•˜์ง€๋งŒ ํ›จ์”ฌ ํšจ์œจ์ 
    • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
    • ๋‹จ์ผ ์ปค๋„ ์‹คํ–‰
    • ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ: 3116.11ms (GPU)

๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ตœ์ ํ™”

  1. ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ œ๊ฑฐ:

    • ์—ฐ์‚ฐ ๊ฐ„ ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ๋ถˆํ•„์š”
    • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ/์“ฐ๊ธฐ ๊ฐ์†Œ
    • ์„ ํ˜• ๋ณ€ํ™˜์„ ์œ„ํ•œ ์ •๊ทœํ™”๋œ ๊ฐ’ ์žฌ์‚ฌ์šฉ
    • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ ˆ๊ฐ๋ฅ : \[\Large \text{reduction} = \frac{\text{unfused_bandwidth} - \text{fused_bandwidth}}{\text{unfused_bandwidth}}\]
  2. ์บ์‹œ ํšจ์œจ:

    • L1/L2 ์บ์‹œ ํ™œ์šฉ๋„ ํ–ฅ์ƒ
    • ์บ์‹œ ๋ฏธ์Šค ๊ฐ์†Œ
    • ๊ฐœ์„ ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
    • ๋” ๋†’์€ ์‚ฐ์ˆ  ๊ฐ•๋„

์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ

  1. ์ปค๋„ ์‹คํ–‰ ์ตœ์ ํ™”:

    • ์—ฌ๋Ÿฌ ๋ฒˆ ๋Œ€์‹  ๋‹จ์ผ ์ปค๋„ ์‹คํ–‰
    • ๋“œ๋ผ์ด๋ฒ„ ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ
    • ๋™๊ธฐํ™” ์ง€์  ๊ฐ์†Œ
    • ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ํšŸ์ˆ˜ ๊ฐ์†Œ
  2. ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ:

    • ์—ฐ์‚ฐ ๊ฐ„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์žฌ์‚ฌ์šฉ
    • ๋ ˆ์ง€์Šคํ„ฐ ํ™œ์šฉ๋„ ํ–ฅ์ƒ
    • ์Šค๋ ˆ๋“œ ์ ์œ ์œจ ๊ฐœ์„ 
    • GPU ํ™œ์šฉ๋ฅ  ํ–ฅ์ƒ

์„ฑ๋Šฅ ํŠน์„ฑ

  1. ํ™•์žฅ์„ฑ:

    • ์ž…๋ ฅ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ํ™•์žฅ์„ฑ ํ–ฅ์ƒ
    • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋ณ‘๋ชฉ ๊ฐ์†Œ
    • GPU ๋ฆฌ์†Œ์Šค์˜ ๋” ํšจ์œจ์ ์ธ ํ™œ์šฉ
    • ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์—์„œ ์ฒ˜๋ฆฌ๋Ÿ‰ ํ–ฅ์ƒ
  2. ์ˆ˜์น˜์  ํšจ์œจ:

    • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ์œ ์ง€
    • ๋ฐ˜์˜ฌ๋ฆผ ์˜ค์ฐจ ๊ฐ์†Œ
    • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ์˜ ์ •๋ฐ€๋„ ํ–ฅ์ƒ
    • ์ตœ์ ํ™”๋œ ์—ฐ์‚ฐ ์ˆœ์„œ

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ปค๋„ ํ“จ์ „์€ ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์˜ LayerNorm + Linear์ฒ˜๋Ÿผ ์‹ ๊ฒฝ๋ง์—์„œ ์ž์ฃผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๋Š” ์—ฐ์‚ฐ์— ํŠนํžˆ ์œ ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅ ํฌ๊ธฐ๊ฐ€ ํฌ๊ณ  ๋ชจ๋ธ์ด ๋ณต์žกํ• ์ˆ˜๋ก ์„ฑ๋Šฅ ์ด์ ์€ ๋”์šฑ ์ปค์ง‘๋‹ˆ๋‹ค.

โ›“๏ธ ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ๊ณผ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ํ“จ์ „ LayerNorm + Linear ์—ฐ์‚ฐ์˜ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค(backward pass) ๊ตฌํ˜„์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋Š” ๋‹ค์Œ์— ๋Œ€ํ•œ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:

  • ์ž…๋ ฅ ํ…์„œ
  • LayerNorm ์Šค์ผ€์ผ (\(\gamma\))๊ณผ ์‹œํ”„ํŠธ (\(\beta\)) ํŒŒ๋ผ๋ฏธํ„ฐ
  • Linear ๋ ˆ์ด์–ด์˜ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ๊ณผ bias

๊ตฌํ˜„ํ•  ์ˆ˜ํ•™์  ์—ฐ์‚ฐ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. LayerNorm ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค (์œ ๋„ ๊ณผ์ •์˜ ์ƒ์„ธ ๋‚ด์šฉ์€ LayerNorm ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค์˜ ์ƒ์„ธ ์œ ๋„ ์ฐธ์กฐ): \[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \odot \gamma \odot \frac{1}{\sqrt{\sigma^2 + \epsilon}} (1 - \frac{1}{H} - \frac{(x - \mu)^2}{H(\sigma^2 + \epsilon)}) \]

  2. Linear ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค: \[\Large \frac{\partial L}{\partial W} = \frac{\partial L}{\partial y}x^T \] \[\Large \frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \] \[\Large \frac{\partial L}{\partial x} = W^T\frac{\partial L}{\partial y} \]

  3. ํ“จ์ „ ์—ฐ์‚ฐ์˜ ์—ฐ์‡„ ๋ฒ•์น™: \[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y_{linear}} \frac{\partial y_{linear}}{\partial y_{norm}} \frac{\partial y_{norm}}{\partial x} \] ์—ฌ๊ธฐ์„œ:

  • \(y_{norm}\)์€ LayerNorm ์ถœ๋ ฅ
  • \(y_{linear}\)์€ Linear ๋ ˆ์ด์–ด ์ถœ๋ ฅ
  • ์—ฐ์‡„ ๋ฒ•์น™์ด ๋‘ ์—ฐ์‚ฐ์„ ํ†ตํ•œ ์ ์ ˆํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์„ ๋ณด์žฅ

ํ•ต์‹ฌ ๊ฐœ๋…

  • ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก (๊ทธ๋ฆฌ๋“œ: [batch_size, seq_len])
    • ์ค‘๋ณต์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์‹œํ€€์Šค ์œ„์น˜๋‹น ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    • ๊ฐ ์‹œํ€€์Šค ์œ„์น˜์˜ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๋ฅผ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ์—์„œ ๊ณ„์‚ฐ
    • ์›์ž์  ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” ๋ณด์žฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ์ž…๋ ฅ ํ…์„œ: [batch_idx, seq_idx, h]๋กœ ์ ‘๊ทผ
    • ์ถœ๋ ฅ ํ…์„œ: [batch_idx, seq_idx, out_idx]๋กœ ์ ‘๊ทผ
    • ๊ฐ€์ค‘์น˜: ์„ ํ˜• ๋ ˆ์ด์–ด์—์„œ [out_idx, h]๋กœ ์ ‘๊ทผ
    • ์›์ž์  ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ ๋ณด์žฅ
    • ์ž์ฃผ ์ ‘๊ทผํ•˜๋Š” ๋ฐ์ดํ„ฐ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
  • ์—ฐ์‚ฐ ํ๋ฆ„:

    • ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ๋™์ผํ•œ ์ˆœ์„œ๋กœ LayerNorm ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ
    • ๋ชจ๋“  ์ถœ๋ ฅ ์ฐจ์›์— ์ •๊ทœํ™”๋œ ๊ฐ’ ์žฌ์‚ฌ์šฉ
    • ์ •๊ทœํ™”์™€ ์„ ํ˜• ๋ณ€ํ™˜ ๊ฒฐํ•ฉ
    • ์ „์ฒด ๊ณผ์ •์—์„œ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ์œ ์ง€
    • ์—ฃ์ง€ ์ผ€์ด์Šค๋ฅผ ์ ์ ˆํžˆ ์ฒ˜๋ฆฌ
  • ์„ฑ๋Šฅ:

    • ํ†ต๊ณ„๋Ÿ‰์˜ ์ค‘๋ณต ๊ณ„์‚ฐ ๋ฐฉ์ง€
    • ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ตœ์†Œํ™”
    • rebind[Scalar[dtype]]๋กœ ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ… ์‚ฌ์šฉ
    • ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ ๋ณด์žฅ
    • ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ์— ์ตœ์ ํ™”

๊ตฌ์„ฑ

  • ๋ฐฐ์น˜ ํฌ๊ธฐ: BATCH_SIZE = 4
  • ์‹œํ€€์Šค ๊ธธ์ด: SEQ_LEN = 4
  • ์€๋‹‰ ์ฐจ์›: HIDDEN_DIM = 8
  • ์ถœ๋ ฅ ์ฐจ์›: OUTPUT_DIM = 16
  • ์—ก์‹ค๋ก : EPS = 1e-5
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32

๊ตฌํ˜„ (๊ณ ๊ธ‰)

ํ“จ์ „ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ์ปค๋„์€ LayerNorm๊ณผ Linear์˜ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ GPU ์ปค๋„๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ตฌํ˜„์€ ๋‹ค์Œ์„ ์‹ ์ค‘ํ•˜๊ฒŒ ๋‹ค๋ค„์•ผ ํ•˜๋Š” ๋„์ „์ ์ธ ๊ณผ์ œ์ž…๋‹ˆ๋‹ค:

  • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ
  • ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์—์„œ์˜ ์ˆ˜์น˜ ์•ˆ์ •์„ฑ
  • ํšจ์œจ์ ์ธ GPU ํ™œ์šฉ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ์—ฐ์‚ฐ ๊ฐ„ ์ ์ ˆํ•œ ๋™๊ธฐํ™”
def minimal_fused_kernel_backward[
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    output_dim: Int,
    GradInputLayout: TensorLayout,
    GradLnWeightLayout: TensorLayout,
    GradLnBiasLayout: TensorLayout,
    GradWeightLayout: TensorLayout,
    GradBiasLayout: TensorLayout,
    GradOutputLayout: TensorLayout,
    InputLayout: TensorLayout,
    LnParamsLayout: TensorLayout,
    WeightLayout: TensorLayout,
](
    grad_input: TileTensor[mut=True, dtype, GradInputLayout, MutAnyOrigin],
    grad_ln_weight: TileTensor[
        mut=True, dtype, GradLnWeightLayout, MutAnyOrigin
    ],
    grad_ln_bias: TileTensor[mut=True, dtype, GradLnBiasLayout, MutAnyOrigin],
    grad_weight: TileTensor[mut=True, dtype, GradWeightLayout, MutAnyOrigin],
    grad_bias: TileTensor[mut=True, dtype, GradBiasLayout, MutAnyOrigin],
    grad_output: TileTensor[mut=False, dtype, GradOutputLayout, ImmutAnyOrigin],
    input: TileTensor[mut=False, dtype, InputLayout, ImmutAnyOrigin],
    ln_weight: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
    ln_bias: TileTensor[mut=False, dtype, LnParamsLayout, ImmutAnyOrigin],
    linear_weight: TileTensor[mut=False, dtype, WeightLayout, ImmutAnyOrigin],
):
    """Fused backward kernel using atomic operations for safe gradient accumulation.
    """
    # Grid: (batch_size, seq_len) - one thread per sequence position
    # Block: (1,) - single thread per sequence position
    var batch_idx = block_idx.x
    var seq_idx = block_idx.y

    if batch_idx >= batch_size or seq_idx >= seq_len:
        return

    # Initialize gradient tensors to zero (block 0,0 only to avoid UB with atomic ops)
    if batch_idx == 0 and seq_idx == 0:
        # Initialize grad_ln_weight and grad_ln_bias
        comptime for h in range(hidden_dim):
            (grad_ln_weight.ptr + h).init_pointee_copy(0)
            (grad_ln_bias.ptr + h).init_pointee_copy(0)

        # Initialize grad_weight and grad_bias
        comptime for out_idx in range(output_dim):
            (grad_bias.ptr + out_idx).init_pointee_copy(0)

            comptime for h in range(hidden_dim):
                (grad_weight.ptr + out_idx * hidden_dim + h).init_pointee_copy(
                    0
                )

    # Note: We cannot use barrier() here as it only synchronizes within a block.
    # The atomic operations will handle synchronization across blocks.

    # Step 1: Recompute forward pass statistics (needed for gradients)
    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0

    # FILL IN roughly 8 lines

    # Step 2: Atomically accumulate gradients w.r.t. linear bias

    # FILL IN roughly 4 lines

    # Step 3: Atomically accumulate gradients w.r.t. linear weight
    # Make sure to use the correct atomic operation to avoid race conditions

    # FILL IN roughly 10 lines

    # Step 4: Atomically accumulate gradients w.r.t. LayerNorm parameters

    # FILL IN roughly 10 lines

    # Step 5: Compute gradients w.r.t. input (LayerNorm backward)
    # Compute sum terms needed for LayerNorm backward
    # Make sure to use the correct atomic operation to avoid race conditions

    # FILL IN roughly 12 lines

    # Compute actual input gradients (no race conditions here - each thread writes to different positions)

    # FILL IN roughly 10 lines


ํ•ต์‹ฌ ์ตœ์ ํ™”:

  • ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋‹จ์ผ ์ปค๋„ ์‹คํ–‰
  • ์•ˆ์ „ํ•œ ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ
  • ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
  • ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ๋ถˆํ•„์š”
ํŒ
  1. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ:

    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก
    • ์‹œํ€€์Šค ์œ„์น˜๋‹น ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    • ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๋ฅผ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ์—์„œ ๊ณ„์‚ฐ
  2. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ:

    • ์ž…๋ ฅ/์ถœ๋ ฅ ํ…์„œ์— ๋Œ€ํ•œ ๋ณ‘ํ•ฉ ์ ‘๊ทผ
    • ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์— ๋Œ€ํ•œ stride ์ ‘๊ทผ
    • ์›์ž์  ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ ์ ˆํ•œ ์ •๋ ฌ
  3. ์—ฐ์‚ฐ ํ๋ฆ„:

    • ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ๋™์ผํ•œ ์ˆœ์„œ๋กœ ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ
    • ์ •๊ทœํ™”๋œ ๊ฐ’ ์žฌ์‚ฌ์šฉ
    • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ์œ ์ง€
  4. ์„ฑ๋Šฅ:

    • ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ตœ์†Œํ™”
    • ์ ์ ˆํ•œ ํƒ€์ž… ์บ์ŠคํŒ… ์‚ฌ์šฉ
    • ์ ์ ˆํ•œ ์ •๋ ฌ ๋ณด์žฅ

์ฝ”๋“œ ์‹คํ–‰

ํ“จ์ „ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ตฌํ˜„์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ๋‹ค์Œ์„ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p22 --backward
pixi run -e amd p22 --backward
uv run poe p22 --backward

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Testing with dimensions: [4, 4, 8] -> [4, 4, 16]
โœ… Loaded Mojo operations library
============================================================
           Comprehensive Backward Pass Test
           Testing Custom LayerNorm + Linear Gradients
============================================================
Testing with dimensions: [4, 4, 8] -> [4, 4, 16]

Testing CPU Backward Pass:

Testing CPU Backward Implementation - Backward Pass
---------------------------------------------------------
   Computing PyTorch autograd reference...
   Computing Mojo backward implementation (CPU)...
โœ… CPU Backward Implementation backward completed
   Forward max difference: 1.49e-08
   grad_input: 2.98e-08 โœ…
   grad_ln_weight: 5.96e-08 โœ…
   grad_ln_bias: 2.38e-07 โœ…
   grad_linear_weight: 9.54e-07 โœ…
   grad_linear_bias: 0.00e+00 โœ…

   Forward pass: โœ… CORRECT
   Gradients:    โœ… CORRECT
   Overall:      โœ… CORRECT

Testing GPU Backward Pass:

Testing GPU Backward Implementation - Backward Pass
---------------------------------------------------------
   Computing PyTorch autograd reference...
   Computing Mojo backward implementation (GPU)...

โœ… GPU Backward Implementation backward completed
   Forward max difference: 1.86e-08
   grad_input: 4.47e-08 โœ…
   grad_ln_weight: 5.96e-08 โœ…
   grad_ln_bias: 3.58e-07 โœ…
   grad_linear_weight: 9.54e-07 โœ…
   grad_linear_bias: 0.00e+00 โœ…

   Forward pass: โœ… CORRECT
   Gradients:    โœ… CORRECT
   Overall:      โœ… CORRECT

Backward Pass Test Summary:
   - CPU Backward:  โœ… CORRECT
   - GPU Backward:  โœ… CORRECT

   Overall Result: โœ… ALL CORRECT

BACKWARD PASS Test Completed!

์†”๋ฃจ์…˜

def minimal_fused_kernel_backward[
    batch_size: Int,
    seq_len: Int,
    hidden_dim: Int,
    output_dim: Int,
    GradInputLayout: TensorLayout,
    GradLnWeightLayout: TensorLayout,
    GradLnBiasLayout: TensorLayout,
    GradWeightLayout: TensorLayout,
    GradBiasLayout: TensorLayout,
    GradOutputLayout: TensorLayout,
    InputLayout: TensorLayout,
    LnParamsLayout: TensorLayout,
    WeightLayout: TensorLayout,
    dtype: DType = DType.float32,
](
    grad_input: TileTensor[mut=True, dtype, GradInputLayout, MutAnyOrigin],
    grad_ln_weight: TileTensor[
        mut=True, dtype, GradLnWeightLayout, MutAnyOrigin
    ],
    grad_ln_bias: TileTensor[mut=True, dtype, GradLnBiasLayout, MutAnyOrigin],
    grad_weight: TileTensor[mut=True, dtype, GradWeightLayout, MutAnyOrigin],
    grad_bias: TileTensor[mut=True, dtype, GradBiasLayout, MutAnyOrigin],
    grad_output: TileTensor[mut=True, dtype, GradOutputLayout, MutAnyOrigin],
    input: TileTensor[mut=True, dtype, InputLayout, MutAnyOrigin],
    ln_weight: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
    ln_bias: TileTensor[mut=True, dtype, LnParamsLayout, MutAnyOrigin],
    linear_weight: TileTensor[mut=True, dtype, WeightLayout, MutAnyOrigin],
):
    """Fused backward kernel using atomic operations for safe gradient accumulation.
    """
    # Grid: (batch_size, seq_len) - one thread per sequence position
    # Block: (1,) - single thread per sequence position
    var batch_idx = block_idx.x
    var seq_idx = block_idx.y

    if batch_idx >= batch_size or seq_idx >= seq_len:
        return

    var grad_input_lt = grad_input.to_layout_tensor()
    var grad_ln_weight_lt = grad_ln_weight.to_layout_tensor()
    var grad_ln_bias_lt = grad_ln_bias.to_layout_tensor()
    var grad_weight_lt = grad_weight.to_layout_tensor()
    var grad_bias_lt = grad_bias.to_layout_tensor()
    var grad_output_lt = grad_output.to_layout_tensor()
    var input_lt = input.to_layout_tensor()
    var ln_weight_lt = ln_weight.to_layout_tensor()
    var ln_bias_lt = ln_bias.to_layout_tensor()
    var linear_weight_lt = linear_weight.to_layout_tensor()

    # Initialize gradient tensors to zero (block 0,0 only to avoid UB with atomic ops)
    if batch_idx == 0 and seq_idx == 0:
        # Initialize grad_ln_weight and grad_ln_bias
        comptime for h in range(hidden_dim):
            (grad_ln_weight.ptr + h).init_pointee_copy(0)
            (grad_ln_bias.ptr + h).init_pointee_copy(0)

        # Initialize grad_weight and grad_bias
        comptime for out_idx in range(output_dim):
            (grad_bias.ptr + out_idx).init_pointee_copy(0)

            comptime for h in range(hidden_dim):
                (grad_weight.ptr + out_idx * hidden_dim + h).init_pointee_copy(
                    0
                )

    # Note: We cannot use barrier() here as it only synchronizes within a block.
    # The atomic operations will handle synchronization across blocks.

    # Step 1: Recompute forward pass statistics (needed for gradients)
    var sum_val: Scalar[dtype] = 0
    var sq_sum: Scalar[dtype] = 0

    comptime for h in range(hidden_dim):
        var val = input_lt[batch_idx, seq_idx, h]
        sum_val += rebind[Scalar[dtype]](val)
        sq_sum += rebind[Scalar[dtype]](val * val)

    var mean_val = sum_val / hidden_dim
    var var_val = (sq_sum / hidden_dim) - (mean_val * mean_val)
    var inv_std = 1.0 / sqrt(var_val + 1e-5)

    # Step 2: Atomically accumulate gradients w.r.t. linear bias
    comptime for out_idx in range(output_dim):
        var grad_bias_ptr = grad_bias.ptr + out_idx
        _ = Atomic[dtype].fetch_add(
            grad_bias_ptr,
            rebind[Scalar[dtype]](grad_output_lt[batch_idx, seq_idx, out_idx]),
        )

    # Step 3: Atomically accumulate gradients w.r.t. linear weight
    comptime for out_idx in range(output_dim):
        comptime for h in range(hidden_dim):
            var input_val = input_lt[batch_idx, seq_idx, h]
            var normalized = (input_val - mean_val) * inv_std
            var ln_output_val = normalized * rebind[Scalar[dtype]](
                ln_weight_lt[h]
            ) + rebind[Scalar[dtype]](ln_bias_lt[h])

            # Atomic gradient accumulation for linear weight
            var grad_w = (
                grad_output_lt[batch_idx, seq_idx, out_idx] * ln_output_val
            )
            var grad_weight_ptr = grad_weight.ptr + out_idx * hidden_dim + h
            _ = Atomic.fetch_add(grad_weight_ptr, rebind[Scalar[dtype]](grad_w))

    # Step 4: Atomically accumulate gradients w.r.t. LayerNorm parameters
    comptime for h in range(hidden_dim):
        input_val = input_lt[batch_idx, seq_idx, h]
        normalized = (input_val - mean_val) * inv_std

        # Compute gradient w.r.t. LayerNorm output for this h
        var grad_ln_out: Scalar[dtype] = 0

        comptime for out_idx in range(output_dim):
            grad_ln_out = grad_ln_out + rebind[Scalar[dtype]](
                grad_output_lt[batch_idx, seq_idx, out_idx]
                * linear_weight_lt[out_idx, h]
            )

        # Atomic accumulation of LayerNorm parameter gradients
        var grad_ln_weight_ptr = grad_ln_weight.ptr + h
        var grad_ln_bias_ptr = grad_ln_bias.ptr + h
        _ = Atomic[dtype].fetch_add(
            grad_ln_weight_ptr, rebind[Scalar[dtype]](grad_ln_out * normalized)
        )
        _ = Atomic[dtype].fetch_add(
            grad_ln_bias_ptr, rebind[Scalar[dtype]](grad_ln_out)
        )

    # Step 5: Compute gradients w.r.t. input (LayerNorm backward)
    # Compute sum terms needed for LayerNorm backward
    var sum_grad_normalized: Scalar[dtype] = 0
    var sum_grad_normalized_times_normalized: Scalar[dtype] = 0

    comptime for h in range(hidden_dim):
        h_input_val = input_lt[batch_idx, seq_idx, h]
        h_normalized = (h_input_val - mean_val) * inv_std

        var h_grad_ln_out: Scalar[dtype] = 0

        comptime for out_idx in range(output_dim):
            h_grad_ln_out = h_grad_ln_out + rebind[Scalar[dtype]](
                grad_output_lt[batch_idx, seq_idx, out_idx]
                * linear_weight_lt[out_idx, h]
            )

        h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight_lt[h])
        sum_grad_normalized = sum_grad_normalized + rebind[Scalar[dtype]](
            h_grad_norm
        )
        sum_grad_normalized_times_normalized = (
            sum_grad_normalized_times_normalized
            + rebind[Scalar[dtype]](h_grad_norm * h_normalized)
        )

    # Compute actual input gradients (no race conditions here - each thread writes to different positions)
    comptime for h in range(hidden_dim):
        h_input_val = input_lt[batch_idx, seq_idx, h]
        h_normalized = (h_input_val - mean_val) * inv_std

        var h_grad_ln_out: Scalar[dtype] = 0

        comptime for out_idx in range(output_dim):
            h_grad_ln_out = h_grad_ln_out + rebind[Scalar[dtype]](
                grad_output_lt[batch_idx, seq_idx, out_idx]
                * linear_weight_lt[out_idx, h]
            )

        h_grad_norm = h_grad_ln_out * rebind[Scalar[dtype]](ln_weight_lt[h])
        grad_input_lt[batch_idx, seq_idx, h] = inv_std * (
            h_grad_norm
            - (sum_grad_normalized / hidden_dim)
            - (h_normalized * sum_grad_normalized_times_normalized / hidden_dim)
        )


ํ“จ์ „ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ตฌํ˜„์€ ์—ฐ์‚ฐ๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:

  1. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ:

    • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: [batch_size, seq_len]์œผ๋กœ ์‹œํ€€์Šค ์œ„์น˜๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก
    • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค: batch_idx = block_idx.x, seq_idx = block_idx.y
    • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ:
      • ์ž…๋ ฅ ํ…์„œ: [batch_size, seq_len, hidden_dim]
      • ์ถœ๋ ฅ ํ…์„œ: [batch_size, seq_len, output_dim]
      • ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ: [output_dim, hidden_dim]
      • ๊ธฐ์šธ๊ธฐ: ์ž…๋ ฅ ๊ธฐ์šธ๊ธฐ์šฉ [batch_size, seq_len, hidden_dim]
      • ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ์šธ๊ธฐ: LayerNorm์šฉ [hidden_dim], Linear์šฉ [output_dim, hidden_dim]
  2. LayerNorm ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋‹จ๊ณ„:

    • ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ๋™์ผํ•œ ์ˆœ์„œ๋กœ ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค ํ†ต๊ณ„๋Ÿ‰์„ ์žฌ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:
      • ํ‰๊ท : \[\Large \mu = \frac{1}{H} \sum_{i=1}^{H} x_i \]
      • ๋ถ„์‚ฐ: \[\Large \sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2 \]
      • ์—ญํ‘œ์ค€ํŽธ์ฐจ: \[\Large \text{inv_std} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} \]
    • ์ •๊ทœํ™”๋œ ๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2
      • \epsilon}} \]
    • ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:
      • ์ž…๋ ฅ ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \odot \gamma \odot \frac{1}{\sqrt{\sigma^2 + \epsilon}} (1
        • \frac{1}{H} - \frac{(x - \mu)^2}{H(\sigma^2 + \epsilon)}) \]
      • ์Šค์ผ€์ผ ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial \gamma} = \sum_{i=1}^{H} \frac{\partial L}{\partial y_i} \odot \hat{x}_i \]
      • ์‹œํ”„ํŠธ ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial \beta} = \sum_{i=1}^{H} \frac{\partial L}{\partial y_i} \]
  3. Linear ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๋‹จ๊ณ„:

    • ๊ฐ ์ถœ๋ ฅ ์ฐจ์›์— ๋Œ€ํ•ด:
      • Bias ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \]
      • ๊ฐ€์ค‘์น˜ ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial W} = \frac{\partial L}{\partial y}x^T \]
      • ์ž…๋ ฅ ๊ธฐ์šธ๊ธฐ: \[\Large \frac{\partial L}{\partial x} = W^T\frac{\partial L}{\partial y} \]
    • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ ์‚ฌ์šฉ:
      • Bias ๊ธฐ์šธ๊ธฐ์— ์ ์ ˆํ•œ ์ •๋ ฌ๋กœ atomic_add ์‚ฌ์šฉ
      • ๊ฐ€์ค‘์น˜ ๊ธฐ์šธ๊ธฐ์— ์ ์ ˆํ•œ ์ •๋ ฌ๋กœ atomic_add ์‚ฌ์šฉ
      • LayerNorm ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ์šธ๊ธฐ์— ์ ์ ˆํ•œ ์ •๋ ฌ๋กœ atomic_add ์‚ฌ์šฉ
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

    • ์ž…๋ ฅ/์ถœ๋ ฅ ํ…์„œ์— ๋Œ€ํ•œ ๋ณ‘ํ•ฉ ์ ‘๊ทผ
    • ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์— ๋Œ€ํ•œ stride ์ ‘๊ทผ
    • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์„ ์œ„ํ•œ ์›์ž์  ์—ฐ์‚ฐ
    • ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ
    • ์ž์ฃผ ์ ‘๊ทผํ•˜๋Š” ๊ฐ’์„ ์œ„ํ•œ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ
    • ๋ชจ๋“  ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ
  5. ์ˆ˜์น˜ ์•ˆ์ •์„ฑ:

    • ๋ถ„๋ชจ์˜ ์—ก์‹ค๋ก  ์ฒ˜๋ฆฌ์— ์ฃผ์˜
    • ๊ธฐ์šธ๊ธฐ์˜ ์ ์ ˆํ•œ ์Šค์ผ€์ผ๋ง
    • ์•ˆ์ •์ ์ธ ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ
    • rebind[Scalar[dtype]]๋กœ ํƒ€์ž… ์บ์ŠคํŒ…
    • ์—ฃ์ง€ ์ผ€์ด์Šค์˜ ์ ์ ˆํ•œ ์ฒ˜๋ฆฌ
    • ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ๋™์ผํ•œ ์—ฐ์‚ฐ ์ˆœ์„œ ์œ ์ง€
  6. ์„ฑ๋Šฅ ์ตœ์ ํ™”:

    • ๋ชจ๋“  ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋‹จ์ผ ์ปค๋„ ์‹คํ–‰
    • ๊ณ„์‚ฐ๋œ ํ†ต๊ณ„๋Ÿ‰ ์žฌ์‚ฌ์šฉ
    • ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ตœ์†Œํ™”
    • ์ค‘๊ฐ„ ํ…์„œ ํ• ๋‹น ๋ถˆํ•„์š”
    • ํšจ์œจ์ ์ธ ์Šค๋ ˆ๋“œ ํ™œ์šฉ
    • ๋™๊ธฐํ™” ์ง€์  ๊ฐ์†Œ
    • ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
    • ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ
  7. ๊ตฌํ˜„ ์„ธ๋ถ€ ์‚ฌํ•ญ:

    • ์ปดํŒŒ์ผ ํƒ€์ž„ ์ƒ์ˆ˜๋ฅผ ์œ„ํ•œ @parameter ์‚ฌ์šฉ
    • ํ…์„œ ์ฐจ์›์˜ ์ ์ ˆํ•œ ์ฒ˜๋ฆฌ
    • ํšจ์œจ์ ์ธ ํƒ€์ž… ์บ์ŠคํŒ…๊ณผ ๋ณ€ํ™˜
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์‹ ์ค‘ํ•œ ๊ด€๋ฆฌ
    • ์—ฐ์‚ฐ ๊ฐ„ ์ ์ ˆํ•œ ๋™๊ธฐํ™”
    • ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ์™€ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
    • PyTorch ์˜คํ† ๊ทธ๋ž˜๋“œ ์‹œ์Šคํ…œ๊ณผ์˜ ํ†ตํ•ฉ

์ด ๊ตฌํ˜„์€ ๋‹ค์Œ์„ ํ†ตํ•ด ์–ธํ“จ์ „ ๋ฒ„์ „๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • ์ปค๋„ ํ“จ์ „์„ ํ†ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰ ์ ˆ๊ฐ
  • ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์†Œํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”
  • GPU ๋ฆฌ์†Œ์Šค์˜ ํšจ์œจ์  ํ™œ์šฉ
  • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ์œ ์ง€
  • ๊ธฐ์šธ๊ธฐ ๋ˆ„์ ์˜ ์ ์ ˆํ•œ ์ฒ˜๋ฆฌ
  • ์ ์ ˆํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ •๋ ฌ ๋ณด์žฅ
  • ํšจ์œจ์ ์ธ ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ

ํ“จ์ „ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋Š” LayerNorm + Linear ์—ฐ์‚ฐ์ด ์ž์ฃผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜๋Š” ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜์—์„œ ํŠนํžˆ ์ค‘์š”ํ•˜๋ฉฐ, ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๊ณ ๋ ค ์‚ฌํ•ญ

์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ตฌํ˜„์€ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์ ํ™”๋œ torch.compile์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

# Compilation configuration
torch._dynamo.config.cache_size_limit = 64  # Increase cache
torch._dynamo.config.suppress_errors = True  # Handle errors gracefully
torch._dynamo.config.automatic_dynamic_shapes = True  # Dynamic shapes

์ด๋Ÿฌํ•œ ์ตœ์ ํ™”๊ฐ€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค์—์„œ ํŠนํžˆ ์ค‘์š”ํ•œ ์ด์œ ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • ์ž‘์€ ํ…์„œ ์—ฐ์‚ฐ์€ ์ปดํŒŒ์ผ ์บ์‹ฑ์˜ ์ด์ ์„ ๋ฐ›์Šต๋‹ˆ๋‹ค
  • ๋™์  ํ˜•์ƒ์€ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค์—์„œ ํ”ํ•˜๊ฒŒ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค
  • ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์—๋Š” ๊ฐ•๊ฑดํ•œ ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • ์บ์‹œ ํฌ๊ธฐ๋Š” ๋ฐ˜๋ณต์ ์ธ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ์—ฐ์‚ฐ์— ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค
  • ์ ์ ˆํ•œ ์˜ค๋ฅ˜ ์ฒ˜๋ฆฌ๋Š” ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์— ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค
  • ์ปดํŒŒ์ผ ์˜ค๋ฒ„ํ—ค๋“œ๋Š” ํ•™์Šต ์‹œ๊ฐ„์— ํฐ ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋Š” ์ •ํ™•์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ปดํŒŒ์ผ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด reduce-overhead ๋ชจ๋“œ๋กœ ์ปดํŒŒ์ผ๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ํŠนํžˆ ์ค‘์š”ํ•œ ์ด์œ ๋Š”:

  • ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๋Š” ํ•™์Šต ์ค‘์— ๋นˆ๋ฒˆํ•˜๊ฒŒ ํ˜ธ์ถœ๋ฉ๋‹ˆ๋‹ค
  • ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์€ ์ˆ˜์น˜์ ์œผ๋กœ ์•ˆ์ •์ ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์ตœ์ ํ™”๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ์›์ž์  ์—ฐ์‚ฐ์—๋Š” ์ ์ ˆํ•œ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  • ์˜คํ† ๊ทธ๋ž˜๋“œ ํ†ตํ•ฉ์ด ํšจ์œจ์ ์ด์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

LayerNorm ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค์˜ ์ƒ์„ธ ์œ ๋„

LayerNorm์˜ ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค ๊ธฐ์šธ๊ธฐ๋Š” ์—ฐ์‡„ ๋ฒ•์น™์„ ์ฃผ์˜ ๊นŠ๊ฒŒ ์ ์šฉํ•˜์—ฌ ์œ ๋„๋ฉ๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„ ์œ ๋„ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค ์—ฐ์‚ฐ

  • ํ‰๊ท : \(\mu = \frac{1}{H} \sum_{i=1}^{H} x_i\)
  • ๋ถ„์‚ฐ: \(\sigma^2 = \frac{1}{H} \sum_{i=1}^{H} (x_i - \mu)^2\)
  • ์ •๊ทœํ™”๋œ ๊ฐ’: \(\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}\)
  • ์ตœ์ข… ์ถœ๋ ฅ: \(y = \gamma \odot \hat{x} + \beta\)

์—ฐ์‡„ ๋ฒ•์น™ ์ ์šฉ

\(\frac{\partial L}{\partial x}\)๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฐ์‡„ ๋ฒ•์น™์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค: \[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial \hat{x}} \frac{\partial \hat{x}}{\partial x}\]

๊ธฐ์šธ๊ธฐ ๊ตฌ์„ฑ ์š”์†Œ

์ถœ๋ ฅ์—์„œ ์ •๊ทœํ™”๋œ ๊ฐ’์œผ๋กœ

  • \(\frac{\partial y}{\partial \hat{x}} = \gamma\) (์š”์†Œ๋ณ„ ๊ณฑ์…ˆ)

์ •๊ทœํ™”๋œ ๊ฐ’์—์„œ ์ž…๋ ฅ์œผ๋กœ

๊ธฐ์šธ๊ธฐ \(\frac{\partial \hat{x}}{\partial x}\)์—๋Š” ์„ธ ๊ฐ€์ง€ ๊ตฌ์„ฑ ์š”์†Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ถ„์ž๋ฅผ ํ†ตํ•œ ์ง์ ‘์  ํšจ๊ณผ: \(\frac{1}{\sqrt{\sigma^2 + \epsilon}}\)
  • ํ‰๊ท ์„ ํ†ตํ•œ ๊ฐ„์ ‘์  ํšจ๊ณผ: \(-\frac{1}{H} \frac{1}{\sqrt{\sigma^2 + \epsilon}}\)
  • ๋ถ„์‚ฐ์„ ํ†ตํ•œ ๊ฐ„์ ‘์  ํšจ๊ณผ: \(-\frac{(x - \mu)}{H(\sigma^2 + \epsilon)^{3/2}} (x
    • \mu)\)

ํ•ญ ๊ฒฐํ•ฉ

์ •๊ทœํ™” ํ•ญ์„ ํ†ตํ•œ ๊ธฐ์šธ๊ธฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌ๋ฉ๋‹ˆ๋‹ค: \[\Large \frac{\partial \hat{x}}{\partial x} = \frac{1}{\sqrt{\sigma^2 + \epsilon}} (1 - \frac{1}{H} - \frac{(x - \mu)^2}{H(\sigma^2 + \epsilon)})\]

์ตœ์ข… ๊ธฐ์šธ๊ธฐ ํ‘œํ˜„์‹

๋ชจ๋“  ํ•ญ์„ ๊ฒฐํ•ฉํ•˜๋ฉด: \[\Large \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \odot \gamma \odot \frac{1}{\sqrt{\sigma^2 + \epsilon}} (1 - \frac{1}{H} - \frac{(x - \mu)^2}{H(\sigma^2 + \epsilon)})\]

ํ•ต์‹ฌ ํ†ต์ฐฐ

  • ์—ฐ์‡„ ๋ฒ•์น™์€ x๊ฐ€ ์ถœ๋ ฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๋ชจ๋“  ๊ฒฝ๋กœ๋ฅผ ๊ณ ๋ คํ•ฉ๋‹ˆ๋‹ค
  • ์ •๊ทœํ™” ํ•ญ \(\sqrt{\sigma^2 + \epsilon}\)์€ ๋ถ„์ž์™€ ๋ถ„๋ชจ ๋ชจ๋‘์— ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค
  • ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ ํ•ญ์€ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์˜ ์ถ”๊ฐ€ ๊ฒฝ๋กœ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ์ตœ์ข… ํ‘œํ˜„์‹์€ ๋ชจ๋“  ํšจ๊ณผ๋ฅผ ํ•˜๋‚˜์˜ ํšจ์œจ์ ์ธ ๊ณ„์‚ฐ์œผ๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค

๊ตฌํ˜„ ์‹œ ๊ณ ๋ ค ์‚ฌํ•ญ

  • ๊ธฐ์šธ๊ธฐ๊ฐ€ \(\gamma\)์˜ ์Šค์ผ€์ผ๋ง ํšจ๊ณผ๋ฅผ ์ ์ ˆํžˆ ๋ฐ˜์˜ํ•ฉ๋‹ˆ๋‹ค
  • ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์˜ ์ •๊ทœํ™” ํšจ๊ณผ๊ฐ€ ๋ณด์กด๋ฉ๋‹ˆ๋‹ค
  • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ ํ•ญ \(\epsilon\)์ด ์œ ์ง€๋ฉ๋‹ˆ๋‹ค
  • ๊ธฐ์šธ๊ธฐ๊ฐ€ ์€๋‹‰ ์ฐจ์› H ์ „์ฒด์— ๊ฑธ์ณ ์ ์ ˆํžˆ ์Šค์ผ€์ผ๋ง๋ฉ๋‹ˆ๋‹ค
  • ์ˆ˜์น˜ ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด ์—ฐ์‚ฐ ์ˆœ์„œ๊ฐ€ ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค

์ด ์œ ๋„๋ฅผ ํ†ตํ•ด ์—ญ๋ฐฉํ–ฅ ํŒจ์Šค๊ฐ€ ์ˆœ๋ฐฉํ–ฅ ํŒจ์Šค์™€ ๋™์ผํ•œ ์ˆ˜์น˜์  ํŠน์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ํ•„์š”ํ•œ ๋ชจ๋“  ๊ธฐ์šธ๊ธฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Puzzle 23: GPU ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด

๊ฐœ์š”

Part VI: ํ•จ์ˆ˜ํ˜• GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” GPU ์—ฐ์‚ฐ์„ ์œ„ํ•œ Mojo์˜ ๊ณ ์ˆ˜์ค€ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋ฒกํ„ฐํ™”, ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”, ์„ฑ๋Šฅ ํŠœ๋‹์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ํ•จ์ˆ˜ํ˜• ์ ‘๊ทผ ๋ฐฉ์‹์„ ๋ฐฐ์šฐ๋ฉฐ, ์ˆ˜๋™ GPU ์ปค๋„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํ˜„๋Œ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ์šฐ์•„ํ•จ์„ ํฌ๊ธฐํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค - Mojo์˜ ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์€ ๋‘ ๊ฐ€์ง€๋ฅผ ๋ชจ๋‘ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ

GPU ์‹คํ–‰ ๊ณ„์ธต ๊ตฌ์กฐ

GPU ์Šค๋ ˆ๋“œ์™€ SIMD ์—ฐ์‚ฐ ์‚ฌ์ด์˜ ๊ทผ๋ณธ์ ์ธ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

GPU Device
โ”œโ”€โ”€ Grid (์ „์ฒด ๋ฌธ์ œ)
โ”‚   โ”œโ”€โ”€ Block 1 (์Šค๋ ˆ๋“œ ๊ทธ๋ฃน, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)
โ”‚   โ”‚   โ”œโ”€โ”€ Warp 1 (32๊ฐœ ์Šค๋ ˆ๋“œ, ๋ก์Šคํ… ์‹คํ–‰) --> Part VI์—์„œ ํ•™์Šต
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 1 โ†’ SIMD
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 2 โ†’ SIMD
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ... (์ด 32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ”‚   โ””โ”€โ”€ Warp 2 (32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ””โ”€โ”€ Block 2 (๋…๋ฆฝ์ ์ธ ๊ทธ๋ฃน)

Mojo๊ฐ€ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ๋“ค:

  • ๊ทธ๋ฆฌ๋“œ/๋ธ”๋ก ๊ตฌ์„ฑ ์ž๋™ ๊ณ„์‚ฐ
  • ์›Œํ”„ ๊ด€๋ฆฌ์˜ ํˆฌ๋ช…ํ•œ ์ฒ˜๋ฆฌ
  • ์Šค๋ ˆ๋“œ ์Šค์ผ€์ค„๋ง ์ž๋™ ์ตœ์ ํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™” ๋‚ด์žฅ

๐Ÿ’ก ์ฐธ๊ณ : ์ด Part๋Š” ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์œผ๋ฉฐ, ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๊ณ ๊ธ‰ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋Š” Part VII ์—์„œ ์ž์„ธํžˆ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

๋„ค ๊ฐ€์ง€ ๊ธฐ๋ณธ ํŒจํ„ด

GPU ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ•ต์‹ฌ ํŒจํ„ด์„ ๋ชจ๋‘ ๋‹ค๋ฃน๋‹ˆ๋‹ค:

  1. Elementwise: ์ž๋™ SIMD ๋ฒกํ„ฐํ™”๋ฅผ ํ†ตํ•œ ์ตœ๋Œ€ ๋ณ‘๋ ฌ์„ฑ
  2. Tiled: ์บ์‹œ ์ตœ์ ํ™”๋ฅผ ํ™œ์šฉํ•œ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ด ๋†’์€ ์ฒ˜๋ฆฌ
  3. ์ˆ˜๋™ ๋ฒกํ„ฐํ™”: SIMD ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์ œ์–ด
  4. Mojo vectorize: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํฌํ•จํ•œ ์•ˆ์ „ํ•œ ์ž๋™ ๋ฒกํ„ฐํ™”

ํ•œ๋ˆˆ์— ๋ณด๋Š” ์„ฑ๋Šฅ ํŒจํ„ด

๋ฌธ์ œ: 1024๊ฐœ ์š”์†Œ์˜ ๋ฒกํ„ฐ ๋‘ ๊ฐœ ๋”ํ•˜๊ธฐ (SIZE=1024, SIMD_WIDTH=4)

Elementwise:     256 ์Šค๋ ˆ๋“œ ร— 1 SIMD ์—ฐ์‚ฐ   = ๋†’์€ ๋ณ‘๋ ฌ์„ฑ
Tiled:           32 ์Šค๋ ˆ๋“œ  ร— 8 SIMD ์—ฐ์‚ฐ  = ์บ์‹œ ์ตœ์ ํ™”
Manual:          8 ์Šค๋ ˆ๋“œ   ร— 32 SIMD ์—ฐ์‚ฐ = ์ตœ๋Œ€ ์ œ์–ด
Mojo vectorize:  32 ์Šค๋ ˆ๋“œ  ร— 8 SIMD ์—ฐ์‚ฐ  = ์ž๋™ ์•ˆ์ „์„ฑ

๐Ÿ“Š ์‹ค์ œ ์„ฑ๋Šฅ ๋ถ„์„

์‹ค์ฆ์  ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ๋ฅผ ํ•ด์„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค:

๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ (SIZE=1,048,576):
elementwise:        11.34ms  โ† ๋Œ€๊ทœ๋ชจ์—์„œ ์ตœ๋Œ€ ๋ณ‘๋ ฌ์„ฑ์ด ์œ ๋ฆฌ
tiled:              12.04ms  โ† ์ง€์—ญ์„ฑ๊ณผ ๋ณ‘๋ ฌ์„ฑ์˜ ๊ท ํ˜•
manual_vectorized:  15.75ms  โ† ๋‹จ์ˆœ ์—ฐ์‚ฐ์—์„œ ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ์ด ๋ถˆ๋ฆฌ
vectorized:         13.38ms  โ† ์ž๋™ ์ตœ์ ํ™” ์˜ค๋ฒ„ํ—ค๋“œ

์„ ์ˆ˜ ์ง€์‹

ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์„ ํ•™์Šตํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ๋‚ด์šฉ์— ์ต์ˆ™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  • ๊ธฐ๋ณธ GPU ๊ฐœ๋…: ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ, ์Šค๋ ˆ๋“œ ์‹คํ–‰, SIMD ์—ฐ์‚ฐ
  • Mojo ๊ธฐ์ดˆ: ํŒŒ๋ผ๋ฏธํ„ฐ ํ•จ์ˆ˜, ์ปดํŒŒ์ผ ํƒ€์ž„ ํŠน์ˆ˜ํ™”, ์บก์ฒ˜ ์˜๋ฏธ๋ก 
  • TileTensor ์—ฐ์‚ฐ: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘
  • GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ: ๋ฒ„ํผ ํ• ๋‹น, ํ˜ธ์ŠคํŠธ-๋””๋ฐ”์ด์Šค ๋™๊ธฐํ™”

ํ•™์Šต ๊ฒฝ๋กœ

1. Elementwise ์—ฐ์‚ฐ

โ†’ elementwise - ๊ธฐ๋ณธ GPU ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ

๊ธฐ์ดˆ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค: ์ž๋™ ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ์™€ SIMD ๋ฒกํ„ฐํ™”.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • elementwise๋ฅผ ํ™œ์šฉํ•œ ํ•จ์ˆ˜ํ˜• GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ
  • GPU ์Šค๋ ˆ๋“œ ๋‚ด์˜ ์ž๋™ SIMD ๋ฒกํ„ฐํ™”
  • ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•œ TileTensor ์—ฐ์‚ฐ
  • ์ค‘์ฒฉ ํ•จ์ˆ˜์—์„œ์˜ ์บก์ฒ˜ ์˜๋ฏธ๋ก 

ํ•ต์‹ฌ ํŒจํ„ด:

elementwise[add_function, SIMD_WIDTH, target="gpu"](total_size, ctx)

2. ํƒ€์ผ๋ง ์ฒ˜๋ฆฌ

โ†’ tile - ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ํƒ€์ผ๋ง ์ฒ˜๋ฆฌ

elementwise๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ํƒ€์ผ๋ง ํŒจํ„ด์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์บ์‹œ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ํƒ€์ผ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ
  • ํƒ€์ผ ๋‚ด ์ˆœ์ฐจ์  SIMD ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ ์›์น™๊ณผ ์บ์‹œ ์นœํ™”์  ์ ‘๊ทผ ํŒจํ„ด
  • ์Šค๋ ˆ๋“œ-ํƒ€์ผ ๋งคํ•‘ vs ์Šค๋ ˆ๋“œ-์š”์†Œ ๋งคํ•‘

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํƒ€์ผ๋ง์€ ๋ณ‘๋ ฌ ํญ์„ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ๊ณผ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค - ๋” ์ ์€ ์ˆ˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋” ๋‚˜์€ ์บ์‹œ ํ™œ์šฉ์œผ๋กœ ๋” ๋งŽ์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

3. ๊ณ ๊ธ‰ ๋ฒกํ„ฐํ™”

โ†’ vectorize - SIMD ์ œ์–ด

์ˆ˜๋™ ์ œ์–ด์™€ ์ž๋™ ๋ฒกํ„ฐํ™” ์ „๋žต์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋ช…์‹œ์  ์ธ๋ฑ์Šค ๊ด€๋ฆฌ๋ฅผ ํ†ตํ•œ ์ˆ˜๋™ SIMD ์—ฐ์‚ฐ
  • ์•ˆ์ „ํ•˜๊ณ  ์ž๋™์ ์ธ ๋ฒกํ„ฐํ™”๋ฅผ ์œ„ํ•œ Mojo์˜ vectorize ํ•จ์ˆ˜
  • ์ตœ์ ์˜ SIMD ์ •๋ ฌ์„ ์œ„ํ•œ ์ฒญํฌ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ
  • ์ˆ˜๋™ ์ œ์–ด์™€ ์•ˆ์ „์„ฑ ๊ฐ„์˜ ์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ๋ฒ•:

  • ์ˆ˜๋™: ์ง์ ‘ ์ œ์–ด, ์ตœ๋Œ€ ์„ฑ๋Šฅ, ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ
  • Mojo vectorize: ์ž๋™ ์ตœ์ ํ™”, ๋‚ด์žฅ ์•ˆ์ „์„ฑ, ๊น”๋”ํ•œ ์ฝ”๋“œ

๐Ÿง  4. ์Šค๋ ˆ๋”ฉ vs SIMD ๊ฐœ๋…

โ†’ GPU ์Šค๋ ˆ๋”ฉ vs SIMD ๊ฐœ๋…

๋ณ‘๋ ฌ์„ฑ ์ˆ˜์ค€ ๊ฐ„์˜ ๊ทผ๋ณธ์ ์ธ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • GPU ์Šค๋ ˆ๋”ฉ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ํ•˜๋“œ์›จ์–ด ๋งคํ•‘
  • GPU ์Šค๋ ˆ๋“œ ๋‚ด์˜ SIMD ์—ฐ์‚ฐ
  • ํŒจํ„ด ๋น„๊ต์™€ ์Šค๋ ˆ๋“œ-์ž‘์—… ๋งคํ•‘
  • ์›Œํฌ๋กœ๋“œ์— ๋งž๋Š” ์˜ฌ๋ฐ”๋ฅธ ํŒจํ„ด ์„ ํƒ

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณ‘๋ ฌ์„ฑ์˜ ๊ตฌ์กฐ๋ฅผ ์ œ๊ณตํ•˜๊ณ , SIMD ์—ฐ์‚ฐ์ด ๊ฐ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ ๋ฒกํ„ฐํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“Š 5. Mojo ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํ‚น

โ†’ Mojo ๋ฒค์น˜๋งˆํ‚น

GPU ์„ฑ๋Šฅ์„ ๊ณผํ•™์ ์œผ๋กœ ์ธก์ •, ๋ถ„์„, ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • Mojo์˜ ๋‚ด์žฅ ๋ฒค์น˜๋งˆํ‚น ํ”„๋ ˆ์ž„์›Œํฌ
  • GPU ๊ณ ์œ ์˜ ํƒ€์ด๋ฐ ๋ฐ ๋™๊ธฐํ™” ๋ฌธ์ œ
  • ์ปดํŒŒ์ผ ํƒ€์ž„ ํŠน์ˆ˜ํ™”๋ฅผ ํ™œ์šฉํ•œ ํŒŒ๋ผ๋ฏธํ„ฐํ™”๋œ ๋ฒค์น˜๋งˆํฌ ํ•จ์ˆ˜
  • ์‹ค์ฆ์  ์„ฑ๋Šฅ ๋ถ„์„๊ณผ ํŒจํ„ด ์„ ํƒ

ํ•ต์‹ฌ ๊ธฐ๋ฒ•: keep()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒค์น˜๋งˆํฌ ์ฝ”๋“œ์˜ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ

Elementwise ํŒจํ„ด๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ๊ฐ ์„น์…˜์„ ์ฒด๊ณ„์ ์œผ๋กœ ํ•™์Šตํ•˜์„ธ์š”. ๊ฐ ํผ์ฆ์€ ์ด์ „ ๊ฐœ๋…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ๋กœ์šด ์ˆ˜์ค€์˜ ์ •๊ตํ•จ์„ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ๊ฐ ํŒจํ„ด์˜ ์–ด๋–ป๊ฒŒ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์™œ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ•˜์„ธ์š”. ์—ฌ๊ธฐ์„œ ํ˜•์„ฑํ•˜๋Š” ๊ฐœ๋…์  ํ”„๋ ˆ์ž„์›Œํฌ๋Š” GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ „๋ฐ˜์— ๊ฑธ์ณ ํ™œ์šฉ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ๋ชฉํ‘œ: Part VI๋ฅผ ๋งˆ์น˜๋ฉด, ์ €์ˆ˜์ค€ GPU ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋Œ€์‹  ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์˜ ๊ด€์ ์—์„œ ์‚ฌ๊ณ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์–ด, ๋” ์œ ์ง€๋ณด์ˆ˜ํ•˜๊ธฐ ์‰ฝ๊ณ , ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚˜๋ฉฐ, ์ด์‹์„ฑ์ด ๋†’์€ GPU ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ: elementwise - ๊ธฐ๋ณธ GPU ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ ์—์„œ ํ•จ์ˆ˜ํ˜• GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‹œ์ž‘ํ•˜์„ธ์š”.

elementwise - ๊ธฐ๋ณธ GPU ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ

์ด ํผ์ฆ์€ Mojo์˜ ํ•จ์ˆ˜ํ˜• elementwise ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ ๋ง์…ˆ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž๋™์œผ๋กœ ์—ฌ๋Ÿฌ SIMD ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ํ˜„๋Œ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ์–ด๋–ป๊ฒŒ ์ €์ˆ˜์ค€ ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ์ถ”์ƒํ™”ํ•˜๋ฉด์„œ๋„ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: elementwise ํ•จ์ˆ˜๋Š” ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ, SIMD ๋ฒกํ„ฐํ™”, ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • elementwise๋ฅผ ํ™œ์šฉํ•œ ํ•จ์ˆ˜ํ˜• GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ
  • GPU ์Šค๋ ˆ๋“œ ๋‚ด์˜ ์ž๋™ SIMD ๋ฒกํ„ฐํ™”
  • ์•ˆ์ „ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•œ TileTensor ์—ฐ์‚ฐ
  • GPU ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ vs SIMD ์—ฐ์‚ฐ
  • ์ค‘์ฒฉ ํ•จ์ˆ˜์—์„œ์˜ ์บก์ฒ˜ ์˜๋ฏธ๋ก 

์ˆ˜ํ•™์  ์—ฐ์‚ฐ์€ ๋‹จ์ˆœํ•œ ์š”์†Œ๋ณ„ ๋ง์…ˆ์ž…๋‹ˆ๋‹ค: \[\Large \text{output}[i] = a[i] + b[i]\]

์ด ๊ตฌํ˜„์€ Mojo์—์„œ์˜ ๋ชจ๋“  GPU ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋ณธ ํŒจํ„ด์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

์„ค์ •

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 1024
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • SIMD ํญ: ํƒ€๊ฒŸ ์˜์กด์  (GPU ์•„ํ‚คํ…์ฒ˜์™€ ๋ฐ์ดํ„ฐ ํƒ€์ž…์— ๋”ฐ๋ผ ๊ฒฐ์ •)
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D ํ–‰ ์šฐ์„ )

์™„์„ฑํ•  ์ฝ”๋“œ

comptime SIZE = 1024
comptime rank = 1
comptime layout = row_major[SIZE]()
comptime LayoutType = type_of(layout)
comptime dtype = DType.float32
comptime SIMD_WIDTH = simd_width_of[dtype, target=get_gpu_target()]()


def elementwise_add[
    LayoutT: TensorLayout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
    output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    def add[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        var idx = indices[0]
        print("idx:", idx)
        # FILL IN (2 to 4 lines)

    elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p23/p23.mojo

ํŒ

1. ํ•จ์ˆ˜ ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

elementwise ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ •ํ™•ํ•œ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ๊ฐ€์ง„ ์ค‘์ฒฉ ํ•จ์ˆ˜๋ฅผ ๊ธฐ๋Œ€ํ•ฉ๋‹ˆ๋‹ค:

@parameter
@always_inline
def your_function[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
    # ๊ตฌํ˜„ ์ฝ”๋“œ

๊ฐ ๋ถ€๋ถ„์ด ์ค‘์š”ํ•œ ์ด์œ :

  • @parameter: ์ตœ์ ์˜ GPU ์ฝ”๋“œ ์ƒ์„ฑ์„ ์œ„ํ•œ ์ปดํŒŒ์ผ ํƒ€์ž„ ํŠน์ˆ˜ํ™”๋ฅผ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค
  • @always_inline: GPU ์ปค๋„์—์„œ ํ•จ์ˆ˜ ํ˜ธ์ถœ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ์ธ๋ผ์ด๋‹์„ ๊ฐ•์ œํ•ฉ๋‹ˆ๋‹ค
  • capturing: ์™ธ๋ถ€ ์Šค์ฝ”ํ”„์˜ ๋ณ€์ˆ˜(์ž…์ถœ๋ ฅ ํ…์„œ)์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค
  • IndexList[rank]: ๋‹ค์ฐจ์› ์ธ๋ฑ์‹ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค (๋ฒกํ„ฐ๋Š” rank=1, ํ–‰๋ ฌ์€ rank=2)

2. ์ธ๋ฑ์Šค ์ถ”์ถœ๊ณผ SIMD ์ฒ˜๋ฆฌ

idx = indices[0]  # 1D ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์„ ํ˜• ์ธ๋ฑ์Šค ์ถ”์ถœ

์ด idx๋Š” ๋‹จ์ผ ์š”์†Œ๊ฐ€ ์•„๋‹Œ SIMD ๋ฒกํ„ฐ์˜ ์‹œ์ž‘ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. SIMD_WIDTH=4 (GPU ์˜์กด์ )์ธ ๊ฒฝ์šฐ:

  • Thread 0์€ idx=0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์š”์†Œ [0, 1, 2, 3]์„ ์ฒ˜๋ฆฌ
  • Thread 1์€ idx=4๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์š”์†Œ [4, 5, 6, 7]์„ ์ฒ˜๋ฆฌ
  • Thread 2๋Š” idx=8๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ์š”์†Œ [8, 9, 10, 11]์„ ์ฒ˜๋ฆฌ
  • ์ด๋Ÿฐ ์‹์œผ๋กœ ๊ณ„์†โ€ฆ

3. SIMD ๋กœ๋“œ ํŒจํ„ด

a_simd = a.aligned_load[simd_width](Index(idx))  # ์—ฐ์† float 4๊ฐœ ๋กœ๋“œ (GPU ์˜์กด์ )
b_simd = b.aligned_load[simd_width](Index(idx))  # ์—ฐ์† float 4๊ฐœ ๋กœ๋“œ (GPU ์˜์กด์ )

๋‘ ๋ฒˆ์งธ ๋งค๊ฐœ๋ณ€์ˆ˜ 0์€ ์ฐจ์› ์˜คํ”„์…‹์ž…๋‹ˆ๋‹ค (1D ๋ฒกํ„ฐ์—์„œ๋Š” ํ•ญ์ƒ 0). ์ด ์—ฐ์‚ฐ์€ ํ•œ ๋ฒˆ์— ๋ฒกํ„ฐํ™”๋œ ์ฒญํฌ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค. ๋กœ๋“œ๋˜๋Š” ์ •ํ™•ํ•œ ์š”์†Œ ์ˆ˜๋Š” GPU์˜ SIMD ๋Šฅ๋ ฅ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

4. ๋ฒกํ„ฐ ์—ฐ์‚ฐ

result = a_simd + b_simd  # 4๊ฐœ ์š”์†Œ์˜ SIMD ๋ง์…ˆ์„ ๋™์‹œ์— ์ˆ˜ํ–‰ (GPU ์˜์กด์ )

์ „์ฒด SIMD ๋ฒกํ„ฐ์— ๊ฑธ์ณ ์š”์†Œ๋ณ„ ๋ง์…ˆ์„ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค - 4๊ฐœ์˜ ๊ฐœ๋ณ„ ์Šค์นผ๋ผ ๋ง์…ˆ๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฆ…๋‹ˆ๋‹ค.

5. SIMD ์ €์žฅ

output.store[simd_width](idx, 0, result)  # 4๊ฐœ ๊ฒฐ๊ณผ๋ฅผ ํ•œ ๋ฒˆ์— ์ €์žฅ (GPU ์˜์กด์ )

์ „์ฒด SIMD ๋ฒกํ„ฐ๋ฅผ ํ•œ ๋ฒˆ์˜ ์—ฐ์‚ฐ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ์— ๋‹ค์‹œ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.

6. elementwise ํ•จ์ˆ˜ ํ˜ธ์ถœ

elementwise[your_function, SIMD_WIDTH, target="gpu"](total_size, ctx)
  • total_size๋Š” ๋ชจ๋“  ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด a.size()๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • GPU๋Š” ์‹คํ–‰ํ•  ์Šค๋ ˆ๋“œ ์ˆ˜๋ฅผ ์ž๋™์œผ๋กœ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค: total_size // SIMD_WIDTH

7. ๋””๋ฒ„๊น… ํ•ต์‹ฌ ํฌ์ธํŠธ

ํ…œํ”Œ๋ฆฟ์— ์žˆ๋Š” print("idx:", idx)์— ์ฃผ๋ชฉํ•˜์„ธ์š”. ์‹คํ–‰ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

idx: 0, idx: 4, idx: 8, idx: 12, ...

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ SIMD ์ฒญํฌ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, SIMD_WIDTH (GPU ์˜์กด์ ) ๊ฐ„๊ฒฉ์œผ๋กœ ์ž๋™ ๋ฐฐ์น˜๋จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ฝ”๋“œ ์‹คํ–‰

ํ’€์ด๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p23 --elementwise
pixi run -e amd p23 --elementwise
pixi run -e apple p23 --elementwise
uv run poe p23 --elementwise

ํผ์ฆ์ด ์•„์ง ํ’€๋ฆฌ์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

SIZE: 1024
simd_width: 4
...
idx: 404
idx: 408
idx: 412
idx: 416
...

out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

์†”๋ฃจ์…˜

def elementwise_add[
    LayoutT: TensorLayout, dtype: DType, simd_width: Int, rank: Int, size: Int
](
    output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    def add[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        var idx = indices[0]
        # Convert inside GPU kernel to avoid host-captured LayoutTensor issues
        var a_lt = a.to_layout_tensor()
        var b_lt = b.to_layout_tensor()
        var out_lt = output.to_layout_tensor()
        # Note: This is thread-local SIMD - each thread processes its own vector of data
        # we'll later better see this hierarchy in Mojo:
        # SIMD within threads, warp across threads, block across warps
        var a_simd = a_lt.aligned_load[width=simd_width](Index(idx))
        var b_simd = b_lt.aligned_load[width=simd_width](Index(idx))
        var ret = a_simd + b_simd
        out_lt.store[simd_width](Index(idx), ret)

    elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx)


Mojo์˜ elementwise ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์€ ํ˜„๋Œ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•œ ๋ช‡ ๊ฐ€์ง€ ๊ธฐ๋ณธ ๊ฐœ๋…์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค:

1. ํ•จ์ˆ˜ํ˜• ์ถ”์ƒํ™” ์ฒ ํ•™

elementwise ํ•จ์ˆ˜๋Š” ๊ธฐ์กด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ์˜ ํŒจ๋Ÿฌ๋‹ค์ž„ ์ „ํ™˜์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค:

์ „ํ†ต์ ์ธ CUDA/HIP ๋ฐฉ์‹:

# ์ˆ˜๋™ ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ
idx = thread_idx.x + block_idx.x * block_dim.x
if idx < size:
    output[idx] = a[idx] + b[idx];  // ์Šค์นผ๋ผ ์—ฐ์‚ฐ

Mojo ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹:

# ์ž๋™ ๊ด€๋ฆฌ + SIMD ๋ฒกํ„ฐํ™”
elementwise[add_function, simd_width, target="gpu"](size, ctx)

elementwise๊ฐ€ ์ถ”์ƒํ™”ํ•˜๋Š” ๊ฒƒ๋“ค:

  • ์Šค๋ ˆ๋“œ ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ๋ธ”๋ก/๊ทธ๋ฆฌ๋“œ ์ฐจ์›์„ ๊ณ„์‚ฐํ•  ํ•„์š” ์—†์Œ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋ฐฐ์—ด ๊ฒฝ๊ณ„๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ๋‚ด์žฅ
  • SIMD ์˜ค์ผ€์ŠคํŠธ๋ ˆ์ด์…˜: ๋ฒกํ„ฐํ™”๋ฅผ ํˆฌ๋ช…ํ•˜๊ฒŒ ์ฒ˜๋ฆฌ
  • GPU ํƒ€๊ฒŸ ์„ ํƒ: ๋‹ค์–‘ํ•œ GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ๋™์ž‘

2. ์‹ฌ์ธต ๋ถ„์„: ์ค‘์ฒฉ ํ•จ์ˆ˜ ์•„ํ‚คํ…์ฒ˜

@parameter
@always_inline
def add[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:

๋งค๊ฐœ๋ณ€์ˆ˜ ๋ถ„์„:

  • @parameter: ์ด ๋ฐ์ฝ”๋ ˆ์ดํ„ฐ๋Š” ์ปดํŒŒ์ผ ํƒ€์ž„ ํŠน์ˆ˜ํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๊ณ ์œ ํ•œ simd_width์™€ rank์— ๋Œ€ํ•ด ํ•จ์ˆ˜๊ฐ€ ๋ณ„๋„๋กœ ์ƒ์„ฑ๋˜์–ด ์ ๊ทน์ ์ธ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • @always_inline: GPU ์„ฑ๋Šฅ์— ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค - ์ฝ”๋“œ๋ฅผ ์ปค๋„์— ์ง์ ‘ ๋‚ด์žฅํ•˜์—ฌ ํ•จ์ˆ˜ ํ˜ธ์ถœ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
  • capturing: ๋ ‰์‹œ์ปฌ ์Šค์ฝ”ํ•‘์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค - ๋‚ด๋ถ€ ํ•จ์ˆ˜๊ฐ€ ๋ช…์‹œ์  ๋งค๊ฐœ๋ณ€์ˆ˜ ์ „๋‹ฌ ์—†์ด ์™ธ๋ถ€ ์Šค์ฝ”ํ”„์˜ ๋ณ€์ˆ˜์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • IndexList[rank]: ์ฐจ์› ๋ฌด๊ด€ ์ธ๋ฑ์‹ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค - ๋™์ผํ•œ ํŒจํ„ด์ด 1D ๋ฒกํ„ฐ, 2D ํ–‰๋ ฌ, 3D ํ…์„œ ๋“ฑ์—์„œ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.

3. SIMD ์‹คํ–‰ ๋ชจ๋ธ ์‹ฌ์ธต ๋ถ„์„

idx = indices[0]                                    # ์„ ํ˜• ์ธ๋ฑ์Šค: 0, 4, 8, 12... (GPU ์˜์กด์  ๊ฐ„๊ฒฉ)
a_simd = a.aligned_load[simd_width](Index(idx))     # ๋กœ๋“œ: [a[0:4], a[4:8], a[8:12]...] (๋กœ๋“œ๋‹น 4๊ฐœ ์š”์†Œ)
b_simd = b.aligned_load[simd_width](Index(idx))     # ๋กœ๋“œ: [b[0:4], b[4:8], b[8:12]...] (๋กœ๋“œ๋‹น 4๊ฐœ ์š”์†Œ)
ret = a_simd + b_simd                               # SIMD: 4๊ฐœ ๋ง์…ˆ์„ ๋ณ‘๋ ฌ ์ˆ˜ํ–‰ (GPU ์˜์กด์ )
output.store[simd_width](Index(global_start), ret)  # ์ €์žฅ: 4๊ฐœ ๊ฒฐ๊ณผ๋ฅผ ๋™์‹œ ์ €์žฅ (GPU ์˜์กด์ )

์‹คํ–‰ ๊ณ„์ธต ๊ตฌ์กฐ ์‹œ๊ฐํ™”:

GPU ์•„ํ‚คํ…์ฒ˜:
โ”œโ”€โ”€ Grid (์ „์ฒด ๋ฌธ์ œ)
โ”‚   โ”œโ”€โ”€ Block 1 (์—ฌ๋Ÿฌ Warp)
โ”‚   โ”‚   โ”œโ”€โ”€ Warp 1 (32๊ฐœ ์Šค๋ ˆ๋“œ) --> Warp๋Š” ๋‹ค์Œ Part VI์—์„œ ํ•™์Šต
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 1 โ†’ SIMD[4๊ฐœ ์š”์†Œ]  โ† ํ˜„์žฌ ์ดˆ์  (GPU ์˜์กด์  ํญ)
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 2 โ†’ SIMD[4๊ฐœ ์š”์†Œ]
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ...
โ”‚   โ”‚   โ””โ”€โ”€ Warp 2 (32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ””โ”€โ”€ Block 2 (์—ฌ๋Ÿฌ Warp)

SIMD_WIDTH=4์ธ 1024๊ฐœ ์š”์†Œ ๋ฒกํ„ฐ์˜ ๊ฒฝ์šฐ (GPU ์˜ˆ์‹œ):

  • ํ•„์š”ํ•œ ์ด SIMD ์—ฐ์‚ฐ ์ˆ˜: 1024 รท 4 = 256
  • GPU ์‹คํ–‰: 256๊ฐœ ์Šค๋ ˆ๋“œ (1024 รท 4)
  • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌํ•˜๋Š” ์–‘: ์ •ํ™•ํžˆ 4๊ฐœ์˜ ์—ฐ์† ์š”์†Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ์Šค์นผ๋ผ ์—ฐ์‚ฐ ๋Œ€๋น„ SIMD_WIDTH๋ฐฐ ํ–ฅ์ƒ

์ฐธ๊ณ : SIMD ํญ์€ GPU ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค (์˜ˆ: ์ผ๋ถ€ GPU๋Š” 4, RTX 4090์€ 8, A100์€ 16).

4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„

a.aligned_load[simd_width](Index(idx))  // ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ

๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์˜ ์ด์ :

  • ์ˆœ์ฐจ์  ์ ‘๊ทผ: ์Šค๋ ˆ๋“œ๋“ค์ด ์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผ
  • ์บ์‹œ ์ตœ์ ํ™”: L1/L2 ์บ์‹œ ํžˆํŠธ์œจ ๊ทน๋Œ€ํ™”
  • ๋Œ€์—ญํญ ํ™œ์šฉ: ์ด๋ก ์  ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ๊ทผ์ ‘ํ•˜๋Š” ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
  • ํ•˜๋“œ์›จ์–ด ํšจ์œจ: GPU ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ๊ฐ€ ์ด ํŒจํ„ด์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Œ

SIMD_WIDTH=4 (GPU ์˜์กด์ ) ์˜ˆ์‹œ:

Thread 0: a[0:4] ๋กœ๋“œ   โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ 0-3
Thread 1: a[4:8] ๋กœ๋“œ   โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ 4-7
Thread 2: a[8:12] ๋กœ๋“œ  โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ 8-11
...
๊ฒฐ๊ณผ: ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ ํ™œ์šฉ

5. ์„ฑ๋Šฅ ํŠน์„ฑ ๋ฐ ์ตœ์ ํ™”

์‚ฐ์ˆ  ๊ฐ•๋„ ๋ถ„์„ (SIMD_WIDTH=4 ๊ธฐ์ค€):

  • ์‚ฐ์ˆ  ์—ฐ์‚ฐ: 4๊ฐœ ์š”์†Œ๋‹น 1ํšŒ SIMD ๋ง์…ˆ
  • ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ: 4๊ฐœ ์š”์†Œ๋‹น 2ํšŒ SIMD ๋กœ๋“œ + 1ํšŒ SIMD ์ €์žฅ
  • ์‚ฐ์ˆ  ๊ฐ•๋„: 1 ๋ง์…ˆ รท 3 ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ = 0.33 (๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ)

์ด๊ฒƒ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ธ ์ด์œ :

๋‹จ์ˆœ ์—ฐ์‚ฐ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ >>> ์—ฐ์‚ฐ ๋Šฅ๋ ฅ

์ตœ์ ํ™” ์‹œ์‚ฌ์ :

  • ์‚ฐ์ˆ  ์ตœ์ ํ™”๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์— ์ง‘์ค‘ํ•ด์•ผ ํ•จ
  • SIMD ๋ฒกํ„ฐํ™”๊ฐ€ ์ฃผ์š” ์„ฑ๋Šฅ ์ด์ ์„ ์ œ๊ณต
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด ์„ฑ๋Šฅ์— ๋งค์šฐ ์ค‘์š”
  • ์—ฐ์‚ฐ ๋ณต์žก๋„๋ณด๋‹ค ์บ์‹œ ์ง€์—ญ์„ฑ์ด ๋” ์ค‘์š”

6. ํ™•์žฅ์„ฑ๊ณผ ์ ์‘์„ฑ

์ž๋™ ํ•˜๋“œ์›จ์–ด ์ ์‘:

comptime SIMD_WIDTH = simd_width_of[dtype, target = _get_gpu_target()]()
  • GPU๋ณ„ ์ตœ์ ํ™”: SIMD ํญ์ด ํ•˜๋“œ์›จ์–ด์— ๋งž๊ฒŒ ์กฐ์ •๋จ (์˜ˆ: ์ผ๋ถ€ ์นด๋“œ๋Š” 4, RTX 4090์€ 8, A100์€ 16)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž… ์ธ์‹: float32์™€ float16์— ๋Œ€ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ SIMD ํญ ์ ์šฉ
  • ์ปดํŒŒ์ผ ํƒ€์ž„ ์ตœ์ ํ™”: ํ•˜๋“œ์›จ์–ด ๊ฐ์ง€์— ๋Œ€ํ•œ ๋Ÿฐํƒ€์ž„ ์˜ค๋ฒ„ํ—ค๋“œ ์—†์Œ

ํ™•์žฅ์„ฑ ํŠน์„ฑ:

  • ์Šค๋ ˆ๋“œ ์ˆ˜: ๋ฌธ์ œ ํฌ๊ธฐ์— ๋”ฐ๋ผ ์ž๋™ ํ™•์žฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰: ์ž…๋ ฅ ํฌ๊ธฐ์— ๋น„๋ก€ํ•˜์—ฌ ์„ ํ˜• ํ™•์žฅ
  • ์„ฑ๋Šฅ: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํฌํ™” ์‹œ์ ๊นŒ์ง€ ๊ฑฐ์˜ ์„ ํ˜•์ ์ธ ์†๋„ ํ–ฅ์ƒ

7. ๊ณ ๊ธ‰ ์ธ์‚ฌ์ดํŠธ: ์ด ํŒจํ„ด์ด ์ค‘์š”ํ•œ ์ด์œ 

๋ณต์žกํ•œ ์—ฐ์‚ฐ์˜ ๊ธฐ์ดˆ: ์ด elementwise ํŒจํ„ด์€ ๋‹ค์Œ ์—ฐ์‚ฐ๋“ค์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค:

  • ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ: ๋Œ€๊ทœ๋ชจ ๋ฐฐ์—ด์—์„œ์˜ ํ•ฉ๊ณ„, ์ตœ๋Œ“๊ฐ’, ์ตœ์†Ÿ๊ฐ’
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ: ์Šค์นผ๋ผ-๋ฒกํ„ฐ ์—ฐ์‚ฐ
  • ๋ณต์žกํ•œ ๋ณ€ํ™˜: ํ™œ์„ฑํ™” ํ•จ์ˆ˜, ์ •๊ทœํ™”
  • ๋‹ค์ฐจ์› ์—ฐ์‚ฐ: ํ–‰๋ ฌ ์—ฐ์‚ฐ, ํ•ฉ์„ฑ๊ณฑ

์ „ํ†ต์ ์ธ ๋ฐฉ์‹๊ณผ์˜ ๋น„๊ต:

// ์ „ํ†ต์ : ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ, ์žฅํ™ฉํ•จ, ํ•˜๋“œ์›จ์–ด ์ข…์†์ 
__global__ void add_kernel(float* output, float* a, float* b, int size) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < size) {
        output[idx] = a[idx] + b[idx];  // ๋ฒกํ„ฐํ™” ์—†์Œ
    }
}

// Mojo: ์•ˆ์ „, ๊ฐ„๊ฒฐ, ์ž๋™ ๋ฒกํ„ฐํ™”
elementwise[add, SIMD_WIDTH, target="gpu"](size, ctx)

ํ•จ์ˆ˜ํ˜• ์ ‘๊ทผ๋ฒ•์˜ ์ด์ :

  • ์•ˆ์ „์„ฑ: ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ๋ฒ„ํผ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ ๋ฐฉ์ง€
  • ์ด์‹์„ฑ: ๋™์ผํ•œ ์ฝ”๋“œ๊ฐ€ ๋‹ค์–‘ํ•œ GPU ๋ฒค๋”/์„ธ๋Œ€์—์„œ ๋™์ž‘
  • ์„ฑ๋Šฅ: ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”๊ฐ€ ์ˆ˜๋™ ํŠœ๋‹ ์ฝ”๋“œ๋ฅผ ์ข…์ข… ๋Šฅ๊ฐ€
  • ์œ ์ง€๋ณด์ˆ˜์„ฑ: ๊น”๋”ํ•œ ์ถ”์ƒํ™”๋กœ ๋””๋ฒ„๊น… ๋ณต์žก๋„ ๊ฐ์†Œ
  • ์กฐํ•ฉ์„ฑ: ๋‹ค๋ฅธ ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ๊ณผ ์‰ฝ๊ฒŒ ๊ฒฐํ•ฉ ๊ฐ€๋Šฅ

์ด ํŒจํ„ด์€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๋ฏธ๋ž˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค - ์„ฑ๋Šฅ์„ ํฌ์ƒํ•˜์ง€ ์•Š๋Š” ๊ณ ์ˆ˜์ค€ ์ถ”์ƒํ™”๋กœ, ์ตœ์ ์˜ ํšจ์œจ์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ GPU ์ปดํ“จํŒ…์„ ๋” ์‰ฝ๊ฒŒ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

Elementwise ์—ฐ์‚ฐ์„ ํ•™์Šตํ–ˆ๋‹ค๋ฉด ๋‹ค์Œ์œผ๋กœ ๋„˜์–ด๊ฐˆ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ: elementwise ํŒจํ„ด์€ Mojo๊ฐ€ ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ์šฐ์•„ํ•จ๊ณผ GPU ์„ฑ๋Šฅ์„ ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์™„์ „ํ•œ ์ œ์–ด๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ฒกํ„ฐํ™”์™€ ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

tile - ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ํƒ€์ผ๋ง ์ฒ˜๋ฆฌ

๊ฐœ์š”

elementwise ํŒจํ„ด์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์ด ํผ์ฆ์—์„œ๋Š” ํƒ€์ผ๋ง ์ฒ˜๋ฆฌ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” GPU์—์„œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด๊ณผ ์บ์‹œ ํ™œ์šฉ์„ ์ตœ์ ํ™”ํ•˜๋Š” ํ•ต์‹ฌ ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ „์ฒด ๋ฐฐ์—ด์— ๊ฑธ์ณ ๊ฐœ๋ณ„ SIMD ๋ฒกํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋Œ€์‹ , ํƒ€์ผ๋ง์€ ๋ฐ์ดํ„ฐ๋ฅผ ์บ์‹œ ๋ฉ”๋ชจ๋ฆฌ์— ๋” ์ž˜ ๋งž๋Š” ์ž‘๊ณ  ๊ด€๋ฆฌ ๊ฐ€๋Šฅํ•œ ์ฒญํฌ๋กœ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Puzzle 16์˜ ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ ์—์„œ ์ด๋ฏธ ํƒ€์ผ๋ง์„ ๊ฒฝํ—˜ํ•œ ๋ฐ” ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฑฐ๊ธฐ์„œ๋Š” ํƒ€์ผ์„ ์‚ฌ์šฉํ•ด ๋Œ€๊ทœ๋ชจ ํ–‰๋ ฌ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๋™์ผํ•œ ํƒ€์ผ๋ง ์›์น™์„ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์— ์ ์šฉํ•˜์—ฌ, ์ด ๊ธฐ๋ฒ•์ด 2D ํ–‰๋ ฌ์—์„œ 1D ๋ฐฐ์—ด๊นŒ์ง€ ์–ด๋–ป๊ฒŒ ํ™•์žฅ๋˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Mojo์˜ ํƒ€์ผ๋ง ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋™์ผํ•œ ๋ฒกํ„ฐ ๋ง์…ˆ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ GPU ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐ์ดํ„ฐ์˜ ํƒ€์ผ ์ „์ฒด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ์ด ํŠน์ • ์›Œํฌ๋กœ๋“œ์—์„œ ์–ด๋–ป๊ฒŒ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํƒ€์ผ๋ง์€ ๋ณ‘๋ ฌ ํญ์„ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ๊ณผ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค - ๋” ์ ์€ ์ˆ˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋” ๋‚˜์€ ์บ์‹œ ํ™œ์šฉ์œผ๋กœ ๋” ๋งŽ์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์บ์‹œ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ํƒ€์ผ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ
  • ํƒ€์ผ ๋‚ด์˜ ์ˆœ์ฐจ์  SIMD ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ ์›์น™๊ณผ ์บ์‹œ ์นœํ™”์  ์ ‘๊ทผ ํŒจํ„ด
  • ์Šค๋ ˆ๋“œ-ํƒ€์ผ ๋งคํ•‘ vs ์Šค๋ ˆ๋“œ-์š”์†Œ ๋งคํ•‘
  • ๋ณ‘๋ ฌ์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ ๊ฐ„์˜ ์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

์š”์†Œ๋ณ„ ๋ฐฉ์‹๊ณผ ๋™์ผํ•œ ์ˆ˜ํ•™์  ์—ฐ์‚ฐ: \[\Large \text{output}[i] = a[i] + b[i]\]

ํ•˜์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ์— ์ตœ์ ํ™”๋œ ์™„์ „ํžˆ ๋‹ค๋ฅธ ์‹คํ–‰ ์ „๋žต์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์„ค์ •

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 1024
  • ํƒ€์ผ ํฌ๊ธฐ: TILE_SIZE = 32
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • SIMD ํญ: GPU ์˜์กด์  (ํƒ€์ผ ๋‚ด ์—ฐ์‚ฐ์šฉ)
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D ํ–‰ ์šฐ์„ )

์™„์„ฑํ•  ์ฝ”๋“œ

comptime TILE_SIZE = 32


def tiled_elementwise_add[
    LayoutT: TensorLayout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    def process_tiles[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        var tile_id = indices[0]
        print("tile_id:", tile_id)
        var output_tile = output.tile[tile_size](tile_id)
        var a_tile = a.tile[tile_size](tile_id)
        var b_tile = b.tile[tile_size](tile_id)

        # FILL IN (6 lines at most)

    var num_tiles = (size + tile_size - 1) // tile_size
    elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p23/p23.mojo

ํŒ

1. ํƒ€์ผ ๊ตฌ์„ฑ ์ดํ•ดํ•˜๊ธฐ

ํƒ€์ผ๋ง ๋ฐฉ์‹์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ์ • ํฌ๊ธฐ์˜ ์ฒญํฌ๋กœ ๋‚˜๋ˆ•๋‹ˆ๋‹ค:

num_tiles = (size + tile_size - 1) // tile_size  # ์˜ฌ๋ฆผ ๋‚˜๋ˆ—์…ˆ

TILE_SIZE=32์ธ 1024๊ฐœ ์š”์†Œ ๋ฒกํ„ฐ์˜ ๊ฒฝ์šฐ: 1024 รท 32 = 32๊ฐœ ํƒ€์ผ์ด ์ •ํ™•ํžˆ ์ƒ๊น๋‹ˆ๋‹ค.

2. ํƒ€์ผ ์ถ”์ถœ ํŒจํ„ด

TileTensor .tile ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

tile_id = indices[0]  # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌํ•  ํƒ€์ผ ํ•˜๋‚˜๋ฅผ ๋ฐ›์Œ
out_tile = output.tile[tile_size](tile_id)
a_tile = a.tile[tile_size](tile_id)
b_tile = b.tile[tile_size](tile_id)

tile[size](id) ๋ฉ”์„œ๋“œ๋Š” id ร— size ์œ„์น˜๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” size๊ฐœ์˜ ์—ฐ์† ์š”์†Œ์— ๋Œ€ํ•œ ๋ทฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

3. ํƒ€์ผ ๋‚ด ์ˆœ์ฐจ ์ฒ˜๋ฆฌ

์š”์†Œ๋ณ„ ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ, ํƒ€์ผ์„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

@parameter
for i in range(tile_size):
    # ํ˜„์žฌ ํƒ€์ผ ๋‚ด์˜ ์š”์†Œ i๋ฅผ ์ฒ˜๋ฆฌ

์ด @parameter ๋ฃจํ”„๋Š” ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ์ปดํŒŒ์ผ ํƒ€์ž„์— ์ „๊ฐœ๋ฉ๋‹ˆ๋‹ค.

4. ํƒ€์ผ ์š”์†Œ ๋‚ด SIMD ์—ฐ์‚ฐ

a_vec = a_tile.load[simd_width](i, 0)  # ํƒ€์ผ ๋‚ด ์œ„์น˜ i์—์„œ ๋กœ๋“œ
b_vec = b_tile.load[simd_width](i, 0)  # ํƒ€์ผ ๋‚ด ์œ„์น˜ i์—์„œ ๋กœ๋“œ
result = a_vec + b_vec                 # SIMD ๋ง์…ˆ (GPU ์˜์กด์  ํญ)
out_tile.store[simd_width](i, 0, result)  # ํƒ€์ผ ๋‚ด ์œ„์น˜ i์— ์ €์žฅ

5. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ์˜ ์ฐจ์ด์ 

elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)

SIMD_WIDTH ๋Œ€์‹  1์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ํƒ€์ผ ์ „์ฒด๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

6. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ธ์‚ฌ์ดํŠธ

๊ฐ ์Šค๋ ˆ๋“œ๋Š” ์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋ธ”๋ก(ํƒ€์ผ)์— ์ ‘๊ทผํ•œ ๋‹ค์Œ, ๋‹ค์Œ ํƒ€์ผ๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ ์Šค๋ ˆ๋“œ์˜ ์‹คํ–‰ ๋‚ด์—์„œ ์šฐ์ˆ˜ํ•œ ๊ณต๊ฐ„ ์ง€์—ญ์„ฑ์ด ๋งŒ๋“ค์–ด์ง‘๋‹ˆ๋‹ค.

7. ๋””๋ฒ„๊น… ํ•ต์‹ฌ ํฌ์ธํŠธ

ํƒ€์ผ๋ง์„ ์‚ฌ์šฉํ•˜๋ฉด ์Šค๋ ˆ๋“œ ์‹คํ–‰ ์ˆ˜๋Š” ์ค„์–ด๋“ค์ง€๋งŒ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋” ๋งŽ์€ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

  • ์š”์†Œ๋ณ„: ~256๊ฐœ ์Šค๋ ˆ๋“œ (SIMD_WIDTH=4 ๊ธฐ์ค€), ๊ฐ๊ฐ 4๊ฐœ ์š”์†Œ ์ฒ˜๋ฆฌ
  • Tiled: ~32๊ฐœ ์Šค๋ ˆ๋“œ, ๊ฐ๊ฐ 32๊ฐœ ์š”์†Œ๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ

์ฝ”๋“œ ์‹คํ–‰

ํ’€์ด๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p23 --tiled
pixi run -e amd p23 --tiled
pixi run -e apple p23 --tiled
uv run poe p23 --tiled

ํผ์ฆ์ด ์•„์ง ํ’€๋ฆฌ์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0
tile_id: 1
tile_id: 2
tile_id: 3
...
tile_id: 29
tile_id: 30
tile_id: 31
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

์†”๋ฃจ์…˜

comptime TILE_SIZE = 32


def tiled_elementwise_add[
    LayoutT: TensorLayout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    def process_tiles[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        var tile_id = indices[0]

        var output_tile = output.tile[tile_size](tile_id).to_layout_tensor()
        var a_tile = a.tile[tile_size](tile_id).to_layout_tensor()
        var b_tile = b.tile[tile_size](tile_id).to_layout_tensor()

        comptime for i in range(tile_size):
            var a_vec = a_tile.aligned_load[width=simd_width](Index(i))
            var b_vec = b_tile.aligned_load[width=simd_width](Index(i))
            var ret = a_vec + b_vec
            output_tile.store[simd_width](Index(i), ret)

    var num_tiles = (size + tile_size - 1) // tile_size
    elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)


ํƒ€์ผ๋ง ์ฒ˜๋ฆฌ ํŒจํ„ด์€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•œ ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

1. ํƒ€์ผ๋ง ์ฒ ํ•™๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ

ํƒ€์ผ๋ง์€ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์— ๋Œ€ํ•œ ์‚ฌ๊ณ  ๋ฐฉ์‹์˜ ๊ทผ๋ณธ์ ์ธ ์ „ํ™˜์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค:

์š”์†Œ๋ณ„ ๋ฐฉ์‹:

  • ๋„“์€ ๋ณ‘๋ ฌ์„ฑ: ๋งŽ์€ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ๊ฐ ์ตœ์†Œํ•œ์˜ ์ž‘์—… ์ˆ˜ํ–‰
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€ํ•˜: ์Šค๋ ˆ๋“œ๋“ค์ด ์ „์ฒด ๋ฐฐ์—ด์— ๋ถ„์‚ฐ
  • ์บ์‹œ ๋ฏธ์Šค: ์Šค๋ ˆ๋“œ ๊ฒฝ๊ณ„๋ฅผ ๋„˜๋‚˜๋“œ๋Š” ๋‚ฎ์€ ๊ณต๊ฐ„ ์ง€์—ญ์„ฑ

ํƒ€์ผ๋ง ๋ฐฉ์‹:

  • ๊นŠ์€ ๋ณ‘๋ ฌ์„ฑ: ๋” ์ ์€ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ๊ฐ ์ƒ๋‹นํ•œ ์ž‘์—… ์ˆ˜ํ–‰
  • ์ง€์—ญํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†์ ์ธ ๋ฐ์ดํ„ฐ์—์„œ ์ž‘์—…
  • ์บ์‹œ ์ตœ์ ํ™”: ์šฐ์ˆ˜ํ•œ ๊ณต๊ฐ„ ๋ฐ ์‹œ๊ฐ„ ์ง€์—ญ์„ฑ

2. ํƒ€์ผ ๊ตฌ์„ฑ๊ณผ ์ธ๋ฑ์‹ฑ

tile_id = indices[0]
out_tile = output.tile[tile_size](tile_id)
a_tile = a.tile[tile_size](tile_id)
b_tile = b.tile[tile_size](tile_id)

ํƒ€์ผ ๋งคํ•‘ ์‹œ๊ฐํ™” (TILE_SIZE=32):

์›๋ณธ ๋ฐฐ์—ด: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ..., 1023]

Tile 0 (thread 0): [0, 1, 2, ..., 31]      โ† ์š”์†Œ 0-31
Tile 1 (thread 1): [32, 33, 34, ..., 63]   โ† ์š”์†Œ 32-63
Tile 2 (thread 2): [64, 65, 66, ..., 95]   โ† ์š”์†Œ 64-95
...
Tile 31 (thread 31): [992, 993, ..., 1023] โ† ์š”์†Œ 992-1023

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ:

  • tile[size](id)๋Š” ์›๋ณธ ํ…์„œ์— ๋Œ€ํ•œ ๋ทฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ๋ทฐ๋Š” ์ œ๋กœ ์นดํ”ผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค - ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์‚ฌํ•˜์ง€ ์•Š๊ณ  ํฌ์ธํ„ฐ ์—ฐ์‚ฐ๋งŒ ์ˆ˜ํ–‰
  • ํƒ€์ผ ๊ฒฝ๊ณ„๋Š” ํ•ญ์ƒ tile_size ๋‹จ์œ„๋กœ ์ •๋ ฌ๋ฉ๋‹ˆ๋‹ค

3. ์ˆœ์ฐจ ์ฒ˜๋ฆฌ ์‹ฌ์ธต ๋ถ„์„

@parameter
for i in range(tile_size):
    a_vec = a_tile.load[simd_width](i, 0)
    b_vec = b_tile.load[simd_width](i, 0)
    ret = a_vec + b_vec
    out_tile.store[simd_width](i, 0, ret)

์™œ ์ˆœ์ฐจ ์ฒ˜๋ฆฌ์ธ๊ฐ€?

  • ์บ์‹œ ์ตœ์ ํ™”: ์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์บ์‹œ ํžˆํŠธ์œจ์„ ๊ทน๋Œ€ํ™”
  • ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”: @parameter ๋ฃจํ”„๊ฐ€ ์ปดํŒŒ์ผ ํƒ€์ž„์— ์™„์ „ํžˆ ์ „๊ฐœ๋จ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ์ˆœ์ฐจ ์ ‘๊ทผ์ด ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ ์„ค๊ณ„์— ๋ถ€ํ•ฉ
  • ์กฐ์ • ๋น„์šฉ ๊ฐ์†Œ: SIMD ๊ทธ๋ฃน ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ๋ถˆํ•„์š”

ํ•˜๋‚˜์˜ ํƒ€์ผ ๋‚ด ์‹คํ–‰ ํŒจํ„ด (TILE_SIZE=32, SIMD_WIDTH=4):

์Šค๋ ˆ๋“œ๊ฐ€ ํƒ€์ผ์„ ์ˆœ์ฐจ ์ฒ˜๋ฆฌ:
Step 0: ์š”์†Œ [0:4]๋ฅผ SIMD๋กœ ์ฒ˜๋ฆฌ
Step 1: ์š”์†Œ [4:8]๋ฅผ SIMD๋กœ ์ฒ˜๋ฆฌ
Step 2: ์š”์†Œ [8:12]๋ฅผ SIMD๋กœ ์ฒ˜๋ฆฌ
...
Step 7: ์š”์†Œ [28:32]๋ฅผ SIMD๋กœ ์ฒ˜๋ฆฌ
ํ•ฉ๊ณ„: ์Šค๋ ˆ๋“œ๋‹น 8ํšŒ SIMD ์—ฐ์‚ฐ (32 รท 4 = 8)

4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„

์บ์‹œ ๋™์ž‘ ๋น„๊ต:

์š”์†Œ๋ณ„ ํŒจํ„ด:

Thread 0: ๊ธ€๋กœ๋ฒŒ ์œ„์น˜ [0, 4, 8, 12, ...] ์ ‘๊ทผ    โ† Stride = SIMD_WIDTH
Thread 1: ๊ธ€๋กœ๋ฒŒ ์œ„์น˜ [4, 8, 12, 16, ...] ์ ‘๊ทผ   โ† Stride = SIMD_WIDTH
...
๊ฒฐ๊ณผ: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์ „์ฒด ๋ฐฐ์—ด์— ๋ถ„์‚ฐ

Tiled ํŒจํ„ด:

Thread 0: ์œ„์น˜ [0:32]๋ฅผ ์ˆœ์ฐจ ์ ‘๊ทผ               โ† ์—ฐ์†์ ์ธ 32๊ฐœ ์š”์†Œ ๋ธ”๋ก
Thread 1: ์œ„์น˜ [32:64]๋ฅผ ์ˆœ์ฐจ ์ ‘๊ทผ             โ† ๋‹ค์Œ ์—ฐ์†์ ์ธ 32๊ฐœ ์š”์†Œ ๋ธ”๋ก
...
๊ฒฐ๊ณผ: ๊ฐ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ ์™„๋ฒฝํ•œ ๊ณต๊ฐ„ ์ง€์—ญ์„ฑ

์บ์‹œ ํšจ์œจ ์‹œ์‚ฌ์ :

  • L1 ์บ์‹œ: ์ž‘์€ ํƒ€์ผ์ด L1 ์บ์‹œ์— ๋” ์ž˜ ๋งž์•„ ์บ์‹œ ๋ฏธ์Šค ๊ฐ์†Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ์ˆœ์ฐจ ์ ‘๊ทผ์ด ์œ ํšจ ๋Œ€์—ญํญ์„ ๊ทน๋Œ€ํ™”
  • TLB ํšจ์œจ: TLB ๋ฏธ์Šค ๊ฐ์†Œ (์—ญ์ฃผ: TLB(Translation Lookaside Buffer)๋Š” ๊ฐ€์ƒ ์ฃผ์†Œ๋ฅผ ๋ฌผ๋ฆฌ ์ฃผ์†Œ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ์บ์‹œ๋กœ, ๋ฏธ์Šค๊ฐ€ ์ค„๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค)
  • ํ”„๋ฆฌํŽ˜์นญ: ํ•˜๋“œ์›จ์–ด ํ”„๋ฆฌํŽ˜์ฒ˜๊ฐ€ ์ˆœ์ฐจ ํŒจํ„ด์—์„œ ์ตœ์ ์œผ๋กœ ๋™์ž‘

5. ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ ์ „๋žต

elementwise[process_tiles, 1, target="gpu"](num_tiles, ctx)

์™œ SIMD_WIDTH ๋Œ€์‹  1์ธ๊ฐ€?

  • ์Šค๋ ˆ๋“œ ์ˆ˜: num_tiles ร— SIMD_WIDTH๊ฐ€ ์•„๋‹Œ ์ •ํ™•ํžˆ num_tiles๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋งŒ ์‹คํ–‰
  • ์ž‘์—… ๋ถ„๋ฐฐ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ์™„์ „ํ•œ ํƒ€์ผ์„ ์ฒ˜๋ฆฌ
  • ๋กœ๋“œ ๋ฐธ๋Ÿฐ์‹ฑ: ์Šค๋ ˆ๋“œ๋‹น ๋” ๋งŽ์€ ์ž‘์—…, ์ „์ฒด์ ์œผ๋กœ ๋” ์ ์€ ์Šค๋ ˆ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ: ๊ฐ ์Šค๋ ˆ๋“œ์˜ ์ž‘์—…์ด ๊ณต๊ฐ„์ ์œผ๋กœ ์ง€์—ญํ™”

์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„:

  • ๋” ์ ์€ ๋…ผ๋ฆฌ์  ์Šค๋ ˆ๋“œ: ๋‚ฎ์€ ์ ์œ ์œจ์—์„œ ๋ชจ๋“  GPU ์ฝ”์–ด๋ฅผ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Œ
  • ์Šค๋ ˆ๋“œ๋‹น ๋” ๋งŽ์€ ์ž‘์—…: ๋” ๋‚˜์€ ์บ์‹œ ํ™œ์šฉ๊ณผ ์กฐ์ • ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ
  • ์ˆœ์ฐจ ์ ‘๊ทผ: ๊ฐ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ
  • ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ: ์Šค๋ ˆ๋“œ ์‹คํ–‰ ๋ฐ ์กฐ์ • ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ

์ค‘์š” ์ฐธ๊ณ : โ€œ๋” ์ ์€ ์Šค๋ ˆ๋“œโ€œ๋Š” ๋…ผ๋ฆฌ์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. GPU ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์—ฌ๋Ÿฌ ์›Œํ”„๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ ํšจ์œจ์ ์œผ๋กœ ์ „ํ™˜ํ•˜์—ฌ ๋†’์€ ํ•˜๋“œ์›จ์–ด ํ™œ์šฉ๋ฅ ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

6. ์„ฑ๋Šฅ ํŠน์„ฑ

ํƒ€์ผ๋ง์ด ๋„์›€์ด ๋˜๋Š” ๊ฒฝ์šฐ:

  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ๋ณ‘๋ชฉ์ธ ๊ฒฝ์šฐ
  • ์บ์‹œ ๋ฏผ๊ฐ ์›Œํฌ๋กœ๋“œ: ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์˜ ์ด์ ์ด ์žˆ๋Š” ์—ฐ์‚ฐ
  • ๋ณต์žกํ•œ ์—ฐ์‚ฐ: ์š”์†Œ๋‹น ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์€ ๊ฒฝ์šฐ
  • ์ œํ•œ๋œ ๋ณ‘๋ ฌ์„ฑ: GPU ์ฝ”์–ด๋ณด๋‹ค ์Šค๋ ˆ๋“œ๊ฐ€ ์ ์€ ๊ฒฝ์šฐ

ํƒ€์ผ๋ง์ด ๋ถˆ๋ฆฌํ•œ ๊ฒฝ์šฐ:

  • ๊ณ ๋„๋กœ ๋ณ‘๋ ฌ์ ์ธ ์›Œํฌ๋กœ๋“œ: ์ตœ๋Œ€ ์Šค๋ ˆ๋“œ ํ™œ์šฉ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ๋‹จ์ˆœํ•œ ์—ฐ์‚ฐ: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์—ฐ์‚ฐ๋ณด๋‹ค ์ง€๋ฐฐ์ ์ธ ๊ฒฝ์šฐ
  • ๋ถˆ๊ทœ์น™์  ์ ‘๊ทผ ํŒจํ„ด: ํƒ€์ผ๋ง์ด ์ง€์—ญ์„ฑ์„ ๊ฐœ์„ ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ

๋‹จ์ˆœ ๋ง์…ˆ ์˜ˆ์‹œ (TILE_SIZE=32):

  • ์Šค๋ ˆ๋“œ ์ˆ˜: 256๊ฐœ ๋Œ€์‹  32๊ฐœ (8๋ฐฐ ์ ์Œ)
  • ์Šค๋ ˆ๋“œ๋‹น ์ž‘์—…๋Ÿ‰: 4๊ฐœ ๋Œ€์‹  32๊ฐœ ์š”์†Œ (8๋ฐฐ ๋งŽ์Œ)
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์ˆœ์ฐจ vs ์ŠคํŠธ๋ผ์ด๋“œ ์ ‘๊ทผ
  • ์บ์‹œ ํ™œ์šฉ: ํ›จ์”ฌ ๋‚˜์€ ๊ณต๊ฐ„ ์ง€์—ญ์„ฑ

7. ๊ณ ๊ธ‰ ํƒ€์ผ๋ง ๊ณ ๋ ค ์‚ฌํ•ญ

ํƒ€์ผ ํฌ๊ธฐ ์„ ํƒ:

  • ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด: ์บ์‹œ ํ™œ์šฉ์ด ๋–จ์–ด์ง€๊ณ , ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ฆ๊ฐ€
  • ๋„ˆ๋ฌด ํฌ๋ฉด: ์บ์‹œ์— ๋งž์ง€ ์•Š์„ ์ˆ˜ ์žˆ๊ณ , ๋ณ‘๋ ฌ์„ฑ์ด ๊ฐ์†Œ
  • ์ตœ์  ์ง€์ : L1 ์บ์‹œ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๋ณดํ†ต 16-64๊ฐœ ์š”์†Œ
  • ํ˜„์žฌ ์„ ํƒ: 32๊ฐœ ์š”์†Œ๋กœ ์บ์‹œ ํ™œ์šฉ๊ณผ ๋ณ‘๋ ฌ์„ฑ์˜ ๊ท ํ˜• ๋‹ฌ์„ฑ

ํ•˜๋“œ์›จ์–ด ๊ณ ๋ ค ์‚ฌํ•ญ:

  • ์บ์‹œ ํฌ๊ธฐ: ๊ฐ€๋Šฅํ•˜๋ฉด ํƒ€์ผ์ด L1 ์บ์‹œ์— ๋งž์•„์•ผ ํ•จ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ ํญ์„ ๊ณ ๋ ค
  • ์ฝ”์–ด ์ˆ˜: ๋ชจ๋“  ์ฝ”์–ด๋ฅผ ํ™œ์šฉํ•˜๊ธฐ์— ์ถฉ๋ถ„ํ•œ ํƒ€์ผ ํ™•๋ณด
  • SIMD ํญ: ํƒ€์ผ ํฌ๊ธฐ๋Š” SIMD ํญ์˜ ๋ฐฐ์ˆ˜์—ฌ์•ผ ํ•จ

๋น„๊ต ์š”์•ฝ:

Elementwise: ๋†’์€ ๋ณ‘๋ ฌ์„ฑ, ๋ถ„์‚ฐ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
Tiled:       ์ ๋‹นํ•œ ๋ณ‘๋ ฌ์„ฑ, ์ง€์—ญํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ

์š”์†Œ๋ณ„ ํŒจํ„ด๊ณผ ํƒ€์ผ๋ง ํŒจํ„ด ๊ฐ„์˜ ์„ ํƒ์€ ํŠน์ • ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ, ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ํŒจํ„ด, ๋Œ€์ƒ ํ•˜๋“œ์›จ์–ด ๋Šฅ๋ ฅ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

์š”์†Œ๋ณ„ ํŒจํ„ด๊ณผ ํƒ€์ผ๋ง ํŒจํ„ด์„ ๋ชจ๋‘ ์ดํ•ดํ–ˆ๋‹ค๋ฉด:

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ: ํƒ€์ผ๋ง์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์›์‹œ ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ๋Ÿ‰๋ณด๋‹ค ๋” ์ค‘์š”ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ตœ๊ณ ์˜ GPU ์ฝ”๋“œ๋Š” ๋ณ‘๋ ฌ์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”์˜ ๊ท ํ˜•์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

vectorize - SIMD ์ œ์–ด

๊ฐœ์š”

์ด ํผ์ฆ์—์„œ๋Š” ์ˆ˜๋™ ๋ฒกํ„ฐํ™”์™€ vectorize๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ GPU ์ปค๋„ ๋‚ด์—์„œ SIMD ์—ฐ์‚ฐ์„ ์ •๋ฐ€ํ•˜๊ฒŒ ์ œ์–ดํ•˜๋Š” ๊ณ ๊ธ‰ ๋ฒกํ„ฐํ™” ๊ธฐ๋ฒ•์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ๋ฒกํ„ฐํ™”๋œ ์—ฐ์‚ฐ์— ๋Œ€ํ•ด ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ ์ ‘๊ทผ๋ฒ•์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ˆ˜๋™ ๋ฒกํ„ฐํ™”: ๋ช…์‹œ์  ์ธ๋ฑ์Šค ๊ณ„์‚ฐ์„ ํ†ตํ•œ ์ง์ ‘์ ์ธ SIMD ์ œ์–ด
  2. Mojo์˜ vectorize ํ•จ์ˆ˜: ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํฌํ•จํ•œ ๊ณ ์ˆ˜์ค€ ๋ฒกํ„ฐํ™”

๋‘ ์ ‘๊ทผ๋ฒ• ๋ชจ๋‘ ํƒ€์ผ๋ง ๊ฐœ๋…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ง€๋งŒ, ์ œ์–ด, ์•ˆ์ „์„ฑ, ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ฐ„์˜ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ฒกํ„ฐํ™” ์ „๋žต์€ ์„ฑ๋Šฅ ์š”๊ตฌ ์‚ฌํ•ญ๊ณผ ๋ณต์žก๋„ ์ˆ˜์ค€์— ๋”ฐ๋ผ ๋‹ฌ๋ฆฌ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๋ช…์‹œ์  ์ธ๋ฑ์Šค ๊ด€๋ฆฌ๋ฅผ ํ†ตํ•œ ์ˆ˜๋™ SIMD ์—ฐ์‚ฐ
  • ์•ˆ์ „ํ•˜๊ณ  ์ž๋™์ ์ธ ๋ฒกํ„ฐํ™”๋ฅผ ์œ„ํ•œ Mojo์˜ vectorize ํ•จ์ˆ˜
  • ์ตœ์ ์˜ SIMD ์ •๋ ฌ์„ ์œ„ํ•œ ์ฒญํฌ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ๊ตฌ์„ฑ
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์œ„ํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ์ „๋žต
  • ์ˆ˜๋™ ์ œ์–ด์™€ ์•ˆ์ „์„ฑ ๊ฐ„์˜ ์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

์ด์ „๊ณผ ๋™์ผํ•œ ์ˆ˜ํ•™์  ์—ฐ์‚ฐ: \[\Large \text{output}[i] = a[i] + b[i]\]

ํ•˜์ง€๋งŒ ์ตœ๋Œ€ ์„ฑ๋Šฅ์„ ์œ„ํ•œ ์ •๊ตํ•œ ๋ฒกํ„ฐํ™” ์ „๋žต์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์„ค์ •

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 1024
  • ํƒ€์ผ ํฌ๊ธฐ: TILE_SIZE = 32
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • SIMD ํญ: GPU ์˜์กด์ 
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D ํ–‰ ์šฐ์„ )

1. ์ˆ˜๋™ ๋ฒกํ„ฐํ™” ๋ฐฉ์‹

์™„์„ฑํ•  ์ฝ”๋“œ

def manual_vectorized_tiled_elementwise_add[
    LayoutT: TensorLayout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size groups of simd_width elements
    comptime chunk_size = tile_size * simd_width

    @parameter
    @always_inline
    def process_manual_vectorized_tiles[
        num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        var tile_id = indices[0]
        print("tile_id:", tile_id)
        var output_tile = output.tile[chunk_size](tile_id)
        var a_tile = a.tile[chunk_size](tile_id)
        var b_tile = b.tile[chunk_size](tile_id)

        # FILL IN (7 lines at most)

    # Number of tiles needed: each tile processes chunk_size elements
    var num_tiles = (size + chunk_size - 1) // chunk_size
    elementwise[
        process_manual_vectorized_tiles, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p23/p23.mojo

ํŒ

1. ์ฒญํฌ ๊ตฌ์„ฑ ์ดํ•ดํ•˜๊ธฐ

comptime chunk_size = tile_size * simd_width  # 32 * 4 = ์ฒญํฌ๋‹น 128๊ฐœ ์š”์†Œ

๊ฐ ํƒ€์ผ์€ ์ด์ œ ๋‹จ์ˆœํ•œ ์ˆœ์ฐจ ์š”์†Œ๊ฐ€ ์•„๋‹Œ ์—ฌ๋Ÿฌ SIMD ๊ทธ๋ฃน์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

2. ์ „์—ญ ์ธ๋ฑ์Šค ๊ณ„์‚ฐ

global_start = tile_id * chunk_size + i * simd_width

์ฒญํฌ ๋‚ด ๊ฐ SIMD ๋ฒกํ„ฐ์˜ ์ •ํ™•ํ•œ ์ „์—ญ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

3. ํ…์„œ ์ง์ ‘ ์ ‘๊ทผ

a_vec = a.load[simd_width](global_start, 0)     # ์ „์—ญ ํ…์„œ์—์„œ ๋กœ๋“œ
output.store[simd_width](global_start, 0, ret)  # ์ „์—ญ ํ…์„œ์— ์ €์žฅ

์ฐธ๊ณ : ํƒ€์ผ ๋ทฐ๊ฐ€ ์•„๋‹Œ ์›๋ณธ ํ…์„œ์— ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค.

4. ์ฃผ์š” ํŠน์„ฑ

  • ๋” ๋งŽ์€ ์ œ์–ด, ๋” ๋งŽ์€ ๋ณต์žก์„ฑ, ์ „์—ญ ํ…์„œ ์ ‘๊ทผ
  • ํ•˜๋“œ์›จ์–ด์— ๋Œ€ํ•œ ์™„๋ฒฝํ•œ SIMD ์ •๋ ฌ
  • ์ˆ˜๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํ•„์š”

์ˆ˜๋™ ๋ฒกํ„ฐํ™” ์‹คํ–‰

pixi run p23 --manual-vectorized
pixi run -e amd p23 --manual-vectorized
pixi run -e apple p23 --manual-vectorized
uv run poe p23 --manual-vectorized

ํผ์ฆ์ด ์•„์ง ํ’€๋ฆฌ์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0
tile_id: 1
tile_id: 2
tile_id: 3
tile_id: 4
tile_id: 5
tile_id: 6
tile_id: 7
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

์ˆ˜๋™ ๋ฒกํ„ฐํ™” ํ’€์ด

def manual_vectorized_tiled_elementwise_add[
    LayoutT: TensorLayout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size groups of simd_width elements
    comptime chunk_size = tile_size * simd_width

    @parameter
    @always_inline
    def process_manual_vectorized_tiles[
        num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        var tile_id = indices[0]
        # Convert inside GPU kernel to avoid host-captured LayoutTensor issues
        var a_lt = a.to_layout_tensor()
        var b_lt = b.to_layout_tensor()
        var out_lt = output.to_layout_tensor()

        comptime for i in range(tile_size):
            var global_start = tile_id * chunk_size + i * simd_width

            var a_vec = a_lt.aligned_load[width=simd_width](Index(global_start))
            var b_vec = b_lt.aligned_load[width=simd_width](Index(global_start))
            var ret = a_vec + b_vec
            out_lt.store[simd_width](Index(global_start), ret)

    # Number of tiles needed: each tile processes chunk_size elements
    var num_tiles = (size + chunk_size - 1) // chunk_size
    elementwise[
        process_manual_vectorized_tiles, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


์ˆ˜๋™ ๋ฒกํ„ฐํ™” ์‹ฌ์ธต ๋ถ„์„

์ˆ˜๋™ ๋ฒกํ„ฐํ™”๋Š” ๋ช…์‹œ์  ์ธ๋ฑ์Šค ๊ณ„์‚ฐ์„ ํ†ตํ•ด SIMD ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ง์ ‘์ ์ธ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • ์ฒญํฌ ๊ธฐ๋ฐ˜ ๊ตฌ์„ฑ: chunk_size = tile_size * simd_width
  • ์ „์—ญ ์ธ๋ฑ์‹ฑ: ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์˜ ์ง์ ‘ ๊ณ„์‚ฐ
  • ์ˆ˜๋™ ๊ฒฝ๊ณ„ ๊ด€๋ฆฌ: ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์ง์ ‘ ์ฒ˜๋ฆฌ

์•„ํ‚คํ…์ฒ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ:

comptime chunk_size = tile_size * simd_width  # 32 * 4 = 128

์ฒญํฌ ๊ตฌ์„ฑ ์‹œ๊ฐํ™” (TILE_SIZE=32, SIMD_WIDTH=4):

์›๋ณธ ๋ฐฐ์—ด: [0, 1, 2, 3, ..., 1023]

์ฒญํฌ 0 (thread 0): [0:128]    โ† 128๊ฐœ ์š”์†Œ = 4๊ฐœ์”ฉ 32๊ฐœ SIMD ๊ทธ๋ฃน
์ฒญํฌ 1 (thread 1): [128:256]  โ† ๋‹ค์Œ 128๊ฐœ ์š”์†Œ
์ฒญํฌ 2 (thread 2): [256:384]  โ† ๋‹ค์Œ 128๊ฐœ ์š”์†Œ
...
์ฒญํฌ 7 (thread 7): [896:1024] โ† ๋งˆ์ง€๋ง‰ 128๊ฐœ ์š”์†Œ

ํ•˜๋‚˜์˜ ์ฒญํฌ ๋‚ด ์ฒ˜๋ฆฌ:

@parameter
for i in range(tile_size):  # i = 0, 1, 2, ..., 31
    global_start = tile_id * chunk_size + i * simd_width
    # tile_id=0์ผ ๋•Œ: global_start = 0, 4, 8, 12, ..., 124
    # tile_id=1์ผ ๋•Œ: global_start = 128, 132, 136, 140, ..., 252

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์Šค๋ ˆ๋“œ ์ˆ˜: 8๊ฐœ ์Šค๋ ˆ๋“œ (1024 รท 128 = 8)
  • ์Šค๋ ˆ๋“œ๋‹น ์ž‘์—…๋Ÿ‰: 128๊ฐœ ์š”์†Œ (๊ฐ 4๊ฐœ ์š”์†Œ์˜ SIMD ์—ฐ์‚ฐ 32ํšŒ)
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์™„๋ฒฝํ•œ SIMD ์ •๋ ฌ์„ ๊ฐ–์ถ˜ ๋Œ€ํ˜• ์ฒญํฌ
  • ์˜ค๋ฒ„ํ—ค๋“œ: ์ตœ์†Œ - ํ•˜๋“œ์›จ์–ด์— ์ง์ ‘ ๋งคํ•‘
  • ์•ˆ์ „์„ฑ: ์ˆ˜๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํ•„์š”

์ฃผ์š” ์žฅ์ :

  • ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ์ธ๋ฑ์‹ฑ: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์— ๋Œ€ํ•œ ์ •ํ™•ํ•œ ์ œ์–ด
  • ์ตœ์ ์˜ ์ •๋ ฌ: SIMD ์—ฐ์‚ฐ์ด ํ•˜๋“œ์›จ์–ด์— ์™„๋ฒฝํžˆ ์ •๋ ฌ
  • ์ตœ๋Œ€ ์ฒ˜๋ฆฌ๋Ÿ‰: ์•ˆ์ „์„ฑ ๊ฒ€์‚ฌ๋กœ ์ธํ•œ ์˜ค๋ฒ„ํ—ค๋“œ ์—†์Œ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: GPU SIMD ์œ ๋‹›์— ์ง์ ‘ ๋งคํ•‘

์ฃผ์š” ๊ณผ์ œ:

  • ์ธ๋ฑ์Šค ๋ณต์žก์„ฑ: ์ „์—ญ ์œ„์น˜์˜ ์ˆ˜๋™ ๊ณ„์‚ฐ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ์ฑ…์ž„: ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์ง์ ‘ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•จ
  • ๋””๋ฒ„๊น… ๋‚œ์ด๋„: ์ •ํ™•์„ฑ ๊ฒ€์ฆ์ด ๋” ๋ณต์žก

2. Mojo vectorize ๋ฐฉ์‹

์™„์„ฑํ•  ์ฝ”๋“œ

def vectorize_within_tiles_elementwise_add[
    LayoutT: TensorLayout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size elements (not SIMD groups)
    @parameter
    @always_inline
    def process_tile_with_vectorize[
        num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        var tile_id = indices[0]
        var tile_start = tile_id * tile_size
        var tile_end = min(tile_start + tile_size, size)
        var actual_tile_size = tile_end - tile_start
        print(
            "tile_id:",
            tile_id,
            "tile_start:",
            tile_start,
            "tile_end:",
            tile_end,
            "actual_tile_size:",
            actual_tile_size,
        )

        # FILL IN (9 lines at most)

    var num_tiles = (size + tile_size - 1) // tile_size
    elementwise[
        process_tile_with_vectorize, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p23/p23.mojo

ํŒ

1. ํƒ€์ผ ๊ฒฝ๊ณ„ ๊ณ„์‚ฐ

tile_start = tile_id * tile_size
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start

๋งˆ์ง€๋ง‰ ํƒ€์ผ์ด tile_size๋ณด๋‹ค ์ž‘์„ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

2. ๋ฒกํ„ฐํ™” ํ•จ์ˆ˜ ํŒจํ„ด

def vectorized_add[
  width: Int
](i: Int) unified {read tile_start, read a, read b, mut output}:
    global_idx = tile_start + i
    if global_idx + width <= size:  # ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
        # SIMD ์—ฐ์‚ฐ ์ฝ”๋“œ

width ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” vectorize ํ•จ์ˆ˜์— ์˜ํ•ด ์ž๋™์œผ๋กœ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.

3. vectorize ํ˜ธ์ถœ

vectorize[simd_width](actual_tile_size, vectorized_add)

์ œ๊ณต๋œ SIMD ํญ์œผ๋กœ ๋ฒกํ„ฐํ™” ๋ฃจํ”„๋ฅผ ์ž๋™ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

4. ์ฃผ์š” ํŠน์„ฑ

  • ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ, ๋‚ด์žฅ ์•ˆ์ „์„ฑ, ํƒ€์ผ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ
  • ๋ช…์‹œ์  SIMD ํญ ๋งค๊ฐœ๋ณ€์ˆ˜ ์‚ฌ์šฉ
  • ๋‚ด์žฅ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ์ž๋™ ๋‚˜๋จธ์ง€ ์š”์†Œ ์ฒ˜๋ฆฌ

Mojo vectorize ์‹คํ–‰

uv run poe p23 --vectorized
pixi run p23 --vectorized

ํผ์ฆ์ด ์•„์ง ํ’€๋ฆฌ์ง€ ์•Š์€ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

SIZE: 1024
simd_width: 4
tile size: 32
tile_id: 0 tile_start: 0 tile_end: 32 actual_tile_size: 32
tile_id: 1 tile_start: 32 tile_end: 64 actual_tile_size: 32
tile_id: 2 tile_start: 64 tile_end: 96 actual_tile_size: 32
tile_id: 3 tile_start: 96 tile_end: 128 actual_tile_size: 32
...
tile_id: 29 tile_start: 928 tile_end: 960 actual_tile_size: 32
tile_id: 30 tile_start: 960 tile_end: 992 actual_tile_size: 32
tile_id: 31 tile_start: 992 tile_end: 1024 actual_tile_size: 32
out: HostBuffer([0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 5.0, 9.0, ..., 4085.0, 4089.0, 4093.0])

Mojo vectorize ํ’€์ด

def vectorize_within_tiles_elementwise_add[
    LayoutT: TensorLayout,
    dtype: DType,
    simd_width: Int,
    num_threads_per_tile: Int,
    rank: Int,
    size: Int,
    tile_size: Int,
](
    output: TileTensor[mut=True, dtype, LayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutT, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    # Each tile contains tile_size elements (not SIMD groups)
    @parameter
    @always_inline
    def process_tile_with_vectorize[
        num_threads_per_tile: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        var tile_id = indices[0]
        var tile_start = tile_id * tile_size
        var tile_end = min(tile_start + tile_size, size)
        var actual_tile_size = tile_end - tile_start
        # Convert inside GPU kernel to avoid host-captured LayoutTensor issues
        var a_lt = a.to_layout_tensor()
        var b_lt = b.to_layout_tensor()
        var out_lt = output.to_layout_tensor()

        def vectorized_add[
            width: Int
        ](i: Int) {read tile_start, read a_lt, read b_lt, mut out_lt}:
            var global_idx = tile_start + i
            if global_idx + width <= size:
                var a_vec = a_lt.aligned_load[width](Index(global_idx))
                var b_vec = b_lt.aligned_load[width](Index(global_idx))
                var result = a_vec + b_vec
                out_lt.store[width](Index(global_idx), result)

        # Use vectorize within each tile
        vectorize[simd_width](actual_tile_size, vectorized_add)

    var num_tiles = (size + tile_size - 1) // tile_size
    elementwise[
        process_tile_with_vectorize, num_threads_per_tile, target="gpu"
    ](num_tiles, ctx)


Mojo vectorize ์‹ฌ์ธต ๋ถ„์„

Mojo์˜ vectorize ํ•จ์ˆ˜๋Š” ๋‚ด์žฅ ์•ˆ์ „์„ฑ๊ณผ ํ•จ๊ป˜ ์ž๋™ ๋ฒกํ„ฐํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • ๋ช…์‹œ์  SIMD ํญ ๋งค๊ฐœ๋ณ€์ˆ˜: ์‚ฌ์šฉํ•  simd_width๋ฅผ ์ง์ ‘ ์ง€์ •
  • ๋‚ด์žฅ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋ฒ„ํผ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ๋ฅผ ์ž๋™์œผ๋กœ ๋ฐฉ์ง€
  • ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ: ๋‚จ์€ ์š”์†Œ๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
  • ์ค‘์ฒฉ ํ•จ์ˆ˜ ํŒจํ„ด: ๋ฒกํ„ฐํ™” ๋กœ์ง์˜ ๊น”๋”ํ•œ ๋ถ„๋ฆฌ

ํƒ€์ผ ๊ธฐ๋ฐ˜ ๊ตฌ์„ฑ:

tile_start = tile_id * tile_size    # 0, 32, 64, 96, ...
tile_end = min(tile_start + tile_size, size)
actual_tile_size = tile_end - tile_start

์ž๋™ ๋ฒกํ„ฐํ™” ๋ฉ”์ปค๋‹ˆ์ฆ˜:

def vectorized_add[
  width: Int
](i: Int) unified {read tile_start, read a, read b, mut output}:
    global_idx = tile_start + i
    if global_idx + width <= size:
        # ์ž๋™ SIMD ์ตœ์ ํ™”

vectorize์˜ ๋™์ž‘ ๋ฐฉ์‹:

  • ์ž๋™ ์ฒญํฌ ๋ถ„ํ• : actual_tile_size๋ฅผ ์ง€์ •ํ•œ simd_width์˜ ์ฒญํฌ๋กœ ๋ถ„ํ• 
  • ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ: ๋‚จ์€ ์š”์†Œ๋ฅผ ๋” ์ž‘์€ ํญ์œผ๋กœ ์ž๋™ ์ฒ˜๋ฆฌ
  • ๊ฒฝ๊ณ„ ์•ˆ์ „์„ฑ: ๋ฒ„ํผ ์˜ค๋ฒ„ํ”Œ๋กœ์šฐ๋ฅผ ์ž๋™์œผ๋กœ ๋ฐฉ์ง€
  • ๋ฃจํ”„ ๊ด€๋ฆฌ: ๋ฒกํ„ฐํ™” ๋ฃจํ”„๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ

์‹คํ–‰ ์‹œ๊ฐํ™” (TILE_SIZE=32, SIMD_WIDTH=4):

Tile 0 ์ฒ˜๋ฆฌ:
  vectorize ํ˜ธ์ถœ 0: ์š”์†Œ [0:4]๋ฅผ SIMD_WIDTH=4๋กœ ์ฒ˜๋ฆฌ
  vectorize ํ˜ธ์ถœ 1: ์š”์†Œ [4:8]๋ฅผ SIMD_WIDTH=4๋กœ ์ฒ˜๋ฆฌ
  ...
  vectorize ํ˜ธ์ถœ 7: ์š”์†Œ [28:32]๋ฅผ SIMD_WIDTH=4๋กœ ์ฒ˜๋ฆฌ
  ํ•ฉ๊ณ„: 8ํšŒ ์ž๋™ SIMD ์—ฐ์‚ฐ

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์Šค๋ ˆ๋“œ ์ˆ˜: 32๊ฐœ ์Šค๋ ˆ๋“œ (1024 รท 32 = 32)
  • ์Šค๋ ˆ๋“œ๋‹น ์ž‘์—…๋Ÿ‰: 32๊ฐœ ์š”์†Œ (์ž๋™ SIMD ์ฒญํฌ ๋ถ„ํ• )
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์ž๋™ ๋ฒกํ„ฐํ™”๋ฅผ ๊ฐ–์ถ˜ ์ž‘์€ ํƒ€์ผ
  • ์˜ค๋ฒ„ํ—ค๋“œ: ์•ฝ๊ฐ„ - ์ž๋™ ์ตœ์ ํ™” ๋ฐ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ์•ˆ์ „์„ฑ: ๋‚ด์žฅ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ

์„ฑ๋Šฅ ๋น„๊ต์™€ ๋ชจ๋ฒ” ์‚ฌ๋ก€

๊ฐ ์ ‘๊ทผ๋ฒ•์˜ ์„ ํƒ ๊ธฐ์ค€

์ˆ˜๋™ ๋ฒกํ„ฐํ™”๋ฅผ ์„ ํƒํ•  ๋•Œ:

  • ์ตœ๋Œ€ ์„ฑ๋Šฅ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ
  • ์˜ˆ์ธก ๊ฐ€๋Šฅํ•˜๊ณ  ์ •๋ ฌ๋œ ๋ฐ์ดํ„ฐ ํŒจํ„ด์ด ์žˆ๋Š” ๊ฒฝ์šฐ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— ๋Œ€ํ•œ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ์ˆ˜๋™์œผ๋กœ ๊ฒฝ๊ณ„ ์•ˆ์ „์„ฑ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ
  • ํ•˜๋“œ์›จ์–ด๋ณ„ ์ตœ์ ํ™”๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ

Mojo vectorize๋ฅผ ์„ ํƒํ•  ๋•Œ:

  • ๊ฐœ๋ฐœ ์†๋„์™€ ์•ˆ์ „์„ฑ์ด ์šฐ์„ ์ธ ๊ฒฝ์šฐ
  • ๋ถˆ๊ทœ์น™ํ•˜๊ฑฐ๋‚˜ ๋™์ ์ธ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ฒฝ์šฐ
  • ์ˆ˜๋™ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ๊ด€๋ฆฌ ๋Œ€์‹  ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ๋ฅผ ์›ํ•˜๋Š” ๊ฒฝ์šฐ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ๋ณต์žก๋„๊ฐ€ ์˜ค๋ฅ˜๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ
  • ์ˆ˜๋™ ๋ฃจํ”„ ๊ด€๋ฆฌ๋ณด๋‹ค ๊น”๋”ํ•œ ๋ฒกํ„ฐํ™” ํŒจํ„ด์„ ์„ ํ˜ธํ•˜๋Š” ๊ฒฝ์šฐ

๊ณ ๊ธ‰ ์ตœ์ ํ™” ์ธ์‚ฌ์ดํŠธ

๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ:

์ˆ˜๋™:      8 ์Šค๋ ˆ๋“œ ร— 32 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ
vectorize: 32 ์Šค๋ ˆ๋“œ ร— 8 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ

๋‘˜ ๋‹ค ๋น„์Šทํ•œ ์ด ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋‹ฌ์„ฑํ•˜์ง€๋งŒ, ๋ณ‘๋ ฌ์„ฑ ์ „๋žต์ด ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

์บ์‹œ ๋™์ž‘:

  • ์ˆ˜๋™: ๋Œ€ํ˜• ์ฒญํฌ๊ฐ€ L1 ์บ์‹œ๋ฅผ ์ดˆ๊ณผํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์™„๋ฒฝํ•œ ์ˆœ์ฐจ ์ ‘๊ทผ
  • vectorize: ์ž‘์€ ํƒ€์ผ์ด ์บ์‹œ์— ๋” ์ž˜ ๋งž๊ณ , ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ

ํ•˜๋“œ์›จ์–ด ๋งคํ•‘:

  • ์ˆ˜๋™: ์›Œํ”„ ํ™œ์šฉ๊ณผ SIMD ์œ ๋‹› ๋งคํ•‘์— ๋Œ€ํ•œ ์ง์ ‘ ์ œ์–ด
  • vectorize: ์ž๋™ ๋ฃจํ”„ ๋ฐ ๋‚˜๋จธ์ง€ ๊ด€๋ฆฌ๋ฅผ ํ†ตํ•œ ๊ฐ„์†Œํ™”๋œ ๋ฒกํ„ฐํ™”

๋ชจ๋ฒ” ์‚ฌ๋ก€ ์š”์•ฝ

์ˆ˜๋™ ๋ฒกํ„ฐํ™” ๋ชจ๋ฒ” ์‚ฌ๋ก€:

  • ์ธ๋ฑ์Šค ๊ณ„์‚ฐ์„ ํ•ญ์ƒ ์‹ ์ค‘ํ•˜๊ฒŒ ๊ฒ€์ฆ
  • ๊ฐ€๋Šฅํ•˜๋ฉด chunk_size์— ์ปดํŒŒ์ผ ํƒ€์ž„ ์ƒ์ˆ˜ ์‚ฌ์šฉ
  • ์บ์‹œ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํ”„๋กœํŒŒ์ผ๋ง
  • ์ตœ์ ์˜ SIMD ์„ฑ๋Šฅ์„ ์œ„ํ•œ ์ •๋ ฌ ์š”๊ตฌ ์‚ฌํ•ญ ๊ณ ๋ ค

Mojo vectorize ๋ชจ๋ฒ” ์‚ฌ๋ก€:

  • ๋ฐ์ดํ„ฐ์™€ ํ•˜๋“œ์›จ์–ด์— ์ ํ•ฉํ•œ SIMD ํญ ์„ ํƒ
  • ๋ฏธ์„ธ ์ตœ์ ํ™”๋ณด๋‹ค ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋ช…ํ™•์„ฑ์— ์ง‘์ค‘
  • ๊น”๋”ํ•œ ๋ฒกํ„ฐํ™” ๋กœ์ง์„ ์œ„ํ•ด ์ค‘์ฒฉ ํŒŒ๋ผ๋ฏธํ„ฐ ํ•จ์ˆ˜ ์‚ฌ์šฉ
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด์—๋Š” ์ž๋™ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ ์‹ ๋ขฐ

๋‘ ์ ‘๊ทผ๋ฒ• ๋ชจ๋‘ GPU ์„ฑ๋Šฅ ์ตœ์ ํ™” ๋„๊ตฌ ๋ชจ์Œ์—์„œ ์œ ํšจํ•œ ์ „๋žต์ž…๋‹ˆ๋‹ค. ์ˆ˜๋™ ๋ฒกํ„ฐํ™”๋Š” ์ตœ๋Œ€ํ•œ์˜ ์ œ์–ด๋ฅผ, Mojo์˜ vectorize๋Š” ์•ˆ์ „์„ฑ๊ณผ ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ํŒจํ„ด์„ ๋ชจ๋‘ ์ดํ•ดํ–ˆ๋‹ค๋ฉด:

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ: ๋ฒกํ„ฐํ™” ์ „๋žต์€ ์„ฑ๋Šฅ ์š”๊ตฌ ์‚ฌํ•ญ์— ๋”ฐ๋ผ ๋‹ฌ๋ฆฌ ์„ ํƒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜๋™ ๋ฒกํ„ฐํ™”๋Š” ์ตœ๋Œ€ํ•œ์˜ ์ œ์–ด๋ฅผ, Mojo์˜ vectorize ํ•จ์ˆ˜๋Š” ์•ˆ์ „์„ฑ๊ณผ ์ž๋™ ๋‚˜๋จธ์ง€ ์ฒ˜๋ฆฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์ธ ์„ฑ๋Šฅ ์š”๊ตฌ ์‚ฌํ•ญ๊ณผ ๊ฐœ๋ฐœ ์ œ์•ฝ ์กฐ๊ฑด์— ๋”ฐ๋ผ ์„ ํƒํ•˜์„ธ์š”.

๐Ÿง  GPU ์Šค๋ ˆ๋”ฉ vs SIMD - ์‹คํ–‰ ๊ณ„์ธต ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

๊ฐœ์š”

์š”์†Œ๋ณ„, ํƒ€์ผ๋ง, ๋ฒกํ„ฐํ™” ํŒจํ„ด์„ ํƒ๊ตฌํ•˜๋ฉด์„œ GPU ์—ฐ์‚ฐ์„ ๊ตฌ์„ฑํ•˜๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ์ด ์„น์…˜์—์„œ๋Š” GPU ์Šค๋ ˆ๋“œ์™€ SIMD ์—ฐ์‚ฐ ์‚ฌ์ด์˜ ๊ทผ๋ณธ์ ์ธ ๊ด€๊ณ„๋ฅผ ๋ช…ํ™•ํžˆ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‘˜์€ ์„œ๋กœ ๋‹ค๋ฅด์ง€๋งŒ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ๋ณ‘๋ ฌ์„ฑ ์ˆ˜์ค€์œผ๋กœ, ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ํ•จ๊ป˜ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณ‘๋ ฌ์„ฑ์˜ ๊ตฌ์กฐ๋ฅผ ์ œ๊ณตํ•˜๊ณ , SIMD ์—ฐ์‚ฐ์ด ๊ฐ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ ๋ฒกํ„ฐํ™”๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

GPU ์Šค๋ ˆ๋”ฉ ๊ณ„์ธต ๊ตฌ์กฐ

GPU ์‹คํ–‰์€ ํ•˜๋“œ์›จ์–ด์˜ ๋ณต์žก์„ฑ์„ ์ถ”์ƒํ™”ํ•˜๋Š” ์ž˜ ์ •์˜๋œ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

GPU Device
โ”œโ”€โ”€ Grid (์ „์ฒด ๋ฌธ์ œ)
โ”‚   โ”œโ”€โ”€ Block 1 (์Šค๋ ˆ๋“œ ๊ทธ๋ฃน, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)
โ”‚   โ”‚   โ”œโ”€โ”€ ์›Œํ”„ 1 (32๊ฐœ ์Šค๋ ˆ๋“œ, ๋ก์Šคํ… ์‹คํ–‰)
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 1 โ†’ SIMD ์—ฐ์‚ฐ
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ Thread 2 โ†’ SIMD ์—ฐ์‚ฐ
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ... (์ด 32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ”‚   โ””โ”€โ”€ ์›Œํ”„ 2 (32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ””โ”€โ”€ Block 2 (๋…๋ฆฝ์ ์ธ ๊ทธ๋ฃน)

๐Ÿ’ก ์ฐธ๊ณ : ์ด Part๋Š” ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์œผ๋ฉฐ, ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๊ณ ๊ธ‰ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋Š” Part VII ์—์„œ ์ž์„ธํžˆ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

Mojo๊ฐ€ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฒƒ๋“ค:

  • ๊ทธ๋ฆฌ๋“œ/๋ธ”๋ก ๊ตฌ์„ฑ: ๋ฌธ์ œ ํฌ๊ธฐ์— ๋”ฐ๋ผ ์ž๋™ ๊ณ„์‚ฐ
  • ์›Œํ”„ ๊ด€๋ฆฌ: ํ•˜๋“œ์›จ์–ด๊ฐ€ 32๊ฐœ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์„ ํˆฌ๋ช…ํ•˜๊ฒŒ ์ฒ˜๋ฆฌ
  • ์Šค๋ ˆ๋“œ ์Šค์ผ€์ค„๋ง: GPU ์Šค์ผ€์ค„๋Ÿฌ๊ฐ€ ์‹คํ–‰์„ ์ž๋™ ๊ด€๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ: ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ์— ์ตœ์ ์˜ ์ ‘๊ทผ ํŒจํ„ด ๋‚ด์žฅ

GPU ์Šค๋ ˆ๋“œ ๋‚ด์˜ SIMD

๊ฐ GPU ์Šค๋ ˆ๋“œ๋Š” SIMD (Single Instruction, Multiple Data) ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ์š”์†Œ๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# ํ•˜๋‚˜์˜ GPU ์Šค๋ ˆ๋“œ ๋‚ด๋ถ€:
a_simd = a.load[simd_width](idx, 0)      # float 4๊ฐœ๋ฅผ ๋™์‹œ์— ๋กœ๋“œ
b_simd = b.load[simd_width](idx, 0)      # float 4๊ฐœ๋ฅผ ๋™์‹œ์— ๋กœ๋“œ
result = a_simd + b_simd                 # 4์Œ์„ ๋™์‹œ์— ๋ง์…ˆ
output.store[simd_width](idx, 0, result) # ๊ฒฐ๊ณผ 4๊ฐœ๋ฅผ ๋™์‹œ์— ์ €์žฅ

ํŒจํ„ด ๋น„๊ต์™€ ์Šค๋ ˆ๋“œ-์ž‘์—… ๋งคํ•‘

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ: ๋ชจ๋“  ํŒจํ„ด์€ ๋™์ผํ•œ ์ด ์ž‘์—…๋Ÿ‰ - SIMD_WIDTH=4๋กœ 1024๊ฐœ ์š”์†Œ์— ๋Œ€ํ•ด 256ํšŒ์˜ SIMD ์—ฐ์‚ฐ - ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฐจ์ด์ ์€ ์ด ์ž‘์—…์ด GPU ์Šค๋ ˆ๋“œ์— ์–ด๋–ป๊ฒŒ ๋ถ„๋ฐฐ๋˜๋А๋ƒ์— ์žˆ์Šต๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ ๋น„๊ต (SIZE=1024, SIMD_WIDTH=4)

ํŒจํ„ด์Šค๋ ˆ๋“œ ์ˆ˜์Šค๋ ˆ๋“œ๋‹น SIMD ์—ฐ์‚ฐ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ดํŠธ๋ ˆ์ด๋“œ์˜คํ”„
์š”์†Œ๋ณ„2561๋ถ„์‚ฐ ์ ‘๊ทผ์ตœ๋Œ€ ๋ณ‘๋ ฌ์„ฑ, ๋‚ฎ์€ ์ง€์—ญ์„ฑ
ํƒ€์ผ๋ง328์†Œํ˜• ๋ธ”๋ก๋ณ‘๋ ฌ์„ฑ + ์ง€์—ญ์„ฑ ๊ท ํ˜•
์ˆ˜๋™ ๋ฒกํ„ฐํ™”832๋Œ€ํ˜• ์ฒญํฌ๋†’์€ ๋Œ€์—ญํญ, ์ ์€ ์Šค๋ ˆ๋“œ
Mojo vectorize328์Šค๋งˆํŠธ ๋ธ”๋ก์ž๋™ ์ตœ์ ํ™”

์ƒ์„ธ ์‹คํ–‰ ํŒจํ„ด

์š”์†Œ๋ณ„ ํŒจํ„ด:

Thread 0: [0,1,2,3] โ†’ Thread 1: [4,5,6,7] โ†’ ... โ†’ Thread 255: [1020,1021,1022,1023]
256 ์Šค๋ ˆ๋“œ ร— 1 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ

ํƒ€์ผ๋ง ํŒจํ„ด:

Thread 0: [0:32] (8 SIMD) โ†’ Thread 1: [32:64] (8 SIMD) โ†’ ... โ†’ Thread 31: [992:1024] (8 SIMD)
32 ์Šค๋ ˆ๋“œ ร— 8 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ

์ˆ˜๋™ ๋ฒกํ„ฐํ™” ํŒจํ„ด:

Thread 0: [0:128] (32 SIMD) โ†’ Thread 1: [128:256] (32 SIMD) โ†’ ... โ†’ Thread 7: [896:1024] (32 SIMD)
8 ์Šค๋ ˆ๋“œ ร— 32 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ

Mojo vectorize ํŒจํ„ด:

Thread 0: [0:32] ์ž๋™ ๋ฒกํ„ฐํ™” โ†’ Thread 1: [32:64] ์ž๋™ ๋ฒกํ„ฐํ™” โ†’ ... โ†’ Thread 31: [992:1024] ์ž๋™ ๋ฒกํ„ฐํ™”
32 ์Šค๋ ˆ๋“œ ร— 8 SIMD ์—ฐ์‚ฐ = ์ด 256ํšŒ SIMD ์—ฐ์‚ฐ

์„ฑ๋Šฅ ํŠน์„ฑ๊ณผ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

ํ•ต์‹ฌ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ ์š”์•ฝ

์ธก๋ฉด์Šค๋ ˆ๋“œ ๋งŽ์Œ (์š”์†Œ๋ณ„)์Šค๋ ˆ๋“œ ์ค‘๊ฐ„ (ํƒ€์ผ๋ง/vectorize)์Šค๋ ˆ๋“œ ์ ์Œ (์ˆ˜๋™)
๋ณ‘๋ ฌ์„ฑ์ตœ๋Œ€ ์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰๊ท ํ˜• ์žกํžŒ ์ ‘๊ทผ์ตœ์†Œํ•œ์˜ ๋ณ‘๋ ฌ์„ฑ
์บ์‹œ ์ง€์—ญ์„ฑ์Šค๋ ˆ๋“œ ๊ฐ„ ๋‚ฎ์Œํƒ€์ผ ๋‚ด์—์„œ ์–‘ํ˜ธ์ˆœ์ฐจ ์ ‘๊ทผ์œผ๋กœ ์šฐ์ˆ˜
๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์–‘ํ˜ธํ•œ ๋ณ‘ํ•ฉ์–‘ํ˜ธ + ์บ์‹œ ์žฌ์‚ฌ์šฉ์ด๋ก ์  ์ตœ๋Œ“๊ฐ’
๋ณต์žก๋„๊ฐ€์žฅ ๋‹จ์ˆœ๋ณดํ†ต๊ฐ€์žฅ ๋ณต์žก

๊ฐ ํŒจํ„ด์˜ ์„ ํƒ ๊ธฐ์ค€

์š”์†Œ๋ณ„ ํŒจํ„ด์„ ์‚ฌ์šฉํ•  ๋•Œ:

  • ์š”์†Œ๋‹น ์—ฐ์‚ฐ๋Ÿ‰์ด ์ ์€ ๋‹จ์ˆœํ•œ ์—ฐ์‚ฐ
  • ์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰์„ ์œ„ํ•ด ์ตœ๋Œ€ ๋ณ‘๋ ฌ์„ฑ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ๋‹ค์–‘ํ•œ ๋ฌธ์ œ ํฌ๊ธฐ์— ๋Œ€ํ•œ ํ™•์žฅ์„ฑ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ

ํƒ€์ผ๋ง/vectorize๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ:

  • ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์˜ ์ด์ ์ด ์žˆ๋Š” ์บ์‹œ ๋ฏผ๊ฐ ์—ฐ์‚ฐ
  • ์„ฑ๋Šฅ๊ณผ ์œ ์ง€๋ณด์ˆ˜์„ฑ์˜ ๊ท ํ˜•์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ์ž๋™ ์ตœ์ ํ™”(vectorize)๊ฐ€ ์„ ํ˜ธ๋˜๋Š” ๊ฒฝ์šฐ

์ˆ˜๋™ ๋ฒกํ„ฐํ™”๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ:

  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์— ๋Œ€ํ•œ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ์ตœ๋Œ€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ
  • ๊ฐœ๋ฐœ ๋ณต์žก๋„๋ฅผ ๊ฐ์ˆ˜ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ

ํ•˜๋“œ์›จ์–ด ๊ณ ๋ ค ์‚ฌํ•ญ

ํ˜„๋Œ€ GPU ์•„ํ‚คํ…์ฒ˜์—๋Š” Mojo๊ฐ€ ์ถ”์ƒํ™”ํ•˜๋Š” ์—ฌ๋Ÿฌ ์ˆ˜์ค€์ด ์žˆ์Šต๋‹ˆ๋‹ค:

ํ•˜๋“œ์›จ์–ด ์‹ค์ œ ๊ตฌ์กฐ:

  • ์›Œํ”„: 32๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ก์Šคํ…์œผ๋กœ ์‹คํ–‰
  • Streaming Multiprocessor (SM): ์—ฌ๋Ÿฌ ์›Œํ”„๊ฐ€ ๋™์‹œ์— ์‹คํ–‰
  • SIMD ์œ ๋‹›: ๊ฐ SM ๋‚ด์˜ ๋ฒกํ„ฐ ์ฒ˜๋ฆฌ ์œ ๋‹›
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ: L1/L2 ์บ์‹œ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ

Mojo ์ถ”์ƒํ™”์˜ ์ด์ :

  • ์›Œํ”„ ์ •๋ ฌ๊ณผ ์Šค์ผ€์ค„๋ง์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ํˆฌ๋ช…ํ•˜๊ฒŒ ์ตœ์ ํ™”
  • SM ๊ฐ„ ๋ฆฌ์†Œ์Šค ํ• ๋‹น์„ ๊ด€๋ฆฌ
  • GPU ๋ฒค๋” ๊ฐ„ ์ด์‹ ๊ฐ€๋Šฅํ•œ ์„ฑ๋Šฅ ์ œ๊ณต

์„ฑ๋Šฅ์— ๋Œ€ํ•œ ์‚ฌ๊ณ  ๋ชจ๋ธ

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋‘ ๊ฐ€์ง€ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ๋ณ‘๋ ฌ์„ฑ ์œ ํ˜•์„ ๊ด€๋ฆฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ƒ๊ฐํ•˜์„ธ์š”:

์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ:

  • ๋ณ‘๋ ฌ ๊ตฌ์กฐ๋ฅผ ์ œ๊ณต (์‹คํ–‰ ์œ ๋‹›์˜ ์ˆ˜)
  • ๋™์‹œ ์‹คํ–‰์„ ํ†ตํ•œ ์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰ ๊ฐ€๋Šฅ
  • GPU ์Šค์ผ€์ค„๋Ÿฌ๊ฐ€ ์ž๋™์œผ๋กœ ๊ด€๋ฆฌ

SIMD ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ:

  • ๊ฐ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ ๋ฒกํ„ฐํ™”๋ฅผ ์ œ๊ณต
  • ์Šค๋ ˆ๋“œ๋‹น ์‚ฐ์ˆ  ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๊ทน๋Œ€ํ™”
  • ๋ฒกํ„ฐ ์ฒ˜๋ฆฌ ์œ ๋‹›์„ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉ

์ตœ์  ์„ฑ๋Šฅ ๊ณต์‹:

์„ฑ๋Šฅ = (์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰์„ ์œ„ํ•œ ์ถฉ๋ถ„ํ•œ ์Šค๋ ˆ๋“œ) ร—
       (ํšจ์œจ์ ์ธ SIMD ํ™œ์šฉ) ร—
       (์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด)

ํ™•์žฅ์„ฑ ๊ณ ๋ ค ์‚ฌํ•ญ

๋ฌธ์ œ ํฌ๊ธฐ์ตœ์  ํŒจํ„ด๊ทผ๊ฑฐ
์†Œ๊ทœ๋ชจ (< 1K)ํƒ€์ผ๋ง/vectorize๋‚ฎ์€ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ
์ค‘๊ทœ๋ชจ (1K-1M)๋ชจ๋“  ํŒจํ„ด์œ ์‚ฌํ•œ ์„ฑ๋Šฅ
๋Œ€๊ทœ๋ชจ (> 1M)๋ณดํ†ต ์š”์†Œ๋ณ„๋ณ‘๋ ฌ์„ฑ์ด ์ง€๋ฐฐ์ 

์ตœ์ ์˜ ์„ ํƒ์€ ํŠน์ • ํ•˜๋“œ์›จ์–ด, ์›Œํฌ๋กœ๋“œ ๋ณต์žก๋„, ๊ฐœ๋ฐœ ์ œ์•ฝ ์กฐ๊ฑด์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

GPU ์Šค๋ ˆ๋”ฉ vs SIMD ๊ฐœ๋…์„ ํ™•์‹คํžˆ ์ดํ•ดํ–ˆ๋‹ค๋ฉด:

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ: GPU ์Šค๋ ˆ๋“œ์™€ SIMD ์—ฐ์‚ฐ์€ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ๋ณ‘๋ ฌ์„ฑ ์ˆ˜์ค€์œผ๋กœ ํ•จ๊ป˜ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‘˜์˜ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๋ฉด ๊ตฌ์ฒด์ ์ธ ์„ฑ๋Šฅ ์š”๊ตฌ ์‚ฌํ•ญ๊ณผ ์ œ์•ฝ ์กฐ๊ฑด์— ๋งž๋Š” ์˜ฌ๋ฐ”๋ฅธ ํŒจํ„ด์„ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“Š Mojo ๋ฒค์น˜๋งˆํ‚น - ์„ฑ๋Šฅ ๋ถ„์„๊ณผ ์ตœ์ ํ™”

๊ฐœ์š”

์š”์†Œ๋ณ„, ํƒ€์ผ๋ง, ์ˆ˜๋™ ๋ฒกํ„ฐํ™”, Mojo vectorize ํŒจํ„ด์„ ํ•™์Šตํ•œ ํ›„, ์ด์ œ ์‹ค์ œ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ์ฐจ๋ก€์ž…๋‹ˆ๋‹ค. p21.mojo์— ๋‚ด์žฅ๋œ ๋ฒค์น˜๋งˆํ‚น ์‹œ์Šคํ…œ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ๋ฒ•์„ ๊ณผํ•™์ ์œผ๋กœ ๋น„๊ตํ•˜๊ณ  ์„ฑ๋Šฅ ํŠน์„ฑ์„ ์ดํ•ดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•„๋ด…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด๋ก ์  ๋ถ„์„์€ ๊ฐ€์น˜ ์žˆ์ง€๋งŒ, ์‹ค์ฆ์  ๋ฒค์น˜๋งˆํ‚น์ด ํŠน์ • ํ•˜๋“œ์›จ์–ด์—์„œ์˜ ์‹ค์ œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๋ฒค์น˜๋งˆํฌ ์‹คํ–‰

์ „์ฒด ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•˜๋ ค๋ฉด:

pixi run p23 --benchmark
pixi run -e amd p23 --benchmark
pixi run -e apple p23 --benchmark
uv run poe p23 --benchmark

๊ฐ ํŒจํ„ด์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ์ธก์ • ๊ฒฐ๊ณผ๊ฐ€ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

SIZE: 1024
simd_width: 4
Running P21 GPU Benchmarks...
SIMD width: 4
--------------------------------------------------------------------------------
Testing SIZE=16, TILE=4
Running elementwise_16_4
Running tiled_16_4
Running manual_vectorized_16_4
Running vectorized_16_4
--------------------------------------------------------------------------------
Testing SIZE=128, TILE=16
Running elementwise_128_16
Running tiled_128_16
Running manual_vectorized_128_16
--------------------------------------------------------------------------------
Testing SIZE=128, TILE=16, Vectorize within tiles
Running vectorized_128_16
--------------------------------------------------------------------------------
Testing SIZE=1048576 (1M), TILE=1024
Running elementwise_1M_1024
Running tiled_1M_1024
Running manual_vectorized_1M_1024
Running vectorized_1M_1024
| name                      | met (ms)              | iters |
| ------------------------- | --------------------- | ----- |
| elementwise_16_4          | 0.0033248             | 100   |
| tiled_16_4                | 0.00327392            | 100   |
| manual_vectorized_16_4    | 0.0036169600000000002 | 100   |
| vectorized_16_4           | 0.0037209599999999997 | 100   |
| elementwise_128_16        | 0.00351999            | 100   |
| tiled_128_16              | 0.00370431            | 100   |
| manual_vectorized_128_16  | 0.0043696             | 100   |
| vectorized_128_16         | 0.00378048            | 100   |
| elementwise_1M_1024       | 0.03130143            | 100   |
| tiled_1M_1024             | 0.6892189000000001    | 100   |
| manual_vectorized_1M_1024 | 0.5923888             | 100   |
| vectorized_1M_1024        | 0.1876688             | 100   |

Benchmarks completed!

๋ฒค์น˜๋งˆํฌ ์„ค์ •

๋ฒค์น˜๋งˆํ‚น ์‹œ์Šคํ…œ์€ Mojo์˜ ๋‚ด์žฅ benchmark ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

from benchmark import Bench, BenchConfig, Bencher, BenchId, keep
bench_config = BenchConfig(max_iters=10, num_warmup_iters=1)
  • max_iters=10: ํ†ต๊ณ„์  ์‹ ๋ขฐ์„ฑ์„ ์œ„ํ•ด ์ตœ๋Œ€ 10ํšŒ ๋ฐ˜๋ณต
  • num_warmup_iters=1: ์ธก์ • ์ „ GPU ์›Œ๋ฐ์—…
  • Benchmark ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”

๋ฒค์น˜๋งˆํ‚น ๊ตฌํ˜„์˜ ํ•ต์‹ฌ

ํ•ต์‹ฌ ์›Œํฌํ”Œ๋กœ์šฐ ํŒจํ„ด

๊ฐ ๋ฒค์น˜๋งˆํฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฐ„๊ฒฐํ•œ ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

@parameter
def benchmark_pattern_parameterized[test_size: Int, tile_size: Int](mut b: Bencher) raises:
    bench_ctx = DeviceContext()
    # ์…‹์—…: ๋ฒ„ํผ ์ƒ์„ฑ ๋ฐ ๋ฐ์ดํ„ฐ ์ดˆ๊ธฐํ™”
    @parameter
    def pattern_workflow(ctx: DeviceContext) raises:
      # ์—ฐ์‚ฐ: ์ธก์ • ๋Œ€์ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์‹คํ–‰

    b.iter_custom[pattern_workflow](bench_ctx)
    # ์ตœ์ ํ™” ๋ฐฉ์ง€: keep(out.unsafe_ptr())
    # ๋™๊ธฐํ™”: ctx.synchronize()

์ฃผ์š” ๋‹จ๊ณ„:

  1. ์…‹์—…: ๋ฒ„ํผ ํ• ๋‹น ๋ฐ ๋ฐ์ดํ„ฐ ์ดˆ๊ธฐํ™”
  2. ์—ฐ์‚ฐ: ๋ฒค์น˜๋งˆํฌ ๋Œ€์ƒ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์‹คํ–‰
  3. ์ตœ์ ํ™” ๋ฐฉ์ง€: ์ •ํ™•ํ•œ ์ธก์ •์„ ์œ„ํ•ด ํ•„์ˆ˜
  4. ๋™๊ธฐํ™”: GPU ์ž‘์—… ์™„๋ฃŒ ํ™•์ธ

์ค‘์š”: keep() ํ•จ์ˆ˜ keep(out.unsafe_ptr())๋Š” ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ์—ฐ์‚ฐ ๊ฒฐ๊ณผ๋ฅผ โ€œ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ์ฝ”๋“œโ€œ๋กœ ์ตœ์ ํ™”ํ•˜์—ฌ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ์—†์œผ๋ฉด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋Œ€์‹  ์•„๋ฌด๊ฒƒ๋„ ์ธก์ •ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! GPU ์ปค๋„์€ ๋น„๋™๊ธฐ์ ์œผ๋กœ ์‹คํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•ํ•œ GPU ๋ฒค์น˜๋งˆํ‚น์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

์ปค์Šคํ…€ ๋ฐ˜๋ณต์ด GPU์— ํ•„์š”ํ•œ ์ด์œ 

์ผ๋ฐ˜์ ์ธ ๋ฒค์น˜๋งˆํ‚น์€ CPU ์Šคํƒ€์ผ์˜ ๋™๊ธฐ ์‹คํ–‰์„ ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. GPU ์ปค๋„์€ ๋น„๋™๊ธฐ์ ์œผ๋กœ ์‹คํ–‰๋˜๋ฏ€๋กœ ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • GPU ์ปจํ…์ŠคํŠธ ๊ด€๋ฆฌ: ์ ์ ˆํ•œ DeviceContext ์ƒ๋ช…์ฃผ๊ธฐ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ: ๋ฐ˜๋ณต ๊ฐ„ ๋ฒ„ํผ ์ •๋ฆฌ
  • ๋™๊ธฐํ™” ์ฒ˜๋ฆฌ: ๋น„๋™๊ธฐ ์—ฐ์‚ฐ์˜ ์ •ํ™•ํ•œ ํƒ€์ด๋ฐ
  • ์˜ค๋ฒ„ํ—ค๋“œ ๋ถ„๋ฆฌ: ์…‹์—… ๋น„์šฉ๊ณผ ์—ฐ์‚ฐ ๋น„์šฉ์˜ ๋ถ„๋ฆฌ

ํ…Œ์ŠคํŠธ ์‹œ๋‚˜๋ฆฌ์˜ค์™€ ์Šค๋ ˆ๋“œ ๋ถ„์„

๋ฒค์น˜๋งˆํฌ ๋ชจ์Œ์€ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ ํ…Œ์ŠคํŠธํ•ฉ๋‹ˆ๋‹ค:

์Šค๋ ˆ๋“œ ํ™œ์šฉ ์š”์•ฝ

๋ฌธ์ œ ํฌ๊ธฐํŒจํ„ด์Šค๋ ˆ๋“œ ์ˆ˜์Šค๋ ˆ๋“œ๋‹น SIMD ์—ฐ์‚ฐ์ด SIMD ์—ฐ์‚ฐ
SIZE=16์š”์†Œ๋ณ„414
ํƒ€์ผ๋ง414
์ˆ˜๋™144
vectorize414
SIZE=128์š”์†Œ๋ณ„32132
ํƒ€์ผ๋ง8432
์ˆ˜๋™21632
vectorize8432
SIZE=1M์š”์†Œ๋ณ„262,1441262,144
ํƒ€์ผ๋ง1,024256262,144
์ˆ˜๋™2561,024262,144
vectorize1,024256262,144

๋ฌธ์ œ ํฌ๊ธฐ๋ณ„ ์„ฑ๋Šฅ ํŠน์„ฑ

์†Œ๊ทœ๋ชจ ๋ฌธ์ œ (SIZE=16):

  • ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ์  (~0.003ms ๊ธฐ์ค€์„ )
  • ์Šค๋ ˆ๋“œ ์ˆ˜ ์ฐจ์ด๋Š” ๊ฑฐ์˜ ๋ฌด์˜๋ฏธ
  • ํƒ€์ผ๋ง/vectorize๊ฐ€ ์•ฝ๊ฐ„ ๋‚ฎ์€ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ๋ณด์ž„

์ค‘๊ทœ๋ชจ ๋ฌธ์ œ (SIZE=128):

  • ์—ฌ์ „ํžˆ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ์  (~0.003ms ์ „ ํŒจํ„ด)
  • ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ๊ฑฐ์˜ ์‚ฌ๋ผ์ง
  • ์˜ค๋ฒ„ํ—ค๋“œ ์ง€๋ฐฐ์—์„œ ์—ฐ์‚ฐ ์ง€๋ฐฐ๋กœ์˜ ์ „ํ™˜ ๊ตฌ๊ฐ„

๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ (SIZE=1M):

  • ์‹ค์งˆ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ฐจ์ด๊ฐ€ ๋“œ๋Ÿฌ๋‚จ
  • ๋น„๋ณ‘ํ•ฉ ๋กœ๋“œ์˜ ์˜ํ–ฅ์ด ๋ช…ํ™•ํ•ด์ง
  • ๋šœ๋ ทํ•œ ์„ฑ๋Šฅ ์ˆœ์œ„๊ฐ€ ๋‚˜ํƒ€๋‚จ

๋ฐ์ดํ„ฐ๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ

๋‹ค์–‘ํ•œ ํ•˜๋“œ์›จ์–ด์—์„œ์˜ ์‹ค์ฆ์  ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ:

์„ฑ๋Šฅ ์ˆœ์œ„ (๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ)

์ˆœ์œ„ํŒจํ„ด์†Œ์š” ์‹œ๊ฐ„ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ
๐Ÿฅ‡์š”์†Œ๋ณ„~0.03ms๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ ์Šน๋ฆฌ
๐ŸฅˆMojo vectorize~0.19ms๋น„๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์„ฑ๋Šฅ์„ ์ €ํ•˜
๐Ÿฅ‰์ˆ˜๋™ ๋ฒกํ„ฐํ™”~0.59ms๋น„๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ณผ ์ˆ˜๋™ ์ตœ์ ํ™”๊ฐ€ ์„ฑ๋Šฅ ๊ฐ์†Œ
4์œ„ํƒ€์ผ๋ง~0.69ms๋น„๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ, SIMD ๋กœ๋“œ ์—†๋Š” ์ˆ˜๋™ ์ตœ์ ํ™”๊ฐ€ ์„ฑ๋Šฅ์„ ๋” ์ €ํ•˜

ํ•ต์‹ฌ ์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ

๋‹จ์ˆœ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์˜ ๊ฒฝ์šฐ: ์ตœ๋Œ€ ๋ณ‘๋ ฌ์„ฑ(elementwise)์ด ๋Œ€๊ทœ๋ชจ์—์„œ ๋ณต์žกํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”๋ณด๋‹ค ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.

์š”์†Œ๋ณ„ ํŒจํ„ด์ด ์Šน๋ฆฌํ•˜๋Š” ์ด์œ :

  • 262,144๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ์šฐ์ˆ˜ํ•œ ์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰์„ ์ œ๊ณต
  • ๋‹จ์ˆœํ•œ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์ด ์ข‹์€ ๋ณ‘ํ•ฉ์„ ๋‹ฌ์„ฑ
  • ์Šค๋ ˆ๋“œ๋‹น ์ตœ์†Œํ•œ์˜ ์˜ค๋ฒ„ํ—ค๋“œ
  • GPU ์ฝ”์–ด ์ˆ˜์— ๋”ฐ๋ผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ™•์žฅ

ํƒ€์ผ๋ง๊ณผ vectorize๊ฐ€ ๊ฒฝ์Ÿ๋ ฅ ์žˆ๋Š” ์ด์œ :

  • ๋ณ‘๋ ฌ์„ฑ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ ์‚ฌ์ด์˜ ๊ท ํ˜• ์žกํžŒ ์ ‘๊ทผ
  • ์ž๋™ ์ตœ์ ํ™”(vectorize)๊ฐ€ ์ˆ˜๋™ ํƒ€์ผ๋ง๊ณผ ๊ฑฐ์˜ ๋™๋“ฑํ•œ ์„ฑ๋Šฅ
  • ๊ณผ๋„ํ•œ ๋ณต์žก๋„ ์—†์ด ์–‘ํ˜ธํ•œ ์Šค๋ ˆ๋“œ ํ™œ์šฉ

์ˆ˜๋™ ๋ฒกํ„ฐํ™”๊ฐ€ ๊ณ ์ „ํ•˜๋Š” ์ด์œ :

  • 256๊ฐœ ์Šค๋ ˆ๋“œ๋งŒ์œผ๋กœ๋Š” ๋ณ‘๋ ฌ์„ฑ์ด ์ œํ•œ์ 
  • ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ์ด ์—ฐ์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ถ”๊ฐ€
  • ์Šค๋ ˆ๋“œ๋‹น ๋Œ€ํ˜• ์ฒญํฌ๋กœ ์ธํ•œ ์บ์‹œ ๋ถ€๋‹ด
  • ๋‹จ์ˆœ ์‚ฐ์ˆ ์—์„œ ํšจ๊ณผ ์ฒด๊ฐ

ํ”„๋ ˆ์ž„์›Œํฌ ์ž๋™ํ™” ๊ธฐ๋Šฅ:

  • ์ž๋™ ๋ฐ˜๋ณต ํšŸ์ˆ˜ ์กฐ์ • (91-100ํšŒ ๋ฐ˜๋ณต)
  • ์„œ๋กœ ๋‹ค๋ฅธ ์‹คํ–‰ ์‹œ๊ฐ„์— ๊ฑธ์นœ ํ†ต๊ณ„์  ์‹ ๋ขฐ์„ฑ
  • ๋ฐœ์—ด ์ œํ•œ๊ณผ ์‹œ์Šคํ…œ ๋ณ€๋™์— ๋Œ€์‘

๊ฒฐ๊ณผ ํ•ด์„ํ•˜๊ธฐ

์ถœ๋ ฅ ํ…Œ์ด๋ธ” ์ฝ๊ธฐ

| name                     | met (ms)           | iters |
| elementwise_1M_1024      | 0.03130143         | 100   |
  • met (ms): ๋‹จ์ผ ๋ฐ˜๋ณต์˜ ์‹คํ–‰ ์‹œ๊ฐ„
  • iters: ์ˆ˜ํ–‰๋œ ๋ฐ˜๋ณต ํšŸ์ˆ˜
  • ๋™์ผ ๋ฌธ์ œ ํฌ๊ธฐ ๋‚ด์—์„œ ๋น„๊ต: ๊ฐ™์€ ํฌ๊ธฐ๋ผ๋ฆฌ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์˜๋ฏธ ์žˆ์Œ

์ตœ์ ํ™” ์˜์‚ฌ๊ฒฐ์ •

์‹ค์ฆ์  ์ฆ๊ฑฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํŒจํ„ด์„ ์„ ํƒํ•˜์„ธ์š”:

ํ”„๋กœ๋•์…˜ ์›Œํฌ๋กœ๋“œ์˜ ๊ฒฝ์šฐ:

  • ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ (>100K ์š”์†Œ): ์š”์†Œ๋ณ„ ํŒจํ„ด์ด ์ผ๋ฐ˜์ ์œผ๋กœ ์ตœ์ 
  • ์†Œ๊ทœ๋ชจ/์‹œ์ž‘ ๋ฐ์ดํ„ฐ์…‹ (<1K ์š”์†Œ): ๋‚ฎ์€ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์œ„ํ•ด ํƒ€์ผ๋ง ๋˜๋Š” vectorize
  • ๊ฐœ๋ฐœ ์†๋„ ์šฐ์„ : ์ž๋™ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ Mojo vectorize
  • ์ˆ˜๋™ ๋ฒกํ„ฐํ™” ์ง€์–‘: ๋‹จ์ˆœ ์—ฐ์‚ฐ์—์„œ๋Š” ๋ณต์žก๋„๊ฐ€ ์„ฑ๋Šฅ์œผ๋กœ ๋ณด์ƒ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋“œ๋ฌพ

์„ฑ๋Šฅ ์ตœ์ ํ™” ์›Œํฌํ”Œ๋กœ์šฐ:

  1. ๋จผ์ € ํ”„๋กœํŒŒ์ผ๋ง: ์ตœ์ ํ™”ํ•˜๊ธฐ ์ „์— ์ธก์ •
  2. ๋Œ€๊ทœ๋ชจ์—์„œ ํ…Œ์ŠคํŠธ: ์†Œ๊ทœ๋ชจ ๋ฌธ์ œ๋Š” ์‹ค์ œ ์„ฑ๋Šฅ์— ๋Œ€ํ•ด ์˜คํ•ด๋ฅผ ์ค„ ์ˆ˜ ์žˆ์Œ
  3. ์ด๋น„์šฉ ๊ณ ๋ ค: ๊ฐœ๋ฐœ ๋ฐ ์œ ์ง€๋ณด์ˆ˜ ๋…ธ๋ ฅ์„ ํฌํ•จ
  4. ๊ฐœ์„  ์‚ฌํ•ญ ๊ฒ€์ฆ: ๋Œ€์ƒ ํ•˜๋“œ์›จ์–ด์—์„œ ๋ฒค์น˜๋งˆํฌ๋กœ ํ™•์ธ

๊ณ ๊ธ‰ ๋ฒค์น˜๋งˆํ‚น ๊ธฐ๋ฒ•

์ปค์Šคํ…€ ํ…Œ์ŠคํŠธ ์‹œ๋‚˜๋ฆฌ์˜ค

๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์กฐ๊ฑด์„ ํ…Œ์ŠคํŠธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# ๋‹ค์–‘ํ•œ ๋ฌธ์ œ ํฌ๊ธฐ
benchmark_elementwise_parameterized[1024, 32]  # ๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ
benchmark_elementwise_parameterized[64, 8]     # ์†Œ๊ทœ๋ชจ ๋ฌธ์ œ

# ๋‹ค์–‘ํ•œ ํƒ€์ผ ํฌ๊ธฐ
benchmark_tiled_parameterized[256, 8]   # ์ž‘์€ ํƒ€์ผ
benchmark_tiled_parameterized[256, 64]  # ํฐ ํƒ€์ผ

ํ•˜๋“œ์›จ์–ด ๊ณ ๋ ค ์‚ฌํ•ญ

๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค:

  • GPU ์•„ํ‚คํ…์ฒ˜: SIMD ํญ, ์ฝ”์–ด ์ˆ˜, ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ
  • ์‹œ์Šคํ…œ ๊ตฌ์„ฑ: PCIe ๋Œ€์—ญํญ, CPU ์„ฑ๋Šฅ
  • ์—ด ์ƒํƒœ: GPU ๋ถ€์ŠคํŠธ ํด๋Ÿญ vs ์ง€์† ์„ฑ๋Šฅ
  • ๋™์‹œ ์›Œํฌ๋กœ๋“œ: GPU ํ™œ์šฉ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค

๋ชจ๋ฒ” ์‚ฌ๋ก€ ์š”์•ฝ

๋ฒค์น˜๋งˆํ‚น ์›Œํฌํ”Œ๋กœ์šฐ:

  1. ์ค‘์š”ํ•œ ์ธก์ • ์ „์— GPU ์›Œ๋ฐ์—…
  2. ํ†ต๊ณ„์  ์œ ์˜์„ฑ์„ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณต ์‹คํ–‰
  3. ํ™•์žฅ ํŠน์„ฑ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ๋‹ค์–‘ํ•œ ๋ฌธ์ œ ํฌ๊ธฐ ํ…Œ์ŠคํŠธ
  4. ์ตœ์ ํ™” ์•„ํ‹ฐํŒฉํŠธ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด keep()์„ ์ผ๊ด€๋˜๊ฒŒ ์‚ฌ์šฉ
  5. ๋™์ผ ์กฐ๊ฑด์—์„œ ๋น„๊ต (๊ฐ™์€ ๋ฌธ์ œ ํฌ๊ธฐ, ๊ฐ™์€ ํ•˜๋“œ์›จ์–ด)

์„ฑ๋Šฅ ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ:

  • ๋‹จ์ˆœํ•˜๊ฒŒ ์‹œ์ž‘: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—๋Š” ์š”์†Œ๋ณ„ ํŒจํ„ด๋ถ€ํ„ฐ
  • ์ถ”์ธกํ•˜์ง€ ๋ง๊ณ  ์ธก์ •: ์ด๋ก ์  ๋ถ„์„์€ ๋ฐฉํ–ฅ์„, ์‹ค์ฆ์  ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฒฐ์ •์„
  • ๊ทœ๋ชจ๊ฐ€ ์ค‘์š”: ์†Œ๊ทœ๋ชจ ๋ฌธ์ œ์˜ ์„ฑ๋Šฅ์ด ๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ์˜ ๋™์ž‘์„ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•จ
  • ์ด๋น„์šฉ ์ตœ์ ํ™”: ๊ฐœ๋ฐœ ์‹œ๊ฐ„ vs ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์˜ ๊ท ํ˜•

๋‹ค์Œ ๋‹จ๊ณ„

๋ฒค์น˜๋งˆํ‚น ๊ธฐ์ˆ ์„ ๊ฐ–์ถ”์—ˆ๋‹ค๋ฉด:

  • ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ”„๋กœํŒŒ์ผ๋ง: ์ด ํŒจํ„ด๋“ค์„ ์‹ค์ œ ์›Œํฌ๋กœ๋“œ์— ์ ์šฉ
  • ๊ณ ๊ธ‰ GPU ํŒจํ„ด: ๋ฆฌ๋•์…˜, ํ•ฉ์„ฑ๊ณฑ, ํ–‰๋ ฌ ์—ฐ์‚ฐ ํƒ๊ตฌ
  • ๋ฉ€ํ‹ฐ GPU ํ™•์žฅ: ๋ถ„์‚ฐ GPU ์ปดํ“จํŒ… ํŒจํ„ด ์ดํ•ด
  • ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ ๊ธ‰ ์บ์‹ฑ์„ ๋” ๊นŠ์ด ํƒ๊ตฌ

๐Ÿ’ก ํ•ต์‹ฌ ์š”์•ฝ: ๋ฒค์น˜๋งˆํ‚น์€ ์ด๋ก ์  ์ดํ•ด๋ฅผ ์‹ค์งˆ์ ์ธ ์„ฑ๋Šฅ ์ตœ์ ํ™”๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ฆ์  ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ • ํ•˜๋“œ์›จ์–ด์™€ ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ํŒจํ„ด์„ ์„ ํƒํ•˜์„ธ์š”.

์•ž์œผ๋กœ์˜ ๋ฐฉํ–ฅ: ๋” ๋งŽ์€ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•  ๋•Œ

Part VI์˜ ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์€ ๋Œ€๋ถ€๋ถ„์˜ ์›Œํฌ๋กœ๋“œ์—์„œ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•˜์ง€๋งŒ, ์ผ๋ถ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ง์ ‘์ ์ธ ์Šค๋ ˆ๋“œ ๊ฐ„ ํ†ต์‹ ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ์œ ์šฉํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜:

  • ๋ฆฌ๋•์…˜: ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์— ๊ฑธ์นœ ํ•ฉ๊ณ„, ์ตœ๋Œ“๊ฐ’, ์ตœ์†Ÿ๊ฐ’ ์—ฐ์‚ฐ
  • ๋ˆ„์  ์—ฐ์‚ฐ: ๋ˆ„์  ํ•ฉ, ์ด๋™ ์ตœ๋Œ“๊ฐ’
  • ๋ฐ์ดํ„ฐ ์…”ํ”Œ: ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜
  • ํ˜‘๋ ฅ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์Šค๋ ˆ๋“œ ๊ฐ„ ๊ธด๋ฐ€ํ•œ ์กฐ์ •์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ

์„ฑ๋Šฅ ๋ฏธ๋ฆฌ๋ณด๊ธฐ:

Part VII์—์„œ๋Š” Part III์˜ ์—ฌ๋Ÿฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹ค์‹œ ์‚ดํŽด๋ณด๋ฉฐ ์›Œํ”„ ์—ฐ์‚ฐ์ด ์–ด๋–ป๊ฒŒ:

  • ์ฝ”๋“œ๋ฅผ ๊ฐ„์†Œํ™”ํ•˜๋Š”์ง€: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒด
  • ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š”์ง€: ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ์„ ๊ฐ์†Œ
  • ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š”์ง€: ์ˆœ์ˆ˜ ํ•จ์ˆ˜ํ˜• ์ ‘๊ทผ์œผ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•œ ํŒจํ„ด์„ ๊ตฌํ˜„

๋‹ค์Œ ๋‚ด์šฉ: Part VII: ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ - Puzzle 14์˜ ๋ˆ„์  ํ•ฉ์„ ์™„์ „ํžˆ ์ƒˆ๋กญ๊ฒŒ ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

Puzzle 24: ์›Œํ”„ ๊ธฐ์ดˆ

๊ฐœ์š”

Part VII: ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ๋Š” GPU์˜ ์›Œํ”„ ๋ ˆ๋ฒจ ๊ธฐ๋ณธ ์š”์†Œ - ์›Œํ”„ ๋‚ด ๋™๊ธฐํ™”๋œ ์Šค๋ ˆ๋“œ ์‹คํ–‰์„ ํ™œ์šฉํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ์—ฐ์‚ฐ์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ ์ธ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋‚ด์žฅ ์›Œํ”„ ์—ฐ์‚ฐ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ชฉํ‘œ: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ํ•˜๋“œ์›จ์–ด ๋™๊ธฐํ™”๋ฅผ ํ™œ์šฉํ•˜๋Š” ํšจ์œจ์ ์ธ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์›Œํ”„๋Š” ๋ก์Šคํ…(lockstep)์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค - Mojo์˜ ์›Œํ”„ ์—ฐ์‚ฐ์€ ์ด ๋™๊ธฐํ™”๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ๊ฐ•๋ ฅํ•œ ๋ณ‘๋ ฌ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ

GPU ์›Œํ”„ ์‹คํ–‰ ๋ชจ๋ธ

GPU ๋ณ‘๋ ฌ์„ฑ์˜ ๊ธฐ๋ณธ ํ•˜๋“œ์›จ์–ด ๋‹จ์œ„๋ฅผ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

GPU ๋ธ”๋ก (์˜ˆ: 256 ์Šค๋ ˆ๋“œ)
โ”œโ”€โ”€ ์›Œํ”„ 0 (32 ์Šค๋ ˆ๋“œ, SIMT ๋ก์Šคํ… ์‹คํ–‰)
โ”‚   โ”œโ”€โ”€ ๋ ˆ์ธ 0  โ”€โ”
โ”‚   โ”œโ”€โ”€ ๋ ˆ์ธ 1   โ”‚ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ช…๋ น์„
โ”‚   โ”œโ”€โ”€ ๋ ˆ์ธ 2   โ”‚ ๋™์‹œ์— ์‹คํ–‰ (SIMT)
โ”‚   โ”‚   ...      โ”‚
โ”‚   โ””โ”€โ”€ ๋ ˆ์ธ 31 โ”€โ”˜
โ”œโ”€โ”€ ์›Œํ”„ 1 (32 ์Šค๋ ˆ๋“œ, ๋…๋ฆฝ์ )
โ”œโ”€โ”€ ์›Œํ”„ 2 (32 ์Šค๋ ˆ๋“œ, ๋…๋ฆฝ์ )
โ””โ”€โ”€ ...

ํ•˜๋“œ์›จ์–ด ํ˜„์‹ค:

  • NVIDIA GPU์—์„œ ์›Œํ”„๋‹น 32 ์Šค๋ ˆ๋“œ (WARP_SIZE=32)
  • AMD GPU์—์„œ ์›Œํ”„๋‹น 32 ๋˜๋Š” 64 ์Šค๋ ˆ๋“œ (WARP_SIZE=32 or 64)
  • ๋ก์Šคํ… ์‹คํ–‰: ์›Œํ”„ ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๋ช…๋ น์„ ๋™์‹œ์— ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ๋™๊ธฐํ™” ๋น„์šฉ ์ œ๋กœ: ์›Œํ”„ ์—ฐ์‚ฐ์€ ๊ฐ ์›Œํ”„ ๋‚ด์—์„œ ์ฆ‰์‹œ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค

Mojo์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์›Œํ”„ ์—ฐ์‚ฐ

gpu.primitives.warp์˜ ํ•ต์‹ฌ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๋ฐฐ์›๋‹ˆ๋‹ค:

  1. sum(value): ์›Œํ”„์˜ ๋ชจ๋“  ๋ ˆ์ธ์—์„œ ๊ฐ’์„ ํ•ฉ์‚ฐ
  2. shuffle_idx(value, lane): ํŠน์ • ๋ ˆ์ธ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ธฐ
  3. shuffle_down(value, delta): lane+delta ์œ„์น˜์˜ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ธฐ
  4. prefix_sum(value): ๋ ˆ์ธ ์ „์ฒด์— ๊ฑธ์ณ ๋ˆ„์  ํ•ฉ ๊ณ„์‚ฐ
  5. lane_id(): ํ˜„์žฌ ์Šค๋ ˆ๋“œ์˜ ๋ ˆ์ธ ๋ฒˆํ˜ธ ๋ฐ˜ํ™˜ (0-31 ๋˜๋Š” 0-63)

์„ฑ๋Šฅ ๋ณ€ํ™˜ ์˜ˆ์‹œ

# 1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ๋ฆฌ๋•์…˜
# ์•ž์„œ ์‚ดํŽด๋ณธ ๋ณต์žกํ•œ ํŒจํ„ด (p12.mojo):
shared = TileTensor[
    dtype,
    row_major[WARP_SIZE](),
    MutAnyOrigin,
    address_space = AddressSpace.SHARED,
].stack_allocation()
shared[local_i] = partial_product
barrier()

# ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ†ตํ•œ ์•ˆ์ „ํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์€ ๊ฐ ๋‹จ๊ณ„๋งˆ๋‹ค ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:
stride = WARP_SIZE // 2
while stride > 0:
    if local_i < stride:
        shared[local_i] += shared[local_i + stride]

    barrier()
    stride //= 2

# 2. ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ™œ์šฉํ•œ ๋ฆฌ๋•์…˜
# ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•œ ์•ˆ์ „ํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ๊ฐ ๋‹จ๊ณ„์˜ ๋ฐฐ๋ฆฌ์–ด๊ฐ€
# ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.
# Mojo์˜ ์›Œํ”„ ๋ ˆ๋ฒจ sum ์—ฐ์‚ฐ์€ ๋‚ด๋ถ€์ ์œผ๋กœ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด ๋ชจ๋“  ๋ณต์žก์„ฑ์„
# ์ˆจ๊น๋‹ˆ๋‹ค:
total = sum(partial_product)  # ๋‚ด๋ถ€์ ์œผ๋กœ ๋ฐฐ๋ฆฌ์–ด๋„, ๊ฒฝ์Ÿ ์ƒํƒœ๋„ ์—†์Šต๋‹ˆ๋‹ค!

์›Œํ”„ ์—ฐ์‚ฐ์ด ๋น›๋‚˜๋Š” ์ˆœ๊ฐ„

์„ฑ๋Šฅ ํŠน์„ฑ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

๋ฌธ์ œ ๊ทœ๋ชจ              ๊ธฐ์กด ๋ฐฉ์‹        ์›Œํ”„ ์—ฐ์‚ฐ
๋‹จ์ผ ์›Œํ”„ (32)         ๋น ๋ฆ„            ๊ฐ€์žฅ ๋น ๋ฆ„ (๋ฐฐ๋ฆฌ์–ด ์—†์Œ)
์†Œ์ˆ˜ ์›Œํ”„ (128)        ์ข‹์Œ            ์šฐ์ˆ˜ (์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์†Œ)
๋‹ค์ˆ˜ ์›Œํ”„ (1024+)      ์ข‹์Œ            ๋›ฐ์–ด๋‚จ (์„ ํ˜• ํ™•์žฅ)
๋Œ€๊ทœ๋ชจ (16K+)          ๋ณ‘๋ชฉ ๋ฐœ์ƒ        ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ œํ•œ

์„ ์ˆ˜ ์ง€์‹

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ๋‹ค์Œ ๋‚ด์šฉ์— ์ต์ˆ™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  • Part VI ํ•จ์ˆ˜ํ˜• ํŒจํ„ด: elementwise, tiled, vectorize ์ ‘๊ทผ ๋ฐฉ์‹
  • GPU ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ: ๋ธ”๋ก, ์›Œํ”„, ์Šค๋ ˆ๋“œ์— ๋Œ€ํ•œ ์ดํ•ด
  • TileTensor ์—ฐ์‚ฐ: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ๋…: ๋ฐฐ๋ฆฌ์–ด์™€ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ์™œ ๋ณต์žกํ•œ์ง€

ํ•™์Šต ๊ฒฝ๋กœ

1. SIMT ์‹คํ–‰ ๋ชจ๋ธ

โ†’ ์›Œํ”„ ๋ ˆ์ธ๊ณผ SIMT ์‹คํ–‰

์›Œํ”„ ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ฐ˜์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • SIMT(Single Instruction, Multiple Thread) ์‹คํ–‰ ๋ชจ๋ธ
  • ์›Œํ”„ ๋ถ„๊ธฐ์™€ ์ˆ˜๋ ด ํŒจํ„ด
  • ์›Œํ”„ ๋‚ด ๋ ˆ์ธ ๋™๊ธฐํ™”
  • ํ•˜๋“œ์›จ์–ด vs ์†Œํ”„ํŠธ์›จ์–ด ์Šค๋ ˆ๋“œ ๊ด€๋ฆฌ

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์›Œํ”„๋Š” GPU ์‹คํ–‰์˜ ๊ธฐ๋ณธ ๋‹จ์œ„์ž…๋‹ˆ๋‹ค - SIMT๋ฅผ ์ดํ•ดํ•˜๋ฉด ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๋ฌธ์ด ์—ด๋ฆฝ๋‹ˆ๋‹ค.

2. ์›Œํ”„ sum ๊ธฐ์ดˆ

โ†’ warp.sum()์˜ ํ•ต์‹ฌ

๋‚ด์  ๊ตฌํ˜„์„ ํ†ตํ•ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์›Œํ”„ ์—ฐ์‚ฐ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด๋ฅผ sum()์œผ๋กœ ๋Œ€์ฒด
  • GPU ์•„ํ‚คํ…์ฒ˜ ๊ฐ„ ํ˜ธํ™˜์„ฑ (WARP_SIZE)
  • ์›Œํ”„๋ฅผ ํ™œ์šฉํ•œ ์ปค๋„ vs ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด
  • ๊ธฐ์กด ๋ฐฉ์‹๊ณผ์˜ ์„ฑ๋Šฅ ๋น„๊ต

ํ•ต์‹ฌ ํŒจํ„ด:

partial_result = compute_per_lane_value()
total = sum(partial_result)  # ๋งˆ๋ฒ•์ด ์ผ์–ด๋‚˜๋Š” ๊ณณ!
if lane_id() == 0:
    output[0] = total

3. ์–ธ์ œ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‚ฌ์šฉํ• ๊นŒ

โ†’ ์–ธ์ œ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‚ฌ์šฉํ• ๊นŒ

๋Œ€์•ˆ ๋Œ€๋น„ ์›Œํ”„ ์—ฐ์‚ฐ์„ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•œ ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • ์›Œํ”„ ์—ฐ์‚ฐ์— ์œ ๋ฆฌํ•œ ๋ฌธ์ œ ํŠน์„ฑ
  • ์›Œํ”„ ์ˆ˜์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ํ™•์žฅ ํŒจํ„ด
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ vs ์—ฐ์‚ฐ๋Ÿ‰ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„
  • ์›Œํ”„ ์—ฐ์‚ฐ ์„ ํƒ ๊ฐ€์ด๋“œ๋ผ์ธ

์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ: ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ์ด ๋ณ‘๋ชฉ์ด ๋  ๋•Œ, ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ๋ŒํŒŒ๊ตฌ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

ํ•˜๋“œ์›จ์–ด-์†Œํ”„ํŠธ์›จ์–ด ์ •๋ ฌ

Mojo ์›Œํ”„ ์—ฐ์‚ฐ์ด GPU ํ•˜๋“œ์›จ์–ด์— ๋งคํ•‘๋˜๋Š” ๋ฐฉ์‹์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

  • SIMT ์‹คํ–‰: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์ผํ•œ ๋ช…๋ น์„ ๋™์‹œ์— ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ๋‚ด์žฅ ๋™๊ธฐํ™”: ์›Œํ”„ ๋‚ด์—์„œ ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค
  • ํฌ๋กœ์Šค ์•„ํ‚คํ…์ฒ˜ ์ง€์›: WARP_SIZE๊ฐ€ NVIDIA์™€ AMD์˜ ์ฐจ์ด๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

ํŒจํ„ด ๋ณ€ํ™˜

๋ณต์žกํ•œ ๋ณ‘๋ ฌ ํŒจํ„ด์„ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

  • ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ โ†’ sum()
  • ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ โ†’ prefix_sum()
  • ๋ฐ์ดํ„ฐ ์…”ํ”Œ โ†’ shuffle_idx(), shuffle_down()

์„ฑ๋Šฅ ํŠน์„ฑ

์›Œํ”„ ์—ฐ์‚ฐ์ด ์ด์ ์„ ์ œ๊ณตํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค:

  • ์†Œ~์ค‘๊ทœ๋ชจ ๋ฌธ์ œ: ๋ฐฐ๋ฆฌ์–ด ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค
  • ๋Œ€๊ทœ๋ชจ ๋ฌธ์ œ: ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ์„ ์ค„์ด๊ณ  ์บ์‹œ ํ™œ์šฉ์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค
  • ๊ทœ์น™์ ์ธ ํŒจํ„ด: ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ์ ‘๊ทผ ํŒจํ„ด์—์„œ ์›Œํ”„ ์—ฐ์‚ฐ์ด ํƒ์›”ํ•ฉ๋‹ˆ๋‹ค

์‹œ์ž‘ํ•˜๊ธฐ

SIMT ์‹คํ–‰ ๋ชจ๋ธ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘ํ•˜์—ฌ, ์‹ค์šฉ์ ์ธ warp.sum ๊ตฌํ˜„์„ ๋‹ค๋ฃจ๊ณ , ์ „๋žต์  ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ๋งˆ๋ฌด๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ์›Œํ”„๋ฅผ ๋…๋ฆฝ์ ์ธ ์Šค๋ ˆ๋“œ๊ฐ€ ์•„๋‹Œ ๋™๊ธฐํ™”๋œ ๋ฒกํ„ฐ ์œ ๋‹›์œผ๋กœ ์ƒ๊ฐํ•˜์„ธ์š”. ์ด ๋ฉ˜ํƒˆ ๋ชจ๋ธ์ด ํšจ๊ณผ์ ์ธ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์œผ๋กœ ์•ˆ๋‚ดํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ๋ชฉํ‘œ: Part VII์„ ๋งˆ์น˜๋ฉด, ์›Œํ”„ ์—ฐ์‚ฐ์ด ๋ณต์žกํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํ™ฉ์„ ์ธ์‹ํ•˜์—ฌ ๋” ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅธ GPU ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ: ์›Œํ”„ ๋ ˆ์ธ๊ณผ SIMT ์‹คํ–‰ ์—์„œ ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํž˜์„ ๋งŒ๋‚˜๋ณด์„ธ์š”!

๐Ÿง  ์›Œํ”„ ๋ ˆ์ธ๊ณผ SIMT ์‹คํ–‰

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ vs SIMD ๋ฉ˜ํƒˆ ๋ชจ๋ธ

์›Œํ”„๋ž€ ๋ฌด์—‡์ธ๊ฐ€?

์›Œํ”„๋Š” 32๊ฐœ(๋˜๋Š” 64๊ฐœ)์˜ GPU ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋™์ผํ•œ ๋ช…๋ น์„ ๋™์‹œ์— ์‹คํ–‰ํ•˜๋Š” ๊ทธ๋ฃน์ž…๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฒกํ„ฐ ํ”„๋กœ์„ธ์„œ์˜ โ€œ๋ ˆ์ธโ€ ์—ญํ• ์„ ํ•˜๋Š” ๋™๊ธฐํ™”๋œ ๋ฒกํ„ฐ ์œ ๋‹›์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๊ฐ„๋‹จํ•œ ์˜ˆ์‹œ:

from gpu.primitives.warp import sum
# ์›Œํ”„ ๋‚ด 32๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์‹คํ–‰:
var my_value = input[my_thread_id]     # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ด
var warp_total = sum(my_value)         # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ํ•ฉ๊ณ„์— ๊ธฐ์—ฌ

๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚œ ๊ฑธ๊นŒ์š”? 32๊ฐœ์˜ ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ณต์žกํ•œ ์กฐ์œจ์„ ํ•˜๋Š” ๋Œ€์‹ , ์›Œํ”„๊ฐ€ ์ž๋™์œผ๋กœ ๋™๊ธฐํ™”ํ•˜์—ฌ ํ•˜๋‚˜์˜ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด๋ƒˆ์Šต๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ฐ”๋กœ SIMT(Single Instruction, Multiple Thread) ์‹คํ–‰์ž…๋‹ˆ๋‹ค.

SIMT vs SIMD ๋น„๊ต

CPU ๋ฒกํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ(SIMD)์— ์ต์ˆ™ํ•˜๋‹ค๋ฉด, GPU ์›Œํ”„๋Š” ๋น„์Šทํ•˜์ง€๋งŒ ํ•ต์‹ฌ์ ์ธ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

๊ด€์ CPU SIMD (์˜ˆ: AVX)GPU ์›Œํ”„ (SIMT)
ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ๋ช…์‹œ์  ๋ฒกํ„ฐ ์—ฐ์‚ฐ์Šค๋ ˆ๋“œ ๊ธฐ๋ฐ˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ
๋ฐ์ดํ„ฐ ํญ๊ณ ์ • (256/512 ๋น„ํŠธ)์œ ์—ฐ (32/64 ์Šค๋ ˆ๋“œ)
๋™๊ธฐํ™”๋ช…๋ น ๋‚ด ์•”์‹œ์ ์›Œํ”„ ๋‚ด ์•”์‹œ์ 
ํ†ต์‹ ๋ฉ”๋ชจ๋ฆฌ/๋ ˆ์ง€์Šคํ„ฐ ๊ฒฝ์œ ์…”ํ”Œ ์—ฐ์‚ฐ ๊ฒฝ์œ 
๋ถ„๊ธฐ ์ฒ˜๋ฆฌํ•ด๋‹น ์—†์Œํ•˜๋“œ์›จ์–ด ๋งˆ์Šคํ‚น
์˜ˆ์‹œa + bsum(thread_value)

CPU SIMD ๋ฐฉ์‹ (C++ intrinsics):

// ๋ช…์‹œ์  ๋ฒกํ„ฐ ์—ฐ์‚ฐ - 8๊ฐœ์˜ float๋ฅผ ๋ณ‘๋ ฌ๋กœ
__m256 result = _mm256_add_ps(a, b);   // 8์Œ์„ ๋™์‹œ์— ๋ง์…ˆ

CPU SIMD ๋ฐฉ์‹ (Mojo):

# Mojo์—์„œ SIMD๋Š” ์ผ๊ธ‰ ์‹œ๋ฏผ ํƒ€์ž…์ด๋ฏ€๋กœ a, b๊ฐ€ SIMD ํƒ€์ž…์ด๋ฉด
# ๋ง์…ˆ์ด ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค
var result = a + b # 8์Œ์„ ๋™์‹œ์— ๋ง์…ˆ

GPU SIMT ๋ฐฉ์‹ (Mojo):

# ์Šค๋ ˆ๋“œ ๊ธฐ๋ฐ˜ ์ฝ”๋“œ๊ฐ€ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์œผ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค
from gpu.primitives.warp import sum

var my_data = input[thread_id]         # ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž๊ธฐ ์š”์†Œ๋ฅผ ๊ฐ€์ ธ์˜ด
var partial = my_data * coefficient    # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๊ณ„์‚ฐ
var total = sum(partial)               # ํ•˜๋“œ์›จ์–ด๊ฐ€ ํ•ฉ์‚ฐ์„ ์กฐ์œจ

์›Œํ”„๋ฅผ ๊ฐ•๋ ฅํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ํ•ต์‹ฌ ๊ฐœ๋…

1. ๋ ˆ์ธ ์‹๋ณ„: ๊ฐ ์Šค๋ ˆ๋“œ๋Š” ์‚ฌ์‹ค์ƒ ๋น„์šฉ ์—†์ด ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” โ€œ๋ ˆ์ธ IDโ€ (0~31)๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค

var my_lane = lane_id()  # ํ•˜๋“œ์›จ์–ด ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ์ฝ์„ ๋ฟ

2. ์•”์‹œ์  ๋™๊ธฐํ™”: ์›Œํ”„ ๋‚ด์—์„œ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค

# ๊ทธ๋ƒฅ ๋™์ž‘ - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ž๋™์œผ๋กœ ๋™๊ธฐํ™”
var sum = sum(my_contribution)

3. ํšจ์œจ์ ์ธ ํ†ต์‹ : ๋ฉ”๋ชจ๋ฆฌ ์—†์ด๋„ ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ฐ์ดํ„ฐ ๊ณต์œ ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค

# ๋ ˆ์ธ 0์˜ ๊ฐ’์„ ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์œผ๋กœ ์ „๋‹ฌ
var broadcasted = shuffle_idx(my_value, 0)

ํ•ต์‹ฌ ํ†ต์ฐฐ: SIMT๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ž์—ฐ์Šค๋Ÿฌ์šด ์Šค๋ ˆ๋“œ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๋ฉด์„œ๋„ ํšจ์œจ์ ์ธ ๋ฒกํ„ฐ ์—ฐ์‚ฐ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์–ด, ์Šค๋ ˆ๋“œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํŽธ๋ฆฌํ•จ๊ณผ ๋ฒกํ„ฐ ์ฒ˜๋ฆฌ์˜ ์„ฑ๋Šฅ์„ ๋ชจ๋‘ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GPU ์‹คํ–‰ ๊ณ„์ธต ๊ตฌ์กฐ์—์„œ ์›Œํ”„์˜ ์œ„์น˜

์›Œํ”„๊ฐ€ ์ „์ฒด GPU ์‹คํ–‰ ๋ชจ๋ธ๊ณผ ์–ด๋–ป๊ฒŒ ์—ฐ๊ฒฐ๋˜๋Š”์ง€ ์ž์„ธํžˆ ์•Œ์•„๋ณด๋ ค๋ฉด GPU ์Šค๋ ˆ๋”ฉ vs SIMD ๊ฐœ๋…์„ ์ฐธ๊ณ ํ•˜์„ธ์š”. ์›Œํ”„์˜ ์œ„์น˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

GPU ๋””๋ฐ”์ด์Šค
โ”œโ”€โ”€ ๊ทธ๋ฆฌ๋“œ (์ „์ฒด ๋ฌธ์ œ)
โ”‚   โ”œโ”€โ”€ ๋ธ”๋ก 1 (์Šค๋ ˆ๋“œ ๊ทธ๋ฃน, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)
โ”‚   โ”‚   โ”œโ”€โ”€ ์›Œํ”„ 1 (32 ์Šค๋ ˆ๋“œ, ๋ก์Šคํ… ์‹คํ–‰) โ† ์ด ๋ ˆ๋ฒจ
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ์Šค๋ ˆ๋“œ 1 โ†’ SIMD ์—ฐ์‚ฐ
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ์Šค๋ ˆ๋“œ 2 โ†’ SIMD ์—ฐ์‚ฐ
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ... (์ด 32๊ฐœ ์Šค๋ ˆ๋“œ)
โ”‚   โ”‚   โ””โ”€โ”€ ์›Œํ”„ 2 (32 ์Šค๋ ˆ๋“œ)
โ”‚   โ””โ”€โ”€ ๋ธ”๋ก 2 (๋…๋ฆฝ์ ์ธ ๊ทธ๋ฃน)

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ โ€œ์›Œํ”„ ๋ ˆ๋ฒจโ€œ์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค - ๋‹จ์ผ ์›Œํ”„ ๋‚ด์˜ 32๊ฐœ ์Šค๋ ˆ๋“œ๋ฅผ ๋ชจ๋‘ ์กฐ์œจํ•˜๋Š” ์—ฐ์‚ฐ์„ ๋‹ค๋ฃจ๋ฉฐ, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ์ด ํ•„์š”ํ•œ sum() ๊ฐ™์€ ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๋ฉ˜ํƒˆ ๋ชจ๋ธ์€ ๋ฌธ์ œ๊ฐ€ ์›Œํ”„ ์—ฐ์‚ฐ์— ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋งคํ•‘๋˜๋Š” ๊ฒฝ์šฐ์™€ ๊ธฐ์กด์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ฐ˜

Single Instruction, Multiple Thread(SIMT) ์‹คํ–‰์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ํšจ๊ณผ์ ์ธ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๋‹จ์ˆœํ•œ ์†Œํ”„ํŠธ์›จ์–ด ์ถ”์ƒํ™”๊ฐ€ ์•„๋‹ˆ๋ผ, GPU ํ•˜๋“œ์›จ์–ด๊ฐ€ ์‹ค๋ฆฌ์ฝ˜ ์ˆ˜์ค€์—์„œ ์‹ค์ œ๋กœ ์ž‘๋™ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

SIMT ์‹คํ–‰์ด๋ž€?

SIMT๋ž€ ์›Œํ”„ ๋‚ด์—์„œ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๊ฐ™์€ ๋ช…๋ น์„ ๋™์‹œ์— ์‹คํ–‰ํ•œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์™„์ „ํžˆ ๋‹ค๋ฅธ ๋ช…๋ น์„ ๋…๋ฆฝ์ ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” CPU ์Šค๋ ˆ๋“œ์™€ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

CPU vs GPU ์‹คํ–‰ ๋ชจ๋ธ

๊ด€์ CPU (MIMD)GPU ์›Œํ”„ (SIMT)
๋ช…๋ น ๋ชจ๋ธMultiple Instructions, Multiple DataSingle Instruction, Multiple Thread
Core 1add r1, r2add r1, r2
Core 2load r3, [mem]add r1, r2 (๋™์ผ ๋ช…๋ น)
Core 3branch loopadd r1, r2 (๋™์ผ ๋ช…๋ น)
โ€ฆ Core 32๋‹ค๋ฅธ ๋ช…๋ นadd r1, r2 (๋™์ผ ๋ช…๋ น)
์‹คํ–‰ ๋ฐฉ์‹๋…๋ฆฝ์ , ๋น„๋™๊ธฐ๋™๊ธฐํ™”, ๋ก์Šคํ…
์Šค์ผ€์ค„๋ง๋ณต์žก, OS ๊ด€๋ฆฌ๋‹จ์ˆœ, ํ•˜๋“œ์›จ์–ด ๊ด€๋ฆฌ
๋ฐ์ดํ„ฐ๋…๋ฆฝ์ ์ธ ๋ฐ์ดํ„ฐ ์„ธํŠธ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ, ๊ฐ™์€ ์—ฐ์‚ฐ

GPU ์›Œํ”„ ์‹คํ–‰ ํŒจํ„ด:

  • ๋ช…๋ น: 32๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์ผ: add r1, r2
  • ๋ ˆ์ธ 0: Data0์— ์—ฐ์‚ฐ โ†’ Result0
  • ๋ ˆ์ธ 1: Data1์— ์—ฐ์‚ฐ โ†’ Result1
  • ๋ ˆ์ธ 2: Data2์— ์—ฐ์‚ฐ โ†’ Result2
  • โ€ฆ (๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์‹œ์— ์‹คํ–‰)
  • ๋ ˆ์ธ 31: Data31์— ์—ฐ์‚ฐ โ†’ Result31

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ชจ๋“  ๋ ˆ์ธ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๊ฐ™์€ ๋ช…๋ น์„ ๋™์‹œ์— ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

SIMT๊ฐ€ GPU์— ์ ํ•ฉํ•œ ์ด์œ 

GPU๋Š” ์ง€์—ฐ ์‹œ๊ฐ„์ด ์•„๋‹Œ ์ฒ˜๋ฆฌ๋Ÿ‰์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. SIMT๊ฐ€ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ๋“ค:

  • ํ•˜๋“œ์›จ์–ด ๋‹จ์ˆœํ™”: ํ•˜๋‚˜์˜ ๋ช…๋ น ๋””์ฝ”๋”๊ฐ€ 32๊ฐœ ๋˜๋Š” 64๊ฐœ ์Šค๋ ˆ๋“œ๋ฅผ ์ฒ˜๋ฆฌ
  • ์‹คํ–‰ ํšจ์œจ์„ฑ: ์›Œํ”„ ๋‚ด ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ณต์žกํ•œ ์Šค์ผ€์ค„๋ง ๋ถˆํ•„์š”
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ์ „๋ ฅ ํšจ์œจ์„ฑ: ๋ ˆ์ธ ์ „์ฒด์— ๊ฑธ์ณ ์ œ์–ด ๋กœ์ง ๊ณต์œ 

์›Œํ”„ ์‹คํ–‰ ๋ฉ”์ปค๋‹ˆ์ฆ˜

๋ ˆ์ธ ๋ฒˆํ˜ธ์™€ ์‹๋ณ„

์›Œํ”„ ๋‚ด ๊ฐ ์Šค๋ ˆ๋“œ๋Š” 0๋ถ€ํ„ฐ WARP_SIZE-1๊นŒ์ง€์˜ ๋ ˆ์ธ ID๋ฅผ ๊ฐ–์Šต๋‹ˆ๋‹ค:

from gpu import lane_id
from gpu.primitives.warp import WARP_SIZE

# ์ปค๋„ ํ•จ์ˆ˜ ๋‚ด์—์„œ:
my_lane = lane_id()  # 0-31 (NVIDIA/RDNA) ๋˜๋Š” 0-63 (CDNA) ๋ฐ˜ํ™˜

ํ•ต์‹ฌ ํ†ต์ฐฐ: lane_id()๋Š” ๋น„์šฉ์ด ์—†์Šต๋‹ˆ๋‹ค - ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ํ•˜๋“œ์›จ์–ด ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ์ฝ์„ ๋ฟ์ž…๋‹ˆ๋‹ค.

์›Œํ”„ ๋‚ด ๋™๊ธฐํ™”

SIMT์˜ ๊ฐ€์žฅ ๊ฐ•๋ ฅํ•œ ์ธก๋ฉด: ์•”์‹œ์  ๋™๊ธฐํ™”.

# thread_idx.x < WARP_SIZE์ธ ๊ฒฝ์šฐ์˜ ์˜ˆ์‹œ

# 1. ๊ธฐ์กด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹:
shared[thread_idx.x] = partial_result
barrier()  # ๋ช…์‹œ์  ๋™๊ธฐํ™” ํ•„์š”
var total = shared[0] + shared[1] + ... + shared[WARP_SIZE] # ํ•ฉ์‚ฐ ๋ฆฌ๋•์…˜

# 2. ์›Œํ”„ ๋ฐฉ์‹:
from gpu.primitives.warp import sum

var total = sum(partial_result)  # ์•”์‹œ์  ๋™๊ธฐํ™”!

์™œ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š” ์—†์„๊นŒ์š”? ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ ๋ช…๋ น์„ ์ •ํ™•ํžˆ ๊ฐ™์€ ์‹œ์ ์— ์‹คํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. sum()์ด ์‹œ์ž‘๋  ๋•Œ, ๋ชจ๋“  ๋ ˆ์ธ์€ ์ด๋ฏธ partial_result ๊ณ„์‚ฐ์„ ๋งˆ์นœ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.

์›Œํ”„ ๋ถ„๊ธฐ์™€ ์ˆ˜๋ ด

์กฐ๊ฑด ์ฝ”๋“œ์—์„œ ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚ ๊นŒ?

if lane_id() % 2 == 0:
    # ์ง์ˆ˜ ๋ ˆ์ธ์ด ์ด ๊ฒฝ๋กœ๋ฅผ ์‹คํ–‰
    result = compute_even()
else:
    # ํ™€์ˆ˜ ๋ ˆ์ธ์ด ์ด ๊ฒฝ๋กœ๋ฅผ ์‹คํ–‰
    result = compute_odd()
# ๋ชจ๋“  ๋ ˆ์ธ์ด ์—ฌ๊ธฐ์„œ ์ˆ˜๋ ด

ํ•˜๋“œ์›จ์–ด ๋™์ž‘ ๋‹จ๊ณ„:

๋‹จ๊ณ„ํŽ˜์ด์ฆˆํ™œ์„ฑ ๋ ˆ์ธ๋Œ€๊ธฐ ๋ ˆ์ธํšจ์œจ์„ฑ๋Šฅ ๋น„์šฉ
1์กฐ๊ฑด ํ‰๊ฐ€32๊ฐœ ๋ ˆ์ธ ์ „๋ถ€์—†์Œ100%์ •์ƒ ์†๋„
2์ง์ˆ˜ ๋ ˆ์ธ ๋ถ„๊ธฐ๋ ˆ์ธ 0,2,4โ€ฆ30 (16๊ฐœ)๋ ˆ์ธ 1,3,5โ€ฆ31 (16๊ฐœ)50%2๋ฐฐ ๋А๋ฆผ
3ํ™€์ˆ˜ ๋ ˆ์ธ ๋ถ„๊ธฐ๋ ˆ์ธ 1,3,5โ€ฆ31 (16๊ฐœ)๋ ˆ์ธ 0,2,4โ€ฆ30 (16๊ฐœ)50%2๋ฐฐ ๋А๋ฆผ
4์ˆ˜๋ ด32๊ฐœ ๋ ˆ์ธ ์ „๋ถ€์—†์Œ100%์ •์ƒ ์†๋„ ๋ณต๊ท€

์˜ˆ์‹œ ๋ถ„์„:

  • 2๋‹จ๊ณ„: ์ง์ˆ˜ ๋ ˆ์ธ๋งŒ compute_even()์„ ์‹คํ–‰ํ•˜๊ณ  ํ™€์ˆ˜ ๋ ˆ์ธ์€ ๋Œ€๊ธฐ
  • 3๋‹จ๊ณ„: ํ™€์ˆ˜ ๋ ˆ์ธ๋งŒ compute_odd()๋ฅผ ์‹คํ–‰ํ•˜๊ณ  ์ง์ˆ˜ ๋ ˆ์ธ์€ ๋Œ€๊ธฐ
  • ์ด ์†Œ์š” ์‹œ๊ฐ„: time(compute_even) + time(compute_odd) (์ˆœ์ฐจ ์‹คํ–‰)
  • ๋ถ„๊ธฐ ์—†๋Š” ๊ฒฝ์šฐ: max(time(compute_even), time(compute_odd)) (๋ณ‘๋ ฌ ์‹คํ–‰)

์„ฑ๋Šฅ ์˜ํ–ฅ:

  1. ๋ถ„๊ธฐ: ์›Œํ”„๊ฐ€ ์‹คํ–‰์„ ๋ถ„๋ฆฌ - ์ผ๋ถ€ ๋ ˆ์ธ์€ ํ™œ์„ฑ, ๋‚˜๋จธ์ง€๋Š” ๋Œ€๊ธฐ
  2. ์ˆœ์ฐจ ์‹คํ–‰: ์„œ๋กœ ๋‹ค๋ฅธ ๊ฒฝ๋กœ๊ฐ€ ๋ณ‘๋ ฌ์ด ์•„๋‹Œ ์ˆœ์ฐจ์ ์œผ๋กœ ์‹คํ–‰
  3. ์ˆ˜๋ ด: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋‹ค์‹œ ํ•ฉ๋ฅ˜ํ•˜์—ฌ ํ•จ๊ป˜ ์ง„ํ–‰
  4. ๋น„์šฉ: ๋ถ„๊ธฐ๊ฐ€ ์žˆ๋Š” ์›Œํ”„๋Š” ํ†ตํ•ฉ ์‹คํ–‰ ๋Œ€๋น„ 2๋ฐฐ ์ด์ƒ์˜ ์‹œ๊ฐ„ ์†Œ์š”

์›Œํ”„ ํšจ์œจ์„ ์œ„ํ•œ ๋ชจ๋ฒ” ์‚ฌ๋ก€

์›Œํ”„ ํšจ์œจ ํŒจํ„ด

โœ… ์šฐ์ˆ˜: ๊ท ์ผ ์‹คํ–‰ (100% ํšจ์œจ)

# ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ ์ž‘์—… ์ˆ˜ํ–‰ - ๋ถ„๊ธฐ ์—†์Œ
var partial = a[global_i] * b[global_i]
var total = sum(partial)

์„ฑ๋Šฅ: 32๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์‹œ ํ™œ์„ฑ

โš ๏ธ ํ—ˆ์šฉ: ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ๋ถ„๊ธฐ (~95% ํšจ์œจ)

# lane_id() ๊ธฐ๋ฐ˜ ๋ถ„๊ธฐ - ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋จ
if lane_id() == 0:
    output[block_idx] = sum(partial)

์„ฑ๋Šฅ: ๋‹จ์ผ ๋ ˆ์ธ์˜ ์งง์€ ์—ฐ์‚ฐ, ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ํŒจํ„ด

๐Ÿ”ถ ์ฃผ์˜: ๊ตฌ์กฐํ™”๋œ ๋ถ„๊ธฐ (~50-75% ํšจ์œจ)

# ๊ทœ์น™์ ์ธ ํŒจํ„ด์€ ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ์ตœ์ ํ™” ๊ฐ€๋Šฅ
if (global_i / 4) % 2 == 0:
    result = method_a()
else:
    result = method_b()

์„ฑ๋Šฅ: ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ๊ทธ๋ฃน, ์ผ๋ถ€ ์ตœ์ ํ™” ๊ฐ€๋Šฅ

โŒ ํšŒํ”ผ: ๋ฐ์ดํ„ฐ ์˜์กด์  ๋ถ„๊ธฐ (~25-50% ํšจ์œจ)

# ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ๋ ˆ์ธ๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฒฝ๋กœ๋ฅผ ํƒˆ ์ˆ˜ ์žˆ์Œ
if input[global_i] > threshold:  # ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ถ„๊ธฐ
    result = expensive_computation()
else:
    result = simple_computation()

์„ฑ๋Šฅ: ๋ฌด์ž‘์œ„ ๋ถ„๊ธฐ๊ฐ€ ์›Œํ”„ ํšจ์œจ์„ ๋–จ์–ด๋œจ๋ฆผ

๐Ÿ’€ ์ตœ์•…: ์ค‘์ฒฉ๋œ ๋ฐ์ดํ„ฐ ์˜์กด์  ๋ถ„๊ธฐ (~10-25% ํšจ์œจ)

# ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ๋ถ„๊ธฐ์˜ ๋‹ค๋‹จ๊ณ„ ์ค‘์ฒฉ
if input[global_i] > threshold1:
    if input[global_i] > threshold2:
        result = very_expensive()
    else:
        result = expensive()
else:
    result = simple()

์„ฑ๋Šฅ: ์›Œํ”„ ํšจ์œจ์ด ์‚ฌ์‹ค์ƒ ๋ฌด๋„ˆ์ง

ํฌ๋กœ์Šค ์•„ํ‚คํ…์ฒ˜ ํ˜ธํ™˜์„ฑ

NVIDIA vs AMD ์›Œํ”„ ํฌ๊ธฐ

from gpu.primitives.warp import WARP_SIZE

# NVIDIA GPUs:     WARP_SIZE = 32
# AMD RDNA GPUs:   WARP_SIZE = 32 (wavefront32 ๋ชจ๋“œ)
# AMD CDNA GPUs:   WARP_SIZE = 64 (์ „ํ†ต์ ์ธ wavefront64)

์™œ ์ค‘์š”ํ• ๊นŒ์š”:

  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ๋ณ‘ํ•ฉ๋œ ์ ‘๊ทผ์ด ์›Œํ”„ ํฌ๊ธฐ์— ์˜์กด
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„: ๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ๊ฐ€ ์›Œํ”„ ํฌ๊ธฐ๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•จ
  • ์„ฑ๋Šฅ ํ™•์žฅ: AMD์—์„œ ์›Œํ”„๋‹น ๋ ˆ์ธ์ด 2๋ฐฐ

์ด์‹ ๊ฐ€๋Šฅํ•œ ์›Œํ”„ ์ฝ”๋“œ ์ž‘์„ฑ

์•„ํ‚คํ…์ฒ˜ ์ ์‘ ์ „๋žต

โœ… ์ด์‹ ๊ฐ€๋Šฅ: ํ•ญ์ƒ WARP_SIZE ์‚ฌ์šฉ

comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)  # ์ž๋™์œผ๋กœ ์ ์‘
comptime ELEMENTS_PER_WARP = WARP_SIZE       # ํ•˜๋“œ์›จ์–ด์— ๋งž๊ฒŒ ํ™•์žฅ

๊ฒฐ๊ณผ: NVIDIA/AMD (32)์™€ AMD (64) ๋ชจ๋‘์—์„œ ์ตœ์ ์œผ๋กœ ๋™์ž‘

โŒ ์ž˜๋ชป๋œ ๋ฐฉ์‹: ์›Œํ”„ ํฌ๊ธฐ๋ฅผ ํ•˜๋“œ์ฝ”๋”ฉํ•˜์ง€ ๋งˆ์„ธ์š”

comptime THREADS_PER_BLOCK = (32, 1)  # AMD GPU์—์„œ ๋™์ž‘ ์•ˆ ํ•จ!
comptime REDUCTION_SIZE = 32          # AMD์—์„œ ์ž˜๋ชป๋œ ๊ฐ’!

๊ฒฐ๊ณผ: AMD์—์„œ ์„ฑ๋Šฅ ์ €ํ•˜, ์ •ํ™•์„ฑ ๋ฌธ์ œ ๊ฐ€๋Šฅ

์‹ค์ œ ํ•˜๋“œ์›จ์–ด ์˜ํ–ฅ

GPU ์•„ํ‚คํ…์ฒ˜WARP_SIZE์›Œํ”„๋‹น ๋ฉ”๋ชจ๋ฆฌ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„๋ ˆ์ธ ํŒจํ„ด
NVIDIA/AMD RDNA32128 bytes (4ร—32)5๋‹จ๊ณ„: 32โ†’16โ†’8โ†’4โ†’2โ†’1๋ ˆ์ธ 0-31
AMD CDNA64256 bytes (4ร—64)6๋‹จ๊ณ„: 64โ†’32โ†’16โ†’8โ†’4โ†’2โ†’1๋ ˆ์ธ 0-63

64 vs 32์˜ ์„ฑ๋Šฅ ์ฐจ์ด:

  • CDNA ์žฅ์ : ์›Œํ”„๋‹น 2๋ฐฐ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ
  • CDNA ์žฅ์ : ์›Œํ”„๋‹น 2๋ฐฐ์˜ ์—ฐ์‚ฐ๋Ÿ‰
  • NVIDIA/RDNA ์žฅ์ : ๋ธ”๋ก๋‹น ๋” ๋งŽ์€ ์›Œํ”„ (๋” ๋†’์€ ์ ์œ ์œจ)
  • ์ฝ”๋“œ ์ด์‹์„ฑ: ๊ฐ™์€ ์†Œ์Šค ์ฝ”๋“œ๋กœ ์–‘์ชฝ ๋ชจ๋‘ ์ตœ์  ์„ฑ๋Šฅ

์›Œํ”„์™€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

โœ… ์™„๋ฒฝ: ๋ณ‘ํ•ฉ๋œ ์ ‘๊ทผ (100% ๋Œ€์—ญํญ ํ™œ์šฉ)

# ์ธ์ ‘ ๋ ˆ์ธ โ†’ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ
var value = input[global_i]  # ๋ ˆ์ธ 0โ†’input[0], ๋ ˆ์ธ 1โ†’input[1], ๋“ฑ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

์ ‘๊ทผ ํŒจํ„ดNVIDIA/RDNA (32 ๋ ˆ์ธ)CDNA (64 ๋ ˆ์ธ)๋Œ€์—ญํญ ํ™œ์šฉ์„ฑ๋Šฅ
โœ… ๋ณ‘ํ•ฉ๋ ˆ์ธ N โ†’ ์ฃผ์†Œ 4ร—N๋ ˆ์ธ N โ†’ ์ฃผ์†Œ 4ร—N100%์ตœ์ 
1ํšŒ ํŠธ๋žœ์žญ์…˜: 128 bytes1ํšŒ ํŠธ๋žœ์žญ์…˜: 256 bytes์ „์ฒด ๋ฒ„์Šค ํญ๋น ๋ฆ„
โŒ ๋ถ„์‚ฐ๋ ˆ์ธ N โ†’ ์ž„์˜ ์ฃผ์†Œ๋ ˆ์ธ N โ†’ ์ž„์˜ ์ฃผ์†Œ~6%์ตœ์•…
32ํšŒ ๊ฐœ๋ณ„ ํŠธ๋žœ์žญ์…˜64ํšŒ ๊ฐœ๋ณ„ ํŠธ๋žœ์žญ์…˜๋Œ€๋ถ€๋ถ„ ์œ ํœด ๋ฒ„์Šค32๋ฐฐ ๋А๋ฆผ

์ฃผ์†Œ ์˜ˆ์‹œ:

  • ๋ณ‘ํ•ฉ: ๋ ˆ์ธ 0โ†’0, ๋ ˆ์ธ 1โ†’4, ๋ ˆ์ธ 2โ†’8, ๋ ˆ์ธ 3โ†’12, โ€ฆ
  • ๋ถ„์‚ฐ: ๋ ˆ์ธ 0โ†’1000, ๋ ˆ์ธ 1โ†’52, ๋ ˆ์ธ 2โ†’997, ๋ ˆ์ธ 3โ†’8, โ€ฆ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ

๋ฑ…ํฌ ์ถฉ๋Œ์ด๋ž€?

GPU ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋™์‹œ ์ ‘๊ทผ์ด ๊ฐ€๋Šฅํ•œ 32๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ๋ฑ…ํฌ๋กœ ๋‚˜๋‰˜์–ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์›Œํ”„ ๋‚ด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฑ…ํฌ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ์†Œ์— ๋™์‹œ์— ์ ‘๊ทผํ•˜๋ ค ํ•  ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ ‘๊ทผ์„ ์ง๋ ฌํ™”ํ•ด์•ผ ํ•˜๋ฏ€๋กœ, ๋‹จ์ผ ์‚ฌ์ดํด์ด์–ด์•ผ ํ•  ์—ฐ์‚ฐ์ด ์—ฌ๋Ÿฌ ์‚ฌ์ดํด๋กœ ๋Š˜์–ด๋‚ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ์ถฉ๋Œ ์—†์Œ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผ โ†’ ๋ชจ๋“  ์ ‘๊ทผ์ด ๋™์‹œ์— ๋ฐœ์ƒ (1 ์‚ฌ์ดํด)
  • ๋ฑ…ํฌ ์ถฉ๋Œ: ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฑ…ํฌ์— ์ ‘๊ทผ โ†’ ์ ‘๊ทผ์ด ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐœ์ƒ (N๊ฐœ ์Šค๋ ˆ๋“œ์— N ์‚ฌ์ดํด)
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ์ฃผ์†Œ์— ์ ‘๊ทผ โ†’ ํ•˜๋“œ์›จ์–ด๊ฐ€ 1 ์‚ฌ์ดํด๋กœ ์ตœ์ ํ™”

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ๊ตฌ์„ฑ:

๋ฑ…ํฌ์ฃผ์†Œ (๋ฐ”์ดํŠธ ์˜คํ”„์…‹)์˜ˆ์‹œ ๋ฐ์ดํ„ฐ (float32)
๋ฑ…ํฌ 00, 128, 256, 384, โ€ฆshared[0], shared[32], shared[64], โ€ฆ
๋ฑ…ํฌ 14, 132, 260, 388, โ€ฆshared[1], shared[33], shared[65], โ€ฆ
๋ฑ…ํฌ 28, 136, 264, 392, โ€ฆshared[2], shared[34], shared[66], โ€ฆ
โ€ฆโ€ฆโ€ฆ
๋ฑ…ํฌ 31124, 252, 380, 508, โ€ฆshared[31], shared[63], shared[95], โ€ฆ

๋ฑ…ํฌ ์ถฉ๋Œ ์˜ˆ์‹œ:

์ ‘๊ทผ ํŒจํ„ด๋ฑ…ํฌ ์‚ฌ์šฉ์‚ฌ์ดํด์„ฑ๋Šฅ์„ค๋ช…
โœ… ์ˆœ์ฐจ์ shared[thread_idx.x]1 ์‚ฌ์ดํด100%๊ฐ ๋ ˆ์ธ์ด ๋‹ค๋ฅธ ๋ฑ…ํฌ ์ ‘๊ทผ
๋ ˆ์ธ 0โ†’๋ฑ…ํฌ 0, ๋ ˆ์ธ 1โ†’๋ฑ…ํฌ 1, โ€ฆ์ตœ์ ์ถฉ๋Œ ์—†์Œ
โœ… ๋™์ผ ์ธ๋ฑ์Šคshared[0]1 ์‚ฌ์ดํด100%๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ ์ฃผ์†Œ์—์„œ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
32๊ฐœ ๋ ˆ์ธ ์ „๋ถ€โ†’๋ฑ…ํฌ 0 (๊ฐ™์€ ์ฃผ์†Œ)์ตœ์ ์ถฉ๋Œ ์—†์Œ
โŒ ์ŠคํŠธ๋ผ์ด๋“œ 2shared[thread_idx.x * 2]2 ์‚ฌ์ดํด50%๋ฑ…ํฌ๋‹น 2๊ฐœ ๋ ˆ์ธ
๋ ˆ์ธ 0,16โ†’๋ฑ…ํฌ 0; ๋ ˆ์ธ 1,17โ†’๋ฑ…ํฌ 12๋ฐฐ ๋А๋ฆผ์ง๋ ฌํ™”๋œ ์ ‘๊ทผ
๐Ÿ’€ ์ŠคํŠธ๋ผ์ด๋“œ 32shared[thread_idx.x * 32]32 ์‚ฌ์ดํด3%๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ ๋ฑ…ํฌ ์ ‘๊ทผ
32๊ฐœ ๋ ˆ์ธ ์ „๋ถ€โ†’๋ฑ…ํฌ 0 (๋‹ค๋ฅธ ์ฃผ์†Œ)32๋ฐฐ ๋А๋ฆผ์™„์ „ํžˆ ์ง๋ ฌํ™”

์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ์‹ค์ „ ํ™œ์šฉ

์›Œํ”„ ์—ฐ์‚ฐ์ด ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ๊ฒฝ์šฐ

  1. ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ: sum(), max() ๋“ฑ
  2. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ: shuffle_idx()๋กœ ๊ฐ’ ๊ณต์œ 
  3. ์ด์›ƒ ํ†ต์‹ : shuffle_down()์œผ๋กœ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ
  4. ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ: prefix_sum()์œผ๋กœ scan ์•Œ๊ณ ๋ฆฌ์ฆ˜

์„ฑ๋Šฅ ํŠน์„ฑ

์—ฐ์‚ฐ ์œ ํ˜•๊ธฐ์กด ๋ฐฉ์‹์›Œํ”„ ์—ฐ์‚ฐ
๋ฆฌ๋•์…˜ (32๊ฐœ ์š”์†Œ)~20๊ฐœ ๋ช…๋ น10๊ฐœ ๋ช…๋ น
๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ๋†’์Œ์ตœ์†Œ
๋™๊ธฐํ™” ๋น„์šฉ๋น„์šฉ ๋†’์Œ๋ฌด๋ฃŒ
์ฝ”๋“œ ๋ณต์žก๋„๋†’์Œ๋‚ฎ์Œ

๋‹ค์Œ ๋‹จ๊ณ„

SIMT์˜ ๊ธฐ๋ฐ˜์„ ์ดํ•ดํ–ˆ์œผ๋‹ˆ, ์ด ๊ฐœ๋…์ด ์–ด๋–ป๊ฒŒ ๊ฐ•๋ ฅํ•œ ์›Œํ”„ ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š”์ง€ ์•Œ์•„๋ณผ ์ฐจ๋ก€์ž…๋‹ˆ๋‹ค. ๋‹ค์Œ ์„น์…˜์—์„œ๋Š” sum()์ด ๋ณต์žกํ•œ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ๊ฐ„๋‹จํ•˜๊ณ  ํšจ์œจ์ ์ธ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

โ†’ ๋‹ค์Œ: warp.sum()์˜ ํ•ต์‹ฌ

warp.sum()์˜ ํ•ต์‹ฌ - ์›Œํ”„ ๋ ˆ๋ฒจ ๋‚ด์ 

Puzzle 12์—์„œ ์‚ดํŽด๋ณธ ๋‚ด์ ์„ Mojo์˜ ์›Œํ”„ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ๊ฐ„๋‹จํ•œ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์›Œํ”„ ๋ ˆ์ธ์ด ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  warp.sum()์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ์ž๋™์œผ๋กœ ํ•ฉ์‚ฐํ•˜์—ฌ, ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด GPU ๋™๊ธฐํ™”๋ฅผ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: warp.sum() ์—ฐ์‚ฐ์€ SIMT ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์„ ๋‹จ์ผ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ช…๋ น์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • warp.sum()์„ ํ™œ์šฉํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜
  • SIMT ์‹คํ–‰ ๋ชจ๋ธ๊ณผ ๋ ˆ์ธ ๋™๊ธฐํ™”
  • WARP_SIZE๋ฅผ ํ™œ์šฉํ•œ ํฌ๋กœ์Šค ์•„ํ‚คํ…์ฒ˜ ํ˜ธํ™˜์„ฑ
  • ๋ณต์žกํ•œ ํŒจํ„ด์—์„œ ๊ฐ„๋‹จํ•œ ํŒจํ„ด์œผ๋กœ์˜ ์„ฑ๋Šฅ ๋ณ€ํ™˜
  • ๋ ˆ์ธ ID ๊ด€๋ฆฌ์™€ ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ

์ˆ˜ํ•™์  ์—ฐ์‚ฐ์€ ๋‚ด์ ์ž…๋‹ˆ๋‹ค: \[\Large \text{output}[0] = \sum_{i=0}^{N-1} a[i] \times b[i]\]

ํ•˜์ง€๋งŒ ๊ตฌํ˜„ ๊ณผ์ •์—์„œ Mojo์˜ ๋ชจ๋“  ์›Œํ”„ ๋ ˆ๋ฒจ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ์ ์šฉ๋˜๋Š” ๊ธฐ๋ณธ ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D ํ–‰ ์šฐ์„ )

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ณต์žก์„ฑ (Puzzle 12์—์„œ)

solutions/p12/p12.mojo์˜ ๋ณต์žกํ•œ ๋ฐฉ์‹์„ ๋– ์˜ฌ๋ ค ๋ด…์‹œ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ฐฐ๋ฆฌ์–ด, ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค:

comptime SIZE = WARP_SIZE
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
comptime SIMD_WIDTH = simd_width_of[dtype]()
comptime in_layout = row_major[SIZE]()
comptime InLayoutType = type_of(in_layout)
comptime out_layout = row_major[1]()
comptime OutLayoutType = type_of(out_layout)


def traditional_dot_product_p12_style[
    size: Int
](
    output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
):
    """
    This is the complex approach from p12_layout_tensor.mojo - kept for comparison.
    """
    var shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[WARP_SIZE]())
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = (a[global_i] * b[global_i]).reduce_add()
    else:
        shared[local_i] = 0.0

    barrier()

    var stride = WARP_SIZE // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2

    if local_i == 0:
        output[global_i // WARP_SIZE] = shared[0]


์ด ๋ฐฉ์‹์ด ๋ณต์žกํ•œ ์ด์œ :

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ๋ธ”๋ก ๋‚ด์—์„œ ์ˆ˜๋™์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ด€๋ฆฌ
  • ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด: ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•œ barrier() ํ˜ธ์ถœ
  • ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ณต์žกํ•œ ๋ฃจํ”„
  • ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ: ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ก

๋™์ž‘์€ ํ•˜์ง€๋งŒ, ์ฝ”๋“œ๊ฐ€ ์žฅํ™ฉํ•˜๊ณ  ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์šฐ๋ฉฐ GPU ๋™๊ธฐํ™”์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

pixi run p24 --traditional
pixi run -e amd p24 --traditional
pixi run -e apple p24 --traditional
uv run poe p24 --traditional

์™„์„ฑํ•  ์ฝ”๋“œ

1. ๊ฐ„๋‹จํ•œ ์›Œํ”„ ์ปค๋„ ๋ฐฉ์‹

๋ณต์žกํ•œ ๊ธฐ์กด ๋ฐฉ์‹์„ warp_sum()์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ„๋‹จํ•œ ์›Œํ”„ ์ปค๋„๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

def simple_warp_dot_product[
    size: Int
](
    output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    # FILL IN (6 lines at most)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p24/p24.mojo

ํŒ

1. ๊ฐ„๋‹จํ•œ ์›Œํ”„ ์ปค๋„ ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

simple_warp_dot_product ํ•จ์ˆ˜๋ฅผ 6์ค„ ์ด๋‚ด๋กœ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

def simple_warp_dot_product[...](output, a, b):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    # ์—ฌ๊ธฐ๋ฅผ ์ฑ„์šฐ์„ธ์š” (์ตœ๋Œ€ 6์ค„)

๋”ฐ๋ผ์•ผ ํ•  ํŒจํ„ด:

  1. ์ด ์Šค๋ ˆ๋“œ์˜ ์š”์†Œ์— ๋Œ€ํ•œ ๋ถ€๋ถ„๊ณฑ ๊ณ„์‚ฐ
  2. warp_sum()์œผ๋กœ ๋ชจ๋“  ์›Œํ”„ ๋ ˆ์ธ์˜ ๊ฐ’์„ ํ•ฉ์‚ฐ
  3. ๋ ˆ์ธ 0์ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ก

2. ๋ถ€๋ถ„๊ณฑ ๊ณ„์‚ฐํ•˜๊ธฐ

var partial_product: Scalar[dtype] = 0
if global_i < size:
    partial_product = (a[global_i] * b[global_i]).reduce_add()

.reduce_add()๊ฐ€ ํ•„์š”ํ•œ ์ด์œ : Mojo์˜ ๊ฐ’์€ SIMD ๊ธฐ๋ฐ˜์ด๋ฏ€๋กœ a[global_i] * b[global_i]๋Š” SIMD ๋ฒกํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค. .reduce_add()๋กœ ๋ฒกํ„ฐ๋ฅผ ์Šค์นผ๋ผ ๊ฐ’์œผ๋กœ ํ•ฉ์‚ฐํ•ฉ๋‹ˆ๋‹ค.

๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์œ ํšจํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

3. ์›Œํ”„ ๋ฆฌ๋•์…˜์˜ ๋งˆ๋ฒ•

total = warp_sum(partial_product)

warp_sum()์ด ํ•˜๋Š” ์ผ:

  • ๊ฐ ๋ ˆ์ธ์˜ partial_product ๊ฐ’์„ ๊ฐ€์ ธ์˜ด
  • ์›Œํ”„ ๋‚ด ๋ชจ๋“  ๋ ˆ์ธ์˜ ๊ฐ’์„ ํ•ฉ์‚ฐ (ํ•˜๋“œ์›จ์–ด ๊ฐ€์†)
  • ๋ชจ๋“  ๋ ˆ์ธ์— ๊ฐ™์€ ํ•ฉ๊ณ„๋ฅผ ๋ฐ˜ํ™˜ (๋ ˆ์ธ 0๋งŒ์ด ์•„๋‹˜)
  • ๋ช…์‹œ์  ๋™๊ธฐํ™”๊ฐ€ ์ „ํ˜€ ํ•„์š” ์—†์Œ (SIMT๊ฐ€ ์ฒ˜๋ฆฌ)

4. ๊ฒฐ๊ณผ ๊ธฐ๋กํ•˜๊ธฐ

if lane_id() == 0:
    output[global_i // WARP_SIZE] = total

์™œ ๋ ˆ์ธ 0๋งŒ? warp_sum() ์ดํ›„ ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ total ๊ฐ’์„ ๊ฐ–์ง€๋งŒ, ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ํ•œ ๋ฒˆ๋งŒ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค.

์™œ output[0]์— ์ง์ ‘ ์“ฐ์ง€ ์•Š์„๊นŒ? ์œ ์—ฐ์„ฑ์„ ์œ„ํ•ด์„œ์ž…๋‹ˆ๋‹ค. ์ด ํ•จ์ˆ˜๋Š” ์›Œํ”„๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ธ ๊ฒฝ์šฐ์—๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๊ฐ ์›Œํ”„์˜ ๊ฒฐ๊ณผ๊ฐ€ global_i // WARP_SIZE ์œ„์น˜์— ๊ธฐ๋ก๋ฉ๋‹ˆ๋‹ค.

lane_id(): 0-31 (NVIDIA) ๋˜๋Š” 0-63 (AMD)์„ ๋ฐ˜ํ™˜ - ์›Œํ”„ ๋‚ด์—์„œ ์–ด๋А ๋ ˆ์ธ์ธ์ง€ ์‹๋ณ„ํ•ฉ๋‹ˆ๋‹ค.

๊ฐ„๋‹จํ•œ ์›Œํ”„ ์ปค๋„ ํ…Œ์ŠคํŠธ:

uv run poe p24 --kernel
pixi run p24 --kernel

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
=== RESULT ===
out: 10416.0
expected: 10416.0
๐Ÿš€ Notice how simple the warp version is compared to p12.mojo!
   Same kernel structure, but warp_sum() replaces all the complexity!

ํ’€์ด

def simple_warp_dot_product[
    InLayoutT: TensorLayout, OutLayoutT: TensorLayout, size: Int
](
    output: TileTensor[mut=True, dtype, OutLayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
):
    var a_lt = a.to_layout_tensor()
    var b_lt = b.to_layout_tensor()
    var out_lt = output.to_layout_tensor()
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    # Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based
    var partial_product: Scalar[dtype] = 0
    if global_i < size:
        partial_product = rebind[Scalar[dtype]](a_lt[global_i]) * rebind[
            Scalar[dtype]
        ](b_lt[global_i])

    # warp_sum() replaces all the shared memory + barriers + tree reduction
    var total = warp_sum(partial_product)

    # Only lane 0 writes the result (all lanes have the same total)
    if lane_id() == 0:
        out_lt.store[1](Index(global_i // WARP_SIZE), total)


๊ฐ„๋‹จํ•œ ์›Œํ”„ ์ปค๋„์€ ๋ณต์žกํ•œ ๋™๊ธฐํ™”์—์„œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ณธ ์š”์†Œ๋กœ์˜ ๊ทผ๋ณธ์ ์ธ ๋ณ€ํ™˜์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ธฐ์กด ๋ฐฉ์‹์—์„œ ์‚ฌ๋ผ์ง„ ๊ฒƒ๋“ค:

  • 15์ค„ ์ด์ƒ โ†’ 6์ค„: ํš๊ธฐ์ ์ธ ์ฝ”๋“œ ์ถ•์†Œ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๋ถˆํ•„์š”
  • 3ํšŒ ์ด์ƒ์˜ barrier() ํ˜ธ์ถœ: ๋ช…์‹œ์  ๋™๊ธฐํ™” ์ œ๋กœ
  • ๋ณต์žกํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒด
  • ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ: ์™„์ „ํžˆ ์ œ๊ฑฐ

SIMT ์‹คํ–‰ ๋ชจ๋ธ:

์›Œํ”„ ๋ ˆ์ธ (SIMT ์‹คํ–‰):
๋ ˆ์ธ 0: partial_product = a[0] * b[0]    = 0.0
๋ ˆ์ธ 1: partial_product = a[1] * b[1]    = 4.0
๋ ˆ์ธ 2: partial_product = a[2] * b[2]    = 16.0
...
๋ ˆ์ธ 31: partial_product = a[31] * b[31] = 3844.0

warp_sum() ํ•˜๋“œ์›จ์–ด ์—ฐ์‚ฐ:
๋ชจ๋“  ๋ ˆ์ธ โ†’ 0.0 + 4.0 + 16.0 + ... + 3844.0 = 10416.0
๋ชจ๋“  ๋ ˆ์ธ์ด ์ˆ˜์‹  โ†’ total = 10416.0 (๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฒฐ๊ณผ)

๋ฐฐ๋ฆฌ์–ด ์—†์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  1. SIMT ์‹คํ–‰: ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ ๋ช…๋ น ๋™์‹œ ์‹คํ–‰
  2. ํ•˜๋“œ์›จ์–ด ๋™๊ธฐํ™”: warp_sum()์ด ์‹œ์ž‘๋  ๋•Œ ๋ชจ๋“  ๋ ˆ์ธ์ด ์ด๋ฏธ partial_product ๊ณ„์‚ฐ ์™„๋ฃŒ
  3. ๋‚ด์žฅ ํ†ต์‹ : GPU ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ
  4. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ total ๊ฐ’ ์ˆ˜์‹ 

2. ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹

์ด๋ฒˆ์—๋Š” Mojo์˜ ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ™์€ ์›Œํ”„ ๋‚ด์ ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

def functional_warp_dot_product[
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
](
    output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayoutType, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, InLayoutType, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    def compute_dot_product[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        var idx = indices[0]
        print("idx:", idx)
        # FILL IN (10 lines at most)

    # Launch exactly size == WARP_SIZE threads (one warp) to process all elements
    elementwise[compute_dot_product, 1, target="gpu"](size, ctx)


ํŒ

1. ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹์˜ ๊ตฌ์กฐ ์ดํ•ดํ•˜๊ธฐ

compute_dot_product ํ•จ์ˆ˜๋ฅผ 10์ค„ ์ด๋‚ด๋กœ ์™„์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

@parameter
@always_inline
def compute_dot_product[simd_width: Int, rank: Int](indices: IndexList[rank]) capturing -> None:
    idx = indices[0]
    # ์—ฌ๊ธฐ๋ฅผ ์ฑ„์šฐ์„ธ์š” (์ตœ๋Œ€ 10์ค„)

ํ•จ์ˆ˜ํ˜• ํŒจํ„ด์˜ ์ฐจ์ด์ :

  • elementwise๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •ํ™•ํžˆ WARP_SIZE๊ฐœ์˜ ์Šค๋ ˆ๋“œ ์‹คํ–‰
  • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ idx๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋‚˜์˜ ์š”์†Œ ์ฒ˜๋ฆฌ
  • ๊ฐ™์€ ์›Œํ”„ ์—ฐ์‚ฐ, ๋‹ค๋ฅธ ์‹คํ–‰ ๋ฉ”์ปค๋‹ˆ์ฆ˜

2. ๋ถ€๋ถ„๊ณฑ ๊ณ„์‚ฐํ•˜๊ธฐ

var partial_product: Scalar[dtype] = 0.0
if idx < size:
    a_val = a.load[1](idx, 0)
    b_val = b.load[1](idx, 0)
    partial_product = (a_val * b_val).reduce_add()
else:
    partial_product = 0.0

๋กœ๋”ฉ ํŒจํ„ด: a.load[1](idx, 0)์€ ์œ„์น˜ idx์—์„œ ์ •ํ™•ํžˆ 1๊ฐœ ์š”์†Œ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค (SIMD ๋ฒกํ„ฐํ™” ์—†์Œ).

๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์Šค๋ ˆ๋“œ์˜ partial_product๋ฅผ 0.0์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ํ•ฉ์‚ฐ์— ๊ธฐ์—ฌํ•˜์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

3. ์›Œํ”„ ์—ฐ์‚ฐ๊ณผ ์ €์žฅ

total = warp_sum(partial_product)

if lane_id() == 0:
    output.store[1](Index(idx // WARP_SIZE), total)

์ €์žฅ ํŒจํ„ด: output.store[1](Index(idx // WARP_SIZE), 0, total)์€ ์ถœ๋ ฅ ํ…์„œ์˜ ์œ„์น˜ (idx // WARP_SIZE, 0)์— 1๊ฐœ ์š”์†Œ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

๋™์ผํ•œ ์›Œํ”„ ๋กœ์ง: warp_sum()๊ณผ ๋ ˆ์ธ 0์˜ ๊ธฐ๋ก ๋กœ์ง์€ ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹์—์„œ๋„ ๋™์ผํ•˜๊ฒŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

4. import์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ•จ์ˆ˜๋“ค

from gpu import lane_id
from gpu.primitives.warp import sum as warp_sum, WARP_SIZE

# ํ•จ์ˆ˜ ๋‚ด์—์„œ:
my_lane = lane_id()           # 0 ~ WARP_SIZE-1
total = warp_sum(my_value)    # ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ฆฌ๋•์…˜
warp_size = WARP_SIZE         # 32 (NVIDIA) ๋˜๋Š” 64 (AMD)

ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

uv run poe p24 --functional
pixi run p24 --functional

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
=== RESULT ===
out: 10416.0
expected: 10416.0
๐Ÿ”ง Functional approach shows modern Mojo style with warp operations!
   Clean, composable, and still leverages warp hardware primitives!

ํ’€์ด

def functional_warp_dot_product[
    InLayoutT: TensorLayout,
    OutLayoutT: TensorLayout,
    dtype: DType,
    simd_width: Int,
    rank: Int,
    size: Int,
](
    output: TileTensor[mut=True, dtype, OutLayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
    ctx: DeviceContext,
) raises:
    @parameter
    @always_inline
    def compute_dot_product[
        simd_width: Int, rank: Int, alignment: Int = align_of[dtype]()
    ](indices: IndexList[rank]) capturing -> None:
        var idx = indices[0]
        # Convert inside GPU kernel to avoid host-captured LayoutTensor issues
        var a_lt = a.to_layout_tensor()
        var b_lt = b.to_layout_tensor()
        var out_lt = output.to_layout_tensor()

        # Each thread computes one partial product
        var partial_product: Scalar[dtype] = 0.0
        if idx < size:
            var a_val = a_lt.load[1](Index(idx))
            var b_val = b_lt.load[1](Index(idx))
            partial_product = rebind[Scalar[dtype]](a_val) * rebind[
                Scalar[dtype]
            ](b_val)
        else:
            partial_product = 0.0

        # Warp magic - combines all WARP_SIZE partial products!
        var total = warp_sum(partial_product)

        # Only lane 0 writes the result (all lanes have the same total)
        if lane_id() == 0:
            out_lt.store[1](Index(idx // WARP_SIZE), total)

    # Launch exactly size == WARP_SIZE threads (one warp) to process all elements
    elementwise[compute_dot_product, 1, target="gpu"](size, ctx)


ํ•จ์ˆ˜ํ˜• ์›Œํ”„ ๋ฐฉ์‹์€ ์›Œํ”„ ์—ฐ์‚ฐ์„ ํ™œ์šฉํ•œ ํ˜„๋Œ€์ ์ธ Mojo ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹์˜ ํŠน์ง•:

elementwise[compute_dot_product, 1, target="gpu"](size, ctx)

์žฅ์ :

  • ํƒ€์ž… ์•ˆ์ „์„ฑ: ์ปดํŒŒ์ผ ํƒ€์ž„ ํ…์„œ ๋ ˆ์ด์•„์›ƒ ๊ฒ€์‚ฌ
  • ์กฐํ•ฉ ๊ฐ€๋Šฅ์„ฑ: ๋‹ค๋ฅธ ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ๊ณผ ์‰ฝ๊ฒŒ ํ†ตํ•ฉ
  • ํ˜„๋Œ€์  ํŒจํ„ด: Mojo์˜ ํ•จ์ˆ˜ํ˜• ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ๋Šฅ ํ™œ์šฉ
  • ์ž๋™ ์ตœ์ ํ™”: ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ๊ณ ์ˆ˜์ค€ ์ตœ์ ํ™”๋ฅผ ์ ์šฉ ๊ฐ€๋Šฅ

์ปค๋„ ๋ฐฉ์‹๊ณผ์˜ ์ฃผ์š” ์ฐจ์ด:

  • ์‹คํ–‰ ๋ฉ”์ปค๋‹ˆ์ฆ˜: enqueue_function ๋Œ€์‹  elementwise ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: .load[1]()๊ณผ .store[1]() ํŒจํ„ด ์‚ฌ์šฉ
  • ํ†ตํ•ฉ์„ฑ: ๋‹ค๋ฅธ ํ•จ์ˆ˜ํ˜• ์—ฐ์‚ฐ๊ณผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๊ฒฐํ•ฉ

๋™์ผํ•œ ์›Œํ”„์˜ ์ด์ :

  • ๋™๊ธฐํ™” ์ œ๋กœ: warp_sum()์ด ๋™์ผํ•˜๊ฒŒ ๋™์ž‘
  • ํ•˜๋“œ์›จ์–ด ๊ฐ€์†: ์ปค๋„ ๋ฐฉ์‹๊ณผ ๊ฐ™์€ ์„ฑ๋Šฅ
  • ํฌ๋กœ์Šค ์•„ํ‚คํ…์ฒ˜: WARP_SIZE๊ฐ€ ์ž๋™์œผ๋กœ ์ ์‘

๋ฒค์น˜๋งˆํฌ๋ฅผ ํ†ตํ•œ ์„ฑ๋Šฅ ๋น„๊ต

์ข…ํ•ฉ ๋ฒค์น˜๋งˆํฌ๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ์›Œํ”„ ์—ฐ์‚ฐ์˜ ํ™•์žฅ์„ฑ์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:

uv run poe p24 --benchmark
pixi run p24 --benchmark

์ „์ฒด ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰ ๊ฒฐ๊ณผ์˜ ์˜ˆ์‹œ์ž…๋‹ˆ๋‹ค:

SIZE: 32
WARP_SIZE: 32
SIMD_WIDTH: 8
--------------------------------------------------------------------------------
Testing SIZE=1 x WARP_SIZE, BLOCKS=1
Running traditional_1x
Running simple_warp_1x
Running functional_warp_1x
--------------------------------------------------------------------------------
Testing SIZE=4 x WARP_SIZE, BLOCKS=4
Running traditional_4x
Running simple_warp_4x
Running functional_warp_4x
--------------------------------------------------------------------------------
Testing SIZE=32 x WARP_SIZE, BLOCKS=32
Running traditional_32x
Running simple_warp_32x
Running functional_warp_32x
--------------------------------------------------------------------------------
Testing SIZE=256 x WARP_SIZE, BLOCKS=256
Running traditional_256x
Running simple_warp_256x
Running functional_warp_256x
--------------------------------------------------------------------------------
Testing SIZE=2048 x WARP_SIZE, BLOCKS=2048
Running traditional_2048x
Running simple_warp_2048x
Running functional_warp_2048x
--------------------------------------------------------------------------------
Testing SIZE=16384 x WARP_SIZE, BLOCKS=16384 (Large Scale)
Running traditional_16384x
Running simple_warp_16384x
Running functional_warp_16384x
--------------------------------------------------------------------------------
Testing SIZE=65536 x WARP_SIZE, BLOCKS=65536 (Massive Scale)
Running traditional_65536x
Running simple_warp_65536x
Running functional_warp_65536x
| name                   | met (ms)              | iters |
| ---------------------- | --------------------- | ----- |
| traditional_1x         | 0.00460128            | 100   |
| simple_warp_1x         | 0.00574047            | 100   |
| functional_warp_1x     | 0.00484192            | 100   |
| traditional_4x         | 0.00492671            | 100   |
| simple_warp_4x         | 0.00485247            | 100   |
| functional_warp_4x     | 0.00587679            | 100   |
| traditional_32x        | 0.0062406399999999996 | 100   |
| simple_warp_32x        | 0.0054918400000000004 | 100   |
| functional_warp_32x    | 0.00552447            | 100   |
| traditional_256x       | 0.0050614300000000004 | 100   |
| simple_warp_256x       | 0.00488768            | 100   |
| functional_warp_256x   | 0.00461472            | 100   |
| traditional_2048x      | 0.01120031            | 100   |
| simple_warp_2048x      | 0.00884383            | 100   |
| functional_warp_2048x  | 0.007038720000000001  | 100   |
| traditional_16384x     | 0.038533750000000005  | 100   |
| simple_warp_16384x     | 0.0323264             | 100   |
| functional_warp_16384x | 0.01674271            | 100   |
| traditional_65536x     | 0.19784991999999998   | 100   |
| simple_warp_65536x     | 0.12870176            | 100   |
| functional_warp_65536x | 0.048680310000000004  | 100   |

Benchmarks completed!

WARP OPERATIONS PERFORMANCE ANALYSIS:
   GPU Architecture: NVIDIA (WARP_SIZE=32) vs AMD (WARP_SIZE=64)
   - 1,...,256 x WARP_SIZE: Grid size too small to benchmark
   - 2048 x WARP_SIZE: Warp primative benefits emerge
   - 16384 x WARP_SIZE: Large scale (512K-1M elements)
   - 65536 x WARP_SIZE: Massive scale (2M-4M elements)

   Expected Results at Large Scales:
   โ€ข Traditional: Slower due to more barrier overhead
   โ€ข Warp operations: Faster, scale better with problem size
   โ€ข Memory bandwidth becomes the limiting factor

์ด ์˜ˆ์‹œ์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ:

  • ์†Œ๊ทœ๋ชจ (1x-4x): ์›Œํ”„ ์—ฐ์‚ฐ์ด ์†Œํญ์˜ ๊ฐœ์„ ์„ ๋ณด์ž„ (~10-15% ๋น ๋ฆ„)
  • ์ค‘๊ทœ๋ชจ (32x-256x): ํ•จ์ˆ˜ํ˜• ๋ฐฉ์‹์ด ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
  • ๋Œ€๊ทœ๋ชจ (16K-65K): ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์ง€๋ฐฐ์ ์ด ๋˜๋ฉด์„œ ๋ชจ๋“  ๋ฐฉ์‹์˜ ์„ฑ๋Šฅ์ด ์ˆ˜๋ ด
  • ๋ณ€๋™์„ฑ: ์„ฑ๋Šฅ์€ ํŠน์ • GPU ์•„ํ‚คํ…์ฒ˜์™€ ๋ฉ”๋ชจ๋ฆฌ ์„œ๋ธŒ์‹œ์Šคํ…œ์— ํฌ๊ฒŒ ์˜์กด

์ฐธ๊ณ : ํ•˜๋“œ์›จ์–ด(GPU ๋ชจ๋ธ, ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ, WARP_SIZE)์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ํ•ต์‹ฌ์€ ์ ˆ๋Œ€์ ์ธ ์ˆ˜์น˜๋ณด๋‹ค ์ƒ๋Œ€์ ์ธ ์„ฑ๋Šฅ ์ถ”์„ธ๋ฅผ ๊ด€์ฐฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

warp.sum ์—ฐ์‚ฐ์„ ๋ฐฐ์› ์œผ๋‹ˆ, ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ์–ธ์ œ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‚ฌ์šฉํ• ๊นŒ: ์›Œํ”„ vs ๊ธฐ์กด ๋ฐฉ์‹์— ๋Œ€ํ•œ ์ „๋žต์  ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ
  • ๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ: ๋ณต์žกํ•œ ํ†ต์‹  ํŒจํ„ด์„ ์œ„ํ•œ shuffle_idx(), shuffle_down(), prefix_sum()
  • ๋ฉ€ํ‹ฐ ์›Œํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์›Œํ”„ ์—ฐ์‚ฐ๊ณผ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋™๊ธฐํ™”์˜ ๊ฒฐํ•ฉ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ์ตœ์ ํ™”: ์ตœ๋Œ€ ๋Œ€์—ญํญ์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”

๐Ÿ’ก ํ•ต์‹ฌ ์š”์ : ์›Œํ”„ ์—ฐ์‚ฐ์€ ๋ณต์žกํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์„ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ณธ ์š”์†Œ๋กœ ๋Œ€์ฒดํ•˜์—ฌ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์‹คํ–‰ ๋ชจ๋ธ์„ ์ดํ•ดํ•˜๋ฉด ์„ฑ๋Šฅ์„ ํฌ์ƒํ•˜์ง€ ์•Š๊ณ ๋„ ํš๊ธฐ์ ์ธ ๋‹จ์ˆœํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

์–ธ์ œ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์‚ฌ์šฉํ• ๊นŒ

๋น ๋ฅธ ํŒ๋‹จ ๊ฐ€์ด๋“œ

โœ… ์›Œํ”„ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•  ๋•Œ:

  • 32๊ฐœ ์ด์ƒ์˜ ์š”์†Œ์— ๋Œ€ํ•œ ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ (sum, max, min)
  • ๊ทœ์น™์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด (์ธ์ ‘ ๋ ˆ์ธ โ†’ ์ธ์ ‘ ์ฃผ์†Œ)
  • ํฌ๋กœ์Šค ์•„ํ‚คํ…์ฒ˜ ์ด์‹์„ฑ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ (NVIDIA/RDNA 32 vs CDNA 64 ์Šค๋ ˆ๋“œ)
  • ๋” ๊ฐ„๋‹จํ•˜๊ณ  ์œ ์ง€๋ณด์ˆ˜ํ•˜๊ธฐ ์‰ฌ์šด ์ฝ”๋“œ๋ฅผ ์›ํ•  ๋•Œ

โŒ ๊ธฐ์กด ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•  ๋•Œ:

  • ๋ณต์žกํ•œ ์›Œํ”„ ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ๋ถˆ๊ทœ์น™ํ•˜๊ฑฐ๋‚˜ ์‚ฐ๋ฐœ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด
  • ์Šค๋ ˆ๋“œ๋ณ„ ์ž‘์—…๋Ÿ‰์ด ๋‹ค๋ฅธ ๊ฒฝ์šฐ (์›Œํ”„ ๋ถ„๊ธฐ ๋ฐœ์ƒ)
  • ๋ฌธ์ œ ํฌ๊ธฐ๊ฐ€ size < WARP_SIZE์ธ ๊ฒฝ์šฐ

์„ฑ๋Šฅ ํŠน์„ฑ

๋ฌธ์ œ ํฌ๊ธฐ๋ณ„ ํ™•์žฅ์„ฑ

์š”์†Œ ์ˆ˜์›Œํ”„ ์ด์ ๋น„๊ณ 
< 32์—†์Œ๊ธฐ์กด ๋ฐฉ์‹์ด ์œ ๋ฆฌ
32-1K1.2-1.5๋ฐฐ์ด์ ์ด ๋‚˜ํƒ€๋‚˜๊ธฐ ์‹œ์ž‘
1K-32K1.5-2.5๋ฐฐ์›Œํ”„ ์—ฐ์‚ฐ์ด ํƒ์›”
> 32K๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์–‘์ชฝ ๋ชจ๋‘ ๋Œ€์—ญํญ์— ์˜ํ•ด ์ œํ•œ

์›Œํ”„์˜ ํ•ต์‹ฌ ์ด์ 

  • ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๋ฐฐ๋ฆฌ์–ด ๋น„์šฉ ์ œ๊ฑฐ
  • ์ตœ์†Œํ•œ์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ถˆํ•„์š”
  • ์šฐ์ˆ˜ํ•œ ํ™•์žฅ์„ฑ: ์›Œํ”„ ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • ๊ฐ„๊ฒฐํ•œ ์ฝ”๋“œ: ๋” ์ ์€ ์ค„ ์ˆ˜, ๋” ์ ์€ ์˜ค๋ฅ˜ ๊ฐ€๋Šฅ์„ฑ

์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณ„ ๊ฐ€์ด๋“œ

์•Œ๊ณ ๋ฆฌ์ฆ˜๊ถŒ์žฅ ์‚ฌํ•ญ์ด์œ 
๋‚ด์ ์›Œํ”„ ์—ฐ์‚ฐ (1K+ ์š”์†Œ)๋‹จ์ผ ๋ฆฌ๋•์…˜, ๊ทœ์น™์  ์ ‘๊ทผ
ํ–‰๋ ฌ ํ–‰/์—ด ํ•ฉ๊ณ„์›Œํ”„ ์—ฐ์‚ฐ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฆฌ๋•์…˜ ํŒจํ„ด
๋ˆ„์  ํ•ฉํ•ญ์ƒ prefix_sum() ์‚ฌ์šฉํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ๊ธฐ๋ณธ ์š”์†Œ
ํ’€๋ง (max/min)์›Œํ”„ ์—ฐ์‚ฐ (๊ทœ์น™์  ์œˆ๋„์šฐ)ํšจ์œจ์ ์ธ ์œˆ๋„์šฐ ๋ฆฌ๋•์…˜
๊ตฌ๊ฐ„์ด ๋งŽ์€ ํžˆ์Šคํ† ๊ทธ๋žจ๊ธฐ์กด ๋ฐฉ์‹๋ถˆ๊ทœ์น™ํ•œ ์“ฐ๊ธฐ, ์›์ž์  ์—…๋ฐ์ดํŠธ

์ฝ”๋“œ ์˜ˆ์‹œ

โœ… ์›Œํ”„์— ์ ํ•ฉํ•œ ๊ฒฝ์šฐ

# ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ
from gpu.primitives.warp import sum, max
var total = sum(partial_values)
var maximum = max(partial_values)

# ํ†ต์‹  ํŒจํ„ด
from gpu.primitives.warp import shuffle_idx, prefix_sum
var broadcast = shuffle_idx(my_value, 0)
var running_sum = prefix_sum(my_value)

โŒ ๊ธฐ์กด ๋ฐฉ์‹์ด ๋‚˜์€ ๊ฒฝ์šฐ

# ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ๋™๊ธฐํ™”
stage1_compute()
barrier()  # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
stage2_depends_on_stage1()

# ๋ถˆ๊ทœ์น™ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
var value = input[random_indices[global_i]]  # ์‚ฐ๋ฐœ์  ์ฝ๊ธฐ

# ๋ฐ์ดํ„ฐ ์˜์กด์  ์ž‘์—…
if input[global_i] > threshold:
    result = expensive_computation()  # ์›Œํ”„ ๋ถ„๊ธฐ ๋ฐœ์ƒ

์„ฑ๋Šฅ ์ธก์ •

# ํ•ญ์ƒ ์–‘์ชฝ ๋ฐฉ์‹์„ ๋ฒค์น˜๋งˆํฌํ•˜์„ธ์š”
mojo p22.mojo --benchmark

# ํ™•์žฅ ํŒจํ„ด์„ ํ™•์ธํ•˜์„ธ์š”:
# traditional_1x:  X.XX ms
# warp_1x:         Y.YY ms  # ๋” ๋นจ๋ผ์•ผ ํ•จ
# warp_32x:        Z.ZZ ms  # ์ด์ ์ด ์ปค์ ธ์•ผ ํ•จ

์š”์•ฝ

์›Œํ”„ ์—ฐ์‚ฐ์œผ๋กœ ์‹œ์ž‘ํ•˜์„ธ์š”:

  • ๊ทœ์น™์ ์ธ ์ ‘๊ทผ ํŒจํ„ด์„ ๊ฐ€์ง„ ๋ฆฌ๋•์…˜
  • ๋ฌธ์ œ โ‰ฅ 1 ์›Œํ”„ ํฌ๊ธฐ
  • ํฌ๋กœ์Šค ํ”Œ๋žซํผ ํ˜ธํ™˜์„ฑ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ

๊ธฐ์กด ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์„ธ์š”:

  • ๋ณต์žกํ•œ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ
  • ๋ถˆ๊ทœ์น™ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด
  • ์ž‘์€ ๋ฌธ์ œ ๋˜๋Š” ์‹ฌํ•œ ๋ถ„๊ธฐ

ํŒ๋‹จ์ด ์–ด๋ ค์šธ ๋•Œ: ์–‘์ชฝ ๋ชจ๋‘ ๊ตฌํ˜„ํ•˜๊ณ  ๋ฒค์น˜๋งˆํฌํ•˜์„ธ์š”. ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ณด๋ฉด ๋‹ต์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.

Puzzle 25: ์›Œํ”„ ํ†ต์‹ 

๊ฐœ์š”

Puzzle 25: ์›Œํ”„ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ์—์„œ๋Š” ๊ณ ๊ธ‰ GPU ์›Œํ”„ ๋ ˆ๋ฒจ ํ†ต์‹  ์—ฐ์‚ฐ - ์›Œํ”„ ๋‚ด์—์„œ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ตํ™˜๊ณผ ์กฐ์ • ํŒจํ„ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. shuffle_down๊ณผ broadcast๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด ์—†์ด ์ด์›ƒ ํ†ต์‹ ๊ณผ ์ง‘ํ•ฉ ์กฐ์ •์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

Part VII: GPU ์›Œํ”„ ํ†ต์‹ ์—์„œ๋Š” ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน ๋‚ด ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฐ์ดํ„ฐ ์ด๋™ ์—ฐ์‚ฐ์„ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ์ธ๋ฑ์‹ฑ + ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํŒจํ„ด์„ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ๋ฐ์ดํ„ฐ ์ด๋™์„ ํ™œ์šฉํ•˜๋Š” ํšจ์œจ์ ์ธ ์›Œํ”„ ํ†ต์‹  ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์›Œํ”„๋Š” ๋ก์Šคํ…์œผ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค - Mojo์˜ ์›Œํ”„ ํ†ต์‹  ์—ฐ์‚ฐ์€ ์ด ๋™๊ธฐํ™”๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ž๋™ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ์™€ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ๊ตํ™˜ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ

์›Œํ”„ ํ†ต์‹  ๋ชจ๋ธ

GPU ์›Œํ”„ ๋‚ด ๊ธฐ๋ณธ ํ†ต์‹  ํŒจํ„ด์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

GPU ์›Œํ”„ (32 ์Šค๋ ˆ๋“œ, SIMT ๋ก์Šคํ… ์‹คํ–‰)
โ”œโ”€โ”€ ๋ ˆ์ธ 0  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 1  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 2
โ”œโ”€โ”€ ๋ ˆ์ธ 1  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 2  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 3
โ”œโ”€โ”€ ๋ ˆ์ธ 2  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 3  โ”€โ”€shuffle_downโ”€โ”€> ๋ ˆ์ธ 4
โ”‚   ...
โ””โ”€โ”€ ๋ ˆ์ธ 31 โ”€โ”€shuffle_downโ”€โ”€> undefined (๊ฒฝ๊ณ„)

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด:
๋ ˆ์ธ 0 โ”€โ”€broadcastโ”€โ”€> ๋ชจ๋“  ๋ ˆ์ธ (0, 1, 2, ..., 31)

ํ•˜๋“œ์›จ์–ด ํ˜„์‹ค:

  • ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ : ๋ฐ์ดํ„ฐ๊ฐ€ ์Šค๋ ˆ๋“œ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์ด๋ฅผ ์ง์ ‘ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค
  • ์ž๋™ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ํ•˜๋“œ์›จ์–ด๊ฐ€ ์›Œํ”„ ๊ฒฝ๊ณ„์˜ ์˜ˆ์™ธ ์ƒํ™ฉ์„ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ๋‹จ์ผ ์‚ฌ์ดํด ์—ฐ์‚ฐ: ํ•˜๋‚˜์˜ ๋ช…๋ น ์‚ฌ์ดํด์—์„œ ํ†ต์‹ ์ด ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค

Mojo์˜ ์›Œํ”„ ํ†ต์‹  ์—ฐ์‚ฐ

gpu.primitives.warp์˜ ํ•ต์‹ฌ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๋ฐฐ์›๋‹ˆ๋‹ค:

  1. shuffle_down(value, offset): ๋” ๋†’์€ ์ธ๋ฑ์Šค์˜ ๋ ˆ์ธ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ธฐ (์ด์›ƒ ์ ‘๊ทผ)
  2. broadcast(value): ๋ ˆ์ธ 0์˜ ๊ฐ’์„ ๋ชจ๋“  ๋ ˆ์ธ์— ๊ณต์œ  (์ผ๋Œ€๋‹ค)
  3. shuffle_idx(value, lane): ํŠน์ • ๋ ˆ์ธ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ธฐ (์ž„์˜ ์ ‘๊ทผ)
  4. shuffle_up(value, offset): ๋” ๋‚ฎ์€ ์ธ๋ฑ์Šค์˜ ๋ ˆ์ธ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๊ธฐ (์—ญ๋ฐฉํ–ฅ ์ด์›ƒ)

์ฐธ๊ณ : ์ด ํผ์ฆ์€ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ํ†ต์‹  ํŒจํ„ด์ธ shuffle_down()๊ณผ broadcast()์— ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. ๋ชจ๋“  ์›Œํ”„ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ „์ฒด ๋‚ด์šฉ์€ Mojo GPU ์›Œํ”„ ๋ฌธ์„œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

์„ฑ๋Šฅ ๋ณ€ํ™˜ ์˜ˆ์‹œ

# ๋ณต์žกํ•œ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด (๊ธฐ์กด ๋ฐฉ์‹):
shared = TileTensor[
    dtype,
    row_major[WARP_SIZE](),
    MutAnyOrigin,
    address_space = AddressSpace.SHARED,
].stack_allocation()
shared[local_i] = input[global_i]
barrier()
if local_i < WARP_SIZE - 1:
    next_value = shared[local_i + 1]  # ์ด์›ƒ ์ ‘๊ทผ
    result = next_value - shared[local_i]
else:
    result = 0  # ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ
barrier()

# ์›Œํ”„ ํ†ต์‹ ์€ ์ด ๋ชจ๋“  ๋ณต์žก์„ฑ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค:
current_val = input[global_i]
next_val = shuffle_down(current_val, 1)  # ์ด์›ƒ์— ์ง์ ‘ ์ ‘๊ทผ
if lane < WARP_SIZE - 1:
    result = next_val - current_val
else:
    result = 0

์›Œํ”„ ํ†ต์‹ ์ด ๋น›๋‚˜๋Š” ์ˆœ๊ฐ„

์„ฑ๋Šฅ ํŠน์„ฑ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

ํ†ต์‹  ํŒจํ„ด๊ธฐ์กด ๋ฐฉ์‹์›Œํ”„ ์—ฐ์‚ฐ
์ด์›ƒ ์ ‘๊ทผ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
์Šคํ…์‹ค ์—ฐ์‚ฐ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ๊ฐ„๋‹จํ•œ ์…”ํ”Œ ํŒจํ„ด
๋ธ”๋ก ์กฐ์ •๋ฐฐ๋ฆฌ์–ด + ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‹จ์ผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ์ˆ˜๋™ ๊ฒ€์‚ฌํ•˜๋“œ์›จ์–ด ์ž๋™ ์ฒ˜๋ฆฌ

์„ ์ˆ˜ ์ง€์‹

์›Œํ”„ ํ†ต์‹ ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ๋‹ค์Œ ๋‚ด์šฉ์— ์ต์ˆ™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  • Part VII ์›Œํ”„ ๊ธฐ์ดˆ: SIMT ์‹คํ–‰๊ณผ ๊ธฐ๋ณธ ์›Œํ”„ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ดํ•ด (Puzzle 24: ์›Œํ”„ ๊ธฐ์ดˆ ์ฐธ๊ณ )
  • GPU ์Šค๋ ˆ๋“œ ๊ณ„์ธต ๊ตฌ์กฐ: ๋ธ”๋ก, ์›Œํ”„, ๋ ˆ์ธ ๋ฒˆํ˜ธ ๋งค๊ธฐ๊ธฐ
  • TileTensor ์—ฐ์‚ฐ: ๋กœ๋“œ, ์ €์žฅ, ํ…์„œ ์กฐ์ž‘
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ: ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ฐ€์žฅ์ž๋ฆฌ ์ผ€์ด์Šค ๊ด€๋ฆฌ

ํ•™์Šต ๊ฒฝ๋กœ

1. shuffle_down์„ ์ด์šฉํ•œ ์ด์›ƒ ํ†ต์‹ 

โ†’ warp.shuffle_down()

์Šคํ…์‹ค ์—ฐ์‚ฐ๊ณผ ์œ ํ•œ ์ฐจ๋ถ„์„ ์œ„ํ•œ ์ด์›ƒ ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • shuffle_down()์œผ๋กœ ์ธ์ ‘ ๋ ˆ์ธ ๋ฐ์ดํ„ฐ ์ ‘๊ทผํ•˜๊ธฐ
  • ์œ ํ•œ ์ฐจ๋ถ„๊ณผ ์ด๋™ ํ‰๊ท  ๊ตฌํ˜„
  • ์›Œํ”„ ๊ฒฝ๊ณ„ ์ž๋™ ์ฒ˜๋ฆฌ
  • ํ™•์žฅ๋œ ์ด์›ƒ ์ ‘๊ทผ์„ ์œ„ํ•œ ๋‹ค์ค‘ ์˜คํ”„์…‹ ์…”ํ”Œ

ํ•ต์‹ฌ ํŒจํ„ด:

current_val = input[global_i]
next_val = shuffle_down(current_val, 1)
if lane < WARP_SIZE - 1:
    result = compute_with_neighbors(current_val, next_val)

2. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋ฅผ ์ด์šฉํ•œ ์ง‘ํ•ฉ ์กฐ์ •

โ†’ warp.broadcast()

๋ธ”๋ก ๋ ˆ๋ฒจ ์กฐ์ •๊ณผ ์ง‘ํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ •์„ ์œ„ํ•œ ์ผ๋Œ€๋‹ค ํ†ต์‹  ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • broadcast()๋กœ ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ๋ชจ๋“  ๋ ˆ์ธ์— ๊ณต์œ 
  • ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต๊ณ„์™€ ์ง‘ํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ • ๊ตฌํ˜„
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์™€ ์กฐ๊ฑด๋ถ€ ๋กœ์ง ๊ฒฐํ•ฉ
  • ๊ณ ๊ธ‰ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ-์…”ํ”Œ ์กฐ์ • ํŒจํ„ด

ํ•ต์‹ฌ ํŒจํ„ด:

var shared_value = 0.0
if lane == 0:
    shared_value = compute_block_statistic()
shared_value = broadcast(shared_value)
result = use_shared_value(shared_value, local_data)

ํ•ต์‹ฌ ๊ฐœ๋…

ํ†ต์‹  ํŒจํ„ด

์›Œํ”„ ํ†ต์‹ ์˜ ๊ธฐ๋ณธ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

  • ์ด์›ƒ ํ†ต์‹ : ๋ ˆ์ธ ๊ฐ„ ์ธ์ ‘ ๋ฐ์ดํ„ฐ ๊ตํ™˜
  • ์ง‘ํ•ฉ ์กฐ์ •: ํ•˜๋‚˜์˜ ๋ ˆ์ธ์—์„œ ๋ชจ๋“  ๋ ˆ์ธ์œผ๋กœ ์ •๋ณด ๊ณต์œ 
  • ์Šคํ…์‹ค ์—ฐ์‚ฐ: ๊ณ ์ •๋œ ํŒจํ„ด์œผ๋กœ ์ด์›ƒ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์›Œํ”„ ๊ฐ€์žฅ์ž๋ฆฌ์—์„œ์˜ ํ†ต์‹  ๊ด€๋ฆฌ

ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”

์›Œํ”„ ํ†ต์‹ ์ด GPU ํ•˜๋“œ์›จ์–ด์— ๋งคํ•‘๋˜๋Š” ๋ฐฉ์‹์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

  • ๋ ˆ์ง€์Šคํ„ฐ ํŒŒ์ผ ํ†ต์‹ : ์Šค๋ ˆ๋“œ ๊ฐ„ ๋ ˆ์ง€์Šคํ„ฐ ์ง์ ‘ ์ ‘๊ทผ
  • SIMT ์‹คํ–‰: ๋ชจ๋“  ๋ ˆ์ธ์ด ํ†ต์‹ ์„ ๋™์‹œ์— ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ์ œ๋กœ ์ง€์—ฐ ์‹œ๊ฐ„: ์‹คํ–‰ ์œ ๋‹› ๋‚ด์—์„œ ํ†ต์‹ ์ด ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค
  • ์ž๋™ ๋™๊ธฐํ™”: ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ€ํ™˜

๊ธฐ์กด ๋ณ‘๋ ฌ ํŒจํ„ด์„ ์›Œํ”„ ํ†ต์‹ ์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

  • ๋ฐฐ์—ด ์ด์›ƒ ์ ‘๊ทผ โ†’ shuffle_down()
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ • โ†’ broadcast()
  • ๋ณต์žกํ•œ ๊ฒฝ๊ณ„ ๋กœ์ง โ†’ ํ•˜๋“œ์›จ์–ด ์ž๋™ ์ฒ˜๋ฆฌ
  • ๋‹ค๋‹จ๊ณ„ ๋™๊ธฐํ™” โ†’ ๋‹จ์ผ ํ†ต์‹  ์—ฐ์‚ฐ

์‹œ์ž‘ํ•˜๊ธฐ

์ด์›ƒ ๊ธฐ๋ฐ˜ ์…”ํ”Œ ์—ฐ์‚ฐ์œผ๋กœ ๊ธฐ์ดˆ๋ฅผ ๋‹ค์ง„ ๋‹ค์Œ, ๊ณ ๊ธ‰ ์กฐ์ •์„ ์œ„ํ•œ ์ง‘ํ•ฉ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด์œผ๋กœ ๋‚˜์•„๊ฐ‘๋‹ˆ๋‹ค.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ์›Œํ”„ ํ†ต์‹ ์„ ๊ฐ™์€ ์›Œํ”„ ๋‚ด ์Šค๋ ˆ๋“œ ๊ฐ„์˜ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ฉ”์‹œ์ง€ ํŒจ์‹ฑ์œผ๋กœ ์ƒ๊ฐํ•˜์„ธ์š”. ์ด ๋ฉ˜ํƒˆ ๋ชจ๋ธ์ด GPU์˜ SIMT ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ™œ์šฉํ•˜๋Š” ํšจ์œจ์ ์ธ ํ†ต์‹  ํŒจํ„ด์œผ๋กœ ์•ˆ๋‚ดํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ๋ชฉํ‘œ: Puzzle 25๋ฅผ ๋งˆ์น˜๋ฉด, ์›Œํ”„ ํ†ต์‹ ์ด ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํ™ฉ์„ ์ธ์‹ํ•˜์—ฌ ๋” ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅธ ์ด์›ƒ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์กฐ์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ: warp.shuffle_down() ์—์„œ ์ด์›ƒ ํ†ต์‹ ์„ ๋ฐฐ์šด ๋‹ค์Œ, warp.broadcast() ์—์„œ ์ง‘ํ•ฉ ์กฐ์ • ํŒจํ„ด์œผ๋กœ ๋‚˜์•„๊ฐ€์„ธ์š”.

warp.shuffle_down() ์ผ๋Œ€์ผ ํ†ต์‹ 

์›Œํ”„ ๋ ˆ๋ฒจ ์ด์›ƒ ํ†ต์‹ ์—์„œ๋Š” shuffle_down()์„ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„ ๋‚ด ์ธ์ ‘ ๋ ˆ์ธ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ†ตํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ์œ ํ•œ ์ฐจ๋ถ„, ์ด๋™ ํ‰๊ท , ์ด์›ƒ ๊ธฐ๋ฐ˜ ๊ณ„์‚ฐ์„ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: shuffle_down() ์—ฐ์‚ฐ์€ SIMT ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ ๋ ˆ์ธ์ด ๊ฐ™์€ ์›Œํ”„ ๋‚ด ์ด์›ƒ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋ฉฐ, ํšจ์œจ์ ์ธ ์Šคํ…์‹ค ํŒจํ„ด๊ณผ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์Šคํ…์‹ค ์—ฐ์‚ฐ์ด๋ž€? ์Šคํ…์‹ค ์—ฐ์‚ฐ์€ ๊ฐ ์ถœ๋ ฅ ์š”์†Œ๊ฐ€ ์ด์›ƒ ์ž…๋ ฅ ์š”์†Œ์˜ ๊ณ ์ •๋œ ํŒจํ„ด์— ์˜์กดํ•˜๋Š” ๊ณ„์‚ฐ์ž…๋‹ˆ๋‹ค. ๋Œ€ํ‘œ์ ์ธ ์˜ˆ๋กœ ์œ ํ•œ ์ฐจ๋ถ„(๋„ํ•จ์ˆ˜), ํ•ฉ์„ฑ๊ณฑ, ์ด๋™ ํ‰๊ท ์ด ์žˆ์Šต๋‹ˆ๋‹ค. โ€œ์Šคํ…์‹คโ€œ์€ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด์„ ๊ฐ€๋ฆฌํ‚ต๋‹ˆ๋‹ค - ์˜ˆ๋ฅผ ๋“ค์–ด [i-1, i, i+1]์„ ์ฝ๋Š” 3์  ์Šคํ…์‹ค์ด๋‚˜ [i-2, i-1, i, i+1, i+2]๋ฅผ ์ฝ๋Š” 5์  ์Šคํ…์‹ค์ด ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • shuffle_down()์„ ํ™œ์šฉํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฐ์ดํ„ฐ ์…”ํ”Œ
  • ์Šคํ…์‹ค ๊ณ„์‚ฐ์„ ์œ„ํ•œ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด
  • ์›Œํ”„ ๊ฐ€์žฅ์ž๋ฆฌ์—์„œ์˜ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ
  • ํ™•์žฅ๋œ ์ด์›ƒ ์ ‘๊ทผ์„ ์œ„ํ•œ ๋‹ค์ค‘ ์˜คํ”„์…‹ ์…”ํ”Œ
  • ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ์˜ ์›Œํ”„ ๊ฐ„ ์กฐ์ •

shuffle_down ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด ๋” ๋†’์€ ์ธ๋ฑ์Šค์˜ ๋ ˆ์ธ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{shuffle_down}(\text{value}, \text{offset}) = \text{value_from_lane}(\text{lane_id} + \text{offset})\]

์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด์ด ๊ฐ„๋‹จํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ์—ฐ์‚ฐ์œผ๋กœ ๋ณ€ํ™˜๋˜์–ด, ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์‹ฑ ์—†์ด ํšจ์œจ์ ์ธ ์Šคํ…์‹ค ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

1. ๊ธฐ๋ณธ ์ด์›ƒ ์ฐจ๋ถ„

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D row-major)

shuffle_down ๊ฐœ๋…

๊ธฐ์กด ์ด์›ƒ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ๊ณผ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ๊ธฐ์กด ๋ฐฉ์‹ - ๋ณต์žกํ•˜๊ณ  ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์›€
if global_i < size - 1:
    next_value = input[global_i + 1]  # ๋ฒ”์œ„ ์ดˆ๊ณผ ๊ฐ€๋Šฅ์„ฑ
    result = next_value - current_value

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ :

  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋ฐฐ์—ด ๊ฒฝ๊ณ„๋ฅผ ์ˆ˜๋™์œผ๋กœ ํ™•์ธํ•ด์•ผ ํ•จ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๋ณ„๋„์˜ ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ๊ฐ€ ํ•„์š”
  • ๋™๊ธฐํ™”: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์—์„œ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Œ
  • ๋ณต์žกํ•œ ๋กœ์ง: ๊ฒฝ๊ณ„์˜ ์˜ˆ์™ธ ์ƒํ™ฉ ์ฒ˜๋ฆฌ๊ฐ€ ์žฅํ™ฉํ•ด์ง

shuffle_down()์„ ์‚ฌ์šฉํ•˜๋ฉด ์ด์›ƒ ์ ‘๊ทผ์ด ๊ฐ„๊ฒฐํ•ด์ง‘๋‹ˆ๋‹ค:

# ์›Œํ”„ ์…”ํ”Œ ๋ฐฉ์‹ - ๊ฐ„๋‹จํ•˜๊ณ  ์•ˆ์ „
current_val = input[global_i]
next_val = shuffle_down(current_val, 1)  # lane+1์—์„œ ๊ฐ’ ๊ฐ€์ ธ์˜ค๊ธฐ
if lane < WARP_SIZE - 1:
    result = next_val - current_val

shuffle_down์˜ ์žฅ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋ถˆํ•„์š”
  • ์ž๋™ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ํ•˜๋“œ์›จ์–ด๊ฐ€ ์›Œํ”„ ๊ฒฝ๊ณ„๋ฅผ ๊ด€๋ฆฌ
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ์กฐํ•ฉ ๊ฐ€๋Šฅ: ๋‹ค๋ฅธ ์›Œํ”„ ์—ฐ์‚ฐ๊ณผ ์‰ฝ๊ฒŒ ๊ฒฐํ•ฉ

์™„์„ฑํ•  ์ฝ”๋“œ

shuffle_down()์œผ๋กœ ๋‹ค์Œ ์š”์†Œ์— ์ ‘๊ทผํ•˜์—ฌ ์œ ํ•œ ์ฐจ๋ถ„์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ๊ฐ ์š”์†Œ์˜ ์ด์‚ฐ ๋„ํ•จ์ˆ˜(์œ ํ•œ ์ฐจ๋ถ„)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \text{input}[i+1] - \text{input}[i]\]

์ž…๋ ฅ ๋ฐ์ดํ„ฐ [0, 1, 4, 9, 16, 25, ...] (์ œ๊ณฑ์ˆ˜: i * i)๋ฅผ ์ฐจ๋ถ„๊ฐ’ [1, 3, 5, 7, 9, ...] (ํ™€์ˆ˜)๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ, ์ด์ฐจ ํ•จ์ˆ˜์˜ ์ด์‚ฐ ๋„ํ•จ์ˆ˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

comptime SIZE = WARP_SIZE
comptime BLOCKS_PER_GRID = (1, 1)
comptime THREADS_PER_BLOCK = (WARP_SIZE, 1)
comptime dtype = DType.float32
comptime layout = row_major[SIZE]()
comptime LayoutType = type_of(layout)


def neighbor_difference[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Compute finite differences: output[i] = input[i+1] - input[i]
    Uses shuffle_down(val, 1) to get the next neighbor's value.
    Works across multiple blocks, each processing one warp worth of data.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = Int(lane_id())

    # FILL IN (roughly 7 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p25/p25.mojo

ํŒ

1. shuffle_down ์ดํ•ดํ•˜๊ธฐ

shuffle_down(value, offset) ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด ๋” ๋†’์€ ์ธ๋ฑ์Šค์˜ ๋ ˆ์ธ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ›์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋“œ ์—†์ด ์ด์›ƒ ์š”์†Œ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ดํŽด๋ณด์„ธ์š”.

shuffle_down(val, 1)์ด ํ•˜๋Š” ์ผ:

  • ๋ ˆ์ธ 0์ด ๋ ˆ์ธ 1์˜ ๊ฐ’์„ ๋ฐ›์Œ
  • ๋ ˆ์ธ 1์ด ๋ ˆ์ธ 2์˜ ๊ฐ’์„ ๋ฐ›์Œ
  • โ€ฆ
  • ๋ ˆ์ธ 30์ด ๋ ˆ์ธ 31์˜ ๊ฐ’์„ ๋ฐ›์Œ
  • ๋ ˆ์ธ 31์€ ๋ฏธ์ •์˜ ๊ฐ’์„ ๋ฐ›์Œ (๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋กœ ์ฒ˜๋ฆฌ)

2. ์›Œํ”„ ๊ฒฝ๊ณ„ ๊ณ ๋ ค์‚ฌํ•ญ

์›Œํ”„์˜ ๊ฐ€์žฅ์ž๋ฆฌ์—์„œ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”. ์ผ๋ถ€ ๋ ˆ์ธ์€ ์…”ํ”Œ ์—ฐ์‚ฐ์œผ๋กœ ์ ‘๊ทผํ•  ์œ ํšจํ•œ ์ด์›ƒ์ด ์—†์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณผ์ œ: ์›Œํ”„ ๊ฒฝ๊ณ„์—์„œ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ๋ฏธ์ •์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•˜์„ธ์š”.

WARP_SIZE = 32์—์„œ์˜ ์ด์›ƒ ์ฐจ๋ถ„:

  • ์œ ํšจํ•œ ์ฐจ๋ถ„ (lane < WARP_SIZE - 1): ๋ ˆ์ธ 0-30 (31๊ฐœ ๋ ˆ์ธ)

    • ์กฐ๊ฑด: \(\text{lane_id}() \in {0, 1, \cdots, 30}\)
    • ์ด์œ : shuffle_down(current_val, 1)์ด ๋‹ค์Œ ์ด์›ƒ์˜ ๊ฐ’์„ ์„ฑ๊ณต์ ์œผ๋กœ ๊ฐ€์ ธ์˜ด
    • ๊ฒฐ๊ณผ: output[i] = input[i+1] - input[i] (์œ ํ•œ ์ฐจ๋ถ„)
  • ๊ฒฝ๊ณ„ ์ผ€์ด์Šค (else): ๋ ˆ์ธ 31 (1๊ฐœ ๋ ˆ์ธ)

    • ์กฐ๊ฑด: \(\text{lane_id}() = 31\)
    • ์ด์œ : shuffle_down(current_val, 1)์ด ๋ฏธ์ •์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ (๋ ˆ์ธ 32๊ฐ€ ์—†์Œ)
    • ๊ฒฐ๊ณผ: output[i] = 0 (์ฐจ๋ถ„ ๊ณ„์‚ฐ ๋ถˆ๊ฐ€)

3. ๋ ˆ์ธ ์‹๋ณ„

lane = lane_id()  # 0๋ถ€ํ„ฐ WARP_SIZE-1๊นŒ์ง€ ๋ฐ˜ํ™˜

๋ ˆ์ธ ๋ฒˆํ˜ธ ๋งค๊ธฐ๊ธฐ: ๊ฐ ์›Œํ”„ ๋‚ด์—์„œ ๋ ˆ์ธ์€ 0, 1, 2,โ€ฆ, WARP_SIZE-1๋กœ ๋ฒˆํ˜ธ๊ฐ€ ๋งค๊ฒจ์ง‘๋‹ˆ๋‹ค

์ด์›ƒ ์ฐจ๋ถ„ ํ…Œ์ŠคํŠธ:

pixi run p25 --neighbor
pixi run -e amd p25 --neighbor
pixi run -e apple p25 --neighbor
uv run poe p25 --neighbor

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0, 25.0, 27.0, 29.0, 31.0, 33.0, 35.0, 37.0, 39.0, 41.0, 43.0, 45.0, 47.0, 49.0, 51.0, 53.0, 55.0, 57.0, 59.0, 61.0, 0.0]
expected: [1.0, 3.0, 5.0, 7.0, 9.0, 11.0, 13.0, 15.0, 17.0, 19.0, 21.0, 23.0, 25.0, 27.0, 29.0, 31.0, 33.0, 35.0, 37.0, 39.0, 41.0, 43.0, 45.0, 47.0, 49.0, 51.0, 53.0, 55.0, 57.0, 59.0, 61.0, 0.0]
โœ… Basic neighbor difference test passed!

์†”๋ฃจ์…˜

def neighbor_difference[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
):
    """
    Compute finite differences: output[i] = input[i+1] - input[i]
    Uses shuffle_down(val, 1) to get the next neighbor's value.
    Works across multiple blocks, each processing one warp worth of data.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = Int(lane_id())

    if global_i < size:
        # Get current value
        var current_val = input[global_i]

        # Get next neighbor's value using shuffle_down
        var next_val = shuffle_down(current_val, 1)

        # Compute difference - valid within warp boundaries
        # Last lane of each warp has no valid neighbor within the warp
        # Note there's only one warp in this test, so we don't need to check global_i < size - 1
        # We'll see how this works with multiple blocks in the next tests
        if lane < WARP_SIZE - 1:
            output[global_i] = next_val - current_val
        else:
            # Last thread in warp or last thread overall, set to 0
            output[global_i] = 0


์ด ์†”๋ฃจ์…˜์€ shuffle_down()์ด ๊ธฐ์กด ๋ฐฐ์—ด ์ธ๋ฑ์‹ฑ์„ ํšจ์œจ์ ์ธ ์›Œํ”„ ๋ ˆ๋ฒจ ํ†ต์‹ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]           # ๊ฐ ๋ ˆ์ธ์ด ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ฝ์Œ
    next_val = shuffle_down(current_val, 1) # ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ด๋™

    if lane < WARP_SIZE - 1:
        output[global_i] = next_val - current_val  # ์ฐจ๋ถ„ ๊ณ„์‚ฐ
    else:
        output[global_i] = 0                       # ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ

SIMT ์‹คํ–‰ ์ƒ์„ธ ๋ถ„์„:

์‚ฌ์ดํด 1: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์‹œ์— ๊ฐ’์„ ๋กœ๋“œ
  ๋ ˆ์ธ 0: current_val = input[0] = 0
  ๋ ˆ์ธ 1: current_val = input[1] = 1
  ๋ ˆ์ธ 2: current_val = input[2] = 4
  ...
  ๋ ˆ์ธ 31: current_val = input[31] = 961

์‚ฌ์ดํด 2: shuffle_down(current_val, 1)์ด ๋ชจ๋“  ๋ ˆ์ธ์—์„œ ์‹คํ–‰
  ๋ ˆ์ธ 0: ๋ ˆ์ธ 1์—์„œ current_val ์ˆ˜์‹  โ†’ next_val = 1
  ๋ ˆ์ธ 1: ๋ ˆ์ธ 2์—์„œ current_val ์ˆ˜์‹  โ†’ next_val = 4
  ๋ ˆ์ธ 2: ๋ ˆ์ธ 3์—์„œ current_val ์ˆ˜์‹  โ†’ next_val = 9
  ...
  ๋ ˆ์ธ 30: ๋ ˆ์ธ 31์—์„œ current_val ์ˆ˜์‹  โ†’ next_val = 961
  ๋ ˆ์ธ 31: ๋ฏธ์ •์˜ ์ˆ˜์‹  (๋ ˆ์ธ 32 ์—†์Œ) โ†’ next_val = ?

์‚ฌ์ดํด 3: ์ฐจ๋ถ„ ๊ณ„์‚ฐ (๋ ˆ์ธ 0-30๋งŒ ํ•ด๋‹น)
  ๋ ˆ์ธ 0: output[0] = 1 - 0 = 1
  ๋ ˆ์ธ 1: output[1] = 4 - 1 = 3
  ๋ ˆ์ธ 2: output[2] = 9 - 4 = 5
  ...
  ๋ ˆ์ธ 31: output[31] = 0 (๊ฒฝ๊ณ„ ์กฐ๊ฑด)

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ์ด์‚ฐ ๋„ํ•จ์ˆ˜ ์—ฐ์‚ฐ์ž \(D\)๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large D\lbrack f\rbrack(i) = f(i+1) - f(i)\]

์ด์ฐจ ์ž…๋ ฅ \(f(i) = i^2\)์— ๋Œ€ํ•ด: \[\Large D[i^2] = (i+1)^2 - i^2 = i^2 + 2i + 1 - i^2 = 2i + 1\]

shuffle_down์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ธฐ์กด ๋ฐฉ์‹์€ input[global_i + 1] ๋กœ๋“œ๊ฐ€ ํ•„์š”ํ•˜์—ฌ ์บ์‹œ ๋ฏธ์Šค๋ฅผ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ์Œ
  2. ๊ฒฝ๊ณ„ ์•ˆ์ „์„ฑ: ๋ฒ”์œ„ ์ดˆ๊ณผ ์ ‘๊ทผ ์œ„ํ—˜์ด ์—†์Œ - ํ•˜๋“œ์›จ์–ด๊ฐ€ ์›Œํ”„ ๊ฒฝ๊ณ„๋ฅผ ์ฒ˜๋ฆฌ
  3. SIMT ์ตœ์ ํ™”: ๋‹จ์ผ ๋ช…๋ น์ด ๋ชจ๋“  ๋ ˆ์ธ์„ ๋™์‹œ์— ์ฒ˜๋ฆฌ
  4. ๋ ˆ์ง€์Šคํ„ฐ ํ†ต์‹ : ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ๊ฐ€ ์•„๋‹Œ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์ด๋ฅผ ์ด๋™

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์ง€์—ฐ ์‹œ๊ฐ„: 1 ์‚ฌ์ดํด (๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์˜ 100+ ์‚ฌ์ดํด ๋Œ€๋น„)
  • ๋Œ€์—ญํญ: 0 ๋ฐ”์ดํŠธ (๊ธฐ์กด ๋ฐฉ์‹์˜ ์Šค๋ ˆ๋“œ๋‹น 4๋ฐ”์ดํŠธ ๋Œ€๋น„)
  • ๋ณ‘๋ ฌ์„ฑ: 32๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์‹œ์— ์ฒ˜๋ฆฌ

2. ๋‹ค์ค‘ ์˜คํ”„์…‹ ์ด๋™ ํ‰๊ท 

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE_2 = 64 (๋ฉ€ํ‹ฐ ๋ธ”๋ก ์‹œ๋‚˜๋ฆฌ์˜ค)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: BLOCKS_PER_GRID = (2, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: THREADS_PER_BLOCK = (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

์—ฌ๋Ÿฌ shuffle_down ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜์—ฌ 3์  ์ด๋™ ํ‰๊ท ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ์„ธ ๊ฐœ์˜ ์—ฐ์† ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \frac{1}{3}\left(\text{input}[i] + \text{input}[i+1] + \text{input}[i+2]\right)\]

๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์›Œํ”„ ๊ฒฝ๊ณ„์—์„œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์šฐ์•„ํ•˜๊ฒŒ ์ ์‘ํ•ฉ๋‹ˆ๋‹ค:

  • 3์  ์ „์ฒด ์œˆ๋„์šฐ: \(\text{output}[i] = \frac{1}{3}\sum_{k=0}^{2} \text{input}[i+k]\) - ๋ชจ๋“  ์ด์›ƒ์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•  ๋•Œ
  • 2์  ์œˆ๋„์šฐ: \(\text{output}[i] = \frac{1}{2}\sum_{k=0}^{1} \text{input}[i+k]\) - ๋‹ค์Œ ์ด์›ƒ๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•  ๋•Œ
  • 1์  ์œˆ๋„์šฐ: \(\text{output}[i] = \text{input}[i]\) - ์ด์›ƒ์ด ์‚ฌ์šฉ ๋ถˆ๊ฐ€ํ•  ๋•Œ

์ด๋Š” shuffle_down()์ด ์›Œํ”„ ๋ฒ”์œ„ ๋‚ด์—์„œ ์ž๋™ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ์™€ ํ•จ๊ป˜ ํšจ์œจ์ ์ธ ์Šคํ…์‹ค ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
comptime layout_2 = row_major[SIZE_2]()
comptime LayoutType_2 = type_of(layout_2)


def moving_average_3[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType_2, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType_2, ImmutAnyOrigin],
):
    """
    Compute 3-point moving average: output[i] = (input[i] + input[i+1] + input[i+2]) / 3
    Uses shuffle_down with offsets 1 and 2 to access neighbors.
    Works within warp boundaries across multiple blocks.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = Int(lane_id())

    # FILL IN (roughly 10 lines)


ํŒ

1. ๋‹ค์ค‘ ์˜คํ”„์…‹ ์…”ํ”Œ ํŒจํ„ด

์ด ํผ์ฆ์€ ์—ฌ๋Ÿฌ ์ด์›ƒ์— ๋™์‹œ์— ์ ‘๊ทผํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ์˜คํ”„์…‹์œผ๋กœ ์…”ํ”Œ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • input[i+1]๊ณผ input[i+2]๋ฅผ ์…”ํ”Œ ์—ฐ์‚ฐ์œผ๋กœ ์–ด๋–ป๊ฒŒ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์„๊นŒ์š”?
  • ์…”ํ”Œ ์˜คํ”„์…‹๊ณผ ์ด์›ƒ ๊ฑฐ๋ฆฌ์˜ ๊ด€๊ณ„๋Š” ๋ฌด์—‡์ผ๊นŒ์š”?
  • ๊ฐ™์€ ์†Œ์Šค ๊ฐ’์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ๋ฒˆ ์…”ํ”Œ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์„๊นŒ์š”?

์‹œ๊ฐํ™” ๊ฐœ๋…:

ํ˜„์žฌ ๋ ˆ์ธ์ด ํ•„์š”ํ•œ ๊ฐ’: current_val, next_val, next_next_val
์…”ํ”Œ ์˜คํ”„์…‹:        0 (์ง์ ‘),    1,        2

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: ๋ช‡ ๋ฒˆ์˜ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜๊ณ , ์–ด๋–ค ์˜คํ”„์…‹์„ ์‚ฌ์šฉํ•ด์•ผ ํ• ๊นŒ์š”?

2. ๋‹จ๊ณ„์  ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ

๋‹จ์ˆœํ•œ ์ด์›ƒ ์ฐจ๋ถ„๊ณผ ๋‹ฌ๋ฆฌ, ์ด ํผ์ฆ์€ 2๊ฐœ์˜ ์ด์›ƒ์— ์ ‘๊ทผํ•ด์•ผ ํ•˜๋ฏ€๋กœ ์—ฌ๋Ÿฌ ๊ฒฝ๊ณ„ ์‹œ๋‚˜๋ฆฌ์˜ค๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ๊ฒฝ๊ณ„ ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ์ „์ฒด ์œˆ๋„์šฐ: ๋ ˆ์ธ์ด ๋‘ ์ด์›ƒ ๋ชจ๋‘ ์ ‘๊ทผ ๊ฐ€๋Šฅ โ†’ 3๊ฐœ ๊ฐ’ ๋ชจ๋‘ ์‚ฌ์šฉ
  • ๋ถ€๋ถ„ ์œˆ๋„์šฐ: ๋ ˆ์ธ์ด 1๊ฐœ ์ด์›ƒ๋งŒ ์ ‘๊ทผ ๊ฐ€๋Šฅ โ†’ 2๊ฐœ ๊ฐ’ ์‚ฌ์šฉ
  • ์œˆ๋„์šฐ ์—†์Œ: ๋ ˆ์ธ์ด ์ด์›ƒ์— ์ ‘๊ทผ ๋ถˆ๊ฐ€ โ†’ 1๊ฐœ ๊ฐ’ ์‚ฌ์šฉ

๋น„ํŒ์  ์‚ฌ๊ณ :

  • ์–ด๋–ค ๋ ˆ์ธ์ด ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ์— ํ•ด๋‹นํ• ๊นŒ์š”?
  • ๊ฐ’์ด ์ ์„ ๋•Œ ํ‰๊ท ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์–ด๋–ป๊ฒŒ ์กฐ์ •ํ•ด์•ผ ํ• ๊นŒ์š”?
  • ์–ด๋–ค ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ๊ฒ€์‚ฌํ•ด์•ผ ํ• ๊นŒ์š”?

๊ณ ๋ คํ•  ํŒจํ„ด:

if (๋‘ ์ด์›ƒ ๋ชจ๋‘ ์ ‘๊ทผ ๊ฐ€๋Šฅ):
    # 3์  ํ‰๊ท 
elif (ํ•œ ์ด์›ƒ๋งŒ ์ ‘๊ทผ ๊ฐ€๋Šฅ):
    # 2์  ํ‰๊ท 
else:
    # 1์  (ํ‰๊ท  ์—†์Œ)

3. ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ •

์ด ํผ์ฆ์€ ์—ฌ๋Ÿฌ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•˜๋ฉฐ, ๊ฐ ๋ธ”๋ก์ด ๋ฐ์ดํ„ฐ์˜ ๋‹ค๋ฅธ ์˜์—ญ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ค‘์š”ํ•œ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๊ฐ ๋ธ”๋ก์€ ๋ ˆ์ธ 0๋ถ€ํ„ฐ WARP_SIZE-1๊นŒ์ง€์˜ ์ž์ฒด ์›Œํ”„๋ฅผ ๊ฐ€์ง
  • ๊ฒฝ๊ณ„ ์กฐ๊ฑด์€ ๊ฐ ์›Œํ”„ ๋‚ด์—์„œ ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ
  • ๋ธ”๋ก๋งˆ๋‹ค ๋ ˆ์ธ ๋ฒˆํ˜ธ๊ฐ€ ์ดˆ๊ธฐํ™”๋จ

์ƒ๊ฐํ•ด ๋ณผ ์งˆ๋ฌธ:

  • ๊ฒฝ๊ณ„ ๋กœ์ง์ด ๋ธ”๋ก 0๊ณผ ๋ธ”๋ก 1 ๋ชจ๋‘์—์„œ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋™์ž‘ํ•˜๋‚˜์š”?
  • ๋ ˆ์ธ ๊ฒฝ๊ณ„์™€ ์ „์—ญ ๋ฐฐ์—ด ๊ฒฝ๊ณ„๋ฅผ ๋ชจ๋‘ ๊ฒ€์‚ฌํ•˜๊ณ  ์žˆ๋‚˜์š”?
  • ์„œ๋กœ ๋‹ค๋ฅธ ๋ธ”๋ก์—์„œ global_i์™€ lane_id()์˜ ๊ด€๊ณ„๋Š” ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”?

๋””๋ฒ„๊น… ํŒ: ๊ฐ ๋ธ”๋ก์˜ ๊ฒฝ๊ณ„ ๋ ˆ์ธ์—์„œ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ถ”์ ํ•˜์—ฌ ๋กœ์ง์„ ํ…Œ์ŠคํŠธํ•˜์„ธ์š”.

์ด๋™ ํ‰๊ท  ํ…Œ์ŠคํŠธ:

pixi run p25 --average
pixi run -e amd p25 --average
uv run poe p25 --average

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE_2:  64
output: HostBuffer([3.3333333, 6.3333335, 10.333333, 15.333333, 21.333334, 28.333334, 36.333332, 45.333332, 55.333332, 66.333336, 78.333336, 91.333336, 105.333336, 120.333336, 136.33333, 153.33333, 171.33333, 190.33333, 210.33333, 231.33333, 253.33333, 276.33334, 300.33334, 325.33334, 351.33334, 378.33334, 406.33334, 435.33334, 465.33334, 496.33334, 512.0, 528.0, 595.3333, 630.3333, 666.3333, 703.3333, 741.3333, 780.3333, 820.3333, 861.3333, 903.3333, 946.3333, 990.3333, 1035.3334, 1081.3334, 1128.3334, 1176.3334, 1225.3334, 1275.3334, 1326.3334, 1378.3334, 1431.3334, 1485.3334, 1540.3334, 1596.3334, 1653.3334, 1711.3334, 1770.3334, 1830.3334, 1891.3334, 1953.3334, 2016.3334, 2048.0, 2080.0])
expected: HostBuffer([3.3333333, 6.3333335, 10.333333, 15.333333, 21.333334, 28.333334, 36.333332, 45.333332, 55.333332, 66.333336, 78.333336, 91.333336, 105.333336, 120.333336, 136.33333, 153.33333, 171.33333, 190.33333, 210.33333, 231.33333, 253.33333, 276.33334, 300.33334, 325.33334, 351.33334, 378.33334, 406.33334, 435.33334, 465.33334, 496.33334, 512.0, 528.0, 595.3333, 630.3333, 666.3333, 703.3333, 741.3333, 780.3333, 820.3333, 861.3333, 903.3333, 946.3333, 990.3333, 1035.3334, 1081.3334, 1128.3334, 1176.3334, 1225.3334, 1275.3334, 1326.3334, 1378.3334, 1431.3334, 1485.3334, 1540.3334, 1596.3334, 1653.3334, 1711.3334, 1770.3334, 1830.3334, 1891.3334, 1953.3334, 2016.3334, 2048.0, 2080.0])
โœ… Moving average test passed!

์†”๋ฃจ์…˜

def moving_average_3[
    size: Int
](
    output: TileTensor[mut=True, dtype, Layout2Type, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, Layout2Type, MutAnyOrigin],
):
    """
    Compute 3-point moving average: output[i] = (input[i] + input[i+1] + input[i+2]) / 3
    Uses shuffle_down with offsets 1 and 2 to access neighbors.
    Works within warp boundaries across multiple blocks.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = Int(lane_id())

    if global_i < size:
        # Get current, next, and next+1 values
        var current_val = input[global_i]
        var next_val = shuffle_down(current_val, 1)
        var next_next_val = shuffle_down(current_val, 2)

        # Compute 3-point average - valid within warp boundaries
        if lane < WARP_SIZE - 2 and global_i < size - 2:
            output[global_i] = (current_val + next_val + next_next_val) / 3.0
        elif lane < WARP_SIZE - 1 and global_i < size - 1:
            # Second-to-last in warp: only current + next available
            output[global_i] = (current_val + next_val) / 2.0
        else:
            # Last thread in warp or boundary cases: only current available
            output[global_i] = current_val


์ด ์†”๋ฃจ์…˜์€ ๋ณต์žกํ•œ ์Šคํ…์‹ค ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๊ณ ๊ธ‰ ๋‹ค์ค‘ ์˜คํ”„์…‹ ์…”ํ”Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    # ๋‹จ๊ณ„ 1: ์—ฌ๋Ÿฌ ์…”ํ”Œ๋กœ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ๋ชจ๋‘ ํ™•๋ณด
    current_val = input[global_i]                   # ์ง์ ‘ ์ ‘๊ทผ
    next_val = shuffle_down(current_val, 1)         # ์˜ค๋ฅธ์ชฝ ์ด์›ƒ
    next_next_val = shuffle_down(current_val, 2)    # ์˜ค๋ฅธ์ชฝ+1 ์ด์›ƒ

    # ๋‹จ๊ณ„ 2: ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ฅธ ์ ์‘ํ˜• ๊ณ„์‚ฐ
    if lane < WARP_SIZE - 2 and global_i < size - 2:
        # 3์  ์Šคํ…์‹ค ์ „์ฒด ์‚ฌ์šฉ ๊ฐ€๋Šฅ
        output[global_i] = (current_val + next_val + next_next_val) / 3.0
    elif lane < WARP_SIZE - 1 and global_i < size - 1:
        # 2์  ์Šคํ…์‹ค๋งŒ ์‚ฌ์šฉ ๊ฐ€๋Šฅ (์›Œํ”„ ๊ฒฝ๊ณ„ ๊ทผ์ฒ˜)
        output[global_i] = (current_val + next_val) / 2.0
    else:
        # ์Šคํ…์‹ค ์‚ฌ์šฉ ๋ถˆ๊ฐ€ (์›Œํ”„ ๊ฒฝ๊ณ„)
        output[global_i] = current_val

๋‹ค์ค‘ ์˜คํ”„์…‹ ์‹คํ–‰ ์ถ”์  (WARP_SIZE = 32):

์ดˆ๊ธฐ ์ƒํƒœ (๋ธ”๋ก 0, ์š”์†Œ 0-31):
  ๋ ˆ์ธ 0: current_val = input[0] = 1
  ๋ ˆ์ธ 1: current_val = input[1] = 2
  ๋ ˆ์ธ 2: current_val = input[2] = 4
  ...
  ๋ ˆ์ธ 31: current_val = input[31] = X

์ฒซ ๋ฒˆ์งธ ์…”ํ”Œ: shuffle_down(current_val, 1)
  ๋ ˆ์ธ 0: next_val = input[1] = 2
  ๋ ˆ์ธ 1: next_val = input[2] = 4
  ๋ ˆ์ธ 2: next_val = input[3] = 7
  ...
  ๋ ˆ์ธ 30: next_val = input[31] = X
  ๋ ˆ์ธ 31: next_val = ๋ฏธ์ •์˜

๋‘ ๋ฒˆ์งธ ์…”ํ”Œ: shuffle_down(current_val, 2)
  ๋ ˆ์ธ 0: next_next_val = input[2] = 4
  ๋ ˆ์ธ 1: next_next_val = input[3] = 7
  ๋ ˆ์ธ 2: next_next_val = input[4] = 11
  ...
  ๋ ˆ์ธ 29: next_next_val = input[31] = X
  ๋ ˆ์ธ 30: next_next_val = ๋ฏธ์ •์˜
  ๋ ˆ์ธ 31: next_next_val = ๋ฏธ์ •์˜

๊ณ„์‚ฐ ๋‹จ๊ณ„:
  ๋ ˆ์ธ 0-29: 3์  ์ „์ฒด ํ‰๊ท  โ†’ (current + next + next_next) / 3
  ๋ ˆ์ธ 30:   2์  ํ‰๊ท  โ†’ (current + next) / 2
  ๋ ˆ์ธ 31:   1์  ํ‰๊ท  โ†’ current (๊ทธ๋Œ€๋กœ ์ „๋‹ฌ)

์ˆ˜ํ•™์  ๊ธฐ๋ฐ˜: ๊ฐ€๋ณ€ ํญ ์ด์‚ฐ ํ•ฉ์„ฑ๊ณฑ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large h[i] = \sum_{k=0}^{K(i)-1} w_k^{(i)} \cdot f[i+k]\]

์œ„์น˜์— ๋”ฐ๋ผ ์ปค๋„์ด ์ ์‘ํ•ฉ๋‹ˆ๋‹ค:

  • ๋‚ด๋ถ€ ์ : \(K(i) = 3\), \(\mathbf{w}^{(i)} = [\frac{1}{3}, \frac{1}{3}, \frac{1}{3}]\)
  • ๊ฒฝ๊ณ„ ๊ทผ์ฒ˜: \(K(i) = 2\), \(\mathbf{w}^{(i)} = [\frac{1}{2}, \frac{1}{2}]\)
  • ๊ฒฝ๊ณ„: \(K(i) = 1\), \(\mathbf{w}^{(i)} = [1]\)

๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ •: SIZE_2 = 64์™€ 2๊ฐœ ๋ธ”๋ก:

๋ธ”๋ก 0 (์ „์—ญ ์ธ๋ฑ์Šค 0-31):
  ์ „์—ญ ์ธ๋ฑ์Šค 29, 30, 31์— ๋ ˆ์ธ ๊ฒฝ๊ณ„ ์ ์šฉ

๋ธ”๋ก 1 (์ „์—ญ ์ธ๋ฑ์Šค 32-63):
  ์ „์—ญ ์ธ๋ฑ์Šค 61, 62, 63์— ๋ ˆ์ธ ๊ฒฝ๊ณ„ ์ ์šฉ
  ๋ ˆ์ธ ๋ฒˆํ˜ธ ์ดˆ๊ธฐํ™”: global_i=32 โ†’ lane=0, global_i=63 โ†’ lane=31

์„ฑ๋Šฅ ์ตœ์ ํ™”:

  1. ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ํ™•๋ณด: ๋‘ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ๋™์‹œ์— ์‹คํ–‰
  2. ์กฐ๊ฑด๋ถ€ ๋ถ„๊ธฐ: GPU๊ฐ€ ํ”„๋ ˆ๋””์ผ€์ด์…˜์„ ํ†ตํ•ด ๋ถ„๊ธฐ ๋ ˆ์ธ์„ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌ
  3. ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ: ์ˆœ์ฐจ์  ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด GPU์— ์ตœ์ 
  4. ๋ ˆ์ง€์Šคํ„ฐ ์žฌ์‚ฌ์šฉ: ๋ชจ๋“  ์ค‘๊ฐ„ ๊ฐ’์ด ๋ ˆ์ง€์Šคํ„ฐ์— ์œ ์ง€

์‹ ํ˜ธ ์ฒ˜๋ฆฌ ๊ด€์ : ์ด๊ฒƒ์€ ์ž„ํŽ„์Šค ์‘๋‹ต \(h[n] = \frac{1}{3}[\delta[n] + \delta[n-1] + \delta[n-2]]\)๋ฅผ ๊ฐ€์ง„ ์ธ๊ณผ FIR ํ•„ํ„ฐ๋กœ, ์ฐจ๋‹จ ์ฃผํŒŒ์ˆ˜ \(f_c \approx 0.25f_s\)์—์„œ ์Šค๋ฌด๋”ฉ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์š”์•ฝ

์ด ์„น์…˜์˜ ํ•ต์‹ฌ ํŒจํ„ด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

current_val = input[global_i]
neighbor_val = shuffle_down(current_val, offset)
if lane < WARP_SIZE - offset:
    result = compute(current_val, neighbor_val)

ํ•ต์‹ฌ ์žฅ์ :

  • ํ•˜๋“œ์›จ์–ด ํšจ์œจ์„ฑ: ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
  • ๊ฒฝ๊ณ„ ์•ˆ์ „์„ฑ: ์ž๋™ ์›Œํ”„ ๋ฒ”์œ„ ์ฒ˜๋ฆฌ
  • SIMT ์ตœ์ ํ™”: ๋‹จ์ผ ๋ช…๋ น, ๋ชจ๋“  ๋ ˆ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ

ํ™œ์šฉ ๋ถ„์•ผ: ์œ ํ•œ ์ฐจ๋ถ„, ์Šคํ…์‹ค ์—ฐ์‚ฐ, ์ด๋™ ํ‰๊ท , ํ•ฉ์„ฑ๊ณฑ.

warp.broadcast() ์ผ๋Œ€๋‹ค ํ†ต์‹ 

์›Œํ”„ ๋ ˆ๋ฒจ ์กฐ์ •์—์„œ๋Š” broadcast()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋ ˆ์ธ์—์„œ ์›Œํ”„ ๋‚ด ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ†ตํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ๋ธ”๋ก ๋ ˆ๋ฒจ ๊ณ„์‚ฐ, ์กฐ๊ฑด๋ถ€ ๋กœ์ง ์กฐ์ •, ์ผ๋Œ€๋‹ค ํ†ต์‹  ํŒจํ„ด์„ ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: broadcast() ์—ฐ์‚ฐ์€ SIMT ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ๋ ˆ์ธ(๋ณดํ†ต ๋ ˆ์ธ 0)์ด ๊ณ„์‚ฐํ•œ ๊ฐ’์„ ๊ฐ™์€ ์›Œํ”„์˜ ๋ชจ๋“  ๋ ˆ์ธ์— ์ „๋‹ฌํ•˜๋ฉฐ, ํšจ์œจ์ ์ธ ์กฐ์ • ํŒจํ„ด๊ณผ ์ง‘ํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ •์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์ด๋ž€? ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ์€ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ทธ๋ฃน ๋‚ด ๋‹ค๋ฅธ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์™€ ๊ณต์œ ํ•˜๋Š” ํ†ต์‹  ํŒจํ„ด์ž…๋‹ˆ๋‹ค. ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต๊ณ„ ๊ณ„์‚ฐ, ์ง‘ํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ •, ์›Œํ”„ ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ์„ค์ • ํŒŒ๋ผ๋ฏธํ„ฐ ์ „๋‹ฌ ๋“ฑ์˜ ์กฐ์ • ์ž‘์—…์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • broadcast()๋ฅผ ํ™œ์šฉํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
  • ์ผ๋Œ€๋‹ค ํ†ต์‹  ํŒจํ„ด
  • ์ง‘ํ•ฉ ๊ณ„์‚ฐ ์ „๋žต
  • ๋ ˆ์ธ ๊ฐ„ ์กฐ๊ฑด๋ถ€ ์กฐ์ •
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ-shuffle ๊ฒฐํ•ฉ ์—ฐ์‚ฐ

broadcast() ์—ฐ์‚ฐ์€ ํ•˜๋‚˜์˜ ๋ ˆ์ธ(๊ธฐ๋ณธ์ ์œผ๋กœ ๋ ˆ์ธ 0)์ด ์ž์‹ ์˜ ๊ฐ’์„ ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ๊ณผ ๊ณต์œ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{broadcast}(\text{value}) = \text{value_from_lane_0_to_all_lanes}\]

์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ์กฐ์ • ํŒจํ„ด์ด ๊ฐ„๋‹จํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ์—ฐ์‚ฐ์œผ๋กœ ๋ณ€ํ™˜๋˜์–ด, ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ํšจ์œจ์ ์ธ ์ง‘ํ•ฉ ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐœ๋…

๊ธฐ์กด ์กฐ์ • ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ๊ธฐ์กด ๋ฐฉ์‹ - ๋ณต์žกํ•˜๊ณ  ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์›€
shared_memory[lane] = local_computation()
sync_threads()  # ๋น„์šฉ์ด ํฐ ๋™๊ธฐํ™”
if lane == 0:
    result = compute_from_shared_memory()
sync_threads()  # ๋˜ ๋‹ค๋ฅธ ๋น„์šฉ์ด ํฐ ๋™๊ธฐํ™”
final_result = shared_memory[0]  # ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ฝ์Œ

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ํ•„์š”
  • ๋™๊ธฐํ™”: ๋น„์šฉ์ด ํฐ ๋ฐฐ๋ฆฌ์–ด ์—ฐ์‚ฐ์ด ์—ฌ๋Ÿฌ ๋ฒˆ ํ•„์š”
  • ๋ณต์žกํ•œ ๋กœ์ง: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ธ๋ฑ์Šค์™€ ์ ‘๊ทผ ํŒจํ„ด ๊ด€๋ฆฌ
  • ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ์„ฑ: ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ์‰ฝ๊ฒŒ ๋ฐœ์ƒ

broadcast()๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์กฐ์ •์ด ๊ฐ„๊ฒฐํ•ด์ง‘๋‹ˆ๋‹ค:

# ์›Œํ”„ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋ฐฉ์‹ - ๊ฐ„๋‹จํ•˜๊ณ  ์•ˆ์ „
collective_value = 0
if lane == 0:
    collective_value = compute_block_statistic()
collective_value = broadcast(collective_value)  # ๋ชจ๋“  ๋ ˆ์ธ๊ณผ ๊ณต์œ 
result = use_collective_value(collective_value)

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์˜ ์žฅ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ถˆํ•„์š”
  • ์ž๋™ ๋™๊ธฐํ™”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ๊ฐ„๋‹จํ•œ ํŒจํ„ด: ํ•˜๋‚˜์˜ ๋ ˆ์ธ์ด ๊ณ„์‚ฐํ•˜๊ณ  ๋ชจ๋“  ๋ ˆ์ธ์ด ์ˆ˜์‹ 
  • ์กฐํ•ฉ ๊ฐ€๋Šฅ: ๋‹ค๋ฅธ ์›Œํ”„ ์—ฐ์‚ฐ๊ณผ ์‰ฝ๊ฒŒ ๊ฒฐํ•ฉ

1. ๊ธฐ๋ณธ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ

๋ ˆ์ธ 0์ด ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต๊ณ„๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ  ๋ชจ๋“  ๋ ˆ์ธ๊ณผ ๊ณต์œ ํ•˜๋Š” ๊ธฐ๋ณธ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์š”๊ตฌ์‚ฌํ•ญ:

  • ๋ ˆ์ธ 0์ด ํ˜„์žฌ ๋ธ”๋ก์˜ ์ฒ˜์Œ 4๊ฐœ ์š”์†Œ์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ์ด ๊ณ„์‚ฐ๋œ ๊ฐ’์„ broadcast()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„์˜ ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ๊ณผ ๊ณต์œ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๊ฐ ๋ ˆ์ธ์€ ์ด ๊ณต์œ ๋œ ๊ฐ’์„ ์ž์‹ ์˜ ์ž…๋ ฅ ์š”์†Œ์— ๋”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: ์ž…๋ ฅ [1, 2, 3, 4, 5, 6, 7, 8, ...]์€ ์ถœ๋ ฅ [11, 12, 13, 14, 15, 16, 17, 18, ...]์„ ์ƒ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

๊ณผ์ œ: ํ•˜๋‚˜์˜ ๋ ˆ์ธ๋งŒ ๋ธ”๋ก ๋ ˆ๋ฒจ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋˜, ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์ž์‹ ์˜ ๊ฐœ๋ณ„ ์—ฐ์‚ฐ์— ์‚ฌ์šฉํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ์กฐ์ •ํ•ด์•ผ ํ• ๊นŒ์š”?

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D row-major)

์™„์„ฑํ•  ์ฝ”๋“œ

def basic_broadcast[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Basic broadcast: Lane 0 computes a block-local value, broadcasts it to all lanes.
    Each lane then uses this broadcast value in its own computation.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = Int(lane_id())
    if global_i < size:
        var broadcast_value: output.ElementType = 0.0

        # FILL IN (roughly 10 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p25/p25.mojo

ํŒ

1. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋™์ž‘ ๋ฐฉ์‹ ์ดํ•ดํ•˜๊ธฐ

broadcast(value) ์—ฐ์‚ฐ์€ ๋ ˆ์ธ 0์˜ ๊ฐ’์„ ๊ฐ€์ ธ์™€ ์›Œํ”„์˜ ๋ชจ๋“  ๋ ˆ์ธ์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์—์„œ๋Š” ๋ ˆ์ธ 0์˜ ๊ฐ’๋งŒ ์˜๋ฏธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ๋ ˆ์ธ์˜ ๊ฐ’์€ ๋ฌด์‹œ๋˜์ง€๋งŒ, ๋ชจ๋“  ๋ ˆ์ธ์ด ๋ ˆ์ธ 0์˜ ๊ฐ’์„ ์ˆ˜์‹ ํ•ฉ๋‹ˆ๋‹ค.

์‹œ๊ฐํ™”:

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ „: ๋ ˆ์ธ 0์€ valโ‚€, ๋ ˆ์ธ 1์€ valโ‚, ๋ ˆ์ธ 2๋Š” valโ‚‚, ...
๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํ›„: ๋ ˆ์ธ 0์€ valโ‚€, ๋ ˆ์ธ 1์€ valโ‚€, ๋ ˆ์ธ 2๋Š” valโ‚€, ...

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: ๋ ˆ์ธ 0๋งŒ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•˜๋ ค๋Š” ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋„๋ก ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ์š”?

2. ๋ ˆ์ธ๋ณ„ ๊ณ„์‚ฐ

๋ ˆ์ธ 0์ด ํŠน๋ณ„ํ•œ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ๋‹ค๋ฅธ ๋ ˆ์ธ์€ ๋Œ€๊ธฐํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•ฉ๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ํŒจํ„ด:

var shared_value = ์ดˆ๊ธฐ๊ฐ’
if lane == 0:
    # ๋ ˆ์ธ 0๋งŒ ๊ณ„์‚ฐ
    shared_value = ํŠน๋ณ„ํ•œ_๊ณ„์‚ฐ()
# ๋ชจ๋“  ๋ ˆ์ธ์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์— ์ฐธ์—ฌ
shared_value = broadcast(shared_value)

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ „์— ๋‹ค๋ฅธ ๋ ˆ์ธ์˜ ๊ฐ’์€ ์–ด๋–ค ์ƒํƒœ์—ฌ์•ผ ํ• ๊นŒ์š”?
  • ๋ ˆ์ธ 0์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•  ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ–๋„๋ก ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ์š”?

3. ์ง‘ํ•ฉ์  ํ™œ์šฉ

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํ›„ ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ ๊ฐ’์„ ๊ฐ–๊ฒŒ ๋˜๋ฉฐ, ์ด๋ฅผ ๊ฐ์ž์˜ ๊ฐœ๋ณ„ ๊ณ„์‚ฐ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: ๊ฐ ๋ ˆ์ธ์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’๊ณผ ์ž์‹ ์˜ ๋กœ์ปฌ ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉํ• ๊นŒ์š”?

๊ธฐ๋ณธ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํ…Œ์ŠคํŠธ:

pixi run p25 --broadcast-basic
pixi run -e amd p25 --broadcast-basic
pixi run -e apple p25 --broadcast-basic
uv run poe p25 --broadcast-basic

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: HostBuffer([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0])
expected: HostBuffer([11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0, 35.0, 36.0, 37.0, 38.0, 39.0, 40.0, 41.0, 42.0])
โœ… Basic broadcast test passed!

์†”๋ฃจ์…˜

def basic_broadcast[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
):
    """
    Basic broadcast: Lane 0 computes a block-local value, broadcasts it to all lanes.
    Each lane then uses this broadcast value in its own computation.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = Int(lane_id())

    if global_i < size:
        # Step 1: Lane 0 computes special value (sum of first 4 elements in this block)
        var broadcast_value: output.ElementType = 0.0
        if lane == 0:
            var block_start = block_idx.x * block_dim.x
            var sum: output.ElementType = 0.0
            for i in range(4):
                if block_start + i < size:
                    sum += input[block_start + i]
            broadcast_value = sum

        # Step 2: Broadcast lane 0's value to all lanes in this warp
        broadcast_value = broadcast(broadcast_value)

        # Step 3: All lanes use broadcast value in their computation
        output[global_i] = broadcast_value + input[global_i]


์ด ์†”๋ฃจ์…˜์€ ์›Œํ”„ ๋ ˆ๋ฒจ ์กฐ์ •์„ ์œ„ํ•œ ๊ธฐ๋ณธ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    # ๋‹จ๊ณ„ 1: ๋ ˆ์ธ 0์ด ํŠน๋ณ„ํ•œ ๊ฐ’์„ ๊ณ„์‚ฐ
    var broadcast_value: output.element_type = 0.0
    if lane == 0:
        # ๋ ˆ์ธ 0๋งŒ ์ด ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰
        block_start = block_idx.x * block_dim.x
        var sum: output.element_type = 0.0
        for i in range(4):
            if block_start + i < size:
                sum += input[block_start + i]
        broadcast_value = sum

    # ๋‹จ๊ณ„ 2: ๋ ˆ์ธ 0์˜ ๊ฐ’์„ ๋ชจ๋“  ๋ ˆ์ธ๊ณผ ๊ณต์œ 
    broadcast_value = broadcast(broadcast_value)

    # ๋‹จ๊ณ„ 3: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’์„ ํ™œ์šฉ
    output[global_i] = broadcast_value + input[global_i]

SIMT ์‹คํ–‰ ์ถ”์ :

์‚ฌ์ดํด 1: ๋ ˆ์ธ๋ณ„ ๊ณ„์‚ฐ
  ๋ ˆ์ธ 0: input[0] + input[1] + input[2] + input[3] = 1+2+3+4 = 10์„ ๊ณ„์‚ฐ
  ๋ ˆ์ธ 1: broadcast_value๋Š” 0.0 ์œ ์ง€ (๋ ˆ์ธ 0์ด ์•„๋‹˜)
  ๋ ˆ์ธ 2: broadcast_value๋Š” 0.0 ์œ ์ง€ (๋ ˆ์ธ 0์ด ์•„๋‹˜)
  ...
  ๋ ˆ์ธ 31: broadcast_value๋Š” 0.0 ์œ ์ง€ (๋ ˆ์ธ 0์ด ์•„๋‹˜)

์‚ฌ์ดํด 2: broadcast(broadcast_value) ์‹คํ–‰
  ๋ ˆ์ธ 0: ์ž์‹ ์˜ ๊ฐ’ ์œ ์ง€ โ†’ broadcast_value = 10.0
  ๋ ˆ์ธ 1: ๋ ˆ์ธ 0์˜ ๊ฐ’ ์ˆ˜์‹  โ†’ broadcast_value = 10.0
  ๋ ˆ์ธ 2: ๋ ˆ์ธ 0์˜ ๊ฐ’ ์ˆ˜์‹  โ†’ broadcast_value = 10.0
  ...
  ๋ ˆ์ธ 31: ๋ ˆ์ธ 0์˜ ๊ฐ’ ์ˆ˜์‹  โ†’ broadcast_value = 10.0

์‚ฌ์ดํด 3: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’์„ ํ™œ์šฉํ•œ ๊ฐœ๋ณ„ ๊ณ„์‚ฐ
  ๋ ˆ์ธ 0: output[0] = 10.0 + input[0] = 10.0 + 1.0 = 11.0
  ๋ ˆ์ธ 1: output[1] = 10.0 + input[1] = 10.0 + 2.0 = 12.0
  ๋ ˆ์ธ 2: output[2] = 10.0 + input[2] = 10.0 + 3.0 = 13.0
  ...
  ๋ ˆ์ธ 31: output[31] = 10.0 + input[31] = 10.0 + 32.0 = 42.0

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๊ฐ€ ์šฐ์›”ํ•œ ์ด์œ :

  1. ์กฐ์ • ํšจ์œจ์„ฑ: ๋‹จ์ผ ์—ฐ์‚ฐ์œผ๋กœ ๋ชจ๋“  ๋ ˆ์ธ์„ ์กฐ์ •
  2. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ถˆํ•„์š”
  3. ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ž๋™์œผ๋กœ ์กฐ์ •์„ ์ฒ˜๋ฆฌ
  4. ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํŒจํ„ด: ์›Œํ”„ ํฌ๊ธฐ์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋™์ผํ•˜๊ฒŒ ๋™์ž‘

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์ง€์—ฐ ์‹œ๊ฐ„: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์—ฐ์‚ฐ 1 ์‚ฌ์ดํด
  • ๋Œ€์—ญํญ: 0 ๋ฐ”์ดํŠธ (๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ )
  • ์กฐ์ •: 32๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ์ž๋™ ๋™๊ธฐํ™”

2. ์กฐ๊ฑด๋ถ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ

๋ ˆ์ธ 0์ด ๋ธ”๋ก ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๋ชจ๋“  ๋ ˆ์ธ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ์กฐ๊ฑด๋ถ€ ์กฐ์ •์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์š”๊ตฌ์‚ฌํ•ญ:

  • ๋ ˆ์ธ 0์ด ํ˜„์žฌ ๋ธ”๋ก์˜ ์ฒ˜์Œ 8๊ฐœ ์š”์†Œ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ์ตœ๋Œ“๊ฐ’์„ ์ฐพ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ์ด ์ตœ๋Œ“๊ฐ’์„ broadcast()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์— ์ „๋‹ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๊ฐ ๋ ˆ์ธ์€ ์กฐ๊ฑด๋ถ€ ๋กœ์ง์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค: ์ž์‹ ์˜ ์š”์†Œ๊ฐ€ ์ตœ๋Œ“๊ฐ’์˜ ์ ˆ๋ฐ˜๋ณด๋‹ค ํฌ๋ฉด 2๋ฐฐ๋กœ, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ์ ˆ๋ฐ˜์œผ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: ์ž…๋ ฅ [3, 1, 7, 2, 9, 4, 6, 8, ...] (๋ฐ˜๋ณต ํŒจํ„ด)์€ ์ถœ๋ ฅ [1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, ...]์„ ์ƒ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

๊ณผ์ œ: ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ถ„์„๊ณผ ์š”์†Œ๋ณ„ ์กฐ๊ฑด๋ถ€ ๋ณ€ํ™˜์„ ๋ชจ๋“  ๋ ˆ์ธ์— ๊ฑธ์ณ ์–ด๋–ป๊ฒŒ ์กฐ์ •ํ• ๊นŒ์š”?

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

def conditional_broadcast[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Conditional broadcast: Lane 0 makes a decision based on block-local data, broadcasts it to all lanes.
    All lanes apply different logic based on the broadcast decision.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = Int(lane_id())
    if global_i < size:
        var decision_value: output.ElementType = 0.0

        # FILL IN (roughly 10 lines)

        var current_input = input[global_i]
        var threshold = decision_value / 2.0
        if current_input >= threshold:
            output[global_i] = current_input * 2.0  # Double if >= threshold
        else:
            output[global_i] = current_input / 2.0  # Halve if < threshold


ํŒ

1. ๋ถ„์„๊ณผ ์˜์‚ฌ๊ฒฐ์ •

๋ ˆ์ธ 0์ด ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์˜ ๋™์ž‘์„ ์•ˆ๋‚ดํ•  ๊ฒฐ์ •์„ ๋‚ด๋ ค์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ ˆ์ธ 0์ด ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ถ„์„ํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ์š”?
  • ๋ ˆ์ธ์˜ ๋™์ž‘์„ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋–ค ์ข…๋ฅ˜์˜ ๊ฒฐ์ •์„ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•ด์•ผ ํ• ๊นŒ์š”?
  • ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•  ๋•Œ ๊ฒฝ๊ณ„ ์กฐ๊ฑด์€ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ• ๊นŒ์š”?

๊ณ ๋ คํ•  ํŒจํ„ด:

var decision = ๊ธฐ๋ณธ๊ฐ’
if lane == 0:
    # ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ๋ถ„์„
    decision = ๋ถ„์„_ํ›„_๊ฒฐ์ •()
decision = broadcast(decision)

2. ์กฐ๊ฑด๋ถ€ ์‹คํ–‰ ์กฐ์ •

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ๊ฒฐ์ •์„ ์ˆ˜์‹ ํ•œ ํ›„, ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ทธ ๊ฒฐ์ •์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ๋กœ์ง์„ ์ ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • ๋ ˆ์ธ์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’์„ ์‚ฌ์šฉํ•˜์—ฌ ๋กœ์ปฌ ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์€?
  • ๊ฐ ์กฐ๊ฑด๋ถ€ ๋ถ„๊ธฐ์—์„œ ์–ด๋–ค ์—ฐ์‚ฐ์„ ์ ์šฉํ•ด์•ผ ํ• ๊นŒ์š”?
  • ๋ชจ๋“  ๋ ˆ์ธ์—์„œ ์ผ๊ด€๋œ ๋™์ž‘์„ ๋ณด์žฅํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ์š”?

์กฐ๊ฑด๋ถ€ ํŒจํ„ด:

if (๋กœ์ปฌ_๋ฐ์ดํ„ฐ๊ฐ€ broadcast_๊ธฐ์ค€์„ ์ถฉ์กฑ):
    # ํ•˜๋‚˜์˜ ๋ณ€ํ™˜ ์ ์šฉ
else:
    # ๋‹ค๋ฅธ ๋ณ€ํ™˜ ์ ์šฉ

3. ๋ฐ์ดํ„ฐ ๋ถ„์„ ์ „๋žต

๋ ˆ์ธ 0์ด ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ถ„์„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณ ๋ คํ•˜์„ธ์š”.

๊ณ ๋ คํ•  ์ ‘๊ทผ๋ฒ•:

  • ์ตœ๋Œ“๊ฐ’/์ตœ์†Ÿ๊ฐ’ ์ฐพ๊ธฐ
  • ํ‰๊ท ์ด๋‚˜ ํ•ฉ๊ณ„ ๊ณ„์‚ฐ
  • ํŒจํ„ด์ด๋‚˜ ์ž„๊ณ„๊ฐ’ ๊ฐ์ง€
  • ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๊ธฐ๋ฐ˜ํ•œ ์ด์ง„ ๊ฒฐ์ •

์กฐ๊ฑด๋ถ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํ…Œ์ŠคํŠธ:

pixi run p25 --broadcast-conditional
pixi run -e amd p25 --broadcast-conditional
uv run poe p25 --broadcast-conditional

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: HostBuffer([1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0])
expected: HostBuffer([1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0, 1.5, 0.5, 14.0, 1.0, 18.0, 2.0, 12.0, 16.0])
โœ… Conditional broadcast test passed!

์†”๋ฃจ์…˜

def conditional_broadcast[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
):
    """
    Conditional broadcast: Lane 0 makes a decision based on block-local data, broadcasts it to all lanes.
    All lanes apply different logic based on the broadcast decision.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = Int(lane_id())

    if global_i < size:
        # Step 1: Lane 0 analyzes block-local data and makes decision (find max of first 8 in block)
        var decision_value: output.ElementType = 0.0
        if lane == 0:
            var block_start = block_idx.x * block_dim.x
            decision_value = input[block_start] if block_start < size else 0.0
            for i in range(1, min(8, min(WARP_SIZE, size - block_start))):
                if block_start + i < size:
                    var current_val = input[block_start + i]
                    if current_val > decision_value:
                        decision_value = current_val

        # Step 2: Broadcast decision to all lanes in this warp
        decision_value = broadcast(decision_value)

        # Step 3: All lanes apply conditional logic based on broadcast decision
        var current_input = input[global_i]
        var threshold = decision_value / 2.0
        if current_input >= threshold:
            output[global_i] = current_input * 2.0  # Double if >= threshold
        else:
            output[global_i] = current_input / 2.0  # Halve if < threshold


์ด ์†”๋ฃจ์…˜์€ ๋ ˆ์ธ ๊ฐ„ ์กฐ๊ฑด๋ถ€ ์กฐ์ •์„ ์œ„ํ•œ ๊ณ ๊ธ‰ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    # ๋‹จ๊ณ„ 1: ๋ ˆ์ธ 0์ด ๋ธ”๋ก ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ณ  ๊ฒฐ์ •์„ ๋‚ด๋ฆผ
    var decision_value: output.element_type = 0.0
    if lane == 0:
        # ๋ธ”๋ก์˜ ์ฒ˜์Œ 8๊ฐœ ์š”์†Œ ์ค‘ ์ตœ๋Œ“๊ฐ’ ์ฐพ๊ธฐ
        block_start = block_idx.x * block_dim.x
        decision_value = input[block_start] if block_start < size else 0.0
        for i in range(1, min(8, min(WARP_SIZE, size - block_start))):
            if block_start + i < size:
                current_val = input[block_start + i]
                if current_val > decision_value:
                    decision_value = current_val

    # ๋‹จ๊ณ„ 2: ๊ฒฐ์ •์„ broadcastํ•˜์—ฌ ๋ชจ๋“  ๋ ˆ์ธ์„ ์กฐ์ •
    decision_value = broadcast(decision_value)

    # ๋‹จ๊ณ„ 3: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์— ๊ธฐ๋ฐ˜ํ•œ ์กฐ๊ฑด๋ถ€ ๋กœ์ง์„ ์ ์šฉ
    current_input = input[global_i]
    threshold = decision_value / 2.0
    if current_input >= threshold:
        output[global_i] = current_input * 2.0  # ์ž„๊ณ„๊ฐ’ ์ด์ƒ์ด๋ฉด 2๋ฐฐ
    else:
        output[global_i] = current_input / 2.0  # ์ž„๊ณ„๊ฐ’ ๋ฏธ๋งŒ์ด๋ฉด ์ ˆ๋ฐ˜

์˜์‚ฌ๊ฒฐ์ • ์‹คํ–‰ ์ถ”์ :

์ž…๋ ฅ ๋ฐ์ดํ„ฐ: [3.0, 1.0, 7.0, 2.0, 9.0, 4.0, 6.0, 8.0, ...]

๋‹จ๊ณ„ 1: ๋ ˆ์ธ 0์ด ์ฒ˜์Œ 8๊ฐœ ์š”์†Œ์˜ ์ตœ๋Œ“๊ฐ’์„ ์ฐพ์Œ
  ๋ ˆ์ธ 0 ๋ถ„์„:
    input[0] = 3.0์œผ๋กœ ์‹œ์ž‘
    input[1] = 1.0๊ณผ ๋น„๊ต โ†’ 3.0 ์œ ์ง€
    input[2] = 7.0๊ณผ ๋น„๊ต โ†’ 7.0์œผ๋กœ ๊ฐฑ์‹ 
    input[3] = 2.0๊ณผ ๋น„๊ต โ†’ 7.0 ์œ ์ง€
    input[4] = 9.0๊ณผ ๋น„๊ต โ†’ 9.0์œผ๋กœ ๊ฐฑ์‹ 
    input[5] = 4.0๊ณผ ๋น„๊ต โ†’ 9.0 ์œ ์ง€
    input[6] = 6.0๊ณผ ๋น„๊ต โ†’ 9.0 ์œ ์ง€
    input[7] = 8.0๊ณผ ๋น„๊ต โ†’ 9.0 ์œ ์ง€
    ์ตœ์ข… decision_value = 9.0

๋‹จ๊ณ„ 2: decision_value = 9.0์„ ๋ชจ๋“  ๋ ˆ์ธ์— broadcast
  ๋ชจ๋“  ๋ ˆ์ธ: decision_value = 9.0, threshold = 4.5

๋‹จ๊ณ„ 3: ๋ ˆ์ธ๋ณ„ ์กฐ๊ฑด๋ถ€ ์‹คํ–‰
  ๋ ˆ์ธ 0: input[0] = 3.0 < 4.5 โ†’ output[0] = 3.0 / 2.0 = 1.5
  ๋ ˆ์ธ 1: input[1] = 1.0 < 4.5 โ†’ output[1] = 1.0 / 2.0 = 0.5
  ๋ ˆ์ธ 2: input[2] = 7.0 โ‰ฅ 4.5 โ†’ output[2] = 7.0 * 2.0 = 14.0
  ๋ ˆ์ธ 3: input[3] = 2.0 < 4.5 โ†’ output[3] = 2.0 / 2.0 = 1.0
  ๋ ˆ์ธ 4: input[4] = 9.0 โ‰ฅ 4.5 โ†’ output[4] = 9.0 * 2.0 = 18.0
  ๋ ˆ์ธ 5: input[5] = 4.0 < 4.5 โ†’ output[5] = 4.0 / 2.0 = 2.0
  ๋ ˆ์ธ 6: input[6] = 6.0 โ‰ฅ 4.5 โ†’ output[6] = 6.0 * 2.0 = 12.0
  ๋ ˆ์ธ 7: input[7] = 8.0 โ‰ฅ 4.5 โ†’ output[7] = 8.0 * 2.0 = 16.0
  ...๋‚˜๋จธ์ง€ ๋ ˆ์ธ์— ํŒจํ„ด ๋ฐ˜๋ณต

์ˆ˜ํ•™์  ๊ธฐ๋ฐ˜: ์ž„๊ณ„๊ฐ’ ๊ธฐ๋ฐ˜ ๋ณ€ํ™˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large f(x) = \begin{cases} 2x & \text{if } x \geq \tau \\ \frac{x}{2} & \text{if } x < \tau \end{cases}\]

์—ฌ๊ธฐ์„œ \(\tau = \frac{\max(\text{block_data})}{2}\)๋Š” ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ์ž„๊ณ„๊ฐ’์ž…๋‹ˆ๋‹ค.

์กฐ์ • ํŒจํ„ด์˜ ์žฅ์ :

  1. ์ค‘์•™ํ™”๋œ ๋ถ„์„: ํ•˜๋‚˜์˜ ๋ ˆ์ธ์ด ๋ถ„์„ํ•˜๊ณ  ๋ชจ๋“  ๋ ˆ์ธ์ด ํ˜œํƒ์„ ๋ฐ›์Œ
  2. ์ผ๊ด€๋œ ๊ฒฐ์ •: ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ™์€ ์ž„๊ณ„๊ฐ’์„ ์‚ฌ์šฉ
  3. ์ ์‘ํ˜• ๋™์ž‘: ์ž„๊ณ„๊ฐ’์ด ๋ธ”๋ก ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ ์‘
  4. ํšจ์œจ์  ์กฐ์ •: ๋‹จ์ผ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋ณต์žกํ•œ ์กฐ๊ฑด๋ถ€ ๋กœ์ง์„ ์กฐ์ •

ํ™œ์šฉ ๋ถ„์•ผ:

  • ์ ์‘ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋กœ์ปฌ ๋ฐ์ดํ„ฐ ํŠน์„ฑ์— ๋”ฐ๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ ์กฐ์ •
  • ํ’ˆ์งˆ ๊ด€๋ฆฌ: ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ์ง€ํ‘œ์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ์ฒ˜๋ฆฌ ์ ์šฉ
  • ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ๋ธ”๋ก ๋กœ์ปฌ ๋ณต์žก๋„ ๋ถ„์„์— ๊ธฐ๋ฐ˜ํ•œ ์ž‘์—… ๋ถ„๋ฐฐ

3. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ-shuffle ์กฐ์ •

broadcast()์™€ shuffle_down()์„ ๋ชจ๋‘ ๊ฒฐํ•ฉํ•œ ๊ณ ๊ธ‰ ์กฐ์ •์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์š”๊ตฌ์‚ฌํ•ญ:

  • ๋ ˆ์ธ 0์ด ๋ธ”๋ก์˜ ์ฒ˜์Œ 4๊ฐœ ์š”์†Œ์˜ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•˜๊ณ  ์ด ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ ๋ชจ๋“  ๋ ˆ์ธ์— ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๊ฐ ๋ ˆ์ธ์€ shuffle_down(offset=1)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ ์ด์›ƒ์˜ ๊ฐ’์„ ๊ฐ€์ ธ์™€์•ผ ํ•ฉ๋‹ˆ๋‹ค
  • ๋Œ€๋ถ€๋ถ„์˜ ๋ ˆ์ธ: ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ์— (ํ˜„์žฌ_๊ฐ’ + ๋‹ค์Œ_์ด์›ƒ_๊ฐ’)์„ ๊ณฑํ•ฉ๋‹ˆ๋‹ค
  • ์›Œํ”„์˜ ๋งˆ์ง€๋ง‰ ๋ ˆ์ธ: ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ์— ํ˜„์žฌ_๊ฐ’๋งŒ ๊ณฑํ•ฉ๋‹ˆ๋‹ค (์œ ํšจํ•œ ์ด์›ƒ ์—†์Œ)

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: ์ž…๋ ฅ์€ [2, 4, 6, 8, 1, 3, 5, 7, ...] ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค (์ฒ˜์Œ 4๊ฐœ ์š”์†Œ: 2,4,6,8 ์ดํ›„ 1,3,5,7 ๋ฐ˜๋ณต)

  • ๋ ˆ์ธ 0์ด ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ ๊ณ„์‚ฐ: (2+4+6+8)/4 = 5.0
  • ์˜ˆ์ƒ ์ถœ๋ ฅ: [30.0, 50.0, 70.0, 45.0, 20.0, 40.0, 60.0, 40.0, ...]

๊ณผ์ œ: ํ•˜๋‚˜์˜ ๋ ˆ์ธ์˜ ๊ณ„์‚ฐ์ด ๋ชจ๋“  ๋ ˆ์ธ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฉด์„œ, ๊ฐ ๋ ˆ์ธ์ด ์ž์‹ ์˜ ์ด์›ƒ ๋ฐ์ดํ„ฐ์—๋„ ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ์ƒํ™ฉ์—์„œ ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์–ด๋–ป๊ฒŒ ์กฐ์ •ํ• ๊นŒ์š”?

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

def broadcast_shuffle_coordination[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Combine broadcast() and shuffle_down() for advanced warp coordination.
    Lane 0 computes block-local scaling factor, broadcasts it to all lanes in the warp.
    Each lane uses shuffle_down() for neighbor access and applies broadcast factor.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = Int(lane_id())
    if global_i < size:
        var scale_factor: output.ElementType = 0.0

        # FILL IN (roughly 14 lines)


ํŒ

1. ๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์กฐ์ •

์ด ํผ์ฆ์€ broadcast์™€ ์…”ํ”Œ ์—ฐ์‚ฐ์„ ์ˆœ์„œ๋Œ€๋กœ ์กฐ์œจํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ๋ฆ„์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  1. ํ•˜๋‚˜์˜ ๋ ˆ์ธ์ด ์ „์ฒด ์›Œํ”„๋ฅผ ์œ„ํ•œ ๊ฐ’์„ ๊ณ„์‚ฐ
  2. ์ด ๊ฐ’์ด ๋ชจ๋“  ๋ ˆ์ธ์— broadcast๋จ
  3. ๊ฐ ๋ ˆ์ธ์ด ์…”ํ”Œ๋กœ ์ด์›ƒ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผ
  4. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๊ฐ’์ด ์ด์›ƒ ๋ฐ์ดํ„ฐ์˜ ์ฒ˜๋ฆฌ ๋ฐฉ์‹์— ์˜ํ–ฅ

์กฐ์ • ํŒจํ„ด:

# ๋‹จ๊ณ„ 1: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์กฐ์ •
var shared_param = lane_0์ด๋ฉด_๊ณ„์‚ฐ()
shared_param = broadcast(shared_param)

# ๋‹จ๊ณ„ 2: ์…”ํ”Œ ์ด์›ƒ ์ ‘๊ทผ
current_val = input[global_i]
neighbor_val = shuffle_down(current_val, offset)

# ๋‹จ๊ณ„ 3: ๊ฒฐํ•ฉ ๊ณ„์‚ฐ
result = ๊ฒฐํ•ฉ(current_val, neighbor_val, shared_param)

2. ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณ„์‚ฐ ์ „๋žต

์ด์›ƒ ์—ฐ์‚ฐ์„ ์Šค์ผ€์ผ๋งํ•˜๋Š” ๋ฐ ์œ ์šฉํ•œ ๋ธ”๋ก ๋ ˆ๋ฒจ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋ฌด์—‡์ผ์ง€ ๊ณ ๋ คํ•˜์„ธ์š”.

ํƒ๊ตฌํ•  ์งˆ๋ฌธ:

  • ๋ ˆ์ธ 0์ด ๋ธ”๋ก ๋ฐ์ดํ„ฐ์—์„œ ์–ด๋–ค ํ†ต๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ด์•ผ ํ• ๊นŒ์š”?
  • ์ด ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ด์›ƒ ๊ธฐ๋ฐ˜ ๊ณ„์‚ฐ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์ณ์•ผ ํ• ๊นŒ์š”?
  • ์…”ํ”Œ ์—ฐ์‚ฐ์ด ํฌํ•จ๋  ๋•Œ ์›Œํ”„ ๊ฒฝ๊ณ„์—์„œ ๋ฌด์Šจ ์ผ์ด ์ผ์–ด๋‚ ๊นŒ์š”?

3. ๊ฒฐํ•ฉ ์—ฐ์‚ฐ ์„ค๊ณ„

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒŒ๋ผ๋ฏธํ„ฐ์™€ ์…”ํ”Œ ๊ธฐ๋ฐ˜ ์ด์›ƒ ์ ‘๊ทผ์„ ์˜๋ฏธ ์žˆ๊ฒŒ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•˜์„ธ์š”.

ํŒจํ„ด ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ž…๋ ฅ, ์ถœ๋ ฅ, ๋˜๋Š” ๊ณ„์‚ฐ์„ ์Šค์ผ€์ผ๋งํ•ด์•ผ ํ• ๊นŒ์š”?
  • ์…”ํ”Œ์ด ๋ฏธ์ •์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฒฝ๊ณ„ ์ผ€์ด์Šค๋ฅผ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ• ๊นŒ์š”?
  • ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ ์ˆœ์„œ๋Š” ๋ฌด์—‡์ผ๊นŒ์š”?

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ-shuffle ์กฐ์ • ํ…Œ์ŠคํŠธ:

pixi run p25 --broadcast-shuffle-coordination
pixi run -e amd p25 --broadcast-shuffle-coordination
uv run poe p25 --broadcast-shuffle-coordination

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: HostBuffer([30.0, 50.0, 70.0, 45.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 35.0])
expected: HostBuffer([30.0, 50.0, 70.0, 45.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 40.0, 20.0, 40.0, 60.0, 35.0])
โœ… ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ + ์…”ํ”Œ coordination test passed!

์†”๋ฃจ์…˜

def broadcast_shuffle_coordination[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
):
    """
    Combine broadcast() and shuffle_down() for advanced warp coordination.
    Lane 0 computes block-local scaling factor, broadcasts it to all lanes in the warp.
    Each lane uses shuffle_down() for neighbor access and applies broadcast factor.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = Int(lane_id())

    if global_i < size:
        # Step 1: Lane 0 computes block-local scaling factor
        var scale_factor: output.ElementType = 0.0
        if lane == 0:
            # Compute average of first 4 elements in this block's data
            var block_start = block_idx.x * block_dim.x
            var sum: output.ElementType = 0.0
            for i in range(4):
                if block_start + i < size:
                    sum += input[block_start + i]
            scale_factor = sum / 4.0

        # Step 2: Broadcast scaling factor to all lanes in this warp
        scale_factor = broadcast(scale_factor)

        # Step 3: Each lane gets current and next values
        var current_val = input[global_i]
        var next_val = shuffle_down(current_val, 1)

        # Step 4: Apply broadcast factor with neighbor coordination
        if lane < WARP_SIZE - 1 and global_i < size - 1:
            # Combine current + next, then scale by broadcast factor
            output[global_i] = (current_val + next_val) * scale_factor
        else:
            # Last lane in warp or last element: only current value, scaled by broadcast factor
            output[global_i] = current_val * scale_factor


์ด ์†”๋ฃจ์…˜์€ broadcast์™€ ์…”ํ”Œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•œ ๊ฐ€์žฅ ๊ณ ๊ธ‰ ์›Œํ”„ ์กฐ์ • ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    # ๋‹จ๊ณ„ 1: ๋ ˆ์ธ 0์ด ๋ธ”๋ก ๋กœ์ปฌ ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ ๊ณ„์‚ฐ
    var scale_factor: output.element_type = 0.0
    if lane == 0:
        block_start = block_idx.x * block_dim.x
        var sum: output.element_type = 0.0
        for i in range(4):
            if block_start + i < size:
                sum += input[block_start + i]
        scale_factor = sum / 4.0

    # ๋‹จ๊ณ„ 2: ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ ๋ชจ๋“  ๋ ˆ์ธ์— broadcast
    scale_factor = broadcast(scale_factor)

    # ๋‹จ๊ณ„ 3: ๊ฐ ๋ ˆ์ธ์ด shuffle์„ ํ†ตํ•ด ํ˜„์žฌ ๊ฐ’๊ณผ ๋‹ค์Œ ๊ฐ’์„ ๊ฐ€์ ธ์˜ด
    current_val = input[global_i]
    next_val = shuffle_down(current_val, 1)

    # ๋‹จ๊ณ„ 4: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒฉํ„ฐ๋ฅผ ์ด์›ƒ ์กฐ์ •๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์ ์šฉ
    if lane < WARP_SIZE - 1 and global_i < size - 1:
        output[global_i] = (current_val + next_val) * scale_factor
    else:
        output[global_i] = current_val * scale_factor

๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์‹คํ–‰ ์ถ”์ :

์ž…๋ ฅ ๋ฐ์ดํ„ฐ: [2, 4, 6, 8, 1, 3, 5, 7, ...]

๋‹จ๊ณ„ 1: ๋ ˆ์ธ 0์ด ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ๋ฅผ ๊ณ„์‚ฐ
  ๋ ˆ์ธ 0 ๊ณ„์‚ฐ: (input[0] + input[1] + input[2] + input[3]) / 4
              = (2 + 4 + 6 + 8) / 4 = 20 / 4 = 5.0
  ๋‹ค๋ฅธ ๋ ˆ์ธ: scale_factor๋Š” 0.0 ์œ ์ง€

๋‹จ๊ณ„ 2: scale_factor = 5.0์„ ๋ชจ๋“  ๋ ˆ์ธ์— broadcast
  ๋ชจ๋“  ๋ ˆ์ธ: scale_factor = 5.0

๋‹จ๊ณ„ 3: ์ด์›ƒ ์ ‘๊ทผ์„ ์œ„ํ•œ ์…”ํ”Œ ์—ฐ์‚ฐ
  ๋ ˆ์ธ 0: current_val = input[0] = 2, next_val = shuffle_down(2, 1) = input[1] = 4
  ๋ ˆ์ธ 1: current_val = input[1] = 4, next_val = shuffle_down(4, 1) = input[2] = 6
  ๋ ˆ์ธ 2: current_val = input[2] = 6, next_val = shuffle_down(6, 1) = input[3] = 8
  ๋ ˆ์ธ 3: current_val = input[3] = 8, next_val = shuffle_down(8, 1) = input[4] = 1
  ...
  ๋ ˆ์ธ 31: current_val = input[31], next_val = ๋ฏธ์ •์˜

๋‹จ๊ณ„ 4: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์Šค์ผ€์ผ๋ง๊ณผ ๊ฒฐํ•ฉํ•œ ๊ณ„์‚ฐ
  ๋ ˆ์ธ 0: output[0] = (2 + 4) * 5.0 = 6 * 5.0 = 30.0
  ๋ ˆ์ธ 1: output[1] = (4 + 6) * 5.0 = 10 * 5.0 = 50.0
  ๋ ˆ์ธ 2: output[2] = (6 + 8) * 5.0 = 14 * 5.0 = 70.0
  ๋ ˆ์ธ 3: output[3] = (8 + 1) * 5.0 = 9 * 5.0 = 45.0
  ...
  ๋ ˆ์ธ 31: output[31] = 7 * 5.0 = 35.0 (๊ฒฝ๊ณ„ - ์ด์›ƒ ์—†์Œ)

ํ†ต์‹  ํŒจํ„ด ๋ถ„์„: ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ณ„์ธต์  ์กฐ์ • ํŒจํ„ด์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ˆ˜์ง ์กฐ์ • (broadcast): ๋ ˆ์ธ 0 โ†’ ๋ชจ๋“  ๋ ˆ์ธ
  2. ์ˆ˜ํ‰ ์กฐ์ • (shuffle): ๋ ˆ์ธ i โ†’ ๋ ˆ์ธ i+1
  3. ๊ฒฐํ•ฉ ๊ณ„์‚ฐ: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋ฐ์ดํ„ฐ์™€ ์…”ํ”Œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ํ™œ์šฉ

์ˆ˜ํ•™์  ๊ธฐ๋ฐ˜: \[\Large \text{output}[i] = \begin{cases} (\text{input}[i]

  • \text{input}[i+1]) \cdot \beta & \text{if lane} i < \text{WARP_SIZE} - 1 \\ \text{input}[i] \cdot \beta & \text{if lane } i = \text{WARP_SIZE} - 1 \end{cases}\]

์—ฌ๊ธฐ์„œ \(\beta = \frac{1}{4}\sum_{k=0}^{3} \text{input}[\text{block_start} + k]\)๋Š” ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ์ž…๋‹ˆ๋‹ค.

๊ณ ๊ธ‰ ์กฐ์ •์˜ ์žฅ์ :

  1. ๋‹ค๋‹จ๊ณ„ ํ†ต์‹ : ์ „์—ญ(broadcast)๊ณผ ์ง€์—ญ(shuffle) ์กฐ์ •์˜ ๊ฒฐํ•ฉ
  2. ์ ์‘ํ˜• ์Šค์ผ€์ผ๋ง: ๋ธ”๋ก ๋ ˆ๋ฒจ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์ด์›ƒ ์—ฐ์‚ฐ์— ์˜ํ–ฅ
  3. ํšจ์œจ์  ๊ตฌ์„ฑ: ๋‘ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ๋งค๋„๋Ÿฝ๊ฒŒ ํ˜‘๋ ฅ
  4. ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„: ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ

์‹ค์ œ ํ™œ์šฉ ์‚ฌ๋ก€:

  • ์ ์‘ํ˜• ํ•„ํ„ฐ๋ง: ๋ธ”๋ก ๋ ˆ๋ฒจ ๋…ธ์ด์ฆˆ ์ถ”์ •๊ณผ ์ด์›ƒ ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง
  • ๋™์  ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ์ „์—ญ ์ž‘์—… ๋ถ„๋ฐฐ์™€ ๋กœ์ปฌ ์กฐ์ •
  • ๋‹ค์ค‘ ์Šค์ผ€์ผ ์ฒ˜๋ฆฌ: ์ „์—ญ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ๋กœ์ปฌ ์Šคํ…์‹ค ์—ฐ์‚ฐ์„ ์ œ์–ด

์š”์•ฝ

์ด ์„น์…˜์˜ ํ•ต์‹ฌ ํŒจํ„ด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค

var shared_value = initial_value
if lane == 0:
    shared_value = compute_block_statistic()
shared_value = broadcast(shared_value)
result = use_shared_value(shared_value, local_data)

ํ•ต์‹ฌ ์žฅ์ :

  • ์ผ๋Œ€๋‹ค ์กฐ์ •: ํ•˜๋‚˜์˜ ๋ ˆ์ธ์ด ๊ณ„์‚ฐํ•˜๊ณ  ๋ชจ๋“  ๋ ˆ์ธ์ด ํ˜œํƒ์„ ๋ฐ›์Œ
  • ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: SIMT ์‹คํ–‰์ด ์กฐ์ •์„ ์ฒ˜๋ฆฌ
  • ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ํŒจํ„ด: ์…”ํ”Œ๊ณผ ๋‹ค๋ฅธ ์›Œํ”„ ์—ฐ์‚ฐ๊ณผ ์‰ฝ๊ฒŒ ๊ฒฐํ•ฉ

ํ™œ์šฉ ๋ถ„์•ผ: ๋ธ”๋ก ํ†ต๊ณ„, ์ง‘ํ•ฉ์  ์˜์‚ฌ๊ฒฐ์ •, ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ , ์ ์‘ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜.

Puzzle 26: ๊ณ ๊ธ‰ ์›Œํ”„ ํŒจํ„ด

๊ฐœ์š”

Puzzle 26: ๊ณ ๊ธ‰ ์›Œํ”„ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ์—์„œ๋Š” ์ •๊ตํ•œ GPU ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ๊ณผ ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ - ์›Œํ”„ ๋‚ด์—์„œ ํšจ์œจ์ ์ธ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. shuffle_xor์„ ์‚ฌ์šฉํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์™€ prefix_sum์„ ์‚ฌ์šฉํ•œ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ์Šค์บ”์„ ๋ฐฐ์šฐ๋ฉฐ, ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์—†์ด ์ด๋ฅผ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ตํž™๋‹ˆ๋‹ค.

๋‹ฌ์„ฑ ๋ชฉํ‘œ: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ๋‹ค๋‹จ๊ณ„ ๋ฆฌ๋•์…˜ ํŒจํ„ด์—์„œ ๋ฒ—์–ด๋‚˜, ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์™€ ๋ณ‘๋ ฌ ์Šค์บ” ์œ ๋‹›์„ ํ™œ์šฉํ•˜๋Š” ์šฐ์•„ํ•œ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์›Œํ”„๋Š” ํ•˜๋“œ์›จ์–ด์—์„œ ์ •๊ตํ•œ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ํ†ต์‹ ๊ณผ ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค - Mojo์˜ ๊ณ ๊ธ‰ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์™€ ์ „์šฉ ์Šค์บ” ์œ ๋‹›์„ ํ™œ์šฉํ•˜์—ฌ \(O(\log n)\) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ผ ๋ช…๋ น ์ˆ˜์ค€์˜ ๊ฐ„๊ฒฐํ•จ์œผ๋กœ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ

๊ณ ๊ธ‰ ์›Œํ”„ ํ†ต์‹  ๋ชจ๋ธ

GPU ์›Œํ”„ ๋‚ด ์ •๊ตํ•œ ํ†ต์‹  ํŒจํ„ด์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

GPU ์›Œํ”„ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ (32 ์Šค๋ ˆ๋“œ, XOR ๊ธฐ๋ฐ˜ ํ†ต์‹ )
Offset 16: Lane 0 โ†” Lane 16, Lane 1 โ†” Lane 17, ..., Lane 15 โ†” Lane 31
Offset 8:  Lane 0 โ†” Lane 8,  Lane 1 โ†” Lane 9,  ..., Lane 23 โ†” Lane 31
Offset 4:  Lane 0 โ†” Lane 4,  Lane 1 โ†” Lane 5,  ..., Lane 27 โ†” Lane 31
Offset 2:  Lane 0 โ†” Lane 2,  Lane 1 โ†” Lane 3,  ..., Lane 29 โ†” Lane 31
Offset 1:  Lane 0 โ†” Lane 1,  Lane 2 โ†” Lane 3,  ..., Lane 30 โ†” Lane 31

ํ•˜๋“œ์›จ์–ด ๋ˆ„์  ํ•ฉ (๋ณ‘๋ ฌ ์Šค์บ” ๊ฐ€์†)
์ž…๋ ฅ:  [1, 2, 3, 4, 5, 6, 7, 8, ...]
์ถœ๋ ฅ: [1, 3, 6, 10, 15, 21, 28, 36, ...] (inclusive scan)

ํ•˜๋“œ์›จ์–ด ํ˜„์‹ค:

  • ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ: XOR ๊ธฐ๋ฐ˜ ํ†ต์‹ ์ด ์ตœ์ ์˜ ํŠธ๋ฆฌ ํ† ํด๋กœ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ์ „์šฉ ์Šค์บ” ์œ ๋‹›: ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ
  • ๋กœ๊ทธ ๋ณต์žก๋„: \(O(\log n)\) ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด \(O(n)\) ์ˆœ์ฐจ ํŒจํ„ด์„ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค
  • ๋‹จ์ผ ์‚ฌ์ดํด ์—ฐ์‚ฐ: ๋ณต์žกํ•œ ๋ฆฌ๋•์…˜์ด ์ „์šฉ ํ•˜๋“œ์›จ์–ด์—์„œ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค

Mojo์˜ ๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ

gpu.primitives.warp์˜ ์ •๊ตํ•œ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๋ฐฐ์›๋‹ˆ๋‹ค:

  1. shuffle_xor(value, mask): ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ XOR ๊ธฐ๋ฐ˜ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ 
  2. prefix_sum(value): ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ
  3. ๊ณ ๊ธ‰ ์กฐ์ • ํŒจํ„ด: ์—ฌ๋Ÿฌ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•œ ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜

์ฐธ๊ณ : ์ด ๊ธฐ๋ณธ ์š”์†Œ๋“ค์€ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜, ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜, quicksort ํŒŒํ‹ฐ์…”๋‹, FFT ์—ฐ์‚ฐ ๋“ฑ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ • ์ฝ”๋“œ๊ฐ€ ์ˆ˜์‹ญ ์ค„ ํ•„์š”ํ–ˆ์„ ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๋ณ€ํ™˜ ์˜ˆ์‹œ

# ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ (๊ธฐ์กด ๋ฐฉ์‹ - Puzzle 14 ์ฐธ๊ณ ):
shared = TileTensor[
    dtype,
    row_major[WARP_SIZE](),
    MutAnyOrigin,
    address_space = AddressSpace.SHARED,
].stack_allocation()
shared[local_i] = input[global_i]
barrier()
offset = 1
for i in range(Int(log2(Scalar[dtype](WARP_SIZE)))):
    var current_val: output.element_type = 0
    if local_i >= offset and local_i < WARP_SIZE:
        current_val = shared[local_i - offset]
    barrier()
    if local_i >= offset and local_i < WARP_SIZE:
        shared[local_i] += current_val
    barrier()
    offset *= 2

# ๊ณ ๊ธ‰ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ์ด ๋ชจ๋“  ๋ณต์žก์„ฑ์„ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค:
current_val = input[global_i]
scan_result = prefix_sum[exclusive=False](current_val)  # ๋‹จ์ผ ํ˜ธ์ถœ!
output[global_i] = scan_result

๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ์ด ๋น›๋‚˜๋Š” ์ˆœ๊ฐ„

์„ฑ๋Šฅ ํŠน์„ฑ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด๊ธฐ์กด ๋ฐฉ์‹๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ
๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด๋‹จ์ผ shuffle_xor ํŠธ๋ฆฌ
๋ˆ„์  ํ•ฉ/์Šค์บ” ์—ฐ์‚ฐ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ํ•˜๋“œ์›จ์–ด prefix_sum
์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑprefix_sum + ์กฐ์ •
Quicksort ํŒŒํ‹ฐ์…˜์ˆ˜๋™ ์œ„์น˜ ๊ณ„์‚ฐ๊ฒฐํ•ฉ๋œ ๊ธฐ๋ณธ ์š”์†Œ
ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์žฌ๊ท€์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ 

์„ ์ˆ˜ ์ง€์‹

๊ณ ๊ธ‰ ์›Œํ”„ ํ†ต์‹ ์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ๋‹ค์Œ ๋‚ด์šฉ์— ์ต์ˆ™ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

  • Part VII ์›Œํ”„ ๊ธฐ์ดˆ: SIMT ์‹คํ–‰๊ณผ ๊ธฐ๋ณธ ์›Œํ”„ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ดํ•ด (Puzzle 24: ์›Œํ”„ ๊ธฐ์ดˆ์™€ Puzzle 25: ์›Œํ”„ ํ†ต์‹  ์ฐธ๊ณ )
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ด๋ก : ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜, ๋ณ‘๋ ฌ ์Šค์บ”, ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ
  • GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด๊ณผ ๋™๊ธฐํ™” (Puzzle 14: ๋ˆ„์  ํ•ฉ ์ฐธ๊ณ )
  • ์ˆ˜ํ•™ ์—ฐ์‚ฐ: XOR ์—ฐ์‚ฐ๊ณผ ๋กœ๊ทธ ๋ณต์žก๋„์— ๋Œ€ํ•œ ์ดํ•ด

ํ•™์Šต ๊ฒฝ๋กœ

1. shuffle_xor์„ ์ด์šฉํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ 

โ†’ warp.shuffle_xor()์™€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ

ํšจ์œจ์ ์ธ ํŠธ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ XOR ๊ธฐ๋ฐ˜ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • shuffle_xor()์œผ๋กœ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€ ๊ตฌ์„ฑํ•˜๊ธฐ
  • ํŠธ๋ฆฌ ํ†ต์‹ ์„ ํ™œ์šฉํ•œ \(O(\log n)\) ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๊ตฌํ˜„
  • XOR ๊ธฐ๋ฐ˜ ๋ ˆ์ธ ํŽ˜์–ด๋ง๊ณผ ํ†ต์‹  ํŒจํ„ด ์ดํ•ด
  • ๋‹ค์ค‘ ๊ฐ’ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๊ณ ๊ธ‰ ์กฐ๊ฑด๋ถ€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์—ฐ์‚ฐ

ํ•ต์‹ฌ ํŒจํ„ด:

max_val = input[global_i]
offset = WARP_SIZE // 2
while offset > 0:
    max_val = max(max_val, shuffle_xor(max_val, offset))
    offset //= 2
# ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ฐ€์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

2. prefix_sum์„ ์ด์šฉํ•œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ”

โ†’ warp.prefix_sum()๊ณผ ์Šค์บ” ์—ฐ์‚ฐ

๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฐฐ์šธ ๋‚ด์šฉ:

  • prefix_sum()์„ ํ™œ์šฉํ•œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ˆ„์  ์—ฐ์‚ฐ
  • ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜๊ณผ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹ ๊ตฌํ˜„
  • prefix_sum๊ณผ shuffle_xor์„ ๊ฒฐํ•ฉํ•œ ๊ณ ๊ธ‰ ์กฐ์ •
  • Inclusive vs exclusive ์Šค์บ” ํŒจํ„ด ์ดํ•ด

ํ•ต์‹ฌ ํŒจํ„ด:

current_val = input[global_i]
scan_result = prefix_sum[exclusive=False](current_val)
output[global_i] = scan_result  # ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ˆ„์  ํ•ฉ

ํ•ต์‹ฌ ๊ฐœ๋…

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ํ†ต์‹ 

XOR ๊ธฐ๋ฐ˜ ํ†ต์‹  ํ† ํด๋กœ์ง€๋ฅผ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

  • XOR ํŽ˜์–ด๋ง: lane_id โŠ• mask๊ฐ€ ๋Œ€์นญ ํ†ต์‹  ์Œ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: ๊ณ„์ธต์  ๋ฐ์ดํ„ฐ ๊ตํ™˜์„ ํ†ตํ•œ ๋กœ๊ทธ ๋ณต์žก๋„
  • ๋ณ‘๋ ฌ ์กฐ์ •: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋ฆฌ๋•์…˜์— ๋™์‹œ์— ์ฐธ์—ฌํ•ฉ๋‹ˆ๋‹ค
  • ๋™์  ์•Œ๊ณ ๋ฆฌ์ฆ˜: 2์˜ ๊ฑฐ๋“ญ์ œ๊ณฑ WARP_SIZE (32, 64 ๋“ฑ) ์–ด๋””์„œ๋‚˜ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค

ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ”

์ „์šฉ ์Šค์บ” ์œ ๋‹›์˜ ๋Šฅ๋ ฅ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

  • ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ: ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์„ ํ™œ์šฉํ•œ ๋ˆ„์  ์—ฐ์‚ฐ
  • ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜: ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜
  • ๋‹จ์ผ ํ•จ์ˆ˜ ๊ฐ„๊ฒฐ์„ฑ: ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋‹จ์ผ ํ˜ธ์ถœ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ชจ๋“  ์กฐ์ •์„ ๋‚ด๋ถ€์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„ ๋ณ€ํ™˜

๊ธฐ์กด ํŒจํ„ด์„ ๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

  • ์ˆœ์ฐจ ๋ฆฌ๋•์…˜ (\(O(n)\)) โ†’ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ (\(O(\log n)\))
  • ๋‹ค๋‹จ๊ณ„ ์Šค์บ” ์•Œ๊ณ ๋ฆฌ์ฆ˜ โ†’ ๋‹จ์ผ ํ•˜๋“œ์›จ์–ด prefix_sum
  • ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด โ†’ ๋ ˆ์ง€์Šคํ„ฐ ์ „์šฉ ์—ฐ์‚ฐ
  • ๋ช…์‹œ์  ๋™๊ธฐํ™” โ†’ ํ•˜๋“œ์›จ์–ด ๊ด€๋ฆฌ ์กฐ์ •

๊ณ ๊ธ‰ ์กฐ์ • ํŒจํ„ด

์—ฌ๋Ÿฌ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•œ ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜:

  • ์ด์ค‘ ๋ฆฌ๋•์…˜: ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์„ ํ™œ์šฉํ•œ ๋™์‹œ min/max ์ถ”์ 
  • ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹: quicksort ์Šคํƒ€์ผ ์—ฐ์‚ฐ์„ ์œ„ํ•œ shuffle_xor + prefix_sum
  • ์กฐ๊ฑด๋ถ€ ์—ฐ์‚ฐ: ์ „์—ญ ์กฐ์ •์„ ํ†ตํ•œ ๋ ˆ์ธ ๊ธฐ๋ฐ˜ ์ถœ๋ ฅ ์„ ํƒ
  • ๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ตœ์  ์„ฑ๋Šฅ์˜ ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ํŒจํ„ด

์‹œ์ž‘ํ•˜๊ธฐ

๊ณ ๊ธ‰ GPU ์›Œํ”„ ๋ ˆ๋ฒจ ํ†ต์‹ ์„ ํ™œ์šฉํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ์—ฐ์‚ฐ์œผ๋กœ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ํ†ต์‹ ์„ ์ดํ•ดํ•œ ๋‹ค์Œ, ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ”์œผ๋กœ ๋‚˜์•„๊ฐ€ ์ตœ์ ์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜์„ธ์š”.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ๊ณ ๊ธ‰ ์›Œํ”„ ์—ฐ์‚ฐ์„ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋นŒ๋”ฉ ๋ธ”๋ก์œผ๋กœ ์ƒ๊ฐํ•˜์„ธ์š”. ์ด ๊ธฐ๋ณธ ์š”์†Œ๋“ค์€ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ „์ฒด ๋ฒ”์ฃผ๋ฅผ ๋‹จ์ผ ์ตœ์ ํ™” ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋ชฉํ‘œ: Puzzle 26์„ ๋งˆ์น˜๋ฉด, ๊ณ ๊ธ‰ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํ™ฉ์„ ์ธ์‹ํ•˜์—ฌ ํ›จ์”ฌ ๊ฐ„๋‹จํ•˜๊ณ  ๋น ๋ฅธ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ฆฌ๋•์…˜, ๋ณ‘๋ ฌ ์Šค์บ”, ์กฐ์ • ํŒจํ„ด์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ: warp.shuffle_xor()์™€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ์—์„œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ์„ ๋ฐฐ์šด ๋‹ค์Œ, warp.prefix_sum()๊ณผ ์Šค์บ” ์—ฐ์‚ฐ ์—์„œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ” ํŒจํ„ด์œผ๋กœ ๋‚˜์•„๊ฐ€์„ธ์š”!

warp.shuffle_xor() ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ 

์›Œํ”„ ๋ ˆ๋ฒจ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ์—์„œ๋Š” shuffle_xor()์„ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„ ๋‚ด์— ์ •๊ตํ•œ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŒจํ„ด์„ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•๋ ฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ†ตํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋‚˜ ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜, ์ •๋ ฌ ๋„คํŠธ์›Œํฌ, ๊ณ ๊ธ‰ ์กฐ์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: shuffle_xor() ์—ฐ์‚ฐ์€ SIMT ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ XOR ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๋ฉฐ, ์›Œํ”„ ํฌ๊ธฐ์— ๋Œ€ํ•ด \(O(\log n)\) ๋ณต์žก๋„๋กœ ํ™•์žฅ๋˜๋Š” ํšจ์œจ์ ์ธ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์™€ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๋ž€? ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๋Š” ์Šค๋ ˆ๋“œ๋“ค์ด ์ธ๋ฑ์Šค์˜ XOR ํŒจํ„ด์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•˜๋Š” ํ†ต์‹  ํ† ํด๋กœ์ง€์ž…๋‹ˆ๋‹ค. ์ด๋ฆ„์€ ์‹œ๊ฐ์ ์œผ๋กœ ๊ทธ๋ ธ์„ ๋•Œ ๋‚˜๋น„ ๋‚ ๊ฐœ์ฒ˜๋Ÿผ ๋ณด์ด๋Š” ์—ฐ๊ฒฐ ํŒจํ„ด์—์„œ ์œ ๋ž˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋„คํŠธ์›Œํฌ๋Š” \(O(\log n)\) ํ†ต์‹  ๋ณต์žก๋„๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— FFT, bitonic ์ •๋ ฌ, ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ๊ฐ™์€ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • shuffle_xor()์„ ํ™œ์šฉํ•œ XOR ๊ธฐ๋ฐ˜ ํ†ต์‹  ํŒจํ„ด
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ ํ† ํด๋กœ์ง€
  • \(O(\log n)\) ๋ณต์žก๋„์˜ ํŠธ๋ฆฌ ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜
  • ๊ณ ๊ธ‰ ์กฐ์ •์„ ์œ„ํ•œ ์กฐ๊ฑด๋ถ€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์—ฐ์‚ฐ
  • ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ๊ธฐ๋ณธ ์š”์†Œ

shuffle_xor ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด XOR ํŒจํ„ด์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๋ ˆ์ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{shuffle_xor}(\text{value}, \text{mask}) = \text{value_from_lane}(\text{lane_id} \oplus \text{mask})\]

์ด๋ฅผ ํ†ตํ•ด ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์šฐ์•„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์œผ๋กœ ๋ณ€ํ™˜๋˜์–ด, ๋ช…์‹œ์  ์กฐ์ • ์—†์ด ํšจ์œจ์ ์ธ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜๊ณผ ์ •๋ ฌ ๋„คํŠธ์›Œํฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

1. ๊ธฐ๋ณธ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŽ˜์–ด ๊ตํ™˜

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D row-major)

shuffle_xor ๊ฐœ๋…

๊ธฐ์กด ํŽ˜์–ด ๊ตํ™˜ ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ๊ณผ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ๊ธฐ์กด ๋ฐฉ์‹ - ๋ณต์žกํ•˜๊ณ  ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”
shared_memory[lane] = input[global_i]
barrier()
if lane % 2 == 0:
    partner = lane + 1
else:
    partner = lane - 1
if partner < WARP_SIZE:
    swapped_val = shared_memory[partner]

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ํ•„์š”
  • ๋™๊ธฐํ™”: ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”
  • ๋ณต์žกํ•œ ๋กœ์ง: ์ˆ˜๋™ ํŒŒํŠธ๋„ˆ ๊ณ„์‚ฐ๊ณผ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ๋‚ฎ์€ ํ™•์žฅ์„ฑ: ํ•˜๋“œ์›จ์–ด ํ†ต์‹ ์„ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•จ

shuffle_xor()์„ ์‚ฌ์šฉํ•˜๋ฉด ํŽ˜์–ด ๊ตํ™˜์ด ์šฐ์•„ํ•ด์ง‘๋‹ˆ๋‹ค:

# ๋ฒ„ํ„ฐํ”Œ๋ผ์ด XOR ๋ฐฉ์‹ - ๊ฐ„๋‹จํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”
current_val = input[global_i]
swapped_val = shuffle_xor(current_val, 1)  # 1๊ณผ XORํ•˜๋ฉด ํŽ˜์–ด๊ฐ€ ์ƒ์„ฑ๋จ
output[global_i] = swapped_val

shuffle_xor์˜ ์žฅ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ๋ชจ๋“  ๋ ˆ์ธ์— ๋Œ€ํ•ด ๋‹จ์ผ ๋ช…๋ น์œผ๋กœ ์ฒ˜๋ฆฌ
  • ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๊ธฐ๋ฐ˜: ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋นŒ๋”ฉ ๋ธ”๋ก

์™„์„ฑํ•  ์ฝ”๋“œ

shuffle_xor()์„ ์‚ฌ์šฉํ•˜์—ฌ ์ธ์ ‘ ํŽ˜์–ด ๊ฐ„ ๊ฐ’์„ ๊ตํ™˜ํ•˜๋Š” ํŽ˜์–ด ๊ตํ™˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: XOR ํŒจํ„ด์œผ๋กœ ์ธ์ ‘ ํŽ˜์–ด๋ฅผ ๋งŒ๋“ค์–ด ๊ฐ’์„ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \text{input}[i \oplus 1]\]

์ž…๋ ฅ ๋ฐ์ดํ„ฐ [0, 1, 2, 3, 4, 5, 6, 7, ...]์„ ํŽ˜์–ด [1, 0, 3, 2, 5, 4, 7, 6, ...]์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ๊ฐ ํŽ˜์–ด (i, i+1)์ด XOR ํ†ต์‹ ์œผ๋กœ ๊ฐ’์„ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค.


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p26/p26.mojo

ํŒ

1. shuffle_xor ์ดํ•ดํ•˜๊ธฐ

shuffle_xor(value, mask) ์—ฐ์‚ฐ์€ ๊ฐ ๋ ˆ์ธ์ด XOR ๋งˆ์Šคํฌ๋งŒํผ ์ฐจ์ด๋‚˜๋Š” ๋ ˆ์ธ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตํ™˜ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ๋งˆ์Šคํฌ ๊ฐ’์œผ๋กœ ๋ ˆ์ธ ID๋ฅผ XORํ–ˆ์„ ๋•Œ ์–ด๋–ค ์ผ์ด ์ผ์–ด๋‚˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํƒ๊ตฌํ•  ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ ˆ์ธ 0์ด ๋งˆ์Šคํฌ 1๋กœ XORํ•˜๋ฉด ์–ด๋–ค ํŒŒํŠธ๋„ˆ๋ฅผ ์–ป๋‚˜์š”?
  • ๋ ˆ์ธ 1์ด ๋งˆ์Šคํฌ 1๋กœ XORํ•˜๋ฉด ์–ด๋–ค ํŒŒํŠธ๋„ˆ๋ฅผ ์–ป๋‚˜์š”?
  • ํŒจํ„ด์ด ๋ณด์ด๋‚˜์š”?

ํžŒํŠธ: ์ฒ˜์Œ ๋ช‡ ๊ฐœ์˜ ๋ ˆ์ธ ID์— ๋Œ€ํ•ด XOR ์—ฐ์‚ฐ์„ ์ง์ ‘ ํ•ด๋ณด๋ฉด ํŽ˜์–ด๋ง ํŒจํ„ด์„ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

2. XOR ํŽ˜์–ด ํŒจํ„ด

๋ ˆ์ธ ID์˜ ์ด์ง„ ํ‘œํ˜„๊ณผ ์ตœํ•˜์œ„ ๋น„ํŠธ๋ฅผ ๋’ค์ง‘์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

๊ณ ๋ คํ•  ์งˆ๋ฌธ:

  • ์ง์ˆ˜ ๋ ˆ์ธ์„ 1๊ณผ XORํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?
  • ํ™€์ˆ˜ ๋ ˆ์ธ์„ 1๊ณผ XORํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?
  • ์™œ ์ด๊ฒƒ์ด ์™„๋ฒฝํ•œ ํŽ˜์–ด๋ฅผ ๋งŒ๋“œ๋‚˜์š”?

3. ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ๋ถˆํ•„์š”

shuffle_down()๊ณผ ๋‹ฌ๋ฆฌ shuffle_xor() ์—ฐ์‚ฐ์€ ์›Œํ”„ ๊ฒฝ๊ณ„ ๋‚ด์—์„œ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์ž‘์€ ๋งˆ์Šคํฌ๋กœ์˜ XOR์ด ์ ˆ๋Œ€๋กœ ๋ฒ”์œ„ ๋ฐ–์˜ ๋ ˆ์ธ ID๋ฅผ ๋งŒ๋“ค์ง€ ์•Š๋Š” ์ด์œ ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: ์œ ํšจํ•œ ๋ ˆ์ธ ID๋ฅผ 1๊ณผ XORํ–ˆ์„ ๋•Œ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ๋ ˆ์ธ ID๋Š” ์–ผ๋งˆ์ธ๊ฐ€์š”?

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŽ˜์–ด ๊ตํ™˜ ํ…Œ์ŠคํŠธ:

pixi run p26 --pair-swap
pixi run -e amd p26 --pair-swap
pixi run -e apple p26 --pair-swap
uv run poe p26 --pair-swap

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1.0, 0.0, 3.0, 2.0, 5.0, 4.0, 7.0, 6.0, 9.0, 8.0, 11.0, 10.0, 13.0, 12.0, 15.0, 14.0, 17.0, 16.0, 19.0, 18.0, 21.0, 20.0, 23.0, 22.0, 25.0, 24.0, 27.0, 26.0, 29.0, 28.0, 31.0, 30.0]
expected: [1.0, 0.0, 3.0, 2.0, 5.0, 4.0, 7.0, 6.0, 9.0, 8.0, 11.0, 10.0, 13.0, 12.0, 15.0, 14.0, 17.0, 16.0, 19.0, 18.0, 21.0, 20.0, 23.0, 22.0, 25.0, 24.0, 27.0, 26.0, 29.0, 28.0, 31.0, 30.0]
โœ… Butterfly pair swap test passed!

์†”๋ฃจ์…˜

def butterfly_pair_swap[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Basic butterfly pair swap: Exchange values between adjacent pairs using XOR pattern.
    Each thread exchanges its value with its XOR-1 neighbor, creating pairs: (0,1), (2,3), (4,5), etc.
    Uses shuffle_xor(val, 1) to swap values within each pair.
    This is the foundation of butterfly network communication patterns.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    if global_i < size:
        var current_val = input[global_i]

        # Exchange with XOR-1 neighbor using butterfly pattern
        # Lane 0 exchanges with lane 1, lane 2 with lane 3, etc.
        var swapped_val = shuffle_xor(current_val, 1)

        # For demonstration, we'll store the swapped value
        # In real applications, this might be used for sorting, reduction, etc.
        output[global_i] = swapped_val


์ด ํ’€์ด๋Š” shuffle_xor()์ด XOR ํ†ต์‹  ํŒจํ„ด์„ ํ†ตํ•ด ์™„๋ฒฝํ•œ ํŽ˜์–ด ๊ตํ™˜์„ ์–ด๋–ป๊ฒŒ ๋งŒ๋“œ๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]              # ๊ฐ ๋ ˆ์ธ์ด ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ฝ์Œ
    swapped_val = shuffle_xor(current_val, 1)  # XOR๋กœ ํŽ˜์–ด ๊ตํ™˜ ์ƒ์„ฑ

    # ๊ตํ™˜๋œ ๊ฐ’์„ ์ €์žฅ
    output[global_i] = swapped_val

SIMT ์‹คํ–‰ ์ƒ์„ธ ๋ถ„์„:

์‚ฌ์ดํด 1: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์‹œ์— ๊ฐ’์„ ๋กœ๋“œ
  Lane 0: current_val = input[0] = 0
  Lane 1: current_val = input[1] = 1
  Lane 2: current_val = input[2] = 2
  Lane 3: current_val = input[3] = 3
  ...
  Lane 31: current_val = input[31] = 31

์‚ฌ์ดํด 2: shuffle_xor(current_val, 1)์ด ๋ชจ๋“  ๋ ˆ์ธ์—์„œ ์‹คํ–‰
  Lane 0: Lane 1์—์„œ ์ˆ˜์‹  (0โŠ•1=1) โ†’ swapped_val = 1
  Lane 1: Lane 0์—์„œ ์ˆ˜์‹  (1โŠ•1=0) โ†’ swapped_val = 0
  Lane 2: Lane 3์—์„œ ์ˆ˜์‹  (2โŠ•1=3) โ†’ swapped_val = 3
  Lane 3: Lane 2์—์„œ ์ˆ˜์‹  (3โŠ•1=2) โ†’ swapped_val = 2
  ...
  Lane 30: Lane 31์—์„œ ์ˆ˜์‹  (30โŠ•1=31) โ†’ swapped_val = 31
  Lane 31: Lane 30์—์„œ ์ˆ˜์‹  (31โŠ•1=30) โ†’ swapped_val = 30

์‚ฌ์ดํด 3: ๊ฒฐ๊ณผ ์ €์žฅ
  Lane 0: output[0] = 1
  Lane 1: output[1] = 0
  Lane 2: output[2] = 3
  Lane 3: output[3] = 2
  ...

์ˆ˜ํ•™์  ํ†ต์ฐฐ: XOR ์†์„ฑ์„ ํ™œ์šฉํ•œ ์™„๋ฒฝํ•œ ํŽ˜์–ด ๊ตํ™˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{XOR}(i, 1) = \begin{cases} i + 1 & \text{if } i \bmod 2 = 0 \\ i - 1 & \text{if } i \bmod 2 = 1 \end{cases}\]

shuffle_xor์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ์™„๋ฒฝํ•œ ๋Œ€์นญ: ๋ชจ๋“  ๋ ˆ์ธ์ด ์ •ํ™•ํžˆ ํ•˜๋‚˜์˜ ํŽ˜์–ด์— ์ฐธ์—ฌ
  2. ์กฐ์ • ๋ถˆํ•„์š”: ๋ชจ๋“  ํŽ˜์–ด๊ฐ€ ๋™์‹œ์— ๊ตํ™˜
  3. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์›Œํ”„ ์ „์ฒด์— ๋Œ€ํ•ด ๋‹จ์ผ ๋ช…๋ น์œผ๋กœ ์ฒ˜๋ฆฌ
  4. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๊ธฐ๋ฐ˜: ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋นŒ๋”ฉ ๋ธ”๋ก

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์ง€์—ฐ ์‹œ๊ฐ„: 1 ์‚ฌ์ดํด (ํ•˜๋“œ์›จ์–ด ๋ ˆ์ง€์Šคํ„ฐ ๊ตํ™˜)
  • ๋Œ€์—ญํญ: 0 ๋ฐ”์ดํŠธ (๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์—†์Œ)
  • ๋ณ‘๋ ฌ์„ฑ: WARP_SIZE๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์‹œ์— ๊ตํ™˜
  • ํ™•์žฅ์„ฑ: ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์— ๊ด€๊ณ„์—†์ด \(O(1)\) ๋ณต์žก๋„

2. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

๊ฐ์†Œํ•˜๋Š” offset์œผ๋กœ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด shuffle_xor์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์„ ํ†ตํ•ด ๋ชจ๋“  ์›Œํ”„ ๋ ˆ์ธ์—์„œ ์ตœ๋Œ“๊ฐ’์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{max_result} = \max_{i=0}^{\small\text{WARP_SIZE}-1} \text{input}[i]\]

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ํŒจํ„ด: XOR ์˜คํ”„์…‹์„ WARP_SIZE/2์—์„œ 1๊นŒ์ง€ ์ ˆ๋ฐ˜์”ฉ ์ค„์—ฌ๊ฐ€๋ฉฐ, ํ†ต์‹  ๋ฒ”์œ„๊ฐ€ ๋‹จ๊ณ„๋งˆ๋‹ค ๋ฐ˜์œผ๋กœ ์ข์•„์ง€๋Š” ์ด์ง„ ํŠธ๋ฆฌ๋ฅผ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • 1๋‹จ๊ณ„: WARP_SIZE/2 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต (์›Œํ”„ ์ „์ฒด๋ฅผ ํฌ๊ด„)
  • 2๋‹จ๊ณ„: WARP_SIZE/4 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต (๋ฒ”์œ„๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ขํž˜)
  • 3๋‹จ๊ณ„: WARP_SIZE/8 ๊ฑฐ๋ฆฌ์˜ ๋ ˆ์ธ๊ณผ ๋น„๊ต
  • 4๋‹จ๊ณ„: offset = 1์ด ๋  ๋•Œ๊นŒ์ง€ ๊ณ„์† ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ž„

\(\log_2(\text{WARP_SIZE})\) ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜๋ฉด ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ฐ–๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

def butterfly_parallel_max[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Parallel maximum reduction using butterfly pattern.
    Uses shuffle_xor with decreasing offsets starting from WARP_SIZE/2 down to 1.
    Each step reduces the active range by half until all threads have the maximum value.
    This implements an efficient O(log n) parallel reduction algorithm that works
    for any WARP_SIZE (32, 64, etc.).
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    # FILL ME IN (roughly 7 lines)


ํŒ

1. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ์ดํ•ดํ•˜๊ธฐ

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์€ ์ด์ง„ ํŠธ๋ฆฌ ํ†ต์‹  ํŒจํ„ด์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„์—์„œ ๋ฌธ์ œ ํฌ๊ธฐ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ์ตœ๋Œ€ ๋ฒ”์œ„๋ฅผ ์ปค๋ฒ„ํ•˜๋ ค๋ฉด ์‹œ์ž‘ offset์ด ์–ผ๋งˆ์—ฌ์•ผ ํ•˜๋‚˜์š”?
  • ๋‹จ๊ณ„ ์‚ฌ์ด์— ์˜คํ”„์…‹์„ ์–ด๋–ป๊ฒŒ ๋ณ€๊ฒฝํ•ด์•ผ ํ•˜๋‚˜์š”?
  • ์–ธ์ œ ๋ฆฌ๋•์…˜์„ ๋ฉˆ์ถฐ์•ผ ํ•˜๋‚˜์š”?

ํžŒํŠธ: โ€œ๋ฒ„ํ„ฐํ”Œ๋ผ์ดโ€œ๋ผ๋Š” ์ด๋ฆ„์€ ํ†ต์‹  ํŒจํ„ด์—์„œ ์œ ๋ž˜ํ•ฉ๋‹ˆ๋‹ค - ์ž‘์€ ์˜ˆ์ œ์— ๋Œ€ํ•ด ์ง์ ‘ ๊ทธ๋ ค๋ณด์„ธ์š”.

2. XOR ๋ฆฌ๋•์…˜ ํŠน์„ฑ

XOR์€ ๊ฐ ๋‹จ๊ณ„์—์„œ ๊ฒน์น˜์ง€ ์•Š๋Š” ํ†ต์‹  ํŽ˜์–ด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์—์„œ ์™œ ์ค‘์š”ํ•œ์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • ์„œ๋กœ ๋‹ค๋ฅธ ์˜คํ”„์…‹์œผ๋กœ์˜ XOR์ด ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅธ ํ†ต์‹  ํŒจํ„ด์„ ๋งŒ๋“œ๋‚˜์š”?
  • ๊ฐ™์€ ๋‹จ๊ณ„์—์„œ ๋ ˆ์ธ๋“ค์ด ์™œ ์„œ๋กœ ๊ฐ„์„ญํ•˜์ง€ ์•Š๋‚˜์š”?
  • XOR์ด ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์— ํŠนํžˆ ์ ํ•ฉํ•œ ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?

3. ์ตœ๋Œ“๊ฐ’ ๋ˆ„์ 

๊ฐ ๋ ˆ์ธ์€ ์ž์‹ ์˜ โ€œ์˜์—ญโ€œ์—์„œ ์ตœ๋Œ“๊ฐ’์˜ ์ง€์‹์„ ์ ์ง„์ ์œผ๋กœ ์Œ“์•„๊ฐ€์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ:

  • ์ž์‹ ์˜ ๊ฐ’์œผ๋กœ ์‹œ์ž‘
  • ๊ฐ ๋‹จ๊ณ„์—์„œ ์ด์›ƒ์˜ ๊ฐ’๊ณผ ๋น„๊ต
  • ์ตœ๋Œ“๊ฐ’์„ ์œ ์ง€ํ•˜๊ณ  ๊ณ„์† ์ง„ํ–‰

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๊ฐ ๋‹จ๊ณ„ ํ›„, โ€œ์ง€์‹์˜ ์˜์—ญโ€œ์ด ๋‘ ๋ฐฐ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค.

  • ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„ ํ›„: ๊ฐ ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ์•Œ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

4. ์ด ํŒจํ„ด์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ 

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์€ \(\log_2(\text{WARP_SIZE})\) ๋‹จ๊ณ„ ํ›„์— ๋‹ค์Œ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

  • ๋ชจ๋“  ๋ ˆ์ธ์ด ๋‹ค๋ฅธ ๋ชจ๋“  ๋ ˆ์ธ์˜ ๊ฐ’์„ ๊ฐ„์ ‘์ ์œผ๋กœ ํ™•์ธ
  • ์ค‘๋ณต ํ†ต์‹  ์—†์Œ: ๊ฐ ํŽ˜์–ด๊ฐ€ ๋‹จ๊ณ„๋‹น ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ๊ตํ™˜
  • ์ตœ์  ๋ณต์žก๋„: \(O(n)\) ์ˆœ์ฐจ ๋น„๊ต ๋Œ€์‹  \(O(\log n)\) ๋‹จ๊ณ„

์ถ”์  ์˜ˆ์ œ (4๊ฐœ ๋ ˆ์ธ, ๊ฐ’ [3, 1, 7, 2]):

์ดˆ๊ธฐ ์ƒํƒœ: Lane 0=3, Lane 1=1, Lane 2=7, Lane 3=2

1๋‹จ๊ณ„ (offset=2): 0 โ†” 2, 1 โ†” 3
  Lane 0: max(3, 7) = 7
  Lane 1: max(1, 2) = 2
  Lane 2: max(7, 3) = 7
  Lane 3: max(2, 1) = 2

2๋‹จ๊ณ„ (offset=1): 0 โ†” 1, 2 โ†” 3
  Lane 0: max(7, 2) = 7
  Lane 1: max(2, 7) = 7
  Lane 2: max(7, 2) = 7
  Lane 3: max(2, 7) = 7

๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’ = 7์„ ๊ฐ€์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ณ‘๋ ฌ ์ตœ๋Œ“๊ฐ’ ํ…Œ์ŠคํŠธ:

pixi run p26 --parallel-max
pixi run -e amd p26 --parallel-max
uv run poe p26 --parallel-max

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0]
expected: [1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0, 1000.0]
โœ… Butterfly parallel max test passed!

์†”๋ฃจ์…˜

def butterfly_parallel_max[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Parallel maximum reduction using butterfly pattern.
    Uses shuffle_xor with decreasing offsets (16, 8, 4, 2, 1) to perform tree-based reduction.
    Each step reduces the active range by half until all threads have the maximum value.
    This implements an efficient O(log n) parallel reduction algorithm.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    if global_i < size:
        var max_val = input[global_i]

        # Butterfly reduction tree: dynamic for any WARP_SIZE (32, 64, etc.)
        # Start with half the warp size and reduce by half each step
        var offset = WARP_SIZE // 2
        while offset > 0:
            max_val = max(max_val, shuffle_xor(max_val, UInt32(offset)))
            offset //= 2

        # All threads now have the maximum value across the entire warp
        output[global_i] = max_val


์ด ํ’€์ด๋Š” shuffle_xor()์ด \(O(\log n)\) ๋ณต์žก๋„์˜ ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ๋ฅผ ์–ด๋–ป๊ฒŒ ์ƒ์„ฑํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    max_val = input[global_i]  # ๋กœ์ปฌ ๊ฐ’์œผ๋กœ ์‹œ์ž‘

    # ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ: ๋ชจ๋“  WARP_SIZE์— ๋™์ ์œผ๋กœ ๋Œ€์‘
    offset = WARP_SIZE // 2
    while offset > 0:
        max_val = max(max_val, shuffle_xor(max_val, offset))
        offset //= 2

    output[global_i] = max_val  # ๋ชจ๋“  ๋ ˆ์ธ์ด ์ „์—ญ ์ตœ๋Œ“๊ฐ’์„ ๊ฐ€์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์‹คํ–‰ ์ถ”์  (8-๋ ˆ์ธ ์˜ˆ์ œ, ๊ฐ’ [0,2,4,6,8,10,12,1000]):

์ดˆ๊ธฐ ์ƒํƒœ:
  Lane 0: max_val = 0,    Lane 1: max_val = 2
  Lane 2: max_val = 4,    Lane 3: max_val = 6
  Lane 4: max_val = 8,    Lane 5: max_val = 10
  Lane 6: max_val = 12,   Lane 7: max_val = 1000

1๋‹จ๊ณ„: shuffle_xor(max_val, 4) - ์ ˆ๋ฐ˜ ๊ตํ™˜
  Lane 0โ†”4: max(0,8)=8,     Lane 1โ†”5: max(2,10)=10
  Lane 2โ†”6: max(4,12)=12,   Lane 3โ†”7: max(6,1000)=1000
  Lane 4โ†”0: max(8,0)=8,     Lane 5โ†”1: max(10,2)=10
  Lane 6โ†”2: max(12,4)=12,   Lane 7โ†”3: max(1000,6)=1000

2๋‹จ๊ณ„: shuffle_xor(max_val, 2) - 1/4 ๊ตํ™˜
  Lane 0โ†”2: max(8,12)=12,   Lane 1โ†”3: max(10,1000)=1000
  Lane 2โ†”0: max(12,8)=12,   Lane 3โ†”1: max(1000,10)=1000
  Lane 4โ†”6: max(8,12)=12,   Lane 5โ†”7: max(10,1000)=1000
  Lane 6โ†”4: max(12,8)=12,   Lane 7โ†”5: max(1000,10)=1000

3๋‹จ๊ณ„: shuffle_xor(max_val, 1) - ํŽ˜์–ด ๊ตํ™˜
  Lane 0โ†”1: max(12,1000)=1000,  Lane 1โ†”0: max(1000,12)=1000
  Lane 2โ†”3: max(12,1000)=1000,  Lane 3โ†”2: max(1000,12)=1000
  Lane 4โ†”5: max(12,1000)=1000,  Lane 5โ†”4: max(1000,12)=1000
  Lane 6โ†”7: max(12,1000)=1000,  Lane 7โ†”6: max(1000,12)=1000

์ตœ์ข… ๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์˜ max_val = 1000

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹ ์œผ๋กœ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ์ž๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{Reduce}(\oplus, [a_0, a_1, \ldots, a_{n-1}]) = a_0 \oplus a_1 \oplus \cdots \oplus a_{n-1}\]

์—ฌ๊ธฐ์„œ \(\oplus\)๋Š” max ์—ฐ์‚ฐ์ด๋ฉฐ, ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์ด ์ตœ์  \(O(\log n)\) ๋ณต์žก๋„๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ๋กœ๊ทธ ๋ณต์žก๋„: ์ˆœ์ฐจ ๋ฆฌ๋•์…˜์˜ \(O(n)\)์— ๋น„ํ•ด \(O(\log n)\)
  2. ์™„๋ฒฝํ•œ ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ๋ชจ๋“  ๋ ˆ์ธ์ด ๊ฐ ๋‹จ๊ณ„์—์„œ ๋™๋“ฑํ•˜๊ฒŒ ์ฐธ์—ฌ
  3. ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ ์—†์Œ: ์ˆœ์ˆ˜ ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ํ†ต์‹ 
  4. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: GPU ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ์— ์ง์ ‘ ๋งคํ•‘

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ๋‹จ๊ณ„ ์ˆ˜: \(\log_2(\text{WARP_SIZE})\) (์˜ˆ: 32-์Šค๋ ˆ๋“œ ์›Œํ”„๋Š” 5๋‹จ๊ณ„, 64-์Šค๋ ˆ๋“œ ์›Œํ”„๋Š” 6๋‹จ๊ณ„)
  • ๋‹จ๊ณ„๋‹น ์ง€์—ฐ ์‹œ๊ฐ„: 1 ์‚ฌ์ดํด (๋ ˆ์ง€์Šคํ„ฐ ๊ตํ™˜ + ๋น„๊ต)
  • ์ด ์ง€์—ฐ ์‹œ๊ฐ„: ์ˆœ์ฐจ ๋ฐฉ์‹์˜ \((\text{WARP_SIZE}-1)\) ์‚ฌ์ดํด ๋Œ€๋น„ \(\log_2(\text{WARP_SIZE})\) ์‚ฌ์ดํด
  • ๋ณ‘๋ ฌ์„ฑ: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ „์ฒด์—์„œ ๋ชจ๋“  ๋ ˆ์ธ์ด ํ™œ์„ฑ ์ƒํƒœ

3. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์กฐ๊ฑด๋ถ€ ์ตœ๋Œ“๊ฐ’

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE_2 = 64 (๋ฉ€ํ‹ฐ ๋ธ”๋ก ์‹œ๋‚˜๋ฆฌ์˜ค)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: BLOCKS_PER_GRID_2 = (2, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: THREADS_PER_BLOCK_2 = (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

์ง์ˆ˜ ๋ ˆ์ธ์€ ์ตœ๋Œ“๊ฐ’์„, ํ™€์ˆ˜ ๋ ˆ์ธ์€ ์ตœ์†Ÿ๊ฐ’์„ ์ €์žฅํ•˜๋Š” ์กฐ๊ฑด๋ถ€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’ ๋ชจ๋‘์— ๋Œ€ํ•ด ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•œ ํ›„, ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ผ ์กฐ๊ฑด๋ถ€๋กœ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \begin{cases} \max_{j=0}^{\text{WARP_SIZE}-1} \text{input}[j] & \text{if} i \bmod 2 = 0 \\ \min_{j=0}^{\text{WARP_SIZE}-1} \text{input}[j] & \text{if } i \bmod 2 = 1 \end{cases}\]

์ด์ค‘ ๋ฆฌ๋•์…˜ ํŒจํ„ด: ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŠธ๋ฆฌ๋ฅผ ํ†ตํ•ด ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์„ ๋™์‹œ์— ์ถ”์ ํ•œ ํ›„, ๋ ˆ์ธ ID ํ™€์ง์— ๋”ฐ๋ผ ์กฐ๊ฑด๋ถ€๋กœ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์ด ๋ณต์žกํ•œ ๋‹ค์ค‘ ๊ฐ’ ๋ฆฌ๋•์…˜์œผ๋กœ ์–ด๋–ป๊ฒŒ ํ™•์žฅ๋˜๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

comptime SIZE_2 = 64
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (WARP_SIZE, 1)
comptime layout_2 = row_major[SIZE_2]()
comptime LayoutType_2 = type_of(layout_2)


def butterfly_conditional_max[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType_2, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType_2, ImmutAnyOrigin],
):
    """
    Conditional butterfly maximum: Perform butterfly max reduction, but only store result
    in even-numbered lanes. Odd-numbered lanes store the minimum value seen.
    Demonstrates conditional logic combined with butterfly communication patterns.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = lane_id()

    if global_i < size:
        var current_val = input[global_i]
        var min_val = current_val

        # FILL ME IN (roughly 11 lines)


ํŒ

1. ์ด์ค‘ ์ถ”์  ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜

์ด ํผ์ฆ์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŠธ๋ฆฌ๋ฅผ ํ†ตํ•ด ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ ๊ฐ’์„ ๋™์‹œ์— ์ถ”์ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ๋ฆฌ๋•์…˜์„ ๋ณ‘๋ ฌ๋กœ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ๋ฆฌ๋•์…˜ ๊ณผ์ •์—์„œ ์ตœ๋Œ“๊ฐ’๊ณผ ์ตœ์†Ÿ๊ฐ’์„ ์–ด๋–ป๊ฒŒ ๋™์‹œ์— ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ๋‘ ์—ฐ์‚ฐ์— ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ์–ด๋–ค ๋ณ€์ˆ˜๋ฅผ ์ถ”์ ํ•ด์•ผ ํ•˜๋‚˜์š”?

2. ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ ๋กœ์ง

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ์™„๋ฃŒํ•œ ํ›„, ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ผ ๋‹ค๋ฅธ ๊ฐ’์„ ์ถœ๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ์ :

  • ๋ ˆ์ธ์ด ์ง์ˆ˜์ธ์ง€ ํ™€์ˆ˜์ธ์ง€ ์–ด๋–ป๊ฒŒ ํŒ๋ณ„ํ•˜๋‚˜์š”?
  • ์–ด๋–ค ๋ ˆ์ธ์ด ์ตœ๋Œ“๊ฐ’์„, ์–ด๋–ค ๋ ˆ์ธ์ด ์ตœ์†Ÿ๊ฐ’์„ ์ถœ๋ ฅํ•ด์•ผ ํ•˜๋‚˜์š”?
  • ๋ ˆ์ธ ID์— ์–ด๋–ป๊ฒŒ ์ ‘๊ทผํ•˜๋‚˜์š”?

3. min๊ณผ max ๋™์‹œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜

์ด ๊ณผ์ œ์˜ ํ•ต์‹ฌ์€ ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์œผ๋กœ min๊ณผ max๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ณ‘๋ ฌ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • min๊ณผ max์— ๋ณ„๋„์˜ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•œ๊ฐ€์š”?
  • ๋‘ ์—ฐ์‚ฐ์— ๊ฐ™์€ ์ด์›ƒ ๊ฐ’์„ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  • ๋‘ ๋ฆฌ๋•์…˜ ๋ชจ๋‘ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์™„๋ฃŒ๋˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•˜๋‚˜์š”?

4. ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๊ฒฝ๊ณ„ ๊ณ ๋ ค์‚ฌํ•ญ

์ด ํผ์ฆ์€ ์—ฌ๋Ÿฌ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋ฆฌ๋•์…˜ ๋ฒ”์œ„์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ค‘์š”ํ•œ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๊ฐ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์˜ ๋ฒ”์œ„๋Š” ์–ด๋””๊นŒ์ง€์ธ๊ฐ€์š”?
  • ๋ธ”๋ก ๊ตฌ์กฐ๊ฐ€ ๋ ˆ์ธ ๋ฒˆํ˜ธ ๋งค๊ธฐ๊ธฐ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋‚˜์š”?
  • ์ „์—ญ min/max๋ฅผ ๊ณ„์‚ฐํ•˜๋‚˜์š”, ๋ธ”๋ก๋ณ„ min/max๋ฅผ ๊ณ„์‚ฐํ•˜๋‚˜์š”?

๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์กฐ๊ฑด๋ถ€ ์ตœ๋Œ“๊ฐ’ ํ…Œ์ŠคํŠธ:

pixi run p26 --conditional-max
pixi run -e amd p26 --conditional-max
uv run poe p26 --conditional-max

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE_2:  64
output: [9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0]
expected: [9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 9.0, 0.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0, 63.0, 32.0]
โœ… Butterfly conditional max test passed!

์†”๋ฃจ์…˜

def butterfly_conditional_max[
    size: Int
](
    output: TileTensor[mut=True, dtype, Layout2Type, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, Layout2Type, ImmutAnyOrigin],
):
    """
    Conditional butterfly maximum: Perform butterfly max reduction, but only store result
    in even-numbered lanes. Odd-numbered lanes store the minimum value seen.
    Demonstrates conditional logic combined with butterfly communication patterns.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var lane = lane_id()

    if global_i < size:
        var current_val = input[global_i]
        var min_val = current_val

        # Butterfly reduction for both maximum and minimum: dynamic for any WARP_SIZE
        var offset = WARP_SIZE // 2
        while offset > 0:
            var neighbor_val = shuffle_xor(current_val, UInt32(offset))
            current_val = max(current_val, neighbor_val)

            var min_neighbor_val = shuffle_xor(min_val, UInt32(offset))
            min_val = min(min_val, min_neighbor_val)

            offset //= 2

        # Conditional output: max for even lanes, min for odd lanes
        if lane % 2 == 0:
            output[global_i] = current_val  # Maximum
        else:
            output[global_i] = min_val  # Minimum


์ด ํ’€์ด๋Š” ์ด์ค‘ ์ถ”์ ๊ณผ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ณ ๊ธ‰ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]
    min_val = current_val  # ์ตœ์†Ÿ๊ฐ’์„ ๋ณ„๋„๋กœ ์ถ”์ 

    # max์™€ min ๋™์‹œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ (log_2(WARP_SIZE) ๋‹จ๊ณ„)
    offset = WARP_SIZE // 2
    while offset > 0:
        neighbor_val = shuffle_xor(current_val, offset)
        current_val = max(current_val, neighbor_val)    # Max ๋ฆฌ๋•์…˜

        min_neighbor_val = shuffle_xor(min_val, offset)
        min_val = min(min_val, min_neighbor_val)        # Min ๋ฆฌ๋•์…˜

        offset //= 2

    # ๋ ˆ์ธ ํ™€์ง์— ๋”ฐ๋ฅธ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ
    if lane % 2 == 0:
        output[global_i] = current_val  # ์ง์ˆ˜ ๋ ˆ์ธ: ์ตœ๋Œ“๊ฐ’
    else:
        output[global_i] = min_val      # ํ™€์ˆ˜ ๋ ˆ์ธ: ์ตœ์†Ÿ๊ฐ’

์ด์ค‘ ๋ฆฌ๋•์…˜ ์‹คํ–‰ ์ถ”์  (4-๋ ˆ์ธ ์˜ˆ์ œ, ๊ฐ’ [3, 1, 7, 2]):

์ดˆ๊ธฐ ์ƒํƒœ:
  Lane 0: current_val=3, min_val=3
  Lane 1: current_val=1, min_val=1
  Lane 2: current_val=7, min_val=7
  Lane 3: current_val=2, min_val=2

1๋‹จ๊ณ„: shuffle_xor(current_val, 2)์™€ shuffle_xor(min_val, 2) - ์ ˆ๋ฐ˜ ๊ตํ™˜
  Lane 0โ†”2: max_neighbor=7, min_neighbor=7 โ†’ current_val=max(3,7)=7, min_val=min(3,7)=3
  Lane 1โ†”3: max_neighbor=2, min_neighbor=2 โ†’ current_val=max(1,2)=2, min_val=min(1,2)=1
  Lane 2โ†”0: max_neighbor=3, min_neighbor=3 โ†’ current_val=max(7,3)=7, min_val=min(7,3)=3
  Lane 3โ†”1: max_neighbor=1, min_neighbor=1 โ†’ current_val=max(2,1)=2, min_val=min(2,1)=1

2๋‹จ๊ณ„: shuffle_xor(current_val, 1)์™€ shuffle_xor(min_val, 1) - ํŽ˜์–ด ๊ตํ™˜
  Lane 0โ†”1: max_neighbor=2, min_neighbor=1 โ†’ current_val=max(7,2)=7, min_val=min(3,1)=1
  Lane 1โ†”0: max_neighbor=7, min_neighbor=3 โ†’ current_val=max(2,7)=7, min_val=min(1,3)=1
  Lane 2โ†”3: max_neighbor=2, min_neighbor=1 โ†’ current_val=max(7,2)=7, min_val=min(3,1)=1
  Lane 3โ†”2: max_neighbor=7, min_neighbor=3 โ†’ current_val=max(2,7)=7, min_val=min(1,3)=1

์ตœ์ข… ๊ฒฐ๊ณผ: ๋ชจ๋“  ๋ ˆ์ธ์ด current_val=7 (์ „์—ญ max)๊ณผ min_val=1 (์ „์—ญ min)์„ ๊ฐ€์ง

๋™์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ (๋ชจ๋“  WARP_SIZE์—์„œ ๋™์ž‘):

offset = WARP_SIZE // 2
while offset > 0:
    neighbor_val = shuffle_xor(current_val, offset)
    current_val = max(current_val, neighbor_val)

    min_neighbor_val = shuffle_xor(min_val, offset)
    min_val = min(min_val, min_neighbor_val)

    offset //= 2

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ์กฐ๊ฑด๋ถ€ ๋””๋ฉ€ํ‹ฐํ”Œ๋ ‰์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ์ด์ค‘ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \begin{align} \text{max_result} &= \max_{i=0}^{n-1} \text{input}[i] \\ \text{min_result} &= \min_{i=0}^{n-1} \text{input}[i] \\ \text{output}[i] &= \text{lane_parity}(i) \; \text{?} \; \text{min_result}: \text{max_result} \end{align}\]

์ด์ค‘ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  1. ๋…๋ฆฝ์  ๋ฆฌ๋•์…˜: Max์™€ min ๋ฆฌ๋•์…˜์€ ์ˆ˜ํ•™์ ์œผ๋กœ ๋…๋ฆฝ
  2. ๋ณ‘๋ ฌ ์‹คํ–‰: ๋‘˜ ๋‹ค ๊ฐ™์€ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์„ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  3. ํ†ต์‹  ๊ณต์œ : ๊ฐ™์€ ์…”ํ”Œ ์—ฐ์‚ฐ์ด ๋‘ ๋ฆฌ๋•์…˜ ๋ชจ๋‘์— ํ™œ์šฉ
  4. ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ: ๋ ˆ์ธ ํ™€์ง์ด ์–ด๋–ค ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ• ์ง€ ๊ฒฐ์ •

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ํ†ต์‹  ๋‹จ๊ณ„: \(\log_2(\text{WARP_SIZE})\) (๋‹จ์ผ ๋ฆฌ๋•์…˜๊ณผ ๋™์ผ)
  • ๋‹จ๊ณ„๋‹น ์—ฐ์‚ฐ: ๋‹จ์ผ ๋ฆฌ๋•์…˜์˜ 1๊ฐœ ๋Œ€๋น„ 2๊ฐœ ์—ฐ์‚ฐ (max + min)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹ ๋Œ€๋น„ ์Šค๋ ˆ๋“œ๋‹น ๋ ˆ์ง€์Šคํ„ฐ 2๊ฐœ
  • ์ถœ๋ ฅ ์œ ์—ฐ์„ฑ: ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ธ์ด ๋‹ค๋ฅธ ๋ฆฌ๋•์…˜ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ ๊ฐ€๋Šฅ

์š”์•ฝ

shuffle_xor() ๊ธฐ๋ณธ ์š”์†Œ๋Š” ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๊ฐ•๋ ฅํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํ†ต์‹  ํŒจํ„ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์„ธ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ†ตํ•ด ๋‹ค์Œ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด

  1. ํŽ˜์–ด ๊ตํ™˜ (shuffle_xor(value, 1)):

    • ์™„๋ฒฝํ•œ ์ธ์ ‘ ํŽ˜์–ด ์ƒ์„ฑ: (0,1), (2,3), (4,5), โ€ฆ
    • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ์˜ \(O(1)\) ๋ณต์žก๋„
    • ์ •๋ ฌ ๋„คํŠธ์›Œํฌ์™€ ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜์˜ ๊ธฐ๋ฐ˜
  2. ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ (๋™์  offset: WARP_SIZE/2 โ†’ 1):

    • ๋กœ๊ทธ ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜: ์ˆœ์ฐจ์˜ \(O(n)\) ๋Œ€๋น„ \(O(\log n)\)
    • ๋ชจ๋“  ๊ฒฐํ•ฉ ์—ฐ์‚ฐ์— ์ ์šฉ ๊ฐ€๋Šฅ (max, min, sum ๋“ฑ)
    • ๋ชจ๋“  ์›Œํ”„ ๋ ˆ์ธ์— ๊ฑธ์ณ ์ตœ์ ์˜ ๋ถ€ํ•˜ ๋ถ„์‚ฐ
  3. ์กฐ๊ฑด๋ถ€ ๋‹ค์ค‘ ๋ฆฌ๋•์…˜ (์ด์ค‘ ์ถ”์  + ๋ ˆ์ธ ํ™€์ง):

    • ์—ฌ๋Ÿฌ ๋ฆฌ๋•์…˜์„ ๋™์‹œ์— ๋ณ‘๋ ฌ ์ˆ˜ํ–‰
    • ์Šค๋ ˆ๋“œ ํŠน์„ฑ์— ๋”ฐ๋ฅธ ์กฐ๊ฑด๋ถ€ ์ถœ๋ ฅ
    • ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†๋Š” ๊ณ ๊ธ‰ ์กฐ์ •

ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ†ต์ฐฐ

XOR ํ†ต์‹  ํŠน์„ฑ:

  • shuffle_xor(value, mask)๊ฐ€ ๋Œ€์นญ์ ์ด๊ณ  ๊ฒน์น˜์ง€ ์•Š๋Š” ํŽ˜์–ด๋ฅผ ์ƒ์„ฑ
  • ๊ฐ ๋งˆ์Šคํฌ๊ฐ€ ๊ณ ์œ ํ•œ ํ†ต์‹  ํ† ํด๋กœ์ง€๋ฅผ ์ƒ์„ฑ
  • ์ด์ง„ XOR ํŒจํ„ด์—์„œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋„คํŠธ์›Œํฌ๊ฐ€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋„์ถœ

๋™์  ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„:

offset = WARP_SIZE // 2
while offset > 0:
    neighbor_val = shuffle_xor(current_val, offset)
    current_val = operation(current_val, neighbor_val)
    offset //= 2

์„ฑ๋Šฅ ์ด์ :

  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ๋ ˆ์ง€์Šคํ„ฐ ๊ฐ„ ์ง์ ‘ ํ†ต์‹ 
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: SIMT ์‹คํ–‰์ด ์ •ํ™•์„ฑ์„ ๋ณด์žฅ
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ณต์žก๋„: ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ \(O(\log n)\)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ถˆํ•„์š”

์‹ค์šฉ์  ํ™œ์šฉ

์ด ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ํŒจํ„ด๋“ค์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๋ถ„์•ผ:

  • ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜: ํ•ฉ๊ณ„, max, min, ๋…ผ๋ฆฌ ์—ฐ์‚ฐ
  • ๋ˆ„์  ํ•ฉ/์Šค์บ” ์—ฐ์‚ฐ: ๋ˆ„์  ํ•ฉ, ๋ณ‘๋ ฌ ์ •๋ ฌ
  • FFT ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์‹ ํ˜ธ ์ฒ˜๋ฆฌ์™€ ํ•ฉ์„ฑ๊ณฑ
  • Bitonic ์ •๋ ฌ: ๋ณ‘๋ ฌ ์ •๋ ฌ ๋„คํŠธ์›Œํฌ
  • ๊ทธ๋ž˜ํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ํŠธ๋ฆฌ ์ˆœํšŒ์™€ ์—ฐ๊ฒฐ์„ฑ

shuffle_xor() ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์กฐ์ •์„ ์šฐ์•„ํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ํ†ต์‹  ํŒจํ„ด์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ๋‹ค์–‘ํ•œ GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ํšจ์œจ์ ์œผ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค.

warp.prefix_sum() ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ์Šค์บ”

์›Œํ”„ ๋ ˆ๋ฒจ ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ์—์„œ๋Š” prefix_sum()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ณธ ์š”์†Œ๋กœ ๋Œ€์ฒดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ•๋ ฅํ•œ ์—ฐ์‚ฐ์„ ํ†ตํ•ด ์ˆ˜์‹ญ ์ค„์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐ ๋™๊ธฐํ™” ์ฝ”๋“œ๊ฐ€ ํ•„์š”ํ–ˆ์„ ํšจ์œจ์ ์ธ ๋ˆ„์  ๊ณ„์‚ฐ, ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹, ๊ณ ๊ธ‰ ์กฐ์ • ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: prefix_sum() ์—ฐ์‚ฐ์€ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ”์„ ํ™œ์šฉํ•˜์—ฌ ์›Œํ”„ ๋ ˆ์ธ์— ๊ฑธ์ณ \(O(\log n)\) ๋ณต์žก๋„๋กœ ๋ˆ„์  ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

๋ณ‘๋ ฌ ์Šค์บ”์ด๋ž€? ๋ณ‘๋ ฌ ์Šค์บ” (๋ˆ„์  ํ•ฉ)์€ ๋ฐ์ดํ„ฐ ์š”์†Œ์— ๊ฑธ์ณ ๋ˆ„์  ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋ณธ์ ์ธ ๋ณ‘๋ ฌ ๊ธฐ๋ณธ ์š”์†Œ์ž…๋‹ˆ๋‹ค. ๋ง์…ˆ์˜ ๊ฒฝ์šฐ [a, b, c, d]๋ฅผ [a, a+b, a+b+c, a+b+c+d]๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ์—ฐ์‚ฐ์€ ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜, quicksort ํŒŒํ‹ฐ์…”๋‹, ๋ณ‘๋ ฌ ์ •๋ ฌ ๊ฐ™์€ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • prefix_sum()์„ ํ™œ์šฉํ•œ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ณ‘๋ ฌ ์Šค์บ”
  • ํฌํ•จ(inclusive) vs ๋น„ํฌํ•จ(exclusive) ๋ˆ„์  ํ•ฉ ํŒจํ„ด
  • ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜๋ฅผ ์œ„ํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜
  • ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•œ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹
  • ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ๋‹จ์ผ ์›Œํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ตœ์ ํ™”

์ด๋ฅผ ํ†ตํ•ด ๋‹ค๋‹จ๊ณ„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์šฐ์•„ํ•œ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋ณ€ํ™˜๋˜์–ด, ๋ช…์‹œ์  ๋™๊ธฐํ™” ์—†์ด ํšจ์œจ์ ์ธ ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

1. ์›Œํ”„ ํฌํ•จ ๋ˆ„์  ํ•ฉ

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D row-major)

prefix_sum์˜ ์ด์ 

๊ธฐ์กด ๋ˆ„์  ํ•ฉ์€ ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Puzzle 14: ๋ˆ„์  ํ•ฉ์—์„œ๋Š” ๋ช…์‹œ์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ๋กœ ์ด๋ฅผ ํž˜๋“ค๊ฒŒ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค:

def prefix_sum_simple(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    var shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    var offset = 1
    for i in range(Int(log2(Scalar[dtype](TPB)))):
        var current_val: output.ElementType = 0
        if local_i >= offset and local_i < size:
            current_val = shared[local_i - offset]  # read

        barrier()
        if local_i >= offset and local_i < size:
            shared[local_i] += current_val

        barrier()
        offset *= 2

    if global_i < size:
        output[global_i] = shared[local_i]


๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ฌธ์ œ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ํ•„์š”
  • ๋‹ค์ค‘ ๋ฐฐ๋ฆฌ์–ด: ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ๋™๊ธฐํ™”
  • ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ: ์ˆ˜๋™ ์ŠคํŠธ๋ผ์ด๋“œ ๊ณ„์‚ฐ๊ณผ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ๋‚ฎ์€ ํ™•์žฅ์„ฑ: ๊ฐ ๋‹จ๊ณ„ ์‚ฌ์ด์— ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”ํ•œ \(O(\log n)\) ๋‹จ๊ณ„

prefix_sum()์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ณ‘๋ ฌ ์Šค์บ”์ด ๊ฐ„๋‹จํ•ด์ง‘๋‹ˆ๋‹ค:

# ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ฐฉ์‹ - ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ!
current_val = input[global_i]
scan_result = prefix_sum[exclusive=False](current_val)
output[global_i] = scan_result

prefix_sum์˜ ์žฅ์ :

  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ์—ฐ์‚ฐ
  • ๋™๊ธฐํ™” ๋ถˆํ•„์š”: ๋‹จ์ผ ์•„ํ† ๋ฏน ์—ฐ์‚ฐ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์ „์šฉ ์Šค์บ” ์œ ๋‹› ํ™œ์šฉ
  • ์™„๋ฒฝํ•œ ํ™•์žฅ์„ฑ: ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ ๋™์ž‘

์™„์„ฑํ•  ์ฝ”๋“œ

ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” prefix_sum() ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํฌํ•จ ๋ˆ„์  ํ•ฉ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ๊ฐ ๋ ˆ์ธ์ด ์ž์‹ ์˜ ์œ„์น˜๊นŒ์ง€ ๋ชจ๋“  ์š”์†Œ์˜ ํ•ฉ์„ ํฌํ•จํ•˜๋Š” ๋ˆ„์  ํ•ฉ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \sum_{j=0}^{i} \text{input}[j]\]

์ž…๋ ฅ ๋ฐ์ดํ„ฐ [1, 2, 3, 4, 5, ...]๋ฅผ ๋ˆ„์  ํ•ฉ [1, 3, 6, 10, 15, ...]์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉฐ, ๊ฐ ์œ„์น˜์— ์ด์ „ ๋ชจ๋“  ์š”์†Œ์™€ ์ž๊ธฐ ์ž์‹ ์˜ ํ•ฉ์ด ๋‹ด๊น๋‹ˆ๋‹ค.

def warp_inclusive_prefix_sum[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Inclusive prefix sum using warp primitive:
    Each thread gets sum of all elements up to and including its position.
    Compare this to Puzzle 12's complex shared memory + barrier approach.

    Puzzle 12 approach:
    - Shared memory allocation
    - Multiple barrier synchronizations
    - Log(n) iterations with manual tree reduction
    - Complex multi-phase algorithm

    Warp prefix_sum approach:
    - Single function call!
    - Hardware-optimized parallel scan
    - Automatic synchronization
    - O(log n) complexity, but implemented in hardware.

    NOTE: This implementation only works correctly within a single warp (WARP_SIZE threads).
    For multi-warp scenarios, additional coordination would be needed.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    # FILL ME IN (roughly 4 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p26/p26.mojo

ํŒ

1. prefix_sum ๋งค๊ฐœ๋ณ€์ˆ˜ ์ดํ•ดํ•˜๊ธฐ

prefix_sum() ํ•จ์ˆ˜์—๋Š” ์Šค์บ” ์œ ํ˜•์„ ์ œ์–ดํ•˜๋Š” ์ค‘์š”ํ•œ ํ…œํ”Œ๋ฆฟ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • ํฌํ•จ ๋ˆ„์  ํ•ฉ๊ณผ ๋น„ํฌํ•จ ๋ˆ„์  ํ•ฉ์˜ ์ฐจ์ด๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?
  • ์–ด๋–ค ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์ด ๋™์ž‘์„ ์ œ์–ดํ•˜๋‚˜์š”?
  • ํฌํ•จ ์Šค์บ”์—์„œ ๊ฐ ๋ ˆ์ธ์€ ๋ฌด์—‡์„ ์ถœ๋ ฅํ•ด์•ผ ํ•˜๋‚˜์š”?

ํžŒํŠธ: ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ๋ณด๊ณ  ๋ˆ„์  ์—ฐ์‚ฐ์—์„œ โ€œํฌํ•จ(inclusive)โ€œ์ด ๋ฌด์—‡์„ ์˜๋ฏธํ•˜๋Š”์ง€ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

2. ๋‹จ์ผ ์›Œํ”„ ์ œํ•œ

์ด ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋‹จ์ผ ์›Œํ”„ ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ œํ•œ์˜ ์˜๋ฏธ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • ์—ฌ๋Ÿฌ ์›Œํ”„๊ฐ€ ์žˆ์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋‚˜์š”?
  • ์ด ์ œํ•œ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์™œ ์ค‘์š”ํ•œ๊ฐ€์š”?
  • ๋ฉ€ํ‹ฐ ์›Œํ”„ ์‹œ๋‚˜๋ฆฌ์˜ค๋กœ ํ™•์žฅํ•˜๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ•˜๋‚˜์š”?

3. ๋ฐ์ดํ„ฐ ํƒ€์ž… ๊ณ ๋ ค์‚ฌํ•ญ

prefix_sum ํ•จ์ˆ˜๋Š” ์ตœ์  ์„ฑ๋Šฅ์„ ์œ„ํ•ด ํŠน์ • ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์š”๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ์ :

  • ์ž…๋ ฅ์ด ์–ด๋–ค ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์‚ฌ์šฉํ•˜๋‚˜์š”?
  • prefix_sum์ด ํŠน์ • ์Šค์นผ๋ผ ํƒ€์ž…์„ ๊ธฐ๋Œ€ํ•˜๋‚˜์š”?
  • ํ•„์š”ํ•œ ๊ฒฝ์šฐ ํƒ€์ž… ๋ณ€ํ™˜์„ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋‚˜์š”?

์›Œํ”„ ํฌํ•จ ๋ˆ„์  ํ•ฉ ํ…Œ์ŠคํŠธ:

pixi run p26 --prefix-sum
pixi run -e amd p26 --prefix-sum
pixi run -e apple p26 --prefix-sum
uv run poe p26 --prefix-sum

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: [1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0, 36.0, 45.0, 55.0, 66.0, 78.0, 91.0, 105.0, 120.0, 136.0, 153.0, 171.0, 190.0, 210.0, 231.0, 253.0, 276.0, 300.0, 325.0, 351.0, 378.0, 406.0, 435.0, 465.0, 496.0, 528.0]
expected: [1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0, 36.0, 45.0, 55.0, 66.0, 78.0, 91.0, 105.0, 120.0, 136.0, 153.0, 171.0, 190.0, 210.0, 231.0, 253.0, 276.0, 300.0, 325.0, 351.0, 378.0, 406.0, 435.0, 465.0, 496.0, 528.0]
โœ… Warp inclusive prefix sum test passed!

์†”๋ฃจ์…˜

def warp_inclusive_prefix_sum[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    """
    Inclusive prefix sum using warp primitive: Each thread gets sum of all elements up to and including its position.
    Compare this to Puzzle 12's complex shared memory + barrier approach.

    Puzzle 12 approach:
    - Shared memory allocation
    - Multiple barrier synchronizations
    - Log(n) iterations with manual tree reduction
    - Complex multi-phase algorithm

    Warp prefix_sum approach:
    - Single function call!
    - Hardware-optimized parallel scan
    - Automatic synchronization
    - O(log n) complexity, but implemented in hardware.

    NOTE: This implementation only works correctly within a single warp (WARP_SIZE threads).
    For multi-warp scenarios, additional coordination would be needed.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    if global_i < size:
        var current_val = input[global_i]

        # This one call replaces ~30 lines of complex shared memory logic from Puzzle 12!
        # But it only works within the current warp (WARP_SIZE threads)
        var scan_result = prefix_sum[exclusive=False](
            rebind[Scalar[dtype]](current_val)
        )

        output[global_i] = scan_result


์ด ์†”๋ฃจ์…˜์€ prefix_sum()์ด ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋œ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ์–ด๋–ป๊ฒŒ ๋Œ€์ฒดํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]

    # ์ด ํ•œ ์ค„์ด Puzzle 14์˜ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋กœ์ง ~30์ค„์„ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค!
    # ๋‹จ, ํ˜„์žฌ ์›Œํ”„ (WARP_SIZE ์Šค๋ ˆ๋“œ) ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค
    scan_result = prefix_sum[exclusive=False](
        rebind[Scalar[dtype]](current_val)
    )

    output[global_i] = scan_result

SIMT ์‹คํ–‰ ์ƒ์„ธ ๋ถ„์„:

์ž…๋ ฅ: [1, 2, 3, 4, 5, 6, 7, 8, ...]

์‚ฌ์ดํด 1: ๋ชจ๋“  ๋ ˆ์ธ์ด ๋™์‹œ์— ๊ฐ’์„ ๋กœ๋“œ
  Lane 0: current_val = 1
  Lane 1: current_val = 2
  Lane 2: current_val = 3
  Lane 3: current_val = 4
  ...
  Lane 31: current_val = 32

์‚ฌ์ดํด 2: prefix_sum[exclusive=False] ์‹คํ–‰ (ํ•˜๋“œ์›จ์–ด ๊ฐ€์†)
  Lane 0: scan_result = 1 (์š”์†Œ 0~0์˜ ํ•ฉ)
  Lane 1: scan_result = 3 (์š”์†Œ 0~1์˜ ํ•ฉ: 1+2)
  Lane 2: scan_result = 6 (์š”์†Œ 0~2์˜ ํ•ฉ: 1+2+3)
  Lane 3: scan_result = 10 (์š”์†Œ 0~3์˜ ํ•ฉ: 1+2+3+4)
  ...
  Lane 31: scan_result = 528 (์š”์†Œ 0~31์˜ ํ•ฉ)

์‚ฌ์ดํด 3: ๊ฒฐ๊ณผ ์ €์žฅ
  Lane 0: output[0] = 1
  Lane 1: output[1] = 3
  Lane 2: output[2] = 6
  Lane 3: output[3] = 10
  ...

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ํฌํ•จ ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \sum_{j=0}^{i} \text{input}[j]\]

Puzzle 14 ๋ฐฉ์‹๊ณผ์˜ ๋น„๊ต:

  • Puzzle 14: ๋ˆ„์  ํ•ฉ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ~30์ค„ + ๋‹ค์ค‘ ๋ฐฐ๋ฆฌ์–ด + ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ
  • ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ: ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์˜ ํ•จ์ˆ˜ ํ˜ธ์ถœ 1๊ฐœ
  • ์„ฑ๋Šฅ: ๊ฐ™์€ \(O(\log n)\) ๋ณต์žก๋„์ด์ง€๋งŒ, ์ „์šฉ ํ•˜๋“œ์›จ์–ด์—์„œ ๊ตฌํ˜„
  • ๋ฉ”๋ชจ๋ฆฌ: ๋ช…์‹œ์  ํ• ๋‹น ๋Œ€๋น„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ œ๋กœ

Puzzle 12์—์„œ์˜ ๋ฐœ์ „: ํ˜„๋Œ€ GPU ์•„ํ‚คํ…์ฒ˜์˜ ๊ฐ•๋ ฅํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค - Puzzle 12์—์„œ ์‹ ์ค‘ํ•œ ์ˆ˜๋™ ๊ตฌํ˜„์ด ํ•„์š”ํ–ˆ๋˜ ๊ฒƒ์ด ์ด์ œ๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ณธ ์š”์†Œ ํ•˜๋‚˜๋กœ ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค. ์›Œํ”„ ๋ ˆ๋ฒจ prefix_sum()์€ ๊ตฌํ˜„ ๋ณต์žก๋„ ์ œ๋กœ๋กœ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์  ์ด์ ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

prefix_sum์ด ์šฐ์›”ํ•œ ์ด์œ :

  1. ํ•˜๋“œ์›จ์–ด ๊ฐ€์†: ํ˜„๋Œ€ GPU์˜ ์ „์šฉ ์Šค์บ” ์œ ๋‹›
  2. ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ถˆํ•„์š”
  3. ์ž๋™ ๋™๊ธฐํ™”: ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ๋ถˆํ•„์š”
  4. ์™„๋ฒฝํ•œ ํ™•์žฅ์„ฑ: ๋ชจ๋“  WARP_SIZE์—์„œ ์ตœ์ ์œผ๋กœ ๋™์ž‘

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์ง€์—ฐ ์‹œ๊ฐ„: ~1-2 ์‚ฌ์ดํด (ํ•˜๋“œ์›จ์–ด ์Šค์บ” ์œ ๋‹›)
  • ๋Œ€์—ญํญ: ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ ์ œ๋กœ (๋ ˆ์ง€์Šคํ„ฐ ์ „์šฉ ์—ฐ์‚ฐ)
  • ๋ณ‘๋ ฌ์„ฑ: WARP_SIZE๊ฐœ ๋ ˆ์ธ ๋ชจ๋‘ ๋™์‹œ์— ์ฐธ์—ฌ
  • ํ™•์žฅ์„ฑ: ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋ฅผ ๋™๋ฐ˜ํ•œ \(O(\log n)\) ๋ณต์žก๋„

์ค‘์š”ํ•œ ์ œํ•œ์‚ฌํ•ญ: ์ด ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋‹จ์ผ ์›Œํ”„ ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ๋ฉ€ํ‹ฐ ์›Œํ”„ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ์›Œํ”„ ๊ฐ„ ์ถ”๊ฐ€ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

2. ์›Œํ”„ ํŒŒํ‹ฐ์…˜

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = WARP_SIZE (GPU์— ๋”ฐ๋ผ 32 ๋˜๋Š” 64)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (WARP_SIZE, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜

์™„์„ฑํ•  ์ฝ”๋“œ

shuffle_xor๊ณผ prefix_sum ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์ผ ์›Œํ”„ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ํ”ผ๋ฒ— ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์š”์†Œ๋ฅผ ๋ถ„ํ• ํ•˜์—ฌ, < pivot์ธ ์š”์†Œ๋Š” ์™ผ์ชฝ์—, >= pivot์ธ ์š”์†Œ๋Š” ์˜ค๋ฅธ์ชฝ์— ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{output} = [\text{elements} < \text{pivot}] \,|\, [\text{elements} \geq \text{pivot}]\]

๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‘ ๊ฐ€์ง€ ์ •๊ตํ•œ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค:

  1. shuffle_xor(): ์™ผ์ชฝ ์š”์†Œ ๊ฐœ์ˆ˜๋ฅผ ์„ธ๊ธฐ ์œ„ํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜
  2. prefix_sum(): ๊ฐ ํŒŒํ‹ฐ์…˜ ๋‚ด ์œ„์น˜ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋น„ํฌํ•จ ์Šค์บ”

์ด๋Š” ๋‹จ์ผ ์›Œํ”„ ๋‚ด์—์„œ ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๋Š” ๊ฐ•๋ ฅํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

def warp_partition[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    pivot: Float32,
):
    """
    Single-warp parallel partitioning using BOTH shuffle_xor AND prefix_sum.
    This implements a warp-level quicksort partition step that places elements < pivot
    on the left and elements >= pivot on the right.

    ALGORITHM COMPLEXITY - combines two advanced warp primitives:
    1. shuffle_xor(): Butterfly pattern for warp-level reductions
    2. prefix_sum(): Warp-level exclusive scan for position calculation.

    This demonstrates the power of warp primitives for sophisticated parallel algorithms
    within a single warp (works for any WARP_SIZE: 32, 64, etc.).

    Example with pivot=5:
    Input:  [3, 7, 1, 8, 2, 9, 4, 6]
    var Result: [3, 1, 2, 4, 7, 8, 9, 6] (< pivot | >= pivot).
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    if global_i < size:
        var current_val = input[global_i]

        # FILL ME IN (roughly 13 lines)


ํŒ

1. ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์—ฌ๋Ÿฌ ์กฐ์ •๋œ ๋‹จ๊ณ„๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํŒŒํ‹ฐ์…”๋‹์— ํ•„์š”ํ•œ ๋…ผ๋ฆฌ์  ๋‹จ๊ณ„๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”.

๊ณ ๋ คํ•  ํ•ต์‹ฌ ๋‹จ๊ณ„:

  • ์–ด๋–ค ์š”์†Œ๊ฐ€ ์–ด๋А ํŒŒํ‹ฐ์…˜์— ์†ํ•˜๋Š”์ง€ ์–ด๋–ป๊ฒŒ ์‹๋ณ„ํ•˜๋‚˜์š”?
  • ๊ฐ ํŒŒํ‹ฐ์…˜ ๋‚ด์—์„œ ์œ„์น˜๋ฅผ ์–ด๋–ป๊ฒŒ ๊ณ„์‚ฐํ•˜๋‚˜์š”?
  • ์™ผ์ชฝ ํŒŒํ‹ฐ์…˜์˜ ์ „์ฒด ํฌ๊ธฐ๋ฅผ ์–ด๋–ป๊ฒŒ ์•Œ ์ˆ˜ ์žˆ๋‚˜์š”?
  • ์ตœ์ข… ์œ„์น˜์— ์š”์†Œ๋ฅผ ์–ด๋–ป๊ฒŒ ๊ธฐ๋กํ•˜๋‚˜์š”?

2. ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ

์–ด๋А ํŒŒํ‹ฐ์…˜์— ์†ํ•˜๋Š”์ง€ ํŒ๋ณ„ํ•˜๋Š” ๋ถˆ๋ฆฌ์–ธ ํ”„๋ ˆ๋””์ผ€์ดํŠธ๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ƒ๊ฐํ•ด ๋ณด์„ธ์š”:

  • โ€œ์ด ์š”์†Œ๋Š” ์™ผ์ชฝ ํŒŒํ‹ฐ์…˜์— ์†ํ•œ๋‹คโ€œ๋ฅผ ์–ด๋–ป๊ฒŒ ํ‘œํ˜„ํ•˜๋‚˜์š”?
  • โ€œ์ด ์š”์†Œ๋Š” ์˜ค๋ฅธ์ชฝ ํŒŒํ‹ฐ์…˜์— ์†ํ•œ๋‹คโ€œ๋ฅผ ์–ด๋–ป๊ฒŒ ํ‘œํ˜„ํ•˜๋‚˜์š”?
  • prefix_sum์— ์ „๋‹ฌํ•  ํ”„๋ ˆ๋””์ผ€์ดํŠธ๋Š” ์–ด๋–ค ๋ฐ์ดํ„ฐ ํƒ€์ž…์ด์–ด์•ผ ํ•˜๋‚˜์š”?

3. shuffle_xor๊ณผ prefix_sum ๊ฒฐํ•ฉ

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋‘ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๊ณ ๋ คํ•  ์ :

  • ์ด ๋งฅ๋ฝ์—์„œ shuffle_xor์€ ๋ฌด์—‡์— ์‚ฌ์šฉ๋˜๋‚˜์š”?
  • ์ด ๋งฅ๋ฝ์—์„œ prefix_sum์€ ๋ฌด์—‡์— ์‚ฌ์šฉ๋˜๋‚˜์š”?
  • ์ด ๋‘ ์—ฐ์‚ฐ์ด ์–ด๋–ป๊ฒŒ ํ•จ๊ป˜ ๋™์ž‘ํ•˜๋‚˜์š”?

4. ์œ„์น˜ ๊ณ„์‚ฐ

๊ฐ€์žฅ ๊นŒ๋‹ค๋กœ์šด ๋ถ€๋ถ„์€ ๊ฐ ์š”์†Œ๊ฐ€ ์ถœ๋ ฅ์—์„œ ์–ด๋””์— ๊ธฐ๋ก๋˜์–ด์•ผ ํ•˜๋Š”์ง€ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ:

  • ์™ผ์ชฝ ํŒŒํ‹ฐ์…˜ ์š”์†Œ: ์ตœ์ข… ์œ„์น˜๋ฅผ ๋ฌด์—‡์ด ๊ฒฐ์ •ํ•˜๋‚˜์š”?
  • ์˜ค๋ฅธ์ชฝ ํŒŒํ‹ฐ์…˜ ์š”์†Œ: ์˜คํ”„์…‹์„ ์–ด๋–ป๊ฒŒ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ ์šฉํ•˜๋‚˜์š”?
  • ๋กœ์ปฌ ์œ„์น˜์™€ ํŒŒํ‹ฐ์…˜ ๊ฒฝ๊ณ„๋ฅผ ์–ด๋–ป๊ฒŒ ๊ฒฐํ•ฉํ•˜๋‚˜์š”?

์›Œํ”„ ํŒŒํ‹ฐ์…˜ ํ…Œ์ŠคํŠธ:

uv run poe p26 --partition
pixi run p26 --partition

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

WARP_SIZE:  32
SIZE:  32
output: HostBuffer([3.0, 1.0, 2.0, 4.0, 0.0, 3.0, 1.0, 4.0, 3.0, 1.0, 2.0, 4.0, 0.0, 3.0, 1.0, 4.0, 7.0, 8.0, 9.0, 6.0, 10.0, 11.0, 12.0, 13.0, 7.0, 8.0, 9.0, 6.0, 10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([3.0, 1.0, 2.0, 4.0, 0.0, 3.0, 1.0, 4.0, 3.0, 1.0, 2.0, 4.0, 0.0, 3.0, 1.0, 4.0, 7.0, 8.0, 9.0, 6.0, 10.0, 11.0, 12.0, 13.0, 7.0, 8.0, 9.0, 6.0, 10.0, 11.0, 12.0, 13.0])
pivot: 5.0
โœ… Warp partition test passed!

์†”๋ฃจ์…˜

def warp_partition[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    pivot: Float32,
):
    """
    Single-warp parallel partitioning using BOTH shuffle_xor AND prefix_sum.
    This implements a warp-level quicksort partition step that places elements < pivot
    on the left and elements >= pivot on the right.

    ALGORITHM COMPLEXITY - combines two advanced warp primitives:
    1. shuffle_xor(): Butterfly pattern for warp-level reductions
    2. prefix_sum(): Warp-level exclusive scan for position calculation.

    This demonstrates the power of warp primitives for sophisticated parallel algorithms
    within a single warp (works for any WARP_SIZE: 32, 64, etc.).

    Example with pivot=5:
    Input:  [3, 7, 1, 8, 2, 9, 4, 6]
    var Result: [3, 1, 2, 4, 7, 8, 9, 6] (< pivot | >= pivot).
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    if global_i < size:
        var current_val = input[global_i]

        # Phase 1: Create warp-level predicates
        var predicate_left = Scalar[dtype](
            1.0
        ) if current_val < pivot else Scalar[dtype](0.0)
        var predicate_right = Scalar[dtype](
            1.0
        ) if current_val >= pivot else Scalar[dtype](0.0)

        # Phase 2: Warp-level prefix sum to get positions within warp
        var warp_left_pos = prefix_sum[exclusive=True](predicate_left)
        var warp_right_pos = prefix_sum[exclusive=True](predicate_right)

        # Phase 3: Get total left count using shuffle_xor reduction
        var warp_left_total = predicate_left

        # Butterfly reduction to get total across the warp: dynamic for any WARP_SIZE
        var offset = WARP_SIZE // 2
        while offset > 0:
            warp_left_total += shuffle_xor(warp_left_total, UInt32(offset))
            offset //= 2

        # Phase 4: Write to output positions
        if current_val < pivot:
            # Left partition: use warp-level position
            output[Int(warp_left_pos)] = current_val
        else:
            # Right partition: offset by total left count + right position
            output[Int(warp_left_total + warp_right_pos)] = current_val


์ด ์†”๋ฃจ์…˜์€ ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ ๊ฐ„์˜ ๊ณ ๊ธ‰ ์กฐ์ •์„ ํ†ตํ•ด ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ „์ฒด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

if global_i < size:
    current_val = input[global_i]

    # 1๋‹จ๊ณ„: ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ
    predicate_left = Float32(1.0) if current_val < pivot else Float32(0.0)
    predicate_right = Float32(1.0) if current_val >= pivot else Float32(0.0)

    # 2๋‹จ๊ณ„: ์›Œํ”„ ๋ ˆ๋ฒจ ๋ˆ„์  ํ•ฉ์œผ๋กœ ์›Œํ”„ ๋‚ด ์œ„์น˜ ๊ณ„์‚ฐ
    warp_left_pos = prefix_sum[exclusive=True](predicate_left)
    warp_right_pos = prefix_sum[exclusive=True](predicate_right)

    # 3๋‹จ๊ณ„: shuffle_xor ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜์œผ๋กœ ์™ผ์ชฝ ์ด ๊ฐœ์ˆ˜ ๊ตฌํ•˜๊ธฐ
    warp_left_total = predicate_left

    # ์›Œํ”„ ์ „์ฒด์˜ ํ•ฉ์‚ฐ์„ ์œ„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜: ๋ชจ๋“  WARP_SIZE์— ๋™์  ๋Œ€์‘
    offset = WARP_SIZE // 2
    while offset > 0:
        warp_left_total += shuffle_xor(warp_left_total, offset)
        offset //= 2

    # 4๋‹จ๊ณ„: ์ถœ๋ ฅ ์œ„์น˜์— ๊ธฐ๋ก
    if current_val < pivot:
        # ์™ผ์ชฝ ํŒŒํ‹ฐ์…˜: ์›Œํ”„ ๋ ˆ๋ฒจ ์œ„์น˜ ์‚ฌ์šฉ
        output[Int(warp_left_pos)] = current_val
    else:
        # ์˜ค๋ฅธ์ชฝ ํŒŒํ‹ฐ์…˜: ์™ผ์ชฝ ์ด ๊ฐœ์ˆ˜ + ์˜ค๋ฅธ์ชฝ ์œ„์น˜๋กœ offset
        output[Int(warp_left_total + warp_right_pos)] = current_val

๋‹ค๋‹จ๊ณ„ ์‹คํ–‰ ์ถ”์  (8-๋ ˆ์ธ ์˜ˆ์ œ, pivot=5, ๊ฐ’ [3,7,1,8,2,9,4,6]):

์ดˆ๊ธฐ ์ƒํƒœ:
  Lane 0: current_val=3 (< 5)  Lane 1: current_val=7 (>= 5)
  Lane 2: current_val=1 (< 5)  Lane 3: current_val=8 (>= 5)
  Lane 4: current_val=2 (< 5)  Lane 5: current_val=9 (>= 5)
  Lane 6: current_val=4 (< 5)  Lane 7: current_val=6 (>= 5)

1๋‹จ๊ณ„: ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ
  Lane 0: predicate_left=1.0, predicate_right=0.0
  Lane 1: predicate_left=0.0, predicate_right=1.0
  Lane 2: predicate_left=1.0, predicate_right=0.0
  Lane 3: predicate_left=0.0, predicate_right=1.0
  Lane 4: predicate_left=1.0, predicate_right=0.0
  Lane 5: predicate_left=0.0, predicate_right=1.0
  Lane 6: predicate_left=1.0, predicate_right=0.0
  Lane 7: predicate_left=0.0, predicate_right=1.0

2๋‹จ๊ณ„: ์œ„์น˜ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋น„ํฌํ•จ ๋ˆ„์  ํ•ฉ
  warp_left_pos:  [0, 0, 1, 1, 2, 2, 3, 3]
  warp_right_pos: [0, 0, 0, 1, 1, 2, 2, 3]

3๋‹จ๊ณ„: ์™ผ์ชฝ ์ด ๊ฐœ์ˆ˜๋ฅผ ์œ„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜
  ์ดˆ๊ธฐ๊ฐ’: [1, 0, 1, 0, 1, 0, 1, 0]
  ๋ฆฌ๋•์…˜ ํ›„: ๋ชจ๋“  ๋ ˆ์ธ์ด warp_left_total = 4๋ฅผ ๊ฐ€์ง

4๋‹จ๊ณ„: ์ถœ๋ ฅ ์œ„์น˜์— ๊ธฐ๋ก
  Lane 0: current_val=3 < pivot โ†’ output[0] = 3
  Lane 1: current_val=7 >= pivot โ†’ output[4+0] = output[4] = 7
  Lane 2: current_val=1 < pivot โ†’ output[1] = 1
  Lane 3: current_val=8 >= pivot โ†’ output[4+1] = output[5] = 8
  Lane 4: current_val=2 < pivot โ†’ output[2] = 2
  Lane 5: current_val=9 >= pivot โ†’ output[4+2] = output[6] = 9
  Lane 6: current_val=4 < pivot โ†’ output[3] = 4
  Lane 7: current_val=6 >= pivot โ†’ output[4+3] = output[7] = 6

์ตœ์ข… ๊ฒฐ๊ณผ: [3, 1, 2, 4, 7, 8, 9, 6] (< pivot | >= pivot)

์ˆ˜ํ•™์  ํ†ต์ฐฐ: ์ด์ค‘ ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค: \[\Large \begin{align} \text{left_pos}[i] &= \text{prefix_sum}_{\text{exclusive}}(\text{predicate_left}[i]) \\ \text{right_pos}[i] &= \text{prefix_sum}_{\text{exclusive}}(\text{predicate_right}[i]) \\ \text{left_total} &= \text{butterfly_reduce}(\text{predicate_left}) \\ \text{final_pos}[i] &= \begin{cases} \text{left_pos}[i] & \text{if } \text{input}[i] < \text{pivot} \\ \text{left_total} + \text{right_pos}[i] & \text{if} \text{input}[i] \geq \text{pivot} \end{cases} \end{align}\]

๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์ ‘๊ทผ ๋ฐฉ์‹์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  1. ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ: ๊ฐ ์š”์†Œ์˜ ํŒŒํ‹ฐ์…˜ ์†Œ์†์„ ์‹๋ณ„
  2. ๋น„ํฌํ•จ ๋ˆ„์  ํ•ฉ: ๊ฐ ํŒŒํ‹ฐ์…˜ ๋‚ด ์ƒ๋Œ€์  ์œ„์น˜๋ฅผ ๊ณ„์‚ฐ
  3. ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜: ํŒŒํ‹ฐ์…˜ ๊ฒฝ๊ณ„ (์™ผ์ชฝ ์ด ๊ฐœ์ˆ˜)๋ฅผ ์‚ฐ์ถœ
  4. ์กฐ์ •๋œ ๊ธฐ๋ก: ๋กœ์ปฌ ์œ„์น˜์™€ ์ „์—ญ ํŒŒํ‹ฐ์…˜ ๊ตฌ์กฐ๋ฅผ ๊ฒฐํ•ฉ

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„:

  • 1๋‹จ๊ณ„: \(O(1)\) - ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ
  • 2๋‹จ๊ณ„: \(O(\log n)\) - ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ˆ„์  ํ•ฉ
  • 3๋‹จ๊ณ„: \(O(\log n)\) - shuffle_xor์„ ํ™œ์šฉํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜
  • 4๋‹จ๊ณ„: \(O(1)\) - ์กฐ์ •๋œ ๊ธฐ๋ก
  • ์ „์ฒด: ์šฐ์ˆ˜ํ•œ ์ƒ์ˆ˜๋ฅผ ๊ฐ€์ง„ \(O(\log n)\)

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ํ†ต์‹  ๋‹จ๊ณ„: \(2 \times \log_2(\text{WARP_SIZE})\) (๋ˆ„์  ํ•ฉ + ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜)
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ œ๋กœ, ๋ชจ๋‘ ๋ ˆ์ง€์Šคํ„ฐ ๊ธฐ๋ฐ˜
  • ๋ณ‘๋ ฌ์„ฑ: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ „์ฒด์—์„œ ๋ชจ๋“  ๋ ˆ์ธ์ด ํ™œ์„ฑ ์ƒํƒœ
  • ํ™•์žฅ์„ฑ: ๋ชจ๋“  WARP_SIZE (32, 64 ๋“ฑ)์—์„œ ๋™์ž‘

์‹ค์šฉ์  ํ™œ์šฉ: ์ด ํŒจํ„ด์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๋ถ„์•ผ:

  • Quicksort ํŒŒํ‹ฐ์…”๋‹: ๋ณ‘๋ ฌ ์ •๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•ต์‹ฌ ๋‹จ๊ณ„
  • ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜: ๋ฐ์ดํ„ฐ ์ŠคํŠธ๋ฆผ์—์„œ null/๋ฌดํšจ ์š”์†Œ ์ œ๊ฑฐ
  • ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง: ๋ณต์žกํ•œ ํ”„๋ ˆ๋””์ผ€์ดํŠธ์— ๋”ฐ๋ฅธ ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ
  • ๋ถ€ํ•˜ ๋ถ„์‚ฐ: ์—ฐ์‚ฐ ์š”๊ตฌ๋Ÿ‰์— ๋”ฐ๋ฅธ ์ž‘์—… ์žฌ๋ถ„๋ฐฐ

์š”์•ฝ

prefix_sum() ๊ธฐ๋ณธ ์š”์†Œ๋Š” ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ†ตํ•ด ๋‹ค์Œ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ๋ˆ„์  ํ•ฉ ํŒจํ„ด

  1. ํฌํ•จ ๋ˆ„์  ํ•ฉ (prefix_sum[exclusive=False]):

    • ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋ˆ„์  ์—ฐ์‚ฐ
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ”๋“œ ~30์ค„์„ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒด
    • ์ „์šฉ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋ฅผ ๋™๋ฐ˜ํ•œ \(O(\log n)\) ๋ณต์žก๋„
  2. ๊ณ ๊ธ‰ ๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์กฐ์ • (prefix_sum + shuffle_xor ๊ฒฐํ•ฉ):

    • ๋‹จ์ผ ์›Œํ”„ ๋‚ด ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜
    • ์œ„์น˜ ๊ณ„์‚ฐ์„ ์œ„ํ•œ ๋น„ํฌํ•จ ์Šค์บ” + ์ดํ•ฉ์„ ์œ„ํ•œ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜
    • ์ตœ์ ์˜ ๋ณ‘๋ ฌ ํšจ์œจ์„ฑ์„ ๊ฐ€์ง„ ๋ณต์žกํ•œ ํŒŒํ‹ฐ์…”๋‹ ์—ฐ์‚ฐ

ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ†ต์ฐฐ

ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์˜ ์ด์ :

  • prefix_sum()์ด ํ˜„๋Œ€ GPU์˜ ์ „์šฉ ์Šค์บ” ์œ ๋‹›์„ ํ™œ์šฉ
  • ๊ธฐ์กด ๋ฐฉ์‹ ๋Œ€๋น„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ ์ œ๋กœ
  • ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ์—†๋Š” ์ž๋™ ๋™๊ธฐํ™”

๋‹ค์ค‘ ๊ธฐ๋ณธ ์š”์†Œ ์กฐ์ •:

# 1๋‹จ๊ณ„: ํŒŒํ‹ฐ์…˜ ์†Œ์†์„ ์œ„ํ•œ ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ
predicate = 1.0 if condition else 0.0

# 2๋‹จ๊ณ„: ๋กœ์ปฌ ์œ„์น˜๋ฅผ ์œ„ํ•œ prefix_sum ์‚ฌ์šฉ
local_pos = prefix_sum[exclusive=True](predicate)

# 3๋‹จ๊ณ„: ์ „์—ญ ์ดํ•ฉ์„ ์œ„ํ•œ shuffle_xor ์‚ฌ์šฉ
global_total = butterfly_reduce(predicate)

# 4๋‹จ๊ณ„: ์ตœ์ข… ์œ„์น˜ ๊ฒฐ์ •์„ ์œ„ํ•œ ๊ฒฐํ•ฉ
final_pos = local_pos + partition_offset

์„ฑ๋Šฅ ์ด์ :

  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์†Œํ”„ํŠธ์›จ์–ด ๊ตฌํ˜„ ๋Œ€๋น„ ์ „์šฉ ์Šค์บ” ์œ ๋‹›
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋Œ€๋น„ ๋ ˆ์ง€์Šคํ„ฐ ์ „์šฉ ์—ฐ์‚ฐ
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ณต์žก๋„: ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์„ ๋™๋ฐ˜ํ•œ \(O(\log n)\)
  • ๋‹จ์ผ ์›Œํ”„ ์ตœ์ ํ™”: WARP_SIZE ํ•œ๋„ ๋‚ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ตœ์ 

์‹ค์šฉ์  ํ™œ์šฉ

์ด ๋ˆ„์  ํ•ฉ ํŒจํ„ด๋“ค์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” ๋ถ„์•ผ:

  • ๋ณ‘๋ ฌ ์Šค์บ” ์—ฐ์‚ฐ: ๋ˆ„์  ํ•ฉ, ๋ˆ„์  ๊ณฑ, min/max ์Šค์บ”
  • ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜: ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ๋ฐ์ดํ„ฐ ์žฌ๋ฐฐ์น˜
  • Quicksort ํŒŒํ‹ฐ์…”๋‹: ๋ณ‘๋ ฌ ์ •๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•ต์‹ฌ ๋นŒ๋”ฉ ๋ธ”๋ก
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋ถ€ํ•˜ ๋ถ„์‚ฐ, ์ž‘์—… ๋ถ„๋ฐฐ, ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์กฐํ™”

prefix_sum()๊ณผ shuffle_xor()์˜ ๊ฒฐํ•ฉ์€ ํ˜„๋Œ€ GPU ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ์ตœ์†Œํ•œ์˜ ์ฝ”๋“œ ๋ณต์žก๋„์™€ ์ตœ์ ์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์œผ๋กœ ์ •๊ตํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์–ด๋–ป๊ฒŒ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Puzzle 27: ๋ธ”๋ก ์ „์ฒด ํŒจํ„ด

๊ฐœ์š”

Puzzle 27: ๋ธ”๋ก ์ „์ฒด ํŒจํ„ด์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์ด ํผ์ฆ์€ GPU ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํ•ต์‹ฌ ๊ตฌ์„ฑ ์š”์†Œ์ธ ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ „์ฒด ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์นœ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ํ†ต์‹  ํŒจํ„ด์„ ํƒ๊ตฌํ•˜๋ฉฐ, ๋ณต์žกํ•œ ์ˆ˜๋™ ๋™๊ธฐํ™”๋ฅผ ๊ฐ„๊ฒฐํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด์— ์ตœ์ ํ™”๋œ ์—ฐ์‚ฐ์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

๋ชฉํ‘œ: ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด(Puzzle 12)์—์„œ ๋ฒ—์–ด๋‚˜, ์—ฌ๋Ÿฌ ์›Œํ”„์— ๊ฑธ์นœ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ธ”๋ก ์ „์ฒด ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊ฐ„๊ฒฐํ•œ ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ „ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: GPU ์Šค๋ ˆ๋“œ ๋ธ”๋ก์€ ์ •๊ตํ•œ ํ•˜๋“œ์›จ์–ด ์กฐ์œจ๋กœ ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค - Mojo์˜ ๋ธ”๋ก ์—ฐ์‚ฐ์€ ํฌ๋กœ์Šค ์›Œํ”„ ํ†ต์‹ ๊ณผ ์ „์šฉ ํ•˜๋“œ์›จ์–ด ์œ ๋‹›์„ ํ™œ์šฉํ•˜์—ฌ ์™„๋ฒฝํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋นŒ๋”ฉ ๋ธ”๋ก์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค: ๋ฆฌ๋•์…˜(์ „์ฒดโ†’ํ•˜๋‚˜), ์Šค์บ”(์ „์ฒดโ†’๊ฐ๊ฐ), ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ(ํ•˜๋‚˜โ†’์ „์ฒด).

๋ฐฐ์šธ ๋‚ด์šฉ

๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต์‹  ๋ชจ๋ธ

GPU ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋‚ด ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ํ†ต์‹  ํŒจํ„ด์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

GPU ์Šค๋ ˆ๋“œ ๋ธ”๋ก (128 ์Šค๋ ˆ๋“œ, 4๊ฐœ ๋˜๋Š” 2๊ฐœ ์›Œํ”„, ํ•˜๋“œ์›จ์–ด ์กฐ์œจ)
์ „์ฒดโ†’ํ•˜๋‚˜ (Reduction):     ๋ชจ๋“  ์Šค๋ ˆ๋“œ โ†’ ์Šค๋ ˆ๋“œ 0์— ๋‹จ์ผ ๊ฒฐ๊ณผ
์ „์ฒดโ†’๊ฐ๊ฐ (Scan):         ๋ชจ๋“  ์Šค๋ ˆ๋“œ โ†’ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ˆ„์  ์œ„์น˜๋ฅผ ๋ฐ›์Œ
ํ•˜๋‚˜โ†’์ „์ฒด (Broadcast):     ์Šค๋ ˆ๋“œ 0 โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ๋ฐ›์Œ

ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ:
โ”œโ”€โ”€ ์›Œํ”„ 0 (์Šค๋ ˆ๋“œ 0-31)   โ”€โ”€block.sum()โ”€โ”€โ”
โ”œโ”€โ”€ ์›Œํ”„ 1 (์Šค๋ ˆ๋“œ 32-63)  โ”€โ”€block.sum()โ”€โ”€โ”ผโ†’ ์Šค๋ ˆ๋“œ 0 ๊ฒฐ๊ณผ
โ”œโ”€โ”€ ์›Œํ”„ 2 (์Šค๋ ˆ๋“œ 64-95)  โ”€โ”€block.sum()โ”€โ”€โ”ค
โ””โ”€โ”€ ์›Œํ”„ 3 (์Šค๋ ˆ๋“œ 96-127) โ”€โ”€block.sum()โ”€โ”€โ”˜

ํ•˜๋“œ์›จ์–ด ํ˜„์‹ค:

  • ํฌ๋กœ์Šค ์›Œํ”„ ๋™๊ธฐํ™”: ๋ธ”๋ก ๋‚ด ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ฐ„ ์ž๋™ ์กฐ์œจ
  • ์ „์šฉ ํ•˜๋“œ์›จ์–ด ์œ ๋‹›: ํŠนํ™”๋œ ์Šค์บ” ์œ ๋‹›๊ณผ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜ ๋„คํŠธ์›Œํฌ
  • ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ๋ถˆํ•„์š”: ํ•˜๋“œ์›จ์–ด๊ฐ€ ๋ชจ๋“  ๋™๊ธฐํ™”๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ๊ด€๋ฆฌ
  • ๋กœ๊ทธ ๋ณต์žก๋„: \(O(\log n)\) ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋‹จ์ผ ๋ช…๋ น์˜ ๋‹จ์ˆœํ•จ์œผ๋กœ

Mojo์˜ ๋ธ”๋ก ์—ฐ์‚ฐ

gpu.primitives.block์˜ ์™„์ „ํ•œ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋„๊ตฌ ๋ชจ์Œ์„ ๋ฐฐ์›๋‹ˆ๋‹ค:

  1. block.sum(value): ํ•ฉ๊ณ„, ํ‰๊ท , ์ตœ๋Œ“๊ฐ’/์ตœ์†Ÿ๊ฐ’์„ ์œ„ํ•œ ์ „์ฒดโ†’ํ•˜๋‚˜ ๋ฆฌ๋•์…˜
  2. block.prefix_sum(value): ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ์ถ”์ถœ์„ ์œ„ํ•œ ์ „์ฒดโ†’๊ฐ๊ฐ ์Šค์บ”
  3. block.broadcast(value): ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ ์™€ ์กฐ์œจ์„ ์œ„ํ•œ ํ•˜๋‚˜โ†’์ „์ฒด ๋ถ„๋ฐฐ

์ฐธ๊ณ : ์ด ๊ธฐ๋ณธ ์š”์†Œ๋“ค์€ ํ†ต๊ณ„ ์—ฐ์‚ฐ, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜, ์ •๊ทœํ™” ์›Œํฌํ”Œ๋กœ์šฐ์™€ ๊ฐ™์€ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ธฐ๋ณธ ์š”์†Œ ์—†์ด ๊ตฌํ˜„ํ•˜๋ ค๋ฉด ์ˆ˜์‹ญ ์ค„์˜ ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์œจ ์ฝ”๋“œ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๋ณ€ํ™˜ ์˜ˆ์‹œ

# ๋ณต์žกํ•œ ๋ธ”๋ก ์ „์ฒด ๋ฆฌ๋•์…˜ (๊ธฐ์กด ๋ฐฉ์‹ - Puzzle 12์—์„œ):
shared_memory[local_i] = my_value
barrier()
stride = 64
while stride > 0:
    if local_i < stride:
        shared_memory[local_i] += shared_memory[local_i + stride]
    barrier()
    stride //= 2
if local_i == 0:
    output[block_idx.x] = shared_memory[0]

# ๋ธ”๋ก ์—ฐ์‚ฐ์œผ๋กœ ์ด ๋ชจ๋“  ๋ณต์žก์„ฑ์„ ์ œ๊ฑฐ:
my_partial = compute_local_contribution()
total = block.sum[block_size=128, broadcast=False](my_partial)  # ํ•œ ์ค„์ด๋ฉด ๋!
if local_i == 0:
    output[block_idx.x] = total[0]

๋ธ”๋ก ์—ฐ์‚ฐ์ด ๋น›๋‚˜๋Š” ์ˆœ๊ฐ„

์„ฑ๋Šฅ ํŠน์„ฑ์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค:

์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด๊ธฐ์กด ๋ฐฉ์‹๋ธ”๋ก ์—ฐ์‚ฐ
๋ธ”๋ก ์ „์ฒด ๋ฆฌ๋•์…˜๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด๋‹จ์ผ block.sum ํ˜ธ์ถœ
๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑblock.prefix_sum ์กฐ์œจ
๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ ์ˆ˜๋™ ๋™๊ธฐํ™”๋‹จ์ผ block.broadcast ํ˜ธ์ถœ
ํฌ๋กœ์Šค ์›Œํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ๊ด€๋ฆฌํ•˜๋“œ์›จ์–ด ๊ด€๋ฆฌ ์กฐ์œจ

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์˜ ์ง„ํ™”

์ถœ๋ฐœ์ : ์ˆ˜๋™ ์กฐ์œจ (Puzzle 12)

๋ณต์žกํ•˜์ง€๋งŒ ๊ต์œก์  - ๋ช…์‹œ์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ฐฐ๋ฆฌ์–ด, ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜:

# ์ˆ˜๋™ ๋ฐฉ์‹: 15์ค„ ์ด์ƒ์˜ ๋ณต์žกํ•œ ๋™๊ธฐํ™”
shared_memory[local_i] = my_value
barrier()
# ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜...
stride = 64
while stride > 0:
    if local_i < stride:
        shared_memory[local_i] += shared_memory[local_i + stride]
    barrier()
    stride //= 2

์ค‘๊ฐ„ ๋‹จ๊ณ„: ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (Puzzle 24)

ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์ด์ง€๋งŒ ๋ฒ”์œ„๊ฐ€ ์ œํ•œ์  - 32 ์Šค๋ ˆ๋“œ ์›Œํ”„ ๋‚ด์˜ warp.sum():

# ์›Œํ”„ ๋ฐฉ์‹: 1์ค„์ด์ง€๋งŒ ๋‹จ์ผ ์›Œํ”„๋งŒ
total = warp.sum[warp_size=WARP_SIZE](val=partial_product)

์ตœ์ข… ๋ชฉ์ ์ง€: ๋ธ”๋ก ํ”„๋กœ๊ทธ๋ž˜๋ฐ (์ด๋ฒˆ ํผ์ฆ)

์™„์ „ํ•œ ๋„๊ตฌ ๋ชจ์Œ - ์ „์ฒด ๋ธ”๋ก์— ๊ฑธ์นœ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ณธ ์š”์†Œ:

# ๋ธ”๋ก ๋ฐฉ์‹: ์—ฌ๋Ÿฌ ์›Œํ”„์— ๊ฑธ์นœ 1์ค„ (128+ ์Šค๋ ˆ๋“œ)
total = block.sum[block_size=128, broadcast=False](val=partial_product)

์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ํ†ต์‹  ํŒจํ„ด

๋ธ”๋ก ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ๋ชจ๋“  ๋ณ‘๋ ฌ ํ†ต์‹  ์š”๊ตฌ๋ฅผ ์ถฉ์กฑํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ํ•ต์‹ฌ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

1. ์ „์ฒดโ†’ํ•˜๋‚˜: ๋ฆฌ๋•์…˜ (block.sum())

  • ํŒจํ„ด: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ธฐ์—ฌ โ†’ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์Œ
  • ์šฉ๋„: ํ•ฉ๊ณ„, ํ‰๊ท , ์ตœ๋Œ“๊ฐ’/์ตœ์†Ÿ๊ฐ’ ๊ณ„์‚ฐ
  • ์˜ˆ์‹œ: ๋‚ด์ , ํ†ต๊ณ„ ์ง‘๊ณ„
  • ํ•˜๋“œ์›จ์–ด: ์ž๋™ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํฌํ•จ๋œ ํฌ๋กœ์Šค ์›Œํ”„ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ๋ฆฌ๋•์…˜

2. ์ „์ฒดโ†’๊ฐ๊ฐ: ์Šค์บ” (block.prefix_sum())

  • ํŒจํ„ด: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ธฐ์—ฌ โ†’ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ˆ„์  ์œ„์น˜๋ฅผ ๋ฐ›์Œ
  • ์šฉ๋„: ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง, ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜
  • ์˜ˆ์‹œ: ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ์ถ”์ถœ์„ ์œ„ํ•œ ์“ฐ๊ธฐ ์œ„์น˜ ๊ณ„์‚ฐ
  • ํ•˜๋“œ์›จ์–ด: ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ์„ ํฌํ•จํ•œ ๋ณ‘๋ ฌ ์Šค์บ”

3. ํ•˜๋‚˜โ†’์ „์ฒด: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ (block.broadcast())

  • ํŒจํ„ด: ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ์ œ๊ณต โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ๋ฐ›์Œ
  • ์šฉ๋„: ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ , ์„ค์ •๊ฐ’ ๋ถ„๋ฐฐ
  • ์˜ˆ์‹œ: ์ •๊ทœํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๊ณ„์‚ฐ๋œ ํ‰๊ท  ๊ณต์œ 
  • ํ•˜๋“œ์›จ์–ด: ์—ฌ๋Ÿฌ ์›Œํ”„์— ๊ฑธ์นœ ์ตœ์ ํ™”๋œ ๋ถ„๋ฐฐ

ํ•™์Šต ๊ฒฝ๋กœ

์„ธ ๋‹จ๊ณ„๋กœ ์ด ํผ์ฆ์„ ์™„์„ฑํ•˜๋ฉฐ, ๋‹จ์ˆœํ•œ ๊ฒƒ์—์„œ ๋ณต์žกํ•œ ๊ฒƒ์œผ๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

Part 1: block.sum()์˜ ํ•ต์‹ฌ

๋ณต์žกํ•œ ๋ฆฌ๋•์…˜์„ ๋‹จ์ˆœํ•œ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋ณ€ํ™˜

block.sum()์œผ๋กœ ๋‚ด์ ์„ ๊ตฌํ˜„ํ•˜๋ฉฐ ๋ธ”๋ก ๋ฆฌ๋•์…˜์˜ ๊ธฐ๋ณธ ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ๋ธ”๋ก ์—ฐ์‚ฐ์ด 15์ค„ ์ด์ƒ์˜ ์ˆ˜๋™ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ๋‹จ์ผ ์ตœ์ ํ™” ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ์—ฌ๋Ÿฌ ์›Œํ”„์— ๊ฑธ์นœ ๋ธ”๋ก ์ „์ฒด ๋™๊ธฐํ™”
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๋ฆฌ๋•์…˜ ํŒจํ„ด
  • ์Šค๋ ˆ๋“œ 0 ๊ฒฐ๊ณผ ๊ด€๋ฆฌ
  • ๊ธฐ์กด ๋ฐฉ์‹๊ณผ์˜ ์„ฑ๋Šฅ ๋น„๊ต

ํ•™์Šต ๋ชฉํ‘œ: block.sum()์ด ๋ธ”๋ก ๊ทœ๋ชจ์—์„œ warp.sum()์˜ ๋‹จ์ˆœํ•จ์„ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.


Part 2: block.prefix_sum()๊ณผ ๋ณ‘๋ ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜

๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ์ถ”์ถœ

ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด block.prefix_sum()์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. ๋ˆ„์  ํ•ฉ์ด ๋‹จ์ˆœํ•œ ๋ฆฌ๋•์…˜์œผ๋กœ๋Š” ๊ตฌํ˜„ํ•˜๊ธฐ ์–ด๋ ค์šด ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ์ด์ง„ ํ”„๋ ˆ๋””์ผ€์ดํŠธ๋ฅผ ์ด์šฉํ•œ ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง
  • ์กฐ์œจ๋œ ์“ฐ๊ธฐ ์œ„์น˜ ๊ณ„์‚ฐ
  • ๊ณ ๊ธ‰ ํŒŒํ‹ฐ์…”๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ํฌ๋กœ์Šค ์Šค๋ ˆ๋“œ ๋ฐ์ดํ„ฐ ์ถ”์ถœ ํŒจํ„ด

ํ•™์Šต ๋ชฉํ‘œ: block.prefix_sum()์ด ๋‹จ์ˆœํ•œ ์ง‘๊ณ„๋ฅผ ๋„˜์–ด์„œ๋Š” ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.


Part 3: block.broadcast()์™€ ๋ฒกํ„ฐ ์ •๊ทœํ™”

๋ชจ๋“  ํŒจํ„ด์„ ๊ฒฐํ•ฉํ•˜๋Š” ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ

๋ธ”๋ก ์—ฐ์‚ฐ ๋„๊ตฌ ๋ชจ์Œ ์ „์ฒด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ์–ด๋–ป๊ฒŒ ํ•จ๊ป˜ ์ž‘๋™ํ•˜์—ฌ ์ˆ˜ํ•™์  ์ •ํ™•์„ฑ์„ ๊ฐ–์ถ˜ ์‹ค์ œ ์—ฐ์‚ฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹  ํŒจํ„ด
  • ์กฐ์œจ๋œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ
  • ์‹ค์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„

ํ•™์Šต ๋ชฉํ‘œ: ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•ด ๋ธ”๋ก ์—ฐ์‚ฐ์„ ์กฐํ•ฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

๋ธ”๋ก ์—ฐ์‚ฐ์ด ์ค‘์š”ํ•œ ์ด์œ 

์ฝ”๋“œ ๋‹จ์ˆœํ™” ๋ณ€ํ™˜:

๊ธฐ์กด ๋ฐฉ์‹:     20์ค„ ์ด์ƒ์˜ ๋ฐฐ๋ฆฌ์–ด, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ณต์žกํ•œ ์ธ๋ฑ์‹ฑ
๋ธ”๋ก ์—ฐ์‚ฐ:     3-5์ค„์˜ ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ณธ ์š”์†Œ

์„ฑ๋Šฅ ์ด์ :

  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: GPU ์•„ํ‚คํ…์ฒ˜๋ณ„ ์ตœ์ ํ™”๋ฅผ ํ™œ์šฉ
  • ์ž๋™ ๋™๊ธฐํ™”: ์ˆ˜๋™ ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜ ์˜ค๋ฅ˜ ์ œ๊ฑฐ
  • ์กฐํ•ฉ ๊ฐ€๋Šฅ์„ฑ: ์—ฐ์‚ฐ๋“ค์ด ๋งค๋„๋Ÿฝ๊ฒŒ ํ•จ๊ป˜ ๋™์ž‘
  • ์ด์‹์„ฑ: ๋™์ผํ•œ ์ฝ”๋“œ๊ฐ€ ๋‹ค์–‘ํ•œ GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ์ž‘๋™

๊ต์œก์  ๊ฐ€์น˜:

  • ๊ฐœ๋…์  ๋ช…ํ™•์„ฑ: ๊ฐ ์—ฐ์‚ฐ์ด ๋ช…ํ™•ํ•œ ํ†ต์‹  ๋ชฉ์ ์„ ๊ฐ€์ง
  • ์ ์ง„์  ๋ณต์žก์„ฑ: ๋‹จ์ˆœํ•œ ๋ฆฌ๋•์…˜์—์„œ ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ฐœ์ „
  • ์‹ค์ œ ์‘์šฉ: ๊ณผํ•™ ์—ฐ์‚ฐ, ๊ทธ๋ž˜ํ”ฝ, AI์—์„œ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ํŒจํ„ด

์„ ์ˆ˜ ์ง€์‹

์ด ํผ์ฆ์„ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ์„ ์™„๋ฃŒํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

ํ•™์Šต ์„ฑ๊ณผ

์„ธ ํŒŒํŠธ๋ฅผ ๋ชจ๋‘ ์™„๋ฃŒํ•˜๋ฉด ๋‹ค์Œ์„ ์ดํ•ดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

  1. ๊ฐ ๋ธ”๋ก ์—ฐ์‚ฐ์˜ ์šฉ๋„ - ๋‹ค์–‘ํ•œ ๋ณ‘๋ ฌ ํ†ต์‹  ์š”๊ตฌ์— ๋งž๋Š” ์„ ํƒ
  2. ์—ฐ์‚ฐ ์กฐํ•ฉ ๋ฐฉ๋ฒ• - ๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์ถ•
  3. ์„ฑ๋Šฅ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„ - ์ˆ˜๋™ ๋ฐฉ์‹๊ณผ ์ž๋™ํ™” ๋ฐฉ์‹ ๊ฐ„์˜ ๋น„๊ต
  4. ์‹ค์ œ ์‘์šฉ - ๋ธ”๋ก ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์˜ ํ™œ์šฉ
  5. ์•„ํ‚คํ…์ฒ˜ ๋…๋ฆฝ์  ํ”„๋กœ๊ทธ๋ž˜๋ฐ - ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ณธ ์š”์†Œ ํ™œ์šฉ

์‹œ์ž‘ํ•˜๊ธฐ

๊ถŒ์žฅ ์ˆœ์„œ: ๊ฐ ํŒŒํŠธ๊ฐ€ ์ด์ „ ํŒŒํŠธ์˜ ๊ฐœ๋…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฏ€๋กœ ์ˆœ์„œ๋Œ€๋กœ ์™„์„ฑํ•˜์„ธ์š”. ๋‹จ์ˆœํ•œ ๋ฆฌ๋•์…˜ โ†’ ๊ณ ๊ธ‰ ํŒŒํ‹ฐ์…”๋‹ โ†’ ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ๋กœ ์ด์–ด์ง€๋Š” ์ง„ํ–‰์ด ๋ธ”๋ก ๋ ˆ๋ฒจ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์ดํ•ดํ•˜๋Š” ์ตœ์ ์˜ ํ•™์Šต ๊ฒฝ๋กœ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋ธ”๋ก ์—ฐ์‚ฐ์€ ํ”„๋กœ๊ทธ๋ž˜๋จธ ์ƒ์‚ฐ์„ฑ๊ณผ ํ•˜๋“œ์›จ์–ด ์„ฑ๋Šฅ ์‚ฌ์ด์˜ ์ตœ์  ์ง€์ ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค - ๊ณ ์ˆ˜์ค€ ์—ฐ์‚ฐ์˜ ๋‹จ์ˆœํ•จ๊ณผ ์„ธ์‹ฌํ•˜๊ฒŒ ์ตœ์ ํ™”๋œ ์ €์ˆ˜์ค€ ๊ตฌํ˜„์˜ ํšจ์œจ์„ฑ์„ ๋™์‹œ์— ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ํผ์ฆ์€ ํ˜„๋Œ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ์ ํ•ฉํ•œ ์ถ”์ƒํ™” ์ˆ˜์ค€์—์„œ ์‚ฌ๊ณ ํ•˜๋Š” ๋ฒ•์„ ๊ฐ€๋ฅด์นฉ๋‹ˆ๋‹ค.

block.sum()์˜ ํ•ต์‹ฌ - ๋ธ”๋ก ๋ ˆ๋ฒจ ๋‚ด์ 

Puzzle 12์—์„œ ์‚ดํŽด๋ณธ ๋‚ด์ ์„ ๋ธ”๋ก ๋ ˆ๋ฒจ sum ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ๊ฐ„๋‹จํ•œ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. ๋ธ”๋ก ๋‚ด ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  block.sum()์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ์ž๋™์œผ๋กœ ํ•ฉ์‚ฐํ•˜์—ฌ, ๋ธ”๋ก ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ์ „์ฒด ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์นœ GPU ๋™๊ธฐํ™”๋ฅผ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™˜ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.sum() ์—ฐ์‚ฐ์€ ๋ธ”๋ก ์ „์ฒด ์‹คํ–‰์„ ํ™œ์šฉํ•˜์—ฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์„ ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ฑธ์ณ ์›Œํ”„ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜๋Š” ์ •๊ตํ•˜๊ฒŒ ์ตœ์ ํ™”๋œ ๊ตฌํ˜„์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. LLVM ๋ถ„์„์€ ๊ธฐ์ˆ  ๋ถ„์„์„ ์ฐธ๊ณ ํ•˜์„ธ์š”.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • block.sum()์„ ํ™œ์šฉํ•œ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜
  • ๋ธ”๋ก ์ „์ฒด ๋™๊ธฐํ™”์™€ ์Šค๋ ˆ๋“œ ์กฐ์œจ
  • ๋‹จ์ผ ๋ธ”๋ก ๋‚ด ํฌ๋กœ์Šค ์›Œํ”„ ํ†ต์‹ 
  • ๋ณต์žกํ•œ ํŒจํ„ด์—์„œ ๊ฐ„๋‹จํ•œ ํŒจํ„ด์œผ๋กœ์˜ ์„ฑ๋Šฅ ๋ณ€ํ™˜
  • ์Šค๋ ˆ๋“œ 0 ๊ฒฐ๊ณผ ๊ด€๋ฆฌ์™€ ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ

์ˆ˜ํ•™์  ์—ฐ์‚ฐ์€ ๋‚ด์ ์ž…๋‹ˆ๋‹ค: \[\Large \text{output}[0] = \sum_{i=0}^{N-1} a[i] \times b[i]\]

ํ•˜์ง€๋งŒ ๊ตฌํ˜„ ๊ณผ์ •์—์„œ Mojo์˜ ๋ชจ๋“  ๋ธ”๋ก ๋ ˆ๋ฒจ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์— ์ ์šฉ๋˜๋Š” ๊ธฐ๋ณธ ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 128 ์š”์†Œ
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (128, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (TPB = 128)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D row-major)
  • ๋ธ”๋ก๋‹น ์›Œํ”„ ์ˆ˜: 128 / WARP_SIZE (NVIDIA์—์„œ 4๊ฐœ, AMD์—์„œ 2๊ฐœ ๋˜๋Š” 4๊ฐœ)

๊ธฐ์กด ๋ฐฉ์‹์˜ ๋ณต์žก์„ฑ (Puzzle 12์—์„œ)

Puzzle 12์˜ ๋ณต์žกํ•œ ๋ฐฉ์‹์„ ๋– ์˜ฌ๋ ค ๋ด…์‹œ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ฐฐ๋ฆฌ์–ด, ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค:

def traditional_dot_product[
    tpb: Int
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    size: Int,
):
    """Traditional dot product using shared memory + barriers + tree reduction.
    Educational but complex - shows the manual coordination needed."""

    var shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[tpb]())
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Each thread computes partial product
    if global_i < size:
        var a_val = rebind[Scalar[dtype]](a[global_i])
        var b_val = rebind[Scalar[dtype]](b[global_i])
        shared[local_i] = a_val * b_val

    barrier()

    # Tree reduction in shared memory - complex but educational
    var stride = tpb // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2

    # Only thread 0 writes final result
    if local_i == 0:
        output[0] = shared[0]


์ด ๋ฐฉ์‹์ด ๋ณต์žกํ•œ ์ด์œ :

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ๋ธ”๋ก ๋‚ด์—์„œ ์ˆ˜๋™์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๊ด€๋ฆฌ
  • ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด: ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•˜๊ธฐ ์œ„ํ•œ barrier() ํ˜ธ์ถœ
  • ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ณต์žกํ•œ ๋ฃจํ”„ (64โ†’32โ†’16โ†’8โ†’4โ†’2โ†’1)
  • ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ: ์—ฌ๋Ÿฌ ์›Œํ”„ ๊ฐ„ ๋™๊ธฐํ™”๊ฐ€ ํ•„์š”
  • ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ: ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ก

์ด ๋ฐฉ์‹์€ ์ „์ฒด ๋ธ”๋ก(GPU์— ๋”ฐ๋ผ 2๊ฐœ ๋˜๋Š” 4๊ฐœ ์›Œํ”„์— ๊ฑธ์นœ 128 ์Šค๋ ˆ๋“œ)์—์„œ ๋™์ž‘ํ•˜์ง€๋งŒ, ์ฝ”๋“œ๊ฐ€ ์žฅํ™ฉํ•˜๊ณ  ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๊ธฐ ์‰ฌ์šฐ๋ฉฐ ๋ธ”๋ก ๋ ˆ๋ฒจ GPU ๋™๊ธฐํ™”์— ๋Œ€ํ•œ ๊นŠ์€ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์›Œํ”„ ๋ ˆ๋ฒจ ๊ฐœ์„  (Puzzle 24์—์„œ)

๋ธ”๋ก ๋ ˆ๋ฒจ ์—ฐ์‚ฐ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์—, Puzzle 24์—์„œ warp.sum()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์ผ ์›Œํ”„ ๋‚ด ๋ฆฌ๋•์…˜์„ ์–ด๋–ป๊ฒŒ ๋‹จ์ˆœํ™”ํ–ˆ๋Š”์ง€ ๋– ์˜ฌ๋ ค ๋ด…์‹œ๋‹ค:

def simple_warp_dot_product[
    InLayoutT: TensorLayout, OutLayoutT: TensorLayout, size: Int
](
    output: TileTensor[mut=True, dtype, OutLayoutT, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
    b: TileTensor[mut=False, dtype, InLayoutT, MutAnyOrigin],
):
    var a_lt = a.to_layout_tensor()
    var b_lt = b.to_layout_tensor()
    var out_lt = output.to_layout_tensor()
    var global_i = block_dim.x * block_idx.x + thread_idx.x

    # Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based
    var partial_product: Scalar[dtype] = 0
    if global_i < size:
        partial_product = rebind[Scalar[dtype]](a_lt[global_i]) * rebind[
            Scalar[dtype]
        ](b_lt[global_i])

    # warp_sum() replaces all the shared memory + barriers + tree reduction
    var total = warp_sum(partial_product)

    # Only lane 0 writes the result (all lanes have the same total)
    if lane_id() == 0:
        out_lt.store[1](Index(global_i // WARP_SIZE), total)


warp.sum()์ด ๋‹ฌ์„ฑํ•œ ๊ฒƒ:

  • ๋‹จ์ผ ์›Œํ”„ ๋ฒ”์œ„: 32 ์Šค๋ ˆ๋“œ(NVIDIA) ๋˜๋Š” 32/64 ์Šค๋ ˆ๋“œ(AMD) ๋‚ด์—์„œ ๋™์ž‘
  • ํ•˜๋“œ์›จ์–ด ์…”ํ”Œ: ํšจ์œจ์ ์ธ shfl.sync.bfly.b32 ๋ช…๋ น ์‚ฌ์šฉ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ถˆํ•„์š”: ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ์—†์Œ
  • ํ•œ ์ค„ ๋ฆฌ๋•์…˜: total = warp_sum[warp_size=WARP_SIZE](val=partial_product)

๊ทธ๋Ÿฌ๋‚˜ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค: warp.sum()์€ ๋‹จ์ผ ์›Œํ”„ ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ์›Œํ”„๊ฐ€ ํ•„์š”ํ•œ ๋ฌธ์ œ(์˜ˆ: 128 ์Šค๋ ˆ๋“œ ๋ธ”๋ก)์—์„œ๋Š” ์—ฌ์ „ํžˆ ์›Œํ”„ ๊ฐ„ ์กฐ์œจ์„ ์œ„ํ•ด ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด ๋ฐฉ์‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

pixi run p27 --traditional-dot-product
pixi run -e amd p27 --traditional-dot-product
pixi run -e apple p27 --traditional-dot-product
uv run poe p27 --traditional-dot-product

์™„์„ฑํ•  ์ฝ”๋“œ

block.sum() ๋ฐฉ์‹

๋ณต์žกํ•œ ๊ธฐ์กด ๋ฐฉ์‹์„ block.sum()์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ธ”๋ก ์ปค๋„๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

comptime SIZE = 128
comptime TPB = 128
comptime NUM_BINS = 8
comptime in_layout = row_major[SIZE]()
comptime InLayoutType = type_of(in_layout)
comptime out_layout = row_major[1]()
comptime OutLayoutType = type_of(out_layout)
comptime dtype = DType.float32


def block_sum_dot_product[
    tpb: Int
](
    output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
    size: Int,
):
    """Dot product using block.sum() - convenience function like warp.sum()!
    Replaces manual shared memory + barriers + tree reduction with one line."""

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # FILL IN (roughly 6 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p27/p27.mojo

pixi run p27 --block-sum-dot-product
pixi run -e amd p27 --block-sum-dot-product
uv run poe p27 --block-sum-dot-product

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 128
TPB: 128
Expected result: 1381760.0
Block.sum result: 1381760.0
Block.sum() gives identical results!
Compare the code: 15+ lines of barriers โ†’ 1 line of block.sum()!
Just like warp.sum() but for the entire block
ํŒ

1. ์„ธ ๋‹จ๊ณ„ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

๋ชจ๋“  ๋ธ”๋ก ๋ฆฌ๋•์…˜์€ ๋™์ผํ•œ ๊ฐœ๋…์  ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ๋กœ์ปฌ ๊ธฐ์—ฌ๋ถ„์„ ๊ณ„์‚ฐ
  2. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธ”๋ก ์ „์ฒด ๋ฆฌ๋•์…˜์— ์ฐธ์—ฌ
  3. ์ง€์ •๋œ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ฒ˜๋ฆฌ

2. ๋‚ด์  ์ˆ˜ํ•™ ๊ธฐ์–ตํ•˜๊ธฐ

๊ฐ ์Šค๋ ˆ๋“œ๋Š” ๋ฒกํ„ฐ a์™€ b์—์„œ ํ•˜๋‚˜์˜ ์š”์†Œ ์Œ์„ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋“ค์„ ์Šค๋ ˆ๋“œ ๊ฐ„์— ํ•ฉ์‚ฐํ•  ์ˆ˜ ์žˆ๋Š” โ€œ๋ถ€๋ถ„ ๊ฒฐ๊ณผโ€œ๋กœ ํ•ฉ์น˜๋Š” ์—ฐ์‚ฐ์€ ๋ฌด์—‡์ผ๊นŒ์š”?

3. TileTensor ์ธ๋ฑ์‹ฑ ํŒจํ„ด

TileTensor ์š”์†Œ์— ์ ‘๊ทผํ•  ๋•Œ, ์ธ๋ฑ์‹ฑ์ด SIMD ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค๋Š” ์ ์„ ๊ธฐ์–ตํ•˜์„ธ์š”. ์‚ฐ์ˆ  ์—ฐ์‚ฐ์„ ์œ„ํ•ด ์Šค์นผ๋ผ ๊ฐ’์„ ์ถ”์ถœํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

4. block.sum() API ๊ฐœ๋…

ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ์‚ดํŽด๋ณด์„ธ์š” - ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ๋ธ”๋ก ํฌ๊ธฐ๋ฅผ ์ง€์ •ํ•˜๋Š” ํ…œํ”Œ๋ฆฟ ํŒŒ๋ผ๋ฏธํ„ฐ
  • ๊ฒฐ๊ณผ ๋ถ„๋ฐฐ ๋ฐฉ์‹์„ ์œ„ํ•œ ํ…œํ”Œ๋ฆฟ ํŒŒ๋ผ๋ฏธํ„ฐ (broadcast)
  • ๋ฆฌ๋“€์Šคํ•  ๊ฐ’์„ ๋‹ด์€ ๋Ÿฐํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ

5. ์Šค๋ ˆ๋“œ ์กฐ์œจ ์›์น™

  • ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌํ•  ์œ ํšจํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์„๊นŒ์š”? (ํžŒํŠธ: ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ)
  • ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•ด์•ผ ํ• ๊นŒ์š”? (ํžŒํŠธ: ์ผ๊ด€๋œ ์„ ํƒ)
  • ๊ทธ ํŠน์ • ์Šค๋ ˆ๋“œ๋ฅผ ์–ด๋–ป๊ฒŒ ์‹๋ณ„ํ• ๊นŒ์š”? (ํžŒํŠธ: ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ)

์†”๋ฃจ์…˜

def block_sum_dot_product[
    tpb: Int
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    size: Int,
):
    """Dot product using block.sum() - convenience function like warp.sum()!
    Replaces manual shared memory + barriers + tree reduction with one line."""

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Each thread computes partial product
    var partial_product: Scalar[dtype] = 0.0
    if global_i < size:
        # TileTensor indexing `[0]` returns the underlying SIMD value
        partial_product = a[global_i][0] * b[global_i][0]

    # The magic: block.sum() replaces 15+ lines of manual reduction!
    # Just like warp.sum() but for the entire block
    var total = block.sum[block_size=tpb, broadcast=False](
        val=SIMD[DType.float32, 1](partial_product)
    )

    # Only thread 0 writes the result
    if local_i == 0:
        output[0] = total[0]


block.sum() ์ปค๋„์€ ๋ณต์žกํ•œ ๋ธ”๋ก ๋™๊ธฐํ™”์—์„œ ์ •๊ตํ•˜๊ฒŒ ์ตœ์ ํ™”๋œ ๊ตฌํ˜„์œผ๋กœ์˜ ๊ทผ๋ณธ์ ์ธ ๋ณ€ํ™˜์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ธฐ์กด ๋ฐฉ์‹์—์„œ ์‚ฌ๋ผ์ง„ ๊ฒƒ๋“ค:

  • 15์ค„ ์ด์ƒ โ†’ 8์ค„: ํš๊ธฐ์ ์ธ ์ฝ”๋“œ ์ถ•์†Œ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ๋ถˆํ•„์š”
  • 7ํšŒ ์ด์ƒ์˜ barrier() ํ˜ธ์ถœ: ๋ช…์‹œ์  ๋™๊ธฐํ™” ์ œ๋กœ
  • ๋ณต์žกํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: ๋‹จ์ผ ํ•จ์ˆ˜ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒด
  • ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ์ธ๋ฑ์‹ฑ: ์™„์ „ํžˆ ์ œ๊ฑฐ
  • ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ: ์ตœ์ ํ™”๋œ ๊ตฌํ˜„์ด ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ

๋ธ”๋ก ์ „์ฒด ์‹คํ–‰ ๋ชจ๋ธ:

๋ธ”๋ก ์Šค๋ ˆ๋“œ (128 ์Šค๋ ˆ๋“œ, 4๊ฐœ ์›Œํ”„):
์›Œํ”„ 0 (์Šค๋ ˆ๋“œ 0-31):
  ์Šค๋ ˆ๋“œ 0: partial_product = a[0] * b[0] = 0.0
  ์Šค๋ ˆ๋“œ 1: partial_product = a[1] * b[1] = 2.0
  ...
  ์Šค๋ ˆ๋“œ 31: partial_product = a[31] * b[31] = 1922.0

์›Œํ”„ 1 (์Šค๋ ˆ๋“œ 32-63):
  ์Šค๋ ˆ๋“œ 32: partial_product = a[32] * b[32] = 2048.0
  ...

์›Œํ”„ 2 (์Šค๋ ˆ๋“œ 64-95):
  ์Šค๋ ˆ๋“œ 64: partial_product = a[64] * b[64] = 8192.0
  ...

์›Œํ”„ 3 (์Šค๋ ˆ๋“œ 96-127):
  ์Šค๋ ˆ๋“œ 96: partial_product = a[96] * b[96] = 18432.0
  ์Šค๋ ˆ๋“œ 127: partial_product = a[127] * b[127] = 32258.0

block.sum() ํ•˜๋“œ์›จ์–ด ์—ฐ์‚ฐ:
๋ชจ๋“  ์Šค๋ ˆ๋“œ โ†’ 0.0 + 2.0 + 1922.0 + 2048.0 + ... + 32258.0 = 1381760.0
์Šค๋ ˆ๋“œ 0์ด ์ˆ˜์‹  โ†’ total = 1381760.0 (broadcast=False์ผ ๋•Œ)

๋ฐฐ๋ฆฌ์–ด ์—†์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  1. ๋ธ”๋ก ์ „์ฒด ์‹คํ–‰: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์›Œํ”„ ๋‚ด์—์„œ ๋ก์Šคํ…์œผ๋กœ ๊ฐ ๋ช…๋ น์„ ์‹คํ–‰
  2. ๋‚ด์žฅ ๋™๊ธฐํ™”: block.sum() ๊ตฌํ˜„์ด ๋™๊ธฐํ™”๋ฅผ ๋‚ด๋ถ€์ ์œผ๋กœ ์ฒ˜๋ฆฌ
  3. ํฌ๋กœ์Šค ์›Œํ”„ ํ†ต์‹ : ๋ธ”๋ก ๋‚ด ์›Œํ”„ ๊ฐ„ ์ตœ์ ํ™”๋œ ํ†ต์‹ 
  4. ์กฐ์œจ๋œ ๊ฒฐ๊ณผ ์ „๋‹ฌ: ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์‹ 

warp.sum() (Puzzle 24)๊ณผ์˜ ๋น„๊ต:

  • ์›Œํ”„ ๋ฒ”์œ„: warp.sum()์€ 32/64 ์Šค๋ ˆ๋“œ(๋‹จ์ผ ์›Œํ”„) ๋‚ด์—์„œ ๋™์ž‘
  • ๋ธ”๋ก ๋ฒ”์œ„: block.sum()์€ ์ „์ฒด ๋ธ”๋ก(์—ฌ๋Ÿฌ ์›Œํ”„)์— ๊ฑธ์ณ ๋™์ž‘
  • ๋™์ผํ•œ ๋‹จ์ˆœํ•จ: ๋‘˜ ๋‹ค ๋ณต์žกํ•œ ์ˆ˜๋™ ๋ฆฌ๋•์…˜์„ ํ•œ ์ค„ ํ˜ธ์ถœ๋กœ ๋Œ€์ฒด
  • ์ž๋™ ์กฐ์œจ: block.sum()์€ warp.sum()์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†๋Š” ํฌ๋กœ์Šค ์›Œํ”„ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ

๊ธฐ์ˆ  ๋ถ„์„: block.sum()์€ ์‹ค์ œ๋กœ ๋ฌด์—‡์œผ๋กœ ์ปดํŒŒ์ผ๋ ๊นŒ?

block.sum()์ด ์‹ค์ œ๋กœ ๋ฌด์—‡์„ ์ƒ์„ฑํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด, ๋””๋ฒ„๊ทธ ์ •๋ณด์™€ ํ•จ๊ป˜ ํผ์ฆ์„ ์ปดํŒŒ์ผํ–ˆ์Šต๋‹ˆ๋‹ค:

pixi run mojo build --emit llvm --debug-level=line-tables solutions/p27/p27.mojo -o solutions/p27/p27.ll

์ด๋ ‡๊ฒŒ ์ƒ์„ฑ๋œ LLVM ํŒŒ์ผ solutions/p27/p27.ll์—๋Š”, ํ˜ธํ™˜ NVIDIA GPU์—์„œ ์‹ค์ œ GPU ๋ช…๋ น์„ ๋ณด์—ฌ์ฃผ๋Š” PTX ์–ด์…ˆ๋ธ”๋ฆฌ๊ฐ€ ๋‚ด์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

๋ฐœ๊ฒฌ 1: ๋‹จ์ผ ๋ช…๋ น์ด ์•„๋‹ˆ๋‹ค

block.sum()์€ ์•ฝ 20๊ฐœ ์ด์ƒ์˜ PTX ๋ช…๋ น์œผ๋กœ ์ปดํŒŒ์ผ๋˜๋ฉฐ, 2๋‹จ๊ณ„ ๋ฆฌ๋•์…˜์œผ๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜ (๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์…”ํ”Œ)

shfl.sync.bfly.b32 %r23, %r46, 16, 31, -1;   // ์˜คํ”„์…‹ 16์œผ๋กœ ์…”ํ”Œ
add.f32            %r24, %r46, %r23;         // ์…”ํ”Œ๋œ ๊ฐ’์„ ํ•ฉ์‚ฐ
shfl.sync.bfly.b32 %r25, %r24, 8, 31, -1;    // ์˜คํ”„์…‹ 8๋กœ ์…”ํ”Œ
add.f32            %r26, %r24, %r25;         // ์…”ํ”Œ๋œ ๊ฐ’์„ ํ•ฉ์‚ฐ
// ... ์˜คํ”„์…‹ 4, 2, 1์— ๋Œ€ํ•ด ๊ณ„์†

2๋‹จ๊ณ„: ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ

shr.u32            %r32, %r1, 5;             // ์›Œํ”„ ID๋ฅผ ๊ณ„์‚ฐ
mov.b32            %r34, _global_alloc_$__gpu_shared_mem; // ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ
bar.sync           0;                        // ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”
// ... ํฌ๋กœ์Šค ์›Œํ”„ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๋˜ ๋‹ค๋ฅธ ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์…”ํ”Œ ์‹œํ€€์Šค

๋ฐœ๊ฒฌ 2: ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™” ๊ตฌํ˜„

  • ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์…”ํ”Œ: ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜๋ณด๋‹ค ํšจ์œจ์ 
  • ์ž๋™ ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜: ํฌ๋กœ์Šค ์›Œํ”„ ๋™๊ธฐํ™”๋ฅผ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ
  • ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ „๋žต์ ์œผ๋กœ ์‚ฌ์šฉ
  • ์•„ํ‚คํ…์ฒ˜ ์ธ์‹: ๋™์ผํ•œ API๊ฐ€ NVIDIA(32 ์Šค๋ ˆ๋“œ ์›Œํ”„)์™€ AMD(32 ๋˜๋Š” 64 ์Šค๋ ˆ๋“œ ์›Œํ”„)์—์„œ ๋™์ž‘

๋ฐœ๊ฒฌ 3: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„ ๋ถ„์„

๋ถ„์„ ์ ‘๊ทผ ๋ฐฉ์‹:

  1. ๋ฐ”์ด๋„ˆ๋ฆฌ ELF ์„น์…˜(.nv_debug_ptx_txt)์—์„œ PTX ์–ด์…ˆ๋ธ”๋ฆฌ๋ฅผ ํ™•์ธ
  2. ๊ฐœ๋ณ„ ๋ช…๋ น ์ˆ˜๋ฅผ ์„ธ๊ธฐ๋ณด๋‹ค ์•Œ๊ณ ๋ฆฌ์ฆ˜์  ์ฐจ์ด๋ฅผ ์‹๋ณ„

๊ด€์ฐฐ๋œ ์ฃผ์š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ฐจ์ด:

  • ๊ธฐ์กด ๋ฐฉ์‹: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ + ๋‹ค์ˆ˜์˜ bar.sync ํ˜ธ์ถœ
  • block.sum(): ๋ฒ„ํ„ฐํ”Œ๋ผ์ด ์…”ํ”Œ ํŒจํ„ด + ์ตœ์ ํ™”๋œ ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ

์„ฑ๋Šฅ ์ด์ ์€ ๋ช…๋ น ์ˆ˜๋‚˜ ๋งˆ๋ฒ• ๊ฐ™์€ ํ•˜๋“œ์›จ์–ด๊ฐ€ ์•„๋‹ˆ๋ผ ์ •๊ตํ•˜๊ฒŒ ์ตœ์ ํ™”๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ ํƒ(๋ฒ„ํ„ฐํ”Œ๋ผ์ด > ํŠธ๋ฆฌ)์—์„œ ๋น„๋กฏ๋ฉ๋‹ˆ๋‹ค. ๊ตฌํ˜„์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ Mojo gpu ๋ชจ๋“ˆ์˜ block.mojo๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ

block.sum() vs ๊ธฐ์กด ๋ฐฉ์‹:

  • ์ฝ”๋“œ ๋‹จ์ˆœํ•จ: ๋ฆฌ๋•์…˜ ๋ถ€๋ถ„์ด 15์ค„ ์ด์ƒ โ†’ 1์ค„๋กœ
  • ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ถˆํ•„์š”
  • ๋™๊ธฐํ™”: ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ๋ถˆํ•„์š”
  • ํ™•์žฅ์„ฑ: ํ•˜๋“œ์›จ์–ด ํ•œ๋„ ๋‚ด์—์„œ ๋ชจ๋“  ๋ธ”๋ก ํฌ๊ธฐ์— ๋™์ž‘

block.sum() vs warp.sum():

  • ๋ฒ”์œ„: ๋ธ”๋ก ์ „์ฒด(128 ์Šค๋ ˆ๋“œ) vs ์›Œํ”„ ์ „์ฒด(32 ์Šค๋ ˆ๋“œ)
  • ์šฉ๋„: ์ „์ฒด ๋ธ”๋ก์— ๊ฑธ์นœ ๋ฆฌ๋•์…˜์ด ํ•„์š”ํ•  ๋•Œ
  • ํŽธ์˜์„ฑ: ๋™์ผํ•œ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ, ๋‹ค๋ฅธ ๊ทœ๋ชจ

block.sum()์„ ์‚ฌ์šฉํ•ด์•ผ ํ•  ๋•Œ:

  • ๋‹จ์ผ ๋ธ”๋ก ๋ฌธ์ œ: ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ํ•˜๋‚˜์˜ ๋ธ”๋ก์— ๋“ค์–ด๊ฐˆ ๋•Œ
  • ๋ธ”๋ก ๋ ˆ๋ฒจ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋ฆฌ๋•์…˜์ด ํ•„์š”ํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ
  • ํ™•์žฅ์„ฑ๋ณด๋‹ค ํŽธ์˜์„ฑ: ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋ฐฉ์‹๋ณด๋‹ค ๋‹จ์ˆœ

์ด์ „ ํผ์ฆ๊ณผ์˜ ๊ด€๊ณ„

Puzzle 12 (๊ธฐ์กด ๋ฐฉ์‹)์—์„œ:

๋ณต์žกํ•จ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜
โ†“
๋‹จ์ˆœํ•จ: block.sum() ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ณธ ์š”์†Œ

Puzzle 24 (warp.sum())์—์„œ:

์›Œํ”„ ๋ ˆ๋ฒจ: warp.sum() - 32 ์Šค๋ ˆ๋“œ (๋‹จ์ผ ์›Œํ”„)
โ†“
๋ธ”๋ก ๋ ˆ๋ฒจ: block.sum() - 128 ์Šค๋ ˆ๋“œ (์—ฌ๋Ÿฌ ์›Œํ”„)

3๋‹จ๊ณ„ ์ง„ํ–‰:

  1. ์ˆ˜๋™ ๋ฆฌ๋•์…˜ (Puzzle 12): ๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด + ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜
  2. ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ (Puzzle 24): warp.sum() - ๋‹จ์ˆœํ•˜์ง€๋งŒ ๋‹จ์ผ ์›Œํ”„๋กœ ์ œํ•œ
  3. ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ (Puzzle 27): block.sum() - ์›Œํ”„์˜ ๋‹จ์ˆœํ•จ์„ ์—ฌ๋Ÿฌ ์›Œํ”„๋กœ ํ™•์žฅ

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.sum()์€ warp.sum()์˜ ๋‹จ์ˆœํ•จ์„ ์ œ๊ณตํ•˜๋ฉด์„œ ์ „์ฒด ๋ธ”๋ก์œผ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค. ์ˆ˜๋™์œผ๋กœ ๊ตฌํ˜„ํ•ด์•ผ ํ–ˆ๋˜ ๋ณต์žกํ•œ ํฌ๋กœ์Šค ์›Œํ”„ ์กฐ์œจ์„ ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

block.sum() ์—ฐ์‚ฐ์„ ๋ฐฐ์› ์œผ๋‹ˆ, ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

๐Ÿ’ก ํ•ต์‹ฌ ์š”์ : ๋ธ”๋ก ์—ฐ์‚ฐ์€ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…์„ ์ „์ฒด ์Šค๋ ˆ๋“œ ๋ธ”๋ก์œผ๋กœ ํ™•์žฅํ•˜์—ฌ, ์—ฌ๋Ÿฌ ์›Œํ”„์— ๊ฑธ์ณ ๋™์‹œ์— ๋™์ž‘ํ•˜๋ฉด์„œ ๋ณต์žกํ•œ ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋Œ€์ฒดํ•˜๋Š” ์ตœ์ ํ™”๋œ ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. warp.sum()์ด ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜์„ ๋‹จ์ˆœํ™”ํ•œ ๊ฒƒ์ฒ˜๋Ÿผ, block.sum()์€ ์„ฑ๋Šฅ์„ ํฌ์ƒํ•˜์ง€ ์•Š๊ณ  ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜์„ ๋‹จ์ˆœํ™”ํ•ฉ๋‹ˆ๋‹ค.

block.prefix_sum()๊ณผ ๋ณ‘๋ ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜

์ด ํผ์ฆ์€ ๋ธ”๋ก ๋ ˆ๋ฒจ block.prefix_sum ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ์ถ”์ถœ์„ ์œ„ํ•œ ๋ณ‘๋ ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ์š”์†Œ๊ฐ€ ์†ํ•  ๋Œ€์ƒ ๊ตฌ๊ฐ„์„ ๊ฒฐ์ •ํ•œ ๋‹ค์Œ, block.prefix_sum()์„ ์ ์šฉํ•˜์—ฌ ํŠน์ • ๊ตฌ๊ฐ„์˜ ์š”์†Œ๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•œ ์“ฐ๊ธฐ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. ๋ˆ„์  ํ•ฉ์ด ๋‹จ์ˆœํ•œ ๋ฆฌ๋•์…˜์„ ๋„˜์–ด ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.prefix_sum() ์—ฐ์‚ฐ์€ ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ฑธ์ณ ์ผ์น˜ํ•˜๋Š” ์š”์†Œ์˜ ๋ˆ„์  ์“ฐ๊ธฐ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ์ถ”์ถœ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋‹ค๋ฃจ๋Š” ๋‚ด์šฉ:

  • block.prefix_sum()์„ ํ™œ์šฉํ•œ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ˆ„์  ํ•ฉ
  • ๋ˆ„์  ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง๊ณผ ์ถ”์ถœ
  • ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ๋ธ”๋ก ์ „์ฒด ์กฐ์œจ์„ ํ†ตํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜
  • ๋น„ํฌํ•จ(exclusive) vs ํฌํ•จ(inclusive) ๋ˆ„์  ํ•ฉ ํŒจํ„ด

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํŠน์ • ๊ฐ’ ๋ฒ”์œ„(๊ตฌ๊ฐ„)์— ์†ํ•˜๋Š” ์š”์†Œ๋ฅผ ์ถ”์ถœํ•˜์—ฌ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค: \[\Large \text{Bin}_k = \{x_i: k/N \leq x_i < (k+1)/N\}\]

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ์š”์†Œ๊ฐ€ ์†ํ•˜๋Š” ๊ตฌ๊ฐ„์„ ๊ฒฐ์ •ํ•˜๊ณ , block.prefix_sum()์ด ๋ณ‘๋ ฌ ์ถ”์ถœ์„ ์กฐ์œจํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 128 ์š”์†Œ
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (128, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (TPB = 128)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๊ตฌ๊ฐ„ ์ˆ˜: NUM_BINS = 8 (๋ฒ”์œ„ [0.0, 0.125), [0.125, 0.25) ๋“ฑ)
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (1D row-major)
  • ๋ธ”๋ก๋‹น ์›Œํ”„ ์ˆ˜: 128 / WARP_SIZE (GPU์— ๋”ฐ๋ผ 2๊ฐœ ๋˜๋Š” 4๊ฐœ)

๋„์ „ ๊ณผ์ œ: ๋ณ‘๋ ฌ ๊ตฌ๊ฐ„ ์ถ”์ถœ

๊ธฐ์กด์˜ ์ˆœ์ฐจ์  ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ์„ฑ์€ ์š”์†Œ๋ฅผ ํ•˜๋‚˜์”ฉ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค:

# ์ˆœ์ฐจ์  ๋ฐฉ์‹ - ๋ณ‘๋ ฌํ™”๊ฐ€ ์–ด๋ ค์›€
histogram = [[] for _ in range(NUM_BINS)]
for element in data:
    bin_id = int(element * NUM_BINS)  # ๊ตฌ๊ฐ„ ๊ฒฐ์ •
    histogram[bin_id].append(element)  # ์ˆœ์ฐจ์  ์ถ”๊ฐ€

๋‹จ์ˆœํ•œ GPU ๋ณ‘๋ ฌํ™”์˜ ๋ฌธ์ œ์ :

  • ๊ฒฝ์Ÿ ์ƒํƒœ: ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ตฌ๊ฐ„์— ๋™์‹œ์— ์“ฐ๊ธฐ
  • ๋น„์ •๋ ฌ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์œ„์น˜์— ์ ‘๊ทผ
  • ๋ถ€ํ•˜ ๋ถˆ๊ท ํ˜•: ์ผ๋ถ€ ๊ตฌ๊ฐ„์— ํ›จ์”ฌ ๋งŽ์€ ์š”์†Œ๊ฐ€ ๋ชฐ๋ฆด ์ˆ˜ ์žˆ์Œ
  • ๋ณต์žกํ•œ ๋™๊ธฐํ™”: ๋ฐฐ๋ฆฌ์–ด์™€ ์›์ž์  ์—ฐ์‚ฐ์ด ํ•„์š”

๊ณ ๊ธ‰ ๋ฐฉ์‹: block.prefix_sum() ์กฐ์œจ

๋ณต์žกํ•œ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์„ ์กฐ์œจ๋œ ์ถ”์ถœ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

์™„์„ฑํ•  ์ฝ”๋“œ

block.prefix_sum() ๋ฐฉ์‹

block.prefix_sum()์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

comptime bin_layout = row_major[SIZE]()  # Max SIZE elements per bin
comptime BinLayoutType = type_of(bin_layout)


def block_histogram_bin_extract[
    tpb: Int
](
    input_data: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
    bin_output: TileTensor[mut=True, dtype, BinLayoutType, MutAnyOrigin],
    count_output: TileTensor[
        mut=True, DType.int32, OutLayoutType, MutAnyOrigin
    ],
    size: Int,
    target_bin: Int,
    num_bins: Int,
):
    """Parallel histogram using block.prefix_sum() for bin extraction.

    This demonstrates advanced parallel filtering and extraction:
    1. Each thread determines which bin its element belongs to
    2. Use block.prefix_sum() to compute write positions for target_bin elements
    3. Extract and pack only elements belonging to target_bin
    """

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Step 1: Each thread determines its bin and element value

    # FILL IN (roughly 9 lines)

    # Step 2: Create predicate for target bin extraction

    # FILL IN (roughly 3 line)

    # Step 3: Use block.prefix_sum() for parallel bin extraction!
    # This computes where each thread should write within the target bin

    # FILL IN (1 line)

    # Step 4: Extract and pack elements belonging to target_bin

    # FILL IN (roughly 2 line)

    # Step 5: Final thread computes total count for this bin

    # FILL IN (roughly 3 line)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p27/p27.mojo

ํŒ

1. ํ•ต์‹ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ (์ด์ „ ํผ์ฆ์—์„œ ์ ์šฉ)

block_sum_dot_product์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋‹ค์Œ ํ•ต์‹ฌ ๋ณ€์ˆ˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x

ํ•จ์ˆ˜๋Š” 5๊ฐ€์ง€ ์ฃผ์š” ๋‹จ๊ณ„(์ด ์•ฝ 15-20์ค„)๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค:

  1. ์š”์†Œ๋ฅผ ๋กœ๋“œํ•˜๊ณ  ๊ตฌ๊ฐ„์„ ๊ฒฐ์ •
  2. ๋Œ€์ƒ ๊ตฌ๊ฐ„์— ๋Œ€ํ•œ ์ด์ง„ ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ
  3. ํ”„๋ ˆ๋””์ผ€์ดํŠธ์— block.prefix_sum() ์‹คํ–‰
  4. ๊ณ„์‚ฐ๋œ ์˜คํ”„์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ
  5. ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ๊ฐ€ ์ด ๊ฐœ์ˆ˜๋ฅผ ๊ณ„์‚ฐ

2. ๊ตฌ๊ฐ„ ๊ณ„์‚ฐ (math.floor ์‚ฌ์šฉ)

Float32 ๊ฐ’์„ ๊ตฌ๊ฐ„์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๋ ค๋ฉด:

my_value = input_data[global_i][0]  # ๋‚ด์ ์—์„œ์ฒ˜๋Ÿผ SIMD ์ถ”์ถœ
bin_number = Int(floor(my_value * num_bins))

๊ฒฝ๊ณ„ ์‚ฌ๋ก€ ์ฒ˜๋ฆฌ: ์ •ํ™•ํžˆ 1.0์ธ ๊ฐ’์€ ๊ตฌ๊ฐ„ NUM_BINS์— ๋“ค์–ด๊ฐ€์ง€๋งŒ, ์‹ค์ œ ๊ตฌ๊ฐ„์€ 0๋ถ€ํ„ฐ NUM_BINS-1๊นŒ์ง€์ž…๋‹ˆ๋‹ค. if ๋ฌธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ๋Œ€ ๊ตฌ๊ฐ„์„ ์ œํ•œํ•˜์„ธ์š”.

3. ์ด์ง„ ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ

์ด ์Šค๋ ˆ๋“œ์˜ ์š”์†Œ๊ฐ€ target_bin์— ์†ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ •์ˆ˜ ๋ณ€์ˆ˜(0 ๋˜๋Š” 1)๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค:

var belongs_to_target: Int = 0
if (thread_has_valid_element) and (my_bin == target_bin):
    belongs_to_target = 1

์ด๊ฒƒ์ด ํ•ต์‹ฌ ํ†ต์ฐฐ์ž…๋‹ˆ๋‹ค: ๋ˆ„์  ํ•ฉ์ด ์ด ์ด์ง„ ํ”Œ๋ž˜๊ทธ์— ์ž‘์šฉํ•˜์—ฌ ์œ„์น˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค!

4. block.prefix_sum() ํ˜ธ์ถœ ํŒจํ„ด

๋ฌธ์„œ์— ๋”ฐ๋ฅด๋ฉด ํ˜ธ์ถœ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

offset = block.prefix_sum[
    dtype=DType.int32,         # ์ •์ˆ˜ ํ”„๋ ˆ๋””์ผ€์ดํŠธ๋กœ ์ž‘์—…
    block_size=tpb,            # block.sum()๊ณผ ๋™์ผ
    exclusive=True             # ํ•ต์‹ฌ: ๊ฐ ์Šค๋ ˆ๋“œ ์ด์ „์˜ ์œ„์น˜๋ฅผ ์ œ๊ณต
](val=SIMD[DType.int32, 1](my_predicate_value))

์™œ ๋น„ํฌํ•จ(exclusive)์ธ๊ฐ€? ์œ„์น˜ 5์—์„œ ํ”„๋ ˆ๋””์ผ€์ดํŠธ=1์ธ ์Šค๋ ˆ๋“œ๋Š”, ์ž์‹  ์•ž์— 4๊ฐœ์˜ ์š”์†Œ๊ฐ€ ์žˆ์—ˆ๋‹ค๋ฉด output[4]์— ์จ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

5. ์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ ํŒจํ„ด

belongs_to_target == 1์ธ ์Šค๋ ˆ๋“œ๋งŒ ๊ธฐ๋กํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

if belongs_to_target == 1:
    bin_output[Int(offset[0])] = my_value  # ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด SIMD๋ฅผ Int๋กœ ๋ณ€ํ™˜

์ด๊ฒƒ์€ Puzzle 12์˜ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ ํŒจํ„ด๊ณผ ๋™์ผํ•˜์ง€๋งŒ, ์กฐ๊ฑด์ด โ€œ๋Œ€์ƒ ๊ตฌ๊ฐ„์— ์†ํ•˜๋Š”์ง€โ€œ๋กœ ๋ฐ”๋€Œ์—ˆ์Šต๋‹ˆ๋‹ค.

6. ์ตœ์ข… ๊ฐœ์ˆ˜ ๊ณ„์‚ฐ

๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ(์Šค๋ ˆ๋“œ 0์ด ์•„๋‹˜!)๊ฐ€ ์ด ๊ฐœ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค:

if local_i == tpb - 1:  # ๋ธ”๋ก์˜ ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ
    total_count = offset[0] + belongs_to_target  # ํฌํ•จ = ๋น„ํฌํ•จ + ์ž์‹ ์˜ ๊ธฐ์—ฌ๋ถ„
    count_output[0] = total_count

์™œ ๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ์ธ๊ฐ€? ๊ฐ€์žฅ ๋†’์€ offset ๊ฐ’์„ ๊ฐ€์ง€๋ฏ€๋กœ, offset + ๊ธฐ์—ฌ๋ถ„์ด ์ด ๊ฐœ์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

7. ๋ฐ์ดํ„ฐ ํƒ€์ž…๊ณผ ๋ณ€ํ™˜

์ด์ „ ํผ์ฆ์˜ ํŒจํ„ด์„ ๊ธฐ์–ตํ•˜์„ธ์š”:

  • TileTensor ์ธ๋ฑ์‹ฑ์€ SIMD๋ฅผ ๋ฐ˜ํ™˜: input_data[i][0]
  • block.prefix_sum()์€ SIMD๋ฅผ ๋ฐ˜ํ™˜: offset[0]์œผ๋กœ ์ถ”์ถœ
  • ๋ฐฐ์—ด ์ธ๋ฑ์‹ฑ์€ Int๊ฐ€ ํ•„์š”: bin_output[...]์— Int(offset[0])

block.prefix_sum() ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

pixi run p27 --histogram
pixi run -e amd p27 --histogram
pixi run -e apple p27 --histogram
uv run poe p27 --histogram

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 128
TPB: 128
NUM_BINS: 8

Input sample: 0.0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 ...

=== Processing Bin 0 (range [ 0.0 , 0.125 )) ===
Bin 0 count: 26
Bin 0 extracted elements: 0.0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 ...

=== Processing Bin 1 (range [ 0.125 , 0.25 )) ===
Bin 1 count: 24
Bin 1 extracted elements: 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 ...

=== Processing Bin 2 (range [ 0.25 , 0.375 )) ===
Bin 2 count: 26
Bin 2 extracted elements: 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 ...

=== Processing Bin 3 (range [ 0.375 , 0.5 )) ===
Bin 3 count: 22
Bin 3 extracted elements: 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 ...

=== Processing Bin 4 (range [ 0.5 , 0.625 )) ===
Bin 4 count: 13
Bin 4 extracted elements: 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 ...

=== Processing Bin 5 (range [ 0.625 , 0.75 )) ===
Bin 5 count: 12
Bin 5 extracted elements: 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 ...

=== Processing Bin 6 (range [ 0.75 , 0.875 )) ===
Bin 6 count: 5
Bin 6 extracted elements: 0.75 0.76 0.77 0.78 0.79

=== Processing Bin 7 (range [ 0.875 , 1.0 )) ===
Bin 7 count: 0
Bin 7 extracted elements:

์†”๋ฃจ์…˜

def block_histogram_bin_extract[
    tpb: Int
](
    input_data: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    bin_output: TileTensor[mut=True, dtype, BinLayout, MutAnyOrigin],
    count_output: TileTensor[mut=True, DType.int32, OutLayout, MutAnyOrigin],
    size: Int,
    target_bin: Int,
    num_bins: Int,
):
    """Parallel histogram using block.prefix_sum() for bin extraction.

    This demonstrates advanced parallel filtering and extraction:
    1. Each thread determines which bin its element belongs to
    2. Use block.prefix_sum() to compute write positions for target_bin elements
    3. Extract and pack only elements belonging to target_bin
    """

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Step 1: Each thread determines its bin and element value
    var my_value: Scalar[dtype] = 0.0
    var my_bin: Int = -1

    if global_i < size:
        # `[0]` returns the underlying SIMD value
        my_value = input_data[global_i][0]
        # Bin values [0.0, 1.0) into num_bins buckets
        my_bin = Int(floor(my_value * Scalar[dtype](num_bins)))
        # Clamp to valid range
        if my_bin >= num_bins:
            my_bin = num_bins - 1
        if my_bin < 0:
            my_bin = 0

    # Step 2: Create predicate for target bin extraction
    var belongs_to_target: Int = 0
    if global_i < size and my_bin == target_bin:
        belongs_to_target = 1

    # Step 3: Use block.prefix_sum() for parallel bin extraction!
    # This computes where each thread should write within the target bin
    var write_offset = block.prefix_sum[
        dtype=DType.int32, block_size=tpb, exclusive=True
    ](val=SIMD[DType.int32, 1](belongs_to_target))

    # Step 4: Extract and pack elements belonging to target_bin
    if belongs_to_target == 1:
        bin_output[Int(write_offset[0])] = my_value

    # Step 5: Final thread computes total count for this bin
    if local_i == tpb - 1:
        # Inclusive sum = exclusive sum + my contribution
        var total_count = write_offset[0] + Int32(belongs_to_target)
        count_output[0] = total_count


block.prefix_sum() ์ปค๋„์€ ์ด์ „ ํผ์ฆ์˜ ๊ฐœ๋…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์กฐ์œจ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๋‹จ๊ณ„๋ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

1๋‹จ๊ณ„: ์š”์†Œ ์ฒ˜๋ฆฌ (Puzzle 12 ๋‚ด์ ๊ณผ ์œ ์‚ฌ)

์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ (์ต์ˆ™ํ•œ ํŒจํ„ด):
  global_i = block_dim.x * block_idx.x + thread_idx.x  // ์ „์—ญ ์š”์†Œ ์ธ๋ฑ์Šค
  local_i = thread_idx.x                               // ๋กœ์ปฌ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค

์š”์†Œ ๋กœ๋”ฉ (TileTensor ํŒจํ„ด๊ณผ ๋™์ผ):
  ์Šค๋ ˆ๋“œ 0:  my_value = input_data[0][0] = 0.00
  ์Šค๋ ˆ๋“œ 1:  my_value = input_data[1][0] = 0.01
  ์Šค๋ ˆ๋“œ 13: my_value = input_data[13][0] = 0.13
  ์Šค๋ ˆ๋“œ 25: my_value = input_data[25][0] = 0.25
  ...

2๋‹จ๊ณ„: ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜ (์ƒˆ๋กœ์šด ๊ฐœ๋…)

floor ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•œ ๊ตฌ๊ฐ„ ๊ณ„์‚ฐ:
  ์Šค๋ ˆ๋“œ 0:  my_bin = Int(floor(0.00 * 8)) = 0  // ๊ฐ’ [0.000, 0.125) โ†’ ๊ตฌ๊ฐ„ 0
  ์Šค๋ ˆ๋“œ 1:  my_bin = Int(floor(0.01 * 8)) = 0  // ๊ฐ’ [0.000, 0.125) โ†’ ๊ตฌ๊ฐ„ 0
  ์Šค๋ ˆ๋“œ 13: my_bin = Int(floor(0.13 * 8)) = 1  // ๊ฐ’ [0.125, 0.250) โ†’ ๊ตฌ๊ฐ„ 1
  ์Šค๋ ˆ๋“œ 25: my_bin = Int(floor(0.25 * 8)) = 2  // ๊ฐ’ [0.250, 0.375) โ†’ ๊ตฌ๊ฐ„ 2
  ...

3๋‹จ๊ณ„: ์ด์ง„ ํ”„๋ ˆ๋””์ผ€์ดํŠธ ์ƒ์„ฑ (ํ•„ํ„ฐ๋ง ํŒจํ„ด)

target_bin=0์— ๋Œ€ํ•ด ์ถ”์ถœ ๋งˆ์Šคํฌ ์ƒ์„ฑ:
  ์Šค๋ ˆ๋“œ 0:  belongs_to_target = 1  (๊ตฌ๊ฐ„ 0 == ๋Œ€์ƒ 0)
  ์Šค๋ ˆ๋“œ 1:  belongs_to_target = 1  (๊ตฌ๊ฐ„ 0 == ๋Œ€์ƒ 0)
  ์Šค๋ ˆ๋“œ 13: belongs_to_target = 0  (๊ตฌ๊ฐ„ 1 != ๋Œ€์ƒ 0)
  ์Šค๋ ˆ๋“œ 25: belongs_to_target = 0  (๊ตฌ๊ฐ„ 2 != ๋Œ€์ƒ 0)
  ...

์ด์ง„ ๋ฐฐ์—ด ์ƒ์„ฑ: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ...]

4๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ˆ„์  ํ•ฉ (๋งˆ๋ฒ•์ด ์ผ์–ด๋‚˜๋Š” ๊ณณ!)

ํ”„๋ ˆ๋””์ผ€์ดํŠธ์— block.prefix_sum[exclusive=True] ์ ์šฉ:
์ž…๋ ฅ:      [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, ...]
๋น„ํฌํ•จ:    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12, -, -, -, ...]
                                                      ^
                                                 ์ค‘์š”ํ•˜์ง€ ์•Š์Œ

ํ•ต์‹ฌ ํ†ต์ฐฐ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ถœ๋ ฅ ๋ฐฐ์—ด์—์„œ ์ž์‹ ์˜ ์“ฐ๊ธฐ ์œ„์น˜๋ฅผ ๋ฐ›์Šต๋‹ˆ๋‹ค!

5๋‹จ๊ณ„: ์กฐ์œจ๋œ ์ถ”์ถœ (์กฐ๊ฑด๋ถ€ ์“ฐ๊ธฐ)

belongs_to_target=1์ธ ์Šค๋ ˆ๋“œ๋งŒ ๊ธฐ๋ก:
  ์Šค๋ ˆ๋“œ 0:  bin_output[0] = 0.00   // write_offset[0] = 0 ์‚ฌ์šฉ
  ์Šค๋ ˆ๋“œ 1:  bin_output[1] = 0.01   // write_offset[1] = 1 ์‚ฌ์šฉ
  ์Šค๋ ˆ๋“œ 12: bin_output[12] = 0.12  // write_offset[12] = 12 ์‚ฌ์šฉ
  ์Šค๋ ˆ๋“œ 13: (๊ธฐ๋ก ์•ˆ ํ•จ)             // belongs_to_target = 0
  ์Šค๋ ˆ๋“œ 25: (๊ธฐ๋ก ์•ˆ ํ•จ)             // belongs_to_target = 0
  ...

๊ฒฐ๊ณผ: [0.00, 0.01, 0.02, ..., 0.12, ???, ???, ...] // ๋นˆํ‹ˆ์—†์ด ์ฑ„์›Œ์ง!

6๋‹จ๊ณ„: ๊ฐœ์ˆ˜ ๊ณ„์‚ฐ (block.sum() ํŒจํ„ด๊ณผ ์œ ์‚ฌ)

๋งˆ์ง€๋ง‰ ์Šค๋ ˆ๋“œ๊ฐ€ ์ด ๊ฐœ์ˆ˜๋ฅผ ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ 0์ด ์•„๋‹˜!):
  if local_i == tpb - 1:  // ์ด ๊ฒฝ์šฐ ์Šค๋ ˆ๋“œ 127
      total = write_offset[0] + belongs_to_target  // ํฌํ•จ ํ•ฉ ๊ณต์‹
      count_output[0] = total

์ด ๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋™์ž‘ํ•˜๋Š” ์ด์œ :

Puzzle 12 (๊ธฐ์กด ๋‚ด์ )๊ณผ์˜ ์—ฐ๊ฒฐ:

  • ๋™์ผํ•œ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ: global_i์™€ local_i ํŒจํ„ด
  • ๋™์ผํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: if global_i < size ๊ฒ€์ฆ
  • ๋™์ผํ•œ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ: [0]์„ ์‚ฌ์šฉํ•œ TileTensor SIMD ์ถ”์ถœ

block.sum() (์ด ํผ์ฆ์˜ ์•ž๋ถ€๋ถ„)๊ณผ์˜ ์—ฐ๊ฒฐ:

  • ๋™์ผํ•œ ๋ธ”๋ก ์ „์ฒด ์—ฐ์‚ฐ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ์— ์ฐธ์—ฌ
  • ๋™์ผํ•œ ๊ฒฐ๊ณผ ์ฒ˜๋ฆฌ: ํŠน์ • ์Šค๋ ˆ๋“œ(์ฒซ ๋ฒˆ์งธ ๋Œ€์‹  ๋งˆ์ง€๋ง‰)๊ฐ€ ์ตœ์ข… ๊ฒฐ๊ณผ ์ฒ˜๋ฆฌ
  • ๋™์ผํ•œ SIMD ๋ณ€ํ™˜: ๋ฐฐ์—ด ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•œ Int(result[0]) ํŒจํ„ด

block.prefix_sum()๋งŒ์˜ ๊ณ ๊ธ‰ ๊ฐœ๋…:

  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์Œ: ์Šค๋ ˆ๋“œ 0๋งŒ ์ค‘์š”ํ•œ block.sum()๊ณผ ๋‹ฌ๋ฆฌ
  • ์กฐ์œจ๋œ ์“ฐ๊ธฐ ์œ„์น˜: ๋ˆ„์  ํ•ฉ์ด ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ž๋™์œผ๋กœ ์ œ๊ฑฐ
  • ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง: ์ด์ง„ ํ”„๋ ˆ๋””์ผ€์ดํŠธ๊ฐ€ ๊ณ ๊ธ‰ ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ

๋‹จ์ˆœํ•œ ๋ฐฉ์‹ ๋Œ€๋น„ ์„ฑ๋Šฅ ์ด์ :

vs. ์›์ž์  ์—ฐ์‚ฐ:

  • ๊ฒฝ์Ÿ ์ƒํƒœ ์—†์Œ: ๋ˆ„์  ํ•ฉ์ด ๊ณ ์œ ํ•œ ์“ฐ๊ธฐ ์œ„์น˜๋ฅผ ์ œ๊ณต
  • ๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ: ์ˆœ์ฐจ์  ์“ฐ๊ธฐ๊ฐ€ ์บ์‹œ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ
  • ์ง๋ ฌํ™” ์—†์Œ: ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ๋ณ‘๋ ฌ๋กœ ์ˆ˜ํ–‰

vs. ๋‹ค์ค‘ ํŒจ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜:

  • ๋‹จ์ผ ์ปค๋„: ํ•œ ๋ฒˆ์˜ GPU ์‹คํ–‰์œผ๋กœ ํžˆ์Šคํ† ๊ทธ๋žจ ์ถ”์ถœ ์™„๋ฃŒ
  • ์™„์ „ ํ™œ์šฉ: ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๊ด€๊ณ„์—†์ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ž‘์—…
  • ์ตœ์  ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ์— ์ตœ์ ํ™”๋œ ํŒจํ„ด

์ด๊ฒƒ์€ block.prefix_sum()์ด block.sum() ๊ฐ™์€ ๋‹จ์ˆœํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋กœ๋Š” ๋ณต์žกํ•˜๊ฑฐ๋‚˜ ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์–ด๋–ป๊ฒŒ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ

block.prefix_sum() vs ๊ธฐ์กด ๋ฐฉ์‹:

  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ •๊ตํ•จ: ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹ vs ์ˆœ์ฐจ์  ์ฒ˜๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ: ๋ณ‘ํ•ฉ๋œ ์“ฐ๊ธฐ vs ๋ถ„์‚ฐ๋œ ๋ฌด์ž‘์œ„ ์ ‘๊ทผ
  • ๋™๊ธฐํ™”: ๋‚ด์žฅ ์กฐ์œจ vs ์ˆ˜๋™ ๋ฐฐ๋ฆฌ์–ด์™€ ์›์ž์  ์—ฐ์‚ฐ
  • ํ™•์žฅ์„ฑ: ๋ชจ๋“  ๋ธ”๋ก ํฌ๊ธฐ์™€ ๊ตฌ๊ฐ„ ์ˆ˜์— ๋™์ž‘

block.prefix_sum() vs block.sum():

  • ๋ฒ”์œ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›์Œ vs ์Šค๋ ˆ๋“œ 0๋งŒ
  • ์šฉ๋„: ๋ณต์žกํ•œ ํŒŒํ‹ฐ์…”๋‹ vs ๋‹จ์ˆœํ•œ ์ง‘๊ณ„
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์œ ํ˜•: ๋ณ‘๋ ฌ ์Šค์บ” ๊ธฐ๋ณธ ์š”์†Œ vs ๋ฆฌ๋•์…˜ ๊ธฐ๋ณธ ์š”์†Œ
  • ์ถœ๋ ฅ ํŒจํ„ด: ์Šค๋ ˆ๋“œ๋ณ„ ์œ„์น˜ vs ๋‹จ์ผ ํ•ฉ๊ณ„

block.prefix_sum()์„ ์‚ฌ์šฉํ•ด์•ผ ํ•  ๋•Œ:

  • ๋ณ‘๋ ฌ ํ•„ํ„ฐ๋ง: ์กฐ๊ฑด์— ๋งž๋Š” ์š”์†Œ ์ถ”์ถœ
  • ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜: ๋ถˆํ•„์š”ํ•œ ์š”์†Œ ์ œ๊ฑฐ
  • ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹: ๋ฐ์ดํ„ฐ๋ฅผ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ๋ถ„๋ฆฌ
  • ๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋ถ€ํ•˜ ๋ถ„์‚ฐ, ์ •๋ ฌ, ๊ทธ๋ž˜ํ”„ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๋‹ค์Œ ๋‹จ๊ณ„

block.prefix_sum() ์—ฐ์‚ฐ์„ ๋ฐฐ์› ์œผ๋‹ˆ, ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • block.broadcast()์™€ ๋ฒกํ„ฐ ์ •๊ทœํ™”: ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ฐ’์„ ๊ณต์œ 
  • ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋” ํฐ ๋ฌธ์ œ๋ฅผ ์œ„ํ•œ ์—ฌ๋Ÿฌ ๋ธ”๋ก ๊ฐ„ ์กฐ์œจ
  • ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ •๋ ฌ, ๊ทธ๋ž˜ํ”„ ํƒ์ƒ‰, ๋™์  ๋ถ€ํ•˜ ๋ถ„์‚ฐ
  • ๋ณต์žกํ•œ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ๋ธ”๋ก ์—ฐ์‚ฐ๊ณผ ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์˜ ๊ฒฐํ•ฉ

๐Ÿ’ก ํ•ต์‹ฌ ์š”์ : ๋ธ”๋ก ๋ˆ„์  ํ•ฉ ์—ฐ์‚ฐ์€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋‹จ์ˆœํ•œ ๋ณ‘๋ ฌ ๊ณ„์‚ฐ์—์„œ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. block.sum()์ด ๋ฆฌ๋•์…˜์„ ๋‹จ์ˆœํ™”ํ–ˆ๋‹ค๋ฉด, block.prefix_sum()์€ ๊ณ ์„ฑ๋Šฅ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์ˆ˜์ ์ธ ๊ณ ๊ธ‰ ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ ํŒจํ„ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

block.broadcast()์™€ ๋ฒกํ„ฐ ์ •๊ทœํ™”

block.sum๊ณผ block.broadcast ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๊ตฌํ˜„ํ•˜๊ณ , ๋ธ”๋ก ๋ ˆ๋ฒจ ํ†ต์‹  ์›Œํฌํ”Œ๋กœ์šฐ์˜ ์ „์ฒด ๋ชจ์Šต์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ‰๊ท  ๊ณ„์‚ฐ์— ๊ธฐ์—ฌํ•œ ๋‹ค์Œ, ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ๋ฐ›์•„ ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ, ๋ธ”๋ก ์—ฐ์‚ฐ์ด ์‹ค์ œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋–ป๊ฒŒ ํ•จ๊ป˜ ๋™์ž‘ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.broadcast() ์—ฐ์‚ฐ์€ ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹ ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์—ฌ, ๊ธฐ๋ณธ ๋ธ”๋ก ํ†ต์‹  ํŒจํ„ด์„ ์™„์„ฑํ•ฉ๋‹ˆ๋‹ค: ๋ฆฌ๋•์…˜(์ „์ฒดโ†’ํ•˜๋‚˜), ์Šค์บ”(์ „์ฒดโ†’๊ฐ๊ฐ), ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ(ํ•˜๋‚˜โ†’์ „์ฒด).

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ ๋ฐฐ์šธ ๋‚ด์šฉ:

  • block.broadcast()๋ฅผ ํ™œ์šฉํ•œ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
  • ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹  ํŒจํ„ด
  • ์†Œ์Šค ์Šค๋ ˆ๋“œ ์ง€์ •๊ณผ ํŒŒ๋ผ๋ฏธํ„ฐ ์ œ์–ด
  • ์—ฌ๋Ÿฌ ์—ฐ์‚ฐ์„ ๊ฒฐํ•ฉํ•˜๋Š” ์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ
  • ์กฐ์œจ๋œ ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•œ ์‹ค์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌํ˜„

์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: \[\Large \text{output}[i] = \frac{\text{input}[i]}{\frac{1}{N}\sum_{j=0}^{N-1} \text{input}[j]}\]

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ‰๊ท  ๊ณ„์‚ฐ์— ๊ธฐ์—ฌํ•œ ๋‹ค์Œ, ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ๋ฐ›์•„ ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: SIZE = 128 ์š”์†Œ
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ธ”๋ก ๊ตฌ์„ฑ: (128, 1) ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (TPB = 128)
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: (1, 1) ๊ทธ๋ฆฌ๋“œ๋‹น ๋ธ”๋ก ์ˆ˜
  • ๋ ˆ์ด์•„์›ƒ: row_major[SIZE]() (์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ๋ชจ๋‘ 1D row-major)
  • ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ: 1-8 ๋ฐ˜๋ณต ๊ฐ’, ํ‰๊ท  = 4.5
  • ์˜ˆ์ƒ ์ถœ๋ ฅ: ํ‰๊ท ์ด 1.0์ธ ์ •๊ทœํ™”๋œ ๋ฒกํ„ฐ

๋„์ „ ๊ณผ์ œ: ๋ธ”๋ก ์ „์ฒด ๊ณ„์‚ฐ๊ณผ ๋ถ„๋ฐฐ์˜ ์กฐ์œจ

๊ธฐ์กด์˜ ํ‰๊ท  ์ •๊ทœํ™” ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# ์ˆœ์ฐจ์  ๋ฐฉ์‹ - ๋ณ‘๋ ฌ์„ฑ์„ ํ™œ์šฉํ•˜์ง€ ๋ชปํ•จ
total = sum(input_array)
mean = total / len(input_array)
output_array = [x / mean for x in input_array]

๋‹จ์ˆœํ•œ GPU ๋ณ‘๋ ฌํ™”์˜ ๋ฌธ์ œ์ :

  • ๋‹ค์ค‘ ์ปค๋„ ์‹คํ–‰: ํ‰๊ท  ๊ณ„์‚ฐ๊ณผ ์ •๊ทœํ™”์— ๊ฐ๊ฐ ๋ณ„๋„์˜ ํŒจ์Šค๊ฐ€ ํ•„์š”
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์™•๋ณต: ํ‰๊ท ์„ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ–ˆ๋‹ค๊ฐ€ ๋‚˜์ค‘์— ๋‹ค์‹œ ์ฝ๊ธฐ
  • ๋™๊ธฐํ™” ๋ณต์žก์„ฑ: ๊ณ„์‚ฐ ๋‹จ๊ณ„ ๊ฐ„์— ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ํ•„์š”
  • ์Šค๋ ˆ๋“œ ๋ถ„๊ธฐ: ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ์ž‘์—…์„ ์ˆ˜ํ–‰

๊ธฐ์กด GPU ํ’€์ด์˜ ๋ณต์žก์„ฑ:

# 1๋‹จ๊ณ„: ํ•ฉ๊ณ„๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฆฌ๋•์…˜ (๋ณต์žกํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ + ๋ฐฐ๋ฆฌ์–ด)
shared_sum[local_i] = my_value
barrier()
# ์—ฌ๋Ÿฌ barrier() ํ˜ธ์ถœ์ด ํ•„์š”ํ•œ ์ˆ˜๋™ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜...

# 2๋‹จ๊ณ„: ์Šค๋ ˆ๋“œ 0์ด ํ‰๊ท ์„ ๊ณ„์‚ฐ
if local_i == 0:
    mean = shared_sum[0] / size
    shared_mean[0] = mean

barrier()

# 3๋‹จ๊ณ„: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ‰๊ท ์„ ์ฝ๊ณ  ์ •๊ทœํ™”
mean = shared_mean[0]  # ๋ชจ๋‘๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ์ฝ์Œ
output[global_i] = my_value / mean

๊ณ ๊ธ‰ ๋ฐฉ์‹: block.sum() + block.broadcast() ์กฐ์œจ

๋‹ค๋‹จ๊ณ„ ์กฐ์œจ์„ ๊ฐ„๊ฒฐํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

์™„์„ฑํ•  ์ฝ”๋“œ

์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ

๋ธ”๋ก ์—ฐ์‚ฐ ๋„๊ตฌ ๋ชจ์Œ ์ „์ฒด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ๊ธ‰ ๋ฒกํ„ฐ ํ‰๊ท  ์ •๊ทœํ™”๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:


comptime vector_layout = row_major[SIZE]()
comptime VectorLayoutType = type_of(vector_layout)


def block_normalize_vector[
    tpb: Int
](
    input_data: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
    output_data: TileTensor[mut=True, dtype, VectorLayoutType, MutAnyOrigin],
    size: Int,
):
    """Vector mean normalization using block.sum() + block.broadcast() combination.

    This demonstrates the complete block operations workflow:
    1. Use block.sum() to compute sum of all elements (all โ†’ one)
    2. Thread 0 computes mean = sum / size
    3. Use block.broadcast() to share mean to all threads (one โ†’ all)
    4. Each thread normalizes: output[i] = input[i] / mean
    """

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Step 1: Each thread loads its element

    # FILL IN (roughly 3 lines)

    # Step 2: Use block.sum() to compute total sum (familiar from earlier!)

    # FILL IN (1 line)

    # Step 3: Thread 0 computes mean value

    # FILL IN (roughly 4 lines)

    # Step 4: block.broadcast() shares mean to ALL threads!
    # This completes the block operations trilogy demonstration

    # FILL IN (1 line)

    # Step 5: Each thread normalizes by the mean

    # FILL IN (roughly 3 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p27/p27.mojo

ํŒ

1. ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ ๊ตฌ์กฐ (๋ชจ๋“  ์ด์ „ ์—ฐ์‚ฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์ถ•)

์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์™„๋ฒฝํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ํŒจํ„ด์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ๋กœ๋“œ (๋ชจ๋“  ์ด์ „ ํผ์ฆ์—์„œ ์ต์ˆ™ํ•œ ํŒจํ„ด)
  2. block.sum()์œผ๋กœ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐ (์ด ํผ์ฆ์˜ ์•ž๋ถ€๋ถ„์—์„œ ๋ฐฐ์šด ๋‚ด์šฉ)
  3. ์Šค๋ ˆ๋“œ 0์ด ํ•ฉ๊ณ„๋กœ๋ถ€ํ„ฐ ํ‰๊ท ์„ ๊ณ„์‚ฐ
  4. block.broadcast()๋กœ ํ‰๊ท ์„ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ณต์œ  (์ƒˆ๋กœ์šด ๋‚ด์šฉ!)
  5. ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์œผ๋กœ ์ •๊ทœํ™”

2. ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ๊ณผ ํ•ฉ๊ณ„ ๊ณ„์‚ฐ (์ต์ˆ™ํ•œ ํŒจํ„ด)

๊ธฐ์กด TileTensor ํŒจํ„ด์œผ๋กœ ์š”์†Œ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค:

var my_value: Scalar[dtype] = 0.0
if global_i < size:
    my_value = input_data[global_i][0]  # SIMD ์ถ”์ถœ

๊ทธ๋Ÿฐ ๋‹ค์Œ ์•ž์„œ ๋ฐฐ์šด ๋‚ด์ ๊ณผ ๋™์ผํ•˜๊ฒŒ block.sum()์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

total_sum = block.sum[block_size=tpb, broadcast=False](...)

3. ํ‰๊ท  ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ 0๋งŒ)

์Šค๋ ˆ๋“œ 0๋งŒ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

var mean_value: Scalar[dtype] = 1.0  # ์•ˆ์ „ํ•œ ๊ธฐ๋ณธ๊ฐ’
if local_i == 0:
    # total_sum๊ณผ size๋กœ ํ‰๊ท  ๊ณ„์‚ฐ

์™œ ์Šค๋ ˆ๋“œ 0์ธ๊ฐ€? block.sum() ํŒจํ„ด์—์„œ ์Šค๋ ˆ๋“œ 0์ด ๊ฒฐ๊ณผ๋ฅผ ๋ฐ›๋Š” ๊ฒƒ๊ณผ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

4. block.broadcast() API ๊ฐœ๋…

ํ•จ์ˆ˜ ์‹œ๊ทธ๋‹ˆ์ฒ˜๋ฅผ ์‚ดํŽด๋ณด์„ธ์š” - ๋‹ค์Œ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ํ…œํ”Œ๋ฆฟ ํŒŒ๋ผ๋ฏธํ„ฐ: dtype, width, block_size
  • ๋Ÿฐํƒ€์ž„ ํŒŒ๋ผ๋ฏธํ„ฐ: val (๋ธŒ๋กœ๋“œ์บ์ŠคํŠธํ•  SIMD ๊ฐ’), src_thread (๊ธฐ๋ณธ๊ฐ’=0)

ํ˜ธ์ถœ ํŒจํ„ด์€ ๊ธฐ์กด ํ…œํ”Œ๋ฆฟ ์Šคํƒ€์ผ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

result = block.broadcast[
    dtype = DType.float32,
    width = 1,
    block_size = tpb
](val=SIMD[DType.float32, 1](value_to_broadcast), src_thread=UInt(0))

5. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

ํ•ต์‹ฌ ํ†ต์ฐฐ: block.broadcast()๋Š” ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ์—์„œ ๊ฐ’์„ ๊ฐ€์ ธ์™€ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค:

  • ์Šค๋ ˆ๋“œ 0์ด ๊ณ„์‚ฐ๋œ ํ‰๊ท ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ
  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ํ‰๊ท ๊ฐ’์ด ํ•„์š”
  • block.broadcast() ๊ฐ€ ์Šค๋ ˆ๋“œ 0์˜ ๊ฐ’์„ ๋ชจ๋‘์—๊ฒŒ ๋ณต์‚ฌ

์ด๊ฒƒ์€ block.sum()(์ „์ฒดโ†’ํ•˜๋‚˜)์˜ ๋ฐ˜๋Œ€์ด๋ฉฐ, block.prefix_sum()(์ „์ฒดโ†’๊ฐ๊ฐ ์œ„์น˜)๊ณผ๋„ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

6. ์ตœ์ข… ์ •๊ทœํ™” ๋‹จ๊ณ„

๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ๋ฐ›์œผ๋ฉด, ์ž์‹ ์˜ ์š”์†Œ๋ฅผ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค:

if global_i < size:
    normalized_value = my_value / broadcasted_mean[0]  # SIMD ์ถ”์ถœ
    output_data[global_i] = normalized_value

SIMD ์ถ”์ถœ: block.broadcast()๊ฐ€ SIMD๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋ฏ€๋กœ [0]์œผ๋กœ ์Šค์นผ๋ผ๋ฅผ ์ถ”์ถœํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

7. ์ด์ „ ํผ์ฆ์—์„œ์˜ ํŒจํ„ด ์ธ์‹

  • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ: ํ•ญ์ƒ ๋™์ผํ•œ global_i, local_i ํŒจํ„ด
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: ๋™์ผํ•œ if global_i < size ๊ฒ€์ฆ
  • SIMD ์ฒ˜๋ฆฌ: ๋™์ผํ•œ [0] ์ถ”์ถœ ํŒจํ„ด
  • ๋ธ”๋ก ์—ฐ์‚ฐ: block.sum()๊ณผ ๋™์ผํ•œ ํ…œํ”Œ๋ฆฟ ํŒŒ๋ผ๋ฏธํ„ฐ ์Šคํƒ€์ผ

๊ฐ ๋ธ”๋ก ์—ฐ์‚ฐ์ด ์ผ๊ด€๋œ ํŒจํ„ด์„ ๋”ฐ๋ฅด๋Š” ๊ฒƒ์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค!

block.broadcast() ๋ฐฉ์‹ ํ…Œ์ŠคํŠธ:

pixi run p27 --normalize
pixi run -e amd p27 --normalize
pixi run -e apple p27 --normalize
uv run poe p27 --normalize

ํ’€์—ˆ์„ ๋•Œ์˜ ์˜ˆ์ƒ ์ถœ๋ ฅ:

SIZE: 128
TPB: 128

Input sample: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 ...
Sum value: 576.0
Mean value: 4.5

Mean Normalization Results:
Normalized sample: 0.22222222 0.44444445 0.6666667 0.8888889 1.1111112 1.3333334 1.5555556 1.7777778 ...

Output sum: 128.0
Output mean: 1.0
โœ… Success: Output mean is 1.0 (should be close to 1.0)

์†”๋ฃจ์…˜

def block_normalize_vector[
    tpb: Int
](
    input_data: TileTensor[mut=False, dtype, InLayout, ImmutAnyOrigin],
    output_data: TileTensor[mut=True, dtype, VectorLayout, MutAnyOrigin],
    size: Int,
):
    """Vector mean normalization using block.sum() + block.broadcast() combination.

    This demonstrates the complete block operations workflow:
    1. Use block.sum() to compute sum of all elements (all -> one)
    2. Thread 0 computes mean = sum / size
    3. Use block.broadcast() to share mean to all threads (one -> all)
    4. Each thread normalizes: output[i] = input[i] / mean
    """

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Step 1: Each thread loads its element
    var my_value: Scalar[dtype] = 0.0
    if global_i < size:
        my_value = input_data[global_i][0]  # Extract SIMD value

    # Step 2: Use block.sum() to compute total sum (familiar from earlier!)
    var total_sum = block.sum[block_size=tpb, broadcast=False](
        val=SIMD[DType.float32, 1](my_value)
    )

    # Step 3: Thread 0 computes mean value
    var mean_value: Scalar[dtype] = 1.0  # Default to avoid division by zero
    if local_i == 0:
        if total_sum[0] > 0.0:
            mean_value = total_sum[0] / Scalar[dtype](size)

    # Step 4: block.broadcast() shares mean to ALL threads!
    # This completes the block operations trilogy demonstration
    var broadcasted_mean = block.broadcast[
        dtype=DType.float32, width=1, block_size=tpb
    ](val=SIMD[DType.float32, 1](mean_value), src_thread=UInt(0))

    # Step 5: Each thread normalizes by the mean
    if global_i < size:
        var normalized_value = my_value / broadcasted_mean[0]
        output_data[global_i] = normalized_value


block.broadcast() ์ปค๋„์€ ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ ํ†ต์‹  ํŒจํ„ด์„ ๋ชจ๋‘ ๊ฒฐํ•ฉํ•˜์—ฌ ์ˆ˜ํ•™์ ์œผ๋กœ ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์‹ค์ œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๊ตฌ์ฒด์ ์ธ ์‹คํ–‰์„ ํ†ตํ•œ ์™„์ „ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ถ„์„:

1๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ (๋ชจ๋“  ์ด์ „ ํผ์ฆ์—์„œ ํ™•๋ฆฝ๋œ ํŒจํ„ด)

์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ (๋ชจ๋“  ํผ์ฆ์—์„œ ์ผ๊ด€๋จ):
  global_i = block_dim.x * block_idx.x + thread_idx.x  // ์ž…๋ ฅ ๋ฐฐ์—ด ์œ„์น˜์— ๋งคํ•‘
  local_i = thread_idx.x                              // ๋ธ”๋ก ๋‚ด ์œ„์น˜ (0-127)

TileTensor ํŒจํ„ด์„ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ์š”์†Œ ๋กœ๋”ฉ:
  ์Šค๋ ˆ๋“œ 0:   my_value = input_data[0][0] = 1.0    // ์ฒซ ๋ฒˆ์งธ ์ˆœํ™˜ ๊ฐ’
  ์Šค๋ ˆ๋“œ 1:   my_value = input_data[1][0] = 2.0    // ๋‘ ๋ฒˆ์งธ ์ˆœํ™˜ ๊ฐ’
  ์Šค๋ ˆ๋“œ 7:   my_value = input_data[7][0] = 8.0    // ๋งˆ์ง€๋ง‰ ์ˆœํ™˜ ๊ฐ’
  ์Šค๋ ˆ๋“œ 8:   my_value = input_data[8][0] = 1.0    // ์ˆœํ™˜ ๋ฐ˜๋ณต: 1,2,3,4,5,6,7,8,1,2...
  ์Šค๋ ˆ๋“œ 15:  my_value = input_data[15][0] = 8.0   // 15 % 8 = 7, 8๋ฒˆ์งธ ๊ฐ’
  ์Šค๋ ˆ๋“œ 127: my_value = input_data[127][0] = 8.0  // 127 % 8 = 7, 8๋ฒˆ์งธ ๊ฐ’

128๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ๋กœ๋“œ - ์™„๋ฒฝํ•œ ๋ณ‘๋ ฌ ํšจ์œจ!

2๋‹จ๊ณ„: ๋ธ”๋ก ์ „์ฒด ํ•ฉ๊ณ„ ๋ฆฌ๋•์…˜ (์•ž์„œ ๋ฐฐ์šด block.sum() ์ง€์‹ ํ™œ์šฉ)

128๊ฐœ ์Šค๋ ˆ๋“œ์— ๊ฑธ์นœ block.sum() ์กฐ์œจ:
  ๊ธฐ์—ฌ๋ถ„ ๋ถ„์„:
    - ๊ฐ’ 1,2,3,4,5,6,7,8์ด ๊ฐ๊ฐ 16๋ฒˆ ๋ฐ˜๋ณต (128/8 = 16)
    - ์Šค๋ ˆ๋“œ ๊ธฐ์—ฌ๋ถ„: 16ร—1 + 16ร—2 + 16ร—3 + 16ร—4 + 16ร—5 + 16ร—6 + 16ร—7 + 16ร—8
    - ์ˆ˜ํ•™์  ํ•ฉ๊ณ„: 16 ร— (1+2+3+4+5+6+7+8) = 16 ร— 36 = 576.0

block.sum() ํ•˜๋“œ์›จ์–ด ์‹คํ–‰:
  ๋ชจ๋“  ์Šค๋ ˆ๋“œ โ†’ [๋ฆฌ๋•์…˜ ํŠธ๋ฆฌ] โ†’ ์Šค๋ ˆ๋“œ 0
  total_sum = SIMD[DType.float32, 1](576.0)  // ์Šค๋ ˆ๋“œ 0๋งŒ ์ด ๊ฐ’์„ ์ˆ˜์‹ 

์Šค๋ ˆ๋“œ 1-127: total_sum์— ์ ‘๊ทผ ๋ถˆ๊ฐ€ (block.sum์—์„œ broadcast=False)

3๋‹จ๊ณ„: ๋…์ ์  ํ‰๊ท  ๊ณ„์‚ฐ (๋‹จ์ผ ์Šค๋ ˆ๋“œ ์ฒ˜๋ฆฌ)

์Šค๋ ˆ๋“œ 0์ด ํ•ต์‹ฌ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰:
  ์ž…๋ ฅ: total_sum[0] = 576.0, size = 128
  ๊ณ„์‚ฐ: mean_value = 576.0 / 128.0 = 4.5

  ๊ฒ€์ฆ: ๊ธฐ๋Œ€ ํ‰๊ท  = (1+2+3+4+5+6+7+8)/8 = 36/8 = 4.5 โœ“

๋‹ค๋ฅธ ๋ชจ๋“  ์Šค๋ ˆ๋“œ (1-127):
  mean_value = 1.0 (๊ธฐ๋ณธ ์•ˆ์ „ ๊ฐ’)
  ์ด ๊ฐ’๋“ค์€ ๋ฌด๊ด€ - ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋กœ ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด ์‹œ์ ์—์„œ ์˜ฌ๋ฐ”๋ฅธ ํ‰๊ท ๊ฐ’์„ ๊ฐ€์ง„ ๊ฒƒ์€ ์Šค๋ ˆ๋“œ 0๋ฟ์ž…๋‹ˆ๋‹ค!

4๋‹จ๊ณ„: ๋ธ”๋ก ์ „์ฒด ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ๋ถ„๋ฐฐ (ํ•˜๋‚˜ โ†’ ์ „์ฒด ํ†ต์‹ )

block.broadcast() API ์‹คํ–‰:
  ์†Œ์Šค: src_thread = UInt(0) โ†’ ์Šค๋ ˆ๋“œ 0์˜ mean_value = 4.5
  ๋Œ€์ƒ: ๋ธ”๋ก ๋‚ด ๋ชจ๋“  128 ์Šค๋ ˆ๋“œ

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ „:
  ์Šค๋ ˆ๋“œ 0:   mean_value = 4.5  โ† ์ง„์‹ค์˜ ์›์ฒœ
  ์Šค๋ ˆ๋“œ 1:   mean_value = 1.0  โ† ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •
  ์Šค๋ ˆ๋“œ 2:   mean_value = 1.0  โ† ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •
  ...
  ์Šค๋ ˆ๋“œ 127: mean_value = 1.0  โ† ๋ฎ์–ด์”Œ์›Œ์งˆ ์˜ˆ์ •

block.broadcast() ์‹คํ–‰ ํ›„:
  ์Šค๋ ˆ๋“œ 0:   broadcasted_mean[0] = 4.5  โ† ์ž์‹ ์˜ ๊ฐ’์„ ๋‹ค์‹œ ์ˆ˜์‹ 
  ์Šค๋ ˆ๋“œ 1:   broadcasted_mean[0] = 4.5  โ† ์ด์ œ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ€์ง!
  ์Šค๋ ˆ๋“œ 2:   broadcasted_mean[0] = 4.5  โ† ์ด์ œ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ€์ง!
  ...
  ์Šค๋ ˆ๋“œ 127: broadcasted_mean[0] = 4.5  โ† ์ด์ œ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ๊ฐ€์ง!

๊ฒฐ๊ณผ: ์™„๋ฒฝํ•œ ๋™๊ธฐํ™” - ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ํ‰๊ท ๊ฐ’์„ ๊ฐ€์ง!

5๋‹จ๊ณ„: ๋ณ‘๋ ฌ ํ‰๊ท  ์ •๊ทœํ™” (์กฐ์œจ๋œ ์ฒ˜๋ฆฌ)

๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๋œ ํ‰๊ท ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”:
  ์Šค๋ ˆ๋“œ 0:   normalized = 1.0 / 4.5 = 0.22222222...
  ์Šค๋ ˆ๋“œ 1:   normalized = 2.0 / 4.5 = 0.44444444...
  ์Šค๋ ˆ๋“œ 2:   normalized = 3.0 / 4.5 = 0.66666666...
  ์Šค๋ ˆ๋“œ 7:   normalized = 8.0 / 4.5 = 1.77777777...
  ์Šค๋ ˆ๋“œ 8:   normalized = 1.0 / 4.5 = 0.22222222...  (ํŒจํ„ด ๋ฐ˜๋ณต)
  ...

์ˆ˜ํ•™์  ๊ฒ€์ฆ:
  ์ถœ๋ ฅ ํ•ฉ๊ณ„ = (0.222... + 0.444... + ... + 1.777...) ร— 16 = 4.5 ร— 16 ร— 2 = 128.0
  ์ถœ๋ ฅ ํ‰๊ท  = 128.0 / 128 = 1.0  ์™„๋ฒฝํ•œ ์ •๊ทœํ™”!

๊ฐ ๊ฐ’์„ ์›๋ž˜ ํ‰๊ท ์œผ๋กœ ๋‚˜๋ˆ„๋ฉด ํ‰๊ท ์ด 1.0์ธ ์ถœ๋ ฅ์„ ์ƒ์„ฑ

6๋‹จ๊ณ„: ์ •ํ™•์„ฑ ๊ฒ€์ฆ

์ž…๋ ฅ ๋ถ„์„:
  - ํ•ฉ๊ณ„: 576.0, ํ‰๊ท : 4.5
  - ์ตœ๋Œ“๊ฐ’: 8.0, ์ตœ์†Ÿ๊ฐ’: 1.0
  - ๋ฒ”์œ„: [1.0, 8.0]

์ถœ๋ ฅ ๋ถ„์„:
  - ํ•ฉ๊ณ„: 128.0, ํ‰๊ท : 1.0 โœ“
  - ์ตœ๋Œ“๊ฐ’: 1.777..., ์ตœ์†Ÿ๊ฐ’: 0.222...
  - ๋ฒ”์œ„: [0.222, 1.777] (๋ชจ๋“  ๊ฐ’์ด 1/4.5 ๋น„์œจ๋กœ ์Šค์ผ€์ผ๋ง)

๋น„๋ก€ ๊ด€๊ณ„ ๋ณด์กด:
  - ์›๋ž˜ 8:1 ๋น„์œจ์ด 1.777:0.222 = 8:1๋กœ ์œ ์ง€ โœ“
  - ๋ชจ๋“  ์ƒ๋Œ€์  ํฌ๊ธฐ๊ฐ€ ์™„๋ฒฝํ•˜๊ฒŒ ์œ ์ง€

์ด ์™„์ „ํ•œ ์›Œํฌํ”Œ๋กœ์šฐ๊ฐ€ ์ˆ˜ํ•™์ ยท๊ณ„์‚ฐ์ ์œผ๋กœ ์šฐ์ˆ˜ํ•œ ์ด์œ :

๊ธฐ์ˆ ์  ์ •ํ™•์„ฑ๊ณผ ๊ฒ€์ฆ:

์ˆ˜ํ•™์  ์ •ํ™•์„ฑ ์ฆ๋ช…:
  ์ž…๋ ฅ: xโ‚, xโ‚‚, ..., xโ‚™ (n = 128)
  ํ‰๊ท : ฮผ = (โˆ‘xแตข)/n = 576/128 = 4.5

  ์ •๊ทœํ™”: yแตข = xแตข/ฮผ
  ์ถœ๋ ฅ ํ‰๊ท : (โˆ‘yแตข)/n = (โˆ‘xแตข/ฮผ)/n = (1/ฮผ)(โˆ‘xแตข)/n = (1/ฮผ)ฮผ = 1 โœ“

์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ฆ๋ช… ๊ฐ€๋Šฅํ•˜๊ฒŒ ์˜ฌ๋ฐ”๋ฅธ ์ˆ˜ํ•™์  ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

Puzzle 12 (๊ธฐ์ดˆ ํŒจํ„ด)๊ณผ์˜ ์—ฐ๊ฒฐ:

  • ์Šค๋ ˆ๋“œ ์กฐ์œจ์˜ ์ง„ํ™”: ๋™์ผํ•œ global_i, local_i ํŒจํ„ด์ด์ง€๋งŒ ๋ธ”๋ก ๊ธฐ๋ณธ ์š”์†Œ ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด: ๋™์ผํ•œ TileTensor SIMD ์ถ”์ถœ [0]์ด์ง€๋งŒ ์ตœ์ ํ™”๋œ ์›Œํฌํ”Œ๋กœ์šฐ
  • ๋ณต์žก์„ฑ ์ œ๊ฑฐ: 20์ค„ ์ด์ƒ์˜ ์ˆ˜๋™ ๋ฐฐ๋ฆฌ์–ด๋ฅผ 2๊ฐœ์˜ ๋ธ”๋ก ์—ฐ์‚ฐ์œผ๋กœ ๋Œ€์ฒด
  • ๊ต์œก์  ์ง„ํ–‰: ์ˆ˜๋™ โ†’ ์ž๋™, ๋ณต์žก โ†’ ๋‹จ์ˆœ, ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ โ†’ ์‹ ๋ขฐ์„ฑ

block.sum() (์™„๋ฒฝํ•œ ํ†ตํ•ฉ)๊ณผ์˜ ์—ฐ๊ฒฐ:

  • API ์ผ๊ด€์„ฑ: ๋™์ผํ•œ ํ…œํ”Œ๋ฆฟ ๊ตฌ์กฐ [block_size=tpb, broadcast=False]
  • ๊ฒฐ๊ณผ ํ๋ฆ„ ์„ค๊ณ„: ์Šค๋ ˆ๋“œ 0์ด ํ•ฉ๊ณ„๋ฅผ ์ˆ˜์‹ ํ•˜๊ณ , ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํŒŒ์ƒ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ„์‚ฐ
  • ๋งค๋„๋Ÿฌ์šด ์กฐํ•ฉ: block.sum()์˜ ์ถœ๋ ฅ์ด ๊ณ„์‚ฐ + ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ์˜ ์ž…๋ ฅ์ด ๋จ
  • ์„ฑ๋Šฅ ์ตœ์ ํ™”: ๋‹จ์ผ ์ปค๋„ ์›Œํฌํ”Œ๋กœ์šฐ vs ๋‹ค์ค‘ ํŒจ์Šค ๋ฐฉ์‹

block.prefix_sum() (์ƒ๋ณด์  ํ†ต์‹ )๊ณผ์˜ ์—ฐ๊ฒฐ:

  • ๋ถ„๋ฐฐ ํŒจํ„ด: prefix_sum์€ ๊ณ ์œ ํ•œ ์œ„์น˜๋ฅผ, broadcast๋Š” ๊ณต์œ  ๊ฐ’์„ ์ œ๊ณต

  • ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค: prefix_sum์€ ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹์šฉ, broadcast๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ ์šฉ

  • ํ…œํ”Œ๋ฆฟ ์ผ๊ด€์„ฑ: ๋ชจ๋“  ์—ฐ์‚ฐ์—์„œ ๋™์ผํ•œ dtype, block_size ํŒŒ๋ผ๋ฏธํ„ฐ ํŒจํ„ด

  • SIMD ์ฒ˜๋ฆฌ ํ†ต์ผ์„ฑ: ๋ชจ๋“  ๋ธ”๋ก ์—ฐ์‚ฐ์ด [0] ์ถ”์ถœ์ด ํ•„์š”ํ•œ SIMD๋ฅผ ๋ฐ˜ํ™˜

๊ณ ๊ธ‰ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ธ์‚ฌ์ดํŠธ:

ํ†ต์‹  ํŒจํ„ด ๋น„๊ต:
  ๊ธฐ์กด ๋ฐฉ์‹:
    1. ์ˆ˜๋™ ๋ฆฌ๋•์…˜:         O(log n), ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด ํ•„์š”
    2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ:     O(1), ๋™๊ธฐํ™” ํ•„์š”
    3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ:     O(1), ๋ฑ…ํฌ ์ถฉ๋Œ ๊ฐ€๋Šฅ์„ฑ
    ์ดํ•ฉ: ๋‹ค์ˆ˜์˜ ๋™๊ธฐํ™” ์ง€์ , ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅ

  ๋ธ”๋ก ์—ฐ์‚ฐ ๋ฐฉ์‹:
    1. block.sum():        O(log n), ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”, ์ž๋™ ๋ฐฐ๋ฆฌ์–ด
    2. ๊ณ„์‚ฐ:                O(1), ๋‹จ์ผ ์Šค๋ ˆ๋“œ
    3. block.broadcast():  O(log n), ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”, ์ž๋™ ๋ถ„๋ฐฐ
    ์ดํ•ฉ: ๋‘ ๊ฐœ์˜ ๊ธฐ๋ณธ ์š”์†Œ, ์ž๋™ ๋™๊ธฐํ™”, ์ฆ๋ช…๋œ ์ •ํ™•์„ฑ

์‹ค์ œ ์‘์šฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด:

์ผ๋ฐ˜์ ์ธ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐ:
  1๋‹จ๊ณ„: ๋ณ‘๋ ฌ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ        โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ธฐ์—ฌ
  2๋‹จ๊ณ„: ์ „์—ญ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณ„์‚ฐ      โ†’ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐ
  3๋‹จ๊ณ„: ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„๋ฐฐ          โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ˆ˜์‹ 
  4๋‹จ๊ณ„: ์กฐ์œจ๋œ ๋ณ‘๋ ฌ ์ถœ๋ ฅ        โ†’ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌ

์ด ์ •ํ™•ํ•œ ํŒจํ„ด์ด ๋“ฑ์žฅํ•˜๋Š” ๋ถ„์•ผ:
  - ๋ฐฐ์น˜ ์ •๊ทœํ™” (๋”ฅ๋Ÿฌ๋‹)
  - ํžˆ์Šคํ† ๊ทธ๋žจ ๊ท ๋“ฑํ™” (์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ)
  - ๋ฐ˜๋ณต์  ์ˆ˜์น˜ ํ•ด๋ฒ• (๊ณผํ•™ ์—ฐ์‚ฐ)
  - ์กฐ๋ช… ๊ณ„์‚ฐ (์ปดํ“จํ„ฐ ๊ทธ๋ž˜ํ”ฝ)

ํ‰๊ท  ์ •๊ทœํ™”๋Š” ์ด ๊ทผ๋ณธ์ ์ธ ํŒจํ„ด์˜ ์™„๋ฒฝํ•œ ๊ต์œก ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.

๋ธ”๋ก ์—ฐ์‚ฐ 3๋ถ€์ž‘ ์™„์„ฑ:

1. block.sum() - ์ „์ฒดโ†’ํ•˜๋‚˜ (Reduction)

  • ์ž…๋ ฅ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ์ œ๊ณต
  • ์ถœ๋ ฅ: ์Šค๋ ˆ๋“œ 0์ด ์ง‘๊ณ„๋œ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์‹ 
  • ์šฉ๋„: ํ•ฉ๊ณ„, ์ตœ๋Œ“๊ฐ’ ๊ณ„์‚ฐ ๋“ฑ

2. block.prefix_sum() - ์ „์ฒดโ†’๊ฐ๊ฐ (Scan)

  • ์ž…๋ ฅ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ์ œ๊ณต
  • ์ถœ๋ ฅ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ˆ„์  ์œ„์น˜๋ฅผ ์ˆ˜์‹ 
  • ์šฉ๋„: ์“ฐ๊ธฐ ์œ„์น˜ ๊ณ„์‚ฐ, ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹

3. block.broadcast() - ํ•˜๋‚˜โ†’์ „์ฒด (Broadcast)

  • ์ž…๋ ฅ: ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ’์„ ์ œ๊ณต (์ผ๋ฐ˜์ ์œผ๋กœ ์Šค๋ ˆ๋“œ 0)
  • ์ถœ๋ ฅ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๊ฐ’์„ ์ˆ˜์‹ 
  • ์šฉ๋„: ๊ณ„์‚ฐ๋œ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ณต์œ , ์„ค์ •๊ฐ’ ๋ถ„๋ฐฐ

์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ์ง„ํ–‰:

  1. ์ˆ˜๋™ ์กฐ์œจ (Puzzle 12): ๋ณ‘๋ ฌ ๊ธฐ์ดˆ ์ดํ•ด
  2. ์›Œํ”„ ๊ธฐ๋ณธ ์š”์†Œ (Puzzle 24): ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ํŒจํ„ด ํ•™์Šต
  3. ๋ธ”๋ก ๋ฆฌ๋•์…˜ (block.sum()): ์ „์ฒดโ†’ํ•˜๋‚˜ ํ†ต์‹  ํ•™์Šต
  4. ๋ธ”๋ก ์Šค์บ” (block.prefix_sum()): ์ „์ฒดโ†’๊ฐ๊ฐ ํ†ต์‹  ํ•™์Šต
  5. ๋ธ”๋ก ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ (block.broadcast()): ํ•˜๋‚˜โ†’์ „์ฒด ํ†ต์‹  ํ•™์Šต

์ „์ฒด ๊ทธ๋ฆผ: ๋ธ”๋ก ์—ฐ์‚ฐ์€ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๊ธฐ๋ณธ ํ†ต์‹  ๋นŒ๋”ฉ ๋ธ”๋ก์„ ์ œ๊ณตํ•˜๋ฉฐ, ๋ณต์žกํ•œ ์ˆ˜๋™ ์กฐ์œจ์„ ๊น”๋”ํ•˜๊ณ  ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ๊ธฐ๋ณธ ์š”์†Œ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ์ธ์‚ฌ์ดํŠธ์™€ ๊ธฐ์ˆ  ๋ถ„์„

์ •๋Ÿ‰์  ์„ฑ๋Šฅ ๋น„๊ต:

block.broadcast() vs ๊ธฐ์กด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฉ์‹ (์ฐธ๊ณ ์šฉ):

๊ธฐ์กด ์ˆ˜๋™ ๋ฐฉ์‹:

1๋‹จ๊ณ„: ์ˆ˜๋™ ๋ฆฌ๋•์…˜
  โ€ข ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น: ~5 ์‚ฌ์ดํด
  โ€ข ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”: ~10 ์‚ฌ์ดํด
  โ€ข ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ๋ฃจํ”„: ~15 ์‚ฌ์ดํด
  โ€ข ์˜ค๋ฅ˜ ๋ฐœ์ƒ ๊ฐ€๋Šฅํ•œ ์ˆ˜๋™ ์ธ๋ฑ์‹ฑ

2๋‹จ๊ณ„: ํ‰๊ท  ๊ณ„์‚ฐ: ~2 ์‚ฌ์ดํด

3๋‹จ๊ณ„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ
  โ€ข ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ˆ˜๋™ ์“ฐ๊ธฐ: ~2 ์‚ฌ์ดํด
  โ€ข ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”: ~10 ์‚ฌ์ดํด
  โ€ข ๋ชจ๋“  ์Šค๋ ˆ๋“œ ์ฝ๊ธฐ: ~3 ์‚ฌ์ดํด

์ดํ•ฉ: ~47 ์‚ฌ์ดํด
  + ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ
  + ๊ฒฝ์Ÿ ์ƒํƒœ ๊ฐ€๋Šฅ์„ฑ
  + ์ˆ˜๋™ ์˜ค๋ฅ˜ ๋””๋ฒ„๊น…

๋ธ”๋ก ์—ฐ์‚ฐ ๋ฐฉ์‹:

1๋‹จ๊ณ„: block.sum()
  โ€ข ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ~3 ์‚ฌ์ดํด
  โ€ข ์ž๋™ ๋ฐฐ๋ฆฌ์–ด: ๋ช…์‹œ์  ๋น„์šฉ 0
  โ€ข ์ตœ์ ํ™”๋œ ๋ฆฌ๋•์…˜: ~8 ์‚ฌ์ดํด
  โ€ข ๊ฒ€์ฆ๋œ ์˜ฌ๋ฐ”๋ฅธ ๊ตฌํ˜„

2๋‹จ๊ณ„: ํ‰๊ท  ๊ณ„์‚ฐ: ~2 ์‚ฌ์ดํด

3๋‹จ๊ณ„: block.broadcast()
  โ€ข ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ~4 ์‚ฌ์ดํด
  โ€ข ์ž๋™ ๋ถ„๋ฐฐ: ๋ช…์‹œ์  ๋น„์šฉ 0
  โ€ข ๊ฒ€์ฆ๋œ ์˜ฌ๋ฐ”๋ฅธ ๊ตฌํ˜„

์ดํ•ฉ: ~17 ์‚ฌ์ดํด
  + ์ž๋™ ์ตœ์ ํ™”
  + ๋ณด์žฅ๋œ ์ •ํ™•์„ฑ
  + ์กฐํ•ฉ ๊ฐ€๋Šฅํ•œ ์„ค๊ณ„

๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ด์ :

์บ์‹œ ํšจ์œจ:

  • block.sum(): ์ตœ์ ํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์บ์‹œ ๋ฏธ์Šค ๊ฐ์†Œ
  • block.broadcast(): ํšจ์œจ์ ์ธ ๋ถ„๋ฐฐ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ ์ตœ์†Œํ™”
  • ๊ฒฐํ•ฉ ์›Œํฌํ”Œ๋กœ์šฐ: ๋‹จ์ผ ์ปค๋„์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์™•๋ณต์„ 100% ๊ฐ์†Œ

๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ:

๊ธฐ์กด ๋ฉ€ํ‹ฐ ์ปค๋„ ๋ฐฉ์‹:
  ์ปค๋„ 1: ์ž…๋ ฅ โ†’ ๋ฆฌ๋•์…˜ โ†’ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ
  ์ปค๋„ 2: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ โ†’ ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ โ†’ ์ถœ๋ ฅ
  ์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก: ๋ฐฐ์—ด ํฌ๊ธฐ์˜ 3๋ฐฐ

๋ธ”๋ก ์—ฐ์‚ฐ ๋‹จ์ผ ์ปค๋„:
  ์ž…๋ ฅ โ†’ block.sum() โ†’ block.broadcast() โ†’ ์ถœ๋ ฅ
  ์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก: ๋ฐฐ์—ด ํฌ๊ธฐ์˜ 2๋ฐฐ (33% ๊ฐœ์„ )

๊ฐ ๋ธ”๋ก ์—ฐ์‚ฐ์˜ ์ตœ์  ์‚ฌ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค:

block.sum() ์ตœ์  ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ๋ฐ์ดํ„ฐ ์ง‘๊ณ„: ํ•ฉ๊ณ„, ํ‰๊ท , ์ตœ๋Œ“๊ฐ’/์ตœ์†Ÿ๊ฐ’ ๊ณ„์‚ฐ
  • ๋ฆฌ๋•์…˜ ํŒจํ„ด: ์ „์ฒดโ†’ํ•˜๋‚˜ ํ†ต์‹ ์ด ํ•„์š”ํ•œ ๋ชจ๋“  ๊ฒฝ์šฐ
  • ํ†ต๊ณ„ ์—ฐ์‚ฐ: ํ‰๊ท , ๋ถ„์‚ฐ, ์ƒ๊ด€๊ด€๊ณ„ ๊ณ„์‚ฐ

block.prefix_sum() ์ตœ์  ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ๋ณ‘๋ ฌ ํŒŒํ‹ฐ์…”๋‹: ์ŠคํŠธ๋ฆผ ์ปดํŒฉ์…˜, ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜
  • ์“ฐ๊ธฐ ์œ„์น˜ ๊ณ„์‚ฐ: ๋ณ‘๋ ฌ ์ถœ๋ ฅ ์ƒ์„ฑ
  • ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ •๋ ฌ, ๊ฒ€์ƒ‰, ๋ฐ์ดํ„ฐ ์žฌ๊ตฌ์„ฑ

block.broadcast() ์ตœ์  ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ถ„๋ฐฐ: ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๊ณต์œ 
  • ์„ค์ • ์ „ํŒŒ: ๋ชจ๋“œ ํ”Œ๋ž˜๊ทธ, ์Šค์ผ€์ผ๋ง ํŒฉํ„ฐ, ์ž„๊ณ„๊ฐ’
  • ์กฐ์œจ๋œ ์ฒ˜๋ฆฌ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๊ณ„์‚ฐ๋œ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ํ•„์š”ํ•  ๋•Œ

์กฐํ•ฉ์˜ ์ด์ :

๊ฐœ๋ณ„ ์—ฐ์‚ฐ:   ์ข‹์€ ์„ฑ๋Šฅ, ์ œํ•œ๋œ ๋ฒ”์œ„
๊ฒฐํ•ฉ ์—ฐ์‚ฐ:   ํƒ์›”ํ•œ ์„ฑ๋Šฅ, ํฌ๊ด„์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜

์‹ค์ œ ์‘์šฉ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์กฐํ•ฉ ์˜ˆ์‹œ:
โ€ข block.sum() + block.broadcast():       ์ •๊ทœํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜
โ€ข block.prefix_sum() + block.sum():      ๊ณ ๊ธ‰ ํŒŒํ‹ฐ์…”๋‹
โ€ข ์„ธ ๊ฐ€์ง€ ๋ชจ๋‘ ๊ฒฐํ•ฉ:                      ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜
โ€ข ๊ธฐ์กด ํŒจํ„ด๊ณผ ํ•จ๊ป˜:                       ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ตœ์ ํ™” ์ „๋žต

๋‹ค์Œ ๋‹จ๊ณ„

์™„์ „ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ 3๋ถ€์ž‘์„ ๋ฐฐ์› ์œผ๋‹ˆ, ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์นœ ์—ฐ์‚ฐ ์กฐ์œจ
  • ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ํŒจํ„ด: ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ธ”๋ก ์—ฐ์‚ฐ ๊ฒฐํ•ฉ
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”: ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ์ด๋™ ํŒจํ„ด
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„: ๋ธ”๋ก ์—ฐ์‚ฐ ๋นŒ๋”ฉ ๋ธ”๋ก์„ ์‚ฌ์šฉํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐํ™”
  • ์„ฑ๋Šฅ ์ตœ์ ํ™”: ์ตœ์ ์˜ ๋ธ”๋ก ํฌ๊ธฐ์™€ ์—ฐ์‚ฐ ์กฐํ•ฉ ์„ ํƒ

๐Ÿ’ก ํ•ต์‹ฌ ์š”์ : ๋ธ”๋ก ์—ฐ์‚ฐ 3๋ถ€์ž‘(sum, prefix_sum, broadcast)์€ ๋ธ”๋ก ๋ ˆ๋ฒจ ๋ณ‘๋ ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•œ ์™„์ „ํ•œ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ์—ฐ์‚ฐ๋“ค์„ ์กฐํ•ฉํ•˜๋ฉด GPU ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”๋ฅผ ํ™œ์šฉํ•˜๋Š” ๊น”๋”ํ•˜๊ณ  ์œ ์ง€๋ณด์ˆ˜ํ•˜๊ธฐ ์‰ฌ์šด ์ฝ”๋“œ๋กœ ๊ณ ๊ธ‰ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ‰๊ท  ์ •๊ทœํ™”๋Š” ์ด ์—ฐ์‚ฐ๋“ค์ด ํ•จ๊ป˜ ์ž‘๋™ํ•˜์—ฌ ์‹ค์ œ ์—ฐ์‚ฐ ๋ฌธ์ œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Puzzle 28: ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ๊ณผ ๋ณต์‚ฌ ์ค‘์ฒฉ

GPU ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ ํ˜„์ƒ: ์‹ค์ œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋Œ€๋ถ€๋ถ„์€ ์ขŒ์ ˆ์Šค๋Ÿฌ์šด ๋ฒฝ์— ๋ถ€๋”ชํž™๋‹ˆ๋‹ค

  • ์—ฐ์‚ฐ ๋Šฅ๋ ฅ์ด ์•„๋‹ˆ๋ผ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ์˜ํ•ด ์ œํ•œ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋น„์‹ผ GPU ์ฝ”์–ด๊ฐ€ ๋А๋ฆฐ DRAM์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋„์ฐฉํ•˜๊ธฐ๋ฅผ ๊ธฐ๋‹ค๋ฆฌ๋ฉฐ ๋†€๊ณ  ์žˆ๋Š” ๊ฒƒ์ด์ฃ .

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ํ”ํžˆ ๋ณผ ์ˆ˜ ์žˆ๋Š” ์ƒํ™ฉ์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

# ์„ฑ๋Šฅ์˜ ์  - ์ˆœ์ฐจ์  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ
load_input_tile()     # โ† DRAM ๋Œ€๊ธฐ 500 ์‚ฌ์ดํด
load_kernel_data()    # โ† ๋˜ 100 ์‚ฌ์ดํด ๋Œ€๊ธฐ
barrier()             # โ† ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์œ ํœด ๋Œ€๊ธฐ
compute()             # โ† ๋“œ๋””์–ด ์‹ค์ œ ์—ฐ์‚ฐ 50 ์‚ฌ์ดํด
# ์ด: 650 ์‚ฌ์ดํด, ์—ฐ์‚ฐ ํ™œ์šฉ๋ฅ  ๊ฒจ์šฐ 7.7%!

์ด๋ ‡๊ฒŒ ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ์–ด๋–จ๊นŒ์š”?

# ์„ฑ๋Šฅ ๊ฐœ์„  - ์ค‘์ฒฉ ์—ฐ์‚ฐ
launch_async_load()   # โ† ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ 500 ์‚ฌ์ดํด ์ „์†ก ์‹œ์ž‘
load_small_data()     # โ† ๋Œ€๊ธฐ ์ค‘ ์œ ์šฉํ•œ ์ž‘์—… 100 ์‚ฌ์ดํด
wait_and_compute()    # โ† ๋‚˜๋จธ์ง€ ~400 ์‚ฌ์ดํด๋งŒ ๋Œ€๊ธฐ ํ›„ ์—ฐ์‚ฐ
# ์ด: ~550 ์‚ฌ์ดํด, 45% ํ–ฅ์ƒ!

์ด๊ฒƒ์ด ๋น„๋™๊ธฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์˜ ์œ„๋ ฅ์ž…๋‹ˆ๋‹ค - ๋А๋ฆฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ GPU์˜ ์ž ์žฌ๋ ฅ์„ ์ตœ๋Œ€ํ•œ ๋ฐœํœ˜ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ฐจ์ด๋ฅผ ๋งŒ๋“ค์–ด ๋ƒ…๋‹ˆ๋‹ค.

์™œ ์ค‘์š”ํ•œ๊ฐ€

์ด ํผ์ฆ์—์„œ๋Š” Puzzle 13์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ 1D ํ•ฉ์„ฑ๊ณฑ์„ ์—ฐ์‚ฐ ๋’ค์— ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๋Š” ๊ณ ์„ฑ๋Šฅ ๊ตฌํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ํ•™์ˆ ์  ์—ฐ์Šต์ด ์•„๋‹™๋‹ˆ๋‹ค - ์ด ํŒจํ„ด๋“ค์€ ๋‹ค์Œ ๋ถ„์•ผ์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค:

  • ๋”ฅ๋Ÿฌ๋‹: ๊ฐ€์ค‘์น˜์™€ ํ™œ์„ฑํ™”๊ฐ’์˜ ํšจ์œจ์  ๋กœ๋”ฉ
  • ๊ณผํ•™ ์—ฐ์‚ฐ: ์Šคํ…์‹ค ์—ฐ์‚ฐ์—์„œ ๋ฐ์ดํ„ฐ ์ „์†ก ์ค‘์ฒฉ
  • ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ: ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ํ†ตํ•œ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹ ์ŠคํŠธ๋ฆฌ๋ฐ
  • ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์•Œ๊ณ ๋ฆฌ์ฆ˜: ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ƒ์‚ฐ์ ์ธ ์ž‘์—…์œผ๋กœ ์ „ํ™˜

์‚ฌ์ „ ์ค€๋น„

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ๋‚ด์šฉ์„ ํ™•์‹คํžˆ ์ดํ•ดํ•˜๊ณ  ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

ํ•„์ˆ˜ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…:

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (Puzzle 8, Puzzle 16) - matmul ํŒจํ„ด์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ(coalescing) (Puzzle 21) - ์ตœ์ ์˜ ๋น„๋™๊ธฐ ์ „์†ก์— ํ•„์ˆ˜
  • ํƒ€์ผ ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ (Puzzle 23) - ์ด ์ตœ์ ํ™”์˜ ๊ธฐ๋ฐ˜

ํ•˜๋“œ์›จ์–ด ์ดํ•ด:

  • GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ (DRAM โ†’ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ๋ ˆ์ง€์Šคํ„ฐ)
  • ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๊ตฌ์„ฑ๊ณผ ๋™๊ธฐํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„ vs. ๋Œ€์—ญํญ์— ๋Œ€ํ•œ ๊ธฐ๋ณธ ์ดํ•ด

API ์ˆ™์ง€: Mojo GPU Memory Operations

โš ๏ธ ํ•˜๋“œ์›จ์–ด ํ˜ธํ™˜์„ฑ ์ฐธ๊ณ : ์ด ํผ์ฆ์€ ์ตœ์‹  GPU ์•„ํ‚คํ…์ฒ˜๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ๋Š” ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ(copy_dram_to_sram_async, async_copy_wait_all)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. .async ์ˆ˜์ •์ž๋‚˜ ์ง€์›๋˜์ง€ ์•Š๋Š” ์—ฐ์‚ฐ ๊ด€๋ จ ์ปดํŒŒ์ผ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ํ•ด๋‹น GPU๊ฐ€ ์ด ๊ธฐ๋Šฅ์„ ์ง€์›ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜๋„ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ํŒจํ„ด์„ ์ดํ•ดํ•˜๋Š” ๋ฐ ๊ฐœ๋…์€ ์—ฌ์ „ํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

GPU ์ปดํ“จํŒ… ๋Šฅ๋ ฅ ํ™•์ธ:

nvidia-smi --query-gpu=name,compute_cap --format=csv,noheader,nounits
  • SM_70 ์ด์ƒ (์˜ˆ: V100, T4, A10G, RTX 20+ ์‹œ๋ฆฌ์ฆˆ): ๊ธฐ๋ณธ ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์ง€์›
  • SM_80 ์ด์ƒ (์˜ˆ: A100, RTX 30+ ์‹œ๋ฆฌ์ฆˆ): ์ „์ฒด ๋น„๋™๊ธฐ ๋ณต์‚ฌ ๊ธฐ๋Šฅ
  • SM_90 ์ด์ƒ (์˜ˆ: H100, RTX 40+ ์‹œ๋ฆฌ์ฆˆ): ๊ณ ๊ธ‰ TMA ์—ฐ์‚ฐ ์ง€์›

ํ•™์Šต ๋‚ด์šฉ

์ด ํผ์ฆ์„ ๋งˆ์น˜๋ฉด ๋‹ค์Œ์„ ์ง์ ‘ ๊ฒฝํ—˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ๊ธฐ๋ฒ•

  • ๋น„๋™๊ธฐ ๋ณต์‚ฌ ๊ธฐ๋ณธ ์š”์†Œ: ๋ฐฑ๊ทธ๋ผ์šด๋“œ DRAMโ†’SRAM ์ „์†ก ์‹œ์ž‘
  • ์ง€์—ฐ ์‹œ๊ฐ„ ์€ํ(latency hiding): ๋น„์šฉ์ด ํฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์œ ์šฉํ•œ ์—ฐ์‚ฐ๊ณผ ์ค‘์ฒฉ
  • ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ํ•˜๋“œ์›จ์–ด์— ๋งž์ถ”๊ธฐ
  • ํŒŒ์ดํ”„๋ผ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: ๋ฉ”๋ชจ๋ฆฌ ํ™œ์šฉ์„ ๊ทน๋Œ€ํ™”ํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ์กฐํ™”

์ฃผ์š” API

Puzzle 16์˜ ๊ด€์šฉ์  matmul์—์„œ ์†Œ๊ฐœํ•œ ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์ด์ œ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ์ž ์žฌ๋ ฅ์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค:

  • copy_dram_to_sram_async(): ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐฑ๊ทธ๋ผ์šด๋“œ DRAMโ†’SRAM ์ „์†ก ์‹œ์ž‘
  • async_copy_wait_all(): ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ „ ์ „์†ก ์™„๋ฃŒ ๋™๊ธฐํ™”

Puzzle 16๊ณผ ๋‹ค๋ฅธ ์ ์€? Puzzle 16์—์„œ๋Š” matmul์˜ ๊น”๋”ํ•œ ํƒ€์ผ ๋กœ๋”ฉ์„ ์œ„ํ•ด ๋น„๋™๊ธฐ ๋ณต์‚ฌ๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค๋ฉด, ์ด ํผ์ฆ์€ ์ง€์—ฐ ์‹œ๊ฐ„ ์€ํ์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค - ๋น„์šฉ์ด ํฐ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ๊ณผ ์œ ์šฉํ•œ ์—ฐ์‚ฐ ์ž‘์—…์„ ์ค‘์ฒฉํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌ์กฐํ™”ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ํšจ๊ณผ

์ด ๊ธฐ๋ฒ•๋“ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค:

  • DRAM ์ง€์—ฐ ์‹œ๊ฐ„ ์ˆจ๊ธฐ๊ธฐ: ์œ ํœด ๋Œ€๊ธฐ๋ฅผ ์ƒ์‚ฐ์ ์ธ ์—ฐ์‚ฐ ์‹œ๊ฐ„์œผ๋กœ ์ „ํ™˜
  • ๋Œ€์—ญํญ ๊ทน๋Œ€ํ™”: ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์บ์‹œ ๋ฏธ์Šค ๋ฐฉ์ง€
  • ํŒŒ์ดํ”„๋ผ์ธ ํšจ์œจ: ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์ด ๋ณ‘๋ ฌ๋กœ ์ผ์–ด๋‚˜๋Š” ๋™์•ˆ ์—ฐ์‚ฐ ์œ ๋‹›์„ ๋ฐ”์˜๊ฒŒ ์œ ์ง€

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์ด๋ž€? ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์€ GPU ๋ธ”๋ก์ด ๋‹ค๋ฅธ ์ž‘์—…์„ ๊ณ„์†ํ•˜๋Š” ๋™์•ˆ ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์‹คํ–‰๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์—ฐ์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ด๋™์„ ์ค‘์ฒฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ทผ๋ณธ์ ์ธ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

๐Ÿ’ก ์„ฑ๊ณต ํŒ: ์ด๊ฒƒ์„ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์œ„ํ•œ ํŒŒ์ดํ”„๋ผ์ธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์œผ๋กœ ์ƒ๊ฐํ•˜์„ธ์š” - ๋‹จ๊ณ„๋ฅผ ์ค‘์ฒฉํ•˜๊ณ , ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๊ณ , ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๊ทน๋Œ€ํ™”ํ•ฉ๋‹ˆ๋‹ค. ๋ชฉํ‘œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ด๋™ํ•˜๋Š” ๋™์•ˆ ๋น„์‹ผ ์—ฐ์‚ฐ ์œ ๋‹›์„ ๋ฐ”์˜๊ฒŒ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

ํ—ค์ผ๋กœ ์˜์—ญ ์ดํ•ดํ•˜๊ธฐ

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์œผ๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ „์—, ํ•ฉ์„ฑ๊ณฑ๊ณผ ๊ฐ™์€ ์Šคํ…์‹ค ์—ฐ์‚ฐ์˜ ํƒ€์ผ ๊ธฐ๋ฐ˜ ์ฒ˜๋ฆฌ์— ํ•„์ˆ˜์ ์ธ ํ—ค์ผ๋กœ ์˜์—ญ(ghost cell ๋˜๋Š” guard cell์ด๋ผ๊ณ ๋„ ํ•จ)์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

ํ—ค์ผ๋กœ ์˜์—ญ์ด๋ž€?

ํ—ค์ผ๋กœ ์˜์—ญ์€ ์Šคํ…์‹ค ์—ฐ์‚ฐ์— ํ•„์š”ํ•œ ์ด์›ƒ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด ์ฒ˜๋ฆฌ ํƒ€์ผ์˜ ๊ฒฝ๊ณ„๋ฅผ ๋„˜์–ด ํ™•์žฅ๋˜๋Š” ์ถ”๊ฐ€ ์š”์†Œ์ž…๋‹ˆ๋‹ค. ํƒ€์ผ ๊ฐ€์žฅ์ž๋ฆฌ ๊ทผ์ฒ˜์˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ, ์Šคํ…์‹ค ์—ฐ์‚ฐ์€ ์ธ์ ‘ ํƒ€์ผ์˜ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ—ค์ผ๋กœ ์˜์—ญ์ด ํ•„์š”ํ•œ ์ด์œ 

ํƒ€์ผ์—์„œ 5์  ์ปค๋„์„ ์‚ฌ์šฉํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ์„ ์ƒ๊ฐํ•ด ๋ด…์‹œ๋‹ค:

์›๋ณธ ๋ฐ์ดํ„ฐ:      [... | a b c d e f g h i j k l m n o | ...]
์ฒ˜๋ฆฌ ํƒ€์ผ:              [c d e f g h i j k l m n o]
                            ^                 ^
                      ์™ผ์ชฝ ํƒ€์ผ์—์„œ        ์˜ค๋ฅธ์ชฝ ํƒ€์ผ์—์„œ
                      ์ด์›ƒ ํ•„์š”           ์ด์›ƒ ํ•„์š”

ํ—ค์ผ๋กœ ํฌํ•จ:       [a b | c d e f g h i j k l m n o | p q]
                 ^^^                               ^^^
                 ์™ผ์ชฝ ํ—ค์ผ๋กœ                     ์˜ค๋ฅธ์ชฝ ํ—ค์ผ๋กœ

์ฃผ์š” ํŠน์„ฑ:

  • ํ—ค์ผ๋กœ ํฌ๊ธฐ: ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ ์ธก๋ฉด์— KERNEL_SIZE // 2๊ฐœ ์š”์†Œ
  • ๋ชฉ์ : ํƒ€์ผ ๊ฒฝ๊ณ„์—์„œ ์ •ํ™•ํ•œ ์Šคํ…์‹ค ์—ฐ์‚ฐ ๊ฐ€๋Šฅ
  • ๋‚ด์šฉ: ์ด์›ƒ ํƒ€์ผ์˜ ๋ฐ์ดํ„ฐ ๋ณต์‚ฌ๋ณธ ๋˜๋Š” ๊ฒฝ๊ณ„ ์กฐ๊ฑด
  • ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฒ„ํ—ค๋“œ: ํฐ ์—ฐ์‚ฐ ์ด์ ์„ ์œ„ํ•œ ์ ์€ ์ถ”๊ฐ€ ์ €์žฅ ๊ณต๊ฐ„

ํ•ฉ์„ฑ๊ณฑ์—์„œ์˜ ํ—ค์ผ๋กœ ์˜์—ญ

5์  ํ•ฉ์„ฑ๊ณฑ ์ปค๋„ \([k_0, k_1, k_2, k_3, k_4]\)์˜ ๊ฒฝ์šฐ:

  • ์ค‘์‹ฌ ์š”์†Œ: \(k_2\)๊ฐ€ ํ˜„์žฌ ์ฒ˜๋ฆฌ ์š”์†Œ์™€ ์ •๋ ฌ
  • ์™ผ์ชฝ ์ด์›ƒ: \(k_0, k_1\)์€ ์™ผ์ชฝ 2๊ฐœ ์š”์†Œ ํ•„์š”
  • ์˜ค๋ฅธ์ชฝ ์ด์›ƒ: \(k_3, k_4\)์€ ์˜ค๋ฅธ์ชฝ 2๊ฐœ ์š”์†Œ ํ•„์š”
  • ํ—ค์ผ๋กœ ํฌ๊ธฐ: ๊ฐ ์ธก๋ฉด์— HALO_SIZE = 5 // 2 = 2๊ฐœ ์š”์†Œ

ํ—ค์ผ๋กœ ์˜์—ญ ์—†์ด:

  • ํƒ€์ผ ๊ฒฝ๊ณ„ ์š”์†Œ์—์„œ ์ „์ฒด ํ•ฉ์„ฑ๊ณฑ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์—†์Œ
  • ์ž˜๋ชป๋œ ์ถœ๋ ฅ์ด๋‚˜ ๋ณต์žกํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ๋กœ์ง์ด ํ•„์š”
  • ๋ถ„์‚ฐ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์„ฑ๋Šฅ ์ €ํ•˜

ํ—ค์ผ๋กœ ์˜์—ญ ์‚ฌ์šฉ ์‹œ:

  • ๋ชจ๋“  ํƒ€์ผ ์š”์†Œ๊ฐ€ ๋กœ์ปฌ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด ํ•ฉ์„ฑ๊ณฑ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ
  • ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ๊ฐ„๊ฒฐํ•˜๊ณ  ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ
  • ๋” ๋‚˜์€ ์บ์‹œ ํ™œ์šฉ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ

์ด ๊ฐœ๋…์€ ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•  ๋•Œ ํŠนํžˆ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ํ—ค์ผ๋กœ ์˜์—ญ์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋กœ๋”ฉํ•˜๊ณ  ๋™๊ธฐํ™”ํ•ด์•ผ ์—ฌ๋Ÿฌ ํƒ€์ผ์— ๊ฑธ์นœ ์ •ํ™•ํ•œ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์ค‘์ฒฉ์„ ํ™œ์šฉํ•œ 1D ํ•ฉ์„ฑ๊ณฑ

Puzzle 13 ๊ธฐ๋ฐ˜: ์ด ํผ์ฆ์€ Puzzle 13์˜ 1D ํ•ฉ์„ฑ๊ณฑ์„ ๋‹ค์‹œ ๋‹ค๋ฃจ์ง€๋งŒ, ์ด๋ฒˆ์—๋Š” ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์—ฐ์‚ฐ ๋’ค์— ์ˆจ๊ธฐ๋Š” ์ตœ์ ํ™”๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ๋™๊ธฐ์‹ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๋Œ€์‹ , ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์„ ์‚ฌ์šฉํ•˜์—ฌ ๋น„์šฉ์ด ํฐ DRAM ์ „์†ก๊ณผ ์œ ์šฉํ•œ ์ž‘์—…์„ ์ค‘์ฒฉํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

  • ๋ฒกํ„ฐ ํฌ๊ธฐ: VECTOR_SIZE = 16384 (์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์นœ 16K ์š”์†Œ)
  • ํƒ€์ผ ํฌ๊ธฐ: CONV_TILE_SIZE = 256 (์ฒ˜๋ฆฌ ํƒ€์ผ ํฌ๊ธฐ)
  • ๋ธ”๋ก ๊ตฌ์„ฑ: ๋ธ”๋ก๋‹น (256, 1) ์Šค๋ ˆ๋“œ
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ๊ทธ๋ฆฌ๋“œ๋‹น (VECTOR_SIZE // CONV_TILE_SIZE, 1) ๋ธ”๋ก (64๊ฐœ ๋ธ”๋ก)
  • ์ปค๋„ ํฌ๊ธฐ: KERNEL_SIZE = 5 (Puzzle 13๊ณผ ๋™์ผํ•œ ๊ฐ„๋‹จํ•œ 1D ํ•ฉ์„ฑ๊ณฑ)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ ˆ์ด์•„์›ƒ: row_major[VECTOR_SIZE]() (1D row-major)

๋น„๋™๊ธฐ ๋ณต์‚ฌ์˜ ๊ธฐํšŒ

Puzzle 16 ๊ธฐ๋ฐ˜: matmul์—์„œ ๊น”๋”ํ•œ ํƒ€์ผ ๋กœ๋”ฉ์„ ์œ„ํ•ด copy_dram_to_sram_async๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ์ด๋ฏธ ๋ณด์…จ์Šต๋‹ˆ๋‹ค. ์ด์ œ ๊ณ ์„ฑ๋Šฅ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํ•ต์‹ฌ์ธ ์ง€์—ฐ ์‹œ๊ฐ„ ์€ํ ๊ธฐ๋Šฅ์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ์กด์˜ ๋™๊ธฐ์‹ ๋ฉ”๋ชจ๋ฆฌ ๋กœ๋”ฉ์€ ์ „์†ก ์ค‘ ์—ฐ์‚ฐ ์œ ๋‹›์„ ์œ ํœด ์ƒํƒœ๋กœ ๋Œ€๊ธฐํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์€ ์ „์†ก๊ณผ ์œ ์šฉํ•œ ์ž‘์—…์˜ ์ค‘์ฒฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค:

# ๋™๊ธฐ์‹ ์ ‘๊ทผ - ๋น„ํšจ์œจ์ :
for i in range(CONV_TILE_SIZE):
    input_shared[i] = input[base_idx + i]  # ๊ฐ ๋กœ๋“œ๊ฐ€ DRAM์„ ๊ธฐ๋‹ค๋ฆผ
for i in range(KERNEL_SIZE):
    kernel_shared[i] = kernel[i]           # DRAM ์ถ”๊ฐ€ ๋Œ€๊ธฐ
barrier()  # ์—ฐ์‚ฐ ์‹œ์ž‘ ์ „ ๋ชจ๋“  ์Šค๋ ˆ๋“œ ๋Œ€๊ธฐ
# โ†‘ ์ด ์‹œ๊ฐ„ = input_transfer_time + kernel_transfer_time

# ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์ ‘๊ทผ - ํšจ์œจ์ :
copy_dram_to_sram_async[thread_layout](input_shared, input_tile)  # ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์ „์†ก ์‹œ์ž‘
# ์ž…๋ ฅ์ด ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ „์†ก๋˜๋Š” ๋™์•ˆ, ์ปค๋„์„ ๋™๊ธฐ์‹์œผ๋กœ ๋กœ๋”ฉ
for i in range(KERNEL_SIZE):
    kernel_shared[i] = kernel[i]  # ๋น„๋™๊ธฐ ์ž…๋ ฅ ์ „์†ก๊ณผ ์ค‘์ฒฉ
async_copy_wait_all()  # ๋‘ ์—ฐ์‚ฐ์ด ๋ชจ๋‘ ์™„๋ฃŒ๋  ๋•Œ๋งŒ ๋Œ€๊ธฐ
# โ†‘ ์ด ์‹œ๊ฐ„ = MAX(input_transfer_time, kernel_transfer_time)

๋น„๋™๊ธฐ ๋ณต์‚ฌ๊ฐ€ ์ž˜ ๋™์ž‘ํ•˜๋Š” ์ด์œ :

  • ์ „์šฉ ๋ณต์‚ฌ ์—”์ง„: ์ตœ์‹  GPU๋Š” ๋ ˆ์ง€์Šคํ„ฐ๋ฅผ ์šฐํšŒํ•˜๊ณ  ์ง„์ •ํ•œ ์—ฐ์‚ฐ-๋ฉ”๋ชจ๋ฆฌ ์ค‘์ฒฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์ „์šฉ ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค (Puzzle 16์—์„œ ์„ค๋ช…)
  • ์ง€์—ฐ ์‹œ๊ฐ„ ์€ํ: GPU ์Šค๋ ˆ๋“œ๊ฐ€ ๋‹ค๋ฅธ ์—ฐ์‚ฐ์„ ์‹คํ–‰ํ•˜๋Š” ๋™์•ˆ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค
  • ์ตœ์ ์˜ ๋ณ‘ํ•ฉ: ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ์ด ํšจ์œจ์ ์ธ DRAM ์ ‘๊ทผ ํŒจํ„ด์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋ฆฌ์†Œ์Šค ํ™œ์šฉ: ์—ฐ์‚ฐ ์œ ๋‹›์ด ์œ ํœด ๋Œ€๊ธฐ ๋Œ€์‹  ๊ณ„์† ๋ฐ”์˜๊ฒŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค

์™„์„ฑํ•  ์ฝ”๋“œ

Puzzle 16์˜ matmul ๊ตฌํ˜„ ํŒจํ„ด์„ ๋”ฐ๋ผ, ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก๊ณผ ์—ฐ์‚ฐ์„ ์ค‘์ฒฉํ•˜๋Š” 1D ํ•ฉ์„ฑ๊ณฑ์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

์ˆ˜ํ•™์  ์—ฐ์‚ฐ: ๋น„๋™๊ธฐ ๋ณต์‚ฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ๋ฒกํ„ฐ์— ๋Œ€ํ•œ 1D ํ•ฉ์„ฑ๊ณฑ์„ ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค: \[\text{output}[i] = \sum_{k=0}^{\text{KERNEL_SIZE}-1} \text{input}[i+k-\text{HALO_SIZE}] \times \text{kernel}[k]\]

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜:

  1. ๋น„๋™๊ธฐ ํƒ€์ผ ๋กœ๋”ฉ: ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๋ฐฑ๊ทธ๋ผ์šด๋“œ DRAMโ†’SRAM ์ „์†ก ์‹œ์ž‘
  2. ์ค‘์ฒฉ ์—ฐ์‚ฐ: ์ž…๋ ฅ ์ „์†ก ์ค‘ ์ž‘์€ ์ปค๋„ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ
  3. ๋™๊ธฐํ™”: ์ „์†ก ์™„๋ฃŒ ๋Œ€๊ธฐ ํ›„ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ
comptime VECTOR_SIZE = 16384
comptime CONV_TILE_SIZE = 256
comptime KERNEL_SIZE = 5
comptime HALO_SIZE = KERNEL_SIZE // 2  # Halo elements needed for boundary
comptime BUFFER_SIZE = CONV_TILE_SIZE + 2 * HALO_SIZE  # Include halo for boundary conditions
comptime BLOCKS_PER_GRID_ASYNC = (
    VECTOR_SIZE + CONV_TILE_SIZE - 1
) // CONV_TILE_SIZE
comptime THREADS_PER_BLOCK_ASYNC = 256
comptime dtype = DType.float32
comptime layout_async = row_major[VECTOR_SIZE]()
comptime LayoutAsyncType = type_of(layout_async)
comptime kernel_layout = row_major[KERNEL_SIZE]()
comptime KernelLayoutType = type_of(kernel_layout)


def async_copy_overlap_convolution[
    dtype: DType
](
    output: TileTensor[mut=True, dtype, LayoutAsyncType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutAsyncType, ImmutAnyOrigin],
    kernel: TileTensor[mut=False, dtype, KernelLayoutType, ImmutAnyOrigin],
):
    """Demonstrates async copy operations building on p14 patterns.

    This shows how to use copy_dram_to_sram_async and async_copy_wait_all
    for efficient memory transfers, extending the patterns from p14 matmul.
    """

    # Shared memory buffers (like p14, but without .fill(0) to avoid race)
    var input_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[CONV_TILE_SIZE]())
    var kernel_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[KERNEL_SIZE]())

    # FILL IN HERE (roughly 19 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p28/p28.mojo

ํŒ

1. ๋น„๋™๊ธฐ ๋ณต์‚ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ดํ•ด

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์€ ๋ธ”๋ก์ด ๋‹ค๋ฅธ ์ฝ”๋“œ๋ฅผ ๊ณ„์† ์‹คํ–‰ํ•˜๋Š” ๋™์•ˆ ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์ „์†ก์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

ํƒ๊ตฌํ•  ํ•ต์‹ฌ ์งˆ๋ฌธ:

  • DRAM์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ ์–ด๋–ค ๋ฐ์ดํ„ฐ๋ฅผ ์ „์†กํ•ด์•ผ ํ•˜๋Š”๊ฐ€?
  • ์ „์†ก์ด ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ผ์–ด๋‚˜๋Š” ๋™์•ˆ ์–ด๋–ค ์—ฐ์‚ฐ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
  • ํ•˜๋“œ์›จ์–ด๊ฐ€ ์—ฌ๋Ÿฌ ๋™์‹œ ์—ฐ์‚ฐ์„ ์–ด๋–ป๊ฒŒ ์กฐ์œจํ•˜๋Š”๊ฐ€?

์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๋ธ”๋ก์—๋Š” THREADS_PER_BLOCK_ASYNC = 256๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค
  • ํƒ€์ผ์—๋Š” CONV_TILE_SIZE = 256๊ฐœ์˜ ์š”์†Œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค
  • ์–ด๋–ค ๋ ˆ์ด์•„์›ƒ ํŒจํ„ด์ด ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์„ ๋ณด์žฅํ•˜๋Š”๊ฐ€?

2. ์ค‘์ฒฉ ๊ธฐํšŒ ํŒŒ์•…

๋ชฉํ‘œ๋Š” ์œ ์šฉํ•œ ์—ฐ์‚ฐ ๋’ค์— ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ถ„์„ ์ ‘๊ทผ๋ฒ•:

  • ์–ด๋–ค ์—ฐ์‚ฐ์ด ์ˆœ์ฐจ์ ์œผ๋กœ vs. ๋ณ‘๋ ฌ๋กœ ์ผ์–ด๋‚˜์•ผ ํ•˜๋Š”๊ฐ€?
  • ์–ด๋–ค ๋ฐ์ดํ„ฐ ์ „์†ก์ด ํฐ(๋น„์šฉ์ด ๋†’์€) vs. ์ž‘์€(๋น„์šฉ์ด ๋‚ฎ์€)๊ฐ€?
  • ๋ณ‘๋ ฌ ์‹คํ–‰์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์–ด๋–ป๊ฒŒ ๊ตฌ์กฐํ™”ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?

๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ํฐ ์ž…๋ ฅ ํƒ€์ผ: 256 ์š”์†Œ ร— 4 ๋ฐ”์ดํŠธ = 1KB ์ „์†ก
  • ์ž‘์€ ์ปค๋„: 5 ์š”์†Œ ร— 4 ๋ฐ”์ดํŠธ = 20 ๋ฐ”์ดํŠธ
  • ์–ด๋–ค ์ „์†ก์ด ๋น„๋™๊ธฐ ์ตœ์ ํ™”์˜ ์ด์ ์„ ๊ฐ€์žฅ ๋งŽ์ด ๋ฐ›๋Š”๊ฐ€?

3. ๋™๊ธฐํ™” ์ „๋žต

์ ์ ˆํ•œ ๋™๊ธฐํ™”๋Š” ์„ฑ๋Šฅ์„ ํฌ์ƒํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ์ •ํ™•์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

ํƒ€์ด๋ฐ ๋ถ„์„:

  • ๊ฐ ์—ฐ์‚ฐ์ด ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์ค€๋น„๋˜์–ด์•ผ ํ•˜๋Š” ์‹œ์ ์€ ์–ธ์ œ์ธ๊ฐ€?
  • ์ •ํ™•์„ฑ์„ ์œ„ํ•ด ํ•„์š”ํ•œ ์ตœ์†Œํ•œ์˜ ๋™๊ธฐํ™”๋Š” ๋ฌด์—‡์ธ๊ฐ€?
  • ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ๋ถˆํ•„์š”ํ•œ ์ •์ฒด๋ฅผ ์–ด๋–ป๊ฒŒ ํ”ผํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€:

  • ์ „์†ก์ด ์™„๋ฃŒ๋˜๊ธฐ ์ „์— ์—ฐ์‚ฐ์ด ์‹œ์ž‘๋˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋Š”๊ฐ€?
  • ๋ฉ”๋ชจ๋ฆฌ ํŽœ์Šค์™€ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์„ ์–ด๋–ป๊ฒŒ ์กฐ์œจํ•˜๋Š”๊ฐ€?

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์ค‘์ฒฉ ํ…Œ์ŠคํŠธ:

pixi run p28
pixi run -e amd p28
pixi run -e apple p28
uv run poe p28

์†”๋ฃจ์…˜

์ƒ์„ธ ์„ค๋ช…์ด ํฌํ•จ๋œ ์ „์ฒด ์†”๋ฃจ์…˜

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์ค‘์ฒฉ ์†”๋ฃจ์…˜๋Š” ๋น„์šฉ์ด ํฐ DRAM ์ „์†ก๊ณผ ์œ ์šฉํ•œ ์—ฐ์‚ฐ์„ ์ค‘์ฒฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

def async_copy_overlap_convolution[
    dtype: DType
](
    output: TileTensor[mut=True, dtype, AsyncLayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, AsyncLayoutType, MutAnyOrigin],
    kernel: LayoutTensor[dtype, kernel_layout, ImmutAnyOrigin],
):
    """Demonstrates async copy operations building on p14 patterns.

    This shows how to use copy_dram_to_sram_async and async_copy_wait_all
    for efficient memory transfers, extending the patterns from p14 matmul.
    """

    # Shared memory buffers (like p14, but without .fill(0) to avoid race)
    var input_shared = LayoutTensor[
        dtype,
        Layout.row_major(CONV_TILE_SIZE),
        MutAnyOrigin,
        address_space=AddressSpace.SHARED,
    ].stack_allocation()
    var kernel_shared = LayoutTensor[
        dtype,
        Layout.row_major(KERNEL_SIZE),
        MutAnyOrigin,
        address_space=AddressSpace.SHARED,
    ].stack_allocation()

    var local_i = thread_idx.x

    # Phase 1: Launch async copy for input tile
    # Note: tile() does NOT perform bounds checking - ensure valid tile bounds
    var input_tile = input.tile[CONV_TILE_SIZE](block_idx.x).to_layout_tensor()

    # Use async copy with thread layout matching p14 pattern
    comptime load_layout = Layout.row_major(THREADS_PER_BLOCK_ASYNC)
    copy_dram_to_sram_async[thread_layout=load_layout](input_shared, input_tile)

    # Phase 2: Load kernel synchronously (small data)
    if local_i < KERNEL_SIZE:
        kernel_shared[local_i] = kernel[local_i]

    # Phase 3: Wait for async copy to complete
    async_copy_wait_all()  # Always wait since we always do async copy
    barrier()  # Sync all threads

    # Phase 4: Compute convolution
    var global_i = block_idx.x * CONV_TILE_SIZE + local_i
    if local_i < CONV_TILE_SIZE and global_i < Int(output.dim[0]()):
        var result: output.ElementType = 0

        # Simple convolution avoiding boundary issues
        if local_i >= HALO_SIZE and local_i < CONV_TILE_SIZE - HALO_SIZE:
            # Full convolution for center elements
            for k in range(KERNEL_SIZE):
                var input_idx = local_i + k - HALO_SIZE
                if input_idx >= 0 and input_idx < CONV_TILE_SIZE:
                    result += rebind[Scalar[dtype]](
                        input_shared[input_idx]
                    ) * rebind[Scalar[dtype]](kernel_shared[k])
        else:
            # For boundary elements, just copy input (no convolution)
            result = rebind[Scalar[dtype]](input_shared[local_i])

        output[global_i] = result


๋‹จ๊ณ„๋ณ„ ๋ถ„์„

Phase 1: ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์‹œ์ž‘

# Phase 1: Launch async copy for input tile
input_tile = input.tile[CONV_TILE_SIZE](block_idx.x)
comptime load_layout = row_major[THREADS_PER_BLOCK_ASYNC]()
copy_dram_to_sram_async[thread_layout=load_layout](input_shared, input_tile)
  • ํƒ€์ผ ์ƒ์„ฑ: input.tile[CONV_TILE_SIZE](block_idx.x)๋Š” block_idx.x * 256์—์„œ ์‹œ์ž‘ํ•˜๋Š” 256๊ฐœ ์š”์†Œ์˜ ์ž…๋ ฅ ๋ฐฐ์—ด ๋ทฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. Mojo์˜ tile ๋ฉ”์„œ๋“œ๋Š” ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋‚˜ ์ œ๋กœ ํŒจ๋”ฉ์„ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์ธ๋ฑ์Šค ์ ‘๊ทผ์€ ๋ฏธ์ •์˜ ๋™์ž‘์„ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค. ๊ตฌํ˜„์—์„œ ํƒ€์ผ ํฌ๊ธฐ์™€ offset์ด ์œ ํšจํ•œ ๋ฐฐ์—ด ๋ฒ”์œ„ ๋‚ด์— ์žˆ๋Š”์ง€ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ: row_major[THREADS_PER_BLOCK_ASYNC, 1]()๋Š” ๋ธ”๋ก ๊ตฌ์„ฑ๊ณผ ์ผ์น˜ํ•˜๋Š” 256 x 1 ๋ ˆ์ด์•„์›ƒ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ํ•„์ˆ˜์ž…๋‹ˆ๋‹ค - ์ตœ์ ์˜ ๋ณ‘ํ•ฉ๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์œ„ํ•ด ๋ ˆ์ด์•„์›ƒ์ด ๋ฌผ๋ฆฌ์  ์Šค๋ ˆ๋“œ ๋ฐฐ์น˜์™€ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ ˆ์ด์•„์›ƒ์ด ์ผ์น˜ํ•˜์ง€ ์•Š์œผ๋ฉด ์Šค๋ ˆ๋“œ๊ฐ€ ๋น„์—ฐ์†์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ์— ์ ‘๊ทผํ•˜์—ฌ ๋ณ‘ํ•ฉ์ด ๊นจ์ง€๊ณ  ์„ฑ๋Šฅ์ด ์‹ฌ๊ฐํ•˜๊ฒŒ ์ €ํ•˜๋ฉ๋‹ˆ๋‹ค.

  • ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์‹œ์ž‘: copy_dram_to_sram_async๋Š” DRAM์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋กœ์˜ ๋ฐฑ๊ทธ๋ผ์šด๋“œ ์ „์†ก์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋“œ์›จ์–ด๊ฐ€ 256๊ฐœ์˜ float(1KB)๋ฅผ ๋ณต์‚ฌํ•˜๋Š” ๋™์•ˆ ๋ธ”๋ก์€ ๊ณ„์† ์‹คํ–‰๋ฉ๋‹ˆ๋‹ค.

Phase 2: ์ค‘์ฒฉ ์—ฐ์‚ฐ

# Phase 2: Load kernel synchronously (small data)
if local_i < KERNEL_SIZE:
    kernel_shared[local_i] = kernel[local_i]
  • ๋™์‹œ ์‹คํ–‰: 1KB ์ž…๋ ฅ ํƒ€์ผ์ด ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ „์†ก๋˜๋Š” ๋™์•ˆ, ์Šค๋ ˆ๋“œ๋“ค์€ ์ž‘์€ 20๋ฐ”์ดํŠธ ์ปค๋„์„ ๋™๊ธฐ์‹์œผ๋กœ ๋กœ๋”ฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์ค‘์ฒฉ์ด ํ•ต์‹ฌ ์ตœ์ ํ™”์ž…๋‹ˆ๋‹ค.

  • ํฌ๊ธฐ ๊ธฐ๋ฐ˜ ์ „๋žต: ํฐ ์ „์†ก(์ž…๋ ฅ ํƒ€์ผ)์€ ๋น„๋™๊ธฐ ๋ณต์‚ฌ๋ฅผ, ์ž‘์€ ์ „์†ก(์ปค๋„)์€ ๋™๊ธฐ์‹ ๋กœ๋”ฉ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋ณต์žก์„ฑ๊ณผ ์„ฑ๋Šฅ ์ด์ ์˜ ๊ท ํ˜•์„ ๋งž์ถฅ๋‹ˆ๋‹ค.

Phase 3: ๋™๊ธฐํ™”

# Phase 3: Wait for async copy to complete
async_copy_wait_all()  # Always wait since we always do async copy
barrier()  # Sync all threads
  • ์ „์†ก ์™„๋ฃŒ: async_copy_wait_all()์€ ๋ชจ๋“  ๋น„๋™๊ธฐ ์ „์†ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค. input_shared์— ์ ‘๊ทผํ•˜๊ธฐ ์ „์— ๋ฐ˜๋“œ์‹œ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

  • ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”: barrier()๋Š” ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์‚ฐ์œผ๋กœ ๋„˜์–ด๊ฐ€๊ธฐ ์ „์— ์™„๋ฃŒ๋œ ์ „์†ก์„ ํ™•์ธํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

Phase 4: ์—ฐ์‚ฐ

# Phase 4: Compute convolution
global_i = block_idx.x * CONV_TILE_SIZE + local_i
if local_i < CONV_TILE_SIZE and global_i < output.shape[0]():
    var result: output.element_type = 0

    if local_i >= HALO_SIZE and local_i < CONV_TILE_SIZE - HALO_SIZE:
        # Full convolution for center elements
        for k in range(KERNEL_SIZE):
            input_idx = local_i + k - HALO_SIZE
            if input_idx >= 0 and input_idx < CONV_TILE_SIZE:
                result += input_shared[input_idx] * kernel_shared[k]
    else:
        # For boundary elements, just copy input (no convolution)
        result = input_shared[local_i]

    output[global_i] = result
  • ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ: ๋ชจ๋“  ์—ฐ์‚ฐ์ด ๋ฏธ๋ฆฌ ๋กœ๋“œ๋œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์—ฐ์‚ฐ ์ง‘์•ฝ์ ์ธ ํ•ฉ์„ฑ๊ณฑ ๋ฃจํ”„์—์„œ ๋А๋ฆฐ DRAM ์ ‘๊ทผ์„ ํ”ผํ•ฉ๋‹ˆ๋‹ค.

  • ๋‹จ์ˆœํ™”๋œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์ด ๊ตฌํ˜„์€ ํƒ€์ผ ๊ฒฝ๊ณ„ ๊ทผ์ฒ˜ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์‹ค์šฉ์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

    • ์ค‘์‹ฌ ์š”์†Œ (local_i >= HALO_SIZE์ด๊ณ  local_i < CONV_TILE_SIZE - HALO_SIZE): ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ „์ฒด 5์  ํ•ฉ์„ฑ๊ณฑ ์ ์šฉ
    • ๊ฒฝ๊ณ„ ์š”์†Œ (๊ฐ ํƒ€์ผ์˜ ์ฒ˜์Œ 2๊ฐœ์™€ ๋งˆ์ง€๋ง‰ 2๊ฐœ ์š”์†Œ): ๋ณต์žกํ•œ ๊ฒฝ๊ณ„ ๋กœ์ง์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ํ•ฉ์„ฑ๊ณฑ ์—†์ด ์ž…๋ ฅ์„ ์ง์ ‘ ๋ณต์‚ฌ

๊ต์œก์  ๊ทผ๊ฑฐ: ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ณต์žกํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ๋ณด๋‹ค ๋น„๋™๊ธฐ ๋ณต์‚ฌ ํŒจํ„ด ์‹œ์—ฐ์„ ์šฐ์„ ์‹œํ•ฉ๋‹ˆ๋‹ค. HALO_SIZE = 2์ธ 256๊ฐœ ์š”์†Œ ํƒ€์ผ์—์„œ, ์š”์†Œ 0-1๊ณผ 254-255๋Š” ์ž…๋ ฅ ๋ณต์‚ฌ๋ฅผ, ์š”์†Œ 2-253์€ ์ „์ฒด ํ•ฉ์„ฑ๊ณฑ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋™์ž‘ํ•˜๋Š” ๊ตฌํ˜„์„ ์ œ๊ณตํ•˜๋ฉด์„œ ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”์— ์ดˆ์ ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๋ถ„์„

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—†์ด (๋™๊ธฐ์‹):

Total Time = Input_Transfer_Time + Kernel_Transfer_Time + Compute_Time
           = Large_DRAM_transfer + Small_DRAM_transfer + convolution
           = Major_latency + Minor_latency + computation_work

๋น„๋™๊ธฐ ๋ณต์‚ฌ ์‚ฌ์šฉ (์ค‘์ฒฉ):

Total Time = MAX(Input_Transfer_Time, Kernel_Transfer_Time) + Compute_Time
           = MAX(Major_latency, Minor_latency) + computation_work
           = Major_latency + computation_work

์„ฑ๋Šฅ ํ–ฅ์ƒ: ๋” ํฐ ์ž…๋ ฅ ์ „์†ก ๋’ค์— ๋” ์ž‘์€ ์ปค๋„ ์ „์†ก์˜ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊น€์œผ๋กœ์จ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค. ์‹ค์ œ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํญ์€ ์ „์†ก์˜ ์ƒ๋Œ€์  ํฌ๊ธฐ์™€ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ๋” ํฐ ์ค‘์ฒฉ์ด ๊ฐ€๋Šฅํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ๋Š” ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ํ›จ์”ฌ ํด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ์ˆ ์  ํ†ต์ฐฐ

  1. ์Šค๋ ˆ๋“œ ๋ ˆ์ด์•„์›ƒ ๋งค์นญ: row_major[256, 1]() ๋ ˆ์ด์•„์›ƒ์ด ๋ธ”๋ก์˜ (256, 1) ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ๊ณผ ์ •ํ™•ํžˆ ์ผ์น˜ํ•˜์—ฌ ์ตœ์ ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

  2. ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€: ์ ์ ˆํ•œ ์ˆœ์„œ ์ง€์ •(๋น„๋™๊ธฐ ๋ณต์‚ฌ โ†’ ์ปค๋„ ๋กœ๋“œ โ†’ ๋Œ€๊ธฐ โ†’ ๋ฐฐ๋ฆฌ์–ด โ†’ ์—ฐ์‚ฐ)์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

  3. ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: ์ตœ์‹  GPU๋Š” ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์ „์šฉ ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ฐ–์ถ”๊ณ  ์žˆ์–ด, ๋ฉ”๋ชจ๋ฆฌ ์œ ๋‹›๊ณผ ์—ฐ์‚ฐ ์œ ๋‹› ์‚ฌ์ด์˜ ์ง„์ •ํ•œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

  4. ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ํ™œ์šฉ: ์ด ํŒจํ„ด์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ํšจ์œจ์ ์œผ๋กœ ์ด๋™์‹œํ‚ต๋‹ˆ๋‹ค: DRAM โ†’ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ๋ ˆ์ง€์Šคํ„ฐ โ†’ ์—ฐ์‚ฐ.

  5. ํ…Œ์ŠคํŠธ-๊ตฌํ˜„ ์ผ๊ด€์„ฑ: ํ…Œ์ŠคํŠธ ๊ฒ€์ฆ ๋กœ์ง์€ local_i_in_tile = i % CONV_TILE_SIZE๋ฅผ ๊ฒ€์‚ฌํ•˜์—ฌ ๊ฐ ์š”์†Œ๊ฐ€ ํ•ฉ์„ฑ๊ณฑ ๊ฒฐ๊ณผ(์ค‘์‹ฌ ์š”์†Œ)๋ฅผ ๊ธฐ๋Œ€ํ•ด์•ผ ํ•˜๋Š”์ง€ ์ž…๋ ฅ ๋ณต์‚ฌ(๊ฒฝ๊ณ„ ์š”์†Œ)๋ฅผ ๊ธฐ๋Œ€ํ•ด์•ผ ํ•˜๋Š”์ง€ ํŒ๋ณ„ํ•˜๋ฉฐ, ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ์ „๋žต๊ณผ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋‹จ์ˆœํ™”๋œ ๊ฒฝ๊ณ„ ์ ‘๊ทผ ๋ฐฉ์‹์˜ ์ •ํ™•ํ•œ ๊ฒ€์ฆ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

์ด ์†”๋ฃจ์…˜๋Š” ๋‹จ์ˆœํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ํ•ฉ์„ฑ๊ณฑ์„ ์œ ์šฉํ•œ ์ž‘์—… ๋’ค์— ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๋Š” ์ตœ์ ํ™”๋œ ๊ตฌํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ, ๊ณ ์„ฑ๋Šฅ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ธฐ๋ณธ ์›๋ฆฌ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

Puzzle 29: GPU ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ

๋‹จ์ˆœํ•œ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๋„˜์–ด์„œ

์ด ์žฅ์—์„œ๋Š” ์Šค๋ ˆ๋“œ ๊ฐ„ ์ •๋ฐ€ํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•œ ๋ณต์žกํ•œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋™๊ธฐํ™” ํŒจํ„ด์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ˆœํ•œ ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์— ์ดˆ์ ์„ ๋งž์ถ˜ ์ด์ „ ํผ์ฆ๋“ค๊ณผ ๋‹ฌ๋ฆฌ, ์ด ์ฑŒ๋ฆฐ์ง€๋“ค์€ ์‹ค์ œ GPU ์†Œํ”„ํŠธ์›จ์–ด์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์•„ํ‚คํ…์ฒ˜ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋‚ด์šฉ:

  • ์Šค๋ ˆ๋“œ ํŠนํ™”: ํ•˜๋‚˜์˜ ๋ธ”๋ก ์•ˆ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ๊ฐ๊ฐ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰
  • ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ํŒŒ์ดํ”„๋ผ์ธ: ๋ช…์‹œ์  ๋ฐ์ดํ„ฐ ์˜์กด์„ฑ์„ ๊ฐ€์ง„ ๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ
  • ๊ณ ๊ธ‰ ๋ฐฐ๋ฆฌ์–ด API: ๊ธฐ๋ณธ barrier() ํ˜ธ์ถœ์„ ๋„˜์–ด์„  ์„ธ๋ฐ€ํ•œ ๋™๊ธฐํ™” ์ œ์–ด
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •: ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์‹œ์„ฑ๊ณผ ์ˆœ์„œ์— ๋Œ€ํ•œ ๋ช…์‹œ์  ์ œ์–ด
  • ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด: ๋ณต์žกํ•œ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ๋”๋ธ” ๋ฒ„ํผ๋ง๊ณผ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •

์™œ ์ค‘์š”ํ•œ๊ฐ€: ๋Œ€๋ถ€๋ถ„์˜ GPU ํŠœํ† ๋ฆฌ์–ผ์€ ๋‹จ์ˆœํ•œ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ํŒจํ„ด์„ ๊ฐ€๋ฅด์น˜์ง€๋งŒ, ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„ ๊ฐ„์˜ ์ •๊ตํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด ํผ์ฆ๋“ค์€ ํ•™์ˆ ์  ์˜ˆ์ œ์™€ ์‹ค์ œ GPU ์ปดํ“จํŒ… ์‚ฌ์ด์˜ ๊ฐ„๊ทน์„ ๋ฉ”์›Œ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

GPU ๋™๊ธฐํ™”๋Š” ๋ณต์žกํ•œ ๋ณ‘๋ ฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์˜ฌ๋ฐ”๋ฅด๊ณ  ํšจ์œจ์ ์œผ๋กœ ๋™์ž‘ํ•˜๊ฒŒ ํ•˜๋Š” ํ† ๋Œ€์ž…๋‹ˆ๋‹ค. ์ด ์žฅ์—์„œ๋Š” ๊ณ ์„ฑ๋Šฅ GPU ์ปดํ“จํŒ… ์ „๋ฐ˜์— ๊ฑธ์ณ ๋‚˜ํƒ€๋‚˜๋Š” ์„ธ ๊ฐ€์ง€ ๊ธฐ๋ณธ์ ์ธ ๋™๊ธฐํ™” ํŒจํ„ด์ธ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •, ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๊ด€๋ฆฌ, ์ŠคํŠธ๋ฆฌ๋ฐ ์—ฐ์‚ฐ์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต ๋ชฉํ‘œ:

  • ์„œ๋กœ ๋‹ค๋ฅธ ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ๊ฐ€ ์–ธ์ œ, ์™œ ํ•„์š”ํ•œ์ง€ ์ดํ•ด
  • ์ ์ ˆํ•œ ์Šค๋ ˆ๋“œ ํŠนํ™”๋ฅผ ํ†ตํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„
  • ์ •๋ฐ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •์ด ํ•„์š”ํ•œ ๋ฐ˜๋ณต ํŒจํ„ด ๊ตฌํ˜„
  • ์ •ํ™•์„ฑ์„ ๋ณด์žฅํ•˜๋ฉด์„œ ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์ ํ™”

์•„ํ‚คํ…์ฒ˜ ์ง„ํ–‰ ๊ตฌ์กฐ: ์ด ํผ์ฆ๋“ค์€ ๊ธฐ๋ณธ์ ์ธ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •๋ถ€ํ„ฐ ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๊ด€๋ฆฌ๊นŒ์ง€, ๊ทธ๋ฆฌ๊ณ  ์ตœ์ข…์ ์œผ๋กœ ๊ณ ์ฒ˜๋ฆฌ๋Ÿ‰ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ŠคํŠธ๋ฆฌ๋ฐ ์—ฐ์‚ฐ ํŒจํ„ด๊นŒ์ง€ ๋‹จ๊ณ„์ ์œผ๋กœ ์ง„ํ–‰๋˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

์Šค๋ ˆ๋“œ ์กฐ์œจ ํŒจ๋Ÿฌ๋‹ค์ž„:

  • ๋‹จ์ˆœ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ (์ด์ „ ํผ์ฆ๋“ค)
  • ํŠนํ™” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ๊ฐ๊ฐ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ˆ˜ํ–‰ (์ด ์žฅ)
  • ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง„ ์ˆœ์ฐจ์  ๋‹จ๊ณ„
  • ๋ฐ˜๋ณต ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์‹ ์ค‘ํ•œ ๋ฒ„ํผ ๊ด€๋ฆฌ๋ฅผ ์ˆ˜๋ฐ˜ํ•˜๋Š” ๋‹ค์ค‘ ํŒจ์Šค

๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ์˜ ๊ณ„์ธต ๊ตฌ์กฐ:

  • ๊ธฐ๋ณธ barrier(): ๋ธ”๋ก ๋‚ด ๋‹จ์ˆœ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”
  • ๊ณ ๊ธ‰ mbarrier API: ์ƒํƒœ ์ถ”์ ์„ ์ง€์›ํ•˜๋Š” ์„ธ๋ฐ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์ œ์–ด
  • ์ŠคํŠธ๋ฆฌ๋ฐ ์กฐ์ •: ๋น„๋™๊ธฐ ๋ณต์‚ฌ ๋ฐ ๋Œ€๋Ÿ‰ ์ „์†ก ๋™๊ธฐํ™”

๋ฉ”๋ชจ๋ฆฌ ์ผ๊ด€์„ฑ ๋ชจ๋ธ:

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •: ์Šค๋ ˆ๋“œ ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•œ ๋น ๋ฅธ ์˜จ์นฉ ๋ฉ”๋ชจ๋ฆฌ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ˆœ์„œ ๋ณด์žฅ: ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์— ๊ฑธ์ณ ์“ฐ๊ธฐ์˜ ๊ฐ€์‹œ์„ฑ ๋ณด์žฅ
  • ๋ฒ„ํผ ๊ด€๋ฆฌ: ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋”๋ธ” ๋ฒ„ํผ๋ง๊ณผ ํ•‘ํ ํŒจํ„ด

๊ตฌ์„ฑ

์‹œ์Šคํ…œ ์•„ํ‚คํ…์ฒ˜:

  • ๋ธ”๋ก ํฌ๊ธฐ: ์ตœ์ ์˜ ์ ์œ ์œจ์„ ์œ„ํ•ด ๋ธ”๋ก๋‹น TPB = 256 ์Šค๋ ˆ๋“œ
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ๊ฐ๊ฐ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ํƒ€์ผ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹ค์ˆ˜์˜ ๋ธ”๋ก
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋ ˆ์ง€์Šคํ„ฐ, ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์˜ ์ „๋žต์  ํ™œ์šฉ
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: ์ˆ˜์น˜ ์—ฐ์‚ฐ์„ ์œ„ํ•œ DType.float32

๋‹ค๋ฃจ๋Š” ๋™๊ธฐํ™” ํŒจํ„ด:

  1. ๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ: ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •์„ ํ™œ์šฉํ•œ ์Šค๋ ˆ๋“œ ํŠนํ™”
  2. ๋”๋ธ” ๋ฒ„ํผ๋ง ๋ฐ˜๋ณต: ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๊ด€๋ฆฌ
  3. ์ŠคํŠธ๋ฆฌ๋ฐ ์—ฐ์‚ฐ: ๊ณ ์ฒ˜๋ฆฌ๋Ÿ‰ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์กฐ์ •

์„ฑ๋Šฅ ๊ณ ๋ ค์‚ฌํ•ญ:

  • ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ: ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฐ๋ฆฌ์–ด ์œ ํ˜•์˜ ๋น„์šฉ ์ดํ•ด
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ์ตœ๋Œ€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์œ„ํ•œ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”
  • ์Šค๋ ˆ๋“œ ํ™œ์šฉ๋„: ํŠนํ™”๋œ ์—ญํ• ๊ณผ ์ „์ฒด ํšจ์œจ์„ฑ ๊ฐ„์˜ ๊ท ํ˜•

ํผ์ฆ ๊ตฌ์„ฑ

์ด ์žฅ์—๋Š” ์„œ๋กœ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฐœ์ „ํ•˜๋Š” ์„ธ ๊ฐœ์˜ ์—ฐ๊ฒฐ๋œ ํผ์ฆ์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •

์ดˆ์ : ์Šค๋ ˆ๋“œ ํŠนํ™”์™€ ํŒŒ์ดํ”„๋ผ์ธ ์•„ํ‚คํ…์ฒ˜

ํ•˜๋‚˜์˜ ๋ธ”๋ก ์•ˆ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๋Š” GPU ์ปค๋„์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค. ์ด ํผ์ฆ์—์„œ๋Š” ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๊ด€๊ณ„์™€ ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„ ๊ฐ„์˜ ์ „๋žต์  ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ์Šค๋ ˆ๋“œ ์—ญํ•  ํŠนํ™” (Stage 1: ๋กœ๋“œ, Stage 2: ์ฒ˜๋ฆฌ, Stage 3: ์ถœ๋ ฅ)
  • ์ฒ˜๋ฆฌ ๋‹จ๊ณ„ ๊ฐ„ ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๋ฐ์ดํ„ฐ ํ๋ฆ„
  • ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์‚ฌ์ด์˜ ์ „๋žต์  ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜

์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ: ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ, ๋‹ค๋‹จ๊ณ„ ๊ณผํ•™ ์—ฐ์‚ฐ, ์‹ ๊ฒฝ๋ง ๋ ˆ์ด์–ด ์กฐ์ •

๋”๋ธ” ๋ฒ„ํผ๋ง ์Šคํ…์‹ค ์—ฐ์‚ฐ

์ดˆ์ : ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด API์™€ ๋ฐ˜๋ณต ์ฒ˜๋ฆฌ

์ •๋ฐ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •์ด ํ•„์š”ํ•œ ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•ด mbarrier API๋ฅผ ์‚ฌ์šฉํ•œ ์„ธ๋ฐ€ํ•œ ๋™๊ธฐํ™” ์ œ์–ด๋ฅผ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ด ํผ์ฆ์€ ๋ฐ˜๋ณต๋ฒ•๊ณผ ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์ˆ˜์ ์ธ ๋”๋ธ” ๋ฒ„ํผ๋ง ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…:

  • ๊ณ ๊ธ‰ mbarrier API vs ๊ธฐ๋ณธ barrier()
  • ์ฝ๊ธฐ/์“ฐ๊ธฐ ๋ฒ„ํผ ์—ญํ• ์„ ๊ต๋Œ€ํ•˜๋Š” ๋”๋ธ” ๋ฒ„ํผ๋ง
  • ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์กฐ์ •

์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ: ๋ฐ˜๋ณต๋ฒ• (Jacobi, Gauss-Seidel), ์…€๋ฃฐ๋Ÿฌ ์˜คํ† ๋งˆํƒ€, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์‹œ๊ฐ„ ์Šคํ…

์‹œ์ž‘ํ•˜๊ธฐ

๊ถŒ์žฅ ํ•™์Šต ์ˆœ์„œ:

  1. ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •๋ถ€ํ„ฐ ์‹œ์ž‘: ์Šค๋ ˆ๋“œ ํŠนํ™”์˜ ๊ธฐ์ดˆ ์ดํ•ด
  2. ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๋กœ ์ง„ํ–‰: ์„ธ๋ฐ€ํ•œ ๋™๊ธฐํ™” ์ œ์–ด ํ•™์Šต
  3. ์ŠคํŠธ๋ฆฌ๋ฐ ํŒจํ„ด์— ์ ์šฉ: ๊ณ ์ฒ˜๋ฆฌ๋Ÿ‰ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์œ„ํ•œ ๊ฐœ๋… ๊ฒฐํ•ฉ

์‚ฌ์ „ ์ค€๋น„:

  • ๊ธฐ๋ณธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋… (์Šค๋ ˆ๋“œ, ๋ธ”๋ก, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)์— ๋Œ€ํ•œ ์ดํ•ด
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ์ ‘๊ทผ ํŒจํ„ด์— ๋Œ€ํ•œ ์ดํ•ด
  • ์ด์ „ ํผ์ฆ์—์„œ ๋ฐฐ์šด ๋ฐฐ๋ฆฌ์–ด ๋™๊ธฐํ™”์— ๋Œ€ํ•œ ์นœ์ˆ™ํ•จ

ํ•™์Šต ์„ฑ๊ณผ: ์ด ์žฅ์„ ์™„๋ฃŒํ•˜๋ฉด, ์ •๋ฐ€ํ•œ ์กฐ์œจ์ด ํ•„์š”ํ•œ ์ •๊ตํ•œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•˜๊ณ  ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ํ† ๋Œ€๋ฅผ ๊ฐ–์ถ”๊ฒŒ ๋˜์–ด, ์‹ค์ œ GPU ์ปดํ“จํŒ… ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ๋งˆ์ฃผํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜์  ๋ณต์žก์„ฑ์— ๋Œ€๋น„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ • ์—์„œ ์Šค๋ ˆ๋“œ ํŠนํ™”์˜ ๊ธฐ๋ณธ์„ ๋ฐฐ์šด ๋‹ค์Œ, ๋”๋ธ” ๋ฒ„ํผ๋ง ์Šคํ…์‹ค ์—ฐ์‚ฐ ์œผ๋กœ ๋‚˜์•„๊ฐ€ ๊ณ ๊ธ‰ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๊ธฐ๋ฒ•์„ ํƒ๊ตฌํ•ด ๋ณด์„ธ์š”.

๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ ์กฐ์ •

๊ฐœ์š”

์กฐ์œจ๋œ 3๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ์„ ํ†ตํ•ด ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ํŠนํ™”๋œ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ๋‹ด๋‹นํ•˜๊ณ , ๋ช…์‹œ์  ๋ฐฐ๋ฆฌ์–ด๋กœ ๋™๊ธฐํ™”๋ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ : ์Šค๋ ˆ๋“œ ์—ญํ• ์ด ํŠนํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค: Stage 1 (์Šค๋ ˆ๋“œ 0-127)์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ  ์ „์ฒ˜๋ฆฌํ•˜๋ฉฐ, Stage 2 (์Šค๋ ˆ๋“œ 128-255)๋Š” ๋ธ”๋Ÿฌ ์—ฐ์‚ฐ์„ ์ ์šฉํ•˜๊ณ , Stage 3 (์ „์ฒด ์Šค๋ ˆ๋“œ)์€ ์ตœ์ข… ์Šค๋ฌด๋”ฉ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ์•„ํ‚คํ…์ฒ˜: ์ด ํผ์ฆ์€ ํ•˜๋‚˜์˜ GPU ๋ธ”๋ก ์•ˆ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๋Š” ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๋Š” ์ „ํ†ต์ ์ธ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๋‹ฌ๋ฆฌ, ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ์Šค๋ ˆ๋“œ๋ฅผ ๊ธฐ๋Šฅ๋ณ„๋กœ ํŠนํ™”ํ•˜์—ฌ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ๋…: ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์„ธ ๊ฐœ์˜ ๊ตฌ๋ถ„๋œ ๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋ฉฐ, ๊ฐ ๋‹จ๊ณ„์—๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๋Š” ํŠนํ™”๋œ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋‹จ๊ณ„๋Š” ๋‹ค์Œ ๋‹จ๊ณ„๊ฐ€ ์†Œ๋น„ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ, ๋ฐฐ๋ฆฌ์–ด๋กœ ์‹ ์ค‘ํ•˜๊ฒŒ ๋™๊ธฐํ™”ํ•ด์•ผ ํ•˜๋Š” ๋ช…์‹œ์  ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๊ด€๊ณ„๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์˜์กด์„ฑ๊ณผ ๋™๊ธฐํ™”: ๊ฐ ๋‹จ๊ณ„๋Š” ๋‹ค์Œ ๋‹จ๊ณ„๊ฐ€ ์†Œ๋น„ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:

  • Stage 1 โ†’ Stage 2: ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๊ฐ€ ๋ธ”๋Ÿฌ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ
  • Stage 2 โ†’ Stage 3: ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„๊ฐ€ ์ตœ์ข… ์Šค๋ฌด๋”ฉ์„ ์œ„ํ•œ ๋ธ”๋Ÿฌ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑ
  • ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€: ์˜์กดํ•˜๋Š” ๋‹จ๊ณ„๊ฐ€ ์‹œ์ž‘๋˜๊ธฐ ์ „์— ํ•ด๋‹น ๋‹จ๊ณ„๊ฐ€ ์™„์ „ํžˆ ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ

๊ตฌ์ฒด์ ์œผ๋กœ, ๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ์€ ์„ธ ๊ฐ€์ง€ ์ˆ˜ํ•™ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ์กฐ์œจ๋œ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

Stage 1 - ์ „์ฒ˜๋ฆฌ ๊ฐ•ํ™”:

\[P[i] = I[i] \times 1.1\]

์—ฌ๊ธฐ์„œ \(P[i]\)๋Š” ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ์ด๊ณ  \(I[i]\)๋Š” ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.

Stage 2 - ์ˆ˜ํ‰ ๋ธ”๋Ÿฌ ํ•„ํ„ฐ:

\[B[i] = \frac{1}{N_i} \sum_{k=-2}^{2} P[i+k] \quad \text{where} i+k \in [0, 255]\]

์—ฌ๊ธฐ์„œ \(B[i]\)๋Š” ๋ธ”๋Ÿฌ ๊ฒฐ๊ณผ์ด๊ณ , \(N_i\)๋Š” ํƒ€์ผ ๊ฒฝ๊ณ„ ๋‚ด์˜ ์œ ํšจํ•œ ์ด์›ƒ ์ˆ˜์ž…๋‹ˆ๋‹ค.

Stage 3 - ์—ฐ์‡„์  ์ด์›ƒ ์Šค๋ฌด๋”ฉ:

\[F[i] = \begin{cases} (B[i] + B[i+1]) \times 0.6 & \text{if } i = 0 \\ ((B[i] + B[i-1]) \times 0.6 + B[i+1]) \times 0.6 & \text{if } 0 < i < 255 \\ (B[i] + B[i-1]) \times 0.6 & \text{if } i = 255 \end{cases}\]

์—ฌ๊ธฐ์„œ \(F[i]\)๋Š” ์—ฐ์‡„์  ์Šค๋ฌด๋”ฉ์ด ์ ์šฉ๋œ ์ตœ์ข… ์ถœ๋ ฅ์ž…๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ํŠนํ™”:

  • ์Šค๋ ˆ๋“œ 0-127: \(i \in \{0, 1, 2, \ldots, 255\}\)์— ๋Œ€ํ•ด \(P[i]\) ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ๋‹น 2๊ฐœ ์š”์†Œ)
  • ์Šค๋ ˆ๋“œ 128-255: \(i \in \{0, 1, 2, \ldots, 255\}\)์— ๋Œ€ํ•ด \(B[i]\) ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ๋‹น 2๊ฐœ ์š”์†Œ)
  • ์ „์ฒด 256๊ฐœ ์Šค๋ ˆ๋“œ: \(i \in \{0, 1, 2, \ldots, 255\}\)์— ๋Œ€ํ•ด \(F[i]\) ๊ณ„์‚ฐ (์Šค๋ ˆ๋“œ๋‹น 1๊ฐœ ์š”์†Œ)

๋™๊ธฐํ™” ์ง€์ :

\[\text{barrier}_1 \Rightarrow P[i] \text{ complete} \Rightarrow \text{barrier}_2 \Rightarrow B[i] \text{ complete} \Rightarrow \text{barrier}_3 \Rightarrow F[i] \text{ complete}\]

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ๋Š” ๋‹ค์Œ์„ ๋ฐฐ์›๋‹ˆ๋‹ค:

  • ํ•˜๋‚˜์˜ GPU ๋ธ”๋ก ์•ˆ์—์„œ ์Šค๋ ˆ๋“œ ์—ญํ•  ํŠนํ™” ๊ตฌํ˜„
  • ์ฒ˜๋ฆฌ ๋‹จ๊ณ„ ๊ฐ„ ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๊ด€๊ณ„ ์กฐ์œจ
  • ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ„์˜ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•œ ๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ (๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด๋ถ€๋ฟ ์•„๋‹ˆ๋ผ)

ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๋ฉด์„œ ์ „๋žต์  ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜๋ฅผ ํ†ตํ•ด ์กฐ์œจ๋˜๋Š” ๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ์„ ์–ด๋–ป๊ฒŒ ์„ค๊ณ„ํ•˜๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์™œ ์ค‘์š”ํ•œ๊ฐ€: ๋Œ€๋ถ€๋ถ„์˜ GPU ํŠœํ† ๋ฆฌ์–ผ์€ ๋‹จ์ผ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด์—์„œ์˜ ๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ๋ฒ• - ๋ฆฌ๋•์…˜์ด๋‚˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ ์ค‘ ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•˜๋Š” ๊ฒƒ - ์„ ๊ฐ€๋ฅด์นฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์‹ค์ œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ๋Š” ์‹ ์ค‘ํ•˜๊ฒŒ ์กฐ์œจํ•ด์•ผ ํ•˜๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ตฌ๋ถ„๋œ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ํฌํ•จํ•˜๋Š” ์•„ํ‚คํ…์ฒ˜์  ๋ณต์žก์„ฑ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ์ด ํผ์ฆ์€ ๋‹จ์ผ์ฒด์  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํŠนํ™”๋˜๊ณ  ์กฐ์œจ๋œ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ด์ „ ํผ์ฆ๊ณผ ํ˜„์žฌ์˜ ๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ ๋น„๊ต:

  • ์ด์ „ ํผ์ฆ (P8, P12, P15): ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๊ณ , ๋ฐฐ๋ฆฌ์–ด๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„ ๋‚ด์—์„œ ๋™๊ธฐํ™”
  • ์ด ํผ์ฆ: ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰ํ•˜๊ณ , ๋ฐฐ๋ฆฌ์–ด๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ„์˜ ์กฐ์œจ

์Šค๋ ˆ๋“œ ํŠนํ™” ์•„ํ‚คํ…์ฒ˜: ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐ์ดํ„ฐ ์ธ๋ฑ์Šค๋งŒ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์™€ ๋‹ฌ๋ฆฌ, ์ด ํผ์ฆ์€ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ์˜ ์—ญํ• ์— ๋”ฐ๋ผ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅธ ์ฝ”๋“œ ๊ฒฝ๋กœ๋ฅผ ์‹คํ–‰ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

์‹œ์Šคํ…œ ๋งค๊ฐœ๋ณ€์ˆ˜:

  • ์ด๋ฏธ์ง€ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ (๊ฐ„์†Œํ™”๋ฅผ ์œ„ํ•ด 1D)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 256 ์Šค๋ ˆ๋“œ, (256, 1) ๋ธ”๋ก ์ฐจ์›์œผ๋กœ ๊ตฌ์„ฑ
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ํƒ€์ผ ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ (4, 1) ๋ธ”๋ก (์ด 4๊ฐœ ๋ธ”๋ก)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: ๋ชจ๋“  ์—ฐ์‚ฐ์— DType.float32

์Šค๋ ˆ๋“œ ํŠนํ™” ์•„ํ‚คํ…์ฒ˜:

  • Stage 1 ์Šค๋ ˆ๋“œ: STAGE1_THREADS = 128 (์Šค๋ ˆ๋“œ 0-127, ๋ธ”๋ก์˜ ์ „๋ฐ˜๋ถ€)

    • ์—ญํ• : ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜๊ณ  ์ „์ฒ˜๋ฆฌ ์ ์šฉ
    • ์ž‘์—… ๋ถ„๋ฐฐ: ํšจ์œจ์ ์ธ ๋ถ€ํ•˜ ๊ท ํ˜•์„ ์œ„ํ•ด ์Šค๋ ˆ๋“œ๋‹น 2๊ฐœ ์š”์†Œ ์ฒ˜๋ฆฌ
    • ์ถœ๋ ฅ: input_shared[256]์— ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ ์ฑ„์šฐ๊ธฐ
  • Stage 2 ์Šค๋ ˆ๋“œ: STAGE2_THREADS = 128 (์Šค๋ ˆ๋“œ 128-255, ๋ธ”๋ก์˜ ํ›„๋ฐ˜๋ถ€)

    • ์—ญํ• : ์ „์ฒ˜๋ฆฌ๋œ ๋ฐ์ดํ„ฐ์— ์ˆ˜ํ‰ ๋ธ”๋Ÿฌ ํ•„ํ„ฐ ์ ์šฉ
    • ์ž‘์—… ๋ถ„๋ฐฐ: ์Šค๋ ˆ๋“œ๋‹น 2๊ฐœ์˜ ๋ธ”๋Ÿฌ ์—ฐ์‚ฐ ์ฒ˜๋ฆฌ
    • ์ถœ๋ ฅ: blur_shared[256]์— ๋ธ”๋Ÿฌ ๊ฒฐ๊ณผ ์ฑ„์šฐ๊ธฐ
  • Stage 3 ์Šค๋ ˆ๋“œ: ์ „์ฒด 256๊ฐœ ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ

    • ์—ญํ• : ์ตœ์ข… ์Šค๋ฌด๋”ฉ ๋ฐ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ถœ๋ ฅ
    • ์ž‘์—… ๋ถ„๋ฐฐ: ์ผ๋Œ€์ผ ๋งคํ•‘ (์Šค๋ ˆ๋“œ i๊ฐ€ ์š”์†Œ i๋ฅผ ์ฒ˜๋ฆฌ)
    • ์ถœ๋ ฅ: ๊ธ€๋กœ๋ฒŒ output ๋ฐฐ์—ด์— ์ตœ์ข… ๊ฒฐ๊ณผ ๊ธฐ๋ก

์™„์„ฑํ•  ์ฝ”๋“œ


comptime TPB = 256  # Threads per block for pipeline stages
comptime SIZE = 1024  # Image size (1D for simplicity)
comptime BLOCKS_PER_GRID = (4, 1)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime dtype = DType.float32
comptime layout = row_major[SIZE]()
comptime LayoutType = type_of(layout)

# Multi-stage processing configuration
comptime STAGE1_THREADS = TPB // 2
comptime STAGE2_THREADS = TPB // 2
comptime BLUR_RADIUS = 2


def multi_stage_image_blur_pipeline(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    """Multi-stage image blur pipeline with barrier coordination.

    Stage 1 (threads 0-127): Load input data and apply 1.1x preprocessing
    Stage 2 (threads 128-255): Apply 5-point blur with BLUR_RADIUS=2
    Stage 3 (all threads): Final neighbor smoothing and output
    """

    # Shared memory buffers for pipeline stages
    var input_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())
    var blur_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Stage 1: Load and preprocess (threads 0-127)

    # FILL ME IN (roughly 10 lines)

    barrier()  # Wait for Stage 1 completion

    # Stage 2: Apply blur (threads 128-255)

    # FILL ME IN (roughly 25 lines)

    barrier()  # Wait for Stage 2 completion

    # Stage 3: Final smoothing (all threads)

    # FILL ME IN (roughly 7 lines)

    barrier()  # Ensure all writes complete


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p29/p29.mojo

ํŒ

์Šค๋ ˆ๋“œ ์—ญํ•  ์‹๋ณ„

  • ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค ๋น„๊ต๋ฅผ ํ†ตํ•ด ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์–ด๋–ค ๋‹จ๊ณ„๋ฅผ ์‹คํ–‰ํ•ด์•ผ ํ•˜๋Š”์ง€ ๊ฒฐ์ •
  • Stage 1: ์ „๋ฐ˜๋ถ€ ์Šค๋ ˆ๋“œ (์Šค๋ ˆ๋“œ 0-127)
  • Stage 2: ํ›„๋ฐ˜๋ถ€ ์Šค๋ ˆ๋“œ (์Šค๋ ˆ๋“œ 128-255)
  • Stage 3: ๋ชจ๋“  ์Šค๋ ˆ๋“œ ์ฐธ์—ฌ

Stage 1 ์ ‘๊ทผ ๋ฐฉ์‹

  • ์ ์ ˆํ•œ ์ธ๋ฑ์Šค ๋น„๊ต๋ฅผ ํ†ตํ•ด Stage 1 ์Šค๋ ˆ๋“œ ์‹๋ณ„
  • ๋ถ€ํ•˜ ๊ท ํ˜•์„ ์œ„ํ•ด ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌ
  • ์ „์ฒ˜๋ฆฌ ๊ฐ•ํ™” ๊ณ„์ˆ˜ ์ ์šฉ
  • ์ œ๋กœ ํŒจ๋”ฉ์„ ์‚ฌ์šฉํ•œ ์ ์ ˆํ•œ ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ ๊ตฌํ˜„

Stage 2 ์ ‘๊ทผ ๋ฐฉ์‹

  • Stage 2 ์Šค๋ ˆ๋“œ๋ฅผ ์‹๋ณ„ํ•˜๊ณ  ์ธ๋ฑ์Šค๋ฅผ ์ฒ˜๋ฆฌ ๋ฒ”์œ„์— ๋งคํ•‘
  • ์ด์›ƒ ์š”์†Œ์˜ ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๋ธ”๋Ÿฌ ์ปค๋„ ๊ตฌํ˜„
  • ์œ ํšจํ•œ ์ด์›ƒ๋งŒ ํฌํ•จํ•˜์—ฌ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ
  • ํšจ์œจ์„ฑ์„ ์œ„ํ•ด ์Šค๋ ˆ๋“œ๋‹น ์—ฌ๋Ÿฌ ์š”์†Œ ์ฒ˜๋ฆฌ

Stage 3 ์ ‘๊ทผ ๋ฐฉ์‹

  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ์ฒ˜๋ฆฌ์— ์ฐธ์—ฌ
  • ์ง€์ •๋œ ์Šค์ผ€์ผ๋ง ๊ณ„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•œ ์ด์›ƒ ์Šค๋ฌด๋”ฉ ์ ์šฉ
  • ์ด์›ƒ์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์˜ ์—ฃ์ง€ ์ผ€์ด์Šค ์ฒ˜๋ฆฌ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•ด ๊ธ€๋กœ๋ฒŒ ์ถœ๋ ฅ์— ๊ฒฐ๊ณผ ๊ธฐ๋ก

๋™๊ธฐํ™” ์ „๋žต

  • ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋‹จ๊ณ„ ์‚ฌ์ด์— ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜
  • ์˜์กดํ•˜๋Š” ๋‹จ๊ณ„๊ฐ€ ์‹œ์ž‘๋˜๊ธฐ ์ „์— ๊ฐ ๋‹จ๊ณ„๊ฐ€ ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ
  • ๋ธ”๋ก ์ข…๋ฃŒ ์ „ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ์ตœ์ข… ๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค:

pixi run p29 --multi-stage
pixi run -e amd p29 --multi-stage
uv run poe p29 --multi-stage

ํผ์ฆ์„ ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒํ•˜๋ฉด ๋‹ค์Œ๊ณผ ์œ ์‚ฌํ•œ ์ถœ๋ ฅ์ด ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค:

Puzzle 29: GPU Synchronization Primitives
==================================================
TPB: 256
SIZE: 1024
STAGE1_THREADS: 128
STAGE2_THREADS: 128
BLUR_RADIUS: 2

Testing Puzzle 29A: Multi-Stage Pipeline Coordination
============================================================
Multi-stage pipeline blur completed
Input sample: 0.0 1.01 2.02
Output sample: 1.6665002 2.3331003 3.3996604
โœ… Multi-stage pipeline coordination test PASSED!

์†”๋ฃจ์…˜

def multi_stage_image_blur_pipeline(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
    size: Int,
):
    """Multi-stage image blur pipeline with barrier coordination.

    Stage 1 (threads 0-127): Load input data and apply 1.1x preprocessing
    Stage 2 (threads 128-255): Apply 5-point blur with BLUR_RADIUS=2
    Stage 3 (all threads): Final neighbor smoothing and output
    """

    # Shared memory buffers for pipeline stages
    var input_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())
    var blur_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Stage 1: Load and preprocess (threads 0-127)
    if local_i < STAGE1_THREADS:
        if global_i < size:
            input_shared[local_i] = input[global_i] * 1.1
            # Each thread loads 2 elements
            if local_i + STAGE1_THREADS < size:
                input_shared[local_i + STAGE1_THREADS] = (
                    input[global_i + STAGE1_THREADS] * 1.1
                )
        else:
            # Zero-padding for out-of-bounds
            input_shared[local_i] = 0.0
            if local_i + STAGE1_THREADS < TPB:
                input_shared[local_i + STAGE1_THREADS] = 0.0

    barrier()  # Wait for Stage 1 completion

    # Stage 2: Apply blur (threads 128-255)
    if local_i >= STAGE1_THREADS:
        var blur_idx = local_i - STAGE1_THREADS
        var blur_sum: Scalar[dtype] = 0.0
        blur_count = 0

        # 5-point blur kernel
        for offset in range(-BLUR_RADIUS, BLUR_RADIUS + 1):
            sample_idx = blur_idx + offset
            if sample_idx >= 0 and sample_idx < TPB:
                blur_sum += rebind[Scalar[dtype]](input_shared[sample_idx])
                blur_count += 1

        if blur_count > 0:
            blur_shared[blur_idx] = blur_sum / Scalar[dtype](blur_count)
        else:
            blur_shared[blur_idx] = 0.0

        # Process second element
        var second_idx = blur_idx + STAGE1_THREADS
        if second_idx < TPB:
            blur_sum = 0.0
            blur_count = 0
            for offset in range(-BLUR_RADIUS, BLUR_RADIUS + 1):
                sample_idx = second_idx + offset
                if sample_idx >= 0 and sample_idx < TPB:
                    blur_sum += rebind[Scalar[dtype]](input_shared[sample_idx])
                    blur_count += 1

            if blur_count > 0:
                blur_shared[second_idx] = blur_sum / Scalar[dtype](blur_count)
            else:
                blur_shared[second_idx] = 0.0

    barrier()  # Wait for Stage 2 completion

    # Stage 3: Final smoothing (all threads)
    if global_i < size:
        final_value = blur_shared[local_i]

        # Neighbor smoothing with 0.6 scaling
        if local_i > 0:
            final_value = (final_value + blur_shared[local_i - 1]) * 0.6
        if local_i < TPB - 1:
            final_value = (final_value + blur_shared[local_i + 1]) * 0.6

        output[global_i] = final_value

    barrier()  # Ensure all writes complete


ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์ด๊ฒƒ์ด ์Šค๋ ˆ๋“œ ์—ญํ•  ํŠนํ™”๋ฅผ ๊ฐ€์ง„ ํŒŒ์ดํ”„๋ผ์ธ ์•„ํ‚คํ…์ฒ˜ ๋ฌธ์ œ์ž„์„ ์ธ์‹ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  1. ๋‹จ๊ณ„๋ณ„ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน ์„ค๊ณ„: ๋ฐ์ดํ„ฐ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ธฐ๋Šฅ๋ณ„๋กœ ์Šค๋ ˆ๋“œ๋ฅผ ๋ถ„ํ• 
  2. ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ์ฒด์ธ ๊ตฌํ˜„: Stage 1์ด Stage 2๋ฅผ ์œ„ํ•ด ์ƒ์‚ฐํ•˜๊ณ , Stage 2๊ฐ€ Stage 3์„ ์œ„ํ•ด ์ƒ์‚ฐ
  3. ์ „๋žต์  ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜: ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด๊ฐ€ ์•„๋‹ˆ๋ผ ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ„์˜ ๋™๊ธฐํ™”
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”: ๋ณ‘ํ•ฉ๋œ ์ฝ๊ธฐ์™€ ํšจ์œจ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ๋ณด์žฅ

์ƒ์„ธ ์„ค๋ช…์ด ํฌํ•จ๋œ ์ „์ฒด ์†”๋ฃจ์…˜

๋‹ค๋‹จ๊ณ„ ํŒŒ์ดํ”„๋ผ์ธ ์†”๋ฃจ์…˜์€ ์ •๊ตํ•œ ์Šค๋ ˆ๋“œ ํŠนํ™”์™€ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ „ํ†ต์ ์ธ ๋‹จ์ผ์ฒด์  GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํŠนํ™”๋˜๊ณ  ์กฐ์œจ๋œ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ดํ”„๋ผ์ธ ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„

์ด ํผ์ฆ์˜ ๊ทผ๋ณธ์ ์ธ ๋ŒํŒŒ๊ตฌ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹Œ ์—ญํ• ์— ์˜ํ•œ ์Šค๋ ˆ๋“œ ํŠนํ™”์ž…๋‹ˆ๋‹ค:

์ „ํ†ต์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰

  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ (๋ฆฌ๋•์…˜์ด๋‚˜ ํ–‰๋ ฌ ์—ฐ์‚ฐ ๋“ฑ)
  • ๋ฐฐ๋ฆฌ์–ด๋Š” ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‹จ๊ณ„ ๋‚ด์—์„œ ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”
  • ์Šค๋ ˆ๋“œ ์—ญํ• ์€ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ธ๋ฑ์Šค๋งŒ ๋‹ค๋ฆ„

์ด ํผ์ฆ์˜ ํ˜์‹ : ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์™„์ „ํžˆ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰

  • ์Šค๋ ˆ๋“œ 0-127์ด ๋กœ๋”ฉ ๋ฐ ์ „์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰
  • ์Šค๋ ˆ๋“œ 128-255๊ฐ€ ๋ธ”๋Ÿฌ ์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰
  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ์Šค๋ฌด๋”ฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ˜‘๋ ฅ
  • ๋ฐฐ๋ฆฌ์–ด๋Š” ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด๊ฐ€ ์•„๋‹ˆ๋ผ ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ„์˜ ์กฐ์œจ

์ƒ์‚ฐ์ž-์†Œ๋น„์ž ์กฐ์ •

์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋‚ด์—์„œ ๋™๋“ฑํ•œ ์—ญํ• ์„ ํ•˜๋˜ ์ด์ „ ํผ์ฆ๊ณผ ๋‹ฌ๋ฆฌ, ์ด ํผ์ฆ์€ ๋ช…์‹œ์ ์ธ ์ƒ์‚ฐ์ž-์†Œ๋น„์ž ๊ด€๊ณ„๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค:

  • Stage 1: ์ƒ์‚ฐ์ž (Stage 2๋ฅผ ์œ„ํ•œ ์ „์ฒ˜๋ฆฌ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ)
  • Stage 2: ์†Œ๋น„์ž (Stage 1์˜ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ) + ์ƒ์‚ฐ์ž (Stage 3์„ ์œ„ํ•œ ๋ธ”๋Ÿฌ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ)
  • Stage 3: ์†Œ๋น„์ž (Stage 2์˜ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ)

์ „๋žต์  ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜

๋ฐฐ๋ฆฌ์–ด๊ฐ€ ์–ธ์ œ ํ•„์š”ํ•˜๊ณ  ์–ธ์ œ ๋‚ญ๋น„์ ์ธ์ง€ ์ดํ•ดํ•˜๊ธฐ:

  • ํ•„์š”ํ•œ ๊ฒฝ์šฐ: ์˜์กด์ ์ธ ๋‹จ๊ณ„ ์‚ฌ์ด์—์„œ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด
  • ๋‚ญ๋น„์ ์ธ ๊ฒฝ์šฐ: ๊ฐ™์€ ๋‹จ๊ณ„์˜ ๋…๋ฆฝ์ ์ธ ์—ฐ์‚ฐ ๋‚ด์—์„œ
  • ์„ฑ๋Šฅ ํ†ต์ฐฐ: ๊ฐ ๋ฐฐ๋ฆฌ์–ด์—๋Š” ๋น„์šฉ์ด ์žˆ์œผ๋ฏ€๋กœ ์ „๋žต์ ์œผ๋กœ ์‚ฌ์šฉ

ํ•ต์‹ฌ ๋™๊ธฐํ™” ์ง€์ :

  1. Stage 1 ์ดํ›„: Stage 2๊ฐ€ ๋ถˆ์™„์ „ํ•œ ์ „์ฒ˜๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€
  2. Stage 2 ์ดํ›„: Stage 3์ด ๋ถˆ์™„์ „ํ•œ ๋ธ”๋Ÿฌ ๊ฒฐ๊ณผ๋ฅผ ์ฝ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€
  3. Stage 3 ์ดํ›„: ๋ธ”๋ก ์ข…๋ฃŒ ์ „ ๋ชจ๋“  ์ถœ๋ ฅ ์“ฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ

์Šค๋ ˆ๋“œ ํ™œ์šฉ ํŒจํ„ด

  • Stage 1: 50% ํ™œ์šฉ (256๊ฐœ ์ค‘ 128๊ฐœ ์Šค๋ ˆ๋“œ ํ™œ์„ฑ, 128๊ฐœ ์œ ํœด)
  • Stage 2: 50% ํ™œ์šฉ (128๊ฐœ ํ™œ์„ฑ, 128๊ฐœ ์œ ํœด)
  • Stage 3: 100% ํ™œ์šฉ (์ „์ฒด 256๊ฐœ ์Šค๋ ˆ๋“œ ํ™œ์„ฑ)

์ด๊ฒƒ์€ ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์กฐ์œจ๋œ ํŒŒ์ดํ”„๋ผ์ธ ๋‚ด์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์—ฐ์‚ฐ ์ž‘์—…์— ํŠนํ™”๋˜๋Š” ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋‹จ์ˆœํ•œ ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ๋„˜์–ด ์‹ค์ œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์š”ํ•œ ์•„ํ‚คํ…์ฒ˜์  ์‚ฌ๊ณ ๋กœ ๋‚˜์•„๊ฐ‘๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜:

  • ๋‘ ๊ฐœ์˜ ํŠนํ™”๋œ ๋ฒ„ํผ๊ฐ€ ๋‹จ๊ณ„ ๊ฐ„ ๋ฐ์ดํ„ฐ ํ๋ฆ„์„ ์ฒ˜๋ฆฌ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์€ ๊ฒฝ๊ณ„ ์—ฐ์‚ฐ์—๋งŒ ์ตœ์†Œํ™”
  • ๋ชจ๋“  ์ค‘๊ฐ„ ์ฒ˜๋ฆฌ์— ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ

์ ‘๊ทผ ํŒจํ„ด์˜ ์ด์ :

  • Stage 1: ์ž…๋ ฅ ๋กœ๋”ฉ์„ ์œ„ํ•œ ๋ณ‘ํ•ฉ๋œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ
  • Stage 2: ๋ธ”๋Ÿฌ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฝ๊ธฐ
  • Stage 3: ์ถœ๋ ฅ์„ ์œ„ํ•œ ๋ณ‘ํ•ฉ๋œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ

์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ

์ด ํŒŒ์ดํ”„๋ผ์ธ ์•„ํ‚คํ…์ฒ˜ ํŒจํ„ด์€ ๋‹ค์Œ ๋ถ„์•ผ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค:

์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ:

  • ๋‹ค๋‹จ๊ณ„ ํ•„ํ„ฐ (๋ธ”๋Ÿฌ, ์„ ๋ช…ํ™”, ์—ฃ์ง€ ๊ฒ€์ถœ์„ ์ˆœ์ฐจ์ ์œผ๋กœ)
  • ์ƒ‰ ๊ณต๊ฐ„ ๋ณ€ํ™˜ (RGB โ†’ HSV โ†’ ์ฒ˜๋ฆฌ โ†’ RGB)
  • ๋‹ค์ค‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจ์Šค๋ฅผ ์‚ฌ์šฉํ•œ ๋…ธ์ด์ฆˆ ๊ฐ์†Œ

๊ณผํ•™ ์—ฐ์‚ฐ:

  • ๋‹ค๋‹จ๊ณ„ ์œ ํ•œ ์ฐจ๋ถ„ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ ์Šคํ…์‹ค ์—ฐ์‚ฐ
  • ํ•„ํ„ฐ๋ง, ๋ณ€ํ™˜, ๋ถ„์„ ํŒŒ์ดํ”„๋ผ์ธ์„ ์‚ฌ์šฉํ•œ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ
  • ๋‹ค๋‹จ๊ณ„ ์†”๋ฒ„ ๋ฐ˜๋ณต์„ ์‚ฌ์šฉํ•œ ์ „์‚ฐ ์œ ์ฒด ์—ญํ•™

๋จธ์‹ ๋Ÿฌ๋‹:

  • ์„œ๋กœ ๋‹ค๋ฅธ ์—ฐ์‚ฐ์„ ์œ„ํ•ด ํŠนํ™”๋œ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์„ ๊ฐ€์ง„ ์‹ ๊ฒฝ๋ง ๋ ˆ์ด์–ด
  • ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ (์กฐ์œจ๋œ ๋‹จ๊ณ„์—์„œ ๋กœ๋“œ, ์ •๊ทœํ™”, ์ฆ๊ฐ•)
  • ์„œ๋กœ ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน์ด ์„œ๋กœ ๋‹ค๋ฅธ ์—ฐ์‚ฐ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ

ํ•ต์‹ฌ ๊ธฐ์ˆ ์  ํ†ต์ฐฐ

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ vs. ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ:

  • ๋ฐ์ดํ„ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์š”์†Œ์— ๋™์ผํ•œ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์Šค๋ ˆ๋“œ๊ฐ€ ํŠนํ™”๋œ ์—ญํ• ์— ๋”ฐ๋ผ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‹คํ–‰

๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ ์ฒ ํ•™:

  • ์ „๋žต์  ๋ฐฐ์น˜: ์˜์กด์ ์ธ ๋‹จ๊ณ„ ๊ฐ„์˜ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ๊ณณ์—๋งŒ ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜
  • ์„ฑ๋Šฅ ๊ณ ๋ ค์‚ฌํ•ญ: ๊ฐ ๋ฐฐ๋ฆฌ์–ด์—๋Š” ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฏ€๋กœ ์ •ํ™•ํ•˜์ง€๋งŒ ์ ˆ์ œ๋œ ์‚ฌ์šฉ
  • ์ •ํ™•์„ฑ ๋ณด์žฅ: ์ ์ ˆํ•œ ๋ฐฐ๋ฆฌ์–ด ๋ฐฐ์น˜๋กœ ์Šค๋ ˆ๋“œ ์‹คํ–‰ ํƒ€์ด๋ฐ์— ๊ด€๊ณ„์—†์ด ๊ฒฐ์ •์  ๊ฒฐ๊ณผ๋ฅผ ๋ณด์žฅ

์Šค๋ ˆ๋“œ ํŠนํ™”์˜ ์ด์ :

  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ตœ์ ํ™”: ๊ฐ ๋‹จ๊ณ„๋ฅผ ํ•ด๋‹น ์—ฐ์‚ฐ ํŒจํ„ด์— ๋งž๊ฒŒ ์ตœ์ ํ™” ๊ฐ€๋Šฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”: ์„œ๋กœ ๋‹ค๋ฅธ ๋‹จ๊ณ„์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ „๋žต ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ๋ฆฌ์†Œ์Šค ํ™œ์šฉ: ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํŠนํ™”๋˜๊ณ  ํšจ์œจ์ ์ธ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ๋ถ„ํ•ด ๊ฐ€๋Šฅ

์ด ์†”๋ฃจ์…˜์€ ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์—ฐ์‚ฐ์„ ์œ„ํ•ด ์Šค๋ ˆ๋“œ ํŠนํ™”์™€ ์ „๋žต์  ๋™๊ธฐํ™”๋ฅผ ํ™œ์šฉํ•˜๋Š” ์ •๊ตํ•œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋‹จ์ˆœํ•œ ๋ณ‘๋ ฌ ๋ฃจํ”„๋ฅผ ๋„˜์–ด ์‹ค์ œ GPU ์†Œํ”„ํŠธ์›จ์–ด์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์•„ํ‚คํ…์ฒ˜์  ์ ‘๊ทผ ๋ฐฉ์‹์œผ๋กœ ๋‚˜์•„๊ฐ‘๋‹ˆ๋‹ค.

๋”๋ธ” ๋ฒ„ํผ๋ง ์Šคํ…์‹ค ์—ฐ์‚ฐ

๐Ÿ”ฌ ์„ธ๋ฐ€ํ•œ ๋™๊ธฐํ™”: mbarrier vs barrier()

์ด ํผ์ฆ์€ ์ด์ „ ํผ์ฆ์—์„œ ์‚ฌ์šฉํ•œ ๊ธฐ๋ณธ barrier() ํ•จ์ˆ˜๋ณด๋‹ค ํ›จ์”ฌ ๊ฐ•๋ ฅํ•œ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด API๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ณธ barrier()์˜ ํ•œ๊ณ„:

  • ์ผํšŒ์„ฑ ์‚ฌ์šฉ: ์ƒํƒœ ์ถ”์  ์—†์ด ๋‹จ์ผ ๋™๊ธฐํ™” ์ง€์ ๋งŒ ์ œ๊ณต
  • ๋ธ”๋ก ์ „์ฒด ์ „์šฉ: ๋ธ”๋ก์˜ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์‹œ์— ์ฐธ์—ฌํ•ด์•ผ ํ•จ
  • ์žฌ์‚ฌ์šฉ ๋ถˆ๊ฐ€: ๋งค barrier() ํ˜ธ์ถœ์ด ์ƒˆ๋กœ์šด ๋™๊ธฐํ™” ์ด๋ฒคํŠธ๋ฅผ ์ƒ์„ฑ
  • ์„ธ๋ฐ€๋„ ๋ถ€์กฑ: ๋ฉ”๋ชจ๋ฆฌ ์ˆœ์„œ์™€ ํƒ€์ด๋ฐ์— ๋Œ€ํ•œ ์ œํ•œ์  ์ œ์–ด
  • ์ •์  ์กฐ์ •: ์Šค๋ ˆ๋“œ ์ฐธ์—ฌ ํŒจํ„ด์˜ ๋ณ€ํ™”์— ์ ์‘ ๋ถˆ๊ฐ€

๊ณ ๊ธ‰ mbarrier API์˜ ๊ธฐ๋Šฅ:

  • ์ •๋ฐ€ํ•œ ์ œ์–ด: mbarrier_init()๋กœ ํŠน์ • ์Šค๋ ˆ๋“œ ์ˆ˜๋ฅผ ์ง€์ •ํ•˜์—ฌ ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐฐ๋ฆฌ์–ด ๊ฐ์ฒด๋ฅผ ์„ค์ •
  • ์ƒํƒœ ์ถ”์ : mbarrier_arrive()๋กœ ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฌ๊ณ  ๋„์ฐฉ ํšŸ์ˆ˜๋ฅผ ์œ ์ง€
  • ์œ ์—ฐํ•œ ๋Œ€๊ธฐ: mbarrier_test_wait()๋กœ ํŠน์ • ์™„๋ฃŒ ์ƒํƒœ๋ฅผ ๊ธฐ๋‹ค๋ฆด ์ˆ˜ ์žˆ์Œ
  • ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๊ฐ์ฒด: ๋™์ผํ•œ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์—ฌ๋Ÿฌ ๋ฐ˜๋ณต์— ๊ฑธ์ณ ์žฌ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅ
  • ๋‹ค์ค‘ ๋ฐฐ๋ฆฌ์–ด: ์„œ๋กœ ๋‹ค๋ฅธ ๋™๊ธฐํ™” ์ง€์ (์ดˆ๊ธฐํ™”, ๋ฐ˜๋ณต, ๋งˆ๋ฌด๋ฆฌ)์— ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฐ๋ฆฌ์–ด ๊ฐ์ฒด ์‚ฌ์šฉ
  • ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”: GPU ํ•˜๋“œ์›จ์–ด ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ์— ์ง์ ‘ ๋งคํ•‘ํ•˜์—ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ
  • ๋ฉ”๋ชจ๋ฆฌ ์˜๋ฏธ๋ก : ๋ฉ”๋ชจ๋ฆฌ ๊ฐ€์‹œ์„ฑ๊ณผ ์ˆœ์„œ ๋ณด์žฅ์— ๋Œ€ํ•œ ๋ช…์‹œ์  ์ œ์–ด

๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ์™œ ์ค‘์š”ํ•œ๊ฐ€: ๋”๋ธ” ๋ฒ„ํผ๋ง ํŒจํ„ด์—์„œ๋Š” ๋ฒ„ํผ ๊ต์ฒด ๋‹จ๊ณ„ ๊ฐ„์˜ ์ •๋ฐ€ํ•œ ์กฐ์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ๋ณธ barrier()๋กœ๋Š” ๋‹ค์Œ์— ํ•„์š”ํ•œ ์„ธ๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค:

  • ๋ฒ„ํผ ์—ญํ•  ๊ต๋Œ€: buffer_A์— ๋Œ€ํ•œ ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ buffer_A์—์„œ ์ฝ๊ธฐ ์‹œ์ž‘๋˜๋„๋ก ๋ณด์žฅ
  • ๋ฐ˜๋ณต ๊ฒฝ๊ณ„: ๋‹จ์ผ ์ปค๋„ ๋‚ด์—์„œ ์—ฌ๋Ÿฌ ๋™๊ธฐํ™” ์ง€์  ์กฐ์œจ
  • ์ƒํƒœ ๊ด€๋ฆฌ: ์–ด๋–ค ์Šค๋ ˆ๋“œ๊ฐ€ ์–ด๋–ค ์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ–ˆ๋Š”์ง€ ์ถ”์ 
  • ์„ฑ๋Šฅ ์ตœ์ ํ™”: ์žฌ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฐฐ๋ฆฌ์–ด ๊ฐ์ฒด๋ฅผ ํ†ตํ•ด ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ ์ตœ์†Œํ™”

์ด ํผ์ฆ์€ ๋ฐ˜๋ณต๋ฒ•, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ํ”„๋ ˆ์ž„์›Œํฌ, ๊ณ ์„ฑ๋Šฅ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ ๋“ฑ ์‹ค์ œ GPU ์ปดํ“จํŒ… ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

๋”๋ธ” ๋ฒ„ํผ๋ง ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ˜๋ณต ์Šคํ…์‹ค ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ปค๋„์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ณต ๊ฐ„ ์•ˆ์ „ํ•œ ๋ฒ„ํผ ๊ต์ฒด๋ฅผ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๋กœ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์Šคํ…์‹ค ์—ฐ์‚ฐ์€ ๋ฐฐ์—ด์˜ ๊ฐ ์š”์†Œ ๊ฐ’์„ ์ด์›ƒ ์š”์†Œ๋“ค์˜ ๊ณ ์ •๋œ ํŒจํ„ด์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ์—ฐ์‚ฐ ํŒจํ„ด์ž…๋‹ˆ๋‹ค.

์ฐธ๊ณ : ๋ฒ„ํผ ์—ญํ• ์ด ๊ต๋Œ€ํ•ฉ๋‹ˆ๋‹ค: buffer_A์™€ buffer_B๊ฐ€ ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ์—ฐ์‚ฐ์„ ๊ต๋Œ€ํ•˜๋ฉฐ, mbarrier ๋™๊ธฐํ™”๊ฐ€ ๋ฒ„ํผ ๊ต์ฒด ์ „์— ๋ชจ๋“  ์Šค๋ ˆ๋“œ์˜ ์“ฐ๊ธฐ ์™„๋ฃŒ๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค.

์•Œ๊ณ ๋ฆฌ์ฆ˜ ์•„ํ‚คํ…์ฒ˜: ์ด ํผ์ฆ์€ ๋‘ ๊ฐœ์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„ํผ๊ฐ€ ์—ฌ๋Ÿฌ ๋ฐ˜๋ณต์— ๊ฑธ์ณ ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ๋Œ€์ƒ์˜ ์—ญํ• ์„ ๊ต๋Œ€ํ•˜๋Š” ๋”๋ธ” ๋ฒ„ํผ๋ง ํŒจํ„ด์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ๋งŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๋‹จ์ˆœํ•œ ์Šคํ…์‹ค ์—ฐ์‚ฐ๊ณผ ๋‹ฌ๋ฆฌ, ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฒ„ํผ ์ „ํ™˜ ์ค‘ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•œ ์„ธ์‹ฌํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •๊ณผ ํ•จ๊ป˜ ๋ฐ˜๋ณต์  ๊ฐœ์„ ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

ํŒŒ์ดํ”„๋ผ์ธ ๊ฐœ๋…: ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๋ฐ˜๋ณต์  ์Šคํ…์‹ค ๊ฐœ์„ ์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ๊ฐ ๋ฐ˜๋ณต์€ ํ•˜๋‚˜์˜ ๋ฒ„ํผ์—์„œ ์ฝ๊ณ  ๋‹ค๋ฅธ ๋ฒ„ํผ์— ์“ฐ๋ฉฐ, ๋ฒ„ํผ๋“ค์€ ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ์—ญํ• ์„ ๊ต๋Œ€ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์†์ƒ ์—†์ด ์—ฐ์† ์ฒ˜๋ฆฌ๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ•‘ํ ํŒจํ„ด์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ์˜์กด์„ฑ๊ณผ ๋™๊ธฐํ™”: ๊ฐ ๋ฐ˜๋ณต์€ ์ด์ „ ๋ฐ˜๋ณต์˜ ์™„์„ฑ๋œ ๊ฒฐ๊ณผ์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค:

  • ๋ฐ˜๋ณต N โ†’ ๋ฐ˜๋ณต N+1: ํ˜„์žฌ ๋ฐ˜๋ณต์ด ๋‹ค์Œ ๋ฐ˜๋ณต์ด ์†Œ๋น„ํ•˜๋Š” ๊ฐœ์„ ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑ
  • ๋ฒ„ํผ ์กฐ์ •: ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ๋ฒ„ํผ๊ฐ€ ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ์—ญํ• ์„ ๊ตํ™˜
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€: ์ƒˆ๋กœ ๊ธฐ๋ก๋œ ๋ฒ„ํผ์—์„œ ์ฝ๊ธฐ๋ฅผ ์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋ชจ๋“  ์“ฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ

๊ตฌ์ฒด์ ์œผ๋กœ, ๋”๋ธ” ๋ฒ„ํผ๋ง ์Šคํ…์‹ค์€ ์„ธ ๊ฐ€์ง€ ์ˆ˜ํ•™ ์—ฐ์‚ฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋ฐ˜๋ณต์  ์Šค๋ฌด๋”ฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค:

๋ฐ˜๋ณต ํŒจํ„ด - ๋ฒ„ํผ ๊ต๋Œ€:

\[\text{Iteration} i: \begin{cases} \text{Read from buffer_A, Write to buffer_B} & \text{if} i \bmod 2 = 0 \\ \text{Read from buffer_B, Write to buffer_A} & \text{if } i \bmod 2 = 1 \end{cases}\]

์Šคํ…์‹ค ์—ฐ์‚ฐ - 3์  ํ‰๊ท :

\[S^{(i+1)}[j] = \frac{1}{N_j} \sum_{k=-1}^{1} S^{(i)}[j+k] \quad \text{where} j+k \in [0, 255]\]

์—ฌ๊ธฐ์„œ \(S^{(i)}[j]\)๋Š” ๋ฐ˜๋ณต \(i\) ์ดํ›„ ์œ„์น˜ \(j\)์—์„œ์˜ ์Šคํ…์‹ค ๊ฐ’์ด๊ณ , \(N_j\)๋Š” ์œ ํšจํ•œ ์ด์›ƒ ์ˆ˜์ž…๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •:

\[\text{mbarrier_arrive}() \Rightarrow \text{mbarrier_test_wait}() \Rightarrow \text{buffer swap} \Rightarrow \text{next iteration}\]

์ตœ์ข… ์ถœ๋ ฅ ์„ ํƒ:

\[\text{Output}[j] = \begin{cases} \text{buffer_A}[j] & \text{if STENCIL_ITERATIONS } \bmod 2 = 0 \\ \text{buffer_B}[j] & \text{if STENCIL_ITERATIONS } \bmod 2 = 1 \end{cases}\]

ํ•ต์‹ฌ ๊ฐœ๋…

์ด ํผ์ฆ์—์„œ๋Š” ๋‹ค์Œ์„ ๋ฐฐ์›๋‹ˆ๋‹ค:

  • ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋”๋ธ” ๋ฒ„ํผ๋ง ํŒจํ„ด ๊ตฌํ˜„
  • mbarrier API๋ฅผ ์‚ฌ์šฉํ•œ ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •
  • ๋ฐ˜๋ณต์— ๊ฑธ์ณ ๊ต๋Œ€ํ•˜๋Š” ์ฝ๊ธฐ/์“ฐ๊ธฐ ๋ฒ„ํผ ์—ญํ•  ๊ด€๋ฆฌ

ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์ฝ๊ธฐ์™€ ์“ฐ๊ธฐ ์—ฐ์‚ฐ ์‚ฌ์ด์˜ ๊ฒฝ์Ÿ ์ƒํƒœ๊ฐ€ ์ ์ ˆํžˆ ๋™๊ธฐํ™”๋˜์ง€ ์•Š์œผ๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ์†์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ๋ฒ„ํผ ๊ต์ฒด๋ฅผ ์•ˆ์ „ํ•˜๊ฒŒ ์กฐ์œจํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์™œ ์ค‘์š”ํ•œ๊ฐ€: ๋Œ€๋ถ€๋ถ„์˜ GPU ํŠœํ† ๋ฆฌ์–ผ์€ ๋‹จ์ˆœํ•œ ๋‹จ์ผ ํŒจ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ์‹ค์ œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ๋Š” ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋‹ค์ค‘ ํŒจ์Šค๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐ˜๋ณต์  ๊ฐœ์„ ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค. ๋”๋ธ” ๋ฒ„ํผ๋ง์€ ๊ฐ ๋ฐ˜๋ณต์ด ์ด์ „ ๋ฐ˜๋ณต์˜ ์™„์„ฑ๋œ ๊ฒฐ๊ณผ์— ์˜์กดํ•˜๋Š” ๋ฐ˜๋ณต๋ฒ•, ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ ํ•„ํ„ฐ, ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์—…๋ฐ์ดํŠธ ๊ฐ™์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

์ด์ „ ํผ์ฆ๊ณผ ํ˜„์žฌ์˜ ๋™๊ธฐํ™” ๋น„๊ต:

  • ์ด์ „ ํผ์ฆ (P8, P12, P15): ๋‹จ์ผ ํŒจ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์œ„ํ•œ ๋‹จ์ˆœ barrier() ํ˜ธ์ถœ
  • ์ด ํผ์ฆ: ๋ฒ„ํผ ๊ต์ฒด ํƒ€์ด๋ฐ์— ๋Œ€ํ•œ ์ •๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ์œ„ํ•œ ๋ช…์‹œ์  mbarrier API

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ํŠนํ™”: ๊ธฐ๋ณธ์ ์ธ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”์™€ ๋‹ฌ๋ฆฌ, ์ด ํผ์ฆ์€ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด ์–ธ์ œ ์™„๋ฃŒ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ์„ธ๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, ์ด๋Š” ๋ณต์žกํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

์‹œ์Šคํ…œ ๋งค๊ฐœ๋ณ€์ˆ˜:

  • ์ด๋ฏธ์ง€ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ (๊ฐ„์†Œํ™”๋ฅผ ์œ„ํ•ด 1D)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: TPB = 256 ์Šค๋ ˆ๋“œ, (256, 1) ๋ธ”๋ก ์ฐจ์›์œผ๋กœ ๊ตฌ์„ฑ
  • ๊ทธ๋ฆฌ๋“œ ๊ตฌ์„ฑ: ์ „์ฒด ์ด๋ฏธ์ง€๋ฅผ ํƒ€์ผ ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ (4, 1) ๋ธ”๋ก (์ด 4๊ฐœ ๋ธ”๋ก)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: ๋ชจ๋“  ์—ฐ์‚ฐ์— DType.float32

๋ฐ˜๋ณต ๋งค๊ฐœ๋ณ€์ˆ˜:

  • ์Šคํ…์‹ค ๋ฐ˜๋ณต ํšŸ์ˆ˜: STENCIL_ITERATIONS = 3 ๊ฐœ์„  ํŒจ์Šค
  • ๋ฒ„ํผ ์ˆ˜: BUFFER_COUNT = 2 (๋”๋ธ” ๋ฒ„ํผ๋ง)
  • ์Šคํ…์‹ค ์ปค๋„: ๋ฐ˜์ง€๋ฆ„ 1์˜ 3์  ํ‰๊ท 

๋ฒ„ํผ ์•„ํ‚คํ…์ฒ˜:

  • buffer_A: ์ฃผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„ํผ ([256] ์š”์†Œ)
  • buffer_B: ๋ณด์กฐ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„ํผ ([256] ์š”์†Œ)
  • ์—ญํ•  ๊ต๋Œ€: ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ๋ฒ„ํผ๊ฐ€ ์ฝ๊ธฐ ์†Œ์Šค์™€ ์“ฐ๊ธฐ ๋Œ€์ƒ ์‚ฌ์ด๋ฅผ ๊ต์ฒด

์ฒ˜๋ฆฌ ์š”๊ตฌ์‚ฌํ•ญ:

์ดˆ๊ธฐํ™” ๋‹จ๊ณ„:

  • ๋ฒ„ํผ ์„ค์ •: buffer_A๋ฅผ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ, buffer_B๋ฅผ 0์œผ๋กœ ์ดˆ๊ธฐํ™”
  • ๋ฐฐ๋ฆฌ์–ด ์ดˆ๊ธฐํ™”: ๋™๊ธฐํ™” ์ง€์ ์„ ์œ„ํ•œ mbarrier ๊ฐ์ฒด ์„ค์ •
  • ์Šค๋ ˆ๋“œ ์กฐ์ •: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ดˆ๊ธฐํ™”์— ์ฐธ์—ฌ

๋ฐ˜๋ณต ์ฒ˜๋ฆฌ:

  • ์ง์ˆ˜ ๋ฐ˜๋ณต (0, 2, 4โ€ฆ): buffer_A์—์„œ ์ฝ๊ณ  buffer_B์— ์“ฐ๊ธฐ
  • ํ™€์ˆ˜ ๋ฐ˜๋ณต (1, 3, 5โ€ฆ): buffer_B์—์„œ ์ฝ๊ณ  buffer_A์— ์“ฐ๊ธฐ
  • ์Šคํ…์‹ค ์—ฐ์‚ฐ: 3์  ํ‰๊ท  \((\text{left} + \text{center} + \text{right}) / 3\)
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ๋ฒ„ํผ ๊ฐ€์žฅ์ž๋ฆฌ์˜ ์š”์†Œ์— ๋Œ€ํ•ด ์ ์‘์  ํ‰๊ท  ์‚ฌ์šฉ

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •:

  • mbarrier_arrive(): ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์“ฐ๊ธฐ ๋‹จ๊ณ„ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆผ
  • mbarrier_test_wait(): ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์“ฐ๊ธฐ๋ฅผ ์™„๋ฃŒํ•  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
  • ๋ฒ„ํผ ๊ต์ฒด ์•ˆ์ „์„ฑ: ๋‹ค๋ฅธ ์Šค๋ ˆ๋“œ๊ฐ€ ์•„์ง ์“ฐ๊ณ  ์žˆ๋Š” ๋™์•ˆ ๋ฒ„ํผ์—์„œ ์ฝ๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€
  • ๋ฐฐ๋ฆฌ์–ด ์žฌ์ดˆ๊ธฐํ™”: ๋ฐ˜๋ณต ๊ฐ„์— ๋ฐฐ๋ฆฌ์–ด ์ƒํƒœ๋ฅผ ์žฌ์„ค์ •

์ถœ๋ ฅ ๋‹จ๊ณ„:

  • ์ตœ์ข… ๋ฒ„ํผ ์„ ํƒ: ๋ฐ˜๋ณต ํšŸ์ˆ˜์˜ ํ™€์ง์— ๋”ฐ๋ผ ํ™œ์„ฑ ๋ฒ„ํผ ์„ ํƒ
  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์“ฐ๊ธฐ: ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ ๋ฐฐ์—ด์— ๋ณต์‚ฌ
  • ์™„๋ฃŒ ๋ฐฐ๋ฆฌ์–ด: ๋ธ”๋ก ์ข…๋ฃŒ ์ „ ๋ชจ๋“  ์“ฐ๊ธฐ ์™„๋ฃŒ ๋ณด์žฅ

์™„์„ฑํ•  ์ฝ”๋“œ


# Double-buffered stencil configuration
comptime STENCIL_ITERATIONS = 3
comptime BUFFER_COUNT = 2


def double_buffered_stencil_computation(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    """Double-buffered stencil computation with memory barrier coordination.

    Iteratively applies 3-point stencil using alternating buffers.
    Uses mbarrier APIs for precise buffer swap coordination.
    """

    # Double-buffering: Two shared memory buffers
    var buffer_A = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())
    var buffer_B = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    # Memory barriers for coordinating buffer swaps
    var init_barrier = stack_allocation[
        dtype=DType.uint64, address_space=AddressSpace.SHARED
    ](row_major[1]())
    var iter_barrier = stack_allocation[
        dtype=DType.uint64, address_space=AddressSpace.SHARED
    ](row_major[1]())
    var final_barrier = stack_allocation[
        dtype=DType.uint64, address_space=AddressSpace.SHARED
    ](row_major[1]())

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Initialize barriers (only thread 0)
    if local_i == 0:
        mbarrier_init(init_barrier.ptr, TPB)
        mbarrier_init(iter_barrier.ptr, TPB)
        mbarrier_init(final_barrier.ptr, TPB)

    # Initialize buffer_A with input data

    # FILL ME IN (roughly 4 lines)

    # Wait for buffer_A initialization
    _ = mbarrier_arrive(init_barrier.ptr)
    _ = mbarrier_test_wait(init_barrier.ptr, TPB)

    # Iterative stencil processing with double-buffering
    comptime for iteration in range(STENCIL_ITERATIONS):
        comptime if iteration % 2 == 0:
            # Even iteration: Read from A, Write to B

            # FILL ME IN (roughly 12 lines)
            ...

        else:
            # Odd iteration: Read from B, Write to A

            # FILL ME IN (roughly 12 lines)
            ...

        # Memory barrier: wait for all writes before buffer swap
        _ = mbarrier_arrive(iter_barrier.ptr)
        _ = mbarrier_test_wait(iter_barrier.ptr, TPB)

        # Reinitialize barrier for next iteration
        if local_i == 0:
            mbarrier_init(iter_barrier.ptr, TPB)

    # Write final results from active buffer
    if local_i < TPB and global_i < size:
        comptime if STENCIL_ITERATIONS % 2 == 0:
            # Even iterations end in buffer_A
            output[global_i] = buffer_A[local_i]
        else:
            # Odd iterations end in buffer_B
            output[global_i] = buffer_B[local_i]

    # Final barrier
    _ = mbarrier_arrive(final_barrier.ptr)
    _ = mbarrier_test_wait(final_barrier.ptr, TPB)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p29/p29.mojo

ํŒ

๋ฒ„ํผ ์ดˆ๊ธฐํ™”

  • buffer_A๋ฅผ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๊ณ , buffer_B๋Š” ๋นˆ ์ƒํƒœ๋กœ ์‹œ์ž‘ ๊ฐ€๋Šฅ
  • ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚œ ์š”์†Œ์— ๋Œ€ํ•ด ์ œ๋กœ ํŒจ๋”ฉ์„ ์‚ฌ์šฉํ•œ ์ ์ ˆํ•œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ
  • ์Šค๋ ˆ๋“œ 0๋งŒ mbarrier ๊ฐ์ฒด๋ฅผ ์ดˆ๊ธฐํ™”ํ•ด์•ผ ํ•จ
  • ์„œ๋กœ ๋‹ค๋ฅธ ๋™๊ธฐํ™” ์ง€์ ์— ๋ณ„๋„์˜ ๋ฐฐ๋ฆฌ์–ด ์„ค์ •

๋ฐ˜๋ณต ์ œ์–ด

  • ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ๋ฅผ ์œ„ํ•ด @parameter for iteration in range(STENCIL_ITERATIONS) ์‚ฌ์šฉ
  • iteration % 2๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฝ๊ธฐ/์“ฐ๊ธฐ ํ• ๋‹น์„ ๊ต๋Œ€ํ•˜๋ฉด์„œ ๋ฒ„ํผ ์—ญํ•  ๊ฒฐ์ •
  • ์ด์›ƒ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•ด ์œ ํšจํ•œ ๋ฒ”์œ„ ๋‚ด์—์„œ๋งŒ ์Šคํ…์‹ค ์—ฐ์‚ฐ ์ ์šฉ

์Šคํ…์‹ค ์—ฐ์‚ฐ

  • 3์  ํ‰๊ท  ๊ตฌํ˜„: (left + center + right) / 3
  • ์œ ํšจํ•œ ์ด์›ƒ๋งŒ ํ‰๊ท ์— ํฌํ•จํ•˜์—ฌ ๊ฒฝ๊ณ„ ์กฐ๊ฑด ์ฒ˜๋ฆฌ
  • ์—ฃ์ง€ ์ผ€์ด์Šค๋ฅผ ๋งค๋„๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ ์‘์  ์นด์šดํŒ… ์‚ฌ์šฉ

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •

  • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์“ฐ๊ธฐ ์—ฐ์‚ฐ์„ ์™„๋ฃŒํ•œ ํ›„ mbarrier_arrive() ํ˜ธ์ถœ
  • ๋ฒ„ํผ ๊ต์ฒด ์ „ ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์™„๋ฃŒํ•˜๋„๋ก mbarrier_test_wait() ์‚ฌ์šฉ
  • ์žฌ์‚ฌ์šฉ์„ ์œ„ํ•ด ๋ฐ˜๋ณต ๊ฐ„์— ๋ฐฐ๋ฆฌ์–ด ์žฌ์ดˆ๊ธฐํ™”: mbarrier_init()
  • ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์Šค๋ ˆ๋“œ 0๋งŒ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ์žฌ์ดˆ๊ธฐํ™”

์ถœ๋ ฅ ์„ ํƒ

  • STENCIL_ITERATIONS % 2๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ข… ํ™œ์„ฑ ๋ฒ„ํผ ์„ ํƒ
  • ์ง์ˆ˜ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋Š” buffer_A์— ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚จ์Œ
  • ํ™€์ˆ˜ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋Š” buffer_B์— ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚จ์Œ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ํ†ตํ•ด ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ธ€๋กœ๋ฒŒ ์ถœ๋ ฅ์— ๊ธฐ๋ก

์ฝ”๋“œ ์‹คํ–‰

์†”๋ฃจ์…˜์„ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค:

pixi run p29 --double-buffer
pixi run -e amd p29 --double-buffer
uv run poe p29 --double-buffer

ํผ์ฆ์„ ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒํ•˜๋ฉด ๋‹ค์Œ๊ณผ ์œ ์‚ฌํ•œ ์ถœ๋ ฅ์ด ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค:

Puzzle 29: GPU Synchronization Primitives
==================================================
TPB: 256
SIZE: 1024
STENCIL_ITERATIONS: 3
BUFFER_COUNT: 2

Testing Puzzle 29B: Double-Buffered Stencil Computation
============================================================
Double-buffered stencil completed
Input sample: 1.0 1.0 1.0
GPU output sample: 1.0 1.0 1.0
โœ… Double-buffered stencil test PASSED!

์†”๋ฃจ์…˜

def double_buffered_stencil_computation(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, MutAnyOrigin],
    size: Int,
):
    """Double-buffered stencil computation with memory barrier coordination.

    Iteratively applies 3-point stencil using alternating buffers.
    Uses mbarrier APIs for precise buffer swap coordination.
    """

    # Double-buffering: Two shared memory buffers
    var buffer_A = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())
    var buffer_B = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    # Memory barriers for coordinating buffer swaps
    var init_barrier = stack_allocation[
        dtype=DType.uint64, address_space=AddressSpace.SHARED
    ](row_major[1]())
    var iter_barrier = stack_allocation[
        dtype=DType.uint64, address_space=AddressSpace.SHARED
    ](row_major[1]())
    var final_barrier = stack_allocation[
        dtype=DType.uint64, address_space=AddressSpace.SHARED
    ](row_major[1]())

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Initialize barriers (only thread 0)
    if local_i == 0:
        mbarrier_init(init_barrier.ptr, TPB)
        mbarrier_init(iter_barrier.ptr, TPB)
        mbarrier_init(final_barrier.ptr, TPB)

    # Initialize buffer_A with input data
    if local_i < TPB and global_i < size:
        buffer_A[local_i] = input[global_i]
    else:
        buffer_A[local_i] = 0.0

    # Wait for buffer_A initialization
    _ = mbarrier_arrive(init_barrier.ptr)
    _ = mbarrier_test_wait(init_barrier.ptr, TPB)

    # Iterative stencil processing with double-buffering
    comptime for iteration in range(STENCIL_ITERATIONS):
        comptime if iteration % 2 == 0:
            # Even iteration: Read from A, Write to B
            if local_i < TPB:
                var stencil_sum: Scalar[dtype] = 0.0
                var stencil_count: Int = 0

                # 3-point stencil: [i-1, i, i+1]
                for offset in range(-1, 2):
                    sample_idx = local_i + offset
                    if sample_idx >= 0 and sample_idx < TPB:
                        stencil_sum += rebind[Scalar[dtype]](
                            buffer_A[sample_idx]
                        )
                        stencil_count += 1

                if stencil_count > 0:
                    buffer_B[local_i] = stencil_sum / Scalar[dtype](
                        stencil_count
                    )
                else:
                    buffer_B[local_i] = buffer_A[local_i]

        else:
            # Odd iteration: Read from B, Write to A
            if local_i < TPB:
                var stencil_sum: Scalar[dtype] = 0.0
                var stencil_count: Int = 0

                # 3-point stencil: [i-1, i, i+1]
                for offset in range(-1, 2):
                    sample_idx = local_i + offset
                    if sample_idx >= 0 and sample_idx < TPB:
                        stencil_sum += rebind[Scalar[dtype]](
                            buffer_B[sample_idx]
                        )
                        stencil_count += 1

                if stencil_count > 0:
                    buffer_A[local_i] = stencil_sum / Scalar[dtype](
                        stencil_count
                    )
                else:
                    buffer_A[local_i] = buffer_B[local_i]

        # Memory barrier: wait for all writes before buffer swap
        _ = mbarrier_arrive(iter_barrier.ptr)
        _ = mbarrier_test_wait(iter_barrier.ptr, TPB)

        # Reinitialize barrier for next iteration
        if local_i == 0:
            mbarrier_init(iter_barrier.ptr, TPB)

    # Write final results from active buffer
    if local_i < TPB and global_i < size:
        comptime if STENCIL_ITERATIONS % 2 == 0:
            # Even iterations end in buffer_A
            output[global_i] = buffer_A[local_i]
        else:
            # Odd iterations end in buffer_B
            output[global_i] = buffer_B[local_i]

    # Final barrier
    _ = mbarrier_arrive(final_barrier.ptr)
    _ = mbarrier_test_wait(final_barrier.ptr, TPB)


ํ•ต์‹ฌ ํ†ต์ฐฐ์€ ์ด๊ฒƒ์ด ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •์„ ์‚ฌ์šฉํ•˜๋Š” ๋”๋ธ” ๋ฒ„ํผ๋ง ์•„ํ‚คํ…์ฒ˜ ๋ฌธ์ œ์ž„์„ ์ธ์‹ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค:

  1. ๊ต๋Œ€ํ•˜๋Š” ๋ฒ„ํผ ์—ญํ•  ์„ค๊ณ„: ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ์ฝ๊ธฐ/์“ฐ๊ธฐ ์ฑ…์ž„์„ ๊ตํ™˜
  2. ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ๊ตฌํ˜„: ์ •๋ฐ€ํ•œ ๋™๊ธฐํ™” ์ œ์–ด๋ฅผ ์œ„ํ•ด mbarrier API ์‚ฌ์šฉ
  3. ๋ฐ˜๋ณต ์ฒ˜๋ฆฌ ์กฐ์œจ: ๋ฒ„ํผ ๊ต์ฒด ์ „ ๋ฐ˜๋ณต ๊ฒฐ๊ณผ๊ฐ€ ์™„์ „ํžˆ ์™„๋ฃŒ๋˜๋„๋ก ๋ณด์žฅ
  4. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”: ๋ชจ๋“  ์ฒ˜๋ฆฌ๋ฅผ ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์ˆ˜ํ–‰

์ƒ์„ธ ์„ค๋ช…์ด ํฌํ•จ๋œ ์ „์ฒด ์†”๋ฃจ์…˜

๋”๋ธ” ๋ฒ„ํผ๋ง ์Šคํ…์‹ค ์†”๋ฃจ์…˜์€ ์ •๊ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์กฐ์ •๊ณผ ๋ฐ˜๋ณต ์ฒ˜๋ฆฌ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํƒ€์ด๋ฐ์— ๋Œ€ํ•œ ์ •๋ฐ€ํ•œ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ์•ˆ์ „ํ•œ ๋ฐ˜๋ณต์  ๊ฐœ์„  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๋”๋ธ” ๋ฒ„ํผ๋ง ์•„ํ‚คํ…์ฒ˜ ์„ค๊ณ„

์ด ํผ์ฆ์˜ ๊ทผ๋ณธ์ ์ธ ๋ŒํŒŒ๊ตฌ๋Š” ๋‹จ์ˆœํ•œ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™”๊ฐ€ ์•„๋‹Œ ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์ œ์–ด์ž…๋‹ˆ๋‹ค:

์ „ํ†ต์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹: ๋‹จ์ˆœํ•œ ์Šค๋ ˆ๋“œ ์กฐ์ •์„ ์œ„ํ•ด ๊ธฐ๋ณธ barrier() ์‚ฌ์šฉ

  • ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์— ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์‹คํ–‰
  • ๋‹จ์ผ ๋ฐฐ๋ฆฌ์–ด ํ˜ธ์ถœ๋กœ ์Šค๋ ˆ๋“œ ์™„๋ฃŒ๋ฅผ ๋™๊ธฐํ™”
  • ํŠน์ • ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ ํƒ€์ด๋ฐ์— ๋Œ€ํ•œ ์ œ์–ด ์—†์Œ

์ด ํผ์ฆ์˜ ํ˜์‹ : ๋ช…์‹œ์  ๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๋กœ ์กฐ์ •๋˜๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฒ„ํผ ์—ญํ• 

  • buffer_A์™€ buffer_B๊ฐ€ ์ฝ๊ธฐ ์†Œ์Šค์™€ ์“ฐ๊ธฐ ๋Œ€์ƒ ์‚ฌ์ด๋ฅผ ๊ต๋Œ€
  • mbarrier API๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ ์™„๋ฃŒ์— ๋Œ€ํ•œ ์ •๋ฐ€ํ•œ ์ œ์–ด๋ฅผ ์ œ๊ณต
  • ๋ช…์‹œ์  ์กฐ์ •์œผ๋กœ ๋ฒ„ํผ ์ „ํ™˜ ์ค‘ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€

๋ฐ˜๋ณต ์ฒ˜๋ฆฌ ์กฐ์œจ

๋‹จ์ผ ํŒจ์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋‹ฌ๋ฆฌ, ์ด ํผ์ฆ์€ ์‹ ์ค‘ํ•œ ๋ฒ„ํผ ๊ด€๋ฆฌ๋ฅผ ํ†ตํ•œ ๋ฐ˜๋ณต์  ๊ฐœ์„ ์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค:

  • ๋ฐ˜๋ณต 0: buffer_A์—์„œ ์ฝ๊ธฐ (์ž…๋ ฅ์œผ๋กœ ์ดˆ๊ธฐํ™”๋จ), buffer_B์— ์“ฐ๊ธฐ
  • ๋ฐ˜๋ณต 1: buffer_B์—์„œ ์ฝ๊ธฐ (์ด์ „ ๊ฒฐ๊ณผ), buffer_A์— ์“ฐ๊ธฐ
  • ๋ฐ˜๋ณต 2: buffer_A์—์„œ ์ฝ๊ธฐ (์ด์ „ ๊ฒฐ๊ณผ), buffer_B์— ์“ฐ๊ธฐ
  • ๊ต๋Œ€ ๊ณ„์†: ๊ฐ ๋ฐ˜๋ณต์ด ์ด์ „ ๋ฐ˜๋ณต์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฐœ์„ 

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด API ์‚ฌ์šฉ๋ฒ•

mbarrier ์กฐ์ • ํŒจํ„ด์˜ ์ดํ•ด:

  • mbarrier_init(): ํŠน์ • ์Šค๋ ˆ๋“œ ์ˆ˜(TPB)๋ฅผ ์ง€์ •ํ•˜์—ฌ ๋ฐฐ๋ฆฌ์–ด ์ดˆ๊ธฐํ™”
  • mbarrier_arrive(): ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ์˜ ์“ฐ๊ธฐ ๋‹จ๊ณ„ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆผ
  • mbarrier_test_wait(): ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆด ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
  • ์žฌ์ดˆ๊ธฐํ™”: ์žฌ์‚ฌ์šฉ์„ ์œ„ํ•ด ๋ฐ˜๋ณต ๊ฐ„์— ๋ฐฐ๋ฆฌ์–ด ์ƒํƒœ๋ฅผ ์žฌ์„ค์ •

ํ•ต์‹ฌ ํƒ€์ด๋ฐ ์ˆœ์„œ:

  1. ๋ชจ๋“  ์Šค๋ ˆ๋“œ ์“ฐ๊ธฐ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํ• ๋‹น๋œ ๋ฒ„ํผ ์š”์†Œ๋ฅผ ์—…๋ฐ์ดํŠธ
  2. ์™„๋ฃŒ ์•Œ๋ฆผ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ mbarrier_arrive() ํ˜ธ์ถœ
  3. ์ „์ฒด ๋Œ€๊ธฐ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ mbarrier_test_wait() ํ˜ธ์ถœ
  4. ์ง„ํ–‰ ์•ˆ์ „: ์ด์ œ ๋‹ค์Œ ๋ฐ˜๋ณต์„ ์œ„ํ•ด ๋ฒ„ํผ ์—ญํ• ์„ ์•ˆ์ „ํ•˜๊ฒŒ ๊ต์ฒด ๊ฐ€๋Šฅ

์Šคํ…์‹ค ์—ฐ์‚ฐ ๋ฉ”์ปค๋‹ˆ์ฆ˜

์ ์‘์  ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ๋ฅผ ํฌํ•จํ•œ 3์  ์Šคํ…์‹ค ์—ฐ์‚ฐ:

๋‚ด๋ถ€ ์š”์†Œ (์ธ๋ฑ์Šค 1๋ถ€ํ„ฐ 254):

# ์™ผ์ชฝ, ์ค‘์‹ฌ, ์˜ค๋ฅธ์ชฝ ์ด์›ƒ๊ณผ์˜ ํ‰๊ท 
stencil_sum = buffer[i-1] + buffer[i] + buffer[i+1]
result[i] = stencil_sum / 3.0

๊ฒฝ๊ณ„ ์š”์†Œ (์ธ๋ฑ์Šค 0๊ณผ 255):

# ์œ ํšจํ•œ ์ด์›ƒ๋งŒ ํ‰๊ท ์— ํฌํ•จ
stencil_count = 0
for neighbor in valid_neighbors:
    stencil_sum += buffer[neighbor]
    stencil_count += 1
result[i] = stencil_sum / stencil_count

๋ฒ„ํผ ์—ญํ•  ๊ต๋Œ€

ํ•‘ํ ๋ฒ„ํผ ํŒจํ„ด์ด ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ์„ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค:

์ง์ˆ˜ ๋ฐ˜๋ณต (0, 2, 4โ€ฆ):

  • ์ฝ๊ธฐ ์†Œ์Šค: buffer_A์— ํ˜„์žฌ ๋ฐ์ดํ„ฐ ํฌํ•จ
  • ์“ฐ๊ธฐ ๋Œ€์ƒ: buffer_B๊ฐ€ ์—…๋ฐ์ดํŠธ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์‹ 
  • ๋ฉ”๋ชจ๋ฆฌ ํ๋ฆ„: buffer_A โ†’ ์Šคํ…์‹ค ์—ฐ์‚ฐ โ†’ buffer_B

ํ™€์ˆ˜ ๋ฐ˜๋ณต (1, 3, 5โ€ฆ):

  • ์ฝ๊ธฐ ์†Œ์Šค: buffer_B์— ํ˜„์žฌ ๋ฐ์ดํ„ฐ ํฌํ•จ
  • ์“ฐ๊ธฐ ๋Œ€์ƒ: buffer_A๊ฐ€ ์—…๋ฐ์ดํŠธ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ˆ˜์‹ 
  • ๋ฉ”๋ชจ๋ฆฌ ํ๋ฆ„: buffer_B โ†’ ์Šคํ…์‹ค ์—ฐ์‚ฐ โ†’ buffer_A

๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด๊ฐ€ ์—ฌ๋Ÿฌ ์œ ํ˜•์˜ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค:

๋ฐฐ๋ฆฌ์–ด ์—†์ด (์ž˜๋ชป๋œ ๊ฒฝ์šฐ):

# ์Šค๋ ˆ๋“œ A๊ฐ€ buffer_B[10]์— ์“ฐ๊ธฐ
buffer_B[10] = stencil_result_A

# ์Šค๋ ˆ๋“œ B๊ฐ€ ์Šคํ…์‹ค ์—ฐ์‚ฐ์„ ์œ„ํ•ด buffer_B[10]์„ ์ฆ‰์‹œ ์ฝ๊ธฐ
# ๊ฒฝ์Ÿ ์ƒํƒœ: ์Šค๋ ˆ๋“œ B๊ฐ€ ์Šค๋ ˆ๋“œ A์˜ ์“ฐ๊ธฐ๊ฐ€ ์™„๋ฃŒ๋˜๊ธฐ ์ „์— ์ด์ „ ๊ฐ’์„ ์ฝ์„ ์ˆ˜ ์žˆ์Œ
stencil_input = buffer_B[10]  // ๋ฏธ์ •์˜ ๋™์ž‘!

๋ฐฐ๋ฆฌ์–ด ์‚ฌ์šฉ (์˜ฌ๋ฐ”๋ฅธ ๊ฒฝ์šฐ):

# ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฒฐ๊ณผ๋ฅผ ์“ฐ๊ธฐ
buffer_B[local_i] = stencil_result

# ์“ฐ๊ธฐ ์™„๋ฃŒ ์•Œ๋ฆผ
mbarrier_arrive(barrier)

# ๋ชจ๋“  ์Šค๋ ˆ๋“œ์˜ ์“ฐ๊ธฐ ์™„๋ฃŒ๊นŒ์ง€ ๋Œ€๊ธฐ
mbarrier_test_wait(barrier, TPB)

# ์ด์ œ ์ฝ๊ธฐ ์•ˆ์ „ - ๋ชจ๋“  ์“ฐ๊ธฐ ์™„๋ฃŒ ๋ณด์žฅ
stencil_input = buffer_B[neighbor_index]  // ํ•ญ์ƒ ์˜ฌ๋ฐ”๋ฅธ ๊ฐ’์„ ์ฝ์Œ

์ถœ๋ ฅ ๋ฒ„ํผ ์„ ํƒ

์ตœ์ข… ๊ฒฐ๊ณผ ์œ„์น˜๋Š” ๋ฐ˜๋ณต ํšŸ์ˆ˜์˜ ํ™€์ง์— ๋”ฐ๋ผ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค:

์ˆ˜ํ•™์  ๊ฒฐ์ •:

  • STENCIL_ITERATIONS = 3 (ํ™€์ˆ˜)
  • ์ตœ์ข… ํ™œ์„ฑ ๋ฒ„ํผ: ๋ฐ˜๋ณต 2๊ฐ€ buffer_B์— ์“ฐ๊ธฐ
  • ์ถœ๋ ฅ ์†Œ์Šค: buffer_B์—์„œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋ณต์‚ฌ

๊ตฌํ˜„ ํŒจํ„ด:

@parameter
if STENCIL_ITERATIONS % 2 == 0:
    # ์ง์ˆ˜ ์ด ๋ฐ˜๋ณต ํšŸ์ˆ˜๋Š” buffer_A์—์„œ ์ข…๋ฃŒ
    output[global_i] = buffer_A[local_i]
else:
    # ํ™€์ˆ˜ ์ด ๋ฐ˜๋ณต ํšŸ์ˆ˜๋Š” buffer_B์—์„œ ์ข…๋ฃŒ
    output[global_i] = buffer_B[local_i]

์„ฑ๋Šฅ ํŠน์„ฑ

๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: ์ž…๋ ฅ ๋กœ๋”ฉ๊ณผ ์ตœ์ข… ์ถœ๋ ฅ์—๋งŒ ์ ‘๊ทผ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ชจ๋“  ๋ฐ˜๋ณต ์ฒ˜๋ฆฌ์— ๋น ๋ฅธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ
  • ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ค‘์‹ฌ์œผ๋กœ ์ตœ์†Œํ™”

๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ:

  • mbarrier ๋น„์šฉ: ๊ธฐ๋ณธ barrier()๋ณด๋‹ค ๋†’์ง€๋งŒ ํ•„์ˆ˜์ ์ธ ์ œ์–ด๋ฅผ ์ œ๊ณต
  • ๋ฐ˜๋ณต ํ™•์žฅ์„ฑ: ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ฐ˜๋ณต ํšŸ์ˆ˜์— ๋น„๋ก€ํ•˜์—ฌ ์„ ํ˜•์ ์œผ๋กœ ์ฆ๊ฐ€
  • ์Šค๋ ˆ๋“œ ํšจ์œจ์„ฑ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ์ฒ˜๋ฆฌ ์ „๋ฐ˜์— ๊ฑธ์ณ ํ™œ์„ฑ ์ƒํƒœ ์œ ์ง€

์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ

์ด ๋”๋ธ” ๋ฒ„ํผ๋ง ํŒจํ„ด์€ ๋‹ค์Œ ๋ถ„์•ผ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค:

๋ฐ˜๋ณต๋ฒ•:

  • ์„ ํ˜• ์‹œ์Šคํ…œ์„ ์œ„ํ•œ Gauss-Seidel ๋ฐ Jacobi ๋ฐฉ๋ฒ•
  • ์ˆ˜์น˜ ์ •ํ™•๋„๋ฅผ ์œ„ํ•œ ๋ฐ˜๋ณต์  ๊ฐœ์„ 
  • ๋ ˆ๋ฒจ๋ณ„ ์ฒ˜๋ฆฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹ค์ค‘ ๊ทธ๋ฆฌ๋“œ ๋ฐฉ๋ฒ•

์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ:

  • ๋‹ค์ค‘ ํŒจ์Šค ํ•„ํ„ฐ (์–‘์ธก, ์œ ๋„, ์—ฃ์ง€ ๋ณด์กด)
  • ๋ฐ˜๋ณต์  ๋””๋…ธ์ด์ง• ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ์—ด ํ™•์‚ฐ๊ณผ ์ด๋ฐฉ์„ฑ ์Šค๋ฌด๋”ฉ

์‹œ๋ฎฌ๋ ˆ์ด์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜:

  • ์ƒํƒœ ์ง„ํ™”๋ฅผ ๊ฐ€์ง„ ์…€๋ฃฐ๋Ÿฌ ์˜คํ† ๋งˆํƒ€
  • ์œ„์น˜ ์—…๋ฐ์ดํŠธ๋ฅผ ์ˆ˜๋ฐ˜ํ•˜๋Š” ์ž…์ž ์‹œ์Šคํ…œ
  • ๋ฐ˜๋ณต์  ์••๋ ฅ ์†”๋น™์„ ์‚ฌ์šฉํ•œ ์œ ์ฒด ์—ญํ•™

ํ•ต์‹ฌ ๊ธฐ์ˆ ์  ํ†ต์ฐฐ

๋ฉ”๋ชจ๋ฆฌ ๋ฐฐ๋ฆฌ์–ด ์ฒ ํ•™:

  • ๋ช…์‹œ์  ์ œ์–ด: ์ž๋™ ๋™๊ธฐํ™” ๋Œ€๋น„ ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์— ๋Œ€ํ•œ ์ •๋ฐ€ํ•œ ํƒ€์ด๋ฐ ์ œ์–ด
  • ๊ฒฝ์Ÿ ์ƒํƒœ ๋ฐฉ์ง€: ๊ต๋Œ€ํ•˜๋Š” ์ฝ๊ธฐ/์“ฐ๊ธฐ ํŒจํ„ด์„ ๊ฐ€์ง„ ๋ชจ๋“  ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ํ•„์ˆ˜
  • ์„ฑ๋Šฅ ์ ˆ์ถฉ: ๋ณด์žฅ๋œ ์ •ํ™•์„ฑ์„ ์œ„ํ•œ ๋” ๋†’์€ ๋™๊ธฐํ™” ๋น„์šฉ

๋”๋ธ” ๋ฒ„ํผ๋ง์˜ ์ด์ :

  • ๋ฐ์ดํ„ฐ ๋ฌด๊ฒฐ์„ฑ: ์“ฐ๊ธฐ ์ค‘ ์ฝ๊ธฐ hazard ์ œ๊ฑฐ
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ช…ํ™•์„ฑ: ํ˜„์žฌ์™€ ๋‹ค์Œ ๋ฐ˜๋ณต ์ƒํƒœ ๊ฐ„์˜ ๊น”๋”ํ•œ ๋ถ„๋ฆฌ
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ค‘๊ฐ„ ์ €์žฅ์†Œ ๋ถˆํ•„์š”

๋ฐ˜๋ณต ๊ด€๋ฆฌ:

  • ์ปดํŒŒ์ผ ํƒ€์ž„ ๋ฃจํ”„ ์ „๊ฐœ: @parameter for๊ฐ€ ์ตœ์ ํ™” ๊ธฐํšŒ๋ฅผ ์ œ๊ณต
  • ์ƒํƒœ ์ถ”์ : ๋ฒ„ํผ ์—ญํ•  ๊ต๋Œ€๊ฐ€ ๊ฒฐ์ •์ ์ด์–ด์•ผ ํ•จ
  • ๊ฒฝ๊ณ„ ์ฒ˜๋ฆฌ: ์ ์‘์  ์Šคํ…์‹ค ์—ฐ์‚ฐ์ด ์—ฃ์ง€ ์ผ€์ด์Šค๋ฅผ ๋งค๋„๋Ÿฝ๊ฒŒ ์ฒ˜๋ฆฌ

์ด ์†”๋ฃจ์…˜์€ ์ •๋ฐ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ œ์–ด๊ฐ€ ํ•„์š”ํ•œ ๋ฐ˜๋ณต GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ๋‹จ์ˆœํ•œ ๋ณ‘๋ ฌ ๋ฃจํ”„๋ฅผ ๋„˜์–ด ์‹ค์ œ ์ˆ˜์น˜ ์†Œํ”„ํŠธ์›จ์–ด์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ •๊ตํ•œ ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ํŒจํ„ด์œผ๋กœ ๋‚˜์•„๊ฐ‘๋‹ˆ๋‹ค.

Puzzle 30: GPU ํ”„๋กœํŒŒ์ผ๋ง

์˜ฌ๋ฐ”๋ฅธ ์ฝ”๋“œ, ๊ทธ ๋„ˆ๋จธ๋กœ

์ฐธ๊ณ : ์ด ํŒŒํŠธ๋Š” ํ˜ธํ™˜๋˜๋Š” NVIDIA GPU ์ „์šฉ์ž…๋‹ˆ๋‹ค

์ด ์ฑ•ํ„ฐ์—์„œ๋Š” ๋™์ž‘ํ•˜๋Š” GPU ์ฝ”๋“œ๋ฅผ ๊ณ ์„ฑ๋Šฅ ์ฝ”๋“œ๋กœ ํƒˆ๋ฐ”๊ฟˆ์‹œํ‚ค๋Š” ์ฒด๊ณ„์  ์„ฑ๋Šฅ ๋ถ„์„์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ์ •ํ™•์„ฑ๊ณผ GPU ๊ธฐ๋Šฅ์— ์ง‘์ค‘ํ–ˆ๋˜ ์ด์ „ ํผ์ฆ๋“ค๊ณผ ๋‹ฌ๋ฆฌ, ์—ฌ๊ธฐ์„œ๋Š” ์‹ค๋ฌด GPU ์†Œํ”„ํŠธ์›จ์–ด ๊ฐœ๋ฐœ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šต ๋‚ด์šฉ:

  • ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ: ์ข…ํ•ฉ์ ์ธ ์„ฑ๋Šฅ ๋ถ„์„์„ ์œ„ํ•œ NSight Systems์™€ NSight Compute
  • ์„ฑ๋Šฅ ํƒ์ • ์ž‘์—…: ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๋ณ‘๋ชฉ๊ณผ ์ตœ์ ํ™” ๊ธฐํšŒ ํŒŒ์•…
  • ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ํ†ต์ฐฐ: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ๊ทน์ ์ธ ์˜ํ–ฅ ์ดํ•ด
  • ๋ฐ˜์ง๊ด€์ ์ธ ๋ฐœ๊ฒฌ: โ€œ์ข‹์•„ ๋ณด์ด๋Š”โ€ ์ง€ํ‘œ๊ฐ€ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š” ๊ฒฝ์šฐ
  • ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”: ๊ฐ€์ •์ด ์•„๋‹Œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•œ ์ตœ์ ํ™” ํŒ๋‹จ

์™œ ์ค‘์š”ํ•œ๊ฐ€: ๋Œ€๋ถ€๋ถ„์˜ GPU ํŠœํ† ๋ฆฌ์–ผ์€ ๊ธฐ๋ณธ์ ์ธ ์„ฑ๋Šฅ ๊ฐœ๋…๋งŒ ๊ฐ€๋ฅด์น˜์ง€๋งŒ, ์‹ค์ œ GPU ๊ฐœ๋ฐœ์—์„œ๋Š” ์‹ค์งˆ์ ์ธ ๋ณ‘๋ชฉ์„ ์ฐพ์•„๋‚ด๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ๋™์ž‘์„ ์ดํ•ดํ•˜๋ฉฐ, ๊ทผ๊ฑฐ ์žˆ๋Š” ์ตœ์ ํ™” ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๊ธฐ ์œ„ํ•œ ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์—ญ๋Ÿ‰์ด ํ•™์ˆ ์  ์˜ˆ์ œ์™€ ์‹ค๋ฌด GPU ์ปดํ“จํŒ… ์‚ฌ์ด์˜ ๊ฒฉ์ฐจ๋ฅผ ๋ฉ”์›Œ์ค๋‹ˆ๋‹ค.

๊ฐœ์š”

GPU ์„ฑ๋Šฅ ํ”„๋กœํŒŒ์ผ๋ง์€ ์ฒด๊ณ„์  ๋ถ„์„์„ ํ†ตํ•ด ์˜ฌ๋ฐ”๋ฅธ ์ฝ”๋“œ๋ฅผ ๊ณ ์„ฑ๋Šฅ ์ฝ”๋“œ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด ์ฑ•ํ„ฐ์—์„œ๋Š” ์‹ค๋ฌด GPU ๊ฐœ๋ฐœ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ์™€ ํƒ์ • ๋ฐฉ๋ฒ•๋ก ์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต ๋ชฉํ‘œ:

  • ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ ์„ ํƒ๋ฒ• ํ•™์Šต - NSight Systems์™€ NSight Compute๋ฅผ ์–ธ์ œ ์‚ฌ์šฉํ•˜๋Š”์ง€ ์ดํ•ด
  • ์„ฑ๋Šฅ ํƒ์ • ๋Šฅ๋ ฅ ๊ฐœ๋ฐœ - ์‹ค์ œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ์ถœ๋ ฅ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณ‘๋ชฉ ์‹๋ณ„
  • ๋ฐ˜์ง๊ด€์ ์ธ ํ†ต์ฐฐ ๋ฐœ๊ฒฌ - GPU ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ๊ณผ ์บ์‹ฑ ๋™์ž‘์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ์‹œ๊ฐ
  • ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™” ํ•™์Šต - ๊ฐ€์ •์ด ์•„๋‹Œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•œ ์ตœ์ ํ™”

ํ•ต์‹ฌ ๊ฐœ๋…

์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ:

  • NSight Systems (nsys): CPU-GPU ์กฐ์œจ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ์œ„ํ•œ ์‹œ์Šคํ…œ ์ „์ฒด ํƒ€์ž„๋ผ์ธ ๋ถ„์„
  • NSight Compute (ncu): ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ๊ณผ ์—ฐ์‚ฐ ํ™œ์šฉ๋„๋ฅผ ์œ„ํ•œ ์ƒ์„ธ ์ปค๋„ ๋ถ„์„
  • ์ฒด๊ณ„์  ๋ฐฉ๋ฒ•๋ก : ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ๋ณ‘๋ชฉ ์‹๋ณ„๊ณผ ์ตœ์ ํ™” ๊ฒ€์ฆ

๋ฐœ๊ฒฌํ•˜๊ฒŒ ๋  ํ•ต์‹ฌ ํ†ต์ฐฐ:

  • ๋ฐ˜์ง๊ด€์  ๋™์ž‘: ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์ด ์‹ค์ œ๋กœ๋Š” ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒฝ์šฐ
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด: ๋ณ‘ํ•ฉ์ด ๋Œ€์—ญํญ ํ™œ์šฉ์— ๋ฏธ์น˜๋Š” ๊ทน์ ์ธ ์˜ํ–ฅ
  • ๋„๊ตฌ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”: ์„ฑ๋Šฅ ๊ฐ€์ •์ด ์•„๋‹Œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ์˜์‚ฌ๊ฒฐ์ •

๊ตฌ์„ฑ

์š”๊ตฌ ์‚ฌํ•ญ:

  • NVIDIA GPU: ํ”„๋กœํŒŒ์ผ๋ง์ด ํ™œ์„ฑํ™”๋œ CUDA ํ˜ธํ™˜ ํ•˜๋“œ์›จ์–ด
  • CUDA Toolkit: NSight Systems ๋ฐ NSight Compute ๋„๊ตฌ
  • ๋นŒ๋“œ ์„ค์ •: ๋””๋ฒ„๊ทธ ์ •๋ณด๊ฐ€ ํฌํ•จ๋œ ์ตœ์ ํ™” ์ฝ”๋“œ (--debug-level=full)

๋ฐฉ๋ฒ•๋ก :

  1. NSight Systems๋ฅผ ํ™œ์šฉํ•œ ์‹œ์Šคํ…œ ์ „์ฒด ๋ถ„์„์œผ๋กœ ์ฃผ์š” ๋ณ‘๋ชฉ ์‹๋ณ„
  2. NSight Compute๋ฅผ ํ™œ์šฉํ•œ ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ๋ถ„์„
  3. ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ๊ฒฐ๋ก ์œผ๋กœ ์ตœ์ ํ™” ๋ฐฉํ–ฅ ๋„์ถœ

ํผ์ฆ ๊ตฌ์„ฑ

์ด ์ฑ•ํ„ฐ๋Š” ์„œ๋กœ ์—ฐ๊ฒฐ๋˜์–ด ์ ์ง„์ ์œผ๋กœ ๋ฐœ์ „ํ•˜๋Š” ๋‘ ๊ฐœ์˜ ๊ตฌ์„ฑ ์š”์†Œ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค:

NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ

์‹ค์ œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ์ถœ๋ ฅ์„ ์‚ฌ์šฉํ•œ ์‹ค์Šต ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ์ƒํƒœ๊ณ„์˜ ํ•ต์‹ฌ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

ํ•™์Šต ๋‚ด์šฉ:

  • ์‹œ์Šคํ…œ ์ „์ฒด ํƒ€์ž„๋ผ์ธ ๋ถ„์„๊ณผ ๋ณ‘๋ชฉ ์‹๋ณ„์„ ์œ„ํ•œ NSight Systems
  • ์ƒ์„ธ ์ปค๋„ ๋ถ„์„๊ณผ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ํ†ต์ฐฐ์„ ์œ„ํ•œ NSight Compute
  • ์‹ค๋ฌด GPU ๊ฐœ๋ฐœ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ์›Œํฌํ”Œ๋กœ์šฐ์™€ ๋ชจ๋ฒ” ์‚ฌ๋ก€

์บ์‹œ ํžˆํŠธ์˜ ์—ญ์„ค

๋™์ผํ•œ ๋ฒกํ„ฐ ๋ง์…ˆ ์ปค๋„ ์„ธ ๊ฐœ๊ฐ€ ๊ทน์ ์œผ๋กœ ๋‹ค๋ฅธ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ํ’€์–ด๋ด…๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: ์บ์‹œ ํžˆํŠธ์œจ์ด ๊ฐ€์žฅ ๋†’์€ ์ปค๋„์ด ์„ฑ๋Šฅ์€ ๊ฐ€์žฅ ๋‚ฎ์€ ์ด์œ ๋ฅผ ๋ฐํ˜€๋‚ด์„ธ์š” - CPU ์ค‘์‹ฌ์˜ ์ „ํ†ต์ ์ธ ์„ฑ๋Šฅ ๊ด€๋…์„ ๋’ค์ง‘๋Š” ๋ฐ˜์ง๊ด€์  ํ†ต์ฐฐ์ž…๋‹ˆ๋‹ค.

ํƒ์ • ๋Šฅ๋ ฅ: ์‹ค์ œ NSight Systems์™€ NSight Compute ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ํšจ๊ณผ์™€ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋ฅผ ์ดํ•ดํ•ฉ๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ

ํ•™์Šต ๊ฒฝ๋กœ:

  1. NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ - NSight Systems์™€ NSight Compute ํ•™์Šต
  2. ์บ์‹œ ํžˆํŠธ์˜ ์—ญ์„ค - ์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ ํ’€๊ธฐ์— ๋Šฅ๋ ฅ ์ ์šฉ

์‚ฌ์ „ ์ค€๋น„:

  • GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ์™€ ์ ‘๊ทผ ํŒจํ„ด
  • GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ์ดˆ (์Šค๋ ˆ๋“œ, ๋ธ”๋ก, ์›Œํ”„, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ)
  • ์ปค๋งจ๋“œ๋ผ์ธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ ์‚ฌ์šฉ ๊ฒฝํ—˜

ํ•™์Šต ์„ฑ๊ณผ: ์‹ค๋ฌด GPU ๊ฐœ๋ฐœ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์ฒด๊ณ„์  ๋ณ‘๋ชฉ ์‹๋ณ„๊ณผ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ํ”„๋กœํŒŒ์ผ๋ง ์—ญ๋Ÿ‰.

์ด ์ฑ•ํ„ฐ๋Š” ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง์ด ์ง๊ด€์ด ๋†“์น˜๋Š” ์ง„์‹ค์„ ๋“œ๋Ÿฌ๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๋ ค์ค๋‹ˆ๋‹ค - GPU ์„ฑ๋Šฅ ์ตœ์ ํ™”๋Š” ๊ฐ€์ •์ด ์•„๋‹Œ ๋„๊ตฌ ๊ธฐ๋ฐ˜์˜ ๋ฐœ๊ฒฌ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์ถ”๊ฐ€ ์ž๋ฃŒ:

๐Ÿ“š NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ

๊ฐœ์š”

์ง€๊ธˆ๊นŒ์ง€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ธฐ์ดˆ์™€ ๊ณ ๊ธ‰ ํŒจํ„ด์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. Part II์—์„œ๋Š” compute-sanitizer์™€ cuda-gdb๋ฅผ ์‚ฌ์šฉํ•œ ์ •ํ™•์„ฑ ๋””๋ฒ„๊น… ๊ธฐ๋ฒ•์„, ๋‹ค๋ฅธ ํŒŒํŠธ์—์„œ๋Š” ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ, ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ, ๋ธ”๋ก ๋ ˆ๋ฒจ ์—ฐ์‚ฐ ๋“ฑ ๋‹ค์–‘ํ•œ GPU ๊ธฐ๋Šฅ์„ ๋‹ค๋ค˜์Šต๋‹ˆ๋‹ค. ์ปค๋„์ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋™์ž‘ํ•˜๊ธด ํ•ฉ๋‹ˆ๋‹ค - ํ•˜์ง€๋งŒ ๋น ๋ฅด๊ธฐ๋„ ํ• ๊นŒ์š”?

์ด ํŠœํ† ๋ฆฌ์–ผ์€ CUDA Best Practices Guide์—์„œ ๊ถŒ์žฅํ•˜๋Š” NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์˜ฌ๋ฐ”๋ฅธ ์ปค๋„์ด๋ผ๋„ ์ตœ์ ์˜ ์„ฑ๋Šฅ๋ณด๋‹ค ์ˆ˜์‹ญ ๋ฐฐ๋‚˜ ๋А๋ฆด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กœํŒŒ์ผ๋ง์€ ๋™์ž‘ํ•˜๋Š” ์ฝ”๋“œ์™€ ๊ณ ์„ฑ๋Šฅ ์ฝ”๋“œ ์‚ฌ์ด์˜ ๊ฒฉ์ฐจ๋ฅผ ์ขํž™๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ ๋ชจ์Œ

pixi๋ฅผ ํ†ตํ•ด cuda-toolkit์ด ์„ค์น˜๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ, NVIDIA์˜ ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ๋ฐ”๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

NSight Systems (nsys) - โ€œ์ „์ฒด ๊ทธ๋ฆผโ€ ๋„๊ตฌ

์šฉ๋„: ์‹œ์Šคํ…œ ์ „์ฒด ์„ฑ๋Šฅ ๋ถ„์„ (NSight Systems ๋ฌธ์„œ)

  • CPU-GPU ์ƒํ˜ธ์ž‘์šฉ์˜ ํƒ€์ž„๋ผ์ธ ๋ทฐ
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋ณ‘๋ชฉ
  • ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ
  • ๋ฉ€ํ‹ฐ GPU ์กฐ์œจ
  • API ํ˜ธ์ถœ ์ถ”์ 

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ธํ„ฐํŽ˜์ด์Šค: ์ปค๋งจ๋“œ๋ผ์ธ (nsys) ๋ฐ GUI (nsys-ui)

์‚ฌ์šฉ ์‹œ์ :

  • ์ „์ฒด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ํ๋ฆ„ ํŒŒ์•…
  • CPU-GPU ๋™๊ธฐํ™” ๋ฌธ์ œ ์‹๋ณ„
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ํŒจํ„ด ๋ถ„์„
  • ์ปค๋„ ์‹คํ–‰ ๋ณ‘๋ชฉ ๋ฐœ๊ฒฌ
# ๋„์›€๋ง ๋ณด๊ธฐ
pixi run nsys --help

# ๊ธฐ๋ณธ ์‹œ์Šคํ…œ ์ „์ฒด ํ”„๋กœํŒŒ์ผ๋ง
pixi run nsys profile --trace=cuda,nvtx --output=timeline mojo your_program.mojo

# ๋Œ€ํ™”ํ˜• ๋ถ„์„
pixi run nsys stats --force-export=true timeline.nsys-rep

NSight Compute (ncu) - โ€œ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„โ€ ๋„๊ตฌ

์šฉ๋„: ์ƒ์„ธํ•œ ๋‹จ์ผ ์ปค๋„ ์„ฑ๋Šฅ ๋ถ„์„ (NSight Compute ๋ฌธ์„œ)

  • ๋ฃจํ”„๋ผ์ธ ๋ชจ๋ธ ๋ถ„์„
  • ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ํ™œ์šฉ๋„
  • ์›Œํ”„ ์‹คํ–‰ ํšจ์œจ
  • ๋ ˆ์ง€์Šคํ„ฐ/๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰
  • ์—ฐ์‚ฐ ์œ ๋‹› ํ™œ์šฉ๋„

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ธํ„ฐํŽ˜์ด์Šค: ์ปค๋งจ๋“œ๋ผ์ธ (ncu) ๋ฐ GUI (ncu-ui)

์‚ฌ์šฉ ์‹œ์ :

  • ํŠน์ • ์ปค๋„ ์„ฑ๋Šฅ ์ตœ์ ํ™”
  • ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํŒŒ์•…
  • ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ vs ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„ ๋ถ„์„
  • ์›Œํ”„ ๋ถ„๊ธฐ ๋ฌธ์ œ ์‹๋ณ„
# ๋„์›€๋ง ๋ณด๊ธฐ
pixi run ncu --help

# ์ƒ์„ธ ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง
pixi run ncu --set full --output kernel_profile mojo your_program.mojo

# ํŠน์ • ์ปค๋„์— ์ง‘์ค‘
pixi run ncu --kernel-name regex:your_kernel_name mojo your_program.mojo

๋„๊ตฌ ์„ ํƒ ์˜์‚ฌ๊ฒฐ์ • ํŠธ๋ฆฌ

์„ฑ๋Šฅ ๋ฌธ์ œ ๋ฐœ์ƒ
      |
      v
์–ด๋–ค ์ปค๋„์ธ์ง€ ์•„๋Š”๊ฐ€?
    |           |
  ์•„๋‹ˆ์˜ค         ์˜ˆ
    |           |
    v           v
NSight    ์ปค๋„ ๊ณ ์œ ์˜ ๋ฌธ์ œ์ธ๊ฐ€?
Systems       |         |
    |       ์•„๋‹ˆ์˜ค       ์˜ˆ
    v         |         |
ํƒ€์ž„๋ผ์ธ        |         v
๋ถ„์„    <------+   NSight Compute
                        |
                        v
                   ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„

๋น ๋ฅธ ์˜์‚ฌ๊ฒฐ์ • ๊ฐ€์ด๋“œ:

  • ๋ณ‘๋ชฉ์ด ์–ด๋””์ธ์ง€ ๋ชจ๋ฅด๊ฒ ์œผ๋ฉด NSight Systems (nsys)๋ถ€ํ„ฐ ์‹œ์ž‘
  • ์ตœ์ ํ™”ํ•  ์ปค๋„์„ ์ •ํ™•ํžˆ ์•Œ๋ฉด NSight Compute (ncu) ์‚ฌ์šฉ
  • ์ข…ํ•ฉ์ ์ธ ๋ถ„์„์ด ํ•„์š”ํ•˜๋ฉด ๋‘˜ ๋‹ค ์‚ฌ์šฉ (์ผ๋ฐ˜์ ์ธ ์›Œํฌํ”Œ๋กœ์šฐ)

์‹ค์Šต: NSight Systems๋กœ ์‹œ์Šคํ…œ ์ „์ฒด ํ”„๋กœํŒŒ์ผ๋ง

Puzzle 16์˜ ํ–‰๋ ฌ ๊ณฑ์…ˆ ๊ตฌํ˜„๋“ค์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜์—ฌ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ํŒŒ์•…ํ•ด ๋ด…์‹œ๋‹ค.

GUI ์ฐธ๊ณ : NSight Systems์™€ Compute GUI (nsys-ui, ncu-ui)๋Š” ๋””์Šคํ”Œ๋ ˆ์ด์™€ OpenGL ์ง€์›์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. X11 ํฌ์›Œ๋”ฉ์ด ์—†๋Š” ํ—ค๋“œ๋ฆฌ์Šค ์„œ๋ฒ„๋‚˜ ์›๊ฒฉ ์‹œ์Šคํ…œ์—์„œ๋Š” ์ปค๋งจ๋“œ๋ผ์ธ ๋ฒ„์ „ (nsys, ncu)์„ ์‚ฌ์šฉํ•˜์—ฌ nsys stats์™€ ncu --import --page details๋กœ ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์„ธ์š”. .nsys-rep์™€ .ncu-rep ํŒŒ์ผ์„ ๋กœ์ปฌ ๋จธ์‹ ์œผ๋กœ ์ „์†กํ•˜์—ฌ GUI๋กœ ๋ถ„์„ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

Step 1: ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ์ฝ”๋“œ ์ค€๋น„

์ค‘์š”: ์ •ํ™•ํ•œ ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•ด ์ตœ์ ํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜์—ฌ ๋นŒ๋“œํ•ฉ๋‹ˆ๋‹ค:

pixi shell -e nvidia
# ์ตœ์ ํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด ํฌํ•จ ๋นŒ๋“œ (ํฌ๊ด„์ ์ธ ์†Œ์Šค ๋งคํ•‘์šฉ)
mojo build --debug-level=full solutions/p16/p16.mojo -o solutions/p16/p16_optimized

# ์ตœ์ ํ™” ๋นŒ๋“œ ํ…Œ์ŠคํŠธ
./solutions/p16/p16_optimized --naive

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ :

  • ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด: ํ”„๋กœํŒŒ์ผ๋Ÿฌ๋ฅผ ์œ„ํ•œ ์™„์ „ํ•œ ์‹ฌ๋ณผ ํ…Œ์ด๋ธ”, ๋ณ€์ˆ˜๋ช…, ์†Œ์Šค ๋ผ์ธ ๋งคํ•‘ ์ œ๊ณต
  • ํฌ๊ด„์  ๋ถ„์„: NSight ๋„๊ตฌ๊ฐ€ ์„ฑ๋Šฅ ๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ • ์ฝ”๋“œ ์œ„์น˜์™€ ์—ฐ๊ฒฐ ๊ฐ€๋Šฅ
  • ์ตœ์ ํ™” ์œ ์ง€: ํ”„๋กœ๋•์…˜ ๋นŒ๋“œ์™€ ์ผ์น˜ํ•˜๋Š” ํ˜„์‹ค์ ์ธ ์„ฑ๋Šฅ ์ธก์ • ๋ณด์žฅ

Step 2: ์‹œ์Šคํ…œ ์ „์ฒด ํ”„๋กœํŒŒ์ผ ์ˆ˜์ง‘

# ํฌ๊ด„์  ์ถ”์ ์œผ๋กœ ์ตœ์ ํ™” ๋นŒ๋“œ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile \
  --trace=cuda,nvtx \
  --output=matmul_naive \
  --force-overwrite=true \
  ./solutions/p16/p16_optimized --naive

๋ช…๋ น์–ด ๋ถ„์„:

  • --trace=cuda,nvtx: CUDA API ํ˜ธ์ถœ ๋ฐ ์ปค์Šคํ…€ ์–ด๋…ธํ…Œ์ด์…˜ ์บก์ฒ˜
  • --output=matmul_naive: ํ”„๋กœํŒŒ์ผ์„ matmul_naive.nsys-rep๋กœ ์ €์žฅ
  • --force-overwrite=true: ๊ธฐ์กด ํ”„๋กœํŒŒ์ผ ๋ฎ์–ด์“ฐ๊ธฐ
  • ๋งˆ์ง€๋ง‰ ์ธ์ˆ˜: Mojo ํ”„๋กœ๊ทธ๋žจ

Step 3: ํƒ€์ž„๋ผ์ธ ๋ถ„์„

# ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ ํ†ต๊ณ„ ์ƒ์„ฑ
nsys stats --force-export=true matmul_naive.nsys-rep

# ์ฃผ์š” ์ง€ํ‘œ ํ™•์ธ:
# - GPU ํ™œ์šฉ๋ฅ 
# - ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„
# - ์ปค๋„ ์‹คํ–‰ ์‹œ๊ฐ„
# - CPU-GPU ๋™๊ธฐํ™” ๊ฐ„๊ฒฉ

ํ™•์ธํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฐ๊ณผ (2ร—2 ํ–‰๋ ฌ ๊ณฑ์…ˆ์˜ ์‹ค์ œ ์ถœ๋ ฅ):

** CUDA API Summary (cuda_api_sum):
 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)  Min (ns)  Max (ns)  StdDev (ns)          Name
 --------  ---------------  ---------  ---------  --------  --------  --------  -----------  --------------------
     81.9          8617962          3  2872654.0    2460.0      1040   8614462    4972551.6  cuMemAllocAsync
     15.1          1587808          4   396952.0    5965.5      3810   1572067     783412.3  cuMemAllocHost_v2
      0.6            67152          1    67152.0   67152.0     67152     67152          0.0  cuModuleLoadDataEx
      0.4            44961          1    44961.0   44961.0     44961     44961          0.0  cuLaunchKernelEx

** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                    Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------
    100.0             1920          1    1920.0    1920.0      1920      1920          0.0  p16_naive_matmul_Layout_Int6A6AcB6A6AsA6A6A

** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum):
 Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Operation
 --------  ---------------  -----  --------  --------  --------  --------  -----------  ----------------------------
     49.4             4224      3    1408.0    1440.0      1312      1472         84.7  [CUDA memcpy Device-to-Host]
     36.0             3072      4     768.0     528.0       416      1600        561.0  [CUDA memset]
     14.6             1248      3     416.0     416.0       416       416          0.0  [CUDA memcpy Host-to-Device]

์ฃผ์š” ์„ฑ๋Šฅ ํ†ต์ฐฐ:

  • ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ด ์ง€๋ฐฐ์ : ์ „์ฒด ์‹œ๊ฐ„์˜ 81.9%๊ฐ€ cuMemAllocAsync์— ์†Œ๋น„
  • ์ปค๋„์€ ๋ฒˆ๊ฐœ์ฒ˜๋Ÿผ ๋น ๋ฆ„: ์‹คํ–‰ ์‹œ๊ฐ„ 1,920 ns (0.000001920์ดˆ)์— ๋ถˆ๊ณผ
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋‚ด์—ญ: 49.4% Deviceโ†’Host, 36.0% memset, 14.6% Hostโ†’Device
  • ์•„์ฃผ ์ž‘์€ ๋ฐ์ดํ„ฐ: ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด 0.001 MB ๋ฏธ๋งŒ (float32 4๊ฐœ = 16๋ฐ”์ดํŠธ)

Step 4: ๊ตฌํ˜„ ๋น„๊ต

๋‹ค๋ฅธ ๋ฒ„์ „๋“ค์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜๊ณ  ๋น„๊ตํ•ฉ๋‹ˆ๋‹ค:

# pixi shell ์ƒํƒœ๋ฅผ ์œ ์ง€ํ•˜์„ธ์š” `pixi run -e nvidia`

# ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฒ„์ „ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_shared ./solutions/p16/p16_optimized --single-block

# Tiled ๋ฒ„์ „ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_tiled ./solutions/p16/p16_optimized --tiled

# ๊ด€์šฉ์  Tiled ๋ฒ„์ „ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_idiomatic_tiled ./solutions/p16/p16_optimized --idiomatic-tiled

# ๊ฐ ๊ตฌํ˜„์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๋ถ„์„ (nsys stats๋Š” ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ํŒŒ์ผ๋งŒ ์ฒ˜๋ฆฌ)
nsys stats --force-export=true matmul_shared.nsys-rep
nsys stats --force-export=true matmul_tiled.nsys-rep
nsys stats --force-export=true matmul_idiomatic_tiled.nsys-rep

๊ฒฐ๊ณผ ๋น„๊ต ๋ฐฉ๋ฒ•:

  1. GPU Kernel Summary ํ™•์ธ - ๊ตฌํ˜„ ๊ฐ„ ์‹คํ–‰ ์‹œ๊ฐ„ ๋น„๊ต
  2. Memory Operations ํ™•์ธ - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ์„ ์ค„์ด๋Š”์ง€ ํ™•์ธ
  3. API ์˜ค๋ฒ„ํ—ค๋“œ ๋น„๊ต - ๋ชจ๋‘ ๋น„์Šทํ•œ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ํŒจํ„ด์„ ๊ฐ€์ ธ์•ผ ํ•จ

์ˆ˜๋™ ๋น„๊ต ์›Œํฌํ”Œ๋กœ์šฐ:

# ๊ฐ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•˜์—ฌ ๋น„๊ต
nsys stats --force-export=true matmul_naive.nsys-rep > naive_stats.txt
nsys stats --force-export=true matmul_shared.nsys-rep > shared_stats.txt
nsys stats --force-export=true matmul_tiled.nsys-rep > tiled_stats.txt
nsys stats --force-export=true matmul_idiomatic_tiled.nsys-rep > idiomatic_tiled_stats.txt

๊ณต์ •ํ•œ ๋น„๊ต ๊ฒฐ๊ณผ (์‹ค์ œ ํ”„๋กœํŒŒ์ผ๋ง ์ถœ๋ ฅ):

๋น„๊ต 1: 2 x 2 ํ–‰๋ ฌ

๊ตฌํ˜„๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ปค๋„ ์‹คํ–‰์„ฑ๋Šฅ
Naive81.9% cuMemAllocAsyncโœ… 1,920 ns๊ธฐ์ค€์„ 
Shared (--single-block)81.8% cuMemAllocAsyncโœ… 1,984 ns+3.3% ๋А๋ฆผ

๋น„๊ต 2: 9 x 9 ํ–‰๋ ฌ

๊ตฌํ˜„๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์ปค๋„ ์‹คํ–‰์„ฑ๋Šฅ
Tiled (์ˆ˜๋™)81.1% cuMemAllocAsyncโœ… 2,048 ns๊ธฐ์ค€์„ 
Idiomatic Tiled81.6% cuMemAllocAsyncโœ… 2,368 ns+15.6% ๋А๋ฆผ

๊ณต์ • ๋น„๊ต์—์„œ ์–ป์€ ํ•ต์‹ฌ ํ†ต์ฐฐ:

๋‘ ํ–‰๋ ฌ ํฌ๊ธฐ ๋ชจ๋‘ GPU ์ž‘์—…์—๋Š” ๋„ˆ๋ฌด ์ž‘์Œ!:

  • 2ร—2 ํ–‰๋ ฌ: 4๊ฐœ ์š”์†Œ - ์™„์ „ํžˆ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ
  • 9ร—9 ํ–‰๋ ฌ: 81๊ฐœ ์š”์†Œ - ์—ฌ์ „ํžˆ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ
  • ์‹ค์ œ GPU ์›Œํฌ๋กœ๋“œ: ์ฐจ์›๋‹น ์ˆ˜์ฒœ~์ˆ˜๋ฐฑ๋งŒ ๊ฐœ ์š”์†Œ

์ด ๊ฒฐ๊ณผ๊ฐ€ ์‹ค์ œ๋กœ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ:

  • ๋ชจ๋“  ๋ณ€ํ˜•์ด ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น์— ์ง€๋ฐฐ๋จ (์‹œ๊ฐ„์˜ 81% ์ด์ƒ)
  • ์ปค๋„ ์‹คํ–‰์€ ์˜๋ฏธ ์—†์Œ - ์„ค์ • ๋น„์šฉ์— ๋น„ํ•˜๋ฉด ๋ฏธ๋ฏธ
  • โ€œ์ตœ์ ํ™”โ€œ๊ฐ€ ์˜คํžˆ๋ ค ํ•ด๋กœ์šธ ์ˆ˜ ์žˆ์Œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ 3.3%, async_copy๊ฐ€ 15.6% ์˜ค๋ฒ„ํ—ค๋“œ ์ถ”๊ฐ€
  • ์ง„์งœ ๊ตํ›ˆ: ์ž‘์€ ์›Œํฌ๋กœ๋“œ์—์„œ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ ํƒ์ด ๋ฌด์˜๋ฏธ - ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋ชจ๋“  ๊ฒƒ์„ ์••๋„

์ด๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๋Š” ์ด์œ :

  • GPU ์„ค์ • ๋น„์šฉ(๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ์ปค๋„ ์‹คํ–‰)์€ ๋ฌธ์ œ ํฌ๊ธฐ์— ๊ด€๊ณ„์—†์ด ๊ณ ์ •
  • ์ž‘์€ ๋ฌธ์ œ์—์„œ๋Š” ์ด ๊ณ ์ • ๋น„์šฉ์ด ์—ฐ์‚ฐ ์‹œ๊ฐ„์„ ๋ฌด์ƒ‰ํ•˜๊ฒŒ ๋งŒ๋“ฆ
  • ํฐ ๋ฌธ์ œ๋ฅผ ์œ„ํ•ด ์„ค๊ณ„๋œ ์ตœ์ ํ™”๊ฐ€ ์ž‘์€ ๋ฌธ์ œ์—์„œ๋Š” ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๋จ

์‹ค๋ฌด ํ”„๋กœํŒŒ์ผ๋ง ๊ตํ›ˆ:

  • ๋ฌธ์ œ ํฌ๊ธฐ ๋งฅ๋ฝ์ด ์ค‘์š”: 2ร—2์™€ 9ร—9 ๋ชจ๋‘ GPU์—๊ฒŒ๋Š” ์ž‘์Œ
  • ๊ณ ์ • ๋น„์šฉ์ด ์ž‘์€ ๋ฌธ์ œ๋ฅผ ์ง€๋ฐฐ: ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น, ์ปค๋„ ์‹คํ–‰ ์˜ค๋ฒ„ํ—ค๋“œ
  • โ€œ์ตœ์ ํ™”โ€œ๊ฐ€ ์ž‘์€ ์›Œํฌ๋กœ๋“œ์— ํ•ด๋กœ์šธ ์ˆ˜ ์žˆ์Œ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ๋น„๋™๊ธฐ ์—ฐ์‚ฐ์ด ์˜ค๋ฒ„ํ—ค๋“œ ์ถ”๊ฐ€
  • ์ž‘์€ ๋ฌธ์ œ๋ฅผ ์ตœ์ ํ™”ํ•˜์ง€ ๋ง ๊ฒƒ: ์‹ค์ œ ์›Œํฌ๋กœ๋“œ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ง‘์ค‘
  • ํ•ญ์ƒ ๋ฒค์น˜๋งˆํ‚นํ•  ๊ฒƒ: โ€œ๋” ์ข‹์€โ€ ์ฝ”๋“œ์— ๋Œ€ํ•œ ๊ฐ€์ •์€ ํ”ํžˆ ํ‹€๋ฆผ

์ž‘์€ ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง์˜ ์ดํ•ด: ์ด 2ร—2 ํ–‰๋ ฌ ์˜ˆ์ œ๋Š” ์ „ํ˜•์ ์ธ ์ž‘์€ ์ปค๋„ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  • ์‹ค์ œ ์—ฐ์‚ฐ(ํ–‰๋ ฌ ๊ณฑ์…ˆ)์€ ๊ทนํžˆ ๋น ๋ฆ„ (1,920 ns)
  • ๋ฉ”๋ชจ๋ฆฌ ์„ค์ • ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ „์ฒด ์‹œ๊ฐ„์„ ์ง€๋ฐฐ (์‹คํ–‰์˜ 97% ์ด์ƒ)
  • ์ด๊ฒƒ์ด ์‹ค๋ฌด GPU ์ตœ์ ํ™”๊ฐ€ ๋‹ค์Œ์— ์ง‘์ค‘ํ•˜๋Š” ์ด์œ ์ž…๋‹ˆ๋‹ค:
    • ์—ฐ์‚ฐ ์ผ๊ด„ ์ฒ˜๋ฆฌ๋กœ ์„ค์ • ๋น„์šฉ ๋ถ„์‚ฐ
    • ๋ฉ”๋ชจ๋ฆฌ ์žฌ์‚ฌ์šฉ์œผ๋กœ ํ• ๋‹น ์˜ค๋ฒ„ํ—ค๋“œ ๊ฐ์†Œ
    • ์—ฐ์‚ฐ์ด ๋ณ‘๋ชฉ์ด ๋˜๋Š” ๋” ํฐ ๋ฌธ์ œ ํฌ๊ธฐ

์‹ค์Šต: NSight Compute๋กœ ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„

์ด์ œ ํŠน์ • ์ปค๋„์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ์‹ฌ์ธต์ ์œผ๋กœ ๋“ค์—ฌ๋‹ค๋ด…์‹œ๋‹ค.

Step 1: ํŠน์ • ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง

# ํ™œ์„ฑ shell ์ƒํƒœ์ธ์ง€ ํ™•์ธ
pixi shell -e nvidia

# Naive MatMul ์ปค๋„์„ ์ƒ์„ธ ํ”„๋กœํŒŒ์ผ๋ง (์ตœ์ ํ™” ๋นŒ๋“œ ์‚ฌ์šฉ)
ncu \
  --set full \
  -o kernel_analysis \
  --force-overwrite \
  ./solutions/p16/p16_optimized --naive

ํ”ํ•œ ๋ฌธ์ œ: ๊ถŒํ•œ ์˜ค๋ฅ˜

ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ๋‹ค์Œ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•์„ ์‹œ๋„ํ•˜์„ธ์š”:

# NVIDIA ๋“œ๋ผ์ด๋ฒ„ ์˜ต์…˜ ์ถ”๊ฐ€ (rmmod๋ณด๋‹ค ์•ˆ์ „)
echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee -a /etc/modprobe.d/nvidia-kernel-common.conf

# ์ปค๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •
sudo sysctl -w kernel.perf_event_paranoid=0

# ์˜๊ตฌ ์ ์šฉ
echo 'kernel.perf_event_paranoid=0' | sudo tee -a /etc/sysctl.conf

# ๋“œ๋ผ์ด๋ฒ„ ๋ณ€๊ฒฝ ์‚ฌํ•ญ ์ ์šฉ์„ ์œ„ํ•ด ์žฌ๋ถ€ํŒ… ํ•„์š”
sudo reboot

# ๊ทธ๋Ÿฐ ๋‹ค์Œ ncu ๋ช…๋ น์„ ๋‹ค์‹œ ์‹คํ–‰
ncu \
  --set full \
  -o kernel_analysis \
  --force-overwrite \
  ./solutions/p16/p16_optimized --naive

Step 2: ์ฃผ์š” ์ง€ํ‘œ ๋ถ„์„

# ์ƒ์„ธ ๋ณด๊ณ ์„œ ์ƒ์„ฑ (์˜ฌ๋ฐ”๋ฅธ ๊ตฌ๋ฌธ)
ncu --import kernel_analysis.ncu-rep --page details

์‹ค์ œ NSight Compute ์ถœ๋ ฅ (2ร—2 Naive MatMul):

GPU Speed Of Light Throughput
----------------------- ----------- ------------
DRAM Frequency              Ghz         6.10
SM Frequency                Ghz         1.30
Elapsed Cycles            cycle         3733
Memory Throughput             %         1.02
DRAM Throughput               %         0.19
Duration                     us         2.88
Compute (SM) Throughput       %         0.00
----------------------- ----------- ------------

Launch Statistics
-------------------------------- --------------- ---------------
Block Size                                                     9
Grid Size                                                      1
Threads                           thread               9
Waves Per SM                                                0.00
-------------------------------- --------------- ---------------

Occupancy
------------------------------- ----------- ------------
Theoretical Occupancy                 %        33.33
Achieved Occupancy                    %         2.09
------------------------------- ----------- ------------

์‹ค์ œ ๋ฐ์ดํ„ฐ์—์„œ ์–ป์€ ํ•ต์‹ฌ ํ†ต์ฐฐ:

์„ฑ๋Šฅ ๋ถ„์„ - ๋ƒ‰ํ˜นํ•œ ํ˜„์‹ค

  • Compute Throughput: 0.00% - GPU๊ฐ€ ์—ฐ์‚ฐ์ ์œผ๋กœ ์™„์ „ํžˆ ์œ ํœด ์ƒํƒœ
  • Memory Throughput: 1.02% - ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ๊ฑฐ์˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ
  • Achieved Occupancy: 2.09% - GPU ๋Šฅ๋ ฅ์˜ 2%๋งŒ ์‚ฌ์šฉ ์ค‘
  • Grid Size: 1 ๋ธ”๋ก - 80๊ฐœ ๋ฉ€ํ‹ฐํ”„๋กœ์„ธ์„œ๋ฅผ ์™„์ „ํžˆ ๋‚ญ๋น„!

์„ฑ๋Šฅ์ด ์ด๋ ‡๊ฒŒ ๋‚ฎ์€ ์ด์œ 

  • ์ž‘์€ ๋ฌธ์ œ ํฌ๊ธฐ: 2ร—2 ํ–‰๋ ฌ = ์ด 4๊ฐœ ์š”์†Œ
  • ์ž˜๋ชป๋œ ์‹คํ–‰ ๊ตฌ์„ฑ: 1๊ฐœ ๋ธ”๋ก์— 9๊ฐœ ์Šค๋ ˆ๋“œ (32์˜ ๋ฐฐ์ˆ˜์—ฌ์•ผ ํ•จ)
  • ์‹ฌ๊ฐํ•œ ๊ณผ์†Œ ํ™œ์šฉ: SM๋‹น 0.00 wave (ํšจ์œจ์„ ์œ„ํ•ด ์ˆ˜์ฒœ ๊ฐœ ํ•„์š”)

NSight Compute์˜ ํ•ต์‹ฌ ์ตœ์ ํ™” ๊ถŒ๊ณ ์‚ฌํ•ญ

  • โ€œEst. Speedup: 98.75%โ€ - 80๊ฐœ SM์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋„๋ก ๊ทธ๋ฆฌ๋“œ ํฌ๊ธฐ ์ฆ๊ฐ€
  • โ€œEst. Speedup: 71.88%โ€ - ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ 32์˜ ๋ฐฐ์ˆ˜๋กœ ์‚ฌ์šฉ
  • โ€œKernel grid is too smallโ€ - GPU ํšจ์œจ์„ ์œ„ํ•ด ํ›จ์”ฌ ํฐ ๋ฌธ์ œ ํ•„์š”

Step 3: ํ˜„์‹ค ์ง์‹œ

์ด ํ”„๋กœํŒŒ์ผ๋ง ๋ฐ์ดํ„ฐ๊ฐ€ ์•Œ๋ ค์ฃผ๋Š” ๊ฒƒ:

  1. ์ž‘์€ ๋ฌธ์ œ๋Š” GPU์—๊ฒŒ ๋…: 2ร—2 ํ–‰๋ ฌ์€ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ์™„์ „ํžˆ ๋‚ญ๋น„
  2. ์‹คํ–‰ ๊ตฌ์„ฑ์ด ์ค‘์š”: ์ž˜๋ชป๋œ ์Šค๋ ˆ๋“œ/๋ธ”๋ก ํฌ๊ธฐ๊ฐ€ ์„ฑ๋Šฅ์„ ์ฃฝ์ž„
  3. ๊ทœ๋ชจ๊ฐ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณด๋‹ค ์ค‘์š”: ๊ทผ๋ณธ์ ์œผ๋กœ ์ž‘์€ ๋ฌธ์ œ๋Š” ์–ด๋–ค ์ตœ์ ํ™”๋กœ๋„ ํ•ด๊ฒฐ ๋ถˆ๊ฐ€
  4. NSight Compute๋Š” ์ •์งํ•จ: ์ปค๋„ ์„ฑ๋Šฅ์ด ๋‚ฎ์„ ๋•Œ ๊ทธ๋Œ€๋กœ ์•Œ๋ ค์คŒ

์ง„์งœ ๊ตํ›ˆ:

  • ํ† ์ด ๋ฌธ์ œ๋ฅผ ์ตœ์ ํ™”ํ•˜์ง€ ๋ง ๊ฒƒ - ์‹ค์ œ GPU ์›Œํฌ๋กœ๋“œ๋ฅผ ๋Œ€ํ‘œํ•˜์ง€ ์•Š์Œ
  • ํ˜„์‹ค์ ์ธ ์›Œํฌ๋กœ๋“œ์— ์ง‘์ค‘ - ์ตœ์ ํ™”๊ฐ€ ์‹ค์ œ๋กœ ์˜๋ฏธ ์žˆ๋Š” 1000ร—1000+ ํ–‰๋ ฌ
  • ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ์ตœ์ ํ™”๋ฅผ ์•ˆ๋‚ด - ๋‹จ, ์ตœ์ ํ™”ํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ๋Š” ๋ฌธ์ œ์—๋งŒ

2ร—2 ์˜ˆ์ œ์˜ ๊ฒฝ์šฐ: ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜(๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, tiling)์ด ์ด๋ฏธ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ง€๋ฐฐ์ ์ธ ์›Œํฌ๋กœ๋“œ์— ์˜ค๋ฒ„ํ—ค๋“œ๋งŒ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋Ÿฌ ์ถœ๋ ฅ์„ ์„ฑ๋Šฅ ํƒ์ •์ฒ˜๋Ÿผ ์ฝ๊ธฐ

์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ์„ฑ๋Šฅ ํŒจํ„ด

ํŒจํ„ด 1: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„

NSight Systems๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๊ธด ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ์‹œ๊ฐ„ NSight Compute๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰, ๋‚ฎ์€ ์—ฐ์‚ฐ ํ™œ์šฉ๋„ ํ•ด๊ฒฐ์ฑ…: ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ

ํŒจํ„ด 2: ๋‚ฎ์€ ์ ์œ ์œจ

NSight Systems๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ์งง์€ ์ปค๋„ ์‹คํ–‰๊ณผ ๊ฐ„๊ฒฉ NSight Compute๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ์‹ค์ œ ์ ์œ ์œจ์ด ๋‚ฎ์Œ ํ•ด๊ฒฐ์ฑ…: ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰ ์ค„์ด๊ธฐ, ๋ธ”๋ก ํฌ๊ธฐ ์ตœ์ ํ™”

ํŒจํ„ด 3: ์›Œํ”„ ๋ถ„๊ธฐ

NSight Systems๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๋ถˆ๊ทœ์น™ํ•œ ์ปค๋„ ์‹คํ–‰ ํŒจํ„ด NSight Compute๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ: ๋‚ฎ์€ ์›Œํ”„ ์‹คํ–‰ ํšจ์œจ ํ•ด๊ฒฐ์ฑ…: ์กฐ๊ฑด ๋ถ„๊ธฐ ์ตœ์†Œํ™”, ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์žฌ๊ตฌ์„ฑ

ํ”„๋กœํŒŒ์ผ๋ง ํƒ์ • ์›Œํฌํ”Œ๋กœ์šฐ

์„ฑ๋Šฅ ๋ฌธ์ œ ๋ฐœ์ƒ
     |
     v
NSight Systems: ์ „์ฒด ๊ทธ๋ฆผ
        |
        v
GPU๋ฅผ ์ž˜ ํ™œ์šฉํ•˜๊ณ  ์žˆ๋Š”๊ฐ€?
    |             |
  ์•„๋‹ˆ์˜ค           ์˜ˆ
    |             |
    v             v
CPU-GPU    NSight Compute: ์ปค๋„ ์ƒ์„ธ
ํŒŒ์ดํ”„๋ผ์ธ          |
์ˆ˜์ •               v
        ๋ฉ”๋ชจ๋ฆฌ ๋˜๋Š” ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์ธ๊ฐ€?
          |       |       |
         ๋ฉ”๋ชจ๋ฆฌ   ์—ฐ์‚ฐ    ๋‘˜ ๋‹ค ์•„๋‹˜
          |       |       |
          v       v       v
        ๋ฉ”๋ชจ๋ฆฌ    ์‚ฐ์ˆ      ์ ์œ ์œจ
        ์ ‘๊ทผ     ์ตœ์ ํ™”    ํ™•์ธ
        ์ตœ์ ํ™”

ํ”„๋กœํŒŒ์ผ๋ง ๋ชจ๋ฒ” ์‚ฌ๋ก€

ํฌ๊ด„์ ์ธ ํ”„๋กœํŒŒ์ผ๋ง ์ง€์นจ์€ Best Practices Guide - Performance Metrics๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

์ด๋ ‡๊ฒŒ ํ•˜์„ธ์š”

  1. ๋Œ€ํ‘œ์ ์ธ ์›Œํฌ๋กœ๋“œ๋ฅผ ํ”„๋กœํŒŒ์ผ๋ง: ํ˜„์‹ค์ ์ธ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์™€ ํŒจํ„ด ์‚ฌ์šฉ
  2. ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด๋กœ ๋นŒ๋“œ: ์ตœ์ ํ™”์™€ ํ•จ๊ป˜ ํฌ๊ด„์ ์ธ ํ”„๋กœํŒŒ์ผ๋ง ๋ฐ์ดํ„ฐ ๋ฐ ์†Œ์Šค ๋งคํ•‘์„ ์œ„ํ•ด --debug-level=full ์‚ฌ์šฉ
  3. GPU ์›Œ๋ฐ์—…: ์ปค๋„์„ ์—ฌ๋Ÿฌ ๋ฒˆ ์‹คํ–‰ํ•œ ํ›„ ํ›„๋ฐ˜ ๋ฐ˜๋ณต์„ ํ”„๋กœํŒŒ์ผ๋ง
  4. ๋Œ€์•ˆ ๋น„๊ต: ํ•ญ์ƒ ์—ฌ๋Ÿฌ ๊ตฌํ˜„์„ ํ”„๋กœํŒŒ์ผ๋ง
  5. ํ•ซ์ŠคํŒŸ์— ์ง‘์ค‘: ๊ฐ€์žฅ ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š” ์ปค๋„์„ ์ตœ์ ํ™”

์ด๋ ‡๊ฒŒ ํ•˜์ง€ ๋งˆ์„ธ์š”

  1. ๋””๋ฒ„๊ทธ ์ •๋ณด ์—†์ด ํ”„๋กœํŒŒ์ผ๋งํ•˜์ง€ ๋ง ๊ฒƒ: ์„ฑ๋Šฅ์„ ์†Œ์Šค ์ฝ”๋“œ์— ๋งคํ•‘ํ•  ์ˆ˜ ์—†์Œ (mojo build --help)
  2. ๋‹จ์ผ ์‹คํ–‰๋งŒ ํ”„๋กœํŒŒ์ผ๋งํ•˜์ง€ ๋ง ๊ฒƒ: GPU ์„ฑ๋Šฅ์€ ์‹คํ–‰๋งˆ๋‹ค ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ
  3. ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ๋ฌด์‹œํ•˜์ง€ ๋ง ๊ฒƒ: CPU-GPU ์ „์†ก์ด ํ”ํžˆ ์ง€๋ฐฐ์ 
  4. ์„ฃ๋ถˆ๋ฆฌ ์ตœ์ ํ™”ํ•˜์ง€ ๋ง ๊ฒƒ: ๋จผ์ € ํ”„๋กœํŒŒ์ผ๋ง, ๊ทธ๋‹ค์Œ ์ตœ์ ํ™”

ํ”ํ•œ ํ•จ์ •๊ณผ ํ•ด๊ฒฐ์ฑ…

ํ•จ์ • 1: ์ฝœ๋“œ ์Šคํƒ€ํŠธ ํšจ๊ณผ

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ์ฒซ ๋ฒˆ์งธ ์‹คํ–‰์„ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile mojo your_program.mojo

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ์›Œ๋ฐ์—… ํ›„ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile --delay=5 mojo your_program.mojo  # GPU ์›Œ๋ฐ์—… ๋Œ€๊ธฐ

ํ•จ์ • 2: ์ž˜๋ชป๋œ ๋นŒ๋“œ ๊ตฌ์„ฑ

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ์ „์ฒด ๋””๋ฒ„๊ทธ ๋นŒ๋“œ (์ตœ์ ํ™” ๋น„ํ™œ์„ฑํ™”) ์ฆ‰, `--no-optimization`
mojo build -O0 your_program.mojo -o your_program

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: ๋””๋ฒ„๊ทธ ์ •๋ณด ์—†์Œ (์†Œ์Šค ๋งคํ•‘ ๋ถˆ๊ฐ€)
mojo build your_program.mojo -o your_program

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด ํฌํ•จ ์ตœ์ ํ™” ๋นŒ๋“œ
mojo build --debug-level=full your_program.mojo -o optimized_program
nsys profile ./optimized_program

ํ•จ์ • 3: ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋ฌด์‹œ

# NSight Systems์—์„œ ์ด ํŒจํ„ด์„ ์ฐพ์•„๋ณด์„ธ์š”:
CPU -> GPU transfer: 50ms
Kernel execution: 2ms
GPU -> CPU transfer: 48ms
# ์ด: 100ms (์ปค๋„์€ ๊ฒจ์šฐ 2%!)

ํ•ด๊ฒฐ์ฑ…: ์ „์†ก๊ณผ ์—ฐ์‚ฐ์„ ์ค‘์ฒฉํ•˜๊ณ  ์ „์†ก ๋นˆ๋„๋ฅผ ์ค„์ด๊ธฐ (Part IX์—์„œ ๋‹ค๋ฃธ)

ํ•จ์ • 4: ๋‹จ์ผ ์ปค๋„์—๋งŒ ์ง‘์ค‘

# ์ž˜๋ชป๋œ ๋ฐฉ๋ฒ•: "๋А๋ฆฐ" ์ปค๋„๋งŒ ํ”„๋กœํŒŒ์ผ๋ง
ncu --kernel-name regex:slow_kernel program

# ์˜ฌ๋ฐ”๋ฅธ ๋ฐฉ๋ฒ•: ๋จผ์ € ์ „์ฒด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ํ”„๋กœํŒŒ์ผ๋ง
nsys profile mojo program.mojo  # ์‹ค์ œ ๋ณ‘๋ชฉ ์ฐพ๊ธฐ

๋ชจ๋ฒ” ์‚ฌ๋ก€์™€ ๊ณ ๊ธ‰ ์˜ต์…˜

๊ณ ๊ธ‰ NSight Systems ํ”„๋กœํŒŒ์ผ๋ง

ํฌ๊ด„์ ์ธ ์‹œ์Šคํ…œ ์ „์ฒด ๋ถ„์„์„ ์œ„ํ•ด ๋‹ค์Œ ๊ณ ๊ธ‰ nsys ํ”Œ๋ž˜๊ทธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

# ํ”„๋กœ๋•์…˜๊ธ‰ ํ”„๋กœํŒŒ์ผ๋ง ๋ช…๋ น
nsys profile \
  --gpu-metrics-devices=all \
  --trace=cuda,osrt,nvtx \
  --trace-fork-before-exec=true \
  --cuda-memory-usage=true \
  --cuda-um-cpu-page-faults=true \
  --cuda-um-gpu-page-faults=true \
  --opengl-gpu-workload=false \
  --delay=2 \
  --duration=30 \
  --sample=cpu \
  --cpuctxsw=process-tree \
  --output=comprehensive_profile \
  --force-overwrite=true \
  ./your_program

ํ”Œ๋ž˜๊ทธ ์„ค๋ช…:

  • --gpu-metrics-devices=all: ๋ชจ๋“  ๋””๋ฐ”์ด์Šค์—์„œ GPU ์ง€ํ‘œ ์ˆ˜์ง‘
  • --trace=cuda,osrt,nvtx: ํฌ๊ด„์  API ์ถ”์ 
  • --cuda-memory-usage=true: ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น/ํ•ด์ œ ์ถ”์ 
  • --cuda-um-cpu/gpu-page-faults=true: Unified Memory ํŽ˜์ด์ง€ ํดํŠธ ๋ชจ๋‹ˆํ„ฐ๋ง
  • --delay=2: ํ”„๋กœํŒŒ์ผ๋ง ์ „ 2์ดˆ ๋Œ€๊ธฐ (์ฝœ๋“œ ์Šคํƒ€ํŠธ ํšŒํ”ผ)
  • --duration=30: ์ตœ๋Œ€ 30์ดˆ๊ฐ„ ํ”„๋กœํŒŒ์ผ๋ง
  • --sample=cpu: ํ•ซ์ŠคํŒŸ ๋ถ„์„์„ ์œ„ํ•œ CPU ์ƒ˜ํ”Œ๋ง ํฌํ•จ
  • --cpuctxsw=process-tree: CPU ์ปจํ…์ŠคํŠธ ์Šค์œ„์น˜ ์ถ”์ 

๊ณ ๊ธ‰ NSight Compute ํ”„๋กœํŒŒ์ผ๋ง

ํฌ๊ด„์  ์ง€ํ‘œ๋ฅผ ํฌํ•จํ•œ ์ƒ์„ธ ์ปค๋„ ๋ถ„์„:

# ๋ชจ๋“  ์ง€ํ‘œ ์„ธํŠธ๋กœ ์ „์ฒด ์ปค๋„ ๋ถ„์„
ncu \
  --set full \
  --import-source=on \
  --kernel-id=:::1 \
  --launch-skip=0 \
  --launch-count=1 \
  --target-processes=all \
  --replay-mode=kernel \
  --cache-control=all \
  --clock-control=base \
  --apply-rules=yes \
  --check-exit-code=yes \
  --export=detailed_analysis \
  --force-overwrite \
  ./your_program

# ํŠน์ • ์„ฑ๋Šฅ ์ธก๋ฉด์— ์ง‘์ค‘
ncu \
  --set=@roofline \
  --section=InstructionStats \
  --section=LaunchStats \
  --section=Occupancy \
  --section=SpeedOfLight \
  --section=WarpStateStats \
  --metrics=sm__cycles_elapsed.avg,dram__throughput.avg.pct_of_peak_sustained_elapsed \
  --kernel-name regex:your_kernel_.* \
  --export=targeted_analysis \
  ./your_program

์ฃผ์š” NSight Compute ํ”Œ๋ž˜๊ทธ:

  • --set full: ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ์ง€ํ‘œ ์ˆ˜์ง‘ (ํฌ๊ด„์ ์ด์ง€๋งŒ ๋А๋ฆผ)
  • --set @roofline: ๋ฃจํ”„๋ผ์ธ ๋ถ„์„์— ์ตœ์ ํ™”๋œ ์„ธํŠธ
  • --import-source=on: ๊ฒฐ๊ณผ๋ฅผ ์†Œ์Šค ์ฝ”๋“œ์— ๋งคํ•‘
  • --replay-mode=kernel: ์ •ํ™•ํ•œ ์ธก์ •์„ ์œ„ํ•ด ์ปค๋„ ๋ฆฌํ”Œ๋ ˆ์ด
  • --cache-control=all: ์ผ๊ด€๋œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ GPU ์บ์‹œ ์ œ์–ด
  • --clock-control=base: ๊ธฐ๋ณธ ์ฃผํŒŒ์ˆ˜๋กœ ํด๋Ÿญ ๊ณ ์ •
  • --section=SpeedOfLight: Speed of Light ๋ถ„์„ ํฌํ•จ
  • --metrics=...: ํŠน์ • ์ง€ํ‘œ๋งŒ ์ˆ˜์ง‘
  • --kernel-name regex:pattern: ์ •๊ทœ์‹ ํŒจํ„ด์œผ๋กœ ์ปค๋„ ์ง€์ • (--kernel-regex๊ฐ€ ์•„๋‹˜)

ํ”„๋กœํŒŒ์ผ๋ง ์›Œํฌํ”Œ๋กœ์šฐ ๋ชจ๋ฒ” ์‚ฌ๋ก€

1. ์ ์ง„์  ํ”„๋กœํŒŒ์ผ๋ง ์ „๋žต

# Step 1: ๋น ๋ฅธ ๊ฐœ์š” (๋น ๋ฆ„)
nsys profile --trace=cuda --duration=10 --output=quick_look ./program

# Step 2: ์ƒ์„ธ ์‹œ์Šคํ…œ ๋ถ„์„ (์ค‘๊ฐ„)
nsys profile --trace=cuda,osrt,nvtx --cuda-memory-usage=true --output=detailed ./program

# Step 3: ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„ (๋А๋ฆฌ์ง€๋งŒ ํฌ๊ด„์ )
ncu --set=@roofline --kernel-name regex:hotspot_kernel ./program

2. ์‹ ๋ขฐ์„ฑ์„ ์œ„ํ•œ ๋‹ค์ค‘ ์‹คํ–‰ ๋ถ„์„

# ์—ฌ๋Ÿฌ ๋ฒˆ ํ”„๋กœํŒŒ์ผ๋งํ•˜๊ณ  ๋น„๊ต
for i in {1..5}; do
  nsys profile --output=run_${i} ./program
  nsys stats run_${i}.nsys-rep > stats_${i}.txt
done

# ๊ฒฐ๊ณผ ๋น„๊ต
diff stats_1.txt stats_2.txt

3. ํƒ€๊ฒŸ ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง

# ๋จผ์ € ํ•ซ์ŠคํŒŸ ์ปค๋„ ์‹๋ณ„
nsys profile --trace=cuda,nvtx --output=overview ./program
nsys stats overview.nsys-rep | grep -A 10 "GPU Kernel Summary"

# ๊ทธ๋Ÿฐ ๋‹ค์Œ ํŠน์ • ์ปค๋„ ํ”„๋กœํŒŒ์ผ๋ง
ncu --kernel-name="identified_hotspot_kernel" --set full ./program

ํ™˜๊ฒฝ ๋ฐ ๋นŒ๋“œ ๋ชจ๋ฒ” ์‚ฌ๋ก€

์ตœ์  ๋นŒ๋“œ ๊ตฌ์„ฑ

# ํ”„๋กœํŒŒ์ผ๋ง์šฉ: ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด ํฌํ•จ ์ตœ์ ํ™” ๋นŒ๋“œ
mojo build --debug-level=full --optimization-level=3 program.mojo -o program_profile

# ๋นŒ๋“œ ์„ค์ • ํ™•์ธ
mojo build --help | grep -E "(debug|optimization)"

ํ”„๋กœํŒŒ์ผ๋ง ํ™˜๊ฒฝ ์„ค์ •

# ์ผ๊ด€๋œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด GPU ๋ถ€์ŠคํŠธ ๋น„ํ™œ์„ฑํ™”
sudo nvidia-smi -ac 1215,1410  # ๋ฉ”๋ชจ๋ฆฌ ๋ฐ GPU ํด๋Ÿญ ๊ณ ์ •

# ๊ฒฐ์ •๋ก ์  ๋™์ž‘ ์„ค์ •
export CUDA_LAUNCH_BLOCKING=1  # ์ •ํ™•ํ•œ ํƒ€์ด๋ฐ์„ ์œ„ํ•œ ๋™๊ธฐ์‹ ์‹คํ–‰

# ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ๋“œ๋ผ์ด๋ฒ„ ์ œํ•œ ์™„ํ™”
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee -a /etc/modprobe.d/nvidia-kernel-common.conf

๋ฉ”๋ชจ๋ฆฌ ๋ฐ ์„ฑ๋Šฅ ๊ฒฉ๋ฆฌ

# ํ”„๋กœํŒŒ์ผ๋ง ์ „ GPU ๋ฉ”๋ชจ๋ฆฌ ์ดˆ๊ธฐํ™”
nvidia-smi --gpu-reset

# ๋‹ค๋ฅธ GPU ํ”„๋กœ์„ธ์Šค ๋น„ํ™œ์„ฑํ™”
sudo fuser -v /dev/nvidia*  # GPU ์‚ฌ์šฉ ์ค‘์ธ ํ”„๋กœ์„ธ์Šค ํ™•์ธ
sudo pkill -f cuda  # ํ•„์š”์‹œ CUDA ํ”„๋กœ์„ธ์Šค ์ข…๋ฃŒ

# ๋†’์€ ์šฐ์„ ์ˆœ์œ„๋กœ ์‹คํ–‰
sudo nice -n -20 nsys profile ./program

๋ถ„์„ ๋ฐ ๋ณด๊ณ  ๋ชจ๋ฒ” ์‚ฌ๋ก€

์ข…ํ•ฉ ๋ณด๊ณ ์„œ ์ƒ์„ฑ

# ์—ฌ๋Ÿฌ ๋ณด๊ณ ์„œ ํ˜•์‹ ์ƒ์„ฑ
nsys stats --report=cuda_api_sum,cuda_gpu_kern_sum,cuda_gpu_mem_time_sum --format=csv --output=. profile.nsys-rep

# ์™ธ๋ถ€ ๋ถ„์„์„ ์œ„ํ•ด ๋‚ด๋ณด๋‚ด๊ธฐ
nsys export --type=sqlite profile.nsys-rep
nsys export --type=json profile.nsys-rep

# ๋น„๊ต ๋ณด๊ณ ์„œ ์ƒ์„ฑ
nsys stats --report=cuda_gpu_kern_sum baseline.nsys-rep > baseline_kernels.txt
nsys stats --report=cuda_gpu_kern_sum optimized.nsys-rep > optimized_kernels.txt
diff -u baseline_kernels.txt optimized_kernels.txt

์„ฑ๋Šฅ ํšŒ๊ท€ ํ…Œ์ŠคํŠธ

#!/bin/bash
# CI/CD์šฉ ์ž๋™ํ™” ํ”„๋กœํŒŒ์ผ๋ง ์Šคํฌ๋ฆฝํŠธ
BASELINE_TIME=$(nsys stats baseline.nsys-rep | grep "Total Time" | awk '{print $3}')
CURRENT_TIME=$(nsys stats current.nsys-rep | grep "Total Time" | awk '{print $3}')

REGRESSION_THRESHOLD=1.10  # 10% ์„ฑ๋Šฅ ์ €ํ•˜ ์ž„๊ณ„๊ฐ’
if (( $(echo "$CURRENT_TIME > $BASELINE_TIME * $REGRESSION_THRESHOLD" | bc -l) )); then
    echo "Performance regression detected: ${CURRENT_TIME}ns vs ${BASELINE_TIME}ns"
    exit 1
fi

๋‹ค์Œ ๋‹จ๊ณ„

ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ๋ฅผ ์ดํ•ดํ–ˆ์œผ๋‹ˆ:

  1. ๊ธฐ์กด ์ปค๋„๋กœ ์—ฐ์Šต: ์ด๋ฏธ ํ’€์—ˆ๋˜ ํผ์ฆ๋“ค์„ ํ”„๋กœํŒŒ์ผ๋งํ•ด ๋ณด์„ธ์š”
  2. ์ตœ์ ํ™” ์ค€๋น„: Puzzle 31์—์„œ ์ด ํ†ต์ฐฐ์„ ์ ์œ ์œจ ์ตœ์ ํ™”์— ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค
  3. ๋„๊ตฌ ์ตํžˆ๊ธฐ: ๋‹ค์–‘ํ•œ NSight Systems์™€ NSight Compute ์˜ต์…˜์„ ์‹คํ—˜ํ•ด ๋ณด์„ธ์š”

๊ธฐ์–ตํ•˜์„ธ์š”: ํ”„๋กœํŒŒ์ผ๋ง์€ ๋‹จ์ˆœํžˆ ๋А๋ฆฐ ์ฝ”๋“œ๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค - ํ”„๋กœ๊ทธ๋žจ์˜ ๋™์ž‘์„ ์ดํ•ดํ•˜๊ณ  ๊ทผ๊ฑฐ ์žˆ๋Š” ์ตœ์ ํ™” ๊ฒฐ์ •์„ ๋‚ด๋ฆฌ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ถ”๊ฐ€ ํ”„๋กœํŒŒ์ผ๋ง ์ž๋ฃŒ:

๐Ÿ•ต ์บ์‹œ ํžˆํŠธ์˜ ์—ญ์„ค

๊ฐœ์š”

์ฒซ ๋ฒˆ์งธ ํ”„๋กœํŒŒ์ผ๋ง ํƒ์ • ์‚ฌ๊ฑด์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์„ธ ๊ฐœ์˜ GPU ์ปค๋„์ด ๋ชจ๋‘ ๋™์ผํ•œ ๋ฒกํ„ฐ ๋ง์…ˆ output[i] = a[i] + b[i]์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋‹น์—ฐํžˆ ์„ฑ๋Šฅ๋„ ๊ฐ™๊ฒ ์ฃ ?

์•„๋‹™๋‹ˆ๋‹ค! ์ด ์ปค๋„๋“ค์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋Š” ๊ทน์ ์ž…๋‹ˆ๋‹ค - ํ•˜๋‚˜๋Š” ๋‚˜๋จธ์ง€๋ณด๋‹ค ์ˆ˜์‹ญ ๋ฐฐ๋‚˜ ๋А๋ฆฝ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ๋ถ„์˜ ์ž„๋ฌด: ๋ฐฉ๊ธˆ ๋ฐฐ์šด ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์™œ ๊ทธ๋Ÿฐ์ง€ ๋ฐํ˜€๋‚ด์„ธ์š”.

๋„์ „ ๊ณผ์ œ

GPU ์ตœ์ ํ™”์— ๋Œ€ํ•œ ๊ธฐ์กด ์ƒ์‹์„ ์™„์ „ํžˆ ๋’ค์ง‘๋Š” ์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ๋ˆˆ์•ž์—๋Š” ๊ฒ‰๋ณด๊ธฐ์— ๋™์ผํ•œ ๋ฒกํ„ฐ ๋ง์…ˆ ์ปค๋„ ์„ธ ๊ฐœ๊ฐ€ ์žˆ๊ณ , ๋ชจ๋‘ ์ •ํ™•ํžˆ ๊ฐ™์€ ์ˆ˜ํ•™ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค:

output[i] = a[i] + b[i]  // ๋‹จ์ˆœํ•œ ์‚ฐ์ˆ  ์—ฐ์‚ฐ - ๋ญ๊ฐ€ ์ž˜๋ชป๋  ์ˆ˜ ์žˆ์„๊นŒ?

์ถฉ๊ฒฉ์ ์ธ ํ˜„์‹ค:

  • ์„ธ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•˜๊ณ  ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ํ•˜๋‚˜์˜ ์ปค๋„์ด ๋‚˜๋จธ์ง€๋ณด๋‹ค ~50๋ฐฐ ๋А๋ฆฝ๋‹ˆ๋‹ค
  • ๊ฐ€์žฅ ๋А๋ฆฐ ์ปค๋„์ด ๊ฐ€์žฅ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์„ ๋ณด์ž…๋‹ˆ๋‹ค (์˜ˆ์ƒ๊ณผ ์ •๋ฐ˜๋Œ€!)
  • ์ผ๋ฐ˜์ ์ธ ์„ฑ๋Šฅ ์ง๊ด€์ด ์™„์ „ํžˆ ๋น—๋‚˜๊ฐ‘๋‹ˆ๋‹ค

ํƒ์ • ์ž„๋ฌด:

  1. ์„ฑ๋Šฅ ๋ฒ”์ธ ์‹๋ณ„ - ์–ด๋–ค ์ปค๋„์ด ์น˜๋ช…์ ์œผ๋กœ ๋А๋ฆฐ๊ฐ€?
  2. ์บ์‹œ์˜ ์—ญ์„ค ๊ทœ๋ช… - ๋†’์€ ์บ์‹œ ํžˆํŠธ๊ฐ€ ์™œ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ์˜๋ฏธํ•˜๋Š”๊ฐ€?
  3. ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํ•ด๋… - ๋™์ผํ•œ ์—ฐ์‚ฐ์ด ์–ด๋–ป๊ฒŒ ์ด๋ ‡๊ฒŒ ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘ํ•˜๋Š”๊ฐ€?
  4. ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก  ํ•™์Šต - ์ถ”์ธก์ด ์•„๋‹Œ NSight ๋„๊ตฌ๋กœ ๊ทผ๊ฑฐ๋ฅผ ํ™•๋ณดํ•˜๋ผ

์™œ ์ค‘์š”ํ•œ๊ฐ€: ์ด ํผ์ฆ์€ CPU ๊ธฐ๋ฐ˜ ์ง๊ด€์— ๋„์ „ํ•˜๋Š” GPU ์„ฑ๋Šฅ์˜ ๊ทผ๋ณธ ์›๋ฆฌ๋ฅผ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ธฐ๋ฅด๋Š” ์—ญ๋Ÿ‰์€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„๋ณด๋‹ค ์ค‘์š”ํ•œ ์‹ค๋ฌด GPU ์ตœ์ ํ™”์— ์ง์ ‘ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

๋ฐ˜์ „: ์ด ๊ณผ์ •์€ ํ”„๋กœ๋•์…˜ ์„ฑ๋Šฅ ์ด์Šˆ๋ฅผ ๋””๋ฒ„๊น…ํ•˜๋“ฏ์ด, ์†Œ์Šค ์ฝ”๋“œ๋ฅผ ๋จผ์ € ๋ณด์ง€ ์•Š๊ณ  ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋งŒ์œผ๋กœ ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฐ๊ณผ๋ฅผ ์–ป์€ ํ›„์— ์ฝ”๋“œ๋ฅผ ๋“ค์—ฌ๋‹ค๋ด…๋‹ˆ๋‹ค.

ํƒ์ • ๋„๊ตฌ ๋ชจ์Œ

ํ”„๋กœํŒŒ์ผ๋ง ํŠœํ† ๋ฆฌ์–ผ์—์„œ ๋ฐฐ์šด ๋„๊ตฌ๋“ค:

  • NSight Systems (nsys) - ์–ด๋–ค ์ปค๋„์ด ๋А๋ฆฐ์ง€ ์ฐพ๊ธฐ
  • NSight Compute (ncu) - ์ปค๋„์ด ์™œ ๋А๋ฆฐ์ง€ ๋ถ„์„ํ•˜๊ธฐ
  • ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ ์ง€ํ‘œ - ๋น„ํšจ์œจ์ ์ธ ์ ‘๊ทผ ํŒจํ„ด ํƒ์ง€

์‹œ์ž‘ํ•˜๊ธฐ

Step 1: ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰

pixi shell -e nvidia
mojo problems/p30/p30.mojo --benchmark

์ปค๋„ ๊ฐ„์— ๊ทน์ ์ธ ์‹คํ–‰ ์‹œ๊ฐ„ ์ฐจ์ด๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ํ•˜๋‚˜์˜ ์ปค๋„์ด ๋‚˜๋จธ์ง€๋ณด๋‹ค ํ›จ์”ฌ ๋А๋ฆฝ๋‹ˆ๋‹ค. ์ฝ”๋“œ๋ฅผ ๋ณด์ง€ ์•Š๊ณ  ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋งŒ์œผ๋กœ ์›์ธ์„ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค.

์ถœ๋ ฅ ์˜ˆ์‹œ:

| name    | met (ms)  | iters |
| ------- | --------- | ----- |
| kernel1 | 171.85    | 11    |
| kernel2 | 1546.68   | 11    |  <- ์ด๊ฒƒ๋งŒ ์œ ๋… ๋А๋ฆฌ๋‹ค!
| kernel3 | 172.18    | 11    |

Step 2: ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•œ ๋นŒ๋“œ ์ค€๋น„

ํ•„์ˆ˜: ์ •ํ™•ํ•œ ํ”„๋กœํŒŒ์ผ๋ง์„ ์œ„ํ•ด ์ตœ์ ํ™”๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด๋ฅผ ํฌํ•จํ•˜์—ฌ ๋นŒ๋“œํ•ฉ๋‹ˆ๋‹ค:

mojo build --debug-level=full problems/p30/p30.mojo -o problems/p30/p30_profiler

์ค‘์š”ํ•œ ์ด์œ :

  • ์ „์ฒด ๋””๋ฒ„๊ทธ ์ •๋ณด: ํ”„๋กœํŒŒ์ผ๋Ÿฌ์— ์™„์ „ํ•œ ์‹ฌ๋ณผ ํ…Œ์ด๋ธ”, ๋ณ€์ˆ˜๋ช…, ์†Œ์Šค ๋ผ์ธ ๋งคํ•‘์„ ์ œ๊ณต
  • ์ข…ํ•ฉ ๋ถ„์„: NSight ๋„๊ตฌ๊ฐ€ ์„ฑ๋Šฅ ๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ • ์ฝ”๋“œ ์œ„์น˜์™€ ์—ฐ๊ด€ ์ง“๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅ
  • ์ตœ์ ํ™” ์œ ์ง€: ํ”„๋กœ๋•์…˜ ๋นŒ๋“œ์™€ ๋™์ผํ•œ ํ˜„์‹ค์ ์ธ ์„ฑ๋Šฅ ์ธก์ • ๋ณด์žฅ

Step 3: ์‹œ์Šคํ…œ ์ „์ฒด ์กฐ์‚ฌ (NSight Systems)

๊ฐ ์ปค๋„์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜์—ฌ ์ „์ฒด ๊ทธ๋ฆผ์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:

# ์ตœ์ ํ™” ๋นŒ๋“œ๋กœ ๊ฐ ์ปค๋„์„ ๊ฐœ๋ณ„ ํ”„๋กœํŒŒ์ผ๋ง (์ฝœ๋“œ ์Šคํƒ€ํŠธ ๋ฐฉ์ง€๋ฅผ ์œ„ํ•œ ์›Œ๋ฐ์—… ํฌํ•จ)
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel1_profile ./problems/p30/p30_profiler --kernel1
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel2_profile ./problems/p30/p30_profiler --kernel2
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel3_profile ./problems/p30/p30_profiler --kernel3

# ๊ฒฐ๊ณผ ๋ถ„์„
nsys stats --force-export=true ./problems/p30/kernel1_profile.nsys-rep > ./problems/p30/kernel1_profile.txt
nsys stats --force-export=true ./problems/p30/kernel2_profile.nsys-rep > ./problems/p30/kernel2_profile.txt
nsys stats --force-export=true ./problems/p30/kernel3_profile.nsys-rep > ./problems/p30/kernel3_profile.txt

ํ™•์ธํ•  ์‚ฌํ•ญ:

  • GPU ์ปค๋„ ์š”์•ฝ - ์–ด๋–ค ์ปค๋„์ด ๊ฐ€์žฅ ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋Š”๊ฐ€?
  • ์ปค๋„ ์‹คํ–‰ ์‹œ๊ฐ„ - ์ฐจ์ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋‚˜๋Š”๊ฐ€?
  • ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ํŒจํ„ด - ๊ตฌํ˜„ ๊ฐ„์— ๋น„์Šทํ•œ๊ฐ€?

Step 4: ์ปค๋„ ์‹ฌ์ธต ๋ถ„์„ (NSight Compute)

๋А๋ฆฐ ์ปค๋„์„ ์‹๋ณ„ํ•œ ํ›„, NSight Compute๋กœ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค:

# ์ตœ์ ํ™” ๋นŒ๋“œ๋กœ ๊ฐ ์ปค๋„์˜ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด ์‹ฌ์ธต ๋ถ„์„
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel1_analysis ./problems/p30/p30_profiler --kernel1
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel2_analysis ./problems/p30/p30_profiler --kernel2
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel3_analysis ./problems/p30/p30_profiler --kernel3

# ๊ฒฐ๊ณผ ํ™•์ธ
ncu --import ./problems/p30/kernel1_analysis.ncu-rep --page details
ncu --import ./problems/p30/kernel2_analysis.ncu-rep --page details
ncu --import ./problems/p30/kernel3_analysis.ncu-rep --page details

์œ„ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถœ๋ ฅ์ด ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค:

Kernel1: Memory Throughput: ~308 Gbyte/s, Max Bandwidth: ~51%
Kernel2: Memory Throughput: ~6 Gbyte/s,   Max Bandwidth: ~12%
Kernel3: Memory Throughput: ~310 Gbyte/s, Max Bandwidth: ~52%

์ฃผ์š” ์กฐ์‚ฌ ์ง€ํ‘œ:

  • Memory Throughput (Gbyte/s) - ์‹ค์ œ ๋‹ฌ์„ฑํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ
  • Max Bandwidth (%) - ์ด๋ก ์  ์ตœ๋Œ€ ๋Œ€์—ญํญ ๋Œ€๋น„ ํ™œ์šฉ๋ฅ 
  • L1/TEX Hit Rate (%) - L1 ์บ์‹œ ํšจ์œจ
  • L2 Hit Rate (%) - L2 ์บ์‹œ ํšจ์œจ

๐Ÿค” ๋ฐ˜์ง๊ด€์ ์ธ ๊ฒฐ๊ณผ: Kernel2๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์„ ๋ณด์ด๋ฉด์„œ ๊ฐ€์žฅ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค! ์ด๊ฒƒ์ด ํ’€์–ด์•ผ ํ•  ํ•ต์‹ฌ ๋ฏธ์Šคํ„ฐ๋ฆฌ์ž…๋‹ˆ๋‹ค.

Step 5: ํƒ์ • ์งˆ๋ฌธ

ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ปค๋„ ์ฝ”๋“œ problems/p30/p30.mojo๋ฅผ ์‚ดํŽด๋ณด๋ฉฐ ๋‹ค์Œ ์งˆ๋ฌธ์— ๋‹ตํ•ด ๋ณด์„ธ์š”:

์„ฑ๋Šฅ ๋ถ„์„

  1. ์–ด๋–ค ์ปค๋„์ด ๊ฐ€์žฅ ๋†’์€ Memory Throughput์„ ๋‹ฌ์„ฑํ•˜๋Š”๊ฐ€? (Gbyte/s ๊ฐ’ ํ™•์ธ)
  2. ์–ด๋–ค ์ปค๋„์˜ Max Bandwidth ํ™œ์šฉ๋ฅ ์ด ๊ฐ€์žฅ ๋‚ฎ์€๊ฐ€? (๋ฐฑ๋ถ„์œจ ๋น„๊ต)
  3. ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰์˜ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ๋Š” ์–ผ๋งˆ์ธ๊ฐ€? (๊ฐ€์žฅ ๋น ๋ฅธ ๊ฒƒ๊ณผ ๊ฐ€์žฅ ๋А๋ฆฐ ๊ฒƒ์˜ ๋ฐฐ์ˆ˜ ์ฐจ์ด)

์บ์‹œ์˜ ์—ญ์„ค

  1. ์–ด๋–ค ์ปค๋„์˜ L1/TEX Hit Rate๊ฐ€ ๊ฐ€์žฅ ๋†’์€๊ฐ€?
  2. ์–ด๋–ค ์ปค๋„์˜ L2 Hit Rate๊ฐ€ ๊ฐ€์žฅ ๋†’์€๊ฐ€?
  3. ๐Ÿคฏ ์บ์‹œ ํžˆํŠธ์œจ์ด ๊ฐ€์žฅ ๋†’์€ ์ปค๋„์ด ์™œ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ๋‚˜์œ๊ฐ€?

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด ํƒ๊ตฌ

  1. ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์ด ์‹ค์ œ๋กœ ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š”๊ฐ€?
  2. ์–ด๋–ค ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ๋†’์€ ์บ์‹œ ํžˆํŠธ์™€ ๋‚ฎ์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋™์‹œ์— ์œ ๋ฐœํ•˜๋Š”๊ฐ€?
  3. ์™œ โ€œํšจ์œจ์ ์ธ ์บ์‹ฑโ€œ์ด โ€œ๋น„ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผโ€œ์˜ ์ฆ์ƒ์ผ ์ˆ˜ ์žˆ๋Š”๊ฐ€?

โ€œ์•„ํ•˜!โ€ ์ˆœ๊ฐ„

  1. ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, ์ด ์‚ฌ๋ก€๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” GPU ๋ฉ”๋ชจ๋ฆฌ์˜ ๊ทผ๋ณธ ์›๋ฆฌ๋Š” ๋ฌด์—‡์ธ๊ฐ€?

๋ฐœ๊ฒฌํ•  ํ•ต์‹ฌ ํ†ต์ฐฐ: ๋•Œ๋กœ๋Š” ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์ด ์„ฑ๋Šฅ ์Šน๋ฆฌ๊ฐ€ ์•„๋‹ˆ๋ผ ์œ„ํ—˜ ์‹ ํ˜ธ์ž…๋‹ˆ๋‹ค!

์†”๋ฃจ์…˜

์ด ๋ฏธ์Šคํ„ฐ๋ฆฌ๋Š” GPU ์„ฑ๋Šฅ์˜ ๊ทผ๋ณธ ์›๋ฆฌ๋ฅผ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค: ์ปค๋„์ด ๋™์ผํ•œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋”๋ผ๋„ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์„ฑ๋Šฅ์„ ์ง€๋ฐฐํ•ฉ๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ๊ฐ€ ๋ฐํžˆ๋Š” ๊ฒƒ:

  1. ์„ฑ๋Šฅ ์œ„๊ณ„: Kernel1๊ณผ Kernel3์€ ๋น ๋ฅด๊ณ , Kernel2๋Š” ์น˜๋ช…์ ์œผ๋กœ ๋А๋ฆผ (์ˆ˜์‹ญ ๋ฐฐ ์ฐจ์ด)
  2. ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰์ด ๋‹ต์„ ๋งํ•ด์ค€๋‹ค: ๋น ๋ฅธ ์ปค๋„์€ ๋†’์€ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ ์„ ๋‹ฌ์„ฑํ•˜๊ณ , ๋А๋ฆฐ ์ปค๋„์€ ์ตœ์†Œํ•œ์˜ ํ™œ์šฉ๋ฅ ๋งŒ ๋‹ฌ์„ฑ
  3. ์บ์‹œ์˜ ์—ญ์„ค: ๊ฐ€์žฅ ๋А๋ฆฐ ์ปค๋„์ด ๊ฐ€์žฅ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์„ ๋ณด์ž„ - ๋†’์€ ์บ์‹œ ํžˆํŠธ๊ฐ€ ๋น„ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌ
  4. ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ GPU ์›Œํฌ๋กœ๋“œ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„๋ณด๋‹ค ์ค‘์š”
์ƒ์„ธ ์†”๋ฃจ์…˜๊ณผ ์‹ฌ์ธต ์„ค๋ช…

์ด ํ”„๋กœํŒŒ์ผ๋ง ํƒ์ • ์‚ฌ๊ฑด์€ ์ปค๋„์ด ๋™์ผํ•œ ์ˆ˜ํ•™ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋”๋ผ๋„ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์–ด๋–ป๊ฒŒ ์ˆ˜์‹ญ ๋ฐฐ์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š”์ง€ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ํ™•์ธํ•œ ์„ฑ๋Šฅ ๊ทผ๊ฑฐ

NSight Systems ํƒ€์ž„๋ผ์ธ ๋ถ„์„:

  • Kernel 1: ์งง์€ ์‹คํ–‰ ์‹œ๊ฐ„ - ํšจ์œจ์ 
  • Kernel 3: Kernel 1๊ณผ ์œ ์‚ฌ - ํšจ์œจ์ 
  • Kernel 2: ๊ทน์ ์œผ๋กœ ๊ธด ์‹คํ–‰ ์‹œ๊ฐ„ - ๋น„ํšจ์œจ์ 

NSight Compute ๋ฉ”๋ชจ๋ฆฌ ๋ถ„์„ (ํ•˜๋“œ์›จ์–ด ๋ฌด๊ด€ํ•œ ํŒจํ„ด):

  • ํšจ์œจ์ ์ธ ์ปค๋„ (1 & 3): ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰, ์–‘ํ˜ธํ•œ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ , ๋ณดํ†ต ์ˆ˜์ค€์˜ ์บ์‹œ ํžˆํŠธ์œจ
  • ๋น„ํšจ์œจ์ ์ธ ์ปค๋„ (2): ๋งค์šฐ ๋‚ฎ์€ ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰, ์—ด์•…ํ•œ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ , ๊ทน๋„๋กœ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ

์บ์‹œ์˜ ์—ญ์„ค ๊ทœ๋ช…

๐Ÿคฏ ๋ฐ˜์ง๊ด€์ ์ธ ๋ฐœ๊ฒฌ:

  • Kernel2๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์„ ๋ณด์ด๋ฉด์„œ ์„ฑ๋Šฅ์€ ์ตœ์•…
  • ๊ธฐ์กด ์ƒ์‹์— ๋Œ€ํ•œ ๋„์ „: โ€œ๋†’์€ ์บ์‹œ ํžˆํŠธ = ์ข‹์€ ์„ฑ๋Šฅโ€
  • ์ง„์‹ค: ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์€ ๋น„ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์˜ ์ฆ์ƒ์ผ ์ˆ˜ ์žˆ์Œ

์บ์‹œ์˜ ์—ญ์„ค์ด ๋ฐœ์ƒํ•˜๋Š” ์ด์œ :

์ „ํ†ต์ ์ธ CPU ์ง๊ด€ (GPU์—์„œ๋Š” ํ‹€๋ฆผ):

  • ์บ์‹œ ํžˆํŠธ์œจ์ด ๋†’์„์ˆ˜๋ก ํ•ญ์ƒ ์„ฑ๋Šฅ์ด ์ข‹๋‹ค
  • ์บ์‹œ ํžˆํŠธ๋Š” ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ์„ ์ค„์—ฌ ํšจ์œจ์„ ๋†’์ธ๋‹ค

GPU ๋ฉ”๋ชจ๋ฆฌ์˜ ํ˜„์‹ค (์˜ฌ๋ฐ”๋ฅธ ์ดํ•ด):

  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ์—์„œ๋Š” ๋ณ‘ํ•ฉ์ด ์บ์‹ฑ๋ณด๋‹ค ์ค‘์š”
  • ๋น„ํšจ์œจ์ ์ธ ์ ‘๊ทผ ํŒจํ„ด์€ ์ธ์œ„์ ์œผ๋กœ ์บ์‹œ ํžˆํŠธ์œจ์„ ๋ถ€ํ’€๋ฆด ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ ์ด ์ง„์ •ํ•œ ์„ฑ๋Šฅ ์ง€ํ‘œ

๊ทผ๋ณธ ์›์ธ ๋ถ„์„ - ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

p30.mojo์˜ ์‹ค์ œ ์ปค๋„ ๊ตฌํ˜„:

Kernel 1 - ํšจ์œจ์ ์ธ ๋ณ‘ํ•ฉ ์ ‘๊ทผ:

def kernel1(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    var i = block_dim.x * block_idx.x + thread_idx.x
    if i < size:
        output[i] = a[i] + b[i]


ํ‘œ์ค€ ์Šค๋ ˆ๋“œ ์ธ๋ฑ์‹ฑ - ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ

Kernel 2 - ๋น„ํšจ์œจ์ ์ธ ์ŠคํŠธ๋ผ์ด๋“œ ์ ‘๊ทผ:

def kernel2(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    var tid = block_idx.x * block_dim.x + thread_idx.x
    var stride = 512

    var i = tid
    while i < size:
        output[i] = a[i] + b[i]
        i += stride


ํฐ stride=512๋กœ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๊ฐ„๊ฒฉ ๋ฐœ์ƒ - ๋™์ผํ•œ ์—ฐ์‚ฐ์ด์ง€๋งŒ ํฉ์–ด์ง„ ์ ‘๊ทผ

Kernel 3 - ํšจ์œจ์ ์ธ ์—ญ์ˆœ ์ ‘๊ทผ:

def kernel3(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    var tid = block_idx.x * block_dim.x + thread_idx.x
    var total_threads = (SIZE // 1024) * 1024

    for step in range(0, size, total_threads):
        var forward_i = step + tid
        if forward_i < size:
            var reverse_i = size - 1 - forward_i
            output[reverse_i] = a[reverse_i] + b[reverse_i]


์—ญ์ˆœ ์ธ๋ฑ์‹ฑ์ด์ง€๋งŒ ์—ฌ์ „ํžˆ ์˜ˆ์ธก ๊ฐ€๋Šฅ - ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ์ฃผ์†Œ์— ์ ‘๊ทผ (๋ฐฉํ–ฅ๋งŒ ๋ฐ˜๋Œ€)

ํŒจํ„ด ๋ถ„์„:

  • Kernel 1: ์ „ํ˜•์ ์ธ ๋ณ‘ํ•ฉ ์ ‘๊ทผ - ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผ
  • Kernel 2: ์น˜๋ช…์ ์ธ ์ŠคํŠธ๋ผ์ด๋“œ ์ ‘๊ทผ - ์Šค๋ ˆ๋“œ๊ฐ€ 512๊ฐœ ์š”์†Œ์”ฉ ๊ฑด๋„ˆ๋œ€
  • Kernel 3: ์—ญ์ˆœ์ด์ง€๋งŒ ์›Œํ”„ ๋‚ด์—์„œ๋Š” ๋ณ‘ํ•ฉ ์œ ์ง€ - ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ํŒจํ„ด

๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ ์ดํ•ด

GPU ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜ ๊ธฐ์ดˆ:

  • ์›Œํ”„ ์‹คํ–‰: 32๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ ํ•จ๊ป˜ ์‹คํ–‰
  • ์บ์‹œ ๋ผ์ธ ํฌ๊ธฐ: 128๋ฐ”์ดํŠธ (float32 ๊ฐ’ 32๊ฐœ)
  • ๋ณ‘ํ•ฉ ์š”๊ฑด: ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•ด์•ผ ํ•จ

p30.mojo ์„ค์ • ์ƒ์„ธ:

comptime SIZE = 16 * 1024 * 1024          # 16M ์š”์†Œ (float32 ๋ฐ์ดํ„ฐ 64MB)
comptime THREADS_PER_BLOCK = (1024, 1)    # ๋ธ”๋ก๋‹น 1024 ์Šค๋ ˆ๋“œ
comptime BLOCKS_PER_GRID = (SIZE // 1024, 1)  # ์ด 16,384 ๋ธ”๋ก
comptime dtype = DType.float32             # ์š”์†Œ๋‹น 4๋ฐ”์ดํŠธ

์ด ์„ค์ •์ด ์ค‘์š”ํ•œ ์ด์œ :

  • ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์…‹ (16M): ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์˜ ์ฐจ์ด๊ฐ€ ๋ช…ํ™•ํ•˜๊ฒŒ ๋“œ๋Ÿฌ๋‚จ
  • ๋ธ”๋ก๋‹น 1024 ์Šค๋ ˆ๋“œ: CUDA ์ตœ๋Œ€ ์Šค๋ ˆ๋“œ ์ˆ˜
  • ๋ธ”๋ก๋‹น 32๊ฐœ ์›Œํ”„: ๊ฐ ๋ธ”๋ก์— 32๊ฐœ์˜ ์›Œํ”„(๊ฐ 32 ์Šค๋ ˆ๋“œ)๊ฐ€ ํฌํ•จ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ ์‹œ๊ฐํ™”:

KERNEL 1 (๋ณ‘ํ•ฉ):                KERNEL 2 (stride 512):
์›Œํ”„ ์Šค๋ ˆ๋“œ 0-31:               ์›Œํ”„ ์Šค๋ ˆ๋“œ 0-31:
  Thread 0: Memory[0]            Thread 0: Memory[0]
  Thread 1: Memory[1]            Thread 1: Memory[512]
  Thread 2: Memory[2]            Thread 2: Memory[1024]
  ...                           ...
  Thread 31: Memory[31]          Thread 31: Memory[15872]

๊ฒฐ๊ณผ: ์บ์‹œ ๋ผ์ธ 1ํšŒ fetch          ๊ฒฐ๊ณผ: ๋ณ„๋„์˜ ์บ์‹œ ๋ผ์ธ 32ํšŒ fetch
์ƒํƒœ: ~308 GB/s ์ฒ˜๋ฆฌ๋Ÿ‰            ์ƒํƒœ: ~6 GB/s ์ฒ˜๋ฆฌ๋Ÿ‰
์บ์‹œ: ํšจ์œจ์  ํ™œ์šฉ                  ์บ์‹œ: ๊ฐ™์€ ๋ผ์ธ์„ ๋ฐ˜๋ณต ํžˆํŠธ!

KERNEL 3 (์—ญ์ˆœ์ด์ง€๋งŒ ๋ณ‘ํ•ฉ):

์›Œํ”„ ์Šค๋ ˆ๋“œ 0-31 (์ฒซ ๋ฒˆ์งธ ๋ฐ˜๋ณต):
  Thread 0: Memory[SIZE-1]     (reverse_i = SIZE-1-0)
  Thread 1: Memory[SIZE-2]     (reverse_i = SIZE-1-1)
  Thread 2: Memory[SIZE-3]     (reverse_i = SIZE-1-2)
  ...
  Thread 31: Memory[SIZE-32]   (reverse_i = SIZE-1-31)

๊ฒฐ๊ณผ: ์ธ์ ‘ํ•œ ์ฃผ์†Œ (๋ฐฉํ–ฅ๋งŒ ๋ฐ˜๋Œ€)
์ƒํƒœ: ~310 GB/s ์ฒ˜๋ฆฌ๋Ÿ‰ (Kernel 1๊ณผ ๊ฑฐ์˜ ๋™์ผ)
์บ์‹œ: ์—ญ์ˆœ์ž„์—๋„ ํšจ์œจ์  ํ™œ์šฉ

์บ์‹œ์˜ ์—ญ์„ค ์„ค๋ช…

Kernel2 (stride=512)๊ฐ€ ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์—๋„ ์„ฑ๋Šฅ์ด ๋‚˜์œ ์ด์œ :

stride=512์˜ ์žฌ์•™ ์„ค๋ช…:

# ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ํฐ ๊ฐ„๊ฒฉ์œผ๋กœ ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌ:
Thread 0: elements [0, 512, 1024, 1536, 2048, ...]
Thread 1: elements [1, 513, 1025, 1537, 2049, ...]
Thread 2: elements [2, 514, 1026, 1538, 2050, ...]
...

์ด๊ฒƒ์ด ์บ์‹œ์˜ ์—ญ์„ค์„ ๋งŒ๋“œ๋Š” ์ด์œ :

  1. ์บ์‹œ ๋ผ์ธ ๋ฐ˜๋ณต: 512๊ฐœ ์š”์†Œ๋ฅผ ๊ฑด๋„ˆ๋›ฐ์–ด๋„ ๊ฒน์น˜๋Š” ์บ์‹œ ๋ผ์ธ ์˜์—ญ ์•ˆ์— ๋จธ๋ฌด๋ฆ„
  2. ๊ฑฐ์ง“ ํšจ์œจ์˜ ํ™˜์ƒ: ๊ฐ™์€ ์บ์‹œ ๋ผ์ธ์— ๋ฐ˜๋ณต ์ ‘๊ทผ = ์ธ์œ„์ ์œผ๋กœ ๋†’์€ โ€œํžˆํŠธ์œจโ€
  3. ๋Œ€์—ญํญ ์žฌ์•™: 32๊ฐœ ์Šค๋ ˆ๋“œ ร— 32๊ฐœ ๋ณ„๋„ ์บ์‹œ ๋ผ์ธ = ๋ง‰๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ
  4. ์›Œํ”„ ์‹คํ–‰ ๋ถˆ์ผ์น˜: GPU๋Š” ๋ณ‘ํ•ฉ ์ ‘๊ทผ์— ๋งž๊ฒŒ ์„ค๊ณ„๋˜์—ˆ์ง€๋งŒ, ํฉ์–ด์ง„ ์ ‘๊ทผ์„ ๋ฐ›์Œ

float32 (๊ฐ 4๋ฐ”์ดํŠธ) ๊ตฌ์ฒด ์˜ˆ์‹œ:

  • ์บ์‹œ ๋ผ์ธ: 128๋ฐ”์ดํŠธ = float32 ๊ฐ’ 32๊ฐœ
  • stride 512: ์Šค๋ ˆ๋“œ๊ฐ€ 512ร—4 = 2048๋ฐ”์ดํŠธ = 16 ์บ์‹œ ๋ผ์ธ ๊ฐ„๊ฒฉ์œผ๋กœ ์ ํ”„!
  • ์›Œํ”„ ์˜ํ–ฅ: 32๊ฐœ ์Šค๋ ˆ๋“œ๊ฐ€ 1๊ฐœ ๋Œ€์‹  32๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์บ์‹œ ๋ผ์ธ์„ ํ•„์š”๋กœ ํ•จ

ํ•ต์‹ฌ ํ†ต์ฐฐ: Kernel2์˜ ๋†’์€ ์บ์‹œ ํžˆํŠธ๋Š” ๋น„ํšจ์œจ์ ์œผ๋กœ ๊ฐ€์ ธ์˜จ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๋ฐ˜๋ณต ์ ‘๊ทผ์ด์ง€, ํ˜„๋ช…ํ•œ ์บ์‹ฑ์ด ์•„๋‹™๋‹ˆ๋‹ค!

ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก  ํ†ต์ฐฐ

์ฒด๊ณ„์  ํƒ์ • ์ ‘๊ทผ๋ฒ•:

1๋‹จ๊ณ„: NSight Systems (์ „์ฒด ๊ทธ๋ฆผ)

  • ์–ด๋–ค ์ปค๋„์ด ๋А๋ฆฐ์ง€ ์‹๋ณ„
  • ๋ช…๋ฐฑํ•œ ๋ณ‘๋ชฉ ๋ฐฐ์ œ (๋ฉ”๋ชจ๋ฆฌ ์ „์†ก, API ์˜ค๋ฒ„ํ—ค๋“œ)
  • ์ปค๋„ ์‹คํ–‰ ์‹œ๊ฐ„ ์ฐจ์ด์— ์ง‘์ค‘

2๋‹จ๊ณ„: NSight Compute (์‹ฌ์ธต ๋ถ„์„)

  • ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰ ์ง€ํ‘œ ๋ถ„์„
  • ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ  ๋ฐฑ๋ถ„์œจ ๋น„๊ต
  • ์บ์‹œ ํžˆํŠธ์œจ๊ณผ ํŒจํ„ด ์กฐ์‚ฌ

3๋‹จ๊ณ„: ๊ทผ๊ฑฐ๋ฅผ ์ด๋ก ์œผ๋กœ ์—ฐ๊ฒฐ

ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ โ†’ ์ฝ”๋“œ ๋ถ„์„:

NSight Compute ๊ฒฐ๊ณผ:              ์‹ค์ œ ์ฝ”๋“œ ํŒจํ„ด:
- Kernel1: ~308 GB/s            โ†’ i = block_idx*block_dim + thread_idx (๋ณ‘ํ•ฉ)
- Kernel2: ~6 GB/s, 99% L2 hits โ†’ i += 512 (์น˜๋ช…์  stride)
- Kernel3: ~310 GB/s            โ†’ reverse_i = size-1-forward_i (์—ญ์ˆœ ๋ณ‘ํ•ฉ)

ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ์„ ์ง์ ‘ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค!

๊ทผ๊ฑฐ์—์„œ ์ฝ”๋“œ๋กœ์˜ ์—ฐ๊ฒฐ:

  • ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰ + ๋ณดํ†ต ์บ์‹œ ํžˆํŠธ์œจ = ๋ณ‘ํ•ฉ ์ ‘๊ทผ (Kernel 1 & 3)
  • ๋‚ฎ์€ ์ฒ˜๋ฆฌ๋Ÿ‰ + ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ = ๋น„ํšจ์œจ์  ์ŠคํŠธ๋ผ์ด๋“œ ์ ‘๊ทผ (Kernel 2)
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ ์ด ์บ์‹œ ํ†ต๊ณ„์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ง„์ •ํ•œ ํšจ์œจ์„ ๋“œ๋Ÿฌ๋ƒ„

์‹ค๋ฌด ์„ฑ๋Šฅ ์‹œ์‚ฌ์ 

์ด ํŒจํ„ด์ด ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” GPU ์‘์šฉ ๋ถ„์•ผ:

๊ณผํ•™ ์ปดํ“จํŒ…:

  • ์Šคํ…์‹ค ์—ฐ์‚ฐ: ๊ทธ๋ฆฌ๋“œ ์‹œ๋ฎฌ๋ ˆ์ด์…˜์—์„œ์˜ ์ด์›ƒ ์ ‘๊ทผ ํŒจํ„ด
  • ์„ ํ˜• ๋Œ€์ˆ˜: ํ–‰๋ ฌ ์ˆœํšŒ ์ˆœ์„œ (ํ–‰ ์šฐ์„  vs ์—ด ์šฐ์„ )
  • ํŽธ๋ฏธ๋ถ„ ๋ฐฉ์ •์‹ ํ’€์ด: ์œ ํ•œ ์ฐจ๋ถ„๋ฒ•์—์„œ์˜ ๊ฒฉ์ž์  ์ ‘๊ทผ ํŒจํ„ด

๊ทธ๋ž˜ํ”ฝ์Šค ๋ฐ ์ด๋ฏธ์ง€ ์ฒ˜๋ฆฌ:

  • ํ…์Šค์ฒ˜ ํ•„ํ„ฐ๋ง: ์…ฐ์ด๋”์—์„œ์˜ ์ƒ˜ํ”Œ ์ ‘๊ทผ ํŒจํ„ด
  • ์ด๋ฏธ์ง€ ํ•ฉ์„ฑ๊ณฑ: ํ•„ํ„ฐ ์ปค๋„์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
  • ์ƒ‰ ๊ณต๊ฐ„ ๋ณ€ํ™˜: ์ฑ„๋„ ์ธํ„ฐ๋ฆฌ๋น™ ์ „๋žต

๋จธ์‹ ๋Ÿฌ๋‹:

  • ํ–‰๋ ฌ ์—ฐ์‚ฐ: GEMM์—์„œ์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”
  • ํ…์„œ ์ถ•์•ฝ: ๋‹ค์ฐจ์› ๋ฐฐ์—ด ์ ‘๊ทผ ํŒจํ„ด
  • ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ: ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ์™€ ์ „์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ

GPU ์ตœ์ ํ™”์˜ ๊ทผ๋ณธ ์›์น™

๋ฉ”๋ชจ๋ฆฌ ์šฐ์„  ์ตœ์ ํ™” ์ „๋žต:

  1. ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์ด ์ง€๋ฐฐ: ์ ‘๊ทผ ํŒจํ„ด์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„๋ณด๋‹ค ๋” ์ค‘์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
  2. ๋ณ‘ํ•ฉ์ด ํ•ต์‹ฌ: ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋„๋ก ์„ค๊ณ„
  3. ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ  ์ธก์ •: ์บ์‹œ ํ†ต๊ณ„๊ฐ€ ์•„๋‹Œ ์‹ค์ œ ์ฒ˜๋ฆฌ๋Ÿ‰์— ์ง‘์ค‘
  4. ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง: NSight ๋„๊ตฌ๋กœ ์‹ค์ œ ๋ณ‘๋ชฉ์„ ํŒŒ์•…

ํ•ต์‹ฌ ๊ธฐ์ˆ  ํ†ต์ฐฐ:

  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ: ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ ์ด ์„ฑ๋Šฅ์„ ๊ฒฐ์ •
  • ์บ์‹œ ์ง€ํ‘œ์˜ ํ•จ์ •: ๋†’์€ ํžˆํŠธ์œจ์ด ํ•ญ์ƒ ํšจ์œจ์„ ์˜๋ฏธํ•˜์ง€๋Š” ์•Š์Œ
  • ์›Œํ”„ ๋ ˆ๋ฒจ ์‚ฌ๊ณ : 32๊ฐœ ์Šค๋ ˆ๋“œ ์‹คํ–‰ ๊ทธ๋ฃน์„ ์œ„ํ•œ ์ ‘๊ทผ ํŒจํ„ด ์„ค๊ณ„
  • ํ•˜๋“œ์›จ์–ด ์ธ์‹ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: GPU ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ดํ•ด๊ฐ€ ํ•„์ˆ˜

ํ•ต์‹ฌ ๊ตํ›ˆ

์ด๋ฒˆ์— ํƒ๊ตฌํ•œ ์‚ฌ๋ก€๋Š” GPU ์„ฑ๋Šฅ ์ตœ์ ํ™”๊ฐ€ CPU ์ง๊ด€์„ ๋ฒ„๋ฆฌ๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์ค‘์‹ฌ ์‚ฌ๊ณ ๋กœ ์ „ํ™˜ํ•  ๊ฒƒ์„ ์š”๊ตฌํ•œ๋‹ค๋Š” ์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

ํ•ต์‹ฌ ํ†ต์ฐฐ:

  • ๋†’์€ ์บ์‹œ ํžˆํŠธ์œจ์€ ์ข‹์€ ์„ฑ๋Šฅ์ด ์•„๋‹ˆ๋ผ ๋น„ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ ์ด ์บ์‹œ ํ†ต๊ณ„๋ณด๋‹ค ์ค‘์š”
  • ๋‹จ์ˆœํ•œ ๋ณ‘ํ•ฉ ํŒจํ„ด์ด ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋ณด๋‹ค ๋” ๋น ๋ฅธ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
  • ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๊ฐ€ ์ง๊ด€์œผ๋กœ๋Š” ์•Œ ์ˆ˜ ์—†๋Š” ์„ฑ๋Šฅ์˜ ์ง„์‹ค์„ ๋“œ๋Ÿฌ๋ƒ„

์‹ค์ „ ๋ฐฉ๋ฒ•๋ก :

  • NSight Systems์™€ NSight Compute๋กœ ์ฒด๊ณ„์ ์œผ๋กœ ํ”„๋กœํŒŒ์ผ๋ง
  • ์ธ์ ‘ ์Šค๋ ˆ๋“œ๊ฐ€ ์ธ์ ‘ ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋„๋ก ์„ค๊ณ„ (๋ณ‘ํ•ฉ)
  • ์ง๊ด€์ด ์•„๋‹Œ ํ”„๋กœํŒŒ์ผ๋Ÿฌ ๊ทผ๊ฑฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ตœ์ ํ™” ๊ฒฐ์ •

์บ์‹œ์˜ ์—ญ์„ค์€ ์•„ํ‚คํ…์ฒ˜์— ๋Œ€ํ•œ ์ดํ•ด ์—†์ด ๊ณ ์ˆ˜์ค€ ์ง€ํ‘œ์— ์˜์กดํ•˜๋ฉด ์ž˜๋ชป๋œ ๊ฒฐ๋ก ์— ์ด๋ฅผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค - GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋„˜์–ด ๋‘๋ฃจ ์ ์šฉ๋˜๋Š” ๊ตํ›ˆ์ž…๋‹ˆ๋‹ค.

Puzzle 31: ์ ์œ ์œจ ์ตœ์ ํ™”

์ด ํผ์ฆ์ด ์ค‘์š”ํ•œ ์ด์œ 

Puzzle 30์˜ ์—ฐ์žฅ์„ : GPU ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ๋ฐฐ์šฐ๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์–ด๋–ป๊ฒŒ ๊ทน์ ์ธ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š”์ง€ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ ๋‹ค์Œ ๋‹จ๊ณ„๋กœ ๋‚˜์•„๊ฐˆ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค: ๋ฆฌ์†Œ์Šค ์ตœ์ ํ™”.

ํ•™์Šต ์—ฌ์ •:

  • Puzzle 30์—์„œ๋Š” NSight ํ”„๋กœํŒŒ์ผ๋ง(nsys์™€ ncu)์„ ํ†ตํ•ด ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ์ง„๋‹จํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค
  • Puzzle 31์—์„œ๋Š” ๋ฆฌ์†Œ์Šค ๊ด€๋ฆฌ๋ฅผ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ์˜ˆ์ธกํ•˜๊ณ  ์ œ์–ดํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค
  • ๋‘˜์„ ํ•ฉ์น˜๋ฉด GPU ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์™„์ „ํ•œ ๋„๊ตฌ ์„ธํŠธ๋ฅผ ๊ฐ–์ถ”๊ฒŒ ๋ฉ๋‹ˆ๋‹ค

๋ฐœ๊ฒฌํ•˜๊ฒŒ ๋  ๊ฒƒ: GPU ์„ฑ๋Šฅ์€ ๋‹จ์ˆœํžˆ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํšจ์œจ์˜ ๋ฌธ์ œ๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค - ์ฝ”๋“œ๊ฐ€ ํ•œ์ •๋œ ํ•˜๋“œ์›จ์–ด ๋ฆฌ์†Œ์Šค๋ฅผ ์–ด๋–ป๊ฒŒ ํ™œ์šฉํ•˜๋А๋ƒ๊ฐ€ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ๋ชจ๋“  GPU๋Š” ์œ ํ•œํ•œ ๋ ˆ์ง€์Šคํ„ฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์‹คํ–‰ ์œ ๋‹›์„ ๊ฐ–๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ ์œ ์œจ(occupancy) - SM๋‹น ํ™œ์„ฑ ์›Œํ”„ ์ˆ˜ ๋Œ€๋น„ ์ตœ๋Œ€ ๊ฐ€๋Šฅ ์›Œํ”„ ์ˆ˜์˜ ๋น„์œจ - ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์œ ๋กœ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค:

  • ์ง€์—ฐ ์‹œ๊ฐ„ ์€๋‹‰: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€๊ธฐ ์‹œ๊ฐ„ ๋™์•ˆ GPU๊ฐ€ ์œ ํœด ์ƒํƒœ์— ๋น ์ง€์ง€ ์•Š๋„๋ก ์œ ์ง€
  • ๋ฆฌ์†Œ์Šค ํ• ๋‹น: ๋ ˆ์ง€์Šคํ„ฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์Šค๋ ˆ๋“œ ๋ธ”๋ก ๊ฐ„์˜ ๊ท ํ˜• ์กฐ์ ˆ
  • ์„ฑ๋Šฅ ์˜ˆ์ธก: ๋ณ‘๋ชฉ์ด ๋ฐœ์ƒํ•˜๊ธฐ ์ „์— ๋ฏธ๋ฆฌ ํŒŒ์•…
  • ์ตœ์ ํ™” ์ „๋žต: ์ ์œ ์œจ์— ์ง‘์ค‘ํ•ด์•ผ ํ•  ๋•Œ์™€ ๋‹ค๋ฅธ ์š”์†Œ์— ์ง‘์ค‘ํ•ด์•ผ ํ•  ๋•Œ ํŒ๋‹จ

GPU๋ฅผ ๋„˜์–ด์„œ ์ ์šฉ๋˜๋Š” ์›๋ฆฌ: ์—ฌ๊ธฐ์„œ ๋ฐฐ์šฐ๋Š” ์›๋ฆฌ๋Š” ๋ฆฌ์†Œ์Šค๋ฅผ ์—ฌ๋Ÿฌ ์‹คํ–‰ ์œ ๋‹›์ด ๊ณต์œ ํ•˜๋Š” ๋ชจ๋“  ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค - ํ•˜์ดํผ์Šค๋ ˆ๋”ฉ์„ ์‚ฌ์šฉํ•˜๋Š” CPU๋ถ€ํ„ฐ ๋ถ„์‚ฐ ์ปดํ“จํŒ… ํด๋Ÿฌ์Šคํ„ฐ๊นŒ์ง€.

๊ฐœ์š”

GPU ์ ์œ ์œจ์€ SM๋‹น ํ™œ์„ฑ ์›Œํ”„ ์ˆ˜ ๋Œ€๋น„ ์ตœ๋Œ€ ๊ฐ€๋Šฅ ์›Œํ”„ ์ˆ˜์˜ ๋น„์œจ์ž…๋‹ˆ๋‹ค. GPU๊ฐ€ ์›Œํ”„ ์ „ํ™˜์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์–ผ๋งˆ๋‚˜ ํšจ๊ณผ์ ์œผ๋กœ ์ˆจ๊ธธ ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

SAXPY๋Š” Single-precision Alpha times X plus Y์˜ ์•ฝ์ž์ž…๋‹ˆ๋‹ค. ์ด ํผ์ฆ์—์„œ๋Š” ์ˆ˜ํ•™์ ์œผ๋กœ ๋™์ผํ•˜์ง€๋งŒ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ๋‹ค๋ฅธ ์„ธ ๊ฐ€์ง€ SAXPY ์ปค๋„(y[i] = alpha * x[i] + y[i])์„ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค:

comptime SIZE = 32 * 1024 * 1024  # 32M elements - larger workload to show occupancy effects
comptime THREADS_PER_BLOCK = (1024, 1)
comptime BLOCKS_PER_GRID = (SIZE // 1024, 1)
comptime dtype = DType.float32
comptime layout = row_major[SIZE]()
comptime LayoutType = type_of(layout)
comptime ALPHA = Scalar[dtype](2.5)  # SAXPY coefficient


def minimal_kernel(
    y: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    x: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    alpha: Float32,
    size: Int,
):
    """Minimal SAXPY kernel - simple and register-light for high occupancy."""
    var i = block_dim.x * block_idx.x + thread_idx.x
    if i < size:
        # Direct computation: y[i] = alpha * x[i] + y[i]
        # Uses minimal registers (~8), no shared memory
        y[i] = alpha * x[i] + y[i]


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p31/p31.mojo

def sophisticated_kernel(
    y: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    x: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    alpha: Float32,
    size: Int,
):
    """Sophisticated SAXPY kernel - over-engineered with excessive resource usage.
    """
    # Maximum shared memory allocation (close to 48KB limit)
    var shared_cache = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](
        row_major[1024 * 12]()
    )  # 48KB

    var i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    if i < size:
        # REAL computational work that can't be optimized away - affects final result
        var base_x = x[i]
        var base_y = y[i]

        # Simulate "precision enhancement" - multiple small adjustments that add up
        # Each computation affects the final result so compiler can't eliminate them
        # But artificially increases register pressure
        var precision_x1 = base_x * 1.0001
        var precision_x2 = precision_x1 * 0.9999
        var precision_x3 = precision_x2 * 1.000001
        var precision_x4 = precision_x3 * 0.999999

        var precision_y1 = base_y * 1.000005
        var precision_y2 = precision_y1 * 0.999995
        var precision_y3 = precision_y2 * 1.0000001
        var precision_y4 = precision_y3 * 0.9999999

        # Multiple alpha computations for "stability" - should equal alpha
        var alpha1 = alpha * 1.00001 * 0.99999
        var alpha2 = alpha1 * 1.000001 * 0.999999
        var alpha3 = alpha2 * 1.0000001 * 0.9999999
        var alpha4 = alpha3 * 1.00000001 * 0.99999999

        # Complex polynomial "optimization" - creates register pressure
        var x_power2 = precision_x4 * precision_x4
        var x_power3 = x_power2 * precision_x4
        var x_power4 = x_power3 * precision_x4
        var x_power5 = x_power4 * precision_x4
        var x_power6 = x_power5 * precision_x4
        var x_power7 = x_power6 * precision_x4
        var x_power8 = x_power7 * precision_x4

        # "Advanced" mathematical series that contributes tiny amount to result
        var series_term1 = x_power2 * 0.0000001  # x^2/10M
        var series_term2 = x_power4 * 0.00000001  # x^4/100M
        var series_term3 = x_power6 * 0.000000001  # x^6/1B
        var series_term4 = x_power8 * 0.0000000001  # x^8/10B
        var series_correction = (
            series_term1 - series_term2 + series_term3 - series_term4
        )

        # Over-engineered shared memory usage with multiple caching strategies
        if local_i < 1024:
            shared_cache[local_i] = precision_x4
            shared_cache[local_i + 1024] = precision_y4
            shared_cache[local_i + 2048] = alpha4
            shared_cache[local_i + 3072] = series_correction
        barrier()

        # Load from shared memory for "optimization"
        var cached_x = shared_cache[local_i] if local_i < 1024 else precision_x4
        var cached_y = (
            shared_cache[local_i + 1024] if local_i < 1024 else precision_y4
        )
        var cached_alpha = (
            shared_cache[local_i + 2048] if local_i < 1024 else alpha4
        )
        var cached_correction = (
            shared_cache[local_i + 3072] if local_i
            < 1024 else series_correction
        )

        # Final "high precision" computation - all work contributes to result
        var high_precision_result = (
            cached_alpha * cached_x + cached_y + cached_correction
        )

        # Over-engineered result with massive resource usage but mathematically ~= alpha*x + y
        y[i] = high_precision_result


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p31/p31.mojo

def balanced_kernel(
    y: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    x: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    alpha: Float32,
    size: Int,
):
    """Balanced SAXPY kernel - efficient optimization with moderate resources.
    """
    # Reasonable shared memory usage for effective caching (16KB)
    var shared_cache = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](
        row_major[1024 * 4]()
    )  # 16KB total

    var i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    if i < size:
        # Moderate computational work that contributes to result
        var base_x = x[i]
        var base_y = y[i]

        # Light precision enhancement - less than sophisticated kernel
        var enhanced_x = base_x * 1.00001 * 0.99999
        var enhanced_y = base_y * 1.00001 * 0.99999
        var stable_alpha = alpha * 1.000001 * 0.999999

        # Moderate computational optimization
        var x_squared = enhanced_x * enhanced_x
        var optimization_hint = x_squared * 0.000001

        # Efficient shared memory caching - only what we actually need
        if local_i < 1024:
            shared_cache[local_i] = enhanced_x
            shared_cache[local_i + 1024] = enhanced_y
        barrier()

        # Use cached values efficiently
        var cached_x = shared_cache[local_i] if local_i < 1024 else enhanced_x
        var cached_y = (
            shared_cache[local_i + 1024] if local_i < 1024 else enhanced_y
        )

        # Balanced computation - moderate work, good efficiency
        var result = stable_alpha * cached_x + cached_y + optimization_hint

        # Balanced result with moderate resource usage (~15 registers, 16KB shared)
        y[i] = result


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p31/p31.mojo

๋„์ „ ๊ณผ์ œ

ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„ธ ์ปค๋„์„ ์กฐ์‚ฌํ•˜๊ณ , ์ ์œ ์œจ ์ตœ์ ํ™”์— ๋Œ€ํ•œ ๋ถ„์„ ์งˆ๋ฌธ์— ๋‹ตํ•˜์„ธ์š”. ์ปค๋„๋“ค์€ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•˜์ง€๋งŒ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ๊ทน์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค - ์„ฑ๋Šฅ๊ณผ ์ ์œ ์œจ์ด ์™œ ์ง๊ด€์— ์–ด๊ธ‹๋‚˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•˜๋Š”์ง€ ๋ฐœ๊ฒฌํ•˜๋Š” ๊ฒƒ์ด ์—ฌ๋Ÿฌ๋ถ„์˜ ์ž„๋ฌด์ž…๋‹ˆ๋‹ค!

์ด ํผ์ฆ์— ํ‘œ์‹œ๋œ ๊ตฌ์ฒด์ ์ธ ์ˆ˜์น˜ ๊ฒฐ๊ณผ๋Š” NVIDIA A10G (Ampere 8.6) ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” GPU ์ œ์กฐ์‚ฌ์™€ ์•„ํ‚คํ…์ฒ˜(NVIDIA: Pascal/Turing/Ampere/Ada/Hopper, AMD: RDNA/GCN, Apple: M1/M2/M3/M4/M5)์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€์ง€๋งŒ, ๊ธฐ๋ณธ ๊ฐœ๋…, ๋ฐฉ๋ฒ•๋ก , ํ†ต์ฐฐ์€ ๋ชจ๋“  ์ตœ์‹  GPU์— ๋ณดํŽธ์ ์œผ๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. pixi run gpu-specs๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ํ•˜๋“œ์›จ์–ด๋ณ„ ์ˆ˜์น˜๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

๊ตฌ์„ฑ

์š”๊ตฌ ์‚ฌํ•ญ:

  • CUDA ํˆดํ‚ท์ด ์„ค์น˜๋œ NVIDIA GPU
  • Puzzle 30์˜ NSight Compute

โš ๏ธ GPU ํ˜ธํ™˜์„ฑ ์ฐธ๊ณ : ๊ธฐ๋ณธ ์„ค์ •์€ ๊ณต๊ฒฉ์ ์ธ ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ๊ตฌํ˜•์ด๋‚˜ ์ €์‚ฌ์–‘ GPU์—์„œ๋Š” ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

comptime SIZE = 32 * 1024 * 1024  # 32M ์š”์†Œ (๋ฐฐ์—ด๋‹น ~256MB ๋ฉ”๋ชจ๋ฆฌ)
comptime THREADS_PER_BLOCK = (1024, 1)  # ๋ธ”๋ก๋‹น 1024 ์Šค๋ ˆ๋“œ
comptime BLOCKS_PER_GRID = (SIZE // 1024, 1)  # 32768 ๋ธ”๋ก

์‹คํ–‰ ์‹คํŒจ ์‹œ problems/p31/p31.mojo์—์„œ ๋‹ค์Œ ๊ฐ’์„ ์ค„์ด์„ธ์š”:

  • ๊ตฌํ˜• GPU (Compute Capability < 3.0): THREADS_PER_BLOCK = (512, 1), SIZE = 16 * 1024 * 1024 ์‚ฌ์šฉ
  • ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ GPU (< 2GB): SIZE = 8 * 1024 * 1024 ๋˜๋Š” SIZE = 4 * 1024 * 1024 ์‚ฌ์šฉ
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์› ์ œํ•œ: BLOCKS_PER_GRID๋Š” SIZE์— ๋งž์ถฐ ์ž๋™ ์กฐ์ •๋ฉ๋‹ˆ๋‹ค

์ ์œ ์œจ ๊ณต์‹:

์ด๋ก ์  ์ ์œ ์œจ = min(
    SM๋‹น ๋ ˆ์ง€์Šคํ„ฐ ์ˆ˜ / (์Šค๋ ˆ๋“œ๋‹น ๋ ˆ์ง€์Šคํ„ฐ ์ˆ˜ ร— ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜),
    SM๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ / ๋ธ”๋ก๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ,
    SM๋‹น ์ตœ๋Œ€ ๋ธ”๋ก ์ˆ˜
) ร— ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ / SM๋‹น ์ตœ๋Œ€ ์Šค๋ ˆ๋“œ ์ˆ˜

์กฐ์‚ฌ ๊ณผ์ •

Step 1: ์ปค๋„ ํ…Œ์ŠคํŠธ

pixi shell -e nvidia
mojo problems/p31/p31.mojo --all

์„ธ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฏธ์Šคํ„ฐ๋ฆฌ: ์™œ ์„ฑ๋Šฅ์€ ๋‹ค๋ฅผ๊นŒ์š”?

Step 2: ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ

mojo problems/p31/p31.mojo --benchmark

์„ธ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋ฏธ์Šคํ„ฐ๋ฆฌ: ์™œ ์„ฑ๋Šฅ์€ ๋‹ค๋ฅผ๊นŒ์š”?

Step 3: ํ”„๋กœํŒŒ์ผ๋ง์šฉ ๋นŒ๋“œ

mojo build --debug-level=full problems/p31/p31.mojo -o problems/p31/p31_profiler

Step 4: ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰ ํ”„๋กœํŒŒ์ผ๋ง

# ๊ฐ ์ปค๋„์˜ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰ ํ”„๋กœํŒŒ์ผ๋ง
ncu --set=@occupancy --section=LaunchStats problems/p31/p31_profiler --minimal
ncu --set=@occupancy --section=LaunchStats problems/p31/p31_profiler --sophisticated
ncu --set=@occupancy --section=LaunchStats problems/p31/p31_profiler --balanced

์ ์œ ์œจ ๋ถ„์„์„ ์œ„ํ•ด ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰์„ ๊ธฐ๋กํ•˜์„ธ์š”.

Step 5: ์ด๋ก ์  ์ ์œ ์œจ ๊ณ„์‚ฐ

๋จผ์ € GPU ์•„ํ‚คํ…์ฒ˜์™€ ์„ธ๋ถ€ ์ŠคํŽ™์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค:

pixi run gpu-specs

์ฐธ๊ณ : gpu-specs๋Š” GPU ์ œ์กฐ์‚ฌ(NVIDIA/AMD/Apple)๋ฅผ ์ž๋™ ๊ฐ์ง€ํ•˜๊ณ  ํ•˜๋“œ์›จ์–ด์—์„œ ํŒŒ์ƒ๋œ ๋ชจ๋“  ์•„ํ‚คํ…์ฒ˜ ์„ธ๋ถ€ ์ •๋ณด๋ฅผ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค - ๋ณ„๋„์˜ ์ฐธ์กฐํ‘œ๊ฐ€ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค!

์ฃผ์š” ์•„ํ‚คํ…์ฒ˜ ์ŠคํŽ™ (์ฐธ๊ณ ์šฉ):

์•„ํ‚คํ…์ฒ˜Compute Cap๋ ˆ์ง€์Šคํ„ฐ/SM๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ/SM์ตœ๋Œ€ ์Šค๋ ˆ๋“œ/SM์ตœ๋Œ€ ๋ธ”๋ก/SM
Hopper (H100)9.065,536228KB2,04832
Ada (RTX 40xx)8.965,536128KB2,04832
Ampere (RTX 30xx, A100, A10G)8.0, 8.665,536164KB2,04832
Turing (RTX 20xx)7.565,53696KB1,02416
Pascal (GTX 10xx)6.165,53696KB2,04832

๐Ÿ“š ๊ณต์‹ ๋ฌธ์„œ:

โš ๏ธ ์ฐธ๊ณ : ์ด ๊ฐ’๋“ค์€ ์ด๋ก ์  ์ตœ๋Œ€์น˜์ž…๋‹ˆ๋‹ค. ์‹ค์ œ ์ ์œ ์œจ์€ ํ•˜๋“œ์›จ์–ด ์Šค์ผ€์ค„๋ง ์ œ์•ฝ, ๋“œ๋ผ์ด๋ฒ„ ์˜ค๋ฒ„ํ—ค๋“œ ๋“ฑ์˜ ์š”์ธ์œผ๋กœ ๋” ๋‚ฎ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GPU ์ŠคํŽ™๊ณผ ์ ์œ ์œจ ๊ณต์‹์„ ์‚ฌ์šฉํ•˜์—ฌ:

  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: 1024 (์ปค๋„ ์„ค์ •๊ฐ’)

์ ์œ ์œจ ๊ณต์‹๊ณผ ํ•˜๋“œ์›จ์–ด ์ŠคํŽ™์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ปค๋„์˜ ์ด๋ก ์  ์ ์œ ์œจ์„ ์˜ˆ์ธกํ•˜์„ธ์š”.

Step 6: ์‹ค์ œ ์ ์œ ์œจ ์ธก์ •

# ๊ฐ ์ปค๋„์˜ ์‹ค์ œ ์ ์œ ์œจ ์ธก์ •
ncu --metrics=smsp__warps_active.avg.pct_of_peak_sustained_active problems/p31/p31_profiler --minimal
ncu --metrics=smsp__warps_active.avg.pct_of_peak_sustained_active problems/p31/p31_profiler --sophisticated
ncu --metrics=smsp__warps_active.avg.pct_of_peak_sustained_active problems/p31/p31_profiler --balanced

์ด๋ก ์  ๊ณ„์‚ฐ๊ณผ ์‹ค์ œ ์ธก์ •๋œ ์ ์œ ์œจ์„ ๋น„๊ตํ•˜์„ธ์š” - ๋ฏธ์Šคํ„ฐ๋ฆฌ๊ฐ€ ๋“œ๋Ÿฌ๋‚˜๋Š” ์ˆœ๊ฐ„์ž…๋‹ˆ๋‹ค!

ํ•ต์‹ฌ ํ†ต์ฐฐ

๐Ÿ’ก ์ ์œ ์œจ ์ž„๊ณ„๊ฐ’: ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๊ธฐ์— ์ถฉ๋ถ„ํ•œ ์ ์œ ์œจ(~25-50%)์„ ํ™•๋ณดํ•˜๋ฉด, ๊ทธ ์ด์ƒ์˜ ์ ์œ ์œจ์€ ์ˆ˜ํ™• ์ฒด๊ฐ ํšจ๊ณผ๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

๐Ÿ’ก ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ vs ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ: SAXPY๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž…๋‹ˆ๋‹ค. ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์ ์œ ์œจ๋ณด๋‹ค ๋” ์ค‘์š”ํ•œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก ๋ฆฌ์†Œ์Šค ํšจ์œจ: ์ตœ์‹  GPU๋Š” ์ ๋‹นํ•œ ์ˆ˜์ค€์˜ ๋ ˆ์ง€์Šคํ„ฐ ์••๋ฐ•(์Šค๋ ˆ๋“œ๋‹น 20-40๊ฐœ)์„ ์ ์œ ์œจ์˜ ๊ทน์ ์ธ ๊ฐ์†Œ ์—†์ด ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: ๋‹ค์Œ ์งˆ๋ฌธ์— ๋‹ตํ•˜์„ธ์š”

์œ„์˜ ์กฐ์‚ฌ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ•œ ํ›„, ๋‹ค์Œ ๋ถ„์„ ์งˆ๋ฌธ์— ๋‹ตํ•˜์—ฌ ์ ์œ ์œจ ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ํ’€์–ด๋ณด์„ธ์š”:

์„ฑ๋Šฅ ๋ถ„์„ (Step 2):

  1. ์–ด๋–ค ์ปค๋„์ด ๊ฐ€์žฅ ๋น ๋ฅด๊ณ , ์–ด๋–ค ์ปค๋„์ด ๊ฐ€์žฅ ๋А๋ฆฐ๊ฐ€์š”? ์‹คํ–‰ ์‹œ๊ฐ„ ์ฐจ์ด๋ฅผ ๊ธฐ๋กํ•˜์„ธ์š”.

๋ฆฌ์†Œ์Šค ํ”„๋กœํŒŒ์ผ๋ง (Step 4):

  1. ๊ฐ ์ปค๋„์˜ ์Šค๋ ˆ๋“œ๋‹น ๋ ˆ์ง€์Šคํ„ฐ ์ˆ˜, ๋ธ”๋ก๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, SM๋‹น ์›Œํ”„ ์ˆ˜๋ฅผ ๊ธฐ๋กํ•˜์„ธ์š”.

์ด๋ก ์  ๊ณ„์‚ฐ (Step 5):

  1. GPU ์ŠคํŽ™๊ณผ ์ ์œ ์œจ ๊ณต์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ปค๋„์˜ ์ด๋ก ์  ์ ์œ ์œจ์„ ๊ณ„์‚ฐํ•˜์„ธ์š”. ์–ด๋–ค ์ปค๋„์ด ๊ฐ€์žฅ ๋†’๊ณ /๋‚ฎ์•„์•ผ ํ•˜๋‚˜์š”?

์ธก์ •๋œ ์ ์œ ์œจ (Step 6):

  1. ์ธก์ •๋œ ์ ์œ ์œจ ๊ฐ’์ด ๊ณ„์‚ฐ ๊ฒฐ๊ณผ์™€ ์–ด๋–ป๊ฒŒ ๋น„๊ต๋˜๋‚˜์š”?

์ ์œ ์œจ ๋ฏธ์Šคํ„ฐ๋ฆฌ:

  1. ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ๊ทน์ ์œผ๋กœ ๋‹ค๋ฅธ๋ฐ๋„ ์„ธ ์ปค๋„ ๋ชจ๋‘ ๋น„์Šทํ•œ ์ ์œ ์œจ(~64-66%, GPU ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ)๋ฅผ ๋‹ฌ์„ฑํ•˜๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?
  2. ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ๊ทน์ ์œผ๋กœ ์ฐจ์ด๋‚˜๋Š”๋ฐ(19 vs 40 ๋ ˆ์ง€์Šคํ„ฐ, 0KB vs 49KB ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ) ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ ๋™์ผํ•œ(<2% ์ฐจ์ด) ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?
  3. ์ด๋ก ์  ์ ์œ ์œจ ๊ณ„์‚ฐ๊ณผ ์‹ค์ œ GPU ๋™์ž‘ ์‚ฌ์ด์˜ ๊ด€๊ณ„์— ๋Œ€ํ•ด ๋ฌด์—‡์„ ์•Œ ์ˆ˜ ์žˆ๋‚˜์š”?
  4. ์ด SAXPY ์›Œํฌ๋กœ๋“œ์˜ ์‹ค์ œ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์ด ์ ์œ ์œจ์ด ์•„๋‹ˆ๋ผ๋ฉด ๋ฌด์—‡์ธ๊ฐ€์š”?
ํŒ

ํƒ์ • ๋„๊ตฌ ๋ชจ์Œ:

  • NSight Compute (ncu) - ์ ์œ ์œจ๊ณผ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰ ์ธก์ •
  • GPU ์•„ํ‚คํ…์ฒ˜ ์ŠคํŽ™ - pixi run gpu-specs๋ฅผ ์‚ฌ์šฉํ•œ ์ด๋ก ์  ํ•œ๊ณ„ ๊ณ„์‚ฐ
  • ์ ์œ ์œจ ๊ณต์‹ - ๋ฆฌ์†Œ์Šค ๋ณ‘๋ชฉ ์˜ˆ์ธก
  • ์„ฑ๋Šฅ ๋ฒค์น˜๋งˆํฌ - ์ด๋ก ์  ๋ถ„์„ ๊ฒ€์ฆ

ํ•ต์‹ฌ ์ตœ์ ํ™” ์›์น™:

  • ์ตœ์ ํ™” ์ „์— ๊ณ„์‚ฐํ•˜๊ธฐ: ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ ์ „์— ์ ์œ ์œจ ๊ณต์‹์œผ๋กœ ๋ฆฌ์†Œ์Šค ํ•œ๊ณ„๋ฅผ ์˜ˆ์ธก
  • ์ธก์ •์œผ๋กœ ๊ฒ€์ฆํ•˜๊ธฐ: ์ด๋ก ์  ๊ณ„์‚ฐ์€ ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”์™€ ํ•˜๋“œ์›จ์–ด ์„ธ๋ถ€ ์‚ฌํ•ญ์„ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•จ
  • ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ ๊ณ ๋ คํ•˜๊ธฐ: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ๋Š” ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ๋ณด๋‹ค ์ ์œ ์œจ์ด ๋œ ํ•„์š”
  • ์ตœ๋Œ€ ์ ์œ ์œจ์„ ๋ชฉํ‘œ๋กœ ํ•˜์ง€ ์•Š๊ธฐ: ์ถฉ๋ถ„ํ•œ ์ ์œ ์œจ + ๋‹ค๋ฅธ ์„ฑ๋Šฅ ์š”์†Œ๋ฅผ ์ตœ์ ํ™”
  • ์ž„๊ณ„๊ฐ’ ๊ด€์ ์œผ๋กœ ์‚ฌ๊ณ ํ•˜๊ธฐ: 25-50% ์ ์œ ์œจ์ด๋ฉด ๋Œ€๋ถ€๋ถ„ ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๊ธฐ์— ์ถฉ๋ถ„
  • ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰ ํ”„๋กœํŒŒ์ผ๋งํ•˜๊ธฐ: NSight Compute๋กœ ์‹ค์ œ ๋ ˆ์ง€์Šคํ„ฐ์™€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์†Œ๋น„๋Ÿ‰ ํŒŒ์•…

์กฐ์‚ฌ ์ ‘๊ทผ๋ฒ•:

  1. ๋ฒค์น˜๋งˆํ‚น๋ถ€ํ„ฐ ์‹œ์ž‘ - ๋จผ์ € ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ํ™•์ธ
  2. NSight Compute๋กœ ํ”„๋กœํŒŒ์ผ๋ง - ์‹ค์ œ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์ ์œ ์œจ ๋ฐ์ดํ„ฐ ํ™•๋ณด
  3. ์ด๋ก ์  ์ ์œ ์œจ ๊ณ„์‚ฐ - GPU ์ŠคํŽ™๊ณผ ์ ์œ ์œจ ๊ณต์‹ ํ™œ์šฉ
  4. ์ด๋ก ๊ณผ ํ˜„์‹ค ๋น„๊ต - ๋ฏธ์Šคํ„ฐ๋ฆฌ๊ฐ€ ๋“œ๋Ÿฌ๋‚˜๋Š” ์ˆœ๊ฐ„!
  5. ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ ๊ณ ์ฐฐ - ์ด๋ก ๊ณผ ์‹ค์ œ๊ฐ€ ์™œ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋Š”์ง€ ์ƒ๊ฐํ•ด๋ณด๊ธฐ

์†”๋ฃจ์…˜

์‹ฌ์ธต ํ•ด์„ค์ด ํฌํ•จ๋œ ์™„์ „ํ•œ ํ’€์ด

์ด ์ ์œ ์œจ ํƒ์ • ์‚ฌ๊ฑด์€ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด GPU ์„ฑ๋Šฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๊ณ , ์ด๋ก ์  ์ ์œ ์œจ๊ณผ ์‹ค์ œ ์„ฑ๋Šฅ ์‚ฌ์ด์˜ ๋ณต์žกํ•œ ๊ด€๊ณ„๋ฅผ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค.

์•„๋ž˜ ๊ตฌ์ฒด์ ์ธ ๊ณ„์‚ฐ์€ NVIDIA A10G (Ampere 8.6) - ํ…Œ์ŠคํŠธ์— ์‚ฌ์šฉ๋œ GPU - ๊ธฐ์ค€์ž…๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ๋Š” GPU ์•„ํ‚คํ…์ฒ˜์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€์ง€๋งŒ, ๋ฐฉ๋ฒ•๋ก ๊ณผ ํ†ต์ฐฐ์€ ๋ณดํŽธ์ ์œผ๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. pixi run gpu-specs๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ํ•˜๋“œ์›จ์–ด๋ณ„ ์ˆ˜์น˜๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

๋ฆฌ์†Œ์Šค ๋ถ„์„์„ ํ†ตํ•œ ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ

NSight Compute ๋ฆฌ์†Œ์Šค ๋ถ„์„:

์‹ค์ œ ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฐ๊ณผ (NVIDIA A10G - GPU์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ):

  • Minimal: 19 ๋ ˆ์ง€์Šคํ„ฐ, ~0KB ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ์ ์œ ์œจ 63.87%, 327.7ms
  • Balanced: 25 ๋ ˆ์ง€์Šคํ„ฐ, 16.4KB ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ์ ์œ ์œจ 65.44%, 329.4ms
  • Sophisticated: 40 ๋ ˆ์ง€์Šคํ„ฐ, 49.2KB ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ์ ์œ ์œจ 65.61%, 330.9ms

๋ฒค์น˜๋งˆํฌ ์„ฑ๋Šฅ ๊ทผ๊ฑฐ:

  • ์„ธ ์ปค๋„ ๋ชจ๋‘ ๊ฑฐ์˜ ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„ (~327-331ms, <2% ์ฐจ์ด)
  • ๋ฆฌ์†Œ์Šค ์ฐจ์ด๊ฐ€ ํฌ์ง€๋งŒ ๋ชจ๋‘ ๋น„์Šทํ•œ ์ ์œ ์œจ์„ ๋‹ฌ์„ฑ (~64-66%)
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์ œํ•œ ์š”์ธ์œผ๋กœ ์ž‘์šฉ

์ ์œ ์œจ ๊ณ„์‚ฐ์˜ ์‹ค์ฒด

์ด๋ก ์  ์ ์œ ์œจ ๋ถ„์„ (NVIDIA A10G, Ampere 8.6):

GPU ์ŠคํŽ™ (pixi run gpu-specs ์ถœ๋ ฅ):

  • SM๋‹น ๋ ˆ์ง€์Šคํ„ฐ: 65,536
  • SM๋‹น ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: 164KB (์•„ํ‚คํ…์ฒ˜ ์ตœ๋Œ€์น˜)
  • SM๋‹น ์ตœ๋Œ€ ์Šค๋ ˆ๋“œ: 1,536 (A10G ํ•˜๋“œ์›จ์–ด ์ œํ•œ)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ: 1,024 (์ปค๋„ ์„ค์ •๊ฐ’)
  • SM๋‹น ์ตœ๋Œ€ ๋ธ”๋ก: 32

Minimal ์ปค๋„ ๊ณ„์‚ฐ:

๋ ˆ์ง€์Šคํ„ฐ ์ œํ•œ = 65,536 / (19 ร— 1,024) = 3.36 ๋ธ”๋ก/SM
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ = 164KB / 0KB = โˆž ๋ธ”๋ก/SM
ํ•˜๋“œ์›จ์–ด ๋ธ”๋ก ์ œํ•œ = 32 ๋ธ”๋ก/SM

์Šค๋ ˆ๋“œ ์ œํ•œ = 1,536 / 1,024 = 1 ๋ธ”๋ก/SM (๋‚ด๋ฆผ)
์‹ค์ œ ๋ธ”๋ก = min(3, โˆž, 1) = 1 ๋ธ”๋ก/SM
์ด๋ก ์  ์ ์œ ์œจ = (1 ร— 1,024) / 1,536 = 66.7%

Balanced ์ปค๋„ ๊ณ„์‚ฐ:

๋ ˆ์ง€์Šคํ„ฐ ์ œํ•œ = 65,536 / (25 ร— 1,024) = 2.56 ๋ธ”๋ก/SM
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ = 164KB / 16.4KB = 10 ๋ธ”๋ก/SM
ํ•˜๋“œ์›จ์–ด ๋ธ”๋ก ์ œํ•œ = 32 ๋ธ”๋ก/SM

์Šค๋ ˆ๋“œ ์ œํ•œ = 1,536 / 1,024 = 1 ๋ธ”๋ก/SM (๋‚ด๋ฆผ)
์‹ค์ œ ๋ธ”๋ก = min(2, 10, 1) = 1 ๋ธ”๋ก/SM
์ด๋ก ์  ์ ์œ ์œจ = (1 ร— 1,024) / 1,536 = 66.7%

Sophisticated ์ปค๋„ ๊ณ„์‚ฐ:

๋ ˆ์ง€์Šคํ„ฐ ์ œํ•œ = 65,536 / (40 ร— 1,024) = 1.64 ๋ธ”๋ก/SM
๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ = 164KB / 49.2KB = 3.33 ๋ธ”๋ก/SM
ํ•˜๋“œ์›จ์–ด ๋ธ”๋ก ์ œํ•œ = 32 ๋ธ”๋ก/SM

์Šค๋ ˆ๋“œ ์ œํ•œ = 1,536 / 1,024 = 1 ๋ธ”๋ก/SM (๋‚ด๋ฆผ)
์‹ค์ œ ๋ธ”๋ก = min(1, 3, 1) = 1 ๋ธ”๋ก/SM
์ด๋ก ์  ์ ์œ ์œจ = (1 ร— 1,024) / 1,536 = 66.7%

ํ•ต์‹ฌ ๋ฐœ๊ฒฌ: ์ด๋ก ๊ณผ ํ˜„์‹ค์ด ์ผ์น˜ํ•œ๋‹ค!

  • ์ด๋ก ์ : ๋ชจ๋“  ์ปค๋„ ~66.7% (A10G์˜ ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰์— ์˜ํ•ด ์ œํ•œ)
  • ์‹ค์ธก: ๋ชจ๋‘ ~64-66% (๋งค์šฐ ๊ทผ์ ‘ํ•œ ๊ฒฐ๊ณผ!)

์ด๋Š” A10G์˜ ์Šค๋ ˆ๋“œ ์ œํ•œ์ด ์ง€๋ฐฐ์ ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค - SM๋‹น ์ตœ๋Œ€ ์Šค๋ ˆ๋“œ๊ฐ€ 1,536๊ฐœ์ด๋ฏ€๋กœ 1,024 ์Šค๋ ˆ๋“œ ๋ธ”๋ก์€ 1๊ฐœ๋งŒ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค. ์ด๋ก (66.7%)๊ณผ ์‹ค์ธก(~65%) ์‚ฌ์ด์˜ ์ž‘์€ ์ฐจ์ด๋Š” ํ•˜๋“œ์›จ์–ด ์Šค์ผ€์ค„๋ง ์˜ค๋ฒ„ํ—ค๋“œ์™€ ๋“œ๋ผ์ด๋ฒ„ ์ œ์•ฝ์—์„œ ๋น„๋กฏ๋ฉ๋‹ˆ๋‹ค.

์ด๋ก ๊ณผ ํ˜„์‹ค์ด ๊ทผ์ ‘ํ•œ ์ด์œ 

์ด๋ก ์ (66.7%)๊ณผ ์‹ค์ธก(~65%) ์ ์œ ์œจ ์‚ฌ์ด ์ž‘์€ ์ฐจ์ด์˜ ์›์ธ:

  1. ํ•˜๋“œ์›จ์–ด ์Šค์ผ€์ค„๋ง ์˜ค๋ฒ„ํ—ค๋“œ: ์‹ค์ œ ์›Œํ”„ ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์ด๋ก ์  ๊ณ„์‚ฐ์„ ๋„˜์–ด์„œ๋Š” ์‹ค์งˆ์  ์ œ์•ฝ์ด ์žˆ์Œ
  2. CUDA ๋Ÿฐํƒ€์ž„ ์˜ˆ์•ฝ: ๋“œ๋ผ์ด๋ฒ„์™€ ๋Ÿฐํƒ€์ž„ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ๊ฐ€์šฉ SM ๋ฆฌ์†Œ์Šค๋ฅผ ์•ฝ๊ฐ„ ์ค„์ž„
  3. ๋ฉ”๋ชจ๋ฆฌ ์ปจํŠธ๋กค๋Ÿฌ ์••๋ฐ•: A10G์˜ ๋ฉ”๋ชจ๋ฆฌ ์„œ๋ธŒ์‹œ์Šคํ…œ์ด ์•ฝ๊ฐ„์˜ ์Šค์ผ€์ค„๋ง ์ œ์•ฝ์„ ๋งŒ๋“ฆ
  4. ์ „๋ ฅ ๋ฐ ์—ด ๊ด€๋ฆฌ: ๋™์  ์ฃผํŒŒ์ˆ˜ ์กฐ์ ˆ์ด ์ตœ๋Œ€ ์„ฑ๋Šฅ์— ์˜ํ–ฅ
  5. ๋ช…๋ น์–ด ์บ์‹œ ํšจ๊ณผ: ์‹ค์ œ ์ปค๋„์€ ์ ์œ ์œจ ๊ณ„์‚ฐ์— ํฌ์ฐฉ๋˜์ง€ ์•Š๋Š” ๋ช…๋ น์–ด ํŽ˜์น˜ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์žˆ์Œ

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ด๋ก ๊ณผ ์‹ค์ธก์ด ๊ทผ์ ‘ํ•˜๋‹ค๋Š” ๊ฒƒ(66.7% vs ~65%)์€ ๋ ˆ์ง€์Šคํ„ฐ์™€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฐจ์ด์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ A10G์˜ ์Šค๋ ˆ๋“œ ์ œํ•œ์ด ์„ธ ์ปค๋„ ๋ชจ๋‘๋ฅผ ์ง€๋ฐฐํ•œ๋‹ค๋Š” ๋œป์ž…๋‹ˆ๋‹ค. ์ง„์งœ ๋ณ‘๋ชฉ์„ ์ •ํ™•ํžˆ ์งš์–ด๋‚ธ ์ข‹์€ ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค!

์ ์œ ์œจ ๋ฏธ์Šคํ„ฐ๋ฆฌ ํ•ด์„ค

๋ฏธ์Šคํ„ฐ๋ฆฌ์˜ ์ง„์งœ ์ •์ฒด:

  • ๋ฆฌ์†Œ์Šค ์ฐจ์ด๊ฐ€ ๊ทน์ ์ธ๋ฐ๋„ ์„ธ ์ปค๋„ ๋ชจ๋‘ ๊ฑฐ์˜ ๋™์ผํ•œ ์ ์œ ์œจ์„ ๋‹ฌ์„ฑ (~64-66%)
  • ์„ฑ๋Šฅ์ด ๋ณธ์งˆ์ ์œผ๋กœ ๋™์ผ (์„ธ ์ปค๋„ ๋ชจ๋‘ <2% ๋ณ€๋™)
  • ์ด๋ก ์ด ์ ์œ ์œจ์„ ์ •ํ™•ํžˆ ์˜ˆ์ธก (66.7% ์ด๋ก  โ‰ˆ 65% ์‹ค์ธก)
  • ๋ฏธ์Šคํ„ฐ๋ฆฌ๋Š” ์ ์œ ์œจ ๋ถˆ์ผ์น˜๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค - ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ํฌ๊ฒŒ ๋‹ค๋ฅธ๋ฐ๋„ ์™œ ์ ์œ ์œจ๊ณผ ์„ฑ๋Šฅ์ด ๋™์ผํ•œ์ง€๊ฐ€ ์ง„์งœ ๋ฏธ์Šคํ„ฐ๋ฆฌ์ž…๋‹ˆ๋‹ค!

๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์ด ๋‹ค๋ฅธ๋ฐ ์„ฑ๋Šฅ์ด ๋™์ผํ•œ ์ด์œ :

SAXPY ์›Œํฌ๋กœ๋“œ์˜ ํŠน์„ฑ:

  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ: ๊ฐ ์Šค๋ ˆ๋“œ์˜ ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ทนํžˆ ์ ์Œ (y[i] = alpha * x[i] + y[i])
  • ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ: ์Šค๋ ˆ๋“œ๋‹น 2๊ฐœ ๊ฐ’ ์ฝ๊ธฐ, 1๊ฐœ ๊ฐ’ ์“ฐ๊ธฐ
  • ๋‚ฎ์€ ์‚ฐ์ˆ  ๊ฐ•๋„: 12๋ฐ”์ดํŠธ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ๋‹น 2 FLOPS๋งŒ ์ˆ˜ํ–‰

๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ๋ถ„์„ (A10G):

๋‹จ์ผ ์ปค๋„ ํŒจ์Šค ๋ถ„์„:
- ์ž…๋ ฅ ๋ฐฐ์—ด: 32M ร— 4๋ฐ”์ดํŠธ ร— 2 ๋ฐฐ์—ด = 256MB ์ฝ๊ธฐ
- ์ถœ๋ ฅ ๋ฐฐ์—ด: 32M ร— 4๋ฐ”์ดํŠธ ร— 1 ๋ฐฐ์—ด = 128MB ์“ฐ๊ธฐ
- ์ปค๋„๋‹น ์ด๋Ÿ‰: 384MB ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋ž˜ํ”ฝ

์ตœ๋Œ€ ๋Œ€์—ญํญ (A10G): 600 GB/s
๋‹จ์ผ ํŒจ์Šค ์‹œ๊ฐ„: 384MB / 600 GB/s โ‰ˆ 0.64ms ์ด๋ก ์  ์ตœ์†Œ์น˜
๋ฒค์น˜๋งˆํฌ ์‹œ๊ฐ„: ~328ms (์—ฌ๋Ÿฌ ๋ฐ˜๋ณต + ์˜ค๋ฒ„ํ—ค๋“œ ํฌํ•จ)

์‹ค์ œ ์„ฑ๋Šฅ ๊ฒฐ์ • ์š”์ธ:

  1. ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ: ๋ชจ๋“  ์ปค๋„์ด ๊ฐ€์šฉ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํฌํ™”์‹œํ‚ด
  2. ์—ฐ์‚ฐ ์˜ค๋ฒ„ํ—ค๋“œ: Sophisticated ์ปค๋„์ด ์ถ”๊ฐ€ ์ž‘์—…์„ ์ˆ˜ํ–‰ (๋ ˆ์ง€์Šคํ„ฐ ์••๋ฐ• ํšจ๊ณผ)
  3. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ด์ : Balanced ์ปค๋„์ด ์ผ๋ถ€ ์บ์‹ฑ ์ด์ ์„ ์–ป์Œ
  4. ์ปดํŒŒ์ผ๋Ÿฌ ์ตœ์ ํ™”: ์ตœ์‹  ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ๊ฐ€๋Šฅํ•œ ํ•œ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ์„ ์ตœ์†Œํ™”

์ ์œ ์œจ ์ž„๊ณ„๊ฐ’ ๊ฐœ๋… ์ดํ•ดํ•˜๊ธฐ

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ ์œ ์œจ์€ โ€œ์ตœ๋Œ€โ€œ๊ฐ€ ์•„๋‹Œ โ€œ์ถฉ๋ถ„ํ•จโ€œ์˜ ๋ฌธ์ œ

๋Œ€๊ธฐ ์‹œ๊ฐ„ ์€๋‹‰ ์š”๊ตฌ ์‚ฌํ•ญ:

  • ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„: ์ตœ์‹  GPU์—์„œ ~500-800 ์‚ฌ์ดํด
  • ์›Œํ”„ ์Šค์ผ€์ค„๋ง: GPU๋Š” ์ด ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๊ธฐ ์œ„ํ•ด ์ถฉ๋ถ„ํ•œ ์›Œํ”„๊ฐ€ ํ•„์š”
  • ์ถฉ๋ถ„ํ•œ ์ž„๊ณ„๊ฐ’: ๋ณดํ†ต 25-50% ์ ์œ ์œจ์ด๋ฉด ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ํšจ๊ณผ์ ์œผ๋กœ ์ˆจ๊ธธ ์ˆ˜ ์žˆ์Œ

๋†’์€ ์ ์œ ์œจ์ด ํ•ญ์ƒ ๋„์›€์ด ๋˜์ง€ ์•Š๋Š” ์ด์œ :

๋ฆฌ์†Œ์Šค ๊ฒฝ์Ÿ:

  • ๋” ๋งŽ์€ ํ™œ์„ฑ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ๋†“๊ณ  ๊ฒฝ์Ÿ
  • ๋™์‹œ ์ ‘๊ทผ์ด ๋งŽ์•„์ง€๋ฉด ์บ์‹œ ์••๋ฐ•์ด ์ฆ๊ฐ€
  • ๋ ˆ์ง€์Šคํ„ฐ/๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์••๋ฐ•์ด ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ

์›Œํฌ๋กœ๋“œ๋ณ„ ์ตœ์ ํ™”:

  • ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ: ๋†’์€ ์ ์œ ์œจ์ด ALU ํŒŒ์ดํ”„๋ผ์ธ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๋Š” ๋ฐ ๋„์›€
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ: ์ ์œ ์œจ๊ณผ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์„ฑ๋Šฅ์„ ์ œํ•œ
  • ํ˜ผํ•ฉ ์›Œํฌ๋กœ๋“œ: ์ ์œ ์œจ๊ณผ ๋‹ค๋ฅธ ์ตœ์ ํ™” ์š”์†Œ ์‚ฌ์ด์—์„œ ๊ท ํ˜• ํ•„์š”

์‹ค์ „ ์ ์œ ์œจ ์ตœ์ ํ™” ์›์น™

์ฒด๊ณ„์  ์ ์œ ์œจ ๋ถ„์„ ์ ‘๊ทผ๋ฒ•:

1๋‹จ๊ณ„: ์ด๋ก ์  ํ•œ๊ณ„ ๊ณ„์‚ฐ

# GPU ์ŠคํŽ™ ํ™•์ธ
pixi run gpu-specs

2๋‹จ๊ณ„: ์‹ค์ œ ์‚ฌ์šฉ๋Ÿ‰ ํ”„๋กœํŒŒ์ผ๋ง

# ๋ฆฌ์†Œ์Šค ์†Œ๋น„๋Ÿ‰ ์ธก์ •
ncu --set=@occupancy --section=LaunchStats your_kernel

# ๋‹ฌ์„ฑ๋œ ์ ์œ ์œจ ์ธก์ •
ncu --metrics=smsp__warps_active.avg.pct_of_peak_sustained_active your_kernel

3๋‹จ๊ณ„: ์„ฑ๋Šฅ ๊ฒ€์ฆ

# ํ•ญ์ƒ ์‹ค์ œ ์„ฑ๋Šฅ ์ธก์ •์œผ๋กœ ๊ฒ€์ฆ
ncu --set=@roofline --section=MemoryWorkloadAnalysis your_kernel

๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์˜์‚ฌ๊ฒฐ์ • ํ”„๋ ˆ์ž„์›Œํฌ:

์ ์œ ์œจ ๋ถ„์„ โ†’ ์ตœ์ ํ™” ์ „๋žต:

๋†’์€ ์ ์œ ์œจ (>70%) + ์ข‹์€ ์„ฑ๋Šฅ:
โ†’ ์ ์œ ์œจ์€ ์ถฉ๋ถ„, ๋‹ค๋ฅธ ๋ณ‘๋ชฉ์— ์ง‘์ค‘

๋‚ฎ์€ ์ ์œ ์œจ (<30%) + ๋‚˜์œ ์„ฑ๋Šฅ:
โ†’ ๋ฆฌ์†Œ์Šค ์ตœ์ ํ™”๋ฅผ ํ†ตํ•ด ์ ์œ ์œจ ํ–ฅ์ƒ ํ•„์š”

์ ๋‹นํ•œ ์ ์œ ์œจ (50-70%) + ๋‚˜์œ ์„ฑ๋Šฅ:
โ†’ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ, ์บ์‹œ, ์—ฐ์‚ฐ ๋ณ‘๋ชฉ ์กฐ์‚ฌ ํ•„์š”

๋‚ฎ์€ ์ ์œ ์œจ (<30%) + ์ข‹์€ ์„ฑ๋Šฅ:
โ†’ ์›Œํฌ๋กœ๋“œ๊ฐ€ ๋†’์€ ์ ์œ ์œจ์„ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์Œ (๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ)

์‹ค์šฉ์ ์ธ ์ ์œ ์œจ ์ตœ์ ํ™” ๊ธฐ๋ฒ•

๋ ˆ์ง€์Šคํ„ฐ ์ตœ์ ํ™”:

  • ์ ์ ˆํ•œ ๋ฐ์ดํ„ฐ ํƒ€์ž… ์‚ฌ์šฉ: float32 vs float64, int32 vs int64
  • ์ค‘๊ฐ„ ๋ณ€์ˆ˜ ์ตœ์†Œํ™”: ์ปดํŒŒ์ผ๋Ÿฌ๊ฐ€ ์ž„์‹œ ์ €์žฅ์†Œ๋ฅผ ์ตœ์ ํ™”ํ•˜๋„๋ก ๋งก๊ธฐ๊ธฐ
  • ๋ฃจํ”„ ์ „๊ฐœ ๊ณ ๋ ค: ์ ์œ ์œจ๊ณผ ๋ช…๋ น์–ด ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ ์‚ฌ์ด์˜ ๊ท ํ˜•

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”:

  • ํ•„์š”ํ•œ ํฌ๊ธฐ ๊ณ„์‚ฐ: ๊ณผ๋‹ค ํ• ๋‹น ๋ฐฉ์ง€
  • ํƒ€์ผ๋ง ์ „๋žต ๊ณ ๋ ค: ์ ์œ ์œจ๊ณผ ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ ์‚ฌ์ด์˜ ๊ท ํ˜•
  • ๋ฑ…ํฌ ์ถฉ๋Œ ํšŒํ”ผ: ์ถฉ๋Œ ์—†๋Š” ์ ‘๊ทผ ํŒจํ„ด ์„ค๊ณ„

๋ธ”๋ก ํฌ๊ธฐ ํŠœ๋‹:

  • ์—ฌ๋Ÿฌ ์„ค์ • ํ…Œ์ŠคํŠธ: ๋ธ”๋ก๋‹น 256, 512, 1024 ์Šค๋ ˆ๋“œ
  • ์›Œํ”„ ํ™œ์šฉ ๊ณ ๋ ค: ๊ฐ€๋Šฅํ•˜๋ฉด ๋ถˆ์™„์ „ํ•œ ์›Œํ”„ ๋ฐฉ์ง€
  • ์ ์œ ์œจ๊ณผ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์˜ ๊ท ํ˜•: ๋ธ”๋ก์ด ํด์ˆ˜๋ก ๋ฆฌ์†Œ์Šค ํ•œ๊ณ„์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ์Œ

ํ•ต์‹ฌ ์ •๋ฆฌ: A10G ๋ฏธ์Šคํ„ฐ๋ฆฌ์—์„œ ๋ณดํŽธ์  ์›์น™์œผ๋กœ

์ด A10G ์ ์œ ์œจ ์กฐ์‚ฌ๋Š” ๋ชจ๋“  GPU ์ตœ์ ํ™”์— ์ ์šฉ๋˜๋Š” ๋ช…ํ™•ํ•œ ํ†ต์ฐฐ์˜ ์ง„ํ–‰์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

A10G ๋ฐœ๊ฒฌ ๊ณผ์ •:

  1. ์Šค๋ ˆ๋“œ ์ œํ•œ์ด ๋ชจ๋“  ๊ฒƒ์„ ์ง€๋ฐฐ - 19 vs 40 ๋ ˆ์ง€์Šคํ„ฐ, 0KB vs 49KB ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฐจ์ด์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , A10G์˜ 1,536 ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰ ๋•Œ๋ฌธ์— ๋ชจ๋“  ์ปค๋„์ด SM๋‹น 1๋ธ”๋ก์ด๋ผ๋Š” ๋™์ผํ•œ ์ œํ•œ์— ๊ฑธ๋ฆผ
  2. ์ด๋ก ์ด ํ˜„์‹ค๊ณผ ๊ทผ์ ‘ํ•˜๊ฒŒ ์ผ์น˜ - 66.7% ์ด๋ก  vs ~65% ์‹ค์ธก ์ ์œ ์œจ์€ ์˜ฌ๋ฐ”๋ฅธ ๋ณ‘๋ชฉ์„ ์‹๋ณ„ํ–ˆ์„ ๋•Œ ๊ณ„์‚ฐ์ด ์œ ํšจํ•จ์„ ๋ณด์—ฌ์คŒ
  3. ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์ด ์„ฑ๋Šฅ์„ ์ง€๋ฐฐ - ๋™์ผํ•œ 66.7% ์ ์œ ์œจ์—์„œ, SAXPY์˜ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ํŠน์„ฑ(600 GB/s ํฌํ™”)์ด ๋ฆฌ์†Œ์Šค ์ฐจ์ด์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๋™์ผํ•œ ์„ฑ๋Šฅ์„ ์„ค๋ช…

๋ณดํŽธ์ ์ธ GPU ์ตœ์ ํ™” ์›์น™:

์ง„์งœ ๋ณ‘๋ชฉ ์‹๋ณ„ํ•˜๊ธฐ:

  • ๋ชจ๋“  ๋ฆฌ์†Œ์Šค์—์„œ ์ ์œ ์œจ ์ œํ•œ์„ ๊ณ„์‚ฐ: ๋ ˆ์ง€์Šคํ„ฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰
  • ๊ฐ€์žฅ ์ œํ•œ์ ์ธ ์š”์†Œ๊ฐ€ ๊ฒฐ์ •์  - ๋ ˆ์ง€์Šคํ„ฐ๋‚˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ๋ณ‘๋ชฉ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜์ง€ ๋ง ๊ฒƒ
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ(SAXPY ๊ฐ™์€)๋Š” ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธธ ๋งŒํผ ์ถฉ๋ถ„ํ•œ ์Šค๋ ˆ๋“œ๋งŒ ํ™•๋ณด๋˜๋ฉด ์ ์œ ์œจ์ด ์•„๋‹Œ ๋Œ€์—ญํญ์ด ์ œํ•œ ์š”์ธ

์ ์œ ์œจ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ vs ์ค‘์š”ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ:

  • ๋†’์€ ์ ์œ ์œจ์ด ์ค‘์š”: ์—ฐ์‚ฐ ์ง‘์•ฝ์  ์ปค๋„(GEMM, ๊ณผํ•™ ์‹œ๋ฎฌ๋ ˆ์ด์…˜)์—์„œ ALU ํŒŒ์ดํ”„๋ผ์ธ์ด ๋ฉˆ์ถ”๋Š” ์‹œ๊ฐ„์„ ๋‹ค๋ฅธ ์›Œํ”„ ์‹คํ–‰์œผ๋กœ ์ˆจ๊ฒจ์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ
  • ์ ์œ ์œจ์ด ๋œ ์ค‘์š”: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ(BLAS Level 1, ๋ฉ”๋ชจ๋ฆฌ ๋ณต์‚ฌ)์—์„œ ์ ์œ ์œจ์ด ์ œํ•œ ์š”์ธ์ด ๋˜๊ธฐ ์ „์— ๋Œ€์—ญํญ์ด ํฌํ™”๋˜๋Š” ๊ฒฝ์šฐ
  • ์ ์ • ์ˆ˜์ค€: 60-70% ์ ์œ ์œจ์ด๋ฉด ๋Œ€๊ธฐ ์‹œ๊ฐ„์„ ์ˆจ๊ธฐ๊ธฐ์— ์ถฉ๋ถ„ - ๊ทธ ์ด์ƒ์€ ์ง„์งœ ๋ณ‘๋ชฉ์— ์ง‘์ค‘

์‹ค์ „ ์ตœ์ ํ™” ์›Œํฌํ”Œ๋กœ์šฐ:

  1. ๋จผ์ € ํ”„๋กœํŒŒ์ผ๋ง (ncu --set=@occupancy) - ์‹ค์ œ ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ๋Ÿ‰๊ณผ ์ ์œ ์œจ ์ธก์ •
  2. ์ด๋ก ์  ํ•œ๊ณ„ ๊ณ„์‚ฐ - GPU ์ŠคํŽ™ ํ™œ์šฉ (pixi run gpu-specs)
  3. ์ง€๋ฐฐ์  ์ œ์•ฝ ์‹๋ณ„ - ๋ ˆ์ง€์Šคํ„ฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ, ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰, ๋˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ
  4. ๋ณ‘๋ชฉ ์ตœ์ ํ™” - ์ œํ•œ ์š”์ธ์ด ์•„๋‹Œ ๋ฆฌ์†Œ์Šค์— ์‹œ๊ฐ„ ๋‚ญ๋น„ํ•˜์ง€ ์•Š๊ธฐ
  5. ์ข…๋‹จ๊ฐ„ ์„ฑ๋Šฅ์œผ๋กœ ๊ฒ€์ฆ - ์ ์œ ์œจ์€ ์„ฑ๋Šฅ์„ ์œ„ํ•œ ์ˆ˜๋‹จ์ด์ง€ ๋ชฉํ‘œ๊ฐ€ ์•„๋‹˜

A10G ์‚ฌ๋ก€๋Š” ์ฒด๊ณ„์  ๋ณ‘๋ชฉ ๋ถ„์„์ด ์ง๊ด€๋ณด๋‹ค ๋‚ซ๋‹ค๋Š” ๊ฒƒ์„ ์™„๋ฒฝํ•˜๊ฒŒ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค - ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰์ด ์ง€๋ฐฐ์ ์ด์—ˆ๊ธฐ์— Sophisticated ์ปค๋„์˜ ๋†’์€ ๋ ˆ์ง€์Šคํ„ฐ ์••๋ฐ•์€ ๋ฌด๊ด€ํ–ˆ๊ณ , ๋™์ผํ•œ ์ ์œ ์œจ๊ณผ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํฌํ™”๊ฐ€ ์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ์™„์ „ํžˆ ์„ค๋ช…ํ•ด์ค๋‹ˆ๋‹ค.

Puzzle 32: ๋ฑ…ํฌ ์ถฉ๋Œ

์ด ํผ์ฆ์ด ์ค‘์š”ํ•œ ์ด์œ 

์„ฑ๋Šฅ ์ตœ์ ํ™” 3๋ถ€์ž‘์˜ ์™„๊ฒฐ: Puzzle 30์—์„œ GPU ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ๋ฐฐ์šฐ๊ณ , Puzzle 31์—์„œ ์ ์œ ์œจ ์ตœ์ ํ™”๋ฅผ ์ดํ•ดํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ ์„ฑ๋Šฅ ์ตœ์ ํ™” ํผ์ฆ์˜ ๋งˆ์ง€๋ง‰ ์กฐ๊ฐ์„ ๋งž์ถœ ์ค€๋น„๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ.

์ˆจ๊ฒจ์ง„ ์„ฑ๋Šฅ ํ•จ์ •: ์™„๋ฒฝํ•œ ์ ์œ ์œจ, ์ตœ์ ์˜ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ, ๋™์ผํ•œ ์ˆ˜ํ•™์  ์—ฐ์‚ฐ์„ ๊ฐ–์ถ˜ GPU ์ปค๋„์„ ์ž‘์„ฑํ•˜๊ณ ๋„ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹ ๋•Œ๋ฌธ์— ๊ทน์ ์ธ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๊ฒฝํ—˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฑ…ํฌ ์ถฉ๋Œ์€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๊ฐ€์žฅ ๋ฏธ๋ฌ˜ํ•˜๋ฉด์„œ๋„ ์˜ํ–ฅ๋ ฅ์ด ํฐ ์„ฑ๋Šฅ ํ•จ์ • ์ค‘ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค.

ํ•™์Šต ์—ฌ์ •:

  • Puzzle 30์—์„œ๋Š” NSight ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๊ณ  ์ง„๋‹จํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค
  • Puzzle 31์—์„œ๋Š” ์ ์œ ์œจ ๋ถ„์„์„ ํ†ตํ•ด ๋ฆฌ์†Œ์Šค ์‚ฌ์šฉ์„ ์˜ˆ์ธกํ•˜๊ณ  ์ œ์–ดํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค
  • Puzzle 32์—์„œ๋Š” ์ตœ๋Œ€ ํšจ์œจ์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค

GPU๋ฅผ ๋„˜์–ด์„œ ์ ์šฉ๋˜๋Š” ์›๋ฆฌ: ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น, ์ถฉ๋Œ ๊ฐ์ง€, ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ ํŒจํ„ด ์ตœ์ ํ™”์˜ ์›๋ฆฌ๋Š” CPU ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ๋ถ€ํ„ฐ ๋ถ„์‚ฐ ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜๊นŒ์ง€ ๋‹ค์–‘ํ•œ ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ์‹œ์Šคํ…œ์— ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ : ์ด ํผ์ฆ์€ NVIDIA GPU ์ „์šฉ์ž…๋‹ˆ๋‹ค

๋ฑ…ํฌ ์ถฉ๋Œ ๋ถ„์„์€ NVIDIA์˜ 32-๋ฑ…ํฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜์™€ NSight Compute ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ตœ์ ํ™” ์›๋ฆฌ๋Š” ๋„๋ฆฌ ์ ์šฉ๋˜์ง€๋งŒ, ๊ตฌ์ฒด์ ์ธ ๊ธฐ๋ฒ•๊ณผ ์ธก์ • ๋ฐฉ๋ฒ•์€ NVIDIA CUDA์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

๊ฐœ์š”

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์›Œํ”„ ๋‚ด์˜ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ์†Œ์— ๋™์‹œ์— ์ ‘๊ทผํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋ฉฐ, ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ด๋Ÿฌํ•œ ์ ‘๊ทผ์„ ์ง๋ ฌํ™”ํ•˜๋„๋ก ๊ฐ•์ œํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ผ ์‚ฌ์ดํด ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ์ด์–ด์•ผ ํ•  ๊ฒƒ์ด ์—ฌ๋Ÿฌ ์‚ฌ์ดํด์˜ ์ง๋ ฌํ™”๋œ ์ ‘๊ทผ์œผ๋กœ ๋ฐ”๋€” ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐœ๊ฒฌํ•˜๊ฒŒ ๋  ๊ฒƒ:

  • ํ•˜๋“œ์›จ์–ด ์ˆ˜์ค€์—์„œ GPU ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น์ด ์ž‘๋™ํ•˜๋Š” ๋ฐฉ์‹
  • ๋™์ผํ•œ ์ปค๋„์ด ์™œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์—์„œ ํฌ๊ฒŒ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๋Š”์ง€
  • ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๊ธฐ ์ „์— ๋ฑ…ํฌ ์ถฉ๋Œ์„ ์˜ˆ์ธกํ•˜๊ณ  ์ธก์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•
  • ์ถฉ๋Œ ์—†๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ค๊ณ„ํ•˜๊ธฐ ์œ„ํ•œ ์ „๋ฌธ์ ์ธ ์ตœ์ ํ™” ์ „๋žต

ํƒ์ • ๋ฐฉ๋ฒ•๋ก : ์ด ํผ์ฆ์€ ์ด์ „ ์„ฑ๋Šฅ ํผ์ฆ๊ณผ ๋™์ผํ•œ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค - ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋กœ ์ˆจ๊ฒจ์ง„ ๋น„ํšจ์œจ์„ ๋ฐํ˜€๋‚ธ ๋‹ค์Œ, ์ฒด๊ณ„์ ์ธ ์ตœ์ ํ™” ์›์น™์„ ์ ์šฉํ•˜์—ฌ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ฐœ๋…

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•„ํ‚คํ…์ฒ˜์˜ ๊ธฐ์ดˆ:

  • 32-๋ฑ…ํฌ ์„ค๊ณ„: NVIDIA GPU๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ 32๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ๋ฑ…ํฌ๋กœ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ์ถฉ๋Œ ์œ ํ˜•: ์ถฉ๋Œ ์—†์Œ(์ตœ์ ), N-way ์ถฉ๋Œ(์ง๋ ฌํ™”), ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ(์ตœ์ ํ™”)
  • ์ ‘๊ทผ ํŒจํ„ด ์ˆ˜ํ•™: ๋ฑ…ํฌ ํ• ๋‹น ๊ณต์‹๊ณผ ์ถฉ๋Œ ์˜ˆ์ธก
  • ์„ฑ๋Šฅ ์˜ํ–ฅ: ์ตœ์ ์˜ 1์‚ฌ์ดํด ์ ‘๊ทผ๋ถ€ํ„ฐ ์ตœ์•…์˜ 32์‚ฌ์ดํด ์ง๋ ฌํ™”๊นŒ์ง€

์ „๋ฌธ์ ์ธ ์ตœ์ ํ™” ๊ธฐ์ˆ :

  • ํŒจํ„ด ๋ถ„์„: ๋ฑ…ํ‚น ๋™์ž‘์˜ ์ˆ˜ํ•™์  ์˜ˆ์ธก
  • ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก : ์ถฉ๋Œ ์ธก์ •์„ ์œ„ํ•œ NSight Compute ๋ฉ”ํŠธ๋ฆญ
  • ์„ค๊ณ„ ์›์น™: ์ถฉ๋Œ ์—†๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํŒจํ„ด๊ณผ ์˜ˆ๋ฐฉ ์ „๋žต
  • ์„ฑ๋Šฅ ๊ฒ€์ฆ: ์ฒด๊ณ„์ ์ธ ์ธก์ •์„ ํ†ตํ•œ ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”

ํผ์ฆ ๊ตฌ์„ฑ

์ด ํผ์ฆ์€ ์ „๋ฌธ์„ฑ์„ ์ ์ง„์ ์œผ๋กœ ์Œ“์•„๊ฐ€๋Š” ๋‘ ๊ฐœ์˜ ์ƒํ˜ธ ๋ณด์™„์ ์ธ ์„น์…˜์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค:

๐Ÿ“š ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ดํ•ดํ•˜๊ธฐ

๋ช…ํ™•ํ•œ ์„ค๋ช…๊ณผ ์‹ค์šฉ์ ์ธ ์˜ˆ์ œ๋ฅผ ํ†ตํ•ด GPU ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น์˜ ์ด๋ก ์  ๊ธฐ์ดˆ๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

๋ฐฐ์šฐ๊ฒŒ ๋  ๊ฒƒ:

  • NVIDIA์˜ 32-๋ฑ…ํฌ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ๋ณ‘๋ ฌ ์ ‘๊ทผ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ์‹
  • ๋ฑ…ํฌ ํ• ๋‹น๊ณผ ์ถฉ๋Œ ์˜ˆ์ธก์˜ ์ˆ˜ํ•™
  • ์ถฉ๋Œ ์œ ํ˜•๊ณผ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ
  • ์ด์ „ ๊ฐœ๋…๊ณผ์˜ ์—ฐ๊ฒฐ (์›Œํ”„ ์‹คํ–‰, ์ ์œ ์œจ, ํ”„๋กœํŒŒ์ผ๋ง)

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํ•˜๋“œ์›จ์–ด๋ฅผ ์ดํ•ดํ•˜๋ฉด ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ ์ „์— ์„ฑ๋Šฅ์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ถฉ๋Œ ์—†๋Š” ํŒจํ„ด

๋ฑ…ํ‚น ์ง€์‹์„ ํ™œ์šฉํ•˜์—ฌ ์ „๋ฌธ ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ๋ฒ•์œผ๋กœ ์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ํ’€์–ด๋ด…๋‹ˆ๋‹ค.

ํƒ์ • ๋„์ „ ๊ณผ์ œ: ๋‘ ์ปค๋„์ด ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•˜์ง€๋งŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ์€ ๊ทน์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. NSight Compute๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•œ ์ปค๋„์€ ์ฒด๊ณ„์ ์ธ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ๊ฒช๊ณ  ๋‹ค๋ฅธ ์ปค๋„์€ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋Š” ์ด์œ ๋ฅผ ๋ฐํ˜€๋‚ด์„ธ์š”.

๊ธธ๋Ÿฌ์ง€๋Š” ์—ญ๋Ÿ‰: ํŒจํ„ด ๋ถ„์„, ์ถฉ๋Œ ์ธก์ •, ์ฒด๊ณ„์  ์ตœ์ ํ™”, ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์„ฑ๋Šฅ ๊ฐœ์„ .

์‹œ์ž‘ํ•˜๊ธฐ

ํ•™์Šต ๊ฒฝ๋กœ:

  1. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ดํ•ดํ•˜๊ธฐ - ์ด๋ก ์  ๊ธฐ์ดˆ ์Œ“๊ธฐ
  2. ์ถฉ๋Œ ์—†๋Š” ํŒจํ„ด - ์‹ค์ „ ์ตœ์ ํ™”์— ํƒ์ • ์—ญ๋Ÿ‰ ์ ์šฉํ•˜๊ธฐ

์„ ์ˆ˜ ์กฐ๊ฑด:

  • Puzzle 30์—์„œ ์ตํžŒ GPU ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฝํ—˜
  • Puzzle 31์—์„œ ์ตํžŒ ๋ฆฌ์†Œ์Šค ์ตœ์ ํ™” ์ดํ•ด
  • Puzzle 8๊ณผ Puzzle 16์—์„œ ์ตํžŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฒฝํ—˜

ํ•˜๋“œ์›จ์–ด ์š”๊ตฌ ์‚ฌํ•ญ:

์ตœ์ ํ™”์˜ ํšจ๊ณผ

๋ฑ…ํฌ ์ถฉ๋Œ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ:

  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ๋ง์„ ์‚ฌ์šฉํ•˜๋Š” ํ–‰๋ ฌ ๊ณฑ์…ˆ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์บ์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ์Šคํ…์‹ค ์—ฐ์‚ฐ
  • ์ŠคํŠธ๋ผ์ด๋“œ ๊ธฐ๋ฐ˜ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜๋Š” ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜

์ „๋ฌธ ์—ญ๋Ÿ‰ ๊ฐœ๋ฐœ:

  • ์ฒด๊ณ„์  ์ตœ์ ํ™”: ๊ทผ๊ฑฐ ๊ธฐ๋ฐ˜ ์„ฑ๋Šฅ ๊ฐœ์„  ๋ฐฉ๋ฒ•๋ก 
  • ํ•˜๋“œ์›จ์–ด ์ธ์‹: ์†Œํ”„ํŠธ์›จ์–ด๊ฐ€ ํ•˜๋“œ์›จ์–ด ์ œ์•ฝ์— ์–ด๋–ป๊ฒŒ ๋งคํ•‘๋˜๋Š”์ง€ ์ดํ•ด
  • ํŒจํ„ด ์ธ์‹: ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„์—์„œ ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ ‘๊ทผ ํŒจํ„ด ์‹๋ณ„

ํ•™์Šต ์„ฑ๊ณผ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์„ค๊ณ„, ์ธก์ •, ์ตœ์ ํ™”ํ•˜๋Š” ์—ญ๋Ÿ‰๊นŒ์ง€ ๊ฐ–์ถ”๋ฉด GPU ์„ฑ๋Šฅ ์ตœ์ ํ™” ๋„๊ตฌ ์„ธํŠธ๊ฐ€ ์™„์„ฑ๋ฉ๋‹ˆ๋‹ค - ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ์œ„ํ•œ ๋งˆ์ง€๋ง‰ ํผ์ฆ ์กฐ๊ฐ์ž…๋‹ˆ๋‹ค.

์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์—์„œ ์ ์œ ์œจ ๊ด€๋ฆฌ๋ฅผ ๊ฑฐ์ณ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น ํšจ์œจ๊นŒ์ง€, ์ด ํผ์ฆ์€ ์ตœ์ ์˜ GPU ์„ฑ๋Šฅ์„ ์œ„ํ•ด์„œ๋Š” ์—ฌ๋Ÿฌ ์ˆ˜์ค€์—์„œ ํ•˜๋“œ์›จ์–ด๋ฅผ ์ดํ•ดํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ“š ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ดํ•ดํ•˜๊ธฐ

์ง€๊ธˆ๊นŒ์ง€ ๋ฐฐ์šด ๊ฒƒ์„ ๋ฐ”ํƒ•์œผ๋กœ

GPU ์ตœ์ ํ™” ์—ฌ์ •์—์„œ ์ด๋ฏธ ๋งŽ์€ ๊ธธ์„ ๊ฑธ์–ด์™”์Šต๋‹ˆ๋‹ค. Puzzle 8์—์„œ๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅธ ๋ธ”๋ก ๋‚ด๋ถ€ ์ €์žฅ์†Œ๋ฅผ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. Puzzle 16์—์„œ๋Š” ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ปค๋„์ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ํƒ€์ผ์„ ์บ์‹ฑํ•˜๊ณ , ๋น„์šฉ์ด ํฐ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•์„ ํ™•์ธํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์—๋Š” ๋ณ‘๋ ฌ ์—ฐ์‚ฐ์„ ์ง๋ ฌํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ์ˆจ๊ฒจ์ง„ ์„ฑ๋Šฅ ํ•จ์ •์ด ๋„์‚ฌ๋ฆฌ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค: ๋ฑ…ํฌ ์ถฉ๋Œ.

์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ: ๊ฒ‰๋ณด๊ธฐ์— ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ๋‘ ์ปค๋„์„ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค - ๋‘˜ ๋‹ค ๊ฐ™์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์™„๋ฒฝํ•œ ์ ์œ ์œจ์„ ๊ฐ€์ง€๋ฉฐ, ๊ฒฝ์Ÿ ์ƒํƒœ๋„ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํ•˜๋‚˜๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ๋ณด๋‹ค 32๋ฐฐ ๋А๋ฆฝ๋‹ˆ๋‹ค. ๋ฒ”์ธ์€? ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ๋ž€?

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋ฑ…ํฌ๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” 32๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์œ ๋‹›์˜ ์ง‘ํ•ฉ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜์„ธ์š”. ๊ฐ ๋ฑ…ํฌ๋Š” ํด๋ก ์‚ฌ์ดํด๋‹น ํ•˜๋‚˜์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฑ…ํ‚น ์‹œ์Šคํ…œ์ด ์กด์žฌํ•˜๋Š” ๊ทผ๋ณธ์ ์ธ ์ด์œ ๋Š” ํ•˜๋“œ์›จ์–ด ๋ณ‘๋ ฌ์„ฑ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

32๊ฐœ ์Šค๋ ˆ๋“œ๋กœ ๊ตฌ์„ฑ๋œ ์›Œํ”„๊ฐ€ ๋™์‹œ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•ด์•ผ ํ•  ๋•Œ, ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•œ๋‹ค๋ฉด GPU๋Š” 32๊ฐœ์˜ ์š”์ฒญ์„ ๋ชจ๋‘ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๋ ค ํ•˜๋ฉด ํ•˜๋“œ์›จ์–ด๋Š” ์ด๋ฅผ ์ง๋ ฌํ™”ํ•ด์•ผ ํ•˜๋ฏ€๋กœ, 1์‚ฌ์ดํด์ด๋ฉด ๋  ์—ฐ์‚ฐ์ด ์—ฌ๋Ÿฌ ์‚ฌ์ดํด๋กœ ๋Š˜์–ด๋‚ฉ๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ฃผ์†Œ ๋งคํ•‘

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ๊ฐ 4๋ฐ”์ดํŠธ ์›Œ๋“œ๋Š” ๋‹ค์Œ ๊ณต์‹์— ๋”ฐ๋ผ ํŠน์ • ๋ฑ…ํฌ์— ๋ฐฐ์ •๋ฉ๋‹ˆ๋‹ค:

bank_id = (byte_address / 4) % 32

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ์ฒ˜์Œ 128๋ฐ”์ดํŠธ๊ฐ€ ๋ฑ…ํฌ์— ๋งคํ•‘๋˜๋Š” ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

Address RangeBank IDExample float32 Elements
0-3 bytesBank 0shared[0]
4-7 bytesBank 1shared[1]
8-11 bytesBank 2shared[2]
โ€ฆโ€ฆโ€ฆ
124-127 bytesBank 31shared[31]
128-131 bytesBank 0shared[32]
132-135 bytesBank 1shared[33]

ํ•ต์‹ฌ ํ†ต์ฐฐ: float32 ๋ฐฐ์—ด์—์„œ ๋ฑ…ํ‚น ํŒจํ„ด์€ 32๊ฐœ ์š”์†Œ๋งˆ๋‹ค ๋ฐ˜๋ณต๋˜๋ฉฐ, ์ด๋Š” 32๊ฐœ ์Šค๋ ˆ๋“œ๋กœ ๊ตฌ์„ฑ๋œ ์›Œํ”„ ํฌ๊ธฐ์™€ ์ •ํ™•ํžˆ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์šฐ์—ฐ์ด ์•„๋‹™๋‹ˆ๋‹ค - ์ตœ์ ์˜ ๋ณ‘๋ ฌ ์ ‘๊ทผ์„ ์œ„ํ•ด ์„ค๊ณ„๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ถฉ๋Œ์˜ ์œ ํ˜•

์ถฉ๋Œ ์—†์Œ: ์ด์ƒ์ ์ธ ๊ฒฝ์šฐ

์›Œํ”„ ๋‚ด ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๋ฉด 32๊ฐœ์˜ ์ ‘๊ทผ์ด ๋ชจ๋‘ 1์‚ฌ์ดํด์— ์™„๋ฃŒ๋ฉ๋‹ˆ๋‹ค:

# Perfect case: each thread accesses a different bank
shared[thread_idx.x]  # Thread 0โ†’Bank 0, Thread 1โ†’Bank 1, ..., Thread 31โ†’Bank 31

๊ฒฐ๊ณผ: 32๊ฐœ ๋ณ‘๋ ฌ ์ ‘๊ทผ, ์ด 1์‚ฌ์ดํด

N-way ๋ฑ…ํฌ ์ถฉ๋Œ

N๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ™์€ ๋ฑ…ํฌ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ์†Œ์— ์ ‘๊ทผํ•˜๋ฉด ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ ‘๊ทผ์„ ์ง๋ ฌํ™”ํ•ฉ๋‹ˆ๋‹ค:

# 2-way conflict: stride-2 access pattern
shared[thread_idx.x * 2]  # Thread 0,16โ†’Bank 0; Thread 1,17โ†’Bank 1; etc.

๊ฒฐ๊ณผ: ๋ฑ…ํฌ๋‹น 2ํšŒ ์ ‘๊ทผ, ์ด 2์‚ฌ์ดํด (ํšจ์œจ 50%)

# Worst case: all threads access different addresses in Bank 0
shared[thread_idx.x * 32]  # All threadsโ†’Bank 0

๊ฒฐ๊ณผ: 32ํšŒ ์ง๋ ฌํ™”๋œ ์ ‘๊ทผ, ์ด 32์‚ฌ์ดํด (ํšจ์œจ 3%)

๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์˜ˆ์™ธ

์ถฉ๋Œ ๊ทœ์น™์—๋Š” ํ•œ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ์˜ˆ์™ธ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค: ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ ‘๊ทผ. ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ์ฃผ์†Œ๋ฅผ ์ฝ์œผ๋ฉด ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ด๋ฅผ ๋‹จ์ผ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค:

# Broadcast: all threads read the same value
constant = shared[0]  # All threads read shared[0]

๊ฒฐ๊ณผ: 1ํšŒ ์ ‘๊ทผ์œผ๋กœ 32๊ฐœ ์Šค๋ ˆ๋“œ์— ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ, ์ด 1์‚ฌ์ดํด

์ด ์ตœ์ ํ™”๊ฐ€ ์กด์žฌํ•˜๋Š” ์ด์œ ๋Š” ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ๊ฐ€ ํ”ํ•œ ํŒจํ„ด(์ƒ์ˆ˜ ๋กœ๋”ฉ, ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ ๋“ฑ)์ด๊ณ , ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ถ”๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์—†์ด ๋‹จ์ผ ๊ฐ’์„ ๋ชจ๋“  ์Šค๋ ˆ๋“œ์— ๋ณต์ œํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ถฉ๋Œ์ด ์ค‘์š”ํ•œ ์ด์œ 

์„ฑ๋Šฅ ์˜ํ–ฅ

๋ฑ…ํฌ ์ถฉ๋Œ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹œ๊ฐ„์„ ์ง์ ‘์ ์œผ๋กœ ๋ฐฐ๊ฐ€์‹œํ‚ต๋‹ˆ๋‹ค:

์ถฉ๋Œ ์œ ํ˜•์ ‘๊ทผ ์‹œ๊ฐ„ํšจ์œจ์„ฑ๋Šฅ ์˜ํ–ฅ
์ถฉ๋Œ ์—†์Œ1์‚ฌ์ดํด100%๊ธฐ์ค€์„ 
2-way conflict2์‚ฌ์ดํด50%2๋ฐฐ ๋А๋ฆผ
4-way conflict4์‚ฌ์ดํด25%4๋ฐฐ ๋А๋ฆผ
32-way conflict32์‚ฌ์ดํด3%32๋ฐฐ ๋А๋ฆผ

์‹ค์ „ ๋งฅ๋ฝ

Puzzle 30์—์„œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ๊ทน์ ์ธ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์ด ์›๋ฆฌ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ˆ˜์ค€์—์„œ ์ž‘๋™ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.

์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์ด DRAM ๋Œ€์—ญํญ ํ™œ์šฉ์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, ๋ฑ…ํฌ ์ถฉ๋Œ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰์— ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค. ์ฐจ์ด๋Š” ๊ทœ๋ชจ์— ์žˆ์Šต๋‹ˆ๋‹ค: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์€ ์ˆ˜๋ฐฑ ์‚ฌ์ดํด์ด์ง€๋งŒ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ถฉ๋Œ์€ ์ ‘๊ทผ๋‹น ๋ช‡ ์‚ฌ์ดํด๋งŒ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ง‘์ค‘์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ์—ฐ์‚ฐ ์ง‘์•ฝ์  ์ปค๋„์—์„œ๋Š” ์ด โ€œ๋ช‡ ์‚ฌ์ดํดโ€œ์ด ๋น ๋ฅด๊ฒŒ ๋ˆ„์ ๋ฉ๋‹ˆ๋‹ค.

์›Œํ”„ ์‹คํ–‰๊ณผ์˜ ๊ด€๊ณ„

Puzzle 24์—์„œ ์›Œํ”„๊ฐ€ SIMT(Single Instruction, Multiple Thread) ๋ฐฉ์‹์œผ๋กœ ์‹คํ–‰๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ์›Œํ”„๊ฐ€ ๋ฑ…ํฌ ์ถฉ๋Œ์— ๋ถ€๋”ชํžˆ๋ฉด ์ง๋ ฌํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ 32๊ฐœ ์Šค๋ ˆ๋“œ ๋ชจ๋‘๊ฐ€ ๋Œ€๊ธฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋Œ€๊ธฐ ์‹œ๊ฐ„์€ ์ถฉ๋Œ์„ ์ผ์œผํ‚จ ์Šค๋ ˆ๋“œ๋งŒ์ด ์•„๋‹ˆ๋ผ ์›Œํ”„ ์ „์ฒด์˜ ์ง„ํ–‰์— ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค.

์ด๋Š” Puzzle 31์˜ ์ ์œ ์œจ ๊ฐœ๋…๊ณผ ์—ฐ๊ฒฐ๋ฉ๋‹ˆ๋‹ค: ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์›Œํ”„๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ํšจ๊ณผ์ ์œผ๋กœ ์ˆจ๊ธฐ๋Š” ๊ฒƒ์„ ๋ฐฉํ•ดํ•˜์—ฌ, ๋†’์€ ์ ์œ ์œจ์˜ ์‹ค์งˆ์ ์ธ ์ด์ ์„ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ถฉ๋Œ ๊ฐ์ง€ํ•˜๊ธฐ

์‹œ๊ฐ์  ํŒจํ„ด ์ธ์‹

์ ‘๊ทผ ํŒจํ„ด์„ ๋ถ„์„ํ•˜๋ฉด ๋ฑ…ํฌ ์ถฉ๋Œ์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค:

์ˆœ์ฐจ ์ ‘๊ทผ (์ถฉ๋Œ ์—†์Œ):

# Thread ID:  0  1  2  3  ...  31
# Address:    0  4  8 12  ... 124
# Bank:       0  1  2  3  ...  31  โœ… All different banks

Stride-2 ์ ‘๊ทผ (2-way conflict):

# Thread ID:  0  1  2  3  ...  15 16 17 18 ... 31
# Address:    0  8 16 24  ... 120  4 12 20 ... 124
# Bank:       0  2  4  6  ...  30  1  3  5 ...  31
# Conflict:   Banks 0,2,4... have 2 threads each  โŒ

Stride-32 ์ ‘๊ทผ (32-way conflict):

# Thread ID:  0   1   2   3  ...  31
# Address:    0  128 256 384 ... 3968
# Bank:       0   0   0   0  ...   0  โŒ All threadsโ†’Bank 0

NSight Compute(ncu)๋ฅผ ์‚ฌ์šฉํ•œ ํ”„๋กœํŒŒ์ผ๋ง

Puzzle 30์—์„œ ๋ฐฐ์šด ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ๋ฐ”ํƒ•์œผ๋กœ, ๋ฑ…ํฌ ์ถฉ๋Œ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ์ธก์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

# Key metrics for shared memory bank conflicts
ncu --metrics=l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st your_kernel

# Additional context metrics
ncu --metrics=smsp__sass_average_branch_targets_threads_uniform.pct your_kernel
ncu --metrics=smsp__warps_issue_stalled_membar_per_warp_active.pct your_kernel

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld์™€ l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st ๋ฉ”ํŠธ๋ฆญ์€ ์ปค๋„ ์‹คํ–‰ ์ค‘ ๋กœ๋“œ ๋ฐ ์Šคํ† ์–ด ์—ฐ์‚ฐ์˜ ๋ฑ…ํฌ ์ถฉ๋Œ ํšŸ์ˆ˜๋ฅผ ์ง์ ‘ ์นด์šดํŠธํ•ฉ๋‹ˆ๋‹ค. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšŸ์ˆ˜์™€ ๊ฒฐํ•ฉํ•˜๋ฉด ์ถฉ๋Œ ๋น„์œจ์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ํ•ต์‹ฌ์ ์ธ ์„ฑ๋Šฅ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ถฉ๋Œ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ

์—ฐ์‚ฐ ์ง‘์•ฝ์  ์ปค๋„

๋ฑ…ํฌ ์ถฉ๋Œ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ปค๋„์—์„œ ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นฉ๋‹ˆ๋‹ค:

  • ํƒ€์ดํŠธํ•œ ๋ฃจํ”„ ์•ˆ์—์„œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ž์ฃผ ์ ‘๊ทผํ•˜๋Š” ๊ฒฝ์šฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๋‹น ์—ฐ์‚ฐ๋Ÿ‰์ด ์ ์€ ๊ฒฝ์šฐ
  • ์ปค๋„์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ๊ฐ€ ์•„๋‹Œ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์ธ ๊ฒฝ์šฐ

๋Œ€ํ‘œ์ ์ธ ์‹œ๋‚˜๋ฆฌ์˜ค:

  • ํ–‰๋ ฌ ๊ณฑ์…ˆ ๋‚ด๋ถ€ ๋ฃจํ”„ (Puzzle 16์˜ ํƒ€์ผ๋ง ๋ฒ„์ „๊ณผ ๊ฐ™์€)
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์บ์‹ฑ์„ ์‚ฌ์šฉํ•˜๋Š” ์Šคํ…์‹ค ์—ฐ์‚ฐ
  • ๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜ ์—ฐ์‚ฐ

๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ vs ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„

Puzzle 31์—์„œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ์—์„œ๋Š” ์ ์œ ์œจ์ด ๋œ ์ค‘์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์•˜๋“ฏ์ด, ์ปค๋„์ด ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์— ๋ณ‘๋ชฉ์ด ๊ฑธ๋ฆฌ๊ฑฐ๋‚˜ ์‚ฐ์ˆ  ๊ฐ•๋„๊ฐ€ ๋งค์šฐ ๋‚ฎ์€ ๊ฒฝ์šฐ์—๋Š” ๋ฑ…ํฌ ์ถฉ๋Œ์˜ ์˜ํ–ฅ๋„ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋งŽ์€ ์ปค๋„์€ ๋ฐ”๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์—์„œ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ๋กœ ์ „ํ™˜ํ•˜๊ธฐ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์• ์ดˆ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋„์ž…ํ•œ ์ด์œ ์˜€๋˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•˜์ง€ ๋ชปํ•˜๊ฒŒ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์•ž์œผ๋กœ์˜ ๋ฐฉํ–ฅ

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น์„ ์ดํ•ดํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ธฐ์ดˆ๋ฅผ ๊ฐ–์ถ”๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

  1. ์ ‘๊ทผ ํŒจํ„ด์„ ๋ถ„์„ํ•˜์—ฌ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ ์ „์— ์„ฑ๋Šฅ์„ ์˜ˆ์ธก
  2. ์ฒด๊ณ„์ ์ธ ํ”„๋กœํŒŒ์ผ๋ง ์ ‘๊ทผ๋ฒ•์œผ๋กœ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ง„๋‹จ
  3. ๋†’์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์œ ์ง€ํ•˜๋Š” ์ถฉ๋Œ ์—†๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„
  4. ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ ์‚ฌ์ด์˜ ๊ท ํ˜• ์žกํžŒ ํŒ๋‹จ

๋‹ค์Œ ์„น์…˜์—์„œ๋Š” ์ด ์ง€์‹์„ ์‹ค์Šต์— ์ ์šฉํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ์ถฉ๋Œ ํŒจํ„ด๊ณผ ํ•ด๊ฒฐ์ฑ…์„ ์ง์ ‘ ๋‹ค๋ค„๋ด…๋‹ˆ๋‹ค - ์ด๋ก ์  ์ดํ•ด๋ฅผ ์‹ค์ „ ์ตœ์ ํ™” ์—ญ๋Ÿ‰์œผ๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค.

์ถฉ๋Œ ์—†๋Š” ํŒจํ„ด

์ฐธ๊ณ : ์ด ์„น์…˜์€ NVIDIA GPU ์ „์šฉ์ž…๋‹ˆ๋‹ค

์—ฌ๊ธฐ์„œ ๋‹ค๋ฃจ๋Š” ๋ฑ…ํฌ ์ถฉ๋Œ ๋ถ„์„๊ณผ ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ๋ฒ•์€ NVIDIA GPU์— ํŠนํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ํ”„๋กœํŒŒ์ผ๋ง ๋ช…๋ น์€ NVIDIA CUDA ํˆดํ‚ท์— ํฌํ•จ๋œ NSight Compute ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง ์—ญ๋Ÿ‰์„ ๋ฐ”ํƒ•์œผ๋กœ

Puzzle 30์—์„œ GPU ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ๋ฅผ ๋ฐฐ์šฐ๊ณ , Puzzle 31์—์„œ ๋ฆฌ์†Œ์Šค ์ตœ์ ํ™”๋ฅผ ์ดํ•ดํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด์ œ ๋ฐฐ์šด ํƒ์ • ๊ธฐ์ˆ ์„ ์ƒˆ๋กœ์šด ์„ฑ๋Šฅ ๋ฏธ์Šคํ„ฐ๋ฆฌ์— ์ ์šฉํ•  ์ฐจ๋ก€์ž…๋‹ˆ๋‹ค: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ.

ํƒ์ • ๋„์ „ ๊ณผ์ œ: ๋™์ผํ•œ ์ˆ˜ํ•™์  ์—ฐ์‚ฐ((input + 10) * 2)์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‘ GPU ์ปค๋„์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘˜ ๋‹ค ์ •ํ™•ํžˆ ๊ฐ™์€ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค. ๊ฐ™์€ ์–‘์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ ์œ ์œจ๋„ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํ•˜๋‚˜๋Š” ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐฉ์‹ ๋•Œ๋ฌธ์— ์ฒด๊ณ„์ ์ธ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๊ฒช์Šต๋‹ˆ๋‹ค.

์—ฌ๋Ÿฌ๋ถ„์˜ ์ž„๋ฌด: ์ง€๊ธˆ๊นŒ์ง€ ๋ฐฐ์šด ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ ์ด ์ˆจ๊ฒจ์ง„ ์„ฑ๋Šฅ ํ•จ์ •์„ ๋ฐํ˜€๋‚ด๊ณ , ์‹ค์ œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์—์„œ ๋ฑ…ํฌ ์ถฉ๋Œ์ด ์–ธ์ œ ์ค‘์š”ํ•œ์ง€ ์ดํ•ดํ•˜์„ธ์š”.

๊ฐœ์š”

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์›Œํ”„ ๋‚ด์˜ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ๊ฐ€ ๋™์ผํ•œ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ์†Œ์— ๋™์‹œ์— ์ ‘๊ทผํ•  ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด ํƒ์ • ์‚ฌ๊ฑด์—์„œ๋Š” ๋Œ€์กฐ์ ์ธ ์ ‘๊ทผ ํŒจํ„ด์„ ๊ฐ€์ง„ ๋‘ ์ปค๋„์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค:

comptime SIZE = 8 * 1024  # 8K elements - small enough to focus on shared memory patterns
comptime TPB = 256  # Threads per block - divisible by 32 (warp size)
comptime THREADS_PER_BLOCK = (TPB, 1)
comptime BLOCKS_PER_GRID = (SIZE // TPB, 1)
comptime dtype = DType.float32
comptime layout = row_major[SIZE]()
comptime LayoutType = type_of(layout)


def no_conflict_kernel(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    """Perfect shared memory access - no bank conflicts.

    Each thread accesses a different bank: thread_idx.x maps to bank thread_idx.x % 32.
    This achieves optimal shared memory bandwidth utilization.
    """

    # Shared memory buffer - each thread loads one element
    var shared_buf = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Load from global memory to shared memory - no conflicts
    if global_i < size:
        shared_buf[local_i] = (
            input[global_i] + 10.0
        )  # Add 10 as simple operation

    barrier()  # Synchronize shared memory writes

    # Read back from shared memory and write to output - no conflicts
    if global_i < size:
        output[global_i] = shared_buf[local_i] * 2.0  # Multiply by 2

    barrier()  # Ensure completion


def two_way_conflict_kernel(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    """Stride-2 shared memory access - creates 2-way bank conflicts.

    Threads 0,16 -> Bank 0, Threads 1,17 -> Bank 1, etc.
    Each bank serves 2 threads, doubling access time.
    """

    # Shared memory buffer - stride-2 access pattern creates conflicts
    var shared_buf = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB]())

    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # CONFLICT: stride-2 access creates 2-way bank conflicts
    var conflict_index = (local_i * 2) % TPB

    # Load with bank conflicts
    if global_i < size:
        shared_buf[conflict_index] = (
            input[global_i] + 10.0
        )  # Same operation as no-conflict

    barrier()  # Synchronize shared memory writes

    # Read back with same conflicts
    if global_i < size:
        output[global_i] = (
            shared_buf[conflict_index] * 2.0
        )  # Same operation as no-conflict

    barrier()  # Ensure completion


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p32/p32.mojo

๋ฏธ์Šคํ„ฐ๋ฆฌ: ์ด ์ปค๋„๋“ค์€ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•˜์ง€๋งŒ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ์€ ๊ทน์ ์œผ๋กœ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ์ฒด๊ณ„์ ์ธ ํ”„๋กœํŒŒ์ผ๋ง ๋ถ„์„์„ ํ†ตํ•ด ๊ทธ ์ด์œ ๋ฅผ ๋ฐํ˜€๋‚ด๋Š” ๊ฒƒ์ด ์ž„๋ฌด์ž…๋‹ˆ๋‹ค.

๊ตฌ์„ฑ

์š”๊ตฌ ์‚ฌํ•ญ:

  • Puzzle 30์˜ CUDA ํˆดํ‚ท๊ณผ NSight Compute๊ฐ€ ์„ค์น˜๋œ NVIDIA GPU
  • ์ด์ „ ์„น์…˜์—์„œ ๋‹ค๋ฃฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํ‚น ๊ฐœ๋…์— ๋Œ€ํ•œ ์ดํ•ด

์ปค๋„ ์„ค์ •:

comptime SIZE = 8 * 1024      # 8K elements - focus on shared memory patterns
comptime TPB = 256            # 256 threads per block (8 warps)
comptime BLOCKS_PER_GRID = (SIZE // TPB, 1)  # 32 blocks

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ œํ•œ์ด ์•„๋‹Œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํšจ๊ณผ๋ฅผ ๋ถ€๊ฐํ•˜๊ธฐ ์œ„ํ•ด ๋ฌธ์ œ ํฌ๊ธฐ๋ฅผ ์˜๋„์ ์œผ๋กœ ์ด์ „ ํผ์ฆ๋ณด๋‹ค ์ž‘๊ฒŒ ์„ค์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

์กฐ์‚ฌ ๊ณผ์ •

Step 1: ์ •ํ™•์„ฑ ๊ฒ€์ฆ

pixi shell -e nvidia
mojo problems/p32/p32.mojo --test

๋‘ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ฑ…ํฌ ์ถฉ๋Œ์ด ์ •ํ™•์„ฑ์ด ์•„๋‹Œ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

Step 2: ์„ฑ๋Šฅ ๊ธฐ์ค€์„  ๋ฒค์น˜๋งˆํฌ

mojo problems/p32/p32.mojo --benchmark

์‹คํ–‰ ์‹œ๊ฐ„์„ ๊ธฐ๋กํ•˜์„ธ์š”. ์›Œํฌ๋กœ๋“œ๊ฐ€ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์— ์˜ํ•ด ์ง€๋ฐฐ๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋น„์Šทํ•œ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์ง€๋งŒ, ๋ฑ…ํฌ ์ถฉ๋Œ์€ ํ”„๋กœํŒŒ์ผ๋ง ๋ฉ”ํŠธ๋ฆญ์„ ํ†ตํ•ด ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค.

Step 3: ํ”„๋กœํŒŒ์ผ๋ง์šฉ ๋นŒ๋“œ

mojo build --debug-level=full problems/p32/p32.mojo -o problems/p32/p32_profiler

Step 4: ๋ฑ…ํฌ ์ถฉ๋Œ ํ”„๋กœํŒŒ์ผ๋ง

NSight Compute๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค:

# Profile no-conflict kernel
ncu --metrics=l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st problems/p32/p32_profiler --no-conflict

๊ทธ๋ฆฌ๊ณ 

# Profile two-way conflict kernel
ncu --metrics=l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st problems/p32/p32_profiler --two-way

๊ธฐ๋กํ•  ํ•ต์‹ฌ ๋ฉ”ํŠธ๋ฆญ:

  • l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum - ๋กœ๋“œ ์ถฉ๋Œ
  • l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum - ์Šคํ† ์–ด ์ถฉ๋Œ

Step 5: ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„

ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ˆ˜ํ•™์  ์ ‘๊ทผ ํŒจํ„ด์„ ๋ถ„์„ํ•ฉ๋‹ˆ๋‹ค:

์ถฉ๋Œ ์—†๋Š” ์ปค๋„ ์ ‘๊ทผ ํŒจํ„ด:

# Thread mapping: thread_idx.x directly maps to shared memory index
shared_buf[thread_idx.x]  # Thread 0โ†’Index 0, Thread 1โ†’Index 1, etc.
# Bank mapping: Index % 32 = Bank ID
# Result: Thread 0โ†’Bank 0, Thread 1โ†’Bank 1, ..., Thread 31โ†’Bank 31

2-way ์ถฉ๋Œ ์ปค๋„ ์ ‘๊ทผ ํŒจํ„ด:

# Thread mapping with stride-2 modulo operation
shared_buf[(thread_idx.x * 2) % TPB]
# For threads 0-31: Index 0,2,4,6,...,62, then wraps to 64,66,...,126, then 0,2,4..
# Bank mapping examples:
# Thread 0  โ†’ Index 0   โ†’ Bank 0
# Thread 16 โ†’ Index 32  โ†’ Bank 0  (conflict!)
# Thread 1  โ†’ Index 2   โ†’ Bank 2
# Thread 17 โ†’ Index 34  โ†’ Bank 2  (conflict!)

๋„์ „ ๊ณผ์ œ: ๋ฑ…ํฌ ์ถฉ๋Œ ๋ฏธ์Šคํ„ฐ๋ฆฌ๋ฅผ ํ’€์–ด๋ณด์„ธ์š”

์œ„์˜ ์กฐ์‚ฌ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ•œ ํ›„, ๋‹ค์Œ ๋ถ„์„ ์งˆ๋ฌธ์— ๋‹ตํ•˜์„ธ์š”:

์„ฑ๋Šฅ ๋ถ„์„ (Step 1-2)

  1. ๋‘ ์ปค๋„์ด ๋™์ผํ•œ ์ˆ˜ํ•™์  ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋‚˜์š”?
  2. ์ปค๋„ ๊ฐ„ ์‹คํ–‰ ์‹œ๊ฐ„ ์ฐจ์ด๊ฐ€ ์žˆ๋‚˜์š”?
  3. ์ ‘๊ทผ ํŒจํ„ด์ด ๋‹ค๋ฅธ๋ฐ๋„ ์„ฑ๋Šฅ์ด ๋น„์Šทํ•  ์ˆ˜ ์žˆ๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?

๋ฑ…ํฌ ์ถฉ๋Œ ํ”„๋กœํŒŒ์ผ๋ง (Step 4)

  1. ์ถฉ๋Œ ์—†๋Š” ์ปค๋„์€ ๋กœ๋“œ์™€ ์Šคํ† ์–ด์—์„œ ๋ช‡ ๊ฑด์˜ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ๋ฐœ์ƒ์‹œํ‚ค๋‚˜์š”?
  2. 2-way ์ถฉ๋Œ ์ปค๋„์€ ๋กœ๋“œ์™€ ์Šคํ† ์–ด์—์„œ ๋ช‡ ๊ฑด์˜ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ๋ฐœ์ƒ์‹œํ‚ค๋‚˜์š”?
  3. ๋‘ ์ปค๋„ ๊ฐ„ ์ด ์ถฉ๋Œ ํšŸ์ˆ˜ ์ฐจ์ด๋Š” ์–ผ๋งˆ์ธ๊ฐ€์š”?

์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„ (Step 5)

  1. ์ถฉ๋Œ ์—†๋Š” ์ปค๋„์—์„œ Thread 0์€ ์–ด๋–ค ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๋‚˜์š”? Thread 31์€?
  2. 2-way ์ถฉ๋Œ ์ปค๋„์—์„œ Bank 0์— ์ ‘๊ทผํ•˜๋Š” ์Šค๋ ˆ๋“œ๋Š”? Bank 2์— ์ ‘๊ทผํ•˜๋Š” ์Šค๋ ˆ๋“œ๋Š”?
  3. ์ถฉ๋Œ ์ปค๋„์—์„œ ๊ฐ™์€ ๋ฑ…ํฌ๋ฅผ ๋†“๊ณ  ๊ฒฝ์Ÿํ•˜๋Š” ์Šค๋ ˆ๋“œ๋Š” ๋ช‡ ๊ฐœ์ธ๊ฐ€์š”?

๋ฑ…ํฌ ์ถฉ๋Œ ํƒ์ • ์ž‘์—…

  1. ์ถฉ๋Œ ์—†๋Š” ์ปค๋„์€ ์ถฉ๋Œ์ด 0์ธ๋ฐ, 2-way ์ถฉ๋Œ ์ปค๋„์—์„œ๋Š” ์ธก์ • ๊ฐ€๋Šฅํ•œ ์ถฉ๋Œ์ด ๋‚˜ํƒ€๋‚˜๋Š” ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?
  2. stride-2 ์ ‘๊ทผ ํŒจํ„ด (thread_idx.x * 2) % TPB๋Š” ์–ด๋–ป๊ฒŒ ์ฒด๊ณ„์ ์ธ ์ถฉ๋Œ์„ ๋งŒ๋“ค์–ด๋‚ด๋‚˜์š”?
  3. ๋ฑ…ํฌ ์ถฉ๋Œ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์ปค๋„๋ณด๋‹ค ์—ฐ์‚ฐ ์ง‘์•ฝ์  ์ปค๋„์—์„œ ๋” ์ค‘์š”ํ•œ ์ด์œ ๋Š” ๋ฌด์—‡์ธ๊ฐ€์š”?

์‹ค์ „ ์‹œ์‚ฌ์ 

  1. ๋ฑ…ํฌ ์ถฉ๋Œ์ด ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒ๋˜๋Š” ๊ฒฝ์šฐ๋Š” ์–ธ์ œ์ธ๊ฐ€์š”?
  2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์ „์— ๋ฑ…ํฌ ์ถฉ๋Œ ํŒจํ„ด์„ ์–ด๋–ป๊ฒŒ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‚˜์š”?
  3. ํ–‰๋ ฌ ์—ฐ์‚ฐ๊ณผ ์Šคํ…์‹ค ์—ฐ์‚ฐ์—์„œ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ํ”ผํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ์„ค๊ณ„ ์›์น™์€ ๋ฌด์—‡์ธ๊ฐ€์š”?
ํŒ

๋ฑ…ํฌ ์ถฉ๋Œ ํƒ์ • ๋„๊ตฌ ๋ชจ์Œ:

  • NSight Compute ๋ฉ”ํŠธ๋ฆญ - ์ •๋ฐ€ํ•œ ์ธก์ •์œผ๋กœ ์ถฉ๋Œ์„ ์ •๋Ÿ‰ํ™”
  • ์ ‘๊ทผ ํŒจํ„ด ์‹œ๊ฐํ™” - ์Šค๋ ˆ๋“œ ์ธ๋ฑ์Šค๋ฅผ ๋ฑ…ํฌ์— ์ฒด๊ณ„์ ์œผ๋กœ ๋งคํ•‘
  • ์ˆ˜ํ•™์  ๋ถ„์„ - ๋ชจ๋“ˆ๋กœ ์—ฐ์‚ฐ์œผ๋กœ ์ถฉ๋Œ ์˜ˆ์ธก
  • ์›Œํฌ๋กœ๋“œ ํŠน์„ฑ - ์ถฉ๋Œ์ด ์ค‘์š”ํ•œ ๊ฒฝ์šฐ์™€ ๊ทธ๋ ‡์ง€ ์•Š์€ ๊ฒฝ์šฐ ์ดํ•ด

ํ•ต์‹ฌ ์กฐ์‚ฌ ์›์น™:

  • ์ฒด๊ณ„์ ์œผ๋กœ ์ธก์ •ํ•˜๊ธฐ: ์ถฉ๋Œ์„ ์ถ”์ธกํ•˜์ง€ ๋ง๊ณ  ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉ
  • ์ ‘๊ทผ ํŒจํ„ด ์‹œ๊ฐํ™”ํ•˜๊ธฐ: ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์Šค๋ ˆ๋“œ-๋ฑ…ํฌ ๋งคํ•‘์„ ๊ทธ๋ ค๋ณด๊ธฐ
  • ์›Œํฌ๋กœ๋“œ ๋งฅ๋ฝ ๊ณ ๋ คํ•˜๊ธฐ: ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์—ฐ์‚ฐ ์ง‘์•ฝ์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ ๊ฐ€์žฅ ์ค‘์š”
  • ์˜ˆ๋ฐฉ์ ์œผ๋กœ ์‚ฌ๊ณ ํ•˜๊ธฐ: ์ฒ˜์Œ๋ถ€ํ„ฐ ์ถฉ๋Œ ์—†๋Š” ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„

์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„ ๋ฐฉ๋ฒ•:

  1. ์Šค๋ ˆ๋“œ๋ฅผ ์ธ๋ฑ์Šค์— ๋งคํ•‘: ์ˆ˜ํ•™์  ์ฃผ์†Œ ๊ณ„์‚ฐ์„ ์ดํ•ด
  2. ๋ฑ…ํฌ ํ• ๋‹น ๊ณ„์‚ฐ: ๊ณต์‹ bank_id = (address / 4) % 32 ์‚ฌ์šฉ
  3. ์ถฉ๋Œ ์‹๋ณ„: ๊ฐ™์€ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜๋Š” ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ธ์ง€ ํ™•์ธ
  4. ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ๊ฒ€์ฆ: NSight Compute ์ธก์ •์œผ๋กœ ์ด๋ก ์  ๋ถ„์„ ํ™•์ธ

์ผ๋ฐ˜์ ์ธ ์ถฉ๋Œ ์—†๋Š” ํŒจํ„ด:

  • ์ˆœ์ฐจ ์ ‘๊ทผ: shared[thread_idx.x] - ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผ
  • ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ ‘๊ทผ: ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ shared[0] - ํ•˜๋“œ์›จ์–ด ์ตœ์ ํ™”
  • 2์˜ ๊ฑฐ๋“ญ์ œ๊ณฑ ์ŠคํŠธ๋ผ์ด๋“œ: stride-32๋Š” ๋ฑ…ํ‚น ํŒจํ„ด์— ๊น”๋”ํ•˜๊ฒŒ ๋งคํ•‘๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
  • ํŒจ๋”ฉ๋œ ๋ฐฐ์—ด: ํŒจ๋”ฉ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ ‘๊ทผ ํŒจํ„ด์„ ์ด๋™

์†”๋ฃจ์…˜

๋ฑ…ํฌ ์ถฉ๋Œ ๋ถ„์„์ด ํฌํ•จ๋œ ์™„์ „ํ•œ ํ’€์ด

์ด ๋ฑ…ํฌ ์ถฉ๋Œ ํƒ์ • ์‚ฌ๊ฑด์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด GPU ์„ฑ๋Šฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง์˜ ์ค‘์š”์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง์„ ํ†ตํ•œ ์กฐ์‚ฌ ๊ฒฐ๊ณผ

Step 1: ์ •ํ™•์„ฑ ๊ฒ€์ฆ ๋‘ ์ปค๋„ ๋ชจ๋‘ ๋™์ผํ•œ ์ˆ˜ํ•™์  ๊ฒฐ๊ณผ๋ฅผ ๋ƒ…๋‹ˆ๋‹ค:

โœ… No-conflict kernel: PASSED
โœ… Two-way conflict kernel: PASSED
โœ… Both kernels produce identical results

Step 2: ์„ฑ๋Šฅ ๊ธฐ์ค€์„  ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ๋Š” ๋น„์Šทํ•œ ์‹คํ–‰ ์‹œ๊ฐ„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

| name             | met (ms)           | iters |
| ---------------- | ------------------ | ----- |
| no_conflict      | 2.1930616745886655 | 547   |
| two_way_conflict | 2.1978922967032966 | 546   |

ํ•ต์‹ฌ ํ†ต์ฐฐ: ์„ฑ๋Šฅ์ด ๊ฑฐ์˜ ๋™์ผํ•œ ์ด์œ (~2.19ms vs ~2.20ms)๋Š” ์ด ์›Œํฌ๋กœ๋“œ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ๊ฐ€ ์•„๋‹Œ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์‹คํ–‰ ์‹œ๊ฐ„์ด ์•„๋‹Œ ํ”„๋กœํŒŒ์ผ๋ง ๋ฉ”ํŠธ๋ฆญ์„ ํ†ตํ•ด ๋“œ๋Ÿฌ๋‚ฉ๋‹ˆ๋‹ค.

๋ฑ…ํฌ ์ถฉ๋Œ ํ”„๋กœํŒŒ์ผ๋ง ๊ทผ๊ฑฐ

์ถฉ๋Œ ์—†๋Š” ์ปค๋„ (์ตœ์  ์ ‘๊ทผ ํŒจํ„ด):

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum    0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum    0

๊ฒฐ๊ณผ: ๋กœ๋“œ์™€ ์Šคํ† ์–ด ๋ชจ๋‘ ์ถฉ๋Œ 0๊ฑด - ์™„๋ฒฝํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ.

2-Way ์ถฉ๋Œ ์ปค๋„ (๋ฌธ์ œ ์žˆ๋Š” ์ ‘๊ทผ ํŒจํ„ด):

l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum    256
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum    256

๊ฒฐ๊ณผ: ๋กœ๋“œ์™€ ์Šคํ† ์–ด ๊ฐ๊ฐ 256๊ฑด์˜ ์ถฉ๋Œ - ์ฒด๊ณ„์ ์ธ ๋ฑ…ํ‚น ๋ฌธ์ œ์˜ ๋ช…ํ™•ํ•œ ๊ทผ๊ฑฐ.

์ด ์ถฉ๋Œ ์ฐจ์ด: 512๊ฑด์˜ ์ถฉ๋Œ(256 + 256)์ด ์ธก์ • ๊ฐ€๋Šฅํ•œ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋น„ํšจ์œจ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์ ‘๊ทผ ํŒจํ„ด ์ˆ˜ํ•™์  ๋ถ„์„

์ถฉ๋Œ ์—†๋Š” ์ปค๋„ ์ ‘๊ทผ ํŒจํ„ด

์Šค๋ ˆ๋“œ-์ธ๋ฑ์Šค ๋งคํ•‘:

shared_buf[thread_idx.x]

๋ฑ…ํฌ ํ• ๋‹น ๋ถ„์„:

Thread 0  โ†’ Index 0   โ†’ Bank 0 % 32 = 0
Thread 1  โ†’ Index 1   โ†’ Bank 1 % 32 = 1
Thread 2  โ†’ Index 2   โ†’ Bank 2 % 32 = 2
...
Thread 31 โ†’ Index 31  โ†’ Bank 31 % 32 = 31

๊ฒฐ๊ณผ: ์™„๋ฒฝํ•œ ๋ฑ…ํฌ ๋ถ„๋ฐฐ - ๊ฐ ์›Œํ”„ ๋‚ด์—์„œ ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฑ…ํฌ์— ์ ‘๊ทผํ•˜์—ฌ ๋ณ‘๋ ฌ ์ ‘๊ทผ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

2-way ์ถฉ๋Œ ์ปค๋„ ์ ‘๊ทผ ํŒจํ„ด

์Šค๋ ˆ๋“œ-์ธ๋ฑ์Šค ๋งคํ•‘:

shared_buf[(thread_idx.x * 2) % TPB]  # TPB = 256

์ฒซ ๋ฒˆ์งธ ์›Œํ”„(์Šค๋ ˆ๋“œ 0-31)์˜ ๋ฑ…ํฌ ํ• ๋‹น ๋ถ„์„:

Thread 0  โ†’ Index (0*2)%256 = 0   โ†’ Bank 0
Thread 1  โ†’ Index (1*2)%256 = 2   โ†’ Bank 2
Thread 2  โ†’ Index (2*2)%256 = 4   โ†’ Bank 4
...
Thread 16 โ†’ Index (16*2)%256 = 32 โ†’ Bank 0  โ† Thread 0๊ณผ ์ถฉ๋Œ
Thread 17 โ†’ Index (17*2)%256 = 34 โ†’ Bank 2  โ† Thread 1๊ณผ ์ถฉ๋Œ
Thread 18 โ†’ Index (18*2)%256 = 36 โ†’ Bank 4  โ† Thread 2์™€ ์ถฉ๋Œ
...

์ถฉ๋Œ ํŒจํ„ด: ๊ฐ ๋ฑ…ํฌ๊ฐ€ ์ •ํ™•ํžˆ 2๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋ฅผ ์ฒ˜๋ฆฌํ•˜์—ฌ 32๊ฐœ ๋ฑ…ํฌ ์ „์ฒด์—์„œ ์ฒด๊ณ„์ ์ธ 2-way ์ถฉ๋Œ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜ํ•™์  ์„ค๋ช…: stride-2 ํŒจํ„ด๊ณผ ๋ชจ๋“ˆ๋กœ 256์˜ ์กฐํ•ฉ์ด ๋ฐ˜๋ณต์ ์ธ ์ ‘๊ทผ ํŒจํ„ด์„ ๋งŒ๋“ค์–ด๋ƒ…๋‹ˆ๋‹ค:

  • ์Šค๋ ˆ๋“œ 0-15๋Š” ๋ฑ…ํฌ 0,2,4,โ€ฆ,30์— ์ ‘๊ทผ
  • ์Šค๋ ˆ๋“œ 16-31์€ ๋™์ผํ•œ ๋ฑ…ํฌ 0,2,4,โ€ฆ,30์— ์ ‘๊ทผ
  • ๊ฐ ๋ฑ…ํฌ ์ถฉ๋Œ๋งˆ๋‹ค ํ•˜๋“œ์›จ์–ด ์ง๋ ฌํ™”๊ฐ€ ํ•„์š”

์ด๊ฒƒ์ด ์ค‘์š”ํ•œ ์ด์œ : ์›Œํฌ๋กœ๋“œ ๋งฅ๋ฝ ๋ถ„์„

๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ vs ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ ์‹œ์‚ฌ์ 

์ด ์›Œํฌ๋กœ๋“œ์˜ ํŠน์„ฑ:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ์ง€๋ฐฐ์ : ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก ๋Œ€๋น„ ์ตœ์†Œํ•œ์˜ ์—ฐ์‚ฐ๋งŒ ์ˆ˜ํ–‰
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋Š” ๋ถ€์ฐจ์ : ๋ฑ…ํฌ ์ถฉ๋Œ์ด ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ถ”๊ฐ€ํ•˜์ง€๋งŒ ์ „์ฒด ์‹คํ–‰ ์‹œ๊ฐ„์„ ์ง€๋ฐฐํ•˜์ง€๋Š” ์•Š์Œ
  • ๋™์ผํ•œ ์„ฑ๋Šฅ: ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํฌํ™”๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋น„ํšจ์œจ์„ ๊ฐ€๋ฆผ

๋ฑ…ํฌ ์ถฉ๋Œ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ:

  1. ์—ฐ์‚ฐ ์ง‘์•ฝ์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜ - ํ–‰๋ ฌ ๊ณฑ์…ˆ, ์Šคํ…์‹ค ์—ฐ์‚ฐ, FFT
  2. ํƒ€์ดํŠธํ•œ ์—ฐ์‚ฐ ๋ฃจํ”„ - ๋‚ด๋ถ€ ๋ฃจํ”„ ์•ˆ์—์„œ ๋ฐ˜๋ณต์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ
  3. ๋†’์€ ์‚ฐ์ˆ  ๊ฐ•๋„ - ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๋‹น ์ƒ๋‹นํ•œ ์—ฐ์‚ฐ๋Ÿ‰
  4. ๋Œ€๊ทœ๋ชจ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ž‘์—… ์„ธํŠธ - ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์บ์‹ฑ์„ ์ง‘์ค‘์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

์‹ค์ „ ์„ฑ๋Šฅ ์‹œ์‚ฌ์ 

๋ฑ…ํฌ ์ถฉ๋Œ์ด ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜:

ํ–‰๋ ฌ ๊ณฑ์…ˆ:

# Problematic: All threads in warp access same column
for k in range(tile_size):
    acc += a_shared[local_row, k] * b_shared[k, local_col]  # b_shared[k, 0] conflicts

์Šคํ…์‹ค ์—ฐ์‚ฐ:

# Problematic: Stride access in boundary handling
shared_buf[thread_idx.x * stride]  # Creates systematic conflicts

๋ณ‘๋ ฌ ๋ฆฌ๋•์…˜:

# Problematic: Power-of-2 stride patterns
if thread_idx.x < stride:
    shared_buf[thread_idx.x] += shared_buf[thread_idx.x + stride]  # Conflict potential

์ถฉ๋Œ ์—†๋Š” ์„ค๊ณ„ ์›์น™

์˜ˆ๋ฐฉ ์ „๋žต

1. ์ˆœ์ฐจ ์ ‘๊ทผ ํŒจํ„ด:

shared[thread_idx.x]  # Optimal - each thread different bank

2. ๋ธŒ๋กœ๋“œ์บ์ŠคํŠธ ์ตœ์ ํ™”:

constant = shared[0]  # All threads read same address - hardware optimized

3. ํŒจ๋”ฉ ๊ธฐ๋ฒ•:

shared = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + 1]())  # Shift access patterns

4. ์ ‘๊ทผ ํŒจํ„ด ๋ถ„์„:

  • ๊ตฌํ˜„ ์ „์— ๋ฑ…ํฌ ํ• ๋‹น์„ ๊ณ„์‚ฐ
  • ๋ชจ๋“ˆ๋กœ ์—ฐ์‚ฐ ์‚ฌ์šฉ: bank_id = (address_bytes / 4) % 32
  • ๋ณต์žกํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์Šค๋ ˆ๋“œ-๋ฑ…ํฌ ๋งคํ•‘์„ ์‹œ๊ฐํ™”

์ฒด๊ณ„์  ์ตœ์ ํ™” ์›Œํฌํ”Œ๋กœ์šฐ

์„ค๊ณ„ ๋‹จ๊ณ„:

  1. ์ ‘๊ทผ ํŒจํ„ด ๊ณ„ํš - ์Šค๋ ˆ๋“œ-๋ฉ”๋ชจ๋ฆฌ ๋งคํ•‘์„ ์Šค์ผ€์น˜
  2. ๋ฑ…ํฌ ํ• ๋‹น ๊ณ„์‚ฐ - ์ˆ˜ํ•™์  ๋ถ„์„ ํ™œ์šฉ
  3. ์ถฉ๋Œ ์˜ˆ์ธก - ๋ฌธ์ œ๊ฐ€ ๋˜๋Š” ์ ‘๊ทผ ํŒจํ„ด ์‹๋ณ„
  4. ๋Œ€์•ˆ ์„ค๊ณ„ - ํŒจ๋”ฉ, ์ „์น˜, ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ€๊ฒฝ ๊ณ ๋ ค

๊ตฌํ˜„ ๋‹จ๊ณ„:

  1. ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง - NSight Compute ์ถฉ๋Œ ๋ฉ”ํŠธ๋ฆญ ์‚ฌ์šฉ
  2. ์˜ํ–ฅ ์ธก์ • - ๊ตฌํ˜„ ๊ฐ„ ์ถฉ๋Œ ํšŸ์ˆ˜ ๋น„๊ต
  3. ์„ฑ๋Šฅ ๊ฒ€์ฆ - ์ตœ์ ํ™”๊ฐ€ ์ข…๋‹จ๊ฐ„ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š”์ง€ ํ™•์ธ
  4. ํŒจํ„ด ๋ฌธ์„œํ™” - ์„ฑ๊ณต์ ์ธ ์ถฉ๋Œ ์—†๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์žฌ์‚ฌ์šฉ์„ ์œ„ํ•ด ๊ธฐ๋ก

ํ•ต์‹ฌ ์ •๋ฆฌ: ํƒ์ • ์ž‘์—…์—์„œ ์ตœ์ ํ™” ์ „๋ฌธ์„ฑ์œผ๋กœ

๋ฑ…ํฌ ์ถฉ๋Œ ์กฐ์‚ฌ์—์„œ ๋ฐํ˜€์ง„ ๊ฒƒ:

  1. ์ธก์ •์ด ์ง๊ด€๋ณด๋‹ค ๋‚ซ๋‹ค - ํ”„๋กœํŒŒ์ผ๋ง ๋„๊ตฌ๊ฐ€ ์„ฑ๋Šฅ ํƒ€์ด๋ฐ์œผ๋กœ๋Š” ๋ณด์ด์ง€ ์•Š๋Š” ์ถฉ๋Œ์„ ๋“œ๋Ÿฌ๋ƒ„
  2. ํŒจํ„ด ๋ถ„์„์ด ์œ ํšจํ•˜๋‹ค - ์ˆ˜ํ•™์  ์˜ˆ์ธก์ด NSight Compute ๊ฒฐ๊ณผ์™€ ์ •ํ™•ํžˆ ์ผ์น˜
  3. ๋งฅ๋ฝ์ด ์ค‘์š”ํ•˜๋‹ค - ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์—ฐ์‚ฐ ์ง‘์•ฝ์  ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์›Œํฌ๋กœ๋“œ์—์„œ ๊ฐ€์žฅ ์ค‘์š”
  4. ์˜ˆ๋ฐฉ์ด ์ˆ˜์ •๋ณด๋‹ค ๋‚ซ๋‹ค - ์ถฉ๋Œ ์—†๋Š” ํŒจํ„ด์„ ์„ค๊ณ„ํ•˜๋Š” ๊ฒƒ์ด ์‚ฌํ›„ ์ตœ์ ํ™”๋ณด๋‹ค ์‰ฌ์›€

๋ณดํŽธ์ ์ธ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ์›์น™:

๋ฑ…ํฌ ์ถฉ๋Œ์— ์ฃผ์˜ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ:

  • ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์—ฐ์‚ฐ ์ง‘์•ฝ์  ์ปค๋„
  • ํƒ€์ดํŠธํ•œ ๋ฃจํ”„์—์„œ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ๋ฐ˜๋ณต ์•Œ๊ณ ๋ฆฌ์ฆ˜
  • ๋ชจ๋“  ์‚ฌ์ดํด์ด ์ค‘์š”ํ•œ ์„ฑ๋Šฅ ํ•ต์‹ฌ ์ฝ”๋“œ
  • ๋Œ€์—ญํญ ๋ฐ”์šด๋“œ๊ฐ€ ์•„๋‹Œ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ง‘์•ฝ์  ์—ฐ์‚ฐ

๋ฑ…ํฌ ์ถฉ๋Œ์ด ๋œ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์„ฑ๋Šฅ์„ ์ง€๋ฐฐํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์›Œํฌ๋กœ๋“œ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์žฌ์‚ฌ์šฉ์ด ์ตœ์†Œ์ธ ๋‹จ์ˆœ ์บ์‹ฑ ์‹œ๋‚˜๋ฆฌ์˜ค
  • ๋ฐ˜๋ณต์ ์ธ ์ถฉ๋Œ ๋ฐœ์ƒ ์—ฐ์‚ฐ์ด ์—†๋Š” ์ผํšŒ์„ฑ ์ ‘๊ทผ ํŒจํ„ด

์ „๋ฌธ์  ๊ฐœ๋ฐœ ๋ฐฉ๋ฒ•๋ก :

  1. ์ตœ์ ํ™” ์ „์— ํ”„๋กœํŒŒ์ผ๋ง - NSight Compute๋กœ ์ถฉ๋Œ์„ ์ •๋Ÿ‰์ ์œผ๋กœ ์ธก์ •
  2. ์ ‘๊ทผ ์ˆ˜ํ•™ ์ดํ•ด - ๋ฑ…ํฌ ํ• ๋‹น ๊ณต์‹์œผ๋กœ ๋ฌธ์ œ๋ฅผ ์˜ˆ์ธก
  3. ์ฒด๊ณ„์ ์œผ๋กœ ์„ค๊ณ„ - ๋ฑ…ํ‚น์„ ์‚ฌํ›„ ๊ณ ๋ ค๊ฐ€ ์•„๋‹Œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์„ค๊ณ„ ๋‹จ๊ณ„์—์„œ ๊ณ ๋ ค
  4. ์ตœ์ ํ™” ๊ฒ€์ฆ - ์ถฉ๋Œ ๊ฐ์†Œ๊ฐ€ ์‹ค์ œ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š”์ง€ ํ™•์ธ

์ด ํƒ์ • ์‚ฌ๊ฑด์€ ์ฒด๊ณ„์  ํ”„๋กœํŒŒ์ผ๋ง์ด ์„ฑ๋Šฅ ํƒ€์ด๋ฐ๋งŒ์œผ๋กœ๋Š” ๋ณด์ด์ง€ ์•Š๋Š” ์ตœ์ ํ™” ๊ธฐํšŒ๋ฅผ ๋“œ๋Ÿฌ๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค - ๋ฑ…ํฌ ์ถฉ๋Œ์€ ์ธก์ • ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๊ฐ€ ์ถ”์ธก๋ณด๋‹ค ๋‚˜์€ ๋Œ€ํ‘œ์ ์ธ ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.

Puzzle 33: ํ…์„œ ์ฝ”์–ด ์—ฐ์‚ฐ

์†Œ๊ฐœ

GPU ํ–‰๋ ฌ ๊ณฑ์…ˆ ์ตœ์ ํ™”์˜ ์ตœ์ „์„ ์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์ด ํผ์ฆ์—์„œ๋Š” ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์ „๋ก€ ์—†๋Š” ์†๋„๋กœ ๊ฐ€์†ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ์ „์šฉ ํ•˜๋“œ์›จ์–ด ์œ ๋‹›์ธ ํ…์„œ ์ฝ”์–ด๋ฅผ ํƒ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

์ง€๊ธˆ๊นŒ์ง€ ๋ฐฐ์šด ๋ชจ๋“  ๊ฒƒ, ํŠนํžˆ Puzzle 16์˜ ๊ด€์šฉ์  ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์ตœ์‹  GPU๊ฐ€ ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ๊ทน์ ์œผ๋กœ ๋น ๋ฅด๊ฒŒ ๋งŒ๋“œ๋Š” ์ „์šฉ ์‹ค๋ฆฌ์ฝ˜์„ ์–ด๋–ป๊ฒŒ ์ œ๊ณตํ•˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

ํ…์„œ ์ฝ”์–ด๋ž€?

ํ…์„œ ์ฝ”์–ด(AMD ํ•˜๋“œ์›จ์–ด์—์„œ๋Š” Matrix Core๋ผ๊ณ ๋„ ํ•จ)๋Š” ๋‹จ์ผ ๋ช…๋ น์–ด๋กœ ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ํ–‰๋ ฌ-ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์ „์šฉ ํ”„๋กœ์„ธ์‹ฑ ์œ ๋‹›์ž…๋‹ˆ๋‹ค. ์ด ์œ ๋‹›์€ ์ตœ์‹  GPU ์•„ํ‚คํ…์ฒ˜์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • NVIDIA: Tensor Cores (Volta, Turing, Ampere, Hopper)
  • AMD: Matrix Cores (CDNA/CDNA2/CDNA3 ์•„ํ‚คํ…์ฒ˜)

GPU์— ์ง์ ‘ ๋‚ด์žฅ๋œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† GEMM(์—ญ์ฃผ: General Matrix Multiply, ๋ฒ”์šฉ ํ–‰๋ ฌ ๊ณฑ์…ˆ) ์—”์ง„์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํŠน์ง•

  • ์›Œํ”„ ์ˆ˜์ค€ ์—ฐ์‚ฐ: ๊ฐ ๋ช…๋ น์–ด๊ฐ€ ์ „์ฒด ์›Œํ”„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค (NVIDIA์—์„œ 32๊ฐœ ์Šค๋ ˆ๋“œ, AMD์—์„œ 32 ๋˜๋Š” 64๊ฐœ)
  • ๊ณ ์ • ํƒ€์ผ ํฌ๊ธฐ: ์—ฐ์‚ฐ์ด ํŠน์ • ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ํฌ๊ธฐ์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค (์˜ˆ: FP32์˜ ๊ฒฝ์šฐ 16ร—8ร—8)
  • ํ˜ผํ•ฉ ์ •๋ฐ€๋„: ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ์ •๋ฐ€๋„๋ฅผ ํ˜ผํ•ฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • ๋Œ€๊ทœ๋ชจ ์ฒ˜๋ฆฌ๋Ÿ‰: ํ–‰๋ ฌ ์—ฐ์‚ฐ์—์„œ ์ผ๋ฐ˜ ์ปดํ“จํŠธ ์ฝ”์–ด ๋Œ€๋น„ 10~100๋ฐฐ ์†๋„ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

ํƒ€์ผ๋ง์—์„œ ํ…์„œ ์ฝ”์–ด๋กœ

๊ธฐ๋ณธ ํ–‰๋ ฌ ๊ณฑ์…ˆ์—์„œ ํ…์„œ ์ฝ”์–ด๊นŒ์ง€์˜ ์—ฌ์ •์„ ๋Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

  1. Puzzle 16: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•œ ๊ด€์šฉ์  ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค
  2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”: ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ์œ„ํ•ด copy_dram_to_sram_async๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค
  3. ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ: ๋ฐฐ๋ฆฌ์–ด์™€ ๋น„๋™๊ธฐ ์—ฐ์‚ฐ์œผ๋กœ ์›Œํ”„๋ฅผ ์กฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค
  4. ์ง€๊ธˆ: ํ•ต์‹ฌ ์—ฐ์‚ฐ์„ ๊ฐ€์†ํ•˜๊ธฐ ์œ„ํ•ด ์ „์šฉ ํ•˜๋“œ์›จ์–ด(ํ…์„œ ์ฝ”์–ด)๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค

ํ…์„œ ์ฝ”์–ด ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ

ํ…์„œ ์ฝ”์–ด๋Š” ๊ธฐ์กด๊ณผ ๋‹ค๋ฅธ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

๊ธฐ์กด ์ปดํ“จํŠธ ์ฝ”์–ด ๋ฐฉ์‹

# Each thread computes one element
acc += a_shared[local_row, k] * b_shared[k, local_col]

ํ…์„œ ์ฝ”์–ด ๋ฐฉ์‹

# Entire warp cooperates on matrix fragments
a_reg = mma_op.load_a(A_mma_tile)           # Load 16ร—8 fragment
b_reg = mma_op.load_b(B_mma_tile)           # Load 8ร—8 fragment
c_reg = mma_op.load_c(C_mma_tile)           # Load 16ร—8 accumulator
d_reg = mma_op.mma_op(a_reg, b_reg, c_reg)  # D = Aร—B + C
mma_op.store_d(C_mma_tile, d_reg)           # Store result

Mojo์˜ ํ…์„œ ์ฝ”์–ด API

Mojo๋Š” TensorCore ํƒ€์ž…์„ ํ†ตํ•ด ํ…์„œ ์ฝ”์–ด์— ๋Œ€ํ•œ ๊น”๋”ํ•œ ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

from layout.tensor_core import TensorCore

# Create a Tensor Core operator for specific tile sizes
mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()

# Core operations:
# - load_a(): Load matrix A fragment from shared memory
# - load_b(): Load matrix B fragment from shared memory
# - load_c(): Load matrix C fragment (accumulator)
# - mma_op(): Perform D = Aร—B + C operation
# - store_d(): Store result fragment to memory

๊ณ ๊ธ‰ ๊ธฐ๋Šฅ: TensorCore API๋Š” ์–‘์žํ™” ์—ฐ์‚ฐ, ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ์Šค์œ„์ฆ ํŒจํ„ด(์—ญ์ฃผ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์˜ ๋ฑ…ํฌ ์ถฉ๋Œ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ ์ฃผ์†Œ๋ฅผ ๋น„ํŠธ ์—ฐ์‚ฐ์œผ๋กœ ์žฌ๋ฐฐ์น˜ํ•˜๋Š” ๊ธฐ๋ฒ•), ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ์—ฐ์‚ฐ๋„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ง€์›๋˜๋Š” ๋ชจ๋“  ํ˜•ํƒœ, ๋ฐ์ดํ„ฐ ํƒ€์ž…, ๋ฉ”์„œ๋“œ์— ๋Œ€ํ•œ ์ „์ฒด ๋ฌธ์„œ๋Š” ๊ณต์‹ TensorCore API ๋ ˆํผ๋Ÿฐ์Šค๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ํฌ๊ธฐ

TensorCore API๋Š” GPU ํ•˜๋“œ์›จ์–ด์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์™€ ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค:

NVIDIA GPU:

  • float32: 16ร—8ร—8 ๋˜๋Š” 16ร—8ร—4
  • half-precision: 16ร—8ร—16
  • float8: 16ร—8ร—32

AMD GPU:

  • float32: 16ร—16ร—4
  • half-precision: 16ร—16ร—16 ๋˜๋Š” 32ร—32ร—8

์ด ํผ์ฆ์—์„œ๋Š” FP32์™€ 16ร—8ร—8 ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • MMA_M = 16: ํ–‰๋ ฌ A์˜ ๋†’์ด (์ถœ๋ ฅ ๋†’์ด์™€ ๋™์ผ)
  • MMA_N = 8: ํ–‰๋ ฌ B์˜ ๋„ˆ๋น„ (์ถœ๋ ฅ ๋„ˆ๋น„์™€ ๋™์ผ)
  • MMA_K = 8: ๋‚ด๋ถ€ ์ฐจ์› (A์˜ ๋„ˆ๋น„ = B์˜ ๋†’์ด)

MMA๋ž€? MMA๋Š” โ€œMixed-precision Matrix-Multiply-Accumulateโ€œ์˜ ์•ฝ์ž๋กœ, ํ…์„œ ์ฝ”์–ด๊ฐ€ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋ณธ ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค. ๊ฐ MMA ๋ช…๋ น์–ด๋Š” D = A ร— B + C๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉฐ, ์—ฌ๊ธฐ์„œ A, B, C, D๋Š” ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ์ž…๋‹ˆ๋‹ค.

ํ”„๋ž˜๊ทธ๋จผํŠธ ์‹œ๊ฐํ™”:

A fragment (16ร—8)  ร—  B fragment (8ร—8)  +  C fragment (16ร—8)  =  D fragment (16ร—8)

    16 rows             8 rows               16 rows              16 rows
    8 cols              8 cols               8 cols               8 cols
      |                   |                    |                    |
   [A data]         ร—   [B data]         +   [C data]         =  [D result]

์ฆ‰, ๊ฐ ํ…์„œ ์ฝ”์–ด ๋ช…๋ น์–ด๋Š” A์˜ 16ร—8 ํƒ€์ผ๊ณผ B์˜ 8ร—8 ํƒ€์ผ์„ ๊ณฑํ•œ ๋’ค ๊ธฐ์กด 16ร—8 ๋ˆ„์‚ฐ๊ธฐ์— ๋”ํ•˜์—ฌ 16ร—8 ์ถœ๋ ฅ ํƒ€์ผ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

ํ…์„œ ์ฝ”์–ด๋ฅผ ์œ„ํ•œ ์›Œํ”„ ๊ตฌ์„ฑ

์›Œํ”„๋ž€? ์›Œํ”„๋Š” ๋ก์Šคํ…์œผ๋กœ ๋ช…๋ น์–ด๋ฅผ ํ•จ๊ป˜ ์‹คํ–‰ํ•˜๋Š” ์Šค๋ ˆ๋“œ ๊ทธ๋ฃน(NVIDIA์—์„œ 32๊ฐœ, AMD์—์„œ 32 ๋˜๋Š” 64๊ฐœ)์ž…๋‹ˆ๋‹ค. ํ…์„œ ์ฝ”์–ด๋Š” ๋‹จ์ผ ํ–‰๋ ฌ ์—ฐ์‚ฐ์— ์›Œํ”„ ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ํ˜‘๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์™œ ์›Œํ”„ ์ˆ˜์ค€์ผ๊นŒ? ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ๋™์ž‘ํ•˜๋Š” ์ผ๋ฐ˜ ์—ฐ์‚ฐ๊ณผ ๋‹ฌ๋ฆฌ, ํ…์„œ ์ฝ”์–ด๋Š” ์ „์ฒด ์›Œํ”„๊ฐ€ ํ•จ๊ป˜ ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๋กœ๋“œํ•˜๊ณ , MMA ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ…์„œ ์ฝ”์–ด๊ฐ€ ์›Œํ”„ ์ˆ˜์ค€์—์„œ ๋™์ž‘ํ•˜๋ฏ€๋กœ, ์Šค๋ ˆ๋“œ๋ฅผ ๋‹ค๋ฅด๊ฒŒ ๊ตฌ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

# Calculate warp coordinates within the block
warp_id = thread_idx.x // WARP_SIZE
warps_in_n = BN // WN  # Number of warps along N dimension
warps_in_m = BM // WM  # Number of warps along M dimension
warp_y = warp_id // warps_in_n  # Warp's row
warp_x = warp_id % warps_in_n   # Warp's column

# Each warp handles a WMร—WN tile of the output
C_warp_tile = C_block_tile.tile[WM, WN](warp_y, warp_x)

์›Œํ”„ ๊ตฌ์„ฑ ์˜ˆ์‹œ (BM=128, BN=64, WM=32, WN=32์ธ ๊ฒฝ์šฐ):

Block (128ร—64) contains 8 warps arranged as:

    32 cols    32 cols
     |          |
[  Warp 0  ][  Warp 1  ]  โ† 32 rows each
[  Warp 2  ][  Warp 3  ]  โ† 32 rows each
[  Warp 4  ][  Warp 5  ]  โ† 32 rows each
[  Warp 6  ][  Warp 7  ]  โ† 32 rows each

Total: 4ร—2 = 8 warps, each handling 32ร—32 output region

ํ…์„œ ์ฝ”์–ด์™€ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ

ํ…์„œ ์ฝ”์–ด๋Š” ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”์— ํ•œ ๋‹จ๊ณ„๋ฅผ ๋” ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ โ†’ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: copy_dram_to_sram_async ์‚ฌ์šฉ (Puzzle 16์—์„œ ๋ฐฐ์šด ๊ฒƒ)
  2. ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ โ†’ ๋ ˆ์ง€์Šคํ„ฐ ํ”„๋ž˜๊ทธ๋จผํŠธ: mma_op.load_a/load_b ์‚ฌ์šฉ
  3. ์—ฐ์‚ฐ: ๋ ˆ์ง€์Šคํ„ฐ ํ”„๋ž˜๊ทธ๋จผํŠธ์—์„œ mma_op.mma_op ์‚ฌ์šฉ
  4. ๋ ˆ์ง€์Šคํ„ฐ ํ”„๋ž˜๊ทธ๋จผํŠธ โ†’ ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: mma_op.store_d ์‚ฌ์šฉ

๋„์ „ ๊ณผ์ œ

tensor_core_matrix_multiplication ํ•จ์ˆ˜๋ฅผ ์™„์„ฑํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ž…๋‹ˆ๋‹ค. ์Šค์ผˆ๋ ˆํ†ค ์ฝ”๋“œ๋Š” ํƒ€์ผ๋ง ๋ฐฉ์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋˜ ์‹ค์ œ ํ…์„œ ์ฝ”์–ด ํ•˜๋“œ์›จ์–ด ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์š”๊ตฌ์‚ฌํ•ญ

  1. ์‹ค์ œ ํ…์„œ ์ฝ”์–ด API ์‚ฌ์šฉ: ์‹œ๋ฎฌ๋ ˆ์ด์…˜์ด ์•„๋‹Œ ์‹ค์ œ mma_op.load_a(), mma_op.mma_op() ๋“ฑ์„ ์‚ฌ์šฉํ•˜์„ธ์š”
  2. ์ •ํ™•์„ฑ ์œ ์ง€: ๊ฒฐ๊ณผ๊ฐ€ CPU ์ฐธ์กฐ ๊ตฌํ˜„๊ณผ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
  3. ์˜ฌ๋ฐ”๋ฅธ ์›Œํ”„ ์กฐ์ •: ๋ธ”๋ก๋‹น ์—ฌ๋Ÿฌ ์›Œํ”„๋ฅผ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค (NVIDIA์™€ AMD ๋ชจ๋‘์—์„œ ๋™์ž‘)
  4. ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ: Puzzle 16์—์„œ ๋ฐฐ์šด ๋น„๋™๊ธฐ ๋ณต์‚ฌ ํŒจํ„ด์„ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  5. ํฌ๋กœ์Šค ํ”Œ๋žซํผ ํ˜ธํ™˜์„ฑ: ํƒ€์ผ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ WARP_SIZE์˜ ๋ฐฐ์ˆ˜์ธ์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค

์„ค์ •

  • ํ–‰๋ ฌ ํฌ๊ธฐ: \(\text{SIZE} = 1024\)
  • ๋ธ”๋ก ํƒ€์ผ๋ง: \(\text{BM} = 128, \text{BN} = 64, \text{BK} = 32\)
  • ์›Œํ”„ ํƒ€์ผ๋ง: \(\text{WM} = 32, \text{WN} = 32\) (WARP_SIZE์˜ ๋ฐฐ์ˆ˜)
  • MMA ํ”„๋ž˜๊ทธ๋จผํŠธ: \(16 \times 8 \times 8\) (FP32)
  • ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜: \(8 \times \text{WARP_SIZE}\) (๋ธ”๋ก๋‹น 8๊ฐœ ์›Œํ”„)
  • ๊ทธ๋ฆฌ๋“œ ์ฐจ์›: ๋ธ”๋ก ํƒ€์ผ๋กœ ์ „์ฒด ํ–‰๋ ฌ์„ ์ปค๋ฒ„

๋ ˆ์ด์•„์›ƒ ์„ค์ •:

  • ์ž…๋ ฅ A: row_major[SIZE, SIZE]()
  • ์ž…๋ ฅ B: row_major[SIZE, SIZE]()
  • ์ถœ๋ ฅ C: row_major[SIZE, SIZE]()
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋น„๋™๊ธฐ ๋ณต์‚ฌ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ธ”๋ก ํฌ๊ธฐ ํƒ€์ผ

๋„์ „ ๊ณผ์ œ

์ด ํผ์ฆ์—์„œ๋Š” Puzzle 16์˜ ๊ด€์šฉ์  ํƒ€์ผ๋ง ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ํ…์„œ ์ฝ”์–ด ๊ตฌํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ํƒ€์ผ๋ง ๊ธฐ๋ณธ ๊ตฌํ˜„ ์ดํ•ดํ•˜๊ธฐ

ํผ์ฆ์€ ์ฐธ์กฐ์šฉ์œผ๋กœ ์™„์„ฑ๋œ ๊ด€์šฉ์  ํƒ€์ผ๋ง ๊ตฌํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

def matmul_idiomatic_tiled[
    size: Int
](
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
):
    # Use block_dim to get actual tile size dynamically
    var tile_size_x = block_dim.x
    var tile_size_y = block_dim.y

    var local_row = thread_idx.y
    var local_col = thread_idx.x
    var tiled_row = block_idx.y * tile_size_y + local_row
    var tiled_col = block_idx.x * tile_size_x + local_col

    # Get the tile of the output matrix that this thread block is responsible for
    var out_tile = output.tile[TILE_SIZE, TILE_SIZE](block_idx.y, block_idx.x)
    var a_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TILE_SIZE, TILE_SIZE]())
    var b_shared = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TILE_SIZE, TILE_SIZE]())

    var acc: output.ElementType = 0

    comptime load_a_layout = row_major[1, TILE_SIZE]()  # Coalesced loading
    comptime load_b_layout = row_major[1, TILE_SIZE]()  # Coalesced loading
    # Note: Both matrices stored in same orientation for correct matrix multiplication
    # Transposed loading would be useful if B were pre-transposed in global memory

    for idx in range(size // TILE_SIZE):  # Iterate over K tiles
        # Get tiles from A and B matrices
        var a_tile = a.tile[TILE_SIZE, TILE_SIZE](block_idx.y, idx)
        var b_tile = b.tile[TILE_SIZE, TILE_SIZE](idx, block_idx.x)

        # Asynchronously copy tiles to shared memory with consistent orientation
        copy_dram_to_sram_async[
            thread_layout=load_a_layout,
            num_threads=TILE_SIZE * TILE_SIZE,
            block_dim_count=BLOCK_DIM_COUNT,
        ](a_shared, a_tile)
        copy_dram_to_sram_async[
            thread_layout=load_b_layout,
            num_threads=TILE_SIZE * TILE_SIZE,
            block_dim_count=BLOCK_DIM_COUNT,
        ](b_shared, b_tile)

        async_copy_wait_all()
        barrier()

        # Compute partial matrix multiplication for this tile
        for k in range(TILE_SIZE):
            if (
                local_row < TILE_SIZE
                and local_col < TILE_SIZE
                and k < TILE_SIZE
            ):
                acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write final result to output tile
    if tiled_row < size and tiled_col < size:
        out_tile[local_row, local_col] = acc


์ด ๊ธฐ๋ณธ ๊ตฌํ˜„์ด ํ•˜๋Š” ์ผ:

  • ์ •ํ™•์„ฑ: ์ด ๊ตฌํ˜„์€ ์™„๋ฒฝํ•˜๊ฒŒ ๋™์ž‘ํ•˜๋ฉฐ ๋ชจ๋“  ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค
  • ์Šค๋ ˆ๋“œ ํ˜‘๋ ฅ: ํšจ์œจ์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ์ „์†ก์„ ์œ„ํ•ด copy_dram_to_sram_async๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ฐฐ๋ฆฌ์–ด์™€ ๋น„๋™๊ธฐ ์—ฐ์‚ฐ์œผ๋กœ ์Šค๋ ˆ๋“œ๋ฅผ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค
  • ํƒ€์ผ๋ง ์—ฐ์‚ฐ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํƒ€์ผ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ์ถœ๋ ฅ ์š”์†Œ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค

2๋‹จ๊ณ„: ํ…์„œ ์ฝ”์–ด ๋ฏธ์…˜

์œ„ ๋ฐฉ์‹์„ ์ „์šฉ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์„ ํ™œ์šฉํ•˜๋„๋ก ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค:

  • ๊ธฐ์กด: ์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ์—ฐ์‚ฐ โ†’ ๋ณ€ํ™˜ ํ›„: ์›Œํ”„ ์ˆ˜์ค€ ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ
  • ๊ธฐ์กด: ํ‘œ์ค€ FP32 ์‚ฐ์ˆ  โ†’ ๋ณ€ํ™˜ ํ›„: ํ•˜๋“œ์›จ์–ด ๊ฐ€์† GEMM ์—ฐ์‚ฐ
  • ๊ธฐ์กด: ๊ฐœ๋ณ„ ์š”์†Œ ๊ฒฐ๊ณผ โ†’ ๋ณ€ํ™˜ ํ›„: 16ร—8 ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ๊ฒฐ๊ณผ

3๋‹จ๊ณ„: ์„ค์ • ์ดํ•ดํ•˜๊ธฐ

ํ…์„œ ์ฝ”์–ด ๋ฒ„์ „์€ ํ•˜๋“œ์›จ์–ด์— ์ตœ์ ํ™”๋œ ๋‹ค๋ฅธ ํƒ€์ผ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:

  • ๋ธ”๋ก ํƒ€์ผ๋ง: BM=128, BN=64, BK=32 (๋” ๋‚˜์€ ์ ์œ ์œจ์„ ์œ„ํ•ด ๋” ํฐ ๋ธ”๋ก)
  • ์›Œํ”„ ํƒ€์ผ๋ง: WM=32, WN=32 (๊ฐ ์›Œํ”„๊ฐ€ 32ร—32 ์ถœ๋ ฅ ์˜์—ญ์„ ๋‹ด๋‹น)
  • MMA ํ”„๋ž˜๊ทธ๋จผํŠธ: 16ร—8ร—8 (ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ •์˜ํ•œ ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ํฌ๊ธฐ)
  • ๋ธ”๋ก๋‹น ์›Œํ”„: 8๊ฐœ (BMร—BN ๋ธ”๋ก ๋‚ด์—์„œ 4ร—2๋กœ ๋ฐฐ์น˜)

์™œ ์ด ํŠน์ • ํฌ๊ธฐ์ธ๊ฐ€?

  • BM=128, BN=64: ํ…์„œ ์ฝ”์–ด๋ฅผ ๋” ์ž˜ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํƒ€์ผ๋ง ๋ฒ„์ „(32ร—32)๋ณด๋‹ค ํฝ๋‹ˆ๋‹ค
  • WM=WN=32: WARP_SIZE์˜ ๋ฐฐ์ˆ˜์ด๋ฉฐ 2ร—4=8๊ฐœ์˜ MMA ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค (32รท16=2, 32รท8=4)
  • MMA 16ร—8ร—8: ํ•˜๋“œ์›จ์–ด์— ์˜ํ•ด ๊ณ ์ •๋จ - ํ…์„œ ์ฝ”์–ด๊ฐ€ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜๋Š” ํฌ๊ธฐ์ž…๋‹ˆ๋‹ค
  • 8 ์›Œํ”„: BMรทWM ร— BNรทWN = 128รท32 ร— 64รท32 = 4ร—2 = ๋ธ”๋ก๋‹น 8๊ฐœ ์›Œํ”„

์›Œํ”„๊ฐ€ MMA ํ”„๋ž˜๊ทธ๋จผํŠธ์— ๋งคํ•‘๋˜๋Š” ๋ฐฉ์‹:

Each 32ร—32 warp tile contains multiple 16ร—8 MMA fragments:

    16 cols   16 cols
     |         |
[ MMA 0,0 ][ MMA 0,1 ]  โ† 8 rows each (32รท8=4 fragments down)
[ MMA 1,0 ][ MMA 1,1 ]  โ† 8 rows each
[ MMA 2,0 ][ MMA 2,1 ]  โ† 8 rows each
[ MMA 3,0 ][ MMA 3,1 ]  โ† 8 rows each

2 fragments across (32รท16=2) ร— 4 fragments down (32รท8=4) = 8 MMA operations per warp per K-tile

4๋‹จ๊ณ„: ์™„์„ฑํ•  ์ฝ”๋“œ

# Block and warp tiling sizes
comptime BM = 4 * WARP_SIZE  # Block tile M (4 warps along M)
comptime BN = 2 * WARP_SIZE  # Block tile N (2 warps along N)
comptime BK = WARP_SIZE  # Block tile K (stay within SMEM limit)
comptime WM = WARP_SIZE  # Warp tile M
comptime WN = WARP_SIZE  # Warp tile N

# MMA tile sizes for tensor cores
comptime MMA_M = 16
comptime MMA_N = 8
comptime MMA_K = 8

comptime THREADS_PER_BLOCK_TENSOR_CORE = (8 * WARP_SIZE, 1)  # 8 warps per block
# grid_dim is (x, y). We want x to sweep N (columns) and y to sweep M (rows)
comptime BLOCKS_PER_GRID_TENSOR_CORE = (
    (SIZE + BN - 1) // BN,
    (SIZE + BM - 1) // BM,
)


def tensor_core_matrix_multiplication[
    dtype: DType,
    BM: Int,
    BN: Int,
    BK: Int,
    WM: Int,
    WN: Int,
    MMA_M: Int,
    MMA_N: Int,
    MMA_K: Int,
](
    A: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    B: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    C: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
):
    comptime M = C.dim[0]()
    comptime N = C.dim[1]()
    comptime K = A.dim[1]()

    var warp_id = thread_idx.x // WARP_SIZE
    var warps_in_n = BN // WN
    var warps_in_m = BM // WM
    var warp_y = warp_id // warps_in_n
    var warp_x = warp_id % warps_in_n

    var warp_is_active = warp_y < warps_in_m

    var C_block_tile = C.tile[BM, BN](block_idx.y, block_idx.x)
    var C_warp_tile = C_block_tile.tile[WM, WN](warp_y, warp_x)

    var mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()

    # Shared SRAM tiles (no padding to stay under shared memory limit)
    var A_sram_tile = stack_allocation[
        dtype=A.dtype, address_space=AddressSpace.SHARED
    ](row_major[BM, BK]())
    var B_sram_tile = stack_allocation[
        dtype=B.dtype, address_space=AddressSpace.SHARED
    ](row_major[BK, BN]())

    # One per-warp accumulator tile of shape [WM, WN]
    var C_warp_accum = stack_allocation[
        dtype=C.dtype, address_space=AddressSpace.GENERIC
    ](row_major[WM, WN]())

    # Zero initialize accumulator (only for active warps)
    if warp_is_active:
        comptime for i in range(WM):
            comptime for j in range(WN):
                C_warp_accum[i, j] = 0.0

    # Sweep across K in BK chunks (single-buffered)
    for k_i in range(K // BK):
        barrier()

        var A_dram_tile = A.tile[BM, BK](block_idx.y, k_i)
        var B_dram_tile = B.tile[BK, BN](k_i, block_idx.x)

        copy_dram_to_sram_async[
            thread_layout=row_major[4, 8](),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](A_sram_tile.vectorize[1, 4](), A_dram_tile.vectorize[1, 4]())
        copy_dram_to_sram_async[
            thread_layout=row_major[4, 8](),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](B_sram_tile.vectorize[1, 4](), B_dram_tile.vectorize[1, 4]())

        async_copy_wait_all()
        barrier()

        if warp_is_active:
            var A_warp_tile = A_sram_tile.tile[WM, BK](warp_y, 0)
            var B_warp_tile = B_sram_tile.tile[BK, WN](0, warp_x)

            comptime for mma_k in range(BK // MMA_K):
                comptime for mma_m in range(WM // MMA_M):
                    comptime for mma_n in range(WN // MMA_N):
                        # FILL IN (roughly 8 lines)
                        ...

    # Store the final per-warp accumulation to the output warp tile
    if warp_is_active:
        comptime for mma_m in range(WM // MMA_M):
            comptime for mma_n in range(WN // MMA_N):
                var C_mma_tile = C_warp_tile.tile[MMA_M, MMA_N](mma_m, mma_n)
                var Acc_mma_tile = C_warp_accum.tile[MMA_M, MMA_N](mma_m, mma_n)
                var frag = mma_op.load_c(Acc_mma_tile)
                mma_op.store_d(C_mma_tile, frag)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p33/p33.mojo

ํ•  ์ผ: ์„ธ ๊ฒน์˜ ์ค‘์ฒฉ ๋ฃจํ”„ ์•ˆ์— ์žˆ๋Š” ๋นˆ ๋ถ€๋ถ„(# FILL IN (roughly 8 lines)์œผ๋กœ ํ‘œ์‹œ๋จ)์„ ์™„์„ฑํ•˜์„ธ์š”.

์ดํ•ดํ•ด์•ผ ํ•  ๊ฒƒ:

  • ์Šค์ผˆ๋ ˆํ†ค์ด ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ, ์›Œํ”„ ๊ตฌ์„ฑ, ๋™๊ธฐํ™”๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ํ•ต์‹ฌ ํ…์„œ ์ฝ”์–ด ์—ฐ์‚ฐ๋งŒ ๊ตฌํ˜„ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค
  • ๋ฃจํ”„๋Š” MMA ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค: mma_k, mma_m, mma_n
  • ๊ฐ ๋ฐ˜๋ณต์—์„œ ํ•˜๋‚˜์˜ 16ร—8ร—8 ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

์„ธ ๊ฒน ์ค‘์ฒฉ ๋ฃจํ”„ ์ดํ•ดํ•˜๊ธฐ:

@parameter
for mma_k in range(BK // MMA_K):     # 32รท8 = 4 iterations (K dimension)
    @parameter
    for mma_m in range(WM // MMA_M): # 32รท16 = 2 iterations (M dimension)
        @parameter
        for mma_n in range(WN // MMA_N): # 32รท8 = 4 iterations (N dimension)
            # YOUR CODE HERE: Process one 16ร—8ร—8 MMA fragment

๊ฐ ๋ฃจํ”„๊ฐ€ ํ•˜๋Š” ์ผ:

  • mma_k: ํ˜„์žฌ K-ํƒ€์ผ์˜ K-์Šฌ๋ผ์ด์Šค๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค (๊ฐ 8๊ฐœ ์š”์†Œ์˜ 4๊ฐœ ์Šฌ๋ผ์ด์Šค)
  • mma_m: ์›Œํ”„ ์ถœ๋ ฅ์˜ M-์Šฌ๋ผ์ด์Šค๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค (๊ฐ 16ํ–‰์˜ 2๊ฐœ ์Šฌ๋ผ์ด์Šค)
  • mma_n: ์›Œํ”„ ์ถœ๋ ฅ์˜ N-์Šฌ๋ผ์ด์Šค๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค (๊ฐ 8์—ด์˜ 4๊ฐœ ์Šฌ๋ผ์ด์Šค)
  • ํ•ฉ๊ณ„: 4ร—2ร—4 = K-ํƒ€์ผ๋‹น ์›Œํ”„๋‹น 32๊ฐœ MMA ์—ฐ์‚ฐ
ํŒ

ํ…์„œ ์ฝ”์–ด ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”. ํ•„์š”ํ•œ ๋‹จ๊ณ„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  1. ์˜ฌ๋ฐ”๋ฅธ ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ์ถ”์ถœํ•˜๊ธฐ:

    • ์›Œํ”„ ํƒ€์ผ(A_warp_tile, B_warp_tile, C_warp_accum)์—์„œ MMA ํฌ๊ธฐ์˜ ํŠน์ • ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค
    • ๋ฃจํ”„ ์ธ๋ฑ์Šค(mma_m, mma_k, mma_n)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์˜ฌ๋ฐ”๋ฅธ ํƒ€์ผ ์ขŒํ‘œ๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค
    • ๊ธฐ์–ตํ•˜์„ธ์š”: A๋Š” [MMA_M, MMA_K], B๋Š” [MMA_K, MMA_N], C๋Š” [MMA_M, MMA_N]์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค
  2. ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ํ…์„œ ์ฝ”์–ด ๋ ˆ์ง€์Šคํ„ฐ์— ๋กœ๋“œํ•˜๊ธฐ:

    • mma_op ๊ฐ์ฒด์—๋Š” ๊ฐ ํ–‰๋ ฌ ํƒ€์ž…์„ ๋กœ๋“œํ•˜๋Š” ๋ฉ”์„œ๋“œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค
    • ๊ฐ ๋กœ๋“œ ๋ฉ”์„œ๋“œ๋Š” ํƒ€์ผ์„ ๋ฐ›์•„์„œ ๋ ˆ์ง€์Šคํ„ฐ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค
    • ์ƒ๊ฐํ•ด ๋ณด์„ธ์š”: load_a(), load_b(), load_c() - ๊ฐ๊ฐ ๋ฌด์—‡์„ ๋ฐ›์„๊นŒ์š”?
  3. ํ•˜๋“œ์›จ์–ด ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ ์ €์žฅํ•˜๊ธฐ:

    • MMA ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ๊ฒฐ๊ณผ๋ฅผ ๋ˆ„์‚ฐ๊ธฐ ํƒ€์ผ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
    • ์—ฐ์‚ฐ ํŒจํ„ด: result = A ร— B + C

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ: 128๊ฐœ์˜ ๊ฐœ๋ณ„ ๊ณฑ์…ˆ-๋ง์…ˆ ์—ฐ์‚ฐ์„ ํ•˜๋‚˜์˜ ํ•˜๋“œ์›จ์–ด ๋ช…๋ น์–ด๋กœ ๋Œ€์ฒดํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค!

๋””๋ฒ„๊น… ํŒ: ์ฐจ์› ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ํƒ€์ผ ์ธ๋ฑ์‹ฑ์„ ๋‹ค์‹œ ํ™•์ธํ•˜์„ธ์š” - mma_m, mma_k, mma_n์˜ ์ˆœ์„œ๊ฐ€ ์˜ฌ๋ฐ”๋ฅธ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๋ฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์‹คํ–‰

ํ’€์ด๋ฅผ ํ…Œ์ŠคํŠธํ•˜๋ ค๋ฉด ํ„ฐ๋ฏธ๋„์—์„œ ๋‹ค์Œ ๋ช…๋ น์–ด๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”:

pixi run p33 --test
uv run poe p33 --test

์™„์„ฑํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ •ํ™•๋„ ํ…Œ์ŠคํŠธ ๊ฒฐ๊ณผ๊ฐ€ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค:

=== Running All Accuracy Tests ===
--- Test 1: Tensor Core vs CPU Reference ---
โœ… TENSOR CORE ACCURACY TEST PASSED!
--- Test 2: Idiomatic Tiled vs CPU Reference ---
โœ… IDIOMATIC TILED ACCURACY TEST PASSED!
ALL TESTS PASSED!

์†”๋ฃจ์…˜

def tensor_core_matrix_multiplication[
    dtype: DType,
    layout_a: Layout,
    layout_b: Layout,
    layout_c: Layout,
    BM: Int,
    BN: Int,
    BK: Int,
    WM: Int,
    WN: Int,
    MMA_M: Int,
    MMA_N: Int,
    MMA_K: Int,
](
    A: LayoutTensor[dtype, layout_a, ImmutAnyOrigin],
    B: LayoutTensor[dtype, layout_b, ImmutAnyOrigin],
    C: LayoutTensor[dtype, layout_c, MutAnyOrigin],
):
    comptime M = C.shape[0]()
    comptime N = C.shape[1]()
    comptime K = A.shape[1]()

    var warp_id = thread_idx.x // WARP_SIZE
    var warps_in_n = BN // WN
    var warps_in_m = BM // WM
    var warp_y = warp_id // warps_in_n
    var warp_x = warp_id % warps_in_n

    var warp_is_active = warp_y < warps_in_m

    var C_block_tile = C.tile[BM, BN](block_idx.y, block_idx.x)
    var C_warp_tile = C_block_tile.tile[WM, WN](warp_y, warp_x)

    var mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()

    # Shared SRAM tiles (no padding to stay under shared memory limit)
    var A_sram_tile = LayoutTensor[
        A.dtype,
        Layout.row_major(BM, BK),
        MutAnyOrigin,
        address_space=AddressSpace.SHARED,
    ].stack_allocation()
    var B_sram_tile = LayoutTensor[
        B.dtype,
        Layout.row_major(BK, BN),
        MutAnyOrigin,
        address_space=AddressSpace.SHARED,
    ].stack_allocation()

    # One per-warp accumulator tile of shape [WM, WN]
    var C_warp_accum = LayoutTensor[
        C.dtype,
        Layout.row_major(WM, WN),
        MutAnyOrigin,
        address_space=AddressSpace.LOCAL,
    ].stack_allocation()

    # Zero initialize accumulator (only for active warps)
    if warp_is_active:
        comptime for i in range(WM):
            comptime for j in range(WN):
                C_warp_accum[i, j] = 0.0

    # (Removed shared C accumulator to reduce shared usage)

    # Sweep across K in BK chunks (single-buffered)
    for k_i in range(K // BK):
        barrier()

        var A_dram_tile = A.tile[BM, BK](block_idx.y, k_i)
        var B_dram_tile = B.tile[BK, BN](k_i, block_idx.x)

        copy_dram_to_sram_async[
            thread_layout=Layout.row_major(4, 8),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](A_sram_tile.vectorize[1, 4](), A_dram_tile.vectorize[1, 4]())
        copy_dram_to_sram_async[
            thread_layout=Layout.row_major(4, 8),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](B_sram_tile.vectorize[1, 4](), B_dram_tile.vectorize[1, 4]())

        async_copy_wait_all()
        barrier()

        if warp_is_active:
            var A_warp_tile = A_sram_tile.tile[WM, BK](warp_y, 0)
            var B_warp_tile = B_sram_tile.tile[BK, WN](0, warp_x)

            comptime for mma_k in range(BK // MMA_K):
                comptime for mma_m in range(WM // MMA_M):
                    comptime for mma_n in range(WN // MMA_N):
                        var A_mma_tile = A_warp_tile.tile[MMA_M, MMA_K](
                            mma_m, mma_k
                        )
                        var B_mma_tile = B_warp_tile.tile[MMA_K, MMA_N](
                            mma_k, mma_n
                        )
                        C_mma_tile = C_warp_accum.tile[MMA_M, MMA_N](
                            mma_m, mma_n
                        )

                        var a_reg = mma_op.load_a(A_mma_tile)
                        var b_reg = mma_op.load_b(B_mma_tile)
                        var c_reg = mma_op.load_c(C_mma_tile)
                        var d_reg = mma_op.mma_op(a_reg, b_reg, c_reg)
                        mma_op.store_d(C_mma_tile, d_reg)

    # Store the final per-warp accumulation to the output warp tile
    if warp_is_active:
        comptime for mma_m in range(WM // MMA_M):
            comptime for mma_n in range(WN // MMA_N):
                var C_mma_tile = C_warp_tile.tile[MMA_M, MMA_N](mma_m, mma_n)
                var Acc_mma_tile = C_warp_accum.tile[MMA_M, MMA_N](mma_m, mma_n)
                var frag = mma_op.load_c(Acc_mma_tile)
                mma_op.store_d(C_mma_tile, frag)


์ด ํ’€์ด๋Š” ํ…์„œ ์ฝ”์–ด ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  1. ์›Œํ”„ ๊ตฌ์„ฑ

    • warp_id = thread_idx.x // WARP_SIZE๋กœ ๋ธ”๋ก ๋‚ด ์›Œํ”„ ์ขŒํ‘œ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • ์›Œํ”„๋ฅผ ์ถœ๋ ฅ ํƒ€์ผ์— ๋งคํ•‘ํ•ฉ๋‹ˆ๋‹ค: ๊ฐ ์›Œํ”„๊ฐ€ WMร—WN ์˜์—ญ์„ ๋‹ด๋‹นํ•ฉ๋‹ˆ๋‹ค
    • ์˜ˆ์ƒ๋ณด๋‹ค ์ ์€ ์ˆ˜์˜ ์›Œํ”„๊ฐ€ ์žˆ๋Š” ๋ธ”๋ก์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด warp_is_active ๊ฐ€๋“œ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  2. ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ์ตœ์ ํ™”

    • ๊ธ€๋กœ๋ฒŒ โ†’ ๊ณต์œ : ํšจ์œจ์ ์ธ ๋ธ”๋ก ์ˆ˜์ค€ ์ „์†ก์„ ์œ„ํ•ด copy_dram_to_sram_async๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
    • ๊ณต์œ  โ†’ ๋ ˆ์ง€์Šคํ„ฐ: ์›Œํ”„ ์ˆ˜์ค€ ํ”„๋ž˜๊ทธ๋จผํŠธ ๋กœ๋”ฉ์„ ์œ„ํ•ด mma_op.load_a/load_b๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
    • ๋ ˆ์ง€์Šคํ„ฐ ์—ฐ์‚ฐ: ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์œ„ํ•ด mma_op.mma_op๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
    • ๋ ˆ์ง€์Šคํ„ฐ โ†’ ๊ธ€๋กœ๋ฒŒ: ํšจ์œจ์ ์ธ ๊ฒฐ๊ณผ ์ €์žฅ์„ ์œ„ํ•ด mma_op.store_d๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  3. ํ…์„œ ์ฝ”์–ด ์—ฐ์‚ฐ

    • load_a(A_mma_tile): 16ร—8 ํ–‰๋ ฌ A ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๋ ˆ์ง€์Šคํ„ฐ์— ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค
    • load_b(B_mma_tile): 8ร—8 ํ–‰๋ ฌ B ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๋ ˆ์ง€์Šคํ„ฐ์— ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค
    • load_c(C_mma_tile): 16ร—8 ๋ˆ„์‚ฐ๊ธฐ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค
    • mma_op(a_reg, b_reg, c_reg): ์ „์šฉ ํ•˜๋“œ์›จ์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ D = Aร—B + C๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
    • store_d(C_mma_tile, d_reg): 16ร—8 ๊ฒฐ๊ณผ ํ”„๋ž˜๊ทธ๋จผํŠธ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  4. ํฌ๋กœ์Šค ํ”Œ๋žซํผ ํ˜ธํ™˜์„ฑ

    • ๋ชจ๋“  ํƒ€์ผ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ WARP_SIZE์˜ ๋ฐฐ์ˆ˜์ž…๋‹ˆ๋‹ค (NVIDIA์—์„œ 32, AMD์—์„œ 64)
    • Mojo๋Š” TensorCore ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ํ†ตํ•ด ํ•˜๋“œ์›จ์–ด ์ฐจ์ด๋ฅผ ์ถ”์ƒํ™”ํ•ฉ๋‹ˆ๋‹ค
    • ๋™์ผํ•œ ์ฝ”๋“œ๊ฐ€ NVIDIA ํ…์„œ ์ฝ”์–ด์™€ AMD Matrix Core ๋ชจ๋‘์—์„œ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ๋Š” ํ…์„œ ์ฝ”์–ด๊ฐ€ ์Šค๋ ˆ๋“œ ์ˆ˜์ค€์˜ ๊ฐœ๋ณ„ ์š”์†Œ๊ฐ€ ์•„๋‹Œ ์›Œํ”„ ์ˆ˜์ค€์˜ ์ „์ฒด ํ–‰๋ ฌ ํ”„๋ž˜๊ทธ๋จผํŠธ ๋‹จ์œ„๋กœ ๋™์ž‘ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์™€ ์ „์šฉ ํ•˜๋“œ์›จ์–ด ๊ฐ€์†์ด ๊ฐ€๋Šฅํ•ด์ง‘๋‹ˆ๋‹ค.

์„ฑ๋Šฅ ๋ถ„์„: ์ด๊ฒƒ์œผ๋กœ ๋์ผ๊นŒ?

์ด์ œ ํ…์„œ ์ฝ”์–ด๊ฐ€ ๊ด€์šฉ์  ํƒ€์ผ๋ง ๋ฐฉ์‹ ๋Œ€๋น„ ์•ฝ์†๋œ ์„ฑ๋Šฅ ์šฐ์œ„๋ฅผ ์‹ค์ œ๋กœ ์ œ๊ณตํ•˜๋Š”์ง€ ํ™•์ธํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

ํ”„๋กœํŒŒ์ผ๋ง์šฉ ๋นŒ๋“œ

uv run mojo build problems/p33/p33.mojo -o problems/p33/p33_profiler
pixi run mojo build problems/p33/p33.mojo -o problems/p33/p33_profiler

NVIDIA Nsight Compute๋กœ ํ”„๋กœํŒŒ์ผ๋ง (NVIDIA ์ „์šฉ)

๋จผ์ € ncu์— ์ ‘๊ทผํ•˜๊ธฐ ์œ„ํ•ด CUDA ํ™˜๊ฒฝ์— ์ง„์ž…ํ•ฉ๋‹ˆ๋‹ค:

# Enter CUDA environment
pixi shell -e nvidia

# Profile tensor core version
ncu --set full --metrics sm__cycles_elapsed.avg,smsp__cycles_active.avg.pct_of_peak_sustained_elapsed,dram__throughput.avg.pct_of_peak_sustained_elapsed,smsp__inst_executed_pipe_tensor_op_hmma.sum ./problems/p33p33_profiler --tensor-core

# Profile tiled version for comparison
ncu --set full --metrics sm__cycles_elapsed.avg,smsp__cycles_active.avg.pct_of_peak_sustained_elapsed,dram__throughput.avg.pct_of_peak_sustained_elapsed ./problems/p33p33_profiler --tiled

๋น„๊ตํ•  ํ•ต์‹ฌ ๋ฉ”ํŠธ๋ฆญ

์„ฑ๋Šฅ ๋ฉ”ํŠธ๋ฆญ:

  • Duration: ์ „์ฒด kernel ์‹คํ–‰ ์‹œ๊ฐ„ (๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ)
  • SM Active %: SM ํ™œ์šฉ๋ฅ  (๋†’์„์ˆ˜๋ก ์ข‹์Œ)
  • DRAM Throughput: ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋ฅ  (๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฌ๋ถ€๋ฅผ ๋ณด์—ฌ์คŒ)
  • Tensor Op Instructions: ์‹ค์ œ ํ…์„œ ์ฝ”์–ด ์—ฐ์‚ฐ ํšŸ์ˆ˜ (ํ…์„œ ์ฝ”์–ด ๋ฒ„์ „์—๋งŒ ํ•ด๋‹น)

์ผ๋ฐ˜์ ์ธ ๊ฒฐ๊ณผ:

ํ…์„œ ์ฝ”์–ด ๋ฒ„์ „ (๋” ๋А๋ฆผ):

  • Duration: ~13.9 ms (ํ›จ์”ฌ ๋А๋ฆผ!)
  • SM Active: 83.7% (์ข‹์€ ํ™œ์šฉ๋ฅ )
  • DRAM Throughput: 72.5% (๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ!)
  • Occupancy: 26.3% (๋‚˜์จ - ๋ ˆ์ง€์Šคํ„ฐ์— ์˜ํ•ด ์ œํ•œ๋จ)
  • Tensor Op Instructions: 1,048,576 (ํ…์„œ ์ฝ”์–ด๊ฐ€ ๋™์ž‘ ์ค‘์ž„์„ ํ™•์ธ)

ํƒ€์ผ๋ง ๋ฒ„์ „ (๋” ๋น ๋ฆ„):

  • Duration: ~1.62 ms (8.6๋ฐฐ ๋น ๋ฆ„!)
  • SM Active: 98.0% (ํƒ์›”ํ•œ ํ™œ์šฉ๋ฅ )
  • DRAM Throughput: 1.7% (์˜ˆ์ƒ๋Œ€๋กœ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ)
  • Occupancy: 66.7% (ํ›จ์”ฌ ๋‚˜์Œ)
  • L2 Hit Rate: 96.9% vs 29.7% (ํ›จ์”ฌ ๋‚˜์€ ์บ์‹œ ์ง€์—ญ์„ฑ)

์™œ ํ…์„œ ์ฝ”์–ด๊ฐ€ ๋” ๋А๋ฆด๊นŒ?

  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ: 72% DRAM ์‚ฌ์šฉ๋Ÿ‰์€ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ๊ฐ€ ์•„๋‹Œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ๋‚ฎ์€ ์ ์œ ์œจ: 26% vs 67% - ๋†’์€ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰(์Šค๋ ˆ๋“œ๋‹น 68 vs 38)์ด ๋™์‹œ ์›Œํ”„ ์ˆ˜๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค
  • ์บ์‹œ ๋ฏธ์Šค: 29% L2 ์ ์ค‘๋ฅ  vs 97%๋Š” ๋‚ฎ์€ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ถฉ๋Œ: ์ตœ์ ํ™”๋˜์ง€ ์•Š์€ ์ ‘๊ทผ ํŒจํ„ด์œผ๋กœ ์ธํ•œ ๋ฑ…ํฌ ์ถฉ๋Œ
  • ์‹คํ–‰ ์„ค์ •: ์ด ๋ฌธ์ œ ํฌ๊ธฐ์— ๋Œ€ํ•ด ์ตœ์ ์ด ์•„๋‹Œ ๋ธ”๋ก/์›Œํ”„ ๊ตฌ์„ฑ

์„ฑ๋Šฅ์˜ ํ˜„์‹ค

ํ”„๋กœํŒŒ์ผ๋ง ๊ฒฐ๊ณผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ์ด, โ€œ์ „์šฉ ํ•˜๋“œ์›จ์–ดโ€œ๊ฐ€ ์ž๋™์œผ๋กœ ๋นจ๋ผ์ง€๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค! ํ…์„œ ์ฝ”์–ด ๋ฒ„์ „์€ ๋‹จ์ˆœํ•œ ํƒ€์ผ๋ง ๋ฐฉ์‹๋ณด๋‹ค ์ƒ๋‹นํžˆ ๋А๋ฆฝ๋‹ˆ๋‹ค(~8.6๋ฐฐ). ์ด๋Š” GPU ์ตœ์ ํ™”์—์„œ ํ”ํžˆ ๋ณผ ์ˆ˜ ์žˆ๋Š” ํ˜„์‹ค์ž…๋‹ˆ๋‹ค - ํ•˜๋“œ์›จ์–ด์˜ ์›์‹œ ์„ฑ๋Šฅ์ด ๊ณง ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ:

  • ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ: 72% DRAM ์‚ฌ์šฉ๋Ÿ‰์€ ํ…์„œ ์ฝ”์–ด๊ฐ€ ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ๊ฐ€ ์•„๋‹Œ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž„์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ๋‚ฎ์€ ์ ์œ ์œจ: ๋†’์€ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰์œผ๋กœ ์ธํ•ด 26% vs 67%๋กœ ๋™์‹œ ์›Œํ”„ ์ˆ˜๊ฐ€ ์ œํ•œ๋ฉ๋‹ˆ๋‹ค
  • ์บ์‹œ ๋ฏธ์Šค: 29% vs 97% L2 ์ ์ค‘๋ฅ ์€ ๋‚ฎ์€ ๋ฉ”๋ชจ๋ฆฌ ์ง€์—ญ์„ฑ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ๋ฆฌ์†Œ์Šค ๋‚ญ๋น„: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ์ถฉ๋Œ๊ณผ ์ตœ์ ์ด ์•„๋‹Œ ์‹คํ–‰ ์„ค์ •

๊ตํ›ˆ: ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์„ ์ดํ•ดํ•˜๊ณ  ์ฒด๊ณ„์ ์œผ๋กœ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด โ€œ์ตœ์‹ ์˜ ๊ฐ€์žฅ ๋›ฐ์–ด๋‚œโ€ API๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ํ•˜๋“œ์›จ์–ด ๊ธฐ๋Šฅ์€ ์„ธ์‹ฌํ•œ ํŠœ๋‹์ด ํ•„์š”ํ•œ ๋„๊ตฌ์ด์ง€, ๋งˆ๋ฒ•์˜ ์€ํƒ„ํ™˜์ด ์•„๋‹™๋‹ˆ๋‹ค.

๋‹ค์Œ ๋‹จ๊ณ„

๋ณด๋žŒ ์žˆ๋Š” GPU ์ตœ์ ํ™” ๋„์ „์„ ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๐ŸŽฏ ์„ฑ๋Šฅ ๋ณด๋„ˆ์Šค ์ฑŒ๋ฆฐ์ง€๋กœ ์ด๋™ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ธ ํ…์„œ ์ฝ”์–ด ๊ตฌํ˜„์„ ๋‹จ์ˆœํ•œ ํƒ€์ผ๋ง ๋ฒ„์ „์„ ์‹ค์ œ๋กœ ์ด๊ธฐ๋Š” ๊ตฌํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›Œ๋ณด์„ธ์š”!

๐ŸŽฏ ์„ฑ๋Šฅ ๋ณด๋„ˆ์Šค ์ฑŒ๋ฆฐ์ง€

๋ฐœ๊ฒฌ

Puzzle 33์„ ์™„๋ฃŒํ•˜๊ณ  Mojo์˜ TensorCore API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹ค์ œ ํ…์„œ ์ฝ”์–ด ํ–‰๋ ฌ ๊ณฑ์…ˆ์„ ๊ตฌํ˜„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ตฌํ˜„์€ ์ •ํ™•ํ•˜๊ฒŒ ๋™์ž‘ํ•˜๊ณ , ๋ชจ๋“  ์ •ํ™•๋„ ํ…Œ์ŠคํŠธ๋ฅผ ํ†ต๊ณผํ•˜๋ฉฐ, ์‹ค์ œ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ํ–‰๋ ฌ ์—ฐ์‚ฐ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ Puzzle 16์˜ ํƒ€์ผ๋ง ๋ฒ„์ „๊ณผ ํ”„๋กœํŒŒ์ผ๋ง์œผ๋กœ ๋น„๊ตํ•˜๋ฉดโ€ฆ

โ€œ์ „์šฉ ํ•˜๋“œ์›จ์–ดโ€œ๊ฐ€ ์—„์ฒญ๋‚˜๊ฒŒ ๋” ๋А๋ฆฝ๋‹ˆ๋‹ค!

๋ฌด์—‡์ด ์ž˜๋ชป๋œ ๊ฑธ๊นŒ?

(NVIDIA ์ „์šฉ) ncu๋ฅผ ์‚ฌ์šฉํ•œ ํ”„๋กœํŒŒ์ผ๋ง์ด ๋ƒ‰ํ˜นํ•œ ํ˜„์‹ค์„ ๋“œ๋Ÿฌ๋ƒˆ์Šต๋‹ˆ๋‹ค (ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ๋ฒ•์„ ๋ณต์Šตํ•˜๋ ค๋ฉด Puzzle 10์˜ ๋ฉ”๋ชจ๋ฆฌ ์˜ค๋ฅ˜ ํƒ์ง€์™€ Puzzle 30์˜ GPU ํ”„๋กœํŒŒ์ผ๋ง์„ ์ฐธ๊ณ ํ•˜์„ธ์š”):

ํ…์„œ ์ฝ”์–ด ๋ฒ„์ „ (๊ธฐ๋Œ€์— ๋ชป ๋ฏธ์นจ):

  • Duration: ~13.9 ms
  • ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ: 72.5% DRAM ์ฒ˜๋ฆฌ๋Ÿ‰ (์—ฐ์‚ฐ ๋ฐ”์šด๋“œ์—ฌ์•ผ ํ•˜๋Š”๋ฐ!)
  • ๋‚ฎ์€ ์ ์œ ์œจ: 26.3% (ํ•˜๋“œ์›จ์–ด ๋‚ญ๋น„)
  • ์บ์‹œ ์žฌ์•™: 29.7% L2 ์ ์ค‘๋ฅ 
  • ๋ ˆ์ง€์Šคํ„ฐ ์••๋ฐ•: ์Šค๋ ˆ๋“œ๋‹น 68๊ฐœ ๋ ˆ์ง€์Šคํ„ฐ
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ถฉ๋Œ: ๋ฑ…ํฌ ์ถฉ๋Œ์ด ์„ฑ๋Šฅ์„ ํŒŒ๊ดด

ํƒ€์ผ๋ง ๋ฒ„์ „ (์Šน์ž):

  • Duration: ~1.62 ms (8.6๋ฐฐ ๋น ๋ฆ„!)
  • ์—ฐ์‚ฐ ๋ฐ”์šด๋“œ: 1.7% DRAM ์ฒ˜๋ฆฌ๋Ÿ‰ (์˜ˆ์ƒ๋Œ€๋กœ)
  • ํƒ์›”ํ•œ ์ ์œ ์œจ: 66.7%
  • ์บ์‹œ ์นœํ™”์ : 96.9% L2 ์ ์ค‘๋ฅ 
  • ํšจ์œจ์ : ์Šค๋ ˆ๋“œ๋‹น 38๊ฐœ ๋ ˆ์ง€์Šคํ„ฐ
  • ๊น”๋”ํ•œ ๋ฉ”๋ชจ๋ฆฌ: ์œ ์˜๋ฏธํ•œ ๋ฑ…ํฌ ์ถฉ๋Œ ์—†์Œ

๋ƒ‰ํ˜นํ•œ ํ˜„์‹ค

์ด๋Š” GPU ์ตœ์ ํ™”์—์„œ ํ”ํ•œ ์ด์•ผ๊ธฐ์ž…๋‹ˆ๋‹ค: ํ•˜๋“œ์›จ์–ด์˜ ์›์‹œ ์„ฑ๋Šฅ โ‰  ์‹ค์ œ ์„ฑ๋Šฅ. ํ…์„œ ์ฝ”์–ด๋Š” ๋†€๋ž๋„๋ก ๊ฐ•๋ ฅํ•˜์ง€๋งŒ, ๋™์‹œ์— ์š”๊ตฌ์‚ฌํ•ญ๋„ ๋†€๋ž๋„๋ก ๊นŒ๋‹ค๋กญ์Šต๋‹ˆ๋‹ค:

  • ๋ฉ”๋ชจ๋ฆฌ ๋ฒฝ: ์—ฐ์‚ฐ์ด ๋„ˆ๋ฌด ๋นจ๋ผ์„œ ๋ชจ๋“  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ชฉ์ด ๋“œ๋Ÿฌ๋‚จ
  • ๋ฆฌ์†Œ์Šค ํƒ์‹: ๋†’์€ ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰์ด ์ ์œ ์œจ์„ ์ €ํ•˜์‹œํ‚ด
  • ์ ‘๊ทผ ํŒจํ„ด ๋ฏผ๊ฐ: ๋‚˜์œ ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด์ด ์บ์‹œ ๋™์ž‘์„ ํŒŒ๊ดดํ•จ
  • ์„ค์ •์ด ํ•ต์‹ฌ: ์‹คํ–‰ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์™„๋ฒฝํ•˜๊ฒŒ ํŠœ๋‹ํ•ด์•ผ ํ•จ

๋ฏธ์…˜: ํ…์„œ ์ฝ”์–ด ์„ฑ๋Šฅ ๊ฐœ์„ ํ•˜๊ธฐ

๋„์ „ ๊ณผ์ œ: ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์— ๋‚ฎ์€ ์ ์œ ์œจ์ธ ํ…์„œ ์ฝ”์–ด ๊ตฌํ˜„์„ ๋‹จ์ˆœํ•œ ํƒ€์ผ๋ง ๋ฒ„์ „์„ ์‹ค์ œ๋กœ ์ด๊ธฐ๋Š” ๊ตฌํ˜„์œผ๋กœ ๋ณ€ํ™˜ํ•˜์„ธ์š”.

์ด๊ฒจ์•ผ ํ•  ๊ธฐ์ค€:

  • ๋ชฉํ‘œ Duration: < 1.62 ms
  • ์ ์œ ์œจ: > 26.3% ๊ธฐ์ค€์„ 
  • DRAM ๋ถ€ํ•˜: < 72.5% ๊ธฐ์ค€์„ 
  • ์บ์‹œ ์„ฑ๋Šฅ: > 29.7% L2 ์ ์ค‘๋ฅ  ๊ธฐ์ค€์„ 

ํƒ๊ตฌํ•  ์ตœ์ ํ™” ์ „๋žต:

  1. ๋ ˆ์ง€์Šคํ„ฐ ์••๋ฐ• ์ค„์ด๊ธฐ

    • ๋” ์ž‘์€ ๋ˆ„์‚ฐ๊ธฐ ํƒ€์ผ ์‚ฌ์šฉ
    • ์ค‘๊ฐ„ ์ €์žฅ ๊ณต๊ฐ„ ์ตœ์†Œํ™”
    • ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ํ˜ผํ•ฉ ์ •๋ฐ€๋„ ๊ณ ๋ ค
    • ํšจ์œจ์ ์ธ ๋ˆ„์  ํŒจํ„ด์€ Puzzle 16์˜ ํƒ€์ผ๋ง ๋ฐฉ์‹ ์ฐธ๊ณ 
  2. ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด ์ตœ์ ํ™”

    • ๋ฑ…ํฌ ์ถฉ๋Œ์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํŒจ๋”ฉ ์ถ”๊ฐ€ (๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ฐœ๋… ์ฐธ๊ณ )
    • copy_dram_to_sram_async ๋ ˆ์ด์•„์›ƒ ์ตœ์ ํ™”
    • ๋ณ‘ํ•ฉ ํŒจํ„ด ๊ฐœ์„  (์ดˆ๋ฐ˜ ํผ์ฆ์˜ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ๊ธฐ์ดˆ ์ฐธ๊ณ )
  3. ์ ์œ ์œจ ๊ฐœ์„ 

    • ๋” ๋‚˜์€ ์›Œํ”„ ํ™œ์šฉ์„ ์œ„ํ•œ ๋ธ”๋ก ํฌ๊ธฐ ํŠœ๋‹
    • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ vs ๋ ˆ์ง€์Šคํ„ฐ ์‚ฌ์šฉ๋Ÿ‰ ๊ท ํ˜• ๋งž์ถ”๊ธฐ
    • ์›Œํ”„-SM ๋งคํ•‘ ์ตœ์ ํ™”
    • Puzzle 11-20 ์‹œ๋ฆฌ์ฆˆ์˜ ์Šค๋ ˆ๋“œ ์กฐ์ • ๊ตํ›ˆ ์ ์šฉ
  4. ์บ์‹œ ์ตœ์ ํ™”

    • ๋ฐ์ดํ„ฐ ์žฌ์‚ฌ์šฉ ํŒจํ„ด ๊ฐœ์„ 
    • ์บ์‹œ ๊ณ„์ธต ๊ตฌ์กฐ์— ๋งž๋Š” ํƒ€์ผ ํฌ๊ธฐ ์ตœ์ ํ™”
    • ๋ฐ์ดํ„ฐ ๋ ˆ์ด์•„์›ƒ ๋ณ€ํ™˜ ๊ณ ๋ ค
    • ์ด์ „ ํผ์ฆ ๊ณผ์ •์˜ ๋ฉ”๋ชจ๋ฆฌ ๊ณ„์ธต ๊ตฌ์กฐ ๊ฐœ๋… ํ™œ์šฉ
  5. ๊ณ ๊ธ‰ ๊ธฐ๋ฒ•

    • ๋ฉ”๋ชจ๋ฆฌ์™€ ์—ฐ์‚ฐ์„ ์ค‘์ฒฉํ•˜๊ธฐ ์œ„ํ•œ ๋”๋ธ” ๋ฒ„ํผ๋ง ๊ตฌํ˜„
    • ์†Œํ”„ํŠธ์›จ์–ด ํŒŒ์ดํ”„๋ผ์ด๋‹ ์‚ฌ์šฉ
    • ๋น„๋™๊ธฐ ์‹คํ–‰ ํŒจํ„ด ํƒ๊ตฌ
    • ์ƒˆ๋‹ˆํƒ€์ด์ € ํผ์ฆ์˜ ๊ณ ๊ธ‰ ์กฐ์ • ๊ธฐ๋ฒ• ์ ์šฉ

์„ฑ๊ณต ๊ธฐ์ค€

  • ์ •ํ™•์„ฑ: ๋ชจ๋“  ์ •ํ™•๋„ ํ…Œ์ŠคํŠธ๊ฐ€ ์—ฌ์ „ํžˆ ํ†ต๊ณผ
  • ์„ฑ๋Šฅ: ํ…์„œ ์ฝ”์–ด Duration < 1.62 ms
  • ํšจ์œจ์„ฑ: ๋” ๋†’์€ ์ ์œ ์œจ (>26.3%)
  • ๋ฉ”๋ชจ๋ฆฌ: ๋” ๋‚ฎ์€ DRAM ๋ถ€ํ•˜ (<72.5%)
  • ์บ์‹œ: ๋” ๋†’์€ ์ ์ค‘๋ฅ  (>29.7% L2)

๋” ๊นŠ์€ ๊ตํ›ˆ

์ด ๋ณด๋„ˆ์Šค ์ฑŒ๋ฆฐ์ง€๋Š” GPU ์ตœ์ ํ™”์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ตํ›ˆ์„ ๊ฐ€๋ฅด์นฉ๋‹ˆ๋‹ค: ๋ณ‘๋ชฉ์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ตœ์‹  API๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๋ชฉํ‘œ๋Š” ๋‹จ์ˆœํžˆ ํ…์„œ ์ฝ”์–ด๋ฅผ ๋” ๋น ๋ฅด๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ์•„๋‹™๋‹ˆ๋‹ค - ํ…์„œ ์ฝ”์–ด๊ฐ€ ์™œ ๋” ๋А๋ ค์งˆ ์ˆ˜ ์žˆ๋Š”์ง€ ์ดํ•ดํ•˜๊ณ , ์„ฑ๋Šฅ ๋ฌธ์ œ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ง„๋‹จํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์šฐ๊ณ , ์›์น™์— ๊ธฐ๋ฐ˜ํ•œ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์ด ์ฑŒ๋ฆฐ์ง€๋ฅผ ์™„์ˆ˜ํ•˜๋ฉด, ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ•˜๋“œ์›จ์–ด ๊ธฐ๋Šฅ๊ณผ ๊ด€๊ณ„์—†์ด ์–ด๋–ค GPU ์›Œํฌ๋กœ๋“œ๋“  ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ์—ญ๋Ÿ‰์„ ๊ฐ–์ถ”๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Puzzle 34: GPU ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (SM90+)

์†Œ๊ฐœ

ํ•˜๋“œ์›จ์–ด ์š”๊ตฌ์‚ฌํ•ญ: โš ๏ธ NVIDIA SM90+ ์ „์šฉ

์ด ํผ์ฆ์€ SM90+ ์ปดํ“จํŠธ ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ NVIDIA Hopper ์•„ํ‚คํ…์ฒ˜ (H100, H200) ์ด์ƒ์˜ GPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ API๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๊ธฐ๋ฐ˜์ด๋ฉฐ, ์ง€์›ํ•˜์ง€ ์•Š๋Š” ํ•˜๋“œ์›จ์–ด์—์„œ๋Š” ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉ ์ค‘์ธ ์•„ํ‚คํ…์ฒ˜๊ฐ€ ํ™•์‹คํ•˜์ง€ ์•Š๋‹ค๋ฉด pixi run gpu-specs๋ฅผ ์‹คํ–‰ํ•˜์—ฌ ์ตœ์†Œ Compute Cap: 9.0 ์ด์ƒ์ธ์ง€ ํ™•์ธํ•˜์„ธ์š” (ํ•˜๋“œ์›จ์–ด ์‹๋ณ„์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ NVIDIA ํ”„๋กœํŒŒ์ผ๋ง ๊ธฐ์ดˆ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”)

์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (Puzzle 24-26) ์—์„œ ๋ธ”๋ก ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ (Puzzle 27) ๊นŒ์ง€์˜ ์—ฌ์ •์„ ์ด์–ด, ์ด์ œ ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๋ฐฐ์›๋‹ˆ๋‹ค - ๋‹จ์ผ ๋ธ”๋ก์˜ ํ•œ๊ณ„๋ฅผ ๋„˜์–ด์„œ๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ์กฐ์ •ํ•˜๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

์Šค๋ ˆ๋“œ ๋ธ”๋ก ํด๋Ÿฌ์Šคํ„ฐ๋ž€?

์Šค๋ ˆ๋“œ ๋ธ”๋ก ํด๋Ÿฌ์Šคํ„ฐ๋Š” ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋™๊ธฐํ™” ๋ฐ ํ†ต์‹  ๊ธฐ๋ณธ ์š”์†Œ๋ฅผ ํ†ตํ•ด ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ํ˜‘๋ ฅํ•˜์—ฌ ํ•˜๋‚˜์˜ ์—ฐ์‚ฐ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ฃผ๋Š” ํ˜์‹ ์ ์ธ SM90+ ๊ธฐ๋Šฅ์ž…๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๊ธฐ๋Šฅ:

  • ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™”: cluster_sync, cluster_arrive, cluster_wait๋กœ ์—ฌ๋Ÿฌ ๋ธ”๋ก์„ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค
  • ๋ธ”๋ก ์‹๋ณ„: block_rank_in_cluster๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์œ ํ•œ ๋ธ”๋ก ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ํšจ์œจ์ ์ธ ์กฐ์ •: elect_one_sync๋กœ ์ตœ์ ํ™”๋œ ์›Œํ”„ ์ˆ˜์ค€ ํ˜‘๋ ฅ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค
  • ๊ณ ๊ธ‰ ํŒจํ„ด: cluster_mask_base๋กœ ์„ ํƒ์  ๋ธ”๋ก ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋ชจ๋ธ

๊ธฐ์กด GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณ„์ธต ๊ตฌ์กฐ

Grid (Multiple Blocks)
โ”œโ”€โ”€ Block (Multiple Warps) - barrier() synchronization
    โ”œโ”€โ”€ Warp (32 Threads) - SIMT lockstep execution
    โ”‚   โ”œโ”€โ”€ Lane 0  โ”€โ”
    โ”‚   โ”œโ”€โ”€ Lane 1   โ”‚ All execute same instruction
    โ”‚   โ”œโ”€โ”€ Lane 2   โ”‚ at same time (SIMT)
    โ”‚   โ”‚   ...      โ”‚ warp.sum(), warp.broadcast()
    โ”‚   โ””โ”€โ”€ Lane 31 โ”€โ”˜
        โ””โ”€โ”€ Thread (SIMD operations within each thread)

์ƒˆ๋กœ์šด ๊ณ„์ธต: ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณ„์ธต ๊ตฌ์กฐ:

Grid (Multiple Clusters)
โ”œโ”€โ”€ ๐Ÿ†• Cluster (Multiple Blocks) - cluster_sync(), cluster_arrive()
    โ”œโ”€โ”€ Block (Multiple Warps) - barrier() synchronization
        โ”œโ”€โ”€ Warp (32 Threads) - SIMT lockstep execution
        โ”‚   โ”œโ”€โ”€ Lane 0  โ”€โ”
        โ”‚   โ”œโ”€โ”€ Lane 1   โ”‚ All execute same instruction
        โ”‚   โ”œโ”€โ”€ Lane 2   โ”‚ at same time (SIMT)
        โ”‚   โ”‚   ...      โ”‚ warp.sum(), warp.broadcast()
        โ”‚   โ””โ”€โ”€ Lane 31 โ”€โ”˜
            โ””โ”€โ”€ Thread (SIMD operations within each thread)

์‹คํ–‰ ๋ชจ๋ธ ์ƒ์„ธ:

  • ์Šค๋ ˆ๋“œ ๋ ˆ๋ฒจ: ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ์˜ SIMD ์—ฐ์‚ฐ
  • ์›Œํ”„ ๋ ˆ๋ฒจ: SIMT ์‹คํ–‰ - 32๊ฐœ ์Šค๋ ˆ๋“œ์˜ ๋ก์Šคํ… ์กฐ์ •
  • ๋ธ”๋ก ๋ ˆ๋ฒจ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์™€ ๋ฐฐ๋ฆฌ์–ด๋ฅผ ํ™œ์šฉํ•œ ๋ฉ€ํ‹ฐ ์›Œํ”„ ์กฐ์ •
  • ๐Ÿ†• ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ: SM90+ ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ํ™œ์šฉํ•œ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ •

ํ•™์Šต ๋‹จ๊ณ„

์ด ํผ์ฆ์€ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์—ญ๋Ÿ‰์„ ์ฒด๊ณ„์ ์œผ๋กœ ์Œ“์•„๊ฐ€๋Š” 3๋‹จ๊ณ„ ๊ตฌ์„ฑ์œผ๋กœ ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

๐Ÿ”ฐ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ • ๊ธฐ์ดˆ

ํ•ต์‹ฌ: ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ํŒจํ„ด์˜ ๊ธฐ๋ณธ ์ดํ•ด

์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด cluster_arrive()์™€ cluster_wait()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ๋ณธ์ ์ธ ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ๊ณผ ๋ฐ์ดํ„ฐ ๋ถ„๋ฐฐ๋ฅผ ์œ„ํ•ด ์‹คํ–‰์„ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

์ฃผ์š” API: block_rank_in_cluster(), cluster_arrive(), cluster_wait()


โ˜ธ๏ธ ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ

ํ•ต์‹ฌ: ๋ธ”๋ก ๋ ˆ๋ฒจ ํŒจํ„ด์„ ํด๋Ÿฌ์Šคํ„ฐ ๊ทœ๋ชจ๋กœ ํ™•์žฅ

์ต์ˆ™ํ•œ block.sum() ๊ฐœ๋…์„ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์ณ ํ™•์žฅํ•˜์—ฌ ๋Œ€๊ทœ๋ชจ ์—ฐ์‚ฐ์„ ์กฐ์ •ํ•˜๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ๋ฆฌ๋•์…˜๊ณผ ์ง‘ํ•ฉ ์—ฐ์‚ฐ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

์ฃผ์š” API: cluster_sync(), ํšจ์œจ์ ์ธ ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ์œ„ํ•œ elect_one_sync()


๐Ÿง  ๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜

ํ•ต์‹ฌ: ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€์˜ ๋‹ค๋‹จ๊ณ„ ์กฐ์ • ํŒจํ„ด

GPU ํ™œ์šฉ๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ณ  ๋ณต์žกํ•œ ์—ฐ์‚ฐ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์›Œํ”„ ๋ ˆ๋ฒจ, ๋ธ”๋ก ๋ ˆ๋ฒจ, ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ์˜ ์กฐ์ •์„ ๊ฒฐํ•ฉํ•˜๋Š” ์ •๊ตํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” API: elect_one_sync(), cluster_arrive(), ๊ณ ๊ธ‰ ์กฐ์ • ํŒจํ„ด

ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์ด ์ค‘์š”ํ•œ ์ด์œ 

๋ฌธ์ œ ๊ทœ๋ชจ: ํ˜„๋Œ€ AI ๋ฐ ๊ณผํ•™ ์›Œํฌ๋กœ๋“œ๋Š” ๋‹จ์ผ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์˜ ๋Šฅ๋ ฅ์„ ์ดˆ๊ณผํ•˜๋Š” ์—ฐ์‚ฐ์„ ํ•„์š”๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค:

ํ•˜๋“œ์›จ์–ด ๋ฐœ์ „: GPU๊ฐ€ ๋” ๋งŽ์€ ์—ฐ์‚ฐ ์œ ๋‹›์„ ๊ฐ–์ถ”๊ฒŒ ๋จ์— ๋”ฐ๋ผ (Puzzle 30์˜ GPU ์•„ํ‚คํ…์ฒ˜ ํ”„๋กœํŒŒ์ผ๋ง ์ฐธ๊ณ ), ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์ฐจ์„ธ๋Œ€ ํ•˜๋“œ์›จ์–ด๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ๋ฐ ํ•„์ˆ˜์ ์ด ๋ฉ๋‹ˆ๋‹ค.

๊ต์œก์  ๊ฐ€์น˜

์ด ํผ์ฆ์„ ์™„๋ฃŒํ•˜๋ฉด ์™„์ „ํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค:

์ด ๊ณผ์ •์€ Puzzle 30-32์˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์ฐจ์„ธ๋Œ€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ๊ณผ ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌ ์ปดํ“จํŒ… ๋„์ „์— ๋Œ€๋น„ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ค€๋น„์‹œ์ผœ ์ค๋‹ˆ๋‹ค.

์‹œ์ž‘ํ•˜๊ธฐ

์„ ์ˆ˜ ์กฐ๊ฑด:

๊ถŒ์žฅ ํ•™์Šต ๋ฐฉ๋ฒ•: 3๋‹จ๊ณ„ ๊ตฌ์„ฑ์„ ์ˆœ์„œ๋Œ€๋กœ ๋”ฐ๋ผ๊ฐ€์„ธ์š”. ๊ฐ ๋‹จ๊ณ„๊ฐ€ ๋‹ค์Œ ๋‹จ๊ณ„์˜ ๋ณต์žก์„ฑ์„ ์œ„ํ•œ ํ•ต์‹ฌ ๊ฐœ๋…์„ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค.

ํ•˜๋“œ์›จ์–ด ์ฐธ๊ณ : SM90+ ์ด์™ธ์˜ ํ•˜๋“œ์›จ์–ด์—์„œ ์‹คํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ, ์ด ํผ์ฆ์€ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ฐœ๋…๊ณผ API ์‚ฌ์šฉ ํŒจํ„ด์˜ ๊ต์œก์  ์˜ˆ์ œ๋กœ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๋ฏธ๋ž˜๋ฅผ ๋ฐฐ์šธ ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ • ๊ธฐ์ดˆ ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ ๊ธฐ๋ณธ์ ์ธ ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋ฐฐ์›Œ๋ณด์„ธ์š”!

๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ • ๊ธฐ์ดˆ

๊ฐœ์š”

์ฒซ ๋ฒˆ์งธ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๋„์ „์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ์ด ์„น์…˜์—์„œ๋Š” SM90+ ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ์‚ฌ์šฉํ•œ ๋ธ”๋ก ๊ฐ„ ์กฐ์ •์˜ ๊ธฐ๋ณธ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: 4๊ฐœ์˜ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์ด ์กฐ์ •ํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๊ณต์œ  ์ถœ๋ ฅ ๋ฐฐ์—ด์— ์ €์žฅํ•˜๋Š” ๋ฉ€ํ‹ฐ ๋ธ”๋ก ํžˆ์Šคํ† ๊ทธ๋žจ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต: cluster_arrive() โ†’ ์ฒ˜๋ฆฌ โ†’ cluster_wait()๋ผ๋Š” ํ•„์ˆ˜์ ์ธ ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋ฐฐ์›๋‹ˆ๋‹ค. Puzzle 29์˜ barrier()์—์„œ ๋ฐฐ์šด ๋™๊ธฐํ™” ๊ฐœ๋…์„ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์ œ: ๋ฉ€ํ‹ฐ ๋ธ”๋ก ํžˆ์Šคํ† ๊ทธ๋žจ ๊ตฌ๊ฐ„ ๋ถ„๋ฅ˜

Puzzle 27๊ณผ ๊ฐ™์€ ๊ธฐ์กด์˜ ๋‹จ์ผ ๋ธ”๋ก ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ํ•˜๋‚˜์˜ ๋ธ”๋ก์ด ๊ฐ€์ง„ ์Šค๋ ˆ๋“œ ์šฉ๋Ÿ‰(์˜ˆ: 256๊ฐœ ์Šค๋ ˆ๋“œ) ๋‚ด์— ๋“ค์–ด์˜ค๋Š” ๋ฐ์ดํ„ฐ๋งŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์„ ์ดˆ๊ณผํ•˜๋Š” ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฒฝ์šฐ, ์—ฌ๋Ÿฌ ๋ธ”๋ก์ด ํ˜‘๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ณผ์ œ: 4๊ฐœ ๋ธ”๋ก ๊ฐ๊ฐ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ๋ฒ”์œ„๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , ๊ณ ์œ ํ•œ ๋ธ”๋ก ์ˆœ์œ„๋กœ ๊ฐ’์„ ์Šค์ผ€์ผ๋งํ•˜๋ฉฐ, Puzzle 29์˜ ๋™๊ธฐํ™” ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅธ ๋ธ”๋ก๋“ค๊ณผ ์กฐ์ •ํ•จ์œผ๋กœ์จ ๋ชจ๋“  ๋ธ”๋ก์˜ ์ฒ˜๋ฆฌ๊ฐ€ ์™„๋ฃŒ๋œ ํ›„์—์•ผ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ตฌํ˜„ํ•˜์„ธ์š”.

๋ฌธ์ œ ๋ช…์„ธ

๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋ฐ์ดํ„ฐ ๋ถ„๋ฐฐ:

  • Block 0: ์š”์†Œ 0-255๋ฅผ ์ฒ˜๋ฆฌ, 1๋ฐฐ ์Šค์ผ€์ผ๋ง
  • Block 1: ์š”์†Œ 256-511์„ ์ฒ˜๋ฆฌ, 2๋ฐฐ ์Šค์ผ€์ผ๋ง
  • Block 2: ์š”์†Œ 512-767์„ ์ฒ˜๋ฆฌ, 3๋ฐฐ ์Šค์ผ€์ผ๋ง
  • Block 3: ์š”์†Œ 768-1023์„ ์ฒ˜๋ฆฌ, 4๋ฐฐ ์Šค์ผ€์ผ๋ง

์กฐ์ • ์š”๊ตฌ์‚ฌํ•ญ:

  1. ๊ฐ ๋ธ”๋ก์€ cluster_arrive()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ ค์•ผ ํ•ฉ๋‹ˆ๋‹ค
  2. ๋ชจ๋“  ๋ธ”๋ก์€ cluster_wait()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค๋ฅธ ๋ธ”๋ก์„ ๊ธฐ๋‹ค๋ ค์•ผ ํ•ฉ๋‹ˆ๋‹ค
  3. ์ตœ์ข… ์ถœ๋ ฅ์€ ๊ฐ ๋ธ”๋ก์˜ ์ฒ˜๋ฆฌ๋œ ํ•ฉ๊ณ„๋ฅผ 4๊ฐœ ์š”์†Œ ๋ฐฐ์—ด๋กœ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค

์„ค์ •

  • ๋ฌธ์ œ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ (1D ๋ฐฐ์—ด)
  • ๋ธ”๋ก ์„ค์ •: TPB = 256 ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (256, 1)
  • ๊ทธ๋ฆฌ๋“œ ์„ค์ •: CLUSTER_SIZE = 4 ํด๋Ÿฌ์Šคํ„ฐ๋‹น ๋ธ”๋ก ์ˆ˜ (4, 1)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ์ž…๋ ฅ row_major[SIZE](), ์ถœ๋ ฅ row_major[CLUSTER_SIZE]()

์Šค๋ ˆ๋“œ ๋ธ”๋ก ๋ถ„๋ฐฐ:

  • Block 0: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 0-255
  • Block 1: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 256-511
  • Block 2: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 512-767
  • Block 3: ์Šค๋ ˆ๋“œ 0-255 โ†’ ์š”์†Œ 768-1023

์™„์„ฑํ•  ์ฝ”๋“œ

def cluster_coordination_basics[
    tpb: Int
](
    output: TileTensor[mut=True, dtype, ClusterLayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
    size: Int,
):
    """Real cluster coordination using SM90+ cluster APIs."""
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Check what's happening with cluster ranks
    var my_block_rank = Int(block_rank_in_cluster())
    var block_id = block_idx.x

    var shared_data = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[tpb]())

    # FIX: Use block_idx.x for data distribution instead of cluster rank
    # Each block should process different portions of the data
    var data_scale = Scalar[dtype](
        block_id + 1
    )  # Use block_idx instead of cluster rank

    # Phase 1: Each block processes its portion
    if global_i < size:
        shared_data[local_i] = input[global_i] * data_scale
    else:
        shared_data[local_i] = 0.0

    barrier()

    # Phase 2: Use cluster_arrive() for inter-block coordination
    # Signal this block has completed processing

    # FILL IN 1 line here

    # Block-level aggregation (only thread 0)
    if local_i == 0:
        # FILL IN 4 line here
        ...

    # Wait for all blocks in cluster to complete

    # FILL IN 1 line here


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p34/p34.mojo

ํŒ

๋ธ”๋ก ์‹๋ณ„ ํŒจํ„ด

  • block_rank_in_cluster()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํด๋Ÿฌ์Šคํ„ฐ ์ˆœ์œ„(0-3)๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค
  • ๊ทธ๋ฆฌ๋“œ ์‹คํ–‰์—์„œ ์•ˆ์ •์ ์ธ ๋ธ”๋ก ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด Int(block_idx.x)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ๋ธ”๋ก ์œ„์น˜์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋ฅผ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๊ณ ์œ ํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •

  • stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[tpb]())์œผ๋กœ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค (Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๊ธฐ์ดˆ ์ฐธ๊ณ )
  • block_id + 1๋กœ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๋ธ”๋ก๋งˆ๋‹ค ๊ณ ์œ ํ•œ ์Šค์ผ€์ผ๋ง์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ ์‹œ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (Puzzle 3์˜ ๊ฐ€๋“œ ํŒจํ„ด)

ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ํŒจํ„ด

  1. ์ฒ˜๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด ์ž์‹ ์˜ ๋ฐ์ดํ„ฐ ์˜์—ญ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  2. ์‹ ํ˜ธ: cluster_arrive()๋กœ ์ฒ˜๋ฆฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  3. ์—ฐ์‚ฐ: ๋ธ”๋ก ๋‚ด๋ถ€ ์—ฐ์‚ฐ (๋ฆฌ๋•์…˜, ์ง‘๊ณ„)
  4. ๋Œ€๊ธฐ: cluster_wait()๋กœ ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค

๋ธ”๋ก ๋‚ด๋ถ€ ์Šค๋ ˆ๋“œ ์กฐ์ •

  • ํด๋Ÿฌ์Šคํ„ฐ ์—ฐ์‚ฐ ์ „์— ๋ธ”๋ก ๋‚ด๋ถ€ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด barrier()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (Puzzle 29์˜ ๋ฐฐ๋ฆฌ์–ด ๊ฐœ๋…)
  • ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๋ธ”๋ก ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋กํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๋‹จ์ผ ์“ฐ๊ธฐ ํŒจํ„ด)
  • ์•ˆ์ •์ ์ธ ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด ๊ฒฐ๊ณผ๋ฅผ output[block_id]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

์ฝ”๋“œ ์‹คํ–‰

pixi run p34 --coordination
uv run poe p34 --coordination

์˜ˆ์ƒ ์ถœ๋ ฅ:

Testing Multi-Block Coordination
SIZE: 1024 TPB: 256 CLUSTER_SIZE: 4
Block coordination results:
  Block 0 : 127.5
  Block 1 : 255.0
  Block 2 : 382.5
  Block 3 : 510.0
โœ… Multi-block coordination tests passed!

์„ฑ๊ณต ๊ธฐ์ค€:

  • 4๊ฐœ ๋ธ”๋ก ๋ชจ๋‘ 0์ด ์•„๋‹Œ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ๊ฒฐ๊ณผ๊ฐ€ ์Šค์ผ€์ผ๋ง ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: Block 1 > Block 0, Block 2 > Block 1 ๋“ฑ
  • ๊ฒฝ์Ÿ ์ƒํƒœ๋‚˜ ์กฐ์ • ์‹คํŒจ๊ฐ€ ์—†์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

์†”๋ฃจ์…˜

def cluster_coordination_basics[
    tpb: Int
](
    output: TileTensor[mut=True, dtype, ClusterLayout, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin],
    size: Int,
):
    """Real cluster coordination using SM90+ cluster APIs."""
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # Check what's happening with cluster ranks
    var my_block_rank = Int(block_rank_in_cluster())
    var block_id = block_idx.x

    var shared_data = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[tpb]())

    # FIX: Use block_idx.x for data distribution instead of cluster rank
    # Each block should process different portions of the data
    var data_scale = Scalar[dtype](
        block_id + 1
    )  # Use block_idx instead of cluster rank

    # Phase 1: Each block processes its portion
    if global_i < size:
        shared_data[local_i] = input[global_i] * data_scale
    else:
        shared_data[local_i] = 0.0

    barrier()

    # Phase 2: Use cluster_arrive() for inter-block coordination
    cluster_arrive()  # Signal this block has completed processing

    # Block-level aggregation (only thread 0)
    if local_i == 0:
        var block_sum: Float32 = 0.0
        for i in range(tpb):
            block_sum += shared_data[i][0]
        # FIX: Store result at block_idx position (guaranteed unique per block)
        output[block_id] = block_sum

    # Wait for all blocks in cluster to complete
    cluster_wait()


ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ํ’€์ด๋Š” ์‹ ์ค‘ํ•˜๊ฒŒ ์„ค๊ณ„๋œ 2๋‹จ๊ณ„ ์ ‘๊ทผ ๋ฐฉ์‹์„ ํ†ตํ•ด ๊ธฐ๋ณธ์ ์ธ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ๋™๊ธฐํ™” ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

1๋‹จ๊ณ„: ๋…๋ฆฝ์  ๋ธ”๋ก ์ฒ˜๋ฆฌ

์Šค๋ ˆ๋“œ ๋ฐ ๋ธ”๋ก ์‹๋ณ„:

global_i = block_dim.x * block_idx.x + thread_idx.x  # Global thread index
local_i = thread_idx.x                               # Local thread index within block
my_block_rank = Int(block_rank_in_cluster())         # Cluster rank (0-3)
block_id = Int(block_idx.x)                          # Block index for reliable addressing

๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น ๋ฐ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ:

  • ๊ฐ ๋ธ”๋ก์ด ์ž์ฒด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ž‘์—… ๊ณต๊ฐ„์„ ํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค: stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[tpb]())
  • ์Šค์ผ€์ผ๋ง ์ „๋žต: data_scale = Float32(block_id + 1)๋กœ ๊ฐ ๋ธ”๋ก์ด ๋‹ค๋ฅด๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค
    • Block 0: 1.0๋ฐฐ, Block 1: 2.0๋ฐฐ, Block 2: 3.0๋ฐฐ, Block 3: 4.0๋ฐฐ
  • ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ: if global_i < size:๋กœ ๋ฒ”์œ„ ๋ฐ– ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค
  • ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ: shared_data[local_i] = input[global_i] * data_scale๋กœ ๋ธ”๋ก๋ณ„ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์Šค์ผ€์ผ๋งํ•ฉ๋‹ˆ๋‹ค

๋ธ”๋ก ๋‚ด๋ถ€ ๋™๊ธฐํ™”:

  • barrier()๋Š” ๊ฐ ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ์„ ์™„๋ฃŒํ•œ ํ›„์—์•ผ ๋‹ค์Œ ๋‹จ๊ณ„๋กœ ์ง„ํ–‰ํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋ฐ์ดํ„ฐ ๋กœ๋”ฉ๊ณผ ์ดํ›„์˜ ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ์‚ฌ์ด์˜ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค

2๋‹จ๊ณ„: ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •

๋ธ”๋ก ๊ฐ„ ์‹ ํ˜ธ:

  • cluster_arrive()๋Š” ์ด ๋ธ”๋ก์ด ๋กœ์ปฌ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ–ˆ์Œ์„ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ํ•˜๋“œ์›จ์–ด์— ์™„๋ฃŒ๋ฅผ ๋“ฑ๋กํ•˜๋Š” ๋…ผ๋ธ”๋กœํ‚น ์—ฐ์‚ฐ์ž…๋‹ˆ๋‹ค

๋กœ์ปฌ ์ง‘๊ณ„ (์Šค๋ ˆ๋“œ 0๋งŒ):

if local_i == 0:
    var block_sum: Float32 = 0.0
    for i in range(tpb):
        block_sum += shared_data[i][0]  # Sum all elements in shared memory
    output[block_id] = block_sum        # Store result at unique block position
  • ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ์Šค๋ ˆ๋“œ 0๋งŒ ํ•ฉ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • output[block_id]์— ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•˜์—ฌ ๊ฐ ๋ธ”๋ก์ด ๊ณ ์œ ํ•œ ์œ„์น˜์— ๊ธฐ๋กํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค

์ตœ์ข… ๋™๊ธฐํ™”:

  • cluster_wait()๋Š” ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๋ชจ๋“  ๋ธ”๋ก์ด ์ž‘์—…์„ ์™„๋ฃŒํ•  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค
  • ์ด๋ฅผ ํ†ตํ•ด ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ์— ๊ฑธ์ณ ๊ฒฐ์ •๋ก ์  ์™„๋ฃŒ ์ˆœ์„œ๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค

ํ•ต์‹ฌ ๊ธฐ์ˆ  ์ธ์‚ฌ์ดํŠธ

์™œ my_block_rank ๋Œ€์‹  block_id๋ฅผ ์‚ฌ์šฉํ• ๊นŒ?

  • block_idx.x๋Š” ์•ˆ์ •์ ์ธ ๊ทธ๋ฆฌ๋“œ ์‹คํ–‰ ์ธ๋ฑ์‹ฑ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค (0, 1, 2, 3)
  • block_rank_in_cluster()๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์„ค์ •์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ๋™์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • block_id๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ ๋ธ”๋ก์ด ๊ณ ์œ ํ•œ ๋ฐ์ดํ„ฐ ์˜์—ญ๊ณผ ์ถœ๋ ฅ ์œ„์น˜๋ฅผ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

  • ์ „์—ญ ๋ฉ”๋ชจ๋ฆฌ: ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ input[global_i]๋ฅผ ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ์ฝ์Šต๋‹ˆ๋‹ค
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก ๋‚ด๋ถ€ ํ†ต์‹ ๊ณผ ์ง‘๊ณ„์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค
  • ์ถœ๋ ฅ ๋ฉ”๋ชจ๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด output[block_id]์— ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค

๋™๊ธฐํ™” ๊ณ„์ธต ๊ตฌ์กฐ:

  1. barrier(): ๊ฐ ๋ธ”๋ก ๋‚ด ์Šค๋ ˆ๋“œ๋ฅผ ๋™๊ธฐํ™”ํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ๋‚ด๋ถ€)
  2. cluster_arrive(): ๋‹ค๋ฅธ ๋ธ”๋ก์— ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค (๋ธ”๋ก ๊ฐ„, ๋…ผ๋ธ”๋กœํ‚น)
  3. cluster_wait(): ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ๊ฐ„, ๋ธ”๋กœํ‚น)

์„ฑ๋Šฅ ํŠน์„ฑ:

  • ์—ฐ์‚ฐ ๋ณต์žก๋„: ๋ธ”๋ก๋‹น ๋กœ์ปฌ ํ•ฉ์‚ฐ์— O(TPB), ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์— O(1)
  • ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ: ๊ฐ ์ž…๋ ฅ ์š”์†Œ๋ฅผ ํ•œ ๋ฒˆ๋งŒ ์ฝ์œผ๋ฉฐ, ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์€ ์ตœ์†Œํ™”
  • ํ™•์žฅ์„ฑ: ํŒจํ„ด์ด ๋” ํฐ ํด๋Ÿฌ์Šคํ„ฐ ํฌ๊ธฐ์—๋„ ์ตœ์†Œํ•œ์˜ ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅ

ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์˜ ํ•ต์‹ฌ ํŒจํ„ด์€ ๋‹จ์ˆœํ•˜์ง€๋งŒ ๊ฐ•๋ ฅํ•œ ๊ตฌ์กฐ๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค:

  1. 1๋‹จ๊ณ„: ๊ฐ ๋ธ”๋ก์ด ํ• ๋‹น๋œ ๋ฐ์ดํ„ฐ ์˜์—ญ์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  2. ์‹ ํ˜ธ: cluster_arrive()๋กœ ์ฒ˜๋ฆฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  3. 2๋‹จ๊ณ„: ๋‹ค๋ฅธ ๋ธ”๋ก์˜ ๊ฒฐ๊ณผ์— ์˜์กดํ•˜๋Š” ์—ฐ์‚ฐ์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  4. ๋™๊ธฐํ™”: cluster_wait()๋กœ ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋œ ํ›„ ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค

๋‹ค์Œ ๋‹จ๊ณ„: ๋” ๊ณ ๊ธ‰ ์กฐ์ •์„ ๋ฐฐ์šธ ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ ์œผ๋กœ ์ด๋™ํ•˜์—ฌ Puzzle 27์˜ block.sum() ํŒจํ„ด์„ ํด๋Ÿฌ์Šคํ„ฐ ๊ทœ๋ชจ๋กœ ํ™•์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›Œ๋ณด์„ธ์š”. Puzzle 24์˜ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ฆฌ๋•์…˜์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค!

โ˜ธ๏ธ ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ

๊ฐœ์š”

์ด์ „ ์„น์…˜์˜ ๊ธฐ๋ณธ ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ๋ฐ”ํƒ•์œผ๋กœ, ์ด ๋„์ „์—์„œ๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ์„ ๊ตฌํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์›๋‹ˆ๋‹ค - Puzzle 27์—์„œ ์ตํžŒ block.sum ํŒจํ„ด์„ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์— ๊ฑธ์ณ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: 4๊ฐœ์˜ ์กฐ์ •๋œ ๋ธ”๋ก์— ๊ฑธ์ณ 1024๊ฐœ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•˜๊ณ , ๊ฐ ๋ธ”๋ก์˜ ๊ฐœ๋ณ„ ๋ฆฌ๋•์…˜์„ ํ•˜๋‚˜์˜ ์ „์—ญ ๊ฒฐ๊ณผ๋กœ ํ•ฉ์น˜๋Š” ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต: ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ์œ„ํ•œ cluster_sync()์™€ ํšจ์œจ์ ์ธ ์ตœ์ข… ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ elect_one_sync()๋ฅผ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฌธ์ œ: ๋Œ€๊ทœ๋ชจ ์ „์—ญ ํ•ฉ์‚ฐ

๋‹จ์ผ ๋ธ”๋ก์€ (Puzzle 27์—์„œ ๋ฐฐ์› ๋“ฏ์ด) ์Šค๋ ˆ๋“œ ์ˆ˜์™€ Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์šฉ๋Ÿ‰์— ์˜ํ•ด ์ œํ•œ๋ฉ๋‹ˆ๋‹ค. ๋‹จ์ผ ๋ธ”๋ก ๋ฆฌ๋•์…˜์„ ๋„˜์–ด์„œ๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์˜ ์ „์—ญ ํ†ต๊ณ„(ํ‰๊ท , ๋ถ„์‚ฐ, ํ•ฉ๊ณ„)๋ฅผ ๊ตฌํ•˜๋ ค๋ฉด ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ์ง‘ํ•ฉ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ณผ์ œ: ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด ํ•ฉ์‚ฐ ๋ฆฌ๋•์…˜์„ ๊ตฌํ˜„ํ•˜์„ธ์š”:

  1. ๊ฐ ๋ธ”๋ก์ด ๋กœ์ปฌ ๋ฆฌ๋•์…˜์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (Puzzle 27์˜ block.sum()๊ณผ ์œ ์‚ฌ)
  2. Puzzle 29์˜ ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ธ”๋ก๋“ค์ด ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์นฉ๋‹ˆ๋‹ค
  3. ์„ ์ถœ๋œ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ์›Œํ”„ ์„ ์ถœ ํŒจํ„ด์„ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ์ข… ์ „์—ญ ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค

๋ฌธ์ œ ๋ช…์„ธ

์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ๋ฆ„:

1๋‹จ๊ณ„ - ๋กœ์ปฌ ๋ฆฌ๋•์…˜ (๊ฐ ๋ธ”๋ก ๋‚ด๋ถ€): \[R_i = \sum_{j=0}^{TPB-1} input[i \times TPB + j] \quad \text{for block } i\]

2๋‹จ๊ณ„ - ์ „์—ญ ์ง‘๊ณ„ (ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด): \[\text{Global Sum} = \sum_{i=0}^{\text{CLUSTER_SIZE}-1} R_i\]

์กฐ์ • ์š”๊ตฌ์‚ฌํ•ญ:

  1. ๋กœ์ปฌ ๋ฆฌ๋•์…˜: ๊ฐ ๋ธ”๋ก์ด ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์œผ๋กœ ๋ถ€๋ถ„ ํ•ฉ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค
  2. ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™”: cluster_sync()๋กœ ๋ชจ๋“  ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๊ฐ€ ์ค€๋น„๋˜์—ˆ๋Š”์ง€ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  3. ์ตœ์ข… ์ง‘๊ณ„: ์„ ์ถœ๋œ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๊ฐ€ ๋ชจ๋“  ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์นฉ๋‹ˆ๋‹ค

์„ค์ •

  • ๋ฌธ์ œ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ
  • ๋ธ”๋ก ์„ค์ •: TPB = 256 ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (256, 1)
  • ๊ทธ๋ฆฌ๋“œ ์„ค์ •: CLUSTER_SIZE = 4 ํด๋Ÿฌ์Šคํ„ฐ๋‹น ๋ธ”๋ก ์ˆ˜ (4, 1)
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ์ž…๋ ฅ row_major[SIZE](), ์ถœ๋ ฅ row_major[1]()
  • ์ž„์‹œ ์ €์žฅ์†Œ: ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•œ row_major[CLUSTER_SIZE]()

์˜ˆ์ƒ ๊ฒฐ๊ณผ: ์ˆ˜์—ด 0, 0.01, 0.02, ..., 10.23์˜ ํ•ฉ = 523,776

์™„์„ฑํ•  ์ฝ”๋“œ

def cluster_collective_operations[
    tpb: Int
](
    output: TileTensor[mut=True, dtype, OutLayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
    temp_storage: TileTensor[mut=True, dtype, ClusterLayoutType, MutAnyOrigin],
    size: Int,
):
    """Cluster-wide collective operations using real cluster APIs."""
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # FILL IN (roughly 24 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p34/p34.mojo

ํŒ

๋กœ์ปฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด

ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ์ „๋žต

  • ์•ˆ์ •์ ์ธ ์ธ๋ฑ์‹ฑ์„ ์œ„ํ•ด ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ temp_storage[block_id]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  • ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™”๋ฅผ ์œ„ํ•ด cluster_sync()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (arrive/wait๋ณด๋‹ค ๊ฐ•๋ ฅ)
  • ์ตœ์ข… ์ „์—ญ ์ง‘๊ณ„๋Š” ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋งŒ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค

ํšจ์œจ์ ์ธ ์„ ์ถœ ํŒจํ„ด

  • ์ฒซ ๋ฒˆ์งธ ๋ธ”๋ก(my_block_rank == 0) ๋‚ด์—์„œ elect_one_sync()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ํŒจํ„ด)
  • ์ค‘๋ณต ์—ฐ์‚ฐ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋งŒ ์ตœ์ข… ํ•ฉ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ temp_storage์—์„œ ๋ชจ๋“  ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์ฝ์Šต๋‹ˆ๋‹ค (Puzzle 8์˜ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ๊ณผ ์œ ์‚ฌ)

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด

  • ๊ฐ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ์™€ ํ•จ๊ป˜ input[global_i]๋ฅผ ์ฝ์Šต๋‹ˆ๋‹ค (Puzzle 3์˜ ๊ฐ€๋“œ)
  • ๋ธ”๋ก ๋‚ด๋ถ€ ๋ฆฌ๋•์…˜์„ ์œ„ํ•ด ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋ธ”๋ก ๊ฐ„ ํ†ต์‹ ์„ ์œ„ํ•ด ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ temp_storage[block_id]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค
  • ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” output[0]์— ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค (๋ธ”๋ก ์กฐ์ •์˜ ๋‹จ์ผ ์“ฐ๊ธฐ ํŒจํ„ด)

ํด๋Ÿฌ์Šคํ„ฐ API ์ฐธ์กฐ

gpu.primitives.cluster ๋ชจ๋“ˆ:

  • cluster_sync(): ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” - arrive/wait ํŒจํ„ด๋ณด๋‹ค ๊ฐ•๋ ฅ
  • elect_one_sync(): ํšจ์œจ์ ์ธ ์กฐ์ •์„ ์œ„ํ•ด ์›Œํ”„ ๋‚ด์—์„œ ๋‹จ์ผ ์Šค๋ ˆ๋“œ๋ฅผ ์„ ์ถœ
  • block_rank_in_cluster(): ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๊ณ ์œ ํ•œ ๋ธ”๋ก ์‹๋ณ„์ž๋ฅผ ๋ฐ˜ํ™˜

ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด

Puzzle 27์˜ ์ „ํ†ต์ ์ธ ๋‚ด์ ์—์„œ ๋ฐฐ์šด ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ๋– ์˜ฌ๋ ค ๋ณด์„ธ์š”:

Stride 128: [T0] += [T128], [T1] += [T129], [T2] += [T130], ...
Stride 64:  [T0] += [T64],  [T1] += [T65],  [T2] += [T66],  ...
Stride 32:  [T0] += [T32],  [T1] += [T33],  [T2] += [T34],  ...
Stride 16:  [T0] += [T16],  [T1] += [T17],  [T2] += [T18],  ...
...
Stride 1:   [T0] += [T1] โ†’ Final result at T0

์ด์ œ ์ด ํŒจํ„ด์„ ํด๋Ÿฌ์Šคํ„ฐ ๊ทœ๋ชจ๋กœ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค - ๊ฐ ๋ธ”๋ก์ด ํ•˜๋‚˜์˜ ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•œ ๋’ค, ๋ธ”๋ก ๊ฐ„์— ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.

์ฝ”๋“œ ์‹คํ–‰

pixi run p34 --reduction
uv run poe p34 --reduction

์˜ˆ์ƒ ์ถœ๋ ฅ:

Testing Cluster-Wide Reduction
SIZE: 1024 TPB: 256 CLUSTER_SIZE: 4
Expected sum: 523776.0
Cluster reduction result: 523776.0
Expected: 523776.0
Error: 0.0
โœ… Passed: Cluster reduction accuracy test
โœ… Cluster-wide collective operations tests passed!

์„ฑ๊ณต ๊ธฐ์ค€:

  • ์™„๋ฒฝํ•œ ์ •ํ™•๋„: ๊ฒฐ๊ณผ๊ฐ€ ์˜ˆ์ƒ ํ•ฉ๊ณ„(523,776)์™€ ์ •ํ™•ํžˆ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •: 4๊ฐœ ๋ธ”๋ก ๋ชจ๋‘๊ฐ€ ๋ถ€๋ถ„ ํ•ฉ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค
  • ํšจ์œจ์ ์ธ ์ตœ์ข… ๋ฆฌ๋•์…˜: ์„ ์ถœ๋œ ๋‹จ์ผ ์Šค๋ ˆ๋“œ๊ฐ€ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค

์†”๋ฃจ์…˜

def cluster_collective_operations[
    tpb: Int
](
    output: TileTensor[mut=True, dtype, OutLayout, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin],
    temp_storage: TileTensor[mut=True, dtype, ClusterLayout, MutAnyOrigin],
    size: Int,
):
    """Cluster-wide collective operations using real cluster APIs."""
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    var my_block_rank = Int(block_rank_in_cluster())
    var block_id = block_idx.x

    # Each thread accumulates its data
    var my_value: Float32 = 0.0
    if global_i < size:
        my_value = input[global_i][0]

    # Block-level reduction using shared memory
    var shared_mem = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[tpb]())
    shared_mem[local_i] = my_value
    barrier()

    # Tree reduction within block
    var stride = tpb // 2
    while stride > 0:
        if local_i < stride and local_i + stride < tpb:
            shared_mem[local_i] += shared_mem[local_i + stride]
        barrier()
        stride = stride // 2

    # FIX: Store block result using block_idx for reliable indexing
    if local_i == 0:
        temp_storage[block_id] = shared_mem[0]

    # Use cluster_sync() for full cluster synchronization
    cluster_sync()

    # Final cluster reduction (elect one thread to do the final work)
    if elect_one_sync() and my_block_rank == 0:
        var total: Float32 = 0.0
        for i in range(CLUSTER_SIZE):
            total += temp_storage[i][0]
        output[0] = total


ํด๋Ÿฌ์Šคํ„ฐ ์ง‘ํ•ฉ ์—ฐ์‚ฐ ํ’€์ด๋Š” ๋ถ„์‚ฐ ์ปดํ“จํŒ…์˜ ๊ณ ์ „์ ์ธ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: ๋กœ์ปฌ ๋ฆฌ๋•์…˜ โ†’ ์ „์—ญ ์กฐ์ • โ†’ ์ตœ์ข… ์ง‘๊ณ„:

1๋‹จ๊ณ„: ๋กœ์ปฌ ๋ธ”๋ก ๋ฆฌ๋•์…˜ (์ „ํ†ต์  ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜)

๋ฐ์ดํ„ฐ ๋กœ๋”ฉ ๋ฐ ์ดˆ๊ธฐํ™”:

var my_value: Float32 = 0.0
if global_i < size:
    my_value = input[global_i][0]  # Load with bounds checking
shared_mem[local_i] = my_value     # Store in shared memory
barrier()                          # Ensure all threads complete loading

ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜:

var stride = tpb // 2  # Start with half the threads (128)
while stride > 0:
    if local_i < stride and local_i + stride < tpb:
        shared_mem[local_i] += shared_mem[local_i + stride]
    barrier()          # Synchronize after each reduction step
    stride = stride // 2

ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ์‹œ๊ฐํ™” (TPB=256):

Step 1: stride=128  [T0]+=T128, [T1]+=T129, ..., [T127]+=T255
Step 2: stride=64   [T0]+=T64,  [T1]+=T65,  ..., [T63]+=T127
Step 3: stride=32   [T0]+=T32,  [T1]+=T33,  ..., [T31]+=T63
Step 4: stride=16   [T0]+=T16,  [T1]+=T17,  ..., [T15]+=T31
Step 5: stride=8    [T0]+=T8,   [T1]+=T9,   ..., [T7]+=T15
Step 6: stride=4    [T0]+=T4,   [T1]+=T5,   [T2]+=T6,  [T3]+=T7
Step 7: stride=2    [T0]+=T2,   [T1]+=T3
Step 8: stride=1    [T0]+=T1    โ†’ Final result at shared_mem[0]

๋ถ€๋ถ„ ๊ฒฐ๊ณผ ์ €์žฅ:

  • ์Šค๋ ˆ๋“œ 0๋งŒ ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค: temp_storage[block_id] = shared_mem[0]
  • ๊ฐ ๋ธ”๋ก์ด ์ž์‹ ์˜ ํ•ฉ๊ณ„๋ฅผ temp_storage[0], temp_storage[1], temp_storage[2], temp_storage[3]์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

2๋‹จ๊ณ„: ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™”

์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ ๋ฐฐ๋ฆฌ์–ด:

  • cluster_sync()๋Š” cluster_arrive()/cluster_wait()๋ณด๋‹ค ๋” ๊ฐ•๋ ฅํ•œ ๋ณด์žฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค
  • ์–ด๋–ค ๋ธ”๋ก์ด๋“  ๋‹ค์Œ์œผ๋กœ ์ง„ํ–‰ํ•˜๊ธฐ ์ „์— ๋ชจ๋“  ๋ธ”๋ก์ด ๋กœ์ปฌ ๋ฆฌ๋•์…˜์„ ์™„๋ฃŒํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๋ชจ๋“  ๋ธ”๋ก์— ๊ฑธ์นœ ํ•˜๋“œ์›จ์–ด ๊ฐ€์† ๋™๊ธฐํ™”์ž…๋‹ˆ๋‹ค

3๋‹จ๊ณ„: ์ตœ์ข… ์ „์—ญ ์ง‘๊ณ„

ํšจ์œจ์ ์ธ ์Šค๋ ˆ๋“œ ์„ ์ถœ:

if elect_one_sync() and my_block_rank == 0:
    var total: Float32 = 0.0
    for i in range(CLUSTER_SIZE):
        total += temp_storage[i][0]  # Sum: temp[0] + temp[1] + temp[2] + temp[3]
    output[0] = total

์™œ ์ด ์„ ์ถœ ์ „๋žต์„ ์‚ฌ์šฉํ• ๊นŒ?

  • elect_one_sync(): ์›Œํ”„๋‹น ์ •ํ™•ํžˆ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋ฅผ ์„ ํƒํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ๊ธฐ๋ณธ ์š”์†Œ์ž…๋‹ˆ๋‹ค
  • my_block_rank == 0: ๋‹จ์ผ ์“ฐ๊ธฐ๋ฅผ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด ์ฒซ ๋ฒˆ์งธ ๋ธ”๋ก์—์„œ๋งŒ ์„ ์ถœํ•ฉ๋‹ˆ๋‹ค
  • ๊ฒฐ๊ณผ: ์ „์ฒด ํด๋Ÿฌ์Šคํ„ฐ์—์„œ ๋‹จ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋งŒ ์ตœ์ข… ํ•ฉ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ํšจ์œจ์„ฑ: 1024๊ฐœ ์ „์ฒด ์Šค๋ ˆ๋“œ์— ๊ฑธ์นœ ์ค‘๋ณต ์—ฐ์‚ฐ์„ ํ”ผํ•ฉ๋‹ˆ๋‹ค

ํ•ต์‹ฌ ๊ธฐ์ˆ  ์ธ์‚ฌ์ดํŠธ

3๋‹จ๊ณ„ ๋ฆฌ๋•์…˜ ๊ณ„์ธต ๊ตฌ์กฐ:

  1. ์Šค๋ ˆ๋“œ โ†’ ์›Œํ”„: ๊ฐœ๋ณ„ ์Šค๋ ˆ๋“œ๊ฐ€ ์›Œํ”„ ๋ ˆ๋ฒจ ๋ถ€๋ถ„ ํ•ฉ์— ๊ธฐ์—ฌํ•ฉ๋‹ˆ๋‹ค
  2. ์›Œํ”„ โ†’ ๋ธ”๋ก: ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์ด ์›Œํ”„๋“ค์„ ํ•˜๋‚˜์˜ ๋ธ”๋ก ๊ฒฐ๊ณผ๋กœ ํ•ฉ์นฉ๋‹ˆ๋‹ค (256 โ†’ 1)
  3. ๋ธ”๋ก โ†’ ํด๋Ÿฌ์Šคํ„ฐ: ๋‹จ์ˆœ ๋ฃจํ”„๊ฐ€ ๋ธ”๋ก ๊ฒฐ๊ณผ๋ฅผ ์ตœ์ข… ํ•ฉ๊ณ„๋กœ ํ•ฉ์นฉ๋‹ˆ๋‹ค (4 โ†’ 1)

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

  • ์ž…๋ ฅ: ๊ฐ ์š”์†Œ๋ฅผ ์ •ํ™•ํžˆ ํ•œ ๋ฒˆ ์ฝ์Šต๋‹ˆ๋‹ค (input[global_i])
  • ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ: ๋ธ”๋ก ๋‚ด๋ถ€ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜์„ ์œ„ํ•œ ๊ณ ์† ์ž‘์—… ๊ณต๊ฐ„
  • ์ž„์‹œ ์ €์žฅ์†Œ: ์ €๋น„์šฉ ๋ธ”๋ก ๊ฐ„ ํ†ต์‹  (4๊ฐœ ๊ฐ’๋งŒ)
  • ์ถœ๋ ฅ: ๋‹จ์ผ ์ „์—ญ ๊ฒฐ๊ณผ๋ฅผ ํ•œ ๋ฒˆ ๊ธฐ๋ก

๋™๊ธฐํ™” ๋ณด์žฅ:

  • barrier(): ๋ธ”๋ก ๋‚ด ๋ชจ๋“  ์Šค๋ ˆ๋“œ๊ฐ€ ๊ฐ ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜ ๋‹จ๊ณ„๋ฅผ ์™„๋ฃŒํ•˜๋„๋ก ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • cluster_sync(): ์ „์—ญ ๋ฐฐ๋ฆฌ์–ด - ๋ชจ๋“  ๋ธ”๋ก์ด ๋™์ผํ•œ ์‹คํ–‰ ์ง€์ ์— ๋„๋‹ฌํ•ฉ๋‹ˆ๋‹ค
  • ๋‹จ์ผ ์“ฐ๊ธฐ: ์„ ์ถœ์„ ํ†ตํ•ด ์ตœ์ข… ์ถœ๋ ฅ์— ๋Œ€ํ•œ ๊ฒฝ์Ÿ ์ƒํƒœ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณต์žก๋„ ๋ถ„์„:

  • ํŠธ๋ฆฌ ๋ฆฌ๋•์…˜: O(logโ‚‚ TPB) = O(logโ‚‚ 256) = ๋ธ”๋ก๋‹น 8๋‹จ๊ณ„
  • ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •: O(1) ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ
  • ์ตœ์ข… ์ง‘๊ณ„: O(CLUSTER_SIZE) = O(4) ๋‹จ์ˆœ ๋ง์…ˆ
  • ์ „์ฒด: ๋ธ”๋ก ๋‚ด๋ถ€๋Š” ๋กœ๊ทธ, ๋ธ”๋ก ๊ฐ„์€ ์„ ํ˜•

ํ™•์žฅ์„ฑ ํŠน์„ฑ:

  • ๋ธ”๋ก ๋ ˆ๋ฒจ: ๋กœ๊ทธ ๋ณต์žก๋„๋กœ ์ˆ˜์ฒœ ๊ฐœ์˜ ์Šค๋ ˆ๋“œ๊นŒ์ง€ ํ™•์žฅ ๊ฐ€๋Šฅ
  • ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ: ์„ ํ˜• ๋ณต์žก๋„๋กœ ์ˆ˜์‹ญ ๊ฐœ์˜ ๋ธ”๋ก๊นŒ์ง€ ํ™•์žฅ ๊ฐ€๋Šฅ
  • ๋ฉ”๋ชจ๋ฆฌ: ์ž„์‹œ ์ €์žฅ์†Œ ์š”๊ตฌ๋Ÿ‰์ด ํด๋Ÿฌ์Šคํ„ฐ ํฌ๊ธฐ์— ๋น„๋ก€ํ•˜์—ฌ ์„ ํ˜• ์ฆ๊ฐ€
  • ํ†ต์‹ : ์ตœ์†Œํ•œ์˜ ๋ธ”๋ก ๊ฐ„ ๋ฐ์ดํ„ฐ ์ด๋™ (๋ธ”๋ก๋‹น ํ•˜๋‚˜์˜ ๊ฐ’)

์ง‘ํ•ฉ ์—ฐ์‚ฐ ํŒจํ„ด ์ดํ•ดํ•˜๊ธฐ

์ด ํผ์ฆ์€ ๋ถ„์‚ฐ ์ปดํ“จํŒ…์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ณ ์ „์ ์ธ 2๋‹จ๊ณ„ ๋ฆฌ๋•์…˜ ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

  1. ๋กœ์ปฌ ์ง‘๊ณ„: ๊ฐ ์ฒ˜๋ฆฌ ๋‹จ์œ„(๋ธ”๋ก)๊ฐ€ ์ž์‹ ์˜ ๋ฐ์ดํ„ฐ ์˜์—ญ์„ ๋ฆฌ๋•์…˜ํ•ฉ๋‹ˆ๋‹ค
  2. ์ „์—ญ ์กฐ์ •: ์ฒ˜๋ฆฌ ๋‹จ์œ„๋“ค์ด ๋™๊ธฐํ™”ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๊ตํ™˜ํ•ฉ๋‹ˆ๋‹ค
  3. ์ตœ์ข… ๋ฆฌ๋•์…˜: ์„ ์ถœ๋œ ํ•˜๋‚˜์˜ ๋‹จ์œ„๊ฐ€ ๋ชจ๋“  ๋ถ€๋ถ„ ๊ฒฐ๊ณผ๋ฅผ ํ•ฉ์นฉ๋‹ˆ๋‹ค

๋‹จ์ผ ๋ธ”๋ก ๋ฐฉ์‹๊ณผ์˜ ๋น„๊ต:

  • ๊ธฐ์กด block.sum(): ์ตœ๋Œ€ 256๊ฐœ ์Šค๋ ˆ๋“œ ๋‚ด์—์„œ๋งŒ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ์ง‘ํ•ฉ ์—ฐ์‚ฐ: ์—ฌ๋Ÿฌ ๋ธ”๋ก์— ๊ฑธ์ณ 1000๊ฐœ ์ด์ƒ์˜ ์Šค๋ ˆ๋“œ๋กœ ํ™•์žฅ๋ฉ๋‹ˆ๋‹ค
  • ๋™์ผํ•œ ์ •ํ™•๋„: ๋‘˜ ๋‹ค ๋™์ผํ•œ ์ˆ˜ํ•™์  ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค
  • ๋‹ค๋ฅธ ๊ทœ๋ชจ: ํด๋Ÿฌ์Šคํ„ฐ ๋ฐฉ์‹์ด ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค

์„ฑ๋Šฅ ์ด์ :

  • ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹: ๋‹จ์ผ ๋ธ”๋ก ์šฉ๋Ÿ‰์„ ์ดˆ๊ณผํ•˜๋Š” ๋ฐฐ์—ด์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ๋” ๋‚˜์€ ํ™œ์šฉ๋ฅ : ๋” ๋งŽ์€ GPU ์—ฐ์‚ฐ ์œ ๋‹›์„ ๋™์‹œ์— ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํŒจํ„ด: ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค

๋‹ค์Œ ๋‹จ๊ณ„: ์ตœ์ข… ๋„์ „์„ ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์œผ๋กœ ์ด๋™ํ•˜์—ฌ ์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ+๋ธ”๋ก ์กฐ์ •+ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™”๋ฅผ ๊ฒฐํ•ฉํ•œ ๊ณ„์ธต์  ํŒจํ„ด์„ ๋ฐฐ์›Œ๋ณด์„ธ์š”. ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค!

๐Ÿง  ๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜

๊ฐœ์š”

์ด ๋งˆ์ง€๋ง‰ ๋„์ „์—์„œ๋Š” ์›Œํ”„ ๋ ˆ๋ฒจ (Puzzle 24-26), ๋ธ”๋ก ๋ ˆ๋ฒจ (Puzzle 27), ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณ„์ธต ๊ตฌ์กฐ์˜ ๋ชจ๋“  ๋ ˆ๋ฒจ์„ ๊ฒฐํ•ฉํ•˜์—ฌ GPU ํ™œ์šฉ๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ์ •๊ตํ•œ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๋„์ „ ๊ณผ์ œ: ์›Œํ”„ ๋ ˆ๋ฒจ ์ตœ์ ํ™” (elect_one_sync()), ๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„, ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ์กฐ์ •์„ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ํŒจํ„ด์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ณ„์ธต์  ํด๋Ÿฌ์Šคํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ํ•™์Šต: ๊ณ ๊ธ‰ ์—ฐ์‚ฐ ์›Œํฌ๋กœ๋“œ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€์˜ ์กฐ์ • ํŒจํ„ด๊ณผ ํ•จ๊ป˜ ์™„์ „ํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์Šคํƒ์„ ๋ฐฐ์›๋‹ˆ๋‹ค.

๋ฌธ์ œ: ๋‹ค๋‹จ๊ณ„ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ

์‹ค์ œ GPU ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ GPU ๊ณ„์ธต ๊ตฌ์กฐ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ๋ฒจ(Puzzle 24์˜ ์›Œํ”„, Puzzle 27์˜ ๋ธ”๋ก, ํด๋Ÿฌ์Šคํ„ฐ)์ด ์กฐ์ •๋œ ์—ฐ์‚ฐ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ๊ฐ๊ฐ ์ „๋ฌธํ™”๋œ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ณ„์ธต์  ์กฐ์ •์„ ํ•„์š”๋กœ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์ด๋Š” Puzzle 29์˜ ๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ๋ฅผ ํ™•์žฅํ•ฉ๋‹ˆ๋‹ค.

๊ณผ์ œ: ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹ค๋‹จ๊ณ„ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„ํ•˜์„ธ์š”:

  1. ์›Œํ”„ ๋ ˆ๋ฒจ: ํšจ์œจ์ ์ธ ์›Œํ”„ ๋‚ด๋ถ€ ์กฐ์ •์„ ์œ„ํ•ด elect_one_sync()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (SIMT ์‹คํ–‰)
  2. ๋ธ”๋ก ๋ ˆ๋ฒจ: ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ์กฐ์ •์„ ์‚ฌ์šฉํ•˜์—ฌ ์›Œํ”„ ๊ฒฐ๊ณผ๋ฅผ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค
  3. ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ: cluster_arrive() / cluster_wait() Puzzle 29์˜ ๋‹จ๊ณ„์  ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ธ”๋ก ๊ฐ„ ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ช…์„ธ

๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ:

  1. 1๋‹จ๊ณ„ (์›Œํ”„ ๋ ˆ๋ฒจ): ๊ฐ ์›Œํ”„๊ฐ€ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ๋ฅผ ์„ ์ถœํ•˜์—ฌ 32๊ฐœ์˜ ์—ฐ์† ์š”์†Œ๋ฅผ ํ•ฉ์‚ฐํ•ฉ๋‹ˆ๋‹ค
  2. 2๋‹จ๊ณ„ (๋ธ”๋ก ๋ ˆ๋ฒจ): ๊ฐ ๋ธ”๋ก ๋‚ด์˜ ๋ชจ๋“  ์›Œํ”„ ํ•ฉ๊ณ„๋ฅผ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค
  3. 3๋‹จ๊ณ„ (ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ): cluster_arrive() / cluster_wait()๋กœ ๋ธ”๋ก ๊ฐ„ ์กฐ์ •์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค

์ž…๋ ฅ: ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•œ (i % 50) * 0.02 ํŒจํ„ด์˜ 1024๊ฐœ float ๊ฐ’ ์ถœ๋ ฅ: ๊ณ„์ธต์  ์ฒ˜๋ฆฌ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” 4๊ฐœ ๋ธ”๋ก ๊ฒฐ๊ณผ

์„ค์ •

  • ๋ฌธ์ œ ํฌ๊ธฐ: SIZE = 1024 ์š”์†Œ
  • ๋ธ”๋ก ์„ค์ •: TPB = 256 ๋ธ”๋ก๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (256, 1)
  • ๊ทธ๋ฆฌ๋“œ ์„ค์ •: CLUSTER_SIZE = 4 ๋ธ”๋ก (4, 1)
  • ์›Œํ”„ ํฌ๊ธฐ: WARP_SIZE = 32 ์›Œํ”„๋‹น ์Šค๋ ˆ๋“œ ์ˆ˜ (NVIDIA ํ‘œ์ค€)
  • ๋ธ”๋ก๋‹น ์›Œํ”„ ์ˆ˜: TPB / WARP_SIZE = 8 ์›Œํ”„
  • ๋ฐ์ดํ„ฐ ํƒ€์ž…: DType.float32
  • ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ: ์ž…๋ ฅ row_major[SIZE](), ์ถœ๋ ฅ row_major[CLUSTER_SIZE]()

์ฒ˜๋ฆฌ ๋ถ„๋ฐฐ:

  • Block 0: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 0-255
  • Block 1: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 256-511
  • Block 2: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 512-767
  • Block 3: 256 ์Šค๋ ˆ๋“œ โ†’ 8 ์›Œํ”„ โ†’ ์š”์†Œ 768-1023

์™„์„ฑํ•  ์ฝ”๋“œ

def advanced_cluster_patterns[
    tpb: Int
](
    output: TileTensor[mut=True, dtype, ClusterLayoutType, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InLayoutType, ImmutAnyOrigin],
    size: Int,
):
    """Advanced cluster programming using cluster masks and relaxed synchronization.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x

    # FILL IN (roughly 26 lines)


์ „์ฒด ํŒŒ์ผ ๋ณด๊ธฐ: problems/p34/p34.mojo

ํŒ

์›Œํ”„ ๋ ˆ๋ฒจ ์ตœ์ ํ™” ํŒจํ„ด

๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„ ์ „๋žต

  • ์›Œํ”„ ์ฒ˜๋ฆฌ ํ›„ ๋ชจ๋“  ์›Œํ”„ ๊ฒฐ๊ณผ๋ฅผ ์ง‘๊ณ„ํ•ฉ๋‹ˆ๋‹ค (Puzzle 27์˜ ๋ธ”๋ก ์กฐ์ • ํ™•์žฅ)
  • ์„ ์ถœ๋œ ์œ„์น˜์—์„œ ์ฝ์Šต๋‹ˆ๋‹ค: ์ธ๋ฑ์Šค 0, 32, 64, 96, 128, 160, 192, 224
  • for i in range(0, tpb, 32) ๋ฃจํ”„๋กœ ์›Œํ”„ ๋ฆฌ๋”๋ฅผ ์ˆœํšŒํ•ฉ๋‹ˆ๋‹ค (๋ฆฌ๋•์…˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ํŒจํ„ด)
  • ์Šค๋ ˆ๋“œ 0๋งŒ ์ตœ์ข… ๋ธ”๋ก ํ•ฉ๊ณ„๋ฅผ ๊ณ„์‚ฐํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค (๋ฐฐ๋ฆฌ์–ด ์กฐ์ •์˜ ๋‹จ์ผ ์“ฐ๊ธฐ ํŒจํ„ด)

ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • ํ๋ฆ„

  1. ์ฒ˜๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด ๊ณ„์ธต์  ์›Œํ”„ ์ตœ์ ํ™”๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  2. ์‹ ํ˜ธ: cluster_arrive()๋กœ ๋กœ์ปฌ ์ฒ˜๋ฆฌ ์™„๋ฃŒ๋ฅผ ์•Œ๋ฆฝ๋‹ˆ๋‹ค
  3. ์ €์žฅ: ์Šค๋ ˆ๋“œ 0์ด ๋ธ”๋ก ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅ์— ๊ธฐ๋กํ•ฉ๋‹ˆ๋‹ค
  4. ๋Œ€๊ธฐ: cluster_wait()๋กœ ๋ชจ๋“  ๋ธ”๋ก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐํ•ฉ๋‹ˆ๋‹ค

๋ฐ์ดํ„ฐ ์Šค์ผ€์ผ๋ง ๋ฐ ๊ฒฝ๊ณ„ ๊ฒ€์‚ฌ

  • Float32(block_id + 1)๋กœ ์ž…๋ ฅ์„ ์Šค์ผ€์ผ๋งํ•˜์—ฌ ๋ธ”๋ก๋ณ„ ๊ณ ์œ  ํŒจํ„ด์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค
  • ์ž…๋ ฅ์„ ์ฝ๊ธฐ ์ „์— ํ•ญ์ƒ global_i < size๋ฅผ ๊ฒ€์‚ฌํ•ฉ๋‹ˆ๋‹ค (Puzzle 3์˜ ๊ฐ€๋“œ)
  • ๋ธ”๋ก ๋‚ด ์ฒ˜๋ฆฌ ๋‹จ๊ณ„ ์‚ฌ์ด์— barrier()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค (๋™๊ธฐํ™” ํŒจํ„ด)
  • ๋ฃจํ”„์—์„œ ์›Œํ”„ ๊ฒฝ๊ณ„ ์กฐ๊ฑด์„ ์ฃผ์˜ ๊นŠ๊ฒŒ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค (์›Œํ”„ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์˜ ๊ณ ๋ ค์‚ฌํ•ญ)

๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ API

gpu.primitives.cluster ๋ชจ๋“ˆ:

  • elect_one_sync(): ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ์„ ์œ„ํ•œ ์›Œํ”„ ๋ ˆ๋ฒจ ์Šค๋ ˆ๋“œ ์„ ์ถœ
  • cluster_arrive(): ๋‹จ๊ณ„์  ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ์œ„ํ•œ ์™„๋ฃŒ ์‹ ํ˜ธ
  • cluster_wait(): ๋ชจ๋“  ๋ธ”๋ก์ด ๋™๊ธฐํ™” ์ง€์ ์— ๋„๋‹ฌํ•  ๋•Œ๊นŒ์ง€ ๋Œ€๊ธฐ
  • block_rank_in_cluster(): ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด ๊ณ ์œ ํ•œ ๋ธ”๋ก ์‹๋ณ„์ž ๋ฐ˜ํ™˜

๊ณ„์ธต์  ์กฐ์ • ํŒจํ„ด

์ด ํผ์ฆ์€ 3๋‹จ๊ณ„ ์กฐ์ • ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๋ ˆ๋ฒจ 1: ์›Œํ”„ ์กฐ์ • (Puzzle 24)

Warp (32 threads) โ†’ elect_one_sync() โ†’ 1 elected thread โ†’ processes 32 elements

๋ ˆ๋ฒจ 2: ๋ธ”๋ก ์กฐ์ • (Puzzle 27)

Block (8 warps) โ†’ aggregate warp results โ†’ 1 block total

๋ ˆ๋ฒจ 3: ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ • (์ด ํผ์ฆ)

Cluster (4 blocks) โ†’ cluster_arrive/wait โ†’ synchronized completion

๊ฒฐํ•ฉ ํšจ๊ณผ: 1024๊ฐœ ์Šค๋ ˆ๋“œ โ†’ 32๊ฐœ ์›Œํ”„ ๋ฆฌ๋” โ†’ 4๊ฐœ ๋ธ”๋ก ๊ฒฐ๊ณผ โ†’ ์กฐ์ •๋œ ํด๋Ÿฌ์Šคํ„ฐ ์™„๋ฃŒ

์ฝ”๋“œ ์‹คํ–‰

pixi run p34 --advanced
uv run poe p34 --advanced

์˜ˆ์ƒ ์ถœ๋ ฅ:

Testing Advanced Cluster Algorithms
SIZE: 1024 TPB: 256 CLUSTER_SIZE: 4
Advanced cluster algorithm results:
  Block 0 : 122.799995
  Block 1 : 247.04001
  Block 2 : 372.72
  Block 3 : 499.83997
โœ… Advanced cluster patterns tests passed!

์„ฑ๊ณต ๊ธฐ์ค€:

  • ๊ณ„์ธต์  ์Šค์ผ€์ผ๋ง: ๊ฒฐ๊ณผ๊ฐ€ ๋‹ค๋‹จ๊ณ„ ์กฐ์ • ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค
  • ์›Œํ”„ ์ตœ์ ํ™”: elect_one_sync()๊ฐ€ ์ค‘๋ณต ์—ฐ์‚ฐ์„ ์ค„์ž…๋‹ˆ๋‹ค
  • ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •: ๋ชจ๋“  ๋ธ”๋ก์ด ์ฒ˜๋ฆฌ๋ฅผ ์„ฑ๊ณต์ ์œผ๋กœ ์™„๋ฃŒํ•ฉ๋‹ˆ๋‹ค
  • ์„ฑ๋Šฅ ํŒจํ„ด: ๋” ๋†’์€ ๋ธ”๋ก ID๊ฐ€ ๋น„๋ก€์ ์œผ๋กœ ๋” ํฐ ๊ฒฐ๊ณผ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค

์†”๋ฃจ์…˜

def advanced_cluster_patterns[
    tpb: Int
](
    output: TileTensor[mut=True, dtype, ClusterLayout, MutAnyOrigin],
    input: TileTensor[mut=False, dtype, InLayout, MutAnyOrigin],
    size: Int,
):
    """Advanced cluster programming using cluster masks and relaxed synchronization.
    """
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    var my_block_rank = Int(block_rank_in_cluster())
    var block_id = block_idx.x

    var shared_data = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[tpb]())

    # Compute cluster mask for advanced coordination
    # base_mask = cluster_mask_base()  # Requires cluster_shape parameter

    # FIX: Process data with block_idx-based scaling for guaranteed uniqueness
    var data_scale = Scalar[dtype](block_id + 1)
    if global_i < size:
        shared_data[local_i] = input[global_i] * data_scale
    else:
        shared_data[local_i] = 0.0

    barrier()

    # Advanced pattern: Use elect_one_sync for efficient coordination
    if elect_one_sync():  # Only one thread per warp does this work
        var warp_sum: Float32 = 0.0
        var warp_start = (local_i // 32) * 32  # Get warp start index
        for i in range(32):  # Sum across warp
            if warp_start + i < tpb:
                warp_sum += shared_data[warp_start + i][0]
        shared_data[local_i] = warp_sum

    barrier()

    # Use cluster_arrive for staged synchronization in sm90+
    cluster_arrive()

    # Only first thread in each block stores result
    if local_i == 0:
        var block_total: Float32 = 0.0
        for i in range(0, tpb, 32):  # Sum warp results
            if i < tpb:
                block_total += shared_data[i][0]
        output[block_id] = block_total

    # Wait for all blocks to complete their calculations in sm90+
    cluster_wait()


๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ ํŒจํ„ด ํ’€์ด๋Š” GPU ํ™œ์šฉ๋ฅ ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์›Œํ”„, ๋ธ”๋ก, ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ๊ฒฐํ•ฉํ•˜๋Š” ์ •๊ตํ•œ 3๋‹จ๊ณ„ ๊ณ„์ธต์  ์ตœ์ ํ™”๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค:

๋ ˆ๋ฒจ 1: ์›Œํ”„ ๋ ˆ๋ฒจ ์ตœ์ ํ™” (์Šค๋ ˆ๋“œ ์„ ์ถœ)

๋ฐ์ดํ„ฐ ์ค€๋น„ ๋ฐ ์Šค์ผ€์ผ๋ง:

var data_scale = Float32(block_id + 1)  # Block-specific scaling factor
if global_i < size:
    shared_data[local_i] = input[global_i] * data_scale
else:
    shared_data[local_i] = 0.0  # Zero-pad for out-of-bounds
barrier()  # Ensure all threads complete data loading

์›Œํ”„ ๋ ˆ๋ฒจ ์Šค๋ ˆ๋“œ ์„ ์ถœ:

if elect_one_sync():  # Hardware elects exactly 1 thread per warp
    var warp_sum: Float32 = 0.0
    var warp_start = (local_i // 32) * 32  # Calculate warp boundary
    for i in range(32):  # Process entire warp's data
        if warp_start + i < tpb:
            warp_sum += shared_data[warp_start + i][0]
    shared_data[local_i] = warp_sum  # Store result at elected thread's position

์›Œํ”„ ๊ฒฝ๊ณ„ ๊ณ„์‚ฐ ์„ค๋ช…:

  • ์Šค๋ ˆ๋“œ 37 (์›Œํ”„ 1): warp_start = (37 // 32) * 32 = 1 * 32 = 32
  • ์Šค๋ ˆ๋“œ 67 (์›Œํ”„ 2): warp_start = (67 // 32) * 32 = 2 * 32 = 64
  • ์Šค๋ ˆ๋“œ 199 (์›Œํ”„ 6): warp_start = (199 // 32) * 32 = 6 * 32 = 192

์„ ์ถœ ํŒจํ„ด ์‹œ๊ฐํ™” (TPB=256, 8 ์›Œํ”„):

Warp 0 (threads 0-31):   elect_one_sync() โ†’ Thread 0   processes elements 0-31
Warp 1 (threads 32-63):  elect_one_sync() โ†’ Thread 32  processes elements 32-63
Warp 2 (threads 64-95):  elect_one_sync() โ†’ Thread 64  processes elements 64-95
Warp 3 (threads 96-127): elect_one_sync() โ†’ Thread 96  processes elements 96-127
Warp 4 (threads 128-159):elect_one_sync() โ†’ Thread 128 processes elements 128-159
Warp 5 (threads 160-191):elect_one_sync() โ†’ Thread 160 processes elements 160-191
Warp 6 (threads 192-223):elect_one_sync() โ†’ Thread 192 processes elements 192-223
Warp 7 (threads 224-255):elect_one_sync() โ†’ Thread 224 processes elements 224-255

๋ ˆ๋ฒจ 2: ๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„ (์›Œํ”„ ๋ฆฌ๋” ์กฐ์ •)

์›Œํ”„ ๊ฐ„ ๋™๊ธฐํ™”:

barrier()  # Ensure all warps complete their elected computations

์›Œํ”„ ๋ฆฌ๋” ์ง‘๊ณ„ (์Šค๋ ˆ๋“œ 0๋งŒ):

if local_i == 0:
    var block_total: Float32 = 0.0
    for i in range(0, tpb, 32):  # Iterate through warp leader positions
        if i < tpb:
            block_total += shared_data[i][0]  # Sum warp results
    output[block_id] = block_total

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด:

  • ์Šค๋ ˆ๋“œ 0์ด ๋‹ค์Œ ์œ„์น˜์—์„œ ์ฝ์Šต๋‹ˆ๋‹ค: shared_data[0], shared_data[32], shared_data[64], shared_data[96], shared_data[128], shared_data[160], shared_data[192], shared_data[224]
  • ์ด ์œ„์น˜๋“ค์—๋Š” ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ๊ณ„์‚ฐํ•œ ์›Œํ”„ ํ•ฉ๊ณ„๊ฐ€ ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค
  • ๊ฒฐ๊ณผ: 8๊ฐœ ์›Œํ”„ ํ•ฉ๊ณ„ โ†’ 1๊ฐœ ๋ธ”๋ก ํ•ฉ๊ณ„

๋ ˆ๋ฒจ 3: ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ๋‹จ๊ณ„์  ๋™๊ธฐํ™”

๋‹จ๊ณ„์  ๋™๊ธฐํ™” ์ ‘๊ทผ:

cluster_arrive()  # Non-blocking: signal this block's completion
# ... Thread 0 computes and stores block result ...
cluster_wait()    # Blocking: wait for all blocks to complete

์™œ ๋‹จ๊ณ„์  ๋™๊ธฐํ™”๋ฅผ ์‚ฌ์šฉํ• ๊นŒ?

  • cluster_arrive() ๋ฅผ ์ตœ์ข… ์—ฐ์‚ฐ ์ด์ „์— ํ˜ธ์ถœํ•˜๋ฉด ์ž‘์—… ์ค‘์ฒฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค
  • ๋‹ค๋ฅธ ๋ธ”๋ก์ด ์•„์ง ์ฒ˜๋ฆฌ ์ค‘์ธ ๋™์•ˆ์—๋„ ๋ธ”๋ก์ด ์ž์ฒด ๊ฒฐ๊ณผ๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
  • cluster_wait() ๋กœ ๊ฒฐ์ •๋ก ์  ์™„๋ฃŒ ์ˆœ์„œ๋ฅผ ๋ณด์žฅํ•ฉ๋‹ˆ๋‹ค
  • ๋…๋ฆฝ์ ์ธ ๋ธ”๋ก ์—ฐ์‚ฐ์˜ ๊ฒฝ์šฐ cluster_sync()๋ณด๋‹ค ๋” ํšจ์œจ์ ์ž…๋‹ˆ๋‹ค

๊ณ ๊ธ‰ ํŒจํ„ด ํŠน์„ฑ

๊ณ„์ธต์  ์—ฐ์‚ฐ ์ถ•์†Œ:

  1. 256๊ฐœ ์Šค๋ ˆ๋“œ โ†’ 8๊ฐœ ์„ ์ถœ ์Šค๋ ˆ๋“œ (๋ธ”๋ก๋‹น 32๋ฐฐ ์ถ•์†Œ)
  2. 8๊ฐœ ์›Œํ”„ ํ•ฉ๊ณ„ โ†’ 1๊ฐœ ๋ธ”๋ก ํ•ฉ๊ณ„ (๋ธ”๋ก๋‹น 8๋ฐฐ ์ถ•์†Œ)
  3. 4๊ฐœ ๋ธ”๋ก โ†’ ๋‹จ๊ณ„์  ์™„๋ฃŒ (๋™๊ธฐํ™”๋œ ์ข…๋ฃŒ)
  4. ์ „์ฒด ํšจ์œจ: ๋ธ”๋ก๋‹น ์ค‘๋ณต ์—ฐ์‚ฐ 256๋ฐฐ ์ถ•์†Œ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์ตœ์ ํ™”:

  • ๋ ˆ๋ฒจ 1: input[global_i]์—์„œ ๋ณ‘ํ•ฉ๋œ ์ฝ๊ธฐ, ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์Šค์ผ€์ผ๋ง๋œ ์“ฐ๊ธฐ
  • ๋ ˆ๋ฒจ 2: ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์›Œํ”„ ๋ ˆ๋ฒจ ์ง‘๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (256๊ฐœ ๋Œ€์‹  8๊ฐœ ์—ฐ์‚ฐ)
  • ๋ ˆ๋ฒจ 3: ์Šค๋ ˆ๋“œ 0์ด ๋ธ”๋ก ๋ ˆ๋ฒจ ์ง‘๊ณ„๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค (8๊ฐœ ๋Œ€์‹  1๊ฐœ ์—ฐ์‚ฐ)
  • ๊ฒฐ๊ณผ: ๊ณ„์ธต์  ๋ฆฌ๋•์…˜์„ ํ†ตํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋Ÿ‰์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค

๋™๊ธฐํ™” ๊ณ„์ธต ๊ตฌ์กฐ:

  1. barrier(): ๋ธ”๋ก ๋‚ด๋ถ€ ์Šค๋ ˆ๋“œ ๋™๊ธฐํ™” (๋ฐ์ดํ„ฐ ๋กœ๋”ฉ ๋ฐ ์›Œํ”„ ์ฒ˜๋ฆฌ ํ›„)
  2. cluster_arrive(): ๋ธ”๋ก ๊ฐ„ ์‹ ํ˜ธ (๋…ผ๋ธ”๋กœํ‚น, ์ž‘์—… ์ค‘์ฒฉ ๊ฐ€๋Šฅ)
  3. cluster_wait(): ๋ธ”๋ก ๊ฐ„ ๋™๊ธฐํ™” (๋ธ”๋กœํ‚น, ์™„๋ฃŒ ์ˆœ์„œ ๋ณด์žฅ)

์™œ โ€œ๊ณ ๊ธ‰โ€œ์ธ๊ฐ€:

  • ๋‹ค๋‹จ๊ณ„ ์ตœ์ ํ™”: ์›Œํ”„, ๋ธ”๋ก, ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ๋ฒ•์„ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค
  • ํ•˜๋“œ์›จ์–ด ํšจ์œจ: elect_one_sync()๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์›Œํ”„ ํ™œ์šฉ๋ฅ ์„ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค
  • ๋‹จ๊ณ„์  ์กฐ์ •: ๊ณ ๊ธ‰ ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์œ ์—ฐํ•œ ๋™๊ธฐํ™”๋ฅผ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค
  • ํ”„๋กœ๋•์…˜ ์ˆ˜์ค€: ์‹ค์ œ GPU ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ํŒจํ„ด์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค

์‹ค์ œ ์„ฑ๋Šฅ ์ด์ :

  • ๋ฉ”๋ชจ๋ฆฌ ๋ถ€ํ•˜ ๊ฐ์†Œ: ๋™์‹œ์— ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ์— ์ ‘๊ทผํ•˜๋Š” ์Šค๋ ˆ๋“œ ์ˆ˜๊ฐ€ ์ ์–ด์ง‘๋‹ˆ๋‹ค
  • ๋” ๋‚˜์€ ์›Œํ”„ ํ™œ์šฉ: ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์ง‘์ค‘์ ์ธ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์กฐ์ •: ๋‹จ๊ณ„์  ๋™๊ธฐํ™”๊ฐ€ ๋” ํฐ ํด๋Ÿฌ์Šคํ„ฐ ํฌ๊ธฐ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์œ ์—ฐ์„ฑ: ๋ณต์žกํ•œ ๋‹ค๋‹จ๊ณ„ ์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค

๋ณต์žก๋„ ๋ถ„์„:

  • ์›Œํ”„ ๋ ˆ๋ฒจ: ์„ ์ถœ๋œ ์Šค๋ ˆ๋“œ๋‹น O(32) ์—ฐ์‚ฐ = ๋ธ”๋ก๋‹น ์ด O(256)
  • ๋ธ”๋ก ๋ ˆ๋ฒจ: ๋ธ”๋ก๋‹น O(8) ์ง‘๊ณ„ ์—ฐ์‚ฐ
  • ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ: ๋ธ”๋ก๋‹น O(1) ๋™๊ธฐํ™” ์˜ค๋ฒ„ํ—ค๋“œ
  • ์ „์ฒด: ๋Œ€๊ทœ๋ชจ ๋ณ‘๋ ฌํ™” ์ด์ ์„ ๊ฐ€์ง„ ์„ ํ˜• ๋ณต์žก๋„

์™„์ „ํ•œ GPU ๊ณ„์ธต ๊ตฌ์กฐ

์ถ•ํ•˜ํ•ฉ๋‹ˆ๋‹ค! ์ด ํผ์ฆ์„ ์™„๋ฃŒํ•จ์œผ๋กœ์จ ์™„์ „ํ•œ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์Šคํƒ์„ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค:

โœ… ์Šค๋ ˆ๋“œ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: ๊ฐœ๋ณ„ ์‹คํ–‰ ๋‹จ์œ„
โœ… ์›Œํ”„ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: 32๊ฐœ ์Šค๋ ˆ๋“œ SIMT ์กฐ์ •
โœ… ๋ธ”๋ก ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: ๋ฉ€ํ‹ฐ ์›Œํ”„ ์กฐ์ •๊ณผ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ
โœ… ๐Ÿ†• ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ ํ”„๋กœ๊ทธ๋ž˜๋ฐ: SM90+ API๋ฅผ ํ™œ์šฉํ•œ ๋ฉ€ํ‹ฐ ๋ธ”๋ก ์กฐ์ •
โœ… ํด๋Ÿฌ์Šคํ„ฐ ๋™๊ธฐํ™” ๊ธฐ๋ณธ ์š”์†Œ๋กœ ์—ฌ๋Ÿฌ ์Šค๋ ˆ๋“œ ๋ธ”๋ก์„ ์กฐ์ •
โœ… ํด๋Ÿฌ์Šคํ„ฐ API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ์ผ ๋ธ”๋ก ํ•œ๊ณ„๋ฅผ ๋„˜์–ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ™•์žฅ
โœ… ์›Œํ”„ + ๋ธ”๋ก + ํด๋Ÿฌ์Šคํ„ฐ ์กฐ์ •์„ ๊ฒฐํ•ฉํ•œ ๊ณ„์ธต์  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ตฌํ˜„
โœ… SM90+ ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ์œผ๋กœ ์ฐจ์„ธ๋Œ€ GPU ํ•˜๋“œ์›จ์–ด๋ฅผ ํ™œ์šฉ

์‹ค์ „ ์‘์šฉ

์ด ํผ์ฆ์˜ ๊ณ„์ธต์  ์กฐ์ • ํŒจํ„ด์€ ๋‹ค์Œ ๋ถ„์•ผ์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค:

๊ณ ์„ฑ๋Šฅ ์ปดํ“จํŒ…:

  • ๋ฉ€ํ‹ฐ ๊ทธ๋ฆฌ๋“œ ๊ธฐ๋ฒ•: ๊ฐ ๋ ˆ๋ฒจ์ด ์„œ๋กœ ๋‹ค๋ฅธ ํ•ด์ƒ๋„์˜ ๊ทธ๋ฆฌ๋“œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ๋„๋ฉ”์ธ ๋ถ„ํ•ด: ๋ฌธ์ œ์˜ ํ•˜์œ„ ๋„๋ฉ”์ธ์— ๊ฑธ์นœ ๊ณ„์ธต์  ์กฐ์ •
  • ๋ณ‘๋ ฌ ๋ฐ˜๋ณต๋ฒ•: ์›Œํ”„ ๋ ˆ๋ฒจ์˜ ๋กœ์ปฌ ์—ฐ์‚ฐ๊ณผ ํด๋Ÿฌ์Šคํ„ฐ ๋ ˆ๋ฒจ์˜ ์ „์—ญ ํ†ต์‹ 

๋”ฅ๋Ÿฌ๋‹:

  • ๋ชจ๋ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ๊ฐ ๋ธ”๋ก์ด ๋ชจ๋ธ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค
  • ํŒŒ์ดํ”„๋ผ์ธ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ: ์—ฌ๋Ÿฌ ํŠธ๋žœ์Šคํฌ๋จธ ๋ ˆ์ด์–ด์— ๊ฑธ์นœ ๋‹จ๊ณ„์  ์ฒ˜๋ฆฌ
  • ๊ธฐ์šธ๊ธฐ ์ง‘๊ณ„: ๋ถ„์‚ฐ ํ•™์Šต ๋…ธ๋“œ์— ๊ฑธ์นœ ๊ณ„์ธต์  ๋ฆฌ๋•์…˜

๊ทธ๋ž˜ํ”ฝ์Šค ๋ฐ ์‹œ๊ฐํ™”:

  • ๋ฉ€ํ‹ฐ ํŒจ์Šค ๋ Œ๋”๋ง: ๋ณต์žกํ•œ ์‹œ๊ฐ ํšจ๊ณผ๋ฅผ ์œ„ํ•œ ๋‹จ๊ณ„์  ์ฒ˜๋ฆฌ
  • ๊ณ„์ธต์  ์ปฌ๋ง: ๊ฐ ๋ ˆ๋ฒจ์ด ์„œ๋กœ ๋‹ค๋ฅธ ์„ธ๋ถ„๋„์—์„œ ์ปฌ๋งํ•ฉ๋‹ˆ๋‹ค
  • ๋ณ‘๋ ฌ ์ง€์˜ค๋ฉ”ํŠธ๋ฆฌ ์ฒ˜๋ฆฌ: ์กฐ์ •๋œ ๋ณ€ํ™˜ ํŒŒ์ดํ”„๋ผ์ธ

๋‹ค์Œ ๋‹จ๊ณ„

์ด์ œ ์ตœ์‹  ํ•˜๋“œ์›จ์–ด์—์„œ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ตœ์ฒจ๋‹จ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ธฐ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค!

๋” ๋งŽ์€ ๋„์ „์„ ํ•  ์ค€๋น„๊ฐ€ ๋˜์…จ๋‚˜์š”? ๋‹ค๋ฅธ ๊ณ ๊ธ‰ GPU ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ฃผ์ œ๋ฅผ ํƒ๊ตฌํ•˜๊ณ , Puzzle 30-32์˜ ์„ฑ๋Šฅ ์ตœ์ ํ™” ๊ธฐ๋ฒ•์„ ๋ณต์Šตํ•˜๊ณ , NVIDIA ๋„๊ตฌ์˜ ํ”„๋กœํŒŒ์ผ๋ง ๋ฐฉ๋ฒ•๋ก ์„ ์ ์šฉํ•˜๊ฑฐ๋‚˜, ์ด ํด๋Ÿฌ์Šคํ„ฐ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํŒจํ„ด์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž์‹ ๋งŒ์˜ ์—ฐ์‚ฐ ์›Œํฌ๋กœ๋“œ๋ฅผ ๊ตฌ์ถ•ํ•ด ๋ณด์„ธ์š”!