์„ฑ๋Šฅ: ๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์€ GPU ์„ฑ๋Šฅ ์ตœ์ ํ™”์˜ ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ์ด ์„น์…˜์—์„œ๋Š” ์ž„๋ฒ ๋”ฉ ์กฐํšŒ์™€ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ ๋ณ‘ํ•ฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด ์™œ ๋น„๋ณ‘ํ•ฉ ํŒจํ„ด๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š”์ง€ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ ๊ธฐ์ดˆ

๋ฉ”๋ชจ๋ฆฌ ๋ณ‘ํ•ฉ์€ ์›Œํ”„ ๋‚ด ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ๋ฉ”๋ชจ๋ฆฌ ์ฃผ์†Œ์— ์ ‘๊ทผํ•  ๋•Œ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. GPU๋Š” ์ด๋Ÿฌํ•œ ๊ฐœ๋ณ„ ๋ฉ”๋ชจ๋ฆฌ ์š”์ฒญ์„ ๋” ์ ์€ ์ˆ˜์˜ ๋Œ€์šฉ๋Ÿ‰ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜์œผ๋กœ ๊ฒฐํ•ฉํ•˜์—ฌ ๋Œ€์—ญํญ ํ™œ์šฉ๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

๋ณ‘ํ•ฉ vs ๋น„๋ณ‘ํ•ฉ ์ ‘๊ทผ

๋ณ‘ํ•ฉ (ํšจ์œจ์ ):

- Thread 0 โ†’ Address 0x1000
- Thread 1 โ†’ Address 0x1004
- Thread 2 โ†’ Address 0x1008
- Thread 3 โ†’ Address 0x100C
- ...

๊ฒฐ๊ณผ: ์›Œํ”„ ์ „์ฒด(32๊ฐœ ์Šค๋ ˆ๋“œ)์— ๋Œ€ํ•ด 1๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

๋น„๋ณ‘ํ•ฉ (๋น„ํšจ์œจ์ ):

- Thread 0 โ†’ Address 0x1000
- Thread 1 โ†’ Address 0x2000
- Thread 2 โ†’ Address 0x3000
- Thread 3 โ†’ Address 0x4000
- ...

๊ฒฐ๊ณผ: ์ตœ๋Œ€ 32๋ฒˆ์˜ ๊ฐœ๋ณ„ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

์ž„๋ฒ ๋”ฉ ์—ฐ์‚ฐ์ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ธ ์ด์œ 

์ž„๋ฒ ๋”ฉ ์กฐํšŒ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์„ฑ ๋•Œ๋ฌธ์— ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ์ž…๋‹ˆ๋‹ค:

  • ์ตœ์†Œํ•œ์˜ ์—ฐ์‚ฐ: ํ•˜๋Š” ์ผ์ด๋ผ๊ณค ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์ถœ๋ ฅ์œผ๋กœ ๋ณต์‚ฌํ•˜๋Š” ๊ฒƒ๋ฟ
  • ํฐ ๋ฉ”๋ชจ๋ฆฌ ํ’‹ํ”„๋ฆฐํŠธ: ์ž„๋ฒ ๋”ฉ ํ…Œ์ด๋ธ”์€ ์ˆ˜ ๊ธฐ๊ฐ€๋ฐ”์ดํŠธ์— ๋‹ฌํ•  ์ˆ˜ ์žˆ์Œ
  • ๋†’์€ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์š”๊ตฌ: ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ ์ „์†ก์ด ํ•„์š”

์ด๋Ÿฌํ•œ ์—ฐ์‚ฐ์—์„œ๋Š” ์—ฐ์‚ฐ ๋ณต์žก๋„๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํšจ์œจ์ด ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

์ปค๋„ ๋น„๊ต

1D ๋ณ‘ํ•ฉ ์ปค๋„

  • ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ: [total_elements // 256] ๋ธ”๋ก, ์ถœ๋ ฅ ์š”์†Œ๋‹น ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์—ฐ์†๋œ ์Šค๋ ˆ๋“œ๊ฐ€ ์—ฐ์†๋œ ์ž„๋ฒ ๋”ฉ ์ฐจ์›์— ์ ‘๊ทผ
  • ์™œ ๋ณ‘ํ•ฉ๋˜๋Š”๊ฐ€: Thread 0: output[0,0,0], Thread 1: output[0,0,1] โ†’ ์—ฐ์†๋œ ์ฃผ์†Œ

2D ๋น„๋ณ‘ํ•ฉ ์ปค๋„

  • ์Šค๋ ˆ๋“œ ๊ตฌ์„ฑ: [batch*seq // 16, embed_dim // 16] ๋ธ”๋ก, 16ร—16 ์Šค๋ ˆ๋“œ
  • ๋ฉ”๋ชจ๋ฆฌ ํŒจํ„ด: ์Šค๋ ˆ๋“œ๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ์Œ
  • ์™œ ๋น„๋ณ‘ํ•ฉ์ธ๊ฐ€: ์Šค๋ ˆ๋“œ ์ ‘๊ทผ ํŒจํ„ด์ด ๋ฉ”๋ชจ๋ฆฌ ์ „์ฒด์— ํฉ์–ด์งˆ ์ˆ˜ ์žˆ์Œ

์„ฑ๋Šฅ ๊ฒฐ๊ณผ

์ผ๋ฐ˜์ ์ธ ๋ฒค์น˜๋งˆํฌ ๊ฒฐ๊ณผ:

Performance Results:
   1D Coalesced:     2.145 ms
   2D Non-coalesced: 3.867 ms
   1D is 1.80x faster than 2D

๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ์‹œ๊ฐํ™”

๋ณ‘ํ•ฉ ํŒจํ„ด (1D ์ปค๋„)

output[0,0,0:32]์— ๋Œ€ํ•œ ์›Œํ”„ ์‹คํ–‰:

์š”์†Œ์Šค๋ ˆ๋“œ ID๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์ฃผ์†Œ ํŒจํ„ด
output[0,0,0]0[0,0]Base + 0
output[0,0,1]1[0,1]Base + 4
output[0,0,2]2[0,2]Base + 8
output[0,0,3]3[0,3]Base + 12
โ€ฆโ€ฆโ€ฆโ€ฆ
output[0,0,31]31[0,31]Base + 124

๊ฒฐ๊ณผ: ์—ฐ์†๋œ ์ฃผ์†Œ โ†’ ์›Œํ”„ ์ „์ฒด์— ๋Œ€ํ•ด 1๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

๋น„๋ณ‘ํ•ฉ ํŒจํ„ด (2D ์ปค๋„)

16ร—16 ๋ธ”๋ก์˜ ์›Œํ”„ ์‹คํ–‰:

Block organization (16ร—16):
    X-dim: batch*seq positions (0-15)
    Y-dim: embed dimensions (0-15)

Warp threads might access:
    Thread 0:  batch=0, seq=0, embed=0  โ†’ Address A
    Thread 1:  batch=0, seq=1, embed=0  โ†’ Address B (different row)
    Thread 2:  batch=0, seq=2, embed=0  โ†’ Address C (different row)
    ...
    Thread 31: batch=1, seq=15, embed=0 โ†’ Address Z (scattered)

๊ฒฐ๊ณผ: ํฉ์–ด์ง„ ์ฃผ์†Œ โ†’ ์—ฌ๋Ÿฌ ๋ฒˆ์˜ ๋ฉ”๋ชจ๋ฆฌ ํŠธ๋žœ์žญ์…˜

ํ•ต์‹ฌ ์ตœ์ ํ™” ์ „๋žต

  1. ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ๋Š” ๊ฐ€๋Šฅํ•œ ํ•œ 1D ์ธ๋ฑ์‹ฑ์„ ์„ ํ˜ธํ•˜์„ธ์š”
  2. ๋ณ‘ํ•ฉ์— ์œ ๋ฆฌํ•˜๋„๋ก ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ๋ฅผ ์ •๋ ฌํ•˜์„ธ์š”
  3. ์ปค๋„ ์„ค๊ณ„ ์‹œ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์„ ๊ณ ๋ คํ•˜์„ธ์š”
  4. ๋ณ‘๋ชฉ ์ง€์ ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ์„ ํ”„๋กœํŒŒ์ผ๋งํ•˜์„ธ์š”
  5. ์ตœ์ ํ™” ํšจ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ™œ์šฉํ•˜์„ธ์š”

ํ•ต์‹ฌ ํ†ต์ฐฐ: ํŠนํžˆ ์ž„๋ฒ ๋”ฉ๊ณผ ๊ฐ™์€ ๋ฉ”๋ชจ๋ฆฌ ๋ฐ”์šด๋“œ ์—ฐ์‚ฐ์—์„œ๋Š” ์—ฐ์‚ฐ ๋ณต์žก๋„๋ณด๋‹ค ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ ํŒจํ„ด์ด GPU ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค.