Puzzle 19: μ–΄ν…μ…˜ Op

κ°œμš”

이 νΌμ¦μ—μ„œλŠ” μ–΄ν…μ…˜ λ©”μ»€λ‹ˆμ¦˜μ„ μ»€μŠ€ν…€ MAX κ·Έλž˜ν”„ μ—°μ‚°μœΌλ‘œ κ΅¬ν˜„ν•©λ‹ˆλ‹€. μ–΄ν…μ…˜μ€ νŠΈλžœμŠ€ν¬λ¨Έμ™€ ν•¨κ»˜ 널리 μ•Œλ €μ§„ ν˜„λŒ€ μ‹ κ²½λ§μ˜ 핡심 μš”μ†Œλ‘œ, λͺ¨λΈμ΄ μ˜ˆμΈ‘ν•  λ•Œ μž…λ ₯μ—μ„œ κ΄€λ ¨λœ 뢀뢄에 집쀑할 수 있게 ν•΄μ€λ‹ˆλ‹€.

μˆ˜ν•™μ μœΌλ‘œ μ–΄ν…μ…˜ ν•¨μˆ˜λŠ” λ‹€μŒκ³Ό 같이 μ •μ˜λ©λ‹ˆλ‹€:

$$\Large \text{Attention}(Q, K, V) = \text{softmax}(Q \cdot K^T) \cdot V$$

μ—¬κΈ°μ„œ:

  • \(Q\)λŠ” \((d,)~\) ν˜•νƒœμ˜ 쿼리 벑터 - μ°ΎμœΌλ €λŠ” λŒ€μƒμ„ λ‚˜νƒ€λƒ…λ‹ˆλ‹€
  • \(K\)λŠ” \((\text{seq_len}, d)~\) ν˜•νƒœμ˜ ν‚€ ν–‰λ ¬ - λ§€μΉ­ν•  수 μžˆλŠ” λŒ€μƒμ„ λ‚˜νƒ€λƒ…λ‹ˆλ‹€
  • \(V\)λŠ” \((\text{seq_len}, d)~\) ν˜•νƒœμ˜ κ°’ ν–‰λ ¬ - 검색할 정보λ₯Ό λ‚˜νƒ€λƒ…λ‹ˆλ‹€
  • 좜λ ₯은 \((d,)\) ν˜•νƒœμ˜ 가쀑합 λ²‘ν„°μž…λ‹ˆλ‹€

연산은 μ„Έ κ°€μ§€ μ£Όμš” λ‹¨κ³„λ‘œ μ΄λ£¨μ–΄μ§‘λ‹ˆλ‹€:

  1. μ–΄ν…μ…˜ 점수: \(Q \cdot K^T\)λ₯Ό κ³„μ‚°ν•˜μ—¬ 쿼리가 각 ν‚€ 벑터와 μ–Όλ§ˆλ‚˜ 잘 λ§€μΉ­λ˜λŠ”μ§€ μΈ‘μ •ν•©λ‹ˆλ‹€
  2. μ–΄ν…μ…˜ κ°€μ€‘μΉ˜: μ†Œν”„νŠΈλ§₯슀λ₯Ό μ μš©ν•˜μ—¬ 점수λ₯Ό ν™•λ₯  λΆ„ν¬λ‘œ λ³€ν™˜ν•©λ‹ˆλ‹€ (κ°€μ€‘μΉ˜μ˜ ν•© = 1)
  3. 가쀑 ν•©: μ–΄ν…μ…˜ κ°€μ€‘μΉ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ κ°’ 벑터듀을 κ²°ν•©ν•΄ μ΅œμ’… 좜λ ₯을 μƒμ„±ν•©λ‹ˆλ‹€

μ–΄ν…μ…˜ μ΄ν•΄ν•˜κΈ°: 단계별 뢄석

μ–΄ν…μ…˜μ„ 슀마트 검색 λ©”μ»€λ‹ˆμ¦˜μœΌλ‘œ 생각해 λ³΄μ„Έμš”. 쿼리(찾고자 ν•˜λŠ” 것)κ°€ μ£Όμ–΄μ§€λ©΄, μ–΄ν…μ…˜μ€ ν‚€-κ°’ 쌍의 λͺ¨μŒμ—μ„œ κ°€μž₯ κ΄€λ ¨μ„± 높은 정보λ₯Ό μ°Ύμ•„λƒ…λ‹ˆλ‹€:

  1. 1단계 - μœ μ‚¬λ„ λ§€μΉ­: 쿼리 \(Q\)λ₯Ό λͺ¨λ“  ν‚€ \(K\)와 λΉ„κ΅ν•˜μ—¬ μœ μ‚¬λ„ 점수λ₯Ό κ΅¬ν•©λ‹ˆλ‹€

    • \(Q \cdot K^T\)λ₯Ό κ³„μ‚°ν•˜μ—¬ \(Q\)κ°€ 각 ν‚€ 벑터와 μ–Όλ§ˆλ‚˜ 잘 λ§€μΉ­λ˜λŠ”μ§€ μΈ‘μ •ν•©λ‹ˆλ‹€
    • 높은 점수 = 더 쒋은 λ§€μΉ­
  2. 2단계 - ν™•λ₯  뢄포: μ›μ‹œ 점수λ₯Ό μ •κ·œν™”λœ κ°€μ€‘μΉ˜λ‘œ λ³€ν™˜ν•©λ‹ˆλ‹€

    • μ†Œν”„νŠΈλ§₯슀λ₯Ό μ μš©ν•˜μ—¬ λͺ¨λ“  κ°€μ€‘μΉ˜μ˜ 합이 1.0이 λ˜λ„λ‘ ν•©λ‹ˆλ‹€
    • μ–΄λ–€ 값에 집쀑할지에 λŒ€ν•œ ν™•λ₯  뢄포λ₯Ό λ§Œλ“­λ‹ˆλ‹€
  3. 3단계 - 가쀑 검색: μ–΄ν…μ…˜ κ°€μ€‘μΉ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 값듀을 κ²°ν•©ν•©λ‹ˆλ‹€

    • 각 κ°’ 벑터에 ν•΄λ‹Ήν•˜λŠ” κ°€μ€‘μΉ˜λ₯Ό κ³±ν•©λ‹ˆλ‹€
    • λͺ¨λ“  것을 더해 μ΅œμ’… 좜λ ₯을 κ΅¬ν•©λ‹ˆλ‹€

μ‹€μƒν™œ λΉ„μœ : λ„μ„œκ΄€μ—μ„œ κ²€μƒ‰ν•˜λŠ” 것을 상상해 λ³΄μ„Έμš”. μΏΌλ¦¬λŠ” μ°Ύκ³  싢은 것이고, μ±… 제λͺ©μ€ 킀이며, μ±… λ‚΄μš©μ€ κ°’μž…λ‹ˆλ‹€. μ–΄ν…μ…˜μ€ 각 책이 쿼리와 μ–Όλ§ˆλ‚˜ κ΄€λ ¨ μžˆλŠ”μ§€ κ³„μ‚°ν•œ λ‹€μŒ, 관련도에 따라 가쀑 μš”μ•½μ„ μ œκ³΅ν•©λ‹ˆλ‹€.

μ—°μ‚° 흐름 μ‹œκ°ν™”

Input:  Q(16,)    K(16,16)    V(16,16)
         ↓           ↓           ↓
Step 1: Q(1,16) @ K^T(16,16) β†’ Scores(1,16)
         ↓
Step 2: softmax(Scores) β†’ Weights(1,16)  [sum = 1.0]
         ↓
Step 3: Weights(1,16) @ V(16,16) β†’ Output(1,16) β†’ reshape β†’ Output(16,)

핡심 아이디어: 쿼리 벑터 \(Q\)λ₯Ό \((16,)\)μ—μ„œ \((1,16)\)으둜 λ³€ν™˜ν•˜λ©΄, 내적 λŒ€μ‹  ν–‰λ ¬ κ³±μ…ˆμ„ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 덕뢄에 Puzzle 18의 κ³ λ„λ‘œ μ΅œμ ν™”λœ 타일링 matmul 컀널을 κ·ΈλŒ€λ‘œ ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€!

GPU κ΅¬ν˜„μ€ 이전 νΌμ¦μ—μ„œ μ΅œμ ν™”λœ 컀널듀을 μž¬μ‚¬μš©ν•˜κ³  κ²°ν•©ν•©λ‹ˆλ‹€:

πŸ”„ 컀널 μž¬μ‚¬μš© μ „λž΅: 이 퍼즐은 이전 νΌμ¦μ—μ„œ κ²€μ¦λœ μ΅œμ ν™” 컀널듀을 κ²°ν•©ν•˜μ—¬ λ³΅μž‘ν•œ 연산을 κ΅¬μΆ•ν•˜λŠ” 방법을 λ³΄μ—¬μ€λ‹ˆλ‹€. λͺ¨λ“  것을 μ²˜μŒλΆ€ν„° μž‘μ„±ν•˜λŠ” λŒ€μ‹ , Puzzle 16의 matmul_idiomatic_tiledκ³Ό Puzzle 18의 softmax_kernel을 ν™œμš©ν•˜μ—¬ λͺ¨λ“ˆν˜• GPU 컀널 μ„€κ³„μ˜ κ°•λ ₯함을 λ³΄μ—¬μ€λ‹ˆλ‹€.

핡심 κ°œλ…

  • μ‹œν€€μŠ€ 처리λ₯Ό μœ„ν•œ 벑터 μ–΄ν…μ…˜ λ©”μ»€λ‹ˆμ¦˜
  • 컀널 μž¬μ‚¬μš©: Puzzle 16κ³Ό Puzzle 18의 κ²€μ¦λœ κ΅¬ν˜„ ν™œμš©
  • 곡유 λ©”λͺ¨λ¦¬ tiling을 ν™œμš©ν•œ 효율적인 ν–‰λ ¬ κ³±μ…ˆ
  • 버퍼 할당을 μ΅œμ†Œν™”ν•˜λŠ” λ©”λͺ¨λ¦¬ μ΅œμ ν™” ν…μ„œ ν˜•νƒœ λ³€ν™˜
  • μ—¬λŸ¬ μ΅œμ ν™” 컀널을 단일 μ—°μ‚°μœΌλ‘œ 톡합
  • 닀쀑 μž…λ ₯을 μ§€μ›ν•˜λŠ” μ»€μŠ€ν…€ MAX κ·Έλž˜ν”„ μ—°μ‚°
  • ν˜Έν™˜μ„±μ„ μœ„ν•œ CPU 폴백 κ΅¬ν˜„

μ„€μ •

  • μ‹œν€€μŠ€ 길이: \(\text{SEQ_LEN} = 16~\) - μ‹œν€€μŠ€ λ‚΄ ν‚€/κ°’ λ²‘ν„°μ˜ 수
  • λͺ¨λΈ 차원: \(\text{D} = 16~\) - 각 벑터(쿼리, ν‚€, κ°’)의 차원
  • 블둝당 μŠ€λ ˆλ“œ 수: 각 컀널에 맞게 κ°œλ³„ μ΅œμ ν™”
  • κ·Έλ¦¬λ“œ 차원: λ‹€μ–‘ν•œ ν–‰λ ¬ 크기λ₯Ό 효율적으둜 μ²˜λ¦¬ν•˜λ„λ‘ λ™μ μœΌλ‘œ 계산
  • 곡유 λ©”λͺ¨λ¦¬: μ „μΉ˜, matmul, μ†Œν”„νŠΈλ§₯슀 μ»€λ„μ—μ„œ μ„±λŠ₯을 μœ„ν•΄ ν™œμš©

λ ˆμ΄μ•„μ›ƒ μ„€μ •:

  • 쿼리 ν…μ„œ: Layout.row_major(d)
  • ν‚€ ν…μ„œ: Layout.row_major(seq_len, d)
  • κ°’ ν…μ„œ: Layout.row_major(seq_len, d)
  • 좜λ ₯ ν…μ„œ: Layout.row_major(d)
  • μ»€μŠ€ν…€ op νŒŒλΌλ―Έν„°: {"seq_len": seq_len, "d": d, "dtype": dtype}

이 퍼즐의 핡심 μš”μ†ŒλŠ” λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

  1. 닀쀑 컀널 μ˜€μΌ€μŠ€νŠΈλ ˆμ΄μ…˜: μ „μΉ˜, matmul, μ†Œν”„νŠΈλ§₯슀 μ—°μ‚°μ˜ κ²°ν•©
  2. λ©”λͺ¨λ¦¬ μ΅œμ ν™”: ν˜•νƒœ λ³€ν™˜ μ—°μ‚°κ³Ό 버퍼 μž¬μ‚¬μš©μœΌλ‘œ λ©”λͺ¨λ¦¬ ν• λ‹Ή μ΅œμ†Œν™”
  3. 수치 μ•ˆμ •μ„±: Puzzle 18의 κ²€μ¦λœ μ†Œν”„νŠΈλ§₯슀 κ΅¬ν˜„ ν™œμš©
  4. μ„±λŠ₯ μ΅œμ ν™”: λͺ¨λ“  ν–‰λ ¬ 연산에 Puzzle 16의 타일링 μ•Œκ³ λ¦¬μ¦˜ μ‚¬μš©
  5. 닀쀑 μž…λ ₯ μ—°μ‚°: 단일 μ»€μŠ€ν…€ opμ—μ„œ μ„Έ 개의 μž…λ ₯ ν…μ„œ(Q, K, V) 처리

μ–΄ν…μ…˜ μ»€μŠ€ν…€ 연산은 λ‹€μŒκ³Ό 같은 일을 μˆ˜ν–‰ν•©λ‹ˆλ‹€:

  • νŒŒμ΄μ¬μ—μ„œ 쿼리, ν‚€, κ°’ ν…μ„œλ₯Ό μž…λ ₯으둜 λ°›κΈ°
  • μ΅œμ ν™”λœ 컀널을 μ‚¬μš©ν•˜μ—¬ GPUμ—μ„œ 효율적으둜 처리
  • μ–΄ν…μ…˜ 가쀑 좜λ ₯ 벑터 λ°˜ν™˜
  • NumPy μ°Έμ‘° κ΅¬ν˜„ 결과와 일치

μ™„μ„±ν•  μ½”λ“œ

이 퍼즐을 μ™„μ„±ν•˜λ €λ©΄ Puzzle 16의 타일링 matmul 컀널과 Puzzle 18의 μ†Œν”„νŠΈλ§₯슀 컀널을 ν™œμš©ν•©λ‹ˆλ‹€. 곡유 λ©”λͺ¨λ¦¬λ₯Ό μ‚¬μš©ν•˜μ—¬ Mojo νŒŒμΌμ—μ„œ μ „μΉ˜ μ»€λ„λ§Œ κ΅¬ν˜„ν•˜λ©΄ λ©λ‹ˆλ‹€.

1. μ „μΉ˜ 컀널 κ΅¬ν˜„ν•˜κΈ°

fn transpose_kernel[
    layout_in: Layout,  # Layout for input matrix (seq_len, d)
    layout_out: Layout,  # Layout for output matrix (d, seq_len)
    rows: Int,
    cols: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
    inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin],
):
    # FILL ME IN (roughly 18 lines)
    ...


전체 파일 보기: problems/p19/op/attention.mojo

팁

μ „μΉ˜ 컀널 κ΅¬ν˜„ κ°€μ΄λ“œ:

  1. 곡유 λ©”λͺ¨λ¦¬ μ„€μ •: LayoutTensor[dtype, Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY), MutAnyOrigin, address_space = AddressSpace.SHARED].stack_allocation()을 μ‚¬μš©ν•˜μ—¬ TRANSPOSE_BLOCK_DIM_XY Γ— TRANSPOSE_BLOCK_DIM_XY 크기의 μ •μ‚¬κ°ν˜• 곡유 λ©”λͺ¨λ¦¬ 타일을 μƒμ„±ν•©λ‹ˆλ‹€. 이λ₯Ό 톡해 μŠ€λ ˆλ“œ κ°„ 효율적인 데이터 κ΅ν™˜μ΄ κ°€λŠ₯ν•©λ‹ˆλ‹€.

  2. μŠ€λ ˆλ“œ 인덱싱: μŠ€λ ˆλ“œλ₯Ό ν–‰λ ¬ μš”μ†Œμ— λ§€ν•‘ν•©λ‹ˆλ‹€:

    • local_row = thread_idx.y, local_col = thread_idx.x (블둝 λ‚΄ μœ„μΉ˜)
    • global_row = block_idx.y * TRANSPOSE_BLOCK_DIM_XY + local_row (전체 ν–‰λ ¬μ—μ„œμ˜ μœ„μΉ˜)
  3. 2단계 μ—°μ‚°:

    • 1단계: μ „μ—­ λ©”λͺ¨λ¦¬μ—μ„œ 곡유 λ©”λͺ¨λ¦¬λ‘œ 일반 μΈλ±μ‹±μœΌλ‘œ 데이터λ₯Ό λ‘œλ“œν•©λ‹ˆλ‹€
    • 2단계: 곡유 λ©”λͺ¨λ¦¬μ—μ„œ μ „μ—­ λ©”λͺ¨λ¦¬λ‘œ λ’€λ°”κΎΌ μΈλ±μ‹±μœΌλ‘œ 데이터λ₯Ό μ €μž₯ν•©λ‹ˆλ‹€
  4. ν•„μˆ˜ 동기화: λ‘œλ“œμ™€ μ €μž₯ 사이에 barrier()λ₯Ό ν˜ΈμΆœν•˜μ—¬ λͺ¨λ“  μŠ€λ ˆλ“œκ°€ λ‘œλ“œλ₯Ό μ™„λ£Œν•œ 후에야 μ €μž₯을 μ‹œμž‘ν•˜λ„λ‘ 보μž₯ν•©λ‹ˆλ‹€

  5. μ „μΉ˜μ˜ 핡심: μ „μΉ˜λŠ” λ’€λ°”κΎΌ 인덱싱을 톡해 μ΄λ£¨μ–΄μ§‘λ‹ˆλ‹€: shared_tile[local_row, local_col] λŒ€μ‹  shared_tile[local_col, local_row]λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€

  6. 경계 처리: μ „μ—­ λ©”λͺ¨λ¦¬ μ ‘κ·Ό μ‹œ 경계 검사λ₯Ό μˆ˜ν–‰ν•˜μ—¬ TRANSPOSE_BLOCK_DIM_XY x TRANSPOSE_BLOCK_DIM_XY둜 μ •ν™•νžˆ λ‚˜λˆ„μ–΄μ§€μ§€ μ•ŠλŠ” ν–‰λ ¬μ—μ„œ λ²”μœ„λ₯Ό λ²—μ–΄λ‚œ 읽기/μ“°κΈ°λ₯Ό λ°©μ§€ν•©λ‹ˆλ‹€

  7. λ©”λͺ¨λ¦¬ 병합: 이 νŒ¨ν„΄μ€ 읽기와 μ“°κΈ° λͺ¨λ‘ λ³‘ν•©λ˜λ„λ‘ 보μž₯ν•˜μ—¬ 졜적의 λ©”λͺ¨λ¦¬ λŒ€μ—­ν­μ„ ν™œμš©ν•©λ‹ˆλ‹€

2. μ–΄ν…μ…˜ μ˜€μΌ€μŠ€νŠΈλ ˆμ΄μ…˜

            var gpu_ctx = rebind[DeviceContext](ctx[])

            # Define layouts for matrix multiplication
            # Q reshaped to (1, d)
            comptime layout_q_2d = Layout.row_major(1, d)
            # K^T is (d, seq_len)
            comptime layout_k_t = Layout.row_major(d, seq_len)
            # Scores as (1, seq_len)
            comptime layout_scores_2d = Layout.row_major(1, seq_len)
            # Weights as (1, seq_len)
            comptime layout_weights_2d = Layout.row_major(1, seq_len)
            # Result as (1, d)
            comptime layout_result_2d = Layout.row_major(1, d)

            # Transpose implementation limited to square (TRANSPOSE_BLOCK_DIM_XY x TRANSPOSE_BLOCK_DIM_XY) thread blocks
            comptime transpose_threads_per_block = (
                TRANSPOSE_BLOCK_DIM_XY,
                TRANSPOSE_BLOCK_DIM_XY,
            )
            # Tile over the K (seq_len, d) matrix
            comptime transpose_blocks_per_grid = (
                (d + TRANSPOSE_BLOCK_DIM_XY - 1) // TRANSPOSE_BLOCK_DIM_XY,
                (seq_len + TRANSPOSE_BLOCK_DIM_XY - 1)
                // TRANSPOSE_BLOCK_DIM_XY,
            )
            # Matmul implementation limited to square (MATMUL_BLOCK_DIM_XY x MATMUL_BLOCK_DIM_XY) thread blocks
            comptime matmul_threads_per_block = (
                MATMUL_BLOCK_DIM_XY,
                MATMUL_BLOCK_DIM_XY,
            )
            # seq_len outputs ( Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len) ) with one thread per output
            comptime scores_blocks_per_grid = (
                seq_len + MATMUL_BLOCK_DIM_XY - 1
            ) // MATMUL_BLOCK_DIM_XY
            comptime softmax_threads = SOFTMAX_BLOCK_DIM_X
            comptime softmax_blocks_per_grid = 1
            # d outputs ( weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d) ) with one thread per output
            comptime result_blocks_per_grid = (
                d + MATMUL_BLOCK_DIM_XY - 1
            ) // MATMUL_BLOCK_DIM_XY

            # Allocate minimal temporary buffers - reuse same buffer for different shapes
            k_t_buf = gpu_ctx.enqueue_create_buffer[dtype](
                seq_len * d
            )  # K^T as (d, seq_len)
            scores_weights_buf = gpu_ctx.enqueue_create_buffer[dtype](
                seq_len
            )  # Reused for scores and weights

            k_t = LayoutTensor[dtype, layout_k_t, MutAnyOrigin](k_t_buf)

            # Step 1: Reshape Q from (d,) to (1, d) - no buffer needed
            # FILL ME IN 1 line

            # Step 2: Transpose K from (seq_len, d) to K^T (d, seq_len)
            # FILL ME IN 1 function call

            # Step 3: Compute attention scores using matmul: Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len)
            # This computes Q Β· K^T[i] = Q Β· K[i] for each column i of K^T (which is row i of K)
            # Reuse scores_weights_buf as (1, seq_len) for scores
            # FILL ME IN 2 lines

            # Step 4: Reshape scores from (1, seq_len) to (seq_len,) for softmax
            # FILL ME IN 1 line

            # Step 5: Apply softmax to get attention weights
            # FILL ME IN 1 function call

            # Step 6: Reshape weights from (seq_len,) to (1, seq_len) for final matmul
            # FILL ME IN 1 line

            # Step 7: Compute final result using matmul: weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d)
            # Reuse out_tensor reshaped as (1, d) for result
            # FILL ME IN 2 lines

전체 파일 보기: problems/p19/op/attention.mojo

컀널 ν…ŒμŠ€νŠΈ

pixi run p19
pixi run -e amd p19
pixi run -e apple p19
uv run poe p19

μ„±κ³΅ν•˜λ©΄ CPU와 GPUμ—μ„œ λ‹€μŒκ³Ό λΉ„μŠ·ν•œ 좜λ ₯을 λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€:

Input shapes: Q=(16,), K=(16, 16), V=(16, 16)
Sample Q values: [ 0.04967142 -0.01382643  0.06476886  0.15230298 -0.02341534]
Sample K[0] values: [-0.10128311  0.03142473 -0.09080241 -0.14123037  0.14656489]
Sample V[0] values: [ 0.11631638  0.00102331 -0.09815087  0.04621035  0.01990597]

================================================================================
STEP-BY-STEP VECTOR ATTENTION COMPUTATION DEBUG
================================================================================

1. INPUT SHAPES:
   Q shape: (16,) (query vector)
   K shape: (16, 16) (key matrix)
   V shape: (16, 16) (value matrix)
   Q[:5]: [ 0.04967142 -0.01382643  0.06476886  0.15230298 -0.02341534]

2. ATTENTION SCORES (K[i] Β· Q):
   Scores shape: (16,)
   Scores[:5]: [-0.03479404 -0.01563787  0.04834607  0.06764711  0.04001468]
   Min: -0.061636, Max: 0.067647
   Manual verification:
     Q Β· K[0] = K[0] Β· Q = -0.034794 (computed: -0.034794)
     Q Β· K[1] = K[1] Β· Q = -0.015638 (computed: -0.015638)
     Q Β· K[2] = K[2] Β· Q = 0.048346 (computed: 0.048346)

3. SOFTMAX:
   Max score: 0.067647
   Attention weights shape: (16,)
   Attention weights[:5]: [0.05981331 0.06097015 0.06499878 0.0662655  0.06445949]
   Sum: 1.000000 (should be 1.0)

4. WEIGHTED SUM OF VALUES:
   Output shape: (16,)
   Output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
   Output norm: 0.092764
   Manual output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
   Match: True

================================================================================
TESTING INDIVIDUAL OPERATIONS
================================================================================

Test 1: Vector Dot Product
a Β· b = 3.000000

Test 2: Matrix-Vector Multiplication
M @ v = [ 3.  7. 11.]

Test 3: Softmax
Input: [1. 2. 3. 4.]
Softmax: [0.0320586  0.08714432 0.2368828  0.6439143 ]
Sum: 1.000000

================================================================================
TESTING FULL ATTENTION
================================================================================
Compiling attention graph on Device(type=cpu,id=0)
Executing attention on Device(type=cpu,id=0)
====================================================================================================

CPU attention output[:5]: [-0.00935538 -0.02434331  0.00306551  0.02346884  0.019306  ]
CPU matches NumPy: True
Compiling attention graph on Device(type=gpu,id=0)
Executing attention on Device(type=gpu,id=0)
====================================================================================================

GPU attention output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
Expected output[:5]: [-0.00935538 -0.0243433   0.00306551  0.02346884  0.019306  ]
GPU matches NumPy: True

================================================================================
FINAL VERIFICATION
================================================================================
βœ“ CPU implementation PASSED
βœ“ GPU implementation PASSED

Output vector norms:
  CPU: 0.092764
  GPU: 0.092764
  Expected: 0.092764

이 좜λ ₯은 μ»€μŠ€ν…€ MAX κ·Έλž˜ν”„ 연산이 μ–΄ν…μ…˜ μ•Œκ³ λ¦¬μ¦˜μ„ μ˜¬λ°”λ₯΄κ²Œ κ΅¬ν˜„ν•˜μ—¬ NumPy μ°Έμ‘° κ΅¬ν˜„κ³Ό μΌμΉ˜ν•˜λŠ” κ²°κ³Όλ₯Ό μƒμ„±ν–ˆμŒμ„ λ³΄μ—¬μ€λ‹ˆλ‹€.

μ†”λ£¨μ…˜

이 퍼즐을 ν’€λ €λ©΄ Mojoμ—μ„œ μ „μΉ˜ 컀널을 κ΅¬ν˜„ν•˜κ³  μ–΄ν…μ…˜ μ»€μŠ€ν…€ 연산을 μœ„ν•œ 파이썬 κ·Έλž˜ν”„ μ •μ˜λ₯Ό μ™„μ„±ν•΄μ•Ό ν•©λ‹ˆλ‹€. 이 퍼즐은 이전 퍼즐의 κ°œλ…λ“€μ„ 기반으둜, Puzzle 16의 타일링 ν–‰λ ¬ κ³±μ…ˆκ³Ό Puzzle 18의 μ†Œν”„νŠΈλ§₯슀λ₯Ό κ²°ν•©ν•˜μ—¬ μ™„μ „ν•œ μ–΄ν…μ…˜ λ©”μ»€λ‹ˆμ¦˜μ„ κ΅¬μ„±ν•©λ‹ˆλ‹€.

μž¬μ‚¬μš© 컀널

κ΅¬ν˜„μ—μ„œ λ‹€μŒμ˜ κ²€μ¦λœ 컀널듀을 직접 ν™œμš©ν•©λ‹ˆλ‹€:

  1. matmul_idiomatic_tiled (Puzzle 16) - \(Q \times K^T\)와 \(\text{weights} \times V\) μ—°μ‚° λͺ¨λ‘λ₯Ό μˆ˜ν–‰
  2. softmax_kernel (Puzzle 18) - 수치적으둜 μ•ˆμ •μ μΈ μ–΄ν…μ…˜ κ°€μ€‘μΉ˜ 계산 제곡

μ΄λŠ” λͺ¨λ“ˆν˜• GPU μ•„ν‚€ν…μ²˜μ˜ 쒋은 μ˜ˆμ‹œμž…λ‹ˆλ‹€: 단일 κ΅¬ν˜„μ²΄κ°€ μ•„λ‹Œ, κ²€μ¦λœ μ΅œμ ν™” μ»΄ν¬λ„ŒνŠΈλ₯Ό μ˜€μΌ€μŠ€νŠΈλ ˆμ΄μ…˜ν•˜μ—¬ λ³΅μž‘ν•œ 신경망 연산을 κ΅¬μΆ•ν•©λ‹ˆλ‹€.

μ–΄ν…μ…˜ 연산은 ν‘œμ€€μ μΈ μˆ˜ν•™μ  μ •μ˜λ₯Ό λ”°λ¦…λ‹ˆλ‹€:

$$\Large \text{Attention}(Q, K, V) = \text{softmax}(Q \cdot K^T) \cdot V$$

μˆ˜μ‹ 뢄석:

  • \(Q \cdot K^T~\): 쿼리-ν‚€ μœ μ‚¬λ„ 점수, ν˜•νƒœ: \((1, \text{seq_len})\)
  • \(\text{softmax}(\cdot)~\): 점수λ₯Ό ν™•λ₯ λ‘œ μ •κ·œν™”, ν˜•νƒœ: \((1, \text{seq_len})\)
  • \(\text{weights} \cdot V~\): κ°’μ˜ 가쀑 κ²°ν•©, ν˜•νƒœ: \((1, d)\)

이 κ³Όμ •μ—λŠ” 이전 퍼즐의 GPU 컀널을 ν™œμš©ν•˜μ—¬ μ΅œμ ν™”ν•˜λŠ” μ—¬λŸ¬ μ—°μ‚° 단계가 ν¬ν•¨λ©λ‹ˆλ‹€.

1. μ „μΉ˜ 컀널 κ΅¬ν˜„

fn transpose_kernel[
    layout_in: Layout,  # Layout for input matrix (seq_len, d)
    layout_out: Layout,  # Layout for output matrix (d, seq_len)
    rows: Int,
    cols: Int,
    dtype: DType = DType.float32,
](
    output: LayoutTensor[dtype, layout_out, MutAnyOrigin],
    inp: LayoutTensor[dtype, layout_in, ImmutAnyOrigin],
):
    """Transpose matrix using shared memory tiling for coalesced access."""
    shared_tile = LayoutTensor[
        dtype,
        Layout.row_major(TRANSPOSE_BLOCK_DIM_XY, TRANSPOSE_BLOCK_DIM_XY),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    local_row = Int(thread_idx.y)
    local_col = Int(thread_idx.x)

    global_row = Int(block_idx.y) * TRANSPOSE_BLOCK_DIM_XY + local_row
    global_col = Int(block_idx.x) * TRANSPOSE_BLOCK_DIM_XY + local_col

    if global_row < rows and global_col < cols:
        shared_tile[local_row, local_col] = inp[global_row, global_col]

    barrier()

    out_row = Int(block_idx.x) * TRANSPOSE_BLOCK_DIM_XY + local_row
    out_col = Int(block_idx.y) * TRANSPOSE_BLOCK_DIM_XY + local_col

    # Store data from shared memory to global memory (coalesced write)
    # Note: we transpose the shared memory access pattern
    if out_row < cols and out_col < rows:
        output[out_row, out_col] = shared_tile[local_col, local_row]


μ „μΉ˜ 컀널은 곡유 λ©”λͺ¨λ¦¬ tiling을 μ‚¬μš©ν•˜μ—¬ 병합 λ©”λͺ¨λ¦¬ μ ‘κ·Ό νŒ¨ν„΄μ„ λ‹¬μ„±ν•©λ‹ˆλ‹€. 핡심 κ΅¬ν˜„ λ‚΄μš©μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

핡심 μ „μΉ˜ νŒ¨ν„΄

# 일반 μΈλ±μ‹±μœΌλ‘œ λ‘œλ“œ
shared_tile[local_row, local_col] = inp[global_row, global_col]
barrier()
# λ’€λ°”κΎΌ μΈλ±μ‹±μœΌλ‘œ μ €μž₯ν•˜μ—¬ μ „μΉ˜
output[out_row, out_col] = shared_tile[local_col, local_row]

μ „μΉ˜λŠ” 곡유 λ©”λͺ¨λ¦¬ μ ‘κ·Όμ—μ„œ λ’€λ°”κΎΌ 인덱싱([local_row, local_col] λŒ€μ‹  [local_col, local_row])κ³Ό 좜λ ₯ μœ„μΉ˜ 지정을 μœ„ν•œ λ’€λ°”κΎΌ 블둝 μ’Œν‘œλ₯Ό 톡해 μ΄λ£¨μ–΄μ§‘λ‹ˆλ‹€. 이λ₯Ό 톡해 읽기와 μ“°κΈ° λͺ¨λ‘ 병합을 μœ μ§€ν•˜λ©΄μ„œ μ „μΉ˜ 연산을 μˆ˜ν–‰ν•©λ‹ˆλ‹€.

2. GPU 컀널 μ˜€μΌ€μŠ€νŠΈλ ˆμ΄μ…˜


            # Step 1: Reshape Q from (d,) to (1, d) - no buffer needed
            q_2d = q_tensor.reshape[layout_q_2d]()

            # Step 2: Transpose K from (seq_len, d) to K^T (d, seq_len)\
            comptime kernel = transpose_kernel[
                layout_k, layout_k_t, seq_len, d, dtype
            ]
            gpu_ctx.enqueue_function[kernel, kernel](
                k_t,
                k_tensor,
                grid_dim=transpose_blocks_per_grid,
                block_dim=transpose_threads_per_block,
            )

            # Step 3: Compute attention scores using matmul: Q @ K^T = (1, d) @ (d, seq_len) -> (1, seq_len)
            # This computes Q Β· K^T[i] = Q Β· K[i] for each column i of K^T (which is row i of K)
            # Reuse scores_weights_buf as (1, seq_len) for scores
            scores_2d = LayoutTensor[dtype, layout_scores_2d, MutAnyOrigin](
                scores_weights_buf
            )
            comptime kernel2 = matmul_idiomatic_tiled[
                layout_q_2d,
                layout_k_t,
                layout_scores_2d,
                1,
                seq_len,
                d,
                dtype,
            ]
            gpu_ctx.enqueue_function[kernel2, kernel2](
                scores_2d,
                q_2d,
                k_t,
                grid_dim=scores_blocks_per_grid,
                block_dim=matmul_threads_per_block,
            )

            # Step 4: Reshape scores from (1, seq_len) to (seq_len,) for softmax
            weights = scores_2d.reshape[layout_scores]()

            # Step 5: Apply softmax to get attention weights
            comptime kernel3 = softmax_gpu_kernel[layout_scores, seq_len, dtype]
            gpu_ctx.enqueue_function[kernel3, kernel3](
                weights,
                weights,
                grid_dim=softmax_blocks_per_grid,
                block_dim=softmax_threads,
            )

            # Step 6: Reshape weights from (seq_len,) to (1, seq_len) for final matmul
            weights_2d = weights.reshape[layout_weights_2d]()

            # Step 7: Compute final result using matmul: weights @ V = (1, seq_len) @ (seq_len, d) -> (1, d)
            # Reuse out_tensor reshaped as (1, d) for result
            result_2d = output_tensor.reshape[layout_result_2d]()
            comptime kernel4 = matmul_idiomatic_tiled[
                layout_weights_2d,
                layout_v,
                layout_result_2d,
                1,
                d,
                seq_len,
                dtype,
            ]
            gpu_ctx.enqueue_function[kernel4, kernel4](
                result_2d,
                weights_2d,
                v_tensor,
                grid_dim=result_blocks_per_grid,
                block_dim=matmul_threads_per_block,
            )

GPU μ˜€μΌ€μŠ€νŠΈλ ˆμ΄μ…˜μ€ μ •κ΅ν•œ 컀널 체이닝과 제둜 μΉ΄ν”Ό λ©”λͺ¨λ¦¬ μ΅œμ ν™”λ₯Ό λ³΄μ—¬μ€λ‹ˆλ‹€:

κ³ κΈ‰ λ©”λͺ¨λ¦¬ μ΅œμ ν™” μ „λž΅

# 제둜 μΉ΄ν”Ό reshape - 데이터 이동 없이 ν…μ„œ shape만 μž¬ν•΄μ„
q_2d = q_tensor.reshape[layout_q_2d]()
# 적극적인 버퍼 μž¬μ‚¬μš© - 같은 λ©”λͺ¨λ¦¬, λ‹€λ₯Έ 해석
weights = scores_2d.reshape[layout_scores]()

κ΅¬ν˜„μ€ λ‹€μŒμ„ 톡해 μ΅œλŒ€ λ©”λͺ¨λ¦¬ νš¨μœ¨μ„ λ‹¬μ„±ν•©λ‹ˆλ‹€:

  • 제둜 μΉ΄ν”Ό ν˜•νƒœ λ³€ν™˜: λ©”λͺ¨λ¦¬μ—μ„œ 데이터λ₯Ό μ΄λ™ν•˜μ§€ μ•Šκ³  ν…μ„œ ν˜•νƒœλ₯Ό μž¬ν•΄μ„
  • μ§€λŠ₯적 버퍼 μž¬μ‚¬μš©: λ™μΌν•œ scores_weights_bufκ°€ 점수 \((1,\text{seq_len})\)와 κ°€μ€‘μΉ˜ \((\text{seq_len},)\) 이쀑 μš©λ„λ‘œ ν™œμš©
  • μ΅œμ†Œ ν• λ‹Ή: 단 2개의 μž„μ‹œ λ²„νΌλ‘œ 전체 μ–΄ν…μ…˜ μ—°μ‚° μˆ˜ν–‰
  • λ©”λͺ¨λ¦¬ 병합: λͺ¨λ“  μ—°μ‚°μ—μ„œ 졜적의 λ©”λͺ¨λ¦¬ μ ‘κ·Ό νŒ¨ν„΄ μœ μ§€

μ „λž΅μ  컀널 μž¬μ‚¬μš© νŒ¨ν„΄

  • 3단계 & 7단계: λ‘˜ λ‹€ Puzzle 16의 matmul_idiomatic_tiled ν™œμš©
    • 3단계: \(Q \times K^T\) β†’ μ–΄ν…μ…˜ 점수 계산 \((1,d) \times (d,\text{seq_len}) \rightarrow (1,\text{seq_len})\)
    • 7단계: \(\text{weights} \times V\) β†’ μ΅œμ’… 가쀑 좜λ ₯ \((1,\text{seq_len}) \times (\text{seq_len},d) \rightarrow (1,d)\)
    • 두 μ—°μ‚° λͺ¨λ‘ λ‹€μ–‘ν•œ ν–‰λ ¬ 크기λ₯Ό μ•ˆμ „ν•˜κ²Œ μ²˜λ¦¬ν•˜κΈ° μœ„ν•΄ 경계 검사 포함
  • 5단계: Puzzle 18의 softmax_kernel μ‚¬μš©
    • μ›μ‹œ 점수λ₯Ό μ •κ·œν™”λœ ν™•λ₯  λΆ„ν¬λ‘œ λ³€ν™˜
    • μ΅œλŒ“κ°’ 차감과 병렬 λ¦¬λ•μ…˜μ„ ν†΅ν•œ 수치 μ•ˆμ •μ„± 보μž₯
    • \(\sum_{i} \text{weights}[i] = 1.0\) 보μž₯

μ΄λŠ” λͺ¨λ“ˆν˜• GPU μ•„ν‚€ν…μ²˜μ˜ 쒋은 μ˜ˆμ‹œμž…λ‹ˆλ‹€: 단일 κ΅¬ν˜„μ²΄κ°€ μ•„λ‹Œ, κ²€μ¦λœ μ΅œμ ν™” 컀널듀을 μ˜€μΌ€μŠ€νŠΈλ ˆμ΄μ…˜ν•˜μ—¬ λ³΅μž‘ν•œ 신경망 연산을 κ΅¬μΆ•ν•©λ‹ˆλ‹€!

핡심 κ΅¬ν˜„ μΈμ‚¬μ΄νŠΈ

λ©”λͺ¨λ¦¬ μ΅œμ ν™” μ „λž΅

적극적인 버퍼 μž¬μ‚¬μš©μœΌλ‘œ λ©”λͺ¨λ¦¬ 할당을 μ΅œμ†Œν™”ν•©λ‹ˆλ‹€:

# 전체 연산에 ν•„μš”ν•œ μž„μ‹œ λ²„νΌλŠ” 단 2개
k_t_buf = gpu_ctx.enqueue_create_buffer[dtype](seq_len * d)
scores_weights_buf = gpu_ctx.enqueue_create_buffer[dtype](seq_len)

핡심 μ΅œμ ν™” 포인트:

  • λ™μΌν•œ scores_weights_bufκ°€ ν˜•νƒœ λ³€ν™˜ 연산을 톡해 μ–΄ν…μ…˜ μ μˆ˜μ™€ κ°€μ€‘μΉ˜ λͺ¨λ‘μ— μž¬μ‚¬μš©λ©λ‹ˆλ‹€
  • 제둜 μΉ΄ν”Ό ν…μ„œ ν˜•νƒœ λ³€ν™˜μœΌλ‘œ λΆˆν•„μš”ν•œ 데이터 이동을 μ œκ±°ν•©λ‹ˆλ‹€

컀널 μž¬μ‚¬μš© μ•„ν‚€ν…μ²˜

이 퍼즐은 μ„Έ κ°€μ§€ νŠΉν™”λœ 컀널을 κ²°ν•©ν•˜μ—¬ λͺ¨λ“ˆν˜• 컀널 섀계λ₯Ό λ³΄μ—¬μ€λ‹ˆλ‹€:

  • matmul_idiomatic_tiled (2회 μ‚¬μš©) - \(Q \times K^T\)와 \(\text{weights} \times V\) μ—°μ‚° λͺ¨λ‘λ₯Ό μˆ˜ν–‰
  • softmax_kernel - 병렬 λ¦¬λ•μ…˜μ„ ν™œμš©ν•˜μ—¬ 수치적으둜 μ•ˆμ •μ μΈ μ–΄ν…μ…˜ κ°€μ€‘μΉ˜ 계산
  • transpose_kernel - 병합 λ©”λͺ¨λ¦¬ μ ‘κ·ΌμœΌλ‘œ 효율적인 \(K^T\) 계산

μ•„ν‚€ν…μ²˜μ˜ μž₯점:

  • μ‘°ν•© κ°€λŠ₯μ„±: κ²€μ¦λœ μ»΄ν¬λ„ŒνŠΈλ‘œ λ³΅μž‘ν•œ μ—°μ‚° ꡬ좕
  • μœ μ§€λ³΄μˆ˜μ„±: 각 컀널이 λͺ…ν™•ν•˜κ²Œ μ •μ˜λœ 단일 μ—­ν•  μˆ˜ν–‰
  • μ„±λŠ₯: 이전 퍼즐의 κ³ λ„λ‘œ μ΅œμ ν™”λœ κ΅¬ν˜„ ν™œμš©
  • ν™•μž₯μ„±: λͺ¨λ“ˆν˜• μ„€κ³„λ‘œ 더 큰 μ–΄ν…μ…˜ λ©”μ»€λ‹ˆμ¦˜μœΌλ‘œ ν™•μž₯ 용이

이 κ΅¬ν˜„μ€ μ •κ΅ν•œ 신경망 연산이 단일 κ΅¬ν˜„μ²΄κ°€ μ•„λ‹Œ, 더 λ‹¨μˆœν•˜κ³  잘 κ²€μ¦λœ GPU 컀널듀을 μ˜€μΌ€μŠ€νŠΈλ ˆμ΄μ…˜ν•˜μ—¬ ꡬ좕할 수 μžˆμŒμ„ λ³΄μ—¬μ€λ‹ˆλ‹€.