타일링 버전

κ°œμš”

LayoutTensorλ₯Ό μ‚¬μš©ν•œ 타일링 ν–‰λ ¬ κ³±μ…ˆμœΌλ‘œ μ •λ°© ν–‰λ ¬ \(A\) 와 \(B\) λ₯Ό κ³±ν•˜λŠ” 컀널을 κ΅¬ν˜„ν•˜μ„Έμš”. 큰 행렬을 μž‘μ€ 쑰각(타일)으둜 λ‚˜λˆ„μ–΄ μ²˜λ¦¬ν•˜λŠ” λ°©μ‹μž…λ‹ˆλ‹€.

핡심 κ°œλ…

  • LayoutTensorλ₯Ό μ‚¬μš©ν•œ ν–‰λ ¬ νƒ€μΌλ§μœΌλ‘œ 효율적인 μ—°μ‚°
  • μ μ ˆν•œ λ ˆμ΄μ•„μ›ƒμ„ μ‚¬μš©ν•œ λ©€ν‹° 블둝 쑰율
  • TensorBuilderλ₯Ό ν†΅ν•œ 효율적인 곡유 λ©”λͺ¨λ¦¬ ν™œμš©
  • LayoutTensor 인덱싱을 μ‚¬μš©ν•œ 타일 경계 처리

ꡬ성

  • ν–‰λ ¬ 크기: \(\text{SIZE_TILED} = 9\)
  • 블둝당 μŠ€λ ˆλ“œ 수: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • κ·Έλ¦¬λ“œ 차원: \(3 \times 3\) 블둝
  • 곡유 λ©”λͺ¨λ¦¬: 블둝당 \(\text{TPB} \times \text{TPB}\) LayoutTensor 2개

λ ˆμ΄μ•„μ›ƒ ꡬ성:

  • μž…λ ₯ A: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • μž…λ ₯ B: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • 좜λ ₯: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • 곡유 λ©”λͺ¨λ¦¬: TensorBuilderλ₯Ό μ‚¬μš©ν•œ TPB Γ— TPB LayoutTensor 2개

타일링 μ „λž΅

블둝 ꡬ성

Grid Layout (3Γ—3):           Thread Layout per Block (3Γ—3):
[B00][B01][B02]               [T00 T01 T02]
[B10][B11][B12]               [T10 T11 T12]
[B20][B21][B22]               [T20 T21 T22]

각 블둝은 LayoutTensor 인덱싱을 μ‚¬μš©ν•˜μ—¬ ν•˜λ‚˜μ˜ 타일을 처리

타일 처리 단계

  1. μŠ€λ ˆλ“œ μœ„μΉ˜μ— λŒ€ν•œ μ „μ—­ μΈλ±μŠ€μ™€ 둜컬 인덱슀 계산
  2. A와 B 타일을 μœ„ν•œ 곡유 λ©”λͺ¨λ¦¬ ν• λ‹Ή
  3. 각 타일에 λŒ€ν•΄:
    • ν–‰λ ¬ A와 Bμ—μ„œ 타일 λ‘œλ“œ
    • λΆ€λΆ„ κ³± 계산
    • λ ˆμ§€μŠ€ν„°μ— κ²°κ³Ό λˆ„μ 
  4. μ΅œμ’… λˆ„μ  κ²°κ³Ό 기둝

λ©”λͺ¨λ¦¬ μ ‘κ·Ό νŒ¨ν„΄

Matrix A (8Γ—8)                 Matrix B (8Γ—8)               Matrix C (8Γ—8)
+---+---+---+                  +---+---+---+                +---+---+---+
|T00|T01|T02| ...              |T00|T01|T02| ...            |T00|T01|T02| ...
+---+---+---+                  +---+---+---+                +---+---+---+
|T10|T11|T12|                  |T10|T11|T12|                |T10|T11|T12|
+---+---+---+                  +---+---+---+                +---+---+---+
|T20|T21|T22|                  |T20|T21|T22|                |T20|T21|T22|
+---+---+---+                  +---+---+---+                +---+---+---+
  ...                            ...                          ...

타일 처리 κ³Όμ • (C[T11] 계산 μ˜ˆμ‹œ):
1. A와 Bμ—μ„œ 타일 λ‘œλ“œ:
   +---+      +---+
   |A11| Γ—    |B11|     각 단계 k에 λŒ€ν•΄:
   +---+      +---+     C[T11] += A[row, k] Γ— B[k, col]

2. 타일 이동:
   단계 1      단계 2      단계 3
   A: [T10]    A: [T11]    A: [T12]
   B: [T01]    B: [T11]    B: [T21]

3. 타일 λ‚΄ 각 μŠ€λ ˆλ“œ (i,j)의 μ—°μ‚°:
   C[i,j] = Ξ£ (A[i,k] Γ— B[k,j]), kλŠ” 타일 λ„ˆλΉ„ λ²”μœ„

동기화 ν•„μš” μ‹œμ :
* 타일을 곡유 λ©”λͺ¨λ¦¬μ— λ‘œλ“œν•œ ν›„
* 각 λ‹¨κ³„μ˜ 연산이 λλ‚œ ν›„

μ™„μ„±ν•  μ½”λ“œ

comptime SIZE_TILED = 9
comptime BLOCKS_PER_GRID_TILED = (3, 3)  # each block convers 3x3 elements
comptime THREADS_PER_BLOCK_TILED = (TPB, TPB)
comptime layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED)


fn matmul_tiled[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
    a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
):
    local_row = thread_idx.y
    local_col = thread_idx.x
    tiled_row = block_idx.y * TPB + thread_idx.y
    tiled_col = block_idx.x * TPB + thread_idx.x
    # FILL ME IN (roughly 20 lines)


전체 파일 보기: problems/p16/p16.mojo

팁
  1. ν‘œμ€€ 인덱싱 κ·œμΉ™μ„ μ‚¬μš©ν•˜μ„Έμš”: local_row = thread_idx.y, local_col = thread_idx.x

  2. μ „μ—­ μœ„μΉ˜ 계산:

    global_row = block_idx.y * TPB + local_row
    

    그리고

    global_col = block_idx.x * TPB + local_col
    

    μ „μ—­ 인덱싱 곡식 μ΄ν•΄ν•˜κΈ°:

    • 각 블둝은 ν–‰λ ¬μ˜ TPB Γ— TPB 타일을 μ²˜λ¦¬ν•©λ‹ˆλ‹€

    • block_idx.yλŠ” ν˜„μž¬ λͺ‡ 번째 블둝 행인지λ₯Ό λ‚˜νƒ€λƒ…λ‹ˆλ‹€ (0, 1, 2…)

    • block_idx.y * TPBλŠ” ν•΄λ‹Ή 블둝 νƒ€μΌμ˜ μ‹œμž‘ ν–‰μž…λ‹ˆλ‹€

    • local_row (0~TPB-1)은 블둝 λ‚΄ μŠ€λ ˆλ“œμ˜ μ˜€ν”„μ…‹μž…λ‹ˆλ‹€

    • λ‘˜μ„ λ”ν•˜λ©΄ 전체 ν–‰λ ¬μ—μ„œμ˜ μ‹€μ œ ν–‰ μœ„μΉ˜κ°€ λ©λ‹ˆλ‹€

      TPB=3 μ˜ˆμ‹œ:

    Block Layout:        Global Matrix (9Γ—9):
    [B00][B01][B02]      [0 1 2 | 3 4 5 | 6 7 8]
    [B10][B11][B12]  β†’   [9 A B | C D E | F G H]
    [B20][B21][B22]      [I J K | L M N | O P Q]
                         β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”
                         [R S T | U V W | X Y Z]
                         [a b c | d e f | g h i]
                         [j k l | m n o | p q r]
                         β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”β€”
                         [s t u | v w x | y z Ξ±]
                         [Ξ² Ξ³ Ξ΄ | Ξ΅ ΞΆ Ξ· | ΞΈ ΞΉ ΞΊ]
                         [Ξ» ΞΌ Ξ½ | ΞΎ ΞΏ Ο€ | ρ Οƒ Ο„]
    
    Thread(1,2) in Block(1,0):
    - block_idx.y = 1, local_row = 1
    - global_row = 1 * 3 + 1 = 4
    - 이 μŠ€λ ˆλ“œλŠ” ν–‰λ ¬μ˜ 4번째 행을 λ‹΄λ‹Ή
    
  3. 곡유 λ©”λͺ¨λ¦¬ ν• λ‹Ή (.fill(0)으둜 사전 μ΄ˆκΈ°ν™”λ¨)

  4. 9Γ—9 μ™„λ²½ν•œ νƒ€μΌλ§μ΄λ―€λ‘œ 경계 검사가 λΆˆν•„μš”!

  5. μ μ ˆν•œ 동기화와 ν•¨κ»˜ 타일 κ°„ κ²°κ³Όλ₯Ό λˆ„μ 

μ½”λ“œ μ‹€ν–‰

μ†”λ£¨μ…˜μ„ ν…ŒμŠ€νŠΈν•˜λ €λ©΄ ν„°λ―Έλ„μ—μ„œ λ‹€μŒ λͺ…λ Ήμ–΄λ₯Ό μ‹€ν–‰ν•˜μ„Έμš”:

pixi run p16 --tiled
pixi run -e amd p16 --tiled
pixi run -e apple p16 --tiled
uv run poe p16 --tiled

퍼즐을 아직 ν’€μ§€ μ•Šμ•˜λ‹€λ©΄ 좜λ ₯은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([3672.0, 3744.0, 3816.0, 3888.0, 3960.0, 4032.0, 4104.0, 4176.0, 4248.0, 9504.0, 9738.0, 9972.0, 10206.0, 10440.0, 10674.0, 10908.0, 11142.0, 11376.0, 15336.0, 15732.0, 16128.0, 16524.0, 16920.0, 17316.0, 17712.0, 18108.0, 18504.0, 21168.0, 21726.0, 22284.0, 22842.0, 23400.0, 23958.0, 24516.0, 25074.0, 25632.0, 27000.0, 27720.0, 28440.0, 29160.0, 29880.0, 30600.0, 31320.0, 32040.0, 32760.0, 32832.0, 33714.0, 34596.0, 35478.0, 36360.0, 37242.0, 38124.0, 39006.0, 39888.0, 38664.0, 39708.0, 40752.0, 41796.0, 42840.0, 43884.0, 44928.0, 45972.0, 47016.0, 44496.0, 45702.0, 46908.0, 48114.0, 49320.0, 50526.0, 51732.0, 52938.0, 54144.0, 50328.0, 51696.0, 53064.0, 54432.0, 55800.0, 57168.0, 58536.0, 59904.0, 61272.0])

μ†”λ£¨μ…˜: μˆ˜λ™ 타일링

fn matmul_tiled[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
    a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
):
    local_row = thread_idx.y
    local_col = thread_idx.x
    tiled_row = block_idx.y * TPB + local_row
    tiled_col = block_idx.x * TPB + local_col

    a_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB, TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    b_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB, TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    var acc: output.element_type = 0

    # Iterate over tiles to compute matrix product
    @parameter
    for tile in range((size + TPB - 1) // TPB):
        # Load A tile - global row stays the same, col determined by tile
        if tiled_row < size and (tile * TPB + local_col) < size:
            a_shared[local_row, local_col] = a[
                tiled_row, tile * TPB + local_col
            ]

        # Load B tile - row determined by tile, global col stays the same
        if (tile * TPB + local_row) < size and tiled_col < size:
            b_shared[local_row, local_col] = b[
                tile * TPB + local_row, tiled_col
            ]

        barrier()

        # Matrix multiplication within the tile
        if tiled_row < size and tiled_col < size:

            @parameter
            for k in range(min(Int(TPB), Int(size - tile * TPB))):
                acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write out final result
    if tiled_row < size and tiled_col < size:
        output[tiled_row, tiled_col] = acc


타일링 ν–‰λ ¬ κ³±μ…ˆ κ΅¬ν˜„μ€ μž‘μ€ 타일 \((3 \times 3)\) 을 μ‚¬μš©ν•˜μ—¬ 큰 ν–‰λ ¬ \((9 \times 9)\) 을 효율적으둜 μ²˜λ¦¬ν•˜λŠ” 방법을 λ³΄μ—¬μ€λ‹ˆλ‹€. λ™μž‘ 방식은 λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€:

  1. 곡유 λ©”λͺ¨λ¦¬ ν• λ‹Ή

    Input matrices (9Γ—9) - (3Γ—3) 타일링에 λ”± λ§žλŠ” 크기:
    A = [0  1  2  3  4  5  6  7  8 ]    B = [0  2  4  6  8  10 12 14 16]
        [9  10 11 12 13 14 15 16 17]        [18 20 22 24 26 28 30 32 34]
        [18 19 20 21 22 23 24 25 26]        [36 38 40 42 44 46 48 50 52]
        [27 28 29 30 31 32 33 34 35]        [54 56 58 60 62 64 66 68 70]
        [36 37 38 39 40 41 42 43 44]        [72 74 76 78 80 82 84 86 88]
        [45 46 47 48 49 50 51 52 53]        [90 92 94 96 98 100 102 104 106]
        [54 55 56 57 58 59 60 61 62]        [108 110 112 114 116 118 120 122 124]
        [63 64 65 66 67 68 69 70 71]        [126 128 130 132 134 136 138 140 142]
        [72 73 74 75 76 77 78 79 80]        [144 146 148 150 152 154 156 158 160]
    
    블둝당 곡유 λ©”λͺ¨λ¦¬ (3Γ—3):
    a_shared[TPB, TPB]  b_shared[TPB, TPB]
    
  2. 타일 처리 루프

    타일 수 = 9 // 3 = 3개 (λ‚˜λ¨Έμ§€ 없이 λ”± λ‚˜λˆ μ§!)
    
    각 타일에 λŒ€ν•΄:
    1. A와 Bμ—μ„œ 타일 λ‘œλ“œ
    2. λΆ€λΆ„ κ³± 계산
    3. λ ˆμ§€μŠ€ν„°μ— λˆ„μ 
    
  3. λ©”λͺ¨λ¦¬ λ‘œλ”© νŒ¨ν„΄

    • \((9 \times 9)\) 이 λ”± λ‚˜λˆ μ§€λ―€λ‘œ 경계 검사가 κΈ°μˆ μ μœΌλ‘œλŠ” λΆˆν•„μš”ν•˜μ§€λ§Œ, 방어적 ν”„λ‘œκ·Έλž˜λ°κ³Ό λ‹€λ₯Έ ν–‰λ ¬ 크기에도 λŒ€μ‘ν•  수 μžˆλ„λ‘ ν¬ν•¨ν•©λ‹ˆλ‹€.

         # A 타일 λ‘œλ“œ - μ „μ—­ 행은 κ·ΈλŒ€λ‘œ, 열은 타일에 μ˜ν•΄ κ²°μ •
         if tiled_row < size and (tile * TPB + local_col) < size:
             a_shared[local_row, local_col] = a[
                 tiled_row, tile * TPB + local_col
             ]
      
         # B 타일 λ‘œλ“œ - 행은 타일에 μ˜ν•΄ κ²°μ •, μ „μ—­ 열은 κ·ΈλŒ€λ‘œ
         if (tile * TPB + local_row) < size and tiled_col < size:
             b_shared[local_row, local_col] = b[
                 tile * TPB + local_row, tiled_col
             ]
      
  4. 타일 λ‚΄ μ—°μ‚°

    for k in range(min(TPB, size - tile * TPB)):
        acc += a_shared[local_row, k] * b_shared[k, local_col]
    
    • 곡유 λ©”λͺ¨λ¦¬ 뱅크 좩돌 νšŒν”Ό:

      Bank Conflict Free (Good):        Bank Conflicts (Bad):
      Thread0: a_shared[0,k] b_shared[k,0]  Thread0: a_shared[k,0] b_shared[0,k]
      Thread1: a_shared[0,k] b_shared[k,1]  Thread1: a_shared[k,0] b_shared[1,k]
      Thread2: a_shared[0,k] b_shared[k,2]  Thread2: a_shared[k,0] b_shared[2,k]
      ↓                                     ↓
      μ„œλ‘œ λ‹€λ₯Έ 뱅크에 병렬 μ ‘κ·Ό             b_sharedκ°€ μ—΄ μš°μ„ μ΄μ—ˆλ‹€λ©΄
      (a_sharedλŠ” broadcast)               같은 뱅크에 직렬 μ ‘κ·Ό
      

      곡유 λ©”λͺ¨λ¦¬ 뱅크 좩돌 μ„€λͺ…:

      • μ™Όμͺ½ (Good): b_shared[k,threadIdx.x]λŠ” μ„œλ‘œ λ‹€λ₯Έ 뱅크에 μ ‘κ·Όν•˜κ³ , a_shared[0,k]λŠ” λͺ¨λ“  μŠ€λ ˆλ“œμ— λΈŒλ‘œλ“œμΊμŠ€νŠΈ λ©λ‹ˆλ‹€
      • 였λ₯Έμͺ½ (Bad): b_sharedκ°€ μ—΄ μš°μ„ μ΄μ—ˆλ‹€λ©΄ μŠ€λ ˆλ“œλ“€μ΄ λ™μ‹œμ— 같은 뱅크에 μ ‘κ·Όν•˜κ²Œ λ©λ‹ˆλ‹€
      • 핡심: 이것은 μ „μ—­ λ©”λͺ¨λ¦¬ 병합이 μ•„λ‹Œ 곡유 λ©”λͺ¨λ¦¬ μ ‘κ·Ό νŒ¨ν„΄μ— κ΄€ν•œ κ²ƒμž…λ‹ˆλ‹€
      • 뱅크 ꡬ쑰: 곡유 λ©”λͺ¨λ¦¬λŠ” 32개 λ±…ν¬λ‘œ κ΅¬μ„±λ˜μ–΄ 있으며, μ—¬λŸ¬ μŠ€λ ˆλ“œκ°€ λ™μ‹œμ— 같은 λ±…ν¬μ˜ λ‹€λ₯Έ μ£Όμ†Œμ— μ ‘κ·Όν•  λ•Œ 좩돌이 λ°œμƒν•©λ‹ˆλ‹€
  5. 동기화 지점

    barrier() 호좜 μ‹œμ :
    1. 타일 λ‘œλ”© ν›„
    2. 타일 μ—°μ‚° ν›„
    

μ£Όμš” μ„±λŠ₯ νŠΉμ„±:

  • \((3 \times 3)\) νƒ€μΌλ‘œ \((9 \times 9)\) ν–‰λ ¬ 처리 (λ”± λ§žλŠ” 크기!)
  • 곡유 λ©”λͺ¨λ¦¬λ‘œ λΉ λ₯Έ 타일 μ ‘κ·Ό
  • λ³‘ν•©λœ λ©”λͺ¨λ¦¬ μ ‘κ·ΌμœΌλ‘œ μ „μ—­ λ©”λͺ¨λ¦¬ νŠΈλžœμž­μ…˜ μ΅œμ†Œν™”
  • 뱅크 μΆ©λŒμ„ ν”Όν•˜λ„λ‘ μ΅œμ ν™”λœ 곡유 λ©”λͺ¨λ¦¬ λ ˆμ΄μ•„μ›ƒκ³Ό μ ‘κ·Ό νŒ¨ν„΄
  1. 결과 기둝:

    if tiled_row < size and tiled_col < size:
       output[tiled_row, tiled_col] = acc
    
    • λ‹€λ₯Έ ν–‰λ ¬ 크기와 타일링 μ „λž΅μ„ μœ„ν•œ 방어적 경계 검사 포함
    • 좜λ ₯ 행렬에 직접 λŒ€μž…
    • λͺ¨λ“  μŠ€λ ˆλ“œκ°€ μœ νš¨ν•œ κ²°κ³Όλ₯Ό 기둝

μ£Όμš” μ΅œμ ν™”

  1. λ ˆμ΄μ•„μ›ƒ μ΅œμ ν™”:

    • λͺ¨λ“  ν…μ„œμ— ν–‰ μš°μ„  λ ˆμ΄μ•„μ›ƒ
    • 효율적인 2D 인덱싱
  2. λ©”λͺ¨λ¦¬ μ ‘κ·Ό:

    • λ³‘ν•©λœ μ „μ—­ λ©”λͺ¨λ¦¬ λ‘œλ“œ
    • 효율적인 곡유 λ©”λͺ¨λ¦¬ ν™œμš©
  3. μ—°μ‚°:

    • λ ˆμ§€μŠ€ν„° 기반 λˆ„μ , 즉 var acc: output.element_type = 0
    • @parameterλ₯Ό ν†΅ν•œ 컴파일 νƒ€μž„ 루프 μ „κ°œ

이 κ΅¬ν˜„μ€ λ‹€μŒμ„ 톡해 높은 μ„±λŠ₯을 λ‹¬μ„±ν•©λ‹ˆλ‹€:

  • LayoutTensorλ₯Ό ν™œμš©ν•œ 효율적인 λ©”λͺ¨λ¦¬ μ ‘κ·Ό
  • 졜적의 타일링 μ „λž΅
  • μ μ ˆν•œ μŠ€λ ˆλ“œ 동기화
  • μ„Έμ‹¬ν•œ 경계 처리

μ†”λ£¨μ…˜: κ΄€μš©μ  LayoutTensor 타일링

from gpu.memory import async_copy_wait_all
from layout.layout_tensor import copy_dram_to_sram_async

comptime NUM_THREADS = TPB * TPB
comptime BLOCK_DIM_COUNT = 2


fn matmul_idiomatic_tiled[
    layout: Layout, size: UInt
](
    output: LayoutTensor[dtype, layout_tiled, MutAnyOrigin],
    a: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout_tiled, ImmutAnyOrigin],
):
    local_row = thread_idx.y
    local_col = thread_idx.x
    tiled_row = block_idx.y * TPB + local_row
    tiled_col = block_idx.x * TPB + local_col

    # Get the tile of the output matrix that this thread block is responsible for
    out_tile = output.tile[TPB, TPB](Int(block_idx.y), Int(block_idx.x))
    a_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB, TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    b_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB, TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    var acc: output.element_type = 0

    comptime load_a_layout = Layout.row_major(1, TPB)  # Coalesced loading
    comptime load_b_layout = Layout.row_major(1, TPB)  # Coalesced loading
    # Note: Both matrices stored in same orientation for correct matrix multiplication
    # Transposed loading would be useful if B were pre-transposed in global memory

    @parameter
    for idx in range(size // TPB):  # Perfect division: 9 // 3 = 3 tiles
        # Get tiles from A and B matrices
        a_tile = a.tile[TPB, TPB](Int(block_idx.y), Int(idx))
        b_tile = b.tile[TPB, TPB](Int(idx), Int(block_idx.x))

        # Asynchronously copy tiles to shared memory with consistent orientation
        copy_dram_to_sram_async[
            thread_layout=load_a_layout,
            num_threads=NUM_THREADS,
            block_dim_count=BLOCK_DIM_COUNT,
        ](a_shared, a_tile)
        copy_dram_to_sram_async[
            thread_layout=load_b_layout,
            num_threads=NUM_THREADS,
            block_dim_count=BLOCK_DIM_COUNT,
        ](b_shared, b_tile)

        # Wait for all async copies to complete
        async_copy_wait_all()
        barrier()

        # Compute partial matrix multiplication for this tile
        @parameter
        for k in range(TPB):
            acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write final result to output tile
    if tiled_row < size and tiled_col < size:
        out_tile[local_row, local_col] = acc


κ΄€μš©μ  타일링 ν–‰λ ¬ κ³±μ…ˆμ€ Mojo의 LayoutTensor API와 비동기 λ©”λͺ¨λ¦¬ 연산을 ν™œμš©ν•˜μ—¬ κΉ”λ”ν•œ κ΅¬ν˜„μ„ μ œκ³΅ν•©λ‹ˆλ‹€.

핡심 포인트: 이 κ΅¬ν˜„μ€ 두 ν–‰λ ¬ λͺ¨λ‘ 병합 λ‘œλ”©μ„ μ‚¬μš©ν•˜μ—¬ ν‘œμ€€ A Γ— B ν–‰λ ¬ κ³±μ…ˆμ„ μˆ˜ν–‰ν•©λ‹ˆλ‹€.

이 κ΅¬ν˜„μ΄ ν•˜λŠ” 것:

  • ν–‰λ ¬ μ—°μ‚°: ν‘œμ€€ \(A \times B\) κ³±μ…ˆ (\(A \times B^T\) κ°€ μ•„λ‹˜)
  • λ‘œλ”© νŒ¨ν„΄: 두 ν–‰λ ¬ λͺ¨λ‘ Layout.row_major(1, TPB)둜 병합 μ ‘κ·Ό
  • μ—°μ‚°: acc += a_shared[local_row, k] * b_shared[k, local_col]
  • 데이터 λ ˆμ΄μ•„μ›ƒ: λ‘œλ”© μ‹œ μ „μΉ˜ μ—†μŒ - 두 행렬을 같은 λ°©ν–₯으둜 λ‘œλ“œ

이 κ΅¬ν˜„μ΄ ν•˜μ§€ μ•ŠλŠ” 것:

  • \(A \times B^T\) κ³±μ…ˆμ„ μˆ˜ν–‰ν•˜μ§€ μ•ŠμŒ
  • μ „μΉ˜ λ‘œλ”© νŒ¨ν„΄μ„ μ‚¬μš©ν•˜μ§€ μ•ŠμŒ
  • 볡사 κ³Όμ •μ—μ„œ 데이터λ₯Ό μ „μΉ˜ν•˜μ§€ μ•ŠμŒ

\((9 \times 9)\) ν–‰λ ¬ ν¬κΈ°μ—μ„œλŠ” μ™„λ²½ν•œ 타일링이 이루어져 λͺ¨λ“  경계 검사가 λΆˆν•„μš”ν•©λ‹ˆλ‹€:

  1. LayoutTensor 타일 API

    out_tile = output.tile[TPB, TPB](block_idx.y, block_idx.x)
    a_tile = a.tile[TPB, TPB](block_idx.y, idx)
    b_tile = b.tile[TPB, TPB](idx, block_idx.x)
    

    μˆ˜λ™ μ’Œν‘œ 계산 없이 β€œ(block_idx.y, block_idx.x) μœ„μΉ˜μ˜ 타일을 κ°€μ Έμ˜¨λ‹€β€œλ₯Ό 직접 ν‘œν˜„ν•©λ‹ˆλ‹€. μžμ„Έν•œ λ‚΄μš©μ€ λ¬Έμ„œλ₯Ό μ°Έκ³ ν•˜μ„Έμš”.

  2. 비동기 λ©”λͺ¨λ¦¬ μ—°μ‚°

    copy_dram_to_sram_async[
       thread_layout = load_a_layout,
       num_threads = NUM_THREADS,
       block_dim_count = BLOCK_DIM_COUNT
    ](a_shared,a_tile)
    copy_dram_to_sram_async[
       thread_layout = load_b_layout,
       num_threads = NUM_THREADS,
       block_dim_count = BLOCK_DIM_COUNT
    ](b_shared, b_tile)
    async_copy_wait_all()
    

    이 연산듀은:

    • λ ˆμ§€μŠ€ν„°λ₯Ό μš°νšŒν•˜λŠ” μ „μš© 볡사 엔진을 μ‚¬μš©ν•˜μ—¬ μ—°μ‚°κ³Ό λ©”λͺ¨λ¦¬ μ „μ†‘μ˜ 쀑첩을 κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€ (copy_dram_to_sram_async μ°Έκ³ )
    • 졜적의 λ©”λͺ¨λ¦¬ μ ‘κ·Ό νŒ¨ν„΄μ„ μœ„ν•œ νŠΉν™”λœ μŠ€λ ˆλ“œ λ ˆμ΄μ•„μ›ƒμ„ μ‚¬μš©ν•©λ‹ˆλ‹€
    • μˆ˜λ™ λ©”λͺ¨λ¦¬ μ΄ˆκΈ°ν™”κ°€ λΆˆν•„μš”ν•©λ‹ˆλ‹€
    • μ€‘μš”:
      • ν‘œμ€€ GPU λ‘œλ“œλŠ” 이미 λΉ„λ™κΈ°μ μž…λ‹ˆλ‹€. 이 ν•¨μˆ˜λ“€μ€ 더 λ‚˜μ€ λ¦¬μ†ŒμŠ€ ν™œμš©κ³Ό λ ˆμ§€μŠ€ν„° 우회λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€
      • copy_dram_to_sram_asyncλŠ” 기본적으둜 1D μŠ€λ ˆλ“œ 블둝(block_dim.y == block_dim.z == 1)을 κ°€μ •ν•˜λ©°, 별도 지정이 μ—†μœΌλ©΄ μŠ€λ ˆλ“œ λΈ”λ‘μ˜ λͺ¨λ“  μŠ€λ ˆλ“œκ°€ 볡사에 μ°Έμ—¬ν•©λ‹ˆλ‹€. λ‹€μŒμ„ μ§€μ •ν•˜μ—¬ 이 λ™μž‘μ„ λ³€κ²½ν•  수 μžˆμŠ΅λ‹ˆλ‹€:
        • block_dim_count: μŠ€λ ˆλ“œ λΈ”λ‘μ˜ 차원 수 (2D μŠ€λ ˆλ“œ 블둝 THREADS_PER_BLOCK_TILED = (TPB, TPB)의 경우 2)
        • num_threads: μŠ€λ ˆλ“œ λΈ”λ‘μ˜ μŠ€λ ˆλ“œ 수 (TPB*TPB == 9)
  3. μ΅œμ ν™”λœ λ©”λͺ¨λ¦¬ μ ‘κ·Ό λ ˆμ΄μ•„μ›ƒ

    comptime load_a_layout = Layout.row_major(1, TPB)    # 병합 λ‘œλ”©
    comptime load_b_layout = Layout.row_major(1, TPB)    # 병합 λ‘œλ”©
    # μ°Έκ³ : ν‘œμ€€ A Γ— B κ³±μ…ˆμ—μ„œ 두 ν–‰λ ¬ λͺ¨λ‘ 같은 λ ˆμ΄μ•„μ›ƒμ„ μ‚¬μš©
    

    ν˜„μž¬ κ΅¬ν˜„μ˜ λ©”λͺ¨λ¦¬ μ ‘κ·Ό 뢄석:

    두 ν–‰λ ¬ λͺ¨λ‘ μ „μ—­ λ©”λͺ¨λ¦¬μ—μ„œ 병합 λ‘œλ”©μ„ μœ„ν•΄ Layout.row_major(1, TPB)λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€:

    • load_a_layout: μŠ€λ ˆλ“œλ“€μ΄ ν˜‘λ ₯ν•˜μ—¬ ν–‰λ ¬ A ν–‰μ˜ 연속 μ›μ†Œλ₯Ό λ‘œλ“œ
    • load_b_layout: μŠ€λ ˆλ“œλ“€μ΄ ν˜‘λ ₯ν•˜μ—¬ ν–‰λ ¬ B ν–‰μ˜ 연속 μ›μ†Œλ₯Ό λ‘œλ“œ
    • 핡심: μŠ€λ ˆλ“œ λ ˆμ΄μ•„μ›ƒμ€ 볡사 μ‹œ μŠ€λ ˆλ“œ κ°„ ν˜‘λ ₯ 방식을 κ²°μ •ν•˜λ©°, μ΅œμ’… 데이터 λ ˆμ΄μ•„μ›ƒκ³ΌλŠ” λ³„κ°œμž…λ‹ˆλ‹€

    μ‹€μ œ μ—°μ‚° νŒ¨ν„΄ (A Γ— Bμž„μ„ 증λͺ…):

    # ν˜„μž¬ κ΅¬ν˜„μ˜ μ‹€μ œ μ—°μ‚°
    acc += a_shared[local_row, k] * b_shared[k, local_col]
    
    # 이것은 C[i,j] = Ξ£(A[i,k] * B[k,j])에 ν•΄λ‹Ή
    # 즉, ν‘œμ€€ ν–‰λ ¬ κ³±μ…ˆ A Γ— B
    

    두 행렬이 같은 병합 λ‘œλ”© νŒ¨ν„΄μ„ μ‚¬μš©ν•˜λŠ” 이유:

    μ „μ—­ λ©”λͺ¨λ¦¬μ—μ„œ 타일 λ‘œλ”©:
    - Matrix A 타일: μŠ€λ ˆλ“œλ“€μ΄ A[block_row, k], A[block_row, k+1], A[block_row, k+2]... λ‘œλ“œ (연속)
    - Matrix B 타일: μŠ€λ ˆλ“œλ“€μ΄ B[k, block_col], B[k, block_col+1], B[k, block_col+2]... λ‘œλ“œ (연속)
    
    Layout.row_major(1, TPB)둜 두 νŒ¨ν„΄ λͺ¨λ‘ 병합
    

    μ„Έ κ°€μ§€ λ³„κ°œμ˜ λ©”λͺ¨λ¦¬ 고렀사항:

    1. μ „μ—­β†’κ³΅μœ  병합: Layout.row_major(1, TPB)둜 병합 μ „μ—­ λ©”λͺ¨λ¦¬ μ ‘κ·Ό 보μž₯
    2. 곡유 λ©”λͺ¨λ¦¬ μ—°μ‚°: a_shared[local_row, k] * b_shared[k, local_col]둜 뱅크 좩돌 νšŒν”Ό
    3. ν–‰λ ¬ μ—°μ‚°: μ—°μ‚° νŒ¨ν„΄μ΄ A Γ— Bλ₯Ό κ²°μ • (A Γ— B^Tκ°€ μ•„λ‹˜)
  4. μ™„λ²½ν•œ νƒ€μΌλ§μœΌλ‘œ 경계 검사 λΆˆν•„μš”

    @parameter
    for idx in range(size // TPB):  # λ‚˜λ¨Έμ§€ μ—†λŠ” λ‚˜λˆ—μ…ˆ: 9 // 3 = 3
    

    \((9 \times 9)\) ν–‰λ ¬κ³Ό \((3 \times 3)\) νƒ€μΌμ—μ„œλŠ” λͺ¨λ“  타일이 μ •ν™•νžˆ 꽉 μ°¨κΈ° λ•Œλ¬Έμ— 경계 검사가 ν•„μš” μ—†μŠ΅λ‹ˆλ‹€!

  5. 방어적 경계 검사λ₯Ό ν¬ν•¨ν•œ κΉ”λ”ν•œ 타일 처리

    # μ™„λ²½ν•œ νƒ€μΌλ§μ—μ„œλ„ 방어적 경계 검사 포함
    if tiled_row < size and tiled_col < size:
        out_tile[local_row, local_col] = acc
    

    \((9 \times 9)\) 의 μ™„λ²½ν•œ νƒ€μΌλ§μ—μ„œλŠ” 이 경계 검사가 기술적으둜 λΆˆν•„μš”ν•˜μ§€λ§Œ, 방어적 ν”„λ‘œκ·Έλž˜λ°κ³Ό λ‹€λ₯Έ ν–‰λ ¬ ν¬κΈ°μ™€μ˜ 일관성을 μœ„ν•΄ ν¬ν•¨ν•©λ‹ˆλ‹€.

μ„±λŠ₯ 고렀사항

κ΄€μš©μ  κ΅¬ν˜„μ€ νƒ€μΌλ§μ˜ μ„±λŠ₯ 이점을 μœ μ§€ν•˜λ©΄μ„œ 더 κΉ”λ”ν•œ 좔상화λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€:

  1. λ©”λͺ¨λ¦¬ μ§€μ—­μ„±: 타일링을 톡해 곡간적, μ‹œκ°„μ  지역성을 ν™œμš©
  2. 병합 μ ‘κ·Ό: νŠΉν™”λœ λ‘œλ“œ λ ˆμ΄μ•„μ›ƒμœΌλ‘œ 병합 λ©”λͺ¨λ¦¬ μ ‘κ·Ό νŒ¨ν„΄ 보μž₯
  3. μ—°μ‚°-λ©”λͺ¨λ¦¬ 쀑첩: 비동기 λ©”λͺ¨λ¦¬ 연산을 ν†΅ν•œ 쀑첩 κ°€λŠ₯
  4. 곡유 λ©”λͺ¨λ¦¬ 효율: λΆˆν•„μš”ν•œ 곡유 λ©”λͺ¨λ¦¬ μ΄ˆκΈ°ν™” μ—†μŒ
  5. λ ˆμ§€μŠ€ν„° μ••λ ₯: 졜적의 μ—°μ‚° μ²˜λ¦¬λŸ‰μ„ μœ„ν•œ λˆ„μ  λ ˆμ§€μŠ€ν„° μ‚¬μš©

이 κ΅¬ν˜„μ€ κ³ μˆ˜μ€€ μΆ”μƒν™”λ‘œλ„ μ„±λŠ₯ μ €ν•˜ 없이 λ³΅μž‘ν•œ GPU μ•Œκ³ λ¦¬μ¦˜μ„ ν‘œν˜„ν•  수 μžˆμŒμ„ λ³΄μ—¬μ€λ‹ˆλ‹€. κ³ μˆ˜μ€€μ˜ ν‘œν˜„λ ₯κ³Ό μ €μˆ˜μ€€μ˜ μ„±λŠ₯ μ œμ–΄λ₯Ό κ²°ν•©ν•˜λŠ” Mojo의 철학을 잘 λ³΄μ—¬μ£ΌλŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€.

μˆ˜λ™ νƒ€μΌλ§κ³Όμ˜ μ£Όμš” 차이점

κΈ°λŠ₯μˆ˜λ™ Tilingκ΄€μš©μ  Tiling
λ©”λͺ¨λ¦¬ 접근경계 검사가 μžˆλŠ” 직접 인덱싱LayoutTensor 타일 API
타일 λ‘œλ”©μ›μ†Œλ³„ λͺ…μ‹œμ  λ³΅μ‚¬μ „μš© 볡사 μ—”μ§„μ˜ 벌크 전솑
곡유 λ©”λͺ¨λ¦¬μˆ˜λ™ μ΄ˆκΈ°ν™” (방어적)볡사 ν•¨μˆ˜κ°€ 관리
μ½”λ“œ λ³΅μž‘λ„λͺ…μ‹œμ  μΈλ±μ‹±μœΌλ‘œ λ‹€μ†Œ μž₯ν™©κ³ μˆ˜μ€€ API둜 더 κ°„κ²°
경계 κ²€μ‚¬λ‘œλ”©κ³Ό μ—°μ‚° 쀑 λ‹€μˆ˜μ˜ κ²€μ‚¬μ΅œμ’… 기둝 μ‹œ 단일 방어적 검사
ν–‰λ ¬ λ°©ν–₯A와 B λͺ¨λ‘ 같은 λ°©ν–₯ (ν‘œμ€€ A Γ— B)A와 B λͺ¨λ‘ 같은 λ°©ν–₯ (ν‘œμ€€ A Γ— B)
μ„±λŠ₯λ©”λͺ¨λ¦¬ νŒ¨ν„΄μ˜ λͺ…μ‹œμ  μ œμ–΄λ ˆμ§€μŠ€ν„° 우회λ₯Ό ν¬ν•¨ν•œ μ΅œμ ν™”λœ λ ˆμ΄μ•„μ›ƒ

κ΄€μš©μ  μ ‘κ·Ό 방식은 λ‹¨μˆœνžˆ 더 깔끔할 뿐 μ•„λ‹ˆλΌ, νŠΉν™”λœ λ©”λͺ¨λ¦¬ λ ˆμ΄μ•„μ›ƒκ³Ό 비동기 μ—°μ‚° 덕뢄에 μ„±λŠ₯도 더 쒋을 수 μžˆμŠ΅λ‹ˆλ‹€.

μ°Έκ³ : μ „μΉ˜ λ‘œλ”©μ€ μ–Έμ œ μœ μš©ν• κΉŒ?

ν˜„μž¬ κ΅¬ν˜„μ€ μ „μΉ˜ λ‘œλ”©μ„ μ‚¬μš©ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 이 μ„Ήμ…˜μ€ λ ˆμ΄μ•„μ›ƒ μ‹œμŠ€ν…œμœΌλ‘œ ν•  수 μžˆλŠ” 것을 보여주기 μœ„ν•œ ꡐ윑적 λ‚΄μš©μž…λ‹ˆλ‹€.

ν˜„μž¬ κ΅¬ν˜„ μš”μ•½:

  • 두 ν–‰λ ¬ λͺ¨λ‘ Layout.row_major(1, TPB) μ‚¬μš©
  • ν‘œμ€€ A Γ— B κ³±μ…ˆ μˆ˜ν–‰
  • 볡사 쀑 데이터 μ „μΉ˜ μ—†μŒ

μ „μΉ˜ λ‘œλ”©μ„ μ‚¬μš©ν•˜λŠ” ꡐ윑적 μ‹œλ‚˜λ¦¬μ˜€:

이 퍼즐은 두 ν–‰λ ¬ λͺ¨λ‘ ν‘œμ€€ 병합 λ‘œλ”©μ„ μ‚¬μš©ν•˜μ§€λ§Œ, λ ˆμ΄μ•„μ›ƒ μ‹œμŠ€ν…œμ˜ μœ μ—°μ„±μ€ λ‹€λ₯Έ μ‹œλ‚˜λ¦¬μ˜€μ—μ„œ κ°•λ ₯ν•œ μ΅œμ ν™”λ₯Ό κ°€λŠ₯ν•˜κ²Œ ν•©λ‹ˆλ‹€:

# μ˜ˆμ‹œ: A Γ— Bλ₯Ό κ³„μ‚°ν•˜κΈ° μœ„ν•΄ 사전 μ „μΉ˜λœ ν–‰λ ¬ B^Tλ₯Ό λ‘œλ“œ
# (ν˜„μž¬ κ΅¬ν˜„μ—μ„œλŠ” μ΄λ ‡κ²Œ ν•˜μ§€ μ•ŠμŒ)
comptime load_b_layout = Layout.row_major(TPB, 1)   # B^Tλ₯Ό 병합 μ ‘κ·ΌμœΌλ‘œ λ‘œλ“œ
comptime store_b_layout = Layout.row_major(1, TPB)  # 곡유 λ©”λͺ¨λ¦¬μ— B둜 μ €μž₯
copy_dram_to_sram_async[src_thread_layout=load_b_layout, dst_thread_layout=store_b_layout](b_shared, b_tile)

μ „μΉ˜ λ‘œλ”©μ˜ ν™œμš© 사둀 (이 νΌμ¦μ—μ„œλŠ” μ‚¬μš©ν•˜μ§€ μ•ŠμŒ):

  1. 이미 μ „μΉ˜λœ μž…λ ₯ ν–‰λ ¬: \(B\) κ°€ μ „μ—­ λ©”λͺ¨λ¦¬μ— μ „μΉ˜ μƒνƒœλ‘œ μ €μž₯λ˜μ–΄ μžˆλŠ” 경우
  2. λ‹€λ₯Έ μ•Œκ³ λ¦¬μ¦˜: \(A^T \times B\), \(A \times B^T\), λ˜λŠ” \(A^T \times B^T\) 계산
  3. λ©”λͺ¨λ¦¬ λ ˆμ΄μ•„μ›ƒ λ³€ν™˜: ν–‰ μš°μ„ κ³Ό μ—΄ μš°μ„  λ ˆμ΄μ•„μ›ƒ κ°„ λ³€ν™˜
  4. 별도 μ „μΉ˜ μ—°μ‚° 없이 λ‘œλ“œ: ν•„μš”ν•œ λ°©ν–₯으둜 데이터λ₯Ό 직접 λ‘œλ“œ

핡심 ꡬ뢄:

  • ν˜„μž¬ κ΅¬ν˜„: 두 ν–‰λ ¬ λͺ¨λ‘ ν‘œμ€€ \(A \times B\) κ³±μ…ˆμ— Layout.row_major(1, TPB) μ‚¬μš©
  • μ „μΉ˜ λ‘œλ”© μ˜ˆμ‹œ: 이미 μ „μΉ˜λœ λ°μ΄ν„°λ‚˜ λ‹€λ₯Έ ν–‰λ ¬ 연산을 μ²˜λ¦¬ν•  λ•Œ λ‹€λ₯Έ λ ˆμ΄μ•„μ›ƒ μ‚¬μš©

이것은 Mojo의 철학을 λ³΄μ—¬μ€λ‹ˆλ‹€: 일반적인 κ²½μš°μ— κ³ μˆ˜μ€€ 좔상화λ₯Ό μœ μ§€ν•˜λ©΄μ„œλ„, ν•„μš”ν•  λ•Œ μ €μˆ˜μ€€ μ œμ–΄λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€.


μš”μ•½: 핡심 정리

κ΄€μš©μ  타일링 κ΅¬ν˜„μ΄ μ‹€μ œλ‘œ ν•˜λŠ” 것:

  1. ν–‰λ ¬ μ—°μ‚°: ν‘œμ€€ A Γ— B κ³±μ…ˆ
  2. λ©”λͺ¨λ¦¬ λ‘œλ”©: 두 ν–‰λ ¬ λͺ¨λ‘ Layout.row_major(1, TPB)둜 병합 μ ‘κ·Ό
  3. μ—°μ‚° νŒ¨ν„΄: acc += a_shared[local_row, k] * b_shared[k, local_col]
  4. 데이터 λ ˆμ΄μ•„μ›ƒ: λ‘œλ”© μ‹œ μ „μΉ˜ μ—†μŒ

이것이 졜적인 이유:

  • 병합 μ „μ—­ λ©”λͺ¨λ¦¬ μ ‘κ·Ό: Layout.row_major(1, TPB)둜 효율적인 λ‘œλ”© 보μž₯
  • 뱅크 좩돌 νšŒν”Ό: 곡유 λ©”λͺ¨λ¦¬ μ ‘κ·Ό νŒ¨ν„΄μ΄ μΆ©λŒμ„ λ°©μ§€
  • ν‘œμ€€ μ•Œκ³ λ¦¬μ¦˜: κ°€μž₯ 일반적인 ν–‰λ ¬ κ³±μ…ˆ νŒ¨ν„΄μ„ κ΅¬ν˜„