Puzzle 33: Tensor Core Operations

Introduction

Welcome to the final frontier of GPU matrix multiplication optimization! In this puzzle, we’ll explore Tensor Cores - specialized hardware units designed to accelerate mixed-precision matrix operations at unprecedented speeds.

Building on everything we’ve learned so far, especially from Puzzle 16’s idiomatic tiled matrix multiplication, we’ll see how modern GPUs provide dedicated silicon to make matrix operations blazingly fast.

What are tensor cores?

Tensor Cores (also known as Matrix Cores on AMD hardware) are specialized processing units that can perform mixed-precision matrix-matrix operations in a single instruction. These units are available on modern GPU architectures:

NVIDIA: Tensor Cores (Volta, Turing, Ampere, Hopper)
AMD: Matrix Cores (CDNA/CDNA2/CDNA3 architectures)

Think of them as hardware-accelerated GEMM (General Matrix Multiply) engines built directly into the GPU.

Key characteristics

Warp-level operations: Each instruction operates on data from an entire warp (32 threads on NVIDIA, 32 or 64 on AMD)
Fixed tile sizes: Operations work on specific matrix fragment sizes (e.g., 16×8×8 for FP32)
Mixed precision: Can mix input and output precisions for optimal performance
Massive throughput: Can achieve 10-100x speedup over regular compute cores for matrix operations

From tiled to tensor cores

Let’s trace our journey from basic matrix multiplication to Tensor Cores:

Puzzle 16: We learned idiomatic tiled matrix multiplication using shared memory
Shared memory optimization: We used copy_dram_to_sram_async for efficient memory transfers
Thread cooperation: We coordinated warps using barriers and async operations
Now: We’ll use specialized hardware (Tensor Cores) to accelerate the core computation

The tensor core programming model

Tensor Cores expose a different programming paradigm:

Traditional compute core approach

# Each thread computes one element
acc += a_shared[local_row, k] * b_shared[k, local_col]

Tensor core approach

# Entire warp cooperates on matrix fragments
a_reg = mma_op.load_a(A_mma_tile)           # Load 16×8 fragment
b_reg = mma_op.load_b(B_mma_tile)           # Load 8×8 fragment
c_reg = mma_op.load_c(C_mma_tile)           # Load 16×8 accumulator
d_reg = mma_op.mma_op(a_reg, b_reg, c_reg)  # D = A×B + C
mma_op.store_d(C_mma_tile, d_reg)           # Store result

Tensor core API in Mojo

Mojo provides a clean interface to Tensor Cores through the TensorCore type:

from layout.tensor_core import TensorCore

# Create a Tensor Core operator for specific tile sizes
mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()

# Core operations:
# - load_a(): Load matrix A fragment from shared memory
# - load_b(): Load matrix B fragment from shared memory
# - load_c(): Load matrix C fragment (accumulator)
# - mma_op(): Perform D = A×B + C operation
# - store_d(): Store result fragment to memory

Advanced features: The TensorCore API also supports quantized operations, different swizzle patterns for memory access optimization, and mixed-precision arithmetic. For complete documentation of all supported shapes, data types, and methods, see the official TensorCore API reference.

Matrix fragment sizes

The TensorCore API supports different shapes and data types depending on the GPU hardware:

NVIDIA GPUs:

float32: 16×8×8 or 16×8×4
half-precision: 16×8×16
float8: 16×8×32

AMD GPUs:

float32: 16×16×4
half-precision: 16×16×16 or 32×32×8

This puzzle uses FP32 with 16×8×8 fragments:

MMA_M = 16: Matrix A height (and output height)
MMA_N = 8: Matrix B width (and output width)
MMA_K = 8: Inner dimension (A width = B height)

What is MMA? MMA stands for “Mixed-precision Matrix-Multiply-Accumulate” - the fundamental operation that Tensor Cores perform. Each MMA instruction computes: D = A × B + C where A, B, C, and D are matrix fragments.

Fragment visualization:

A fragment (16×8)  ×  B fragment (8×8)  +  C fragment (16×8)  =  D fragment (16×8)

    16 rows             8 rows               16 rows              16 rows
    8 cols              8 cols               8 cols               8 cols
      |                   |                    |                    |
   [A data]         ×   [B data]         +   [C data]         =  [D result]

This means each Tensor Core instruction computes a 16×8 output tile by multiplying a 16×8 tile from A with an 8×8 tile from B, then adding it to the existing 16×8 accumulator.

Warp organization for tensor cores

What is a warp? A warp is a group of threads (32 on NVIDIA, 32 or 64 on AMD) that execute instructions together in lockstep. Tensor Cores require all threads in a warp to cooperate on a single matrix operation.

Why warp-level? Unlike regular operations where each thread works independently, Tensor Cores need the entire warp to collectively load matrix fragments, perform the MMA operation, and store results.

Since Tensor Cores operate at warp-level, we need to organize our threads differently:

# Calculate warp coordinates within the block
warp_id = thread_idx.x // WARP_SIZE
warps_in_n = BN // WN  # Number of warps along N dimension
warps_in_m = BM // WM  # Number of warps along M dimension
warp_y = warp_id // warps_in_n  # Warp's row
warp_x = warp_id % warps_in_n   # Warp's column

# Each warp handles a WM×WN tile of the output
C_warp_tile = C_block_tile.tile[WM, WN](warp_y, warp_x)

Warp organization example (with BM=128, BN=64, WM=32, WN=32):

Block (128×64) contains 8 warps arranged as:

    32 cols    32 cols
     |          |
[  Warp 0  ][  Warp 1  ]  ← 32 rows each
[  Warp 2  ][  Warp 3  ]  ← 32 rows each
[  Warp 4  ][  Warp 5  ]  ← 32 rows each
[  Warp 6  ][  Warp 7  ]  ← 32 rows each

Total: 4×2 = 8 warps, each handling 32×32 output region

Memory hierarchy with tensor cores

Tensor Cores add another layer to our memory optimization:

Global Memory → Shared Memory: Use copy_dram_to_sram_async (from Puzzle 16)
Shared Memory → Register Fragments: Use mma_op.load_a/load_b
Computation: Use mma_op.mma_op on register fragments
Register Fragments → Global Memory: Use mma_op.store_d

The challenge

Your task is to complete the tensor_core_matrix_multiplication function. The skeleton builds on the tiled approach but uses actual Tensor Core hardware operations.

Key requirements

Use the actual Tensor Core API: Don’t simulate - use real mma_op.load_a(), mma_op.mma_op(), etc.
Maintain correctness: Your result must match the CPU reference implementation
Proper warp coordination: Handle multiple warps per block correctly (works on both NVIDIA and AMD)
Memory efficiency: Use the same async copy patterns from Puzzle 16
Cross-platform compatibility: Ensure tiling parameters are multiples of WARP_SIZE

Configuration

Matrix size: \(\text{SIZE} = 1024\)
Block tiling: \(\text{BM} = 128, \text{BN} = 64, \text{BK} = 32\)
Warp tiling: \(\text{WM} = 32, \text{WN} = 32\) (multiples of WARP_SIZE)
MMA fragments: \(16 \times 8 \times 8\) for FP32
Threads per block: \(8 \times \text{WARP_SIZE}\) (8 warps per block)
Grid dimensions: Covers full matrix with block tiles

Layout configuration:

Input A: Layout.row_major(SIZE, SIZE)
Input B: Layout.row_major(SIZE, SIZE)
Output C: Layout.row_major(SIZE, SIZE)
Shared Memory: Block-sized tiles with async copy operations

The challenge

In this puzzle, you’ll transform the idiomatic tiled matrix multiplication from Puzzle 16 into a Tensor Core implementation. Let’s break this down step by step:

Step 1: Understanding your tiled baseline

The puzzle provides a complete idiomatic tiled implementation as your reference:

fn matmul_idiomatic_tiled[
    layout: Layout, size: Int
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
):
    # Use block_dim to get actual tile size dynamically
    var tile_size_x = block_dim.x
    var tile_size_y = block_dim.y

    local_row = thread_idx.y
    local_col = thread_idx.x
    tiled_row = block_idx.y * tile_size_y + local_row
    tiled_col = block_idx.x * tile_size_x + local_col

    # Get the tile of the output matrix that this thread block is responsible for
    out_tile = output.tile[TILE_SIZE, TILE_SIZE](block_idx.y, block_idx.x)
    a_shared = LayoutTensor[
        dtype,
        Layout.row_major(TILE_SIZE, TILE_SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    b_shared = LayoutTensor[
        dtype,
        Layout.row_major(TILE_SIZE, TILE_SIZE),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    var acc: output.element_type = 0

    alias load_a_layout = Layout.row_major(1, TILE_SIZE)  # Coalesced loading
    alias load_b_layout = Layout.row_major(1, TILE_SIZE)  # Coalesced loading
    # Note: Both matrices stored in same orientation for correct matrix multiplication
    # Transposed loading would be useful if B were pre-transposed in global memory

    for idx in range(size // TILE_SIZE):  # Iterate over K tiles
        # Get tiles from A and B matrices
        a_tile = a.tile[TILE_SIZE, TILE_SIZE](block_idx.y, idx)
        b_tile = b.tile[TILE_SIZE, TILE_SIZE](idx, block_idx.x)

        # Asynchronously copy tiles to shared memory with consistent orientation
        copy_dram_to_sram_async[
            thread_layout=load_a_layout,
            num_threads = TILE_SIZE * TILE_SIZE,
            block_dim_count=BLOCK_DIM_COUNT,
        ](a_shared, a_tile)
        copy_dram_to_sram_async[
            thread_layout=load_b_layout,
            num_threads = TILE_SIZE * TILE_SIZE,
            block_dim_count=BLOCK_DIM_COUNT,
        ](b_shared, b_tile)

        async_copy_wait_all()
        barrier()

        # Compute partial matrix multiplication for this tile
        for k in range(TILE_SIZE):
            if (
                local_row < TILE_SIZE
                and local_col < TILE_SIZE
                and k < TILE_SIZE
            ):
                acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write final result to output tile
    if tiled_row < size and tiled_col < size:
        out_tile[local_row, local_col] = acc

What this baseline does:

Correctness: This implementation works perfectly and passes all tests
Thread cooperation: Uses copy_dram_to_sram_async for efficient memory transfers
Shared memory: Coordinates threads with barriers and async operations
Tiled computation: Each thread computes one output element using shared memory tiles

Step 2: Your tensor core mission

Transform the above approach using specialized hardware acceleration:

From: Thread-level computation → To: Warp-level matrix fragments
From: Standard FP32 arithmetic → To: Hardware-accelerated GEMM operations
From: Individual element results → To: 16×8 matrix fragment results

Step 3: Configuration understanding

The tensor core version uses different tiling parameters optimized for hardware:

Block tiling: BM=128, BN=64, BK=32 (larger blocks for better occupancy)
Warp tiling: WM=32, WN=32 (each warp handles a 32×32 output region)
MMA fragments: 16×8×8 (hardware-defined matrix fragment sizes)
Warps per block: 8 warps (organized as 4×2 in the BM×BN block)

Why these specific sizes?

BM=128, BN=64: Larger than tiled version (32×32) to better utilize Tensor Cores
WM=WN=32: Multiple of WARP_SIZE and contains 2×4=8 MMA fragments (32÷16=2, 32÷8=4)
MMA 16×8×8: Fixed by hardware - this is what the Tensor Cores physically compute
8 warps: BM÷WM × BN÷WN = 128÷32 × 64÷32 = 4×2 = 8 warps per block

How warps map to MMA fragments:

Each 32×32 warp tile contains multiple 16×8 MMA fragments:

    16 cols   16 cols
     |         |
[ MMA 0,0 ][ MMA 0,1 ]  ← 8 rows each (32÷8=4 fragments down)
[ MMA 1,0 ][ MMA 1,1 ]  ← 8 rows each
[ MMA 2,0 ][ MMA 2,1 ]  ← 8 rows each
[ MMA 3,0 ][ MMA 3,1 ]  ← 8 rows each

2 fragments across (32÷16=2) × 4 fragments down (32÷8=4) = 8 MMA operations per warp per K-tile

Step 4: Code to complete

# Block and warp tiling sizes
alias BM = 4 * WARP_SIZE  # Block tile M (4 warps along M)
alias BN = 2 * WARP_SIZE  # Block tile N (2 warps along N)
alias BK = WARP_SIZE  # Block tile K (stay within SMEM limit)
alias WM = WARP_SIZE  # Warp tile M
alias WN = WARP_SIZE  # Warp tile N

# MMA tile sizes for tensor cores
alias MMA_M = 16
alias MMA_N = 8
alias MMA_K = 8

alias THREADS_PER_BLOCK_TENSOR_CORE = (8 * WARP_SIZE, 1)  # 8 warps per block
# grid_dim is (x, y). We want x to sweep N (columns) and y to sweep M (rows)
alias BLOCKS_PER_GRID_TENSOR_CORE = (
    (SIZE + BN - 1) // BN,
    (SIZE + BM - 1) // BM,
)


fn tensor_core_matrix_multiplication[
    dtype: DType,
    layout_a: Layout,
    layout_b: Layout,
    layout_c: Layout,
    BM: Int,
    BN: Int,
    BK: Int,
    WM: Int,
    WN: Int,
    MMA_M: Int,
    MMA_N: Int,
    MMA_K: Int,
](
    A: LayoutTensor[dtype, layout_a, ImmutAnyOrigin],
    B: LayoutTensor[dtype, layout_b, ImmutAnyOrigin],
    C: LayoutTensor[dtype, layout_c, MutAnyOrigin],
):
    alias M = C.shape[0]()
    alias N = C.shape[1]()
    alias K = A.shape[1]()

    warp_id = thread_idx.x // WARP_SIZE
    warps_in_n = BN // WN
    warps_in_m = BM // WM
    warp_y = warp_id // warps_in_n
    warp_x = warp_id % warps_in_n

    warp_is_active = warp_y < warps_in_m

    C_block_tile = C.tile[BM, BN](block_idx.y, block_idx.x)
    C_warp_tile = C_block_tile.tile[WM, WN](warp_y, warp_x)

    mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()

    # Shared SRAM tiles (no padding to stay under shared memory limit)
    A_sram_tile = LayoutTensor[
        A.dtype,
        Layout.row_major(BM, BK),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    B_sram_tile = LayoutTensor[
        B.dtype,
        Layout.row_major(BK, BN),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # One per-warp accumulator tile of shape [WM, WN]
    C_warp_accum = LayoutTensor[
        C.dtype,
        Layout.row_major(WM, WN),
        MutAnyOrigin,
        address_space = AddressSpace.GENERIC,
    ].stack_allocation()

    # Zero initialize accumulator (only for active warps)
    if warp_is_active:

        @parameter
        for i in range(WM):

            @parameter
            for j in range(WN):
                C_warp_accum[i, j] = 0.0

    # Sweep across K in BK chunks (single-buffered)
    for k_i in range(K // BK):
        barrier()

        A_dram_tile = A.tile[BM, BK](block_idx.y, k_i)
        B_dram_tile = B.tile[BK, BN](k_i, block_idx.x)

        copy_dram_to_sram_async[
            thread_layout = Layout.row_major(4, 8),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](A_sram_tile.vectorize[1, 4](), A_dram_tile.vectorize[1, 4]())
        copy_dram_to_sram_async[
            thread_layout = Layout.row_major(4, 8),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](B_sram_tile.vectorize[1, 4](), B_dram_tile.vectorize[1, 4]())

        async_copy_wait_all()
        barrier()

        if warp_is_active:
            A_warp_tile = A_sram_tile.tile[WM, BK](warp_y, 0)
            B_warp_tile = B_sram_tile.tile[BK, WN](0, warp_x)

            @parameter
            for mma_k in range(BK // MMA_K):

                @parameter
                for mma_m in range(WM // MMA_M):

                    @parameter
                    for mma_n in range(WN // MMA_N):
                        # FILL IN (roughly 8 lines)
                        ...

    # Store the final per-warp accumulation to the output warp tile
    if warp_is_active:

        @parameter
        for mma_m in range(WM // MMA_M):

            @parameter
            for mma_n in range(WN // MMA_N):
                var C_mma_tile = C_warp_tile.tile[MMA_M, MMA_N](mma_m, mma_n)
                Acc_mma_tile = C_warp_accum.tile[MMA_M, MMA_N](mma_m, mma_n)
                frag = mma_op.load_c(Acc_mma_tile)
                mma_op.store_d(C_mma_tile, frag)

View full file: problems/p33/p33.mojo

Your task: Complete the missing section (marked with # FILL IN (roughly 8 lines)) inside the triple nested loops.

What you need to understand:

The skeleton handles all memory management, warp organization, and synchronization
You only need to implement the core Tensor Core computation
The loops iterate over MMA fragments: mma_k, mma_m, mma_n
Each iteration processes one 16×8×8 matrix fragment

Understanding the triple nested loops:

@parameter
for mma_k in range(BK // MMA_K):     # 32÷8 = 4 iterations (K dimension)
    @parameter
    for mma_m in range(WM // MMA_M): # 32÷16 = 2 iterations (M dimension)
        @parameter
        for mma_n in range(WN // MMA_N): # 32÷8 = 4 iterations (N dimension)
            # YOUR CODE HERE: Process one 16×8×8 MMA fragment

What each loop does:

mma_k: Iterates through K-slices of the current K-tile (4 slices of 8 elements each)
mma_m: Iterates through M-slices of the warp’s output (2 slices of 16 rows each)
mma_n: Iterates through N-slices of the warp’s output (4 slices of 8 columns each)
Total: 4×2×4 = 32 MMA operations per warp per K-tile

Tips

Think about the Tensor Core workflow - you need to:

Get the right matrix fragments:
- From the warp tiles (A_warp_tile, B_warp_tile, C_warp_accum), extract the specific MMA-sized fragments
- Use the loop indices (mma_m, mma_k, mma_n) to get the correct tile coordinates
- Remember: A needs [MMA_M, MMA_K], B needs [MMA_K, MMA_N], C needs [MMA_M, MMA_N]
Load fragments into Tensor Core registers:
- The mma_op object has methods to load each matrix type
- Each load method takes a tile and returns register fragments
- Think: load_a(), load_b(), load_c() - what do they each take?
Perform the hardware operation and store:
- Use the MMA operation to compute the result
- Store the result back to the accumulator tile
- The operation follows the pattern: result = A × B + C

Key insight: You’re replacing 128 individual multiply-add operations with a single hardware instruction!

Debugging tip: If you get dimension errors, double-check your tile indexing - the order of mma_m, mma_k, mma_n matters for getting the right fragments.

Running the code

To test your solution, run the following command in your terminal:

pixi run p33 --test

uv run poe p33 --test

Your output will show accuracy test results once completed:

=== Running All Accuracy Tests ===
--- Test 1: Tensor Core vs CPU Reference ---
✅ TENSOR CORE ACCURACY TEST PASSED!
--- Test 2: Idiomatic Tiled vs CPU Reference ---
✅ IDIOMATIC TILED ACCURACY TEST PASSED!
ALL TESTS PASSED!

Solution

fn tensor_core_matrix_multiplication[
    dtype: DType,
    layout_a: Layout,
    layout_b: Layout,
    layout_c: Layout,
    BM: Int,
    BN: Int,
    BK: Int,
    WM: Int,
    WN: Int,
    MMA_M: Int,
    MMA_N: Int,
    MMA_K: Int,
](
    A: LayoutTensor[dtype, layout_a, ImmutAnyOrigin],
    B: LayoutTensor[dtype, layout_b, ImmutAnyOrigin],
    C: LayoutTensor[dtype, layout_c, MutAnyOrigin],
):
    alias M = C.shape[0]()
    alias N = C.shape[1]()
    alias K = A.shape[1]()

    warp_id = thread_idx.x // WARP_SIZE
    warps_in_n = BN // WN
    warps_in_m = BM // WM
    warp_y = warp_id // warps_in_n
    warp_x = warp_id % warps_in_n

    warp_is_active = warp_y < warps_in_m

    C_block_tile = C.tile[BM, BN](block_idx.y, block_idx.x)
    C_warp_tile = C_block_tile.tile[WM, WN](warp_y, warp_x)

    mma_op = TensorCore[A.dtype, C.dtype, Index(MMA_M, MMA_N, MMA_K)]()

    # Shared SRAM tiles (no padding to stay under shared memory limit)
    A_sram_tile = LayoutTensor[
        A.dtype,
        Layout.row_major(BM, BK),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    B_sram_tile = LayoutTensor[
        B.dtype,
        Layout.row_major(BK, BN),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # One per-warp accumulator tile of shape [WM, WN]
    C_warp_accum = LayoutTensor[
        C.dtype,
        Layout.row_major(WM, WN),
        MutAnyOrigin,
        address_space = AddressSpace.LOCAL,
    ].stack_allocation()

    # Zero initialize accumulator (only for active warps)
    if warp_is_active:

        @parameter
        for i in range(WM):

            @parameter
            for j in range(WN):
                C_warp_accum[i, j] = 0.0

    # (Removed shared C accumulator to reduce shared usage)

    # Sweep across K in BK chunks (single-buffered)
    for k_i in range(K // BK):
        barrier()

        A_dram_tile = A.tile[BM, BK](block_idx.y, k_i)
        B_dram_tile = B.tile[BK, BN](k_i, block_idx.x)

        copy_dram_to_sram_async[
            thread_layout = Layout.row_major(4, 8),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](A_sram_tile.vectorize[1, 4](), A_dram_tile.vectorize[1, 4]())
        copy_dram_to_sram_async[
            thread_layout = Layout.row_major(4, 8),
            num_threads=256,
            block_dim_count=BLOCK_DIM_COUNT,
        ](B_sram_tile.vectorize[1, 4](), B_dram_tile.vectorize[1, 4]())

        async_copy_wait_all()
        barrier()

        if warp_is_active:
            A_warp_tile = A_sram_tile.tile[WM, BK](warp_y, 0)
            B_warp_tile = B_sram_tile.tile[BK, WN](0, warp_x)

            @parameter
            for mma_k in range(BK // MMA_K):

                @parameter
                for mma_m in range(WM // MMA_M):

                    @parameter
                    for mma_n in range(WN // MMA_N):
                        A_mma_tile = A_warp_tile.tile[MMA_M, MMA_K](
                            mma_m, mma_k
                        )
                        B_mma_tile = B_warp_tile.tile[MMA_K, MMA_N](
                            mma_k, mma_n
                        )
                        C_mma_tile = C_warp_accum.tile[MMA_M, MMA_N](
                            mma_m, mma_n
                        )

                        a_reg = mma_op.load_a(A_mma_tile)
                        b_reg = mma_op.load_b(B_mma_tile)
                        c_reg = mma_op.load_c(C_mma_tile)
                        d_reg = mma_op.mma_op(a_reg, b_reg, c_reg)
                        mma_op.store_d(C_mma_tile, d_reg)

    # Store the final per-warp accumulation to the output warp tile
    if warp_is_active:

        @parameter
        for mma_m in range(WM // MMA_M):

            @parameter
            for mma_n in range(WN // MMA_N):
                var C_mma_tile = C_warp_tile.tile[MMA_M, MMA_N](mma_m, mma_n)
                Acc_mma_tile = C_warp_accum.tile[MMA_M, MMA_N](mma_m, mma_n)
                frag = mma_op.load_c(Acc_mma_tile)
                mma_op.store_d(C_mma_tile, frag)

This solution demonstrates the Tensor Core programming model:

Warp organization
- Calculates warp coordinates within the block using warp_id = thread_idx.x // WARP_SIZE
- Maps warps to output tiles: each warp handles a WM×WN region
- Uses warp_is_active guards to handle blocks with fewer than expected warps
Memory hierarchy optimization
- Global → Shared: Uses copy_dram_to_sram_async for efficient block-level transfers
- Shared → Registers: Uses mma_op.load_a/load_b for warp-level fragment loading
- Register computation: Uses mma_op.mma_op for hardware-accelerated matrix operations
- Registers → Global: Uses mma_op.store_d for efficient result storage
Tensor Core operations
- load_a(A_mma_tile): Loads 16×8 matrix A fragment into registers
- load_b(B_mma_tile): Loads 8×8 matrix B fragment into registers
- load_c(C_mma_tile): Loads 16×8 accumulator fragment
- mma_op(a_reg, b_reg, c_reg): Computes D = A×B + C using specialized hardware
- store_d(C_mma_tile, d_reg): Stores 16×8 result fragment
Cross-platform compatibility
- All tiling parameters are multiples of WARP_SIZE (32 on NVIDIA, 64 on AMD)
- Mojo abstracts hardware differences through the TensorCore interface
- Same code works on both NVIDIA Tensor Cores and AMD Matrix Cores

The key insight is that Tensor Cores operate on entire matrix fragments at the warp level, rather than individual elements at the thread level. This enables massive parallelism and specialized hardware acceleration.

Performance analysis: Are we done?

Now let’s see if Tensor Cores deliver their promised performance advantage over the idiomatic tiled approach.

Building for profiling

uv run mojo build problems/p33/p33.mojo -o problems/p33/p33_profiler

pixi run mojo build problems/p33/p33.mojo -o problems/p33/p33_profiler

Profiling with NVIDIA Nsight Compute (NVIDIA only)

First, enter the CUDA environment for ncu access:

# Enter CUDA environment
pixi shell -e nvidia

# Profile tensor core version
ncu --set full --metrics sm__cycles_elapsed.avg,smsp__cycles_active.avg.pct_of_peak_sustained_elapsed,dram__throughput.avg.pct_of_peak_sustained_elapsed,smsp__inst_executed_pipe_tensor_op_hmma.sum ./problems/p33p33_profiler --tensor-core

# Profile tiled version for comparison
ncu --set full --metrics sm__cycles_elapsed.avg,smsp__cycles_active.avg.pct_of_peak_sustained_elapsed,dram__throughput.avg.pct_of_peak_sustained_elapsed ./problems/p33p33_profiler --tiled

Key metrics to compare

Performance metrics:

Duration: Total kernel execution time (lower is better)
SM Active %: Streaming multiprocessor utilization (higher is better)
DRAM Throughput: Memory bandwidth utilization (shows if memory-bound)
Tensor Op Instructions: Number of actual tensor core operations (tensor core only)

What the results typically show:

Tensor Core version (slower):

Duration: ~13.9 ms (much slower!)
SM Active: 83.7% (good utilization)
DRAM Throughput: 72.5% (memory-bound!)
Occupancy: 26.3% (poor - limited by registers)
Tensor Op Instructions: 1,048,576 (confirms tensor cores are working)

Tiled version (faster):

Duration: ~1.62 ms (8.6× faster!)
SM Active: 98.0% (excellent utilization)
DRAM Throughput: 1.7% (compute-bound, as expected)
Occupancy: 66.7% (much better)
L2 Hit Rate: 96.9% vs 29.7% (much better cache locality)

Why is Tensor Core slower?

Memory bottleneck: 72% DRAM usage shows it’s memory-bound, not compute-bound
Poor occupancy: 26% vs 67% - high register usage (68 vs 38 per thread) limits concurrent warps
Cache misses: 29% L2 hit rate vs 97% shows poor memory locality
Shared memory conflicts: Bank conflicts from unoptimized access patterns
Launch configuration: Suboptimal block/warp organization for this problem size

The performance reality

As you can see from the profiling results, the “specialized hardware” isn’t automatically faster! The Tensor Core version is significantly slower (~8.6×) than the simple tiled approach. This is a common reality in GPU optimization - raw hardware capability doesn’t guarantee better performance.

Key insights:

Memory bottleneck: 72% DRAM usage shows tensor cores are memory-bound, not compute-bound
Poor occupancy: 26% vs 67% due to high register usage limits concurrent warps
Cache misses: 29% vs 97% L2 hit rate shows poor memory locality
Resource waste: Shared memory bank conflicts and suboptimal launch configuration

The lesson: Understanding performance bottlenecks and systematic optimization matter more than using the “latest and greatest” APIs. Hardware features are tools that require careful tuning, not magic bullets.

Next step

Ready for a rewarding GPU optimization challenge? Head to the 🎯 Performance Bonus Challenge to learn how to transform your memory-bound Tensor Core implementation into something that actually beats the simple tiled version!