Naive Matrix Multiplication

Overview

Implement a kernel that multiplies square matrices \(A\) and \(\text{transpose}(A)\) and stores the result in \(\text{out}\). This is the most straightforward implementation where each thread computes one element of the output matrix.

Key concepts

In this puzzle, you’ll learn about:

2D thread organization for matrix operations
Global memory access patterns
Matrix indexing in row-major layout
Thread-to-output element mapping

The key insight is understanding how to map 2D thread indices to matrix elements and compute dot products in parallel.

Configuration

Matrix size: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
Grid dimensions: \(1 \times 1\)

Layout configuration:

Input A: Layout.row_major(SIZE, SIZE)
Input B: Layout.row_major(SIZE, SIZE) (transpose of A)
Output: Layout.row_major(SIZE, SIZE)

Memory layout (LayoutTensor):

Matrix A: a[i, j] directly indexes position (i,j)
Matrix B: b[i, j] is the transpose of A
Output C: out[i, j] stores result at position (i,j)

Code to complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb


alias TPB = 3
alias SIZE = 2
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, TPB)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE, SIZE)


fn naive_matmul[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    # FILL ME IN (roughly 6 lines)

View full file: problems/p14/p14.mojo

Tips

Calculate global_i and global_j from thread indices
Check if indices are within size
Accumulate products in a local variable
Write final sum to correct output position

Running the code

To test your solution, run the following command in your terminal:

magic run p14 --naive

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 3.0, 3.0, 13.0])

Solution

fn naive_matmul[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y

    if global_i < size and global_j < size:
        var acc: out.element_type = 0

        @parameter
        for k in range(size):
            acc += a[global_i, k] * b[k, global_j]

        out[global_i, global_j] = acc

The naive matrix multiplication using LayoutTensor demonstrates the basic approach:

Matrix Layout (2×2 example)

Matrix A:        Matrix B (transpose of A):    Output C:
 [a[0,0] a[0,1]]  [b[0,0] b[0,1]]             [c[0,0] c[0,1]]
 [a[1,0] a[1,1]]  [b[1,0] b[1,1]]             [c[1,0] c[1,1]]

Implementation Details:

Thread Mapping:

global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y

Memory Access Pattern:
- Direct 2D indexing: a[global_i, k]
- Transposed access: b[k, global_j]
- Output writing: out[global_i, global_j]

Computation Flow:

# Use var for mutable accumulator with tensor's element type
var acc: out.element_type = 0

# @parameter for compile-time loop unrolling
@parameter
for k in range(size):
    acc += a[global_i, k] * b[k, global_j]

Key Language Features:

Variable Declaration:
- The use of var in var acc: out.element_type = 0 allows for type inference with out.element_type ensures type compatibility with the output tensor
- Initialized to zero before accumulation
Loop Optimization:
- @parameter decorator unrolls the loop at compile time
- Improves performance for small, known matrix sizes
- Enables better instruction scheduling

Performance Characteristics:

Memory Access:
- Each thread makes 2 x SIZE global memory reads
- One global memory write per thread
- No data reuse between threads
Computational Efficiency:
- Simple implementation but suboptimal performance
- Many redundant global memory accesses
- No use of fast shared memory
Limitations:
- High global memory bandwidth usage
- Poor data locality
- Limited scalability for large matrices

This naive implementation serves as a baseline for understanding matrix multiplication on GPUs, highlighting the need for optimization in memory access patterns.