Naive Matrix Multiplication
Overview
Implement a kernel that multiplies square matrices \(A\) and \(\text{transpose}(A)\) and stores the result in \(\text{out}\). This is the most straightforward implementation where each thread computes one element of the output matrix.
Key concepts
In this puzzle, you’ll learn about:
- 2D thread organization for matrix operations
- Global memory access patterns
- Matrix indexing in row-major layout
- Thread-to-output element mapping
The key insight is understanding how to map 2D thread indices to matrix elements and compute dot products in parallel.
Configuration
- Matrix size: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
- Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
- Grid dimensions: \(1 \times 1\)
Layout configuration:
- Input A:
Layout.row_major(SIZE, SIZE)
- Input B:
Layout.row_major(SIZE, SIZE)
(transpose of A) - Output:
Layout.row_major(SIZE, SIZE)
Memory layout (LayoutTensor):
- Matrix A:
a[i, j]
directly indexes position (i,j) - Matrix B:
b[i, j]
is the transpose of A - Output C:
out[i, j]
stores result at position (i,j)
Code to complete
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
alias TPB = 3
alias SIZE = 2
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, TPB)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE, SIZE)
fn naive_matmul[
layout: Layout, size: Int
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
# FILL ME IN (roughly 6 lines)
View full file: problems/p14/p14.mojo
Tips
- Calculate
global_i
andglobal_j
from thread indices - Check if indices are within
size
- Accumulate products in a local variable
- Write final sum to correct output position
Running the code
To test your solution, run the following command in your terminal:
magic run p14 --naive
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 3.0, 3.0, 13.0])
Solution
fn naive_matmul[
layout: Layout, size: Int
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
if global_i < size and global_j < size:
var acc: out.element_type = 0
@parameter
for k in range(size):
acc += a[global_i, k] * b[k, global_j]
out[global_i, global_j] = acc
The naive matrix multiplication using LayoutTensor demonstrates the basic approach:
Matrix Layout (2×2 example)
Matrix A: Matrix B (transpose of A): Output C:
[a[0,0] a[0,1]] [b[0,0] b[0,1]] [c[0,0] c[0,1]]
[a[1,0] a[1,1]] [b[1,0] b[1,1]] [c[1,0] c[1,1]]
Implementation Details:
-
Thread Mapping:
global_i = block_dim.x * block_idx.x + thread_idx.x global_j = block_dim.y * block_idx.y + thread_idx.y
-
Memory Access Pattern:
- Direct 2D indexing:
a[global_i, k]
- Transposed access:
b[k, global_j]
- Output writing:
out[global_i, global_j]
- Direct 2D indexing:
-
Computation Flow:
# Use var for mutable accumulator with tensor's element type var acc: out.element_type = 0 # @parameter for compile-time loop unrolling @parameter for k in range(size): acc += a[global_i, k] * b[k, global_j]
Key Language Features:
-
Variable Declaration:
- The use of
var
invar acc: out.element_type = 0
allows for type inference without.element_type
ensures type compatibility with the output tensor - Initialized to zero before accumulation
- The use of
-
Loop Optimization:
@parameter
decorator unrolls the loop at compile time- Improves performance for small, known matrix sizes
- Enables better instruction scheduling
Performance Characteristics:
-
Memory Access:
- Each thread makes
2 x SIZE
global memory reads - One global memory write per thread
- No data reuse between threads
- Each thread makes
-
Computational Efficiency:
- Simple implementation but suboptimal performance
- Many redundant global memory accesses
- No use of fast shared memory
-
Limitations:
- High global memory bandwidth usage
- Poor data locality
- Limited scalability for large matrices
This naive implementation serves as a baseline for understanding matrix multiplication on GPUs, highlighting the need for optimization in memory access patterns.