Mojo🔥 GPU Puzzles
🚧 This book is a work in progress! Some sections may be incomplete or subject to change. 🚧
“For the things we have to learn before we can do them, we learn by doing them.” Aristotle, (Nicomachean Ethics)
Welcome to Mojo 🔥 GPU Puzzles, a hands-on guide to mastering GPU programming using Mojo 🔥 — the innovative programming language that combines Pythonic syntax with systems-level performance. GPU programming remains one of the most powerful skills in modern computing, driving advances in artificial intelligence, scientific simulation, and high-performance computing.
This book takes a unique approach to teaching GPU programming: learning by solving increasingly challenging puzzles. Rather than traditional textbook learning, you’ll immediately start writing real GPU code and seeing the results.
The early chapters of this book are heavily inspired by GPU Puzzles, an interactive CUDA learning project by Sasha Rush. This adaptation reimplements these concepts using Mojo’s powerful abstractions and performance capabilities, while expanding on advanced topics with Mojo-specific optimizations.
Why Mojo 🔥 for GPU Programming?
The computing industry has reached a critical inflection point. We can no longer rely on new CPU generations to automatically increase application performance through higher clock speeds. As power and heat constraints have plateaued CPU speeds, hardware manufacturers have shifted toward increasing the number of physical cores. This multi-core revolution has reached its zenith in modern GPUs, which contain thousands of cores operating in parallel. The NVIDIA H100, for example, can run an astonishing 16,896 threads simultaneously in a single clock cycle, with over 270,000 threads queued and ready for execution.
Mojo represents a fresh approach to GPU programming, making this massive parallelism more accessible and productive:
- Python-like Syntax with systems programming capabilities that feels familiar to the largest programming community
- Zero-cost Abstractions that compile to efficient machine code without sacrificing performance
- Strong Type System that catches errors at compile time while maintaining expressiveness
- Built-in Tensor Support with hardware-aware optimizations specifically designed for GPU computation
- Direct Access to low-level CPU and GPU intrinsics for systems-level programming
- Cross-Hardware Portability allowing you to write code that can run on both CPUs and GPUs
- Ergonomic and Safety Improvements over traditional C/C++ GPU programming
- Lower Barrier to Entry enabling more programmers to harness GPU power effectively
Mojo 🔥 aims to fuel innovation by democratizing GPU programming. By expanding on Python’s familiar syntax while adding direct GPU access, Mojo empowers programmers with minimal specialized knowledge to build high-performance, heterogeneous (CPU, GPU-enabled) applications.
The GPU Programming Mindset
Effective GPU programming requires a fundamental shift in how we think about computation. Here are some key mental models that will guide your journey:
From Sequential to Parallel: Eliminating Loops with Threads
In traditional CPU programming, we process data sequentially through loops:
# CPU approach
for i in range(data_size):
result[i] = process(data[i])
With GPUs, we flip this model entirely. Instead of moving sequentially through data, we map thousands of parallel threads directly onto the data:
# GPU approach (conceptual)
thread_id = get_global_id()
if thread_id < data_size:
result[thread_id] = process(data[thread_id])
Each thread becomes responsible for computing a single element, eliminating the need for explicit loops. This mental shift—from “stepping through data” to “blanketing data with compute”—is central to GPU programming.
Fitting a Mesh of Compute Over Data
Imagine your data as a grid, and GPU threads as another grid that overlays it. Your task is to design this “compute mesh” to efficiently cover your data:
- Threads: Individual compute units that process single data elements
- Blocks: Organized groups of threads that share fast memory
- Grid: The entire collection of blocks that covers your dataset
The art of GPU programming lies in crafting this mesh to maximize parallelism while respecting memory and synchronization constraints.
Data Movement vs. Computation
In GPU programming, data movement is often more expensive than computation:
- Moving data between CPU and GPU is slow
- Moving data between global and shared memory is faster
- Operating on data already in registers or shared memory is extremely fast
This inverts another common assumption in programming: computation is no longer the bottleneck—data movement is.
Through the puzzles in this book, you’ll develop an intuitive understanding of these principles, transforming how you approach computational problems.
What You Will Learn
This book takes you on a journey from first principles to advanced GPU programming techniques. Rather than treating the GPU as a mysterious black box, we’ll build your understanding layer by layer—starting with how individual threads operate and culminating in sophisticated parallel algorithms. By mastering both low-level memory management and high-level tensor abstractions, you’ll gain the versatility to tackle any GPU programming challenge.
Your learning path includes:
- GPU Programming Fundamentals: Thread organization, memory hierarchies, and kernel execution models
- Dual Implementation Paths: Beginning with raw memory approaches using pointers, then transitioning to LayoutTensor abstractions
- Memory Management: Working with global, shared, and thread-local memory for optimal performance
- Low-level to High-level Progression: Understanding the foundation with UnsafePointer before leveraging LayoutTensor’s elegant abstractions
- Layout Tensors: Mastering Mojo’s powerful tensor abstractions for simplified, efficient GPU computation
- Parallel Algorithms: Implementing and optimizing parallel reductions, convolutions, matrix operations, and more
- Performance Optimization: Advanced techniques for memory coalescing, tiling, bank conflict avoidance, and minimizing thread divergence
- Real-world Applications: Applying these concepts to machine learning, signal processing, and computational tasks
The book uniquely challenges the status quo approach by first building understanding with low-level memory manipulation, then gradually transitioning to Mojo’s powerful LayoutTensor abstractions. This gives you both deep understanding of GPU memory patterns and practical knowledge of modern tensor-based approaches.
How to Use This Book
Each puzzle follows a consistent format designed to progressively build your skills:
- Overview: Clear problem statement and key concepts introduced in each puzzle
- Configuration: Setup parameters and memory organization specific to each challenge
- Code to Complete: Skeleton code with specific sections for you to implement
- Tips: Optional hints if you get stuck, without giving away complete solutions
- Solution: Detailed explanations of the implementation, performance considerations, and underlying concepts
The puzzles gradually increase in complexity, introducing new concepts while reinforcing fundamentals. We recommend solving them in order, as later puzzles build on skills developed in earlier ones.
Running the code
All puzzles are designed to be run with the provided testing framework that verifies your implementation against expected results. Each puzzle includes instructions for running the code and validating your solution.
Prerequisites
Compatible GPU
You’ll need a compatible GPU to run the examples.
Setting up your environment
Clone the GitHub repository and make sure you have the magic
CLI installed to be able to run the Mojo programs:
# Clone the repository
git clone https://github.com/modular/mojo-gpu-puzzles
cd mojo-gpu-puzzles
# Install magic CLI (if not already installed)
curl -ssL https://magic.modular.com/ | bash
# Or update if already installed
magic self-update
Knowledge prerequisites
Basic knowledge of:
- Programming fundamentals (variables, loops, conditionals, functions)
- Parallel computing concepts (threads, synchronization, race conditions)
- Basic familiarity with Mojo (language basics parts and intro to pointers section)
- A tour of GPU basics in Mojo is helpful
No prior GPU programming experience is necessary! We’ll build that knowledge through the puzzles.
Let’s begin our journey into the exciting world of GPU computing with Mojo 🔥!
Join the community
Join our vibrant community to discuss GPU programming, share solutions, and get help!
Puzzle 1: Map
Overview
GPU programming is all about parallelism. In this puzzle, each thread will process a single element of the input array independently.
Implement a kernel that adds 10 to each position of vector a
and stores it in vector out
.
Note: You have 1 thread per position.
Key concepts
- Basic GPU kernel structure
- One-to-one thread to data mapping
- Memory access patterns
- Array operations on GPU
For each position \(i\): \[\Large out[i] = a[i] + 10\]
What we cover
🔰 Raw Memory Approach
Start with direct memory manipulation to understand GPU fundamentals.
💡 Preview: Modern Approach with LayoutTensor
See how LayoutTensor simplifies GPU programming with safer, cleaner code.
💡 Tip: Understanding both approaches helps you better appreciate modern GPU programming patterns.
Key concepts
In this puzzle, you’ll learn about:
-
Basic GPU kernel structure
-
Thread indexing with
thread_idx.x
-
Simple parallel operations
-
Parallelism: Each thread executes independently
-
Thread indexing: Access element at position
i = thread_idx.x
-
Memory access: Read from
a[i]
and write toout[i]
-
Data independence: Each output depends only on its corresponding input
Code to complete
alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = SIZE
alias dtype = DType.float32
fn add_10(out: UnsafePointer[Scalar[dtype]], a: UnsafePointer[Scalar[dtype]]):
local_i = thread_idx.x
# FILL ME IN (roughly 1 line)
View full file: problems/p01/p01.mojo
Tips
- Store
thread_idx.x
inlocal_i
- Add 10 to
a[local_i]
- Store result in
out[local_i]
Running the code
To test your solution, run the following command in your terminal:
magic run p01
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
Solution
fn add_10(out: UnsafePointer[Scalar[dtype]], a: UnsafePointer[Scalar[dtype]]):
local_i = thread_idx.x
out[local_i] = a[local_i] + 10.0
This solution:
- Gets thread index with
local_i = thread_idx.x
- Adds 10 to input value:
out[local_i] = a[local_i] + 10.0
Why Consider LayoutTensor?
Looking at our traditional implementation above, you might notice some potential issues:
Current approach
local_i = thread_idx.x
out[local_i] = a[local_i] + 10.0
This works for 1D arrays, but what happens when we need to:
- Handle 2D or 3D data?
- Deal with different memory layouts?
- Ensure coalesced memory access?
Preview of future challenges
As we progress through the puzzles, array indexing will become more complex:
# 2D indexing coming in later puzzles
idx = row * WIDTH + col
# 3D indexing
idx = (batch * HEIGHT + row) * WIDTH + col
# With padding
idx = (batch * padded_height + row) * padded_width + col
LayoutTensor preview
LayoutTensor will help us handle these cases more elegantly:
# Future preview - don't worry about this syntax yet!
out[i, j] = a[i, j] + 10.0 # 2D indexing
out[b, i, j] = a[b, i, j] + 10.0 # 3D indexing
We’ll learn about LayoutTensor in detail in Puzzle 4, where these concepts become essential. For now, focus on understanding:
- Basic thread indexing
- Simple memory access patterns
- One-to-one mapping of threads to data
💡 Key Takeaway: While direct indexing works for simple cases, we’ll soon need more sophisticated tools for complex GPU programming patterns.
Puzzle 2: Zip
Overview
Implement a kernel that adds together each position of vector a
and vector b
and stores it in out
.
Note: You have 1 thread per position.
Key concepts
In this puzzle, you’ll learn about:
- Processing multiple input arrays in parallel
- Element-wise operations with multiple inputs
- Thread-to-data mapping across arrays
- Memory access patterns with multiple arrays
For each thread \(i\): \[\Large out[i] = a[i] + b[i]\]
Memory access pattern
Thread 0: a[0] + b[0] → out[0]
Thread 1: a[1] + b[1] → out[1]
Thread 2: a[2] + b[2] → out[2]
...
💡 Note: Notice how we’re now managing three arrays (a
, b
, out
) in our kernel. As we progress to more complex operations, managing multiple array accesses will become increasingly challenging.
Code to complete
alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = SIZE
alias dtype = DType.float32
fn add(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
b: UnsafePointer[Scalar[dtype]],
):
local_i = thread_idx.x
# FILL ME IN (roughly 1 line)
View full file: problems/p02/p02.mojo
Tips
- Store
thread_idx.x
inlocal_i
- Add
a[local_i]
andb[local_i]
- Store result in
out[local_i]
Running the code
To test your solution, run the following command in your terminal:
magic run p02
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 2.0, 4.0, 6.0])
Solution
fn add(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
b: UnsafePointer[Scalar[dtype]],
):
local_i = thread_idx.x
out[local_i] = a[local_i] + b[local_i]
This solution:
- Gets thread index with
local_i = thread_idx.x
- Adds values from both arrays:
out[local_i] = a[local_i] + b[local_i]
Looking ahead
While this direct indexing works for simple element-wise operations, consider:
- What if arrays have different layouts?
- What if we need to broadcast one array to another?
- How to ensure coalesced access across multiple arrays?
These questions will be addressed when we introduce LayoutTensor in Puzzle 4.
Puzzle 3: Guards
Overview
Implement a kernel that adds 10 to each position of vector a
and stores it in vector out
.
Note: You have more threads than positions. This means you need to protect against out-of-bounds memory access.
Key concepts
In this puzzle, you’ll learn about:
- Handling thread/data size mismatches
- Preventing out-of-bounds memory access
- Using conditional execution in GPU kernels
- Safe memory access patterns
Mathematical Description
For each thread \(i\): \[\Large \text{if}\ i < \text{size}: out[i] = a[i] + 10\]
Memory Safety Pattern
Thread 0 (i=0): if 0 < size: out[0] = a[0] + 10 ✓ Valid
Thread 1 (i=1): if 1 < size: out[1] = a[1] + 10 ✓ Valid
Thread 2 (i=2): if 2 < size: out[2] = a[2] + 10 ✓ Valid
Thread 3 (i=3): if 3 < size: out[3] = a[3] + 10 ✓ Valid
Thread 4 (i=4): if 4 < size: ❌ Skip (out of bounds)
Thread 5 (i=5): if 5 < size: ❌ Skip (out of bounds)
💡 Note: Boundary checking becomes increasingly complex with:
- Multi-dimensional arrays
- Different array shapes
- Complex access patterns
Code to complete
alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (8, 1)
alias dtype = DType.float32
fn add_10_guard(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
local_i = thread_idx.x
# FILL ME IN (roughly 2 lines)
View full file: problems/p03/p03.mojo
Tips
- Store
thread_idx.x
inlocal_i
- Add guard:
if local_i < size
- Inside guard:
out[local_i] = a[local_i] + 10.0
Running the code
To test your solution, run the following command in your terminal:
magic run p03
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
Solution
fn add_10_guard(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
local_i = thread_idx.x
if local_i < size:
out[local_i] = a[local_i] + 10.0
This solution:
- Gets thread index with
local_i = thread_idx.x
- Guards against out-of-bounds access with
if local_i < size
- Inside guard: adds 10 to input value
Looking ahead
While simple boundary checks work here, consider these challenges:
- What about 2D/3D array boundaries?
- How to handle different shapes efficiently?
- What if we need padding or edge handling?
Example of growing complexity:
# Current: 1D bounds check
if i < size: ...
# Coming soon: 2D bounds check
if i < height and j < width: ...
# Later: 3D with padding
if i < height and j < width and k < depth and
i >= padding and j >= padding: ...
These boundary handling patterns will become more elegant when we learn about LayoutTensor in Puzzle 4, which provides built-in boundary checking and shape management.
Puzzle 4: 2D Map
Overview
Implement a kernel that adds 10 to each position of 2D square matrix a
and stores it in 2D square matrix out
.
Note: You have more threads than positions.
Key concepts
- 2D thread indexing
- Matrix operations on GPU
- Handling excess threads
- Memory layout patterns
For each position \((i,j)\): \[\Large out[i,j] = a[i,j] + 10\]
Implementation approaches
🔰 Raw memory approach
Learn how 2D indexing works with manual memory management.
📚 Learn about LayoutTensor
Discover a powerful abstraction that simplifies multi-dimensional array operations and memory management on GPU.
🚀 Modern 2D operations
Put LayoutTensor into practice with natural 2D indexing and automatic bounds checking.
💡 Note: From this puzzle onward, we’ll primarily use LayoutTensor for cleaner, safer GPU code.
Key concepts
In this puzzle, you’ll learn about:
- Working with 2D thread indices (
thread_idx.x
,thread_idx.y
) - Converting 2D coordinates to 1D memory indices
- Handling boundary checks in two dimensions
The key insight is understanding how to map from 2D thread coordinates \((i,j)\) to elements in a row-major matrix of size \(n \times n\), while ensuring thread indices are within bounds.
- 2D indexing: Each thread has a unique \((i,j)\) position
- Memory layout: Row-major ordering maps 2D to 1D memory
- Guard condition: Need bounds checking in both dimensions
- Thread bounds: More threads \((3 \times 3)\) than matrix elements \((2 \times 2)\)
Code to complete
alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
fn add_10_2d(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
local_i = thread_idx.x
local_j = thread_idx.y
# FILL ME IN (roughly 2 lines)
View full file: problems/p04/p04.mojo
Tips
- Get 2D indices:
local_i = thread_idx.x
,local_j = thread_idx.y
- Add guard:
if local_i < size and local_j < size
- Inside guard:
out[local_j * size + local_i] = a[local_j * size + local_i] + 10.0
Running the code
To test your solution, run the following command in your terminal:
magic run p04
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
Solution
fn add_10_2d(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
local_i = thread_idx.x
local_j = thread_idx.y
if local_i < size and local_j < size:
out[local_j * size + local_i] = a[local_j * size + local_i] + 10.0
This solution:
- Gets 2D thread indices with
local_i = thread_idx.x
,local_j = thread_idx.y
- Guards against out-of-bounds with
if local_i < size and local_j < size
- Inside guard: adds 10 to input value using row-major indexing
Introduction to LayoutTensor
After dealing with manual indexing, bounds checking, and growing complexity in the previous puzzles, it’s time to introduce a powerful abstraction that will make GPU programming more intuitive and safer.
Why LayoutTensor?
Let’s look at the challenges we’ve faced:
# Puzzle 1: Simple indexing
out[local_i] = a[local_i] + 10.0
# Puzzle 2: Multiple array management
out[local_i] = a[local_i] + b[local_i]
# Puzzle 3: Bounds checking
if local_i < size:
out[local_i] = a[local_i] + 10.0
As dimensions grow, code becomes more complex:
# Traditional 2D indexing for row-major 2D matrix
idx = col * WIDTH + row
if row < height and col < width:
out[idx] = a[idx] + 10.0
The LayoutTensor solution
LayoutTensor provides:
- Natural Indexing: Use
tensor[i, j]
instead of manual offset calculations - Automatic Bounds Checking: Built-in protection against out-of-bounds access
- Flexible Memory Layouts: Support for row-major, column-major, and tiled organizations
- Performance Optimization: Efficient memory access patterns for GPU
Basic usage
from layout import Layout, LayoutTensor
# Define layout
alias HEIGHT = 2
alias WIDTH = 3
alias layout = Layout.row_major(HEIGHT, WIDTH)
# Create tensor
tensor = LayoutTensor[dtype, layout](buffer.unsafe_ptr())
# Access elements naturally
tensor[0, 0] = 1.0 # First element
tensor[1, 2] = 2.0 # Last element
Memory layout control
LayoutTensor supports different memory organizations:
# Row-major (default)
layout_row = Layout.row_major(HEIGHT, WIDTH)
# Column-major
layout_col = Layout.col_major(HEIGHT, WIDTH)
# Tiled (for better cache utilization)
layout_tiled = tensor.tiled[4, 4](HEIGHT, WIDTH)
Understanding memory layouts
Memory layout affects performance dramatically. LayoutTensor supports:
-
Row-major: Elements in a row are contiguous
# [1 2 3] # [4 5 6] -> [1 2 3 4 5 6] layout_row = Layout.row_major(2, 3)
-
Column-major: Elements in a column are contiguous
# [1 2 3] # [4 5 6] -> [1 4 2 5 3 6] layout_col = Layout.col_major(2, 3)
-
Tiled: Elements grouped in tiles for cache efficiency
# [[1 2] [3 4]] in 2x2 tiles layout_tiled = Layout.tiled[2, 2](4, 4)
Benefits over traditional approach
-
Readability:
# Traditional out[col * WIDTH + row] = a[col * WIDTH + row] + 10.0 # LayoutTensor out[row, col] = a[row, col] + 10.0
-
Flexibility:
- Easy to change memory layouts without modifying computation code
- Support for complex access patterns
- Built-in optimizations
Advanced features preview
While we’ll start with basic operations, LayoutTensor’s true power shines in advanced GPU optimizations:
1. Memory hierarchy management
# Shared memory allocation
shared_mem = tb[dtype]().row_major[BM, BK]().shared().alloc()
# Register allocation
reg_tile = tb[dtype]().row_major[TM, TN]().local().alloc()
2. Tiling strategies
# Block tiling
block_tile = tensor.tile[BM, BN](block_idx.y, block_idx.x)
# Register tiling
reg_tile = block_tile.tile[TM, TN](thread_row, thread_col)
3. Memory access patterns
# Vectorized access
vec_tensor = tensor.vectorize[1, simd_width]()
# Asynchronous transfers
copy_dram_to_sram_async[thread_layout=layout](dst, src)
4. Hardware acceleration
# Tensor Core operations (coming in later puzzles)
mma_op = TensorCore[dtype, out_type, Index(M, N, K)]()
result = mma_op.mma_op(a_reg, b_reg, c_reg)
💡 Looking Ahead: As we progress through the puzzles, you’ll learn how to:
- Use shared memory for faster data access
- Implement efficient tiling strategies
- Leverage vectorized operations
- Utilize hardware accelerators
- Optimize memory access patterns
Each puzzle will introduce these concepts gradually, building on the fundamentals to create highly optimized GPU code.
Ready to start your journey from basic operations to advanced GPU programming? Let’s begin with the fundamentals!
Quick example
Let’s put everything together with a simple example that demonstrates the basics of LayoutTensor:
from gpu.host import DeviceContext
from layout import Layout, LayoutTensor
alias HEIGHT = 2
alias WIDTH = 3
alias dtype = DType.float32
alias layout = Layout.row_major(HEIGHT, WIDTH)
fn kernel[dtype: DType, layout: Layout](tensor: LayoutTensor[mut=True, dtype, layout]):
print("Before:")
print(tensor)
tensor[0, 0] += 1
print("After:")
print(tensor)
def main():
ctx = DeviceContext()
a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH).enqueue_fill(0)
tensor = LayoutTensor[mut=True, dtype, layout](a.unsafe_ptr())
# Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
ctx.enqueue_function[kernel[dtype, layout]](tensor, grid_dim=1, block_dim=1)
ctx.synchronize()
When we run this code with magic run layout_tensor_intro
, we see:
Before:
0.0 0.0 0.0
0.0 0.0 0.0
After:
1.0 0.0 0.0
0.0 0.0 0.0
Let’s break down what’s happening:
- We create a
2 x 3
tensor with row-major layout - Initially, all elements are zero
- Using natural indexing, we modify a single element
- The change is reflected in our output
This simple example demonstrates key LayoutTensor benefits:
- Clean syntax for tensor creation and access
- Automatic memory layout handling
- Built-in bounds checking
- Natural multi-dimensional indexing
While this example is straightforward, the same patterns will scale to complex GPU operations in upcoming puzzles. You’ll see how these basic concepts extend to:
- Multi-threaded GPU operations
- Shared memory optimizations
- Complex tiling strategies
- Hardware-accelerated computations
Ready to start your GPU programming journey with LayoutTensor? Let’s dive into the puzzles!
💡 Tip: Keep this example in mind as we progress - we’ll build upon these fundamental concepts to create increasingly sophisticated GPU programs.
LayoutTensor Version
Overview
Implement a kernel that adds 10 to each position of 2D LayoutTensor a
and stores it in 2D LayoutTensor out
.
Note: You have more threads than positions.
Key concepts
In this puzzle, you’ll learn about:
- Using
LayoutTensor
for 2D array access - Direct 2D indexing with
tensor[i, j]
- Handling bounds checking with
LayoutTensor
The key insight is that LayoutTensor
provides a natural 2D indexing interface, abstracting away the underlying memory layout while still requiring bounds checking.
- 2D access: Natural \((i,j)\) indexing with
LayoutTensor
- Memory abstraction: No manual row-major calculation needed
- Guard condition: Still need bounds checking in both dimensions
- Thread bounds: More threads \((3 \times 3)\) than tensor elements \((2 \times 2)\)
Code to complete
alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE, SIZE)
fn add_10_2d(
out: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
local_i = thread_idx.x
local_j = thread_idx.y
# FILL ME IN (roughly 2 lines)
View full file: problems/p04/p04_layout_tensor.mojo
Tips
- Get 2D indices:
local_i = thread_idx.x
,local_j = thread_idx.y
- Add guard:
if local_i < size and local_j < size
- Inside guard:
out[local_i, local_j] = a[local_i, local_j] + 10.0
Running the code
To test your solution, run the following command in your terminal:
magic run p04_layout_tensor
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
Solution
fn add_10_2d(
out: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
local_i = thread_idx.x
local_j = thread_idx.y
if local_i < size and local_j < size:
out[local_i, local_j] = a[local_i, local_j] + 10.0
This solution:
- Gets 2D thread indices with
local_i = thread_idx.x
,local_j = thread_idx.y
- Guards against out-of-bounds with
if local_i < size and local_j < size
- Uses
LayoutTensor
’s 2D indexing:out[local_i, local_j] = a[local_i, local_j] + 10.0
Puzzle 5: Broadcast
Overview
Implement a kernel that broadcast adds vector a
and vector b
and stores it in 2D matrix out
.
Note: You have more threads than positions.
Key concepts
- Broadcasting vectors to matrix
- 2D thread management
- Mixed dimension operations
- Memory layout patterns
Implementation approaches
🔰 Raw memory approach
Learn how to handle broadcasting with manual memory indexing.
📐 LayoutTensor Version
Use LayoutTensor to handle mixed-dimension operations.
💡 Note: Notice how LayoutTensor simplifies broadcasting compared to manual indexing.
Key concepts
In this puzzle, you’ll learn about:
- Broadcasting 1D vectors across different dimensions
- Using 2D thread indices for broadcast operations
- Handling boundary conditions in broadcast patterns
The key insight is understanding how to map elements from two 1D vectors to create a 2D output matrix through broadcasting, while handling thread bounds correctly.
- Broadcasting: Each element of
a
combines with each element ofb
- Thread mapping: 2D thread grid \((3 \times 3)\) for \(2 \times 2\) output
- Vector access: Different access patterns for
a
andb
- Bounds checking: Guard against threads outside matrix dimensions
Code to complete
alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
fn broadcast_add(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
b: UnsafePointer[Scalar[dtype]],
size: Int,
):
local_i = thread_idx.x
local_j = thread_idx.y
# FILL ME IN (roughly 2 lines)
View full file: problems/p05/p05.mojo
Tips
- Get 2D indices:
local_i = thread_idx.x
,local_j = thread_idx.y
- Add guard:
if local_i < size and local_j < size
- Inside guard:
out[local_j * size + local_i] = a[local_i] + b[local_j]
Running the code
To test your solution, run the following command in your terminal:
magic run p05
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])
Solution
fn broadcast_add(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
b: UnsafePointer[Scalar[dtype]],
size: Int,
):
local_i = thread_idx.x
local_j = thread_idx.y
if local_i < size and local_j < size:
out[local_j * size + local_i] = a[local_i] + b[local_j]
This solution:
- Gets 2D thread indices with
local_i = thread_idx.x
,local_j = thread_idx.y
- Guards against out-of-bounds with
if local_i < size and local_j < size
- Broadcasts by adding
a[local_i]
andb[local_j]
into the output matrix
LayoutTensor Version
Overview
Implement a kernel that broadcast adds LayoutTensor vector a
and LayoutTensor vector b
and stores it in LayoutTensor out
.
Note: You have more threads than positions.
Key concepts
In this puzzle, you’ll learn about:
- Using
LayoutTensor
for broadcast operations - Working with different tensor shapes
- Handling 2D indexing with
LayoutTensor
The key insight is that LayoutTensor
allows natural broadcasting through different tensor shapes: \((n,1)\) and \((1,n)\) to \((n,n)\), while still requiring bounds checking.
- Tensor shapes: Input vectors have shapes \((n,1)\) and \((1,n)\)
- Broadcasting: Output combines both dimensions to \((n,n)\)
- Guard condition: Still need bounds checking for output size
- Thread bounds: More threads \((3 \times 3)\) than tensor elements \((2 \times 2)\)
Code to complete
alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(SIZE, 1)
alias b_layout = Layout.row_major(1, SIZE)
fn broadcast_add[
out_layout: Layout,
a_layout: Layout,
b_layout: Layout,
](
out: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=True, dtype, a_layout],
b: LayoutTensor[mut=True, dtype, b_layout],
size: Int,
):
local_i = thread_idx.x
local_j = thread_idx.y
# FILL ME IN (roughly 2 lines)
View full file: problems/p05/p05_layout_tensor.mojo
Tips
- Get 2D indices:
local_i = thread_idx.x
,local_j = thread_idx.y
- Add guard:
if local_i < size and local_j < size
- Inside guard:
out[local_i, local_j] = a[local_i, 0] + b[0, local_j]
Running the code
To test your solution, run the following command in your terminal:
magic run p05_layout_tensor
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])
Solution
fn broadcast_add[
out_layout: Layout,
a_layout: Layout,
b_layout: Layout,
](
out: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=True, dtype, a_layout],
b: LayoutTensor[mut=True, dtype, b_layout],
size: Int,
):
local_i = thread_idx.x
local_j = thread_idx.y
if local_i < size and local_j < size:
out[local_i, local_j] = a[local_i, 0] + b[0, local_j]
This solution:
- Gets 2D thread indices with
local_i = thread_idx.x
,local_j = thread_idx.y
- Guards against out-of-bounds with
if local_i < size and local_j < size
- Uses
LayoutTensor
broadcasting:a[local_i, 0] + b[0, local_j]
Puzzle 6: Blocks
Overview
Implement a kernel that adds 10 to each position of vector a
and stores it in out
.
Note: You have fewer threads per block than the size of a.
Key concepts
In this puzzle, you’ll learn about:
- Processing data larger than thread block size
- Coordinating multiple blocks of threads
- Computing global thread positions
The key insight is understanding how blocks of threads work together to process data that’s larger than a single block’s capacity, while maintaining correct element-to-thread mapping.
Code to complete
alias SIZE = 9
alias BLOCKS_PER_GRID = (3, 1)
alias THREADS_PER_BLOCK = (4, 1)
alias dtype = DType.float32
fn add_10_blocks(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
# FILL ME IN (roughly 2 lines)
View full file: problems/p06/p06.mojo
Tips
- Calculate global index:
global_i = block_dim.x * block_idx.x + thread_idx.x
- Add guard:
if global_i < size
- Inside guard:
out[global_i] = a[global_i] + 10.0
Running the code
To test your solution, run the following command in your terminal:
magic run p06
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])
Solution
fn add_10_blocks(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
if global_i < size:
out[global_i] = a[global_i] + 10.0
This solution:
- Computes global thread index from block and thread indices
- Guards against out-of-bounds with
if global_i < size
- Inside guard: adds 10 to input value at global index
Puzzle 7: 2D Blocks
Overview
Implement a kernel that adds 10 to each position of matrix a
and stores it in out
.
Note: You have fewer threads per block than the size of a
in both directions.
Key concepts
- Block-based processing
- Grid-block coordination
- Multi-block indexing
- Memory access patterns
Implementation approaches
🔰 Raw memory approach
Learn how to handle multi-block operations with manual indexing.
📐 LayoutTensor Version
Use LayoutTensor features to elegantly handle block-based processing.
💡 Note: See how LayoutTensor simplifies block coordination and memory access patterns.
Key concepts
In this puzzle, you’ll learn about:
- Working with 2D block and thread arrangements
- Handling matrix data larger than block size
- Converting between 2D and linear memory access
The key insight is understanding how to coordinate multiple blocks of threads to process a 2D matrix that’s larger than a single block’s dimensions.
Configuration
- Matrix size: \(5 \times 5\) elements
- 2D blocks: Each block processes a \(3 \times 3\) region
- Grid layout: Blocks arranged in \(2 \times 2\) grid
- Total threads: \(36\) for \(25\) elements
- Memory pattern: Row-major storage for 2D data
- Coverage: Ensuring all matrix elements are processed
Code to complete
alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
fn add_10_blocks_2d(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
# FILL ME IN (roughly 2 lines)
View full file: problems/p07/p07.mojo
Tips
- Calculate global indices:
global_i = block_dim.x * block_idx.x + thread_idx.x
- Add guard:
if global_i < size and global_j < size
- Inside guard:
out[global_j * size + global_i] = a[global_j * size + global_i] + 10.0
Running the code
To test your solution, run the following command in your terminal:
magic run p07
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])
Solution
fn add_10_blocks_2d(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
if global_i < size and global_j < size:
out[global_j * size + global_i] = a[global_j * size + global_i] + 10.0
This solution:
- Computes global indices with
block_dim * block_idx + thread_idx
- Guards against out-of-bounds with
if global_i < size and global_j < size
- Uses row-major indexing to access and update matrix elements
LayoutTensor Version
Overview
Implement a kernel that adds 10 to each position of 2D LayoutTensor a
and stores it in 2D LayoutTensor out
.
Note: You have fewer threads per block than the size of a
in both directions.
Key concepts
In this puzzle, you’ll learn about:
- Using
LayoutTensor
with multiple blocks - Handling large matrices with 2D block organization
- Combining block indexing with
LayoutTensor
access
The key insight is that LayoutTensor
simplifies 2D indexing while still requiring proper block coordination for large matrices.
Configuration
- Matrix size: \(5 \times 5\) elements
- Layout handling:
LayoutTensor
manages row-major organization - Block coordination: Multiple blocks cover the full matrix
- 2D indexing: Natural \((i,j)\) access with bounds checking
- Total threads: \(36\) for \(25\) elements
- Thread mapping: Each thread processes one matrix element
Code to complete
alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(SIZE, 1)
fn add_10_blocks_2d[
out_layout: Layout,
a_layout: Layout,
](
out: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=True, dtype, a_layout],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
# FILL ME IN (roughly 2 lines)
View full file: problems/p07/p07_layout_tensor.mojo
Tips
- Calculate global indices:
global_i = block_dim.x * block_idx.x + thread_idx.x
- Add guard:
if global_i < size and global_j < size
- Inside guard:
out[global_i, global_j] = a[global_i, global_j] + 10.0
Running the code
To test your solution, run the following command in your terminal:
magic run p07_layout_tensor
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])
Solution
fn add_10_blocks_2d[
out_layout: Layout,
a_layout: Layout,
](
out: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=True, dtype, a_layout],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
if global_i < size and global_j < size:
out[global_i, global_j] = a[global_i, global_j] + 10.0
This solution:
- Computes global indices with
block_dim * block_idx + thread_idx
- Guards against out-of-bounds with
if global_i < size and global_j < size
- Uses
LayoutTensor
’s 2D indexing:out[global_i, global_j] = a[global_i, global_j] + 10.0
Puzzle 8: Shared Memory
Overview
Implement a kernel that adds 10 to each position of a
and stores it in out
.
Note: You have fewer threads per block than the size of a
.
Implementation approaches
🔰 Raw memory approach
Learn how to manually manage shared memory and synchronization.
📐 LayoutTensor Version
Use LayoutTensor’s built-in shared memory management features.
💡 Note: Experience how LayoutTensor simplifies shared memory operations while maintaining performance.
Key concepts
In this puzzle, you’ll learn about:
- Using shared memory within thread blocks
- Synchronizing threads with barriers
- Managing block-local data storage
The key insight is understanding how shared memory provides fast, block-local storage that all threads in a block can access, requiring careful coordination between threads.
Configuration
- Array size:
SIZE = 8
elements - Threads per block:
TPB = 4
- Number of blocks: 2
- Shared memory:
TPB
elements per block
Notes:
- Shared memory: Fast storage shared by threads in a block
- Thread sync: Coordination using
barrier()
- Memory scope: Shared memory only visible within block
- Access pattern: Local vs global indexing
Warning: Each block can only have a constant amount of shared memory that threads in that block can read and write to. This needs to be a literal python constant, not a variable. After writing to shared memory you need to call barrier to ensure that threads do not cross.
Code to complete
alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
fn add_10_shared(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
shared = stack_allocation[
TPB * sizeof[dtype](),
Scalar[dtype],
address_space = AddressSpace.SHARED,
]()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# local data into shared memory
if global_i < size:
shared[local_i] = a[global_i]
# wait for all threads to complete
# works within a thread block
barrier()
# FILL ME IN (roughly 2 lines)
View full file: problems/p08/p08.mojo
Tips
- Wait for shared memory load with
barrier()
- Use
local_i
to access shared memory:shared[local_i]
- Use
global_i
for output:out[global_i]
- Add guard:
if global_i < size
Running the code
To test your solution, run the following command in your terminal:
magic run p08
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])
Solution
fn add_10_shared(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
shared = stack_allocation[
TPB * sizeof[dtype](),
Scalar[dtype],
address_space = AddressSpace.SHARED,
]()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# local data into shared memory
if global_i < size:
shared[local_i] = a[global_i]
# wait for all threads to complete
# works within a thread block
barrier()
# process using shared memory
if global_i < size:
out[global_i] = shared[local_i] + 10
This solution:
- Waits for shared memory load with
barrier()
- Guards against out-of-bounds with
if global_i < size
- Reads from shared memory using
shared[local_i]
- Writes result to global memory at
out[global_i]
Key concepts
In this puzzle, you’ll learn about:
- Using LayoutTensor’s shared memory features
- Thread synchronization with shared memory
- Block-local data management with tensor builder
The key insight is how LayoutTensor simplifies shared memory management while maintaining the performance benefits of block-local storage.
Configuration
- Array size:
SIZE = 8
elements - Threads per block:
TPB = 4
- Number of blocks: 2
- Shared memory:
TPB
elements per block
Key differences from raw approach
-
Memory allocation: We will use LayoutTensorBuild instead of stack_allocation
# Raw approach shared = stack_allocation[TPB * sizeof[dtype](), ...]() # LayoutTensor approach shared = LayoutTensorBuild[dtype]().row_major[TPB]().shared().alloc()
-
Memory access: Same syntax
# Raw approach shared[local_i] = a[global_i] # LayoutTensor approach shared[local_i] = a[global_i]
-
Safety features:
- Type safety
- Layout management
- Memory alignment handling
Note: LayoutTensor handles memory layout, but you still need to manage thread synchronization with
barrier()
when using shared memory.
Code to complete
alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
fn add_10_shared_layout_tensor[
layout: Layout
](
out: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i]
barrier()
# FILL ME IN (roughly 2 lines)
View full file: problems/p08/p08_layout_tensor.mojo
Tips
- Create shared memory with tensor builder
- Load data with natural indexing:
shared[local_i] = a[global_i]
- Synchronize with
barrier()
- Process data using shared memory indices
- Guard against out-of-bounds access
Running the code
To test your solution, run the following command in your terminal:
magic run p08_layout_tensor
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])
Solution
fn add_10_shared_layout_tensor[
layout: Layout
](
out: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i]
barrier()
if global_i < size:
out[global_i] = shared[local_i] + 10
This solution:
- Creates shared memory using tensor builder’s fluent API
- Guards against out-of-bounds with
if global_i < size
- Uses natural indexing for both shared and global memory
- Ensures thread synchronization with
barrier()
- Leverages LayoutTensor’s built-in safety features
Key steps:
- Allocate shared memory with proper layout
- Load global data into shared memory
- Synchronize threads
- Process data using shared memory
- Write results back to global memory
Puzzle 9: Pooling
Overview
Implement a kernel that sums together the last 3 positions of vector a
and stores it in vector out
.
Note: You have 1 thread per position. You only need 1 global read and 1 global write per thread.
Implementation approaches
🔰 Raw memory approach
Learn how to implement sliding window operations with manual memory management and synchronization.
📐 LayoutTensor Version
Use LayoutTensor’s features for efficient window-based operations and shared memory management.
💡 Note: See how LayoutTensor simplifies sliding window operations while maintaining efficient memory access patterns.
Key concepts
In this puzzle, you’ll learn about:
- Using shared memory for sliding window operations
- Handling boundary conditions in pooling
- Coordinating thread access to neighboring elements
The key insight is understanding how to efficiently access a window of elements using shared memory, with special handling for the first elements in the sequence.
Configuration
- Array size:
SIZE = 8
elements - Threads per block:
TPB = 8
- Window size: 3 elements
- Shared memory:
TPB
elements
Notes:
- Window access: Each output depends on up to 3 previous elements
- Edge handling: First two positions need special treatment
- Memory pattern: One shared memory load per thread
- Thread sync: Coordination before window operations
Code to complete
alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
fn pooling(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
shared = stack_allocation[
TPB * sizeof[dtype](),
Scalar[dtype],
address_space = AddressSpace.SHARED,
]()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# FILL ME IN (roughly 10 lines)
View full file: problems/p09/p09.mojo
Tips
- Load data and call
barrier()
- Special cases:
out[0] = shared[0]
,out[1] = shared[0] + shared[1]
- General case:
if 1 < global_i < size
- Sum three elements:
shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
Running the code
To test your solution, run the following command in your terminal:
magic run p09
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])
Solution
fn pooling(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
shared = stack_allocation[
TPB * sizeof[dtype](),
Scalar[dtype],
address_space = AddressSpace.SHARED,
]()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i]
barrier()
if global_i == 0:
out[0] = shared[0]
elif global_i == 1:
out[1] = shared[0] + shared[1]
if 1 < global_i < size:
out[global_i] = (
shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
)
The solution implements a sliding window sum using shared memory with these key steps:
-
Shared Memory Setup:
- Allocates
TPB
elements in shared memory - Each thread loads one element from global memory
- Uses
barrier()
to ensure all data is loaded
- Allocates
-
Boundary Cases:
- Position 0:
out[0] = shared[0]
(only first element) - Position 1:
out[1] = shared[0] + shared[1]
(sum of first two elements)
- Position 0:
-
Main Window Operation:
- For positions 2 and beyond:
out[i] = shared[i-2] + shared[i-1] + shared[i]
- Uses local indices for shared memory access
- Maintains coalesced memory access pattern
- For positions 2 and beyond:
-
Memory Access Pattern:
- One global read per thread into shared memory
- One global write per thread from shared memory
- Uses shared memory for efficient neighbor access
- Avoids redundant global memory loads
This approach is efficient because:
- Minimizes global memory access
- Uses shared memory for fast neighbor lookups
- Handles boundary conditions without branching in the main case
- Maintains good memory coalescing
Key concepts
In this puzzle, you’ll learn about:
- Using LayoutTensor for sliding window operations
- Managing shared memory with
LayoutTensorBuilder
that we saw in puzzle_08 - Efficient neighbor access patterns
- Boundary condition handling
The key insight is how LayoutTensor simplifies shared memory management while maintaining efficient window-based operations.
Configuration
- Array size:
SIZE = 8
elements - Threads per block:
TPB = 8
- Window size: 3 elements
- Shared memory:
TPB
elements
Notes:
- Tensor builder: Use
LayoutTensorBuilder[dtype]().row_major[TPB]().shared().alloc()
- Window access: Natural indexing for 3-element windows
- Edge handling: Special cases for first two positions
- Memory pattern: One shared memory load per thread
Code to complete
alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
fn pooling[
layout: Layout
](
out: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# FIX ME IN (roughly 10 lines)
View full file: problems/p09/p09_layout_tensor.mojo
Tips
- Create shared memory with tensor builder
- Load data with natural indexing:
shared[local_i] = a[global_i]
- Handle special cases for first two elements
- Use shared memory for window operations
- Guard against out-of-bounds access
Running the code
To test your solution, run the following command in your terminal:
magic run p09_layout_tensor
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])
Solution
fn pooling[
layout: Layout
](
out: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# Load data into shared memory
if global_i < size:
shared[local_i] = a[global_i]
# Synchronize threads within block
barrier()
# Handle first two special cases
if global_i == 0:
out[0] = shared[0]
if global_i == 1:
out[1] = shared[0] + shared[1]
# Handle general case
if 1 < global_i < size:
out[global_i] = (
shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
)
The solution implements a sliding window sum using LayoutTensor with these key steps:
-
Shared Memory Setup:
- Uses tensor builder for clean shared memory allocation
- Natural indexing for data loading
- Thread synchronization with
barrier()
-
Boundary Cases:
- Position 0: Single element output
- Position 1: Sum of first two elements
- Clean indexing with LayoutTensor bounds checking
-
Main Window Operation:
- Natural indexing for 3-element window
- Safe access to neighboring elements
- Automatic bounds checking
-
Memory Access Pattern:
- Efficient shared memory usage
- Type-safe operations
- Layout-aware indexing
- Automatic alignment handling
Benefits over raw approach:
- Cleaner shared memory allocation
- Safer memory access
- Natural indexing syntax
- Built-in bounds checking
- Layout management
Puzzle 10: Dot Product
Overview
Implement a kernel that computes the dot-product of vector a
and vector b
and stores it in out
.
Note: You have 1 thread per position. You only need 2 global reads and 1 global write per thread.
Key concepts
In this puzzle, you’ll learn about:
- Implementing parallel reduction operations
- Using shared memory for intermediate results
- Coordinating threads for collective operations
The key insight is understanding how to efficiently combine multiple values into a single result using parallel computation and shared memory.
Configuration
- Vector size:
SIZE = 8
elements - Threads per block:
TPB = 8
- Number of blocks: 1
- Output size: 1 element
- Shared memory:
TPB
elements
Notes:
- Element access: Each thread reads corresponding elements from
a
andb
- Partial results: Computing and storing intermediate values
- Thread coordination: Synchronizing before combining results
- Final reduction: Converting partial results to scalar output
Note: For this problem, you don’t need to worry about number of shared reads. We will handle that challenge later.
Code to complete
alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (SIZE, 1)
alias dtype = DType.float32
fn dot_product(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
b: UnsafePointer[Scalar[dtype]],
size: Int,
):
# FILL ME IN (roughly 13 lines)
...
View full file: problems/p10/p10.mojo
Tips
- Store
a[global_i] * b[global_i]
inshared[local_i]
- Call
barrier()
to synchronize - Use thread 0 to sum all products in shared memory
- Write final sum to
out[0]
Running the code
To test your solution, run the following command in your terminal:
magic run p10
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0])
expected: HostBuffer([140.0])
Solution
fn dot_product(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
b: UnsafePointer[Scalar[dtype]],
size: Int,
):
shared = stack_allocation[
TPB * sizeof[dtype](),
Scalar[dtype],
address_space = AddressSpace.SHARED,
]()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i] * b[global_i]
barrier()
# The following causes race condition: all threads writing to the same location
# out[0] += shared[local_i]
# Instead can do parallel reduction in shared memory as opposed to
# global memory which has no guarantee on synchronization.
# Loops using global memory can cause thread divergence because
# fundamentally GPUs execute threads in warps (groups of 32 threads typically)
# and warps can be scheduled independently.
# However, shared memory does not have such issues as long as we use `barrier()`
# correctly when we're in the same thread block.
stride = TPB // 2
while stride > 0:
if local_i < stride:
shared[local_i] += shared[local_i + stride]
barrier()
stride //= 2
# only thread 0 writes the final result
if local_i == 0:
out[0] = shared[0]
The solution implements a parallel reduction algorithm for dot product computation using shared memory. Here’s a detailed breakdown:
Phase 1: Element-wise Multiplication
Each thread performs one multiplication:
Thread i: shared[i] = a[i] * b[i]
Phase 2: Parallel Reduction
The reduction uses a tree-based approach that halves active threads in each step:
Initial: [0*0 1*1 2*2 3*3 4*4 5*5 6*6 7*7]
= [0 1 4 9 16 25 36 49]
Step 1: [0+16 1+25 4+36 9+49 16 25 36 49]
= [16 26 40 58 16 25 36 49]
Step 2: [16+40 26+58 40 58 16 25 36 49]
= [56 84 40 58 16 25 36 49]
Step 3: [56+84 84 40 58 16 25 36 49]
= [140 84 40 58 16 25 36 49]
Key Implementation Features:
-
Memory Access Pattern:
- Each thread loads exactly two values from global memory (
a[i]
,b[i]
) - Uses shared memory for intermediate results
- Final result written once to global memory
- Each thread loads exactly two values from global memory (
-
Thread Synchronization:
barrier()
after initial multiplicationbarrier()
after each reduction step- Prevents race conditions between reduction steps
-
Reduction Logic:
stride = TPB // 2 while stride > 0: if local_i < stride: shared[local_i] += shared[local_i + stride] barrier() stride //= 2
- Halves stride in each step
- Only active threads perform additions
- Maintains work efficiency
-
Performance Considerations:
- \(\log_2(n)\) steps for \(n\) elements
- Coalesced memory access pattern
- Minimal thread divergence
- Efficient use of shared memory
This implementation achieves \(O(\log n)\) time complexity compared to \(O(n)\) in sequential execution, demonstrating the power of parallel reduction algorithms.
Barrier Synchronization Importance
The barrier()
between reduction steps is critical for correctness. Here’s why:
Without barrier()
, race conditions occur:
Initial shared memory: [0 1 4 9 16 25 36 49]
Step 1 (stride = 4):
Thread 0 reads: shared[0] = 0, shared[4] = 16
Thread 1 reads: shared[1] = 1, shared[5] = 25
Thread 2 reads: shared[2] = 4, shared[6] = 36
Thread 3 reads: shared[3] = 9, shared[7] = 49
Without barrier:
- Thread 0 writes: shared[0] = 0 + 16 = 16
- Thread 1 starts next step (stride = 2) before Thread 0 finishes
and reads old value shared[0] = 0 instead of 16!
With barrier()
:
Step 1 (stride = 4):
All threads write their sums:
[16 26 40 58 16 25 36 49]
barrier() ensures ALL threads see these values
Step 2 (stride = 2):
Now threads safely read the updated values:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84
The barrier()
ensures:
- All writes from current step complete
- All threads see updated values
- No thread starts next iteration early
- Consistent shared memory state
Without these synchronization points, we could get:
- Memory race conditions
- Threads reading stale values
- Non-deterministic results
- Incorrect final sum
Key concepts
In this puzzle, you’ll learn about:
- Similar to the puzzle 8 and puzzle 9, implementing parallel reduction with LayoutTensor
- Managing shared memory using
LayoutTensorBuilder
- Coordinating threads for collective operations
- Using layout-aware tensor operations
The key insight is how LayoutTensor simplifies memory management while maintaining efficient parallel reduction patterns.
Configuration
- Vector size:
SIZE = 8
elements - Threads per block:
TPB = 8
- Number of blocks: 1
- Output size: 1 element
- Shared memory:
TPB
elements
Notes:
- Tensor builder: Use
LayoutTensorBuilder[dtype]().row_major[TPB]().shared().alloc()
- Element access: Natural indexing with bounds checking
- Layout handling: Separate layouts for input and output
- Thread coordination: Same synchronization patterns with
barrier()
Code to complete
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (SIZE, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(1)
fn dot_product[
in_layout: Layout, out_layout: Layout
](
out: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=True, dtype, in_layout],
b: LayoutTensor[mut=True, dtype, in_layout],
size: Int,
):
# FILL ME IN (roughly 13 lines)
...
View full file: problems/p10/p10_layout_tensor.mojo
Tips
- Create shared memory with tensor builder
- Store
a[global_i] * b[global_i]
inshared[local_i]
- Use parallel reduction pattern with
barrier()
- Let thread 0 write final result to
out[0]
Running the code
To test your solution, run the following command in your terminal:
magic run p10_layout_tensor
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0])
expected: HostBuffer([140.0])
Solution
fn dot_product[
in_layout: Layout, out_layout: Layout
](
out: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=True, dtype, in_layout],
b: LayoutTensor[mut=True, dtype, in_layout],
size: Int,
):
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# Compute element-wise multiplication into shared memory
if global_i < size:
shared[local_i] = a[global_i] * b[global_i]
# Synchronize threads within block
barrier()
# Parallel reduction in shared memory
stride = TPB // 2
while stride > 0:
if local_i < stride:
shared[local_i] += shared[local_i + stride]
barrier()
stride //= 2
# Only thread 0 writes the final result
if local_i == 0:
out[0] = shared[0]
The solution implements a parallel reduction for dot product using LayoutTensor. Here’s the detailed breakdown:
Phase 1: Element-wise Multiplication
Each thread performs one multiplication with natural indexing:
shared[local_i] = a[global_i] * b[global_i]
Phase 2: Parallel Reduction
Tree-based reduction with layout-aware operations:
Initial: [0*0 1*1 2*2 3*3 4*4 5*5 6*6 7*7]
= [0 1 4 9 16 25 36 49]
Step 1: [0+16 1+25 4+36 9+49 16 25 36 49]
= [16 26 40 58 16 25 36 49]
Step 2: [16+40 26+58 40 58 16 25 36 49]
= [56 84 40 58 16 25 36 49]
Step 3: [56+84 84 40 58 16 25 36 49]
= [140 84 40 58 16 25 36 49]
Key Implementation Features:
-
Memory Management:
- Clean shared memory allocation with tensor builder
- Type-safe operations with LayoutTensor
- Automatic bounds checking
- Layout-aware indexing
-
Thread Synchronization:
barrier()
after initial multiplicationbarrier()
between reduction steps- Safe thread coordination
-
Reduction Logic:
stride = TPB // 2 while stride > 0: if local_i < stride: shared[local_i] += shared[local_i + stride] barrier() stride //= 2
-
Performance Benefits:
- \(O(\log n)\) time complexity
- Coalesced memory access
- Minimal thread divergence
- Efficient shared memory usage
The LayoutTensor version maintains the same efficient parallel reduction while providing:
- Better type safety
- Cleaner memory management
- Layout awareness
- Natural indexing syntax
Barrier Synchronization Importance
The barrier()
between reduction steps is critical for correctness. Here’s why:
Without barrier()
, race conditions occur:
Initial shared memory: [0 1 4 9 16 25 36 49]
Step 1 (stride = 4):
Thread 0 reads: shared[0] = 0, shared[4] = 16
Thread 1 reads: shared[1] = 1, shared[5] = 25
Thread 2 reads: shared[2] = 4, shared[6] = 36
Thread 3 reads: shared[3] = 9, shared[7] = 49
Without barrier:
- Thread 0 writes: shared[0] = 0 + 16 = 16
- Thread 1 starts next step (stride = 2) before Thread 0 finishes
and reads old value shared[0] = 0 instead of 16!
With barrier()
:
Step 1 (stride = 4):
All threads write their sums:
[16 26 40 58 16 25 36 49]
barrier() ensures ALL threads see these values
Step 2 (stride = 2):
Now threads safely read the updated values:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84
The barrier()
ensures:
- All writes from current step complete
- All threads see updated values
- No thread starts next iteration early
- Consistent shared memory state
Without these synchronization points, we could get:
- Memory race conditions
- Threads reading stale values
- Non-deterministic results
- Incorrect final sum
Puzzle 11: 1D Convolution
Moving to LayoutTensor
So far in our GPU puzzle journey, we’ve been exploring two parallel approaches to GPU memory management:
- Raw memory management with direct pointer manipulation using UnsafePointer
- The more structured LayoutTensor and its related abstractions such as LayoutTensorBuild
Starting from this puzzle, we’re transitioning exclusively to using
LayoutTensor
. This abstraction provides several benefits:
- Type-safe memory access patterns
- Clear representation of data layouts
- Better code maintainability
- Reduced chance of memory-related bugs
- More expressive code that better represents the underlying computations
- A lot more … that we’ll uncover gradually!
This transition aligns with best practices in modern GPU programming in Mojo 🔥, where higher-level abstractions help manage complexity without sacrificing performance.
Overview
In signal processing and image analysis, convolution is a fundamental operation that combines two sequences to produce a third sequence. This puzzle challenges you to implement a 1D convolution on the GPU, where each output element is computed by sliding a kernel over an input array.
Implement a kernel that computes a 1D convolution between vector a
and vector b
and stores it in out
using the LayoutTensor
abstraction.
Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.
For those new to convolution, think of it as a weighted sliding window operation. At each position, we multiply the kernel values with the corresponding input values and sum the results. In mathematical notation, this is often written as:
\[\Large out[i] = \sum_{j=0}^{\text{CONV}-1} a[i+j] \cdot b[j] \]
In pseudocode, 1D convolution is:
for i in range(SIZE):
for j in range(CONV):
if i + j < SIZE:
ret[i] += a_host[i + j] * b_host[j]
This puzzle is split into two parts to help you build understanding progressively:
-
Simple version: Single block Start here to learn the basics of implementing convolution with shared memory in a single block using LayoutTensor.
-
Complete version: Block boundary Then tackle the more challenging case where data needs to be shared across block boundaries, leveraging LayoutTensor’s capabilities.
Each version presents unique challenges in terms of memory access patterns and thread coordination. The simple version helps you understand the basic convolution operation, while the complete version tests your ability to handle more complex scenarios that arise in real-world GPU programming.
Simple Version: Single Block
Key concepts
In this puzzle, you’ll learn about:
- Implementing sliding window operations on GPUs
- Managing data dependencies across threads
- Using shared memory for overlapping regions
The key insight is understanding how to efficiently access overlapping elements while maintaining correct boundary conditions.
Configuration
- Input array size:
SIZE = 6
elements - Kernel size:
CONV = 3
elements - Threads per block:
TPB = 8
- Number of blocks: 1
- Shared memory: Two arrays of size
SIZE
andCONV
Notes:
- Data loading: Each thread loads one element from input and kernel
- Memory pattern: Shared arrays for input and convolution kernel
- Thread sync: Coordination before computation
Code to complete
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
alias MAX_CONV = 4
alias TPB = 8
alias SIZE = 6
alias CONV = 3
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias in_layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(SIZE)
alias conv_layout = Layout.row_major(CONV)
fn conv_1d_simple[
in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, in_layout],
a_size: Int,
b_size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# FILL ME IN (roughly 13 lines)
View full file: problems/p11/p11.mojo
Tips
- Use
tb[dtype]().row_major[SIZE]().shared().alloc()
for shared memory allocation - Load input to
shared_a[local_i]
and kernel toshared_b[local_i]
- Call
barrier()
after loading - Sum products within bounds:
if local_i + j < SIZE
- Write result if
global_i < a_size
Running the code
To test your solution, run the following command in your terminal:
magic run p11 --simple
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])
Solution
fn conv_1d_simple[
in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, in_layout],
a_size: Int,
b_size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
shared_a = tb[dtype]().row_major[SIZE]().shared().alloc()
shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
if global_i < a_size:
shared_a[local_i] = a[global_i]
if global_i < b_size:
shared_b[local_i] = b[global_i]
barrier()
# Note: this is unsafe as it enforces no guard so could access `shared_a` beyond its bounds
# local_sum = Scalar[dtype](0)
# for j in range(CONV):
# if local_i + j < SIZE:
# local_sum += shared_a[local_i + j] * shared_b[j]
# if global_i < a_size:
# out[global_i] = local_sum
# Safe and correct:
if global_i < a_size:
# Note: using `var` allows us to include the type in the type inference
# `out.element_type` is available in LayoutTensor
var local_sum: out.element_type = 0
# Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant
# See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement
@parameter
for j in range(CONV):
# Bonus: do we need this check for this specific example with fixed SIZE, CONV
if local_i + j < SIZE:
local_sum += shared_a[local_i + j] * shared_b[j]
out[global_i] = local_sum
The solution implements a 1D convolution using shared memory for efficient access to overlapping elements. Here’s a detailed breakdown:
Memory Layout
Input array a: [0 1 2 3 4 5]
Kernel b: [0 1 2]
Computation Steps
-
Data Loading:
shared_a: [0 1 2 3 4 5] // Input array shared_b: [0 1 2] // Convolution kernel
-
Convolution Process for each position i:
out[0] = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] = 0*0 + 1*1 + 2*2 = 5 out[1] = a[1]*b[0] + a[2]*b[1] + a[3]*b[2] = 1*0 + 2*1 + 3*2 = 8 out[2] = a[2]*b[0] + a[3]*b[1] + a[4]*b[2] = 2*0 + 3*1 + 4*2 = 11 out[3] = a[3]*b[0] + a[4]*b[1] + a[5]*b[2] = 3*0 + 4*1 + 5*2 = 14 out[4] = a[4]*b[0] + a[5]*b[1] + 0*b[2] = 4*0 + 5*1 + 0*2 = 5 out[5] = a[5]*b[0] + 0*b[1] + 0*b[2] = 5*0 + 0*1 + 0*2 = 0
Implementation Details
-
Memory Safety Considerations:
-
The naive approach without proper bounds checking could be unsafe:
# Unsafe version - could access shared_a beyond its bounds local_sum = Scalar[dtype](0) for j in range(CONV): if local_i + j < SIZE: local_sum += shared_a[local_i + j] * shared_b[j]
-
The safe and correct implementation:
if global_i < a_size: var local_sum: out.element_type = 0 # Using var allows type inference @parameter # Unrolls loop at compile time since CONV is constant for j in range(CONV): if local_i + j < SIZE: local_sum += shared_a[local_i + j] * shared_b[j]
-
-
Key Implementation Features:
- Uses
var
for proper type inference without.element_type
- Employs
@parameter
decorator to unroll the convolution loop at compile time - Maintains strict bounds checking for memory safety
- Leverages LayoutTensor’s type system for better code safety
- Uses
-
Memory Management:
- Uses shared memory for both input array and kernel
- Single load per thread from global memory
- Efficient reuse of loaded data
-
Thread Coordination:
barrier()
ensures all data is loaded before computation- Each thread computes one output element
- Maintains coalesced memory access pattern
-
Performance Optimizations:
- Minimizes global memory access
- Uses shared memory for fast data access
- Avoids thread divergence in main computation loop
- Loop unrolling through
@parameter
decorator
Complete version: Block Boundary Case
Configuration
- Input array size:
SIZE_2 = 15
elements - Kernel size:
CONV_2 = 4
elements - Threads per block:
TPB = 8
- Number of blocks: 2
- Shared memory:
TPB + CONV_2 - 1
elements for input
Notes:
- Extended loading: Account for boundary overlap
- Block edges: Handle data across block boundaries
- Memory layout: Efficient shared memory usage
- Synchronization: Proper thread coordination
Code to complete
alias SIZE_2 = 15
alias CONV_2 = 4
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (TPB, 1)
fn conv_1d_block_boundary[
in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, in_layout],
a_size: Int,
b_size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# FILL ME IN (roughly 18 lines)
View full file: problems/p11/p11.mojo
Tips
- Use
tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
for shared memory - Load main data:
shared_a[local_i] = a[global_i]
- Load boundary:
if local_i < CONV_2 - 1
handle next block data - Load kernel:
shared_b[local_i] = b[local_i]
- Sum within extended bounds:
if local_i + j < TPB + CONV_2 - 1
Running the code
To test your solution, run the following command in your terminal:
magic run p11 --block-boundary
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])
Solution
fn conv_1d_block_boundary[
in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, in_layout],
a_size: Int,
b_size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# first: need to account for padding
shared_a = tb[dtype]().row_major[TPB + CONV - 1]().shared().alloc()
shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
if global_i < a_size:
shared_a[local_i] = a[global_i]
# second: load elements needed for convolution at block boundary
if local_i < CONV - 1:
# indices from next block
next_idx = global_i + TPB
if next_idx < a_size:
shared_a[TPB + local_i] = a[next_idx]
if local_i < b_size:
shared_b[local_i] = b[local_i]
barrier()
if global_i < a_size:
var local_sum: out.element_type = 0
@parameter
for j in range(CONV):
if local_i + j < TPB + CONV - 1:
local_sum += shared_a[local_i + j] * shared_b[j]
out[global_i] = local_sum
The solution handles block boundary cases in 1D convolution using extended shared memory. Here’s a detailed analysis:
Memory Layout and Sizing
Test Configuration:
- Full array size: SIZE_2 = 15 elements
- Grid: 2 blocks × 8 threads
- Convolution kernel: CONV_2 = 4 elements
Block 0 shared memory: [0 1 2 3 4 5 6 7|8 9 10] // TPB(8) + (CONV_2-1)(3) padding
Block 1 shared memory: [8 9 10 11 12 13 14|0 0] // Second block with padding
Size calculation:
- Main data: TPB elements (8)
- Overlap: CONV_2 - 1 elements (4 - 1 = 3)
- Total: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11 elements
Implementation Details
-
Shared Memory Allocation:
# First: account for padding needed for convolution window shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc() shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()
This allocation pattern ensures we have enough space for both the block’s data and the overlap region.
-
Data Loading Strategy:
# Main block data if global_i < a_size: shared_a[local_i] = a[global_i] # Boundary data from next block if local_i < CONV_2 - 1: next_idx = global_i + TPB if next_idx < a_size: shared_a[TPB + local_i] = a[next_idx]
- Only threads with
local_i < CONV_2 - 1
load boundary data - Prevents unnecessary thread divergence
- Maintains memory coalescing for main data load
- Only threads with
-
Kernel Loading:
if local_i < b_size: shared_b[local_i] = b[local_i]
- Single load per thread
- Bounded by kernel size
-
Convolution Computation:
if global_i < a_size: var local_sum: out.element_type = 0 @parameter for j in range(CONV_2): if local_i + j < TPB + CONV_2 - 1: local_sum += shared_a[local_i + j] * shared_b[j]
- Uses
@parameter
for compile-time loop unrolling - Proper type inference with
out.element_type
- Extended bounds check for overlap region
- Uses
Memory Access Pattern Analysis
-
Block 0 Access Pattern:
Thread 0: [0 1 2 3] × [0 1 2 3] Thread 1: [1 2 3 4] × [0 1 2 3] Thread 2: [2 3 4 5] × [0 1 2 3] ... Thread 7: [7 8 9 10] × [0 1 2 3] // Uses overlap data
-
Block 1 Access Pattern:
Thread 0: [8 9 10 11] × [0 1 2 3] Thread 1: [9 10 11 12] × [0 1 2 3] ... Thread 7: [14 0 0 0] × [0 1 2 3] // Zero padding at end
Performance Optimizations
-
Memory Coalescing:
- Main data load: Consecutive threads access consecutive memory
- Boundary data: Only necessary threads participate
- Single barrier synchronization point
-
Thread Divergence Minimization:
- Clean separation of main and boundary loading
- Uniform computation pattern within warps
- Efficient bounds checking
-
Shared Memory Usage:
- Optimal sizing to handle block boundaries
- No bank conflicts in access pattern
- Efficient reuse of loaded data
-
Boundary Handling:
- Implicit zero padding at array end
- Seamless block transition
- Proper handling of edge cases
This implementation achieves efficient cross-block convolution while maintaining:
- Memory safety through proper bounds checking
- High performance through optimized memory access
- Clean code structure using LayoutTensor abstractions
- Minimal synchronization overhead
Puzzle 12: Prefix Sum
Overview
Prefix sum (also known as scan) is a fundamental parallel algorithm that computes running totals of a sequence. Found at the heart of many parallel applications - from sorting algorithms to scientific simulations - it transforms a sequence of numbers into their running totals. While simple to compute sequentially, making this efficient on a GPU requires clever parallel thinking!
Implement a kernel that computes a prefix-sum over vector a
and stores it in out
.
Note: If the size of a
is greater than the block size, only store the sum of each block.
Key concepts
In this puzzle, you’ll learn about:
- Parallel algorithms with logarithmic complexity
- Shared memory coordination patterns
- Multi-phase computation strategies
The key insight is understanding how to transform a sequential operation into an efficient parallel algorithm using shared memory.
For example, given an input sequence \([3, 1, 4, 1, 5, 9]\), the prefix sum would produce:
- \([3]\) (just the first element)
- \([3, 4]\) (3 + 1)
- \([3, 4, 8]\) (previous sum + 4)
- \([3, 4, 8, 9]\) (previous sum + 1)
- \([3, 4, 8, 9, 14]\) (previous sum + 5)
- \([3, 4, 8, 9, 14, 23]\) (previous sum + 9)
Mathematically, for a sequence \([x_0, x_1, …, x_n]\), the prefix sum produces: \[ [x_0, x_0+x_1, x_0+x_1+x_2, …, \sum_{i=0}^n x_i] \]
While a sequential algorithm would need \(O(n)\) steps, our parallel approach will use a clever two-phase algorithm that completes in \(O(\log n)\) steps! Here’s a visualization of this process:
This puzzle is split into two parts to help you master the concept:
-
Simple Version Start with a single block implementation where all data fits in shared memory. This helps understand the core parallel algorithm.
-
Complete Version Then tackle the more challenging case of handling larger arrays that span multiple blocks, requiring coordination between blocks.
Each version builds on the previous one, helping you develop a deep understanding of parallel prefix sum computation. The simple version establishes the fundamental algorithm, while the complete version shows how to scale it to larger datasets - a common requirement in real-world GPU applications.
Simple Version
Configuration
- Array size:
SIZE = 8
elements - Threads per block:
TPB = 8
- Number of blocks: 1
- Shared memory:
TPB
elements
Notes:
- Data loading: Each thread loads one element using LayoutTensor access
- Memory pattern: Shared memory for intermediate results using
LayoutTensorBuild
- Thread sync: Coordination between computation phases
- Access pattern: Stride-based parallel computation
- Type safety: Leveraging LayoutTensor’s type system
Code to complete
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
from math import log2
alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
fn prefix_sum_simple[
layout: Layout
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# FILL ME IN (roughly 12 lines)
View full file: problems/p12/p12.mojo
Tips
- Load data into
shared[local_i]
- Use
offset = 1
and double it each step - Add elements where
local_i >= offset
- Call
barrier()
between steps
Running the code
To test your solution, run the following command in your terminal:
magic run p12 --simple
Your output will look like this if the puzzle isn’t solved yet:
out: DeviceBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0])
Solution
fn prefix_sum_simple[
layout: Layout
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
shared = tb[dtype]().row_major[TPB]().shared().alloc()
if global_i < size:
shared[local_i] = a[global_i]
barrier()
offset = 1
for i in range(Int(log2(Scalar[dtype](TPB)))):
if local_i >= offset and local_i < size:
shared[local_i] += shared[local_i - offset]
barrier()
offset *= 2
if global_i < size:
out[global_i] = shared[local_i]
The parallel (inclusive) prefix-sum algorithm works as follows:
Setup & Configuration
TPB
(Threads Per Block) = 8SIZE
(Array Size) = 8
Thread Mapping
thread_idx.x
: \([0, 1, 2, 3, 4, 5, 6, 7]\) (local_i
)block_idx.x
: \([0, 0, 0, 0, 0, 0, 0, 0]\)global_i
: \([0, 1, 2, 3, 4, 5, 6, 7]\) (block_idx.x * TPB + thread_idx.x
)
Initial Load to Shared Memory
Threads: T₀ T₁ T₂ T₃ T₄ T₅ T₆ T₇
Input array: [0 1 2 3 4 5 6 7]
shared: [0 1 2 3 4 5 6 7]
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
T₀ T₁ T₂ T₃ T₄ T₅ T₆ T₇
Offset = 1: First Parallel Step
Active threads: \(T_1 \ldots T_7\) (where local_i ≥ 1
)
Before: [0 1 2 3 4 5 6 7]
Add: +0 +1 +2 +3 +4 +5 +6
| | | | | | |
Result: [0 1 3 6 7 9 11 13]
↑ ↑ ↑ ↑ ↑ ↑ ↑
T₁ T₂ T₃ T₄ T₅ T₆ T₇
Offset = 2: Second Parallel Step
Active threads: \(T_2 \ldots T_7\) (where local_i ≥ 2
)
Before: [0 1 3 6 7 9 11 13]
Add: +0 +1 +3 +6 +7 +9
| | | | | |
Result: [0 1 3 7 10 15 18 22]
↑ ↑ ↑ ↑ ↑ ↑
T₂ T₃ T₄ T₅ T₆ T₇
Offset = 4: Third Parallel Step
Active threads: \(T_4 \ldots T_7\) (where local_i ≥ 4
)
Before: [0 1 3 7 10 15 18 22]
Add: +0 +1 +3 +7
| | | |
Result: [0 1 3 7 10 16 21 28]
↑ ↑ ↑ ↑
T₄ T₅ T₆ T₇
Final Write to Output
Threads: T₀ T₁ T₂ T₃ T₄ T₅ T₆ T₇
global_i: 0 1 2 3 4 5 6 7
out[]: [0 1 3 7 10 16 21 28]
↑ ↑ ↑ ↑ ↑ ↑ ↑ ↑
T₀ T₁ T₂ T₃ T₄ T₅ T₆ T₇
Thread-by-Thread Execution
\(T_0\) (local_i=0
):
- Loads
shared[0] = 0
- Never adds (
local_i < offset
always) - Writes
out[0] = 0
\(T_1\) (local_i=1
):
- Loads
shared[1] = 1
offset=1
: addsshared[0]
→ 1offset=2,4
: no action (local_i < offset
)- Writes
out[1] = 1
\(T_2\) (local_i=2
):
- Loads
shared[2] = 2
offset=1
: addsshared[1]
→ 3offset=2
: addsshared[0]
→ 3offset=4
: no action- Writes
out[2] = 3
\(T_3\) (local_i=3
):
- Loads
shared[3] = 3
offset=1
: addsshared[2]
→ 6offset=2
: addsshared[1]
→ 7offset=4
: no action- Writes
out[3] = 7
\(T_4\) (local_i=4
):
- Loads
shared[4] = 4
offset=1
: addsshared[3]
→ 7offset=2
: addsshared[2]
→ 10offset=4
: addsshared[0]
→ 10- Writes
out[4] = 10
\(T_5\) (local_i=5
):
- Loads
shared[5] = 5
offset=1
: addsshared[4]
→ 9offset=2
: addsshared[3]
→ 15offset=4
: addsshared[1]
→ 16- Writes
out[5] = 16
\(T_6\) (local_i=6
):
- Loads
shared[6] = 6
offset=1
: addsshared[5]
→ 11offset=2
: addsshared[4]
→ 18offset=4
: addsshared[2]
→ 21- Writes
out[6] = 21
\(T_7\) (local_i=7
):
- Loads
shared[7] = 7
offset=1
: addsshared[6]
→ 13offset=2
: addsshared[5]
→ 22offset=4
: addsshared[3]
→ 28- Writes
out[7] = 28
The solution ensures correct synchronization between phases using barrier()
and handles array bounds checking with if global_i < size
. The final result produces the inclusive prefix sum where each element \(i\) contains \(\sum_{j=0}^{i} a[j]\).
Complete Version
Configuration
- Array size:
SIZE_2 = 15
elements - Threads per block:
TPB = 8
- Number of blocks: 2
- Shared memory:
TPB
elements per block
Notes:
- Block handling: Multiple blocks process array segments
- Partial blocks: Last block may not be full
- Block sums: Store running totals between blocks
- Global result: Combine local and block sums
- Layout safety: Consistent layout handling through LayoutTensor
Code to complete
alias SIZE_2 = 15
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (TPB, 1)
fn prefix_sum[
layout: Layout
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# FILL ME IN (roughly 19 lines)
View full file: problems/p12/p12.mojo
Tips
- Compute local prefix sums like in Simple Version
- Last thread stores block sum at
TPB * (block_idx.x + 1)
- Add previous block’s sum to current block
- Handle array bounds for all operations
Running the code
To test your solution, run the following command in your terminal:
magic run p12 --complete
Your output will look like this if the puzzle isn’t solved yet:
out: DeviceBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0])
Solution
fn prefix_sum[
layout: Layout
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
shared = tb[dtype]().row_major[TPB]().shared().alloc()
if global_i < size:
shared[local_i] = a[global_i]
barrier()
# Idea: two passes
# SIZE=15, TPB=8, BLOCKS=2
# buffer: [0,1,2,...,7 | 8,...,14]
# Step 1: Each block computes local prefix sum
# Block 0: [0,1,2,3,4,5,6,7] → [0,1,3,6,10,15,21,28]
# Block 1: [8,9,10,11,12,13,14] → [8,17,27,38,50,63,77]
# Step 2: Store block sums
# Block 0's sum (28) → position 8
# Step 3: Add previous block's sum
# Block 1: Each element += 28
# [8,17,27,38,50,63,77] → [36,45,55,66,78,91,105]
# Final result combines both blocks:
# [0,1,3,6,10,15,21,28, 36,45,55,66,78,91,105]
# local prefix-sum for each block
offset = 1
for i in range(Int(log2(Scalar[dtype](TPB)))):
if local_i >= offset and local_i < size:
shared[local_i] += shared[local_i - offset]
barrier()
offset *= 2
# store block results
if global_i < size:
out[global_i] = shared[local_i]
# store block sum in first element of next block:
# - Only last thread (local_i == 7) in each block except last block executes
# - Block 0: Thread 7 stores 28 (sum of 0-7) at position 8 (start of Block 1)
# - Calculation: TPB * (block_idx.x + 1)
# Block 0: 8 * (0 + 1) = position 8
# Block 1: No action (last block)
# Memory state:
# [0,1,3,6,10,15,21,28 | 28,45,55,66,78,91,105]
# ↑
# Block 0's sum stored here
if local_i == TPB - 1 and block_idx.x < size // TPB - 1:
out[TPB * (block_idx.x + 1)] = shared[local_i]
# wait for all blocks to store their sums
barrier()
# second pass: add previous block's sum which becomes:
# Before: [8,9,10,11,12,13,14]
# Add 28: [36,37,38,39,40,41,42]
if block_idx.x > 0 and global_i < size:
shared[local_i] += out[block_idx.x * TPB - 1]
# final result
if global_i < size:
out[global_i] = shared[local_i]
-
Local prefix sum (per block):
Block 0 (8 elements): [0,1,2,3,4,5,6,7] After local prefix sum: [0,1,3,7,10,16,21,28] Block 1 (7 elements): [8,9,10,11,12,13,14] After local prefix sum: [8,17,27,38,50,63,77]
-
Block sum communication:
- Last thread (local_i == TPB-1) in each non-final block
- Stores its block’s sum at next block’s start:
if local_i == TPB - 1 and block_idx.x < size // TPB - 1: out[TPB * (block_idx.x + 1)] = shared[local_i]
- Block 0’s sum (28) stored at position 8
- Memory layout:
[0,1,3,7,10,16,21,28 | 28,17,27,38,50,63,77]
↑ Block 0’s sum
-
Final adjustment:
- Each block after first adds previous block’s sum
if block_idx.x > 0 and global_i < size: shared[local_i] += out[block_idx.x * TPB - 1]
- Block 1: Each element += 28
- Final result:
[0,1,3,7,10,16,21,28, 36,45,55,66,78,91,105]
Key implementation details:
- Uses
barrier()
after shared memory operations - Handles partial blocks (last block size < TPB)
- Guards all operations with proper bounds checking
- Maintains correct thread and block synchronization
- Achieves \(O(\log n)\) complexity per block
The solution scales to arbitrary-sized inputs by combining local prefix sums with efficient block-to-block communication.
Puzzle 13: Axis Sum
Overview
Implement a kernel that computes a sum over each row of 2D matrix a
and stores it in out
using LayoutTensor.
Key concepts
In this puzzle, you’ll learn about:
- Parallel reduction along matrix dimensions using LayoutTensor
- Using block coordinates for data partitioning
- Efficient shared memory reduction patterns
- Working with multi-dimensional tensor layouts
The key insight is understanding how to map thread blocks to matrix rows and perform efficient parallel reduction within each block while leveraging LayoutTensor’s dimensional indexing.
Configuration
- Matrix dimensions: \(\text{BATCH} \times \text{SIZE} = 4 \times 6\)
- Threads per block: \(\text{TPB} = 8\)
- Grid dimensions: \(1 \times \text{BATCH}\)
- Shared memory: \(\text{TPB}\) elements per block
- Input layout:
Layout.row_major(BATCH, SIZE)
- Output layout:
Layout.row_major(BATCH, 1)
Matrix visualization:
Row 0: [0, 1, 2, 3, 4, 5] → Block(0,0)
Row 1: [6, 7, 8, 9, 10, 11] → Block(0,1)
Row 2: [12, 13, 14, 15, 16, 17] → Block(0,2)
Row 3: [18, 19, 20, 21, 22, 23] → Block(0,3)
Code to Complete
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
alias TPB = 8
alias BATCH = 4
alias SIZE = 6
alias BLOCKS_PER_GRID = (1, BATCH)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias in_layout = Layout.row_major(BATCH, SIZE)
alias out_layout = Layout.row_major(BATCH, 1)
fn axis_sum[
in_layout: Layout, out_layout: Layout
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
batch = block_idx.y
# FILL ME IN (roughly 15 lines)
View full file: problems/p13/p13.mojo
Tips
- Use
batch = block_idx.y
to select row - Load elements:
cache[local_i] = a[batch * size + local_i]
- Perform parallel reduction with halving stride
- Thread 0 writes final sum to
out[batch]
Running the Code
To test your solution, run the following command in your terminal:
magic run p13
Your output will look like this if the puzzle isn’t solved yet:
out: DeviceBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([15.0, 51.0, 87.0, 123.0])
Solution
fn axis_sum[
in_layout: Layout, out_layout: Layout
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
batch = block_idx.y
cache = tb[dtype]().row_major[TPB]().shared().alloc()
# Visualize:
# Block(0,0): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 0: [0,1,2,3,4,5]
# Block(0,1): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 1: [6,7,8,9,10,11]
# Block(0,2): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 2: [12,13,14,15,16,17]
# Block(0,3): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 3: [18,19,20,21,22,23]
# each row is handled by each block bc we have grid_dim=(1, BATCH)
if local_i < size:
cache[local_i] = a[batch, local_i]
else:
# Add zero-initialize padding elements for later reduction
cache[local_i] = 0
barrier()
# do reduction sum per each block
stride = TPB // 2
while stride > 0:
if local_i < stride:
cache[local_i] += cache[local_i + stride]
barrier()
stride //= 2
# writing with local thread = 0 that has the sum for each batch
if local_i == 0:
out[batch, 0] = cache[0]
The solution implements a parallel row-wise sum reduction for a 2D matrix using LayoutTensor. Here’s a comprehensive breakdown:
Matrix Layout and Block Mapping
Input Matrix (4×6) with LayoutTensor: Block Assignment:
[[ a[0,0] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] ] → Block(0,0)
[ a[1,0] a[1,1] a[1,2] a[1,3] a[1,4] a[1,5] ] → Block(0,1)
[ a[2,0] a[2,1] a[2,2] a[2,3] a[2,4] a[2,5] ] → Block(0,2)
[ a[3,0] a[3,1] a[3,2] a[3,3] a[3,4] a[3,5] ] → Block(0,3)
Parallel Reduction Process
-
Initial Data Loading:
Block(0,0): cache = [a[0,0] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] * *] // * = padding Block(0,1): cache = [a[1,0] a[1,1] a[1,2] a[1,3] a[1,4] a[1,5] * *] Block(0,2): cache = [a[2,0] a[2,1] a[2,2] a[2,3] a[2,4] a[2,5] * *] Block(0,3): cache = [a[3,0] a[3,1] a[3,2] a[3,3] a[3,4] a[3,5] * *]
-
Reduction Steps (for Block 0,0):
Initial: [0 1 2 3 4 5 * *] Stride 4: [4 5 6 7 4 5 * *] Stride 2: [10 12 6 7 4 5 * *] Stride 1: [15 12 6 7 4 5 * *]
Key Implementation Features:
-
Layout Configuration:
- Input: row-major layout (BATCH × SIZE)
- Output: row-major layout (BATCH × 1)
- Each block processes one complete row
-
Memory Access Pattern:
- LayoutTensor 2D indexing for input:
a[batch, local_i]
- Shared memory for efficient reduction
- LayoutTensor 2D indexing for output:
out[batch, 0]
- LayoutTensor 2D indexing for input:
-
Parallel Reduction Logic:
stride = TPB // 2 while stride > 0: if local_i < size: cache[local_i] += cache[local_i + stride] barrier() stride //= 2
-
Output Writing:
if local_i == 0: out[batch, 0] = cache[0] --> One result per batch
Performance Optimizations:
-
Memory Efficiency:
- Coalesced memory access through LayoutTensor
- Shared memory for fast reduction
- Single write per row result
-
Thread Utilization:
- Perfect load balancing across rows
- No thread divergence in main computation
- Efficient parallel reduction pattern
-
Synchronization:
- Minimal barriers (only during reduction)
- Independent processing between rows
- No inter-block communication needed
Complexity Analysis:
- Time: \(O(\log n)\) per row, where n is row length
- Space: \(O(TPB)\) shared memory per block
- Total parallel time: \(O(\log n)\) with sufficient threads
Puzzle 14: Matrix Multiplication (MatMul)
Overview
Matrix multiplication is a fundamental operation in scientific computing, machine learning, and graphics. Given two matrices \(A\) and \(B\), we want to compute their product \(C = A \times B.\)
For matrices \(A_{m\times k}\) and \(B_{k\times n}\), each element of the result \(C_{m\times n}\) is computed as:
\[\Large C_{ij} = \sum_{l=0}^{k-1} A_{il} \cdot B_{lj} \]
This puzzle explores different approaches to implementing matrix multiplication on GPUs, each with its own performance characteristics:
-
Naive Version The straightforward implementation where each thread computes one element of the output matrix. While simple to understand, this approach makes many redundant memory accesses.
-
Shared Memory Version Improves performance by loading blocks of input matrices into fast shared memory, reducing global memory accesses. Each thread still computes one output element but reads from shared memory.
-
Tiled Version Further optimizes by dividing the computation into tiles, allowing threads to cooperate on loading and computing blocks of the output matrix. This approach better utilizes memory hierarchy and thread cooperation.
Each version builds upon the previous one, introducing new optimization techniques common in GPU programming. You’ll learn how different memory access patterns and thread cooperation strategies affect performance.
The progression illustrates a common pattern in GPU optimization:
- Start with a correct but naive implementation
- Reduce global memory access with shared memory
- Improve data locality and thread cooperation with tiling
- Use high-level abstractions while maintaining performance
Choose a version to begin your matrix multiplication journey!
Naive Matrix Multiplication
Overview
Implement a kernel that multiplies square matrices \(A\) and \(\text{transpose}(A)\) and stores the result in \(\text{out}\). This is the most straightforward implementation where each thread computes one element of the output matrix.
Key concepts
In this puzzle, you’ll learn about:
- 2D thread organization for matrix operations
- Global memory access patterns
- Matrix indexing in row-major layout
- Thread-to-output element mapping
The key insight is understanding how to map 2D thread indices to matrix elements and compute dot products in parallel.
Configuration
- Matrix size: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
- Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
- Grid dimensions: \(1 \times 1\)
Layout configuration:
- Input A:
Layout.row_major(SIZE, SIZE)
- Input B:
Layout.row_major(SIZE, SIZE)
(transpose of A) - Output:
Layout.row_major(SIZE, SIZE)
Memory layout (LayoutTensor):
- Matrix A:
a[i, j]
directly indexes position (i,j) - Matrix B:
b[i, j]
is the transpose of A - Output C:
out[i, j]
stores result at position (i,j)
Code to complete
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
alias TPB = 3
alias SIZE = 2
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, TPB)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE, SIZE)
fn naive_matmul[
layout: Layout, size: Int
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
# FILL ME IN (roughly 6 lines)
View full file: problems/p14/p14.mojo
Tips
- Calculate
global_i
andglobal_j
from thread indices - Check if indices are within
size
- Accumulate products in a local variable
- Write final sum to correct output position
Running the code
To test your solution, run the following command in your terminal:
magic run p14 --naive
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 3.0, 3.0, 13.0])
Solution
fn naive_matmul[
layout: Layout, size: Int
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
if global_i < size and global_j < size:
var acc: out.element_type = 0
@parameter
for k in range(size):
acc += a[global_i, k] * b[k, global_j]
out[global_i, global_j] = acc
The naive matrix multiplication using LayoutTensor demonstrates the basic approach:
Matrix Layout (2×2 example)
Matrix A: Matrix B (transpose of A): Output C:
[a[0,0] a[0,1]] [b[0,0] b[0,1]] [c[0,0] c[0,1]]
[a[1,0] a[1,1]] [b[1,0] b[1,1]] [c[1,0] c[1,1]]
Implementation Details:
-
Thread Mapping:
global_i = block_dim.x * block_idx.x + thread_idx.x global_j = block_dim.y * block_idx.y + thread_idx.y
-
Memory Access Pattern:
- Direct 2D indexing:
a[global_i, k]
- Transposed access:
b[k, global_j]
- Output writing:
out[global_i, global_j]
- Direct 2D indexing:
-
Computation Flow:
# Use var for mutable accumulator with tensor's element type var acc: out.element_type = 0 # @parameter for compile-time loop unrolling @parameter for k in range(size): acc += a[global_i, k] * b[k, global_j]
Key Language Features:
-
Variable Declaration:
- The use of
var
invar acc: out.element_type = 0
allows for type inference without.element_type
ensures type compatibility with the output tensor - Initialized to zero before accumulation
- The use of
-
Loop Optimization:
@parameter
decorator unrolls the loop at compile time- Improves performance for small, known matrix sizes
- Enables better instruction scheduling
Performance Characteristics:
-
Memory Access:
- Each thread makes
2 x SIZE
global memory reads - One global memory write per thread
- No data reuse between threads
- Each thread makes
-
Computational Efficiency:
- Simple implementation but suboptimal performance
- Many redundant global memory accesses
- No use of fast shared memory
-
Limitations:
- High global memory bandwidth usage
- Poor data locality
- Limited scalability for large matrices
This naive implementation serves as a baseline for understanding matrix multiplication on GPUs, highlighting the need for optimization in memory access patterns.
Shared Memory Matrix Multiplication
Overview
Implement a kernel that multiplies square matrices \(A\) and \(\text{transpose}(A)\) and stores the result in \(\text{out}\), using shared memory to improve memory access efficiency. This version loads matrix blocks into shared memory before computation.
Key concepts
In this puzzle, you’ll learn about:
- Block-local memory management with LayoutTensor
- Thread synchronization patterns
- Memory access optimization using shared memory
- Collaborative data loading with 2D indexing
- Efficient use of LayoutTensor for matrix operations
The key insight is understanding how to use fast shared memory with LayoutTensor to reduce expensive global memory operations.
Configuration
- Matrix size: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
- Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
- Grid dimensions: \(1 \times 1\)
Layout configuration:
- Input A:
Layout.row_major(SIZE, SIZE)
- Input B:
Layout.row_major(SIZE, SIZE)
(transpose of A) - Output:
Layout.row_major(SIZE, SIZE)
- Shared Memory: Two
TPB × TPB
LayoutTensors
Memory organization:
Global Memory (LayoutTensor): Shared Memory (LayoutTensor):
A[i,j]: Direct access a_shared[local_i, local_j]
B[i,j]: Transposed access b_shared[local_i, local_j]
Code to complete
fn single_block_matmul[
layout: Layout, size: Int
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
local_i = thread_idx.x
local_j = thread_idx.y
# FILL ME IN (roughly 12 lines)
View full file: problems/p14/p14.mojo
Tips
- Load matrices to shared memory using global and local indices
- Call
barrier()
after loading - Compute dot product using shared memory indices
- Check array bounds for all operations
Running the code
To test your solution, run the following command in your terminal:
magic run p14 --single-block
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 3.0, 3.0, 13.0])
Solution
fn single_block_matmul[
layout: Layout, size: Int
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
local_i = thread_idx.x
local_j = thread_idx.y
a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
if global_i < size and global_j < size:
a_shared[local_i, local_j] = a[global_i, global_j]
b_shared[local_i, local_j] = b[global_i, global_j]
barrier()
if global_i < size and global_j < size:
var acc: out.element_type = 0
@parameter
for k in range(size):
acc += a_shared[local_i, k] * b_shared[k, local_j]
out[global_i, global_j] = acc
The shared memory implementation with LayoutTensor improves performance through efficient memory access patterns:
Memory Organization
Input Tensors (2×2): Shared Memory (3×3):
Matrix A: a_shared:
[a[0,0] a[0,1]] [s[0,0] s[0,1] s[0,2]]
[a[1,0] a[1,1]] [s[1,0] s[1,1] s[1,2]]
[s[2,0] s[2,1] s[2,2]]
Matrix B (transpose): b_shared: (similar layout)
[b[0,0] b[0,1]] [t[0,0] t[0,1] t[0,2]]
[b[1,0] b[1,1]] [t[1,0] t[1,1] t[1,2]]
[t[2,0] t[2,1] t[2,2]]
Implementation Phases:
-
Shared Memory Setup:
# Create 2D shared memory tensors using TensorBuilder a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc() b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
-
Thread Indexing:
# Global indices for matrix access global_i = block_dim.x * block_idx.x + thread_idx.x global_j = block_dim.y * block_idx.y + thread_idx.y # Local indices for shared memory local_i = thread_idx.x local_j = thread_idx.y
-
Data Loading:
# Load data into shared memory using LayoutTensor indexing if global_i < size and global_j < size: a_shared[local_i, local_j] = a[global_i, global_j] b_shared[local_i, local_j] = b[global_i, global_j]
-
Computation with Shared Memory:
# Guard ensures we only compute for valid matrix elements if global_i < size and global_j < size: # Initialize accumulator with output tensor's type var acc: out.element_type = 0 # Compile-time unrolled loop for matrix multiplication @parameter for k in range(size): acc += a_shared[local_i, k] * b_shared[k, local_j] # Write result only for threads within matrix bounds out[global_i, global_j] = acc
Key aspects:
-
Boundary Check:
if global_i < size and global_j < size
- Prevents out-of-bounds computation
- Only valid threads perform work
- Essential because TPB (3×3) > SIZE (2×2)
-
Accumulator Type:
var acc: out.element_type
- Uses output tensor’s element type for type safety
- Ensures consistent numeric precision
- Initialized to zero before accumulation
-
Loop Optimization:
@parameter for k in range(size)
- Unrolls the loop at compile time
- Enables better instruction scheduling
- Efficient for small, known matrix sizes
-
Result Writing:
out[global_i, global_j] = acc
- Protected by the same guard condition
- Only valid threads write results
- Maintains matrix bounds safety
-
Thread Safety and Synchronization:
-
Guard Conditions:
- Input Loading:
if global_i < size and global_j < size
- Computation: Same guard ensures thread safety
- Output Writing: Protected by the same condition
- Prevents invalid memory access and race conditions
- Input Loading:
-
Memory Access Safety:
- Shared memory: Accessed only within TPB bounds
- Global memory: Protected by size checks
- Output: Guarded writes prevent corruption
Key Language Features:
-
LayoutTensor Benefits:
- Direct 2D indexing simplifies code
- Type safety through
element_type
- Efficient memory layout handling
-
Shared Memory Allocation:
- TensorBuilder for structured allocation
- Row-major layout matching input tensors
- Proper alignment for efficient access
-
Synchronization:
barrier()
ensures shared memory consistency- Proper synchronization between load and compute
- Thread cooperation within block
Performance Optimizations:
-
Memory Access Efficiency:
- Single global memory load per element
- Multiple reuse through shared memory
- Coalesced memory access patterns
-
Thread Cooperation:
- Collaborative data loading
- Shared data reuse
- Efficient thread synchronization
-
Computational Benefits:
- Reduced global memory traffic
- Better cache utilization
- Improved instruction throughput
This implementation significantly improves performance over the naive version by:
- Reducing global memory accesses
- Enabling data reuse through shared memory
- Using efficient 2D indexing with LayoutTensor
- Maintaining proper thread synchronization
Tiled Matrix Multiplication
Overview
Implement a kernel that multiplies square matrices \(A\) and \(\text{transpose}(A)\) using tiled matrix multiplication with LayoutTensor. This approach handles large matrices by processing them in smaller chunks (tiles).
Key concepts
- Matrix tiling with LayoutTensor for large-scale computation
- Multi-block coordination with proper layouts
- Efficient shared memory usage through TensorBuilder
- Boundary handling for tiles with LayoutTensor indexing
Configuration
- Matrix size: \(\text{SIZE_TILED} = 8\)
- Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
- Grid dimensions: \(3 \times 3\) blocks
- Shared memory: Two \(\text{TPB} \times \text{TPB}\) LayoutTensors per block
Layout configuration:
- Input A:
Layout.row_major(SIZE_TILED, SIZE_TILED)
- Input B:
Layout.row_major(SIZE_TILED, SIZE_TILED)
(transpose of A) - Output:
Layout.row_major(SIZE_TILED, SIZE_TILED)
- Shared Memory: Two
TPB × TPB
LayoutTensors using TensorBuilder
Tiling strategy
Block organization
Grid Layout (3×3): Thread Layout per Block (3×3):
[B00][B01][B02] [T00 T01 T02]
[B10][B11][B12] [T10 T11 T12]
[B20][B21][B22] [T20 T21 T22]
Each block processes a tile using LayoutTensor indexing
Tile processing steps
- Load tile from matrix A into shared memory using LayoutTensor indexing
- Load corresponding tile from matrix B into shared memory
- Compute partial products using shared memory
- Accumulate results in registers
- Move to next tile
- Repeat until all tiles are processed
Memory access pattern
For each tile using LayoutTensor:
Input Tiles: Output Computation:
A[i:i+TPB, k:k+TPB] × Result tile
B[k:k+TPB, j:j+TPB] → C[i:i+TPB, j:j+TPB]
Code to complete
alias SIZE_TILED = 8
alias BLOCKS_PER_GRID_TILED = (3, 3) # each block convers 3x3 elements
alias THREADS_PER_BLOCK_TILED = (TPB, TPB)
alias layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED)
fn matmul_tiled[
layout: Layout, size: Int
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
):
local_row = thread_idx.x
local_col = thread_idx.y
global_row = block_idx.x * TPB + local_row
global_col = block_idx.y * TPB + local_col
# FILL ME IN (roughly 20 lines)
View full file: problems/p14/p14.mojo
Tips
- Calculate global thread positions from block and thread indices
- Clear shared memory before loading new tiles
- Load tiles with proper bounds checking
- Accumulate results across tiles with proper synchronization
Running the code
To test your solution, run the following command in your terminal:
magic run p14 --tiled
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([140.0, 364.0, 588.0, 812.0, 1036.0, 1260.0, 1484.0, 1708.0, 364.0, 1100.0, 1836.0, 2572.0, 3308.0, 4044.0, 4780.0, 5516.0, 588.0, 1836.0, 3084.0, 4332.0, 5580.0, 6828.0, 8076.0, 9324.0, 812.0, 2572.0, 4332.0, 6092.0, 7852.0, 9612.0, 11372.0, 13132.0, 1036.0, 3308.0, 5580.0, 7852.0, 10124.0, 12396.0, 14668.0, 16940.0, 1260.0, 4044.0, 6828.0, 9612.0, 12396.0, 15180.0, 17964.0, 20748.0, 1484.0, 4780.0, 8076.0, 11372.0, 14668.0, 17964.0, 21260.0, 24556.0, 1708.0, 5516.0, 9324.0, 13132.0, 16940.0, 20748.0, 24556.0, 28364.0])
Solution
fn matmul_tiled[
layout: Layout, size: Int
](
out: LayoutTensor[mut=False, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
):
local_row = thread_idx.x
local_col = thread_idx.y
global_row = block_idx.x * TPB + local_row
global_col = block_idx.y * TPB + local_col
a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
var acc: out.element_type = 0
# Iterate over tiles to compute matrix product
@parameter
for tile in range((size + TPB - 1) // TPB):
# Reset shared memory tiles
if local_row < TPB and local_col < TPB:
a_shared[local_row, local_col] = 0
b_shared[local_row, local_col] = 0
barrier()
# Load A tile - global row stays the same, col determined by tile
if global_row < size and (tile * TPB + local_col) < size:
a_shared[local_row, local_col] = a[
global_row, tile * TPB + local_col
]
# Load B tile - row determined by tile, global col stays the same
if (tile * TPB + local_row) < size and global_col < size:
b_shared[local_row, local_col] = b[
tile * TPB + local_row, global_col
]
barrier()
# Matrix multiplication within the tile
if global_row < size and global_col < size:
@parameter
for k in range(min(TPB, size - tile * TPB)):
acc += a_shared[local_row, k] * b_shared[k, local_col]
barrier()
# Write out final result
if global_row < size and global_col < size:
out[global_row, global_col] = acc
The tiled implementation with LayoutTensor handles large matrices efficiently by processing them in blocks. Here’s a comprehensive analysis:
Implementation Architecture
-
Layout Configuration:
alias layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED)
- Defines row-major layout for all tensors
- Ensures consistent memory access patterns
- Enables efficient 2D indexing
-
Shared Memory Setup:
a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc() b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
- Uses TensorBuilder for structured allocation
- Maintains row-major layout for consistency
- Enables efficient tile processing
-
Thread and Block Organization:
local_row = thread_idx.x local_col = thread_idx.y global_row = block_idx.x * TPB + local_row global_col = block_idx.y * TPB + local_col
- Maps threads to matrix elements
- Handles 2D indexing efficiently
- Maintains proper boundary checks
Tile Processing Pipeline
-
Tile Iteration:
@parameter for tile in range((size + TPB - 1) // TPB):
- Compile-time unrolled loop
- Handles matrix size not divisible by TPB
- Processes matrix in TPB×TPB tiles
-
Shared Memory Reset:
if local_row < TPB and local_col < TPB: a_shared[local_row, local_col] = 0 b_shared[local_row, local_col] = 0
- Clears previous tile data
- Ensures clean state for new tile
- Prevents data corruption
-
Tile Loading:
# Load A tile if global_row < size and (tile * TPB + local_col) < size: a_shared[local_row, local_col] = a[global_row, tile * TPB + local_col] # Load B tile if (tile * TPB + local_row) < size and global_col < size: b_shared[local_row, local_col] = b[tile * TPB + local_row, global_col]
- Handles boundary conditions
- Uses LayoutTensor indexing
- Maintains memory coalescing
-
Computation:
@parameter for k in range(min(TPB, size - tile * TPB)): acc += a_shared[local_row, k] * b_shared[k, local_col]
- Processes current tile
- Uses shared memory for efficiency
- Handles partial tiles correctly
Memory Access Optimization
-
Global Memory Pattern:
A[global_row, tile * TPB + local_col] → Coalesced reads B[tile * TPB + local_row, global_col] → Transposed access
- Maximizes memory coalescing
- Minimizes bank conflicts
- Efficient for transposed access
-
Shared Memory Usage:
a_shared[local_row, k] → Row-wise access b_shared[k, local_col] → Column-wise access
- Optimized for matrix multiplication
- Reduces bank conflicts
- Enables data reuse
Synchronization and Safety
-
Barrier Points:
barrier() # After shared memory reset barrier() # After tile loading barrier() # After computation
- Ensures shared memory consistency
- Prevents race conditions
- Maintains thread cooperation
-
Boundary Handling:
if global_row < size and global_col < size: out[global_row, global_col] = acc
- Prevents out-of-bounds access
- Handles matrix edges
- Safe result writing
Performance Characteristics
-
Memory Efficiency:
- Reduced global memory traffic through tiling
- Efficient shared memory reuse
- Coalesced memory access patterns
-
Computational Throughput:
- High data locality in shared memory
- Efficient thread utilization
- Minimal thread divergence
-
Scalability:
- Handles arbitrary matrix sizes
- Efficient for large matrices
- Good thread occupancy
Key Optimizations
-
Layout Optimization:
- Row-major layout for all tensors
- Efficient 2D indexing
- Proper alignment
-
Memory Access:
- Coalesced global memory loads
- Efficient shared memory usage
- Minimal bank conflicts
-
Computation:
- Register-based accumulation
- Compile-time loop unrolling
- Efficient tile processing
This implementation achieves high performance through:
- Efficient use of LayoutTensor for memory access
- Optimal tiling strategy
- Proper thread synchronization
- Careful boundary handling