Mojo GPU Puzzles Logo

Mojo🔥 GPU Puzzles

GitHub Repository Powered by Mojo Subscribe for Updates Modular Forum Discord

🚧 This book is a work in progress! Some sections may be incomplete or subject to change. 🚧

“For the things we have to learn before we can do them, we learn by doing them.” Aristotle, (Nicomachean Ethics)

Welcome to Mojo 🔥 GPU Puzzles, a hands-on guide to mastering GPU programming using Mojo 🔥 — the innovative programming language that combines Pythonic syntax with systems-level performance. GPU programming remains one of the most powerful skills in modern computing, driving advances in artificial intelligence, scientific simulation, and high-performance computing.

This book takes a unique approach to teaching GPU programming: learning by solving increasingly challenging puzzles. Rather than traditional textbook learning, you’ll immediately start writing real GPU code and seeing the results.

The early chapters of this book are heavily inspired by GPU Puzzles, an interactive CUDA learning project by Sasha Rush. This adaptation reimplements these concepts using Mojo’s powerful abstractions and performance capabilities, while expanding on advanced topics with Mojo-specific optimizations.

Why Mojo 🔥 for GPU Programming?

The computing industry has reached a critical inflection point. We can no longer rely on new CPU generations to automatically increase application performance through higher clock speeds. As power and heat constraints have plateaued CPU speeds, hardware manufacturers have shifted toward increasing the number of physical cores. This multi-core revolution has reached its zenith in modern GPUs, which contain thousands of cores operating in parallel. The NVIDIA H100, for example, can run an astonishing 16,896 threads simultaneously in a single clock cycle, with over 270,000 threads queued and ready for execution.

Mojo represents a fresh approach to GPU programming, making this massive parallelism more accessible and productive:

  • Python-like Syntax with systems programming capabilities that feels familiar to the largest programming community
  • Zero-cost Abstractions that compile to efficient machine code without sacrificing performance
  • Strong Type System that catches errors at compile time while maintaining expressiveness
  • Built-in Tensor Support with hardware-aware optimizations specifically designed for GPU computation
  • Direct Access to low-level CPU and GPU intrinsics for systems-level programming
  • Cross-Hardware Portability allowing you to write code that can run on both CPUs and GPUs
  • Ergonomic and Safety Improvements over traditional C/C++ GPU programming
  • Lower Barrier to Entry enabling more programmers to harness GPU power effectively

Mojo 🔥 aims to fuel innovation by democratizing GPU programming. By expanding on Python’s familiar syntax while adding direct GPU access, Mojo empowers programmers with minimal specialized knowledge to build high-performance, heterogeneous (CPU, GPU-enabled) applications.

The GPU Programming Mindset

Effective GPU programming requires a fundamental shift in how we think about computation. Here are some key mental models that will guide your journey:

From Sequential to Parallel: Eliminating Loops with Threads

In traditional CPU programming, we process data sequentially through loops:

# CPU approach
for i in range(data_size):
    result[i] = process(data[i])

With GPUs, we flip this model entirely. Instead of moving sequentially through data, we map thousands of parallel threads directly onto the data:

# GPU approach (conceptual)
thread_id = get_global_id()
if thread_id < data_size:
    result[thread_id] = process(data[thread_id])

Each thread becomes responsible for computing a single element, eliminating the need for explicit loops. This mental shift—from “stepping through data” to “blanketing data with compute”—is central to GPU programming.

Fitting a Mesh of Compute Over Data

Imagine your data as a grid, and GPU threads as another grid that overlays it. Your task is to design this “compute mesh” to efficiently cover your data:

  • Threads: Individual compute units that process single data elements
  • Blocks: Organized groups of threads that share fast memory
  • Grid: The entire collection of blocks that covers your dataset

The art of GPU programming lies in crafting this mesh to maximize parallelism while respecting memory and synchronization constraints.

Data Movement vs. Computation

In GPU programming, data movement is often more expensive than computation:

  • Moving data between CPU and GPU is slow
  • Moving data between global and shared memory is faster
  • Operating on data already in registers or shared memory is extremely fast

This inverts another common assumption in programming: computation is no longer the bottleneck—data movement is.

Through the puzzles in this book, you’ll develop an intuitive understanding of these principles, transforming how you approach computational problems.

What You Will Learn

This book takes you on a journey from first principles to advanced GPU programming techniques. Rather than treating the GPU as a mysterious black box, we’ll build your understanding layer by layer—starting with how individual threads operate and culminating in sophisticated parallel algorithms. By mastering both low-level memory management and high-level tensor abstractions, you’ll gain the versatility to tackle any GPU programming challenge.

Your learning path includes:

  • GPU Programming Fundamentals: Thread organization, memory hierarchies, and kernel execution models
  • Dual Implementation Paths: Beginning with raw memory approaches using pointers, then transitioning to LayoutTensor abstractions
  • Memory Management: Working with global, shared, and thread-local memory for optimal performance
  • Low-level to High-level Progression: Understanding the foundation with UnsafePointer before leveraging LayoutTensor’s elegant abstractions
  • Layout Tensors: Mastering Mojo’s powerful tensor abstractions for simplified, efficient GPU computation
  • Parallel Algorithms: Implementing and optimizing parallel reductions, convolutions, matrix operations, and more
  • Performance Optimization: Advanced techniques for memory coalescing, tiling, bank conflict avoidance, and minimizing thread divergence
  • Real-world Applications: Applying these concepts to machine learning, signal processing, and computational tasks

The book uniquely challenges the status quo approach by first building understanding with low-level memory manipulation, then gradually transitioning to Mojo’s powerful LayoutTensor abstractions. This gives you both deep understanding of GPU memory patterns and practical knowledge of modern tensor-based approaches.

How to Use This Book

Each puzzle follows a consistent format designed to progressively build your skills:

  • Overview: Clear problem statement and key concepts introduced in each puzzle
  • Configuration: Setup parameters and memory organization specific to each challenge
  • Code to Complete: Skeleton code with specific sections for you to implement
  • Tips: Optional hints if you get stuck, without giving away complete solutions
  • Solution: Detailed explanations of the implementation, performance considerations, and underlying concepts

The puzzles gradually increase in complexity, introducing new concepts while reinforcing fundamentals. We recommend solving them in order, as later puzzles build on skills developed in earlier ones.

Running the code

All puzzles are designed to be run with the provided testing framework that verifies your implementation against expected results. Each puzzle includes instructions for running the code and validating your solution.

Prerequisites

Compatible GPU

You’ll need a compatible GPU to run the examples.

Setting up your environment

Clone the GitHub repository and make sure you have the magic CLI installed to be able to run the Mojo programs:

# Clone the repository
git clone https://github.com/modular/mojo-gpu-puzzles
cd mojo-gpu-puzzles

# Install magic CLI (if not already installed)
curl -ssL https://magic.modular.com/ | bash

# Or update if already installed
magic self-update

Knowledge prerequisites

Basic knowledge of:

  • Programming fundamentals (variables, loops, conditionals, functions)
  • Parallel computing concepts (threads, synchronization, race conditions)
  • Basic familiarity with Mojo (language basics parts and intro to pointers section)
  • A tour of GPU basics in Mojo is helpful

No prior GPU programming experience is necessary! We’ll build that knowledge through the puzzles.

Let’s begin our journey into the exciting world of GPU computing with Mojo 🔥!

Join the community

Subscribe for Updates Modular Forum Discord

Join our vibrant community to discuss GPU programming, share solutions, and get help!

Puzzle 1: Map

Overview

GPU programming is all about parallelism. In this puzzle, each thread will process a single element of the input array independently. Implement a kernel that adds 10 to each position of vector a and stores it in vector out.

Note: You have 1 thread per position.

Map

Key concepts

  • Basic GPU kernel structure
  • One-to-one thread to data mapping
  • Memory access patterns
  • Array operations on GPU

For each position \(i\): \[\Large out[i] = a[i] + 10\]

What we cover

🔰 Raw Memory Approach

Start with direct memory manipulation to understand GPU fundamentals.

💡 Preview: Modern Approach with LayoutTensor

See how LayoutTensor simplifies GPU programming with safer, cleaner code.

💡 Tip: Understanding both approaches helps you better appreciate modern GPU programming patterns.

Key concepts

In this puzzle, you’ll learn about:

  • Basic GPU kernel structure

  • Thread indexing with thread_idx.x

  • Simple parallel operations

  • Parallelism: Each thread executes independently

  • Thread indexing: Access element at position i = thread_idx.x

  • Memory access: Read from a[i] and write to out[i]

  • Data independence: Each output depends only on its corresponding input

Code to complete

alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = SIZE
alias dtype = DType.float32


fn add_10(out: UnsafePointer[Scalar[dtype]], a: UnsafePointer[Scalar[dtype]]):
    local_i = thread_idx.x
    # FILL ME IN (roughly 1 line)


View full file: problems/p01/p01.mojo

Tips
  1. Store thread_idx.x in local_i
  2. Add 10 to a[local_i]
  3. Store result in out[local_i]

Running the code

To test your solution, run the following command in your terminal:

magic run p01

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

Solution

fn add_10(out: UnsafePointer[Scalar[dtype]], a: UnsafePointer[Scalar[dtype]]):
    local_i = thread_idx.x
    out[local_i] = a[local_i] + 10.0


This solution:

  • Gets thread index with local_i = thread_idx.x
  • Adds 10 to input value: out[local_i] = a[local_i] + 10.0

Why Consider LayoutTensor?

Looking at our traditional implementation above, you might notice some potential issues:

Current approach

local_i = thread_idx.x
out[local_i] = a[local_i] + 10.0

This works for 1D arrays, but what happens when we need to:

  • Handle 2D or 3D data?
  • Deal with different memory layouts?
  • Ensure coalesced memory access?

Preview of future challenges

As we progress through the puzzles, array indexing will become more complex:

# 2D indexing coming in later puzzles
idx = row * WIDTH + col

# 3D indexing
idx = (batch * HEIGHT + row) * WIDTH + col

# With padding
idx = (batch * padded_height + row) * padded_width + col

LayoutTensor preview

LayoutTensor will help us handle these cases more elegantly:

# Future preview - don't worry about this syntax yet!
out[i, j] = a[i, j] + 10.0  # 2D indexing
out[b, i, j] = a[b, i, j] + 10.0  # 3D indexing

We’ll learn about LayoutTensor in detail in Puzzle 4, where these concepts become essential. For now, focus on understanding:

  • Basic thread indexing
  • Simple memory access patterns
  • One-to-one mapping of threads to data

💡 Key Takeaway: While direct indexing works for simple cases, we’ll soon need more sophisticated tools for complex GPU programming patterns.

Puzzle 2: Zip

Overview

Implement a kernel that adds together each position of vector a and vector b and stores it in out.

Note: You have 1 thread per position.

Zip

Key concepts

In this puzzle, you’ll learn about:

  • Processing multiple input arrays in parallel
  • Element-wise operations with multiple inputs
  • Thread-to-data mapping across arrays
  • Memory access patterns with multiple arrays

For each thread \(i\): \[\Large out[i] = a[i] + b[i]\]

Memory access pattern

Thread 0:  a[0] + b[0] → out[0]
Thread 1:  a[1] + b[1] → out[1]
Thread 2:  a[2] + b[2] → out[2]
...

💡 Note: Notice how we’re now managing three arrays (a, b, out) in our kernel. As we progress to more complex operations, managing multiple array accesses will become increasingly challenging.

Code to complete

alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = SIZE
alias dtype = DType.float32


fn add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
):
    local_i = thread_idx.x
    # FILL ME IN (roughly 1 line)


View full file: problems/p02/p02.mojo

Tips
  1. Store thread_idx.x in local_i
  2. Add a[local_i] and b[local_i]
  3. Store result in out[local_i]

Running the code

To test your solution, run the following command in your terminal:

magic run p02

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 2.0, 4.0, 6.0])

Solution

fn add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
):
    local_i = thread_idx.x
    out[local_i] = a[local_i] + b[local_i]


This solution:

  • Gets thread index with local_i = thread_idx.x
  • Adds values from both arrays: out[local_i] = a[local_i] + b[local_i]

Looking ahead

While this direct indexing works for simple element-wise operations, consider:

  • What if arrays have different layouts?
  • What if we need to broadcast one array to another?
  • How to ensure coalesced access across multiple arrays?

These questions will be addressed when we introduce LayoutTensor in Puzzle 4.

Puzzle 3: Guards

Overview

Implement a kernel that adds 10 to each position of vector a and stores it in vector out.

Note: You have more threads than positions. This means you need to protect against out-of-bounds memory access.

Guard

Key concepts

In this puzzle, you’ll learn about:

  • Handling thread/data size mismatches
  • Preventing out-of-bounds memory access
  • Using conditional execution in GPU kernels
  • Safe memory access patterns

Mathematical Description

For each thread \(i\): \[\Large \text{if}\ i < \text{size}: out[i] = a[i] + 10\]

Memory Safety Pattern

Thread 0 (i=0):  if 0 < size:  out[0] = a[0] + 10  ✓ Valid
Thread 1 (i=1):  if 1 < size:  out[1] = a[1] + 10  ✓ Valid
Thread 2 (i=2):  if 2 < size:  out[2] = a[2] + 10  ✓ Valid
Thread 3 (i=3):  if 3 < size:  out[3] = a[3] + 10  ✓ Valid
Thread 4 (i=4):  if 4 < size:  ❌ Skip (out of bounds)
Thread 5 (i=5):  if 5 < size:  ❌ Skip (out of bounds)

💡 Note: Boundary checking becomes increasingly complex with:

  • Multi-dimensional arrays
  • Different array shapes
  • Complex access patterns

Code to complete

alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (8, 1)
alias dtype = DType.float32


fn add_10_guard(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    local_i = thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p03/p03.mojo

Tips
  1. Store thread_idx.x in local_i
  2. Add guard: if local_i < size
  3. Inside guard: out[local_i] = a[local_i] + 10.0

Running the code

To test your solution, run the following command in your terminal:

magic run p03

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

Solution

fn add_10_guard(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    local_i = thread_idx.x
    if local_i < size:
        out[local_i] = a[local_i] + 10.0


This solution:

  • Gets thread index with local_i = thread_idx.x
  • Guards against out-of-bounds access with if local_i < size
  • Inside guard: adds 10 to input value

Looking ahead

While simple boundary checks work here, consider these challenges:

  • What about 2D/3D array boundaries?
  • How to handle different shapes efficiently?
  • What if we need padding or edge handling?

Example of growing complexity:

# Current: 1D bounds check
if i < size: ...

# Coming soon: 2D bounds check
if i < height and j < width: ...

# Later: 3D with padding
if i < height and j < width and k < depth and
   i >= padding and j >= padding: ...

These boundary handling patterns will become more elegant when we learn about LayoutTensor in Puzzle 4, which provides built-in boundary checking and shape management.

Puzzle 4: 2D Map

Overview

Implement a kernel that adds 10 to each position of 2D square matrix a and stores it in 2D square matrix out.

Note: You have more threads than positions.

2D Matrix Mapping

Key concepts

  • 2D thread indexing
  • Matrix operations on GPU
  • Handling excess threads
  • Memory layout patterns

For each position \((i,j)\): \[\Large out[i,j] = a[i,j] + 10\]

Implementation approaches

🔰 Raw memory approach

Learn how 2D indexing works with manual memory management.

📚 Learn about LayoutTensor

Discover a powerful abstraction that simplifies multi-dimensional array operations and memory management on GPU.

🚀 Modern 2D operations

Put LayoutTensor into practice with natural 2D indexing and automatic bounds checking.

💡 Note: From this puzzle onward, we’ll primarily use LayoutTensor for cleaner, safer GPU code.

Key concepts

In this puzzle, you’ll learn about:

  • Working with 2D thread indices (thread_idx.x, thread_idx.y)
  • Converting 2D coordinates to 1D memory indices
  • Handling boundary checks in two dimensions

The key insight is understanding how to map from 2D thread coordinates \((i,j)\) to elements in a row-major matrix of size \(n \times n\), while ensuring thread indices are within bounds.

  • 2D indexing: Each thread has a unique \((i,j)\) position
  • Memory layout: Row-major ordering maps 2D to 1D memory
  • Guard condition: Need bounds checking in both dimensions
  • Thread bounds: More threads \((3 \times 3)\) than matrix elements \((2 \times 2)\)

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn add_10_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    local_i = thread_idx.x
    local_j = thread_idx.y
    # FILL ME IN (roughly 2 lines)


View full file: problems/p04/p04.mojo

Tips
  1. Get 2D indices: local_i = thread_idx.x, local_j = thread_idx.y
  2. Add guard: if local_i < size and local_j < size
  3. Inside guard: out[local_j * size + local_i] = a[local_j * size + local_i] + 10.0

Running the code

To test your solution, run the following command in your terminal:

magic run p04

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

Solution

fn add_10_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    local_i = thread_idx.x
    local_j = thread_idx.y
    if local_i < size and local_j < size:
        out[local_j * size + local_i] = a[local_j * size + local_i] + 10.0


This solution:

  • Gets 2D thread indices with local_i = thread_idx.x, local_j = thread_idx.y
  • Guards against out-of-bounds with if local_i < size and local_j < size
  • Inside guard: adds 10 to input value using row-major indexing

Introduction to LayoutTensor

After dealing with manual indexing, bounds checking, and growing complexity in the previous puzzles, it’s time to introduce a powerful abstraction that will make GPU programming more intuitive and safer.

Why LayoutTensor?

Let’s look at the challenges we’ve faced:

# Puzzle 1: Simple indexing
out[local_i] = a[local_i] + 10.0

# Puzzle 2: Multiple array management
out[local_i] = a[local_i] + b[local_i]

# Puzzle 3: Bounds checking
if local_i < size:
    out[local_i] = a[local_i] + 10.0

As dimensions grow, code becomes more complex:

# Traditional 2D indexing for row-major 2D matrix
idx = col * WIDTH + row
if row < height and col < width:
    out[idx] = a[idx] + 10.0

The LayoutTensor solution

LayoutTensor provides:

  1. Natural Indexing: Use tensor[i, j] instead of manual offset calculations
  2. Automatic Bounds Checking: Built-in protection against out-of-bounds access
  3. Flexible Memory Layouts: Support for row-major, column-major, and tiled organizations
  4. Performance Optimization: Efficient memory access patterns for GPU

Basic usage

from layout import Layout, LayoutTensor

# Define layout
alias HEIGHT = 2
alias WIDTH = 3
alias layout = Layout.row_major(HEIGHT, WIDTH)

# Create tensor
tensor = LayoutTensor[dtype, layout](buffer.unsafe_ptr())

# Access elements naturally
tensor[0, 0] = 1.0  # First element
tensor[1, 2] = 2.0  # Last element

Memory layout control

LayoutTensor supports different memory organizations:

# Row-major (default)
layout_row = Layout.row_major(HEIGHT, WIDTH)

# Column-major
layout_col = Layout.col_major(HEIGHT, WIDTH)

# Tiled (for better cache utilization)
layout_tiled = tensor.tiled[4, 4](HEIGHT, WIDTH)

Understanding memory layouts

Memory layout affects performance dramatically. LayoutTensor supports:

  • Row-major: Elements in a row are contiguous

    # [1 2 3]
    # [4 5 6] -> [1 2 3 4 5 6]
    layout_row = Layout.row_major(2, 3)
    
  • Column-major: Elements in a column are contiguous

    # [1 2 3]
    # [4 5 6] -> [1 4 2 5 3 6]
    layout_col = Layout.col_major(2, 3)
    
  • Tiled: Elements grouped in tiles for cache efficiency

    # [[1 2] [3 4]] in 2x2 tiles
    layout_tiled = Layout.tiled[2, 2](4, 4)
    

Benefits over traditional approach

  1. Readability:

    # Traditional
    out[col * WIDTH + row] = a[col * WIDTH + row] + 10.0
    
    # LayoutTensor
    out[row, col] = a[row, col] + 10.0
    
  2. Flexibility:

    • Easy to change memory layouts without modifying computation code
    • Support for complex access patterns
    • Built-in optimizations

Advanced features preview

While we’ll start with basic operations, LayoutTensor’s true power shines in advanced GPU optimizations:

1. Memory hierarchy management

# Shared memory allocation
shared_mem = tb[dtype]().row_major[BM, BK]().shared().alloc()

# Register allocation
reg_tile = tb[dtype]().row_major[TM, TN]().local().alloc()

2. Tiling strategies

# Block tiling
block_tile = tensor.tile[BM, BN](block_idx.y, block_idx.x)

# Register tiling
reg_tile = block_tile.tile[TM, TN](thread_row, thread_col)

3. Memory access patterns

# Vectorized access
vec_tensor = tensor.vectorize[1, simd_width]()

# Asynchronous transfers
copy_dram_to_sram_async[thread_layout=layout](dst, src)

4. Hardware acceleration

# Tensor Core operations (coming in later puzzles)
mma_op = TensorCore[dtype, out_type, Index(M, N, K)]()
result = mma_op.mma_op(a_reg, b_reg, c_reg)

💡 Looking Ahead: As we progress through the puzzles, you’ll learn how to:

  • Use shared memory for faster data access
  • Implement efficient tiling strategies
  • Leverage vectorized operations
  • Utilize hardware accelerators
  • Optimize memory access patterns

Each puzzle will introduce these concepts gradually, building on the fundamentals to create highly optimized GPU code.

Ready to start your journey from basic operations to advanced GPU programming? Let’s begin with the fundamentals!

Quick example

Let’s put everything together with a simple example that demonstrates the basics of LayoutTensor:

from gpu.host import DeviceContext
from layout import Layout, LayoutTensor

alias HEIGHT = 2
alias WIDTH = 3
alias dtype = DType.float32
alias layout = Layout.row_major(HEIGHT, WIDTH)

fn kernel[dtype: DType, layout: Layout](tensor: LayoutTensor[mut=True, dtype, layout]):
    print("Before:")
    print(tensor)
    tensor[0, 0] += 1
    print("After:")
    print(tensor)

def main():
    ctx = DeviceContext()

    a = ctx.enqueue_create_buffer[dtype](HEIGHT * WIDTH).enqueue_fill(0)
    tensor = LayoutTensor[mut=True, dtype, layout](a.unsafe_ptr())
    # Note: since `tensor` is a device tensor we can't print it without the kernel wrapper
    ctx.enqueue_function[kernel[dtype, layout]](tensor, grid_dim=1, block_dim=1)

    ctx.synchronize()

When we run this code with magic run layout_tensor_intro, we see:

Before:
0.0 0.0 0.0
0.0 0.0 0.0
After:
1.0 0.0 0.0
0.0 0.0 0.0

Let’s break down what’s happening:

  1. We create a 2 x 3 tensor with row-major layout
  2. Initially, all elements are zero
  3. Using natural indexing, we modify a single element
  4. The change is reflected in our output

This simple example demonstrates key LayoutTensor benefits:

  • Clean syntax for tensor creation and access
  • Automatic memory layout handling
  • Built-in bounds checking
  • Natural multi-dimensional indexing

While this example is straightforward, the same patterns will scale to complex GPU operations in upcoming puzzles. You’ll see how these basic concepts extend to:

  • Multi-threaded GPU operations
  • Shared memory optimizations
  • Complex tiling strategies
  • Hardware-accelerated computations

Ready to start your GPU programming journey with LayoutTensor? Let’s dive into the puzzles!

💡 Tip: Keep this example in mind as we progress - we’ll build upon these fundamental concepts to create increasingly sophisticated GPU programs.

LayoutTensor Version

Overview

Implement a kernel that adds 10 to each position of 2D LayoutTensor a and stores it in 2D LayoutTensor out.

Note: You have more threads than positions.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor for 2D array access
  • Direct 2D indexing with tensor[i, j]
  • Handling bounds checking with LayoutTensor

The key insight is that LayoutTensor provides a natural 2D indexing interface, abstracting away the underlying memory layout while still requiring bounds checking.

  • 2D access: Natural \((i,j)\) indexing with LayoutTensor
  • Memory abstraction: No manual row-major calculation needed
  • Guard condition: Still need bounds checking in both dimensions
  • Thread bounds: More threads \((3 \times 3)\) than tensor elements \((2 \times 2)\)

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE, SIZE)


fn add_10_2d(
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    local_i = thread_idx.x
    local_j = thread_idx.y
    # FILL ME IN (roughly 2 lines)


View full file: problems/p04/p04_layout_tensor.mojo

Tips
  1. Get 2D indices: local_i = thread_idx.x, local_j = thread_idx.y
  2. Add guard: if local_i < size and local_j < size
  3. Inside guard: out[local_i, local_j] = a[local_i, local_j] + 10.0

Running the code

To test your solution, run the following command in your terminal:

magic run p04_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

Solution

fn add_10_2d(
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    local_i = thread_idx.x
    local_j = thread_idx.y
    if local_i < size and local_j < size:
        out[local_i, local_j] = a[local_i, local_j] + 10.0


This solution:

  • Gets 2D thread indices with local_i = thread_idx.x, local_j = thread_idx.y
  • Guards against out-of-bounds with if local_i < size and local_j < size
  • Uses LayoutTensor’s 2D indexing: out[local_i, local_j] = a[local_i, local_j] + 10.0

Puzzle 5: Broadcast

Overview

Implement a kernel that broadcast adds vector a and vector b and stores it in 2D matrix out.

Note: You have more threads than positions.

Broadcast visualization

Key concepts

  • Broadcasting vectors to matrix
  • 2D thread management
  • Mixed dimension operations
  • Memory layout patterns

Implementation approaches

🔰 Raw memory approach

Learn how to handle broadcasting with manual memory indexing.

📐 LayoutTensor Version

Use LayoutTensor to handle mixed-dimension operations.

💡 Note: Notice how LayoutTensor simplifies broadcasting compared to manual indexing.

Key concepts

In this puzzle, you’ll learn about:

  • Broadcasting 1D vectors across different dimensions
  • Using 2D thread indices for broadcast operations
  • Handling boundary conditions in broadcast patterns

The key insight is understanding how to map elements from two 1D vectors to create a 2D output matrix through broadcasting, while handling thread bounds correctly.

  • Broadcasting: Each element of a combines with each element of b
  • Thread mapping: 2D thread grid \((3 \times 3)\) for \(2 \times 2\) output
  • Vector access: Different access patterns for a and b
  • Bounds checking: Guard against threads outside matrix dimensions

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn broadcast_add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    local_i = thread_idx.x
    local_j = thread_idx.y
    # FILL ME IN (roughly 2 lines)


View full file: problems/p05/p05.mojo

Tips
  1. Get 2D indices: local_i = thread_idx.x, local_j = thread_idx.y
  2. Add guard: if local_i < size and local_j < size
  3. Inside guard: out[local_j * size + local_i] = a[local_i] + b[local_j]

Running the code

To test your solution, run the following command in your terminal:

magic run p05

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])

Solution

fn broadcast_add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    local_i = thread_idx.x
    local_j = thread_idx.y
    if local_i < size and local_j < size:
        out[local_j * size + local_i] = a[local_i] + b[local_j]


This solution:

  • Gets 2D thread indices with local_i = thread_idx.x, local_j = thread_idx.y
  • Guards against out-of-bounds with if local_i < size and local_j < size
  • Broadcasts by adding a[local_i] and b[local_j] into the output matrix

LayoutTensor Version

Overview

Implement a kernel that broadcast adds LayoutTensor vector a and LayoutTensor vector b and stores it in LayoutTensor out.

Note: You have more threads than positions.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor for broadcast operations
  • Working with different tensor shapes
  • Handling 2D indexing with LayoutTensor

The key insight is that LayoutTensor allows natural broadcasting through different tensor shapes: \((n,1)\) and \((1,n)\) to \((n,n)\), while still requiring bounds checking.

  • Tensor shapes: Input vectors have shapes \((n,1)\) and \((1,n)\)
  • Broadcasting: Output combines both dimensions to \((n,n)\)
  • Guard condition: Still need bounds checking for output size
  • Thread bounds: More threads \((3 \times 3)\) than tensor elements \((2 \times 2)\)

Code to complete

alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(SIZE, 1)
alias b_layout = Layout.row_major(1, SIZE)


fn broadcast_add[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=True, dtype, a_layout],
    b: LayoutTensor[mut=True, dtype, b_layout],
    size: Int,
):
    local_i = thread_idx.x
    local_j = thread_idx.y
    # FILL ME IN (roughly 2 lines)


View full file: problems/p05/p05_layout_tensor.mojo

Tips
  1. Get 2D indices: local_i = thread_idx.x, local_j = thread_idx.y
  2. Add guard: if local_i < size and local_j < size
  3. Inside guard: out[local_i, local_j] = a[local_i, 0] + b[0, local_j]

Running the code

To test your solution, run the following command in your terminal:

magic run p05_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 1.0, 2.0])

Solution

fn broadcast_add[
    out_layout: Layout,
    a_layout: Layout,
    b_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=True, dtype, a_layout],
    b: LayoutTensor[mut=True, dtype, b_layout],
    size: Int,
):
    local_i = thread_idx.x
    local_j = thread_idx.y
    if local_i < size and local_j < size:
        out[local_i, local_j] = a[local_i, 0] + b[0, local_j]


This solution:

  • Gets 2D thread indices with local_i = thread_idx.x, local_j = thread_idx.y
  • Guards against out-of-bounds with if local_i < size and local_j < size
  • Uses LayoutTensor broadcasting: a[local_i, 0] + b[0, local_j]

Puzzle 6: Blocks

Overview

Implement a kernel that adds 10 to each position of vector a and stores it in out.

Note: You have fewer threads per block than the size of a.

Blocks visualization

Key concepts

In this puzzle, you’ll learn about:

  • Processing data larger than thread block size
  • Coordinating multiple blocks of threads
  • Computing global thread positions

The key insight is understanding how blocks of threads work together to process data that’s larger than a single block’s capacity, while maintaining correct element-to-thread mapping.

Code to complete

alias SIZE = 9
alias BLOCKS_PER_GRID = (3, 1)
alias THREADS_PER_BLOCK = (4, 1)
alias dtype = DType.float32


fn add_10_blocks(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p06/p06.mojo

Tips
  1. Calculate global index: global_i = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if global_i < size
  3. Inside guard: out[global_i] = a[global_i] + 10.0

Running the code

To test your solution, run the following command in your terminal:

magic run p06

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])

Solution

fn add_10_blocks(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    if global_i < size:
        out[global_i] = a[global_i] + 10.0


This solution:

  • Computes global thread index from block and thread indices
  • Guards against out-of-bounds with if global_i < size
  • Inside guard: adds 10 to input value at global index

Puzzle 7: 2D Blocks

Overview

Implement a kernel that adds 10 to each position of matrix a and stores it in out.

Note: You have fewer threads per block than the size of a in both directions.

Blocks 2D visualization

Key concepts

  • Block-based processing
  • Grid-block coordination
  • Multi-block indexing
  • Memory access patterns

Implementation approaches

🔰 Raw memory approach

Learn how to handle multi-block operations with manual indexing.

📐 LayoutTensor Version

Use LayoutTensor features to elegantly handle block-based processing.

💡 Note: See how LayoutTensor simplifies block coordination and memory access patterns.

Key concepts

In this puzzle, you’ll learn about:

  • Working with 2D block and thread arrangements
  • Handling matrix data larger than block size
  • Converting between 2D and linear memory access

The key insight is understanding how to coordinate multiple blocks of threads to process a 2D matrix that’s larger than a single block’s dimensions.

Configuration

  • Matrix size: \(5 \times 5\) elements
  • 2D blocks: Each block processes a \(3 \times 3\) region
  • Grid layout: Blocks arranged in \(2 \times 2\) grid
  • Total threads: \(36\) for \(25\) elements
  • Memory pattern: Row-major storage for 2D data
  • Coverage: Ensuring all matrix elements are processed

Code to complete

alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn add_10_blocks_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    # FILL ME IN (roughly 2 lines)


View full file: problems/p07/p07.mojo

Tips
  1. Calculate global indices: global_i = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if global_i < size and global_j < size
  3. Inside guard: out[global_j * size + global_i] = a[global_j * size + global_i] + 10.0

Running the code

To test your solution, run the following command in your terminal:

magic run p07

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])

Solution

fn add_10_blocks_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    if global_i < size and global_j < size:
        out[global_j * size + global_i] = a[global_j * size + global_i] + 10.0


This solution:

  • Computes global indices with block_dim * block_idx + thread_idx
  • Guards against out-of-bounds with if global_i < size and global_j < size
  • Uses row-major indexing to access and update matrix elements

LayoutTensor Version

Overview

Implement a kernel that adds 10 to each position of 2D LayoutTensor a and stores it in 2D LayoutTensor out.

Note: You have fewer threads per block than the size of a in both directions.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor with multiple blocks
  • Handling large matrices with 2D block organization
  • Combining block indexing with LayoutTensor access

The key insight is that LayoutTensor simplifies 2D indexing while still requiring proper block coordination for large matrices.

Configuration

  • Matrix size: \(5 \times 5\) elements
  • Layout handling: LayoutTensor manages row-major organization
  • Block coordination: Multiple blocks cover the full matrix
  • 2D indexing: Natural \((i,j)\) access with bounds checking
  • Total threads: \(36\) for \(25\) elements
  • Thread mapping: Each thread processes one matrix element

Code to complete

alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(SIZE, 1)


fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=True, dtype, a_layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    # FILL ME IN (roughly 2 lines)


View full file: problems/p07/p07_layout_tensor.mojo

Tips
  1. Calculate global indices: global_i = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if global_i < size and global_j < size
  3. Inside guard: out[global_i, global_j] = a[global_i, global_j] + 10.0

Running the code

To test your solution, run the following command in your terminal:

magic run p07_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])

Solution

fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=True, dtype, a_layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    if global_i < size and global_j < size:
        out[global_i, global_j] = a[global_i, global_j] + 10.0


This solution:

  • Computes global indices with block_dim * block_idx + thread_idx
  • Guards against out-of-bounds with if global_i < size and global_j < size
  • Uses LayoutTensor’s 2D indexing: out[global_i, global_j] = a[global_i, global_j] + 10.0

Puzzle 8: Shared Memory

Overview

Implement a kernel that adds 10 to each position of a and stores it in out.

Note: You have fewer threads per block than the size of a.

Shared memory visualization

Implementation approaches

🔰 Raw memory approach

Learn how to manually manage shared memory and synchronization.

📐 LayoutTensor Version

Use LayoutTensor’s built-in shared memory management features.

💡 Note: Experience how LayoutTensor simplifies shared memory operations while maintaining performance.

Key concepts

In this puzzle, you’ll learn about:

  • Using shared memory within thread blocks
  • Synchronizing threads with barriers
  • Managing block-local data storage

The key insight is understanding how shared memory provides fast, block-local storage that all threads in a block can access, requiring careful coordination between threads.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 4
  • Number of blocks: 2
  • Shared memory: TPB elements per block

Notes:

  • Shared memory: Fast storage shared by threads in a block
  • Thread sync: Coordination using barrier()
  • Memory scope: Shared memory only visible within block
  • Access pattern: Local vs global indexing

Warning: Each block can only have a constant amount of shared memory that threads in that block can read and write to. This needs to be a literal python constant, not a variable. After writing to shared memory you need to call barrier to ensure that threads do not cross.

Code to complete

alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32


fn add_10_shared(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    shared = stack_allocation[
        TPB * sizeof[dtype](),
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # local data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # wait for all threads to complete
    # works within a thread block
    barrier()

    # FILL ME IN (roughly 2 lines)


View full file: problems/p08/p08.mojo

Tips
  1. Wait for shared memory load with barrier()
  2. Use local_i to access shared memory: shared[local_i]
  3. Use global_i for output: out[global_i]
  4. Add guard: if global_i < size

Running the code

To test your solution, run the following command in your terminal:

magic run p08

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

Solution

fn add_10_shared(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    shared = stack_allocation[
        TPB * sizeof[dtype](),
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # local data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # wait for all threads to complete
    # works within a thread block
    barrier()

    # process using shared memory
    if global_i < size:
        out[global_i] = shared[local_i] + 10


This solution:

  • Waits for shared memory load with barrier()
  • Guards against out-of-bounds with if global_i < size
  • Reads from shared memory using shared[local_i]
  • Writes result to global memory at out[global_i]

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor’s shared memory features
  • Thread synchronization with shared memory
  • Block-local data management with tensor builder

The key insight is how LayoutTensor simplifies shared memory management while maintaining the performance benefits of block-local storage.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 4
  • Number of blocks: 2
  • Shared memory: TPB elements per block

Key differences from raw approach

  1. Memory allocation: We will use LayoutTensorBuild instead of stack_allocation

    # Raw approach
    shared = stack_allocation[TPB * sizeof[dtype](), ...]()
    
    # LayoutTensor approach
    shared = LayoutTensorBuild[dtype]().row_major[TPB]().shared().alloc()
    
  2. Memory access: Same syntax

    # Raw approach
    shared[local_i] = a[global_i]
    
    # LayoutTensor approach
    shared[local_i] = a[global_i]
    
  3. Safety features:

    • Type safety
    • Layout management
    • Memory alignment handling

Note: LayoutTensor handles memory layout, but you still need to manage thread synchronization with barrier() when using shared memory.

Code to complete

alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn add_10_shared_layout_tensor[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # FILL ME IN (roughly 2 lines)


View full file: problems/p08/p08_layout_tensor.mojo

Tips
  1. Create shared memory with tensor builder
  2. Load data with natural indexing: shared[local_i] = a[global_i]
  3. Synchronize with barrier()
  4. Process data using shared memory indices
  5. Guard against out-of-bounds access

Running the code

To test your solution, run the following command in your terminal:

magic run p08_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])

Solution

fn add_10_shared_layout_tensor[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    if global_i < size:
        out[global_i] = shared[local_i] + 10


This solution:

  • Creates shared memory using tensor builder’s fluent API
  • Guards against out-of-bounds with if global_i < size
  • Uses natural indexing for both shared and global memory
  • Ensures thread synchronization with barrier()
  • Leverages LayoutTensor’s built-in safety features

Key steps:

  1. Allocate shared memory with proper layout
  2. Load global data into shared memory
  3. Synchronize threads
  4. Process data using shared memory
  5. Write results back to global memory

Puzzle 9: Pooling

Overview

Implement a kernel that sums together the last 3 positions of vector a and stores it in vector out.

Note: You have 1 thread per position. You only need 1 global read and 1 global write per thread.

Pooling visualization

Implementation approaches

🔰 Raw memory approach

Learn how to implement sliding window operations with manual memory management and synchronization.

📐 LayoutTensor Version

Use LayoutTensor’s features for efficient window-based operations and shared memory management.

💡 Note: See how LayoutTensor simplifies sliding window operations while maintaining efficient memory access patterns.

Key concepts

In this puzzle, you’ll learn about:

  • Using shared memory for sliding window operations
  • Handling boundary conditions in pooling
  • Coordinating thread access to neighboring elements

The key insight is understanding how to efficiently access a window of elements using shared memory, with special handling for the first elements in the sequence.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Window size: 3 elements
  • Shared memory: TPB elements

Notes:

  • Window access: Each output depends on up to 3 previous elements
  • Edge handling: First two positions need special treatment
  • Memory pattern: One shared memory load per thread
  • Thread sync: Coordination before window operations

Code to complete

alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32


fn pooling(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    shared = stack_allocation[
        TPB * sizeof[dtype](),
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 10 lines)


View full file: problems/p09/p09.mojo

Tips
  1. Load data and call barrier()
  2. Special cases: out[0] = shared[0], out[1] = shared[0] + shared[1]
  3. General case: if 1 < global_i < size
  4. Sum three elements: shared[local_i - 2] + shared[local_i - 1] + shared[local_i]

Running the code

To test your solution, run the following command in your terminal:

magic run p09

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])

Solution

fn pooling(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    shared = stack_allocation[
        TPB * sizeof[dtype](),
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    if global_i == 0:
        out[0] = shared[0]
    elif global_i == 1:
        out[1] = shared[0] + shared[1]

    if 1 < global_i < size:
        out[global_i] = (
            shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
        )


The solution implements a sliding window sum using shared memory with these key steps:

  1. Shared Memory Setup:

    • Allocates TPB elements in shared memory
    • Each thread loads one element from global memory
    • Uses barrier() to ensure all data is loaded
  2. Boundary Cases:

    • Position 0: out[0] = shared[0] (only first element)
    • Position 1: out[1] = shared[0] + shared[1] (sum of first two elements)
  3. Main Window Operation:

    • For positions 2 and beyond: out[i] = shared[i-2] + shared[i-1] + shared[i]
    • Uses local indices for shared memory access
    • Maintains coalesced memory access pattern
  4. Memory Access Pattern:

    • One global read per thread into shared memory
    • One global write per thread from shared memory
    • Uses shared memory for efficient neighbor access
    • Avoids redundant global memory loads

This approach is efficient because:

  • Minimizes global memory access
  • Uses shared memory for fast neighbor lookups
  • Handles boundary conditions without branching in the main case
  • Maintains good memory coalescing

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor for sliding window operations
  • Managing shared memory with LayoutTensorBuilder that we saw in puzzle_08
  • Efficient neighbor access patterns
  • Boundary condition handling

The key insight is how LayoutTensor simplifies shared memory management while maintaining efficient window-based operations.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Window size: 3 elements
  • Shared memory: TPB elements

Notes:

  • Tensor builder: Use LayoutTensorBuilder[dtype]().row_major[TPB]().shared().alloc()
  • Window access: Natural indexing for 3-element windows
  • Edge handling: Special cases for first two positions
  • Memory pattern: One shared memory load per thread

Code to complete

alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn pooling[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FIX ME IN (roughly 10 lines)


View full file: problems/p09/p09_layout_tensor.mojo

Tips
  1. Create shared memory with tensor builder
  2. Load data with natural indexing: shared[local_i] = a[global_i]
  3. Handle special cases for first two elements
  4. Use shared memory for window operations
  5. Guard against out-of-bounds access

Running the code

To test your solution, run the following command in your terminal:

magic run p09_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])

Solution

fn pooling[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Load data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # Synchronize threads within block
    barrier()

    # Handle first two special cases
    if global_i == 0:
        out[0] = shared[0]
    if global_i == 1:
        out[1] = shared[0] + shared[1]

    # Handle general case
    if 1 < global_i < size:
        out[global_i] = (
            shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
        )


The solution implements a sliding window sum using LayoutTensor with these key steps:

  1. Shared Memory Setup:

    • Uses tensor builder for clean shared memory allocation
    • Natural indexing for data loading
    • Thread synchronization with barrier()
  2. Boundary Cases:

    • Position 0: Single element output
    • Position 1: Sum of first two elements
    • Clean indexing with LayoutTensor bounds checking
  3. Main Window Operation:

    • Natural indexing for 3-element window
    • Safe access to neighboring elements
    • Automatic bounds checking
  4. Memory Access Pattern:

    • Efficient shared memory usage
    • Type-safe operations
    • Layout-aware indexing
    • Automatic alignment handling

Benefits over raw approach:

  • Cleaner shared memory allocation
  • Safer memory access
  • Natural indexing syntax
  • Built-in bounds checking
  • Layout management

Puzzle 10: Dot Product

Overview

Implement a kernel that computes the dot-product of vector a and vector b and stores it in out.

Note: You have 1 thread per position. You only need 2 global reads and 1 global write per thread.

Dot product visualization

Key concepts

In this puzzle, you’ll learn about:

  • Implementing parallel reduction operations
  • Using shared memory for intermediate results
  • Coordinating threads for collective operations

The key insight is understanding how to efficiently combine multiple values into a single result using parallel computation and shared memory.

Configuration

  • Vector size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Number of blocks: 1
  • Output size: 1 element
  • Shared memory: TPB elements

Notes:

  • Element access: Each thread reads corresponding elements from a and b
  • Partial results: Computing and storing intermediate values
  • Thread coordination: Synchronizing before combining results
  • Final reduction: Converting partial results to scalar output

Note: For this problem, you don’t need to worry about number of shared reads. We will handle that challenge later.

Code to complete

alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (SIZE, 1)
alias dtype = DType.float32


fn dot_product(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    # FILL ME IN (roughly 13 lines)
    ...


View full file: problems/p10/p10.mojo

Tips
  1. Store a[global_i] * b[global_i] in shared[local_i]
  2. Call barrier() to synchronize
  3. Use thread 0 to sum all products in shared memory
  4. Write final sum to out[0]

Running the code

To test your solution, run the following command in your terminal:

magic run p10

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0])
expected: HostBuffer([140.0])

Solution

fn dot_product(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    shared = stack_allocation[
        TPB * sizeof[dtype](),
        Scalar[dtype],
        address_space = AddressSpace.SHARED,
    ]()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    if global_i < size:
        shared[local_i] = a[global_i] * b[global_i]

    barrier()

    # The following causes race condition: all threads writing to the same location
    # out[0] += shared[local_i]

    # Instead can do parallel reduction in shared memory as opposed to
    # global memory which has no guarantee on synchronization.
    # Loops using global memory can cause thread divergence because
    # fundamentally GPUs execute threads in warps (groups of 32 threads typically)
    # and warps can be scheduled independently.
    # However, shared memory does not have such issues as long as we use `barrier()`
    # correctly when we're in the same thread block.
    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]

        barrier()
        stride //= 2

    # only thread 0 writes the final result
    if local_i == 0:
        out[0] = shared[0]


The solution implements a parallel reduction algorithm for dot product computation using shared memory. Here’s a detailed breakdown:

Phase 1: Element-wise Multiplication

Each thread performs one multiplication:

Thread i: shared[i] = a[i] * b[i]

Phase 2: Parallel Reduction

The reduction uses a tree-based approach that halves active threads in each step:

Initial:  [0*0  1*1  2*2  3*3  4*4  5*5  6*6  7*7]
        = [0    1    4    9    16   25   36   49]

Step 1:   [0+16 1+25 4+36 9+49  16   25   36   49]
        = [16   26   40   58   16   25   36   49]

Step 2:   [16+40 26+58 40   58   16   25   36   49]
        = [56   84   40   58   16   25   36   49]

Step 3:   [56+84  84   40   58   16   25   36   49]
        = [140   84   40   58   16   25   36   49]

Key Implementation Features:

  1. Memory Access Pattern:

    • Each thread loads exactly two values from global memory (a[i], b[i])
    • Uses shared memory for intermediate results
    • Final result written once to global memory
  2. Thread Synchronization:

    • barrier() after initial multiplication
    • barrier() after each reduction step
    • Prevents race conditions between reduction steps
  3. Reduction Logic:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2
    
    • Halves stride in each step
    • Only active threads perform additions
    • Maintains work efficiency
  4. Performance Considerations:

    • \(\log_2(n)\) steps for \(n\) elements
    • Coalesced memory access pattern
    • Minimal thread divergence
    • Efficient use of shared memory

This implementation achieves \(O(\log n)\) time complexity compared to \(O(n)\) in sequential execution, demonstrating the power of parallel reduction algorithms.

Barrier Synchronization Importance

The barrier() between reduction steps is critical for correctness. Here’s why:

Without barrier(), race conditions occur:

Initial shared memory: [0 1 4 9 16 25 36 49]

Step 1 (stride = 4):
Thread 0 reads: shared[0] = 0, shared[4] = 16
Thread 1 reads: shared[1] = 1, shared[5] = 25
Thread 2 reads: shared[2] = 4, shared[6] = 36
Thread 3 reads: shared[3] = 9, shared[7] = 49

Without barrier:
- Thread 0 writes: shared[0] = 0 + 16 = 16
- Thread 1 starts next step (stride = 2) before Thread 0 finishes
  and reads old value shared[0] = 0 instead of 16!

With barrier():

Step 1 (stride = 4):
All threads write their sums:
[16 26 40 58 16 25 36 49]
barrier() ensures ALL threads see these values

Step 2 (stride = 2):
Now threads safely read the updated values:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84

The barrier() ensures:

  1. All writes from current step complete
  2. All threads see updated values
  3. No thread starts next iteration early
  4. Consistent shared memory state

Without these synchronization points, we could get:

  • Memory race conditions
  • Threads reading stale values
  • Non-deterministic results
  • Incorrect final sum

Key concepts

In this puzzle, you’ll learn about:

  • Similar to the puzzle 8 and puzzle 9, implementing parallel reduction with LayoutTensor
  • Managing shared memory using LayoutTensorBuilder
  • Coordinating threads for collective operations
  • Using layout-aware tensor operations

The key insight is how LayoutTensor simplifies memory management while maintaining efficient parallel reduction patterns.

Configuration

  • Vector size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Number of blocks: 1
  • Output size: 1 element
  • Shared memory: TPB elements

Notes:

  • Tensor builder: Use LayoutTensorBuilder[dtype]().row_major[TPB]().shared().alloc()
  • Element access: Natural indexing with bounds checking
  • Layout handling: Separate layouts for input and output
  • Thread coordination: Same synchronization patterns with barrier()

Code to complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb


alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (SIZE, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(1)


fn dot_product[
    in_layout: Layout, out_layout: Layout
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=True, dtype, in_layout],
    b: LayoutTensor[mut=True, dtype, in_layout],
    size: Int,
):
    # FILL ME IN (roughly 13 lines)
    ...


View full file: problems/p10/p10_layout_tensor.mojo

Tips
  1. Create shared memory with tensor builder
  2. Store a[global_i] * b[global_i] in shared[local_i]
  3. Use parallel reduction pattern with barrier()
  4. Let thread 0 write final result to out[0]

Running the code

To test your solution, run the following command in your terminal:

magic run p10_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0])
expected: HostBuffer([140.0])

Solution

fn dot_product[
    in_layout: Layout, out_layout: Layout
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=True, dtype, in_layout],
    b: LayoutTensor[mut=True, dtype, in_layout],
    size: Int,
):
    shared = tb[dtype]().row_major[TPB]().shared().alloc()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Compute element-wise multiplication into shared memory
    if global_i < size:
        shared[local_i] = a[global_i] * b[global_i]

    # Synchronize threads within block
    barrier()

    # Parallel reduction in shared memory
    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]

        barrier()
        stride //= 2

    # Only thread 0 writes the final result
    if local_i == 0:
        out[0] = shared[0]


The solution implements a parallel reduction for dot product using LayoutTensor. Here’s the detailed breakdown:

Phase 1: Element-wise Multiplication

Each thread performs one multiplication with natural indexing:

shared[local_i] = a[global_i] * b[global_i]

Phase 2: Parallel Reduction

Tree-based reduction with layout-aware operations:

Initial:  [0*0  1*1  2*2  3*3  4*4  5*5  6*6  7*7]
        = [0    1    4    9    16   25   36   49]

Step 1:   [0+16 1+25 4+36 9+49  16   25   36   49]
        = [16   26   40   58   16   25   36   49]

Step 2:   [16+40 26+58 40   58   16   25   36   49]
        = [56   84   40   58   16   25   36   49]

Step 3:   [56+84  84   40   58   16   25   36   49]
        = [140   84   40   58   16   25   36   49]

Key Implementation Features:

  1. Memory Management:

    • Clean shared memory allocation with tensor builder
    • Type-safe operations with LayoutTensor
    • Automatic bounds checking
    • Layout-aware indexing
  2. Thread Synchronization:

    • barrier() after initial multiplication
    • barrier() between reduction steps
    • Safe thread coordination
  3. Reduction Logic:

    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2
    
  4. Performance Benefits:

    • \(O(\log n)\) time complexity
    • Coalesced memory access
    • Minimal thread divergence
    • Efficient shared memory usage

The LayoutTensor version maintains the same efficient parallel reduction while providing:

  • Better type safety
  • Cleaner memory management
  • Layout awareness
  • Natural indexing syntax

Barrier Synchronization Importance

The barrier() between reduction steps is critical for correctness. Here’s why:

Without barrier(), race conditions occur:

Initial shared memory: [0 1 4 9 16 25 36 49]

Step 1 (stride = 4):
Thread 0 reads: shared[0] = 0, shared[4] = 16
Thread 1 reads: shared[1] = 1, shared[5] = 25
Thread 2 reads: shared[2] = 4, shared[6] = 36
Thread 3 reads: shared[3] = 9, shared[7] = 49

Without barrier:
- Thread 0 writes: shared[0] = 0 + 16 = 16
- Thread 1 starts next step (stride = 2) before Thread 0 finishes
  and reads old value shared[0] = 0 instead of 16!

With barrier():

Step 1 (stride = 4):
All threads write their sums:
[16 26 40 58 16 25 36 49]
barrier() ensures ALL threads see these values

Step 2 (stride = 2):
Now threads safely read the updated values:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84

The barrier() ensures:

  1. All writes from current step complete
  2. All threads see updated values
  3. No thread starts next iteration early
  4. Consistent shared memory state

Without these synchronization points, we could get:

  • Memory race conditions
  • Threads reading stale values
  • Non-deterministic results
  • Incorrect final sum

Puzzle 11: 1D Convolution

Moving to LayoutTensor

So far in our GPU puzzle journey, we’ve been exploring two parallel approaches to GPU memory management:

  1. Raw memory management with direct pointer manipulation using UnsafePointer
  2. The more structured LayoutTensor and its related abstractions such as LayoutTensorBuild

Starting from this puzzle, we’re transitioning exclusively to using LayoutTensor. This abstraction provides several benefits:

  • Type-safe memory access patterns
  • Clear representation of data layouts
  • Better code maintainability
  • Reduced chance of memory-related bugs
  • More expressive code that better represents the underlying computations
  • A lot more … that we’ll uncover gradually!

This transition aligns with best practices in modern GPU programming in Mojo 🔥, where higher-level abstractions help manage complexity without sacrificing performance.

Overview

In signal processing and image analysis, convolution is a fundamental operation that combines two sequences to produce a third sequence. This puzzle challenges you to implement a 1D convolution on the GPU, where each output element is computed by sliding a kernel over an input array.

Implement a kernel that computes a 1D convolution between vector a and vector b and stores it in out using the LayoutTensor abstraction.

Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.

1D Convolution

For those new to convolution, think of it as a weighted sliding window operation. At each position, we multiply the kernel values with the corresponding input values and sum the results. In mathematical notation, this is often written as:

\[\Large out[i] = \sum_{j=0}^{\text{CONV}-1} a[i+j] \cdot b[j] \]

In pseudocode, 1D convolution is:

for i in range(SIZE):
    for j in range(CONV):
        if i + j < SIZE:
            ret[i] += a_host[i + j] * b_host[j]

This puzzle is split into two parts to help you build understanding progressively:

  • Simple version: Single block Start here to learn the basics of implementing convolution with shared memory in a single block using LayoutTensor.

  • Complete version: Block boundary Then tackle the more challenging case where data needs to be shared across block boundaries, leveraging LayoutTensor’s capabilities.

Each version presents unique challenges in terms of memory access patterns and thread coordination. The simple version helps you understand the basic convolution operation, while the complete version tests your ability to handle more complex scenarios that arise in real-world GPU programming.

Simple Version: Single Block

Key concepts

In this puzzle, you’ll learn about:

  • Implementing sliding window operations on GPUs
  • Managing data dependencies across threads
  • Using shared memory for overlapping regions

The key insight is understanding how to efficiently access overlapping elements while maintaining correct boundary conditions.

Configuration

  • Input array size: SIZE = 6 elements
  • Kernel size: CONV = 3 elements
  • Threads per block: TPB = 8
  • Number of blocks: 1
  • Shared memory: Two arrays of size SIZE and CONV

Notes:

  • Data loading: Each thread loads one element from input and kernel
  • Memory pattern: Shared arrays for input and convolution kernel
  • Thread sync: Coordination before computation

Code to complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb


alias MAX_CONV = 4
alias TPB = 8
alias SIZE = 6
alias CONV = 3
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias in_layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(SIZE)
alias conv_layout = Layout.row_major(CONV)


fn conv_1d_simple[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    a_size: Int,
    b_size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 13 lines)


View full file: problems/p11/p11.mojo

Tips
  1. Use tb[dtype]().row_major[SIZE]().shared().alloc() for shared memory allocation
  2. Load input to shared_a[local_i] and kernel to shared_b[local_i]
  3. Call barrier() after loading
  4. Sum products within bounds: if local_i + j < SIZE
  5. Write result if global_i < a_size

Running the code

To test your solution, run the following command in your terminal:

magic run p11 --simple

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])

Solution

fn conv_1d_simple[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    a_size: Int,
    b_size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared_a = tb[dtype]().row_major[SIZE]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
    if global_i < a_size:
        shared_a[local_i] = a[global_i]

    if global_i < b_size:
        shared_b[local_i] = b[global_i]

    barrier()

    # Note: this is unsafe as it enforces no guard so could access `shared_a` beyond its bounds
    # local_sum = Scalar[dtype](0)
    # for j in range(CONV):
    #     if local_i + j < SIZE:
    #         local_sum += shared_a[local_i + j] * shared_b[j]

    # if global_i < a_size:
    #     out[global_i] = local_sum

    # Safe and correct:
    if global_i < a_size:
        # Note: using `var` allows us to include the type in the type inference
        # `out.element_type` is available in LayoutTensor
        var local_sum: out.element_type = 0

        # Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant
        # See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement
        @parameter
        for j in range(CONV):
            # Bonus: do we need this check for this specific example with fixed SIZE, CONV
            if local_i + j < SIZE:
                local_sum += shared_a[local_i + j] * shared_b[j]

        out[global_i] = local_sum


The solution implements a 1D convolution using shared memory for efficient access to overlapping elements. Here’s a detailed breakdown:

Memory Layout

Input array a:   [0  1  2  3  4  5]
Kernel b:        [0  1  2]

Computation Steps

  1. Data Loading:

    shared_a: [0  1  2  3  4  5]  // Input array
    shared_b: [0  1  2]           // Convolution kernel
    
  2. Convolution Process for each position i:

    out[0] = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] = 0*0 + 1*1 + 2*2 = 5
    out[1] = a[1]*b[0] + a[2]*b[1] + a[3]*b[2] = 1*0 + 2*1 + 3*2 = 8
    out[2] = a[2]*b[0] + a[3]*b[1] + a[4]*b[2] = 2*0 + 3*1 + 4*2 = 11
    out[3] = a[3]*b[0] + a[4]*b[1] + a[5]*b[2] = 3*0 + 4*1 + 5*2 = 14
    out[4] = a[4]*b[0] + a[5]*b[1] + 0*b[2]    = 4*0 + 5*1 + 0*2 = 5
    out[5] = a[5]*b[0] + 0*b[1]   + 0*b[2]     = 5*0 + 0*1 + 0*2 = 0
    

Implementation Details

  1. Memory Safety Considerations:

    • The naive approach without proper bounds checking could be unsafe:

      # Unsafe version - could access shared_a beyond its bounds
      local_sum = Scalar[dtype](0)
      for j in range(CONV):
          if local_i + j < SIZE:
              local_sum += shared_a[local_i + j] * shared_b[j]
      
    • The safe and correct implementation:

      if global_i < a_size:
          var local_sum: out.element_type = 0  # Using var allows type inference
          @parameter  # Unrolls loop at compile time since CONV is constant
          for j in range(CONV):
              if local_i + j < SIZE:
                  local_sum += shared_a[local_i + j] * shared_b[j]
      
  2. Key Implementation Features:

    • Uses var for proper type inference with out.element_type
    • Employs @parameter decorator to unroll the convolution loop at compile time
    • Maintains strict bounds checking for memory safety
    • Leverages LayoutTensor’s type system for better code safety
  3. Memory Management:

    • Uses shared memory for both input array and kernel
    • Single load per thread from global memory
    • Efficient reuse of loaded data
  4. Thread Coordination:

    • barrier() ensures all data is loaded before computation
    • Each thread computes one output element
    • Maintains coalesced memory access pattern
  5. Performance Optimizations:

    • Minimizes global memory access
    • Uses shared memory for fast data access
    • Avoids thread divergence in main computation loop
    • Loop unrolling through @parameter decorator

Complete version: Block Boundary Case

Configuration

  • Input array size: SIZE_2 = 15 elements
  • Kernel size: CONV_2 = 4 elements
  • Threads per block: TPB = 8
  • Number of blocks: 2
  • Shared memory: TPB + CONV_2 - 1 elements for input

Notes:

  • Extended loading: Account for boundary overlap
  • Block edges: Handle data across block boundaries
  • Memory layout: Efficient shared memory usage
  • Synchronization: Proper thread coordination

Code to complete

alias SIZE_2 = 15
alias CONV_2 = 4
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (TPB, 1)


fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    a_size: Int,
    b_size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 18 lines)


View full file: problems/p11/p11.mojo

Tips
  1. Use tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc() for shared memory
  2. Load main data: shared_a[local_i] = a[global_i]
  3. Load boundary: if local_i < CONV_2 - 1 handle next block data
  4. Load kernel: shared_b[local_i] = b[local_i]
  5. Sum within extended bounds: if local_i + j < TPB + CONV_2 - 1

Running the code

To test your solution, run the following command in your terminal:

magic run p11 --block-boundary

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])

Solution

fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    a_size: Int,
    b_size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # first: need to account for padding
    shared_a = tb[dtype]().row_major[TPB + CONV - 1]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
    if global_i < a_size:
        shared_a[local_i] = a[global_i]

    # second: load elements needed for convolution at block boundary
    if local_i < CONV - 1:
        # indices from next block
        next_idx = global_i + TPB
        if next_idx < a_size:
            shared_a[TPB + local_i] = a[next_idx]

    if local_i < b_size:
        shared_b[local_i] = b[local_i]

    barrier()

    if global_i < a_size:
        var local_sum: out.element_type = 0

        @parameter
        for j in range(CONV):
            if local_i + j < TPB + CONV - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]

        out[global_i] = local_sum


The solution handles block boundary cases in 1D convolution using extended shared memory. Here’s a detailed analysis:

Memory Layout and Sizing

Test Configuration:
- Full array size: SIZE_2 = 15 elements
- Grid: 2 blocks × 8 threads
- Convolution kernel: CONV_2 = 4 elements

Block 0 shared memory:  [0 1 2 3 4 5 6 7|8 9 10]  // TPB(8) + (CONV_2-1)(3) padding
Block 1 shared memory:  [8 9 10 11 12 13 14|0 0]  // Second block with padding

Size calculation:
- Main data: TPB elements (8)
- Overlap: CONV_2 - 1 elements (4 - 1 = 3)
- Total: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11 elements

Implementation Details

  1. Shared Memory Allocation:

    # First: account for padding needed for convolution window
    shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()
    

    This allocation pattern ensures we have enough space for both the block’s data and the overlap region.

  2. Data Loading Strategy:

    # Main block data
    if global_i < a_size:
        shared_a[local_i] = a[global_i]
    
    # Boundary data from next block
    if local_i < CONV_2 - 1:
        next_idx = global_i + TPB
        if next_idx < a_size:
            shared_a[TPB + local_i] = a[next_idx]
    
    • Only threads with local_i < CONV_2 - 1 load boundary data
    • Prevents unnecessary thread divergence
    • Maintains memory coalescing for main data load
  3. Kernel Loading:

    if local_i < b_size:
        shared_b[local_i] = b[local_i]
    
    • Single load per thread
    • Bounded by kernel size
  4. Convolution Computation:

    if global_i < a_size:
        var local_sum: out.element_type = 0
        @parameter
        for j in range(CONV_2):
            if local_i + j < TPB + CONV_2 - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]
    
    • Uses @parameter for compile-time loop unrolling
    • Proper type inference with out.element_type
    • Extended bounds check for overlap region

Memory Access Pattern Analysis

  1. Block 0 Access Pattern:

    Thread 0: [0 1 2 3] × [0 1 2 3]
    Thread 1: [1 2 3 4] × [0 1 2 3]
    Thread 2: [2 3 4 5] × [0 1 2 3]
    ...
    Thread 7: [7 8 9 10] × [0 1 2 3]  // Uses overlap data
    
  2. Block 1 Access Pattern:

    Thread 0: [8 9 10 11] × [0 1 2 3]
    Thread 1: [9 10 11 12] × [0 1 2 3]
    ...
    Thread 7: [14 0 0 0] × [0 1 2 3]  // Zero padding at end
    

Performance Optimizations

  1. Memory Coalescing:

    • Main data load: Consecutive threads access consecutive memory
    • Boundary data: Only necessary threads participate
    • Single barrier synchronization point
  2. Thread Divergence Minimization:

    • Clean separation of main and boundary loading
    • Uniform computation pattern within warps
    • Efficient bounds checking
  3. Shared Memory Usage:

    • Optimal sizing to handle block boundaries
    • No bank conflicts in access pattern
    • Efficient reuse of loaded data
  4. Boundary Handling:

    • Implicit zero padding at array end
    • Seamless block transition
    • Proper handling of edge cases

This implementation achieves efficient cross-block convolution while maintaining:

  • Memory safety through proper bounds checking
  • High performance through optimized memory access
  • Clean code structure using LayoutTensor abstractions
  • Minimal synchronization overhead

Puzzle 12: Prefix Sum

Overview

Prefix sum (also known as scan) is a fundamental parallel algorithm that computes running totals of a sequence. Found at the heart of many parallel applications - from sorting algorithms to scientific simulations - it transforms a sequence of numbers into their running totals. While simple to compute sequentially, making this efficient on a GPU requires clever parallel thinking!

Implement a kernel that computes a prefix-sum over vector a and stores it in out.

Note: If the size of a is greater than the block size, only store the sum of each block.

Prefix sum

Key concepts

In this puzzle, you’ll learn about:

  • Parallel algorithms with logarithmic complexity
  • Shared memory coordination patterns
  • Multi-phase computation strategies

The key insight is understanding how to transform a sequential operation into an efficient parallel algorithm using shared memory.

For example, given an input sequence \([3, 1, 4, 1, 5, 9]\), the prefix sum would produce:

  • \([3]\) (just the first element)
  • \([3, 4]\) (3 + 1)
  • \([3, 4, 8]\) (previous sum + 4)
  • \([3, 4, 8, 9]\) (previous sum + 1)
  • \([3, 4, 8, 9, 14]\) (previous sum + 5)
  • \([3, 4, 8, 9, 14, 23]\) (previous sum + 9)

Mathematically, for a sequence \([x_0, x_1, …, x_n]\), the prefix sum produces: \[ [x_0, x_0+x_1, x_0+x_1+x_2, …, \sum_{i=0}^n x_i] \]

While a sequential algorithm would need \(O(n)\) steps, our parallel approach will use a clever two-phase algorithm that completes in \(O(\log n)\) steps! Here’s a visualization of this process:

This puzzle is split into two parts to help you master the concept:

  • Simple Version Start with a single block implementation where all data fits in shared memory. This helps understand the core parallel algorithm.

  • Complete Version Then tackle the more challenging case of handling larger arrays that span multiple blocks, requiring coordination between blocks.

Each version builds on the previous one, helping you develop a deep understanding of parallel prefix sum computation. The simple version establishes the fundamental algorithm, while the complete version shows how to scale it to larger datasets - a common requirement in real-world GPU applications.

Simple Version

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Number of blocks: 1
  • Shared memory: TPB elements

Notes:

  • Data loading: Each thread loads one element using LayoutTensor access
  • Memory pattern: Shared memory for intermediate results using LayoutTensorBuild
  • Thread sync: Coordination between computation phases
  • Access pattern: Stride-based parallel computation
  • Type safety: Leveraging LayoutTensor’s type system

Code to complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
from math import log2


alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn prefix_sum_simple[
    layout: Layout
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 12 lines)


View full file: problems/p12/p12.mojo

Tips
  1. Load data into shared[local_i]
  2. Use offset = 1 and double it each step
  3. Add elements where local_i >= offset
  4. Call barrier() between steps

Running the code

To test your solution, run the following command in your terminal:

magic run p12 --simple

Your output will look like this if the puzzle isn’t solved yet:

out: DeviceBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0])

Solution

fn prefix_sum_simple[
    layout: Layout
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared = tb[dtype]().row_major[TPB]().shared().alloc()
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    offset = 1
    for i in range(Int(log2(Scalar[dtype](TPB)))):
        if local_i >= offset and local_i < size:
            shared[local_i] += shared[local_i - offset]

        barrier()
        offset *= 2

    if global_i < size:
        out[global_i] = shared[local_i]


The parallel (inclusive) prefix-sum algorithm works as follows:

Setup & Configuration

  • TPB (Threads Per Block) = 8
  • SIZE (Array Size) = 8

Thread Mapping

  • thread_idx.x: \([0, 1, 2, 3, 4, 5, 6, 7]\) (local_i)
  • block_idx.x: \([0, 0, 0, 0, 0, 0, 0, 0]\)
  • global_i: \([0, 1, 2, 3, 4, 5, 6, 7]\) (block_idx.x * TPB + thread_idx.x)

Initial Load to Shared Memory

Threads:      T₀   T₁   T₂   T₃   T₄   T₅   T₆   T₇
Input array:  [0    1    2    3    4    5    6    7]
shared:       [0    1    2    3    4    5    6    7]
               ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑
              T₀   T₁   T₂   T₃   T₄   T₅   T₆   T₇

Offset = 1: First Parallel Step

Active threads: \(T_1 \ldots T_7\) (where local_i ≥ 1)

Before:      [0    1    2    3    4    5    6    7]
Add:              +0   +1   +2   +3   +4   +5   +6
                   |    |    |    |    |    |    |
Result:      [0    1    3    6    7    9    11   13]
                   ↑    ↑    ↑    ↑    ↑    ↑    ↑
                  T₁   T₂   T₃   T₄   T₅   T₆   T₇

Offset = 2: Second Parallel Step

Active threads: \(T_2 \ldots T_7\) (where local_i ≥ 2)

Before:      [0    1    3    6    7    9    11   13]
Add:                   +0   +1   +3   +6   +7   +9
                        |    |    |    |    |    |
Result:      [0    1    3    7    10   15   18   22]
                        ↑    ↑    ↑    ↑    ↑    ↑
                       T₂   T₃   T₄   T₅   T₆   T₇

Offset = 4: Third Parallel Step

Active threads: \(T_4 \ldots T_7\) (where local_i ≥ 4)

Before:      [0    1    3    7    10   15   18   22]
Add:                              +0   +1   +3   +7
                                  |    |    |    |
Result:      [0    1    3    7    10   16   21   28]
                                  ↑    ↑    ↑    ↑
                                  T₄   T₅   T₆   T₇

Final Write to Output

Threads:      T₀   T₁   T₂   T₃   T₄   T₅   T₆   T₇
global_i:     0    1    2    3    4    5    6    7
out[]:       [0    1    3    7    10   16   21   28]
              ↑    ↑    ↑    ↑    ↑    ↑    ↑    ↑
              T₀   T₁   T₂   T₃   T₄   T₅   T₆   T₇

Thread-by-Thread Execution

\(T_0\) (local_i=0):

  • Loads shared[0] = 0
  • Never adds (local_i < offset always)
  • Writes out[0] = 0

\(T_1\) (local_i=1):

  • Loads shared[1] = 1
  • offset=1: adds shared[0] → 1
  • offset=2,4: no action (local_i < offset)
  • Writes out[1] = 1

\(T_2\) (local_i=2):

  • Loads shared[2] = 2
  • offset=1: adds shared[1] → 3
  • offset=2: adds shared[0] → 3
  • offset=4: no action
  • Writes out[2] = 3

\(T_3\) (local_i=3):

  • Loads shared[3] = 3
  • offset=1: adds shared[2] → 6
  • offset=2: adds shared[1] → 7
  • offset=4: no action
  • Writes out[3] = 7

\(T_4\) (local_i=4):

  • Loads shared[4] = 4
  • offset=1: adds shared[3] → 7
  • offset=2: adds shared[2] → 10
  • offset=4: adds shared[0] → 10
  • Writes out[4] = 10

\(T_5\) (local_i=5):

  • Loads shared[5] = 5
  • offset=1: adds shared[4] → 9
  • offset=2: adds shared[3] → 15
  • offset=4: adds shared[1] → 16
  • Writes out[5] = 16

\(T_6\) (local_i=6):

  • Loads shared[6] = 6
  • offset=1: adds shared[5] → 11
  • offset=2: adds shared[4] → 18
  • offset=4: adds shared[2] → 21
  • Writes out[6] = 21

\(T_7\) (local_i=7):

  • Loads shared[7] = 7
  • offset=1: adds shared[6] → 13
  • offset=2: adds shared[5] → 22
  • offset=4: adds shared[3] → 28
  • Writes out[7] = 28

The solution ensures correct synchronization between phases using barrier() and handles array bounds checking with if global_i < size. The final result produces the inclusive prefix sum where each element \(i\) contains \(\sum_{j=0}^{i} a[j]\).

Complete Version

Configuration

  • Array size: SIZE_2 = 15 elements
  • Threads per block: TPB = 8
  • Number of blocks: 2
  • Shared memory: TPB elements per block

Notes:

  • Block handling: Multiple blocks process array segments
  • Partial blocks: Last block may not be full
  • Block sums: Store running totals between blocks
  • Global result: Combine local and block sums
  • Layout safety: Consistent layout handling through LayoutTensor

Code to complete

alias SIZE_2 = 15
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (TPB, 1)


fn prefix_sum[
    layout: Layout
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 19 lines)


View full file: problems/p12/p12.mojo

Tips
  1. Compute local prefix sums like in Simple Version
  2. Last thread stores block sum at TPB * (block_idx.x + 1)
  3. Add previous block’s sum to current block
  4. Handle array bounds for all operations

Running the code

To test your solution, run the following command in your terminal:

magic run p12 --complete

Your output will look like this if the puzzle isn’t solved yet:

out: DeviceBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 10.0, 15.0, 21.0, 28.0])

Solution

fn prefix_sum[
    layout: Layout
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared = tb[dtype]().row_major[TPB]().shared().alloc()
    if global_i < size:
        shared[local_i] = a[global_i]

    barrier()

    # Idea: two passes
    # SIZE=15, TPB=8, BLOCKS=2
    # buffer: [0,1,2,...,7 | 8,...,14]

    # Step 1: Each block computes local prefix sum
    # Block 0: [0,1,2,3,4,5,6,7] → [0,1,3,6,10,15,21,28]
    # Block 1: [8,9,10,11,12,13,14] → [8,17,27,38,50,63,77]

    # Step 2: Store block sums
    # Block 0's sum (28) → position 8

    # Step 3: Add previous block's sum
    # Block 1: Each element += 28
    # [8,17,27,38,50,63,77] → [36,45,55,66,78,91,105]

    # Final result combines both blocks:
    # [0,1,3,6,10,15,21,28, 36,45,55,66,78,91,105]

    # local prefix-sum for each block
    offset = 1
    for i in range(Int(log2(Scalar[dtype](TPB)))):
        if local_i >= offset and local_i < size:
            shared[local_i] += shared[local_i - offset]

        barrier()
        offset *= 2

    # store block results
    if global_i < size:
        out[global_i] = shared[local_i]

    # store block sum in first element of next block:
    # - Only last thread (local_i == 7) in each block except last block executes
    # - Block 0: Thread 7 stores 28 (sum of 0-7) at position 8 (start of Block 1)
    # - Calculation: TPB * (block_idx.x + 1)
    # Block 0: 8 * (0 + 1) = position 8
    # Block 1: No action (last block)
    # Memory state:
    # [0,1,3,6,10,15,21,28 | 28,45,55,66,78,91,105]
    #                        ↑
    #                       Block 0's sum stored here
    if local_i == TPB - 1 and block_idx.x < size // TPB - 1:
        out[TPB * (block_idx.x + 1)] = shared[local_i]

    # wait for all blocks to store their sums
    barrier()

    # second pass: add previous block's sum which becomes:
    # Before: [8,9,10,11,12,13,14]
    # Add 28: [36,37,38,39,40,41,42]
    if block_idx.x > 0 and global_i < size:
        shared[local_i] += out[block_idx.x * TPB - 1]

    # final result
    if global_i < size:
        out[global_i] = shared[local_i]


This solution handles multi-block prefix sum in three main phases:
  1. Local prefix sum (per block):

    Block 0 (8 elements):    [0,1,2,3,4,5,6,7]
    After local prefix sum:  [0,1,3,7,10,16,21,28]
    
    Block 1 (7 elements):    [8,9,10,11,12,13,14]
    After local prefix sum:  [8,17,27,38,50,63,77]
    
  2. Block sum communication:

    • Last thread (local_i == TPB-1) in each non-final block
    • Stores its block’s sum at next block’s start:
    if local_i == TPB - 1 and block_idx.x < size // TPB - 1:
        out[TPB * (block_idx.x + 1)] = shared[local_i]
    
    • Block 0’s sum (28) stored at position 8
    • Memory layout: [0,1,3,7,10,16,21,28 | 28,17,27,38,50,63,77] ↑ Block 0’s sum
  3. Final adjustment:

    • Each block after first adds previous block’s sum
    if block_idx.x > 0 and global_i < size:
        shared[local_i] += out[block_idx.x * TPB - 1]
    
    • Block 1: Each element += 28
    • Final result: [0,1,3,7,10,16,21,28, 36,45,55,66,78,91,105]

Key implementation details:

  • Uses barrier() after shared memory operations
  • Handles partial blocks (last block size < TPB)
  • Guards all operations with proper bounds checking
  • Maintains correct thread and block synchronization
  • Achieves \(O(\log n)\) complexity per block

The solution scales to arbitrary-sized inputs by combining local prefix sums with efficient block-to-block communication.

Puzzle 13: Axis Sum

Overview

Implement a kernel that computes a sum over each row of 2D matrix a and stores it in out using LayoutTensor.

Axis Sum visualization

Key concepts

In this puzzle, you’ll learn about:

  • Parallel reduction along matrix dimensions using LayoutTensor
  • Using block coordinates for data partitioning
  • Efficient shared memory reduction patterns
  • Working with multi-dimensional tensor layouts

The key insight is understanding how to map thread blocks to matrix rows and perform efficient parallel reduction within each block while leveraging LayoutTensor’s dimensional indexing.

Configuration

  • Matrix dimensions: \(\text{BATCH} \times \text{SIZE} = 4 \times 6\)
  • Threads per block: \(\text{TPB} = 8\)
  • Grid dimensions: \(1 \times \text{BATCH}\)
  • Shared memory: \(\text{TPB}\) elements per block
  • Input layout: Layout.row_major(BATCH, SIZE)
  • Output layout: Layout.row_major(BATCH, 1)

Matrix visualization:

Row 0: [0, 1, 2, 3, 4, 5]       → Block(0,0)
Row 1: [6, 7, 8, 9, 10, 11]     → Block(0,1)
Row 2: [12, 13, 14, 15, 16, 17] → Block(0,2)
Row 3: [18, 19, 20, 21, 22, 23] → Block(0,3)

Code to Complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb


alias TPB = 8
alias BATCH = 4
alias SIZE = 6
alias BLOCKS_PER_GRID = (1, BATCH)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias in_layout = Layout.row_major(BATCH, SIZE)
alias out_layout = Layout.row_major(BATCH, 1)


fn axis_sum[
    in_layout: Layout, out_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    batch = block_idx.y
    # FILL ME IN (roughly 15 lines)


View full file: problems/p13/p13.mojo

Tips
  1. Use batch = block_idx.y to select row
  2. Load elements: cache[local_i] = a[batch * size + local_i]
  3. Perform parallel reduction with halving stride
  4. Thread 0 writes final sum to out[batch]

Running the Code

To test your solution, run the following command in your terminal:

magic run p13

Your output will look like this if the puzzle isn’t solved yet:

out: DeviceBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([15.0, 51.0, 87.0, 123.0])

Solution

fn axis_sum[
    in_layout: Layout, out_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    batch = block_idx.y
    cache = tb[dtype]().row_major[TPB]().shared().alloc()

    # Visualize:
    # Block(0,0): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 0: [0,1,2,3,4,5]
    # Block(0,1): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 1: [6,7,8,9,10,11]
    # Block(0,2): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 2: [12,13,14,15,16,17]
    # Block(0,3): [T0,T1,T2,T3,T4,T5,T6,T7] -> Row 3: [18,19,20,21,22,23]

    # each row is handled by each block bc we have grid_dim=(1, BATCH)

    if local_i < size:
        cache[local_i] = a[batch, local_i]
    else:
        # Add zero-initialize padding elements for later reduction
        cache[local_i] = 0

    barrier()

    # do reduction sum per each block
    stride = TPB // 2
    while stride > 0:
        if local_i < stride:
            cache[local_i] += cache[local_i + stride]

        barrier()
        stride //= 2

    # writing with local thread = 0 that has the sum for each batch
    if local_i == 0:
        out[batch, 0] = cache[0]


The solution implements a parallel row-wise sum reduction for a 2D matrix using LayoutTensor. Here’s a comprehensive breakdown:

Matrix Layout and Block Mapping

Input Matrix (4×6) with LayoutTensor:                Block Assignment:
[[ a[0,0]  a[0,1]  a[0,2]  a[0,3]  a[0,4]  a[0,5] ] → Block(0,0)
 [ a[1,0]  a[1,1]  a[1,2]  a[1,3]  a[1,4]  a[1,5] ] → Block(0,1)
 [ a[2,0]  a[2,1]  a[2,2]  a[2,3]  a[2,4]  a[2,5] ] → Block(0,2)
 [ a[3,0]  a[3,1]  a[3,2]  a[3,3]  a[3,4]  a[3,5] ] → Block(0,3)

Parallel Reduction Process

  1. Initial Data Loading:

    Block(0,0): cache = [a[0,0] a[0,1] a[0,2] a[0,3] a[0,4] a[0,5] * *]  // * = padding
    Block(0,1): cache = [a[1,0] a[1,1] a[1,2] a[1,3] a[1,4] a[1,5] * *]
    Block(0,2): cache = [a[2,0] a[2,1] a[2,2] a[2,3] a[2,4] a[2,5] * *]
    Block(0,3): cache = [a[3,0] a[3,1] a[3,2] a[3,3] a[3,4] a[3,5] * *]
    
  2. Reduction Steps (for Block 0,0):

    Initial:  [0  1  2  3  4  5  *  *]
    Stride 4: [4  5  6  7  4  5  *  *]
    Stride 2: [10 12 6  7  4  5  *  *]
    Stride 1: [15 12 6  7  4  5  *  *]
    

Key Implementation Features:

  1. Layout Configuration:

    • Input: row-major layout (BATCH × SIZE)
    • Output: row-major layout (BATCH × 1)
    • Each block processes one complete row
  2. Memory Access Pattern:

    • LayoutTensor 2D indexing for input: a[batch, local_i]
    • Shared memory for efficient reduction
    • LayoutTensor 2D indexing for output: out[batch, 0]
  3. Parallel Reduction Logic:

    stride = TPB // 2
    while stride > 0:
        if local_i < size:
            cache[local_i] += cache[local_i + stride]
        barrier()
        stride //= 2
    
  4. Output Writing:

    if local_i == 0:
        out[batch, 0] = cache[0]  --> One result per batch
    

Performance Optimizations:

  1. Memory Efficiency:

    • Coalesced memory access through LayoutTensor
    • Shared memory for fast reduction
    • Single write per row result
  2. Thread Utilization:

    • Perfect load balancing across rows
    • No thread divergence in main computation
    • Efficient parallel reduction pattern
  3. Synchronization:

    • Minimal barriers (only during reduction)
    • Independent processing between rows
    • No inter-block communication needed

Complexity Analysis:

  • Time: \(O(\log n)\) per row, where n is row length
  • Space: \(O(TPB)\) shared memory per block
  • Total parallel time: \(O(\log n)\) with sufficient threads

Puzzle 14: Matrix Multiplication (MatMul)

Overview

Matrix multiplication is a fundamental operation in scientific computing, machine learning, and graphics. Given two matrices \(A\) and \(B\), we want to compute their product \(C = A \times B.\)

For matrices \(A_{m\times k}\) and \(B_{k\times n}\), each element of the result \(C_{m\times n}\) is computed as:

\[\Large C_{ij} = \sum_{l=0}^{k-1} A_{il} \cdot B_{lj} \]

Matrix Multiply visualization

This puzzle explores different approaches to implementing matrix multiplication on GPUs, each with its own performance characteristics:

  • Naive Version The straightforward implementation where each thread computes one element of the output matrix. While simple to understand, this approach makes many redundant memory accesses.

  • Shared Memory Version Improves performance by loading blocks of input matrices into fast shared memory, reducing global memory accesses. Each thread still computes one output element but reads from shared memory.

  • Tiled Version Further optimizes by dividing the computation into tiles, allowing threads to cooperate on loading and computing blocks of the output matrix. This approach better utilizes memory hierarchy and thread cooperation.

Each version builds upon the previous one, introducing new optimization techniques common in GPU programming. You’ll learn how different memory access patterns and thread cooperation strategies affect performance.

The progression illustrates a common pattern in GPU optimization:

  1. Start with a correct but naive implementation
  2. Reduce global memory access with shared memory
  3. Improve data locality and thread cooperation with tiling
  4. Use high-level abstractions while maintaining performance

Choose a version to begin your matrix multiplication journey!

Naive Matrix Multiplication

Overview

Implement a kernel that multiplies square matrices \(A\) and \(\text{transpose}(A)\) and stores the result in \(\text{out}\). This is the most straightforward implementation where each thread computes one element of the output matrix.

Key concepts

In this puzzle, you’ll learn about:

  • 2D thread organization for matrix operations
  • Global memory access patterns
  • Matrix indexing in row-major layout
  • Thread-to-output element mapping

The key insight is understanding how to map 2D thread indices to matrix elements and compute dot products in parallel.

Configuration

  • Matrix size: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
  • Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • Grid dimensions: \(1 \times 1\)

Layout configuration:

  • Input A: Layout.row_major(SIZE, SIZE)
  • Input B: Layout.row_major(SIZE, SIZE) (transpose of A)
  • Output: Layout.row_major(SIZE, SIZE)

Memory layout (LayoutTensor):

  • Matrix A: a[i, j] directly indexes position (i,j)
  • Matrix B: b[i, j] is the transpose of A
  • Output C: out[i, j] stores result at position (i,j)

Code to complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb


alias TPB = 3
alias SIZE = 2
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, TPB)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE, SIZE)


fn naive_matmul[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    # FILL ME IN (roughly 6 lines)


View full file: problems/p14/p14.mojo

Tips
  1. Calculate global_i and global_j from thread indices
  2. Check if indices are within size
  3. Accumulate products in a local variable
  4. Write final sum to correct output position

Running the code

To test your solution, run the following command in your terminal:

magic run p14 --naive

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 3.0, 3.0, 13.0])

Solution

fn naive_matmul[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y

    if global_i < size and global_j < size:
        var acc: out.element_type = 0

        @parameter
        for k in range(size):
            acc += a[global_i, k] * b[k, global_j]

        out[global_i, global_j] = acc


The naive matrix multiplication using LayoutTensor demonstrates the basic approach:

Matrix Layout (2×2 example)

Matrix A:        Matrix B (transpose of A):    Output C:
 [a[0,0] a[0,1]]  [b[0,0] b[0,1]]             [c[0,0] c[0,1]]
 [a[1,0] a[1,1]]  [b[1,0] b[1,1]]             [c[1,0] c[1,1]]

Implementation Details:

  1. Thread Mapping:

    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    
  2. Memory Access Pattern:

    • Direct 2D indexing: a[global_i, k]
    • Transposed access: b[k, global_j]
    • Output writing: out[global_i, global_j]
  3. Computation Flow:

    # Use var for mutable accumulator with tensor's element type
    var acc: out.element_type = 0
    
    # @parameter for compile-time loop unrolling
    @parameter
    for k in range(size):
        acc += a[global_i, k] * b[k, global_j]
    

Key Language Features:

  1. Variable Declaration:

    • The use of var in var acc: out.element_type = 0 allows for type inference with out.element_type ensures type compatibility with the output tensor
    • Initialized to zero before accumulation
  2. Loop Optimization:

    • @parameter decorator unrolls the loop at compile time
    • Improves performance for small, known matrix sizes
    • Enables better instruction scheduling

Performance Characteristics:

  1. Memory Access:

    • Each thread makes 2 x SIZE global memory reads
    • One global memory write per thread
    • No data reuse between threads
  2. Computational Efficiency:

    • Simple implementation but suboptimal performance
    • Many redundant global memory accesses
    • No use of fast shared memory
  3. Limitations:

    • High global memory bandwidth usage
    • Poor data locality
    • Limited scalability for large matrices

This naive implementation serves as a baseline for understanding matrix multiplication on GPUs, highlighting the need for optimization in memory access patterns.

Shared Memory Matrix Multiplication

Overview

Implement a kernel that multiplies square matrices \(A\) and \(\text{transpose}(A)\) and stores the result in \(\text{out}\), using shared memory to improve memory access efficiency. This version loads matrix blocks into shared memory before computation.

Key concepts

In this puzzle, you’ll learn about:

  • Block-local memory management with LayoutTensor
  • Thread synchronization patterns
  • Memory access optimization using shared memory
  • Collaborative data loading with 2D indexing
  • Efficient use of LayoutTensor for matrix operations

The key insight is understanding how to use fast shared memory with LayoutTensor to reduce expensive global memory operations.

Configuration

  • Matrix size: \(\text{SIZE} \times \text{SIZE} = 2 \times 2\)
  • Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • Grid dimensions: \(1 \times 1\)

Layout configuration:

  • Input A: Layout.row_major(SIZE, SIZE)
  • Input B: Layout.row_major(SIZE, SIZE) (transpose of A)
  • Output: Layout.row_major(SIZE, SIZE)
  • Shared Memory: Two TPB × TPB LayoutTensors

Memory organization:

Global Memory (LayoutTensor):          Shared Memory (LayoutTensor):
A[i,j]: Direct access                  a_shared[local_i, local_j]
B[i,j]: Transposed access              b_shared[local_i, local_j]

Code to complete

fn single_block_matmul[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    local_i = thread_idx.x
    local_j = thread_idx.y
    # FILL ME IN (roughly 12 lines)


View full file: problems/p14/p14.mojo

Tips
  1. Load matrices to shared memory using global and local indices
  2. Call barrier() after loading
  3. Compute dot product using shared memory indices
  4. Check array bounds for all operations

Running the code

To test your solution, run the following command in your terminal:

magic run p14 --single-block

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 3.0, 3.0, 13.0])

Solution

fn single_block_matmul[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    local_i = thread_idx.x
    local_j = thread_idx.y
    a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()

    if global_i < size and global_j < size:
        a_shared[local_i, local_j] = a[global_i, global_j]
        b_shared[local_i, local_j] = b[global_i, global_j]

    barrier()

    if global_i < size and global_j < size:
        var acc: out.element_type = 0

        @parameter
        for k in range(size):
            acc += a_shared[local_i, k] * b_shared[k, local_j]

        out[global_i, global_j] = acc


The shared memory implementation with LayoutTensor improves performance through efficient memory access patterns:

Memory Organization

Input Tensors (2×2):                Shared Memory (3×3):
Matrix A:                           a_shared:
 [a[0,0] a[0,1]]                    [s[0,0] s[0,1] s[0,2]]
 [a[1,0] a[1,1]]                    [s[1,0] s[1,1] s[1,2]]
                                    [s[2,0] s[2,1] s[2,2]]
Matrix B (transpose):               b_shared: (similar layout)
 [b[0,0] b[0,1]]                    [t[0,0] t[0,1] t[0,2]]
 [b[1,0] b[1,1]]                    [t[1,0] t[1,1] t[1,2]]
                                    [t[2,0] t[2,1] t[2,2]]

Implementation Phases:

  1. Shared Memory Setup:

    # Create 2D shared memory tensors using TensorBuilder
    a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    
  2. Thread Indexing:

    # Global indices for matrix access
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    
    # Local indices for shared memory
    local_i = thread_idx.x
    local_j = thread_idx.y
    
  3. Data Loading:

    # Load data into shared memory using LayoutTensor indexing
    if global_i < size and global_j < size:
        a_shared[local_i, local_j] = a[global_i, global_j]
        b_shared[local_i, local_j] = b[global_i, global_j]
    
  4. Computation with Shared Memory:

    # Guard ensures we only compute for valid matrix elements
    if global_i < size and global_j < size:
        # Initialize accumulator with output tensor's type
        var acc: out.element_type = 0
    
        # Compile-time unrolled loop for matrix multiplication
        @parameter
        for k in range(size):
            acc += a_shared[local_i, k] * b_shared[k, local_j]
    
        # Write result only for threads within matrix bounds
        out[global_i, global_j] = acc
    

    Key aspects:

    • Boundary Check: if global_i < size and global_j < size

      • Prevents out-of-bounds computation
      • Only valid threads perform work
      • Essential because TPB (3×3) > SIZE (2×2)
    • Accumulator Type: var acc: out.element_type

      • Uses output tensor’s element type for type safety
      • Ensures consistent numeric precision
      • Initialized to zero before accumulation
    • Loop Optimization: @parameter for k in range(size)

      • Unrolls the loop at compile time
      • Enables better instruction scheduling
      • Efficient for small, known matrix sizes
    • Result Writing: out[global_i, global_j] = acc

      • Protected by the same guard condition
      • Only valid threads write results
      • Maintains matrix bounds safety

Thread Safety and Synchronization:

  1. Guard Conditions:

    • Input Loading: if global_i < size and global_j < size
    • Computation: Same guard ensures thread safety
    • Output Writing: Protected by the same condition
    • Prevents invalid memory access and race conditions
  2. Memory Access Safety:

    • Shared memory: Accessed only within TPB bounds
    • Global memory: Protected by size checks
    • Output: Guarded writes prevent corruption

Key Language Features:

  1. LayoutTensor Benefits:

    • Direct 2D indexing simplifies code
    • Type safety through element_type
    • Efficient memory layout handling
  2. Shared Memory Allocation:

    • TensorBuilder for structured allocation
    • Row-major layout matching input tensors
    • Proper alignment for efficient access
  3. Synchronization:

    • barrier() ensures shared memory consistency
    • Proper synchronization between load and compute
    • Thread cooperation within block

Performance Optimizations:

  1. Memory Access Efficiency:

    • Single global memory load per element
    • Multiple reuse through shared memory
    • Coalesced memory access patterns
  2. Thread Cooperation:

    • Collaborative data loading
    • Shared data reuse
    • Efficient thread synchronization
  3. Computational Benefits:

    • Reduced global memory traffic
    • Better cache utilization
    • Improved instruction throughput

This implementation significantly improves performance over the naive version by:

  • Reducing global memory accesses
  • Enabling data reuse through shared memory
  • Using efficient 2D indexing with LayoutTensor
  • Maintaining proper thread synchronization

Tiled Matrix Multiplication

Overview

Implement a kernel that multiplies square matrices \(A\) and \(\text{transpose}(A)\) using tiled matrix multiplication with LayoutTensor. This approach handles large matrices by processing them in smaller chunks (tiles).

Key concepts

  • Matrix tiling with LayoutTensor for large-scale computation
  • Multi-block coordination with proper layouts
  • Efficient shared memory usage through TensorBuilder
  • Boundary handling for tiles with LayoutTensor indexing

Configuration

  • Matrix size: \(\text{SIZE_TILED} = 8\)
  • Threads per block: \(\text{TPB} \times \text{TPB} = 3 \times 3\)
  • Grid dimensions: \(3 \times 3\) blocks
  • Shared memory: Two \(\text{TPB} \times \text{TPB}\) LayoutTensors per block

Layout configuration:

  • Input A: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • Input B: Layout.row_major(SIZE_TILED, SIZE_TILED) (transpose of A)
  • Output: Layout.row_major(SIZE_TILED, SIZE_TILED)
  • Shared Memory: Two TPB × TPB LayoutTensors using TensorBuilder

Tiling strategy

Block organization

Grid Layout (3×3):           Thread Layout per Block (3×3):
[B00][B01][B02]               [T00 T01 T02]
[B10][B11][B12]               [T10 T11 T12]
[B20][B21][B22]               [T20 T21 T22]

Each block processes a tile using LayoutTensor indexing

Tile processing steps

  1. Load tile from matrix A into shared memory using LayoutTensor indexing
  2. Load corresponding tile from matrix B into shared memory
  3. Compute partial products using shared memory
  4. Accumulate results in registers
  5. Move to next tile
  6. Repeat until all tiles are processed

Memory access pattern

For each tile using LayoutTensor:
  Input Tiles:                Output Computation:
    A[i:i+TPB, k:k+TPB]   ×    Result tile
    B[k:k+TPB, j:j+TPB]   →    C[i:i+TPB, j:j+TPB]

Code to complete

alias SIZE_TILED = 8
alias BLOCKS_PER_GRID_TILED = (3, 3)  # each block convers 3x3 elements
alias THREADS_PER_BLOCK_TILED = (TPB, TPB)
alias layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED)


fn matmul_tiled[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    local_row = thread_idx.x
    local_col = thread_idx.y
    global_row = block_idx.x * TPB + local_row
    global_col = block_idx.y * TPB + local_col
    # FILL ME IN (roughly 20 lines)


View full file: problems/p14/p14.mojo

Tips
  1. Calculate global thread positions from block and thread indices
  2. Clear shared memory before loading new tiles
  3. Load tiles with proper bounds checking
  4. Accumulate results across tiles with proper synchronization

Running the code

To test your solution, run the following command in your terminal:

magic run p14 --tiled

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([140.0, 364.0, 588.0, 812.0, 1036.0, 1260.0, 1484.0, 1708.0, 364.0, 1100.0, 1836.0, 2572.0, 3308.0, 4044.0, 4780.0, 5516.0, 588.0, 1836.0, 3084.0, 4332.0, 5580.0, 6828.0, 8076.0, 9324.0, 812.0, 2572.0, 4332.0, 6092.0, 7852.0, 9612.0, 11372.0, 13132.0, 1036.0, 3308.0, 5580.0, 7852.0, 10124.0, 12396.0, 14668.0, 16940.0, 1260.0, 4044.0, 6828.0, 9612.0, 12396.0, 15180.0, 17964.0, 20748.0, 1484.0, 4780.0, 8076.0, 11372.0, 14668.0, 17964.0, 21260.0, 24556.0, 1708.0, 5516.0, 9324.0, 13132.0, 16940.0, 20748.0, 24556.0, 28364.0])

Solution

fn matmul_tiled[
    layout: Layout, size: Int
](
    out: LayoutTensor[mut=False, dtype, layout],
    a: LayoutTensor[mut=False, dtype, layout],
    b: LayoutTensor[mut=False, dtype, layout],
):
    local_row = thread_idx.x
    local_col = thread_idx.y
    global_row = block_idx.x * TPB + local_row
    global_col = block_idx.y * TPB + local_col

    a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()

    var acc: out.element_type = 0

    # Iterate over tiles to compute matrix product
    @parameter
    for tile in range((size + TPB - 1) // TPB):
        # Reset shared memory tiles
        if local_row < TPB and local_col < TPB:
            a_shared[local_row, local_col] = 0
            b_shared[local_row, local_col] = 0

        barrier()

        # Load A tile - global row stays the same, col determined by tile
        if global_row < size and (tile * TPB + local_col) < size:
            a_shared[local_row, local_col] = a[
                global_row, tile * TPB + local_col
            ]

        # Load B tile - row determined by tile, global col stays the same
        if (tile * TPB + local_row) < size and global_col < size:
            b_shared[local_row, local_col] = b[
                tile * TPB + local_row, global_col
            ]

        barrier()

        # Matrix multiplication within the tile
        if global_row < size and global_col < size:

            @parameter
            for k in range(min(TPB, size - tile * TPB)):
                acc += a_shared[local_row, k] * b_shared[k, local_col]

        barrier()

    # Write out final result
    if global_row < size and global_col < size:
        out[global_row, global_col] = acc


The tiled implementation with LayoutTensor handles large matrices efficiently by processing them in blocks. Here’s a comprehensive analysis:

Implementation Architecture

  1. Layout Configuration:

    alias layout_tiled = Layout.row_major(SIZE_TILED, SIZE_TILED)
    
    • Defines row-major layout for all tensors
    • Ensures consistent memory access patterns
    • Enables efficient 2D indexing
  2. Shared Memory Setup:

    a_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    b_shared = tb[dtype]().row_major[TPB, TPB]().shared().alloc()
    
    • Uses TensorBuilder for structured allocation
    • Maintains row-major layout for consistency
    • Enables efficient tile processing
  3. Thread and Block Organization:

    local_row = thread_idx.x
    local_col = thread_idx.y
    global_row = block_idx.x * TPB + local_row
    global_col = block_idx.y * TPB + local_col
    
    • Maps threads to matrix elements
    • Handles 2D indexing efficiently
    • Maintains proper boundary checks

Tile Processing Pipeline

  1. Tile Iteration:

    @parameter
    for tile in range((size + TPB - 1) // TPB):
    
    • Compile-time unrolled loop
    • Handles matrix size not divisible by TPB
    • Processes matrix in TPB×TPB tiles
  2. Shared Memory Reset:

    if local_row < TPB and local_col < TPB:
        a_shared[local_row, local_col] = 0
        b_shared[local_row, local_col] = 0
    
    • Clears previous tile data
    • Ensures clean state for new tile
    • Prevents data corruption
  3. Tile Loading:

    # Load A tile
    if global_row < size and (tile * TPB + local_col) < size:
        a_shared[local_row, local_col] = a[global_row, tile * TPB + local_col]
    
    # Load B tile
    if (tile * TPB + local_row) < size and global_col < size:
        b_shared[local_row, local_col] = b[tile * TPB + local_row, global_col]
    
    • Handles boundary conditions
    • Uses LayoutTensor indexing
    • Maintains memory coalescing
  4. Computation:

    @parameter
    for k in range(min(TPB, size - tile * TPB)):
        acc += a_shared[local_row, k] * b_shared[k, local_col]
    
    • Processes current tile
    • Uses shared memory for efficiency
    • Handles partial tiles correctly

Memory Access Optimization

  1. Global Memory Pattern:

    A[global_row, tile * TPB + local_col] → Coalesced reads
    B[tile * TPB + local_row, global_col] → Transposed access
    
    • Maximizes memory coalescing
    • Minimizes bank conflicts
    • Efficient for transposed access
  2. Shared Memory Usage:

    a_shared[local_row, k] → Row-wise access
    b_shared[k, local_col] → Column-wise access
    
    • Optimized for matrix multiplication
    • Reduces bank conflicts
    • Enables data reuse

Synchronization and Safety

  1. Barrier Points:

    barrier()  # After shared memory reset
    barrier()  # After tile loading
    barrier()  # After computation
    
    • Ensures shared memory consistency
    • Prevents race conditions
    • Maintains thread cooperation
  2. Boundary Handling:

    if global_row < size and global_col < size:
        out[global_row, global_col] = acc
    
    • Prevents out-of-bounds access
    • Handles matrix edges
    • Safe result writing

Performance Characteristics

  1. Memory Efficiency:

    • Reduced global memory traffic through tiling
    • Efficient shared memory reuse
    • Coalesced memory access patterns
  2. Computational Throughput:

    • High data locality in shared memory
    • Efficient thread utilization
    • Minimal thread divergence
  3. Scalability:

    • Handles arbitrary matrix sizes
    • Efficient for large matrices
    • Good thread occupancy

Key Optimizations

  1. Layout Optimization:

    • Row-major layout for all tensors
    • Efficient 2D indexing
    • Proper alignment
  2. Memory Access:

    • Coalesced global memory loads
    • Efficient shared memory usage
    • Minimal bank conflicts
  3. Computation:

    • Register-based accumulation
    • Compile-time loop unrolling
    • Efficient tile processing

This implementation achieves high performance through:

  • Efficient use of LayoutTensor for memory access
  • Optimal tiling strategy
  • Proper thread synchronization
  • Careful boundary handling