Key concepts

In this puzzle, you’ll learn about:

  • Working with 2D block and thread arrangements
  • Handling matrix data larger than block size
  • Converting between 2D and linear memory access

The key insight is understanding how to coordinate multiple blocks of threads to process a 2D matrix that’s larger than a single block’s dimensions.

Configuration

  • Matrix size: \(5 \times 5\) elements
  • 2D blocks: Each block processes a \(3 \times 3\) region
  • Grid layout: Blocks arranged in \(2 \times 2\) grid
  • Total threads: \(36\) for \(25\) elements
  • Memory pattern: Row-major storage for 2D data
  • Coverage: Ensuring all matrix elements are processed

Code to complete

alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32


fn add_10_blocks_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    # FILL ME IN (roughly 2 lines)


View full file: problems/p07/p07.mojo

Tips
  1. Calculate global indices: global_i = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if global_i < size and global_j < size
  3. Inside guard: out[global_j * size + global_i] = a[global_j * size + global_i] + 10.0

Running the code

To test your solution, run the following command in your terminal:

magic run p07

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])

Solution

fn add_10_blocks_2d(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    if global_i < size and global_j < size:
        out[global_j * size + global_i] = a[global_j * size + global_i] + 10.0


This solution:

  • Computes global indices with block_dim * block_idx + thread_idx
  • Guards against out-of-bounds with if global_i < size and global_j < size
  • Uses row-major indexing to access and update matrix elements