Puzzle 6: Blocks

Overview

Implement a kernel that adds 10 to each position of vector a and stores it in out.

Note: You have fewer threads per block than the size of a.

Blocks visualization

Key concepts

In this puzzle, you’ll learn about:

  • Processing data larger than thread block size
  • Coordinating multiple blocks of threads
  • Computing global thread positions

The key insight is understanding how blocks of threads work together to process data that’s larger than a single block’s capacity, while maintaining correct element-to-thread mapping.

Code to complete

alias SIZE = 9
alias BLOCKS_PER_GRID = (3, 1)
alias THREADS_PER_BLOCK = (4, 1)
alias dtype = DType.float32


fn add_10_blocks(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)


View full file: problems/p06/p06.mojo

Tips
  1. Calculate global index: global_i = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if global_i < size
  3. Inside guard: out[global_i] = a[global_i] + 10.0

Running the code

To test your solution, run the following command in your terminal:

magic run p06

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])

Solution

fn add_10_blocks(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    if global_i < size:
        out[global_i] = a[global_i] + 10.0


This solution:

  • Computes global thread index from block and thread indices
  • Guards against out-of-bounds with if global_i < size
  • Inside guard: adds 10 to input value at global index