Puzzle 6: Blocks

Overview

Implement a kernel that adds 10 to each position of vector a and stores it in out.

Note: You have fewer threads per block than the size of a.

Blocks visualization

Key concepts

In this puzzle, you’ll learn about:

Processing data larger than thread block size
Coordinating multiple blocks of threads
Computing global thread positions

The key insight is understanding how blocks of threads work together to process data that’s larger than a single block’s capacity, while maintaining correct element-to-thread mapping.

Code to complete

alias SIZE = 9
alias BLOCKS_PER_GRID = (3, 1)
alias THREADS_PER_BLOCK = (4, 1)
alias dtype = DType.float32


fn add_10_blocks(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    # FILL ME IN (roughly 2 lines)

View full file: problems/p06/p06.mojo

Tips

Calculate global index: global_i = block_dim.x * block_idx.x + thread_idx.x
Add guard: if global_i < size
Inside guard: out[global_i] = a[global_i] + 10.0

Running the code

To test your solution, run the following command in your terminal:

magic run p06

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])

Solution

fn add_10_blocks(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    if global_i < size:
        out[global_i] = a[global_i] + 10.0

This solution:

Computes global thread index from block and thread indices
Guards against out-of-bounds with if global_i < size
Inside guard: adds 10 to input value at global index