Puzzle 6: Blocks
Overview
Implement a kernel that adds 10 to each position of vector a
and stores it in out
.
Note: You have fewer threads per block than the size of a.
Key concepts
In this puzzle, you’ll learn about:
- Processing data larger than thread block size
- Coordinating multiple blocks of threads
- Computing global thread positions
The key insight is understanding how blocks of threads work together to process data that’s larger than a single block’s capacity, while maintaining correct element-to-thread mapping.
Code to complete
alias SIZE = 9
alias BLOCKS_PER_GRID = (3, 1)
alias THREADS_PER_BLOCK = (4, 1)
alias dtype = DType.float32
fn add_10_blocks(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
# FILL ME IN (roughly 2 lines)
View full file: problems/p06/p06.mojo
Tips
- Calculate global index:
global_i = block_dim.x * block_idx.x + thread_idx.x
- Add guard:
if global_i < size
- Inside guard:
out[global_i] = a[global_i] + 10.0
Running the code
To test your solution, run the following command in your terminal:
magic run p06
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0])
Solution
fn add_10_blocks(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
if global_i < size:
out[global_i] = a[global_i] + 10.0
This solution:
- Computes global thread index from block and thread indices
- Guards against out-of-bounds with
if global_i < size
- Inside guard: adds 10 to input value at global index