Overview
Implement a kernel that broadcast adds vector a and vector b and stores it in 2D matrix output.
Note: You have more threads than positions.
Key concepts
In this puzzle, you’ll learn about:
- Broadcasting 1D vectors across different dimensions
- Using 2D thread indices for broadcast operations
- Handling boundary conditions in broadcast patterns
The key insight is understanding how to map elements from two 1D vectors to create a 2D output matrix through broadcasting, while handling thread bounds correctly.
- Broadcasting: Each element of
acombines with each element ofb - Thread mapping: 2D thread grid \((3 \times 3)\) for \(2 \times 2\) output
- Vector access: Different access patterns for
aandb - Bounds checking: Guard against threads outside matrix dimensions
Code to complete
alias SIZE = 2
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
fn broadcast_add(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
size: Int,
):
row = thread_idx.y
col = thread_idx.x
# FILL ME IN (roughly 2 lines)
View full file: problems/p05/p05.mojo
Tips
- Get 2D indices:
row = thread_idx.y,col = thread_idx.x - Add guard:
if row < size and col < size - Inside guard: think about how to broadcast values of
aandb
Running the code
To test your solution, run the following command in your terminal:
pixi run p05
pixi run -e amd p05
pixi run -e apple p05
uv run poe p05
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([1.0, 2.0, 11.0, 12.0])
Solution
fn broadcast_add(
output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
b: UnsafePointer[Scalar[dtype], MutAnyOrigin],
size: Int,
):
row = thread_idx.y
col = thread_idx.x
if row < size and col < size:
output[row * size + col] = a[col] + b[row]
This solution demonstrates fundamental GPU broadcasting concepts without LayoutTensor abstraction:
-
Thread to matrix mapping
- Uses
thread_idx.yfor row access andthread_idx.xfor column access - Direct mapping from 2D thread grid to output matrix elements
- Handles excess threads (3×3 grid) for 2×2 output matrix
- Uses
-
Broadcasting mechanics
- Vector
abroadcasts horizontally: samea[col]used across each row - Vector
bbroadcasts vertically: sameb[row]used across each column - Output combines both vectors through addition
[ a0 a1 ] + [ b0 ] = [ a0+b0 a1+b0 ] [ b1 ] [ a0+b1 a1+b1 ] - Vector
-
Bounds checking
- Single guard condition
row < size and col < sizehandles both dimensions - Prevents out-of-bounds access for both input vectors and output matrix
- Required due to 3×3 thread grid being larger than 2×2 data
- Single guard condition
Compare this with the LayoutTensor version to see how the abstraction simplifies broadcasting operations while maintaining the same underlying concepts.