LayoutTensor Version
Overview
Implement a kernel that adds 10 to each position of 2D LayoutTensor a
and stores it in 2D LayoutTensor out
.
Note: You have fewer threads per block than the size of a
in both directions.
Key concepts
In this puzzle, youāll learn about:
- Using
LayoutTensor
with multiple blocks - Handling large matrices with 2D block organization
- Combining block indexing with
LayoutTensor
access
The key insight is that LayoutTensor
simplifies 2D indexing while still requiring proper block coordination for large matrices.
Configuration
- Matrix size: \(5 \times 5\) elements
- Layout handling:
LayoutTensor
manages row-major organization - Block coordination: Multiple blocks cover the full matrix
- 2D indexing: Natural \((i,j)\) access with bounds checking
- Total threads: \(36\) for \(25\) elements
- Thread mapping: Each thread processes one matrix element
Code to complete
alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(SIZE, 1)
fn add_10_blocks_2d[
out_layout: Layout,
a_layout: Layout,
](
out: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=True, dtype, a_layout],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
# FILL ME IN (roughly 2 lines)
View full file: problems/p07/p07_layout_tensor.mojo
Tips
- Calculate global indices:
global_i = block_dim.x * block_idx.x + thread_idx.x
- Add guard:
if global_i < size and global_j < size
- Inside guard:
out[global_i, global_j] = a[global_i, global_j] + 10.0
Running the code
To test your solution, run the following command in your terminal:
magic run p07_layout_tensor
Your output will look like this if the puzzle isnāt solved yet:
out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])
Solution
fn add_10_blocks_2d[
out_layout: Layout,
a_layout: Layout,
](
out: LayoutTensor[mut=True, dtype, out_layout],
a: LayoutTensor[mut=True, dtype, a_layout],
size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
global_j = block_dim.y * block_idx.y + thread_idx.y
if global_i < size and global_j < size:
out[global_i, global_j] = a[global_i, global_j] + 10.0
This solution:
- Computes global indices with
block_dim * block_idx + thread_idx
- Guards against out-of-bounds with
if global_i < size and global_j < size
- Uses
LayoutTensor
ās 2D indexing:out[global_i, global_j] = a[global_i, global_j] + 10.0