LayoutTensor Version

Overview

Implement a kernel that adds 10 to each position of 2D LayoutTensor a and stores it in 2D LayoutTensor out.

Note: You have fewer threads per block than the size of a in both directions.

Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor with multiple blocks
  • Handling large matrices with 2D block organization
  • Combining block indexing with LayoutTensor access

The key insight is that LayoutTensor simplifies 2D indexing while still requiring proper block coordination for large matrices.

Configuration

  • Matrix size: \(5 \times 5\) elements
  • Layout handling: LayoutTensor manages row-major organization
  • Block coordination: Multiple blocks cover the full matrix
  • 2D indexing: Natural \((i,j)\) access with bounds checking
  • Total threads: \(36\) for \(25\) elements
  • Thread mapping: Each thread processes one matrix element

Code to complete

alias SIZE = 5
alias BLOCKS_PER_GRID = (2, 2)
alias THREADS_PER_BLOCK = (3, 3)
alias dtype = DType.float32
alias out_layout = Layout.row_major(SIZE, SIZE)
alias a_layout = Layout.row_major(SIZE, 1)


fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=True, dtype, a_layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    # FILL ME IN (roughly 2 lines)


View full file: problems/p07/p07_layout_tensor.mojo

Tips
  1. Calculate global indices: global_i = block_dim.x * block_idx.x + thread_idx.x
  2. Add guard: if global_i < size and global_j < size
  3. Inside guard: out[global_i, global_j] = a[global_i, global_j] + 10.0

Running the code

To test your solution, run the following command in your terminal:

magic run p07_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, ... , 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, ... , 11.0])

Solution

fn add_10_blocks_2d[
    out_layout: Layout,
    a_layout: Layout,
](
    out: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=True, dtype, a_layout],
    size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    global_j = block_dim.y * block_idx.y + thread_idx.y
    if global_i < size and global_j < size:
        out[global_i, global_j] = a[global_i, global_j] + 10.0


This solution:

  • Computes global indices with block_dim * block_idx + thread_idx
  • Guards against out-of-bounds with if global_i < size and global_j < size
  • Uses LayoutTensor’s 2D indexing: out[global_i, global_j] = a[global_i, global_j] + 10.0