Key concepts

In this puzzle, you’ll learn about:

  • Using LayoutTensor for sliding window operations
  • Managing shared memory with LayoutTensorBuilder that we saw in puzzle_08
  • Efficient neighbor access patterns
  • Boundary condition handling

The key insight is how LayoutTensor simplifies shared memory management while maintaining efficient window-based operations.

Configuration

  • Array size: SIZE = 8 elements
  • Threads per block: TPB = 8
  • Window size: 3 elements
  • Shared memory: TPB elements

Notes:

  • Tensor builder: Use LayoutTensorBuilder[dtype]().row_major[TPB]().shared().alloc()
  • Window access: Natural indexing for 3-element windows
  • Edge handling: Special cases for first two positions
  • Memory pattern: One shared memory load per thread

Code to complete

alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)


fn pooling[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FIX ME IN (roughly 10 lines)


View full file: problems/p09/p09_layout_tensor.mojo

Tips
  1. Create shared memory with tensor builder
  2. Load data with natural indexing: shared[local_i] = a[global_i]
  3. Handle special cases for first two elements
  4. Use shared memory for window operations
  5. Guard against out-of-bounds access

Running the code

To test your solution, run the following command in your terminal:

magic run p09_layout_tensor

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])

Solution

fn pooling[
    layout: Layout
](
    out: LayoutTensor[mut=True, dtype, layout],
    a: LayoutTensor[mut=True, dtype, layout],
    size: Int,
):
    # Allocate shared memory using tensor builder
    shared = tb[dtype]().row_major[TPB]().shared().alloc()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Load data into shared memory
    if global_i < size:
        shared[local_i] = a[global_i]

    # Synchronize threads within block
    barrier()

    # Handle first two special cases
    if global_i == 0:
        out[0] = shared[0]
    if global_i == 1:
        out[1] = shared[0] + shared[1]

    # Handle general case
    if 1 < global_i < size:
        out[global_i] = (
            shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
        )


The solution implements a sliding window sum using LayoutTensor with these key steps:

  1. Shared Memory Setup:

    • Uses tensor builder for clean shared memory allocation
    • Natural indexing for data loading
    • Thread synchronization with barrier()
  2. Boundary Cases:

    • Position 0: Single element output
    • Position 1: Sum of first two elements
    • Clean indexing with LayoutTensor bounds checking
  3. Main Window Operation:

    • Natural indexing for 3-element window
    • Safe access to neighboring elements
    • Automatic bounds checking
  4. Memory Access Pattern:

    • Efficient shared memory usage
    • Type-safe operations
    • Layout-aware indexing
    • Automatic alignment handling

Benefits over raw approach:

  • Cleaner shared memory allocation
  • Safer memory access
  • Natural indexing syntax
  • Built-in bounds checking
  • Layout management