Key concepts
In this puzzle, youāll learn about:
- Using LayoutTensor for sliding window operations
- Managing shared memory with
LayoutTensorBuilder
that we saw in puzzle_08 - Efficient neighbor access patterns
- Boundary condition handling
The key insight is how LayoutTensor simplifies shared memory management while maintaining efficient window-based operations.
Configuration
- Array size:
SIZE = 8
elements - Threads per block:
TPB = 8
- Window size: 3 elements
- Shared memory:
TPB
elements
Notes:
- Tensor builder: Use
LayoutTensorBuilder[dtype]().row_major[TPB]().shared().alloc()
- Window access: Natural indexing for 3-element windows
- Edge handling: Special cases for first two positions
- Memory pattern: One shared memory load per thread
Code to complete
alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
fn pooling[
layout: Layout
](
out: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# FIX ME IN (roughly 10 lines)
View full file: problems/p09/p09_layout_tensor.mojo
Tips
- Create shared memory with tensor builder
- Load data with natural indexing:
shared[local_i] = a[global_i]
- Handle special cases for first two elements
- Use shared memory for window operations
- Guard against out-of-bounds access
Running the code
To test your solution, run the following command in your terminal:
magic run p09_layout_tensor
Your output will look like this if the puzzle isnāt solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 1.0, 3.0, 6.0, 9.0, 12.0, 15.0, 18.0])
Solution
fn pooling[
layout: Layout
](
out: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# Load data into shared memory
if global_i < size:
shared[local_i] = a[global_i]
# Synchronize threads within block
barrier()
# Handle first two special cases
if global_i == 0:
out[0] = shared[0]
if global_i == 1:
out[1] = shared[0] + shared[1]
# Handle general case
if 1 < global_i < size:
out[global_i] = (
shared[local_i - 2] + shared[local_i - 1] + shared[local_i]
)
The solution implements a sliding window sum using LayoutTensor with these key steps:
-
Shared Memory Setup:
- Uses tensor builder for clean shared memory allocation
- Natural indexing for data loading
- Thread synchronization with
barrier()
-
Boundary Cases:
- Position 0: Single element output
- Position 1: Sum of first two elements
- Clean indexing with LayoutTensor bounds checking
-
Main Window Operation:
- Natural indexing for 3-element window
- Safe access to neighboring elements
- Automatic bounds checking
-
Memory Access Pattern:
- Efficient shared memory usage
- Type-safe operations
- Layout-aware indexing
- Automatic alignment handling
Benefits over raw approach:
- Cleaner shared memory allocation
- Safer memory access
- Natural indexing syntax
- Built-in bounds checking
- Layout management