Key concepts
In this puzzle, youāll learn about:
- Using LayoutTensorās shared memory features
- Thread synchronization with shared memory
- Block-local data management with tensor builder
The key insight is how LayoutTensor simplifies shared memory management while maintaining the performance benefits of block-local storage.
Configuration
- Array size:
SIZE = 8
elements - Threads per block:
TPB = 4
- Number of blocks: 2
- Shared memory:
TPB
elements per block
Key differences from raw approach
-
Memory allocation: We will use LayoutTensorBuild instead of stack_allocation
# Raw approach shared = stack_allocation[TPB * sizeof[dtype](), ...]() # LayoutTensor approach shared = LayoutTensorBuild[dtype]().row_major[TPB]().shared().alloc()
-
Memory access: Same syntax
# Raw approach shared[local_i] = a[global_i] # LayoutTensor approach shared[local_i] = a[global_i]
-
Safety features:
- Type safety
- Layout management
- Memory alignment handling
Note: LayoutTensor handles memory layout, but you still need to manage thread synchronization with
barrier()
when using shared memory.
Code to complete
alias TPB = 4
alias SIZE = 8
alias BLOCKS_PER_GRID = (2, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
fn add_10_shared_layout_tensor[
layout: Layout
](
out: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i]
barrier()
# FILL ME IN (roughly 2 lines)
View full file: problems/p08/p08_layout_tensor.mojo
Tips
- Create shared memory with tensor builder
- Load data with natural indexing:
shared[local_i] = a[global_i]
- Synchronize with
barrier()
- Process data using shared memory indices
- Guard against out-of-bounds access
Running the code
To test your solution, run the following command in your terminal:
magic run p08_layout_tensor
Your output will look like this if the puzzle isnāt solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0, 11.0])
Solution
fn add_10_shared_layout_tensor[
layout: Layout
](
out: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
# Allocate shared memory using tensor builder
shared = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i]
barrier()
if global_i < size:
out[global_i] = shared[local_i] + 10
This solution:
- Creates shared memory using tensor builderās fluent API
- Guards against out-of-bounds with
if global_i < size
- Uses natural indexing for both shared and global memory
- Ensures thread synchronization with
barrier()
- Leverages LayoutTensorās built-in safety features
Key steps:
- Allocate shared memory with proper layout
- Load global data into shared memory
- Synchronize threads
- Process data using shared memory
- Write results back to global memory