Simple Version: Single Block
Key concepts
In this puzzle, you’ll learn about:
- Implementing sliding window operations on GPUs
- Managing data dependencies across threads
- Using shared memory for overlapping regions
The key insight is understanding how to efficiently access overlapping elements while maintaining correct boundary conditions.
Configuration
- Input array size:
SIZE = 6
elements - Kernel size:
CONV = 3
elements - Threads per block:
TPB = 8
- Number of blocks: 1
- Shared memory: Two arrays of size
SIZE
andCONV
Notes:
- Data loading: Each thread loads one element from input and kernel
- Memory pattern: Shared arrays for input and convolution kernel
- Thread sync: Coordination before computation
Code to complete
from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb
alias MAX_CONV = 4
alias TPB = 8
alias SIZE = 6
alias CONV = 3
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias in_layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(SIZE)
alias conv_layout = Layout.row_major(CONV)
fn conv_1d_simple[
in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, in_layout],
a_size: Int,
b_size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# FILL ME IN (roughly 13 lines)
View full file: problems/p11/p11.mojo
Tips
- Use
tb[dtype]().row_major[SIZE]().shared().alloc()
for shared memory allocation - Load input to
shared_a[local_i]
and kernel toshared_b[local_i]
- Call
barrier()
after loading - Sum products within bounds:
if local_i + j < SIZE
- Write result if
global_i < a_size
Running the code
To test your solution, run the following command in your terminal:
magic run p11 --simple
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])
Solution
fn conv_1d_simple[
in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, in_layout],
a_size: Int,
b_size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
shared_a = tb[dtype]().row_major[SIZE]().shared().alloc()
shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
if global_i < a_size:
shared_a[local_i] = a[global_i]
if global_i < b_size:
shared_b[local_i] = b[global_i]
barrier()
# Note: this is unsafe as it enforces no guard so could access `shared_a` beyond its bounds
# local_sum = Scalar[dtype](0)
# for j in range(CONV):
# if local_i + j < SIZE:
# local_sum += shared_a[local_i + j] * shared_b[j]
# if global_i < a_size:
# out[global_i] = local_sum
# Safe and correct:
if global_i < a_size:
# Note: using `var` allows us to include the type in the type inference
# `out.element_type` is available in LayoutTensor
var local_sum: out.element_type = 0
# Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant
# See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement
@parameter
for j in range(CONV):
# Bonus: do we need this check for this specific example with fixed SIZE, CONV
if local_i + j < SIZE:
local_sum += shared_a[local_i + j] * shared_b[j]
out[global_i] = local_sum
The solution implements a 1D convolution using shared memory for efficient access to overlapping elements. Here’s a detailed breakdown:
Memory Layout
Input array a: [0 1 2 3 4 5]
Kernel b: [0 1 2]
Computation Steps
-
Data Loading:
shared_a: [0 1 2 3 4 5] // Input array shared_b: [0 1 2] // Convolution kernel
-
Convolution Process for each position i:
out[0] = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] = 0*0 + 1*1 + 2*2 = 5 out[1] = a[1]*b[0] + a[2]*b[1] + a[3]*b[2] = 1*0 + 2*1 + 3*2 = 8 out[2] = a[2]*b[0] + a[3]*b[1] + a[4]*b[2] = 2*0 + 3*1 + 4*2 = 11 out[3] = a[3]*b[0] + a[4]*b[1] + a[5]*b[2] = 3*0 + 4*1 + 5*2 = 14 out[4] = a[4]*b[0] + a[5]*b[1] + 0*b[2] = 4*0 + 5*1 + 0*2 = 5 out[5] = a[5]*b[0] + 0*b[1] + 0*b[2] = 5*0 + 0*1 + 0*2 = 0
Implementation Details
-
Memory Safety Considerations:
-
The naive approach without proper bounds checking could be unsafe:
# Unsafe version - could access shared_a beyond its bounds local_sum = Scalar[dtype](0) for j in range(CONV): if local_i + j < SIZE: local_sum += shared_a[local_i + j] * shared_b[j]
-
The safe and correct implementation:
if global_i < a_size: var local_sum: out.element_type = 0 # Using var allows type inference @parameter # Unrolls loop at compile time since CONV is constant for j in range(CONV): if local_i + j < SIZE: local_sum += shared_a[local_i + j] * shared_b[j]
-
-
Key Implementation Features:
- Uses
var
for proper type inference without.element_type
- Employs
@parameter
decorator to unroll the convolution loop at compile time - Maintains strict bounds checking for memory safety
- Leverages LayoutTensor’s type system for better code safety
- Uses
-
Memory Management:
- Uses shared memory for both input array and kernel
- Single load per thread from global memory
- Efficient reuse of loaded data
-
Thread Coordination:
barrier()
ensures all data is loaded before computation- Each thread computes one output element
- Maintains coalesced memory access pattern
-
Performance Optimizations:
- Minimizes global memory access
- Uses shared memory for fast data access
- Avoids thread divergence in main computation loop
- Loop unrolling through
@parameter
decorator