Complete version: Block Boundary Case
Configuration
- Input array size:
SIZE_2 = 15
elements - Kernel size:
CONV_2 = 4
elements - Threads per block:
TPB = 8
- Number of blocks: 2
- Shared memory:
TPB + CONV_2 - 1
elements for input
Notes:
- Extended loading: Account for boundary overlap
- Block edges: Handle data across block boundaries
- Memory layout: Efficient shared memory usage
- Synchronization: Proper thread coordination
Code to complete
alias SIZE_2 = 15
alias CONV_2 = 4
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (TPB, 1)
fn conv_1d_block_boundary[
in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, in_layout],
a_size: Int,
b_size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# FILL ME IN (roughly 18 lines)
View full file: problems/p11/p11.mojo
Tips
- Use
tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
for shared memory - Load main data:
shared_a[local_i] = a[global_i]
- Load boundary:
if local_i < CONV_2 - 1
handle next block data - Load kernel:
shared_b[local_i] = b[local_i]
- Sum within extended bounds:
if local_i + j < TPB + CONV_2 - 1
Running the code
To test your solution, run the following command in your terminal:
magic run p11 --block-boundary
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])
Solution
fn conv_1d_block_boundary[
in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
out: LayoutTensor[mut=False, dtype, out_layout],
a: LayoutTensor[mut=False, dtype, in_layout],
b: LayoutTensor[mut=False, dtype, in_layout],
a_size: Int,
b_size: Int,
):
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# first: need to account for padding
shared_a = tb[dtype]().row_major[TPB + CONV - 1]().shared().alloc()
shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
if global_i < a_size:
shared_a[local_i] = a[global_i]
# second: load elements needed for convolution at block boundary
if local_i < CONV - 1:
# indices from next block
next_idx = global_i + TPB
if next_idx < a_size:
shared_a[TPB + local_i] = a[next_idx]
if local_i < b_size:
shared_b[local_i] = b[local_i]
barrier()
if global_i < a_size:
var local_sum: out.element_type = 0
@parameter
for j in range(CONV):
if local_i + j < TPB + CONV - 1:
local_sum += shared_a[local_i + j] * shared_b[j]
out[global_i] = local_sum
The solution handles block boundary cases in 1D convolution using extended shared memory. Here’s a detailed analysis:
Memory Layout and Sizing
Test Configuration:
- Full array size: SIZE_2 = 15 elements
- Grid: 2 blocks × 8 threads
- Convolution kernel: CONV_2 = 4 elements
Block 0 shared memory: [0 1 2 3 4 5 6 7|8 9 10] // TPB(8) + (CONV_2-1)(3) padding
Block 1 shared memory: [8 9 10 11 12 13 14|0 0] // Second block with padding
Size calculation:
- Main data: TPB elements (8)
- Overlap: CONV_2 - 1 elements (4 - 1 = 3)
- Total: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11 elements
Implementation Details
-
Shared Memory Allocation:
# First: account for padding needed for convolution window shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc() shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()
This allocation pattern ensures we have enough space for both the block’s data and the overlap region.
-
Data Loading Strategy:
# Main block data if global_i < a_size: shared_a[local_i] = a[global_i] # Boundary data from next block if local_i < CONV_2 - 1: next_idx = global_i + TPB if next_idx < a_size: shared_a[TPB + local_i] = a[next_idx]
- Only threads with
local_i < CONV_2 - 1
load boundary data - Prevents unnecessary thread divergence
- Maintains memory coalescing for main data load
- Only threads with
-
Kernel Loading:
if local_i < b_size: shared_b[local_i] = b[local_i]
- Single load per thread
- Bounded by kernel size
-
Convolution Computation:
if global_i < a_size: var local_sum: out.element_type = 0 @parameter for j in range(CONV_2): if local_i + j < TPB + CONV_2 - 1: local_sum += shared_a[local_i + j] * shared_b[j]
- Uses
@parameter
for compile-time loop unrolling - Proper type inference with
out.element_type
- Extended bounds check for overlap region
- Uses
Memory Access Pattern Analysis
-
Block 0 Access Pattern:
Thread 0: [0 1 2 3] × [0 1 2 3] Thread 1: [1 2 3 4] × [0 1 2 3] Thread 2: [2 3 4 5] × [0 1 2 3] ... Thread 7: [7 8 9 10] × [0 1 2 3] // Uses overlap data
-
Block 1 Access Pattern:
Thread 0: [8 9 10 11] × [0 1 2 3] Thread 1: [9 10 11 12] × [0 1 2 3] ... Thread 7: [14 0 0 0] × [0 1 2 3] // Zero padding at end
Performance Optimizations
-
Memory Coalescing:
- Main data load: Consecutive threads access consecutive memory
- Boundary data: Only necessary threads participate
- Single barrier synchronization point
-
Thread Divergence Minimization:
- Clean separation of main and boundary loading
- Uniform computation pattern within warps
- Efficient bounds checking
-
Shared Memory Usage:
- Optimal sizing to handle block boundaries
- No bank conflicts in access pattern
- Efficient reuse of loaded data
-
Boundary Handling:
- Implicit zero padding at array end
- Seamless block transition
- Proper handling of edge cases
This implementation achieves efficient cross-block convolution while maintaining:
- Memory safety through proper bounds checking
- High performance through optimized memory access
- Clean code structure using LayoutTensor abstractions
- Minimal synchronization overhead