Complete version: Block Boundary Case

Configuration

Input array size: SIZE_2 = 15 elements
Kernel size: CONV_2 = 4 elements
Threads per block: TPB = 8
Number of blocks: 2
Shared memory: TPB + CONV_2 - 1 elements for input

Notes:

Extended loading: Account for boundary overlap
Block edges: Handle data across block boundaries
Memory layout: Efficient shared memory usage
Synchronization: Proper thread coordination

Code to complete

alias SIZE_2 = 15
alias CONV_2 = 4
alias BLOCKS_PER_GRID_2 = (2, 1)
alias THREADS_PER_BLOCK_2 = (TPB, 1)


fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    a_size: Int,
    b_size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 18 lines)

View full file: problems/p11/p11.mojo

Tips

Use tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc() for shared memory
Load main data: shared_a[local_i] = a[global_i]
Load boundary: if local_i < CONV_2 - 1 handle next block data
Load kernel: shared_b[local_i] = b[local_i]
Sum within extended bounds: if local_i + j < TPB + CONV_2 - 1

Running the code

To test your solution, run the following command in your terminal:

magic run p11 --block-boundary

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])

Solution

fn conv_1d_block_boundary[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout, dtype: DType
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    a_size: Int,
    b_size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # first: need to account for padding
    shared_a = tb[dtype]().row_major[TPB + CONV - 1]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
    if global_i < a_size:
        shared_a[local_i] = a[global_i]

    # second: load elements needed for convolution at block boundary
    if local_i < CONV - 1:
        # indices from next block
        next_idx = global_i + TPB
        if next_idx < a_size:
            shared_a[TPB + local_i] = a[next_idx]

    if local_i < b_size:
        shared_b[local_i] = b[local_i]

    barrier()

    if global_i < a_size:
        var local_sum: out.element_type = 0

        @parameter
        for j in range(CONV):
            if local_i + j < TPB + CONV - 1:
                local_sum += shared_a[local_i + j] * shared_b[j]

        out[global_i] = local_sum

The solution handles block boundary cases in 1D convolution using extended shared memory. Here’s a detailed analysis:

Memory Layout and Sizing

Test Configuration:
- Full array size: SIZE_2 = 15 elements
- Grid: 2 blocks × 8 threads
- Convolution kernel: CONV_2 = 4 elements

Block 0 shared memory:  [0 1 2 3 4 5 6 7|8 9 10]  // TPB(8) + (CONV_2-1)(3) padding
Block 1 shared memory:  [8 9 10 11 12 13 14|0 0]  // Second block with padding

Size calculation:
- Main data: TPB elements (8)
- Overlap: CONV_2 - 1 elements (4 - 1 = 3)
- Total: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11 elements

Implementation Details

Shared Memory Allocation:

# First: account for padding needed for convolution window
shared_a = tb[dtype]().row_major[TPB + CONV_2 - 1]().shared().alloc()
shared_b = tb[dtype]().row_major[CONV_2]().shared().alloc()

This allocation pattern ensures we have enough space for both the block’s data and the overlap region.

Data Loading Strategy:

# Main block data
if global_i < a_size:
    shared_a[local_i] = a[global_i]

# Boundary data from next block
if local_i < CONV_2 - 1:
    next_idx = global_i + TPB
    if next_idx < a_size:
        shared_a[TPB + local_i] = a[next_idx]

Only threads with local_i < CONV_2 - 1 load boundary data
Prevents unnecessary thread divergence
Maintains memory coalescing for main data load

Kernel Loading:
```
if local_i < b_size:
    shared_b[local_i] = b[local_i]
```
- Single load per thread
- Bounded by kernel size

Convolution Computation:

if global_i < a_size:
    var local_sum: out.element_type = 0
    @parameter
    for j in range(CONV_2):
        if local_i + j < TPB + CONV_2 - 1:
            local_sum += shared_a[local_i + j] * shared_b[j]

Uses @parameter for compile-time loop unrolling
Proper type inference with out.element_type
Extended bounds check for overlap region

Memory Access Pattern Analysis

Block 0 Access Pattern:

Thread 0: [0 1 2 3] × [0 1 2 3]
Thread 1: [1 2 3 4] × [0 1 2 3]
Thread 2: [2 3 4 5] × [0 1 2 3]
...
Thread 7: [7 8 9 10] × [0 1 2 3]  // Uses overlap data

Block 1 Access Pattern:

Thread 0: [8 9 10 11] × [0 1 2 3]
Thread 1: [9 10 11 12] × [0 1 2 3]
...
Thread 7: [14 0 0 0] × [0 1 2 3]  // Zero padding at end

Performance Optimizations

Memory Coalescing:
- Main data load: Consecutive threads access consecutive memory
- Boundary data: Only necessary threads participate
- Single barrier synchronization point
Thread Divergence Minimization:
- Clean separation of main and boundary loading
- Uniform computation pattern within warps
- Efficient bounds checking
Shared Memory Usage:
- Optimal sizing to handle block boundaries
- No bank conflicts in access pattern
- Efficient reuse of loaded data
Boundary Handling:
- Implicit zero padding at array end
- Seamless block transition
- Proper handling of edge cases

This implementation achieves efficient cross-block convolution while maintaining:

Memory safety through proper bounds checking
High performance through optimized memory access
Clean code structure using LayoutTensor abstractions
Minimal synchronization overhead