Block Boundary Version

Implement a kernel that computes a 1D convolution between 1D TileTensor a and 1D TileTensor b and stores it in 1D TileTensor output.

Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.

Configuration

Input array size: SIZE_2 = 15 elements
Kernel size: CONV_2 = 4 elements
Threads per block: TPB = 8
Number of blocks: 2
Shared memory: TPB + CONV_2 - 1 elements for input

Notes:

Extended loading: Account for boundary overlap
Block edges: Handle data across block boundaries
Memory layout: Efficient shared memory usage
Synchronization: Proper thread coordination

Code to complete

comptime SIZE_2 = 15
comptime CONV_2 = 4
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (TPB, 1)
comptime in_2_layout = row_major[SIZE_2]()
comptime In2Layout = type_of(in_2_layout)
comptime out_2_layout = row_major[SIZE_2]()
comptime Out2Layout = type_of(out_2_layout)
comptime conv_2_layout = row_major[CONV_2]()
comptime Conv2Layout = type_of(conv_2_layout)


def conv_1d_block_boundary(
    output: TileTensor[mut=True, dtype, Out2Layout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, In2Layout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, Conv2Layout, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # FILL ME IN (roughly 18 lines)

View full file: problems/p13/p13.mojo

Tips

Use stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + CONV_2 - 1]()) for shared memory
Load main data: shared_a[local_i] = a[global_i]
Load boundary: if local_i < CONV_2 - 1 handle next block data
Load kernel: shared_b[local_i] = b[local_i]
Sum within input bounds: if global_i + j < SIZE_2

Running the code

To test your solution, run the following command in your terminal:

pixi run p13 --block-boundary

pixi run -e amd p13 --block-boundary

pixi run -e apple p13 --block-boundary

uv run poe p13 --block-boundary

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([14.0, 20.0, 26.0, 32.0, 38.0, 44.0, 50.0, 56.0, 62.0, 68.0, 74.0, 80.0, 41.0, 14.0, 0.0])

Solution

def conv_1d_block_boundary(
    output: TileTensor[mut=True, dtype, Out2Layout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, In2Layout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, Conv2Layout, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # first: need to account for padding
    var shared_a = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB + CONV_2 - 1]())
    var shared_b = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[CONV_2]())
    if global_i < SIZE_2:
        shared_a[local_i] = a[global_i]
    else:
        shared_a[local_i] = 0

    # second: load elements needed for convolution at block boundary
    if local_i < CONV_2 - 1:
        # indices from next block
        var next_idx = global_i + TPB
        if next_idx < SIZE_2:
            shared_a[TPB + local_i] = a[next_idx]
        else:
            # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
            # which is an undefined behavior
            shared_a[TPB + local_i] = 0

    if local_i < CONV_2:
        shared_b[local_i] = b[local_i]

    barrier()

    if global_i < SIZE_2:
        var local_sum: output.ElementType = 0

        comptime for j in range(CONV_2):
            if global_i + j < SIZE_2:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum

The solution handles block boundary cases in 1D convolution using extended shared memory. Here’s a detailed analysis:

Memory layout and sizing

Test Configuration:
- Full array size: SIZE_2 = 15 elements
- Grid: 2 blocks × 8 threads
- Convolution kernel: CONV_2 = 4 elements

Block 0 shared memory:  [0 1 2 3 4 5 6 7|8 9 10]  // TPB(8) + (CONV_2-1)(3) padding
Block 1 shared memory:  [8 9 10 11 12 13 14 0|0 0 0]  // Second block. data(7) + padding to fill grid(1) + (CONV_2-1)(3) padding

Size calculation:
- Main data: TPB elements (8)
- Overlap: CONV_2 - 1 elements (4 - 1 = 3)
- Total: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11 elements

Implementation details

Shared Memory Allocation:

# First: account for padding needed for convolution window
shared_a = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + CONV_2 - 1]())
shared_b = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[CONV_2]())

This allocation pattern ensures we have enough space for both the block’s data and the overlap region.

Data Loading Strategy:

# Main block data
if global_i < SIZE_2:
    shared_a[local_i] = a[global_i]
else:
    shared_a[local_i] = 0

# Boundary data from next block
if local_i < CONV_2 - 1:
    next_idx = global_i + TPB
    if next_idx < SIZE_2:
        shared_a[TPB + local_i] = a[next_idx]
    else:
        # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
        # which is an undefined behavior
        shared_a[TPB + local_i] = 0

Only threads with local_i < CONV_2 - 1 load boundary data
Prevents unnecessary thread divergence
Maintains memory coalescing for main data load
Explicitly zeroes out-of-bounds elements to avoid undefined behavior

Kernel Loading:
```
if local_i < b_size:
    shared_b[local_i] = b[local_i]
```
- Single load per thread
- Bounded by kernel size

Convolution Computation:

if global_i < SIZE_2:
    var local_sum: output.element_type = 0
    @parameter
    for j in range(CONV_2):
        if global_i + j < SIZE_2:
            local_sum += shared_a[local_i + j] * shared_b[j]

Uses @parameter for compile-time loop unrolling
Proper type inference with output.element_type
Semantically correct bounds check: only compute convolution for valid input positions

Memory access pattern analysis

Block 0 Access Pattern:

Thread 0: [0 1 2 3] × [0 1 2 3]
Thread 1: [1 2 3 4] × [0 1 2 3]
Thread 2: [2 3 4 5] × [0 1 2 3]
...
Thread 7: [7 8 9 10] × [0 1 2 3]  // Uses overlap data

Block 1 Access Pattern: Note how starting from thread 4, global_i + j < SIZE_2 evaluates to False and hence iterations are skipped.

Thread 0: [8  9 10 11] × [0 1 2 3]
Thread 1: [9 10 11 12] × [0 1 2 3]
...
Thread 4: [12 13 14] × [0 1 2]       // Zero padding at end
Thread 5: [13 14]    × [0 1]
Thread 6: [14]       × [0]
Thread 7: skipped                    // global_i + j < SIZE_2 evaluates to false for all j, no computation

Performance optimizations

Memory Coalescing:
- Main data load: Consecutive threads access consecutive memory
- Boundary data: Only necessary threads participate
- Single barrier synchronization point
Thread Divergence Minimization:
- Clean separation of main and boundary loading
- Uniform computation pattern within warps
- Efficient bounds checking
Shared Memory Usage:
- Optimal sizing to handle block boundaries
- No bank conflicts in access pattern
- Efficient reuse of loaded data
Boundary Handling:
- Explicit zero initialization for out-of-bounds elements which prevents reading from uninitialized shared memory
- Semantically correct boundary checking using global_i + j < SIZE_2 instead of shared memory bounds
- Proper handling of edge cases without over-computation

Boundary condition improvement

The solution uses if global_i + j < SIZE_2: rather than checking shared memory bounds. This pattern is:

Mathematically correct: Only computes convolution where input data actually exists
More efficient: Avoids unnecessary computations for positions beyond the input array
Safer: Prevents reliance on zero-padding behavior in shared memory

This implementation achieves efficient cross-block convolution while maintaining:

Memory safety through proper bounds checking
High performance through optimized memory access
Clean code structure using TileTensor abstractions
Minimal synchronization overhead
Mathematically sound boundary handling