Block Boundary Version

Implement a kernel that computes a 1D convolution between 1D TileTensor a and 1D TileTensor b and stores it in 1D TileTensor output.

Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.

Configuration

  • Input array size: SIZE_2 = 15 elements
  • Kernel size: CONV_2 = 4 elements
  • Threads per block: TPB = 8
  • Number of blocks: 2
  • Shared memory: TPB + CONV_2 - 1 elements for input

Notes:

  • Extended loading: Account for boundary overlap
  • Block edges: Handle data across block boundaries
  • Memory layout: Efficient shared memory usage
  • Synchronization: Proper thread coordination

Code to complete

comptime SIZE_2 = 15
comptime CONV_2 = 4
comptime BLOCKS_PER_GRID_2 = (2, 1)
comptime THREADS_PER_BLOCK_2 = (TPB, 1)
comptime in_2_layout = row_major[SIZE_2]()
comptime In2Layout = type_of(in_2_layout)
comptime out_2_layout = row_major[SIZE_2]()
comptime Out2Layout = type_of(out_2_layout)
comptime conv_2_layout = row_major[CONV_2]()
comptime Conv2Layout = type_of(conv_2_layout)


def conv_1d_block_boundary(
    output: TileTensor[mut=True, dtype, Out2Layout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, In2Layout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, Conv2Layout, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # FILL ME IN (roughly 18 lines)


View full file: problems/p13/p13.mojo

Tips
  1. Use stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + CONV_2 - 1]()) for shared memory
  2. Load main data: shared_a[local_i] = a[global_i]
  3. Load boundary: if local_i < CONV_2 - 1 handle next block data
  4. Load kernel: shared_b[local_i] = b[local_i]
  5. Sum within input bounds: if global_i + j < SIZE_2

Running the code

To test your solution, run the following command in your terminal:

pixi run p13 --block-boundary
pixi run -e amd p13 --block-boundary
pixi run -e apple p13 --block-boundary
uv run poe p13 --block-boundary

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([14.0, 20.0, 26.0, 32.0, 38.0, 44.0, 50.0, 56.0, 62.0, 68.0, 74.0, 80.0, 41.0, 14.0, 0.0])

Solution

def conv_1d_block_boundary(
    output: TileTensor[mut=True, dtype, Out2Layout, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, In2Layout, ImmutAnyOrigin],
    b: TileTensor[mut=False, dtype, Conv2Layout, ImmutAnyOrigin],
):
    var global_i = block_dim.x * block_idx.x + thread_idx.x
    var local_i = thread_idx.x
    # first: need to account for padding
    var shared_a = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[TPB + CONV_2 - 1]())
    var shared_b = stack_allocation[
        dtype=dtype, address_space=AddressSpace.SHARED
    ](row_major[CONV_2]())
    if global_i < SIZE_2:
        shared_a[local_i] = a[global_i]
    else:
        shared_a[local_i] = 0

    # second: load elements needed for convolution at block boundary
    if local_i < CONV_2 - 1:
        # indices from next block
        var next_idx = global_i + TPB
        if next_idx < SIZE_2:
            shared_a[TPB + local_i] = a[next_idx]
        else:
            # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
            # which is an undefined behavior
            shared_a[TPB + local_i] = 0

    if local_i < CONV_2:
        shared_b[local_i] = b[local_i]

    barrier()

    if global_i < SIZE_2:
        var local_sum: output.ElementType = 0

        comptime for j in range(CONV_2):
            if global_i + j < SIZE_2:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum


The solution handles block boundary cases in 1D convolution using extended shared memory. Here’s a detailed analysis:

Memory layout and sizing

Test Configuration:
- Full array size: SIZE_2 = 15 elements
- Grid: 2 blocks × 8 threads
- Convolution kernel: CONV_2 = 4 elements

Block 0 shared memory:  [0 1 2 3 4 5 6 7|8 9 10]  // TPB(8) + (CONV_2-1)(3) padding
Block 1 shared memory:  [8 9 10 11 12 13 14 0|0 0 0]  // Second block. data(7) + padding to fill grid(1) + (CONV_2-1)(3) padding

Size calculation:
- Main data: TPB elements (8)
- Overlap: CONV_2 - 1 elements (4 - 1 = 3)
- Total: TPB + CONV_2 - 1 = 8 + 4 - 1 = 11 elements

Implementation details

  1. Shared Memory Allocation:

    # First: account for padding needed for convolution window
    shared_a = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[TPB + CONV_2 - 1]())
    shared_b = stack_allocation[dtype=dtype, address_space=AddressSpace.SHARED](row_major[CONV_2]())
    

    This allocation pattern ensures we have enough space for both the block’s data and the overlap region.

  2. Data Loading Strategy:

    # Main block data
    if global_i < SIZE_2:
        shared_a[local_i] = a[global_i]
    else:
        shared_a[local_i] = 0
    
    # Boundary data from next block
    if local_i < CONV_2 - 1:
        next_idx = global_i + TPB
        if next_idx < SIZE_2:
            shared_a[TPB + local_i] = a[next_idx]
        else:
            # Initialize out-of-bounds elements to 0 to avoid reading from uninitialized memory
            # which is an undefined behavior
            shared_a[TPB + local_i] = 0
    
    • Only threads with local_i < CONV_2 - 1 load boundary data
    • Prevents unnecessary thread divergence
    • Maintains memory coalescing for main data load
    • Explicitly zeroes out-of-bounds elements to avoid undefined behavior
  3. Kernel Loading:

    if local_i < b_size:
        shared_b[local_i] = b[local_i]
    
    • Single load per thread
    • Bounded by kernel size
  4. Convolution Computation:

    if global_i < SIZE_2:
        var local_sum: output.element_type = 0
        @parameter
        for j in range(CONV_2):
            if global_i + j < SIZE_2:
                local_sum += shared_a[local_i + j] * shared_b[j]
    
    • Uses @parameter for compile-time loop unrolling
    • Proper type inference with output.element_type
    • Semantically correct bounds check: only compute convolution for valid input positions

Memory access pattern analysis

  1. Block 0 Access Pattern:

    Thread 0: [0 1 2 3] × [0 1 2 3]
    Thread 1: [1 2 3 4] × [0 1 2 3]
    Thread 2: [2 3 4 5] × [0 1 2 3]
    ...
    Thread 7: [7 8 9 10] × [0 1 2 3]  // Uses overlap data
    
  2. Block 1 Access Pattern: Note how starting from thread 4, global_i + j < SIZE_2 evaluates to False and hence iterations are skipped.

    Thread 0: [8  9 10 11] × [0 1 2 3]
    Thread 1: [9 10 11 12] × [0 1 2 3]
    ...
    Thread 4: [12 13 14] × [0 1 2]       // Zero padding at end
    Thread 5: [13 14]    × [0 1]
    Thread 6: [14]       × [0]
    Thread 7: skipped                    // global_i + j < SIZE_2 evaluates to false for all j, no computation
    

Performance optimizations

  1. Memory Coalescing:

    • Main data load: Consecutive threads access consecutive memory
    • Boundary data: Only necessary threads participate
    • Single barrier synchronization point
  2. Thread Divergence Minimization:

    • Clean separation of main and boundary loading
    • Uniform computation pattern within warps
    • Efficient bounds checking
  3. Shared Memory Usage:

    • Optimal sizing to handle block boundaries
    • No bank conflicts in access pattern
    • Efficient reuse of loaded data
  4. Boundary Handling:

    • Explicit zero initialization for out-of-bounds elements which prevents reading from uninitialized shared memory
    • Semantically correct boundary checking using global_i + j < SIZE_2 instead of shared memory bounds
    • Proper handling of edge cases without over-computation

Boundary condition improvement

The solution uses if global_i + j < SIZE_2: rather than checking shared memory bounds. This pattern is:

  • Mathematically correct: Only computes convolution where input data actually exists
  • More efficient: Avoids unnecessary computations for positions beyond the input array
  • Safer: Prevents reliance on zero-padding behavior in shared memory

This implementation achieves efficient cross-block convolution while maintaining:

  • Memory safety through proper bounds checking
  • High performance through optimized memory access
  • Clean code structure using TileTensor abstractions
  • Minimal synchronization overhead
  • Mathematically sound boundary handling