Simple Case with Single Block

Implement a kernel that computes a 1D convolution between 1D LayoutTensor a and 1D LayoutTensor b and stores it in 1D LayoutTensor `output.

Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.

Key concepts

In this puzzle, you’ll learn about:

Implementing sliding window operations on GPUs
Managing data dependencies across threads
Using shared memory for overlapping regions

The key insight is understanding how to efficiently access overlapping elements while maintaining correct boundary conditions.

Configuration

Input array size: SIZE = 6 elements
Kernel size: CONV = 3 elements
Threads per block: TPB = 8
Number of blocks: 1
Shared memory: Two arrays of size SIZE and CONV

Notes:

Data loading: Each thread loads one element from input and kernel
Memory pattern: Shared arrays for input and convolution kernel
Thread sync: Coordination before computation

Code to complete

alias TPB = 8
alias SIZE = 6
alias CONV = 3
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias in_layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(SIZE)
alias conv_layout = Layout.row_major(CONV)


fn conv_1d_simple[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
    output: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 14 lines)

View full file: problems/p11/p11.mojo

Tips

Use tb[dtype]().row_major[SIZE]().shared().alloc() for shared memory allocation
Load input to shared_a[local_i] and kernel to shared_b[local_i]
Call barrier() after loading
Sum products within bounds: if local_i + j < SIZE
Write result if global_i < SIZE

Running the code

To test your solution, run the following command in your terminal:

uv run poe p11 --simple

pixi run p11 --simple

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])

Solution

fn conv_1d_simple[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
    output: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, conv_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared_a = tb[dtype]().row_major[SIZE]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
    if global_i < SIZE:
        shared_a[local_i] = a[global_i]

    if global_i < CONV:
        shared_b[local_i] = b[global_i]

    barrier()

    # Note: this is unsafe as it enforces no guard so could access `shared_a` beyond its bounds
    # local_sum = Scalar[dtype](0)
    # for j in range(CONV):
    #     if local_i + j < SIZE:
    #         local_sum += shared_a[local_i + j] * shared_b[j]

    # if global_i < SIZE:
    #     out[global_i] = local_sum

    # Safe and correct:
    if global_i < SIZE:
        # Note: using `var` allows us to include the type in the type inference
        # `out.element_type` is available in LayoutTensor
        var local_sum: output.element_type = 0

        # Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant
        # See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement
        @parameter
        for j in range(CONV):
            # Bonus: do we need this check for this specific example with fixed SIZE, CONV
            if local_i + j < SIZE:
                local_sum += shared_a[local_i + j] * shared_b[j]

        output[global_i] = local_sum

The solution implements a 1D convolution using shared memory for efficient access to overlapping elements. Here’s a detailed breakdown:

Memory Layout

Input array a:   [0  1  2  3  4  5]
Kernel b:        [0  1  2]

Computation Steps

Data Loading:

shared_a: [0  1  2  3  4  5]  // Input array
shared_b: [0  1  2]           // Convolution kernel

Convolution Process for each position i:

output[0] = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] = 0*0 + 1*1 + 2*2 = 5
output[1] = a[1]*b[0] + a[2]*b[1] + a[3]*b[2] = 1*0 + 2*1 + 3*2 = 8
output[2] = a[2]*b[0] + a[3]*b[1] + a[4]*b[2] = 2*0 + 3*1 + 4*2 = 11
output[3] = a[3]*b[0] + a[4]*b[1] + a[5]*b[2] = 3*0 + 4*1 + 5*2 = 14
output[4] = a[4]*b[0] + a[5]*b[1] + 0*b[2]    = 4*0 + 5*1 + 0*2 = 5
output[5] = a[5]*b[0] + 0*b[1]   + 0*b[2]     = 5*0 + 0*1 + 0*2 = 0

Implementation Details

Thread Participation and Efficiency Considerations:
- The inefficient approach without proper thread guard:
```
# Inefficient version - all threads compute even when results won't be used
local_sum = Scalar[dtype](0)
for j in range(CONV):
    if local_i + j < SIZE:
        local_sum += shared_a[local_i + j] * shared_b[j]
# Only guard the final write
if global_i < SIZE:
    output[global_i] = local_sum
```
- The efficient and correct implementation:
```
if global_i < SIZE:
    var local_sum: output.element_type = 0  # Using var allows type inference
    @parameter  # Unrolls loop at compile time since CONV is constant
    for j in range(CONV):
        if local_i + j < SIZE:
            local_sum += shared_a[local_i + j] * shared_b[j]
    output[global_i] = local_sum
```
The key difference is that the inefficient version has all threads perform the convolution computation (including those where global_i >= SIZE), and only the final write is guarded. This leads to:
- Wasteful computation: Threads beyond the valid range still perform unnecessary work
- Reduced efficiency: Extra computations that won’t be used
- Poor resource utilization: GPU cores working on meaningless calculations
The efficient version ensures that only threads with valid global_i values perform any computation, making better use of GPU resources.
Key Implementation Features:
- Uses var for proper type inference with output.element_type
- Employs @parameter decorator to unroll the convolution loop at compile time
- Maintains strict bounds checking for memory safety
- Leverages LayoutTensor’s type system for better code safety
Memory Management:
- Uses shared memory for both input array and kernel
- Single load per thread from global memory
- Efficient reuse of loaded data
Thread Coordination:
- barrier() ensures all data is loaded before computation
- Each thread computes one output element
- Maintains coalesced memory access pattern
Performance Optimizations:
- Minimizes global memory access
- Uses shared memory for fast data access
- Avoids thread divergence in main computation loop
- Loop unrolling through @parameter decorator