Puzzle 11: 1D Convolution

Moving to LayoutTensor

So far in our GPU puzzle journey, we’ve been exploring two parallel approaches to GPU memory management:

  1. Raw memory management with direct pointer manipulation using UnsafePointer
  2. The more structured LayoutTensor and its related abstractions such as LayoutTensorBuild

Starting from this puzzle, we’re transitioning exclusively to using LayoutTensor. This abstraction provides several benefits:

  • Type-safe memory access patterns
  • Clear representation of data layouts
  • Better code maintainability
  • Reduced chance of memory-related bugs
  • More expressive code that better represents the underlying computations
  • A lot more … that we’ll uncover gradually!

This transition aligns with best practices in modern GPU programming in Mojo 🔥, where higher-level abstractions help manage complexity without sacrificing performance.

Overview

In signal processing and image analysis, convolution is a fundamental operation that combines two sequences to produce a third sequence. This puzzle challenges you to implement a 1D convolution on the GPU, where each output element is computed by sliding a kernel over an input array.

Implement a kernel that computes a 1D convolution between vector a and vector b and stores it in out using the LayoutTensor abstraction.

Note: You need to handle the general case. You only need 2 global reads and 1 global write per thread.

1D Convolution

For those new to convolution, think of it as a weighted sliding window operation. At each position, we multiply the kernel values with the corresponding input values and sum the results. In mathematical notation, this is often written as:

\[\Large out[i] = \sum_{j=0}^{\text{CONV}-1} a[i+j] \cdot b[j] \]

In pseudocode, 1D convolution is:

for i in range(SIZE):
    for j in range(CONV):
        if i + j < SIZE:
            ret[i] += a_host[i + j] * b_host[j]

This puzzle is split into two parts to help you build understanding progressively:

  • Simple version: Single block Start here to learn the basics of implementing convolution with shared memory in a single block using LayoutTensor.

  • Complete version: Block boundary Then tackle the more challenging case where data needs to be shared across block boundaries, leveraging LayoutTensor’s capabilities.

Each version presents unique challenges in terms of memory access patterns and thread coordination. The simple version helps you understand the basic convolution operation, while the complete version tests your ability to handle more complex scenarios that arise in real-world GPU programming.