Simple Version: Single Block

Key concepts

In this puzzle, you’ll learn about:

  • Implementing sliding window operations on GPUs
  • Managing data dependencies across threads
  • Using shared memory for overlapping regions

The key insight is understanding how to efficiently access overlapping elements while maintaining correct boundary conditions.

Configuration

  • Input array size: SIZE = 6 elements
  • Kernel size: CONV = 3 elements
  • Threads per block: TPB = 8
  • Number of blocks: 1
  • Shared memory: Two arrays of size SIZE and CONV

Notes:

  • Data loading: Each thread loads one element from input and kernel
  • Memory pattern: Shared arrays for input and convolution kernel
  • Thread sync: Coordination before computation

Code to complete

from gpu import thread_idx, block_idx, block_dim, barrier
from layout import Layout, LayoutTensor
from layout.tensor_builder import LayoutTensorBuild as tb


alias MAX_CONV = 4
alias TPB = 8
alias SIZE = 6
alias CONV = 3
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias in_layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(SIZE)
alias conv_layout = Layout.row_major(CONV)


fn conv_1d_simple[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    a_size: Int,
    b_size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    # FILL ME IN (roughly 13 lines)


View full file: problems/p11/p11.mojo

Tips
  1. Use tb[dtype]().row_major[SIZE]().shared().alloc() for shared memory allocation
  2. Load input to shared_a[local_i] and kernel to shared_b[local_i]
  3. Call barrier() after loading
  4. Sum products within bounds: if local_i + j < SIZE
  5. Write result if global_i < a_size

Running the code

To test your solution, run the following command in your terminal:

magic run p11 --simple

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([5.0, 8.0, 11.0, 14.0, 5.0, 0.0])

Solution

fn conv_1d_simple[
    in_layout: Layout, out_layout: Layout, conv_layout: Layout
](
    out: LayoutTensor[mut=False, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    a_size: Int,
    b_size: Int,
):
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x
    shared_a = tb[dtype]().row_major[SIZE]().shared().alloc()
    shared_b = tb[dtype]().row_major[CONV]().shared().alloc()
    if global_i < a_size:
        shared_a[local_i] = a[global_i]

    if global_i < b_size:
        shared_b[local_i] = b[global_i]

    barrier()

    # Note: this is unsafe as it enforces no guard so could access `shared_a` beyond its bounds
    # local_sum = Scalar[dtype](0)
    # for j in range(CONV):
    #     if local_i + j < SIZE:
    #         local_sum += shared_a[local_i + j] * shared_b[j]

    # if global_i < a_size:
    #     out[global_i] = local_sum

    # Safe and correct:
    if global_i < a_size:
        # Note: using `var` allows us to include the type in the type inference
        # `out.element_type` is available in LayoutTensor
        var local_sum: out.element_type = 0

        # Note: `@parameter` decorator unrolls the loop at compile time given `CONV` is a compile-time constant
        # See: https://docs.modular.com/mojo/manual/decorators/parameter/#parametric-for-statement
        @parameter
        for j in range(CONV):
            # Bonus: do we need this check for this specific example with fixed SIZE, CONV
            if local_i + j < SIZE:
                local_sum += shared_a[local_i + j] * shared_b[j]

        out[global_i] = local_sum


The solution implements a 1D convolution using shared memory for efficient access to overlapping elements. Here’s a detailed breakdown:

Memory Layout

Input array a:   [0  1  2  3  4  5]
Kernel b:        [0  1  2]

Computation Steps

  1. Data Loading:

    shared_a: [0  1  2  3  4  5]  // Input array
    shared_b: [0  1  2]           // Convolution kernel
    
  2. Convolution Process for each position i:

    out[0] = a[0]*b[0] + a[1]*b[1] + a[2]*b[2] = 0*0 + 1*1 + 2*2 = 5
    out[1] = a[1]*b[0] + a[2]*b[1] + a[3]*b[2] = 1*0 + 2*1 + 3*2 = 8
    out[2] = a[2]*b[0] + a[3]*b[1] + a[4]*b[2] = 2*0 + 3*1 + 4*2 = 11
    out[3] = a[3]*b[0] + a[4]*b[1] + a[5]*b[2] = 3*0 + 4*1 + 5*2 = 14
    out[4] = a[4]*b[0] + a[5]*b[1] + 0*b[2]    = 4*0 + 5*1 + 0*2 = 5
    out[5] = a[5]*b[0] + 0*b[1]   + 0*b[2]     = 5*0 + 0*1 + 0*2 = 0
    

Implementation Details

  1. Memory Safety Considerations:

    • The naive approach without proper bounds checking could be unsafe:

      # Unsafe version - could access shared_a beyond its bounds
      local_sum = Scalar[dtype](0)
      for j in range(CONV):
          if local_i + j < SIZE:
              local_sum += shared_a[local_i + j] * shared_b[j]
      
    • The safe and correct implementation:

      if global_i < a_size:
          var local_sum: out.element_type = 0  # Using var allows type inference
          @parameter  # Unrolls loop at compile time since CONV is constant
          for j in range(CONV):
              if local_i + j < SIZE:
                  local_sum += shared_a[local_i + j] * shared_b[j]
      
  2. Key Implementation Features:

    • Uses var for proper type inference with out.element_type
    • Employs @parameter decorator to unroll the convolution loop at compile time
    • Maintains strict bounds checking for memory safety
    • Leverages LayoutTensor’s type system for better code safety
  3. Memory Management:

    • Uses shared memory for both input array and kernel
    • Single load per thread from global memory
    • Efficient reuse of loaded data
  4. Thread Coordination:

    • barrier() ensures all data is loaded before computation
    • Each thread computes one output element
    • Maintains coalesced memory access pattern
  5. Performance Optimizations:

    • Minimizes global memory access
    • Uses shared memory for fast data access
    • Avoids thread divergence in main computation loop
    • Loop unrolling through @parameter decorator