Puzzle 2: Zip

Overview

Implement a kernel that adds together each position of vector a and vector b and stores it in out.

Note: You have 1 thread per position.

Zip

Key concepts

In this puzzle, you’ll learn about:

  • Processing multiple input arrays in parallel
  • Element-wise operations with multiple inputs
  • Thread-to-data mapping across arrays
  • Memory access patterns with multiple arrays

For each thread \(i\): \[\Large out[i] = a[i] + b[i]\]

Memory access pattern

Thread 0:  a[0] + b[0] → out[0]
Thread 1:  a[1] + b[1] → out[1]
Thread 2:  a[2] + b[2] → out[2]
...

💡 Note: Notice how we’re now managing three arrays (a, b, out) in our kernel. As we progress to more complex operations, managing multiple array accesses will become increasingly challenging.

Code to complete

alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = SIZE
alias dtype = DType.float32


fn add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
):
    local_i = thread_idx.x
    # FILL ME IN (roughly 1 line)


View full file: problems/p02/p02.mojo

Tips
  1. Store thread_idx.x in local_i
  2. Add a[local_i] and b[local_i]
  3. Store result in out[local_i]

Running the code

To test your solution, run the following command in your terminal:

magic run p02

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([0.0, 2.0, 4.0, 6.0])

Solution

fn add(
    out: UnsafePointer[Scalar[dtype]],
    a: UnsafePointer[Scalar[dtype]],
    b: UnsafePointer[Scalar[dtype]],
):
    local_i = thread_idx.x
    out[local_i] = a[local_i] + b[local_i]


This solution:

  • Gets thread index with local_i = thread_idx.x
  • Adds values from both arrays: out[local_i] = a[local_i] + b[local_i]

Looking ahead

While this direct indexing works for simple element-wise operations, consider:

  • What if arrays have different layouts?
  • What if we need to broadcast one array to another?
  • How to ensure coalesced access across multiple arrays?

These questions will be addressed when we introduce LayoutTensor in Puzzle 4.