Key concepts

In this puzzle, you’ll learn about:

  • Basic GPU kernel structure

  • Thread indexing with thread_idx.x

  • Simple parallel operations

  • Parallelism: Each thread executes independently

  • Thread indexing: Access element at position i = thread_idx.x

  • Memory access: Read from a[i] and write to out[i]

  • Data independence: Each output depends only on its corresponding input

Code to complete

alias SIZE = 4
alias BLOCKS_PER_GRID = 1
alias THREADS_PER_BLOCK = SIZE
alias dtype = DType.float32


fn add_10(out: UnsafePointer[Scalar[dtype]], a: UnsafePointer[Scalar[dtype]]):
    local_i = thread_idx.x
    # FILL ME IN (roughly 1 line)


View full file: problems/p01/p01.mojo

Tips
  1. Store thread_idx.x in local_i
  2. Add 10 to a[local_i]
  3. Store result in out[local_i]

Running the code

To test your solution, run the following command in your terminal:

magic run p01

Your output will look like this if the puzzle isn’t solved yet:

out: HostBuffer([0.0, 0.0, 0.0, 0.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])

Solution

fn add_10(out: UnsafePointer[Scalar[dtype]], a: UnsafePointer[Scalar[dtype]]):
    local_i = thread_idx.x
    out[local_i] = a[local_i] + 10.0


This solution:

  • Gets thread index with local_i = thread_idx.x
  • Adds 10 to input value: out[local_i] = a[local_i] + 10.0