Key concepts
In this puzzle, you’ll learn about:
- Implementing parallel reduction operations
- Using shared memory for intermediate results
- Coordinating threads for collective operations
The key insight is understanding how to efficiently combine multiple values into a single result using parallel computation and shared memory.
Configuration
- Vector size:
SIZE = 8
elements - Threads per block:
TPB = 8
- Number of blocks: 1
- Output size: 1 element
- Shared memory:
TPB
elements
Notes:
- Element access: Each thread reads corresponding elements from
a
andb
- Partial results: Computing and storing intermediate values
- Thread coordination: Synchronizing before combining results
- Final reduction: Converting partial results to scalar output
Note: For this problem, you don’t need to worry about number of shared reads. We will handle that challenge later.
Code to complete
alias TPB = 8
alias SIZE = 8
alias BLOCKS_PER_GRID = (1, 1)
alias THREADS_PER_BLOCK = (SIZE, 1)
alias dtype = DType.float32
fn dot_product(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
b: UnsafePointer[Scalar[dtype]],
size: Int,
):
# FILL ME IN (roughly 13 lines)
...
View full file: problems/p10/p10.mojo
Tips
- Store
a[global_i] * b[global_i]
inshared[local_i]
- Call
barrier()
to synchronize - Use thread 0 to sum all products in shared memory
- Write final sum to
out[0]
Running the code
To test your solution, run the following command in your terminal:
magic run p10
Your output will look like this if the puzzle isn’t solved yet:
out: HostBuffer([0.0])
expected: HostBuffer([140.0])
Solution
fn dot_product(
out: UnsafePointer[Scalar[dtype]],
a: UnsafePointer[Scalar[dtype]],
b: UnsafePointer[Scalar[dtype]],
size: Int,
):
shared = stack_allocation[
TPB * sizeof[dtype](),
Scalar[dtype],
address_space = AddressSpace.SHARED,
]()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
if global_i < size:
shared[local_i] = a[global_i] * b[global_i]
barrier()
# The following causes race condition: all threads writing to the same location
# out[0] += shared[local_i]
# Instead can do parallel reduction in shared memory as opposed to
# global memory which has no guarantee on synchronization.
# Loops using global memory can cause thread divergence because
# fundamentally GPUs execute threads in warps (groups of 32 threads typically)
# and warps can be scheduled independently.
# However, shared memory does not have such issues as long as we use `barrier()`
# correctly when we're in the same thread block.
stride = TPB // 2
while stride > 0:
if local_i < stride:
shared[local_i] += shared[local_i + stride]
barrier()
stride //= 2
# only thread 0 writes the final result
if local_i == 0:
out[0] = shared[0]
The solution implements a parallel reduction algorithm for dot product computation using shared memory. Here’s a detailed breakdown:
Phase 1: Element-wise Multiplication
Each thread performs one multiplication:
Thread i: shared[i] = a[i] * b[i]
Phase 2: Parallel Reduction
The reduction uses a tree-based approach that halves active threads in each step:
Initial: [0*0 1*1 2*2 3*3 4*4 5*5 6*6 7*7]
= [0 1 4 9 16 25 36 49]
Step 1: [0+16 1+25 4+36 9+49 16 25 36 49]
= [16 26 40 58 16 25 36 49]
Step 2: [16+40 26+58 40 58 16 25 36 49]
= [56 84 40 58 16 25 36 49]
Step 3: [56+84 84 40 58 16 25 36 49]
= [140 84 40 58 16 25 36 49]
Key Implementation Features:
-
Memory Access Pattern:
- Each thread loads exactly two values from global memory (
a[i]
,b[i]
) - Uses shared memory for intermediate results
- Final result written once to global memory
- Each thread loads exactly two values from global memory (
-
Thread Synchronization:
barrier()
after initial multiplicationbarrier()
after each reduction step- Prevents race conditions between reduction steps
-
Reduction Logic:
stride = TPB // 2 while stride > 0: if local_i < stride: shared[local_i] += shared[local_i + stride] barrier() stride //= 2
- Halves stride in each step
- Only active threads perform additions
- Maintains work efficiency
-
Performance Considerations:
- \(\log_2(n)\) steps for \(n\) elements
- Coalesced memory access pattern
- Minimal thread divergence
- Efficient use of shared memory
This implementation achieves \(O(\log n)\) time complexity compared to \(O(n)\) in sequential execution, demonstrating the power of parallel reduction algorithms.
Barrier Synchronization Importance
The barrier()
between reduction steps is critical for correctness. Here’s why:
Without barrier()
, race conditions occur:
Initial shared memory: [0 1 4 9 16 25 36 49]
Step 1 (stride = 4):
Thread 0 reads: shared[0] = 0, shared[4] = 16
Thread 1 reads: shared[1] = 1, shared[5] = 25
Thread 2 reads: shared[2] = 4, shared[6] = 36
Thread 3 reads: shared[3] = 9, shared[7] = 49
Without barrier:
- Thread 0 writes: shared[0] = 0 + 16 = 16
- Thread 1 starts next step (stride = 2) before Thread 0 finishes
and reads old value shared[0] = 0 instead of 16!
With barrier()
:
Step 1 (stride = 4):
All threads write their sums:
[16 26 40 58 16 25 36 49]
barrier() ensures ALL threads see these values
Step 2 (stride = 2):
Now threads safely read the updated values:
Thread 0: shared[0] = 16 + 40 = 56
Thread 1: shared[1] = 26 + 58 = 84
The barrier()
ensures:
- All writes from current step complete
- All threads see updated values
- No thread starts next iteration early
- Consistent shared memory state
Without these synchronization points, we could get:
- Memory race conditions
- Threads reading stale values
- Non-deterministic results
- Incorrect final sum