Double-Buffered Stencil Computation

🔬 Fine-Grained Synchronization: mbarrier vs barrier()

This puzzle introduces explicit memory barrier APIs that provide significantly more control than the basic barrier() function used in previous puzzles.

Basic barrier() limitations:

Fire-and-forget: Single synchronization point with no state tracking

Block-wide only: All threads in the block must participate simultaneously

No reusability: Each barrier() call creates a new synchronization event

Coarse-grained: Limited control over memory ordering and timing

Static coordination: Cannot adapt to different thread participation patterns

Advanced mbarrier APIs capabilities:

Precise control: mbarrier_init() sets up reusable barrier objects with specific thread counts

State tracking: mbarrier_arrive() signals individual thread completion and maintains arrival count

Flexible waiting: mbarrier_test_wait() allows threads to wait for specific completion states

Reusable objects: Same barrier can be reinitialized and reused across multiple iterations

Multiple barriers: Different barrier objects for different synchronization points (initialization, iteration, finalization)

Hardware optimization: Maps directly to GPU hardware synchronization primitives for better performance

Memory semantics: Explicit control over memory visibility and ordering guarantees

Why this matters for iterative algorithms: In double-buffering patterns, you need precise coordination between buffer swap phases. Basic barrier() cannot provide the fine-grained control required for:

Buffer role alternation: Ensuring all writes to buffer_A complete before reading from buffer_A begins

Iteration boundaries: Coordinating multiple synchronization points within a single kernel

State management: Tracking which threads have completed which phase of processing

Performance optimization: Minimizing synchronization overhead through reusable barrier objects

This puzzle demonstrates synchronization patterns used in real-world GPU computing applications like iterative solvers, simulation frameworks, and high-performance image processing pipelines.

Overview

Implement a kernel that performs iterative stencil operations using double-buffered shared memory, coordinated with explicit memory barriers to ensure safe buffer swapping between iterations. A stencil operation is a computational pattern where the value of each element in an array is calculated based on a fixed pattern of its neighbors.

Note: You have alternating buffer roles: buffer_A and buffer_B swap between read and write operations each iteration, with mbarrier synchronization ensuring all threads complete writes before buffer swaps.

Algorithm architecture: This puzzle implements a double-buffering pattern where two shared memory buffers alternate roles as read and write targets across multiple iterations. Unlike simple stencil operations that process data once, this approach performs iterative refinement with careful memory barrier coordination to prevent race conditions during buffer transitions.

Pipeline concept: The algorithm processes data through iterative stencil refinement, where each iteration reads from one buffer and writes to another. The buffers alternate roles each iteration, creating a ping-pong pattern that enables continuous processing without data corruption.

Data dependencies and synchronization: Each iteration depends on the complete results of the previous iteration:

Iteration N → Iteration N+1: Current iteration produces refined data that next iteration consumes
Buffer coordination: Read and write buffers swap roles each iteration
Memory barriers prevent race conditions by ensuring all writes complete before any thread begins reading from the newly written buffer

Concretely, the double-buffered stencil implements an iterative smoothing algorithm with three mathematical operations:

Iteration Pattern - Buffer Alternation:

\[\text{Iteration } i: \begin{cases} \text{Read from buffer_A, Write to buffer_B} & \text{if } i \bmod 2 = 0 \\ \text{Read from buffer_B, Write to buffer_A} & \text{if } i \bmod 2 = 1 \end{cases}\]

Stencil Operation - 3-Point Average:

\[S^{(i+1)}[j] = \frac{1}{N_j} \sum_{k=-1}^{1} S^{(i)}[j+k] \quad \text{where } j+k \in [0, 255]\]

where \(S^{(i)}[j]\) is the stencil value at position \(j\) after iteration \(i\), and \(N_j\) is the count of valid neighbors.

Memory Barrier Coordination:

\[\text{mbarrier_arrive}() \Rightarrow \text{mbarrier_test_wait}() \Rightarrow \text{buffer swap} \Rightarrow \text{next iteration}\]

Final Output Selection:

\[\text{Output}[j] = \begin{cases} \text{buffer_A}[j] & \text{if STENCIL_ITERATIONS } \bmod 2 = 0 \\ \text{buffer_B}[j] & \text{if STENCIL_ITERATIONS } \bmod 2 = 1 \end{cases}\]

Key concepts

In this puzzle, you’ll learn about:

Implementing double-buffering patterns for iterative algorithms
Coordinating explicit memory barriers using mbarrier APIs
Managing alternating read/write buffer roles across iterations

The key insight is understanding how to safely coordinate buffer swapping in iterative algorithms where race conditions between read and write operations can corrupt data if not properly synchronized.

Why this matters: Most GPU tutorials show simple one-pass algorithms, but real-world applications often require iterative refinement with multiple passes over data. Double-buffering is essential for algorithms like iterative solvers, image processing filters, and simulation updates where each iteration depends on the complete results of the previous iteration.

Previous vs. current synchronization:

Previous puzzles (P8, P12, P15): Simple barrier() calls for single-pass algorithms
This puzzle: Explicit mbarrier APIs for precise control over buffer swap timing

Memory barrier specialization: Unlike basic thread synchronization, this puzzle uses explicit memory barriers that provide fine-grained control over when memory operations complete, essential for complex memory access patterns.

Configuration

System parameters:

Image size: SIZE = 1024 elements (1D for simplicity)
Threads per block: TPB = 256 threads organized as (256, 1) block dimension
Grid configuration: (4, 1) blocks to process entire image in tiles (4 blocks total)
Data type: DType.float32 for all computations

Iteration parameters:

Stencil iterations: STENCIL_ITERATIONS = 3 refinement passes
Buffer count: BUFFER_COUNT = 2 (double-buffering)
Stencil kernel: 3-point averaging with radius 1

Buffer architecture:

buffer_A: Primary shared memory buffer ([256] elements)
buffer_B: Secondary shared memory buffer ([256] elements)
Role alternation: Buffers swap between read source and write target each iteration

Processing requirements:

Initialization phase:

Buffer setup: Initialize buffer_A with input data, buffer_B with zeros
Barrier initialization: Set up mbarrier objects for synchronization points
Thread coordination: All threads participate in initialization

Iterative processing:

Even iterations (0, 2, 4…): Read from buffer_A, write to buffer_B
Odd iterations (1, 3, 5…): Read from buffer_B, write to buffer_A
Stencil operation: 3-point average \((\text{left} + \text{center} + \text{right}) / 3\)
Boundary handling: Use adaptive averaging for elements at buffer edges

Memory barrier coordination:

mbarrier_arrive(): Each thread signals completion of write phase
mbarrier_test_wait(): All threads wait until everyone completes writes
Buffer swap safety: Prevents reading from buffer while others still writing
Barrier reinitialization: Reset barrier state between iterations

Output phase:

Final buffer selection: Choose active buffer based on iteration parity
Global memory write: Copy final results to output array
Completion barrier: Ensure all writes finish before block termination

Code to complete


# Double-buffered stencil configuration
alias STENCIL_ITERATIONS = 3
alias BUFFER_COUNT = 2


fn double_buffered_stencil_computation[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    """Double-buffered stencil computation with memory barrier coordination.

    Iteratively applies 3-point stencil using alternating buffers.
    Uses mbarrier APIs for precise buffer swap coordination.
    """

    # Double-buffering: Two shared memory buffers
    buffer_A = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    buffer_B = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Memory barriers for coordinating buffer swaps
    init_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    iter_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    final_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Initialize barriers (only thread 0)
    if local_i == 0:
        mbarrier_init(init_barrier.ptr, TPB)
        mbarrier_init(iter_barrier.ptr, TPB)
        mbarrier_init(final_barrier.ptr, TPB)

    # Initialize buffer_A with input data

    # FILL ME IN (roughly 4 lines)

    # Wait for buffer_A initialization
    _ = mbarrier_arrive(init_barrier.ptr)
    _ = mbarrier_test_wait(init_barrier.ptr, TPB)

    # Iterative stencil processing with double-buffering
    @parameter
    for iteration in range(STENCIL_ITERATIONS):

        @parameter
        if iteration % 2 == 0:
            # Even iteration: Read from A, Write to B

            # FILL ME IN (roughly 12 lines)
            ...

        else:
            # Odd iteration: Read from B, Write to A

            # FILL ME IN (roughly 12 lines)
            ...

        # Memory barrier: wait for all writes before buffer swap
        _ = mbarrier_arrive(iter_barrier.ptr)
        _ = mbarrier_test_wait(iter_barrier.ptr, TPB)

        # Reinitialize barrier for next iteration
        if local_i == 0:
            mbarrier_init(iter_barrier.ptr, TPB)

    # Write final results from active buffer
    if local_i < TPB and global_i < size:

        @parameter
        if STENCIL_ITERATIONS % 2 == 0:
            # Even iterations end in buffer_A
            output[global_i] = buffer_A[local_i]
        else:
            # Odd iterations end in buffer_B
            output[global_i] = buffer_B[local_i]

    # Final barrier
    _ = mbarrier_arrive(final_barrier.ptr)
    _ = mbarrier_test_wait(final_barrier.ptr, TPB)

View full file: problems/p29/p29.mojo

Tips

Buffer initialization

Initialize buffer_A with input data, buffer_B can start empty
Use proper bounds checking with zero-padding for out-of-range elements
Only thread 0 should initialize the mbarrier objects
Set up separate barriers for different synchronization points

Iteration control

Use @parameter for iteration in range(STENCIL_ITERATIONS) for compile-time unrolling
Determine buffer roles using iteration % 2 to alternate read/write assignments
Apply stencil operation only within valid bounds with neighbor checking

Stencil computation

Implement 3-point averaging: (left + center + right) / 3
Handle boundary conditions by only including valid neighbors in average
Use adaptive counting to handle edge cases gracefully

Memory barrier coordination

Call mbarrier_arrive() after each thread completes its write operations
Use mbarrier_test_wait() to ensure all threads finish before buffer swap
Reinitialize barriers between iterations for reuse: mbarrier_init()
Only thread 0 should reinitialize barriers to avoid race conditions

Output selection

Choose final active buffer based on STENCIL_ITERATIONS % 2
Even iteration counts end with data in buffer_A
Odd iteration counts end with data in buffer_B
Write final results to global output with bounds checking

Running the code

To test your solution, run the following command in your terminal:

pixi run p29 --double-buffer

pixi run -e amd p29 --double-buffer

uv run poe p29 --double-buffer

After completing the puzzle successfully, you should see output similar to:

Puzzle 29: GPU Synchronization Primitives
==================================================
TPB: 256
SIZE: 1024
STENCIL_ITERATIONS: 3
BUFFER_COUNT: 2

Testing Puzzle 29B: Double-Buffered Stencil Computation
============================================================
Double-buffered stencil completed
Input sample: 1.0 1.0 1.0
GPU output sample: 1.0 1.0 1.0
✅ Double-buffered stencil test PASSED!

Solution

fn double_buffered_stencil_computation[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    """Double-buffered stencil computation with memory barrier coordination.

    Iteratively applies 3-point stencil using alternating buffers.
    Uses mbarrier APIs for precise buffer swap coordination.
    """

    # Double-buffering: Two shared memory buffers
    buffer_A = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    buffer_B = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Memory barriers for coordinating buffer swaps
    init_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    iter_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    final_barrier = LayoutTensor[
        DType.uint64,
        Layout.row_major(1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Initialize barriers (only thread 0)
    if local_i == 0:
        mbarrier_init(init_barrier.ptr, TPB)
        mbarrier_init(iter_barrier.ptr, TPB)
        mbarrier_init(final_barrier.ptr, TPB)

    # Initialize buffer_A with input data
    if local_i < TPB and global_i < size:
        buffer_A[local_i] = input[global_i]
    else:
        buffer_A[local_i] = 0.0

    # Wait for buffer_A initialization
    _ = mbarrier_arrive(init_barrier.ptr)
    _ = mbarrier_test_wait(init_barrier.ptr, TPB)

    # Iterative stencil processing with double-buffering
    @parameter
    for iteration in range(STENCIL_ITERATIONS):

        @parameter
        if iteration % 2 == 0:
            # Even iteration: Read from A, Write to B
            if local_i < TPB:
                var stencil_sum: Scalar[dtype] = 0.0
                var stencil_count: Int = 0

                # 3-point stencil: [i-1, i, i+1]
                for offset in range(-1, 2):
                    sample_idx = local_i + offset
                    if sample_idx >= 0 and sample_idx < TPB:
                        stencil_sum += rebind[Scalar[dtype]](
                            buffer_A[sample_idx]
                        )
                        stencil_count += 1

                if stencil_count > 0:
                    buffer_B[local_i] = stencil_sum / stencil_count
                else:
                    buffer_B[local_i] = buffer_A[local_i]

        else:
            # Odd iteration: Read from B, Write to A
            if local_i < TPB:
                var stencil_sum: Scalar[dtype] = 0.0
                var stencil_count: Int = 0

                # 3-point stencil: [i-1, i, i+1]
                for offset in range(-1, 2):
                    sample_idx = local_i + offset
                    if sample_idx >= 0 and sample_idx < TPB:
                        stencil_sum += rebind[Scalar[dtype]](
                            buffer_B[sample_idx]
                        )
                        stencil_count += 1

                if stencil_count > 0:
                    buffer_A[local_i] = stencil_sum / stencil_count
                else:
                    buffer_A[local_i] = buffer_B[local_i]

        # Memory barrier: wait for all writes before buffer swap
        _ = mbarrier_arrive(iter_barrier.ptr)
        _ = mbarrier_test_wait(iter_barrier.ptr, TPB)

        # Reinitialize barrier for next iteration
        if local_i == 0:
            mbarrier_init(iter_barrier.ptr, TPB)

    # Write final results from active buffer
    if local_i < TPB and global_i < size:

        @parameter
        if STENCIL_ITERATIONS % 2 == 0:
            # Even iterations end in buffer_A
            output[global_i] = buffer_A[local_i]
        else:
            # Odd iterations end in buffer_B
            output[global_i] = buffer_B[local_i]

    # Final barrier
    _ = mbarrier_arrive(final_barrier.ptr)
    _ = mbarrier_test_wait(final_barrier.ptr, TPB)

The key insight is recognizing this as a double-buffering architecture problem with explicit memory barrier coordination:

Design alternating buffer roles: Swap read/write responsibilities each iteration
Implement explicit memory barriers: Use mbarrier APIs for precise synchronization control
Coordinate iterative processing: Ensure complete iteration results before buffer swaps
Optimize memory access patterns: Keep all processing in fast shared memory

Complete Solution with Detailed Explanation

The double-buffered stencil solution demonstrates sophisticated memory barrier coordination and iterative processing patterns. This approach enables safe iterative refinement algorithms that require precise control over memory access timing.

Double-buffering architecture design

The fundamental breakthrough in this puzzle is explicit memory barrier control rather than simple thread synchronization:

Traditional approach: Use basic barrier() for simple thread coordination

All threads execute same operation on different data
Single barrier call synchronizes thread completion
No control over specific memory operation timing

This puzzle’s innovation: Different buffer roles coordinated with explicit memory barriers

buffer_A and buffer_B alternate between read source and write target
mbarrier APIs provide precise control over memory operation completion
Explicit coordination prevents race conditions during buffer transitions

Iterative processing coordination

Unlike single-pass algorithms, this establishes iterative refinement with careful buffer management:

Iteration 0: Read from buffer_A (initialized with input), write to buffer_B
Iteration 1: Read from buffer_B (previous results), write to buffer_A
Iteration 2: Read from buffer_A (previous results), write to buffer_B
Continue alternating: Each iteration refines results from previous iteration

Memory barrier API usage

Understanding the mbarrier coordination pattern:

mbarrier_init(): Initialize barrier for specific thread count (TPB)
mbarrier_arrive(): Signal individual thread completion of write phase
mbarrier_test_wait(): Block until all threads signal completion
Reinitialization: Reset barrier state between iterations for reuse

Critical timing sequence:

All threads write: Each thread updates its assigned buffer element
Signal completion: Each thread calls mbarrier_arrive()
Wait for all: All threads call mbarrier_test_wait()
Safe to proceed: Now safe to swap buffer roles for next iteration

Stencil operation mechanics

The 3-point stencil operation with adaptive boundary handling:

Interior elements (indices 1 to 254):

# Average with left, center, and right neighbors
stencil_sum = buffer[i-1] + buffer[i] + buffer[i+1]
result[i] = stencil_sum / 3.0

Boundary elements (indices 0 and 255):

# Only include valid neighbors in average
stencil_count = 0
for neighbor in valid_neighbors:
    stencil_sum += buffer[neighbor]
    stencil_count += 1
result[i] = stencil_sum / stencil_count

Buffer role alternation

The ping-pong buffer pattern ensures data integrity:

Even iterations (0, 2, 4…):

Read source: buffer_A contains current data
Write target: buffer_B receives updated results
Memory flow: buffer_A → stencil operation → buffer_B

Odd iterations (1, 3, 5…):

Read source: buffer_B contains current data
Write target: buffer_A receives updated results
Memory flow: buffer_B → stencil operation → buffer_A

Race condition prevention

Memory barriers eliminate multiple categories of race conditions:

Without barriers (broken):

# Thread A writes to buffer_B[10]
buffer_B[10] = stencil_result_A

# Thread B immediately reads buffer_B[10] for its stencil
# RACE CONDITION: Thread B might read old value before Thread A's write completes
stencil_input = buffer_B[10]  // Undefined behavior!

With barriers (correct):

# All threads write their results
buffer_B[local_i] = stencil_result

# Signal write completion
mbarrier_arrive(barrier)

# Wait for ALL threads to complete writes
mbarrier_test_wait(barrier, TPB)

# Now safe to read - all writes guaranteed complete
stencil_input = buffer_B[neighbor_index]  // Always sees correct values

Output buffer selection

Final result location depends on iteration parity:

Mathematical determination:

STENCIL_ITERATIONS = 3 (odd number)
Final active buffer: Iteration 2 writes to buffer_B
Output source: Copy from buffer_B to global memory

Implementation pattern:

@parameter
if STENCIL_ITERATIONS % 2 == 0:
    # Even total iterations end in buffer_A
    output[global_i] = buffer_A[local_i]
else:
    # Odd total iterations end in buffer_B
    output[global_i] = buffer_B[local_i]

Performance characteristics

Memory hierarchy optimization:

Global memory: Accessed only for input loading and final output
Shared memory: All iterative processing uses fast shared memory
Register usage: Minimal due to shared memory focus

Synchronization overhead:

mbarrier cost: Higher than basic barrier() but provides essential control
Iteration scaling: Overhead increases linearly with iteration count
Thread efficiency: All threads remain active throughout processing

Real-world applications

This double-buffering pattern is fundamental to:

Iterative solvers:

Gauss-Seidel and Jacobi methods for linear systems
Iterative refinement for numerical accuracy
Multigrid methods with level-by-level processing

Image processing:

Multi-pass filters (bilateral, guided, edge-preserving)
Iterative denoising algorithms
Heat diffusion and anisotropic smoothing

Simulation algorithms:

Cellular automata with state evolution
Particle systems with position updates
Fluid dynamics with iterative pressure solving

Key technical insights

Memory barrier philosophy:

Explicit control: Precise timing control over memory operations vs automatic synchronization
Race prevention: Essential for any algorithm with alternating read/write patterns
Performance trade-off: Higher synchronization cost for guaranteed correctness

Double-buffering benefits:

Data integrity: Eliminates read-while-write hazards
Algorithm clarity: Clean separation between current and next iteration state
Memory efficiency: No need for global memory intermediate storage

Iteration management:

Compile-time unrolling: @parameter for enables optimization opportunities
State tracking: Buffer role alternation must be deterministic
Boundary handling: Adaptive stencil operations handle edge cases gracefully

This solution demonstrates how to design iterative GPU algorithms that require precise memory access control, moving beyond simple parallel loops to sophisticated memory management patterns used in production numerical software.