Multi-Stage Pipeline Coordination

Overview

Implement a kernel that processes an image through a coordinated 3-stage pipeline where different thread groups handle specialized processing stages, synchronized with explicit barriers.

Note: You have specialized thread roles: Stage 1 (threads 0-127) loads and preprocesses data, Stage 2 (threads 128-255) applies blur operations, and Stage 3 (all threads) performs final smoothing.

Algorithm architecture: This puzzle implements a producer-consumer pipeline where different thread groups execute completely different algorithms within a single GPU block. Unlike traditional GPU programming where all threads execute the same algorithm on different data, this approach divides threads by functional specialization.

Pipeline concept: The algorithm processes data through three distinct stages, where each stage has specialized thread groups that execute different algorithms. Each stage produces data that the next stage consumes, creating explicit producer-consumer relationships that must be carefully synchronized with barriers.

Data dependencies and synchronization: Each stage produces data that the next stage consumes:

Stage 1 → Stage 2: First stage produces preprocessed data for blur processing
Stage 2 → Stage 3: Second stage produces blur results for final smoothing
Barriers prevent race conditions by ensuring complete stage completion before dependent stages begin

Concretely, the multi-stage pipeline implements a coordinated image processing algorithm with three mathematical operations:

Stage 1 - Preprocessing Enhancement:

\[P[i] = I[i] \times 1.1\]

where \(P[i]\) is the preprocessed data and \(I[i]\) is the input data.

Stage 2 - Horizontal Blur Filter:

\[B[i] = \frac{1}{N_i} \sum_{k=-2}^{2} P[i+k] \quad \text{where } i+k \in [0, 255]\]

where \(B[i]\) is the blur result, and \(N_i\) is the count of valid neighbors within the tile boundary.

Stage 3 - Cascading Neighbor Smoothing:

\[F[i] = \begin{cases} (B[i] + B[i+1]) \times 0.6 & \text{if } i = 0 \\ ((B[i] + B[i-1]) \times 0.6 + B[i+1]) \times 0.6 & \text{if } 0 < i < 255 \\ (B[i] + B[i-1]) \times 0.6 & \text{if } i = 255 \end{cases}\]

where \(F[i]\) is the final output with cascading smoothing applied.

Thread Specialization:

Threads 0-127: Compute \(P[i]\) for \(i \in \{0, 1, 2, \ldots, 255\}\) (2 elements per thread)
Threads 128-255: Compute \(B[i]\) for \(i \in \{0, 1, 2, \ldots, 255\}\) (2 elements per thread)
All 256 threads: Compute \(F[i]\) for \(i \in \{0, 1, 2, \ldots, 255\}\) (1 element per thread)

Synchronization Points:

\[\text{barrier}_1 \Rightarrow P[i] \text{ complete} \Rightarrow \text{barrier}_2 \Rightarrow B[i] \text{ complete} \Rightarrow \text{barrier}_3 \Rightarrow F[i] \text{ complete}\]

Key concepts

In this puzzle, you’ll learn about:

Implementing thread role specialization within a single GPU block
Coordinating producer-consumer relationships between processing stages
Using barriers to synchronize between different algorithms (not just within the same algorithm)

The key insight is understanding how to design multi-stage pipelines where different thread groups execute completely different algorithms, coordinated through strategic barrier placement.

Why this matters: Most GPU tutorials teach barrier usage within a single algorithm - synchronizing threads during reductions or shared memory operations. But real-world GPU algorithms often require architectural complexity with multiple distinct processing stages that must be carefully orchestrated. This puzzle demonstrates how to transform monolithic algorithms into specialized, coordinated processing pipelines.

Previous vs. current barrier usage:

Previous puzzles (P8, P12, P15): All threads execute the same algorithm, barriers sync within algorithm steps
This puzzle: Different thread groups execute different algorithms, barriers coordinate between different algorithms

Thread specialization architecture: Unlike data parallelism where threads differ only in their data indices, this puzzle implements algorithmic parallelism where threads execute fundamentally different code paths based on their role in the pipeline.

Configuration

System parameters:

Image size: SIZE = 1024 elements (1D for simplicity)
Threads per block: TPB = 256 threads organized as (256, 1) block dimension
Grid configuration: (4, 1) blocks to process entire image in tiles (4 blocks total)
Data type: DType.float32 for all computations

Thread specialization architecture:

Stage 1 threads: STAGE1_THREADS = 128 (threads 0-127, first half of block)
- Responsibility: Load input data from global memory and apply preprocessing
- Work distribution: Each thread processes 2 elements for efficient load balancing
- Output: Populates input_shared[256] with preprocessed data
Stage 2 threads: STAGE2_THREADS = 128 (threads 128-255, second half of block)
- Responsibility: Apply horizontal blur filter on preprocessed data
- Work distribution: Each thread processes 2 blur operations
- Output: Populates blur_shared[256] with blur results
Stage 3 threads: All 256 threads collaborate
- Responsibility: Final smoothing and output to global memory
- Work distribution: One-to-one mapping (thread i processes element i)
- Output: Writes final results to global output array

Code to complete


alias TPB = 256  # Threads per block for pipeline stages
alias SIZE = 1024  # Image size (1D for simplicity)
alias BLOCKS_PER_GRID = (4, 1)
alias THREADS_PER_BLOCK = (TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)

# Multi-stage processing configuration
alias STAGE1_THREADS = TPB // 2
alias STAGE2_THREADS = TPB // 2
alias BLUR_RADIUS = 2


fn multi_stage_image_blur_pipeline[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    """Multi-stage image blur pipeline with barrier coordination.

    Stage 1 (threads 0-127): Load input data and apply 1.1x preprocessing
    Stage 2 (threads 128-255): Apply 5-point blur with BLUR_RADIUS=2
    Stage 3 (all threads): Final neighbor smoothing and output
    """

    # Shared memory buffers for pipeline stages
    input_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    blur_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Stage 1: Load and preprocess (threads 0-127)

    # FILL ME IN (roughly 10 lines)

    barrier()  # Wait for Stage 1 completion

    # Stage 2: Apply blur (threads 128-255)

    # FILL ME IN (roughly 25 lines)

    barrier()  # Wait for Stage 2 completion

    # Stage 3: Final smoothing (all threads)

    # FILL ME IN (roughly 7 lines)

    barrier()  # Ensure all writes complete

View full file: problems/p29/p29.mojo

Tips

Thread role identification

Use thread index comparisons to determine which stage each thread should execute
Stage 1: First half of threads (threads 0-127)
Stage 2: Second half of threads (threads 128-255)
Stage 3: All threads participate

Stage 1 approach

Identify Stage 1 threads using appropriate index comparison
Each thread should handle multiple elements for load balancing
Apply the preprocessing enhancement factor
Implement proper boundary handling with zero-padding

Stage 2 approach

Identify Stage 2 threads and map their indices to processing range
Implement the blur kernel by averaging neighboring elements
Handle boundary conditions by only including valid neighbors
Process multiple elements per thread for efficiency

Stage 3 approach

All threads participate in final processing
Apply neighbor smoothing using the specified scaling factor
Handle edge cases where neighbors may not exist
Write results to global output with bounds checking

Synchronization strategy

Place barriers between stages to prevent race conditions
Ensure each stage completes before dependent stages begin
Use final barrier to guarantee completion before block exit

Running the code

To test your solution, run the following command in your terminal:

pixi run p29 --multi-stage

pixi run -e amd p29 --multi-stage

uv run poe p29 --multi-stage

After completing the puzzle successfully, you should see output similar to:

Puzzle 29: GPU Synchronization Primitives
==================================================
TPB: 256
SIZE: 1024
STAGE1_THREADS: 128
STAGE2_THREADS: 128
BLUR_RADIUS: 2

Testing Puzzle 29A: Multi-Stage Pipeline Coordination
============================================================
Multi-stage pipeline blur completed
Input sample: 0.0 1.01 2.02
Output sample: 1.6665002 2.3331003 3.3996604
✅ Multi-stage pipeline coordination test PASSED!

Solution

fn multi_stage_image_blur_pipeline[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    input: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    """Multi-stage image blur pipeline with barrier coordination.

    Stage 1 (threads 0-127): Load input data and apply 1.1x preprocessing
    Stage 2 (threads 128-255): Apply 5-point blur with BLUR_RADIUS=2
    Stage 3 (all threads): Final neighbor smoothing and output
    """

    # Shared memory buffers for pipeline stages
    input_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    blur_shared = LayoutTensor[
        dtype,
        Layout.row_major(TPB),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Stage 1: Load and preprocess (threads 0-127)
    if local_i < STAGE1_THREADS:
        if global_i < size:
            input_shared[local_i] = input[global_i] * 1.1
            # Each thread loads 2 elements
            if local_i + STAGE1_THREADS < size:
                input_shared[local_i + STAGE1_THREADS] = (
                    input[global_i + STAGE1_THREADS] * 1.1
                )
        else:
            # Zero-padding for out-of-bounds
            input_shared[local_i] = 0.0
            if local_i + STAGE1_THREADS < TPB:
                input_shared[local_i + STAGE1_THREADS] = 0.0

    barrier()  # Wait for Stage 1 completion

    # Stage 2: Apply blur (threads 128-255)
    if local_i >= STAGE1_THREADS:
        blur_idx = local_i - STAGE1_THREADS
        var blur_sum: Scalar[dtype] = 0.0
        blur_count = 0

        # 5-point blur kernel
        for offset in range(-BLUR_RADIUS, BLUR_RADIUS + 1):
            sample_idx = blur_idx + offset
            if sample_idx >= 0 and sample_idx < TPB:
                blur_sum += rebind[Scalar[dtype]](input_shared[sample_idx])
                blur_count += 1

        if blur_count > 0:
            blur_shared[blur_idx] = blur_sum / blur_count
        else:
            blur_shared[blur_idx] = 0.0

        # Process second element
        second_idx = blur_idx + STAGE1_THREADS
        if second_idx < TPB:
            blur_sum = 0.0
            blur_count = 0
            for offset in range(-BLUR_RADIUS, BLUR_RADIUS + 1):
                sample_idx = second_idx + offset
                if sample_idx >= 0 and sample_idx < TPB:
                    blur_sum += rebind[Scalar[dtype]](input_shared[sample_idx])
                    blur_count += 1

            if blur_count > 0:
                blur_shared[second_idx] = blur_sum / blur_count
            else:
                blur_shared[second_idx] = 0.0

    barrier()  # Wait for Stage 2 completion

    # Stage 3: Final smoothing (all threads)
    if global_i < size:
        final_value = blur_shared[local_i]

        # Neighbor smoothing with 0.6 scaling
        if local_i > 0:
            final_value = (final_value + blur_shared[local_i - 1]) * 0.6
        if local_i < TPB - 1:
            final_value = (final_value + blur_shared[local_i + 1]) * 0.6

        output[global_i] = final_value

    barrier()  # Ensure all writes complete

The key insight is recognizing this as a pipeline architecture problem with thread role specialization:

Design stage-specific thread groups: Divide threads by function, not just by data
Implement producer-consumer chains: Stage 1 produces for Stage 2, Stage 2 produces for Stage 3
Use strategic barrier placement: Synchronize between different algorithms, not within the same algorithm
Optimize memory access patterns: Ensure coalesced reads and efficient shared memory usage

Complete Solution with Detailed Explanation

The multi-stage pipeline solution demonstrates sophisticated thread specialization and barrier coordination. This approach transforms a traditional monolithic GPU algorithm into a specialized, coordinated processing pipeline.

Pipeline architecture design

The fundamental breakthrough in this puzzle is thread specialization by role rather than by data:

Traditional approach: All threads execute the same algorithm on different data

Everyone performs identical operations (like reductions or matrix operations)
Barriers synchronize threads within the same algorithm steps
Thread roles differ only by data indices they process

This puzzle’s innovation: Different thread groups execute completely different algorithms

Threads 0-127 execute loading and preprocessing algorithms
Threads 128-255 execute blur processing algorithms
All threads collaborate in final smoothing algorithm
Barriers coordinate between different algorithms, not within the same algorithm

Producer-consumer coordination

Unlike previous puzzles where threads were peers in the same algorithm, this establishes explicit producer-consumer relationships:

Stage 1: Producer (creates preprocessed data for Stage 2)
Stage 2: Consumer (uses Stage 1 data) + Producer (creates blur data for Stage 3)
Stage 3: Consumer (uses Stage 2 data)

Strategic barrier placement

Understanding when barriers are necessary vs. wasteful:

Necessary: Between dependent stages to prevent race conditions
Wasteful: Within independent operations of the same stage
Performance insight: Each barrier has a cost - use them strategically

Critical synchronization points:

After Stage 1: Prevent Stage 2 from reading incomplete preprocessed data
After Stage 2: Prevent Stage 3 from reading incomplete blur results
After Stage 3: Ensure all output writes complete before block termination

Thread utilization patterns

Stage 1: 50% utilization (128/256 threads active, 128 idle)
Stage 2: 50% utilization (128 active, 128 idle)
Stage 3: 100% utilization (all 256 threads active)

This demonstrates sophisticated algorithmic parallelism where different thread groups specialize in different computational tasks within a coordinated pipeline, moving beyond simple data parallelism to architectural thinking required for real-world GPU algorithms.

Memory hierarchy optimization

Shared memory architecture:

Two specialized buffers handle data flow between stages
Global memory access minimized to boundary operations only
All intermediate processing uses fast shared memory

Access pattern benefits:

Stage 1: Coalesced global memory reads for input loading
Stage 2: Fast shared memory reads for blur processing
Stage 3: Coalesced global memory writes for output

Real-world applications

This pipeline architecture pattern is fundamental to:

Image processing pipelines:

Multi-stage filters (blur, sharpen, edge detection in sequence)
Color space conversions (RGB → HSV → processing → RGB)
Noise reduction with multiple algorithm passes

Scientific computing:

Stencil computations with multi-stage finite difference methods
Signal processing with filtering, transformation, and analysis pipelines
Computational fluid dynamics with multi-stage solver iterations

Machine learning:

Neural network layers with specialized thread groups for different operations
Data preprocessing pipelines (load, normalize, augment in coordinated stages)
Batch processing where different thread groups handle different operations

Key technical insights

Algorithmic vs. data parallelism:

Data parallelism: Threads execute identical code on different data elements
Algorithmic parallelism: Threads execute fundamentally different algorithms based on their specialized roles

Barrier usage philosophy:

Strategic placement: Barriers only where necessary to prevent race conditions between dependent stages
Performance consideration: Each barrier incurs synchronization overhead - use sparingly but correctly
Correctness guarantee: Proper barrier placement ensures deterministic results regardless of thread execution timing

Thread specialization benefits:

Algorithmic optimization: Each stage can be optimized for its specific computational pattern
Memory access optimization: Different stages can use different memory access strategies
Resource utilization: Complex algorithms can be decomposed into specialized, efficient components

This solution demonstrates how to design sophisticated GPU algorithms that leverage thread specialization and strategic synchronization for complex multi-stage computations, moving beyond simple parallel loops to architectural approaches used in production GPU software.