block.broadcast() Vector Normalization

Implement vector mean normalization by combining block.sum and block.broadcast operations to demonstrate the complete block-level communication workflow. Each thread will contribute to computing the mean, then receive the broadcast mean to normalize its element, showcasing how block operations work together to solve real parallel algorithms.

Key insight: The block.broadcast() operation enables one-to-all communication, completing the fundamental block communication patterns: reduction (all→one), scan (all→each), and broadcast (one→all).

Key concepts

In this puzzle, you’ll learn:

Block-level broadcast with block.broadcast()
One-to-all communication patterns
Source thread specification and parameter control
Complete block operations workflow combining multiple operations
Real-world algorithm implementation using coordinated block primitives

The algorithm demonstrates vector mean normalization: \[\Large \text{output}[i] = \frac{\text{input}[i]}{\frac{1}{N}\sum_{j=0}^{N-1} \text{input}[j]}\]

Each thread contributes to the mean calculation, then receives the broadcast mean to normalize its element.

Configuration

Vector size: SIZE = 128 elements
Data type: DType.float32
Block configuration: (128, 1) threads per block (TPB = 128)
Grid configuration: (1, 1) blocks per grid
Layout: Layout.row_major(SIZE) (1D row-major for input and output)
Test data: Values cycling 1-8, so mean = 4.5
Expected output: Normalized vector with mean = 1.0

The challenge: Coordinating block-wide computation and distribution

Traditional approaches to mean normalization require complex coordination:

# Sequential approach - doesn't utilize parallelism
total = sum(input_array)
mean = total / len(input_array)
output_array = [x / mean for x in input_array]

Problems with naive GPU parallelization:

Multiple kernel launches: One pass to compute mean, another to normalize
Global memory round-trip: Store mean to global memory, read back later
Synchronization complexity: Need barriers between computation phases
Thread divergence: Different threads doing different tasks

Traditional GPU solution complexity:

# Phase 1: Reduce to find sum (complex shared memory + barriers)
shared_sum[local_i] = my_value
barrier()
# Manual tree reduction with multiple barrier() calls...

# Phase 2: Thread 0 computes mean
if local_i == 0:
    mean = shared_sum[0] / size
    shared_mean[0] = mean

barrier()

# Phase 3: All threads read mean and normalize
mean = shared_mean[0]  # Everyone reads the same value
output[global_i] = my_value / mean

The advanced approach: `block.sum()` + `block.broadcast()` coordination

Transform the multi-phase coordination into elegant block operations workflow:

Code to complete

Complete block operations workflow

Implement sophisticated vector mean normalization using the full block operations toolkit:


alias vector_layout = Layout.row_major(SIZE)


fn block_normalize_vector[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    output_data: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    size: Int,
):
    """Vector mean normalization using block.sum() + block.broadcast() combination.

    This demonstrates the complete block operations workflow:
    1. Use block.sum() to compute sum of all elements (all → one)
    2. Thread 0 computes mean = sum / size
    3. Use block.broadcast() to share mean to all threads (one → all)
    4. Each thread normalizes: output[i] = input[i] / mean
    """

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Step 1: Each thread loads its element

    # FILL IN (roughly 3 lines)

    # Step 2: Use block.sum() to compute total sum (familiar from earlier!)

    # FILL IN (1 line)

    # Step 3: Thread 0 computes mean value

    # FILL IN (roughly 4 lines)

    # Step 4: block.broadcast() shares mean to ALL threads!
    # This completes the block operations trilogy demonstration

    # FILL IN (1 line)

    # Step 5: Each thread normalizes by the mean

    # FILL IN (roughly 3 lines)

View full file: problems/p27/p27.mojo

Tips

1. Complete workflow structure (builds on all previous operations)

The algorithm follows the perfect block operations pattern:

Each thread loads its element (familiar from all previous puzzles)
Use block.sum() to compute total (from earlier in this puzzle)
Thread 0 computes mean from the sum
Use block.broadcast() to share mean to all threads (NEW!)
Each thread normalizes using the broadcast mean

2. Data loading and sum computation (familiar patterns)

Load your element using the established LayoutTensor pattern:

var my_value: Scalar[dtype] = 0.0
if global_i < size:
    my_value = input_data[global_i][0]  # SIMD extraction

Then use block.sum() exactly like the dot product earlier:

total_sum = block.sum[block_size=tpb, broadcast=False](...)

3. Mean computation (thread 0 only)

Only thread 0 should compute the mean:

var mean_value: Scalar[dtype] = 1.0  # Safe default
if local_i == 0:
    # Compute mean from total_sum and size

Why thread 0? Consistent with block.sum() pattern where thread 0 receives the result.

4. block.broadcast() API concepts

Study the function signature - it needs:

Template parameters: dtype, width, block_size
Runtime parameters: val (SIMD value to broadcast), src_thread (default=0)

The call pattern follows the established template style:

result = block.broadcast[
    dtype = DType.float32,
    width = 1,
    block_size = tpb
](val=SIMD[DType.float32, 1](value_to_broadcast), src_thread=UInt(0))

5. Understanding the broadcast pattern

Key insight: block.broadcast() takes a value from ONE thread and gives it to ALL threads:

Thread 0 has the computed mean value
All threads need that same mean value
block.broadcast() copies thread 0’s value to everyone

This is the opposite of block.sum() (all→one) and different from block.prefix_sum() (all→each position).

6. Final normalization step

Once every thread has the broadcast mean, normalize your element:

if global_i < size:
    normalized_value = my_value / broadcasted_mean[0]  # Extract SIMD
    output_data[global_i] = normalized_value

SIMD extraction: Remember that block.broadcast() returns SIMD, so use [0] to extract the scalar.

7. Pattern recognition from previous puzzles

Thread indexing: Same global_i, local_i pattern as always
Bounds checking: Same if global_i < size validation
SIMD handling: Same [0] extraction patterns
Block operations: Same template parameter style as block.sum()

The beauty is that each block operation follows consistent patterns!

Test the block.broadcast() approach:

pixi run p27 --normalize

pixi run -e amd p27 --normalize

pixi run -e apple p27 --normalize

uv run poe p27 --normalize

Expected output when solved:

SIZE: 128
TPB: 128

Input sample: 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 ...
Sum value: 576.0
Mean value: 4.5

Mean Normalization Results:
Normalized sample: 0.22222222 0.44444445 0.6666667 0.8888889 1.1111112 1.3333334 1.5555556 1.7777778 ...

Output sum: 128.0
Output mean: 1.0
✅ Success: Output mean is 1.0 (should be close to 1.0)

Solution

fn block_normalize_vector[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    input_data: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    output_data: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    size: Int,
):
    """Vector mean normalization using block.sum() + block.broadcast() combination.

    This demonstrates the complete block operations workflow:
    1. Use block.sum() to compute sum of all elements (all → one)
    2. Thread 0 computes mean = sum / size
    3. Use block.broadcast() to share mean to all threads (one → all)
    4. Each thread normalizes: output[i] = input[i] / mean
    """

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Step 1: Each thread loads its element
    var my_value: Scalar[dtype] = 0.0
    if global_i < size:
        my_value = input_data[global_i][0]  # Extract SIMD value

    # Step 2: Use block.sum() to compute total sum (familiar from earlier!)
    total_sum = block.sum[block_size=tpb, broadcast=False](
        val=SIMD[DType.float32, 1](my_value)
    )

    # Step 3: Thread 0 computes mean value
    var mean_value: Scalar[dtype] = 1.0  # Default to avoid division by zero
    if local_i == 0:
        if total_sum[0] > 0.0:
            mean_value = total_sum[0] / Float32(size)

    # Step 4: block.broadcast() shares mean to ALL threads!
    # This completes the block operations trilogy demonstration
    broadcasted_mean = block.broadcast[
        dtype = DType.float32, width=1, block_size=tpb
    ](val=SIMD[DType.float32, 1](mean_value), src_thread=UInt(0))

    # Step 5: Each thread normalizes by the mean
    if global_i < size:
        normalized_value = my_value / broadcasted_mean[0]
        output_data[global_i] = normalized_value

The block.broadcast() kernel demonstrates the complete block operations workflow by combining all three fundamental communication patterns in a real algorithm that produces mathematically verifiable results:

Complete algorithm walkthrough with concrete execution:

Phase 1: Parallel data loading (established patterns from all previous puzzles)

Thread indexing (consistent across all puzzles):
  global_i = block_dim.x * block_idx.x + thread_idx.x  // Maps to input array position
  local_i = thread_idx.x                              // Position within block (0-127)

Parallel element loading using LayoutTensor pattern:
  Thread 0:   my_value = input_data[0][0] = 1.0    // First cycle value
  Thread 1:   my_value = input_data[1][0] = 2.0    // Second cycle value
  Thread 7:   my_value = input_data[7][0] = 8.0    // Last cycle value
  Thread 8:   my_value = input_data[8][0] = 1.0    // Cycle repeats: 1,2,3,4,5,6,7,8,1,2...
  Thread 15:  my_value = input_data[15][0] = 8.0   // 15 % 8 = 7, so 8th value
  Thread 127: my_value = input_data[127][0] = 8.0  // 127 % 8 = 7, so 8th value

All 128 threads load simultaneously - perfect parallel efficiency!

Phase 2: Block-wide sum reduction (leveraging earlier block.sum() knowledge)

block.sum() coordination across all 128 threads:
  Contribution analysis:
    - Values 1,2,3,4,5,6,7,8 repeat 16 times each (128/8 = 16)
    - Thread contributions: 16×1 + 16×2 + 16×3 + 16×4 + 16×5 + 16×6 + 16×7 + 16×8
    - Mathematical sum: 16 × (1+2+3+4+5+6+7+8) = 16 × 36 = 576.0

block.sum() hardware execution:
  All threads → [reduction tree] → Thread 0
  total_sum = SIMD[DType.float32, 1](576.0)  // Only thread 0 receives this

Threads 1-127: Have no access to total_sum (broadcast=False in block.sum)

Phase 3: Exclusive mean computation (single-thread processing)

Thread 0 performs critical computation:
  Input: total_sum[0] = 576.0, size = 128
  Computation: mean_value = 576.0 / 128.0 = 4.5

  Verification: Expected mean = (1+2+3+4+5+6+7+8)/8 = 36/8 = 4.5 ✓

All other threads (1-127):
  mean_value = 1.0 (default safety value)
  These values are irrelevant - will be overwritten by broadcast

Critical insight: Only thread 0 has the correct mean value at this point!

Phase 4: Block-wide broadcast distribution (one → all communication)

block.broadcast() API execution:
  Source: src_thread = UInt(0) → Thread 0's mean_value = 4.5
  Target: All 128 threads in block

Before broadcast:
  Thread 0:   mean_value = 4.5  ← Source of truth
  Thread 1:   mean_value = 1.0  ← Will be overwritten
  Thread 2:   mean_value = 1.0  ← Will be overwritten
  ...
  Thread 127: mean_value = 1.0  ← Will be overwritten

After block.broadcast() execution:
  Thread 0:   broadcasted_mean[0] = 4.5  ← Receives own value back
  Thread 1:   broadcasted_mean[0] = 4.5  ← Now has correct value!
  Thread 2:   broadcasted_mean[0] = 4.5  ← Now has correct value!
  ...
  Thread 127: broadcasted_mean[0] = 4.5  ← Now has correct value!

Result: Perfect synchronization - all threads have identical mean value!

Phase 5: Parallel mean normalization (coordinated processing)

Each thread independently normalizes using broadcast mean:
  Thread 0:   normalized = 1.0 / 4.5 = 0.22222222...
  Thread 1:   normalized = 2.0 / 4.5 = 0.44444444...
  Thread 2:   normalized = 3.0 / 4.5 = 0.66666666...
  Thread 7:   normalized = 8.0 / 4.5 = 1.77777777...
  Thread 8:   normalized = 1.0 / 4.5 = 0.22222222...  (pattern repeats)
  ...

Mathematical verification:
  Output sum = (0.222... + 0.444... + ... + 1.777...) × 16 = 4.5 × 16 × 2 = 128.0
  Output mean = 128.0 / 128 = 1.0  Perfect normalization!

Each value divided by original mean gives output with mean = 1.0

Phase 6: Verification of correctness

Input analysis:
  - Sum: 576.0, Mean: 4.5
  - Max: 8.0, Min: 1.0
  - Range: [1.0, 8.0]

Output analysis:
  - Sum: 128.0, Mean: 1.0 ✓
  - Max: 1.777..., Min: 0.222...
  - Range: [0.222, 1.777] (all values scaled by factor 1/4.5)

Proportional relationships preserved:
  - Original 8:1 ratio becomes 1.777:0.222 = 8:1 ✓
  - All relative magnitudes maintained perfectly

Why this complete workflow is mathematically and computationally superior:

Technical accuracy and verification:

Mathematical proof of correctness:
  Input: x₁, x₂, ..., xₙ where n = 128
  Mean: μ = (∑xᵢ)/n = 576/128 = 4.5

  Normalization: yᵢ = xᵢ/μ
  Output mean: (∑yᵢ)/n = (∑xᵢ/μ)/n = (1/μ)(∑xᵢ)/n = (1/μ)μ = 1 ✓

Algorithm produces provably correct mathematical result.

Connection to Puzzle 12 (foundational patterns):

Thread coordination evolution: Same global_i, local_i patterns but with block primitives
Memory access patterns: Same LayoutTensor SIMD extraction [0] but optimized workflow
Complexity elimination: Replaces 20+ lines of manual barriers with 2 block operations
Educational progression: Manual → automated, complex → simple, error-prone → reliable

Connection to `block.sum()` (perfect integration):

API consistency: Identical template structure [block_size=tpb, broadcast=False]
Result flow design: Thread 0 receives sum, naturally computes derived parameter
Seamless composition: Output of block.sum() becomes input for computation + broadcast
Performance optimization: Single-kernel workflow vs multi-pass approaches

Connection to `block.prefix_sum()` (complementary communication):

Distribution patterns: prefix_sum gives unique positions, broadcast gives shared values
Usage scenarios: prefix_sum for parallel partitioning, broadcast for parameter sharing
Template consistency: Same dtype, block_size parameter patterns across all operations
SIMD handling uniformity: All block operations return SIMD requiring [0] extraction

Advanced algorithmic insights:

Communication pattern comparison:
  Traditional approach:
    1. Manual reduction:     O(log n) with explicit barriers
    2. Shared memory write:  O(1) with synchronization
    3. Shared memory read:   O(1) with potential bank conflicts
    Total: Multiple synchronization points, error-prone

  Block operations approach:
    1. block.sum():          O(log n) hardware-optimized, automatic barriers
    2. Computation:          O(1) single thread
    3. block.broadcast():    O(log n) hardware-optimized, automatic distribution
    Total: Two primitives, automatic synchronization, provably correct

Real-world algorithm patterns demonstrated:

Common parallel algorithm structure:
  Phase 1: Parallel data processing      → All threads contribute
  Phase 2: Global parameter computation  → One thread computes
  Phase 3: Parameter distribution        → All threads receive
  Phase 4: Coordinated parallel output   → All threads process

This exact pattern appears in:
  - Batch normalization (deep learning)
  - Histogram equalization (image processing)
  - Iterative numerical methods (scientific computing)
  - Lighting calculations (computer graphics)

Mean normalization is the perfect educational example of this fundamental pattern.

Block operations trilogy completed:

1. `block.sum()` - All to One (Reduction)

Input: All threads provide values
Output: Thread 0 receives aggregated result
Use case: Computing totals, finding maximums, etc.

2. `block.prefix_sum()` - All to Each (Scan)

Input: All threads provide values
Output: Each thread receives cumulative position
Use case: Computing write positions, parallel partitioning

3. `block.broadcast()` - One to All (Broadcast)

Input: One thread provides value (typically thread 0)
Output: All threads receive the same value
Use case: Sharing computed parameters, configuration values

Complete block operations progression:

Manual coordination (Puzzle 12): Understand parallel fundamentals
Warp primitives (Puzzle 24): Learn hardware-accelerated patterns
Block reduction (block.sum()): Learn all→one communication
Block scan (block.prefix_sum()): Learn all→each communication
Block broadcast (block.broadcast()): Learn one→all communication

The complete picture: Block operations provide the fundamental communication building blocks for sophisticated parallel algorithms, replacing complex manual coordination with clean, composable primitives.

Performance insights and technical analysis

Quantitative performance comparison:

block.broadcast() vs Traditional shared memory approach (for demonstration):

Traditional Manual Approach:

Phase 1: Manual reduction
  • Shared memory allocation: ~5 cycles
  • Barrier synchronization: ~10 cycles
  • Tree reduction loop: ~15 cycles
  • Error-prone manual indexing

Phase 2: Mean computation: ~2 cycles

Phase 3: Shared memory broadcast
  • Manual write to shared: ~2 cycles
  • Barrier synchronization: ~10 cycles
  • All threads read: ~3 cycles

Total: ~47 cycles
  + synchronization overhead
  + potential race conditions
  + manual error debugging

Block Operations Approach:

Phase 1: block.sum()
  • Hardware-optimized: ~3 cycles
  • Automatic barriers: 0 explicit cost
  • Optimized reduction: ~8 cycles
  • Verified correct implementation

Phase 2: Mean computation: ~2 cycles

Phase 3: block.broadcast()
  • Hardware-optimized: ~4 cycles
  • Automatic distribution: 0 explicit cost
  • Verified correct implementation

Total: ~17 cycles
  + automatic optimization
  + guaranteed correctness
  + composable design

Memory hierarchy advantages:

Cache efficiency:

block.sum(): Optimized memory access patterns reduce cache misses
block.broadcast(): Efficient distribution minimizes memory bandwidth usage
Combined workflow: Single kernel reduces global memory round-trips by 100%

Memory bandwidth utilization:

Traditional multi-kernel approach:
  Kernel 1: Input → Reduction → Global memory write
  Kernel 2: Global memory read → Broadcast → Output
  Total global memory transfers: 3× array size

Block operations single-kernel:
  Input → block.sum() → block.broadcast() → Output
  Total global memory transfers: 2× array size (33% improvement)

When to use each block operation:

block.sum() optimal scenarios:

Data aggregation: Computing totals, averages, maximum/minimum values
Reduction patterns: Any all-to-one communication requirement
Statistical computation: Mean, variance, correlation calculations

block.prefix_sum() optimal scenarios:

Parallel partitioning: Stream compaction, histogram binning
Write position calculation: Parallel output generation
Parallel algorithms: Sorting, searching, data reorganization

block.broadcast() optimal scenarios:

Parameter distribution: Sharing computed values to all threads
Configuration propagation: Mode flags, scaling factors, thresholds
Coordinated processing: When all threads need the same computed parameter

Composition benefits:

Individual operations: Good performance, limited scope
Combined operations:   Excellent performance, comprehensive algorithms

Example combinations seen in real applications:
• block.sum() + block.broadcast():       Normalization algorithms
• block.prefix_sum() + block.sum():      Advanced partitioning
• All three together:                    Complex parallel algorithms
• With traditional patterns:             Hybrid optimization strategies

Next steps

Once you’ve learned about the complete block operations trilogy, you’re ready for:

Multi-block algorithms: Coordinating operations across multiple thread blocks
Advanced parallel patterns: Combining block operations for complex algorithms
Memory hierarchy optimization: Efficient data movement patterns
Algorithm design: Structuring parallel algorithms using block operation building blocks
Performance optimization: Choosing optimal block sizes and operation combinations

💡 Key Takeaway: The block operations trilogy (sum, prefix_sum, broadcast) provides complete communication primitives for block-level parallel programming. By composing these operations, you can implement sophisticated parallel algorithms with clean, maintainable code that leverages GPU hardware optimizations. Mean normalization demonstrates how these operations work together to solve real computational problems efficiently.

4. block.broadcast() API concepts

Connection to Puzzle 12 (foundational patterns):

Connection to block.sum() (perfect integration):

Connection to block.prefix_sum() (complementary communication):

Connection to `block.sum()` (perfect integration):

Connection to `block.prefix_sum()` (complementary communication):