block.sum() Essentials - Block-Level Dot Product

Implement the dot product we saw in puzzle 12 using block-level sum operations to replace complex shared memory patterns with simple function calls. Each thread in the block will process one element and use block.sum() to combine results automatically, demonstrating how block programming transforms GPU synchronization across entire thread blocks.

Key insight: The block.sum() operation leverages block-wide execution to replace shared memory + barriers + tree reduction with expertly optimized implementations that work across all threads using warp patterns in a block. See technical investigation for LLVM analysis.

Key concepts

In this puzzle, you’ll learn:

Block-level reductions with block.sum()
Block-wide synchronization and thread coordination
Cross-warp communication within a single block
Performance transformation from complex to simple patterns
Thread 0 result management and conditional writes

The mathematical operation is a dot product (inner product): \[\Large \text{output}[0] = \sum_{i=0}^{N-1} a[i] \times b[i]\]

But the implementation teaches fundamental patterns for all block-level GPU programming in Mojo.

Configuration

Vector size: SIZE = 128 elements
Data type: DType.float32
Block configuration: (128, 1) threads per block (TPB = 128)
Grid configuration: (1, 1) blocks per grid
Layout: Layout.row_major(SIZE) (1D row-major)
Warps per block: 128 / WARP_SIZE (4 warps on NVIDIA, 2 or 4 warps on AMD)

The traditional complexity (from Puzzle 12)

Recall the complex approach from Puzzle 12 that required shared memory, barriers, and tree reduction:

fn traditional_dot_product[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Traditional dot product using shared memory + barriers + tree reduction.
    Educational but complex - shows the manual coordination needed."""

    shared = LayoutTensor[
        dtype,
        Layout.row_major(tpb),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Each thread computes partial product
    if global_i < size:
        a_val = rebind[Scalar[dtype]](a[global_i])
        b_val = rebind[Scalar[dtype]](b[global_i])
        shared[local_i] = a_val * b_val

    barrier()

    # Tree reduction in shared memory - complex but educational
    var stride = tpb // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2

    # Only thread 0 writes final result
    if local_i == 0:
        output[0] = shared[0]

What makes this complex:

Shared memory allocation: Manual memory management within blocks
Explicit barriers: barrier() calls to synchronize all threads in block
Tree reduction: Complex loop with stride-based indexing (64→32→16→8→4→2→1)
Cross-warp coordination: Must synchronize across multiple warps
Conditional writes: Only thread 0 writes the final result

This works across the entire block (128 threads across 2 or 4 warps depending on GPU), but it’s verbose, error-prone, and requires deep understanding of block-level GPU synchronization.

The warp-level improvement (from Puzzle 24)

Before jumping to block-level operations, recall how Puzzle 24 simplified reduction within a single warp using warp.sum():

fn simple_warp_dot_product[
    in_layout: Layout, out_layout: Layout, size: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x

    # Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based
    var partial_product: Scalar[dtype] = 0
    if global_i < size:
        partial_product = (a[global_i] * b[global_i]).reduce_add()

    # warp_sum() replaces all the shared memory + barriers + tree reduction
    total = warp_sum(partial_product)

    # Only lane 0 writes the result (all lanes have the same total)
    if lane_id() == 0:
        output[global_i // WARP_SIZE] = total

What warp.sum() achieved:

Single warp scope: Works within 32 threads (NVIDIA) or 32/64 threads (AMD)
Hardware shuffle: Uses shfl.sync.bfly.b32 instructions for efficiency
Zero shared memory: No explicit memory management needed
One line reduction: total = warp_sum[warp_size=WARP_SIZE](val=partial_product)

But the limitation: warp.sum() only works within a single warp. For problems requiring multiple warps (like our 128-thread block), you’d still need the complex shared memory + barriers approach to coordinate between warps.

Test the traditional approach:

pixi run p27 --traditional-dot-product

pixi run -e amd p27 --traditional-dot-product

pixi run -e apple p27 --traditional-dot-product

uv run poe p27 --traditional-dot-product

Code to complete

`block.sum()` approach

Transform the complex traditional approach into a simple block kernel using block.sum():

alias SIZE = 128
alias TPB = 128
alias NUM_BINS = 8
alias in_layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(1)
alias dtype = DType.float32


fn block_sum_dot_product[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Dot product using block.sum() - convenience function like warp.sum()!
    Replaces manual shared memory + barriers + tree reduction with one line."""

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # FILL IN (roughly 6 lines)

View full file: problems/p27/p27.mojo

pixi run p27 --block-sum-dot-product

pixi run -e amd p27 --block-sum-dot-product

uv run poe p27 --block-sum-dot-product

Expected output when solved:

SIZE: 128
TPB: 128
Expected result: 1381760.0
Block.sum result: 1381760.0
Block.sum() gives identical results!
Compare the code: 15+ lines of barriers → 1 line of block.sum()!
Just like warp.sum() but for the entire block

Tips

1. Think about the three-step pattern

Every block reduction follows the same conceptual pattern:

Each thread computes its local contribution
All threads participate in a block-wide reduction
One designated thread handles the final result

2. Remember the dot product math

Each thread should handle one element pair from vectors a and b. What operation combines these into a “partial result” that can be summed across threads?

3. LayoutTensor indexing patterns

When accessing LayoutTensor elements, remember that indexing returns SIMD values. You’ll need to extract the scalar value for arithmetic operations.

4. block.sum() API concepts

Study the function signature - it needs:

A template parameter specifying the block size
A template parameter for result distribution (broadcast)
A runtime parameter containing the value to reduce

5. Thread coordination principles

Which threads have valid data to process? (Hint: bounds checking)
Which thread should write the final result? (Hint: consistent choice)
How do you identify that specific thread? (Hint: thread indexing)

Solution

fn block_sum_dot_product[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[dtype, out_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, in_layout, ImmutAnyOrigin],
    size: Int,
):
    """Dot product using block.sum() - convenience function like warp.sum()!
    Replaces manual shared memory + barriers + tree reduction with one line."""

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Each thread computes partial product
    var partial_product: Scalar[dtype] = 0.0
    if global_i < size:
        # LayoutTensor indexing `[0]` returns the underlying SIMD value
        partial_product = a[global_i][0] * b[global_i][0]

    # The magic: block.sum() replaces 15+ lines of manual reduction!
    # Just like warp.sum() but for the entire block
    total = block.sum[block_size=tpb, broadcast=False](
        val=SIMD[DType.float32, 1](partial_product)
    )

    # Only thread 0 writes the result
    if local_i == 0:
        output[0] = total[0]

The block.sum() kernel demonstrates the fundamental transformation from complex block synchronization to expertly optimized implementations:

What disappeared from the traditional approach:

15+ lines → 8 lines: Dramatic code reduction
Shared memory allocation: Zero memory management required
7+ barrier() calls: Zero explicit synchronization needed
Complex tree reduction: Single function call
Stride-based indexing: Eliminated entirely
Cross-warp coordination: Handled automatically by optimized implementation

Block-wide execution model:

Block threads (128 threads across 4 warps):
Warp 0 (threads 0-31):
  Thread 0: partial_product = a[0] * b[0] = 0.0
  Thread 1: partial_product = a[1] * b[1] = 2.0
  ...
  Thread 31: partial_product = a[31] * b[31] = 1922.0

Warp 1 (threads 32-63):
  Thread 32: partial_product = a[32] * b[32] = 2048.0
  ...

Warp 2 (threads 64-95):
  Thread 64: partial_product = a[64] * b[64] = 8192.0
  ...

Warp 3 (threads 96-127):
  Thread 96: partial_product = a[96] * b[96] = 18432.0
  Thread 127: partial_product = a[127] * b[127] = 32258.0

block.sum() hardware operation:
All threads → 0.0 + 2.0 + 1922.0 + 2048.0 + ... + 32258.0 = 1381760.0
Thread 0 receives → total = 1381760.0 (when broadcast=False)

Why this works without barriers:

Block-wide execution: All threads execute each instruction in lockstep within warps
Built-in synchronization: block.sum() implementation handles synchronization internally
Cross-warp communication: Optimized communication between warps in the block
Coordinated result delivery: Only thread 0 receives the final result

Comparison to warp.sum() (Puzzle 24):

Warp scope: warp.sum() works within 32/64 threads (single warp)
Block scope: block.sum() works across entire block (multiple warps)
Same simplicity: Both replace complex manual reductions with one-line calls
Automatic coordination: block.sum() handles the cross-warp barriers that warp.sum() cannot

Technical investigation: What does `block.sum()` actually compile to?

To understand what block.sum() actually generates, we compiled the puzzle with debug information:

pixi run mojo build --emit llvm --debug-level=line-tables solutions/p27/p27.mojo -o solutions/p27/p27.ll

This generated LLVM file solutions/p27/p27.ll. For example, on a compatible NVIDIA GPU, the p27.ll file has embedded PTX assembly showing the actual GPU instructions:

Finding 1: Not a single instruction

block.sum() compiles to approximately 20+ PTX instructions, organized in a two-phase reduction:

Phase 1: Warp-level reduction (butterfly shuffles)

shfl.sync.bfly.b32 %r23, %r46, 16, 31, -1;  // shuffle with offset 16
add.f32            %r24, %r46, %r23;         // add shuffled values
shfl.sync.bfly.b32 %r25, %r24, 8, 31, -1;   // shuffle with offset 8
add.f32            %r26, %r24, %r25;         // add shuffled values
// ... continues for offsets 4, 2, 1

Phase 2: Cross-warp coordination

shr.u32            %r32, %r1, 5;             // compute warp ID
mov.b32            %r34, _global_alloc_$__gpu_shared_mem; // shared memory
bar.sync           0;                        // barrier synchronization
// ... another butterfly shuffle sequence for cross-warp reduction

Finding 2: Hardware-optimized implementation

Butterfly shuffles: More efficient than tree reduction
Automatic barrier placement: Handles cross-warp synchronization
Optimized memory access: Uses shared memory strategically
Architecture-aware: Same API works on NVIDIA (32-thread warps) and AMD (32 or 64-thread warps)

Finding 3: Algorithm complexity analysis

Our approach to investigation:

Located PTX assembly in binary ELF sections (.nv_debug_ptx_txt)
Identified algorithmic differences rather than counting individual instructions

Key algorithmic differences observed:

Traditional: Tree reduction with shared memory + multiple bar.sync calls
block.sum(): Butterfly shuffle pattern + optimized cross-warp coordination

The performance advantage comes from expertly optimized algorithm choice (butterfly > tree), not from instruction count or magical hardware. Take a look at [block.mojo] in Mojo gpu module for more details about the implementation.

Performance insights

block.sum() vs Traditional:

Code simplicity: 15+ lines → 1 line for the reduction
Memory usage: No shared memory allocation required
Synchronization: No explicit barriers needed
Scalability: Works with any block size (up to hardware limits)

block.sum() vs warp.sum():

Scope: Block-wide (128 threads) vs warp-wide (32 threads)
Use case: When you need reduction across entire block
Convenience: Same programming model, different scale

When to use block.sum():

Single block problems: When all data fits in one block
Block-level algorithms: Shared memory computations needing reduction
Convenience over scalability: Simpler than multi-block approaches

Relationship to previous puzzles

From Puzzle 12 (Traditional):

Complex: shared memory + barriers + tree reduction
↓
Simple: block.sum() hardware primitive

From Puzzle 24 (warp.sum()):

Warp-level: warp.sum() across 32 threads (single warp)
↓
Block-level: block.sum() across 128 threads (multiple warps)

Three-stage progression:

Manual reduction (Puzzle 12): Complex shared memory + barriers + tree reduction
Warp primitives (Puzzle 24): warp.sum() - simple but limited to single warp
Block primitives (Puzzle 27): block.sum() - extends warp simplicity across multiple warps

The key insight: block.sum() gives you the simplicity of warp.sum() but scales across an entire block by automatically handling the complex cross-warp coordination that you’d otherwise need to implement manually.

Next steps

Once you’ve learned about block.sum() operations, you’re ready for:

Block Prefix Sum Operations: Cumulative operations across block threads
Block Broadcast Operations: Sharing values across all threads in a block

💡 Key Takeaway: Block operations extend warp programming concepts to entire thread blocks, providing optimized primitives that replace complex synchronization patterns while working across multiple warps simultaneously. Just like warp.sum() simplified warp-level reductions, block.sum() simplifies block-level reductions without sacrificing performance.