block.sum() Essentials - Block-Level Dot Product

Implement the dot product we saw in puzzle 12 using block-level sum operations to replace complex shared memory patterns with simple function calls. Each thread in the block will process one element and use block.sum() to combine results automatically, demonstrating how block programming transforms GPU synchronization across entire thread blocks.

Key insight: The block.sum() operation leverages block-wide execution to replace shared memory + barriers + tree reduction with expertly optimized implementations that work across all threads using warp patterns in a block. See technical investigation for LLVM analysis.

Key concepts

In this puzzle, you’ll learn:

  • Block-level reductions with block.sum()
  • Block-wide synchronization and thread coordination
  • Cross-warp communication within a single block
  • Performance transformation from complex to simple patterns
  • Thread 0 result management and conditional writes

The mathematical operation is a dot product (inner product): \[\Large \text{output}[0] = \sum_{i=0}^{N-1} a[i] \times b[i]\]

But the implementation teaches fundamental patterns for all block-level GPU programming in Mojo.

Configuration

  • Vector size: SIZE = 128 elements
  • Data type: DType.float32
  • Block configuration: (128, 1) threads per block (TPB = 128)
  • Grid configuration: (1, 1) blocks per grid
  • Layout: Layout.row_major(SIZE) (1D row-major)
  • Warps per block: 128 / WARP_SIZE (4 warps on NVIDIA, 2 or 4 warps on AMD)

The traditional complexity (from Puzzle 12)

Recall the complex approach from Puzzle 12 that required shared memory, barriers, and tree reduction:

fn traditional_dot_product[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    size: Int,
):
    """Traditional dot product using shared memory + barriers + tree reduction.
    Educational but complex - shows the manual coordination needed."""

    shared = tb[dtype]().row_major[tpb]().shared().alloc()
    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Each thread computes partial product
    if global_i < size:
        a_val = rebind[Scalar[dtype]](a[global_i])
        b_val = rebind[Scalar[dtype]](b[global_i])
        shared[local_i] = a_val * b_val

    barrier()

    # Tree reduction in shared memory - complex but educational
    var stride = tpb // 2
    while stride > 0:
        if local_i < stride:
            shared[local_i] += shared[local_i + stride]
        barrier()
        stride //= 2

    # Only thread 0 writes final result
    if local_i == 0:
        output[0] = shared[0]


What makes this complex:

  • Shared memory allocation: Manual memory management within blocks
  • Explicit barriers: barrier() calls to synchronize all threads in block
  • Tree reduction: Complex loop with stride-based indexing (64→32→16→8→4→2→1)
  • Cross-warp coordination: Must synchronize across multiple warps
  • Conditional writes: Only thread 0 writes the final result

This works across the entire block (128 threads across 2 or 4 warps depending on GPU), but it’s verbose, error-prone, and requires deep understanding of block-level GPU synchronization.

The warp-level improvement (from Puzzle 24)

Before jumping to block-level operations, recall how Puzzle 24 simplified reduction within a single warp using warp.sum():

fn simple_warp_dot_product[
    in_layout: Layout, out_layout: Layout, size: Int
](
    output: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
):
    global_i = block_dim.x * block_idx.x + thread_idx.x

    # Each thread computes one partial product using vectorized approach as values in Mojo are SIMD based
    var partial_product: Scalar[dtype] = 0
    if global_i < size:
        partial_product = (a[global_i] * b[global_i]).reduce_add()

    # warp_sum() replaces all the shared memory + barriers + tree reduction
    total = warp_sum(partial_product)

    # Only lane 0 writes the result (all lanes have the same total)
    if lane_id() == 0:
        output[0] = total


What warp.sum() achieved:

  • Single warp scope: Works within 32 threads (NVIDIA) or 32/64 threads (AMD)
  • Hardware shuffle: Uses shfl.sync.bfly.b32 instructions for efficiency
  • Zero shared memory: No explicit memory management needed
  • One line reduction: total = warp_sum[warp_size=WARP_SIZE](val=partial_product)

But the limitation: warp.sum() only works within a single warp. For problems requiring multiple warps (like our 128-thread block), you’d still need the complex shared memory + barriers approach to coordinate between warps.

Test the traditional approach:

uv run p27 --traditional-dot-product
pixi run p27 --traditional-dot-product

Code to complete

block.sum() approach

Transform the complex traditional approach into a simple block kernel using block.sum():

alias SIZE = 128
alias TPB = 128
alias NUM_BINS = 8
alias in_layout = Layout.row_major(SIZE)
alias out_layout = Layout.row_major(1)
alias dtype = DType.float32


fn block_sum_dot_product[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    size: Int,
):
    """Dot product using block.sum() - convenience function like warp.sum()!
    Replaces manual shared memory + barriers + tree reduction with one line."""

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # FILL IN (roughly 6 lines)


View full file: problems/p27/p27.mojo

uv run p27 --block-sum-dot-product
pixi run p27 --block-sum-dot-product

Expected output when solved:

SIZE: 128
TPB: 128
Expected result: 1381760.0
Block.sum result: 1381760.0
Block.sum() gives identical results!
Compare the code: 15+ lines of barriers → 1 line of block.sum()!
Just like warp.sum() but for the entire block
Tips

1. Think about the three-step pattern

Every block reduction follows the same conceptual pattern:

  1. Each thread computes its local contribution
  2. All threads participate in a block-wide reduction
  3. One designated thread handles the final result

2. Remember the dot product math

Each thread should handle one element pair from vectors a and b. What operation combines these into a “partial result” that can be summed across threads?

3. LayoutTensor indexing patterns

When accessing LayoutTensor elements, remember that indexing returns SIMD values. You’ll need to extract the scalar value for arithmetic operations.

4. block.sum() API concepts

Study the function signature - it needs:

  • A template parameter specifying the block size
  • A template parameter for result distribution (broadcast)
  • A runtime parameter containing the value to reduce

5. Thread coordination principles

  • Which threads have valid data to process? (Hint: bounds checking)
  • Which thread should write the final result? (Hint: consistent choice)
  • How do you identify that specific thread? (Hint: thread indexing)

Solution

fn block_sum_dot_product[
    in_layout: Layout, out_layout: Layout, tpb: Int
](
    output: LayoutTensor[mut=True, dtype, out_layout],
    a: LayoutTensor[mut=False, dtype, in_layout],
    b: LayoutTensor[mut=False, dtype, in_layout],
    size: Int,
):
    """Dot product using block.sum() - convenience function like warp.sum()!
    Replaces manual shared memory + barriers + tree reduction with one line."""

    global_i = block_dim.x * block_idx.x + thread_idx.x
    local_i = thread_idx.x

    # Each thread computes partial product
    var partial_product: Scalar[dtype] = 0.0
    if global_i < size:
        # LayoutTensor indexing `[0]` returns the underlying SIMD value
        partial_product = a[global_i][0] * b[global_i][0]

    # The magic: block.sum() replaces 15+ lines of manual reduction!
    # Just like warp.sum() but for the entire block
    total = block.sum[block_size=tpb, broadcast=False](
        val=SIMD[DType.float32, 1](partial_product)
    )

    # Only thread 0 writes the result
    if local_i == 0:
        output[0] = total[0]


The block.sum() kernel demonstrates the fundamental transformation from complex block synchronization to expertly optimized implementations:

What disappeared from the traditional approach:

  • 15+ lines → 8 lines: Dramatic code reduction
  • Shared memory allocation: Zero memory management required
  • 7+ barrier() calls: Zero explicit synchronization needed
  • Complex tree reduction: Single function call
  • Stride-based indexing: Eliminated entirely
  • Cross-warp coordination: Handled automatically by optimized implementation

Block-wide execution model:

Block threads (128 threads across 4 warps):
Warp 0 (threads 0-31):
  Thread 0: partial_product = a[0] * b[0] = 0.0
  Thread 1: partial_product = a[1] * b[1] = 2.0
  ...
  Thread 31: partial_product = a[31] * b[31] = 1922.0

Warp 1 (threads 32-63):
  Thread 32: partial_product = a[32] * b[32] = 2048.0
  ...

Warp 2 (threads 64-95):
  Thread 64: partial_product = a[64] * b[64] = 8192.0
  ...

Warp 3 (threads 96-127):
  Thread 96: partial_product = a[96] * b[96] = 18432.0
  Thread 127: partial_product = a[127] * b[127] = 32258.0

block.sum() hardware operation:
All threads → 0.0 + 2.0 + 1922.0 + 2048.0 + ... + 32258.0 = 1381760.0
Thread 0 receives → total = 1381760.0 (when broadcast=False)

Why this works without barriers:

  1. Block-wide execution: All threads execute each instruction in lockstep within warps
  2. Built-in synchronization: block.sum() implementation handles synchronization internally
  3. Cross-warp communication: Optimized communication between warps in the block
  4. Coordinated result delivery: Only thread 0 receives the final result

Comparison to warp.sum() (Puzzle 24):

  • Warp scope: warp.sum() works within 32/64 threads (single warp)
  • Block scope: block.sum() works across entire block (multiple warps)
  • Same simplicity: Both replace complex manual reductions with one-line calls
  • Automatic coordination: block.sum() handles the cross-warp barriers that warp.sum() cannot

Technical investigation: What does block.sum() actually compile to?

To understand what block.sum() actually generates, we compiled the puzzle with debug information:

pixi run mojo build --emit llvm --debug-level=line-tables solutions/p27/p27.mojo -o solutions/p27/p27.ll

This generated LLVM file solutions/p27/p27.ll. For example, on a compatible NVIDIA GPU, the p27.ll file has embedded PTX assembly showing the actual GPU instructions:

Finding 1: Not a single instruction

block.sum() compiles to approximately 20+ PTX instructions, organized in a two-phase reduction:

Phase 1: Warp-level reduction (butterfly shuffles)

shfl.sync.bfly.b32 %r23, %r46, 16, 31, -1;  // shuffle with offset 16
add.f32            %r24, %r46, %r23;         // add shuffled values
shfl.sync.bfly.b32 %r25, %r24, 8, 31, -1;   // shuffle with offset 8
add.f32            %r26, %r24, %r25;         // add shuffled values
// ... continues for offsets 4, 2, 1

Phase 2: Cross-warp coordination

shr.u32            %r32, %r1, 5;             // compute warp ID
mov.b32            %r34, _global_alloc_$__gpu_shared_mem; // shared memory
bar.sync           0;                        // barrier synchronization
// ... another butterfly shuffle sequence for cross-warp reduction

Finding 2: Hardware-optimized implementation

  • Butterfly shuffles: More efficient than tree reduction
  • Automatic barrier placement: Handles cross-warp synchronization
  • Optimized memory access: Uses shared memory strategically
  • Architecture-aware: Same API works on NVIDIA (32-thread warps) and AMD (32 or 64-thread warps)

Finding 3: Algorithm complexity analysis

Our approach to investigation:

  1. Located PTX assembly in binary ELF sections (.nv_debug_ptx_txt)
  2. Identified algorithmic differences rather than counting individual instructions

Key algorithmic differences observed:

  • Traditional: Tree reduction with shared memory + multiple bar.sync calls
  • block.sum(): Butterfly shuffle pattern + optimized cross-warp coordination

The performance advantage comes from expertly optimized algorithm choice (butterfly > tree), not from instruction count or magical hardware. Take a look at [block.mojo] in Mojo gpu module for more details about the implementation.

Performance insights

block.sum() vs Traditional:

  • Code simplicity: 15+ lines → 1 line for the reduction
  • Memory usage: No shared memory allocation required
  • Synchronization: No explicit barriers needed
  • Scalability: Works with any block size (up to hardware limits)

block.sum() vs warp.sum():

  • Scope: Block-wide (128 threads) vs warp-wide (32 threads)
  • Use case: When you need reduction across entire block
  • Convenience: Same programming model, different scale

When to use block.sum():

  • Single block problems: When all data fits in one block
  • Block-level algorithms: Shared memory computations needing reduction
  • Convenience over scalability: Simpler than multi-block approaches

Relationship to previous puzzles

From Puzzle 12 (Traditional):

Complex: shared memory + barriers + tree reduction
↓
Simple: block.sum() hardware primitive

From Puzzle 24 (warp.sum()):

Warp-level: warp.sum() across 32 threads (single warp)
↓
Block-level: block.sum() across 128 threads (multiple warps)

Three-stage progression:

  1. Manual reduction (Puzzle 12): Complex shared memory + barriers + tree reduction
  2. Warp primitives (Puzzle 24): warp.sum() - simple but limited to single warp
  3. Block primitives (Puzzle 27): block.sum() - extends warp simplicity across multiple warps

The key insight: block.sum() gives you the simplicity of warp.sum() but scales across an entire block by automatically handling the complex cross-warp coordination that you’d otherwise need to implement manually.

Next Steps

Once you’ve learned about block.sum() operations, you’re ready for:

đź’ˇ Key Takeaway: Block operations extend warp programming concepts to entire thread blocks, providing optimized primitives that replace complex synchronization patterns while working across multiple warps simultaneously. Just like warp.sum() simplified warp-level reductions, block.sum() simplifies block-level reductions without sacrificing performance.