Puzzle 34: GPU Cluster Programming (SM90+)

Introduction

Hardware requirement: ⚠️ NVIDIA SM90+ Only

This puzzle requires NVIDIA Hopper architecture (H100, H200) or newer GPUs with SM90+ compute capability. The cluster programming APIs are hardware-accelerated and will raise errors on unsupported hardware. If you’re unsure about the underlying architecture, run pixi run gpu-specs and must have at least Compute Cap: 9.0 (see GPU profiling basics for hardware identification)

Building on your journey from warp-level programming (Puzzles 24-26) through block-level programming (Puzzle 27), you’ll now learn cluster-level programming - coordinating multiple thread blocks to solve problems that exceed single-block capabilities.

What are thread block clusters?

Thread Block Clusters are a revolutionary SM90+ feature that enable multiple thread blocks to cooperate on a single computational task with hardware-accelerated synchronization and communication primitives.

Key capabilities:

The cluster programming model

Traditional GPU programming hierarchy:

Grid (Multiple Blocks)
β”œβ”€β”€ Block (Multiple Warps) - barrier() synchronization
    β”œβ”€β”€ Warp (32 Threads) - SIMT lockstep execution
    β”‚   β”œβ”€β”€ Lane 0  ─┐
    β”‚   β”œβ”€β”€ Lane 1   β”‚ All execute same instruction
    β”‚   β”œβ”€β”€ Lane 2   β”‚ at same time (SIMT)
    β”‚   β”‚   ...      β”‚ warp.sum(), warp.broadcast()
    β”‚   └── Lane 31 β”€β”˜
        └── Thread (SIMD operations within each thread)

New: Cluster programming hierarchy:

Grid (Multiple Clusters)
β”œβ”€β”€ πŸ†• Cluster (Multiple Blocks) - cluster_sync(), cluster_arrive()
    β”œβ”€β”€ Block (Multiple Warps) - barrier() synchronization
        β”œβ”€β”€ Warp (32 Threads) - SIMT lockstep execution
        β”‚   β”œβ”€β”€ Lane 0  ─┐
        β”‚   β”œβ”€β”€ Lane 1   β”‚ All execute same instruction
        β”‚   β”œβ”€β”€ Lane 2   β”‚ at same time (SIMT)
        β”‚   β”‚   ...      β”‚ warp.sum(), warp.broadcast()
        β”‚   └── Lane 31 β”€β”˜
            └── Thread (SIMD operations within each thread)

Execution Model Details:

  • Thread Level: SIMD operations within individual threads
  • Warp Level: SIMT execution - 32 threads in lockstep coordination
  • Block Level: Multi-warp coordination with shared memory and barriers
  • πŸ†• Cluster Level: Multi-block coordination with SM90+ cluster APIs

Learning progression

This puzzle follows a carefully designed 3-part progression that builds your cluster programming expertise:

πŸ”° Multi-Block Coordination Basics

Focus: Understanding fundamental cluster synchronization patterns

Learn how multiple thread blocks coordinate their execution using cluster_arrive() and cluster_wait() for basic inter-block communication and data distribution.

Key APIs: block_rank_in_cluster(), cluster_arrive(), cluster_wait()


πŸ“Š Cluster-Wide Collective Operations

Focus: Extending block-level patterns to cluster scale

Learn cluster-wide reductions and collective operations that extend familiar block.sum() concepts to coordinate across multiple thread blocks for large-scale computations.

Key APIs: cluster_sync(), elect_one_sync() for efficient cluster coordination


πŸš€ Advanced Cluster Algorithms

Focus: Production-ready multi-level coordination patterns

Implement sophisticated algorithms combining warp-level, block-level, and cluster-level coordination for maximum GPU utilization and complex computational workflows.

Key APIs: elect_one_sync(), cluster_arrive(), advanced coordination patterns

Why cluster programming matters

Problem Scale: Modern AI and scientific workloads often require computations that exceed single thread block capabilities:

Hardware Evolution: As GPUs gain more compute units (see GPU architecture profiling in Puzzle 30), cluster programming becomes essential for utilizing next-generation hardware efficiently.

Educational value

By completing this puzzle, you’ll have learned the complete GPU programming hierarchy:

This progression prepares you for next-generation GPU programming and large-scale parallel computing challenges, building on the performance optimization techniques from Puzzles 30-32.

Getting started

Prerequisites:

Recommended approach: Follow the 3-part progression sequentially, as each part builds essential concepts for the next level of complexity.

Hardware note: If running on non-SM90+ hardware, the puzzles serve as educational examples of cluster programming concepts and API usage patterns.

Ready to learn the future of GPU programming? Start with Multi-Block Coordination Basics to learn fundamental cluster synchronization patterns!