Puzzle 34: GPU Cluster Programming (SM90+)
Introduction
Hardware requirement: β οΈ NVIDIA SM90+ Only
This puzzle requires NVIDIA Hopper architecture (H100, H200) or newer GPUs with SM90+ compute capability. The cluster programming APIs are hardware-accelerated and will raise errors on unsupported hardware. If youβre unsure about the underlying architecture, run
pixi run gpu-specs
and must have at leastCompute Cap: 9.0
(see GPU profiling basics for hardware identification)
Building on your journey from warp-level programming (Puzzles 24-26) through block-level programming (Puzzle 27), youβll now learn cluster-level programming - coordinating multiple thread blocks to solve problems that exceed single-block capabilities.
What are thread block clusters?
Thread Block Clusters are a revolutionary SM90+ feature that enable multiple thread blocks to cooperate on a single computational task with hardware-accelerated synchronization and communication primitives.
Key capabilities:
- Inter-block synchronization: Coordinate multiple blocks with
cluster_sync
,cluster_arrive
,cluster_wait
- Block identification: Use
block_rank_in_cluster
for unique block coordination - Efficient coordination:
elect_one_sync
for optimized warp-level cooperation - Advanced patterns:
cluster_mask_base
for selective block coordination
The cluster programming model
Traditional GPU programming hierarchy:
Grid (Multiple Blocks)
βββ Block (Multiple Warps) - barrier() synchronization
βββ Warp (32 Threads) - SIMT lockstep execution
β βββ Lane 0 ββ
β βββ Lane 1 β All execute same instruction
β βββ Lane 2 β at same time (SIMT)
β β ... β warp.sum(), warp.broadcast()
β βββ Lane 31 ββ
βββ Thread (SIMD operations within each thread)
New: Cluster programming hierarchy:
Grid (Multiple Clusters)
βββ π Cluster (Multiple Blocks) - cluster_sync(), cluster_arrive()
βββ Block (Multiple Warps) - barrier() synchronization
βββ Warp (32 Threads) - SIMT lockstep execution
β βββ Lane 0 ββ
β βββ Lane 1 β All execute same instruction
β βββ Lane 2 β at same time (SIMT)
β β ... β warp.sum(), warp.broadcast()
β βββ Lane 31 ββ
βββ Thread (SIMD operations within each thread)
Execution Model Details:
- Thread Level: SIMD operations within individual threads
- Warp Level: SIMT execution - 32 threads in lockstep coordination
- Block Level: Multi-warp coordination with shared memory and barriers
- π Cluster Level: Multi-block coordination with SM90+ cluster APIs
Learning progression
This puzzle follows a carefully designed 3-part progression that builds your cluster programming expertise:
π° Multi-Block Coordination Basics
Focus: Understanding fundamental cluster synchronization patterns
Learn how multiple thread blocks coordinate their execution using cluster_arrive()
and cluster_wait()
for basic inter-block communication and data distribution.
Key APIs: block_rank_in_cluster()
, cluster_arrive()
, cluster_wait()
π Cluster-Wide Collective Operations
Focus: Extending block-level patterns to cluster scale
Learn cluster-wide reductions and collective operations that extend familiar block.sum()
concepts to coordinate across multiple thread blocks for large-scale computations.
Key APIs: cluster_sync()
, elect_one_sync()
for efficient cluster coordination
π Advanced Cluster Algorithms
Focus: Production-ready multi-level coordination patterns
Implement sophisticated algorithms combining warp-level, block-level, and cluster-level coordination for maximum GPU utilization and complex computational workflows.
Key APIs: elect_one_sync()
, cluster_arrive()
, advanced coordination patterns
Why cluster programming matters
Problem Scale: Modern AI and scientific workloads often require computations that exceed single thread block capabilities:
- Large matrix operations requiring inter-block coordination (like matrix multiplication from Puzzle 16)
- Multi-stage algorithms with producer-consumer dependencies from Puzzle 29
- Global statistics across datasets larger than shared memory from Puzzle 8
- Advanced stencil computations requiring neighbor block communication
Hardware Evolution: As GPUs gain more compute units (see GPU architecture profiling in Puzzle 30), cluster programming becomes essential for utilizing next-generation hardware efficiently.
Educational value
By completing this puzzle, youβll have learned the complete GPU programming hierarchy:
- Thread-level: Individual computation units with SIMD operations
- Warp-level: 32-thread SIMT coordination (Puzzles 24-26)
- Block-level: Multi-warp coordination with shared memory (Puzzle 27)
- π Cluster-level: Multi-block coordination (Puzzle 34)
- Grid-level: Independent block execution across multiple streaming multiprocessors
This progression prepares you for next-generation GPU programming and large-scale parallel computing challenges, building on the performance optimization techniques from Puzzles 30-32.
Getting started
Prerequisites:
- Complete understanding of block-level programming (Puzzle 27)
- Experience with warp-level programming (Puzzles 24-26)
- Familiarity with GPU memory hierarchy from shared memory concepts (Puzzle 8)
- Understanding of GPU synchronization from barriers (Puzzle 29)
- Access to NVIDIA SM90+ hardware or compatible environment
Recommended approach: Follow the 3-part progression sequentially, as each part builds essential concepts for the next level of complexity.
Hardware note: If running on non-SM90+ hardware, the puzzles serve as educational examples of cluster programming concepts and API usage patterns.
Ready to learn the future of GPU programming? Start with Multi-Block Coordination Basics to learn fundamental cluster synchronization patterns!