How to Use This Book

Each puzzle maintains a consistent structure to support systematic skill development:

  • Overview: Problem definition and key concepts for each challenge
  • Configuration: Technical setup and memory organization details
  • Code to Complete: Implementation framework with clearly marked sections to fill in
  • Tips: Strategic hints available when needed, without revealing complete solutions
  • Solution: Comprehensive implementation analysis, including performance considerations and conceptual explanations

The puzzles increase in complexity systematically, building new concepts on established foundations. Working through them sequentially is recommended, as advanced puzzles assume familiarity with concepts from earlier challenges.

Running the code

All puzzles integrate with a testing framework that validates implementations against expected results. Each puzzle provides specific execution instructions and solution verification procedures.

Prerequisites

System requirements

Make sure your system meets our system requirements.

Compatible GPU

You’ll need a compatible GPU to run the puzzles. If have the supported GPU, run the following command to get some info about your GPU:

pixi run gpu-specs
uv run poe gpu-specs

macOS Apple Sillicon (Early preview)

For osx-arm64 users, you’ll need:

  • macOS 15.0 or later for optimal compatibility. Run pixi run check-macos and if it fails you’d need to upgrade.
  • Xcode 16 or later (minimum required). Use xcodebuild -version to check.

If xcrun -sdk macosx metal outputs cannot execite tool 'metal' due to missing Metal toolchain proceed by running

xcodebuild -downloadComponent MetalToolchain

and then xcrun -sdk macosx metal, should give you the no input files error.

Note: Currently the puzzles 1-8 and 11-15 are working on macOS. We’re working to enable more. Please stay tuned!

Programming knowledge

Basic knowledge of:

  • Programming fundamentals (variables, loops, conditionals, functions)
  • Parallel computing concepts (threads, synchronization, race conditions)
  • Basic familiarity with Mojo (language basics parts and intro to pointers section)
  • GPU programming fundamentals is helpful!

No prior GPU programming experience is necessary! We’ll build that knowledge through the puzzles.

Let’s begin our journey into the exciting world of GPU computing with MojoπŸ”₯!

Setting up your environment

  1. Clone the GitHub repository and navigate to the repository:

    # Clone the repository
    git clone https://github.com/modular/mojo-gpu-puzzles
    cd mojo-gpu-puzzles
    
  2. Install a package manager to run the MojoπŸ”₯ programs:

    pixi is the recommended option for this project because:

    • Easy access to Modular’s MAX/Mojo packages
    • Handles GPU dependencies
    • Full conda + PyPI ecosystem support

    Note: Some puzzles only work with pixi

    Install:

    curl -fsSL https://pixi.sh/install.sh | sh
    

    Update:

    pixi self-update
    

    Option 2: uv

    Install:

    curl -fsSL https://astral.sh/uv/install.sh | sh
    

    Update:

    uv self update
    

    Create a virtual environment:

    uv venv && source .venv/bin/activate
    
  3. Run the puzzles via pixi or uv as follows:

    pixi run pXX  # Replace XX with the puzzle number
    
    pixi run pXX -e amd  # Replace XX with the puzzle number
    
    pixi run pXX -e apple  # Replace XX with the puzzle number
    
    uv run poe pXX  # Replace XX with the puzzle number
    

For example, to run puzzle 01:

  • pixi run p01 or
  • uv run poe p01

GPU support matrix

The following table shows GPU platform compatibility for each puzzle. Different puzzles require different GPU features and vendor-specific tools.

PuzzleNVIDIA GPUAMD GPUApple GPUNotes
Part I: GPU Fundamentals
1 - Mapβœ…βœ…βœ…Basic GPU kernels
2 - Zipβœ…βœ…βœ…Basic GPU kernels
3 - Guardβœ…βœ…βœ…Basic GPU kernels
4 - Map 2Dβœ…βœ…βœ…Basic GPU kernels
5 - Broadcastβœ…βœ…βœ…Basic GPU kernels
6 - Blocksβœ…βœ…βœ…Basic GPU kernels
7 - Shared Memoryβœ…βœ…βœ…Basic GPU kernels
8 - Stencilβœ…βœ…βœ…Basic GPU kernels
Part II: Debugging
9 - GPU Debuggerβœ…βŒβŒNVIDIA-specific debugging tools
10 - Sanitizerβœ…βŒβŒNVIDIA-specific debugging tools
Part III: GPU Algorithms
11 - Reductionβœ…βœ…βœ…Basic GPU kernels
12 - Scanβœ…βœ…βœ…Basic GPU kernels
13 - Poolβœ…βœ…βœ…Basic GPU kernels
14 - Convβœ…βœ…βœ…Basic GPU kernels
15 - Matmulβœ…βœ…βœ…Basic GPU kernels
16 - Flashdotβœ…βœ…βŒAdvanced memory patterns
Part IV: MAX Graph
17 - Custom Opβœ…βœ…βŒMAX Graph integration
18 - Softmaxβœ…βœ…βŒMAX Graph integration
19 - Attentionβœ…βœ…βŒMAX Graph integration
Part V: PyTorch Integration
20 - Torch Bridgeβœ…βœ…βŒPyTorch integration
21 - Autogradβœ…βœ…βŒPyTorch integration
22 - Fusionβœ…βœ…βŒPyTorch integration
Part VI: Functional Patterns
23 - Functionalβœ…βœ…βŒAdvanced Mojo patterns
Part VII: Warp Programming
24 - Warp Sumβœ…βœ…βŒWarp-level operations
25 - Warp Communicationβœ…βœ…βŒWarp-level operations
26 - Advanced Warpβœ…βœ…βŒWarp-level operations
Part VIII: Block Programming
27 - Block Operationsβœ…βœ…βŒBlock-level patterns
Part IX: Memory Systems
28 - Async Memoryβœ…βœ…βŒAdvanced memory operations
29 - Barriersβœ…βœ…βŒAdvanced synchronization
Part X: Performance Analysis
30 - Profilingβœ…βŒβŒNVIDIA profiling tools (NSight)
31 - Occupancyβœ…βŒβŒNVIDIA profiling tools
32 - Bank Conflictsβœ…βŒβŒNVIDIA profiling tools
Part XI: Modern GPU Features
33 - Tensor Coresβœ…βŒβŒNVIDIA Tensor Core specific
34 - Clusterβœ…βŒβŒNVIDIA cluster programming

Legend

  • βœ… Supported: Puzzle works on this platform
  • ❌ Not Supported: Puzzle requires platform-specific features

Platform notes

NVIDIA GPUs (Complete Support)

  • All puzzles (1-34) work on NVIDIA GPUs with CUDA support
  • Requires CUDA toolkit and compatible drivers
  • Best learning experience with access to all features

AMD GPUs (Extensive Support)

  • Most puzzles (1-8, 11-29) work with ROCm support
  • Missing only: Debugging tools (9-10), profiling (30-32), Tensor Cores (33-34)
  • Excellent for learning GPU programming including advanced algorithms and memory patterns

Apple GPUs (Basic Support)

  • Only fundamental puzzles (1-8, 11-15) supported
  • Missing: All advanced features, debugging, profiling tools
  • Suitable for learning basic GPU programming patterns

Future Support: We’re actively working to expand tooling and platform support for AMD and Apple GPUs. Missing features like debugging tools, profiling capabilities, and advanced GPU operations are planned for future releases. Check back for updates as we continue to broaden cross-platform compatibility.

Recommendations

  • Complete Learning Path: Use NVIDIA GPU for full curriculum access (all 34 puzzles)
  • Comprehensive Learning: AMD GPUs work well for most content (27 of 34 puzzles)
  • Basic Understanding: Apple GPUs suitable for fundamental concepts (13 of 34 puzzles)
  • Debugging & Profiling: NVIDIA GPU required for debugging tools and performance analysis
  • Modern GPU Features: NVIDIA GPU required for Tensor Cores and cluster programming

Development

Please see details in the README.

Join the community

Subscribe for Updates Modular Forum Discord

Join our vibrant community to discuss GPU programming, share solutions, and get help!