How to Use This Book
Each puzzle maintains a consistent structure to support systematic skill development:
- Overview: Problem definition and key concepts for each challenge
- Configuration: Technical setup and memory organization details
- Code to Complete: Implementation framework with clearly marked sections to fill in
- Tips: Strategic hints available when needed, without revealing complete solutions
- Solution: Comprehensive implementation analysis, including performance considerations and conceptual explanations
The puzzles increase in complexity systematically, building new concepts on established foundations. Working through them sequentially is recommended, as advanced puzzles assume familiarity with concepts from earlier challenges.
Running the code
All puzzles integrate with a testing framework that validates implementations against expected results. Each puzzle provides specific execution instructions and solution verification procedures.
Prerequisites
System requirements
Make sure your system meets our system requirements.
Compatible GPU
Youβll need a compatible GPU to run the puzzles. If have the supported GPU, run the following command to get some info about your GPU:
pixi run gpu-specs
uv run poe gpu-specs
macOS Apple Sillicon (Early preview)
For osx-arm64
users, youβll need:
- macOS 15.0 or later for optimal compatibility. Run
pixi run check-macos
and if it fails youβd need to upgrade. - Xcode 16 or later (minimum required). Use
xcodebuild -version
to check.
If xcrun -sdk macosx metal
outputs cannot execite tool 'metal' due to missing Metal toolchain
proceed by running
xcodebuild -downloadComponent MetalToolchain
and then xcrun -sdk macosx metal
, should give you the no input files error
.
Note: Currently the puzzles 1-8 and 11-15 are working on macOS. Weβre working to enable more. Please stay tuned!
Programming knowledge
Basic knowledge of:
- Programming fundamentals (variables, loops, conditionals, functions)
- Parallel computing concepts (threads, synchronization, race conditions)
- Basic familiarity with Mojo (language basics parts and intro to pointers section)
- GPU programming fundamentals is helpful!
No prior GPU programming experience is necessary! Weβll build that knowledge through the puzzles.
Letβs begin our journey into the exciting world of GPU computing with Mojoπ₯!
Setting up your environment
-
Clone the GitHub repository and navigate to the repository:
# Clone the repository git clone https://github.com/modular/mojo-gpu-puzzles cd mojo-gpu-puzzles
-
Install a package manager to run the Mojoπ₯ programs:
Option 1 (Hightly recommended): pixi
pixi
is the recommended option for this project because:- Easy access to Modularβs MAX/Mojo packages
- Handles GPU dependencies
- Full conda + PyPI ecosystem support
Note: Some puzzles only work with
pixi
Install:
curl -fsSL https://pixi.sh/install.sh | sh
Update:
pixi self-update
Option 2:
uv
Install:
curl -fsSL https://astral.sh/uv/install.sh | sh
Update:
uv self update
Create a virtual environment:
uv venv && source .venv/bin/activate
-
Run the puzzles via
pixi
oruv
as follows:pixi run pXX # Replace XX with the puzzle number
pixi run pXX -e amd # Replace XX with the puzzle number
pixi run pXX -e apple # Replace XX with the puzzle number
uv run poe pXX # Replace XX with the puzzle number
For example, to run puzzle 01:
pixi run p01
oruv run poe p01
GPU support matrix
The following table shows GPU platform compatibility for each puzzle. Different puzzles require different GPU features and vendor-specific tools.
Puzzle | NVIDIA GPU | AMD GPU | Apple GPU | Notes |
---|---|---|---|---|
Part I: GPU Fundamentals | ||||
1 - Map | β | β | β | Basic GPU kernels |
2 - Zip | β | β | β | Basic GPU kernels |
3 - Guard | β | β | β | Basic GPU kernels |
4 - Map 2D | β | β | β | Basic GPU kernels |
5 - Broadcast | β | β | β | Basic GPU kernels |
6 - Blocks | β | β | β | Basic GPU kernels |
7 - Shared Memory | β | β | β | Basic GPU kernels |
8 - Stencil | β | β | β | Basic GPU kernels |
Part II: Debugging | ||||
9 - GPU Debugger | β | β | β | NVIDIA-specific debugging tools |
10 - Sanitizer | β | β | β | NVIDIA-specific debugging tools |
Part III: GPU Algorithms | ||||
11 - Reduction | β | β | β | Basic GPU kernels |
12 - Scan | β | β | β | Basic GPU kernels |
13 - Pool | β | β | β | Basic GPU kernels |
14 - Conv | β | β | β | Basic GPU kernels |
15 - Matmul | β | β | β | Basic GPU kernels |
16 - Flashdot | β | β | β | Advanced memory patterns |
Part IV: MAX Graph | ||||
17 - Custom Op | β | β | β | MAX Graph integration |
18 - Softmax | β | β | β | MAX Graph integration |
19 - Attention | β | β | β | MAX Graph integration |
Part V: PyTorch Integration | ||||
20 - Torch Bridge | β | β | β | PyTorch integration |
21 - Autograd | β | β | β | PyTorch integration |
22 - Fusion | β | β | β | PyTorch integration |
Part VI: Functional Patterns | ||||
23 - Functional | β | β | β | Advanced Mojo patterns |
Part VII: Warp Programming | ||||
24 - Warp Sum | β | β | β | Warp-level operations |
25 - Warp Communication | β | β | β | Warp-level operations |
26 - Advanced Warp | β | β | β | Warp-level operations |
Part VIII: Block Programming | ||||
27 - Block Operations | β | β | β | Block-level patterns |
Part IX: Memory Systems | ||||
28 - Async Memory | β | β | β | Advanced memory operations |
29 - Barriers | β | β | β | Advanced synchronization |
Part X: Performance Analysis | ||||
30 - Profiling | β | β | β | NVIDIA profiling tools (NSight) |
31 - Occupancy | β | β | β | NVIDIA profiling tools |
32 - Bank Conflicts | β | β | β | NVIDIA profiling tools |
Part XI: Modern GPU Features | ||||
33 - Tensor Cores | β | β | β | NVIDIA Tensor Core specific |
34 - Cluster | β | β | β | NVIDIA cluster programming |
Legend
- β Supported: Puzzle works on this platform
- β Not Supported: Puzzle requires platform-specific features
Platform notes
NVIDIA GPUs (Complete Support)
- All puzzles (1-34) work on NVIDIA GPUs with CUDA support
- Requires CUDA toolkit and compatible drivers
- Best learning experience with access to all features
AMD GPUs (Extensive Support)
- Most puzzles (1-8, 11-29) work with ROCm support
- Missing only: Debugging tools (9-10), profiling (30-32), Tensor Cores (33-34)
- Excellent for learning GPU programming including advanced algorithms and memory patterns
Apple GPUs (Basic Support)
- Only fundamental puzzles (1-8, 11-15) supported
- Missing: All advanced features, debugging, profiling tools
- Suitable for learning basic GPU programming patterns
Future Support: Weβre actively working to expand tooling and platform support for AMD and Apple GPUs. Missing features like debugging tools, profiling capabilities, and advanced GPU operations are planned for future releases. Check back for updates as we continue to broaden cross-platform compatibility.
Recommendations
- Complete Learning Path: Use NVIDIA GPU for full curriculum access (all 34 puzzles)
- Comprehensive Learning: AMD GPUs work well for most content (27 of 34 puzzles)
- Basic Understanding: Apple GPUs suitable for fundamental concepts (13 of 34 puzzles)
- Debugging & Profiling: NVIDIA GPU required for debugging tools and performance analysis
- Modern GPU Features: NVIDIA GPU required for Tensor Cores and cluster programming
Development
Please see details in the README.
Join the community
Join our vibrant community to discuss GPU programming, share solutions, and get help!