- Getting Started
- Introduction
- 🧩 Puzzles Usage Guide
- Part I: GPU Fundamentals
- Puzzle 1: Map
- 🔰 Raw Memory Approach
- 💡 Preview: Modern Approach with LayoutTensor
- Puzzle 2: Zip
- Puzzle 3: Guards
- Puzzle 4: 2D Map
- 🔰 Raw Memory Approach
- 📚 Learn about LayoutTensor
- 🚀 Modern 2D Operations
- Puzzle 5: Broadcast
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 6: Blocks
- Puzzle 7: 2D Blocks
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 8: Shared Memory
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Part II: GPU Algorithms
- Puzzle 9: Pooling
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 10: Dot Product
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 11: 1D Convolution
- 🔰 Simple Version
- ⭐ Complete Version
- Puzzle 12: Prefix Sum
- 🔰 Simple Version
- ⭐ Complete Version
- Puzzle 13: Axis Sum
- Puzzle 14: Matrix Multiplication (MatMul)
- 🔰 Naive Version with Global Memory
📚 Learn about Roofline Model
- 🤝 Shared Memory Version
- 📐 Tiled Version
- Part III: Interfacing with Python via MAX Graph Custom Ops
Puzzle 15: 1D Convolution Op
Puzzle 16: Softmax Op
Puzzle 17: Attention Op
🎯 Bonus Challenges
- Part IV: Advanced GPU Algorithms
Puzzle 18: 2D Convolution Op
Puzzle 19: 3D Average Pooling
📚 Learn about 3D Memory Layout
Basic Version
LayoutTensor Version
Puzzle 20: 3D Convolution
📚 Learn about 3D Convolution
Basic Version
Optimized Version
Puzzle 21: 3D Tensor Multiplication
📚 Learn about Tensor Operations
Basic Version
LayoutTensor Version
Puzzle 22: Multi-Head Self-Attention
📚 Learn about Attention Mechanisms
Basic Version
Optimized Version
- Part V: Performance Optimization Puzzles
Puzzle 23: Memory Coalescing
📚 Learn about Memory Access Patterns
Basic Version
Optimized Version
Puzzle 24: Bank Conflicts
📚 Learn about Shared Memory Banks
Version 1: With Conflicts
Version 2: Conflict-Free
Puzzle 25: Warp-Level Optimization
📚 Learn about Warp Primitives
Version 1: Shared Memory Reduction
Version 2: Warp Shuffle Reduction
- Part VI: Real-world Application Puzzles
Puzzle 26: Image Processing Pipeline
📚 Learn about Kernel Fusion
Version 1: Separate Kernels
Version 2: Fused Pipeline
Puzzle 27: Neural Network Layers
📚 Learn about Layer Fusion
Version 1: Basic Implementation
Version 2: Optimized Implementation
Puzzle 28: Multi-Level Tiling
📚 Learn about Cache Hierarchies
Version 1: Single-Level MatMul
Version 2: Multi-Level MatMul
- Part VII: Debug & Profile Puzzles
Puzzle 29: Race Condition Detective
📚 Learn about Race Conditions
Version 1: Find the Bug
Version 2: Fix the Bug
Puzzle 30: Memory Optimization
📚 Learn about Memory Management
Version 1: Memory Leaks
Version 2: Memory Planning
- Part VIII: Modern GPU Features
Puzzle 31: Dynamic Parallelism
📚 Learn about Nested Parallelism
Version 1: Flat Implementation
Version 2: Nested Launch
Puzzle 32: Tensor Core Programming
📚 Learn about Tensor Cores
Version 1: Regular MatMul
Version 2: Tensor Core MatMul
Puzzle 33: Multi-GPU Programming
📚 Learn about Device Communication
Version 1: Single GPU
Version 2: Multi-GPU