- Getting Started
- 🔥 Introduction
- 🧭 Puzzles Usage Guide
- Part I: GPU Fundamentals
- Puzzle 1: Map
- 🔰 Raw Memory Approach
- 💡 Preview: Modern Approach with LayoutTensor
- Puzzle 2: Zip
- Puzzle 3: Guards
- Puzzle 4: 2D Map
- 🔰 Raw Memory Approach
- 📚 Learn about LayoutTensor
- 🚀 Modern 2D Operations
- Puzzle 5: Broadcast
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 6: Blocks
- Puzzle 7: 2D Blocks
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 8: Shared Memory
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Part II: 🐞 Debugging GPU Programs
- Puzzle 9: GPU Debugging Workflow
- 📚 Mojo GPU Debugging Essentials
- 🧐 Detective Work: First Case
- 🔍 Detective Work: Second Case
- 🕵 Detective Work: Third Case
- Puzzle 10: Memory Error Detection & Race Conditions with Sanitizers
- 👮🏼♂️ Detect Memory Violation
- 🏁 Debug Race Condition
- Part III: 🧮 GPU Algorithms
- Puzzle 11: Pooling
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 12: Dot Product
- 🔰 Raw Memory Approach
- 📐 LayoutTensor Version
- Puzzle 13: 1D Convolution
- 🔰 Simple Version
- ⭐ Block Boundary Version
- Puzzle 14: Prefix Sum
- 🔰 Simple Version
- ⭐ Complete Version
- Puzzle 15: Axis Sum
- Puzzle 16: Matrix Multiplication (MatMul)
- 🔰 Naïve Version with Global Memory
- 📚 Learn about Roofline Model
- 🤝 Shared Memory Version
- 📐 Tiled Version
- Part IV: 🐍 Interfacing with Python via MAX Graph Custom Ops
- Puzzle 17: 1D Convolution Op
- Puzzle 18: Softmax Op
- Puzzle 19: Attention Op
- 🎯 Bonus Challenges
- Part V: 🔥 PyTorch Custom Ops Integration
- Puzzle 20: 1D Convolution Op
- Puzzle 21: Embedding Op
- 🔰 Coaleasced vs non-Coaleasced Kernel
- 📊 Performance Comparison
- Puzzle 22: Kernel Fusion and Custom Backward Pass
- ⚛️ Fused vs Unfused Kernels
- ⛓️ Autograd Integration & Backward Pass
- Part VI: 🌊 Mojo Functional Patterns and Benchmarking
- Puzzle 23: GPU Functional Programming Patterns
- elementwise - Basic GPU Functional Operations
- tile - Memory-Efficient Tiled Processing
- vectorize - SIMD Control
- 🧠 GPU Threading vs SIMD Concepts
- 📊 Benchmarking in Mojo
- Part VII: ⚡ Warp-Level Programming
- Puzzle 24: Warp Fundamentals
- 🧠 Warp lanes & SIMT execution
- 🔰 warp.sum() Essentials
- 🤔 When to Use Warp Programming
- Puzzle 25: Warp Communication
- ⬇️ warp.shuffle_down()
- 📢 warp.broadcast()
- Puzzle 26: Advanced Warp Patterns
- 🦋 warp.shuffle_xor() Butterfly Networks
- 🔢 warp.prefix_sum() Scan Operations
- Part VIII: 🧱 Block-Level Programming
- Puzzle 27: Block-Wide Patterns
- 🔰 block.sum() Essentials
- 📈 block.prefix_sum() Parallel Histogram Binning
- 📡 block.broadcast() Vector Normalization
- Part IX: 🧠 Advanced Memory Systems
- Puzzle 28: Async Memory Operations & Copy Overlap
- Puzzle 29: GPU Synchronization Primitives
- 📶 Multi-Stage Pipeline Coordination
- Double-Buffered Stencil Computation
- Part X: 📊 Performance Analysis & Optimization
Puzzle 30: GPU Profiling Basics
Puzzle 31: Occupancy Optimization
Puzzle 32: Bank Conflicts
📚 Understanding Shared Memory Banks
Conflict-Free Patterns
- Part XI: 🚀 Advanced GPU Features
Puzzle 33: Tensor Core Operations
Puzzle 34: Essential TMA Operations (H100+)
Puzzle 35: GPU Cluster Programming (SM90+)
Thread Block Clusters
cluster_sync() Coordination
Advanced Cluster Patterns