Mojo🔥 GPU Puzzles

Getting Started
🔥 Introduction
🧭 Puzzles Usage Guide
Part I: GPU Fundamentals
Puzzle 1: Map
1. 🔰 Raw Memory Approach
2. 💡 Preview: Modern Approach with LayoutTensor
Puzzle 2: Zip
Puzzle 3: Guards
Puzzle 4: 2D Map
1. 🔰 Raw Memory Approach
2. 📚 Learn about LayoutTensor
3. 🚀 Modern 2D Operations
Puzzle 5: Broadcast
1. 🔰 Raw Memory Approach
2. 📐 LayoutTensor Version
Puzzle 6: Blocks
Puzzle 7: 2D Blocks
1. 🔰 Raw Memory Approach
2. 📐 LayoutTensor Version
Puzzle 8: Shared Memory
1. 🔰 Raw Memory Approach
2. 📐 LayoutTensor Version
Part II: 🧮 GPU Algorithms
Puzzle 9: Pooling
1. 🔰 Raw Memory Approach
2. 📐 LayoutTensor Version
Puzzle 10: Dot Product
1. 🔰 Raw Memory Approach
2. 📐 LayoutTensor Version
Puzzle 11: 1D Convolution
1. 🔰 Simple Version
2. ⭐ Block Boundary Version
Puzzle 12: Prefix Sum
1. 🔰 Simple Version
2. ⭐ Complete Version
Puzzle 13: Axis Sum
Puzzle 14: Matrix Multiplication (MatMul)
1. 🔰 Naïve Version with Global Memory
2. 📚 Learn about Roofline Model
3. 🤝 Shared Memory Version
4. 📐 Tiled Version
Part III: 🐍 Interfacing with Python via MAX Graph Custom Ops
Puzzle 15: 1D Convolution Op
Puzzle 16: Softmax Op
Puzzle 17: Attention Op
🎯 Bonus Challenges
Part IV: 🔥 PyTorch Custom Ops Integration
Puzzle 18: 1D Convolution Op
Puzzle 19: Embedding Op
1. 🔰 Coaleasced vs non-Coaleasced Kernel
2. 📊 Performance Comparison
Puzzle 20: Kernel Fusion and Custom Backward Pass
1. ⚛️ Fused vs Unfused Kernels
2. ⛓️ Autograd Integration & Backward Pass
Part V: 🌊 Mojo Functional Patterns and Benchmarking
Puzzle 21: GPU Functional Programming Patterns
1. elementwise - Basic GPU Functional Operations
2. tile - Memory-Efficient Tiled Processing
3. Vectorization - SIMD Control
4. 🧠 GPU Threading vs SIMD Concepts
5. 📊 Benchmarking in Mojo
Part VI: ⚡ Warp-Level Programming
Puzzle 22: Warp Fundamentals
1. 🧠 Warp lanes & SIMT execution
2. 🔰 warp.sum() Essentials
3. 📊 When to Use Warp Programming
Puzzle 23: Warp Communication
1. ⬇️ warp.shuffle_down()
2. 📢 warp.broadcast()
Puzzle 24: Advanced Warp Patterns
1. 🦋 warp.shuffle_xor() Butterfly Networks
2. 🔢 warp.prefix_sum() Scan Operations
Part VII: Advanced Memory Operations
Puzzle 25: Memory Coalescing
1. 📚 Understanding Coalesced Access
2. Optimized Access Patterns
3. 🔧 Troubleshooting Memory Issues
Puzzle 26: Async Memory Operations
Puzzle 27: Memory Fences & Atomics
Puzzle 28: Prefetching & Caching
Part VIII: 📊 Performance Analysis & Optimization
Puzzle 29: GPU Profiling Basics
Puzzle 30: Occupancy Optimization
Puzzle 31: Bank Conflicts
1. 📚 Understanding Shared Memory Banks
2. Conflict-Free Patterns
Part IX: 🚀 Advanced GPU Features
Puzzle 32: Tensor Core Operations
Puzzle 33: Random Number Generation
Puzzle 34: Advanced Synchronization
Part X: 🌐 Multi-GPU & Advanced Applications
Puzzle 35: Multi-Stream Programming
Puzzle 36: Multi-GPU Basics
Puzzle 37: End-to-End Optimization Case Study
🎯 Advanced Bonus Challenges