1. Getting Started
  2. Introduction
  3. 🧩 Puzzles Usage Guide
  4. Part I: GPU Fundamentals
  5. Puzzle 1: Map
    1. 🔰 Raw Memory Approach
    2. 💡 Preview: Modern Approach with LayoutTensor
  6. Puzzle 2: Zip
  7. Puzzle 3: Guards
  8. Puzzle 4: 2D Map
    1. 🔰 Raw Memory Approach
    2. 📚 Learn about LayoutTensor
    3. 🚀 Modern 2D Operations
  9. Puzzle 5: Broadcast
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  10. Puzzle 6: Blocks
  11. Puzzle 7: 2D Blocks
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  12. Puzzle 8: Shared Memory
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  13. Part II: GPU Algorithms
  14. Puzzle 9: Pooling
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  15. Puzzle 10: Dot Product
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  16. Puzzle 11: 1D Convolution
    1. 🔰 Simple Version
    2. ⭐ Complete Version
  17. Puzzle 12: Prefix Sum
    1. 🔰 Simple Version
    2. ⭐ Complete Version
  18. Puzzle 13: Axis Sum
  19. Puzzle 14: Matrix Multiplication (MatMul)
    1. 🔰 Naive Version with Global Memory
    2. 📚 Learn about Roofline Model
    3. 🤝 Shared Memory Version
    4. 📐 Tiled Version
  20. Part III: Interfacing with Python via MAX Graph Custom Ops
  21. Puzzle 15: 1D Convolution Op
  22. Puzzle 16: Softmax Op
  23. Puzzle 17: Attention Op
  24. 🎯 Bonus Challenges
  25. Part IV: Advanced GPU Algorithms
  26. Puzzle 18: 2D Convolution Op
  27. Puzzle 19: 3D Average Pooling
    1. 📚 Learn about 3D Memory Layout
    2. Basic Version
    3. LayoutTensor Version
  28. Puzzle 20: 3D Convolution
    1. 📚 Learn about 3D Convolution
    2. Basic Version
    3. Optimized Version
  29. Puzzle 21: 3D Tensor Multiplication
    1. 📚 Learn about Tensor Operations
    2. Basic Version
    3. LayoutTensor Version
  30. Puzzle 22: Multi-Head Self-Attention
    1. 📚 Learn about Attention Mechanisms
    2. Basic Version
    3. Optimized Version
  31. Part V: Performance Optimization Puzzles
  32. Puzzle 23: Memory Coalescing
    1. 📚 Learn about Memory Access Patterns
    2. Basic Version
    3. Optimized Version
  33. Puzzle 24: Bank Conflicts
    1. 📚 Learn about Shared Memory Banks
    2. Version 1: With Conflicts
    3. Version 2: Conflict-Free
  34. Puzzle 25: Warp-Level Optimization
    1. 📚 Learn about Warp Primitives
    2. Version 1: Shared Memory Reduction
    3. Version 2: Warp Shuffle Reduction
  35. Part VI: Real-world Application Puzzles
  36. Puzzle 26: Image Processing Pipeline
    1. 📚 Learn about Kernel Fusion
    2. Version 1: Separate Kernels
    3. Version 2: Fused Pipeline
  37. Puzzle 27: Neural Network Layers
    1. 📚 Learn about Layer Fusion
    2. Version 1: Basic Implementation
    3. Version 2: Optimized Implementation
  38. Puzzle 28: Multi-Level Tiling
    1. 📚 Learn about Cache Hierarchies
    2. Version 1: Single-Level MatMul
    3. Version 2: Multi-Level MatMul
  39. Part VII: Debug & Profile Puzzles
  40. Puzzle 29: Race Condition Detective
    1. 📚 Learn about Race Conditions
    2. Version 1: Find the Bug
    3. Version 2: Fix the Bug
  41. Puzzle 30: Memory Optimization
    1. 📚 Learn about Memory Management
    2. Version 1: Memory Leaks
    3. Version 2: Memory Planning
  42. Part VIII: Modern GPU Features
  43. Puzzle 31: Dynamic Parallelism
    1. 📚 Learn about Nested Parallelism
    2. Version 1: Flat Implementation
    3. Version 2: Nested Launch
  44. Puzzle 32: Tensor Core Programming
    1. 📚 Learn about Tensor Cores
    2. Version 1: Regular MatMul
    3. Version 2: Tensor Core MatMul
  45. Puzzle 33: Multi-GPU Programming
    1. 📚 Learn about Device Communication
    2. Version 1: Single GPU
    3. Version 2: Multi-GPU