1. Getting Started
  2. 🔥 Introduction
  3. 🧭 Puzzles Usage Guide
  4. Part I: GPU Fundamentals
  5. Puzzle 1: Map
    1. 🔰 Raw Memory Approach
    2. 💡 Preview: Modern Approach with LayoutTensor
  6. Puzzle 2: Zip
  7. Puzzle 3: Guards
  8. Puzzle 4: 2D Map
    1. 🔰 Raw Memory Approach
    2. 📚 Learn about LayoutTensor
    3. 🚀 Modern 2D Operations
  9. Puzzle 5: Broadcast
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  10. Puzzle 6: Blocks
  11. Puzzle 7: 2D Blocks
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  12. Puzzle 8: Shared Memory
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  13. Part II: 🐞 Debugging GPU Programs
  14. Puzzle 9: GPU Debugging Workflow
    1. 📚 Mojo GPU Debugging Essentials
    2. 🧐 Detective Work: First Case
    3. 🔍 Detective Work: Second Case
    4. 🕵 Detective Work: Third Case
  15. Puzzle 10: Memory Error Detection & Race Conditions with Sanitizers
    1. 👮🏼‍♂️ Detect Memory Violation
    2. 🏁 Debug Race Condition
  16. Part III: 🧮 GPU Algorithms
  17. Puzzle 11: Pooling
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  18. Puzzle 12: Dot Product
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  19. Puzzle 13: 1D Convolution
    1. 🔰 Simple Version
    2. ⭐ Block Boundary Version
  20. Puzzle 14: Prefix Sum
    1. 🔰 Simple Version
    2. ⭐ Complete Version
  21. Puzzle 15: Axis Sum
  22. Puzzle 16: Matrix Multiplication (MatMul)
    1. 🔰 Naïve Version with Global Memory
    2. 📚 Learn about Roofline Model
    3. 🤝 Shared Memory Version
    4. 📐 Tiled Version
  23. Part IV: 🐍 Interfacing with Python via MAX Graph Custom Ops
  24. Puzzle 17: 1D Convolution Op
  25. Puzzle 18: Softmax Op
  26. Puzzle 19: Attention Op
  27. 🎯 Bonus Challenges
  28. Part V: 🔥 PyTorch Custom Ops Integration
  29. Puzzle 20: 1D Convolution Op
  30. Puzzle 21: Embedding Op
    1. 🔰 Coaleasced vs non-Coaleasced Kernel
    2. 📊 Performance Comparison
  31. Puzzle 22: Kernel Fusion and Custom Backward Pass
    1. ⚛️ Fused vs Unfused Kernels
    2. ⛓️ Autograd Integration & Backward Pass
  32. Part VI: 🌊 Mojo Functional Patterns and Benchmarking
  33. Puzzle 23: GPU Functional Programming Patterns
    1. elementwise - Basic GPU Functional Operations
    2. tile - Memory-Efficient Tiled Processing
    3. vectorize - SIMD Control
    4. 🧠 GPU Threading vs SIMD Concepts
    5. 📊 Benchmarking in Mojo
  34. Part VII: ⚡ Warp-Level Programming
  35. Puzzle 24: Warp Fundamentals
    1. 🧠 Warp lanes & SIMT execution
    2. 🔰 warp.sum() Essentials
    3. 🤔 When to Use Warp Programming
  36. Puzzle 25: Warp Communication
    1. ⬇️ warp.shuffle_down()
    2. 📢 warp.broadcast()
  37. Puzzle 26: Advanced Warp Patterns
    1. 🦋 warp.shuffle_xor() Butterfly Networks
    2. 🔢 warp.prefix_sum() Scan Operations
  38. Part VIII: 🧱 Block-Level Programming
  39. Puzzle 27: Block-Wide Patterns
    1. 🔰 block.sum() Essentials
    2. 📈 block.prefix_sum() Parallel Histogram Binning
    3. 📡 block.broadcast() Vector Normalization
  40. Part IX: 🧠 Advanced Memory Systems
  41. Puzzle 28: Async Memory Operations & Copy Overlap
  42. Puzzle 29: GPU Synchronization Primitives
    1. 📶 Multi-Stage Pipeline Coordination
    2. Double-Buffered Stencil Computation
  43. Part X: 📊 Performance Analysis & Optimization
  44. Puzzle 30: GPU Profiling Basics
  45. Puzzle 31: Occupancy Optimization
  46. Puzzle 32: Bank Conflicts
    1. 📚 Understanding Shared Memory Banks
    2. Conflict-Free Patterns
  47. Part XI: 🚀 Advanced GPU Features
  48. Puzzle 33: Tensor Core Operations
  49. Puzzle 34: Essential TMA Operations (H100+)
  50. Puzzle 35: GPU Cluster Programming (SM90+)
    1. Thread Block Clusters
    2. cluster_sync() Coordination
    3. Advanced Cluster Patterns