1. Getting Started
  2. 🔥 Introduction
  3. 🧭 Puzzles Usage Guide
  4. 🏆 Claim Your Rewards
  5. Part I: GPU Fundamentals
  6. Puzzle 1: Map
    1. 🔰 Raw Memory Approach
    2. 💡 Preview: Modern Approach with LayoutTensor
  7. Puzzle 2: Zip
  8. Puzzle 3: Guards
  9. Puzzle 4: 2D Map
    1. 🔰 Raw Memory Approach
    2. 📚 Learn about LayoutTensor
    3. 🚀 Modern 2D Operations
  10. Puzzle 5: Broadcast
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  11. Puzzle 6: Blocks
  12. Puzzle 7: 2D Blocks
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  13. Puzzle 8: Shared Memory
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  14. Part II: 🐞 Debugging GPU Programs
  15. Puzzle 9: GPU Debugging Workflow
    1. 📚 Mojo GPU Debugging Essentials
    2. 🧐 Detective Work: First Case
    3. 🔍 Detective Work: Second Case
    4. 🕵 Detective Work: Third Case
  16. Puzzle 10: Memory Error Detection & Race Conditions with Sanitizers
    1. 👮🏼‍♂️ Detect Memory Violation
    2. 🏁 Debug Race Condition
  17. Part III: 🧮 GPU Algorithms
  18. Puzzle 11: Pooling
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  19. Puzzle 12: Dot Product
    1. 🔰 Raw Memory Approach
    2. 📐 LayoutTensor Version
  20. Puzzle 13: 1D Convolution
    1. 🔰 Simple Version
    2. ⭐ Block Boundary Version
  21. Puzzle 14: Prefix Sum
    1. 🔰 Simple Version
    2. ⭐ Complete Version
  22. Puzzle 15: Axis Sum
  23. Puzzle 16: Matrix Multiplication (MatMul)
    1. 🔰 Naïve Version with Global Memory
    2. 📚 Learn about Roofline Model
    3. 🤝 Shared Memory Version
    4. 📐 Tiled Version
  24. Part IV: 🐍 Interfacing with Python via MAX Graph Custom Ops
  25. Puzzle 17: 1D Convolution Op
  26. Puzzle 18: Softmax Op
  27. Puzzle 19: Attention Op
  28. 🎯 Bonus Challenges
  29. Part V: 🔥 PyTorch Custom Ops Integration
  30. Puzzle 20: 1D Convolution Op
  31. Puzzle 21: Embedding Op
    1. 🔰 Coaleasced vs non-Coaleasced Kernel
    2. 📊 Performance Comparison
  32. Puzzle 22: Kernel Fusion and Custom Backward Pass
    1. ⚛️ Fused vs Unfused Kernels
    2. ⛓️ Autograd Integration & Backward Pass
  33. Part VI: 🌊 Mojo Functional Patterns and Benchmarking
  34. Puzzle 23: GPU Functional Programming Patterns
    1. elementwise - Basic GPU Functional Operations
    2. tile - Memory-Efficient Tiled Processing
    3. vectorize - SIMD Control
    4. 🧠 GPU Threading vs SIMD Concepts
    5. 📊 Benchmarking in Mojo
  35. Part VII: ⚡ Warp-Level Programming
  36. Puzzle 24: Warp Fundamentals
    1. 🧠 Warp lanes & SIMT execution
    2. 🔰 warp.sum() Essentials
    3. 🤔 When to Use Warp Programming
  37. Puzzle 25: Warp Communication
    1. ⬇️ warp.shuffle_down()
    2. 📢 warp.broadcast()
  38. Puzzle 26: Advanced Warp Patterns
    1. 🦋 warp.shuffle_xor() Butterfly Networks
    2. 🔢 warp.prefix_sum() Scan Operations
  39. Part VIII: 🧱 Block-Level Programming
  40. Puzzle 27: Block-Wide Patterns
    1. 🔰 block.sum() Essentials
    2. 📈 block.prefix_sum() Parallel Histogram Binning
    3. 📡 block.broadcast() Vector Normalization
  41. Part IX: 🧠 Advanced Memory Systems
  42. Puzzle 28: Async Memory Operations & Copy Overlap
  43. Puzzle 29: GPU Synchronization Primitives
    1. 📶 Multi-Stage Pipeline Coordination
    2. Double-Buffered Stencil Computation
  44. Part X: 📊 Performance Analysis & Optimization
  45. Puzzle 30: GPU Profiling
    1. 📚 NVIDIA Profiling Basics
    2. 🕵 The Cache Hit Paradox
  46. Puzzle 31: Occupancy Optimization
  47. Puzzle 32: Bank Conflicts
    1. 📚 Understanding Shared Memory Banks
    2. Conflict-Free Patterns
  48. Part XI: 🚀 Advanced GPU Features
  49. Puzzle 33: Tensor Core Operations
    1. 🎯 Performance Bonus Challenge
  50. Puzzle 34: GPU Cluster Programming (SM90+)
    1. 🔰 Multi-Block Coordination Basics
    2. ☸️ Cluster-Wide Collective Operations
    3. 🧠 Advanced Cluster Algorithms