1. 🔥 Introduction
  2. 🧭 Puzzles Usage Guide
  3. 🏆 Claim Your Rewards
  4. Puzzle 1: Map
    1. 🔰 Raw Memory Approach
    2. 💡 Preview: Modern Approach with TileTensor
  5. Puzzle 2: Zip
  6. Puzzle 3: Guards
  7. Puzzle 4: 2D Map
    1. 🔰 Raw Memory Approach
    2. 📚 Learn about TileTensor
    3. 🚀 Modern 2D Operations
  8. Puzzle 5: Broadcast
  9. Puzzle 6: Blocks
  10. Puzzle 7: 2D Blocks
  11. Puzzle 8: Shared Memory
  12. Puzzle 9: GPU Debugging Workflow
    1. 📚 Mojo GPU Debugging Essentials
    2. 🧐 Detective Work: First Case
    3. 🔍 Detective Work: Second Case
    4. 🕵 Detective Work: Third Case
  13. Puzzle 10: Memory Error Detection & Race Conditions with Sanitizers
    1. 👮🏼‍♂️ Detect Memory Violation
    2. 🏁 Debug Race Condition
  14. Puzzle 11: Pooling
  15. Puzzle 12: Dot Product
  16. Puzzle 13: 1D Convolution
    1. 🔰 Simple Version
    2. ⭐ Block Boundary Version
  17. Puzzle 14: Prefix Sum
    1. 🔰 Simple Version
    2. ⭐ Complete Version
  18. Puzzle 15: Axis Sum
  19. Puzzle 16: Matrix Multiplication (MatMul)
    1. 🔰 Naïve Version with Global Memory
    2. 📚 Learn about Roofline Model
    3. 🤝 Shared Memory Version
    4. 📐 Tiled Version
  20. Puzzle 17: 1D Convolution Op
  21. Puzzle 18: Softmax Op
  22. Puzzle 19: Attention Op
  23. 🎯 Bonus Challenges
  24. Puzzle 20: 1D Convolution Op
  25. Puzzle 21: Embedding Op
    1. 🔰 Coalesced vs Non-Coalesced Kernel
    2. 📊 Performance Comparison
  26. Puzzle 22: Kernel Fusion and Custom Backward Pass
    1. ⚛️ Fused vs Unfused Kernels
    2. ⛓️ Autograd Integration & Backward Pass
  27. Puzzle 23: GPU Functional Programming Patterns
    1. elementwise - Basic GPU Functional Operations
    2. tile - Memory-Efficient Tiled Processing
    3. vectorize - SIMD Control
    4. 🧠 GPU Threading vs SIMD Concepts
    5. 📊 Benchmarking in Mojo
  28. Puzzle 24: Warp Fundamentals
    1. 🧠 Warp lanes & SIMT execution
    2. 🔰 warp.sum() Essentials
    3. 🤔 When to Use Warp Programming
  29. Puzzle 25: Warp Communication
    1. ⬇️ warp.shuffle_down()
    2. 📢 warp.broadcast()
  30. Puzzle 26: Advanced Warp Patterns
    1. 🦋 warp.shuffle_xor() Butterfly Networks
    2. 🔢 warp.prefix_sum() Scan Operations
  31. Puzzle 27: Block-Wide Patterns
    1. 🔰 block.sum() Essentials
    2. 📈 block.prefix_sum() Parallel Histogram Binning
    3. 📡 block.broadcast() Vector Normalization
  32. Puzzle 28: Async Memory Operations & Copy Overlap
  33. Puzzle 29: GPU Synchronization Primitives
    1. 📶 Multi-Stage Pipeline Coordination
    2. Double-Buffered Stencil Computation
  34. Puzzle 30: GPU Profiling
    1. 📚 NVIDIA Profiling Basics
    2. 🕵 The Cache Hit Paradox
  35. Puzzle 31: Occupancy Optimization
  36. Puzzle 32: Bank Conflicts
    1. 📚 Understanding Shared Memory Banks
    2. Conflict-Free Patterns
  37. Puzzle 33: Tensor Core Operations
    1. 🎯 Performance Bonus Challenge
  38. Puzzle 34: GPU Cluster Programming (SM90+)
    1. 🔰 Multi-Block Coordination Basics
    2. ☸️ Cluster-Wide Collective Operations
    3. 🧠 Advanced Cluster Algorithms