🎯 Performance Bonus Challenge

The discovery

You’ve just completed Puzzle 33 and implemented actual Tensor Core matrix multiplication using Mojo’s TensorCore API. The implementation works correctly, passes all accuracy tests, and uses real hardware-accelerated matrix operations. But when you profile it against the simple idiomatic tiled version from Puzzle 16 …

The “specialized hardware” is orders of magnitude slower!

What went wrong?

Your profiling with (NVIDIA only) ncu revealed the brutal truth (if you need a refresher on profiling techniques, see Puzzle 10’s memory error detection and Puzzle 30’s GPU profiling):

Tensor Core version (the disappointment):

Duration: ~13.9 ms
Memory bound: 72.5% DRAM throughput (should be compute-bound!)
Poor occupancy: 26.3% (wasted hardware)
Cache disaster: 29.7% L2 hit rate
Register pressure: 68 registers per thread
Shared memory conflicts: Bank conflicts destroying performance

Tiled version (the winner):

Duration: ~1.62 ms (8.6x faster!)
Compute bound: 1.7% DRAM throughput (as expected)
Excellent occupancy: 66.7%
Cache friendly: 96.9% L2 hit rate
Efficient: 38 registers per thread
Clean memory: No significant bank conflicts

The harsh reality

This is a common story in GPU optimization: raw hardware capability ≠ actual performance. Tensor Cores are incredibly powerful, but they’re also incredibly demanding:

Memory wall: They’re so fast they expose every memory bottleneck
Resource hungry: High register usage kills occupancy
Access sensitive: Poor memory patterns destroy cache behavior
Configuration critical: Launch parameters must be perfectly tuned

Your mission: Fix the tensor core performance

The challenge: Transform your memory-bound, low-occupancy Tensor Core implementation into something that actually beats the simple tiled version.

What you need to beat:

Target duration: < 1.62 ms
Occupancy: > 26.3% baseline
DRAM pressure: < 72.5% baseline
Cache performance: > 29.7% L2 hit rate baseline

Optimization strategies to explore:

Register pressure reduction
- Use smaller accumulator tiles
- Minimize intermediate storage
- Consider mixed-precision to reduce register footprint
- Review Puzzle 16’s tiled approach for efficient accumulation patterns
Memory pattern optimization
- Add shared memory padding to eliminate bank conflicts (see shared memory concepts)
- Optimize copy_dram_to_sram_async layouts
- Improve coalescing patterns (memory access fundamentals from early puzzles)
Occupancy improvements
- Tune block sizes for better warp utilization
- Balance shared memory vs register usage
- Optimize warp-to-SM mapping
- Apply thread coordination lessons from Puzzle 11-20 series
Cache optimization
- Improve data reuse patterns
- Optimize tile sizes for cache hierarchy
- Consider data layout transformations
- Build on memory hierarchy concepts from puzzle progression
Advanced techniques
- Implement double buffering to overlap memory and compute
- Use software pipelining
- Explore async execution patterns
- Apply advanced coordination from sanitization puzzles

Success criteria

Correctness: All accuracy tests still pass
Performance: Tensor Core duration < 1.62 ms
Efficiency: Higher occupancy (>26.3%)
Memory: Lower DRAM pressure (<72.5%)
Cache: Better hit rates (>29.7% L2)

The deeper lesson

This bonus challenge teaches the most important lesson in GPU optimization: understanding bottlenecks matters more than using the latest APIs.

The goal isn’t just to make Tensor Cores faster - it’s to understand why they can be slower, how to systematically diagnose performance problems, and how to apply principled optimization techniques.

Complete this challenge, and you’ll have the skills to optimize any GPU workload, regardless of the hardware features available.