🕵 The Cache Hit Paradox
Overview
Welcome to your first profiling detective case! You have three GPU kernels that all compute the same simple vector addition: output[i] = a[i] + b[i]
. They should all perform identically, right?
Wrong! These kernels have dramatically different performance - one is orders of magnitude slower than the others. Your mission: use the profiling tools you just learned to discover why.
The challenge
Welcome to a performance mystery that will challenge everything you think you know about GPU optimization! You’re confronted with three seemingly identical vector addition kernels that compute the exact same mathematical operation:
output[i] = a[i] + b[i] // Simple arithmetic - what could go wrong?
The shocking reality:
- All three kernels produce identical, correct results
- One kernel runs ~50x slower than the others
- The slowest kernel has the highest cache hit rates (counterintuitive!)
- Standard performance intuition completely fails
Your detective mission:
- Identify the performance culprit - Which kernel is catastrophically slow?
- Uncover the cache paradox - Why do high cache hits indicate poor performance?
- Decode memory access patterns - What makes identical operations behave so differently?
- Master profiling methodology - Use NSight tools to gather evidence, not guesses
Why this matters: This puzzle reveals a fundamental GPU performance principle that challenges CPU-based intuition. The skills you develop here apply to real-world GPU optimization where memory access patterns often matter more than algorithmic complexity.
The twist: We approach this without looking at the source code first - using only profiling tools as your guide, just like debugging production performance issues. After we obtained the profiling results, we look at the code for further analysis.
Your detective toolkit
From the profiling tutorial, you have:
- NSight Systems (
nsys
) - Find which kernels are slow - NSight Compute (
ncu
) - Analyze why kernels are slow - Memory efficiency metrics - Detect poor access patterns
Getting started
Step 1: Run the benchmark
pixi shell -e cuda
mojo problems/p30/p30.mojo --benchmark
You’ll see dramatic timing differences between kernels! One kernel is much slower than the others. Your job is to figure out why using profiling tools without looking at the code.
Example output:
| name | met (ms) | iters |
| ------- | --------- | ----- |
| kernel1 | 171.85 | 11 |
| kernel2 | 1546.68 | 11 | <- This one is much slower!
| kernel3 | 172.18 | 11 |
Step 2: Prepare your code for profiling
Critical: For accurate profiling, build with full debug information while keeping optimizations enabled:
mojo build --debug-level=full problems/p30/p30.mojo -o problems/p30/p30_profiler
Why this matters:
- Full debug info: Provides complete symbol tables, variable names, and source line mapping for profilers
- Comprehensive analysis: Enables NSight tools to correlate performance data with specific code locations
- Optimizations enabled: Ensures realistic performance measurements that match production builds
Step 3: System-wide investigation (NSight Systems)
Profile each kernel to see the big picture:
# Profile each kernel individually using the optimized build (with warmup to avoid cold start effects)
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel1_profile ./problems/p30/p30_profiler --kernel1
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel2_profile ./problems/p30/p30_profiler --kernel2
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel3_profile ./problems/p30/p30_profiler --kernel3
# Analyze the results
nsys stats --force-export=true ./problems/p30/kernel1_profile.nsys-rep > ./problems/p30/kernel1_profile.txt
nsys stats --force-export=true ./problems/p30/kernel2_profile.nsys-rep > ./problems/p30/kernel2_profile.txt
nsys stats --force-export=true ./problems/p30/kernel3_profile.nsys-rep > ./problems/p30/kernel3_profile.txt
Look for:
- GPU Kernel Summary - Which kernels take longest?
- Kernel execution times - How much do they vary?
- Memory transfer patterns - Are they similar across implementations?
Step 4: Kernel deep-dive (NSight Compute)
Once you identify the slow kernel, analyze it with NSight Compute:
# Deep-dive into memory patterns for each kernel using the optimized build
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel1_analysis ./problems/p30/p30_profiler --kernel1
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel2_analysis ./problems/p30/p30_profiler --kernel2
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel3_analysis ./problems/p30/p30_profiler --kernel3
# View the results
ncu --import ./problems/p30/kernel1_analysis.ncu-rep --page details
ncu --import ./problems/p30/kernel2_analysis.ncu-rep --page details
ncu --import ./problems/p30/kernel3_analysis.ncu-rep --page details
When you run these commands, you’ll see output like this:
Kernel1: Memory Throughput: ~308 Gbyte/s, Max Bandwidth: ~51%
Kernel2: Memory Throughput: ~6 Gbyte/s, Max Bandwidth: ~12%
Kernel3: Memory Throughput: ~310 Gbyte/s, Max Bandwidth: ~52%
Key metrics to investigate:
- Memory Throughput (Gbyte/s) - Actual memory bandwidth achieved
- Max Bandwidth (%) - Percentage of theoretical peak bandwidth utilized
- L1/TEX Hit Rate (%) - L1 cache efficiency
- L2 Hit Rate (%) - L2 cache efficiency
🤔 The Counterintuitive Result: You’ll notice Kernel2 has the highest cache hit rates but the lowest performance! This is the key mystery to solve.
Step 5: Detective questions
Use your profiling evidence to answer these questions by looking at the kernel code problems/p30/p30.mojo:
Performance Analysis:
- Which kernel achieves the highest Memory Throughput? (Look at Gbyte/s values)
- Which kernel has the lowest Max Bandwidth utilization? (Compare percentages)
- What’s the performance gap in memory throughput? (Factor difference between fastest and slowest)
The Cache Paradox:
- Which kernel has the highest L1/TEX Hit Rate?
- Which kernel has the highest L2 Hit Rate?
- 🤯 Why does the kernel with the BEST cache hit rates perform the WORST?
Memory Access Detective Work:
- Can high cache hit rates actually indicate a performance problem?
- What memory access pattern would cause high cache hits but low throughput?
- Why might “efficient caching” be a symptom of “inefficient memory access”?
The “Aha!” Moment:
- Based on the profiling evidence, what fundamental GPU memory principle does this demonstrate?
Key insight to discover: Sometimes high cache hit rates are a red flag, not a performance victory!
Solution
The mystery reveals a fundamental GPU performance principle: memory access patterns dominate performance for memory-bound operations, even when kernels perform identical computations.
The profiling evidence reveals:
- Performance hierarchy: Kernel1 and Kernel3 are fast, Kernel2 is catastrophically slow (orders of magnitude difference)
- Memory throughput tells the story: Fast kernels achieve high bandwidth utilization, slow kernel achieves minimal utilization
- The cache paradox: The slowest kernel has the highest cache hit rates - revealing that high cache hits can indicate poor memory access patterns
- Memory access patterns matter more than algorithmic complexity for memory-bound GPU workloads
Complete Solution with Enhanced Explanation
This profiling detective case demonstrates how memory access patterns create orders-of-magnitude performance differences, even when kernels perform identical mathematical operations.
Performance evidence from profiling
NSight Systems Timeline Analysis:
- Kernel 1: Short execution time - EFFICIENT
- Kernel 3: Similar to Kernel 1 - EFFICIENT
- Kernel 2: Dramatically longer execution time - INEFFICIENT
NSight Compute Memory Analysis (Hardware-Agnostic Patterns):
- Efficient kernels (1 & 3): High memory throughput, good bandwidth utilization, moderate cache hit rates
- Inefficient kernel (2): Very low memory throughput, poor bandwidth utilization, extremely high cache hit rates
The cache paradox revealed
🤯 The Counterintuitive Discovery:
- Kernel2 has the HIGHEST cache hit rates but WORST performance
- This challenges conventional wisdom: “High cache hits = good performance”
- The truth: High cache hit rates can be a symptom of inefficient memory access patterns
Why the Cache Paradox Occurs:
Traditional CPU intuition (INCORRECT for GPUs):
- Higher cache hit rates always mean better performance
- Cache hits reduce memory traffic, improving efficiency
GPU memory reality (CORRECT understanding):
- Coalescing matters more than caching for memory-bound workloads
- Poor access patterns can cause artificial cache hit inflation
- Memory bandwidth utilization is the real performance indicator
Root cause analysis - memory access patterns
Actual Kernel Implementations from p30.mojo:
Kernel 1 - Efficient Coalesced Access:
fn kernel1[
layout: Layout
](
output: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
size: Int,
):
i = block_dim.x * block_idx.x + thread_idx.x
if i < size:
output[i] = a[i] + b[i]
Standard thread indexing - adjacent threads access adjacent memory
Kernel 2 - Inefficient Strided Access:
fn kernel2[
layout: Layout
](
output: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
size: Int,
):
tid = block_idx.x * block_dim.x + thread_idx.x
stride = 512
i = tid
while i < size:
output[i] = a[i] + b[i]
i += stride
Large stride=512 creates memory access gaps - same operation but scattered access
Kernel 3 - Efficient Reverse Access:
fn kernel3[
layout: Layout
](
output: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=False, dtype, layout],
b: LayoutTensor[mut=False, dtype, layout],
size: Int,
):
tid = block_idx.x * block_dim.x + thread_idx.x
total_threads = (SIZE // 1024) * 1024
for step in range(0, size, total_threads):
forward_i = step + tid
if forward_i < size:
reverse_i = size - 1 - forward_i
output[reverse_i] = a[reverse_i] + b[reverse_i]
Reverse indexing but still predictable - adjacent threads access adjacent addresses (just backwards)
Pattern Analysis:
- Kernel 1: Classic coalesced access - adjacent threads access adjacent memory
- Kernel 2: Catastrophic strided access - threads jump by 512 elements
- Kernel 3: Reverse but still coalesced within warps - predictable pattern
Understanding the memory system
GPU Memory Architecture Fundamentals:
- Warp execution: 32 threads execute together
- Cache line size: 128 bytes (32 float32 values)
- Coalescing requirement: Adjacent threads should access adjacent memory
p30.mojo Configuration Details:
alias SIZE = 16 * 1024 * 1024 # 16M elements (64MB of float32 data)
alias THREADS_PER_BLOCK = (1024, 1) # 1024 threads per block
alias BLOCKS_PER_GRID = (SIZE // 1024, 1) # 16,384 blocks total
alias dtype = DType.float32 # 4 bytes per element
Why these settings matter:
- Large dataset (16M): Makes memory access patterns clearly visible
- 1024 threads/block: Maximum CUDA threads per block
- 32 warps/block: Each block contains 32 warps of 32 threads each
Memory Access Efficiency Visualization:
KERNEL 1 (Coalesced): KERNEL 2 (Strided by 512):
Warp threads 0-31: Warp threads 0-31:
Thread 0: Memory[0] Thread 0: Memory[0]
Thread 1: Memory[1] Thread 1: Memory[512]
Thread 2: Memory[2] Thread 2: Memory[1024]
... ...
Thread 31: Memory[31] Thread 31: Memory[15872]
Result: 1 cache line fetch Result: 32 separate cache line fetches
Status: ~308 GB/s throughput Status: ~6 GB/s throughput
Cache: Efficient utilization Cache: Same lines hit repeatedly!
KERNEL 3 (Reverse but Coalesced):
Warp threads 0-31 (first iteration):
Thread 0: Memory[SIZE-1] (reverse_i = SIZE-1-0)
Thread 1: Memory[SIZE-2] (reverse_i = SIZE-1-1)
Thread 2: Memory[SIZE-3] (reverse_i = SIZE-1-2)
...
Thread 31: Memory[SIZE-32] (reverse_i = SIZE-1-31)
Result: Adjacent addresses (just backwards)
Status: ~310 GB/s throughput (nearly identical to Kernel 1)
Cache: Efficient utilization despite reverse order
The cache paradox explained
Why Kernel2 (stride=512) has high cache hit rates but poor performance:
The stride=512 disaster explained:
# Each thread processes multiple elements with huge gaps:
Thread 0: elements [0, 512, 1024, 1536, 2048, ...]
Thread 1: elements [1, 513, 1025, 1537, 2049, ...]
Thread 2: elements [2, 514, 1026, 1538, 2050, ...]
...
Why this creates the cache paradox:
- Cache line repetition: Each 512-element jump stays within overlapping cache line regions
- False efficiency illusion: Same cache lines accessed repeatedly = artificially high “hit rates”
- Bandwidth catastrophe: 32 threads Ă— 32 separate cache lines = massive memory traffic
- Warp execution mismatch: GPU designed for coalesced access, but getting scattered access
Concrete example with float32 (4 bytes each):
- Cache line: 128 bytes = 32 float32 values
- Stride 512: Thread jumps by 512Ă—4 = 2048 bytes = 16 cache lines apart!
- Warp impact: 32 threads need 32 different cache lines instead of 1
The key insight: High cache hits in Kernel2 are repeated access to inefficiently fetched data, not smart caching!
Profiling methodology insights
Systematic Detective Approach:
Phase 1: NSight Systems (Big Picture)
- Identify which kernels are slow
- Rule out obvious bottlenecks (memory transfers, API overhead)
- Focus on kernel execution time differences
Phase 2: NSight Compute (Deep Analysis)
- Analyze memory throughput metrics
- Compare bandwidth utilization percentages
- Investigate cache hit rates and patterns
Phase 3: Connect Evidence to Theory
PROFILING EVIDENCE → CODE ANALYSIS:
NSight Compute Results: Actual Code Pattern:
- Kernel1: ~308 GB/s → i = block_idx*block_dim + thread_idx (coalesced)
- Kernel2: ~6 GB/s, 99% L2 hits → i += 512 (catastrophic stride)
- Kernel3: ~310 GB/s → reverse_i = size-1-forward_i (reverse coalesced)
The profiler data directly reveals the memory access efficiency!
Evidence-to-Code Connection:
- High throughput + normal cache rates = Coalesced access (Kernels 1 & 3)
- Low throughput + high cache rates = Inefficient strided access (Kernel 2)
- Memory bandwidth utilization reveals true efficiency regardless of cache statistics
Real-world performance implications
This pattern affects many GPU applications:
Scientific Computing:
- Stencil computations: Neighbor access patterns in grid simulations
- Linear algebra: Matrix traversal order (row-major vs column-major)
- PDE solvers: Grid point access patterns in finite difference methods
Graphics and Image Processing:
- Texture filtering: Sample access patterns in shaders
- Image convolution: Filter kernel memory access
- Color space conversion: Channel interleaving strategies
Machine Learning:
- Matrix operations: Memory layout optimization in GEMM
- Tensor contractions: Multi-dimensional array access patterns
- Data loading: Batch processing and preprocessing pipelines
Fundamental GPU optimization principles
Memory-First Optimization Strategy:
- Memory patterns dominate: Access patterns often matter more than algorithmic complexity
- Coalescing is critical: Design for adjacent threads accessing adjacent memory
- Measure bandwidth utilization: Focus on actual throughput, not just cache statistics
- Profile systematically: Use NSight tools to identify real bottlenecks
Key Technical Insights:
- Memory-bound workloads: Bandwidth utilization determines performance
- Cache metrics can mislead: High hit rates don’t always indicate efficiency
- Warp-level thinking: Design access patterns for 32-thread execution groups
- Hardware-aware programming: Understanding GPU memory hierarchy is essential
Key takeaways
This detective case reveals that GPU performance optimization requires abandoning CPU intuition for memory-centric thinking:
Critical insights:
- High cache hit rates can indicate poor memory access patterns (not good performance)
- Memory bandwidth utilization matters more than cache statistics
- Simple coalesced patterns often outperform complex algorithms
- Profiling tools reveal counterintuitive performance truths
Practical methodology:
- Profile systematically with NSight Systems and NSight Compute
- Design for adjacent threads accessing adjacent memory (coalescing)
- Let profiler evidence guide optimization decisions, not intuition
The cache paradox demonstrates that high-level metrics can mislead without architectural understanding - applicable far beyond GPU programming.