🕵 The Cache Hit Paradox

Overview

Welcome to your first profiling detective case! You have three GPU kernels that all compute the same simple vector addition: output[i] = a[i] + b[i]. They should all perform identically, right?

Wrong! These kernels have dramatically different performance - one is orders of magnitude slower than the others. Your mission: use the profiling tools you just learned to discover why.

Welcome to a performance mystery that will challenge everything you think you know about GPU optimization! You’re confronted with three seemingly identical vector addition kernels that compute the exact same mathematical operation:

output[i] = a[i] + b[i]  // Simple arithmetic - what could go wrong?

The shocking reality:

All three kernels produce identical, correct results
One kernel runs ~50x slower than the others
The slowest kernel has the highest cache hit rates (counterintuitive!)
Standard performance intuition completely fails

Your detective mission:

Identify the performance culprit - Which kernel is catastrophically slow?
Uncover the cache paradox - Why do high cache hits indicate poor performance?
Decode memory access patterns - What makes identical operations behave so differently?
Learn profiling methodology - Use NSight tools to gather evidence, not guesses

Why this matters: This puzzle reveals a fundamental GPU performance principle that challenges CPU-based intuition. The skills you develop here apply to real-world GPU optimization where memory access patterns often matter more than algorithmic complexity.

The twist: We approach this without looking at the source code first - using only profiling tools as your guide, just like debugging production performance issues. After we obtained the profiling results, we look at the code for further analysis.

Your detective toolkit

From the profiling tutorial, you have:

NSight Systems (nsys) - Find which kernels are slow
NSight Compute (ncu) - Analyze why kernels are slow
Memory efficiency metrics - Detect poor access patterns

Getting started

Step 1: Run the benchmark

pixi shell -e nvidia
mojo problems/p30/p30.mojo --benchmark

You’ll see dramatic timing differences between kernels! One kernel is much slower than the others. Your job is to figure out why using profiling tools without looking at the code.

Example output:

| name    | met (ms)  | iters |
| ------- | --------- | ----- |
| kernel1 | 171.85    | 11    |
| kernel2 | 1546.68   | 11    |  <- This one is much slower!
| kernel3 | 172.18    | 11    |

Step 2: Prepare your code for profiling

Critical: For accurate profiling, build with full debug information while keeping optimizations enabled:

mojo build --debug-level=full problems/p30/p30.mojo -o problems/p30/p30_profiler

Why this matters:

Full debug info: Provides complete symbol tables, variable names, and source line mapping for profilers
Comprehensive analysis: Enables NSight tools to correlate performance data with specific code locations
Optimizations enabled: Ensures realistic performance measurements that match production builds

Step 3: System-wide investigation (NSight Systems)

Profile each kernel to see the big picture:

# Profile each kernel individually using the optimized build (with warmup to avoid cold start effects)
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel1_profile ./problems/p30/p30_profiler --kernel1
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel2_profile ./problems/p30/p30_profiler --kernel2
nsys profile --trace=cuda,osrt,nvtx --delay=2 --output=./problems/p30/kernel3_profile ./problems/p30/p30_profiler --kernel3

# Analyze the results
nsys stats --force-export=true ./problems/p30/kernel1_profile.nsys-rep > ./problems/p30/kernel1_profile.txt
nsys stats --force-export=true ./problems/p30/kernel2_profile.nsys-rep > ./problems/p30/kernel2_profile.txt
nsys stats --force-export=true ./problems/p30/kernel3_profile.nsys-rep > ./problems/p30/kernel3_profile.txt

Look for:

GPU Kernel Summary - Which kernels take longest?
Kernel execution times - How much do they vary?
Memory transfer patterns - Are they similar across implementations?

Step 4: Kernel deep-dive (NSight Compute)

Once you identify the slow kernel, analyze it with NSight Compute:

# Deep-dive into memory patterns for each kernel using the optimized build
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel1_analysis ./problems/p30/p30_profiler --kernel1
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel2_analysis ./problems/p30/p30_profiler --kernel2
ncu --set=@roofline --section=MemoryWorkloadAnalysis -f -o ./problems/p30/kernel3_analysis ./problems/p30/p30_profiler --kernel3

# View the results
ncu --import ./problems/p30/kernel1_analysis.ncu-rep --page details
ncu --import ./problems/p30/kernel2_analysis.ncu-rep --page details
ncu --import ./problems/p30/kernel3_analysis.ncu-rep --page details

When you run these commands, you’ll see output like this:

Kernel1: Memory Throughput: ~308 Gbyte/s, Max Bandwidth: ~51%
Kernel2: Memory Throughput: ~6 Gbyte/s,   Max Bandwidth: ~12%
Kernel3: Memory Throughput: ~310 Gbyte/s, Max Bandwidth: ~52%

Key metrics to investigate:

Memory Throughput (Gbyte/s) - Actual memory bandwidth achieved
Max Bandwidth (%) - Percentage of theoretical peak bandwidth utilized
L1/TEX Hit Rate (%) - L1 cache efficiency
L2 Hit Rate (%) - L2 cache efficiency

🤔 The Counterintuitive Result: You’ll notice Kernel2 has the highest cache hit rates but the lowest performance! This is the key mystery to solve.

Step 5: Detective questions

Use your profiling evidence to answer these questions by looking at the kernel code problems/p30/p30.mojo:

Performance analysis

Which kernel achieves the highest Memory Throughput? (Look at Gbyte/s values)
Which kernel has the lowest Max Bandwidth utilization? (Compare percentages)
What’s the performance gap in memory throughput? (Factor difference between fastest and slowest)

The cache paradox

Which kernel has the highest L1/TEX Hit Rate?
Which kernel has the highest L2 Hit Rate?
🤯 Why does the kernel with the BEST cache hit rates perform the WORST?

Memory access detective work

Can high cache hit rates actually indicate a performance problem?
What memory access pattern would cause high cache hits but low throughput?
Why might “efficient caching” be a symptom of “inefficient memory access”?

The “Aha!” Moment

Based on the profiling evidence, what fundamental GPU memory principle does this demonstrate?

Key insight to discover: Sometimes high cache hit rates are a red flag, not a performance victory!

Solution

The mystery reveals a fundamental GPU performance principle: memory access patterns dominate performance for memory-bound operations, even when kernels perform identical computations.

The profiling evidence reveals:

Performance hierarchy: Kernel1 and Kernel3 are fast, Kernel2 is catastrophically slow (orders of magnitude difference)
Memory throughput tells the story: Fast kernels achieve high bandwidth utilization, slow kernel achieves minimal utilization
The cache paradox: The slowest kernel has the highest cache hit rates - revealing that high cache hits can indicate poor memory access patterns
Memory access patterns matter more than algorithmic complexity for memory-bound GPU workloads

Complete Solution with Enhanced Explanation

This profiling detective case demonstrates how memory access patterns create orders-of-magnitude performance differences, even when kernels perform identical mathematical operations.

Performance evidence from profiling

NSight Systems Timeline Analysis:

Kernel 1: Short execution time - EFFICIENT
Kernel 3: Similar to Kernel 1 - EFFICIENT
Kernel 2: Dramatically longer execution time - INEFFICIENT

NSight Compute Memory Analysis (Hardware-Agnostic Patterns):

Efficient kernels (1 & 3): High memory throughput, good bandwidth utilization, moderate cache hit rates
Inefficient kernel (2): Very low memory throughput, poor bandwidth utilization, extremely high cache hit rates

The cache paradox revealed

🤯 The Counterintuitive Discovery:

Kernel2 has the HIGHEST cache hit rates but WORST performance
This challenges conventional wisdom: “High cache hits = good performance”
The truth: High cache hit rates can be a symptom of inefficient memory access patterns

Why the Cache Paradox Occurs:

Traditional CPU intuition (INCORRECT for GPUs):

Higher cache hit rates always mean better performance
Cache hits reduce memory traffic, improving efficiency

GPU memory reality (CORRECT understanding):

Coalescing matters more than caching for memory-bound workloads
Poor access patterns can cause artificial cache hit inflation
Memory bandwidth utilization is the real performance indicator

Root cause analysis - memory access patterns

Actual Kernel Implementations from p30.mojo:

Kernel 1 - Efficient Coalesced Access:

fn kernel1[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    i = block_dim.x * block_idx.x + thread_idx.x
    if i < size:
        output[i] = a[i] + b[i]

Standard thread indexing - adjacent threads access adjacent memory

Kernel 2 - Inefficient Strided Access:

fn kernel2[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    tid = block_idx.x * block_dim.x + thread_idx.x
    stride = 512

    i = tid
    while i < size:
        output[i] = a[i] + b[i]
        i += stride

Large stride=512 creates memory access gaps - same operation but scattered access

Kernel 3 - Efficient Reverse Access:

fn kernel3[
    layout: Layout
](
    output: LayoutTensor[dtype, layout, MutAnyOrigin],
    a: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    b: LayoutTensor[dtype, layout, ImmutAnyOrigin],
    size: Int,
):
    tid = block_idx.x * block_dim.x + thread_idx.x
    total_threads = (SIZE // 1024) * 1024

    for step in range(0, size, total_threads):
        forward_i = step + tid
        if forward_i < size:
            reverse_i = size - 1 - forward_i
            output[reverse_i] = a[reverse_i] + b[reverse_i]

Reverse indexing but still predictable - adjacent threads access adjacent addresses (just backwards)

Pattern Analysis:

Kernel 1: Classic coalesced access - adjacent threads access adjacent memory
Kernel 2: Catastrophic strided access - threads jump by 512 elements
Kernel 3: Reverse but still coalesced within warps - predictable pattern

Understanding the memory system

GPU Memory Architecture Fundamentals:

Warp execution: 32 threads execute together
Cache line size: 128 bytes (32 float32 values)
Coalescing requirement: Adjacent threads should access adjacent memory

p30.mojo Configuration Details:

alias SIZE = 16 * 1024 * 1024          # 16M elements (64MB of float32 data)
alias THREADS_PER_BLOCK = (1024, 1)    # 1024 threads per block
alias BLOCKS_PER_GRID = (SIZE // 1024, 1)  # 16,384 blocks total
alias dtype = DType.float32             # 4 bytes per element

Why these settings matter:

Large dataset (16M): Makes memory access patterns clearly visible
1024 threads/block: Maximum CUDA threads per block
32 warps/block: Each block contains 32 warps of 32 threads each

Memory Access Efficiency Visualization:

KERNEL 1 (Coalesced):           KERNEL 2 (Strided by 512):
Warp threads 0-31:             Warp threads 0-31:
  Thread 0: Memory[0]            Thread 0: Memory[0]
  Thread 1: Memory[1]            Thread 1: Memory[512]
  Thread 2: Memory[2]            Thread 2: Memory[1024]
  ...                           ...
  Thread 31: Memory[31]          Thread 31: Memory[15872]

Result: 1 cache line fetch       Result: 32 separate cache line fetches
Status: ~308 GB/s throughput     Status: ~6 GB/s throughput
Cache: Efficient utilization     Cache: Same lines hit repeatedly!

KERNEL 3 (Reverse but Coalesced):

Warp threads 0-31 (first iteration):
  Thread 0: Memory[SIZE-1]     (reverse_i = SIZE-1-0)
  Thread 1: Memory[SIZE-2]     (reverse_i = SIZE-1-1)
  Thread 2: Memory[SIZE-3]     (reverse_i = SIZE-1-2)
  ...
  Thread 31: Memory[SIZE-32]   (reverse_i = SIZE-1-31)

Result: Adjacent addresses (just backwards)
Status: ~310 GB/s throughput (nearly identical to Kernel 1)
Cache: Efficient utilization despite reverse order

The cache paradox explained

Why Kernel2 (stride=512) has high cache hit rates but poor performance:

The stride=512 disaster explained:

# Each thread processes multiple elements with huge gaps:
Thread 0: elements [0, 512, 1024, 1536, 2048, ...]
Thread 1: elements [1, 513, 1025, 1537, 2049, ...]
Thread 2: elements [2, 514, 1026, 1538, 2050, ...]
...

Why this creates the cache paradox:

Cache line repetition: Each 512-element jump stays within overlapping cache line regions
False efficiency illusion: Same cache lines accessed repeatedly = artificially high “hit rates”
Bandwidth catastrophe: 32 threads × 32 separate cache lines = massive memory traffic
Warp execution mismatch: GPU designed for coalesced access, but getting scattered access

Concrete example with float32 (4 bytes each):

Cache line: 128 bytes = 32 float32 values
Stride 512: Thread jumps by 512×4 = 2048 bytes = 16 cache lines apart!
Warp impact: 32 threads need 32 different cache lines instead of 1

The key insight: High cache hits in Kernel2 are repeated access to inefficiently fetched data, not smart caching!

Profiling methodology insights

Systematic Detective Approach:

Phase 1: NSight Systems (Big Picture)

Identify which kernels are slow
Rule out obvious bottlenecks (memory transfers, API overhead)
Focus on kernel execution time differences

Phase 2: NSight Compute (Deep Analysis)

Analyze memory throughput metrics
Compare bandwidth utilization percentages
Investigate cache hit rates and patterns

Phase 3: Connect Evidence to Theory

PROFILING EVIDENCE → CODE ANALYSIS:

NSight Compute Results:           Actual Code Pattern:
- Kernel1: ~308 GB/s            → i = block_idx*block_dim + thread_idx (coalesced)
- Kernel2: ~6 GB/s, 99% L2 hits → i += 512 (catastrophic stride)
- Kernel3: ~310 GB/s            → reverse_i = size-1-forward_i (reverse coalesced)

The profiler data directly reveals the memory access efficiency!

Evidence-to-Code Connection:

High throughput + normal cache rates = Coalesced access (Kernels 1 & 3)
Low throughput + high cache rates = Inefficient strided access (Kernel 2)
Memory bandwidth utilization reveals true efficiency regardless of cache statistics

Real-world performance implications

This pattern affects many GPU applications:

Scientific Computing:

Stencil computations: Neighbor access patterns in grid simulations
Linear algebra: Matrix traversal order (row-major vs column-major)
PDE solvers: Grid point access patterns in finite difference methods

Graphics and Image Processing:

Texture filtering: Sample access patterns in shaders
Image convolution: Filter kernel memory access
Color space conversion: Channel interleaving strategies

Machine Learning:

Matrix operations: Memory layout optimization in GEMM
Tensor contractions: Multi-dimensional array access patterns
Data loading: Batch processing and preprocessing pipelines

Fundamental GPU optimization principles

Memory-First Optimization Strategy:

Memory patterns dominate: Access patterns often matter more than algorithmic complexity
Coalescing is critical: Design for adjacent threads accessing adjacent memory
Measure bandwidth utilization: Focus on actual throughput, not just cache statistics
Profile systematically: Use NSight tools to identify real bottlenecks

Key Technical Insights:

Memory-bound workloads: Bandwidth utilization determines performance
Cache metrics can mislead: High hit rates don’t always indicate efficiency
Warp-level thinking: Design access patterns for 32-thread execution groups
Hardware-aware programming: Understanding GPU memory hierarchy is essential

Key takeaways

This detective case reveals that GPU performance optimization requires abandoning CPU intuition for memory-centric thinking:

Critical insights:

High cache hit rates can indicate poor memory access patterns (not good performance)
Memory bandwidth utilization matters more than cache statistics
Simple coalesced patterns often outperform complex algorithms
Profiling tools reveal counterintuitive performance truths

Practical methodology:

Profile systematically with NSight Systems and NSight Compute
Design for adjacent threads accessing adjacent memory (coalescing)
Let profiler evidence guide optimization decisions, not intuition

The cache paradox demonstrates that high-level metrics can mislead without architectural understanding - applicable far beyond GPU programming.