📚 NVIDIA Profiling Basics

Overview

You’ve learned GPU programming fundamentals and advanced patterns. Part II taught you debugging techniques for correctness using compute-sanitizer and cuda-gdb, while other parts covered different GPU features like warp programming, memory systems, and block-level operations. Your kernels work correctly - but are they fast?

This tutorial follows NVIDIA’s recommended profiling methodology from the CUDA Best Practices Guide.

Key Insight: A correct kernel can still be orders of magnitude slower than optimal. Profiling bridges the gap between working code and high-performance code.

The profiling toolkit

Since you have cuda-toolkit via pixi, you have access to NVIDIA’s professional profiling suite:

NSight Systems (`nsys`) - the “big picture” tool

Purpose: System-wide performance analysis (NSight Systems Documentation)

Timeline view of CPU-GPU interaction
Memory transfer bottlenecks
Kernel launch overhead
Multi-GPU coordination
API call tracing

Available interfaces: Command-line (nsys) and GUI (nsys-ui)

Use when:

Understanding overall application flow
Identifying CPU-GPU synchronization issues
Analyzing memory transfer patterns
Finding kernel launch bottlenecks

# See the help
pixi run nsys --help

# Basic system-wide profiling
pixi run nsys profile --trace=cuda,nvtx --output=timeline mojo your_program.mojo

# Interactive analysis
pixi run nsys stats --force-export=true timeline.nsys-rep

NSight Compute (`ncu`) - the “kernel deep-dive” tool

Purpose: Detailed single-kernel performance analysis (NSight Compute Documentation)

Roofline model analysis
Memory hierarchy utilization
Warp execution efficiency
Register/shared memory usage
Compute unit utilization

Available interfaces: Command-line (ncu) and GUI (ncu-ui)

Use when:

Optimizing specific kernel performance
Understanding memory access patterns
Analyzing compute vs memory bound kernels
Identifying warp divergence issues

# See the help
pixi run ncu --help

# Detailed kernel profiling
pixi run ncu --set full --output kernel_profile mojo your_program.mojo

# Focus on specific kernels
pixi run ncu --kernel-name regex:your_kernel_name mojo your_program.mojo

Tool selection decision tree

Performance Problem
        |
        v
Know which kernel?
    |           |
   No          Yes
    |           |
    v           v
NSight    Kernel-specific issue?
Systems       |           |
    |        No          Yes
    v         |           |
Timeline      |           v
Analysis <----+     NSight Compute
                          |
                          v
                   Kernel Deep-Dive

Quick Decision Guide:

Start with NSight Systems (nsys) if you’re unsure where the bottleneck is
Use NSight Compute (ncu) when you know exactly which kernel to optimize
Use both for comprehensive analysis (common workflow)

Hands-on: system-wide profiling with NSight Systems

Let’s profile the Matrix Multiplication implementations from Puzzle 16 to understand performance differences.

GUI Note: The NSight Systems and Compute GUIs (nsys-ui, ncu-ui) require a display and OpenGL support. On headless servers or remote systems without X11 forwarding, use the command-line versions (nsys, ncu) with text-based analysis via nsys stats and ncu --import --page details. You can also transfer .nsys-rep and .ncu-rep files to local machines for GUI analysis.

Step 1: Prepare your code for profiling

Critical: For accurate profiling, build with full debug information while keeping optimizations enabled:

pixi shell -e nvidia
# Build with full debug info (for comprehensive source mapping) with optimizations enabled
mojo build --debug-level=full solutions/p16/p16.mojo -o solutions/p16/p16_optimized

# Test the optimized build
./solutions/p16/p16_optimized --naive

Why this matters:

Full debug info: Provides complete symbol tables, variable names, and source line mapping for profilers
Comprehensive analysis: Enables NSight tools to correlate performance data with specific code locations
Optimizations enabled: Ensures realistic performance measurements that match production builds

Step 2: Capture system-wide profile

# Profile the optimized build with comprehensive tracing
nsys profile \
  --trace=cuda,nvtx \
  --output=matmul_naive \
  --force-overwrite=true \
  ./solutions/p16/p16_optimized --naive

Command breakdown:

--trace=cuda,nvtx: Capture CUDA API calls and custom annotations
--output=matmul_naive: Save profile as matmul_naive.nsys-rep
--force-overwrite=true: Replace existing profiles
Final argument: Your Mojo program

Step 3: Analyze the timeline

# Generate text-based statistics
nsys stats --force-export=true matmul_naive.nsys-rep

# Key metrics to look for:
# - GPU utilization percentage
# - Memory transfer times
# - Kernel execution times
# - CPU-GPU synchronization gaps

What you’ll see (actual output from a 2×2 matrix multiplication):

** CUDA API Summary (cuda_api_sum):
 Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)  Min (ns)  Max (ns)  StdDev (ns)          Name
 --------  ---------------  ---------  ---------  --------  --------  --------  -----------  --------------------
     81.9          8617962          3  2872654.0    2460.0      1040   8614462    4972551.6  cuMemAllocAsync
     15.1          1587808          4   396952.0    5965.5      3810   1572067     783412.3  cuMemAllocHost_v2
      0.6            67152          1    67152.0   67152.0     67152     67152          0.0  cuModuleLoadDataEx
      0.4            44961          1    44961.0   44961.0     44961     44961          0.0  cuLaunchKernelEx

** CUDA GPU Kernel Summary (cuda_gpu_kern_sum):
 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                    Name
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------
    100.0             1920          1    1920.0    1920.0      1920      1920          0.0  p16_naive_matmul_Layout_Int6A6AcB6A6AsA6A6A

** CUDA GPU MemOps Summary (by Time) (cuda_gpu_mem_time_sum):
 Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)           Operation
 --------  ---------------  -----  --------  --------  --------  --------  -----------  ----------------------------
     49.4             4224      3    1408.0    1440.0      1312      1472         84.7  [CUDA memcpy Device-to-Host]
     36.0             3072      4     768.0     528.0       416      1600        561.0  [CUDA memset]
     14.6             1248      3     416.0     416.0       416       416          0.0  [CUDA memcpy Host-to-Device]

Key Performance Insights:

Memory allocation dominates: 81.9% of total time spent on cuMemAllocAsync
Kernel is lightning fast: Only 1,920 ns (0.000001920 seconds) execution time
Memory transfer breakdown: 49.4% Device→Host, 36.0% memset, 14.6% Host→Device
Tiny data sizes: All memory operations are < 0.001 MB (4 float32 values = 16 bytes)

Step 4: Compare implementations

Profile different versions and compare:

# Make sure you've in pixi shell still `pixi run -e nvidia`

# Profile shared memory version
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_shared ./solutions/p16/p16_optimized --single-block

# Profile tiled version
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_tiled ./solutions/p16/p16_optimized --tiled

# Profile idiomatic tiled version
nsys profile --trace=cuda,nvtx --force-overwrite=true --output=matmul_idiomatic_tiled ./solutions/p16/p16_optimized --idiomatic-tiled

# Analyze each implementation separately (nsys stats processes one file at a time)
nsys stats --force-export=true matmul_shared.nsys-rep
nsys stats --force-export=true matmul_tiled.nsys-rep
nsys stats --force-export=true matmul_idiomatic_tiled.nsys-rep

How to compare the results:

Look at GPU Kernel Summary - Compare execution times between implementations
Check Memory Operations - See if shared memory reduces global memory traffic
Compare API overhead - All should have similar memory allocation patterns

Manual comparison workflow:

# Run each analysis and save output for comparison
nsys stats --force-export=true matmul_naive.nsys-rep > naive_stats.txt
nsys stats --force-export=true matmul_shared.nsys-rep > shared_stats.txt
nsys stats --force-export=true matmul_tiled.nsys-rep > tiled_stats.txt
nsys stats --force-export=true matmul_idiomatic_tiled.nsys-rep > idiomatic_tiled_stats.txt

Fair Comparison Results (actual output from profiling):

Comparison 1: 2 x 2 matrices

Implementation	Memory Allocation	Kernel Execution	Performance
Naive	81.9% cuMemAllocAsync	✅ 1,920 ns	Baseline
Shared (`--single-block`)	81.8% cuMemAllocAsync	✅ 1,984 ns	+3.3% slower

Comparison 2: 9 x 9 matrices

Implementation	Memory Allocation	Kernel Execution	Performance
Tiled (manual)	81.1% cuMemAllocAsync	✅ 2,048 ns	Baseline
Idiomatic Tiled	81.6% cuMemAllocAsync	✅ 2,368 ns	+15.6% slower

Key Insights from Fair Comparisons:

Both Matrix Sizes Are Tiny for GPU Work!:

2×2 matrices: 4 elements - completely overhead-dominated
9×9 matrices: 81 elements - still completely overhead-dominated
Real GPU workloads: Thousands to millions of elements per dimension

What These Results Actually Show:

All variants dominated by memory allocation (>81% of time)
Kernel execution is irrelevant compared to setup costs
“Optimizations” can hurt: Shared memory adds 3.3% overhead, async_copy adds 15.6%
The real lesson: For tiny workloads, algorithm choice doesn’t matter - overhead dominates everything

Why This Happens:

GPU setup cost (memory allocation, kernel launch) is fixed regardless of problem size
For tiny problems, this fixed cost dwarfs computation time
Optimizations designed for large problems become overhead for small ones

Real-World Profiling Lessons:

Problem size context matters: Both 2×2 and 9×9 are tiny for GPUs
Fixed costs dominate small problems: Memory allocation, kernel launch overhead
“Optimizations” can hurt tiny workloads: Shared memory, async operations add overhead
Don’t optimize tiny problems: Focus on algorithms that scale to real workloads
Always benchmark: Assumptions about “better” code are often wrong

Understanding Small Kernel Profiling: This 2×2 matrix example demonstrates a classic small-kernel pattern:

The actual computation (matrix multiply) is extremely fast (1,920 ns)
Memory setup overhead dominates the total time (97%+ of execution)
This is why real-world GPU optimization focuses on:
- Batching operations to amortize setup costs
- Memory reuse to reduce allocation overhead
- Larger problem sizes where compute becomes the bottleneck

Hands-on: kernel deep-dive with NSight Compute

Now let’s dive deep into a specific kernel’s performance characteristics.

Step 1: Profile a specific kernel

# Make sure you're in an active shell
pixi shell -e nvidia

# Profile the naive MatMul kernel in detail (using our optimized build)
ncu \
  --set full \
  -o kernel_analysis \
  --force-overwrite \
  ./solutions/p16/p16_optimized --naive

Common Issue: Permission Error

If you get ERR_NVGPUCTRPERM - The user does not have permission to access NVIDIA GPU Performance Counters, try these > solutions:

# Add NVIDIA driver option (safer than rmmod)
echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee -a /etc/modprobe.d/nvidia-kernel-common.conf

# Set kernel parameter
sudo sysctl -w kernel.perf_event_paranoid=0

# Make permanent
echo 'kernel.perf_event_paranoid=0' | sudo tee -a /etc/sysctl.conf

# Reboot required for driver changes to take effect
sudo reboot

# Then run the ncu command again
ncu \
  --set full \
  -o kernel_analysis \
  --force-overwrite \
  ./solutions/p16/p16_optimized --naive

Step 2: Analyze key metrics

# Generate detailed report (correct syntax)
ncu --import kernel_analysis.ncu-rep --page details

Real NSight Compute Output (from your 2×2 naive MatMul):

GPU Speed Of Light Throughput
----------------------- ----------- ------------
DRAM Frequency              Ghz         6.10
SM Frequency                Ghz         1.30
Elapsed Cycles            cycle         3733
Memory Throughput             %         1.02
DRAM Throughput               %         0.19
Duration                     us         2.88
Compute (SM) Throughput       %         0.00
----------------------- ----------- ------------

Launch Statistics
-------------------------------- --------------- ---------------
Block Size                                                     9
Grid Size                                                      1
Threads                           thread               9
Waves Per SM                                                0.00
-------------------------------- --------------- ---------------

Occupancy
------------------------------- ----------- ------------
Theoretical Occupancy                 %        33.33
Achieved Occupancy                    %         2.09
------------------------------- ----------- ------------

Critical Insights from Real Data:

Performance analysis - the brutal truth

Compute Throughput: 0.00% - GPU is completely idle computationally
Memory Throughput: 1.02% - Barely touching memory bandwidth
Achieved Occupancy: 2.09% - Using only 2% of GPU capability
Grid Size: 1 block - Completely underutilizing 80 multiprocessors!

Why performance is so poor

Tiny problem size: 2×2 matrix = 4 elements total
Poor launch configuration: 9 threads in 1 block (should be multiples of 32)
Massive underutilization: 0.00 waves per SM (need thousands for efficiency)

Key optimization recommendations from NSight Compute

“Est. Speedup: 98.75%” - Increase grid size to use all 80 SMs
“Est. Speedup: 71.88%” - Use thread blocks as multiples of 32
“Kernel grid is too small” - Need much larger problems for GPU efficiency

Step 3: The reality check

What This Profiling Data Teaches Us:

Tiny problems are GPU poison: 2×2 matrices completely waste GPU resources
Launch configuration matters: Wrong thread/block sizes kill performance
Scale matters more than algorithm: No optimization can fix a fundamentally tiny problem
NSight Compute is honest: It tells us when our kernel performance is poor

The Real Lesson:

Don’t optimize toy problems - they’re not representative of real GPU workloads
Focus on realistic workloads - 1000×1000+ matrices where optimizations actually matter
Use profiling to guide optimization - but only on problems worth optimizing

For our tiny 2×2 example: All the sophisticated algorithms (shared memory, tiling) just add overhead to an already overhead-dominated workload.

Performance Issue
        |
        v
NSight Systems: Big Picture
        |
        v
GPU Well Utilized?
    |           |
   No          Yes
    |           |
    v           v
Fix CPU-GPU    NSight Compute: Kernel Detail
Pipeline            |
                    v
            Memory or Compute Bound?
                |       |       |
             Memory  Compute  Neither
                |       |       |
                v       v       v
           Optimize  Optimize  Check
           Memory    Arithmetic Occupancy
           Access

Profiling best practices

For comprehensive profiling guidelines, refer to the Best Practices Guide - Performance Metrics.

Do’s

Profile representative workloads: Use realistic data sizes and patterns
Build with full debug info: Use --debug-level=full for comprehensive profiling data and source mapping with optimizations
Warm up the GPU: Run kernels multiple times, profile later iterations
Compare alternatives: Always profile multiple implementations
Focus on hotspots: Optimize the kernels that take the most time

Don’ts

Don’t profile without debug info: You won’t be able to map performance back to source code (mojo build --help)
Don’t profile single runs: GPU performance can vary between runs
Don’t ignore memory transfers: CPU-GPU transfers often dominate
Don’t optimize prematurely: Profile first, then optimize

Common pitfalls and solutions

Pitfall 1: Cold start effects

# Wrong: Profile first run
nsys profile mojo your_program.mojo

# Right: Warm up, then profile
nsys profile --delay=5 mojo your_program.mojo  # Let GPU warm up

Pitfall 2: Wrong build configuration

# Wrong: Full debug build (disables optimizations) i.e. `--no-optimization`
mojo build -O0 your_program.mojo -o your_program

# Wrong: No debug info (can't map to source)
mojo build your_program.mojo -o your_program

# Right: Optimized build with full debug info for profiling
mojo build --debug-level=full your_program.mojo -o optimized_program
nsys profile ./optimized_program

Pitfall 3: Ignoring memory transfers

# Look for this pattern in NSight Systems:
CPU -> GPU transfer: 50ms
Kernel execution: 2ms
GPU -> CPU transfer: 48ms
# Total: 100ms (kernel is only 2%!)

Solution: Overlap transfers with compute, reduce transfer frequency (covered in Part IX)

Pitfall 4: Single kernel focus

# Wrong: Only profile the "slow" kernel
ncu --kernel-name regex:slow_kernel program

# Right: Profile the whole application first
nsys profile mojo program.mojo  # Find real bottlenecks

Best practices and advanced options

Advanced NSight Systems profiling

For comprehensive system-wide analysis, use these advanced nsys flags:

# Production-grade profiling command
nsys profile \
  --gpu-metrics-devices=all \
  --trace=cuda,osrt,nvtx \
  --trace-fork-before-exec=true \
  --cuda-memory-usage=true \
  --cuda-um-cpu-page-faults=true \
  --cuda-um-gpu-page-faults=true \
  --opengl-gpu-workload=false \
  --delay=2 \
  --duration=30 \
  --sample=cpu \
  --cpuctxsw=process-tree \
  --output=comprehensive_profile \
  --force-overwrite=true \
  ./your_program

Flag explanations:

--gpu-metrics-devices=all: Collect GPU metrics from all devices
--trace=cuda,osrt,nvtx: Comprehensive API tracing
--cuda-memory-usage=true: Track memory allocation/deallocation
--cuda-um-cpu/gpu-page-faults=true: Monitor Unified Memory page faults
--delay=2: Wait 2 seconds before profiling (avoid cold start)
--duration=30: Profile for 30 seconds max
--sample=cpu: Include CPU sampling for hotspot analysis
--cpuctxsw=process-tree: Track CPU context switches

Advanced NSight Compute profiling

For detailed kernel analysis with comprehensive metrics:

# Full kernel analysis with all metric sets
ncu \
  --set full \
  --import-source=on \
  --kernel-id=:::1 \
  --launch-skip=0 \
  --launch-count=1 \
  --target-processes=all \
  --replay-mode=kernel \
  --cache-control=all \
  --clock-control=base \
  --apply-rules=yes \
  --check-exit-code=yes \
  --export=detailed_analysis \
  --force-overwrite \
  ./your_program

# Focus on specific performance aspects
ncu \
  --set=@roofline \
  --section=InstructionStats \
  --section=LaunchStats \
  --section=Occupancy \
  --section=SpeedOfLight \
  --section=WarpStateStats \
  --metrics=sm__cycles_elapsed.avg,dram__throughput.avg.pct_of_peak_sustained_elapsed \
  --kernel-name regex:your_kernel_.* \
  --export=targeted_analysis \
  ./your_program

Key NSight Compute flags:

--set full: Collect all available metrics (comprehensive but slow)
--set @roofline: Optimized set for roofline analysis
--import-source=on: Map results back to source code
--replay-mode=kernel: Replay kernels for accurate measurements
--cache-control=all: Control GPU caches for consistent results
--clock-control=base: Lock clocks to base frequencies
--section=SpeedOfLight: Include Speed of Light analysis
--metrics=...: Collect specific metrics only
--kernel-name regex:pattern: Target kernels using regex patterns (not --kernel-regex)

Profiling workflow best practices

1. Progressive profiling strategy

# Step 1: Quick overview (fast)
nsys profile --trace=cuda --duration=10 --output=quick_look ./program

# Step 2: Detailed system analysis (medium)
nsys profile --trace=cuda,osrt,nvtx --cuda-memory-usage=true --output=detailed ./program

# Step 3: Kernel deep-dive (slow but comprehensive)
ncu --set=@roofline --kernel-name regex:hotspot_kernel ./program

2. Multi-run analysis for reliability

# Profile multiple runs and compare
for i in {1..5}; do
  nsys profile --output=run_${i} ./program
  nsys stats run_${i}.nsys-rep > stats_${i}.txt
done

# Compare results
diff stats_1.txt stats_2.txt

3. Targeted kernel profiling

# First, identify hotspot kernels
nsys profile --trace=cuda,nvtx --output=overview ./program
nsys stats overview.nsys-rep | grep -A 10 "GPU Kernel Summary"

# Then profile specific kernels
ncu --kernel-name="identified_hotspot_kernel" --set full ./program

Environment and build best practices

Optimal build configuration

# For profiling: optimized with full debug info
mojo build --debug-level=full --optimization-level=3 program.mojo -o program_profile

# Verify build settings
mojo build --help | grep -E "(debug|optimization)"

Profiling environment setup

# Disable GPU boost for consistent results
sudo nvidia-smi -ac 1215,1410  # Lock memory and GPU clocks

# Set deterministic behavior
export CUDA_LAUNCH_BLOCKING=1  # Synchronous launches for accurate timing

# Increase driver limits for profiling
echo 0 | sudo tee /proc/sys/kernel/perf_event_paranoid
echo 'options nvidia "NVreg_RestrictProfilingToAdminUsers=0"' | sudo tee -a /etc/modprobe.d/nvidia-kernel-common.conf

Memory and performance isolation

# Clear GPU memory before profiling
nvidia-smi --gpu-reset

# Disable other GPU processes
sudo fuser -v /dev/nvidia*  # Check what's using GPU
sudo pkill -f cuda  # Kill CUDA processes if needed

# Run with high priority
sudo nice -n -20 nsys profile ./program

Analysis and reporting best practices

Comprehensive report generation

# Generate multiple report formats
nsys stats --report=cuda_api_sum,cuda_gpu_kern_sum,cuda_gpu_mem_time_sum --format=csv --output=. profile.nsys-rep

# Export for external analysis
nsys export --type=sqlite profile.nsys-rep
nsys export --type=json profile.nsys-rep

# Generate comparison reports
nsys stats --report=cuda_gpu_kern_sum baseline.nsys-rep > baseline_kernels.txt
nsys stats --report=cuda_gpu_kern_sum optimized.nsys-rep > optimized_kernels.txt
diff -u baseline_kernels.txt optimized_kernels.txt

Performance regression testing

#!/bin/bash
# Automated profiling script for CI/CD
BASELINE_TIME=$(nsys stats baseline.nsys-rep | grep "Total Time" | awk '{print $3}')
CURRENT_TIME=$(nsys stats current.nsys-rep | grep "Total Time" | awk '{print $3}')

REGRESSION_THRESHOLD=1.10  # 10% slowdown threshold
if (( $(echo "$CURRENT_TIME > $BASELINE_TIME * $REGRESSION_THRESHOLD" | bc -l) )); then
    echo "Performance regression detected: ${CURRENT_TIME}ns vs ${BASELINE_TIME}ns"
    exit 1
fi

Next steps

Now that you understand profiling fundamentals:

Practice with your existing kernels: Profile puzzles you’ve already solved
Prepare for optimization: Puzzle 31 will use these insights for occupancy optimization
Understand the tools: Experiment with different NSight Systems and NSight Compute options

Remember: Profiling is not just about finding slow code - it’s about understanding your program’s behavior and making informed optimization decisions.

For additional profiling resources, see: