👮🏼‍♂️ The Silent Memory Corruption
Overview
Learn how to detect memory violations that can silently corrupt GPU programs, even when tests appear to pass. Using NVIDIA’s compute-sanitizer
(avaible through pixi
) with the memcheck
tool, you’ll discover hidden memory bugs that could cause unpredictable behavior in your GPU code.
Key insight: A GPU program can produce “correct” results while simultaneously performing illegal memory accesses.
Prerequisites: Understanding of Puzzle 4 LayoutTensor and basic GPU memory concepts.
The silent memory bug discovery
Test passes, but is my code actually correct?
Let’s start with a seemingly innocent program that appears to work perfectly (this is Puzzle 04 without guards):
fn add_10_2d(
output: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
row = thread_idx.y
col = thread_idx.x
output[row, col] = a[row, col] + 10.0
View full file: problems/p10/p10.mojo
When you run this program normally, everything looks fine:
pixi run p10 --memory-bug
out shape: 2 x 2
Running memory bug example (bounds checking issue)...
out: HostBuffer([10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
âś… Memory test PASSED! (memcheck may find bounds violations)
âś… Test PASSED! The output matches expected results perfectly. Case closed, right?
Wrong! Let’s see what compute-sanitizer
reveals:
pixi run compute-sanitizer --tool memcheck mojo problems/p10/p10.mojo --memory-bug
========= COMPUTE-SANITIZER
out shape: 2 x 2
Running memory bug example (bounds checking issue)...
========= Invalid __global__ read of size 4 bytes
========= at p10_add_10_2d_...+0x80
========= by thread (2,1,0) in block (0,0,0)
========= Access at 0xe0c000210 is out of bounds
========= and is 513 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes
========= Invalid __global__ read of size 4 bytes
========= at p10_add_10_2d_...+0x80
========= by thread (0,2,0) in block (0,0,0)
========= Access at 0xe0c000210 is out of bounds
========= and is 513 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes
========= Invalid __global__ read of size 4 bytes
========= at p10_add_10_2d_...+0x80
========= by thread (1,2,0) in block (0,0,0)
========= Access at 0xe0c000214 is out of bounds
========= and is 517 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes
========= Invalid __global__ read of size 4 bytes
========= at p10_add_10_2d_...+0x80
========= by thread (2,2,0) in block (0,0,0)
========= Access at 0xe0c000218 is out of bounds
========= and is 521 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuStreamSynchronize.
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuEventCreate.
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuMemFreeAsync.
========= ERROR SUMMARY: 7 errors
The program has 7 total errors despite passing all tests:
- 4 memory violations (Invalid global read)
- 3 runtime errors (caused by the memory violations)
Understanding the hidden bug
Root cause analysis
The Problem:
- Tensor size: 2Ă—2 (valid indices: 0, 1)
- Thread grid: 3Ă—3 (thread indices: 0, 1, 2)
- Out-of-bounds threads:
(2,1)
,(0,2)
,(1,2)
,(2,2)
access invalid memory - Missing bounds check: No validation of
thread_idx
against tensor dimensions
Understanding the 7 total errors
4 Memory Violations:
- Each out-of-bounds thread
(2,1)
,(0,2)
,(1,2)
,(2,2)
caused an “Invalid global read”
3 CUDA Runtime Errors:
cuStreamSynchronize
failed due to kernel launch failurecuEventCreate
failed during cleanupcuMemFreeAsync
failed during memory deallocation
Key Insight: Memory violations have cascading effects - one bad memory access causes multiple downstream CUDA API failures.
Why tests still passed:
- Valid threads
(0,0)
,(0,1)
,(1,0)
,(1,1)
wrote correct results - Test only checked valid output locations
- Out-of-bounds accesses didn’t immediately crash the program
Understanding undefined behavior (UB)
What is undefined behavior?
Undefined Behavior (UB) occurs when a program performs operations that have no defined meaning according to the language specification. Out-of-bounds memory access is a classic example of undefined behavior.
Key characteristics of UB:
- The program can do literally anything: crash, produce wrong results, appear to work, or corrupt memory
- No guarantees: Behavior may change between compilers, hardware, drivers, or even different runs
Why undefined behavior is especially dangerous
Correctness issues:
- Unpredictable results: Your program may work during testing but fail in production
- Non-deterministic behavior: Same code can produce different results on different runs
- Silent corruption: UB can corrupt data without any visible errors
- Compiler optimizations: Compilers assume no UB occurs and may optimize in unexpected ways
Security vulnerabilities:
- Buffer overflows: Classic source of security exploits in systems programming
- Memory corruption: Can lead to privilege escalation and code injection attacks
- Information leakage: Out-of-bounds reads can expose sensitive data
- Control flow hijacking: UB can be exploited to redirect program execution
GPU-specific undefined behavior dangers
Massive scale impact:
- Thread divergence: One thread’s UB can affect entire warps (32 threads)
- Memory coalescing: Out-of-bounds access can corrupt neighboring threads’ data
- Kernel failures: UB can cause entire GPU kernels to fail catastrophically
Hardware variations:
- Different GPU architectures: UB may manifest differently on different GPU models
- Driver differences: Same UB may behave differently across driver versions
- Memory layout changes: GPU memory allocation patterns can change UB manifestation
Fixing the memory violation
The solution
As we saw in Puzzle 04, we need to bound-check as follows:
fn add_10_2d(
output: LayoutTensor[mut=True, dtype, layout],
a: LayoutTensor[mut=True, dtype, layout],
size: Int,
):
row = thread_idx.y
col = thread_idx.x
if col < size and row < size:
output[row, col] = a[row, col] + 10.0
The fix is simple: always validate thread indices against data dimensions before accessing memory.
Verification with compute-sanitizer
# Fix the bounds checking in your copy of p10.mojo, then run:
pixi run compute-sanitizer --tool memcheck mojo problems/p10/p10.mojo --memory-bug
========= COMPUTE-SANITIZER
out shape: 2 x 2
Running memory bug example (bounds checking issue)...
out: HostBuffer([10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
âś… Memory test PASSED! (memcheck may find bounds violations)
========= ERROR SUMMARY: 0 errors
âś… SUCCESS: No memory violations detected!
Key learning points
Why manual bounds checking matters
- Clarity: Makes the safety requirements explicit in the code
- Control: You decide exactly what happens for out-of-bounds cases
- Debugging: Easier to reason about when memory violations occur
GPU memory safety rules
- Always validate thread indices against data dimensions
- Avoid undefined behavior (UB) at all costs - out-of-bounds access is UB and can break everything
- Use compute-sanitizer during development and testing
- Never assume “it works” without memory checking
- Test with different grid/block configurations to catch undefined behavior (UB) that manifests inconsistently
Compute-sanitizer best practices
pixi run compute-sanitizer --tool memcheck mojo your_code.mojo
Note: You may see Mojo runtime warnings in the sanitizer output. Focus on the ========= Invalid
and ========= ERROR SUMMARY
lines for actual memory violations.