👮🏼‍♂️ The Silent Memory Corruption

Overview

Learn how to detect memory violations that can silently corrupt GPU programs, even when tests appear to pass. Using NVIDIA’s compute-sanitizer (available through pixi) with the memcheck tool, you’ll discover hidden memory bugs that could cause unpredictable behavior in your GPU code.

Key insight: A GPU program can produce “correct” results while simultaneously performing illegal memory accesses.

Prerequisites: Understanding of Puzzle 4 TileTensor and basic GPU memory concepts.

The silent memory bug discovery

Test passes, but is my code actually correct?

Let’s start with a seemingly innocent program that appears to work perfectly (this is Puzzle 04 without guards):

def add_10_2d(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=False, dtype, LayoutType, ImmutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x
    output[row, col] = a[row, col] + 10.0


View full file: problems/p10/p10.mojo

When you run this program normally, everything looks fine:

pixi run p10 --memory-bug
out shape: 2 x 2
Running memory bug example (bounds checking issue)...
out: HostBuffer([10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
âś… Memory test PASSED! (memcheck may find bounds violations)

âś… Test PASSED! The output matches expected results perfectly. Case closed, right?

Wrong! Let’s see what compute-sanitizer reveals:

MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0 pixi run compute-sanitizer --tool memcheck mojo problems/p10/p10.mojo --memory-bug

Note: MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0 is a command-line environment variable setting that disables a device context’s buffer cache. This setting can reveal memory issues, like bounds violations, that are otherwise masked by the normal caching behavior.

========= COMPUTE-SANITIZER
out shape: 2 x 2
Running memory bug example (bounds checking issue)...

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (2,1,0) in block (0,0,0)
=========     Access at 0xe0c000210 is out of bounds
=========     and is 513 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (0,2,0) in block (0,0,0)
=========     Access at 0xe0c000210 is out of bounds
=========     and is 513 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (1,2,0) in block (0,0,0)
=========     Access at 0xe0c000214 is out of bounds
=========     and is 517 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Invalid __global__ read of size 4 bytes
=========     at p10_add_10_2d_...+0x80
=========     by thread (2,2,0) in block (0,0,0)
=========     Access at 0xe0c000218 is out of bounds
=========     and is 521 bytes after the nearest allocation at 0xe0c000000 of size 16 bytes

========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuStreamSynchronize.
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuEventCreate.
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuMemFreeAsync.

========= ERROR SUMMARY: 7 errors

The program has 7 total errors despite passing all tests:

  • 4 memory violations (Invalid global read)
  • 3 runtime errors (caused by the memory violations)

Understanding the hidden bug

Root cause analysis

The Problem:

  • Tensor size: 2Ă—2 (valid indices: 0, 1)
  • Thread grid: 3Ă—3 (thread indices: 0, 1, 2)
  • Out-of-bounds threads: (2,1), (0,2), (1,2), (2,2) access invalid memory
  • Missing bounds check: No validation of thread_idx against tensor dimensions

Understanding the 7 total errors

4 Memory Violations:

  • Each out-of-bounds thread (2,1), (0,2), (1,2), (2,2) caused an “Invalid global read”

3 CUDA Runtime Errors:

  • cuStreamSynchronize failed due to kernel launch failure
  • cuEventCreate failed during cleanup
  • cuMemFreeAsync failed during memory deallocation

Key Insight: Memory violations have cascading effects - one bad memory access causes multiple downstream CUDA API failures.

Why tests still passed:

  • Valid threads (0,0), (0,1), (1,0), (1,1) wrote correct results
  • Test only checked valid output locations
  • Out-of-bounds accesses didn’t immediately crash the program

Understanding undefined behavior (UB)

What is undefined behavior?

Undefined Behavior (UB) occurs when a program performs operations that have no defined meaning according to the language specification. Out-of-bounds memory access is a classic example of undefined behavior.

Key characteristics of UB:

  • The program can do literally anything: crash, produce wrong results, appear to work, or corrupt memory
  • No guarantees: Behavior may change between compilers, hardware, drivers, or even different runs

Why undefined behavior is especially dangerous

Correctness issues:

  • Unpredictable results: Your program may work during testing but fail in production
  • Non-deterministic behavior: Same code can produce different results on different runs
  • Silent corruption: UB can corrupt data without any visible errors
  • Compiler optimizations: Compilers assume no UB occurs and may optimize in unexpected ways

Security vulnerabilities:

  • Buffer overflows: Classic source of security exploits in systems programming
  • Memory corruption: Can lead to privilege escalation and code injection attacks
  • Information leakage: Out-of-bounds reads can expose sensitive data
  • Control flow hijacking: UB can be exploited to redirect program execution

GPU-specific undefined behavior dangers

Massive scale impact:

  • Thread divergence: One thread’s UB can affect entire warps (32 threads)
  • Memory coalescing: Out-of-bounds access can corrupt neighboring threads’ data
  • Kernel failures: UB can cause entire GPU kernels to fail catastrophically

Hardware variations:

  • Different GPU architectures: UB may manifest differently on different GPU models
  • Driver differences: Same UB may behave differently across driver versions
  • Memory layout changes: GPU memory allocation patterns can change UB manifestation

Fixing the memory violation

The solution

As we saw in Puzzle 04, we need to bound-check as follows:

def add_10_2d(
    output: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    a: TileTensor[mut=True, dtype, LayoutType, MutAnyOrigin],
    size: Int,
):
    var row = thread_idx.y
    var col = thread_idx.x
    if col < size and row < size:
        output[row, col] = a[row, col] + 10.0


The fix is simple: always validate thread indices against data dimensions before accessing memory.

Verification with compute-sanitizer

# Fix the bounds checking in your copy of p10.mojo, then run:
MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0 pixi run compute-sanitizer --tool memcheck mojo problems/p10/p10.mojo --memory-bug
========= COMPUTE-SANITIZER
[...]:WARNING close_multiple.cc:66] close: Bad file descriptor (9)
[...]:WARNING close_multiple.cc:66] close: Bad file descriptor (9)
Please submit a bug report to https://github.com/modular/modular/issues and include the crash backtrace along with all the relevant source codes.
Stack dump:
0.      Program arguments: .../.pixi/envs/default/bin/mojo problems/p10/p10.mojo --memory-bug
 #0 0x... (/.../.pixi/envs/default/bin/mojo+0x...)
 ...
[...] intermediate process terminated by signal 11 (Segmentation fault) (core dumped)
out shape: 2 x 2
Running memory bug example (bounds checking issue)...
out: HostBuffer([10.0, 11.0, 12.0, 13.0])
expected: HostBuffer([10.0, 11.0, 12.0, 13.0])
âś… Memory test PASSED! (memcheck may find bounds violations)
========= ERROR SUMMARY: 0 errors

âś… SUCCESS: No memory violations detected!

Note on the segfault: The crash lines above (“intermediate process terminated by signal 11”) are a known compatibility issue between Mojo’s process initialization and compute-sanitizer’s injection libraries. They appear before the GPU kernel runs and do not affect the sanitizer’s analysis. See the note at the end of this page for details.

Key learning points

Why manual bounds checking matters

  1. Clarity: Makes the safety requirements explicit in the code
  2. Control: You decide exactly what happens for out-of-bounds cases
  3. Debugging: Easier to reason about when memory violations occur

GPU memory safety rules

  1. Always validate thread indices against data dimensions
  2. Avoid undefined behavior (UB) at all costs - out-of-bounds access is UB and can break everything
  3. Use compute-sanitizer during development and testing
  4. Never assume “it works” without memory checking
  5. Test with different grid/block configurations to catch undefined behavior (UB) that manifests inconsistently

Compute-sanitizer best practices

MODULAR_DEVICE_CONTEXT_MEMORY_MANAGER_SIZE_PERCENT=0 pixi run compute-sanitizer --tool memcheck mojo your_code.mojo

Note on Mojo + compute-sanitizer compatibility: You may see a crash at the start of the sanitizer output — lines like close: Bad file descriptor, a stack dump, and intermediate process terminated by signal 11 (Segmentation fault). This is a known issue where compute-sanitizer’s injection libraries conflict with Mojo’s process initialization. Despite the crash, the sanitizer completes its GPU kernel analysis correctly. Always use the ========= ERROR SUMMARY line at the end as the authoritative result, and look for ========= Invalid lines for specific memory violations.