🧐 Detective Work: First Case

Overview

This puzzle presents a crashing GPU program where your task is to identify the issue using only (cuda-gdb) debugging tools, without examining the source code. Apply your debugging skills to solve the mystery!

Prerequisites: Complete Mojo GPU Debugging Essentials to understand CUDA-GDB setup and basic debugging commands. Make sure you’ve run:

pixi run -e nvidia setup-cuda-gdb

This auto-detects your CUDA installation and sets up the necessary links for GPU debugging.

Key concepts

In this debugging challenge, you’ll learn about:

Systematic debugging: Using error messages as clues to find root causes
Error analysis: Reading crash messages and stack traces
Hypothesis formation: Making educated guesses about the problem
Debugging workflow: Step-by-step investigation process

Running the code

First, examine the kernel without looking at the complete code:

fn add_10(
    output: UnsafePointer[Scalar[dtype], MutAnyOrigin],
    a: UnsafePointer[Scalar[dtype], MutAnyOrigin],
):
    i = thread_idx.x
    output[i] = a[i] + 10.0

To experience the bug firsthand, run the following command in your terminal (pixi only):

pixi run -e nvidia p09 --first-case

You’ll see output like this when the program crashes:

First Case: Try to identify what's wrong without looking at the code!

stack trace was not collected. Enable stack trace collection with environment variable `MOJO_ENABLE_STACK_TRACE_ON_ERROR`
Unhandled exception caught during execution: At open-source/max/mojo/stdlib/stdlib/gpu/host/device_context.mojo:2082:17: CUDA call failed: CUDA_ERROR_INVALID_IMAGE (device kernel image is invalid)
To get more accurate error information, set MODULAR_DEVICE_CONTEXT_SYNC_MODE=true.
/home/ubuntu/workspace/mojo-gpu-puzzles/.pixi/envs/nvidia/bin/mojo: error: execution exited with a non-zero result: 1

Your task: detective work

Challenge: Without looking at the code yet, what would be your debugging strategy to investigate this crash?

Start with:

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --first-case

Tips

Read the crash message carefully - CUDA_ERROR_ILLEGAL_ADDRESS means the GPU tried to access invalid memory
Check the breakpoint information - Look at the function parameters shown when CUDA-GDB stops
Inspect all pointers systematically - Use print to examine each pointer parameter
Look for suspicious addresses - Valid GPU addresses are typically large hex numbers (what does 0x0 mean?)
Test memory access - Try accessing the data through each pointer to see which one fails
Apply the systematic approach - Like a detective, follow the evidence from symptom to root cause
Compare valid vs invalid patterns - If one pointer works and another doesn’t, focus on the broken one

💡 Investigation & Solution

Step-by-Step Investigation with CUDA-GDB

Launch the Debugger

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --first-case

Examine the Breakpoint Information

When CUDA-GDB stops, it immediately shows valuable clues:

(cuda-gdb) run
CUDA thread hit breakpoint, p09_add_10_... (result=0x302000000, input=0x0)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:31
31          i = thread_idx.x

🔍 First Clue: The function signature shows (result=0x302000000, input=0x0)

result has a valid GPU memory address
input is 0x0 - this is a null pointer!

Systematic variable inspection

(cuda-gdb) next
32          result[i] = input[i] + 10.0
(cuda-gdb) print i
$1 = 0
(cuda-gdb) print result
$2 = (!pop.scalar<f32> * @register) 0x302000000
(cuda-gdb) print input
$3 = (!pop.scalar<f32> * @register) 0x0

Evidence Gathering:

✅ Thread index i=0 is valid
✅ Result pointer 0x302000000 is a proper GPU address
❌ Input pointer 0x0 is null

Confirm the Problem

(cuda-gdb) print input[i]
Cannot access memory at address 0x0

Smoking Gun: Cannot access memory at null address - this confirms the crash cause!

Root cause analysis

The Problem: Now if we look at the code for --first-crash, we see that the host code creates a null pointer instead of allocating proper GPU memory:

input_ptr = {}  # Creates NULL pointer!

Why This Crashes:

{} creates an uninitialized pointer (null)
This null pointer gets passed to the GPU kernel
When kernel tries input[i], it dereferences null → CUDA_ERROR_ILLEGAL_ADDRESS

The fix

Replace null pointer creation with proper buffer allocation:

# Wrong: Creates null pointer
input_ptr = {}

# Correct: Allocates and initialize actual GPU memory for safe processing
input_buf = ctx.enqueue_create_buffer[dtype](SIZE)
input_bufenqueue_fill(0)

Key debugging lessons

Pattern Recognition:

0x0 addresses are always null pointers
Valid GPU addresses are large hex numbers (e.g., 0x302000000)

Debugging Strategy:

Read crash messages - They often hint at the problem type
Check function parameters - CUDA-GDB shows them at breakpoint entry
Inspect all pointers - Compare addresses to identify null/invalid ones
Test memory access - Try dereferencing suspicious pointers
Trace back to allocation - Find where the problematic pointer was created

💡 Key Insight: This type of null pointer bug is extremely common in GPU programming. The systematic CUDA-GDB investigation approach you learned here applies to debugging many other GPU memory issues, race conditions, and kernel crashes.

Next steps: from crashes to silent bugs

You’ve learned crash debugging! You can now:

Systematically investigate GPU crashes using error messages as clues
Identify null pointer bugs through pointer address inspection
Use CUDA-GDB effectively for memory-related debugging

Your next challenge: Detective Work: Second Case

But what if your program doesn’t crash? What if it runs perfectly but produces wrong results?

The Second Case presents a completely different debugging challenge:

No crash messages to guide you
No obvious pointer problems to investigate
No stack traces pointing to the issue
Just wrong results that need systematic investigation

New skills you’ll develop:

Logic bug detection - Finding algorithmic errors without crashes
Pattern analysis - Using incorrect output to trace back to root causes
Execution flow debugging - When variable inspection fails due to optimizations

The systematic investigation approach you learned here - reading clues, forming hypotheses, testing systematically - forms the foundation for debugging the more subtle logic errors ahead.