🔍 Detective Work: Second Case

Overview

Building on your crash debugging skills from the First Case, you’ll now face a completely different challenge: a logic bug that produces incorrect results without crashing.

The debugging shift:

  • First Case: Clear crash signals (CUDA_ERROR_ILLEGAL_ADDRESS) guided your investigation
  • Second Case: No crashes, no error messages - just subtly wrong results that require detective work

This intermediate-level debugging challenge teaches you to investigate algorithmic errors using LayoutTensor operations, where the program runs successfully but produces wrong output - a much more common (and trickier) real-world debugging scenario.

Prerequisites: Complete Mojo GPU Debugging Essentials and Detective Work: First Case to understand CUDA-GDB workflow and systematic debugging techniques. Make sure you’ve run pixi run setup-cuda-gdb or similar symlink is available

ln -sf /usr/local/cuda/bin/cuda-gdb-minimal $CONDA_PREFIX/bin/cuda-gdb-minimal
ln -sf /usr/local/cuda/bin/cuda-gdb-python3.12-tui $CONDA_PREFIX/bin/cuda-gdb-python3.12-tui

Key concepts

In this debugging challenge, you’ll learn about:

  • LayoutTensor debugging: Investigating structured data access patterns
  • Logic bug detection: Finding algorithmic errors that don’t crash
  • Loop boundary analysis: Understanding iteration count problems
  • Result pattern analysis: Using output data to trace back to root causes

Running the code

Given the kernel and without looking at the complete code:

fn process_sliding_window(
    output: LayoutTensor[mut=True, dtype, vector_layout],
    input: LayoutTensor[mut=False, dtype, vector_layout],
):
    thread_id = thread_idx.x

    # Each thread processes a sliding window of 3 elements
    window_sum = Scalar[dtype](0.0)

    # Sum elements in sliding window: [i-1, i, i+1]
    for offset in range(ITER):
        idx = thread_id + offset - 1
        if 0 <= idx < SIZE:
            value = rebind[Scalar[dtype]](input[idx])
            window_sum += value

    output[thread_id] = window_sum


First experience the bug firsthand, run the following command in your terminal (pixi only):

pixi run p09 --second-case

You’ll see output like this - no crash, but wrong results:

This program computes sliding window sums for each position...

Input array: [0, 1, 2, 3]
Computing sliding window sums (window size = 3)...
Each position should sum its neighbors: [left + center + right]
Actual result: HostBuffer([0.0, 1.0, 3.0, 5.0])
Expected: [1.0, 3.0, 6.0, 5.0]
❌ Test FAILED - Sliding window sums are incorrect!
Check the window indexing logic...

Your task: detective work

Challenge: The program runs without crashing but produces consistently wrong results. Without looking at the code, what would be your systematic approach to investigate this logic bug?

Think about:

  • What pattern do you see in the wrong results?
  • How would you investigate a loop that might not be running correctly?
  • What debugging strategy works when you can’t inspect variables directly?
  • How can you apply the systematic investigation approach from First Case when there are no crash signals to guide you?

Start with:

pixi run mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --second-case

GDB command shortcuts (faster debugging)

Use these abbreviations to speed up your debugging session:

ShortFullUsage Example
rrun(cuda-gdb) r
nnext(cuda-gdb) n
ccontinue(cuda-gdb) c
bbreak(cuda-gdb) b 39
pprint(cuda-gdb) p thread_id
qquit(cuda-gdb) q

All debugging commands below use these shortcuts for efficiency!

Tips
  1. Pattern analysis first - Look at the relationship between expected and actual results (what’s the mathematical pattern in the differences?)
  2. Focus on execution flow - Count loop iterations when variables aren’t accessible
  3. Use simple breakpoints - Complex debugging commands often fail with optimized code
  4. Mathematical reasoning - Work out what each thread should access vs what it actually accesses
  5. Missing data investigation - If results are consistently smaller than expected, what might be missing?
  6. Host output verification - The final results often reveal the pattern of the bug
  7. Algorithm boundary analysis - Check if loops are processing the right number of elements
  8. Cross-validate with working cases - Why does thread 3 work correctly but others don’t?
💡 Investigation & Solution

Step-by-step investigation with CUDA-GDB

Phase 1: Launch and initial analysis

Step 1: Start the debugger

pixi run mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --second-case

Step 2: analyze the symptoms first

Before diving into the debugger, let’s examine what we know:

Actual result: [0.0, 1.0, 3.0, 5.0]
Expected: [1.0, 3.0, 6.0, 5.0]

🔍 Pattern Recognition:

  • Thread 0: Got 0.0, Expected 1.0 → Missing 1.0
  • Thread 1: Got 1.0, Expected 3.0 → Missing 2.0
  • Thread 2: Got 3.0, Expected 6.0 → Missing 3.0
  • Thread 3: Got 5.0, Expected 5.0 → ✅ Correct

Initial Hypothesis: Each thread is missing some data, but thread 3 works correctly.

Phase 2: Entering the kernel

Step 3: Observe the breakpoint entry

Based on the real debugging session, here’s what happens:

(cuda-gdb) r
Starting program: .../mojo run problems/p09/p09.mojo --second-case

This program computes sliding window sums for each position...
Input array: [0, 1, 2, 3]
Computing sliding window sums (window size = 3)...
Each position should sum its neighbors: [left + center + right]

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

CUDA thread hit application kernel entry function breakpoint, p09_process_sliding_window_...
   <<<(1,1,1),(4,1,1)>>> (output=..., input=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:30
30          input: LayoutTensor[mut=False, dtype, vector_layout],

Step 4: Navigate to the main logic

(cuda-gdb) n
29          output: LayoutTensor[mut=True, dtype, vector_layout],
(cuda-gdb) n
32          thread_id = thread_idx.x
(cuda-gdb) n
38          for offset in range(ITER):

Step 5: Test variable accessibility - crucial discovery

(cuda-gdb) p thread_id
$1 = 0

✅ Good: Thread ID is accessible.

(cuda-gdb) p window_sum
Cannot access memory at address 0x0

❌ Problem: window_sum is not accessible.

(cuda-gdb) p input[0]
Attempt to take address of value not located in memory.

❌ Problem: Direct LayoutTensor indexing doesn’t work.

(cuda-gdb) p input.ptr[0]
$2 = {0}
(cuda-gdb) p input.ptr[0]@4
$3 = {{0}, {1}, {2}, {3}}

🎯 BREAKTHROUGH: input.ptr[0]@4 shows the full input array! This is how we can inspect LayoutTensor data.

Phase 3: The critical loop investigation

Step 6: Set up loop monitoring

(cuda-gdb) b 39
Breakpoint 1 at 0x7fffd326ffd0: file problems/p09/p09.mojo, line 39.
(cuda-gdb) c
Continuing.

CUDA thread hit Breakpoint 1, p09_process_sliding_window_...
   <<<(1,1,1),(4,1,1)>>> (output=..., input=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:39
39              idx = thread_id + offset - 1

🔍 We’re now inside the loop body. Let’s count iterations manually.

Step 7: First loop iteration (offset = 0)

(cuda-gdb) n
40              if 0 <= idx < SIZE:
(cuda-gdb) n
38          for offset in range(ITER):

First iteration complete: Loop went from line 39 → 40 → back to 38. The loop continues.

Step 8: Second loop iteration (offset = 1)

(cuda-gdb) n

CUDA thread hit Breakpoint 1, p09_process_sliding_window_...
39              idx = thread_id + offset - 1
(cuda-gdb) n
40              if 0 <= idx < SIZE:
(cuda-gdb) n
41                  value = rebind[Scalar[dtype]](input[idx])
(cuda-gdb) n
42                  window_sum += value
(cuda-gdb) n
40              if 0 <= idx < SIZE:
(cuda-gdb) n
38          for offset in range(ITER):

Second iteration complete: This time it went through the if-block (lines 41-42).

Step 9: testing for third iteration

(cuda-gdb) n
44          output[thread_id] = window_sum

CRITICAL DISCOVERY: The loop exited after only 2 iterations! It went directly to line 44 instead of hitting our breakpoint at line 39 again.

Conclusion: The loop ran exactly 2 iterations and then exited.

Step 10: Complete kernel execution and context loss

(cuda-gdb) n
28      fn process_sliding_window(
(cuda-gdb) n
[Switching to Thread 0x7ffff7cc0e00 (LWP 110927)]
0x00007ffff064f84a in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(cuda-gdb) p output.ptr[0]@4
No symbol "output" in current context.
(cuda-gdb) p offset
No symbol "offset" in current context.

🔍 Context Lost: After kernel completion, we lose access to kernel variables. This is normal behavior.

Phase 4: Root cause analysis

Step 11: Algorithm analysis from observed execution

From our debugging session, we observed:

  1. Loop Iterations: Only 2 iterations (offset = 0, offset = 1)
  2. Expected: A sliding window of size 3 should require 3 iterations (offset = 0, 1, 2)
  3. Missing: The third iteration (offset = 2)

Looking at what each thread should compute:

  • Thread 0: window_sum = input[-1] + input[0] + input[1] = (boundary) + 0 + 1 = 1.0
  • Thread 1: window_sum = input[0] + input[1] + input[2] = 0 + 1 + 2 = 3.0
  • Thread 2: window_sum = input[1] + input[2] + input[3] = 1 + 2 + 3 = 6.0
  • Thread 3: window_sum = input[2] + input[3] + input[4] = 2 + 3 + (boundary) = 5.0

Step 12: Trace the actual execution for thread 0

With only 2 iterations (offset = 0, 1):

Iteration 1 (offset = 0):

  • idx = thread_id + offset - 1 = 0 + 0 - 1 = -1
  • if 0 <= idx < SIZE:if 0 <= -1 < 4:False
  • Skip the sum operation

Iteration 2 (offset = 1):

  • idx = thread_id + offset - 1 = 0 + 1 - 1 = 0
  • if 0 <= idx < SIZE:if 0 <= 0 < 4:True
  • window_sum += input[0]window_sum += 0

Missing Iteration 3 (offset = 2):

  • idx = thread_id + offset - 1 = 0 + 2 - 1 = 1
  • if 0 <= idx < SIZE:if 0 <= 1 < 4:True
  • window_sum += input[1]window_sum += 1THIS NEVER HAPPENS

Result: Thread 0 gets window_sum = 0 instead of window_sum = 0 + 1 = 1

Phase 5: Bug confirmation

Looking at the problem code, we find:

alias ITER = 2                       # ← BUG: Should be 3!

for offset in range(ITER):           # ← Only 2 iterations: [0, 1]
    idx = thread_id + offset - 1     # ← Missing offset = 2
    if 0 <= idx < SIZE:
        window_sum += input[idx]

🎯 ROOT CAUSE IDENTIFIED: ITER = 2 should be ITER = 3 for a sliding window of size 3.

The Fix: Change alias ITER = 2 to alias ITER = 3 in the source code.

Key debugging lessons

When Variables Are Inaccessible:

  1. Focus on execution flow - Count breakpoint hits and loop iterations
  2. Use mathematical reasoning - Work out what should happen vs what does happen
  3. Pattern analysis - Let the wrong results guide your investigation
  4. Cross-validation - Test your hypothesis against multiple data points

Professional GPU Debugging Reality:

  • Variable inspection often fails due to compiler optimizations
  • Execution flow analysis is more reliable than data inspection
  • Host output patterns provide crucial debugging clues
  • Source code reasoning complements limited debugger capabilities

LayoutTensor Debugging:

  • Even with LayoutTensor abstractions, underlying algorithmic bugs still manifest
  • Focus on the algorithm logic rather than trying to inspect tensor contents
  • Use systematic reasoning to trace what each thread should vs actually accesses

💡 Key Insight: This type of off-by-one loop bug is extremely common in GPU programming. The systematic approach you learned here - combining limited debugger info with mathematical analysis and pattern recognition - is exactly how professional GPU developers debug when tools have limitations.

Next Steps: From Logic Bugs to Coordination Deadlocks

You’ve learned logic bug debugging! You can now:

  • Investigate algorithmic errors without crashes or obvious symptoms
  • Use pattern analysis to trace wrong results back to root causes
  • Debug with limited variable access using execution flow analysis
  • Apply mathematical reasoning when debugger tools have limitations

Your Final Challenge: Detective Work: Third Case

But what if your program doesn’t crash AND doesn’t finish? What if it just hangs forever?

The Third Case presents the ultimate debugging challenge:

  • No crash messages (like First Case)
  • No wrong results (like Second Case)
  • No completion at all - just infinite hanging
  • Silent deadlock requiring advanced thread coordination analysis

New skills you’ll develop:

  • Barrier deadlock detection - Finding coordination failures in parallel threads
  • Multi-thread state analysis - Examining all threads simultaneously
  • Synchronization debugging - Understanding thread cooperation breakdowns

The debugging evolution:

  1. First Case: Follow crash signals → Find memory bugs
  2. Second Case: Analyze result patterns → Find logic bugs
  3. Third Case: Investigate thread states → Find coordination bugs

The systematic investigation skills from both previous cases - hypothesis formation, evidence gathering, pattern analysis - become crucial when debugging the most challenging GPU issue: threads that coordinate incorrectly and wait forever.