🕵 Detective Work: Third Case
Overview
You’ve learned debugging memory crashes and logic bugs. Now face the ultimate GPU debugging challenge: a barrier deadlock that causes the program to hang indefinitely with no error messages, no wrong results - just eternal silence.
The complete debugging journey:
- First Case: Program crashes → Follow error signals → Find memory bugs
- Second Case: Program produces wrong results → Analyze patterns → Find logic bugs
- [Third Case]: Program hangs forever → Investigate thread states → Find coordination bugs
This advanced-level debugging challenge teaches you to investigate thread coordination failures using shared memory, LayoutTensor operations, and barrier synchronization - combining all the systematic investigation skills from the previous cases.
Prerequisites: Complete Mojo GPU Debugging Essentials, Detective Work: First Case, and Detective Work: Second Case to understand CUDA-GDB workflow, variable inspection limitations, and systematic debugging approaches. Make sure you’ve run pixi run setup-cuda-gdb
or similar symlink is available
ln -sf /usr/local/cuda/bin/cuda-gdb-minimal $CONDA_PREFIX/bin/cuda-gdb-minimal
ln -sf /usr/local/cuda/bin/cuda-gdb-python3.12-tui $CONDA_PREFIX/bin/cuda-gdb-python3.12-tui
Key concepts
In this debugging challenge, you’ll learn about:
- Barrier deadlock detection: Identifying when threads wait forever at synchronization points
- Shared memory coordination: Understanding thread cooperation patterns
- Conditional execution analysis: Debugging when some threads take different code paths
- Thread coordination debugging: Using CUDA-GDB to analyze multi-thread synchronization failures
Running the code
Given the kernel and without looking at the complete code:
fn collaborative_filter(
output: LayoutTensor[mut=True, dtype, vector_layout],
input: LayoutTensor[mut=False, dtype, vector_layout],
):
thread_id = thread_idx.x
# Shared memory workspace for collaborative processing
shared_workspace = tb[dtype]().row_major[SIZE - 1]().shared().alloc()
# Phase 1: Initialize shared workspace (all threads participate)
if thread_id < SIZE - 1:
shared_workspace[thread_id] = rebind[Scalar[dtype]](input[thread_id])
barrier()
# Phase 2: Collaborative processing
if thread_id < SIZE - 1:
# Apply collaborative filter with neighbors
if thread_id > 0:
shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
barrier()
# Phase 3: Final synchronization and output
barrier()
# Write filtered results back to output
if thread_id < SIZE - 1:
output[thread_id] = shared_workspace[thread_id]
else:
output[thread_id] = rebind[Scalar[dtype]](input[thread_id])
First experience the issue firsthand, run the following command in your terminal (pixi
only):
pixi run p09 --third-case
You’ll see output like this - the program hangs indefinitely:
Third Case: Advanced collaborative filtering with shared memory...
WARNING: This may hang - use Ctrl+C to stop if needed
Input array: [1, 2, 3, 4]
Applying collaborative filter using shared memory...
Each thread cooperates with neighbors for smoothing...
Waiting for GPU computation to complete...
[HANGS FOREVER - Use Ctrl+C to stop]
⚠️ Warning: This program will hang and never complete. Use Ctrl+C
to stop it.
Your task: detective work
Challenge: The program launches successfully but hangs during GPU computation and never returns. Without looking at the complete code, what would be your systematic approach to investigate this deadlock?
Think about:
- What could cause a GPU kernel to never complete?
- How would you investigate thread coordination issues?
- What debugging strategy works when the program just “freezes” with no error messages?
- How do you debug when threads might not be cooperating correctly?
- How can you combine systematic investigation (First Case) with execution flow analysis (Second Case) to debug coordination failures?
Start with:
pixi run mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --third-case
GDB command shortcuts (faster debugging)
Use these abbreviations to speed up your debugging session:
Short | Full | Usage Example |
---|---|---|
r | run | (cuda-gdb) r |
n | next | (cuda-gdb) n |
c | continue | (cuda-gdb) c |
b | break | (cuda-gdb) b 62 |
p | print | (cuda-gdb) p thread_id |
q | quit | (cuda-gdb) q |
All debugging commands below use these shortcuts for efficiency!
Tips
- Silent hang investigation - When programs freeze without error messages, what GPU primitives could cause infinite waiting?
- Thread state inspection - Use
info cuda threads
to see where different threads are stopped - Conditional execution analysis - Check which threads execute which code paths (do all threads follow the same path?)
- Synchronization point investigation - Look for places where threads might need to coordinate
- Thread divergence detection - Are all threads at the same program location, or are some elsewhere?
- Coordination primitive analysis - What happens if threads don’t all participate in the same synchronization operations?
- Execution flow tracing - Follow the path each thread takes through conditional statements
- Thread ID impact analysis - How do different thread IDs affect which code paths execute?
💡 Investigation & Solution
Step-by-step investigation with CUDA-GDB
Phase 1: launch and initial setup
Step 1: start the debugger
pixi run mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --third-case
Step 2: analyze the hanging behavior
Before diving into debugging, let’s understand what we know:
Expected: Program completes and shows filtered results
Actual: Program hangs at "Waiting for GPU computation to complete..."
🔍 Initial Hypothesis: The GPU kernel is deadlocked - some synchronization primitive is causing threads to wait forever.
Phase 2: entering the kernel
Step 3: launch and observe kernel entry
(cuda-gdb) r
Starting program: .../mojo run problems/p09/p09.mojo --third-case
Third Case: Advanced collaborative filtering with shared memory...
WARNING: This may hang - use Ctrl+C to stop if needed
Input array: [1, 2, 3, 4]
Applying collaborative filter using shared memory...
Each thread cooperates with neighbors for smoothing...
Waiting for GPU computation to complete...
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
CUDA thread hit application kernel entry function breakpoint, p09_collaborative_filter_Orig6A6AcB6A6A_1882ca334fc2d34b2b9c4fa338df6c07<<<(1,1,1),(4,1,1)>>> (
output=..., input=...)
at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:52
52 input: LayoutTensor[mut=False, dtype, vector_layout],
🔍 Key Observations:
- Grid: (1,1,1) - single block
- Block: (4,1,1) - 4 threads total (0, 1, 2, 3)
- Current thread: (0,0,0) - debugging thread 0
- Function: collaborative_filter with shared memory operations
Step 4: navigate through initialization
(cuda-gdb) n
51 output: LayoutTensor[mut=True, dtype, vector_layout],
(cuda-gdb) n
54 thread_id = thread_idx.x
(cuda-gdb) n
57 shared_workspace = tb[dtype]().row_major[SIZE-1]().shared().alloc()
(cuda-gdb) n
60 if thread_id < SIZE - 1:
(cuda-gdb) p thread_id
$1 = 0
✅ Thread 0 state: thread_id = 0
, about to check condition 0 < 3
→ True
Step 5: trace through phase 1
(cuda-gdb) n
61 shared_workspace[thread_id] = rebind[Scalar[dtype]](input[thread_id])
(cuda-gdb) n
60 if thread_id < SIZE - 1:
(cuda-gdb) n
62 barrier()
Phase 1 Complete: Thread 0 executed the initialization and reached the first barrier.
Phase 3: the critical barrier investigation
Step 6: examine the first barrier
(cuda-gdb) n
65 if thread_id < SIZE - 1:
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count PC Filename Line
Kernel 0
* (0,0,0) (0,0,0) (0,0,0) (3,0,0) 4 0x00007fffd3272180 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo 65
✅ Good: All 4 threads are at line 65 (after the first barrier). The first barrier worked correctly.
🔍 Critical Point: Now we’re entering Phase 2 with another conditional statement.
Step 7: trace through phase 2 - thread 0 perspective
(cuda-gdb) n
67 if thread_id > 0:
Thread 0 Analysis: 0 < 3
→ True → Thread 0 enters the Phase 2 block
(cuda-gdb) n
69 barrier()
Thread 0 Path: 0 > 0
→ False → Thread 0 skips the inner computation but reaches the barrier at line 69
CRITICAL MOMENT: Thread 0 is now waiting at the barrier on line 69.
(cuda-gdb) n # <-- if you run it the program hangs!
[HANGS HERE - Program never proceeds beyond this point]
Step 8: investigate other threads
(cuda-gdb) cuda thread (1,0,0)
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]
69 barrier()
(cuda-gdb) p thread_id
$2 = 1
(cuda-gdb) info cuda threads
BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count PC Filename Line
Kernel 0
* (0,0,0) (0,0,0) (0,0,0) (2,0,0) 3 0x00007fffd3273aa0 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo 69
(0,0,0) (3,0,0) (0,0,0) (3,0,0) 1 0x00007fffd3273b10 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo 72
SMOKING GUN DISCOVERED:
- Threads 0, 1, 2: All waiting at line 69 (barrier inside the conditional block)
- Thread 3: At line 72 (after the conditional block, never reached the barrier!)
Step 9: analyze thread 3’s execution path
🔍 Thread 3 Analysis from the info output:
- Thread 3: Located at line 72 (PC: 0x00007fffd3273b10)
- Phase 2 condition:
thread_id < SIZE - 1
→3 < 3
→ False - Result: Thread 3 NEVER entered the Phase 2 block (lines 65-69)
- Consequence: Thread 3 NEVER reached the barrier at line 69
- Current state: Thread 3 is at line 72 (final barrier), while threads 0,1,2 are stuck at line 69
Phase 4: root cause analysis
Step 10: deadlock mechanism identified
# Phase 2: Collaborative processing
if thread_id < SIZE - 1: # ← Only threads 0, 1, 2 enter this block
# Apply collaborative filter with neighbors
if thread_id > 0:
shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
barrier() # ← DEADLOCK: Only 3 out of 4 threads reach here!
💀 Deadlock Mechanism:
- Thread 0:
0 < 3
→ True → Enters block → Waits at barrier (line 69) - Thread 1:
1 < 3
→ True → Enters block → Waits at barrier (line 69) - Thread 2:
2 < 3
→ True → Enters block → Waits at barrier (line 69) - Thread 3:
3 < 3
→ False → NEVER enters block → Continues to line 72
Result: 3 threads wait forever for the 4th thread, but thread 3 never arrives at the barrier.
Phase 5: bug confirmation and solution
Step 11: the fundamental barrier rule violation
GPU Barrier Rule: ALL threads in a thread block must reach the SAME barrier for synchronization to complete.
What went wrong:
# ❌ WRONG: Barrier inside conditional
if thread_id < SIZE - 1: # Not all threads enter
# ... some computation ...
barrier() # Only some threads reach this
# ✅ CORRECT: Barrier outside conditional
if thread_id < SIZE - 1: # Not all threads enter
# ... some computation ...
barrier() # ALL threads reach this
The Fix: Move the barrier outside the conditional block:
fn collaborative_filter(
output: LayoutTensor[mut=True, dtype, vector_layout],
input: LayoutTensor[mut=False, dtype, vector_layout],
):
thread_id = thread_idx.x
shared_workspace = tb[dtype]().row_major[SIZE-1]().shared().alloc()
# Phase 1: Initialize shared workspace (all threads participate)
if thread_id < SIZE - 1:
shared_workspace[thread_id] = rebind[Scalar[dtype]](input[thread_id])
barrier()
# Phase 2: Collaborative processing
if thread_id < SIZE - 1:
if thread_id > 0:
shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
# ✅ FIX: Move barrier outside conditional so ALL threads reach it
barrier()
# Phase 3: Final synchronization and output
barrier()
if thread_id < SIZE - 1:
output[thread_id] = shared_workspace[thread_id]
else:
output[thread_id] = rebind[Scalar[dtype]](input[thread_id])
Key debugging lessons
Barrier deadlock detection:
- Use
info cuda threads
- Shows which threads are at which lines - Look for thread state divergence - Some threads at different program locations
- Trace conditional execution paths - Check if all threads reach the same barriers
- Verify barrier reachability - Ensure no thread can skip a barrier that others reach
Professional GPU debugging reality:
- Deadlocks are silent killers - programs just hang with no error messages
- Thread coordination debugging requires patience - systematic analysis of each thread’s path
- Conditional barriers are the #1 deadlock cause - always verify all threads reach the same sync points
- CUDA-GDB thread inspection is essential - the only way to see thread coordination failures
Advanced GPU synchronization:
- Barrier rule: ALL threads in a block must reach the SAME barrier
- Conditional execution pitfalls: Any if-statement can cause thread divergence
- Shared memory coordination: Requires careful barrier placement for correct synchronization
- LayoutTensor doesn’t prevent deadlocks: Higher-level abstractions still need correct synchronization
💡 Key Insight: Barrier deadlocks are among the hardest GPU bugs to debug because:
- No visible error - just infinite waiting
- Requires multi-thread analysis - can’t debug by examining one thread
- Silent failure mode - looks like performance issue, not correctness bug
- Complex thread coordination - need to trace execution paths across all threads
This type of debugging - using CUDA-GDB to analyze thread states, identify divergent execution paths, and verify barrier reachability - is exactly what professional GPU developers do when facing deadlock issues in production systems.
Next steps: GPU debugging skills complete
You’ve completed the GPU debugging trilogy!
Your complete GPU debugging arsenal
From the First Case - Crash debugging:
- ✅ Systematic crash investigation using error messages as guides
- ✅ Memory bug detection through pointer address inspection
- ✅ CUDA-GDB fundamentals for memory-related issues
From the Second Case - Logic bug debugging:
- ✅ Algorithm error investigation without obvious symptoms
- ✅ Pattern analysis techniques for tracing wrong results to root causes
- ✅ Execution flow debugging when variable inspection fails
From the Third Case - Coordination debugging:
- ✅ Barrier deadlock investigation for thread coordination failures
- ✅ Multi-thread state analysis using advanced CUDA-GDB techniques
- ✅ Synchronization verification for complex parallel programs
The professional GPU debugging methodology
You’ve learned the systematic approach used by professional GPU developers:
- Read the symptoms - Crashes? Wrong results? Infinite hangs?
- Form hypotheses - Memory issue? Logic error? Coordination problem?
- Gather evidence - Use CUDA-GDB strategically based on the bug type
- Test systematically - Verify each hypothesis through targeted investigation
- Trace to root cause - Follow the evidence chain to the source
Achievement Unlocked: You can now debug the three most common GPU programming issues:
- Memory crashes (First Case) - null pointers, out-of-bounds access
- Logic bugs (Second Case) - algorithmic errors, incorrect results
- Coordination deadlocks (Third Case) - barrier synchronization failures