🕵 Detective Work: Third Case

Overview

You’ve learned debugging memory crashes and logic bugs. Now face the ultimate GPU debugging challenge: a barrier deadlock that causes the program to hang indefinitely with no error messages, no wrong results - just eternal silence.

The complete debugging journey:

  • First Case: Program crashes → Follow error signals → Find memory bugs
  • Second Case: Program produces wrong results → Analyze patterns → Find logic bugs
  • [Third Case]: Program hangs forever → Investigate thread states → Find coordination bugs

This advanced-level debugging challenge teaches you to investigate thread coordination failures using shared memory, LayoutTensor operations, and barrier synchronization - combining all the systematic investigation skills from the previous cases.

Prerequisites: Complete Mojo GPU Debugging Essentials, Detective Work: First Case, and Detective Work: Second Case to understand CUDA-GDB workflow, variable inspection limitations, and systematic debugging approaches. Make sure you’ve run pixi run setup-cuda-gdb or similar symlink is available

ln -sf /usr/local/cuda/bin/cuda-gdb-minimal $CONDA_PREFIX/bin/cuda-gdb-minimal
ln -sf /usr/local/cuda/bin/cuda-gdb-python3.12-tui $CONDA_PREFIX/bin/cuda-gdb-python3.12-tui

Key concepts

In this debugging challenge, you’ll learn about:

  • Barrier deadlock detection: Identifying when threads wait forever at synchronization points
  • Shared memory coordination: Understanding thread cooperation patterns
  • Conditional execution analysis: Debugging when some threads take different code paths
  • Thread coordination debugging: Using CUDA-GDB to analyze multi-thread synchronization failures

Running the code

Given the kernel and without looking at the complete code:

fn collaborative_filter(
    output: LayoutTensor[mut=True, dtype, vector_layout],
    input: LayoutTensor[mut=False, dtype, vector_layout],
):
    thread_id = thread_idx.x

    # Shared memory workspace for collaborative processing
    shared_workspace = tb[dtype]().row_major[SIZE - 1]().shared().alloc()

    # Phase 1: Initialize shared workspace (all threads participate)
    if thread_id < SIZE - 1:
        shared_workspace[thread_id] = rebind[Scalar[dtype]](input[thread_id])
    barrier()

    # Phase 2: Collaborative processing
    if thread_id < SIZE - 1:
        # Apply collaborative filter with neighbors
        if thread_id > 0:
            shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
        barrier()

    # Phase 3: Final synchronization and output
    barrier()

    # Write filtered results back to output
    if thread_id < SIZE - 1:
        output[thread_id] = shared_workspace[thread_id]
    else:
        output[thread_id] = rebind[Scalar[dtype]](input[thread_id])


First experience the issue firsthand, run the following command in your terminal (pixi only):

pixi run p09 --third-case

You’ll see output like this - the program hangs indefinitely:

Third Case: Advanced collaborative filtering with shared memory...
WARNING: This may hang - use Ctrl+C to stop if needed

Input array: [1, 2, 3, 4]
Applying collaborative filter using shared memory...
Each thread cooperates with neighbors for smoothing...
Waiting for GPU computation to complete...
[HANGS FOREVER - Use Ctrl+C to stop]

⚠️ Warning: This program will hang and never complete. Use Ctrl+C to stop it.

Your task: detective work

Challenge: The program launches successfully but hangs during GPU computation and never returns. Without looking at the complete code, what would be your systematic approach to investigate this deadlock?

Think about:

  • What could cause a GPU kernel to never complete?
  • How would you investigate thread coordination issues?
  • What debugging strategy works when the program just “freezes” with no error messages?
  • How do you debug when threads might not be cooperating correctly?
  • How can you combine systematic investigation (First Case) with execution flow analysis (Second Case) to debug coordination failures?

Start with:

pixi run mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --third-case

GDB command shortcuts (faster debugging)

Use these abbreviations to speed up your debugging session:

ShortFullUsage Example
rrun(cuda-gdb) r
nnext(cuda-gdb) n
ccontinue(cuda-gdb) c
bbreak(cuda-gdb) b 62
pprint(cuda-gdb) p thread_id
qquit(cuda-gdb) q

All debugging commands below use these shortcuts for efficiency!

Tips
  1. Silent hang investigation - When programs freeze without error messages, what GPU primitives could cause infinite waiting?
  2. Thread state inspection - Use info cuda threads to see where different threads are stopped
  3. Conditional execution analysis - Check which threads execute which code paths (do all threads follow the same path?)
  4. Synchronization point investigation - Look for places where threads might need to coordinate
  5. Thread divergence detection - Are all threads at the same program location, or are some elsewhere?
  6. Coordination primitive analysis - What happens if threads don’t all participate in the same synchronization operations?
  7. Execution flow tracing - Follow the path each thread takes through conditional statements
  8. Thread ID impact analysis - How do different thread IDs affect which code paths execute?
💡 Investigation & Solution

Step-by-step investigation with CUDA-GDB

Phase 1: launch and initial setup

Step 1: start the debugger

pixi run mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --third-case

Step 2: analyze the hanging behavior

Before diving into debugging, let’s understand what we know:

Expected: Program completes and shows filtered results
Actual: Program hangs at "Waiting for GPU computation to complete..."

🔍 Initial Hypothesis: The GPU kernel is deadlocked - some synchronization primitive is causing threads to wait forever.

Phase 2: entering the kernel

Step 3: launch and observe kernel entry

(cuda-gdb) r
Starting program: .../mojo run problems/p09/p09.mojo --third-case

Third Case: Advanced collaborative filtering with shared memory...
WARNING: This may hang - use Ctrl+C to stop if needed

Input array: [1, 2, 3, 4]
Applying collaborative filter using shared memory...
Each thread cooperates with neighbors for smoothing...
Waiting for GPU computation to complete...

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

CUDA thread hit application kernel entry function breakpoint, p09_collaborative_filter_Orig6A6AcB6A6A_1882ca334fc2d34b2b9c4fa338df6c07<<<(1,1,1),(4,1,1)>>> (
    output=..., input=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:52
52          input: LayoutTensor[mut=False, dtype, vector_layout],

🔍 Key Observations:

  • Grid: (1,1,1) - single block
  • Block: (4,1,1) - 4 threads total (0, 1, 2, 3)
  • Current thread: (0,0,0) - debugging thread 0
  • Function: collaborative_filter with shared memory operations

Step 4: navigate through initialization

(cuda-gdb) n
51          output: LayoutTensor[mut=True, dtype, vector_layout],
(cuda-gdb) n
54          thread_id = thread_idx.x
(cuda-gdb) n
57          shared_workspace = tb[dtype]().row_major[SIZE-1]().shared().alloc()
(cuda-gdb) n
60          if thread_id < SIZE - 1:
(cuda-gdb) p thread_id
$1 = 0

✅ Thread 0 state: thread_id = 0, about to check condition 0 < 3True

Step 5: trace through phase 1

(cuda-gdb) n
61              shared_workspace[thread_id] = rebind[Scalar[dtype]](input[thread_id])
(cuda-gdb) n
60          if thread_id < SIZE - 1:
(cuda-gdb) n
62          barrier()

Phase 1 Complete: Thread 0 executed the initialization and reached the first barrier.

Phase 3: the critical barrier investigation

Step 6: examine the first barrier

(cuda-gdb) n
65          if thread_id < SIZE - 1:
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (3,0,0)     4 0x00007fffd3272180 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    65

✅ Good: All 4 threads are at line 65 (after the first barrier). The first barrier worked correctly.

🔍 Critical Point: Now we’re entering Phase 2 with another conditional statement.

Step 7: trace through phase 2 - thread 0 perspective

(cuda-gdb) n
67              if thread_id > 0:

Thread 0 Analysis: 0 < 3True → Thread 0 enters the Phase 2 block

(cuda-gdb) n
69              barrier()

Thread 0 Path: 0 > 0False → Thread 0 skips the inner computation but reaches the barrier at line 69

CRITICAL MOMENT: Thread 0 is now waiting at the barrier on line 69.

(cuda-gdb) n # <-- if you run it the program hangs!
[HANGS HERE - Program never proceeds beyond this point]

Step 8: investigate other threads

(cuda-gdb) cuda thread (1,0,0)
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]
69              barrier()
(cuda-gdb) p thread_id
$2 = 1
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (2,0,0)     3 0x00007fffd3273aa0 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    69
   (0,0,0)   (3,0,0)     (0,0,0)      (3,0,0)     1 0x00007fffd3273b10 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    72

SMOKING GUN DISCOVERED:

  • Threads 0, 1, 2: All waiting at line 69 (barrier inside the conditional block)
  • Thread 3: At line 72 (after the conditional block, never reached the barrier!)

Step 9: analyze thread 3’s execution path

🔍 Thread 3 Analysis from the info output:

  • Thread 3: Located at line 72 (PC: 0x00007fffd3273b10)
  • Phase 2 condition: thread_id < SIZE - 13 < 3False
  • Result: Thread 3 NEVER entered the Phase 2 block (lines 65-69)
  • Consequence: Thread 3 NEVER reached the barrier at line 69
  • Current state: Thread 3 is at line 72 (final barrier), while threads 0,1,2 are stuck at line 69

Phase 4: root cause analysis

Step 10: deadlock mechanism identified

# Phase 2: Collaborative processing
if thread_id < SIZE - 1:        # ← Only threads 0, 1, 2 enter this block
    # Apply collaborative filter with neighbors
    if thread_id > 0:
        shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
    barrier()                   # ← DEADLOCK: Only 3 out of 4 threads reach here!

💀 Deadlock Mechanism:

  1. Thread 0: 0 < 3True → Enters block → Waits at barrier (line 69)
  2. Thread 1: 1 < 3True → Enters block → Waits at barrier (line 69)
  3. Thread 2: 2 < 3True → Enters block → Waits at barrier (line 69)
  4. Thread 3: 3 < 3FalseNEVER enters blockContinues to line 72

Result: 3 threads wait forever for the 4th thread, but thread 3 never arrives at the barrier.

Phase 5: bug confirmation and solution

Step 11: the fundamental barrier rule violation

GPU Barrier Rule: ALL threads in a thread block must reach the SAME barrier for synchronization to complete.

What went wrong:

# ❌ WRONG: Barrier inside conditional
if thread_id < SIZE - 1:    # Not all threads enter
    # ... some computation ...
    barrier()               # Only some threads reach this

# ✅ CORRECT: Barrier outside conditional
if thread_id < SIZE - 1:    # Not all threads enter
    # ... some computation ...
 barrier()                # ALL threads reach this

The Fix: Move the barrier outside the conditional block:

fn collaborative_filter(
    output: LayoutTensor[mut=True, dtype, vector_layout],
    input: LayoutTensor[mut=False, dtype, vector_layout],
):
    thread_id = thread_idx.x
    shared_workspace = tb[dtype]().row_major[SIZE-1]().shared().alloc()

    # Phase 1: Initialize shared workspace (all threads participate)
    if thread_id < SIZE - 1:
        shared_workspace[thread_id] = rebind[Scalar[dtype]](input[thread_id])
    barrier()

    # Phase 2: Collaborative processing
    if thread_id < SIZE - 1:
        if thread_id > 0:
            shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
    # ✅ FIX: Move barrier outside conditional so ALL threads reach it
    barrier()

    # Phase 3: Final synchronization and output
    barrier()

    if thread_id < SIZE - 1:
        output[thread_id] = shared_workspace[thread_id]
    else:
        output[thread_id] = rebind[Scalar[dtype]](input[thread_id])

Key debugging lessons

Barrier deadlock detection:

  1. Use info cuda threads - Shows which threads are at which lines
  2. Look for thread state divergence - Some threads at different program locations
  3. Trace conditional execution paths - Check if all threads reach the same barriers
  4. Verify barrier reachability - Ensure no thread can skip a barrier that others reach

Professional GPU debugging reality:

  • Deadlocks are silent killers - programs just hang with no error messages
  • Thread coordination debugging requires patience - systematic analysis of each thread’s path
  • Conditional barriers are the #1 deadlock cause - always verify all threads reach the same sync points
  • CUDA-GDB thread inspection is essential - the only way to see thread coordination failures

Advanced GPU synchronization:

  • Barrier rule: ALL threads in a block must reach the SAME barrier
  • Conditional execution pitfalls: Any if-statement can cause thread divergence
  • Shared memory coordination: Requires careful barrier placement for correct synchronization
  • LayoutTensor doesn’t prevent deadlocks: Higher-level abstractions still need correct synchronization

💡 Key Insight: Barrier deadlocks are among the hardest GPU bugs to debug because:

  • No visible error - just infinite waiting
  • Requires multi-thread analysis - can’t debug by examining one thread
  • Silent failure mode - looks like performance issue, not correctness bug
  • Complex thread coordination - need to trace execution paths across all threads

This type of debugging - using CUDA-GDB to analyze thread states, identify divergent execution paths, and verify barrier reachability - is exactly what professional GPU developers do when facing deadlock issues in production systems.

Next steps: GPU debugging skills complete

You’ve completed the GPU debugging trilogy!

Your complete GPU debugging arsenal

From the First Case - Crash debugging:

  • Systematic crash investigation using error messages as guides
  • Memory bug detection through pointer address inspection
  • CUDA-GDB fundamentals for memory-related issues

From the Second Case - Logic bug debugging:

  • Algorithm error investigation without obvious symptoms
  • Pattern analysis techniques for tracing wrong results to root causes
  • Execution flow debugging when variable inspection fails

From the Third Case - Coordination debugging:

  • Barrier deadlock investigation for thread coordination failures
  • Multi-thread state analysis using advanced CUDA-GDB techniques
  • Synchronization verification for complex parallel programs

The professional GPU debugging methodology

You’ve learned the systematic approach used by professional GPU developers:

  1. Read the symptoms - Crashes? Wrong results? Infinite hangs?
  2. Form hypotheses - Memory issue? Logic error? Coordination problem?
  3. Gather evidence - Use CUDA-GDB strategically based on the bug type
  4. Test systematically - Verify each hypothesis through targeted investigation
  5. Trace to root cause - Follow the evidence chain to the source

Achievement Unlocked: You can now debug the three most common GPU programming issues:

  • Memory crashes (First Case) - null pointers, out-of-bounds access
  • Logic bugs (Second Case) - algorithmic errors, incorrect results
  • Coordination deadlocks (Third Case) - barrier synchronization failures