🕵 Detective Work: Third Case

Overview

You’ve learned debugging memory crashes and logic bugs. Now face the ultimate GPU debugging challenge: a barrier deadlock that causes the program to hang indefinitely with no error messages, no wrong results - just eternal silence.

The complete debugging journey:

First Case: Program crashes → Follow error signals → Find memory bugs
Second Case: Program produces wrong results → Analyze patterns → Find logic bugs
[Third Case]: Program hangs forever → Investigate thread states → Find coordination bugs

This advanced-level debugging challenge teaches you to investigate thread coordination failures using shared memory, LayoutTensor operations, and barrier synchronization - combining all the systematic investigation skills from the previous cases.

Prerequisites: Complete Mojo GPU Debugging Essentials, Detective Work: First Case, and Detective Work: Second Case to understand CUDA-GDB workflow, variable inspection limitations, and systematic debugging approaches. Make sure you run the setup:

pixi run -e nvidia setup-cuda-gdb

Key concepts

In this debugging challenge, you’ll learn about:

Barrier deadlock detection: Identifying when threads wait forever at synchronization points
Shared memory coordination: Understanding thread cooperation patterns with LayoutTensor
Conditional execution analysis: Debugging when some threads take different code paths
Thread coordination debugging: Using CUDA-GDB to analyze multi-thread synchronization failures

Running the code

First, examine the kernel without looking at the complete code:

fn collaborative_filter(
    output: LayoutTensor[dtype, vector_layout, MutAnyOrigin],
    a: LayoutTensor[dtype, vector_layout, ImmutAnyOrigin],
):
    thread_id = thread_idx.x

    # Shared memory workspace for collaborative processing
    shared_workspace = LayoutTensor[
        dtype,
        Layout.row_major(SIZE - 1),
        MutAnyOrigin,
        address_space = AddressSpace.SHARED,
    ].stack_allocation()

    # Phase 1: Initialize shared workspace (all threads participate)
    if thread_id < SIZE - 1:
        shared_workspace[thread_id] = rebind[Scalar[dtype]](a[thread_id])
    barrier()

    # Phase 2: Collaborative processing
    if thread_id < SIZE - 1:
        # Apply collaborative filter with neighbors
        if thread_id > 0:
            shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
        barrier()

    # Phase 3: Final synchronization and output
    barrier()

    # Write filtered results back to output
    if thread_id < SIZE - 1:
        output[thread_id] = shared_workspace[thread_id]
    else:
        output[thread_id] = rebind[Scalar[dtype]](a[thread_id])

To experience the bug firsthand, run the following command in your terminal (pixi only):

pixi -e nvidia run p09 --third-case

You’ll see output like this - the program hangs indefinitely:

Third Case: Advanced collaborative filtering with shared memory...
WARNING: This may hang - use Ctrl+C to stop if needed

Input array: [1, 2, 3, 4]
Applying collaborative filter using shared memory...
Each thread cooperates with neighbors for smoothing...
Waiting for GPU computation to complete...
[HANGS FOREVER - Use Ctrl+C to stop]

⚠️ Warning: This program will hang and never complete. Use Ctrl+C to stop it.

Your task: detective work

Challenge: The program launches successfully but hangs during GPU computation and never returns. Without looking at the complete code, what would be your systematic approach to investigate this deadlock?

Think about:

What could cause a GPU kernel to never complete?
How would you investigate thread coordination issues?
What debugging strategy works when the program just “freezes” with no error messages?
How do you debug when threads might not be cooperating correctly?
How can you combine systematic investigation (First Case) with execution flow analysis (Second Case) to debug coordination failures?

Start with:

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --third-case

GDB command shortcuts (faster debugging)

Use these abbreviations to speed up your debugging session:

Short	Full	Usage Example
`r`	`run`	`(cuda-gdb) r`
`n`	`next`	`(cuda-gdb) n`
`c`	`continue`	`(cuda-gdb) c`
`b`	`break`	`(cuda-gdb) b 62`
`p`	`print`	`(cuda-gdb) p thread_id`
`q`	`quit`	`(cuda-gdb) q`

All debugging commands below use these shortcuts for efficiency!

Tips

Silent hang investigation - When programs freeze without error messages, what GPU primitives could cause infinite waiting?
Thread state inspection - Use info cuda threads to see where different threads are stopped
Conditional execution analysis - Check which threads execute which code paths (do all threads follow the same path?)
Synchronization point investigation - Look for places where threads might need to coordinate
Thread divergence detection - Are all threads at the same program location, or are some elsewhere?
Coordination primitive analysis - What happens if threads don’t all participate in the same synchronization operations?
Execution flow tracing - Follow the path each thread takes through conditional statements
Thread ID impact analysis - How do different thread IDs affect which code paths execute?

💡 Investigation & Solution

Step-by-step investigation with CUDA-GDB

Phase 1: launch and initial setup

Step 1: start the debugger

pixi run -e nvidia mojo debug --cuda-gdb --break-on-launch problems/p09/p09.mojo --third-case

Step 2: analyze the hanging behavior

Before diving into debugging, let’s understand what we know:

Expected: Program completes and shows filtered results
Actual: Program hangs at "Waiting for GPU computation to complete..."

🔍 Initial Hypothesis: The GPU kernel is deadlocked - some synchronization primitive is causing threads to wait forever.

Phase 2: entering the kernel

Step 3: launch and observe kernel entry

(cuda-gdb) r
Starting program: .../mojo run problems/p09/p09.mojo --third-case

Third Case: Advanced collaborative filtering with shared memory...
WARNING: This may hang - use Ctrl+C to stop if needed

Input array: [1, 2, 3, 4]
Applying collaborative filter using shared memory...
Each thread cooperates with neighbors for smoothing...
Waiting for GPU computation to complete...

[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]

CUDA thread hit application kernel entry function breakpoint, p09_collaborative_filter_Orig6A6AcB6A6A_1882ca334fc2d34b2b9c4fa338df6c07<<<(1,1,1),(4,1,1)>>> (
    output=..., input=...)
    at /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo:52
52          input: LayoutTensor[mut=False, dtype, vector_layout],

🔍 Key Observations:

Grid: (1,1,1) - single block
Block: (4,1,1) - 4 threads total (0, 1, 2, 3)
Current thread: (0,0,0) - debugging thread 0
Function: collaborative_filter with shared memory operations

Step 4: navigate through initialization

(cuda-gdb) n
51          output: LayoutTensor[mut=True, dtype, vector_layout],
(cuda-gdb) n
54          thread_id = thread_idx.x
(cuda-gdb) n
57          shared_workspace = tb[dtype]().row_major[SIZE-1]().shared().alloc()
(cuda-gdb) n
60          if thread_id < SIZE - 1:
(cuda-gdb) p thread_id
$1 = 0

✅ Thread 0 state: thread_id = 0, about to check condition 0 < 3 → True

Step 5: trace through phase 1

(cuda-gdb) n
61              shared_workspace[thread_id] = rebind[Scalar[dtype]](input[thread_id])
(cuda-gdb) n
60          if thread_id < SIZE - 1:
(cuda-gdb) n
62          barrier()

Phase 1 Complete: Thread 0 executed the initialization and reached the first barrier.

Phase 3: the critical barrier investigation

Step 6: examine the first barrier

(cuda-gdb) n
65          if thread_id < SIZE - 1:
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (3,0,0)     4 0x00007fffd3272180 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    65

✅ Good: All 4 threads are at line 65 (after the first barrier). The first barrier worked correctly.

🔍 Critical Point: Now we’re entering Phase 2 with another conditional statement.

Step 7: trace through phase 2 - thread 0 perspective

(cuda-gdb) n
67              if thread_id > 0:

Thread 0 Analysis: 0 < 3 → True → Thread 0 enters the Phase 2 block

(cuda-gdb) n
69              barrier()

Thread 0 Path: 0 > 0 → False → Thread 0 skips the inner computation but reaches the barrier at line 69

CRITICAL MOMENT: Thread 0 is now waiting at the barrier on line 69.

(cuda-gdb) n # <-- if you run it the program hangs!
[HANGS HERE - Program never proceeds beyond this point]

Step 8: investigate other threads

(cuda-gdb) cuda thread (1,0,0)
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (1,0,0), device 0, sm 0, warp 0, lane 1]
69              barrier()
(cuda-gdb) p thread_id
$2 = 1
(cuda-gdb) info cuda threads
  BlockIdx ThreadIdx To BlockIdx To ThreadIdx Count                 PC                                                       Filename  Line
Kernel 0
*  (0,0,0)   (0,0,0)     (0,0,0)      (2,0,0)     3 0x00007fffd3273aa0 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    69
   (0,0,0)   (3,0,0)     (0,0,0)      (3,0,0)     1 0x00007fffd3273b10 /home/ubuntu/workspace/mojo-gpu-puzzles/problems/p09/p09.mojo    72

SMOKING GUN DISCOVERED:

Threads 0, 1, 2: All waiting at line 69 (barrier inside the conditional block)
Thread 3: At line 72 (after the conditional block, never reached the barrier!)

Step 9: analyze thread 3’s execution path

🔍 Thread 3 Analysis from the info output:

Thread 3: Located at line 72 (PC: 0x00007fffd3273b10)
Phase 2 condition: thread_id < SIZE - 1 → 3 < 3 → False
Result: Thread 3 NEVER entered the Phase 2 block (lines 65-69)
Consequence: Thread 3 NEVER reached the barrier at line 69
Current state: Thread 3 is at line 72 (final barrier), while threads 0,1,2 are stuck at line 69

Phase 4: root cause analysis

Step 10: deadlock mechanism identified

# Phase 2: Collaborative processing
if thread_id < SIZE - 1:        # ← Only threads 0, 1, 2 enter this block
    # Apply collaborative filter with neighbors
    if thread_id > 0:
        shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
    barrier()                   # ← DEADLOCK: Only 3 out of 4 threads reach here!

💀 Deadlock Mechanism:

Thread 0: 0 < 3 → True → Enters block → Waits at barrier (line 69)
Thread 1: 1 < 3 → True → Enters block → Waits at barrier (line 69)
Thread 2: 2 < 3 → True → Enters block → Waits at barrier (line 69)
Thread 3: 3 < 3 → False → NEVER enters block → Continues to line 72

Result: 3 threads wait forever for the 4th thread, but thread 3 never arrives at the barrier.

Phase 5: bug confirmation and solution

Step 11: the fundamental barrier rule violation

GPU Barrier Rule: ALL threads in a thread block must reach the SAME barrier for synchronization to complete.

What went wrong:

# ❌ WRONG: Barrier inside conditional
if thread_id < SIZE - 1:    # Not all threads enter
    # ... some computation ...
    barrier()               # Only some threads reach this

# ✅ CORRECT: Barrier outside conditional
if thread_id < SIZE - 1:    # Not all threads enter
    # ... some computation ...
 barrier()                # ALL threads reach this

The Fix: Move the barrier outside the conditional block:

fn collaborative_filter(
    output: LayoutTensor[mut=True, dtype, vector_layout],
    input: LayoutTensor[mut=False, dtype, vector_layout],
):
    thread_id = thread_idx.x
    shared_workspace = tb[dtype]().row_major[SIZE-1]().shared().alloc()

    # Phase 1: Initialize shared workspace (all threads participate)
    if thread_id < SIZE - 1:
        shared_workspace[thread_id] = rebind[Scalar[dtype]](input[thread_id])
    barrier()

    # Phase 2: Collaborative processing
    if thread_id < SIZE - 1:
        if thread_id > 0:
            shared_workspace[thread_id] += shared_workspace[thread_id - 1] * 0.5
    # ✅ FIX: Move barrier outside conditional so ALL threads reach it
    barrier()

    # Phase 3: Final synchronization and output
    barrier()

    if thread_id < SIZE - 1:
        output[thread_id] = shared_workspace[thread_id]
    else:
        output[thread_id] = rebind[Scalar[dtype]](input[thread_id])

Key debugging lessons

Barrier deadlock detection:

Use info cuda threads - Shows which threads are at which lines
Look for thread state divergence - Some threads at different program locations
Trace conditional execution paths - Check if all threads reach the same barriers
Verify barrier reachability - Ensure no thread can skip a barrier that others reach

Professional GPU debugging reality:

Deadlocks are silent killers - programs just hang with no error messages
Thread coordination debugging requires patience - systematic analysis of each thread’s path
Conditional barriers are the #1 deadlock cause - always verify all threads reach the same sync points
CUDA-GDB thread inspection is essential - the only way to see thread coordination failures

Advanced GPU synchronization:

Barrier rule: ALL threads in a block must reach the SAME barrier
Conditional execution pitfalls: Any if-statement can cause thread divergence
Shared memory coordination: Requires careful barrier placement for correct synchronization
LayoutTensor doesn’t prevent deadlocks: Higher-level abstractions still need correct synchronization

💡 Key Insight: Barrier deadlocks are among the hardest GPU bugs to debug because:

No visible error - just infinite waiting
Requires multi-thread analysis - can’t debug by examining one thread
Silent failure mode - looks like performance issue, not correctness bug
Complex thread coordination - need to trace execution paths across all threads

This type of debugging - using CUDA-GDB to analyze thread states, identify divergent execution paths, and verify barrier reachability - is exactly what professional GPU developers do when facing deadlock issues in production systems.

Next steps: GPU debugging skills complete

You’ve completed the GPU debugging trilogy!

Your complete GPU debugging arsenal

From the First Case - Crash debugging:

✅ Systematic crash investigation using error messages as guides
✅ Memory bug detection through pointer address inspection
✅ CUDA-GDB fundamentals for memory-related issues

From the Second Case - Logic bug debugging:

✅ Algorithm error investigation without obvious symptoms
✅ Pattern analysis techniques for tracing wrong results to root causes
✅ Execution flow debugging when variable inspection fails

From the Third Case - Coordination debugging:

✅ Barrier deadlock investigation for thread coordination failures
✅ Multi-thread state analysis using advanced CUDA-GDB techniques
✅ Synchronization verification for complex parallel programs

The professional GPU debugging methodology

You’ve learned the systematic approach used by professional GPU developers:

Read the symptoms - Crashes? Wrong results? Infinite hangs?
Form hypotheses - Memory issue? Logic error? Coordination problem?
Gather evidence - Use CUDA-GDB strategically based on the bug type
Test systematically - Verify each hypothesis through targeted investigation
Trace to root cause - Follow the evidence chain to the source

Achievement Unlocked: You can now debug the three most common GPU programming issues:

Memory crashes (First Case) - null pointers, out-of-bounds access
Logic bugs (Second Case) - algorithmic errors, incorrect results
Coordination deadlocks (Third Case) - barrier synchronization failures