Conflict-Free Patterns
Note: This section is specific to NVIDIA GPUs
Bank conflict analysis and profiling techniques covered here apply specifically to NVIDIA GPUs. The profiling commands use NSight Compute tools that are part of the NVIDIA CUDA toolkit.
Building on your profiling skills
You’ve learned GPU profiling fundamentals in Puzzle 30 and understood resource optimization in Puzzle 31. Now you’re ready to apply those detective skills to a new performance mystery: shared memory bank conflicts.
The detective challenge: You have two GPU kernels that perform identical mathematical operations ((input + 10) * 2
). Both produce exactly the same results. Both use the same amount of shared memory. Both have identical occupancy. Yet one experiences systematic performance degradation due to how it accesses shared memory.
Your mission: Use the profiling methodology you’ve learned to uncover this hidden performance trap and understand when bank conflicts matter in real-world GPU programming.
Overview
Shared memory bank conflicts occur when multiple threads in a warp simultaneously access different addresses within the same memory bank. This detective case explores two kernels with contrasting access patterns:
alias SIZE = 8 * 1024 # 8K elements - small enough to focus on shared memory patterns
alias TPB = 256 # Threads per block - divisible by 32 (warp size)
alias THREADS_PER_BLOCK = (TPB, 1)
alias BLOCKS_PER_GRID = (SIZE // TPB, 1)
alias dtype = DType.float32
alias layout = Layout.row_major(SIZE)
fn no_conflict_kernel[
layout: Layout
](
output: LayoutTensor[mut=True, dtype, layout],
input: LayoutTensor[mut=False, dtype, layout],
size: Int,
):
"""Perfect shared memory access - no bank conflicts.
Each thread accesses a different bank: thread_idx.x maps to bank thread_idx.x % 32.
This achieves optimal shared memory bandwidth utilization.
"""
# Shared memory buffer - each thread loads one element
shared_buf = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# Load from global memory to shared memory - no conflicts
if global_i < size:
shared_buf[local_i] = (
input[global_i] + 10.0
) # Add 10 as simple operation
barrier() # Synchronize shared memory writes
# Read back from shared memory and write to output - no conflicts
if global_i < size:
output[global_i] = shared_buf[local_i] * 2.0 # Multiply by 2
barrier() # Ensure completion
fn two_way_conflict_kernel[
layout: Layout
](
output: LayoutTensor[mut=True, dtype, layout],
input: LayoutTensor[mut=False, dtype, layout],
size: Int,
):
"""Stride-2 shared memory access - creates 2-way bank conflicts.
Threads 0,16 → Bank 0, Threads 1,17 → Bank 1, etc.
Each bank serves 2 threads, doubling access time.
"""
# Shared memory buffer - stride-2 access pattern creates conflicts
shared_buf = tb[dtype]().row_major[TPB]().shared().alloc()
global_i = block_dim.x * block_idx.x + thread_idx.x
local_i = thread_idx.x
# CONFLICT: stride-2 access creates 2-way bank conflicts
conflict_index = (local_i * 2) % TPB
# Load with bank conflicts
if global_i < size:
shared_buf[conflict_index] = (
input[global_i] + 10.0
) # Same operation as no-conflict
barrier() # Synchronize shared memory writes
# Read back with same conflicts
if global_i < size:
output[global_i] = (
shared_buf[conflict_index] * 2.0
) # Same operation as no-conflict
barrier() # Ensure completion
View full file: problems/p32/p32.mojo
The mystery: These kernels compute identical results but have dramatically different shared memory access efficiency. Your job is to discover why using systematic profiling analysis.
Configuration
Requirements:
- NVIDIA GPU with CUDA toolkit and NSight Compute from Puzzle 30
- Understanding of shared memory banking concepts from the previous section
Kernel specifications:
alias SIZE = 8 * 1024 # 8K elements - focus on shared memory patterns
alias TPB = 256 # 256 threads per block (8 warps)
alias BLOCKS_PER_GRID = (SIZE // TPB, 1) # 32 blocks
Key insight: The problem size is deliberately smaller than previous puzzles to highlight shared memory effects rather than global memory bandwidth limitations.
The investigation
Step 1: Verify correctness
pixi shell -e cuda
mojo problems/p32/p32.mojo --test
Both kernels should produce identical results. This confirms that bank conflicts affect performance but not correctness.
Step 2: Benchmark performance baseline
mojo problems/p32/p32.mojo --benchmark
Record the execution times. You may notice similar performance due to the workload being dominated by global memory access, but bank conflicts will be revealed through profiling metrics.
Step 3: Build for profiling
mojo build --debug-level=full problems/p32/p32.mojo -o problems/p32/p32_profiler
Step 4: Profile bank conflicts
Use NSight Compute to measure shared memory bank conflicts quantitatively:
# Profile no-conflict kernel
ncu --metrics=l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st problems/p32/p32_profiler --no-conflict
and
# Profile two-way conflict kernel
ncu --metrics=l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld,l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st problems/p32/p32_profiler --two-way
Key metrics to record:
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum
- Load conflictsl1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum
- Store conflicts
Step 5: Analyze access patterns
Based on your profiling results, analyze the mathematical access patterns:
No-conflict kernel access pattern:
# Thread mapping: thread_idx.x directly maps to shared memory index
shared_buf[thread_idx.x] # Thread 0→Index 0, Thread 1→Index 1, etc.
# Bank mapping: Index % 32 = Bank ID
# Result: Thread 0→Bank 0, Thread 1→Bank 1, ..., Thread 31→Bank 31
Two-way conflict kernel access pattern:
# Thread mapping with stride-2 modulo operation
shared_buf[(thread_idx.x * 2) % TPB]
# For threads 0-31: Index 0,2,4,6,...,62, then wraps to 64,66,...,126, then 0,2,4...
# Bank mapping examples:
# Thread 0 → Index 0 → Bank 0
# Thread 16 → Index 32 → Bank 0 (conflict!)
# Thread 1 → Index 2 → Bank 2
# Thread 17 → Index 34 → Bank 2 (conflict!)
Your task: solve the bank conflict mystery
After completing the investigation steps above, answer these analysis questions:
Performance analysis (Steps 1-2):
- Do both kernels produce identical mathematical results?
- What are the execution time differences (if any) between the kernels?
- Why might performance be similar despite different access patterns?
Bank conflict profiling (Step 4):
- How many bank conflicts does the no-conflict kernel generate for loads and stores?
- How many bank conflicts does the two-way conflict kernel generate for loads and stores?
- What is the total conflict count difference between the kernels?
Access pattern analysis (Step 5):
- In the no-conflict kernel, which bank does Thread 0 access? Thread 31?
- In the two-way conflict kernel, which threads access Bank 0? Which access Bank 2?
- How many threads compete for the same bank in the conflict kernel?
The bank conflict detective work:
- Why does the two-way conflict kernel show measurable conflicts while the no-conflict kernel shows zero?
- How does the stride-2 access pattern
(thread_idx.x * 2) % TPB
create systematic conflicts? - Why do bank conflicts matter more in compute-intensive kernels than memory-bound kernels?
Real-world implications:
- When would you expect bank conflicts to significantly impact application performance?
- How can you predict bank conflict patterns before implementing shared memory algorithms?
- What design principles help avoid bank conflicts in matrix operations and stencil computations?
Tips
Bank conflict detective toolkit:
- NSight Compute metrics - Quantify conflicts with precise measurements
- Access pattern visualization - Map thread indices to banks systematically
- Mathematical analysis - Use modulo arithmetic to predict conflicts
- Workload characteristics - Understand when conflicts matter vs when they don’t
Key investigation principles:
- Measure systematically: Use profiling tools rather than guessing about conflicts
- Visualize access patterns: Draw thread-to-bank mappings for complex algorithms
- Consider workload context: Bank conflicts matter most in compute-intensive shared memory algorithms
- Think prevention: Design algorithms with conflict-free access patterns from the start
Access pattern analysis approach:
- Map threads to indices: Understand the mathematical address calculation
- Calculate bank assignments: Use the formula
bank_id = (address / 4) % 32
- Identify conflicts: Look for multiple threads accessing the same bank
- Validate with profiling: Confirm theoretical analysis with NSight Compute measurements
Common conflict-free patterns:
- Sequential access:
shared[thread_idx.x]
- each thread different bank - Broadcast access:
shared[0]
for all threads - hardware optimization - Power-of-2 strides: Stride-32 often maps cleanly to banking patterns
- Padded arrays: Add padding to shift problematic access patterns
Solution
Complete Solution with Bank Conflict Analysis
This bank conflict detective case demonstrates how shared memory access patterns affect GPU performance and reveals the importance of systematic profiling for optimization.
Investigation results from profiling
Step 1: Correctness Verification Both kernels produce identical mathematical results:
✅ No-conflict kernel: PASSED
✅ Two-way conflict kernel: PASSED
✅ Both kernels produce identical results
Step 2: Performance Baseline Benchmark results show similar execution times:
| name | met (ms) | iters |
| ---------------- | ------------------ | ----- |
| no_conflict | 2.1930616745886655 | 547 |
| two_way_conflict | 2.1978922967032966 | 546 |
Key insight: Performance is nearly identical (~2.19ms vs ~2.20ms) because this workload is global memory bound rather than shared memory bound. Bank conflicts become visible through profiling metrics rather than execution time.
Bank conflict profiling evidence
No-Conflict Kernel (Optimal Access Pattern):
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 0
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum 0
Result: Zero conflicts for both loads and stores - perfect shared memory efficiency.
Two-Way Conflict Kernel (Problematic Access Pattern):
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_ld.sum 256
l1tex__data_bank_conflicts_pipe_lsu_mem_shared_op_st.sum 256
Result: 256 conflicts each for loads and stores - clear evidence of systematic banking problems.
Total conflict difference: 512 conflicts (256 + 256) demonstrate measurable shared memory inefficiency.
Access pattern mathematical analysis
No-conflict kernel access pattern
Thread-to-index mapping:
shared_buf[thread_idx.x]
Bank assignment analysis:
Thread 0 → Index 0 → Bank 0 % 32 = 0
Thread 1 → Index 1 → Bank 1 % 32 = 1
Thread 2 → Index 2 → Bank 2 % 32 = 2
...
Thread 31 → Index 31 → Bank 31 % 32 = 31
Result: Perfect bank distribution - each thread accesses a different bank within each warp, enabling parallel access.
Two-way conflict kernel access pattern
Thread-to-index mapping:
shared_buf[(thread_idx.x * 2) % TPB] # TPB = 256
Bank assignment analysis for first warp (threads 0-31):
Thread 0 → Index (0*2)%256 = 0 → Bank 0
Thread 1 → Index (1*2)%256 = 2 → Bank 2
Thread 2 → Index (2*2)%256 = 4 → Bank 4
...
Thread 16 → Index (16*2)%256 = 32 → Bank 0 ← CONFLICT with Thread 0
Thread 17 → Index (17*2)%256 = 34 → Bank 2 ← CONFLICT with Thread 1
Thread 18 → Index (18*2)%256 = 36 → Bank 4 ← CONFLICT with Thread 2
...
Conflict pattern: Each bank serves exactly 2 threads, creating systematic 2-way conflicts across all 32 banks.
Mathematical explanation: The stride-2 pattern with modulo 256 creates a repeating access pattern where:
- Threads 0-15 access banks 0,2,4,…,30
- Threads 16-31 access the same banks 0,2,4,…,30
- Each bank collision requires hardware serialization
Why this matters: workload context analysis
Memory-bound vs compute-bound implications
This workload characteristics:
- Global memory dominant: Each thread performs minimal computation relative to memory transfer
- Shared memory secondary: Bank conflicts add overhead but don’t dominate total execution time
- Identical performance: Global memory bandwidth saturation masks shared memory inefficiency
When bank conflicts matter most:
- Compute-intensive shared memory algorithms - Matrix multiplication, stencil computations, FFT
- Tight computational loops - Repeated shared memory access within inner loops
- High arithmetic intensity - Significant computation per memory access
- Large shared memory working sets - Algorithms that heavily utilize shared memory caching
Real-world performance implications
Applications where bank conflicts significantly impact performance:
Matrix Multiplication:
# Problematic: All threads in warp access same column
for k in range(tile_size):
acc += a_shared[local_row, k] * b_shared[k, local_col] # b_shared[k, 0] conflicts
Stencil Computations:
# Problematic: Stride access in boundary handling
shared_buf[thread_idx.x * stride] # Creates systematic conflicts
Parallel Reductions:
# Problematic: Power-of-2 stride patterns
if thread_idx.x < stride:
shared_buf[thread_idx.x] += shared_buf[thread_idx.x + stride] # Conflict potential
Conflict-free design principles
Prevention strategies
1. Sequential access patterns:
shared[thread_idx.x] # Optimal - each thread different bank
2. Broadcast optimization:
constant = shared[0] # All threads read same address - hardware optimized
3. Padding techniques:
shared = tb[dtype]().row_major[TPB + 1]().shared().alloc() # Shift access patterns
4. Access pattern analysis:
- Calculate bank assignments before implementation
- Use modulo arithmetic:
bank_id = (address_bytes / 4) % 32
- Visualize thread-to-bank mappings for complex algorithms
Systematic optimization workflow
Design Phase:
- Plan access patterns - Sketch thread-to-memory mappings
- Calculate bank assignments - Use mathematical analysis
- Predict conflicts - Identify problematic access patterns
- Design alternatives - Consider padding, transpose, or algorithm changes
Implementation Phase:
- Profile systematically - Use NSight Compute conflict metrics
- Measure impact - Compare conflict counts across implementations
- Validate performance - Ensure optimizations improve end-to-end performance
- Document patterns - Record successful conflict-free algorithms for reuse
Key takeaways: from detective work to optimization expertise
The Bank Conflict Investigation revealed:
- Measurement trumps intuition - Profiling tools reveal conflicts invisible to performance timing
- Pattern analysis works - Mathematical prediction accurately matched NSight Compute results
- Context matters - Bank conflicts matter most in compute-intensive shared memory workloads
- Prevention beats fixing - Designing conflict-free patterns easier than retrofitting optimizations
Universal shared memory optimization principles:
When to worry about bank conflicts:
- High-computation kernels using shared memory for data reuse
- Iterative algorithms with repeated shared memory access in tight loops
- Performance-critical code where every cycle matters
- Memory-intensive operations that are compute-bound rather than bandwidth-bound
When bank conflicts are less critical:
- Memory-bound workloads where global memory dominates performance
- Simple caching scenarios with minimal shared memory reuse
- One-time access patterns without repeated conflict-prone operations
Professional development methodology:
- Profile before optimizing - Measure conflicts quantitatively with NSight Compute
- Understand access mathematics - Use bank assignment formulas to predict problems
- Design systematically - Consider banking in algorithm design, not as afterthought
- Validate optimizations - Confirm that conflict reduction improves actual performance
This detective case demonstrates that systematic profiling reveals optimization opportunities invisible to performance timing alone - bank conflicts are a perfect example of where measurement-driven optimization beats guesswork.