Systematic Debugging - Root Cause Analysis Framework
When to use this skill
- Encountering bugs or unexpected application behavior
- Investigating test failures and flaky tests
- Diagnosing production issues and outages
- Tracing error sources through complex call stacks
- Analyzing logs, stack traces, and error messages
- Reproducing intermittent or hard-to-replicate bugs
- Debugging race conditions and timing issues
- Investigating memory leaks or performance degradation
- Root cause analysis before proposing fixes
- Debugging integration issues between services
- Investigating data corruption or inconsistencies
- Applying scientific method to systematic problem-solving
When to use this skill
- Encountering bugs, test failures, unexpected behavior, or production issues - BEFORE proposing fixes.
- When working on related tasks or features
- During development that requires this expertise
Use when: Encountering bugs, test failures, unexpected behavior, or production issues - BEFORE proposing fixes.
Core Philosophy
Never guess at fixes. Always understand the root cause first.
❌ Bad: "Let's try adding a timeout here"
✅ Good: "The timeout is occurring because X. Here's proof: [evidence]"
Four-Phase Framework
Phase 1: Root Cause Investigation
Goal: Understand exactly what's happening and why
1. Reproduce the issue reliably
- Minimal reproduction case
- Consistent reproduction steps
- Document environment/conditions
2. Gather evidence
- Error messages (full stack traces)
- Logs (with timestamps and context)
- State at failure point
- Input data that triggers issue
3. Form hypothesis
- Based on evidence, not guesswork
- Specific and testable
- Includes mechanism of failure
Example:
Bug: User authentication fails intermittently
Investigation:
1. Reproduced: Fails every ~10th login attempt
2. Evidence:
- Error: "Invalid token signature"
- Logs show token created at 14:32:15, validated at 14:32:17
- Server logs show time drift between auth & API servers
3. Hypothesis: Clock skew causing token validation failures
- Auth server: 14:32:15
- API server: 14:32:10 (5 seconds behind)
- Token "not yet valid" due to nbf (not before) claim
Phase 2: Pattern Analysis
Goal: Understand if this is isolated or systemic
1. Check for similar issues
- Same error in other places?
- Same pattern in related code?
- Recurring in error logs?
2. Identify scope
- One function or entire subsystem?
- One user or all users?
- One environment or all?
3. Find common factors
- Timing (time of day, duration)?
- Data characteristics?
- Execution path?
Phase 3: Hypothesis Testing
Goal: Prove understanding with experiments
1. Design tests that prove/disprove hypothesis
2. Add instrumentation if needed
3. Run experiments systematically
4. Document results
Example Tests:
// Hypothesis: Clock skew causes token failures
// Test 1: Artificially set server clocks in sync
// Result: No failures in 100 attempts ✓
// Test 2: Increase token nbf tolerance to 10 seconds
// Result: No failures in 100 attempts ✓
// Test 3: Log exact time delta when failures occur
// Result: All failures show 4-6 second clock difference ✓
// Conclusion: Hypothesis confirmed
Phase 4: Implementation
Goal: Fix root cause, not symptoms
1. Address root cause
- Fix underlying issue
- Not just surface symptoms
2. Add safeguards
- Validation
- Error handling
- Monitoring
3. Verify fix
- Reproducer no longer triggers issue
- Related edge cases handled
- No new issues introduced
Debugging Techniques
1. Binary Search Debugging
For "it broke somewhere between working and now":
# Git bisect example
git bisect start
git bisect bad HEAD # Current broken state
git bisect good v1.2.0 # Last known working
# Git will checkout commits for testing
# Test each: git bisect good / git bisect bad
# Automatically finds breaking commit
2. Differential Debugging
Compare working vs. broken:
Working environment:
- Node 18.16.0
- Dependency A v2.1.0
- Feature flag X: off
Broken environment:
- Node 18.17.0 ← Suspect
- Dependency A v2.1.0
- Feature flag X: off
Test: Change Node version → Bug disappears → Root cause found
3. Instrumentation
Add strategic logging:
// Not enough information
function processUser(user) {
const result = complexOperation(user);
return result; // Fails sometimes, why?
}
// Rich instrumentation
function processUser(user) {
logger.debug('processUser start', {
userId: user.id,
userState: user.state,
timestamp: Date.now()
});
const result = complexOperation(user);
logger.debug('processUser complete', {
userId: user.id,
resultStatus: result.status,
duration: Date.now() - start
});
return result;
}
4. Rubber Duck Debugging
Explain the problem out loud:
"When a user clicks login, we:
1. Hash their password ← Wait, are we using the same salt?
2. Compare to database
3. ... oh. We changed the salt algorithm last week."
5. Time Travel Debugging
Use debugger to step backwards:
Modern debuggers (rr, WinDbg, Chrome DevTools) can:
- Record execution
- Replay backwards
- Find exact moment state became invalid
Common Root Causes
1. Race Conditions
// Symptom: Intermittent failures, works in debugger
// Root cause: Async operations completing in wrong order
// Bad
let userData = null;
fetchUser().then(data => userData = data); // Async
sendEmail(userData.email); // Runs before fetch completes! ❌
// Fixed
const userData = await fetchUser();
sendEmail(userData.email); // ✓
2. Shared Mutable State
// Symptom: Tests pass individually, fail together
// Root cause: Tests sharing state
// Bad - shared state
const cache = {}; // Global
test('test1', () => { cache.foo = 'bar'; });
test('test2', () => { expect(cache.foo).toBeUndefined(); }); // Fails! ❌
// Fixed - isolated state
test('test1', () => {
const cache = {};
cache.foo = 'bar';
});
test('test2', () => {
const cache = {};
expect(cache.foo).toBeUndefined();
}); // ✓
3. Incorrect Assumptions
// Symptom: Crashes with certain inputs
// Root cause: Assumed data always present
// Bad - assumes email exists
function sendWelcome(user) {
sendEmail(user.email); // Crashes if email is null ❌
}
// Fixed - validate assumptions
function sendWelcome(user) {
if (!user?.email) {
logger.warn('Cannot send welcome email', { userId: user.id });
return;
}
sendEmail(user.email); // ✓
}
4. Off-by-One Errors
// Symptom: Array index errors, missing last item
// Root cause: Loop boundary wrong
// Bad
for (let i = 0; i < array.length - 1; i++) { // Misses last element ❌
process(array[i]);
}
// Fixed
for (let i = 0; i < array.length; i++) { // ✓
process(array[i]);
}
// Or better: array.forEach(process);
5. Timezone Issues
// Symptom: Date calculations wrong for some users
// Root cause: Not handling timezones
// Bad
const deadline = new Date('2024-01-01'); // Midnight in what timezone? ❌
// Fixed
const deadline = new Date('2024-01-01T00:00:00Z'); // Explicit UTC
// Or use library: dayjs.utc('2024-01-01')
Debugging Checklist
□ Can you reproduce the issue reliably?
□ Do you have the full error message and stack trace?
□ Do you know the exact input that triggers the issue?
□ Have you checked recent changes (git log)?
□ Have you verified your assumptions with logging?
□ Have you isolated the failing component?
□ Do you understand WHY it fails (not just WHERE)?
□ Have you tested your fix against the reproducer?
□ Have you added tests to prevent regression?
□ Have you checked for similar issues elsewhere?
When to Ask for Help
Ask when:
- Stuck after 2+ hours of systematic investigation
- Issue involves unfamiliar subsystem
- Reproducer is inconsistent
But first, prepare:
## Issue Description
[What's broken]
## Reproduction Steps
1. [Exact steps]
2. [Expected vs actual]
## Investigation So Far
- [What I've tried]
- [What I've ruled out]
- [Current hypothesis]
## Evidence
- [Logs, errors, screenshots]
- [Minimal code reproducer]
## Environment
- OS, versions, configuration
Anti-Patterns to Avoid
❌ Shotgun Debugging
"Let me try changing this... and this... and this..."
→ You don't know what actually fixed it
❌ printf Debugging Overload
Adding print statements everywhere without a plan
→ Noise obscures signal
❌ Assuming It's Not Your Code
"Must be a framework bug"
→ 95% of the time, it's your code
❌ Fixing Symptoms, Not Root Cause
Bug: Crashes with large files
Bad fix: Add try/catch to hide error ❌
Good fix: Implement streaming to handle large files ✓
Advanced Techniques
Core Dump Analysis
# When process crashes
gdb program core
(gdb) bt # backtrace
(gdb) info locals # local variables
(gdb) frame 3 # inspect frame
Network Debugging
# Capture traffic
tcpdump -i any -w capture.pcap
# Analyze with Wireshark
wireshark capture.pcap
# Or use Charles Proxy, mitmproxy
Performance Profiling
// Node.js
node --prof app.js
node --prof-process isolate-*.log
// Chrome DevTools
// Performance tab → Record → Analyze flame graph
Resources
- Debugging Guide by Julia Evans
- The Art of Debugging by Norman Matloff
- Effective Debugging by Diomidis Spinellis
Remember: Debugging is a skill. The more systematic your approach, the faster you'll find root causes and the fewer bugs you'll introduce with "fixes" that don't address the real problem.
