Root Cause Tracing
When to Use This Skill
- Investigating production errors
- Debugging complex multi-step failures
- Analyzing error chains and cascading failures
- Understanding why a specific state occurred
- Post-mortem analysis of incidents
Tracing Methodology
1. Start from the Symptom
## Error Chain Template
SYMPTOM: [What the user/system reported]
↓
IMMEDIATE CAUSE: [Direct technical cause]
↓
CONTRIBUTING FACTOR: [What enabled the immediate cause]
↓
ROOT CAUSE: [The fundamental issue to fix]
Example Trace
SYMPTOM: User sees "500 Internal Server Error"
↓
IMMEDIATE CAUSE: Unhandled null pointer exception in UserService.getProfile()
↓
CONTRIBUTING FACTOR: Database returned null for user that should exist
↓
ROOT CAUSE: Race condition during user registration - DB write not committed before redirect
Trace Techniques
Stack Trace Analysis
# Given this stack trace:
# Traceback (most recent call last):
# File "api/handlers.py", line 45, in get_user
# profile = user_service.get_profile(user_id)
# File "services/user.py", line 23, in get_profile
# return self.repo.find(user_id).to_dict()
# File "models/user.py", line 67, in to_dict
# 'email': self.email.lower()
# AttributeError: 'NoneType' object has no attribute 'lower'
# Trace backwards:
# 1. self.email is None (immediate cause)
# 2. User model was created without email validation
# 3. API endpoint doesn't validate email before save
# 4. ROOT CAUSE: Missing input validation
Log Correlation
# Find related logs by request ID
grep "req_abc123" /var/log/app/*.log | sort -t: -k2
# Timeline reconstruction
grep -h "2024-01-15T10:3" error.log access.log | sort
# Find first occurrence of error pattern
grep -n "NullPointerException" app.log | head -1
State Inspection
# Add trace points to understand state flow
def process_order(order):
logger.debug(f"[TRACE] Input state: {order.__dict__}")
validated = validate_order(order)
logger.debug(f"[TRACE] After validation: {validated.__dict__}")
calculated = calculate_totals(validated)
logger.debug(f"[TRACE] After calculation: {calculated.__dict__}")
saved = save_order(calculated)
logger.debug(f"[TRACE] After save: {saved.__dict__}")
return saved
Debugging Patterns
Binary Search Debugging
# When you have a long process that fails somewhere
def long_process(data):
# Add checkpoint
step1_result = step1(data)
print(f"CHECKPOINT 1: {step1_result is not None}") # Pass
step2_result = step2(step1_result)
print(f"CHECKPOINT 2: {step2_result is not None}") # Pass
step3_result = step3(step2_result)
print(f"CHECKPOINT 3: {step3_result is not None}") # FAIL - narrow down here
step4_result = step4(step3_result)
# ...
Delta Debugging
# Find which commit introduced a bug
git bisect start
git bisect bad HEAD
git bisect good v1.0.0
# Git will binary search through commits
# Mark each as good/bad until root cause commit is found
Rubber Duck Tracing
## Explain the flow out loud:
1. User clicks "Submit Order"
2. Frontend sends POST to /api/orders
3. Backend validates the payload... WAIT
- Does it validate the discount code?
- What if discount code is empty string vs null?
4. Found it: Empty string "" passes validation but fails lookup
Error Pattern Recognition
Null/Undefined Errors
SYMPTOM: Cannot read property 'X' of null/undefined
TRACE QUESTIONS:
1. What variable is null?
2. Where was it supposed to be set?
3. What condition would leave it unset?
4. Is there a race condition?
5. Is there a missing await/callback?
COMMON ROOT CAUSES:
- Async operation not awaited
- Conditional initialization with edge case
- Object destructuring with missing keys
- Database query returning no results
Race Conditions
SYMPTOM: Intermittent failures, works on retry
TRACE QUESTIONS:
1. Are there multiple async operations?
2. Is there shared state?
3. Are there assumptions about order of execution?
4. Are database transactions being used?
COMMON ROOT CAUSES:
- Missing database transaction
- Read-after-write without waiting
- Multiple requests modifying same resource
- Cache invalidation timing
Resource Exhaustion
SYMPTOM: System slows/crashes under load
TRACE QUESTIONS:
1. What resources are being consumed?
2. Are connections being closed?
3. Are there memory leaks?
4. Is there unbounded growth?
COMMON ROOT CAUSES:
- Database connection pool exhaustion
- Memory leaks in long-running processes
- Unbounded queues or caches
- Missing cleanup in error paths
Systematic Trace Template
## Root Cause Analysis: [Issue Title]
### 1. Incident Summary
- **Date/Time**:
- **Duration**:
- **Impact**:
- **Detected by**:
### 2. Timeline
| Time | Event |
|------|-------|
| 10:00 | First error logged |
| 10:05 | Alert triggered |
| 10:10 | Investigation started |
| 10:30 | Root cause identified |
| 10:45 | Fix deployed |
### 3. Error Chain
[Symptom] ↓ [Immediate Cause] ↓ [Contributing Factor] ↓ [Root Cause]
### 4. Evidence
- Log snippets
- Stack traces
- Metrics/graphs
- Reproduction steps
### 5. Root Cause
[Clear statement of the fundamental issue]
### 6. Fix
[What was done to resolve]
### 7. Prevention
- [ ] Add validation for X
- [ ] Add monitoring for Y
- [ ] Update documentation for Z
Tools for Tracing
Distributed Tracing
# Using OpenTelemetry
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_request(request):
with tracer.start_as_current_span("process_request") as span:
span.set_attribute("user_id", request.user_id)
with tracer.start_as_current_span("validate"):
validate(request)
with tracer.start_as_current_span("process"):
result = process(request)
span.set_attribute("result_count", len(result))
return result
Error Aggregation Query
-- Find error patterns
SELECT
error_type,
error_message,
COUNT(*) as occurrences,
MIN(timestamp) as first_seen,
MAX(timestamp) as last_seen
FROM error_logs
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY error_type, error_message
ORDER BY occurrences DESC
LIMIT 20;
Checklist
- Capture exact error message and stack trace
- Identify timestamp and affected users/requests
- Gather relevant logs around the timeframe
- Reproduce in isolation if possible
- Trace backwards from symptom to root
- Document the error chain
- Identify fix AND prevention
- Create regression test
