Root Cause Tracing

When to Use This Skill

Investigating production errors
Debugging complex multi-step failures
Analyzing error chains and cascading failures
Understanding why a specific state occurred
Post-mortem analysis of incidents

Tracing Methodology

1. Start from the Symptom

## Error Chain Template

SYMPTOM: [What the user/system reported]
↓
IMMEDIATE CAUSE: [Direct technical cause]
↓
CONTRIBUTING FACTOR: [What enabled the immediate cause]
↓
ROOT CAUSE: [The fundamental issue to fix]

Example Trace

SYMPTOM: User sees "500 Internal Server Error"
↓
IMMEDIATE CAUSE: Unhandled null pointer exception in UserService.getProfile()
↓
CONTRIBUTING FACTOR: Database returned null for user that should exist
↓
ROOT CAUSE: Race condition during user registration - DB write not committed before redirect

Trace Techniques

Stack Trace Analysis

# Given this stack trace:
# Traceback (most recent call last):
#   File "api/handlers.py", line 45, in get_user
#     profile = user_service.get_profile(user_id)
#   File "services/user.py", line 23, in get_profile
#     return self.repo.find(user_id).to_dict()
#   File "models/user.py", line 67, in to_dict
#     'email': self.email.lower()
# AttributeError: 'NoneType' object has no attribute 'lower'

# Trace backwards:
# 1. self.email is None (immediate cause)
# 2. User model was created without email validation
# 3. API endpoint doesn't validate email before save
# 4. ROOT CAUSE: Missing input validation

Log Correlation

# Find related logs by request ID
grep "req_abc123" /var/log/app/*.log | sort -t: -k2

# Timeline reconstruction
grep -h "2024-01-15T10:3" error.log access.log | sort

# Find first occurrence of error pattern
grep -n "NullPointerException" app.log | head -1

State Inspection

# Add trace points to understand state flow
def process_order(order):
    logger.debug(f"[TRACE] Input state: {order.__dict__}")

    validated = validate_order(order)
    logger.debug(f"[TRACE] After validation: {validated.__dict__}")

    calculated = calculate_totals(validated)
    logger.debug(f"[TRACE] After calculation: {calculated.__dict__}")

    saved = save_order(calculated)
    logger.debug(f"[TRACE] After save: {saved.__dict__}")

    return saved

Debugging Patterns

Binary Search Debugging

# When you have a long process that fails somewhere

def long_process(data):
    # Add checkpoint
    step1_result = step1(data)
    print(f"CHECKPOINT 1: {step1_result is not None}")  # Pass

    step2_result = step2(step1_result)
    print(f"CHECKPOINT 2: {step2_result is not None}")  # Pass

    step3_result = step3(step2_result)
    print(f"CHECKPOINT 3: {step3_result is not None}")  # FAIL - narrow down here

    step4_result = step4(step3_result)
    # ...

Delta Debugging

# Find which commit introduced a bug
git bisect start
git bisect bad HEAD
git bisect good v1.0.0
# Git will binary search through commits
# Mark each as good/bad until root cause commit is found

Rubber Duck Tracing

## Explain the flow out loud:

1. User clicks "Submit Order"
2. Frontend sends POST to /api/orders
3. Backend validates the payload... WAIT
   - Does it validate the discount code?
   - What if discount code is empty string vs null?
4. Found it: Empty string "" passes validation but fails lookup

Error Pattern Recognition

Null/Undefined Errors

SYMPTOM: Cannot read property 'X' of null/undefined

TRACE QUESTIONS:
1. What variable is null?
2. Where was it supposed to be set?
3. What condition would leave it unset?
4. Is there a race condition?
5. Is there a missing await/callback?

COMMON ROOT CAUSES:
- Async operation not awaited
- Conditional initialization with edge case
- Object destructuring with missing keys
- Database query returning no results

Race Conditions

SYMPTOM: Intermittent failures, works on retry

TRACE QUESTIONS:
1. Are there multiple async operations?
2. Is there shared state?
3. Are there assumptions about order of execution?
4. Are database transactions being used?

COMMON ROOT CAUSES:
- Missing database transaction
- Read-after-write without waiting
- Multiple requests modifying same resource
- Cache invalidation timing

Resource Exhaustion

SYMPTOM: System slows/crashes under load

TRACE QUESTIONS:
1. What resources are being consumed?
2. Are connections being closed?
3. Are there memory leaks?
4. Is there unbounded growth?

COMMON ROOT CAUSES:
- Database connection pool exhaustion
- Memory leaks in long-running processes
- Unbounded queues or caches
- Missing cleanup in error paths

Systematic Trace Template

## Root Cause Analysis: [Issue Title]

### 1. Incident Summary
- **Date/Time**:
- **Duration**:
- **Impact**:
- **Detected by**:

### 2. Timeline
| Time | Event |
|------|-------|
| 10:00 | First error logged |
| 10:05 | Alert triggered |
| 10:10 | Investigation started |
| 10:30 | Root cause identified |
| 10:45 | Fix deployed |

### 3. Error Chain

[Symptom] ↓ [Immediate Cause] ↓ [Contributing Factor] ↓ [Root Cause]


### 4. Evidence
- Log snippets
- Stack traces
- Metrics/graphs
- Reproduction steps

### 5. Root Cause
[Clear statement of the fundamental issue]

### 6. Fix
[What was done to resolve]

### 7. Prevention
- [ ] Add validation for X
- [ ] Add monitoring for Y
- [ ] Update documentation for Z

Tools for Tracing

Distributed Tracing

# Using OpenTelemetry
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

def process_request(request):
    with tracer.start_as_current_span("process_request") as span:
        span.set_attribute("user_id", request.user_id)

        with tracer.start_as_current_span("validate"):
            validate(request)

        with tracer.start_as_current_span("process"):
            result = process(request)
            span.set_attribute("result_count", len(result))

        return result

Error Aggregation Query

-- Find error patterns
SELECT
  error_type,
  error_message,
  COUNT(*) as occurrences,
  MIN(timestamp) as first_seen,
  MAX(timestamp) as last_seen
FROM error_logs
WHERE timestamp > NOW() - INTERVAL '24 hours'
GROUP BY error_type, error_message
ORDER BY occurrences DESC
LIMIT 20;

Checklist

Capture exact error message and stack trace
Identify timestamp and affected users/requests
Gather relevant logs around the timeframe
Reproduce in isolation if possible
Trace backwards from symptom to root
Document the error chain
Identify fix AND prevention
Create regression test

root-cause-tracingSafety 100Repository

Package Files

Root Cause Tracing

When to Use This Skill

Tracing Methodology

1. Start from the Symptom

Example Trace

Trace Techniques

Stack Trace Analysis

Log Correlation

State Inspection

Debugging Patterns

Binary Search Debugging

Delta Debugging

Rubber Duck Tracing

Error Pattern Recognition

Null/Undefined Errors

Race Conditions

Resource Exhaustion

Systematic Trace Template

Tools for Tracing

Distributed Tracing

Error Aggregation Query

Checklist

Install

AI Quality Score

Metadata

Tags

root-cause-tracingSafety 100Repository ShareFavorite skill

Package Files

Root Cause Tracing

When to Use This Skill

Tracing Methodology

1. Start from the Symptom

Example Trace

Trace Techniques

Stack Trace Analysis

Log Correlation

State Inspection

Debugging Patterns

Binary Search Debugging

Delta Debugging

Rubber Duck Tracing

Error Pattern Recognition

Null/Undefined Errors

Race Conditions

Resource Exhaustion

Systematic Trace Template

Tools for Tracing

Distributed Tracing

Error Aggregation Query

Checklist

Install

AI Quality Score

Metadata

Tags

root-cause-tracingSafety 100Repository