Debugging
Decision Tree
Bug reported → Can you reproduce it?
├─ Yes → Is it consistent?
│ ├─ Yes → Add breakpoint/logging at suspected location → Trace
│ └─ No (intermittent) → Race condition or timing issue → Add timestamps to logs
└─ No → Check environment differences
├─ Works locally, fails in CI → Compare env vars, paths, timezone
├─ Works for others, fails for me → Check local config, versions, OS
└─ Cannot reproduce anywhere → Get reproduction steps from reporter
Common Root Causes
| Symptom | Likely Cause | Check |
|---|
| Works locally, fails in CI | Env vars, file paths, timezone | Compare environments |
| Intermittent failure | Race condition, flaky test, timeout | Add logging, increase timeout |
| Null/undefined error | Missing null check, async ordering | Trace data flow backwards |
| Memory growing | Unclosed resources, event listener leaks | Heap snapshot before/after |
| Slow response | N+1 queries, missing index, large payload | Profile, check query count |
| CORS error | Missing headers, wrong origin, preflight | Check server response headers |
| 403/401 | Token expired, wrong scope, missing header | Inspect request headers |
| "It worked yesterday" | Dependency update, config change, data change | Check git log, dep diff, recent deploys |
| Passes alone, fails in suite | Shared state between tests, order dependency | Run in isolation, check setup/teardown |
Strategy
- Reproduce - Get a reliable reproduction case first
- Isolate - Binary search: remove half the variables, see if it still fails
- Trace - Follow data flow from input to failure point
- Verify - Fix should explain the symptom, not just suppress it
Anti-Patterns
- Shotgun debugging - Changing random things hoping it works. Understand first, fix second.
- Fix the symptom - Adding a null check without understanding why it's null. Find the root cause.
- "It works now" - Can't explain why it broke or why the fix works. Keep investigating.
- Printf and pray - Adding one log, running, adding another. Plan your instrumentation.