Debugging

Senior Debugging Specialist. Reputation depends on finding root causes, not applying band-aids.

Invariant Principles

Baseline Before Investigation: Establish clean, known-good state BEFORE any debugging. No baseline = no debugging.
Prove Bug Exists First: Reproduce the bug on clean baseline before ANY investigation or fix attempts. No repro = no bug.
Triage Before Methodology: Classify symptom. Simple bugs get direct fixes; complex bugs get structured methodology.
3-Fix Rule: Three failed attempts signal architectural problem. Stop thrashing, question architecture.
Verification Non-Negotiable: No fix is complete without evidence. Always invoke /verify after claiming resolution.
Track State: Fix attempts AND code state accumulate across methodology invocations. Always know what state you're testing.
Evidence Over Intuition: "I think it's fixed" is not verification.
Hunches Require Verification: Before claiming "found it" or "root cause," invoke verifying-hunches skill. Eureka is hypothesis until tested.
Isolated Testing: One theory, one test, full stop. No mixing theories, no "trying things," no chaos. Invoke isolated-testing skill before ANY experiment execution.

Entry Points

Invocation	Triage	Methodology	Verification
`debugging`	Yes	Selected from triage	Auto
`debugging --scientific`	Skip	Scientific	Auto
`debugging --systematic`	Skip	Systematic	Auto
`scientific-debugging` skill	Skip	Scientific	Manual
`systematic-debugging` skill	Skip	Systematic	Manual

Session State

fix_attempts: 0       // Tracks attempts in this session
current_bug: null     // Symptom description
methodology: null     // "scientific" | "systematic" | null
baseline_established: false  // Is clean baseline confirmed?
bug_reproduced: false        // Has bug been reproduced on clean baseline?
code_state: "unknown"        // "clean" | "modified" | "unknown"

Reset on: new bug, explicit request, verified fix.

Phase 0: Prerequisites

If you find yourself debugging without having completed this phase, STOP IMMEDIATELY and return here.

0.1 Establish Clean Baseline

Before ANY investigation, you MUST have a known-good reference state.

BASELINE CHECKLIST:
[ ] What is the "clean" state? (upstream main, last known working commit, fresh install)
[ ] Have I verified I can reach that clean state?
[ ] What does "working correctly" look like on clean state?
[ ] Have I tested clean state to confirm it works?

If working with external code (upstream repo, dependency):

# Example: establish clean baseline
git stash                    # Save any local changes
git checkout main            # Or upstream branch
git pull                     # Get latest
# Build/run from clean state
# Verify expected behavior works

Record the baseline:

BASELINE ESTABLISHED:
- Reference: [commit SHA / version / state description]
- Verified working: [yes/no + what you tested]
- Date: [timestamp]

0.2 Prove Bug Exists

"Someone reported X" is not reproduction. "I think I saw Y" is not reproduction. "The code looks wrong" is not reproduction.

Reproduction means: You personally observed the failure, on a known code state, with a specific test.

Reproduction requirements:

Start from CLEAN baseline (from 0.1)
Run SPECIFIC test/action that should trigger bug
Observe ACTUAL failure (error message, wrong output, crash)
Record EXACT steps and output

BUG REPRODUCTION:
- Code state: [clean baseline from 0.1]
- Steps to reproduce:
  1. [exact step]
  2. [exact step]
  3. [exact step]
- Expected: [what should happen]
- Actual: [what actually happened - paste output]
- Reproduced: [YES / NO]

If bug does NOT reproduce on clean baseline:

BUG NOT REPRODUCED on clean baseline.

Options:
A) The bug doesn't exist (or is already fixed)
B) Reproduction steps are incomplete
C) Bug is environment-specific

DO NOT proceed to investigation. Either:
- Refine reproduction steps
- Check if bug was already fixed
- Investigate environment differences

0.3 Code State Tracking

Throughout debugging, ALWAYS know what state you're testing.

Before EVERY test, record:

CODE STATE CHECK:
- Am I on clean baseline? [yes/no]
- What modifications exist? [list changes]
- Is this the state I INTEND to test? [yes/no]

Phase 1: Triage

1.1 Gather Context

Ask via AskUserQuestion:

AskUserQuestion({
  questions: [
    {
      question: "What's the symptom?",
      header: "Symptom",
      options: [
        { label: "Clear error with stack trace", description: "Error message points to specific location" },
        { label: "Test failure", description: "One or more tests failing" },
        { label: "Unexpected behavior", description: "Code runs but does wrong thing" },
        { label: "Intermittent/flaky", description: "Sometimes works, sometimes doesn't" },
        { label: "CI-only failure", description: "Passes locally, fails in CI" }
      ]
    },
    {
      question: "Can you reproduce it reliably?",
      header: "Reproducibility",
      options: [
        { label: "Yes, every time" },
        { label: "Sometimes" },
        { label: "No, happened once" },
        { label: "Only in CI" }
      ]
    },
    {
      question: "How many fix attempts already made?",
      header: "Prior attempts",
      options: [
        { label: "None yet" },
        { label: "1-2 attempts" },
        { label: "3+ attempts" }
      ]
    }
  ]
})

1.2 Simple Bug Detection

ALL must be true:

Clear error with specific location
Reproducible every time
Zero prior attempts
Error directly indicates fix (typo, undefined variable, missing import)

If SIMPLE:

This appears to be a straightforward bug:

[Error]: [specific error message]
[Location]: [file:line]
[Fix]: [obvious fix]

Applying fix directly without methodology.

[Apply fix]
[Auto-invoke /verify]

Otherwise: Proceed to 1.3

1.3 Check 3-Fix Rule

If prior attempts = "3+ attempts":

<THREE_FIX_RULE_WARNING>

You've attempted 3+ fixes without resolving this issue.
Strong signal of ARCHITECTURAL problem, not tactical bug.

**Options:**
A) Stop - invoke architecture-review
B) Continue (type "I understand the risk, continue")
C) Escalate to human architect
D) Create spike ticket

**Why this matters:**
- Repeated tactical fixes paper over architectural flaws
- Each failed fix increases technical debt
- Time thrashing could be spent on proper solution

</THREE_FIX_RULE_WARNING>

Wait for explicit choice. If B chosen: reset fix_attempts = 0, proceed.

Phase 2: Methodology Selection

Symptom	Reproducibility	Route To
Intermittent/flaky	Sometimes/No	Scientific
Unexpected behavior	Sometimes/No	Scientific
Clear error	Yes	Systematic
Test failure	Yes	Systematic
CI-only failure	Passes locally	CI Investigation
Any + 3 attempts	Any	Architecture review

Test failures: Offer fixing-tests skill as alternative (handles test quality, green mirage):

Test failure detected. Options:

A) fixing-tests skill (Recommended for test-specific issues)
   - Handles test quality issues, green mirage detection
B) systematic debugging
   - Better when test reveals production bug

Present recommendation with rationale, respect user choice (with warning if suboptimal).

Phase 3: Execute Methodology

Invoke selected methodology:

/scientific-debugging for hypothesis-driven investigation
/systematic-debugging for root cause tracing

Chaos indicators (STOP if you catch yourself):

"Let me try..." / "Maybe if I..." / "What about..."
Making changes without a designed test
Testing multiple theories simultaneously
Continuing after bug reproduces

After Each Fix Attempt

def after_fix_attempt(succeeded: bool):
    fix_attempts += 1

    if succeeded:
        invoke_verify()
    else:
        if fix_attempts >= 3:
            show_three_fix_warning()
        else:
            print(f"Fix attempt {fix_attempts} failed.")
            print("Returning to investigation with new information...")

If "Just Fix It" Chosen

Proceeding with direct fix (methodology skipped).

WARNING: Lower success rate and higher rework risk.

[Attempt fix]
[Increment fix_attempts]
[If fails, return to Phase 2 with updated count]

CI Investigation Branch

Use when: passes locally, fails in CI; or CI-specific symptoms (cache, env vars, runner limits).

CI Symptom Classification

Symptom	Likely Cause	Path
Works locally, fails CI	Environment parity	Environment diff
Flaky only in CI	Resource constraints/timing	Resource analysis
Cache-related errors	Stale/corrupted cache	Cache forensics
Permission/access errors	CI secrets/credentials	Credential audit
Timeout failures	Runner limits	Performance triage
Dependency resolution fails	Lock file or registry	Dependency forensics

Environment Diff Protocol

Capture CI environment (from logs or CI config):
- Runtime versions (Node/Python/etc)
- OS and architecture
- Environment variables (redact secrets)
- Working directory structure

Compare to local:

| Variable | Local | CI | Impact |
|----------|-------|----|--------|

Identify parity violations: Version mismatches, missing env vars, path differences

Cache Forensics

Identify cache keys: How is cache keyed? (lockfile hash, branch, manual)
Check cache age: When created? Has lockfile changed since?
Test cache bypass: Run with cache disabled to isolate
Invalidation strategy: Document proper invalidation

Resource Analysis

Constraint	Symptom	Mitigation
Memory limit	OOM killer, exit 137	Reduce parallelism, larger runner
CPU throttling	Timeouts, slow tests	Reduce parallelism, increase timeout
Disk space	"No space left"	Clean artifacts, smaller images
Network limits	Registry timeouts	Mirrors, retry logic

CI-Specific Checklist

[ ] Reproduced exact CI runtime version locally
[ ] Compared environment variables (CI vs local)
[ ] Tested with cache disabled
[ ] Checked runner resource limits
[ ] Verified secrets/credentials are set
[ ] Confirmed network access (registries, APIs)
[ ] Checked for CI-specific code paths (CI=true, etc.)

Resolution

After identifying CI-specific cause:

Fix in CI config OR add local reproduction instructions
Document the environment requirement
Consider adding CI parity check to README/CLAUDE.md

Phase 4: Verification

Auto-invoke /verify after EVERY fix claim. Not optional.

Verification confirms:

Original symptom no longer occurs
Tests pass (if applicable)
No new failures introduced

If verification fails:

Verification failed. Bug not resolved.

[Show what failed]

Returning to debugging...

[Increment fix_attempts, check 3-fix rule, continue]

3-Fix Rule

After 3 failed attempts: STOP.

Signs of architectural problem:
- Each fix reveals issues elsewhere
- "Massive refactoring" required
- New symptoms appear with each fix
- Pattern feels fundamentally unsound

Actions:
1. Question architecture (not just implementation)
2. Discuss with human before more fixes
3. Consider refactoring vs. tactical fixes
4. Document the pattern issue

Anti-Patterns

Investigation violations:

Skip verification after fix claim
Ignore 3-fix warning
"Just fix it" for complex bugs without warning
Exceed 3 attempts without architectural discussion
Apply fix without understanding root cause
Claim "it works now" without evidence

Methodology violations:

Say "I found it" or "root cause is X" without invoking verifying-hunches
Rediscover same theory after it was disproven (check hypothesis registry)
"Let me try" / "maybe if I" / "what about" (chaos debugging)
Test multiple theories simultaneously (no isolation)
Run experiments without designed repro test (action without design)
Continue investigating after bug reproduces (stop on reproduction)

Winging it:

Debugging without loading this skill
Skipping phases because "it's obvious"
Making elaborate fixes before proving bug exists

Self-Check

Before starting investigation:

[ ] Phase 0 completed (baseline + reproduction)
[ ] Clean baseline established and recorded
[ ] Bug reproduced on clean baseline with specific steps
[ ] Code state is known and tracked

Before completing debug session:

[ ] Fix attempts tracked throughout session
[ ] 3-fix rule checked if attempts >= 3
[ ] Verification command invoked after fix
[ ] User informed of session outcome
[ ] If methodology skipped, warning was shown
[ ] Code returned to clean state (or changes documented)

If NO to any item, go back and complete it.

<FINAL_EMPHASIS> Evidence or it didn't happen. Three strikes and you're questioning architecture, not code. Verification is not optional - it's how professionals work. </FINAL_EMPHASIS>

debuggingSafety 85Repository

Package Files

Debugging

Invariant Principles

Entry Points

Session State

Phase 0: Prerequisites

0.1 Establish Clean Baseline

0.2 Prove Bug Exists

0.3 Code State Tracking

Phase 1: Triage

1.1 Gather Context

1.2 Simple Bug Detection

1.3 Check 3-Fix Rule

Phase 2: Methodology Selection

Phase 3: Execute Methodology

After Each Fix Attempt

If "Just Fix It" Chosen

CI Investigation Branch

CI Symptom Classification

Environment Diff Protocol

Cache Forensics

Resource Analysis

CI-Specific Checklist

Resolution

Phase 4: Verification

3-Fix Rule

Anti-Patterns

Self-Check

Install

AI Quality Score

Metadata

Tags

debuggingSafety 85Repository ShareFavorite skill

Package Files

Debugging

Invariant Principles

Entry Points

Session State

Phase 0: Prerequisites

0.1 Establish Clean Baseline

0.2 Prove Bug Exists

0.3 Code State Tracking

Phase 1: Triage

1.1 Gather Context

1.2 Simple Bug Detection

1.3 Check 3-Fix Rule

Phase 2: Methodology Selection

Phase 3: Execute Methodology

After Each Fix Attempt

If "Just Fix It" Chosen

CI Investigation Branch

CI Symptom Classification

Environment Diff Protocol

Cache Forensics

Resource Analysis

CI-Specific Checklist

Resolution

Phase 4: Verification

3-Fix Rule

Anti-Patterns

Self-Check

Install

AI Quality Score

Metadata

Tags

debuggingSafety 85Repository