Competing Hypotheses: Adversarial Debugging Pattern
Overview
When the root cause is unclear, a single investigator anchors on the first plausible explanation. Multiple investigators testing competing theories — and actively trying to disprove each other — converge faster and more accurately.
Core principle: Adversarial debate eliminates weak hypotheses. The theory that survives structured attack is most likely correct.
Management theory: Psychological Safety (agents MUST challenge each other), Tuckman's Storming (debate is the mechanism, not a problem), Belbin Monitor-Evaluator (devil-advocate role is mandatory).
When to Use
- Bug with unclear root cause
- Multiple plausible explanations exist
- Single investigator would likely anchor on first theory
- Reproducing the bug is possible but cause is ambiguous
Don't use when:
- Root cause is obvious (just fix it)
- Only one plausible explanation
- Bug cannot be reproduced
- Debugging requires sequential steps (A reveals B reveals C)
Team Composition
coordinator (lead, acts as judge)
├── debugger × 2-5 (each tests one hypothesis)
└── devil-advocate × 1 (challenges all hypotheses)
Belbin coverage:
- Thinking: devil-advocate (Monitor-Evaluator)
- Action: debuggers (Shaper — drives toward root cause)
- People: coordinator (Coordinator — judges evidence)
Sizing by complexity:
| Bug Complexity | Debuggers | Notes |
|---|---|---|
| 2 plausible causes | 2 | Minimum viable debate |
| 3-4 theories | 3-4 | Standard case |
| Systemic/unknown | 4-5 + devil-advocate | Maximum investigation |
The Process
Phase 1: Forming — Hypothesis Generation
Coordinator:
- Describes the bug symptoms clearly
- Lists all plausible hypotheses (or asks team to generate them)
- Assigns one hypothesis per debugger
- Spawns devil-advocate to challenge all of them
Spawn prompt template for debugger:
You are testing the hypothesis: [HYPOTHESIS]
Bug symptoms: [SYMPTOMS]
Reproduction steps: [STEPS]
Your job:
1. Find evidence FOR your hypothesis (prove it)
2. Find evidence AGAINST your hypothesis (disprove it)
3. Be honest — if your hypothesis is wrong, say so
4. Report: evidence for, evidence against, confidence (0-100%)
You MUST report disconfirming evidence. Hiding evidence that
disproves your hypothesis is a critical failure.
Devil-advocate spawn prompt:
You will review each debugger's findings.
For EACH hypothesis:
1. What evidence would definitively prove it? Did they find it?
2. What evidence would definitively disprove it? Did they look?
3. Are there alternative explanations for their evidence?
4. What tests would distinguish this hypothesis from others?
Your success metric is finding flaws. Approving a weak hypothesis
without challenge is a failure.
Phase 2: Storming — Parallel Investigation
Each debugger independently:
- Investigates their assigned hypothesis
- Collects evidence (code, logs, test results)
- Reports both confirming AND disconfirming evidence
- Assigns confidence score
Critical: Debuggers must report disconfirming evidence. The prompt explicitly requires this to prevent confirmation bias.
Phase 3: Norming — Adversarial Debate
For each hypothesis:
1. Debugger presents: evidence for + against + confidence
2. Devil-advocate challenges: gaps, alternative explanations
3. Other debuggers challenge: "my evidence contradicts yours because..."
4. Coordinator scores: STRONG / WEAK / DISPROVEN
Elimination round:
- DISPROVEN hypotheses are discarded
- WEAK hypotheses get one more investigation round
- STRONG hypotheses proceed to verification
The debate IS the value. Sequential investigation suffers from anchoring. Parallel adversarial investigation eliminates weak theories faster.
Phase 4: Performing — Verification
For the surviving hypothesis:
- Debugger implements the fix
- Verify: does the fix resolve the symptoms?
- Regression: does the fix break anything else?
- Devil-advocate: "Is this really the root cause, or are we masking the symptom?"
Phase 5: Adjourning — Record & Reflect
- Document: the winning hypothesis, eliminated hypotheses, and why
- Record to
.claude.md: what theories were wrong and why (prevents future anchoring) - Trigger team-orchestrator:session-reflection
Example: App Exits After One Message
Coordinator identifies 5 hypotheses:
H1: WebSocket connection closing prematurely
H2: Event loop draining with no listeners
H3: Unhandled promise rejection causing exit
H4: Session timeout misconfigured
H5: Message handler throwing uncaught error
Team investigates in parallel:
Debugger 1 (H1): "WebSocket stays open — DISPROVEN"
Debugger 2 (H2): "Event loop has active listeners — DISPROVEN"
Debugger 3 (H3): "Found unhandled rejection in auth middleware — STRONG (80%)"
Debugger 4 (H4): "Timeout is 30min, app exits in 1s — DISPROVEN"
Debugger 5 (H5): "Message handler has try-catch — WEAK (30%)"
Devil-advocate: "H3 is strong but — does the rejection happen on EVERY
message or just the first? If first-only, the auth token refresh
might be the real cause, not the handler."
→ Follow-up reveals: auth token refresh throws on first use because
token is not yet set. H3 was close but the real root cause is
token initialization order.
Without adversarial debate, team would have patched the rejection
handler without fixing the token initialization.
Common Mistakes
| Mistake | Fix |
|---|---|
| Debugger hides disconfirming evidence | Prompt explicitly requires both-sides reporting |
| All hypotheses are variations of one idea | Ensure hypotheses are truly independent |
| Skipping debate — just picking highest confidence | Debate reveals flaws that confidence scores don't |
| Devil-advocate too soft | Prompt: "approving a weak hypothesis is a failure" |
| Not recording eliminated hypotheses | They prevent future anchoring — record them |
| Fixing symptom, not root cause | Devil-advocate's final question prevents this |
Integration
Pre-requisite: team-orchestrator:orchestrating-work routes here Post-requisite: team-orchestrator:session-reflection records learnings Related: superpowers:systematic-debugging for single-agent debugging
