Purpose
Detect when a long-running agent task has stalled, entered an infinite loop, exhausted resources, or stopped making meaningful progress — then intervene with a graduated recovery policy before the task wastes further compute or produces corrupt output.
When to use
- An autonomous agent run has exceeded its expected wall-clock time.
- Output logs show repeated identical actions (loop signature).
- A heartbeat or progress metric has flatlined for N intervals.
- Resource consumption (tokens, API calls, disk writes) is spiking without corresponding output.
- The orchestrator needs pre-defined escalation rules for unattended runs.
Do NOT use when
- The task is a single prompt/response with no iteration.
- The agent completed normally but the user dislikes the result (use review skills instead).
- You are actively debugging a failure that already terminated (use
multi-agent-debugging). - The run is interactive with a human in the loop providing guidance each step.
Operating procedure
- Collect run metadata. Read the agent's task description, expected duration, and any
configured timeout values. List them in a summary table:
| Field | Value |. - Define heartbeat contract. Set a heartbeat interval (default: 60 s for short tasks, 300 s for long tasks). Record the metric that constitutes "progress" — e.g., new file written, test passed, commit created, lines of output produced.
- Scan recent output for loop signatures. Run
grepor string-match on the last 200 lines of agent output. Flag if any 5-line block repeats 3+ times consecutively. - Check progress metric delta. Compare the progress metric at T-now vs T-minus-one-interval. If delta is zero for 2 consecutive intervals, mark the run as STALLED.
- Check resource burn rate. Count total tokens or API calls consumed in the last interval. If burn rate exceeds 3× the average of prior intervals with zero output delta, mark as RUNAWAY.
- Apply recovery action based on severity:
- STALLED → Send a nudge prompt: restate the goal and last successful checkpoint.
- RUNAWAY → Pause the agent, snapshot current state, and escalate to the orchestrator.
- LOOP_DETECTED → Inject a "break the pattern" prompt with an alternative approach suggestion.
- TIMEOUT → Terminate the run, preserve artifacts, and write a summary of progress so far.
- Log the intervention. Write a structured record:
{ timestamp, run_id, condition, action_taken, outcome }. - Verify recovery. After a nudge or pattern-break, wait one full interval and re-run steps 3-5. If the condition persists after 2 recovery attempts, escalate to TIMEOUT.
- Produce a watchdog report. Summarize: total run time, intervals monitored, conditions detected, actions taken, final status (recovered / terminated / escalated).
Decision rules
- Never terminate a run on the first stall detection; always attempt at least one nudge first.
- If the agent has produced partial valuable output, snapshot it before any destructive action.
- A loop is only confirmed when the same action sequence repeats 3+ times — not 2.
- Resource burn rate thresholds are relative to the run's own history, not absolute numbers.
- Escalation to a human or parent orchestrator is mandatory after 2 failed recovery attempts.
Output requirements
- Watchdog Status — one of: HEALTHY, STALLED, RUNAWAY, LOOP_DETECTED, TIMEOUT, RECOVERED.
- Heartbeat Log — table of intervals with progress metric values.
- Intervention Record — each action taken with timestamp and outcome.
- Recovery Recommendation — if terminated, a concrete suggestion for what to change on retry.
References
references/checkpoint-rules.md— how to snapshot agent state.references/failure-escalation.md— escalation chain definitions.references/delegate-contracts.md— expected heartbeat clauses in delegation.
Related skills
multi-agent-debugging— for post-mortem analysis after a watchdog termination.verification-before-advance— for quality gates that prevent runaway progress.human-interrupt-handling— for escalations that reach a human operator.autonomous-run-control— for broader run lifecycle management.
Failure handling
- If agent output is inaccessible (no logs), treat as STALLED immediately and escalate.
- If the heartbeat metric is undefined, fall back to wall-clock time with a 2× expected-duration threshold.
- If recovery actions themselves fail (e.g., prompt injection rejected), terminate and preserve all state.
- If multiple agents share a run, isolate the stalled agent before intervening to avoid cascade disruption.
