Sysadmin Ops Skill
Use when the user asks to investigate crashes, runaway memory, machine instability, or to design autonomous reliability guardrails for Pi workflows.
Primary goals
- Stabilize the host quickly.
- Preserve evidence for root-cause analysis.
- Restore interrupted workflows with minimal context loss.
- Prevent recurrence with layered guardrails.
Incident triage workflow
-
Confirm event window
- capture local timestamp range
- list active/affected workspaces
-
Collect host forensics (macOS)
- inspect
/Library/Logs/DiagnosticReports/JetsamEvent-*.ips - summarize top process families by aggregate RSS and process count
- extract evidence of process storms (count, age distribution, coalition hints)
- inspect
-
Collect Pi execution forensics
- parse
~/.pi/agent/sessions/**for bash commands around event window - identify high-risk commands (tests/builds/nested CLI/orchestration)
- detect unfinished commands and crash-correlated sessions
- parse
-
Containment recommendations
- immediate limits (session count, concurrency, timeouts)
- command-level guardrails for sharp edges
- optional emergency stop procedure
-
Recovery plan
- regenerate per-workspace handoff state
- enumerate next-resume checklist per workspace
-
Preventive hardening
- extension guardrails
- slice policy changes
- watchdog automation and alerts
Key sharp edges to check
- nested non-interactive
piinvocations inside agents - unbounded test/build commands without explicit timeout
- test runners without worker caps
- team/pipeline recursion loops
- many simultaneous workspaces with heavy runners
Output contract
## Incident Summary
## Forensic Evidence
- host
- session timeline
## Likely Root Causes (ranked)
## Immediate Containment
## Recovery Plan
## Hardening Plan
- now
- next
- later
