Performs independent self-review of completed work to catch issues before they become problems. The agent invokes this skill when completing tasks, reviewing changes, or validating quality.
4
AI 95
benchmark-runner
yasarshaikh1/27/2026
Guides evaluation workflow execution for SF-Bench. The agent invokes this skill when running evaluations, discussing scoring methodology, or working with the evaluation pipeline.