askill
agent-evaluation

agent-evaluationSafety --Repository

Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.

0 stars
1.2k downloads
Updated 2/3/2026

Package Files

Loading files...
SKILL.md

Agent Evaluation Methods

Evaluate agents by outcomes, not execution paths. Agents are non-deterministic and may take different valid routes to the same goal.

Performance Drivers (95% of variance)

FactorVariance ExplainedImplication
Token usage80%More tokens = better performance
Tool calls~10%More exploration helps
Model choice~5%Model upgrades beat token increases

Multi-Dimensional Rubric

DimensionWeightWhat to Measure
Instruction Following0.30Did it follow all instructions?
Output Completeness0.25All requested aspects covered?
Tool Efficiency0.20Right tools, minimal calls?
Reasoning Quality0.15Clear, logical reasoning?
Response Coherence0.10Well-structured, easy to follow?

Scoring: 1-5 per dimension. Weighted total. Pass threshold: 3.5 (general), 4.25 (critical).

LLM-as-Judge

Two Approaches

Direct Scoring — Single LLM rates one response on defined scale.

  • Best for: Objective criteria (accuracy, instruction following)
  • Always require justification BEFORE score (+15-25% reliability)

Pairwise Comparison — LLM compares two responses, picks better one.

  • Best for: Subjective preferences (tone, style)
  • MUST swap positions and check consistency (mitigates position bias)

Decision Tree

Objective ground truth exists? → Direct Scoring
  (accuracy, instruction following, format compliance)

Preference/quality judgment? → Pairwise Comparison
  (tone, style, creativity)

Compare to reference? → Reference-based evaluation
  (summarization, translation)

Known Biases & Mitigations

BiasProblemPrimary FixSecondary Fix
PositionFirst response preferredSwap positions, consistency checkMultiple shuffles
LengthLonger = higher score"Do NOT prefer longer responses"Length-normalized scoring
Self-EnhancementModels prefer own outputsCross-model evaluationAnonymize responses
VerbosityDetail rewarded even if irrelevantRelevance weightingRubric penalizes padding
AuthorityConfident tone rated higherRequire evidence for claimsFact-checking layer

Evaluation Prompt Template

You are evaluating the output of a Claude Code agent.

## Original Task
{task_description}

## Agent Output
{agent_output}

## Evaluation Criteria
{criteria with rubric levels}

## Instructions
For each criterion:
1. Find specific evidence in the output
2. Write justification citing evidence
3. THEN assign score (1-5)
4. Suggest one improvement

IMPORTANT: Justification BEFORE score. Do NOT prefer longer responses.

## Output Format
### [Criterion Name]
**Evidence**: [specific quotes/observations]
**Justification**: [maps evidence to rubric level]
**Score**: [1-5]
**Improvement**: [one actionable suggestion]

### Overall
**Weighted Score**: [sum of score × weight]
**Pass/Fail**: [Pass if ≥ 3.5]

Pairwise Comparison Protocol

  1. Pass 1: A first, B second → record winner + confidence
  2. Pass 2: B first, A second → record winner + confidence
  3. Result: Both agree → confirmed winner (avg confidence). Disagree → TIE (confidence 0.5, bias detected)

Test Set Design

Stratify by complexity:

  • Simple: Single operation, one tool call
  • Medium: Multiple operations, several tool calls
  • Complex: Cross-file changes, significant ambiguity
  • Edge Case: Known tricky scenarios

Start small (5-10 cases). Early changes have dramatic impact.

Iterative Improvement Workflow

  1. Identify weakness — evaluation finds where agent struggles
  2. Hypothesize cause — prompt? context? examples?
  3. Modify prompt — targeted change based on hypothesis
  4. Re-evaluate — same test cases, modified prompt
  5. Compare — did target dimension improve?
  6. Check regression — did other dimensions suffer?
  7. Iterate — repeat until quality meets threshold

Hierarchical Evaluation (Cost-Efficient)

Tier 1: Quick screen (cheap model, 0-10 score)
  → < 5: Fail  |  ≥ 7: Pass  |  5-7: Escalate

Tier 2: Detailed evaluation (expensive model, full rubric)
  → Score + confidence

Tier 3: Human review (low-confidence cases < 0.6)

Quality Indicators

MetricGoodAcceptableConcerning
Spearman's ρ (vs human)> 0.80.6-0.8< 0.6
Cohen's κ (agreement)> 0.70.5-0.7< 0.5
Position consistency> 0.90.8-0.9< 0.8
Length-score correlation< 0.20.2-0.4> 0.4

Anti-Patterns

  • ❌ Scoring without justification → always require evidence first
  • ❌ Single-pass pairwise → always swap positions
  • ❌ Overloaded criteria → one criterion = one measurable aspect
  • ❌ Missing edge case guidance → include explicit rubric for ambiguous cases
  • ❌ Ignoring low confidence → escalate to human review
  • ❌ Generic rubrics → create domain-specific rubrics

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

AI review pending.

Metadata

Licenseunknown
Version-
Updated2/3/2026
PublisherJacknelson6

Tags

github-actionsllmobservabilitypromptingtesting