Agent Evaluation Methods
Evaluate agents by outcomes, not execution paths. Agents are non-deterministic and may take different valid routes to the same goal.
Performance Drivers (95% of variance)
| Factor | Variance Explained | Implication |
|---|---|---|
| Token usage | 80% | More tokens = better performance |
| Tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Model upgrades beat token increases |
Multi-Dimensional Rubric
| Dimension | Weight | What to Measure |
|---|---|---|
| Instruction Following | 0.30 | Did it follow all instructions? |
| Output Completeness | 0.25 | All requested aspects covered? |
| Tool Efficiency | 0.20 | Right tools, minimal calls? |
| Reasoning Quality | 0.15 | Clear, logical reasoning? |
| Response Coherence | 0.10 | Well-structured, easy to follow? |
Scoring: 1-5 per dimension. Weighted total. Pass threshold: 3.5 (general), 4.25 (critical).
LLM-as-Judge
Two Approaches
Direct Scoring — Single LLM rates one response on defined scale.
- Best for: Objective criteria (accuracy, instruction following)
- Always require justification BEFORE score (+15-25% reliability)
Pairwise Comparison — LLM compares two responses, picks better one.
- Best for: Subjective preferences (tone, style)
- MUST swap positions and check consistency (mitigates position bias)
Decision Tree
Objective ground truth exists? → Direct Scoring
(accuracy, instruction following, format compliance)
Preference/quality judgment? → Pairwise Comparison
(tone, style, creativity)
Compare to reference? → Reference-based evaluation
(summarization, translation)
Known Biases & Mitigations
| Bias | Problem | Primary Fix | Secondary Fix |
|---|---|---|---|
| Position | First response preferred | Swap positions, consistency check | Multiple shuffles |
| Length | Longer = higher score | "Do NOT prefer longer responses" | Length-normalized scoring |
| Self-Enhancement | Models prefer own outputs | Cross-model evaluation | Anonymize responses |
| Verbosity | Detail rewarded even if irrelevant | Relevance weighting | Rubric penalizes padding |
| Authority | Confident tone rated higher | Require evidence for claims | Fact-checking layer |
Evaluation Prompt Template
You are evaluating the output of a Claude Code agent.
## Original Task
{task_description}
## Agent Output
{agent_output}
## Evaluation Criteria
{criteria with rubric levels}
## Instructions
For each criterion:
1. Find specific evidence in the output
2. Write justification citing evidence
3. THEN assign score (1-5)
4. Suggest one improvement
IMPORTANT: Justification BEFORE score. Do NOT prefer longer responses.
## Output Format
### [Criterion Name]
**Evidence**: [specific quotes/observations]
**Justification**: [maps evidence to rubric level]
**Score**: [1-5]
**Improvement**: [one actionable suggestion]
### Overall
**Weighted Score**: [sum of score × weight]
**Pass/Fail**: [Pass if ≥ 3.5]
Pairwise Comparison Protocol
- Pass 1: A first, B second → record winner + confidence
- Pass 2: B first, A second → record winner + confidence
- Result: Both agree → confirmed winner (avg confidence). Disagree → TIE (confidence 0.5, bias detected)
Test Set Design
Stratify by complexity:
- Simple: Single operation, one tool call
- Medium: Multiple operations, several tool calls
- Complex: Cross-file changes, significant ambiguity
- Edge Case: Known tricky scenarios
Start small (5-10 cases). Early changes have dramatic impact.
Iterative Improvement Workflow
- Identify weakness — evaluation finds where agent struggles
- Hypothesize cause — prompt? context? examples?
- Modify prompt — targeted change based on hypothesis
- Re-evaluate — same test cases, modified prompt
- Compare — did target dimension improve?
- Check regression — did other dimensions suffer?
- Iterate — repeat until quality meets threshold
Hierarchical Evaluation (Cost-Efficient)
Tier 1: Quick screen (cheap model, 0-10 score)
→ < 5: Fail | ≥ 7: Pass | 5-7: Escalate
Tier 2: Detailed evaluation (expensive model, full rubric)
→ Score + confidence
Tier 3: Human review (low-confidence cases < 0.6)
Quality Indicators
| Metric | Good | Acceptable | Concerning |
|---|---|---|---|
| Spearman's ρ (vs human) | > 0.8 | 0.6-0.8 | < 0.6 |
| Cohen's κ (agreement) | > 0.7 | 0.5-0.7 | < 0.5 |
| Position consistency | > 0.9 | 0.8-0.9 | < 0.8 |
| Length-score correlation | < 0.2 | 0.2-0.4 | > 0.4 |
Anti-Patterns
- ❌ Scoring without justification → always require evidence first
- ❌ Single-pass pairwise → always swap positions
- ❌ Overloaded criteria → one criterion = one measurable aspect
- ❌ Missing edge case guidance → include explicit rubric for ambiguous cases
- ❌ Ignoring low confidence → escalate to human review
- ❌ Generic rubrics → create domain-specific rubrics
