Agent Evaluation Methods

Evaluate agents by outcomes, not execution paths. Agents are non-deterministic and may take different valid routes to the same goal.

Performance Drivers (95% of variance)

Factor	Variance Explained	Implication
Token usage	80%	More tokens = better performance
Tool calls	~10%	More exploration helps
Model choice	~5%	Model upgrades beat token increases

Multi-Dimensional Rubric

Dimension	Weight	What to Measure
Instruction Following	0.30	Did it follow all instructions?
Output Completeness	0.25	All requested aspects covered?
Tool Efficiency	0.20	Right tools, minimal calls?
Reasoning Quality	0.15	Clear, logical reasoning?
Response Coherence	0.10	Well-structured, easy to follow?

Scoring: 1-5 per dimension. Weighted total. Pass threshold: 3.5 (general), 4.25 (critical).

LLM-as-Judge

Two Approaches

Direct Scoring — Single LLM rates one response on defined scale.

Best for: Objective criteria (accuracy, instruction following)
Always require justification BEFORE score (+15-25% reliability)

Pairwise Comparison — LLM compares two responses, picks better one.

Best for: Subjective preferences (tone, style)
MUST swap positions and check consistency (mitigates position bias)

Decision Tree

Objective ground truth exists? → Direct Scoring
  (accuracy, instruction following, format compliance)

Preference/quality judgment? → Pairwise Comparison
  (tone, style, creativity)

Compare to reference? → Reference-based evaluation
  (summarization, translation)

Known Biases & Mitigations

Bias	Problem	Primary Fix	Secondary Fix
Position	First response preferred	Swap positions, consistency check	Multiple shuffles
Length	Longer = higher score	"Do NOT prefer longer responses"	Length-normalized scoring
Self-Enhancement	Models prefer own outputs	Cross-model evaluation	Anonymize responses
Verbosity	Detail rewarded even if irrelevant	Relevance weighting	Rubric penalizes padding
Authority	Confident tone rated higher	Require evidence for claims	Fact-checking layer

Evaluation Prompt Template

You are evaluating the output of a Claude Code agent.

## Original Task
{task_description}

## Agent Output
{agent_output}

## Evaluation Criteria
{criteria with rubric levels}

## Instructions
For each criterion:
1. Find specific evidence in the output
2. Write justification citing evidence
3. THEN assign score (1-5)
4. Suggest one improvement

IMPORTANT: Justification BEFORE score. Do NOT prefer longer responses.

## Output Format
### [Criterion Name]
**Evidence**: [specific quotes/observations]
**Justification**: [maps evidence to rubric level]
**Score**: [1-5]
**Improvement**: [one actionable suggestion]

### Overall
**Weighted Score**: [sum of score × weight]
**Pass/Fail**: [Pass if ≥ 3.5]

Pairwise Comparison Protocol

Pass 1: A first, B second → record winner + confidence
Pass 2: B first, A second → record winner + confidence
Result: Both agree → confirmed winner (avg confidence). Disagree → TIE (confidence 0.5, bias detected)

Test Set Design

Stratify by complexity:

Simple: Single operation, one tool call
Medium: Multiple operations, several tool calls
Complex: Cross-file changes, significant ambiguity
Edge Case: Known tricky scenarios

Start small (5-10 cases). Early changes have dramatic impact.

Iterative Improvement Workflow

Identify weakness — evaluation finds where agent struggles
Hypothesize cause — prompt? context? examples?
Modify prompt — targeted change based on hypothesis
Re-evaluate — same test cases, modified prompt
Compare — did target dimension improve?
Check regression — did other dimensions suffer?
Iterate — repeat until quality meets threshold

Hierarchical Evaluation (Cost-Efficient)

Tier 1: Quick screen (cheap model, 0-10 score)
  → < 5: Fail  |  ≥ 7: Pass  |  5-7: Escalate

Tier 2: Detailed evaluation (expensive model, full rubric)
  → Score + confidence

Tier 3: Human review (low-confidence cases < 0.6)

Quality Indicators

Metric	Good	Acceptable	Concerning
Spearman's ρ (vs human)	> 0.8	0.6-0.8	< 0.6
Cohen's κ (agreement)	> 0.7	0.5-0.7	< 0.5
Position consistency	> 0.9	0.8-0.9	< 0.8
Length-score correlation	< 0.2	0.2-0.4	> 0.4

Anti-Patterns

❌ Scoring without justification → always require evidence first
❌ Single-pass pairwise → always swap positions
❌ Overloaded criteria → one criterion = one measurable aspect
❌ Missing edge case guidance → include explicit rubric for ambiguous cases
❌ Ignoring low confidence → escalate to human review
❌ Generic rubrics → create domain-specific rubrics

agent-evaluationSafety --Repository

Package Files

Agent Evaluation Methods

Performance Drivers (95% of variance)

Multi-Dimensional Rubric

LLM-as-Judge

Two Approaches

Decision Tree

Known Biases & Mitigations

Evaluation Prompt Template

Pairwise Comparison Protocol

Test Set Design

Iterative Improvement Workflow

Hierarchical Evaluation (Cost-Efficient)

Quality Indicators

Anti-Patterns

Install

AI Quality Score

Metadata

Tags

agent-evaluationSafety --Repository ShareFavorite skill

Package Files

Agent Evaluation Methods

Performance Drivers (95% of variance)

Multi-Dimensional Rubric

LLM-as-Judge

Two Approaches

Decision Tree

Known Biases & Mitigations

Evaluation Prompt Template

Pairwise Comparison Protocol

Test Set Design

Iterative Improvement Workflow

Hierarchical Evaluation (Cost-Efficient)

Quality Indicators

Anti-Patterns

Install

AI Quality Score

Metadata

Tags

agent-evaluationSafety --Repository