askill
evaluation

evaluationSafety --Repository

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions agent testing or quality gates for agent pipelines.

1 stars
1.2k downloads
Updated 2/5/2026

Package Files

Loading files...
SKILL.md

Evaluation Methods for Agent Systems

Agent evaluation requires different approaches than traditional software. Agents make dynamic decisions, are non-deterministic, and often lack single correct answers.

When to Activate

Activate this skill when:

  • Testing agent performance systematically
  • Validating context engineering choices
  • Building quality gates for agent pipelines
  • Comparing different agent configurations

Core Concepts

The 95% Finding

Research found that three factors explain 95% of performance variance:

FactorVariance Explained
Token usage80%
Number of tool calls~10%
Model choice~5%

Implication: Model upgrades often provide larger gains than doubling token budgets.

Evaluation Challenges

Non-Determinism: Agents may take completely different valid paths to reach goals. Evaluate outcomes, not specific steps.

Context-Dependent Failures: Failures may emerge only after extended interaction. Test with realistic context sizes.

Composite Quality: Agent quality spans multiple dimensions that require separate evaluation.

Multi-Dimensional Rubric

DimensionDescription
Factual accuracyClaims match ground truth
CompletenessOutput covers requested aspects
Citation accuracyCitations match claimed sources
Source qualityUses appropriate primary sources
Tool efficiencyUses right tools reasonable number of times

Evaluation Methodologies

LLM-as-Judge

Scales to large test sets with consistent judgments. Design prompts that capture dimensions of interest.

Human Evaluation

Catches what automation misses: hallucinated answers on unusual queries, system failures, subtle biases.

End-State Evaluation

For agents that mutate persistent state, focus on whether final state matches expectations.

Test Set Design

Complexity Stratification

LevelDescription
SimpleSingle tool call
MediumMultiple tool calls
ComplexMany tool calls, significant ambiguity
Very ComplexExtended interaction, deep reasoning

Start with small samples during development—changes have dramatic impacts early.

Context Engineering Evaluation

Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics.

Degradation Testing: Test at different context sizes to identify performance cliffs.

Guidelines

  1. Use multi-dimensional rubrics, not single metrics
  2. Evaluate outcomes, not specific execution paths
  3. Cover complexity levels from simple to complex
  4. Test with realistic context sizes and histories
  5. Run evaluations continuously, not just before release
  6. Supplement LLM evaluation with human review
  7. Track metrics over time for trend detection
  8. Set clear pass/fail thresholds based on use case

Created: 2025-12-20 | Version: 1.0.0

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

AI review pending.

Metadata

Licenseunknown
Version-
Updated2/5/2026
Publisherbthillerup

Tags

llmobservabilitytesting