askill
advanced-evaluation

advanced-evaluationSafety 95Repository

Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or establishing quality standards for AI-generated content.

0 stars
1.2k downloads
Updated 2/23/2026

Package Files

Loading files...
SKILL.md

Advanced Evaluation

LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches - choosing the right one and mitigating biases is the core competency.

When to Activate

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards
  • Debugging inconsistent evaluation results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation

Core Concepts

Evaluation Taxonomy

Direct Scoring: Single LLM rates one response on a defined scale.

  • Best for: Objective criteria (factual accuracy, instruction following, toxicity)
  • Reliability: Moderate to high for well-defined criteria

Pairwise Comparison: LLM compares two responses and selects better one.

  • Best for: Subjective preferences (tone, style, persuasiveness)
  • Reliability: Higher than direct scoring for preferences

Known Biases

BiasDescriptionMitigation
PositionFirst-position preferenceSwap positions, check consistency
LengthLonger = higher scoresExplicit prompting, length-normalized scoring
Self-EnhancementModels rate own outputs higherUse different model for evaluation
VerbosityUnnecessary detail rated higherCriteria-specific rubrics
AuthorityConfident tone rated higherRequire evidence citation

Decision Framework

Is there an objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Pairwise Comparison (tone, style, creativity)

Quick Reference

Direct Scoring Requirements

  1. Clear criteria definitions
  2. Calibrated scale (1-5 recommended)
  3. Chain-of-thought: justification BEFORE score (improves reliability 15-25%)

Pairwise Comparison Protocol

  1. First pass: A in first position
  2. Second pass: B in first position (swap)
  3. Consistency check: If passes disagree → TIE
  4. Final verdict: Consistent winner with averaged confidence

Rubric Components

  • Level descriptions with clear boundaries
  • Observable characteristics per level
  • Edge case guidance
  • Strictness calibration (lenient/balanced/strict)

Integration

Works with:

  • context-fundamentals - Effective context structure
  • tool-design - Evaluation tool schemas
  • evaluation (foundational) - Core evaluation concepts

For detailed implementation patterns, prompt templates, examples, and metrics: references/full-guide.md

See also: references/implementation-patterns.md, references/bias-mitigation.md, references/metrics-guide.md

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

82/100Analyzed 3/1/2026

High-quality technical reference skill on LLM-as-a-Judge evaluation. Covers direct scoring, pairwise comparison, known biases with mitigations, and decision frameworks. Well-structured with clear when-to-use guidance and step-by-step protocols. Slight penalty for .agent path suggesting internal config, but content is generic and reference-style. References external files for detailed implementation patterns, which is appropriate for a skill document.

95
85
85
72
80

Metadata

Licenseunknown
Version1.0.0
Updated2/23/2026
Publisherxiangteng007

Tags

llmobservabilityprompting