askill
databricks-mlflow-evaluation

databricks-mlflow-evaluationSafety 95Repository

MLflow 3 GenAI agent evaluation. Use when writing mlflow.genai.evaluate() code, creating @scorer functions, using built-in scorers (Guidelines, Correctness, Safety, RetrievalGroundedness), building eval datasets from traces, setting up trace ingestion and production monitoring, aligning judges with MemAlign from domain expert feedback, or running optimize_prompts() with GEPA for automated prompt improvement.

611 stars
12.2k downloads
Updated 2/20/2026

Package Files

Loading files...
SKILL.md

MLflow 3 GenAI Evaluation

Before Writing Any Code

  1. Read GOTCHAS.md - 15+ common mistakes that cause failures
  2. Read CRITICAL-interfaces.md - Exact API signatures and data schemas

End-to-End Workflows

Follow these workflows based on your goal. Each step indicates which reference files to read.

Workflow 1: First-Time Evaluation Setup

For users new to MLflow GenAI evaluation or setting up evaluation for a new agent.

StepActionReference Files
1Understand what to evaluateuser-journeys.md (Journey 0: Strategy)
2Learn API patternsGOTCHAS.md + CRITICAL-interfaces.md
3Build initial datasetpatterns-datasets.md (Patterns 1-4)
4Choose/create scorerspatterns-scorers.md + CRITICAL-interfaces.md (built-in list)
5Run evaluationpatterns-evaluation.md (Patterns 1-3)

Workflow 2: Production Trace -> Evaluation Dataset

For building evaluation datasets from production traces.

StepActionReference Files
1Search and filter tracespatterns-trace-analysis.md (MCP tools section)
2Analyze trace qualitypatterns-trace-analysis.md (Patterns 1-7)
3Tag traces for inclusionpatterns-datasets.md (Patterns 16-17)
4Build dataset from tracespatterns-datasets.md (Patterns 6-7)
5Add expectations/ground truthpatterns-datasets.md (Pattern 2)

Workflow 3: Performance Optimization

For debugging slow or expensive agent execution.

StepActionReference Files
1Profile latency by spanpatterns-trace-analysis.md (Patterns 4-6)
2Analyze token usagepatterns-trace-analysis.md (Pattern 9)
3Detect context issuespatterns-context-optimization.md (Section 5)
4Apply optimizationspatterns-context-optimization.md (Sections 1-4, 6)
5Re-evaluate to measure impactpatterns-evaluation.md (Pattern 6-7)

Workflow 4: Regression Detection

For comparing agent versions and finding regressions.

StepActionReference Files
1Establish baselinepatterns-evaluation.md (Pattern 4: named runs)
2Run current versionpatterns-evaluation.md (Pattern 1)
3Compare metricspatterns-evaluation.md (Patterns 6-7)
4Analyze failing tracespatterns-trace-analysis.md (Pattern 7)
5Debug specific failurespatterns-trace-analysis.md (Patterns 8-9)

Workflow 5: Custom Scorer Development

For creating project-specific evaluation metrics.

StepActionReference Files
1Understand scorer interfaceCRITICAL-interfaces.md (Scorer section)
2Choose scorer patternpatterns-scorers.md (Patterns 4-11)
3For multi-agent scorerspatterns-scorers.md (Patterns 13-16)
4Test with evaluationpatterns-evaluation.md (Pattern 1)

Workflow 6: Unity Catalog Trace Ingestion & Production Monitoring

For storing traces in Unity Catalog, instrumenting applications, and enabling continuous production monitoring.

StepActionReference Files
1Link UC schema to experimentpatterns-trace-ingestion.md (Patterns 1-2)
2Set trace destinationpatterns-trace-ingestion.md (Patterns 3-4)
3Instrument your applicationpatterns-trace-ingestion.md (Patterns 5-8)
4Configure trace sources (Apps/Serving/OTEL)patterns-trace-ingestion.md (Patterns 9-11)
5Enable production monitoringpatterns-trace-ingestion.md (Patterns 12-13)
6Query and analyze UC tracespatterns-trace-ingestion.md (Pattern 14)

Workflow 7: Judge Alignment with MemAlign

For aligning an LLM judge to match domain expert preferences. A well-aligned judge improves every downstream use: evaluation accuracy, production monitoring signal, and prompt optimization quality. This workflow is valuable on its own, independent of prompt optimization.

StepActionReference Files
1Design base judge with make_judge (any feedback type)patterns-judge-alignment.md (Pattern 1)
2Run evaluate(), tag successful tracespatterns-judge-alignment.md (Pattern 2)
3Build UC dataset + create SME labeling sessionpatterns-judge-alignment.md (Pattern 3)
4Align judge with MemAlign after labeling completespatterns-judge-alignment.md (Pattern 4)
5Register aligned judge to experimentpatterns-judge-alignment.md (Pattern 5)
6Re-evaluate with aligned judge (baseline)patterns-judge-alignment.md (Pattern 6)

Workflow 8: Automated Prompt Optimization with GEPA

For automatically improving a registered system prompt using optimize_prompts(). Works with any scorer, but paired with an aligned judge (Workflow 7) gives the most domain-accurate signal. For the full end-to-end loop combining alignment and optimization, see user-journeys.md Journey 10.

StepActionReference Files
1Build optimization dataset (inputs + expectations)patterns-prompt-optimization.md (Pattern 1)
2Run optimize_prompts() with GEPA + scorerpatterns-prompt-optimization.md (Pattern 2)
3Register new version, promote conditionallypatterns-prompt-optimization.md (Pattern 3)

Reference Files Quick Lookup

ReferencePurposeWhen to Read
GOTCHAS.mdCommon mistakesAlways read first before writing code
CRITICAL-interfaces.mdAPI signatures, schemasWhen writing any evaluation code
patterns-evaluation.mdRunning evals, comparingWhen executing evaluations
patterns-scorers.mdCustom scorer creationWhen built-in scorers aren't enough
patterns-datasets.mdDataset buildingWhen preparing evaluation data
patterns-trace-analysis.mdTrace debuggingWhen analyzing agent behavior
patterns-context-optimization.mdToken/latency fixesWhen agent is slow or expensive
patterns-trace-ingestion.mdUC trace setup, monitoringWhen setting up trace storage or production monitoring
patterns-judge-alignment.mdMemAlign judge alignment, labeling sessions, SME feedbackWhen aligning judges to domain expert preferences
patterns-prompt-optimization.mdGEPA optimization: build dataset, optimize_prompts(), promoteWhen running automated prompt improvement
user-journeys.mdHigh-level workflows, full domain-expert optimization loopWhen starting a new evaluation project or running the full align + optimize cycle

Critical API Facts

  • Use: mlflow.genai.evaluate() (NOT mlflow.evaluate())
  • Data format: {"inputs": {"query": "..."}} (nested structure required)
  • predict_fn: Receives **unpacked kwargs (not a dict)
  • MemAlign: Scorer-agnostic (works with any feedback_value_type -- float, bool, categorical); token-heavy on the embedding model so set embedding_model explicitly
  • Label schema name matching: The label schema name in the labeling session MUST match the judge name used in evaluate() for align() to pair scores
  • Aligned judge scores: May be lower than unaligned judge scores -- this is expected and means the judge is now more accurate, not that the agent regressed
  • GEPA optimization dataset: Must have both inputs AND expectations per record (different from eval dataset)
  • Episodic memory: Lazily loaded -- get_scorer() results won't show episodic memory on print until the judge is first used
  • optimize_prompts: Requires MLflow >= 3.5.0

See GOTCHAS.md for complete list.

Related Skills

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

78/100Analyzed 2/23/2026

Comprehensive reference skill for MLflow 3 GenAI evaluation on Databricks with 8 well-structured workflows. Excellent organization with tables, clear triggers in description, and good tags. Main limitation is actionability - provides workflow steps referencing external files but lacks direct code examples or commands. Highly reusable as reference material but less actionable for immediate implementation. No safety concerns.

95
88
82
82
68

Metadata

Licenseunknown
Version-
Updated2/20/2026
Publisherdatabricks-solutions

Tags

apici-cdgithub-actionsllmobservabilitypromptingtesting