askill
benchmark-runner

benchmark-runnerSafety 90Repository

Guides evaluation workflow execution for SF-Bench. The agent invokes this skill when running evaluations, discussing scoring methodology, or working with the evaluation pipeline.

4 stars
1.2k downloads
Updated 1/27/2026

Package Files

Loading files...
SKILL.md

Benchmark Runner

Overview

This skill provides expertise in executing SF-Bench evaluations, understanding the scoring methodology, and managing the evaluation pipeline.

When This Skill Applies

  • Running or discussing evaluation workflows
  • Questions about scoring or metrics
  • Working with evaluation scripts
  • Managing checkpoints and results
  • Understanding task execution flow

Evaluation Pipeline

Pipeline Stages

1. Preflight Checks
   ├── API Key validation
   ├── DevHub connectivity
   ├── Scratch org capacity
   └── LLM model validation

2. Solution Generation
   ├── Load task definition
   ├── Generate prompt from task
   ├── Call AI provider API
   └── Validate git diff format

3. Task Execution
   ├── Clone task repository
   ├── Apply solution patch
   ├── Create scratch org
   ├── Deploy to scratch org
   └── Execute validation

4. Validation
   ├── Deployment validation (code compiles/deploys)
   ├── Test validation (unit tests pass)
   └── Functional validation (business outcome achieved)

5. Scoring & Reporting
   ├── Calculate component scores
   ├── Aggregate results
   ├── Generate reports (JSON + Markdown)
   └── Create checkpoint

Running Evaluations

Basic Evaluation

python scripts/evaluate.py \
    --model grok-4.1-fast \
    --tasks data/tasks/verified.json

With Functional Validation

python scripts/evaluate.py \
    --model claude-3-opus \
    --tasks data/tasks/realistic.json \
    --functional

Resume from Checkpoint

python scripts/evaluate.py \
    --model gemini-pro \
    --tasks data/tasks/verified.json \
    --output results/existing-run-dir/

With Pre-Generated Solutions

python scripts/evaluate.py \
    --model gpt-4 \
    --tasks data/tasks/verified.json \
    --solutions solutions/gpt-4/

Scoring Methodology

Component Weights

ComponentWeightDescription
Deployment10%Code deploys without errors
Tests20%Unit tests pass
Functional50%Business outcome achieved
Bulk10%Handles 200+ records
No Tweaks10%No manual modifications needed

Score Calculation

score = (
    deploy_score * 0.10 +
    test_score * 0.20 +
    functional_score * 0.50 +
    bulk_score * 0.10 +
    no_tweaks_score * 0.10
)

Pass/Fail Criteria

  • Pass: Score ≥ 0.6 (60%)
  • Fail: Score < 0.6
  • Partial: Some components pass, others fail

Task Types

Apex Tasks

  • Trigger implementation
  • Class development
  • Test class creation
  • Bulk operation handling

LWC Tasks

  • Component creation
  • Apex controller integration
  • Event handling
  • Wire service usage

Flow Tasks

  • Record-triggered flows
  • Screen flows
  • Scheduled flows
  • Flow variables and formulas

Architecture Tasks

  • Cross-cutting concerns
  • Integration patterns
  • Multi-object solutions

Checkpoint Management

Checkpoint Structure

{
    "evaluation_id": "run-20260127-123456",
    "completed_tasks": ["task-001", "task-002"],
    "results": {...},
    "metadata": {
        "model": "grok-4.1-fast",
        "provider": "routellm",
        "timestamp": "2026-01-27T12:34:56Z"
    },
    "hash": "sha256:abc123..."
}

Resume Behavior

  1. Load checkpoint from output directory
  2. Verify checkpoint integrity (hash)
  3. Skip completed tasks
  4. Continue from next pending task
  5. Merge results on completion

Result Schema (v2)

SWE-bench Compatible Format

{
    "model_name_or_path": "grok-4.1-fast",
    "instance_id": "task-001",
    "model_patch": "--- a/file.cls\n+++ b/file.cls\n...",
    "resolved": true,
    "scores": {
        "deploy": 1.0,
        "tests": 1.0,
        "functional": 1.0,
        "bulk": 1.0,
        "no_tweaks": 1.0,
        "total": 1.0
    }
}

Troubleshooting Evaluations

Common Issues

Preflight Failure

  • Check API key environment variables
  • Verify DevHub authentication: sf org list --all
  • Check scratch org limits

Solution Generation Failure

  • Verify model is available via provider
  • Check API key permissions
  • Review rate limit status

Deployment Failure

  • Check patch format (valid git diff)
  • Review Salesforce error messages
  • Verify scratch org is active

Validation Failure

  • Check test assertions
  • Review functional validation script
  • Verify test data setup

Debug Commands

# Check DevHub orgs
sf org list --all

# View scratch org details
sf org display -o <alias>

# Run tests manually
sf apex run test -o <alias> -n <TestClass> -r human

# Execute anonymous Apex
sf apex run -o <alias> -f test-script.apex

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

78/100Analyzed 2/19/2026

Well-structured skill document covering SF-Bench evaluation workflow. Includes clear pipeline stages, multiple evaluation command examples, scoring methodology with weights, task types, checkpoint management, and troubleshooting. Located in .cursor/skills (internal tool config) which reduces reusability outside this repo. Has good actionability with actual commands but could be more comprehensive. Benefits from clear 'When This Skill Applies' section (R3), structured steps (R5), and relevant tags (R6). Penalized for being in .cursor folder (R8) but gains bonus for being in dedicated skills folder structure (R10).

90
85
70
75
80

Metadata

Licenseunknown
Version-
Updated1/27/2026
Publisheryasarshaikh

Tags

apici-cdllmobservabilitypromptingtesting