askill
evaluate-attempt

evaluate-attemptSafety 95Repository

This SOP evaluates a completed brazil-bench attempt against the spec.md requirements, capturing metrics for comparison across orchestration patterns. Supports Python and Swift/iOS projects.

10 stars
1.2k downloads
Updated 1/13/2026

Package Files

Loading files...
SKILL.md

Evaluate Benchmark Attempt

Overview

This SOP evaluates a completed brazil-bench attempt against the spec.md requirements, capturing metrics for comparison across orchestration patterns. Supports both Python and Swift/iOS implementations.

Parameters

  • attempt_repo (required): Repository name (e.g., attempt-3)
  • output_dir (optional, default: ./results): Where to write evaluation results

Steps

0. Detect Project Language

Identify the primary language/platform of the implementation.

Detection Commands:

cd ./reviews/{attempt_repo}

# Python indicators
ls pyproject.toml setup.py requirements.txt 2>/dev/null

# Swift/iOS indicators
ls Package.swift *.xcodeproj *.xcworkspace 2>/dev/null

# Check file extensions
find . -name "*.py" -not -path "./.venv/*" | head -5
find . -name "*.swift" | head -5

Language Detection Matrix:

Files FoundLanguageTest Framework
pyproject.toml, *.pyPythonpytest
Package.swift, *.swiftSwift Packageswift test
*.xcodeproj, *.swiftiOS/Xcodexcodebuild test
Both Python and SwiftMulti-languageRun both

Constraints:

  • You MUST detect the language before running tests
  • You MUST use appropriate commands for the detected language
  • You SHOULD note the detected language in the report

1. Clone Attempt

Fetch the attempt repository for local analysis.

Constraints:

  • You MUST clone into ./reviews/{attempt_repo}
  • You MUST verify the clone succeeded before proceeding
  • You MUST NOT modify any files in the cloned repo
gh repo clone brazil-bench/{attempt_repo} ./reviews/{attempt_repo}

2. Verify Spec Integrity

Confirm the spec.md was not modified from the template.

Constraints:

  • You MUST compare spec.md against the template version
  • You MUST fail the evaluation if spec.md was modified
  • You SHOULD use a checksum comparison
gh repo clone brazil-bench/benchmark-template ./reviews/_template --depth 1
diff ./reviews/{attempt_repo}/spec.md ./reviews/_template/spec.md

3. Run Conformance Tests

Execute the test suite defined in the spec against the implementation.

Constraints:

  • You MUST attempt to run all tests specified in spec.md
  • You MUST capture pass/fail counts and output
  • You SHOULD timeout tests after 60 seconds each
  • You MAY retry flaky tests once
  • If tests fail due to missing dependencies, follow the dependency resolution steps below

Python Test Commands

cd ./reviews/{attempt_repo}

# Run pytest with verbose output
pytest --tb=short -v 2>&1 | tee test_output.log

# Get summary counts
pytest --tb=no -q 2>&1 | tail -5

Swift/iOS Test Commands

cd ./reviews/{attempt_repo}

# Swift Package Manager
swift test 2>&1 | tee test_output.log

# Xcode project (iOS Simulator)
xcodebuild test \
    -project *.xcodeproj \
    -scheme "YourScheme" \
    -destination 'platform=iOS Simulator,name=iPhone 15' \
    2>&1 | tee test_output.log

# Parse xcodebuild results
grep -E "(Test Case|passed|failed)" test_output.log

# Using xcpretty for cleaner output (if available)
xcodebuild test -project *.xcodeproj -scheme "YourScheme" \
    -destination 'platform=iOS Simulator,name=iPhone 15' \
    | xcpretty --report junit

3a. Handle Missing Dependencies (Neo4j, etc.)

If tests fail due to missing external dependencies like Neo4j:

Step 1: Try to start the dependency via Docker

# Check if Docker is available
docker --version

# Check for docker-compose files in the repo
ls ./reviews/{attempt_repo}/docker-compose*.yml

# If Neo4j docker-compose exists, start it
docker-compose -f ./reviews/{attempt_repo}/docker-compose.neo4j.yml up -d

# Or start Neo4j directly
docker run -d --name neo4j-eval -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/password \
  neo4j:5

# Wait for Neo4j to be ready
sleep 10
docker logs neo4j-eval 2>&1 | tail -5

Step 2: If Docker unavailable or fails, look for evidence of prior test runs

Check these sources for test results:

# Check git history for test-related commits
git log --oneline --all | grep -iE "(test|pass|100%|fix.*test)"

# Check for CI/CD logs or badges
cat ./reviews/{attempt_repo}/README.md | grep -iE "(pass|badge|ci|test)"

# Check prompts.txt for test execution evidence
cat ./reviews/{attempt_repo}/prompts.txt 2>/dev/null | grep -iE "(pytest|test|pass|fail|scenario)"

# Check for pytest cache with results
ls -la ./reviews/{attempt_repo}/.pytest_cache/ 2>/dev/null

# Check for coverage reports
ls -la ./reviews/{attempt_repo}/htmlcov/ ./reviews/{attempt_repo}/coverage.xml 2>/dev/null

Step 3: Document findings in the report

If tests cannot be run directly, document:

  • Why tests couldn't run (missing Neo4j, etc.)
  • Evidence found of prior test runs (commit messages, prompts.txt entries)
  • Claimed test results from the attempt's documentation
  • Mark as "CANNOT VERIFY" with explanation

Constraints for dependency handling:

  • You MUST try Docker first if available
  • You MUST search for evidence if Docker fails
  • You MUST NOT claim tests pass without verification
  • You SHOULD note the source of any claimed test results
  • You SHOULD clean up Docker containers after evaluation: docker stop neo4j-eval && docker rm neo4j-eval

3b. Detect Skipped Tests

Skipped tests inflate test counts without providing actual verification. You MUST detect and report them separately.

Python: Detect Skipped Tests

Step 1: Run pytest with verbose output to capture skipped tests

cd ./reviews/{attempt_repo}

# Run pytest and capture skip count
pytest --tb=no -v 2>&1 | grep -E "(PASSED|FAILED|SKIPPED|ERROR)" | head -100

# Get summary counts
pytest --tb=no -q 2>&1 | tail -5

# Look for skip patterns in test files
grep -r "pytest.skip\|@pytest.mark.skip\|skipif\|xfail" tests/ --include="*.py"

Step 2: Analyze test files for skip patterns

# Count tests that call pytest.skip() inside the test body (worst pattern)
grep -r "pytest.skip(" tests/ --include="*.py" -l | wc -l

# Count tests with @pytest.mark.skip decorator
grep -r "@pytest.mark.skip" tests/ --include="*.py" | wc -l

# Count conditional skips (skipif)
grep -r "@pytest.mark.skipif" tests/ --include="*.py" | wc -l
Swift/iOS: Detect Skipped Tests

Step 1: Run swift test or xcodebuild and capture skipped tests

cd ./reviews/{attempt_repo}

# Swift Package Manager - look for skipped in output
swift test 2>&1 | grep -E "(passed|failed|skipped)"

# Xcode - parse test results
xcodebuild test -project *.xcodeproj -scheme "YourScheme" \
    -destination 'platform=iOS Simulator,name=iPhone 15' \
    2>&1 | grep -E "Test Case.*passed|Test Case.*failed|skipped"

Step 2: Analyze test files for skip patterns

# Count XCTSkip usage (explicit skips)
grep -r "XCTSkip\|throw XCTSkip" Tests/ --include="*.swift" | wc -l

# Count disabled tests (func name doesn't start with test)
grep -r "func disabled_test\|// func test" Tests/ --include="*.swift" | wc -l

# Count tests with availability checks that skip
grep -r "@available\|#available" Tests/ --include="*.swift" -A 2 | grep -i skip | wc -l

# Look for conditional test execution
grep -r "guard.*else.*return\|if.*XCTSkip" Tests/ --include="*.swift" | wc -l

Swift Skip Patterns:

PatternTypeAssessment
throw XCTSkip("reason")Explicit skipAcceptable if documented
#if !targetEnvironment(simulator)ConditionalAcceptable for device-only
@available(iOS 16, *)Version skipAcceptable
Renamed to disabled_testFooHidden skipShould be penalized
Empty test bodyStubShould be penalized

Step 3: Calculate effective test count

MetricHow to Calculate
Total TestsNumber of test functions defined
Passed TestsTests that ran and passed
Skipped TestsTests marked skip or calling pytest.skip()
Effective TestsTotal - Skipped (tests that actually run)
Skip RatioSkipped / Total (percentage of tests that skip)

Constraints for skipped test handling:

  • You MUST report skipped tests separately from passed tests
  • You MUST calculate the "effective test count" (passed + failed, excluding skipped)
  • You MUST flag ANY skipped tests for issue filing - zero tolerance for skips
  • You MUST distinguish between skip types for the issue description:
    • Conditional skips (@pytest.mark.skipif): Document reason in issue
    • Unconditional skips (pytest.skip() in body): Critical - tests never run
    • Decorator skips (@pytest.mark.skip): Document reason in issue
  • You MUST NOT count skipped tests toward the test score in rankings
  • You MUST file an issue for ANY skipped test (no acceptable skip threshold)

Example Analysis:

Total tests:     59
Passed:          44
Skipped:         15  (25% skip ratio - HIGH)
Failed:          0
Effective:       44  (use this for scoring, not 59)

Skip breakdown:
- pytest.skip() in body: 15 (integration tests that never run)
- @pytest.mark.skipif: 0
- @pytest.mark.skip: 0

Flag: INFLATED TEST COUNT - 15 tests skip unconditionally

Document in Report:

## Test Results

| Metric | Count |
|--------|-------|
| Total Tests | 59 |
| Passed | 44 |
| **Skipped** | **15** |
| Failed | 0 |
| **Effective Tests** | **44** |
| Skip Ratio | 25% |

⚠️ **Warning:** 15 tests (25%) are skipped and never execute.
These are integration tests that call `pytest.skip()` inside the test body.
The effective test count for scoring is 44, not 59.

3c. Self-Contained Integration Tests (REQUIRED)

Integration tests MUST be self-contained and actually run. Tests that skip because "Neo4j not available" or similar are not acceptable.

Requirement: Integration tests must start their own data stores as needed.

Detection Commands:

cd ./reviews/{attempt_repo}

# Check for testcontainers usage (Python)
grep -r "testcontainers\|TestContainer\|DockerContainer" tests/ --include="*.py"

# Check for docker-compose in test setup
grep -r "docker-compose\|subprocess.*docker" tests/ --include="*.py"

# Check for pytest-docker fixture
grep -r "pytest-docker\|docker_compose" tests/ --include="*.py" pyproject.toml

# Check for in-memory alternatives (e.g., SQLite instead of Postgres)
grep -r "sqlite.*memory\|:memory:\|MockNeo4j\|FakeNeo4j" tests/ --include="*.py"

# Check for conftest fixtures that start services
grep -A 20 "@pytest.fixture" tests/conftest.py 2>/dev/null | grep -E "docker\|container\|start\|neo4j"

# Swift: Check for test containers
grep -r "Docker\|Container\|TestServer" Tests/ --include="*.swift"

Acceptable Patterns for Self-Contained Tests:

PatternExampleAssessment
testcontainersNeo4jContainer() in fixture✓ Best - automatic lifecycle
pytest-dockerdocker_compose_file fixture✓ Good - compose-based
conftest startupFixture runs docker run neo4j✓ Acceptable - manual but works
In-memory mockMockNeo4jClient class✗ NOT acceptable - not persistent
External dependencypytest.skip("Neo4j not running")✗ NOT acceptable
CI-only tests@pytest.mark.skipif(not CI)✗ NOT acceptable
No integration testsNo tests for data layer✗ NOT acceptable

Example: testcontainers Pattern (Python)

# conftest.py
import pytest
from testcontainers.neo4j import Neo4jContainer

@pytest.fixture(scope="session")
def neo4j_container():
    """Start Neo4j container for integration tests."""
    with Neo4jContainer("neo4j:5") as neo4j:
        yield neo4j

@pytest.fixture
def neo4j_client(neo4j_container):
    """Get client connected to test container."""
    return Neo4jClient(
        uri=neo4j_container.get_connection_url(),
        auth=("neo4j", "password")
    )

Example: pytest-docker Pattern

# conftest.py
import pytest

@pytest.fixture(scope="session")
def docker_compose_file():
    return "docker-compose.test.yml"

@pytest.fixture(scope="session")
def neo4j_service(docker_services):
    """Wait for Neo4j to be ready."""
    docker_services.wait_until_responsive(
        timeout=30.0,
        pause=0.5,
        check=lambda: is_neo4j_ready()
    )

Scoring Impact:

Integration Test QualityScore Modifier
Self-contained (testcontainers/docker)No penalty
In-memory mock (not persistent)-10 points quality
Skips due to missing dependency-10 points quality
No integration tests at all-15 points quality

Constraints:

  • You MUST check if integration tests are self-contained
  • You MUST flag tests that skip due to external dependencies
  • You MUST NOT accept "works on CI" as justification for skipping locally
  • You SHOULD recommend testcontainers or pytest-docker patterns
  • You SHOULD verify integration tests actually execute (not just exist)

Document in Report:

## Integration Test Quality

| Aspect | Status |
|--------|--------|
| Self-contained | Yes/No |
| Data store management | testcontainers / docker-compose / mock / external |
| Integration tests run | X passed, Y skipped |

⚠️ **Issue:** Integration tests skip when Neo4j is not running.
Tests should use testcontainers or pytest-docker to manage dependencies.

3d. Context Header Blocks (REQUIRED)

Every source code file MUST have a context header comment block that documents:

  1. Purpose - What the file/module does
  2. Interfaces - Key classes, functions, or APIs exposed
  3. Change History - Record of modifications (updated on every change)

Detection Commands:

cd ./reviews/{attempt_repo}

# Python: Check for docstrings or header comments in source files
for f in $(find src -name "*.py" -not -name "__init__.py"); do
  echo "=== $f ==="
  head -50 "$f" | grep -E '""".*|^#.*Purpose|^#.*Context|CONTEXT BLOCK|Change History|Interfaces'
done

# Swift: Check for header comments
for f in $(find Sources -name "*.swift" 2>/dev/null); do
  echo "=== $f ==="
  head -50 "$f" | grep -E '///|/\*\*|Purpose|Context|History'
done

# Count files with context headers vs total
total=$(find src -name "*.py" -not -name "__init__.py" | wc -l)
with_header=$(find src -name "*.py" -not -name "__init__.py" -exec head -30 {} \; -exec echo "---" \; | grep -l "CONTEXT\|Purpose\|Module:" | wc -l)
echo "Files with headers: $with_header / $total"

Required Header Format (Python):

"""
================================================================================
CONTEXT BLOCK
================================================================================
File: {filename}
Module: {module.path}
Purpose: {one-line description}

Description:
    {detailed description of what this module does}

Interfaces:
    - {ClassName}: {brief description}
    - {function_name}(): {brief description}

Dependencies:
    - {module}: {why needed}

Change History:
    - {date}: {description of change}
    - {date}: Initial creation
================================================================================
"""

Required Header Format (Swift):

//
//  {FileName}.swift
//  {ProjectName}
//
//  Purpose: {one-line description}
//
//  Interfaces:
//    - {ClassName}: {brief description}
//    - {functionName}(): {brief description}
//
//  Change History:
//    - {date}: {description of change}
//    - {date}: Initial creation
//

Assessment Criteria:

CoverageAssessmentScore Impact
100% files have headersExcellentNo penalty
75-99% files have headersGood-2 quality
50-74% files have headersPartial-5 quality
<50% files have headersPoor-10 quality

Constraints:

  • You MUST check all source files for context headers
  • You MUST verify headers include purpose, interfaces, and change history
  • You MUST flag files missing headers for issue filing
  • You SHOULD note which files have incomplete headers (missing sections)

Document in Report:

## Context Header Compliance

| Metric | Count |
|--------|-------|
| Source files | X |
| With headers | Y |
| Coverage | Z% |

### Files Missing Headers
- `src/module.py` - No header
- `src/utils.py` - Missing change history

### Assessment
{Excellent/Good/Partial/Poor} - {X}% coverage

4. Measure Code Metrics

Collect quantitative data about the implementation.

Constraints:

  • You MUST capture: total lines of code, number of files, dependencies
  • You SHOULD capture: cyclomatic complexity, test coverage
  • You MAY capture: documentation coverage, type hint coverage

Python Metrics

# Lines of code (excluding tests)
find ./reviews/{attempt_repo}/src -name "*.py" | xargs wc -l

# Dependencies
cat ./reviews/{attempt_repo}/pyproject.toml | grep dependencies -A 50

# File count
find ./reviews/{attempt_repo}/src -name "*.py" | wc -l

Swift/iOS Metrics

# Lines of code (excluding tests)
find ./reviews/{attempt_repo}/Sources -name "*.swift" | xargs wc -l

# For Xcode projects
find ./reviews/{attempt_repo} -name "*.swift" -not -path "*/Tests/*" -not -path "*Test*" | xargs wc -l

# Dependencies (Swift Package Manager)
cat ./reviews/{attempt_repo}/Package.swift | grep -A 50 "dependencies:"

# Dependencies (CocoaPods)
cat ./reviews/{attempt_repo}/Podfile 2>/dev/null

# Dependencies (Xcode project - SPM)
grep -r "repositoryURL" ./reviews/{attempt_repo}/*.xcodeproj/project.pbxproj 2>/dev/null | head -20

# File count
find ./reviews/{attempt_repo}/Sources -name "*.swift" | wc -l

# Check for SwiftLint configuration
ls ./reviews/{attempt_repo}/.swiftlint.yml 2>/dev/null

5. Extract Git Metrics and Analyze Development Timeline

Analyze the development history to separate agent-driven work from human interactions.

Constraints:

  • You MUST capture: total commits, time from first to last commit
  • You MUST separate commits into agent-driven vs human-driven phases
  • You MUST calculate autonomous duration (agent work only)
  • You SHOULD capture: number of reverts, force pushes (if detectable)
  • You SHOULD extract commit messages mentioning "fix", "revert", "oops"

5a. Gather Raw Git Data

cd ./reviews/{attempt_repo}

# Full commit history with timestamps and messages
git log --format="%ai | %H | %s" --reverse

# Count total commits
git log --oneline | wc -l

# Find fix/revert commits
git log --format="%H %s" | grep -iE "(fix|revert|oops|wrong)"

# Get first and last commit times
git log --format="%ai" --reverse | head -1  # First commit
git log --format="%ai" | head -1             # Last commit

5b. Identify Development Phases

Analyze commit timestamps and messages to identify distinct phases:

Phase 1: Setup (Human)

  • Initial commit, repo setup, file uploads
  • Typically first 1-3 commits before implementation starts
  • Look for: "Initial commit", "Add files", "upload", "setup"

Phase 2: Agent Implementation

  • Bulk implementation work by the agent
  • Characterized by:
    • Rapid succession of commits (minutes apart)
    • Large code changes
    • Messages like "Implement", "Add", "Create"
    • Consistent commit patterns (same author, similar timing)

Phase 3: Agent Test Iteration

  • Test fixing and iteration by the agent
  • Characterized by:
    • Commits mentioning "fix", "test", "pass"
    • Still rapid succession
    • Often shows progression: "Fix X" → "Fix Y" → "100% pass"

Phase 4: Human Intervention (Post-Completion)

  • Human-driven changes after agent work completes
  • Characterized by:
    • Time gaps (hours/days after previous commits)
    • Different commit patterns or author info
    • Messages about data, documentation, cleanup
    • Changes not required by the spec

5c. Heuristics for Identifying Agent vs Human Commits

Agent commits typically show:

  • Timestamps within minutes of each other
  • Consistent formatting in commit messages
  • Co-authored-by lines mentioning Claude/AI
  • Large, comprehensive changes
  • Focus on implementation and tests

Human commits typically show:

  • Time gaps of hours or days from previous work
  • Different commit message style
  • Focus on data, docs, or polish
  • Smaller, targeted changes
  • Work done after "100% tests pass" milestone
# Look for time gaps > 1 hour between commits (potential phase boundaries)
git log --format="%ai" --reverse | while read ts; do echo "$ts"; done

# Check for co-author lines indicating AI
git log --format="%b" | grep -i "co-authored"

# Check prompts.txt for session boundaries
cat prompts.txt 2>/dev/null | grep -E "^(Done|Session|Agent)"

5d. Calculate Duration Metrics

MetricHow to Calculate
Total DurationLast commit - First commit
Agent DurationSum of time during agent phases only
Human DurationSum of time during human phases
Autonomous DurationPhase 2 + Phase 3 (implementation + test fixing)

Example Timeline Analysis:

09:00:00 - Initial commit (Human Setup)
09:05:00 - Add spec file (Human Setup)
         --- Agent work begins ---
09:15:00 - Implement Phase 1 (Agent)
09:45:00 - Implement Phase 2 (Agent)
10:10:00 - Implement Phase 3 (Agent)
10:25:00 - Fix test issues (Agent)
10:40:00 - 100% tests pass (Agent)
         --- Agent work ends ---
         --- 2 day gap ---
Oct 3     - Add real data (Human)
Oct 3     - Update docs (Human)

Agent Duration: ~1h 25m (09:15 → 10:40)
Human Duration: ~5m setup + later changes
Autonomous Duration: ~1h 25m

5e. Document in Report

Include a Development Timeline section in the report:

## Development Duration Breakdown

| Phase | Duration | Description |
|-------|----------|-------------|
| **Setup (Human)** | ~5 min | Initial commit, file upload |
| **Phase 1: Implementation** | ~55 min | Agent implements all phases |
| **Phase 2: Test Fixing** | ~30 min | Agent iterates to 100% pass |
| **Total Autonomous** | **~1h 25m** | Agent work only |
| **Phase 3: Human Intervention** | 2 days later | Data and docs added |

### Commit Analysis
- Total commits: 15
- Agent commits: 10 (09:15 - 10:40 on Day 1)
- Human commits: 5 (setup + Day 3 changes)
- Fix commits: 3 (normal iteration, not rework)

6. Analyze Against Spec

Review implementation completeness against spec.md requirements.

Constraints:

  • You MUST evaluate against ALL 16 canonical requirements listed below
  • You MUST assess each as: implemented, partial, missing
  • You SHOULD note implementation approach for each
  • You MUST NOT make subjective quality judgments beyond spec compliance
  • You MUST use the exact requirement numbering for cross-attempt consistency

6.0 Canonical Requirements Checklist (16 Requirements)

All evaluations MUST use this exact checklist to ensure consistency across attempts.

Functional Requirements (6):

  1. [FR-1] Search and return match data from all CSV files
  2. [FR-2] Search and return player data
  3. [FR-3] Calculate basic statistics (wins, losses, goals)
  4. [FR-4] Compare teams head-to-head
  5. [FR-5] Handle team name variations correctly
  6. [FR-6] Return properly formatted responses

Query Performance (3): 7. [QP-1] Simple lookups respond in < 2 seconds 8. [QP-2] Aggregate queries respond in < 5 seconds 9. [QP-3] No timeout errors

Data Coverage (3): 10. [DC-1] All 6 CSV files are loadable and queryable 11. [DC-2] At least 20 sample questions can be answered 12. [DC-3] Cross-file queries work (player + match data)

Technical Requirements (4): 13. [TR-1] MCP server implementation with callable tools 14. [TR-2] BDD testing with Given-When-Then structure 15. [TR-3] UTF-8 encoding support (Portuguese characters: ã, ç, é, etc.) 16. [TR-4] Multiple date format handling (ISO, Brazilian DD/MM/YYYY, with time)

Report Format for Requirements:

## Requirements Checklist

### Functional Requirements (X/6)
- [x] [FR-1] Search and return match data from all CSV files
- [x] [FR-2] Search and return player data
- [ ] [FR-3] Calculate basic statistics (partial: missing draws)
...

### Query Performance (X/3)
- [x] [QP-1] Simple lookups respond in < 2 seconds
...

### Data Coverage (X/3)
- [x] [DC-1] All 6 CSV files are loadable and queryable
...

### Technical Requirements (X/4)
- [x] [TR-1] MCP server implementation with callable tools
- [x] [TR-2] BDD testing with Given-When-Then structure
...

**Total: X/16 requirements implemented**

6a. Real Data vs Simulated Data Assessment

Determine whether the implementation uses real external data or simulated/mock data.

Real Data Indicators:

  • Data loaders for external sources (Kaggle, APIs, etc.)
  • CSV/JSON files in data directory
  • API client code with authentication
  • Data normalization/mapping logic for external schemas

Simulated Data Indicators:

  • Hardcoded test fixtures
  • Factory/faker-generated data
  • Mock data in test files only
  • No external data loading code

Constraints for Real Data Implementations:

  • You MUST note which external data source is used
  • You MUST assess schema mapping quality (how well does the implementation adapt external schema to spec schema)
  • You MUST distinguish between:
    • Schema Implemented: The code defines models matching spec entities
    • Data Populated: The data loader can populate those fields from external source
    • Not Available in Source: Spec field cannot be populated because external data doesn't include it
  • You SHOULD credit implementations that adapt to real-world data constraints
  • You SHOULD note any enhancements beyond spec (e.g., additional fields from richer data sources)

Adjusted Compliance Scoring:

  • If real data is used and a spec field is "Not Available in Source", count it as:
    • Implemented if the model/schema supports the field
    • Note the data limitation separately
  • Example: If spec requires "attendance" but Kaggle data has no attendance:
    • Check if Match model has attendance field (schema compliance)
    • Note that field would be null with Kaggle data (data limitation)
    • This is NOT a failure - it's a data source constraint

6b. Documentation Quality Assessment

Evaluate the README.md for essential user documentation.

Required Elements:

  1. Setup Instructions: Prerequisites, installation steps, environment configuration
  2. MCP Server Setup: How to start the server, how to connect Claude
  3. Example Q&A: Sample questions and expected responses/output

Extraction Commands:

# Check README content
head -100 ./reviews/{attempt_repo}/README.md

# Look for key documentation sections
grep -E "Quick Start|Installation|Setup|MCP|Example|Usage" ./reviews/{attempt_repo}/README.md

Documentation Quality Levels:

LevelCriteriaIn Report
ExcellentAll 3 elements + extras (architecture, API ref, troubleshooting)"Comprehensive README"
GoodAll 3 required elements present"Good documentation"
Acceptable2 of 3 elements"Partial documentation"
Poor0-1 elements"Missing documentation"

Best Practice Reference:

  • 2025-10-30-python-hive: Excellent (Quick Start, MCP config, 15+ demo questions, architecture, troubleshooting)
  • 2025-12-15-python-claude-ruvector: Excellent (detailed setup, claude mcp add example, Q&A with output)

Include in Report:

## Documentation Quality

| Element | Present | Notes |
|---------|---------|-------|
| Setup Instructions | Yes/No | {details} |
| MCP Server Setup | Yes/No | {details} |
| Example Q&A | Yes/No | {details} |

**Assessment:** {Excellent/Good/Acceptable/Poor}

7. Generate Codebase Documentation

Generate comprehensive documentation for the implementation using the codebase-summary SOP.

Constraints:

  • You MUST run the codebase-summary skill on the cloned repository
  • You MUST output documentation to {output_dir}/{attempt_repo}-summary/
  • You SHOULD use the generated documentation to inform the final report
  • The documentation provides architecture, components, interfaces, and workflow analysis
summarize codebase reviews/{attempt_repo} to {output_dir}/{attempt_repo}-summary/

8. Generate Report

Produce structured evaluation output.

Constraints:

  • You MUST write results to {output_dir}/{attempt_repo}.md
  • You MUST include: attempt name, orchestration pattern, all metrics
  • You MUST use consistent format for cross-attempt comparison
  • You SHOULD include raw data as appendix

Output Format

# Evaluation: {attempt_repo}

## Summary
- **Pattern:** [swarm|hive|solo|...]
- **Spec Compliance:** X/Y requirements
- **Tests:** X passed, Y skipped, Z failed (X effective)
- **Autonomous Duration:** Xh Ym
- **Documentation:** See `{attempt_repo}-summary/`

## Metrics
| Metric | Value |
|--------|-------|
| Lines of Code | |
| Files | |
| Dependencies | |
| Commits (Total) | |
| Commits (Agent) | |
| Commits (Human) | |
| Fix Commits | |
| Tests (Total) | |
| Tests (Passed) | |
| Tests (Skipped) | |
| Tests (Effective) | |
| Skip Ratio | |

## Development Duration Breakdown

| Phase | Duration | Description |
|-------|----------|-------------|
| **Setup (Human)** | | Initial commit, file upload |
| **Agent Implementation** | | Core implementation work |
| **Agent Test Iteration** | | Test fixing to 100% pass |
| **Total Autonomous** | | Agent work only |
| **Human Intervention** | | Post-completion changes |

### Timeline

{timestamp} - {commit message} ({phase}) ...


### Commit Analysis
- Total commits: X
- Agent commits: X (timespan)
- Human commits: X (description)
- Fix commits: X (context: normal iteration vs rework)

## Requirements Checklist
- [x] Requirement 1
- [ ] Requirement 2 (partial: notes)
- [ ] Requirement 3 (missing)

## Architecture Summary
(Key insights from generated codebase documentation)

## Raw Data
...

Troubleshooting

Clone fails

  • Verify repo exists: gh repo view brazil-bench/{attempt_repo}
  • Check permissions: repo must be public or you need access

Tests won't run due to missing dependencies

  • Try starting Neo4j via Docker (see Step 3a above)
  • If Docker unavailable, search for evidence of prior test runs
  • Check git commits for "100% pass" or similar messages
  • Check prompts.txt for pytest output
  • Document as "CANNOT VERIFY" with evidence found

Neo4j connection errors

  • Verify Neo4j is running: docker ps | grep neo4j
  • Check credentials match: NEO4J_AUTH=neo4j/password
  • Wait for startup: Neo4j needs ~10-15 seconds to initialize
  • Check logs: docker logs neo4j-eval

Spec diff shows changes

  • Fail the evaluation
  • Note the changes in the report
  • This invalidates the benchmark comparison

Codebase documentation fails

  • Verify the codebase-summary skill is available
  • Check that the codebase-path exists and contains code
  • Ensure the output directory is writable
  • Try running the skill standalone first to debug

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

95/100Analyzed 2/4/2026

A comprehensive and highly actionable SOP for evaluating benchmark attempts, featuring multi-language support, detailed testing logic, and git history analysis.

95
100
70
98
98

Metadata

Licenseunknown
Version1.5
Updated1/13/2026
Publisherbrazil-bench

Tags

apici-cddatabasegithub-actionsllmobservabilitysecuritytesting