Evaluate Benchmark Attempt
Overview
This SOP evaluates a completed brazil-bench attempt against the spec.md requirements, capturing metrics for comparison across orchestration patterns. Supports both Python and Swift/iOS implementations.
Parameters
- attempt_repo (required): Repository name (e.g.,
attempt-3) - output_dir (optional, default:
./results): Where to write evaluation results
Steps
0. Detect Project Language
Identify the primary language/platform of the implementation.
Detection Commands:
cd ./reviews/{attempt_repo}
# Python indicators
ls pyproject.toml setup.py requirements.txt 2>/dev/null
# Swift/iOS indicators
ls Package.swift *.xcodeproj *.xcworkspace 2>/dev/null
# Check file extensions
find . -name "*.py" -not -path "./.venv/*" | head -5
find . -name "*.swift" | head -5
Language Detection Matrix:
| Files Found | Language | Test Framework |
|---|---|---|
pyproject.toml, *.py | Python | pytest |
Package.swift, *.swift | Swift Package | swift test |
*.xcodeproj, *.swift | iOS/Xcode | xcodebuild test |
| Both Python and Swift | Multi-language | Run both |
Constraints:
- You MUST detect the language before running tests
- You MUST use appropriate commands for the detected language
- You SHOULD note the detected language in the report
1. Clone Attempt
Fetch the attempt repository for local analysis.
Constraints:
- You MUST clone into
./reviews/{attempt_repo} - You MUST verify the clone succeeded before proceeding
- You MUST NOT modify any files in the cloned repo
gh repo clone brazil-bench/{attempt_repo} ./reviews/{attempt_repo}
2. Verify Spec Integrity
Confirm the spec.md was not modified from the template.
Constraints:
- You MUST compare
spec.mdagainst the template version - You MUST fail the evaluation if spec.md was modified
- You SHOULD use a checksum comparison
gh repo clone brazil-bench/benchmark-template ./reviews/_template --depth 1
diff ./reviews/{attempt_repo}/spec.md ./reviews/_template/spec.md
3. Run Conformance Tests
Execute the test suite defined in the spec against the implementation.
Constraints:
- You MUST attempt to run all tests specified in spec.md
- You MUST capture pass/fail counts and output
- You SHOULD timeout tests after 60 seconds each
- You MAY retry flaky tests once
- If tests fail due to missing dependencies, follow the dependency resolution steps below
Python Test Commands
cd ./reviews/{attempt_repo}
# Run pytest with verbose output
pytest --tb=short -v 2>&1 | tee test_output.log
# Get summary counts
pytest --tb=no -q 2>&1 | tail -5
Swift/iOS Test Commands
cd ./reviews/{attempt_repo}
# Swift Package Manager
swift test 2>&1 | tee test_output.log
# Xcode project (iOS Simulator)
xcodebuild test \
-project *.xcodeproj \
-scheme "YourScheme" \
-destination 'platform=iOS Simulator,name=iPhone 15' \
2>&1 | tee test_output.log
# Parse xcodebuild results
grep -E "(Test Case|passed|failed)" test_output.log
# Using xcpretty for cleaner output (if available)
xcodebuild test -project *.xcodeproj -scheme "YourScheme" \
-destination 'platform=iOS Simulator,name=iPhone 15' \
| xcpretty --report junit
3a. Handle Missing Dependencies (Neo4j, etc.)
If tests fail due to missing external dependencies like Neo4j:
Step 1: Try to start the dependency via Docker
# Check if Docker is available
docker --version
# Check for docker-compose files in the repo
ls ./reviews/{attempt_repo}/docker-compose*.yml
# If Neo4j docker-compose exists, start it
docker-compose -f ./reviews/{attempt_repo}/docker-compose.neo4j.yml up -d
# Or start Neo4j directly
docker run -d --name neo4j-eval -p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/password \
neo4j:5
# Wait for Neo4j to be ready
sleep 10
docker logs neo4j-eval 2>&1 | tail -5
Step 2: If Docker unavailable or fails, look for evidence of prior test runs
Check these sources for test results:
# Check git history for test-related commits
git log --oneline --all | grep -iE "(test|pass|100%|fix.*test)"
# Check for CI/CD logs or badges
cat ./reviews/{attempt_repo}/README.md | grep -iE "(pass|badge|ci|test)"
# Check prompts.txt for test execution evidence
cat ./reviews/{attempt_repo}/prompts.txt 2>/dev/null | grep -iE "(pytest|test|pass|fail|scenario)"
# Check for pytest cache with results
ls -la ./reviews/{attempt_repo}/.pytest_cache/ 2>/dev/null
# Check for coverage reports
ls -la ./reviews/{attempt_repo}/htmlcov/ ./reviews/{attempt_repo}/coverage.xml 2>/dev/null
Step 3: Document findings in the report
If tests cannot be run directly, document:
- Why tests couldn't run (missing Neo4j, etc.)
- Evidence found of prior test runs (commit messages, prompts.txt entries)
- Claimed test results from the attempt's documentation
- Mark as "CANNOT VERIFY" with explanation
Constraints for dependency handling:
- You MUST try Docker first if available
- You MUST search for evidence if Docker fails
- You MUST NOT claim tests pass without verification
- You SHOULD note the source of any claimed test results
- You SHOULD clean up Docker containers after evaluation:
docker stop neo4j-eval && docker rm neo4j-eval
3b. Detect Skipped Tests
Skipped tests inflate test counts without providing actual verification. You MUST detect and report them separately.
Python: Detect Skipped Tests
Step 1: Run pytest with verbose output to capture skipped tests
cd ./reviews/{attempt_repo}
# Run pytest and capture skip count
pytest --tb=no -v 2>&1 | grep -E "(PASSED|FAILED|SKIPPED|ERROR)" | head -100
# Get summary counts
pytest --tb=no -q 2>&1 | tail -5
# Look for skip patterns in test files
grep -r "pytest.skip\|@pytest.mark.skip\|skipif\|xfail" tests/ --include="*.py"
Step 2: Analyze test files for skip patterns
# Count tests that call pytest.skip() inside the test body (worst pattern)
grep -r "pytest.skip(" tests/ --include="*.py" -l | wc -l
# Count tests with @pytest.mark.skip decorator
grep -r "@pytest.mark.skip" tests/ --include="*.py" | wc -l
# Count conditional skips (skipif)
grep -r "@pytest.mark.skipif" tests/ --include="*.py" | wc -l
Swift/iOS: Detect Skipped Tests
Step 1: Run swift test or xcodebuild and capture skipped tests
cd ./reviews/{attempt_repo}
# Swift Package Manager - look for skipped in output
swift test 2>&1 | grep -E "(passed|failed|skipped)"
# Xcode - parse test results
xcodebuild test -project *.xcodeproj -scheme "YourScheme" \
-destination 'platform=iOS Simulator,name=iPhone 15' \
2>&1 | grep -E "Test Case.*passed|Test Case.*failed|skipped"
Step 2: Analyze test files for skip patterns
# Count XCTSkip usage (explicit skips)
grep -r "XCTSkip\|throw XCTSkip" Tests/ --include="*.swift" | wc -l
# Count disabled tests (func name doesn't start with test)
grep -r "func disabled_test\|// func test" Tests/ --include="*.swift" | wc -l
# Count tests with availability checks that skip
grep -r "@available\|#available" Tests/ --include="*.swift" -A 2 | grep -i skip | wc -l
# Look for conditional test execution
grep -r "guard.*else.*return\|if.*XCTSkip" Tests/ --include="*.swift" | wc -l
Swift Skip Patterns:
| Pattern | Type | Assessment |
|---|---|---|
throw XCTSkip("reason") | Explicit skip | Acceptable if documented |
#if !targetEnvironment(simulator) | Conditional | Acceptable for device-only |
@available(iOS 16, *) | Version skip | Acceptable |
Renamed to disabled_testFoo | Hidden skip | Should be penalized |
| Empty test body | Stub | Should be penalized |
Step 3: Calculate effective test count
| Metric | How to Calculate |
|---|---|
| Total Tests | Number of test functions defined |
| Passed Tests | Tests that ran and passed |
| Skipped Tests | Tests marked skip or calling pytest.skip() |
| Effective Tests | Total - Skipped (tests that actually run) |
| Skip Ratio | Skipped / Total (percentage of tests that skip) |
Constraints for skipped test handling:
- You MUST report skipped tests separately from passed tests
- You MUST calculate the "effective test count" (passed + failed, excluding skipped)
- You MUST flag ANY skipped tests for issue filing - zero tolerance for skips
- You MUST distinguish between skip types for the issue description:
- Conditional skips (
@pytest.mark.skipif): Document reason in issue - Unconditional skips (
pytest.skip()in body): Critical - tests never run - Decorator skips (
@pytest.mark.skip): Document reason in issue
- Conditional skips (
- You MUST NOT count skipped tests toward the test score in rankings
- You MUST file an issue for ANY skipped test (no acceptable skip threshold)
Example Analysis:
Total tests: 59
Passed: 44
Skipped: 15 (25% skip ratio - HIGH)
Failed: 0
Effective: 44 (use this for scoring, not 59)
Skip breakdown:
- pytest.skip() in body: 15 (integration tests that never run)
- @pytest.mark.skipif: 0
- @pytest.mark.skip: 0
Flag: INFLATED TEST COUNT - 15 tests skip unconditionally
Document in Report:
## Test Results
| Metric | Count |
|--------|-------|
| Total Tests | 59 |
| Passed | 44 |
| **Skipped** | **15** |
| Failed | 0 |
| **Effective Tests** | **44** |
| Skip Ratio | 25% |
⚠️ **Warning:** 15 tests (25%) are skipped and never execute.
These are integration tests that call `pytest.skip()` inside the test body.
The effective test count for scoring is 44, not 59.
3c. Self-Contained Integration Tests (REQUIRED)
Integration tests MUST be self-contained and actually run. Tests that skip because "Neo4j not available" or similar are not acceptable.
Requirement: Integration tests must start their own data stores as needed.
Detection Commands:
cd ./reviews/{attempt_repo}
# Check for testcontainers usage (Python)
grep -r "testcontainers\|TestContainer\|DockerContainer" tests/ --include="*.py"
# Check for docker-compose in test setup
grep -r "docker-compose\|subprocess.*docker" tests/ --include="*.py"
# Check for pytest-docker fixture
grep -r "pytest-docker\|docker_compose" tests/ --include="*.py" pyproject.toml
# Check for in-memory alternatives (e.g., SQLite instead of Postgres)
grep -r "sqlite.*memory\|:memory:\|MockNeo4j\|FakeNeo4j" tests/ --include="*.py"
# Check for conftest fixtures that start services
grep -A 20 "@pytest.fixture" tests/conftest.py 2>/dev/null | grep -E "docker\|container\|start\|neo4j"
# Swift: Check for test containers
grep -r "Docker\|Container\|TestServer" Tests/ --include="*.swift"
Acceptable Patterns for Self-Contained Tests:
| Pattern | Example | Assessment |
|---|---|---|
| testcontainers | Neo4jContainer() in fixture | ✓ Best - automatic lifecycle |
| pytest-docker | docker_compose_file fixture | ✓ Good - compose-based |
| conftest startup | Fixture runs docker run neo4j | ✓ Acceptable - manual but works |
| In-memory mock | MockNeo4jClient class | ✗ NOT acceptable - not persistent |
| External dependency | pytest.skip("Neo4j not running") | ✗ NOT acceptable |
| CI-only tests | @pytest.mark.skipif(not CI) | ✗ NOT acceptable |
| No integration tests | No tests for data layer | ✗ NOT acceptable |
Example: testcontainers Pattern (Python)
# conftest.py
import pytest
from testcontainers.neo4j import Neo4jContainer
@pytest.fixture(scope="session")
def neo4j_container():
"""Start Neo4j container for integration tests."""
with Neo4jContainer("neo4j:5") as neo4j:
yield neo4j
@pytest.fixture
def neo4j_client(neo4j_container):
"""Get client connected to test container."""
return Neo4jClient(
uri=neo4j_container.get_connection_url(),
auth=("neo4j", "password")
)
Example: pytest-docker Pattern
# conftest.py
import pytest
@pytest.fixture(scope="session")
def docker_compose_file():
return "docker-compose.test.yml"
@pytest.fixture(scope="session")
def neo4j_service(docker_services):
"""Wait for Neo4j to be ready."""
docker_services.wait_until_responsive(
timeout=30.0,
pause=0.5,
check=lambda: is_neo4j_ready()
)
Scoring Impact:
| Integration Test Quality | Score Modifier |
|---|---|
| Self-contained (testcontainers/docker) | No penalty |
| In-memory mock (not persistent) | -10 points quality |
| Skips due to missing dependency | -10 points quality |
| No integration tests at all | -15 points quality |
Constraints:
- You MUST check if integration tests are self-contained
- You MUST flag tests that skip due to external dependencies
- You MUST NOT accept "works on CI" as justification for skipping locally
- You SHOULD recommend testcontainers or pytest-docker patterns
- You SHOULD verify integration tests actually execute (not just exist)
Document in Report:
## Integration Test Quality
| Aspect | Status |
|--------|--------|
| Self-contained | Yes/No |
| Data store management | testcontainers / docker-compose / mock / external |
| Integration tests run | X passed, Y skipped |
⚠️ **Issue:** Integration tests skip when Neo4j is not running.
Tests should use testcontainers or pytest-docker to manage dependencies.
3d. Context Header Blocks (REQUIRED)
Every source code file MUST have a context header comment block that documents:
- Purpose - What the file/module does
- Interfaces - Key classes, functions, or APIs exposed
- Change History - Record of modifications (updated on every change)
Detection Commands:
cd ./reviews/{attempt_repo}
# Python: Check for docstrings or header comments in source files
for f in $(find src -name "*.py" -not -name "__init__.py"); do
echo "=== $f ==="
head -50 "$f" | grep -E '""".*|^#.*Purpose|^#.*Context|CONTEXT BLOCK|Change History|Interfaces'
done
# Swift: Check for header comments
for f in $(find Sources -name "*.swift" 2>/dev/null); do
echo "=== $f ==="
head -50 "$f" | grep -E '///|/\*\*|Purpose|Context|History'
done
# Count files with context headers vs total
total=$(find src -name "*.py" -not -name "__init__.py" | wc -l)
with_header=$(find src -name "*.py" -not -name "__init__.py" -exec head -30 {} \; -exec echo "---" \; | grep -l "CONTEXT\|Purpose\|Module:" | wc -l)
echo "Files with headers: $with_header / $total"
Required Header Format (Python):
"""
================================================================================
CONTEXT BLOCK
================================================================================
File: {filename}
Module: {module.path}
Purpose: {one-line description}
Description:
{detailed description of what this module does}
Interfaces:
- {ClassName}: {brief description}
- {function_name}(): {brief description}
Dependencies:
- {module}: {why needed}
Change History:
- {date}: {description of change}
- {date}: Initial creation
================================================================================
"""
Required Header Format (Swift):
//
// {FileName}.swift
// {ProjectName}
//
// Purpose: {one-line description}
//
// Interfaces:
// - {ClassName}: {brief description}
// - {functionName}(): {brief description}
//
// Change History:
// - {date}: {description of change}
// - {date}: Initial creation
//
Assessment Criteria:
| Coverage | Assessment | Score Impact |
|---|---|---|
| 100% files have headers | Excellent | No penalty |
| 75-99% files have headers | Good | -2 quality |
| 50-74% files have headers | Partial | -5 quality |
| <50% files have headers | Poor | -10 quality |
Constraints:
- You MUST check all source files for context headers
- You MUST verify headers include purpose, interfaces, and change history
- You MUST flag files missing headers for issue filing
- You SHOULD note which files have incomplete headers (missing sections)
Document in Report:
## Context Header Compliance
| Metric | Count |
|--------|-------|
| Source files | X |
| With headers | Y |
| Coverage | Z% |
### Files Missing Headers
- `src/module.py` - No header
- `src/utils.py` - Missing change history
### Assessment
{Excellent/Good/Partial/Poor} - {X}% coverage
4. Measure Code Metrics
Collect quantitative data about the implementation.
Constraints:
- You MUST capture: total lines of code, number of files, dependencies
- You SHOULD capture: cyclomatic complexity, test coverage
- You MAY capture: documentation coverage, type hint coverage
Python Metrics
# Lines of code (excluding tests)
find ./reviews/{attempt_repo}/src -name "*.py" | xargs wc -l
# Dependencies
cat ./reviews/{attempt_repo}/pyproject.toml | grep dependencies -A 50
# File count
find ./reviews/{attempt_repo}/src -name "*.py" | wc -l
Swift/iOS Metrics
# Lines of code (excluding tests)
find ./reviews/{attempt_repo}/Sources -name "*.swift" | xargs wc -l
# For Xcode projects
find ./reviews/{attempt_repo} -name "*.swift" -not -path "*/Tests/*" -not -path "*Test*" | xargs wc -l
# Dependencies (Swift Package Manager)
cat ./reviews/{attempt_repo}/Package.swift | grep -A 50 "dependencies:"
# Dependencies (CocoaPods)
cat ./reviews/{attempt_repo}/Podfile 2>/dev/null
# Dependencies (Xcode project - SPM)
grep -r "repositoryURL" ./reviews/{attempt_repo}/*.xcodeproj/project.pbxproj 2>/dev/null | head -20
# File count
find ./reviews/{attempt_repo}/Sources -name "*.swift" | wc -l
# Check for SwiftLint configuration
ls ./reviews/{attempt_repo}/.swiftlint.yml 2>/dev/null
5. Extract Git Metrics and Analyze Development Timeline
Analyze the development history to separate agent-driven work from human interactions.
Constraints:
- You MUST capture: total commits, time from first to last commit
- You MUST separate commits into agent-driven vs human-driven phases
- You MUST calculate autonomous duration (agent work only)
- You SHOULD capture: number of reverts, force pushes (if detectable)
- You SHOULD extract commit messages mentioning "fix", "revert", "oops"
5a. Gather Raw Git Data
cd ./reviews/{attempt_repo}
# Full commit history with timestamps and messages
git log --format="%ai | %H | %s" --reverse
# Count total commits
git log --oneline | wc -l
# Find fix/revert commits
git log --format="%H %s" | grep -iE "(fix|revert|oops|wrong)"
# Get first and last commit times
git log --format="%ai" --reverse | head -1 # First commit
git log --format="%ai" | head -1 # Last commit
5b. Identify Development Phases
Analyze commit timestamps and messages to identify distinct phases:
Phase 1: Setup (Human)
- Initial commit, repo setup, file uploads
- Typically first 1-3 commits before implementation starts
- Look for: "Initial commit", "Add files", "upload", "setup"
Phase 2: Agent Implementation
- Bulk implementation work by the agent
- Characterized by:
- Rapid succession of commits (minutes apart)
- Large code changes
- Messages like "Implement", "Add", "Create"
- Consistent commit patterns (same author, similar timing)
Phase 3: Agent Test Iteration
- Test fixing and iteration by the agent
- Characterized by:
- Commits mentioning "fix", "test", "pass"
- Still rapid succession
- Often shows progression: "Fix X" → "Fix Y" → "100% pass"
Phase 4: Human Intervention (Post-Completion)
- Human-driven changes after agent work completes
- Characterized by:
- Time gaps (hours/days after previous commits)
- Different commit patterns or author info
- Messages about data, documentation, cleanup
- Changes not required by the spec
5c. Heuristics for Identifying Agent vs Human Commits
Agent commits typically show:
- Timestamps within minutes of each other
- Consistent formatting in commit messages
- Co-authored-by lines mentioning Claude/AI
- Large, comprehensive changes
- Focus on implementation and tests
Human commits typically show:
- Time gaps of hours or days from previous work
- Different commit message style
- Focus on data, docs, or polish
- Smaller, targeted changes
- Work done after "100% tests pass" milestone
# Look for time gaps > 1 hour between commits (potential phase boundaries)
git log --format="%ai" --reverse | while read ts; do echo "$ts"; done
# Check for co-author lines indicating AI
git log --format="%b" | grep -i "co-authored"
# Check prompts.txt for session boundaries
cat prompts.txt 2>/dev/null | grep -E "^(Done|Session|Agent)"
5d. Calculate Duration Metrics
| Metric | How to Calculate |
|---|---|
| Total Duration | Last commit - First commit |
| Agent Duration | Sum of time during agent phases only |
| Human Duration | Sum of time during human phases |
| Autonomous Duration | Phase 2 + Phase 3 (implementation + test fixing) |
Example Timeline Analysis:
09:00:00 - Initial commit (Human Setup)
09:05:00 - Add spec file (Human Setup)
--- Agent work begins ---
09:15:00 - Implement Phase 1 (Agent)
09:45:00 - Implement Phase 2 (Agent)
10:10:00 - Implement Phase 3 (Agent)
10:25:00 - Fix test issues (Agent)
10:40:00 - 100% tests pass (Agent)
--- Agent work ends ---
--- 2 day gap ---
Oct 3 - Add real data (Human)
Oct 3 - Update docs (Human)
Agent Duration: ~1h 25m (09:15 → 10:40)
Human Duration: ~5m setup + later changes
Autonomous Duration: ~1h 25m
5e. Document in Report
Include a Development Timeline section in the report:
## Development Duration Breakdown
| Phase | Duration | Description |
|-------|----------|-------------|
| **Setup (Human)** | ~5 min | Initial commit, file upload |
| **Phase 1: Implementation** | ~55 min | Agent implements all phases |
| **Phase 2: Test Fixing** | ~30 min | Agent iterates to 100% pass |
| **Total Autonomous** | **~1h 25m** | Agent work only |
| **Phase 3: Human Intervention** | 2 days later | Data and docs added |
### Commit Analysis
- Total commits: 15
- Agent commits: 10 (09:15 - 10:40 on Day 1)
- Human commits: 5 (setup + Day 3 changes)
- Fix commits: 3 (normal iteration, not rework)
6. Analyze Against Spec
Review implementation completeness against spec.md requirements.
Constraints:
- You MUST evaluate against ALL 16 canonical requirements listed below
- You MUST assess each as: implemented, partial, missing
- You SHOULD note implementation approach for each
- You MUST NOT make subjective quality judgments beyond spec compliance
- You MUST use the exact requirement numbering for cross-attempt consistency
6.0 Canonical Requirements Checklist (16 Requirements)
All evaluations MUST use this exact checklist to ensure consistency across attempts.
Functional Requirements (6):
- [FR-1] Search and return match data from all CSV files
- [FR-2] Search and return player data
- [FR-3] Calculate basic statistics (wins, losses, goals)
- [FR-4] Compare teams head-to-head
- [FR-5] Handle team name variations correctly
- [FR-6] Return properly formatted responses
Query Performance (3): 7. [QP-1] Simple lookups respond in < 2 seconds 8. [QP-2] Aggregate queries respond in < 5 seconds 9. [QP-3] No timeout errors
Data Coverage (3): 10. [DC-1] All 6 CSV files are loadable and queryable 11. [DC-2] At least 20 sample questions can be answered 12. [DC-3] Cross-file queries work (player + match data)
Technical Requirements (4): 13. [TR-1] MCP server implementation with callable tools 14. [TR-2] BDD testing with Given-When-Then structure 15. [TR-3] UTF-8 encoding support (Portuguese characters: ã, ç, é, etc.) 16. [TR-4] Multiple date format handling (ISO, Brazilian DD/MM/YYYY, with time)
Report Format for Requirements:
## Requirements Checklist
### Functional Requirements (X/6)
- [x] [FR-1] Search and return match data from all CSV files
- [x] [FR-2] Search and return player data
- [ ] [FR-3] Calculate basic statistics (partial: missing draws)
...
### Query Performance (X/3)
- [x] [QP-1] Simple lookups respond in < 2 seconds
...
### Data Coverage (X/3)
- [x] [DC-1] All 6 CSV files are loadable and queryable
...
### Technical Requirements (X/4)
- [x] [TR-1] MCP server implementation with callable tools
- [x] [TR-2] BDD testing with Given-When-Then structure
...
**Total: X/16 requirements implemented**
6a. Real Data vs Simulated Data Assessment
Determine whether the implementation uses real external data or simulated/mock data.
Real Data Indicators:
- Data loaders for external sources (Kaggle, APIs, etc.)
- CSV/JSON files in data directory
- API client code with authentication
- Data normalization/mapping logic for external schemas
Simulated Data Indicators:
- Hardcoded test fixtures
- Factory/faker-generated data
- Mock data in test files only
- No external data loading code
Constraints for Real Data Implementations:
- You MUST note which external data source is used
- You MUST assess schema mapping quality (how well does the implementation adapt external schema to spec schema)
- You MUST distinguish between:
- Schema Implemented: The code defines models matching spec entities
- Data Populated: The data loader can populate those fields from external source
- Not Available in Source: Spec field cannot be populated because external data doesn't include it
- You SHOULD credit implementations that adapt to real-world data constraints
- You SHOULD note any enhancements beyond spec (e.g., additional fields from richer data sources)
Adjusted Compliance Scoring:
- If real data is used and a spec field is "Not Available in Source", count it as:
- Implemented if the model/schema supports the field
- Note the data limitation separately
- Example: If spec requires "attendance" but Kaggle data has no attendance:
- Check if Match model has attendance field (schema compliance)
- Note that field would be null with Kaggle data (data limitation)
- This is NOT a failure - it's a data source constraint
6b. Documentation Quality Assessment
Evaluate the README.md for essential user documentation.
Required Elements:
- Setup Instructions: Prerequisites, installation steps, environment configuration
- MCP Server Setup: How to start the server, how to connect Claude
- Example Q&A: Sample questions and expected responses/output
Extraction Commands:
# Check README content
head -100 ./reviews/{attempt_repo}/README.md
# Look for key documentation sections
grep -E "Quick Start|Installation|Setup|MCP|Example|Usage" ./reviews/{attempt_repo}/README.md
Documentation Quality Levels:
| Level | Criteria | In Report |
|---|---|---|
| Excellent | All 3 elements + extras (architecture, API ref, troubleshooting) | "Comprehensive README" |
| Good | All 3 required elements present | "Good documentation" |
| Acceptable | 2 of 3 elements | "Partial documentation" |
| Poor | 0-1 elements | "Missing documentation" |
Best Practice Reference:
2025-10-30-python-hive: Excellent (Quick Start, MCP config, 15+ demo questions, architecture, troubleshooting)2025-12-15-python-claude-ruvector: Excellent (detailed setup, claude mcp add example, Q&A with output)
Include in Report:
## Documentation Quality
| Element | Present | Notes |
|---------|---------|-------|
| Setup Instructions | Yes/No | {details} |
| MCP Server Setup | Yes/No | {details} |
| Example Q&A | Yes/No | {details} |
**Assessment:** {Excellent/Good/Acceptable/Poor}
7. Generate Codebase Documentation
Generate comprehensive documentation for the implementation using the codebase-summary SOP.
Constraints:
- You MUST run the codebase-summary skill on the cloned repository
- You MUST output documentation to
{output_dir}/{attempt_repo}-summary/ - You SHOULD use the generated documentation to inform the final report
- The documentation provides architecture, components, interfaces, and workflow analysis
summarize codebase reviews/{attempt_repo} to {output_dir}/{attempt_repo}-summary/
8. Generate Report
Produce structured evaluation output.
Constraints:
- You MUST write results to
{output_dir}/{attempt_repo}.md - You MUST include: attempt name, orchestration pattern, all metrics
- You MUST use consistent format for cross-attempt comparison
- You SHOULD include raw data as appendix
Output Format
# Evaluation: {attempt_repo}
## Summary
- **Pattern:** [swarm|hive|solo|...]
- **Spec Compliance:** X/Y requirements
- **Tests:** X passed, Y skipped, Z failed (X effective)
- **Autonomous Duration:** Xh Ym
- **Documentation:** See `{attempt_repo}-summary/`
## Metrics
| Metric | Value |
|--------|-------|
| Lines of Code | |
| Files | |
| Dependencies | |
| Commits (Total) | |
| Commits (Agent) | |
| Commits (Human) | |
| Fix Commits | |
| Tests (Total) | |
| Tests (Passed) | |
| Tests (Skipped) | |
| Tests (Effective) | |
| Skip Ratio | |
## Development Duration Breakdown
| Phase | Duration | Description |
|-------|----------|-------------|
| **Setup (Human)** | | Initial commit, file upload |
| **Agent Implementation** | | Core implementation work |
| **Agent Test Iteration** | | Test fixing to 100% pass |
| **Total Autonomous** | | Agent work only |
| **Human Intervention** | | Post-completion changes |
### Timeline
{timestamp} - {commit message} ({phase}) ...
### Commit Analysis
- Total commits: X
- Agent commits: X (timespan)
- Human commits: X (description)
- Fix commits: X (context: normal iteration vs rework)
## Requirements Checklist
- [x] Requirement 1
- [ ] Requirement 2 (partial: notes)
- [ ] Requirement 3 (missing)
## Architecture Summary
(Key insights from generated codebase documentation)
## Raw Data
...
Troubleshooting
Clone fails
- Verify repo exists:
gh repo view brazil-bench/{attempt_repo} - Check permissions: repo must be public or you need access
Tests won't run due to missing dependencies
- Try starting Neo4j via Docker (see Step 3a above)
- If Docker unavailable, search for evidence of prior test runs
- Check git commits for "100% pass" or similar messages
- Check prompts.txt for pytest output
- Document as "CANNOT VERIFY" with evidence found
Neo4j connection errors
- Verify Neo4j is running:
docker ps | grep neo4j - Check credentials match: NEO4J_AUTH=neo4j/password
- Wait for startup: Neo4j needs ~10-15 seconds to initialize
- Check logs:
docker logs neo4j-eval
Spec diff shows changes
- Fail the evaluation
- Note the changes in the report
- This invalidates the benchmark comparison
Codebase documentation fails
- Verify the codebase-summary skill is available
- Check that the codebase-path exists and contains code
- Ensure the output directory is writable
- Try running the skill standalone first to debug
