File Issues from Evaluation

Overview

This SOP parses evaluation reports from brazil-bench attempts and creates individual GitHub issues on each attempt's repository for every shortcoming, warning, or improvement opportunity identified during the review.

Parameters

attempt_repo (required): Repository name (e.g., 2025-12-13-python-claude-hive)
evaluation_path (optional, default: ./results/{attempt_repo}.md): Path to evaluation report
dry_run (optional, default: false): If true, show issues that would be created without creating them

Steps

1. Load Evaluation Report

Read the evaluation report for the specified attempt.

Constraints:

You MUST verify the evaluation report exists before proceeding
You MUST parse markdown sections to identify shortcomings
You MUST NOT proceed if the report is missing or malformed

# Verify report exists
ls ./results/{attempt_repo}.md

# Read the report content
cat ./results/{attempt_repo}.md

2. Verify Repository Access

Confirm the attempt repository exists and is writable.

Constraints:

You MUST verify the repo exists on GitHub
You MUST check that issues are enabled on the repo
You SHOULD check for existing issues to avoid duplicates
You SHOULD use default GitHub labels (no custom labels required)

# Verify repo exists
gh repo view brazil-bench/{attempt_repo}

# List existing issues to check for duplicates
gh issue list -R brazil-bench/{attempt_repo} --limit 100

# List available labels (use default GitHub labels)
gh label list -R brazil-bench/{attempt_repo}

Default GitHub Labels to Use:

Issue Type	Default Label	Reason
Missing requirement	`enhancement`	It's a feature that needs to be added
Test quality issue	`bug`	Tests not working as expected
Compliance summary	`enhancement`	Tracking improvements needed
Quality concern	`enhancement`	Code improvement request
Documentation gap	`documentation`	Missing or incomplete docs

Using default labels avoids permission issues (creating custom labels requires admin access).

3. Extract Shortcomings from Report

Parse the evaluation report to identify all issues to file.

Systematic Extraction Commands:

Use these commands to programmatically extract shortcomings:

# Extract missing requirements (unchecked items)
grep -E "^- \[ \]" ./results/{attempt_repo}.md

# Extract test quality warnings
grep -A 20 "## Test Quality Warning\|## Test Skip Analysis" ./results/{attempt_repo}.md

# Extract weaknesses section
grep -A 10 "## Weaknesses" ./results/{attempt_repo}.md

# Extract compliance score
grep -E "Spec Compliance.*[0-9]+/[0-9]+" ./results/{attempt_repo}.md

# Extract skip ratio
grep -E "Skip Ratio.*[0-9]+%" ./results/{attempt_repo}.md

Categories of Issues to Extract:

3a. Missing Requirements

Look in the "Requirements Checklist" section for unchecked items:

Pattern: Lines starting with - [ ] indicate missing/partial requirements

# Extract all missing requirements
grep -E "^- \[ \]" ./results/{attempt_repo}.md

Issue Template:

Title: [Missing] {requirement description}
Label: enhancement
Body:
## Requirement
{description}

## Notes from Evaluation
{any additional context from the report}

## Suggested Fix
{if available}

---
Filed from evaluation: results/{attempt_repo}.md

3b. Test Quality Warnings

Look for the "Test Quality Warning" or "Test Skip Analysis" sections:

Patterns to detect:

Skip ratio > 20%
Tests with "not yet implemented" messages
Integration tests that can't run

Issue Template:

Title: [Test Quality] {description of test issue}
Label: bug
Body:
## Issue
{description of the test quality problem}

## Details
{skip breakdown, affected files, etc.}

## Impact
- Skip Ratio: {X}%
- Effective Tests: {Y} (vs {Z} total)

## Suggested Fix
{recommendations}

---
Filed from evaluation: results/{attempt_repo}.md

3c. Spec Compliance Issues

Look for compliance scores below 100%:

Patterns:

Spec Compliance: X/16 where X < 16
Partial implementations noted

Issue Template:

Title: [Compliance] Spec compliance at {X}% ({Y}/16 requirements)
Label: enhancement
Body:
## Current Status
{X}/16 requirements met ({percentage}%)

## Missing Requirements
{list of unchecked requirements}

## Partial Implementations
{list of partial implementations with notes}

---
Filed from evaluation: results/{attempt_repo}.md

3d. Skipped Tests

Look for ANY skipped tests in the "Metrics" or "Test Skip Analysis" sections:

Patterns to detect:

Tests (Skipped) > 0 (zero tolerance for skips)
Any Skip Ratio > 0%
Conditional skips for external dependencies

Extraction:

# Extract skip ratio and skipped test count
grep -E "Skip Ratio|Tests \(Skipped\)" ./results/{attempt_repo}.md

Issue Template:

Title: [Test Quality] {X}% tests skip conditionally ({Y}/{Z} tests)
Label: enhancement
Body:
## Issue
{Y} of {Z} tests ({X}%) are skipped.

## Details
{description of why tests skip - e.g., Neo4j not available}

## Impact
- Skip Ratio: {X}%
- Effective Tests: {effective} (vs {total} total)
- Benchmark penalty: Exceeds 10% threshold

## Assessment
{evaluation of whether skips are acceptable or problematic}

### Acceptable Skips
- Integration tests that require external services (Neo4j, Redis, etc.)
- Conditional skips with proper `pytest.skip()` messages

### Problematic Skips
- Tests with "not yet implemented" messages
- Unconditional skips that inflate test counts
- Stub tests that never execute

## Recommendation
{suggested action or note that no action required}

---
Filed from evaluation: results/{attempt_repo}.md

Skip Ratio Thresholds:

Skip Ratio	Assessment	Action
0%	Clean	No issue needed
>0%	Not acceptable	File issue for every skipped test
>20%	Concerning	File issue, recommend fixes
>50%	Critical	File issue as `bug`, high priority

3d-integration. Self-Contained Integration Tests (CRITICAL)

Integration tests that skip because external dependencies aren't running are NOT acceptable. Tests must manage their own data stores.

Patterns to detect:

Tests skipping with "Neo4j not running", "database not available", etc.
Tests requiring manual docker setup before running
Tests that only pass in CI environments
Conditional skips based on external service availability

Extraction:

# Check for external dependency skips
grep -r "pytest.skip.*neo4j\|pytest.skip.*database\|pytest.skip.*not running" ./reviews/{attempt_repo}/tests/

# Check for skipif based on service availability
grep -r "skipif.*neo4j\|skipif.*database\|skipif.*connection" ./reviews/{attempt_repo}/tests/

# Check for testcontainers usage (good)
grep -r "testcontainers\|Neo4jContainer" ./reviews/{attempt_repo}/tests/

# Check for pytest-docker usage (good)
grep -r "pytest-docker\|docker_compose_file" ./reviews/{attempt_repo}/

Issue Template:

Title: [Test Quality] Integration tests must be self-contained - use testcontainers
Label: bug
Body:
## Issue

Integration tests skip when external dependencies (e.g., Neo4j) are not running. This is not acceptable - tests must manage their own data stores.

## Current Behavior

Tests skip with messages like:
- "Neo4j not running"
- "Database not available"
- "Skipping integration tests"

{examples from codebase}

## Required Behavior

Integration tests MUST start their own data stores. Tests should pass on any machine with Docker installed, without manual setup.

## Recommended Solution: testcontainers

Use [testcontainers-python](https://testcontainers-python.readthedocs.io/) to automatically manage Neo4j:

```python
# conftest.py
import pytest
from testcontainers.neo4j import Neo4jContainer

@pytest.fixture(scope="session")
def neo4j_container():
    """Start Neo4j container for integration tests."""
    with Neo4jContainer("neo4j:5") as neo4j:
        yield neo4j

@pytest.fixture
def neo4j_client(neo4j_container):
    """Get client connected to test container."""
    return Neo4jClient(
        uri=neo4j_container.get_connection_url(),
        auth=("neo4j", "test")
    )

# test_integration.py
def test_create_match(neo4j_client):
    """Test creating a match in Neo4j - container starts automatically."""
    result = neo4j_client.create_match(...)
    assert result is not None

Alternative: pytest-docker

If you prefer docker-compose, use pytest-docker:

# conftest.py
@pytest.fixture(scope="session")
def docker_compose_file():
    return "docker-compose.test.yml"

@pytest.fixture(scope="session")
def neo4j_service(docker_services):
    docker_services.wait_until_responsive(
        timeout=30.0,
        pause=0.5,
        check=lambda: is_responsive()
    )

Dependencies to Add

# pyproject.toml
[project.optional-dependencies]
test = [
    "pytest",
    "testcontainers[neo4j]",
    # OR
    "pytest-docker",
]

Impact

Current: Tests skip, providing no integration coverage
After fix: Tests run everywhere, actually verify Neo4j integration

Filed from evaluation: results/{attempt_repo}.md


**Issue Template: Missing Integration Tests**

Title: [Test Quality] Missing integration tests - no tests verify data persistence Label: bug Body:

Issue

No integration tests exist to verify the data persistence layer works correctly.

Current State

Only unit tests with mocks exist
No tests verify actual database operations
Data layer is untested in real conditions

Required

Integration tests MUST exist and MUST:

Use persistent storage (testcontainers, pytest-docker, or file-based)
Test actual CRUD operations against real storage
Run without manual setup (self-contained)

Impact

Current: Data layer has zero test coverage
After fix: Confidence that database operations work correctly

Filed from evaluation: results/{attempt_repo}.md


**Issue Template: In-Memory Mock Not Acceptable**

Title: [Test Quality] Integration tests use in-memory mock instead of persistent storage Label: bug Body:

Issue

Integration tests use in-memory mock storage instead of persistent storage.

Current Behavior

Tests use mock classes like MockNeo4jDatabase or FakeStorage that store data in memory dictionaries.

Why This Fails

In-memory mocks:

Don't test actual persistence behavior
Don't catch serialization/deserialization bugs
Don't verify database constraints
Data doesn't survive process restart

Required Behavior

Integration tests must use persistent storage that actually tests the data persistence layer.

Benefits of pytest-bdd

Readable scenarios for non-technical stakeholders
Reusable step definitions
Standard tooling (IDE support, linting)
Separation of concerns

Recommendation

{specific migration guidance}

Filed from evaluation: results/{attempt_repo}.md


**Test Pattern Quality:**
| Pattern | Quality | Action |
|---------|---------|--------|
| pytest-bdd + .feature files (sync) | Best | No issue needed |
| pytest-bdd without .feature (sync) | Good | Optional improvement |
| pytest-bdd with async functions | Broken | File bug issue - tests silently fail |
| Docstring BDD | Acceptable | File enhancement issue |
| Custom BDD helper | Non-standard | File enhancement issue |
| E2E only / No BDD | Poor | File enhancement issue |

**Critical: Async pytest-bdd Warning**

pytest-bdd step definitions and scenarios MUST be synchronous. Async functions will:
- Silently skip or fail tests
- Return coroutine objects instead of results
- Cause confusing "test passed" reports when tests didn't actually run

Always use synchronous wrappers around async code in BDD tests.

---

#### 3e-swift. Swift/iOS Test Best Practices

Check if tests follow Swift/iOS testing best practices:

**Patterns to detect:**
- Missing test targets in Xcode project
- No XCTest usage
- Empty test methods (stubs)
- Disabled tests (renamed without `test` prefix)
- Missing async test handling
- No mock/dependency injection

**Extraction:**
```bash
# Check for XCTest usage
grep -r "import XCTest\|XCTestCase" ./reviews/{attempt_repo}/ --include="*.swift"

# Check for test methods
grep -r "func test" ./reviews/{attempt_repo}/ --include="*.swift" | wc -l

# Check for disabled/stub tests
grep -r "func disabled_\|// func test\|func _test" ./reviews/{attempt_repo}/ --include="*.swift"

# Check for proper async testing
grep -r "async throws\|expectation\|wait(for:" ./reviews/{attempt_repo}/ --include="*.swift"

# Check for Quick/Nimble BDD framework
grep -r "import Quick\|import Nimble\|describe.*context.*it" ./reviews/{attempt_repo}/ --include="*.swift"

Issue Template (Swift):

Title: [Test Quality] Use XCTest with proper async handling instead of {current_approach}
Label: enhancement
Body:
## Issue
{description of current test approach}

## Current Approach
{code example showing current pattern}

## Best Practice
Use XCTest with proper structure:

```swift
import XCTest
@testable import YourApp

final class FeatureTests: XCTestCase {
    var sut: FeatureViewModel!
    var mockService: MockNetworkService!

    override func setUp() {
        super.setUp()
        mockService = MockNetworkService()
        sut = FeatureViewModel(service: mockService)
    }

    override func tearDown() {
        sut = nil
        mockService = nil
        super.tearDown()
    }

    func test_whenLoadingData_thenUpdatesState() async throws {
        // Given
        mockService.mockResult = TestData.sampleItems

        // When
        await sut.loadData()

        // Then
        XCTAssertEqual(sut.items.count, 3)
        XCTAssertFalse(sut.isLoading)
    }
}

For BDD-style tests, use Quick/Nimble:

import Quick
import Nimble
@testable import YourApp

class FeatureSpec: QuickSpec {
    override class func spec() {
        describe("Feature") {
            var sut: FeatureViewModel!

            beforeEach {
                sut = FeatureViewModel()
            }

            context("when loading data") {
                it("updates the state") {
                    waitUntil { done in
                        sut.loadData {
                            expect(sut.items).toNot(beEmpty())
                            done()
                        }
                    }
                }
            }
        }
    }
}

Recommendation

{specific migration guidance}

Filed from evaluation: results/{attempt_repo}.md


**Swift Test Pattern Quality:**

| Pattern | Quality | Action |
|---------|---------|--------|
| XCTest + async/await + DI | Best | No issue needed |
| XCTest + expectations + DI | Good | No issue needed |
| Quick/Nimble BDD | Good | No issue needed |
| XCTest without DI | Acceptable | Optional improvement |
| XCTest with stubs/disabled | Poor | File enhancement issue |
| No tests | Critical | File bug issue |

**Swift-Specific Test Issues:**

| Issue | Detection | Severity |
|-------|-----------|----------|
| Tests don't compile | `swift build` fails | Critical |
| Tests timeout on CI | Long `wait(for:)` calls | High |
| No async handling | Missing `async` or `expectation` | Medium |
| Force unwrapping in tests | `!` without guard | Low |
| Missing mocks | No protocol-based DI | Medium |

#### 3f. Documentation Quality
Check if README.md contains essential user documentation:

**Required Elements:**
1. **Setup Instructions**: Prerequisites, installation, configuration
2. **MCP Server Setup**: How to start server, how to connect Claude
3. **Example Q&A**: Sample questions and expected responses

**Extraction:**
```bash
# Check README in cloned repo
head -100 ./reviews/{attempt_repo}/README.md

# Look for key sections
grep -E "Quick Start|Installation|Setup|MCP|Example|Usage" ./reviews/{attempt_repo}/README.md

Issue Template:

Title: [Docs] README missing setup instructions and usage examples
Label: documentation
Body:
## Issue
The README.md lacks essential user documentation.

## Missing Documentation

### 1. Setup Instructions
- Prerequisites (Python version, Neo4j)
- Installation steps
- Environment configuration

### 2. MCP Server Setup
- How to start the MCP server
- How to configure Claude to use the server
- Example configuration

### 3. Usage Examples with Q&A
- Example questions users can ask
- Expected responses/output

## Best Practice Example
See `2025-10-30-python-hive` for comprehensive documentation.

---
Filed from evaluation: results/{attempt_repo}.md

Documentation Quality Levels:

Level	Criteria	Action
Excellent	All 3 elements + extras (architecture, API ref)	No issue needed
Good	All 3 elements present	No issue needed
Acceptable	2 of 3 elements	Optional improvement issue
Poor	0-1 elements	File documentation issue

3g. Architecture/Quality Issues

Look for concerns in "Architecture Summary", "Weaknesses", or "Areas of Note":

Patterns:

Bullet points under "Weaknesses" heading
Notes about missing MCP endpoints
Performance concerns
Code structure issues

Issue Template:

Title: [Quality] {description}
Label: enhancement
Body:
## Issue
{description}

## Context
{relevant context from the evaluation}

## Recommendation
{suggested improvement}

---
Filed from evaluation: results/{attempt_repo}.md

3h. Context Header Blocks

Check if source code files have proper context header comment blocks.

Patterns to detect:

Files missing header blocks entirely
Headers missing required sections (Purpose, Interfaces, Change History)
Headers not updated with change history

Extraction:

# Check for context headers in source files
for f in $(find ./reviews/{attempt_repo}/src -name "*.py" -not -name "__init__.py"); do
  if ! head -30 "$f" | grep -q "CONTEXT\|Purpose\|Module:"; then
    echo "Missing header: $f"
  fi
done

Issue Template:

Title: [Quality] Source files missing context header blocks
Label: enhancement
Body:
## Issue

Source code files are missing required context header comment blocks.

## Required Header Format

Every source file must have a header block that includes:
1. **Purpose** - What the file/module does
2. **Interfaces** - Key classes, functions, or APIs exposed
3. **Change History** - Record of modifications (updated on every change)

## Example Header (Python)

```python
"""
================================================================================
CONTEXT BLOCK
================================================================================
File: server.py
Module: brazilian_soccer_mcp.server
Purpose: MCP server implementation for Brazilian soccer knowledge graph

Description:
    Exposes Neo4j queries as MCP tools for Claude to use.

Interfaces:
    - create_server(): Creates and configures the MCP server
    - MCPServer: Main server class with tool handlers

Dependencies:
    - mcp: MCP protocol implementation
    - neo4j: Database connection

Change History:
    - 2025-01-04: Add get_player_stats tool
    - 2025-01-03: Initial creation
================================================================================
"""

Files Missing Headers

{list of files without proper headers}

Impact

Without context headers, code is harder to understand and maintain
Change history helps track evolution of the codebase
Interfaces section provides quick API reference

Filed from evaluation: results/{attempt_repo}.md


**Context Header Assessment:**

| Coverage | Assessment | Action |
|----------|------------|--------|
| 100% | Excellent | No issue needed |
| 75-99% | Good | Optional issue |
| 50-74% | Partial | File enhancement issue |
| <50% | Poor | File enhancement issue |

### 4. Create Issues
File each extracted shortcoming as a separate GitHub issue.

**Constraints:**
- You MUST create one issue per shortcoming (not a combined issue)
- You MUST apply appropriate default labels
- You MUST include reference to the evaluation report
- You MUST check for duplicates before creating (match on title)
- You MUST create detail issues BEFORE the summary issue (so you can reference issue numbers)
- You SHOULD group related issues with a common prefix
- You MAY skip issues if identical ones already exist

**Issue Creation Order:**
1. Create all detail issues first ([Missing], [Test Quality], [Quality], etc.)
2. Note the issue numbers assigned
3. Create the [Compliance] summary issue last, referencing the detail issues

```bash
# Check if issue already exists
gh issue list -R brazil-bench/{attempt_repo} --search "{issue_title}" --json title

# Create issue using HEREDOC for multi-line markdown body
gh issue create -R brazil-bench/{attempt_repo} \
  --title "[Missing] MCP server implementation" \
  --label "enhancement" \
  --body "$(cat <<'EOF'
## Requirement

The spec requires an MCP server with tools exposed as endpoints.

## Current State

Query engine exists but not exposed as MCP tools.

## Suggested Fix

Add MCP server wrapper using `fastmcp` or `mcp` package.

---
Filed from evaluation: [results/{attempt_repo}.md](https://github.com/brazil-bench/pourpoise/blob/main/results/{attempt_repo}.md)
EOF
)"

Cross-Referencing Issues:

When creating the summary [Compliance] issue, reference related detail issues:

## Missing Requirements

1. **MCP server implementation** - see #2
2. **Query performance benchmarks** - see #3
3. **Cross-file queries verification** - see #4

## Additional Issues

- **Test Quality:** 84% skip ratio - see #1

Issue Creation Rules:

Shortcoming Type	Title Prefix	Default Label
Missing requirement	`[Missing]`	`enhancement`
Test quality issue	`[Test Quality]`	`bug`
Skipped tests (>10%)	`[Test Quality]`	`enhancement` or `bug`*
Integration tests skip	`[Test Quality]`	`bug`
Test best practices	`[Test Quality]`	`enhancement`
Documentation quality	`[Docs]`	`documentation`
Spec compliance	`[Compliance]`	`enhancement`
Architecture/quality	`[Quality]`	`enhancement`
Performance concern	`[Performance]`	`enhancement`

*Use bug for skip ratios >50%, problematic skips, or integration tests that skip due to missing dependencies; use enhancement for acceptable conditional skips.

4a. Summary Issue Template

After creating all detail issues, create a [Compliance] summary issue that links them together:

gh issue create -R brazil-bench/{attempt_repo} \
  --title "[Compliance] Spec compliance at {X}% ({Y}/16 requirements)" \
  --label "enhancement" \
  --body "$(cat <<'EOF'
## Current Status

{Y}/16 requirements met ({X}% compliance)

## Missing Requirements

1. **{requirement_1}** - see #{issue_number_1}
2. **{requirement_2}** - see #{issue_number_2}
...

## Implemented Requirements

### Category 1 (N/M)
- [x] Requirement that passed
- [x] Another passing requirement
...

## Additional Issues

- **Test Quality:** {skip_ratio}% skip ratio - see #{test_quality_issue}

## Benchmark Score

Current score: **{score}** (rank {rank} of 8)

To improve ranking, address the issues linked above.

---
Filed from evaluation: [results/{attempt_repo}.md](https://github.com/brazil-bench/pourpoise/blob/main/results/{attempt_repo}.md)
EOF
)"

This summary issue serves as an index to all other filed issues and provides context for prioritization.

5. Generate Summary

Output a summary of all issues created.

Constraints:

You MUST list all issues created with their URLs
You MUST note any issues skipped due to duplicates
You MUST include the total count

Output Format:

## Issues Filed for {attempt_repo}

### Created ({N} issues)
| # | Title | URL |
|---|-------|-----|
| 1 | [Missing] MCP server implementation | {url} |
| 2 | [Test Quality] 84% tests skip unconditionally | {url} |
| ... | ... | ... |

### Skipped (duplicates)
- {title} - already exists as #{existing_issue_number}

### Summary
- Total shortcomings identified: {X}
- Issues created: {Y}
- Issues skipped (duplicates): {Z}

Examples

Example 1: Filing issues for Hive v2

# File issues from the Hive v2 evaluation
file issues for 2025-12-13-python-claude-hive

Expected issues:

[Missing] MCP server implementation - tools not exposed as MCP endpoints
[Missing] Query performance benchmarks (< 2s simple, < 5s aggregate)
[Missing] Cross-file queries verification
[Test Quality] 84% of tests skip unconditionally (53/63 tests)
[Compliance] Spec compliance at 81% (13/16 requirements)

Example 2: Dry run to preview issues

# Preview issues without creating them
file issues for 2025-12-13-python-claude-hive --dry-run

Troubleshooting

No evaluation report found

Run the evaluate-attempt skill first
Check the path: ./results/{attempt_repo}.md

Permission denied on repository

Verify you have write access: gh repo view brazil-bench/{attempt_repo}
Check if issues are enabled on the repo

Duplicate detection failing

Issues are matched by exact title
If titles differ slightly, duplicates may be created
Use gh issue list -R repo --search "keyword" to check

Labels don't exist

Use default GitHub labels (enhancement, bug, documentation) which exist on all repos
Avoid custom labels as creating them requires admin access (HTTP 403 error)
If a label doesn't exist, omit the --label flag entirely
The title prefix (e.g., [Missing], [Test Quality]) provides categorization without labels

Too many issues

Use --dry-run first to preview
Consider grouping minor issues into a single "Minor improvements" issue
Focus on high-impact shortcomings first

file-issuesSafety 95Repository ShareFavorite skill

Package Files

File Issues from Evaluation

Overview

Parameters

Steps

1. Load Evaluation Report

2. Verify Repository Access

3. Extract Shortcomings from Report

3a. Missing Requirements

3b. Test Quality Warnings

3c. Spec Compliance Issues

3d. Skipped Tests

3d-integration. Self-Contained Integration Tests (CRITICAL)

Alternative: pytest-docker

Dependencies to Add

Impact

Issue

Current State

Required

Recommended Solution

Impact

Issue

Current Behavior

Why This Fails

Required Behavior

Recommended Solution

Benefits of pytest-bdd

Recommendation

Recommendation

3g. Architecture/Quality Issues

3h. Context Header Blocks

Files Missing Headers

Impact

4a. Summary Issue Template

5. Generate Summary

Examples

Example 1: Filing issues for Hive v2

Example 2: Dry run to preview issues

Troubleshooting

Install

AI Quality Score

Metadata

Tags

file-issuesSafety 95Repository