askill
black-box-red-testing

black-box-red-testingSafety --Repository

Black-Box Red Testing — red tests that expose real bugs

59 stars
1.2k downloads
Updated 2/21/2026

Package Files

Loading files...
SKILL.md

Write tests that catch, not tests that pass.

Objective

Produce tests that fail against real bugs. A good test is one that exposes a bug the codebase actually has (row 2 of the Test Truth Table in the Testing skill).

The agent needs bug hypotheses, not buggy code. The skill never modifies the codebase — it only writes tests.

When to Use

  • Exploring an area for unknown bugs — you don't know where to look
  • Specs are weak or absent — spec gap exploitation is this skill's strength
  • During code review — single round (default) as a lightweight probe
  • You want a confidence signal — "is this module solid?"

If you have specific targets (changed files, low-coverage functions) or need classified findings, use the white-box-red-testing skill instead.

Input

Target scope: A module, package, or file set to test adversarially.

If the target scope is large (many packages, broad module), narrow before starting. Adversarial testing is deep, not wide — pick the riskiest or most complex area first.

Before starting, assess the target:

  • Read the code under test
  • Read available specs (if any)
  • Identify which situation applies:
    1. Clear specs — spec is the oracle. Hypothesize bugs in the code.
    2. Spec gaps — ambiguity is the attack surface. See Spec Gap Exploitation below.

Parameters

ParameterDefaultDescription
rounds1Number of rounds. Each round produces 10 tests.

Round Protocol

10 tests per round. Each round is a structured exploration budget. When running multiple rounds, each is a chance to pivot — if a direction isn't productive, refocus on another part of the code or another kind of bug.

Per Round

1. Hypothesis Formation

Before hypothesizing, read the target's contracts:

  • Docstrings, type annotations, function/parameter names
  • Existing tests (what's covered, what's not)
  • Call sites (how the code is actually used)
  • Commit messages on recent changes (what was the intent?)

These are inputs to sharper hypotheses, not classification evidence.

Form 10 distinct bug hypotheses for this round. A senior engineer knows the typical bugs — think:

  • Boundary conditions, off-by-one, empty/nil inputs
  • State corruption, ordering dependencies
  • Error handling gaps, missing validation
  • Concurrency, race conditions
  • Type coercion, encoding issues
  • Contract violations between components
  • Implicit assumptions in the code

Each hypothesis must be distinct (see Anti-Gaming below).

Hypothesis format (required):

Each hypothesis must be stated as a structured one-liner before writing the test:

[code_path] × [defect_class] → [observable_symptom]

Example output:

Hypotheses:
1. parse_date(empty_string) × missing_guard → uncaught ValueError
2. process_batch(>1000_items) × unbounded_loop → timeout
3. auth_token(expired) × stale_cache → false_positive_auth
4. save_record(unicode_name) × encoding_assumption → corrupted_output
5. concurrent_update(same_id) × race_condition → lost_write
...

This format forces precision. Vague hypotheses ("some edge case × something → failure") are immediately visible and must be sharpened before proceeding.

2. Test Writing

Write one test per hypothesis. Each test should:

  • Target a specific, articulable bug hypothesis
  • Be deterministic and self-contained
  • Follow the codebase's existing test conventions
  • Be a valid test — it must compile and run

3. Execution

Run all 10 tests against the existing codebase.

4. Triage

ResultAction
RedKeep — the test exposed something.
GreenDiscard — the code handled this correctly.
Error (won't compile, bad setup)Fix or discard — broken tests aren't adversarial, they're wrong.

5. Pivot Decision

After triage, assess:

  • What kinds of bugs did this round find (or not find)?
  • Is this angle exhausted, or is there more to explore?
  • What area of the code or what class of bugs hasn't been tried?

Pivot to a new angle for the next round, or continue deepening a productive one.

Multi-Round Guidance (rounds > 1)

Pivot between rounds. If a round was unproductive, change angle — different area of the code, different class of bugs. Don't repeat the same approach.

Triage throttle. If a round produces 5+ red tests, pause for the coder to triage before spending the next round. A wall of red tests is noise, not signal.

All-Green Rounds

An all-green round from high-quality hypotheses is a confidence signal, not waste. It means the code handles those cases correctly. This information is valuable — report it as such.

Spec Gap Exploitation

When specs are insufficient or absent, ambiguity becomes the attack surface.

A "compliant" red test that exploits a spec gap is a legitimate find. The agent is gaming for good — the adversarial move becomes a discovery tool that surfaces gaps the spec should have covered.

The skill does not classify whether a red test points to a code bug or a spec gap. The skill finds; the coder diagnoses. This separation mirrors TDD and the programmer/QA role distinction. A red test is a red test.

Anti-Gaming

The contract's general anti-gaming clause applies. Additionally, within this skill:

Within a round, tests must represent distinct bug hypotheses. Ten variations of "nil pointer on empty input" is not ten hypotheses — it's one hypothesis tested ten ways. Vary the kind of bug, not just the input to the same bug.

Distinct means different in at least one of:

  • The code path exercised
  • The category of defect hypothesized
  • The component boundary tested

The structured hypothesis format makes gaming visible — if multiple lines share the same [code_path] × [defect_class] pattern with only input variations, they must be consolidated into one hypothesis.

Output

Surviving red tests only. Green tests and broken tests are discarded.

Present each surviving red test with:

  • The test code
  • The failure output
  • The target (what code path it exercises)

No hypothesis or classification is attached — the coder owns diagnosis.

If no red tests survive, report the confidence signal: state which areas were tested and what classes of bugs were hypothesized. The structured hypotheses from the round serve as documentation of verified absence.

Convergence

The skill completes after the configured number of rounds. Early termination is acceptable if:

  • A round produced several red tests and the coder wants to address them before continuing
  • The target scope is small enough that further rounds would produce redundant hypotheses

Integration with Testing Skill

This skill operates within the Testing skill's constraints:

  • Test Modification rules apply — never weaken existing tests
  • Test Truth Table is the reference — this skill targets row 2 (Buggy code, Red test = Good)
  • Assertion Strength — adversarial tests should have strong assertions. A weak assertion that passes for broken implementations defeats the purpose.
  • Mocking Discipline — mock external boundaries, not internal logic. Adversarial tests must exercise real code paths.

Tests written by this skill that survive (stay red) become inputs to the normal development workflow. The coder decides whether to fix the bug, update the spec, or accept the behavior.

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

AI review pending.

Metadata

Licenseunknown
Version-
Updated2/21/2026
Publisherliza-mas

Tags

github-actionstesting