Analyze Test JSON — Flow & Response Analysis
Analyzes IntegrationTesterApp test-result JSON files to explain what happened per test (flow: prompt → tools called → response → assertion outcomes) and to surface cross-test patterns and concrete fix recommendations.
When to Apply
- User says
/analyze-test-json,/analyze-test-json latest,/analyze-test-json <path>, or "analyze test json", "analyze test results", "analyze latest test results" - User @-mentions a specific
test-results_*.jsonormulti-model-results_*.jsonand asks for analysis or recommendations - User wants to understand why tests failed or how tool-call flow and response text relate to assertions
File Resolution
Base directory: src/IntegrationTesterApp/test-results/
| User input | Action |
|---|---|
| No path, or "latest" | Do not rely on listing the directory (it can truncate). Use a glob for all result files: test-results_*.json and multi-model-results_*.json under the base directory. From the matched paths, parse the embedded timestamp in each filename (_yyyyMMdd_HHmmss before .json). Sort by that timestamp descending and pick the first file — that is the latest. Example: test-results_20260127_054623.json (20260127054623) is newer than multi-model-results_20260126_083048.json (20260126083048). |
| Explicit path | Use that path. If relative, resolve from repo root. If the user gives a bare name like test-results_20260127_054623, resolve to src/IntegrationTesterApp/test-results/test-results_20260127_054623.json (add base dir and .json when missing). |
Supported formats:
- Single-run:
test-results_{timestamp}.json→ root isTestRunResultswithconfiguration,summary,testResults. - Multi-model:
multi-model-results_{timestamp}.json→ root isMultiModelTestResultswithmodelResults[]; each entry hasmodel,summary,testResults.
If the chosen path does not exist, report that and suggest running tests or checking the path.
Per-Test Shape (from JSON)
Each item in testResults (or modelResults[].testResults) has this shape (camelCase in JSON):
- testCase:
name,prompt,conversationHistory,expectedToolsToCall,expectedToolsNotToCall,responseMustContain,responseMustContainAny,tags, … - passed: boolean
- response: LLM response text
- functionCallDetails:
{ functionName, pluginName, name, parameters, result, resultString, nestedCalls? }[] - failures: list of failure strings from TestRunner
- error: if exception occurred
- durationMs: int
- timedOut: boolean
Failure Strings (TestRunner Verification)
The agent matches failures[] to these patterns to classify assertions:
| Pattern (substring or full) | Meaning |
|---|---|
Expected tool '…' was not called | A tool in expectedToolsToCall was never invoked. |
Tool '…' was called but should not have been | A tool in expectedToolsNotToCall was invoked. |
Response should contain '…' but does not | A keyword from responseMustContain was missing (AND condition). |
Response should contain at least one of […] but contains none | No keyword from responseMustContainAny appeared (OR condition). |
Analysis Workflow
- Resolve file: Use "File Resolution" above.
- Detect format: Inspect root keys. Presence of
modelResults→ multi-model; presence oftestResultsat root → single-run. - Run overview: From
configuration/root andsummary(or eachmodelResults[].summary), extract: file path, format type, model(s), pass/fail counts, total duration. - Per-failing-test analysis: For every test where
passed === false(and per model in multi-model):- Flow: prompt (+ conversationHistory) → tools actually called (
functionCallDetails[].functionName) → response snippet → assertion outcomes (failures). - Expected vs actual tools: List
testCase.expectedToolsToCall/expectedToolsNotToCallvsfunctionCallDetails[].functionName. Highlight missing or forbidden calls. - Response vs keywords: Compare
responsetoresponseMustContainandresponseMustContainAny; note which keywords are missing. - Flow narrative: One short paragraph: "User asked X; the model called A, B (expected C but did not call it); it answered with …; assertions failed because …."
- Flow: prompt (+ conversationHistory) → tools actually called (
- Cross-test patterns: Aggregate across failing tests (and across models in multi-model). Examples:
- Same assertion failing in many tests (e.g. same required tool never invoked).
- One tool never invoked anywhere.
- One model much worse than others; same prompt failing only for that model.
- Same missing keyword across tests.
- Recommendations: Concrete, actionable fixes. Target:
TestCaseDefinitions.cs, plugin/kernel code, or test design. Examples:- "In TestCaseDefinitions.cs, test '…': add 'Foo' to
ResponseMustContainAnyor relax to avoid flakiness." - "Expected tool 'StockAgent.AskStockAgent' was never called — check plugin descriptions and routing in StandardKernel / agent registration."
- "Tool 'WeatherAgent.AskWeatherAgent' was called but forbidden — tighten ExpectedToolsNotToCall or improve tool-choice prompts."
- "Add a test in TestCaseDefinitions.cs for scenario X to prevent regression."
- "In TestCaseDefinitions.cs, test '…': add 'Foo' to
Prefer bullet lists and short paragraphs. When suggesting code or config changes, mention specific files (TestCaseDefinitions.cs, StandardKernel.cs, plugin files) and approximate areas (e.g. "around the test named …", "AddNestedKernelPlugins") where relevant.
Output Format — Report Template
Use this structure. Omit sections that have no content (e.g. no failing tests → omit "Per-failing-test analysis" detail; no cross-model data → omit model comparison in patterns).
## Test Results Analysis
### 1. Run overview
- **File**: <resolved path>
- **Format**: Single-run | Multi-model
- **Model(s)**: <from configuration or modelResults[].model>
- **Pass / Fail / Total**: <counts>
- **Total duration**: <ms or per-model if multi-model>
### 2. Per-failing-test analysis
For each failing test (and per model in multi-model when relevant):
#### <testCase.name> [Model: <model> if multi-model]
- **Prompt** (snippet): "<prompt text>" [+ conversation history if present]
- **Expected tools**: <expectedToolsToCall> | **Actual calls**: <functionCallDetails[].functionName>
- **Forbidden tools**: <expectedToolsNotToCall> | **Called anyway**: <list if any>
- **Required keywords** (ResponseMustContain): <list> | **In response**: ✓/✗ per keyword
- **Any-of keywords** (ResponseMustContainAny): <list> | **In response**: ✓/✗ per keyword
- **Failures**: <exact failure strings>
- **Flow**: <One short paragraph: prompt → tools called → response → why assertions failed.>
### 3. Cross-test patterns
- <Bullet list: same assertion failing in many tests; tools never invoked; one model much worse; recurring missing keywords; etc.>
### 4. Recommendations
- <Bullet list of concrete fixes: TestCaseDefinitions.cs changes, keyword adjustments, plugin/kernel fixes, new tests. Mention files and areas where relevant.>
Usage Examples
| Command / request | Action |
|---|---|
/analyze-test-json or "analyze latest test results" | Resolve latest file in test-results/, then run full analysis. |
/analyze-test-json latest | Same as no path — use newest file in test-results/. |
/analyze-test-json src/IntegrationTesterApp/test-results/multi-model-results_20250126_120000.json | Analyze that multi-model file. |
User @-mentions test-results_20250126_143022.json and says "why did these fail?" | Treat as explicit path (or resolve under test-results/), then analyze and emphasize failures and recommendations. |
File Locations & References
- Test result directory:
src/IntegrationTesterApp/test-results/ - Single-run filenames:
test-results_{yyyyMMdd_HHmmss}.json - Multi-model filenames:
multi-model-results_{yyyyMMdd_HHmmss}.json - Resolving "latest": Glob for
test-results_*.jsonandmulti-model-results_*.jsonin that directory; parseyyyyMMdd_HHmmssfrom each filename; sort descending; use the first. Do not rely on directory listing order or truncation. - TestCase definitions:
src/IntegrationTesterApp/TestCaseDefinitions.cs— adjust expected/forbidden tools and response keywords here. - Verification logic:
src/IntegrationTesterApp/TestRunner.cs—VerifyTestproduces the failure strings above. - Result types:
src/IntegrationTesterApp/TestCase.cs—TestResult,TestRunResults,ModelTestResults,MultiModelTestResults,TestCase.
Relation to Other Skills
- smart-test-runner: Runs tests and produces console output; it does not produce or parse JSON. Use analyze-test-json when the user has (or asks for) JSON result files and wants flow analysis and fix recommendations.
- add-plugin-test: Adds test cases to TestCaseDefinitions.cs. Recommendations from analyze-test-json may suggest new tests; the user can then use add-plugin-test to add them.
