askill
trace-production-issue

trace-production-issueSafety 100Repository

Trace production alerts and issues back through REQ-* to original intent, creating new intent for remediation. Closes feedback loop from Runtime → Intent. Use when production alerts fire or issues discovered.

0 stars
1.2k downloads
Updated 2/5/2026

Package Files

Loading files...
SKILL.md

trace-production-issue

Skill Type: Actuator (Feedback Loop) Purpose: Trace production issues back to requirements and create remediation intent Prerequisites: Production alert or issue identified


Agent Instructions

You are closing the feedback loop from production to intent.

Workflow: Alert → REQ-* → Original Intent → New Remediation Intent

Your goal: Trace issue back and create actionable remediation intent.


Workflow

Step 1: Parse Production Alert

Extract REQ- from alert*:

{
  "alert_id": "alert_12345",
  "timestamp": "2025-11-20T15:30:00Z",
  "title": "Login latency exceeded",
  "description": "p95 latency is 750ms (threshold: 500ms)",
  "tags": {
    "req": "<REQ-ID>",  // ← Extract this
    "severity": "critical",
    "sla": "performance"
  },
  "metric": "auth.login.duration",
  "value": 750,
  "threshold": 500
}

Extracted: <REQ-ID>


Step 2: Trace to Requirement

Find requirement definition:

# Search for requirement
grep -rn "^## <REQ-ID>" docs/requirements/

# Output:
# docs/requirements/authentication.md:15:## <REQ-ID>: User Login

Load requirement:

## <REQ-ID>: User Login with Email and Password

**Type**: Functional Requirement
**Priority**: P0
**Intent**: INT-100

**Acceptance Criteria**:
...

**Related Requirements**:
- REQ-NFR-PERF-001: Login response < 500ms  ← SLA being violated!

Step 3: Trace to Original Intent

Find original intent:

# From requirement, find intent
grep "Intent: INT-100" docs/requirements/authentication.md

# Load intent
cat intent.md | grep -A 20 "## INT-100"

Intent context:

## INT-100: User Authentication System

**Requestor**: Product Team
**Priority**: P0

**Description**:
Secure user authentication for personalization and data protection.

Step 4: Trace to Implementing Code

Find code implementing requirement:

# Find files implementing <REQ-ID>
grep -rn "# Implements: <REQ-ID>" src/

# Output:
# src/auth/login.py:23:# Implements: <REQ-ID>

Analyze code:

  • What could cause 750ms latency?
  • Database query slow?
  • bcrypt cost too high?
  • No caching?

Step 5: Find Related Commits

Git history for requirement:

git log --all --grep="<REQ-ID>" --oneline

# Output:
# abc123 feat: Add user login (<REQ-ID>)
# def456 perf: Add caching to login (<REQ-ID>)
# ghi789 fix: Optimize db query (<REQ-ID>)

Recent changes: Did recent commit introduce regression?


Step 6: Create Remediation Intent

Generate new intent from alert:

# docs/intents/remediation.md

## INT-150: Fix Login Performance Degradation

**Type**: Remediation (URGENT)
**Created**: 2025-11-20
**Priority**: P0 (Critical - SLA violation)
**Source**: Production Alert (alert_12345)

**Related To**:
- **Original Intent**: INT-100 (User Authentication System)
- **Requirement**: <REQ-ID> (User login)
- **SLA Violated**: REQ-NFR-PERF-001 (Login response < 500ms)

**Problem**:
Login p95 latency increased to 750ms (threshold: 500ms).
SLA violation detected in production.

**Alert Details**:
- Alert: "Login latency exceeded"
- Metric: auth.login.duration{req:<REQ-ID>}
- Current: 750ms
- Threshold: 500ms
- Violation: +250ms (+50% over limit)

**Root Cause Analysis Needed**:
1. Database query performance (C-003: should be < 100ms)
2. bcrypt cost factor (C-001: cost 12, ~200ms expected)
3. Caching effectiveness (if implemented)
4. External service calls (if any)

**Proposed Investigation**:
1. Check database query times (should be < 100ms)
2. Profile bcrypt hashing time (should be ~200ms)
3. Check for N+1 queries
4. Review recent code changes (commits for <REQ-ID>)

**Success Criteria**:
- p95 latency < 500ms (back within SLA)
- p50 latency < 200ms (stretch goal)
- Root cause identified and fixed
- No regression in other requirements

**Impact**:
- Affected Users: 5% of login attempts (p95)
- Business Impact: Poor user experience, potential churn
- SLA Status: VIOLATED (critical)

Step 7: Link to Traceability

Create feedback loop entry:

# docs/traceability/feedback-loops.yml

alerts:
  - alert_id: "alert_12345"
    timestamp: "2025-11-20T15:30:00Z"
    title: "Login latency exceeded"
    requirement: "<REQ-ID>"
    original_intent: "INT-100"
    remediation_intent: "INT-150"  # ← New intent created
    status: "OPEN"
    assigned_to: "Backend Team"

Step 8: Commit Remediation Intent

git add docs/intents/remediation.md docs/traceability/feedback-loops.yml
git commit -m "FEEDBACK: Create INT-150 from production alert (<REQ-ID>)

Create remediation intent from SLA violation alert.

Alert:
- ID: alert_12345
- Title: Login latency exceeded
- Metric: auth.login.duration (750ms, threshold 500ms)
- Requirement: <REQ-ID>

Traceability:
  Alert → req:<REQ-ID> → REQ-NFR-PERF-001 → INT-100 → INT-150 (new)

Remediation Intent Created:
- INT-150: Fix login performance degradation
- Priority: P0 (SLA violation)
- Related: <REQ-ID>, REQ-NFR-PERF-001

Feedback Loop:
  Production Issue → New Intent → SDLC Cycle Begins Again ♻️

Next: Investigate root cause, implement fix using TDD workflow
"

Output Format

[TRACE PRODUCTION ISSUE - alert_12345]

Alert Details:
  ID: alert_12345
  Title: "Login latency exceeded"
  Timestamp: 2025-11-20T15:30:00Z
  Severity: CRITICAL

Requirement Trace:
  Alert Tag: req:<REQ-ID>
    ↓
  Requirement: <REQ-ID> (User Login)
  Location: docs/requirements/authentication.md:15
    ↓
  Related SLA: REQ-NFR-PERF-001 (Login < 500ms)
    ↓
  Original Intent: INT-100 (User Authentication System)
  Location: intent.md:5

Code Trace:
  Implementation: src/auth/login.py:23
  Recent Commits: 3 commits in last 7 days
    - abc123: perf: Add caching (3 days ago)
    - def456: refactor: Simplify login (5 days ago)
    - ghi789: fix: Handle edge case (7 days ago)

Root Cause Hypothesis:
  1. Database query slow (check C-003: should be < 100ms)
  2. bcrypt too slow (check C-001: cost 12 → ~200ms)
  3. Recent caching change (commit abc123)
  4. Increased traffic/load

Remediation Intent Created:
  ✓ INT-150: Fix login performance degradation
    - Type: Remediation (URGENT)
    - Priority: P0
    - Related: <REQ-ID>, REQ-NFR-PERF-001
    - Source: Production alert_12345

Feedback Loop:
  ✓ Alert → Requirement traced
  ✓ Original intent identified
  ✓ Remediation intent created
  ✓ Traceability logged

Next Steps:
  1. Assign INT-150 to backend team
  2. Investigate root cause (profile, DB queries)
  3. Implement fix using TDD workflow
  4. Deploy fix and verify SLA restored

✅ Production Issue Traced!
   Feedback loop closed
   Remediation intent ready for SDLC

Notes

Why trace production issues?

  • Close feedback loop: Production → Intent → SDLC
  • Root cause: Understand what requirement is problematic
  • Living system: Requirements evolve based on production reality
  • Homeostasis: Production deviations generate corrective intents

Feedback loop:

Intent (INT-100)
  → Requirements (<REQ-ID>)
  → Design → Code → Deploy
  → Production (running)
  → Alert (SLA violation)
  → Trace back to <REQ-ID>
  → Create new Intent (INT-150: Fix performance)
  → SDLC cycle begins again ♻️

Homeostasis Goal:

desired_state:
  all_alerts_traceable_to_req: true
  all_violations_create_intent: true
  feedback_loop: closed

"Excellence or nothing" 🔥

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

95/100Analyzed 2/10/2026

An exceptionally well-structured skill for closing the feedback loop between production issues and requirements. It provides clear, actionable steps, specific commands, and detailed templates for remediation.

100
100
80
100
100

Metadata

Licenseunknown
Version-
Updated2/5/2026
Publishermajiayu000

Tags

ci-cddatabasegithub-actionsobservabilitysecurity