Instrument Backend Observability
Purpose
Make backend services diagnosable in production by standardizing logs, error tracking, metrics, and tracing.
When to use
Use this skill when you are:
- Adding new endpoints or background jobs that require monitoring
- Debugging production incidents (5xx spikes, latency regressions)
- Integrating an error tracker or APM solution
- Standardizing log formats and correlation IDs
Inputs
- The runtime environment(s) and deployment model
- Current logging and monitoring stack (if any)
- What “good” looks like: SLOs, latency targets, error budgets
Outputs
- A consistent logging and error tracking plan
- Standard fields for correlation and debugging
- A minimal alert strategy for critical signals
Core rules
- Unknown errors MUST be captured by an error tracker (or equivalent) with context.
- Logs MUST be structured and SHOULD include a correlation/request ID.
- Sensitive data MUST NOT be logged (tokens, passwords, secrets, raw PII beyond what is required).
- Observability MUST NOT change business behavior (instrumentation should be side-effect free).
Recommended signals
- Errors
- rate of
5xx - rate of domain-specific
4xx(for detecting client issues or abuse)
- rate of
- Latency
- p50/p95/p99 per endpoint
- Saturation
- CPU, memory, DB connection pool utilization
- Traffic
- request volume per endpoint
Steps
- Ensure a request/correlation ID exists for every request.
- Add structured logs at key boundaries:
- request start/end (method, path, status, duration)
- key domain actions (entity IDs, operation names)
- Capture exceptions with context:
- endpoint name
- user/tenant identifiers (redacted as needed)
- correlation ID
- Add metrics for:
- request duration
- error counts
- Define alerts for:
- sustained 5xx rate
- sustained latency regression
- Verify by simulating:
- a known operational error
- an unknown exception
Verification
- All requests have a correlation/request ID in logs
- Structured logs include method, path, status, and duration
- Exceptions are captured with correlation ID and endpoint context
- Sensitive data (tokens, passwords, PII) is not present in logs
- Alerts fire for sustained 5xx rates (test with simulated errors)
- Latency metrics are recorded per endpoint
Boundaries
- MUST NOT log secrets, tokens, passwords, or raw PII
- MUST NOT allow observability code to change business behavior
- MUST NOT create high-cardinality metric labels (e.g., user IDs as labels)
- SHOULD NOT log request/response bodies in production (except for debugging)
- SHOULD NOT rely solely on logs for error tracking (use a dedicated tracker)
- SHOULD NOT skip correlation ID propagation in async operations
Included assets
- Templates:
./templates/includes recommended log fields and exception capture patterns. - Examples:
./examples/includes incident triage checklists.
