Prometheus

Workflow

Confirm SLOs and critical service signals.
Define metric model and label policy before instrumentation.
Configure scrape jobs and service discovery with security constraints.
Implement recording rules for expensive or standardized queries.
Build alert rules with clear severity and runbook links.
Tune retention and storage based on ingestion profile.
Validate alert and query behavior under failure scenarios.

Preflight (Ask / Check First)

Prometheus version and deployment model.
Ingestion volume and retention targets.
Alertmanager routing and escalation policy.
Known cardinality hotspots.
Multi-tenant or federation requirements.

Metric and Label Design

Use stable, descriptive metric names with consistent units.
Prefer low-cardinality labels and bounded value sets.
Avoid dynamic identifiers (request IDs, user IDs) as labels.
Keep most metrics unlabeled or minimally labeled.
Prefer counters/histograms with clear semantics.

Scrape and Rule Architecture

Keep scrape intervals aligned with SLO sensitivity.
Separate high-churn targets from steady infrastructure jobs.
Use recording rules for heavy dashboards and common alert predicates.
Group related rules and set rule evaluation intervals intentionally.
Keep rule files versioned and reviewed like code.
Convert legacy rule files to YAML with promtool update rules before upgrades.

Rule Group Pattern

groups:
  - name: service-latency
    interval: 30s
    rules:
      - record: service:http_request_rate5m
        expr: sum by (service) (rate(http_requests_total[5m]))

Alert Quality and Reliability

Alert on symptoms that map to user impact.
Add for windows to reduce flapping.
Include clear labels/annotations and runbook pointers.
Keep warning vs critical thresholds consistent.
Test alert queries against realistic failure data.

TSDB and Operations

Set retention by business need and storage budget.
Monitor scrape failures, target churn, and WAL pressure.
Plan compaction/storage overhead with growth forecasts.
Keep backup/restore strategy for rules and critical state.
Validate upgrade path and config compatibility.

Upgrade Notes

For Prometheus 3.0, review the migration guide and remove deprecated feature flags now enabled by default.
Revalidate PromQL range semantics and scrape config changes after major upgrades.

Security and Governance

Restrict scrape endpoints and admin APIs.
Protect service discovery credentials.
Isolate multi-tenant data where required.
Keep audit trail for rule/config changes.

Validation Commands

promtool check config prometheus.yml
promtool check rules rules/*.yml
curl -fsS http://localhost:9090/-/ready

Common Failure Modes

High-cardinality labels exhausting memory.
Alert rules without for causing flapping storms.
Expensive ad hoc queries running as dashboard defaults.
Stale scrape targets from discovery misconfiguration.
Retention settings disconnected from disk capacity.

Definition of Done

Metric model is documented and cardinality-safe.
Rule and alert sets pass promtool checks.
Alerting is actionable and noise-controlled.
Retention and storage policies are capacity-validated.
Security controls and ownership are defined.

Operational Checklist

Cardinality budget review is part of instrumentation PRs.
Alert coverage maps to top service-level risks.
Rule files have ownership and on-call runbooks linked.
Retention settings are reviewed against monthly growth.
Scrape failures and stale targets are continuously monitored.

References

references/prometheus-2026-02-18.md

Reference Index

rg -n "metric|label|cardinality" references/prometheus-2026-02-18.md
rg -n "rule group|recording|alert" references/prometheus-2026-02-18.md
rg -n "retention|TSDB|operations" references/prometheus-2026-02-18.md
rg -n "promtool|validation" references/prometheus-2026-02-18.md

prometheusSafety 82Repository

Package Files

Prometheus

Workflow

Preflight (Ask / Check First)

Metric and Label Design

Scrape and Rule Architecture

Rule Group Pattern

Alert Quality and Reliability

TSDB and Operations

Upgrade Notes

Security and Governance

Validation Commands

Common Failure Modes

Definition of Done

Operational Checklist

References

Reference Index

Install

AI Quality Score

Metadata

Tags

prometheusSafety 82Repository ShareFavorite skill

Package Files

Prometheus

Workflow

Preflight (Ask / Check First)

Metric and Label Design

Scrape and Rule Architecture

Rule Group Pattern

Alert Quality and Reliability

TSDB and Operations

Upgrade Notes

Security and Governance

Validation Commands

Common Failure Modes

Definition of Done

Operational Checklist

References

Reference Index

Install

AI Quality Score

Metadata

Tags

prometheusSafety 82Repository