askill
dd-monitors

dd-monitorsSafety 85Repository

Monitor management - create, update, mute, and alerting best practices.

463 stars
9.3k downloads
Updated 3/14/2026

Package Files

Loading files...
SKILL.md

Datadog Monitors

Create, manage, and maintain monitors for alerting.

Prerequisites

This requires Go or the pup binary in your path.

pup - go install github.com/datadog-labs/pup@latest Ensure ~/go/bin is in $PATH.

Quick Start

pup auth login

Common Operations

List Monitors

pup monitors list
pup monitors list --tags "team:platform"
pup monitors list --status "Alert"

Get Monitor

pup monitors get <id> --json

Create Monitor

pup monitors create \
  --name "High CPU on web servers" \
  --type "metric alert" \
  --query "avg(last_5m):avg:system.cpu.user{env:prod} > 80" \
  --message "CPU above 80% @slack-ops"

Mute/Unmute

# Mute with duration
pup monitors mute --id 12345 --duration 1h

# Or mute with specific end time
pup monitors mute --id 12345 --end "2024-01-15T18:00:00Z"

# Unmute
pup monitors unmute --id 12345

⚠️ Monitor Creation Best Practices

1. Avoid Alert Fatigue

RuleWhy
No flapping alertsUse last_Xm not last_1m
Meaningful thresholdsBased on SLOs, not guesses
Actionable alertsIf no action needed, don't alert
Include runbook@runbook-url in message
# WRONG - will flap constantly
query = "avg(last_1m):avg:system.cpu.user{*} > 50"  # ❌ Too sensitive

# CORRECT - stable alerting
query = "avg(last_5m):avg:system.cpu.user{env:prod} by {host} > 80"  # ✅ Reasonable window

2. Use Proper Scoping

# WRONG - alerts on everything
query = "avg(last_5m):avg:system.cpu.user{*} > 80"  # ❌ No scope

# CORRECT - scoped to what matters
query = "avg(last_5m):avg:system.cpu.user{env:prod,service:api} by {host} > 80"  # ✅

3. Set Recovery Thresholds

monitor = {
    "query": "avg(last_5m):avg:system.cpu.user{env:prod} > 80",
    "options": {
        "thresholds": {
            "critical": 80,
            "critical_recovery": 70,  # ✅ Prevents flapping
            "warning": 60,
            "warning_recovery": 50
        }
    }
}

4. Include Context in Messages

message = """
## High CPU Alert

Host: {{host.name}}
Current Value: {{value}}
Threshold: {{threshold}}

### Runbook
1. Check top processes: `ssh {{host.name}} 'top -bn1 | head -20'`
2. Check recent deploys
3. Scale if needed

@slack-ops @pagerduty-oncall
"""

⚠️ NEVER Delete Monitors Directly

Use safe deletion workflow (same as dashboards):

def safe_mark_monitor_for_deletion(monitor_id: str, client) -> bool:
    """Mark monitor instead of deleting."""
    monitor = client.get_monitor(monitor_id)
    name = monitor.get("name", "")
    
    if "[MARKED FOR DELETION]" in name:
        print(f"Already marked: {name}")
        return False
    
    new_name = f"[MARKED FOR DELETION] {name}"
    client.update_monitor(monitor_id, {"name": new_name})
    print(f"✓ Marked: {new_name}")
    return True

Monitor Types

TypeUse Case
metric alertCPU, memory, custom metrics
query alertComplex metric queries
service checkAgent check status
event alertEvent stream patterns
log alertLog pattern matching
compositeCombine multiple monitors
apmAPM metrics

Audit Monitors

# Find monitors without owners
pup monitors list --json | jq '.[] | select(.tags | contains(["team:"]) | not) | {id, name}'

# Find noisy monitors (high alert count)
pup monitors list --json | jq 'sort_by(.overall_state_modified) | .[:10] | .[] | {id, name, status: .overall_state}'

Downtime vs Muting

UseWhen
Mute monitorQuick one-off, < 1 hour
DowntimeScheduled maintenance, recurring
# Downtime (preferred)
pup downtime create \
  --scope "env:prod" \
  --monitor-tags "team:platform" \
  --start "2024-01-15T02:00:00Z" \
  --end "2024-01-15T06:00:00Z"

Failure Handling

ProblemFix
Alert not firingCheck query returns data, thresholds
Too many alertsIncrease window, add recovery threshold
No data alertsCheck agent connectivity, metric exists
Auth errorpup auth refresh

References

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

88/100Analyzed 3/27/2026

Highly polished public skill for Datadog monitor management with the `pup` CLI tool. Provides comprehensive coverage of monitor lifecycle operations (list, get, create, mute/unmute, downtime), excellent best practices guidance with anti-patterns, and a safety-focused deletion workflow. Well-structured with tables, code examples, and clear actionability. Includes troubleshooting section and external references. Suitable for any Datadog user working with monitors.

85
92
90
88
94

Metadata

Licenseunknown
Version-
Updated3/14/2026
Publisherdatadog-labs

Tags

alertingalertsdatadogdd-monitorsmonitors