askill
tron-dashboard-creating

tron-dashboard-creatingSafety 98Repository

Create production-ready Grafana dashboards for TRON team services including consumer dashboards, task metadata, RQ rules manager, and Kafka metrics. Use when building dashboards for task event consumers or RMS TRON services.

1 stars
1.2k downloads
Updated 1/24/2026

Package Files

Loading files...
SKILL.md

TRON Dashboard Creation

Purpose

Create on-call friendly Grafana dashboards for TRON team services using grafanalib and axon_helpers. Implements Grafana best practices including health overview panels, tiered information design, and service-specific alert annotations.

When NOT to Use

  • Generic API dashboards without Kafka consumers (use standard rms_helpers patterns)
  • Non-RMS services (use cd/ or common/ helpers)
  • Simple metric additions to existing dashboards

Quick Start: Minimal TRON Consumer Dashboard

from grafanalib.core import Template, Threshold
from axon_helpers.graph_helpers import AxonGraph, AxonSingleStat, GenDashboard, UNITS
from axon_helpers.rms_helpers import generate_dashboard_template_values

# Template variable for filtering by deployment type
deployment_type_template = Template(
    name="deployment_type",
    label="Deployment Type",
    type="custom",
    default="All",
    includeAll=False,
    query="All : .*,Internal : .*-internal.*,Customer : .*-customer.*",
    options=[
        {"selected": True, "text": "All", "value": ".*"},
        {"selected": False, "text": "Internal", "value": ".*-internal.*"},
        {"selected": False, "text": "Customer", "value": ".*-customer.*"},
    ],
)

rows = {
    "Health Overview": [
        AxonSingleStat(
            title="Consumer Lag",
            expressions=[{"expr": 'sum(kafka_consumergroup_group_topic_sum_lag{topic="task-events-public",group=~".*taskmetadatasvc-task-event-consumer.*"})'}],
            thresholds=[
                Threshold("green", 0, 0.0),
                Threshold("orange", 1, 2500),
                Threshold("red", 2, 5000),
            ],
            reduceCalc="lastNotNull",
            graphMode="area",
            format="short",
            span=4,
        ),
    ],
}

dashboard = GenDashboard(
    title="My TRON Consumer Dashboard",
    uid="my-tron-consumer-dashboard",
    templating=generate_dashboard_template_values(
        additional_templates_list=[deployment_type_template]
    ),
    rows=rows,
)

Core Workflows

Workflow 1: Health Overview Dashboard (On-Call Triage)

Create a health overview row for instant on-call triage. This follows the KubeCon "Foolproof K8s Dashboards for Sleep-Deprived On-Calls" pattern.

Step 1: Define SLO Thresholds

# Define SLO constants at top of file (not hardcoded in panels)
CUSTOMER_LATENCY_WARNING_MS = 120000   # 2 minutes
CUSTOMER_LATENCY_CRITICAL_MS = 300000  # 5 minutes
CUSTOMER_LAG_WARNING = 2500
CUSTOMER_LAG_CRITICAL = 5000
DLQ_WARNING = 5
DLQ_CRITICAL = 10

Step 2: Create Health Stat Panels

from grafanalib.core import Threshold

def health_stat(title, description, expr, unit, thresholds, span=2):
    """Create a stat panel showing current metric value with thresholds."""
    return AxonSingleStat(
        title=title,
        description=description,
        expressions=[{"expr": expr, "legendFormat": title}],
        thresholds=thresholds,
        reduceCalc="lastNotNull",
        graphMode="area",
        format=unit,
        decimals=1,
        span=span,
    )

# Health Overview panels (6 stat panels + timeline)
health_panels = [
    health_stat(
        title="Overall Health",
        description="Health score 0-100% based on Kafka lag",
        expr='(1 - clamp_max(sum(kafka_consumergroup_group_topic_sum_lag{...}) / 5000, 1)) * 100',
        unit="percent",
        thresholds=[
            Threshold("red", 0, 0.0),
            Threshold("orange", 1, 50.0),
            Threshold("green", 2, 90.0),
        ],
    ),
    health_stat(
        title="Max Latency",
        description="Max consumer latency in minutes",
        expr='max(rms_taskmetadatasvc_task_event_consumer_read_latency{...}) / 60000',
        unit="m",
        thresholds=[
            Threshold("green", 0, 0.0),
            Threshold("orange", 1, 2.0),
            Threshold("red", 2, 5.0),
        ],
    ),
    # ... Consumer Lag, DLQ Rate, Throughput panels
]

Step 3: Add Timeline Graph

timeline = AxonGraph(
    title="Kafka Lag Timeline",
    expressions=[{
        "expr": 'sum(kafka_consumergroup_group_topic_sum_lag{topic="task-events-public",...})',
        "legendFormat": "Total Consumer Lag",
    }],
    thresholds=[
        Threshold("green", 0, 0.0),
        Threshold("orange", 1, float(CUSTOMER_LAG_WARNING)),
        Threshold("red", 2, float(CUSTOMER_LAG_CRITICAL)),
    ],
    thresholdsStyleMode="line+area",
    span=12,
    unit=UNITS.SHORT,
)

Validation:

  • Health overview is first row (not collapsed)
  • 5-6 stat panels cover: Health, Latency, Lag, DLQ, Throughput
  • Thresholds use SLO constants (not hardcoded values)
  • Timeline shows historical context

Workflow 2: Consumer Dashboard (Using ConsumerMetrics)

Use the ConsumerMetrics class for standard consumer dashboard sections.

Step 1: Import and Configure

from axon_helpers.rms_helpers import ConsumerMetrics, Query
from axon_helpers.utils import flatten

# Filter tags for internal vs customer
isInternalServiceTag = ', service=~".*-internal.*"'
isCustomerServiceTag = ', service!~".*-internal.*"'
isNotDLQTag = ', is_dlq!="true"'

Step 2: Create Consumer Section

# For Task Metadata Consumer (Internal)
internal_consumer_graphs = flatten(
    ConsumerMetrics(
        service_name="Task Metadata Consumer (Internal)",
        container_name="taskmetadatasvc-task-event-consumer-internal",
        consumer_group=".*taskmetadatasvc-task-event-consumer-internal",
        fetch_latency_query=Query(
            "rms_taskmetadatasvc_task_event_consumer_read_latency",
            additional_expressions=isNotDLQTag + isInternalServiceTag,
        ),
        process_latency_query=Query(
            "rms_taskmetadatasvc_task_event_consumer_sync_latency",
            additional_expressions=isNotDLQTag + isInternalServiceTag,
        ),
        dlq_submitted_query=Query(
            "rms_taskmetadatasvc_task_event_consumer_dlq_submitted_count",
            additional_expressions=isNotDLQTag + isInternalServiceTag,
        ),
        message_volume_query=Query(
            "rms_taskmetadatasvc_task_event_consumer_processed_count",
            additional_expressions=isNotDLQTag + isInternalServiceTag,
        ),
    ).generate_consumer_graphs()
)

Step 3: Add to Dashboard Rows

rows = {
    "Health Overview": health_panels + [timeline],
    "TASK Events: Task Metadata Consumer (Internal)": internal_consumer_graphs,
    "TASK Events: Task Metadata Consumer (Customer)": customer_consumer_graphs,
}

Validation:

  • Internal and Customer are separate rows (not combined)
  • Proper filter tags applied (isInternalServiceTag, isCustomerServiceTag)
  • DLQ metrics excluded from main consumer (is_dlq!="true")

Workflow 3: Alert-Linked Dashboard

Link dashboards to alerts for directed browsing (Grafana best practice).

Step 1: Define Service-Specific Alert Pattern

import grafanalib.core as G

# CRITICAL: Use narrow patterns, NOT generic like ".*[Ll]ag.*"
# Generic patterns match 90+ alerts and create solid annotation blocks
SERVICE_ALERT_PATTERN = (
    "RMS.*TaskMetadataSvc.*|"
    "RMS.*RQ.*Rule.*Manager.*|"
    "RMS.*RuleManager.*"
)

Step 2: Create Alert Annotations

alert_annotations = G.Annotations(
    list=[
        {
            "builtIn": 0,
            "datasource": {"type": "prometheus", "uid": "${DataSource}"},
            "enable": True,
            "expr": f'ALERTS{{alertname=~"{SERVICE_ALERT_PATTERN}", alertstate="firing", axon_cluster=~"$axon_cluster"}}',
            "hide": False,
            "iconColor": "rgba(255, 120, 50, 0.25)",  # Orange with 25% opacity
            "name": "Service Alerts",
            "titleFormat": "{{alertname}}",
            "useValueForTime": False,
        },
    ]
)

Step 3: Apply to Dashboard

dashboard = GenDashboard(
    title="RMS TRON Consumer Dashboard",
    # ... other config
    annotations=alert_annotations,
)

Validation:

  • Alert pattern is service-specific (not generic)
  • Includes axon_cluster=~"$axon_cluster" for environment filtering
  • Icon color has transparency (25% opacity for subtle background)
  • Alert annotations appear on relevant panels only

Panel Selection Guide

Metric TypePanelBest Practice Source
Current health/statusAxonSingleStatKubeCon: instant triage
Latency (p50/p90/p99)AxonGraph (lines)RED Method: Duration
Message volume/rateAxonGraph (bars)RED Method: Rate
Error/fault countsAxonGraph (bars, stacked)USE Method: Errors
Kafka consumer lagAxonGraph (lines)USE Method: Saturation
Top-K operationsAxonBarGaugeGrafana best practices
HPA replica statusAxonGraph (lines)KubeCon: normalization

See PATTERNS.md for complete code examples for each pattern.


Template Variables

Always include these variables:

from axon_helpers.rms_helpers import generate_dashboard_template_values

# Standard RMS template variables
templating = generate_dashboard_template_values(
    additional_templates_list=[
        deployment_type_template,  # Internal vs Customer
        consumer_group_template,   # Kafka consumer group filter
    ]
)

Standard variables provided by generate_dashboard_template_values():

  • $DataSource - Prometheus/Cortex datasource
  • $axon_cluster - Cluster/environment selector

See REFERENCE.md for custom template variable patterns.


Alert Setup

CRITICAL: Use service-specific patterns, NOT generic patterns

# WRONG - matches 90+ alerts, creates solid annotation blocks
ALERT_PATTERN = ".*[Ll]ag.*"

# CORRECT - matches only TRON service alerts
SERVICE_ALERT_PATTERN = (
    "RMS.*TaskMetadataSvc.*|"
    "RMS.*RQ.*Rule.*Manager.*|"
    "RMS.*RuleManager.*"
)

See ALERTS.md for complete alert annotation patterns and threshold integration.


Row Organization (Best Practice)

rows = {
    # TIER 1: Health Overview - NOT collapsed (on-call triage)
    "Health Overview": health_panels + [timeline],

    # TIER 2: Infrastructure - Collapsed by default
    "Overview: Kafka Infrastructure": kafka_panels,
    "Overview: Consumer Pod Replicas": replica_panels,

    # TIER 3: Service-specific - Collapsed by default
    "TASK Events: Rules Manager Consumers": task_event_panels,
    "TASK Events: Task Metadata Consumer (Internal)": internal_panels,
    "TASK Events: Task Metadata Consumer (Customer)": customer_panels,

    # TIER 4: Advanced/DLQ - Collapsed by default
    "DLQ Processors: All Event Types": dlq_panels,
}

dashboard = GenDashboard(
    rows=rows,
    rows_to_collapse_by_title={
        "Overview: Kafka Infrastructure",
        "Overview: Consumer Pod Replicas",
        "TASK Events: Rules Manager Consumers",
        "DLQ Processors: All Event Types",
    },
)

Common Issues

Issue: Kafka metrics don't filter by $axon_cluster

Cause: Kafka exporter metrics don't have axon_cluster label Solution: Use explicit consumer group patterns instead:

# Instead of axon_cluster, use explicit group pattern
kafka_lag_expr = 'kafka_consumergroup_group_topic_sum_lag{group=~".*taskmetadatasvc-task-event-consumer.*"}'

Issue: Alert annotations create solid blocks

Cause: Generic alert pattern matches too many alerts Solution: Use service-specific pattern (see ALERTS.md)

Issue: Internal and customer metrics overlap

Cause: Missing deployment type filter Solution: Add service=~"$deployment_type" filter and split into separate rows


Resources

  • REFERENCE.md - API reference for axon_helpers classes
  • PATTERNS.md - Dashboard patterns with code examples
  • ALERTS.md - Alert annotation patterns (PRIMARY)
  • METRICS.md - RMS metric naming conventions
  • EXAMPLES.md - Complete dashboard examples

Quick Reference

# Dashboard file location
/Users/mriley/projects/ops/grafana-telemetry/dashboards/default/services/rms/

# Generate dashboard
cd /Users/mriley/projects/ops/grafana-telemetry
make rms.rms_tron_consumers_v2.dashboard

# Example dashboard to reference
rms.rms_tron_consumers_v2.dashboard.py

Test Scenarios

Use these scenarios to validate skill invocation and output quality.

Scenario 1: Create Health Overview

Input: "Create a health overview for task metadata consumer"

Expected Output:

  • Uses health_stat() helper function pattern
  • 5-6 stat panels (Health, Latency, Lag, DLQ, Throughput, Pods)
  • Timeline graph with threshold lines
  • Thresholds use SLO constants (e.g., CUSTOMER_LAG_CRITICAL = 5000)
  • Row is NOT collapsed (health always visible)

Validation:

# Should see patterns like:
health_stat(title="Consumer Health", ...)
health_stat(title="P99 Latency", ...)
AxonGraph(title="Health Timeline", thresholds=[...])

Scenario 2: Add Alert Annotations

Input: "Add alert annotations to my TRON dashboard"

Expected Output:

  • Uses SERVICE_ALERT_PATTERN constant (service-specific)
  • Pattern matches: RMS.*TaskMetadataSvc.*|RMS.*RQ.*Rule.*Manager.*
  • Includes axon_cluster=~"$axon_cluster" filter
  • Icon color has transparency (rgba(255, 120, 50, 0.25))

Anti-patterns (should NOT see):

  • ❌ Generic patterns like .*[Ll]ag.* or .*[Cc]onsumer.*
  • ❌ Missing axon_cluster filter
  • ❌ Solid colors without transparency

Scenario 3: Create Consumer Section

Input: "Add task metadata consumer metrics to dashboard"

Expected Output:

  • Uses ConsumerMetrics class from axon_helpers.rms_helpers
  • Separates internal and customer into different rows
  • Applies filter tags (isNotDLQTag, isInternalServiceTag, isCustomerServiceTag)
  • Uses flatten() wrapper for panel lists
  • Includes latency (p50/p90/p99), volume, and DLQ panels

Validation:

# Should see patterns like:
from axon_helpers.rms_helpers import ConsumerMetrics

internal_metrics = ConsumerMetrics(
    metric_prefix="rms_taskmetadatasvc_task_event_consumer",
    filter_tags=isNotDLQTag + isInternalServiceTag,
)

Scenario 4: Kafka Lag Graph

Input: "Add Kafka lag graph for TRON consumers"

Expected Output:

  • Uses kafka_consumergroup_group_topic_sum_lag metric
  • Does NOT filter by axon_cluster (Kafka metrics don't have this label)
  • Filters by explicit consumer group pattern instead
  • Groups by group label for breakdown

Validation:

# Should see pattern like:
sum by (group) (
    kafka_consumergroup_group_topic_sum_lag{
        group=~".*taskmetadatasvc-task-event-consumer.*"
    }
)

Anti-pattern (should NOT see):

  • axon_cluster=~"$axon_cluster" on Kafka metrics (will return no data)

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

78/100Analyzed 2/18/2026

Well-structured skill with comprehensive workflows for creating Grafana dashboards for TRON services. Contains detailed code examples, validation checklists, and best practices. However, heavy reliance on internal helpers (axon_helpers, rms_helpers) and external reference files (PATTERNS.md, REFERENCE.md, ALERTS.md) limits standalone actionability. Includes clear "When NOT to Use" section and proper tags for discoverability.

98
85
65
80
70

Metadata

Licenseunknown
Version1.0.0
Updated1/24/2026
Publishermeriley

Tags

apici-cdgithub-actionsobservabilitytesting