TRON Dashboard Creation
Purpose
Create on-call friendly Grafana dashboards for TRON team services using grafanalib and axon_helpers. Implements Grafana best practices including health overview panels, tiered information design, and service-specific alert annotations.
When NOT to Use
- Generic API dashboards without Kafka consumers (use standard rms_helpers patterns)
- Non-RMS services (use cd/ or common/ helpers)
- Simple metric additions to existing dashboards
Quick Start: Minimal TRON Consumer Dashboard
from grafanalib.core import Template, Threshold
from axon_helpers.graph_helpers import AxonGraph, AxonSingleStat, GenDashboard, UNITS
from axon_helpers.rms_helpers import generate_dashboard_template_values
# Template variable for filtering by deployment type
deployment_type_template = Template(
name="deployment_type",
label="Deployment Type",
type="custom",
default="All",
includeAll=False,
query="All : .*,Internal : .*-internal.*,Customer : .*-customer.*",
options=[
{"selected": True, "text": "All", "value": ".*"},
{"selected": False, "text": "Internal", "value": ".*-internal.*"},
{"selected": False, "text": "Customer", "value": ".*-customer.*"},
],
)
rows = {
"Health Overview": [
AxonSingleStat(
title="Consumer Lag",
expressions=[{"expr": 'sum(kafka_consumergroup_group_topic_sum_lag{topic="task-events-public",group=~".*taskmetadatasvc-task-event-consumer.*"})'}],
thresholds=[
Threshold("green", 0, 0.0),
Threshold("orange", 1, 2500),
Threshold("red", 2, 5000),
],
reduceCalc="lastNotNull",
graphMode="area",
format="short",
span=4,
),
],
}
dashboard = GenDashboard(
title="My TRON Consumer Dashboard",
uid="my-tron-consumer-dashboard",
templating=generate_dashboard_template_values(
additional_templates_list=[deployment_type_template]
),
rows=rows,
)
Core Workflows
Workflow 1: Health Overview Dashboard (On-Call Triage)
Create a health overview row for instant on-call triage. This follows the KubeCon "Foolproof K8s Dashboards for Sleep-Deprived On-Calls" pattern.
Step 1: Define SLO Thresholds
# Define SLO constants at top of file (not hardcoded in panels)
CUSTOMER_LATENCY_WARNING_MS = 120000 # 2 minutes
CUSTOMER_LATENCY_CRITICAL_MS = 300000 # 5 minutes
CUSTOMER_LAG_WARNING = 2500
CUSTOMER_LAG_CRITICAL = 5000
DLQ_WARNING = 5
DLQ_CRITICAL = 10
Step 2: Create Health Stat Panels
from grafanalib.core import Threshold
def health_stat(title, description, expr, unit, thresholds, span=2):
"""Create a stat panel showing current metric value with thresholds."""
return AxonSingleStat(
title=title,
description=description,
expressions=[{"expr": expr, "legendFormat": title}],
thresholds=thresholds,
reduceCalc="lastNotNull",
graphMode="area",
format=unit,
decimals=1,
span=span,
)
# Health Overview panels (6 stat panels + timeline)
health_panels = [
health_stat(
title="Overall Health",
description="Health score 0-100% based on Kafka lag",
expr='(1 - clamp_max(sum(kafka_consumergroup_group_topic_sum_lag{...}) / 5000, 1)) * 100',
unit="percent",
thresholds=[
Threshold("red", 0, 0.0),
Threshold("orange", 1, 50.0),
Threshold("green", 2, 90.0),
],
),
health_stat(
title="Max Latency",
description="Max consumer latency in minutes",
expr='max(rms_taskmetadatasvc_task_event_consumer_read_latency{...}) / 60000',
unit="m",
thresholds=[
Threshold("green", 0, 0.0),
Threshold("orange", 1, 2.0),
Threshold("red", 2, 5.0),
],
),
# ... Consumer Lag, DLQ Rate, Throughput panels
]
Step 3: Add Timeline Graph
timeline = AxonGraph(
title="Kafka Lag Timeline",
expressions=[{
"expr": 'sum(kafka_consumergroup_group_topic_sum_lag{topic="task-events-public",...})',
"legendFormat": "Total Consumer Lag",
}],
thresholds=[
Threshold("green", 0, 0.0),
Threshold("orange", 1, float(CUSTOMER_LAG_WARNING)),
Threshold("red", 2, float(CUSTOMER_LAG_CRITICAL)),
],
thresholdsStyleMode="line+area",
span=12,
unit=UNITS.SHORT,
)
Validation:
- Health overview is first row (not collapsed)
- 5-6 stat panels cover: Health, Latency, Lag, DLQ, Throughput
- Thresholds use SLO constants (not hardcoded values)
- Timeline shows historical context
Workflow 2: Consumer Dashboard (Using ConsumerMetrics)
Use the ConsumerMetrics class for standard consumer dashboard sections.
Step 1: Import and Configure
from axon_helpers.rms_helpers import ConsumerMetrics, Query
from axon_helpers.utils import flatten
# Filter tags for internal vs customer
isInternalServiceTag = ', service=~".*-internal.*"'
isCustomerServiceTag = ', service!~".*-internal.*"'
isNotDLQTag = ', is_dlq!="true"'
Step 2: Create Consumer Section
# For Task Metadata Consumer (Internal)
internal_consumer_graphs = flatten(
ConsumerMetrics(
service_name="Task Metadata Consumer (Internal)",
container_name="taskmetadatasvc-task-event-consumer-internal",
consumer_group=".*taskmetadatasvc-task-event-consumer-internal",
fetch_latency_query=Query(
"rms_taskmetadatasvc_task_event_consumer_read_latency",
additional_expressions=isNotDLQTag + isInternalServiceTag,
),
process_latency_query=Query(
"rms_taskmetadatasvc_task_event_consumer_sync_latency",
additional_expressions=isNotDLQTag + isInternalServiceTag,
),
dlq_submitted_query=Query(
"rms_taskmetadatasvc_task_event_consumer_dlq_submitted_count",
additional_expressions=isNotDLQTag + isInternalServiceTag,
),
message_volume_query=Query(
"rms_taskmetadatasvc_task_event_consumer_processed_count",
additional_expressions=isNotDLQTag + isInternalServiceTag,
),
).generate_consumer_graphs()
)
Step 3: Add to Dashboard Rows
rows = {
"Health Overview": health_panels + [timeline],
"TASK Events: Task Metadata Consumer (Internal)": internal_consumer_graphs,
"TASK Events: Task Metadata Consumer (Customer)": customer_consumer_graphs,
}
Validation:
- Internal and Customer are separate rows (not combined)
- Proper filter tags applied (isInternalServiceTag, isCustomerServiceTag)
- DLQ metrics excluded from main consumer (is_dlq!="true")
Workflow 3: Alert-Linked Dashboard
Link dashboards to alerts for directed browsing (Grafana best practice).
Step 1: Define Service-Specific Alert Pattern
import grafanalib.core as G
# CRITICAL: Use narrow patterns, NOT generic like ".*[Ll]ag.*"
# Generic patterns match 90+ alerts and create solid annotation blocks
SERVICE_ALERT_PATTERN = (
"RMS.*TaskMetadataSvc.*|"
"RMS.*RQ.*Rule.*Manager.*|"
"RMS.*RuleManager.*"
)
Step 2: Create Alert Annotations
alert_annotations = G.Annotations(
list=[
{
"builtIn": 0,
"datasource": {"type": "prometheus", "uid": "${DataSource}"},
"enable": True,
"expr": f'ALERTS{{alertname=~"{SERVICE_ALERT_PATTERN}", alertstate="firing", axon_cluster=~"$axon_cluster"}}',
"hide": False,
"iconColor": "rgba(255, 120, 50, 0.25)", # Orange with 25% opacity
"name": "Service Alerts",
"titleFormat": "{{alertname}}",
"useValueForTime": False,
},
]
)
Step 3: Apply to Dashboard
dashboard = GenDashboard(
title="RMS TRON Consumer Dashboard",
# ... other config
annotations=alert_annotations,
)
Validation:
- Alert pattern is service-specific (not generic)
- Includes
axon_cluster=~"$axon_cluster"for environment filtering - Icon color has transparency (25% opacity for subtle background)
- Alert annotations appear on relevant panels only
Panel Selection Guide
| Metric Type | Panel | Best Practice Source |
|---|---|---|
| Current health/status | AxonSingleStat | KubeCon: instant triage |
| Latency (p50/p90/p99) | AxonGraph (lines) | RED Method: Duration |
| Message volume/rate | AxonGraph (bars) | RED Method: Rate |
| Error/fault counts | AxonGraph (bars, stacked) | USE Method: Errors |
| Kafka consumer lag | AxonGraph (lines) | USE Method: Saturation |
| Top-K operations | AxonBarGauge | Grafana best practices |
| HPA replica status | AxonGraph (lines) | KubeCon: normalization |
See PATTERNS.md for complete code examples for each pattern.
Template Variables
Always include these variables:
from axon_helpers.rms_helpers import generate_dashboard_template_values
# Standard RMS template variables
templating = generate_dashboard_template_values(
additional_templates_list=[
deployment_type_template, # Internal vs Customer
consumer_group_template, # Kafka consumer group filter
]
)
Standard variables provided by generate_dashboard_template_values():
$DataSource- Prometheus/Cortex datasource$axon_cluster- Cluster/environment selector
See REFERENCE.md for custom template variable patterns.
Alert Setup
CRITICAL: Use service-specific patterns, NOT generic patterns
# WRONG - matches 90+ alerts, creates solid annotation blocks
ALERT_PATTERN = ".*[Ll]ag.*"
# CORRECT - matches only TRON service alerts
SERVICE_ALERT_PATTERN = (
"RMS.*TaskMetadataSvc.*|"
"RMS.*RQ.*Rule.*Manager.*|"
"RMS.*RuleManager.*"
)
See ALERTS.md for complete alert annotation patterns and threshold integration.
Row Organization (Best Practice)
rows = {
# TIER 1: Health Overview - NOT collapsed (on-call triage)
"Health Overview": health_panels + [timeline],
# TIER 2: Infrastructure - Collapsed by default
"Overview: Kafka Infrastructure": kafka_panels,
"Overview: Consumer Pod Replicas": replica_panels,
# TIER 3: Service-specific - Collapsed by default
"TASK Events: Rules Manager Consumers": task_event_panels,
"TASK Events: Task Metadata Consumer (Internal)": internal_panels,
"TASK Events: Task Metadata Consumer (Customer)": customer_panels,
# TIER 4: Advanced/DLQ - Collapsed by default
"DLQ Processors: All Event Types": dlq_panels,
}
dashboard = GenDashboard(
rows=rows,
rows_to_collapse_by_title={
"Overview: Kafka Infrastructure",
"Overview: Consumer Pod Replicas",
"TASK Events: Rules Manager Consumers",
"DLQ Processors: All Event Types",
},
)
Common Issues
Issue: Kafka metrics don't filter by $axon_cluster
Cause: Kafka exporter metrics don't have axon_cluster label
Solution: Use explicit consumer group patterns instead:
# Instead of axon_cluster, use explicit group pattern
kafka_lag_expr = 'kafka_consumergroup_group_topic_sum_lag{group=~".*taskmetadatasvc-task-event-consumer.*"}'
Issue: Alert annotations create solid blocks
Cause: Generic alert pattern matches too many alerts Solution: Use service-specific pattern (see ALERTS.md)
Issue: Internal and customer metrics overlap
Cause: Missing deployment type filter
Solution: Add service=~"$deployment_type" filter and split into separate rows
Resources
- REFERENCE.md - API reference for axon_helpers classes
- PATTERNS.md - Dashboard patterns with code examples
- ALERTS.md - Alert annotation patterns (PRIMARY)
- METRICS.md - RMS metric naming conventions
- EXAMPLES.md - Complete dashboard examples
Quick Reference
# Dashboard file location
/Users/mriley/projects/ops/grafana-telemetry/dashboards/default/services/rms/
# Generate dashboard
cd /Users/mriley/projects/ops/grafana-telemetry
make rms.rms_tron_consumers_v2.dashboard
# Example dashboard to reference
rms.rms_tron_consumers_v2.dashboard.py
Test Scenarios
Use these scenarios to validate skill invocation and output quality.
Scenario 1: Create Health Overview
Input: "Create a health overview for task metadata consumer"
Expected Output:
- Uses
health_stat()helper function pattern - 5-6 stat panels (Health, Latency, Lag, DLQ, Throughput, Pods)
- Timeline graph with threshold lines
- Thresholds use SLO constants (e.g.,
CUSTOMER_LAG_CRITICAL = 5000) - Row is NOT collapsed (health always visible)
Validation:
# Should see patterns like:
health_stat(title="Consumer Health", ...)
health_stat(title="P99 Latency", ...)
AxonGraph(title="Health Timeline", thresholds=[...])
Scenario 2: Add Alert Annotations
Input: "Add alert annotations to my TRON dashboard"
Expected Output:
- Uses
SERVICE_ALERT_PATTERNconstant (service-specific) - Pattern matches:
RMS.*TaskMetadataSvc.*|RMS.*RQ.*Rule.*Manager.* - Includes
axon_cluster=~"$axon_cluster"filter - Icon color has transparency (
rgba(255, 120, 50, 0.25))
Anti-patterns (should NOT see):
- ❌ Generic patterns like
.*[Ll]ag.*or.*[Cc]onsumer.* - ❌ Missing
axon_clusterfilter - ❌ Solid colors without transparency
Scenario 3: Create Consumer Section
Input: "Add task metadata consumer metrics to dashboard"
Expected Output:
- Uses
ConsumerMetricsclass fromaxon_helpers.rms_helpers - Separates internal and customer into different rows
- Applies filter tags (
isNotDLQTag,isInternalServiceTag,isCustomerServiceTag) - Uses
flatten()wrapper for panel lists - Includes latency (p50/p90/p99), volume, and DLQ panels
Validation:
# Should see patterns like:
from axon_helpers.rms_helpers import ConsumerMetrics
internal_metrics = ConsumerMetrics(
metric_prefix="rms_taskmetadatasvc_task_event_consumer",
filter_tags=isNotDLQTag + isInternalServiceTag,
)
Scenario 4: Kafka Lag Graph
Input: "Add Kafka lag graph for TRON consumers"
Expected Output:
- Uses
kafka_consumergroup_group_topic_sum_lagmetric - Does NOT filter by
axon_cluster(Kafka metrics don't have this label) - Filters by explicit consumer group pattern instead
- Groups by
grouplabel for breakdown
Validation:
# Should see pattern like:
sum by (group) (
kafka_consumergroup_group_topic_sum_lag{
group=~".*taskmetadatasvc-task-event-consumer.*"
}
)
Anti-pattern (should NOT see):
- ❌
axon_cluster=~"$axon_cluster"on Kafka metrics (will return no data)
