ssmd-health-run
Quick Triage
# Non-Running pods (excludes Completed CronJobs)
kubectl get pods -n ssmd --no-headers | grep -v -E 'Running|Completed'
# Not-ready pods
kubectl get pods -n ssmd --no-headers | grep Running | grep -v '1/1\|2/2'
# Recent error/warning events
kubectl get events -n ssmd --sort-by='.lastTimestamp' --field-selector type!=Normal | tail -20
Deployments
Actual deployment names in the ssmd namespace:
| Deployment | Component |
|---|---|
kalshi-crypto-connector | Kalshi connector |
kraken-futures-connector | Kraken connector |
polymarket-connector | Polymarket connector |
kalshi-crypto-archiver | Kalshi archiver |
kraken-futures-archiver | Kraken archiver |
polymarket-archiver | Polymarket archiver |
ssmd-operator | CRD operator |
ssmd-data-ts | API server (port 8080) |
ssmd-cdc | CDC pipeline |
ssmd-redis | Redis |
StatefulSets: ssmd-postgres-0
Connector / Archiver Health
Rust containers have no wget/curl — use kubectl port-forward:
# Connector health (each returns JSON with status, feed, connected, last_message_secs_ago)
kubectl port-forward -n ssmd deploy/kalshi-crypto-connector 8080:8080 &
sleep 2 && curl -s http://localhost:8080/health && kill %1
kubectl port-forward -n ssmd deploy/kraken-futures-connector 8081:8080 &
sleep 2 && curl -s http://localhost:8081/health && kill %1
kubectl port-forward -n ssmd deploy/polymarket-connector 8082:8080 &
sleep 2 && curl -s http://localhost:8082/health && kill %1
Archiver health uses the same pattern with archiver deployment names.
Prometheus metrics snapshot (also via port-forward):
kubectl port-forward -n ssmd deploy/kalshi-crypto-connector 8080:8080 &
sleep 2 && curl -s http://localhost:8080/metrics | grep -E 'websocket_connected|idle_seconds|messages_total' && kill %1
Infrastructure
NATS
NATS runs as a StatefulSet (nats-0). Use the nats-box pod for CLI commands:
# JetStream health
kubectl exec -n nats deploy/nats-box -- nats server check jetstream
# List streams (shows message counts, last message time)
kubectl exec -n nats deploy/nats-box -- nats stream ls
# Stream detail
kubectl exec -n nats deploy/nats-box -- nats stream info PROD_KALSHI_CRYPTO
# Consumer list for a stream
kubectl exec -n nats deploy/nats-box -- nats consumer ls PROD_KALSHI_CRYPTO
Streams: PROD_KALSHI_CRYPTO, PROD_KRAKEN_FUTURES, PROD_POLYMARKET, PROD_KALSHI_LIFECYCLE, SECMASTER_CDC, SIGNALS
data-ts (Postgres + API)
data-ts listens on port 8080 (not 3000). Health endpoint is /health (not /v1/health).
From allowed CIDRs (home network) — LoadBalancer direct access, no port-forward needed:
curl -s http://<LB-IP>:8080/health
# End-to-end API probe (requires datasets:read API key):
curl -s -H "Authorization: Bearer <API_KEY>" "http://<LB-IP>:8080/v1/markets/lookup?ids=KXBTCD-26FEB0317-T76999.99&feed=kalshi"
From elsewhere — via port-forward:
kubectl port-forward -n ssmd deploy/ssmd-data-ts 8083:8080 &
sleep 2 && curl -s http://localhost:8083/health && kill %1
Returns {"status":"ok"} when Postgres is connected.
Redis
kubectl exec -n ssmd deploy/ssmd-redis -- redis-cli ping
Operator
kubectl get deploy ssmd-operator -n ssmd
kubectl logs -n ssmd deploy/ssmd-operator --tail=50
Data Pipeline Health (CLI)
# Composite health report (writes to DB)
ssmd health daily
Cloud Monitoring Queries
gcloud monitoring time-series list \
--project=massive-acrobat-227416 \
--filter='metric.type="prometheus.googleapis.com/ssmd_connector_websocket_connected/gauge"' \
--interval-start-time=$(date -u -v-1H +%Y-%m-%dT%H:%M:%SZ)
Replace metric filter for other metrics: ssmd_connector_messages_total/counter, ssmd_archiver_messages_total/counter.
DQ Checks
See ssmd-dq-run skill. Summary:
uv run data/dq.py --date YYYY-MM-DD --feed kalshi --stream crypto
uv run data/dq.py --date YYYY-MM-DD --feed kraken-futures --stream futures --prefix kraken-futures
uv run data/dq.py --date YYYY-MM-DD --feed polymarket --stream markets --prefix polymarket
CronJob at 03:30 UTC: kubectl create job --from=cronjob/ssmd-dq-daily ssmd-dq-manual-MMDD -n ssmd
Interpreting Results
| Dimension | GREEN | YELLOW | RED |
|---|---|---|---|
| Pods | All Running, ready | Restarts > 0 | CrashLoopBackOff |
| Connectors | connected, idle < 60s | idle 60-300s | disconnected or idle > 300s |
| Archivers | Running, GCS sync recent | sync 6-12h ago | sync > 12h |
| NATS | JetStream OK, msgs flowing | consumer lag > 1000 | JetStream unhealthy |
| Postgres | data-ts /health OK | slow queries | connection refused |
| Redis | PONG | - | error / timeout |
| DQ Score | >= 98 | 85-97 | < 85 |
| Composite | >= 85 | 60-84 | < 60 |
