ssmd-dq-run
Procedures for running ssmd Data Quality checks and interpreting results.
Source Files
| File | Purpose |
|---|---|
data/dq.py | DQRunner engine — 13 checks, scoring, CLI |
data/dq_email.py | Email report wrapper — runs all feeds, HTML output |
data/Dockerfile | DQ image: python:3.12-slim + duckdb + gcloud monitoring |
Running DQ Locally
Requires gcloud auth application-default login for GCS access.
# Single feed
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto
# With verbose progress
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto --verbose
# JSON output (for programmatic use)
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto --json
# Non-default prefix (when GCS prefix differs from feed name)
uv run data/dq.py --date 2026-02-17 --feed kraken-futures --stream futures --prefix kraken-futures
uv run data/dq.py --date 2026-02-17 --feed polymarket --stream markets --prefix polymarket
All Three Feeds
Run all feeds in parallel for full pipeline verification:
uv run data/dq.py --date 2026-02-17 --feed kalshi --stream crypto
uv run data/dq.py --date 2026-02-17 --feed kraken-futures --stream futures --prefix kraken-futures
uv run data/dq.py --date 2026-02-17 --feed polymarket --stream markets --prefix polymarket
Feed Parameters
| Feed | --feed | --stream | --prefix |
|---|---|---|---|
| Kalshi | kalshi | crypto | (default: kalshi) |
| Kraken Futures | kraken-futures | futures | kraken-futures |
| Polymarket | polymarket | markets | polymarket |
Running DQ In-Cluster
The DQ CronJob runs at 03:30 UTC daily (after parquet-gen at 02:00 UTC).
Manifest: clusters/gke-prod/apps/ssmd/cronjobs/dq-daily.yaml
Trigger a manual DQ email run
kubectl create job --from=cronjob/ssmd-dq-daily ssmd-dq-manual-MMDD -n ssmd
Watch progress
kubectl logs -n ssmd job/ssmd-dq-manual-MMDD -f
Re-run for a specific date
The CronJob defaults to yesterday. To override:
kubectl create job --from=cronjob/ssmd-dq-daily ssmd-dq-rerun-MMDD -n ssmd --dry-run=client -o yaml | \
sed 's|dq_email.py|dq_email.py --date 2026-02-17|' | \
kubectl apply -f -
Interpreting Scores
Grades
| Grade | Score Range | Meaning |
|---|---|---|
| GREEN | >= 98 | Pipeline healthy, all checks passing |
| YELLOW | >= 85 | Minor issues, investigate when convenient |
| RED | < 85 | Significant issues, investigate promptly |
Check Statuses
| Status | Weight | Meaning |
|---|---|---|
| pass | 1.0 | Check passed |
| warn | 0.7 | Threshold exceeded but not critical |
| fail | 0.0 | Check failed |
| skip | excluded | Not enough data to run, excluded from score |
Score = average of weights * 100.
Exit Codes
dq.pyexits 1 if any check has statusfaildq_email.pyalways exits 0 (email is the alert mechanism)
Notebook / Programmatic Usage
from dq import DQRunner
runner = DQRunner(bucket="ssmd-data", feed="kalshi", stream="crypto")
results = runner.run("2026-02-12")
results.summary() # print human-readable report
results.score() # float 0-100
results.to_json() # JSON string
# Ad-hoc queries via the shared DuckDB connection
runner.con.execute(
"SELECT * FROM read_parquet('gcs://ssmd-data/kalshi/crypto/2026-02-12/ticker_*.parquet') LIMIT 10"
).fetchdf()
# Date range
all_results = runner.run_range("2026-02-10", "2026-02-17")
Email Report
dq_email.py runs all 3 feeds, generates an HTML email with per-feed grades and check details, and sends via SMTP.
Required env vars: SMTP_USER, SMTP_PASS, SMTP_TO
Optional: SMTP_HOST (default: smtp.gmail.com), SMTP_PORT (default: 587)
These are provided in-cluster via the ssmd-smtp-credentials Secret.
Post-Deploy / Post-Backfill Verification
After deploying a new DQ version or backfilling parquet data:
- Run DQ locally for all 3 feeds (see commands above)
- Verify target checks show PASS
- Optionally trigger in-cluster email:
kubectl create job --from=cronjob/ssmd-dq-daily ... - Verify email arrives with corrected scores
Image Build
DQ image is built from data/Dockerfile, triggered by dq-v* tags in the 899bushwick repo (not ssmd).
See the ssmd-deploy skill for full deployment procedure.
