ssmd-dq-checks
Reference catalog for all 13 DQ checks in data/dq.py.
Check Catalog
Check 1: data_completeness
What: Verifies parquet files exist and the archiver manifest is present.
| Status | Criteria |
|---|---|
| pass | Parquet files found AND manifest present with files listed |
| warn | Parquet files found but no manifest (or manifest has no files) |
| fail | No parquet files found |
Common failures: Archiver not running, GCS permissions, wrong date.
Check 2: record_counts
What: Reports row counts per message type from parquet files. Uses archiver manifest records_by_type or falls back to parquet-gen manifest records_written.
| Status | Criteria |
|---|---|
| pass | At least one parquet file has rows |
| fail | All parquet files are empty (0 rows) |
Note: Polymarket archiver manifest has files: [] — the check falls back to pq_manifest for type breakdown.
Check 3: schema
What: Validates parquet column names match expected schema per feed/type.
| Status | Criteria |
|---|---|
| pass | All files match expected schema |
| warn | Extra columns present (forward-compatible) |
| fail | Missing required columns |
Check 4: null_rates
What: Checks null percentage for required (non-nullable) columns.
| Status | Criteria |
|---|---|
| pass | All required columns have < 1% nulls |
| warn | 1-5% nulls in required columns |
| fail | > 5% nulls in required columns |
Check 5: duplicates
What: Checks for duplicate rows by primary key (feed-specific).
| Status | Criteria |
|---|---|
| pass | No duplicates found |
| warn | < 1% duplicate rate |
| fail | >= 1% duplicate rate |
Primary keys: Kalshi trade: trade_id. Kalshi ticker: (market_ticker, ts, _nats_seq).
Check 6: nats_continuity
What: Checks for gaps in _nats_seq (NATS JetStream sequence numbers).
| Status | Criteria |
|---|---|
| pass | No gaps in _nats_seq |
| warn | Gaps present but < 1% of total |
| fail | Gaps >= 1% of total |
Common failures: NATS consumer restart, archiver reconnection.
Check 7: ts_continuity
What: Checks for time gaps in exchange timestamps. Divides the day into 15-minute slots and checks for empty slots.
| Status | Criteria |
|---|---|
| pass | < 5% of 15-min slots empty |
| warn | 5-15% empty |
| fail | > 15% empty |
Check 8: per_ticker_stats
What: Per-ticker row counts and time coverage. Reports top/bottom tickers by volume.
| Status | Criteria |
|---|---|
| pass | Always passes (informational) |
Check 9: parquet_vs_jsonl (accountability)
What: Verifies every JSONL line is accounted for in parquet. The zero-tolerance accountability check.
Sources (in priority order):
- parquet-gen manifest (
parquet-manifest.json) — preferred, scoped to what was actually processed - Archiver manifest (
manifest.json) — fallback for pre-v2.0.0 data
Comparison logic:
- Data types:
parquet_rows == jsonl_lines(exact match), orparquet_rows >= jsonl_linesfor fanout types - Control types: counted and reported, NOT converted to parquet
- Unknown types: any unrecognized type = FAIL (needs new schema)
| Status | Criteria |
|---|---|
| pass | All data types accounted for |
| warn | Minor discrepancies or coverage < 100% |
| fail | Missing data or unknown types |
Pipeline stats surfaced: lines_json_error, lines_type_unknown, lines_no_schema, parse_batch_dropped
u64 underflow guard: Values > 2^63 in parse_batch_dropped are treated as 0 (known Rust underflow from fanout).
Check 10: exchange_seq_gaps
What: Gaps in exchange-provided sequence numbers per file.
Grouping:
- Kalshi: group by
_shard_id(v1.3.0+), falls back tosid(v1.2.0), thenticker_col - Kraken Futures: group by
product_id
| Status | Criteria |
|---|---|
| pass | No gaps in exchange sequences |
| warn | Gaps present but coverage > 95% |
| fail | Coverage < 95% |
Common failures: Kalshi pre-1.3.0 data groups by sid which collides across shards — expect false gaps on old data. Post-1.3.0 groups by _shard_id for accurate per-shard analysis.
Check 11: data_coverage
What: Percentage of 15-minute time slots (96/day) that have exchange-timestamped data.
| Status | Criteria |
|---|---|
| pass | >= 95% of slots have data |
| warn | 80-95% |
| fail | < 80% |
Feed-specific timestamp columns: Kalshi ts, Kraken time, Polymarket timestamp_ms.
Note: Polymarket coverage may be lower due to activity concentrated in US hours.
Check 12: connection_uptime
What: WebSocket connection uptime from Cloud Monitoring websocket_connected gauge.
| Status | Criteria |
|---|---|
| pass | >= 99% uptime (min across all shards) |
| warn | 95-99% |
| fail | < 95% |
Known issue: Ghost shards from before connector restarts have low data point counts, dragging down the min. The check should consider only the latest shard generation.
Requires: google-cloud-monitoring package, Monitoring Viewer IAM role on the SA.
Check 13: schema_version
What: Verifies parquet files report their schema version via metadata.
| Status | Criteria |
|---|---|
| pass | Schema version present and recognized |
| warn | Schema version missing (pre-versioning data) |
| fail | Unknown schema version |
Fanout Rules
Some message types produce multiple parquet rows per JSONL line (1:N fan-out):
| Feed | Type | Reason |
|---|---|---|
| kraken | ticker | Batch array: one WS message contains multiple symbol updates |
| kraken | trade | Batch array: one WS message contains multiple trades |
| polymarket | price_change | Array-wrapped messages may contain multiple updates |
For fanout types: parquet_rows >= jsonl_lines (not exact match).
For non-fanout types: parquet_rows == jsonl_lines (exact match).
Per-Feed Expected Message Types
| Feed | Data Types (have parquet schemas) | Control Types (skipped) |
|---|---|---|
| kalshi | ticker, trade, market_lifecycle_v2 | subscribed, ok, error |
| kraken | ticker, trade | heartbeat, status |
| kraken-futures | ticker, trade | control |
| polymarket | book, last_trade_price, price_change, best_bid_ask | new_market, market_resolved |
Trade Message Schemas
Kalshi NATS message (post-connector injection)
{
"type": "trade",
"sid": 2,
"seq": 5724,
"msg": {
"trade_id": "uuid",
"market_ticker": "KXBTCD-26FEB0317-T76999.99",
"yes_price": 17,
"count": 130,
"taker_side": "no",
"ts": 1770153448
},
"_shard_id": 3
}
Kalshi REST API trade
{
"trade_id": "uuid",
"ticker": "KXBTCD-26FEB0317-T76999.99",
"yes_price": 17,
"count": 130,
"taker_side": "no",
"created_time": "2026-02-03T21:17:28.18002Z"
}
Adding a New Check
- Add
def check_<name>(con, base, ...) -> dicttodata/dq.py - Return dict with at minimum:
{"check": "<name>", "status": "pass|warn|fail|skip", "detail": "..."} - Add to
check_fnslist inDQRunner.run() - Update this skill with the check's criteria and common failures
- Deploy: tag
dq-v*, updatedq-daily.yamlmanifest (seessmd-deployskill)
GCS Data Layout
gs://ssmd-data/{prefix}/{feed}/{stream}/{date}/
manifest.json # archiver manifest (records_by_type, files list)
parquet-manifest.json # parquet-gen manifest (parse_batch_input, records_written)
ticker_000000.parquet # parquet files per type
trade_000000.parquet
...
