askill
ssmd-dq-checks

ssmd-dq-checksSafety 100Repository

Catalog of ssmd DQ checks — what each measures, pass/fail criteria, common failure modes and fixes, fanout rules, and accountability SLOs. Use when investigating DQ failures, understanding check behavior, or adding new checks.

1 stars
1.2k downloads
Updated 2/20/2026

Package Files

Loading files...
SKILL.md

ssmd-dq-checks

Reference catalog for all 13 DQ checks in data/dq.py.

Check Catalog

Check 1: data_completeness

What: Verifies parquet files exist and the archiver manifest is present.

StatusCriteria
passParquet files found AND manifest present with files listed
warnParquet files found but no manifest (or manifest has no files)
failNo parquet files found

Common failures: Archiver not running, GCS permissions, wrong date.

Check 2: record_counts

What: Reports row counts per message type from parquet files. Uses archiver manifest records_by_type or falls back to parquet-gen manifest records_written.

StatusCriteria
passAt least one parquet file has rows
failAll parquet files are empty (0 rows)

Note: Polymarket archiver manifest has files: [] — the check falls back to pq_manifest for type breakdown.

Check 3: schema

What: Validates parquet column names match expected schema per feed/type.

StatusCriteria
passAll files match expected schema
warnExtra columns present (forward-compatible)
failMissing required columns

Check 4: null_rates

What: Checks null percentage for required (non-nullable) columns.

StatusCriteria
passAll required columns have < 1% nulls
warn1-5% nulls in required columns
fail> 5% nulls in required columns

Check 5: duplicates

What: Checks for duplicate rows by primary key (feed-specific).

StatusCriteria
passNo duplicates found
warn< 1% duplicate rate
fail>= 1% duplicate rate

Primary keys: Kalshi trade: trade_id. Kalshi ticker: (market_ticker, ts, _nats_seq).

Check 6: nats_continuity

What: Checks for gaps in _nats_seq (NATS JetStream sequence numbers).

StatusCriteria
passNo gaps in _nats_seq
warnGaps present but < 1% of total
failGaps >= 1% of total

Common failures: NATS consumer restart, archiver reconnection.

Check 7: ts_continuity

What: Checks for time gaps in exchange timestamps. Divides the day into 15-minute slots and checks for empty slots.

StatusCriteria
pass< 5% of 15-min slots empty
warn5-15% empty
fail> 15% empty

Check 8: per_ticker_stats

What: Per-ticker row counts and time coverage. Reports top/bottom tickers by volume.

StatusCriteria
passAlways passes (informational)

Check 9: parquet_vs_jsonl (accountability)

What: Verifies every JSONL line is accounted for in parquet. The zero-tolerance accountability check.

Sources (in priority order):

  1. parquet-gen manifest (parquet-manifest.json) — preferred, scoped to what was actually processed
  2. Archiver manifest (manifest.json) — fallback for pre-v2.0.0 data

Comparison logic:

  • Data types: parquet_rows == jsonl_lines (exact match), or parquet_rows >= jsonl_lines for fanout types
  • Control types: counted and reported, NOT converted to parquet
  • Unknown types: any unrecognized type = FAIL (needs new schema)
StatusCriteria
passAll data types accounted for
warnMinor discrepancies or coverage < 100%
failMissing data or unknown types

Pipeline stats surfaced: lines_json_error, lines_type_unknown, lines_no_schema, parse_batch_dropped

u64 underflow guard: Values > 2^63 in parse_batch_dropped are treated as 0 (known Rust underflow from fanout).

Check 10: exchange_seq_gaps

What: Gaps in exchange-provided sequence numbers per file.

Grouping:

  • Kalshi: group by _shard_id (v1.3.0+), falls back to sid (v1.2.0), then ticker_col
  • Kraken Futures: group by product_id
StatusCriteria
passNo gaps in exchange sequences
warnGaps present but coverage > 95%
failCoverage < 95%

Common failures: Kalshi pre-1.3.0 data groups by sid which collides across shards — expect false gaps on old data. Post-1.3.0 groups by _shard_id for accurate per-shard analysis.

Check 11: data_coverage

What: Percentage of 15-minute time slots (96/day) that have exchange-timestamped data.

StatusCriteria
pass>= 95% of slots have data
warn80-95%
fail< 80%

Feed-specific timestamp columns: Kalshi ts, Kraken time, Polymarket timestamp_ms.

Note: Polymarket coverage may be lower due to activity concentrated in US hours.

Check 12: connection_uptime

What: WebSocket connection uptime from Cloud Monitoring websocket_connected gauge.

StatusCriteria
pass>= 99% uptime (min across all shards)
warn95-99%
fail< 95%

Known issue: Ghost shards from before connector restarts have low data point counts, dragging down the min. The check should consider only the latest shard generation.

Requires: google-cloud-monitoring package, Monitoring Viewer IAM role on the SA.

Check 13: schema_version

What: Verifies parquet files report their schema version via metadata.

StatusCriteria
passSchema version present and recognized
warnSchema version missing (pre-versioning data)
failUnknown schema version

Fanout Rules

Some message types produce multiple parquet rows per JSONL line (1:N fan-out):

FeedTypeReason
krakentickerBatch array: one WS message contains multiple symbol updates
krakentradeBatch array: one WS message contains multiple trades
polymarketprice_changeArray-wrapped messages may contain multiple updates

For fanout types: parquet_rows >= jsonl_lines (not exact match). For non-fanout types: parquet_rows == jsonl_lines (exact match).

Per-Feed Expected Message Types

FeedData Types (have parquet schemas)Control Types (skipped)
kalshiticker, trade, market_lifecycle_v2subscribed, ok, error
krakenticker, tradeheartbeat, status
kraken-futuresticker, tradecontrol
polymarketbook, last_trade_price, price_change, best_bid_asknew_market, market_resolved

Trade Message Schemas

Kalshi NATS message (post-connector injection)

{
  "type": "trade",
  "sid": 2,
  "seq": 5724,
  "msg": {
    "trade_id": "uuid",
    "market_ticker": "KXBTCD-26FEB0317-T76999.99",
    "yes_price": 17,
    "count": 130,
    "taker_side": "no",
    "ts": 1770153448
  },
  "_shard_id": 3
}

Kalshi REST API trade

{
  "trade_id": "uuid",
  "ticker": "KXBTCD-26FEB0317-T76999.99",
  "yes_price": 17,
  "count": 130,
  "taker_side": "no",
  "created_time": "2026-02-03T21:17:28.18002Z"
}

Adding a New Check

  1. Add def check_<name>(con, base, ...) -> dict to data/dq.py
  2. Return dict with at minimum: {"check": "<name>", "status": "pass|warn|fail|skip", "detail": "..."}
  3. Add to check_fns list in DQRunner.run()
  4. Update this skill with the check's criteria and common failures
  5. Deploy: tag dq-v*, update dq-daily.yaml manifest (see ssmd-deploy skill)

GCS Data Layout

gs://ssmd-data/{prefix}/{feed}/{stream}/{date}/
  manifest.json           # archiver manifest (records_by_type, files list)
  parquet-manifest.json   # parquet-gen manifest (parse_batch_input, records_written)
  ticker_000000.parquet   # parquet files per type
  trade_000000.parquet
  ...

Install

Download ZIP
Requires askill CLI v1.0+

AI Quality Score

68/100Analyzed 2/23/2026

High-quality technical reference catalog for 13 DQ checks with excellent clarity and completeness. Well-structured with tables, criteria, failure modes, fanout rules, and schemas. However, it's a reference document (not actionalbe procedure) and highly specific to an internal market data pipeline, limiting reusability. Score benefits from dedicated skills folder location and comprehensive tags.

100
95
40
92
55

Metadata

Licenseunknown
Version-
Updated2/20/2026
Publisheraaronwald

Tags

apici-cdobservability