DCT Diff - Compare Datasets
Compare two data files with key matching and optional aggregation metrics.
When to Use
Use this skill when you need to:
- Validate data consistency between two versions
- Compare production vs test data
- Reconcile data after ETL processes
- Check for data drift over time
- Validate data migrations
Installation
which dct || go build -o dct && chmod +x ./dct
Usage
dct diff <keys> <file1> <file2> [flags]
Arguments
keys: Key column(s) for matching records. Formats:- Single key:
id - Composite keys:
key1,key2 - Different names:
left_col=right_col
- Single key:
file1: First data file (left side)file2: Second data file (right side)
Flags
-m, --metrics <spec>: Metrics specification (JSON string or file path)-a, --all: Show all metrics columns-o, --output <file>: Output to file instead of stdout
Examples
Basic Comparison
Compare by single key:
dct diff id left.csv right.csv
Compare by composite keys:
dct diff "first_name,last_name" file1.parquet file2.parquet
Key Name Mapping
When key columns have different names:
dct diff user_id=customer_id old.csv new.csv
With Metrics
Compare with count distinct metric:
dct diff id left.csv right.csv -m '[{"agg":"count_distinct","left":"email","right":"email"}]'
Multiple metrics:
dct diff id left.csv right.csv -m '[{"agg":"mean","left":"amount","right":"amount"},{"agg":"count_distinct","left":"category","right":"category"}]'
Load metrics from file:
dct diff id left.csv right.csv -m metrics.json -a
Metrics Specification
JSON array of metric objects:
[
{
"agg": "count_distinct",
"left": "column_name",
"right": "column_name"
}
]
Available Aggregations
mean- Average valuemedian- Median valuemin- Minimum valuemax- Maximum valuesum- Sum of valuescount- Count of recordscount_distinct- Count of unique values
Output Columns
Default output includes:
- Key column(s)
l_cnt- Count from left filer_cnt- Count from right filecnt_eq- Whether counts match
With metrics and -a flag:
l_<col>_<agg>- Left aggregationr_<col>_<agg>- Right aggregation<col>_<agg>_eq- Whether aggregations match
Best Practices
- Use
-aflag to see all comparison metrics - Both files must contain the key columns
- Files must have at least one row of data
- Start with a small sample to verify keys work
- Use composite keys when single keys aren't unique
Error Handling
Common issues:
attempted to diff when least one of the files have no data: Check files aren't empty- Key not found: Verify column names match exactly (case-sensitive)
- Format errors: Ensure metrics JSON is valid
Example Workflow
# 1. Preview both files first
dct peek left.csv -n 3
dct peek right.csv -n 3
# 2. Compare by ID
dct diff id left.csv right.csv -a
# 3. Save results
dct diff id left.csv right.csv -m metrics.json -a -o comparison.csv
Related Skills
dct-peek: Preview files before comparingdct-profile: Check data quality of each file
