DCT Generate - Create Synthetic Data
Generate realistic test data with customizable schemas and field types.
When to Use
Use this skill when you need to:
- Create test datasets for development
- Generate mock data for demos
- Produce synthetic data for testing ETL pipelines
- Create data with specific distributions
- Generate data with referential integrity
Installation
which dct || go build -o dct && chmod +x ./dct
Usage
dct gen <schema> [flags]
Arguments
schema: JSON schema as a file path or inline JSON string
Flags
-n, --lines <number>: Number of rows to generate (default: 1)-f, --format <format>: Output format - csv, ndjson (default: csv)-o, --outfile <file>: Output file path (default: stdout)
Examples
From schema file:
dct gen schema.json -n 1000 -o test_data.csv
Inline schema:
dct gen '[{"field":"name","source":"firstNames"}]' -n 100
NDJSON output:
dct gen schema.json -n 500 -f ndjson -o output.ndjson
Generate to stdout:
dct gen users-schema.json -n 10
Schema Format
Array of field objects:
[
{
"field": "column_name",
"source": "source_type",
"config": { ... }
}
]
Available Data Sources
Random Generators
-
randomBool- Boolean true/false{"field": "active", "source": "randomBool"} -
randomEnum- Random value from list{"field": "status", "source": "randomEnum", "config": {"values": ["pending", "active", "inactive"]}} -
randomAscii- Random ASCII string{"field": "code", "source": "randomAscii", "config": {"length": 10}} -
randomUniformInt- Uniform integer distribution{"field": "age", "source": "randomUniformInt", "config": {"min": 18, "max": 65}} -
randomNormal- Normal/Gaussian distribution{"field": "score", "source": "randomNormal", "config": {"mean": 100, "std": 15}} -
randomPoisson- Poisson distribution{"field": "events", "source": "randomPoisson", "config": {"lambda": 5}} -
randomDatetime- Random date/time{"field": "created_at", "source": "randomDatetime", "config": {"min": "2024-01-01 00:00:00", "max": "2024-12-31 23:59:59", "tz": "UTC"}} -
randomDate- Random date{"field": "birth_date", "source": "randomDate", "config": {"min": "1980-01-01", "max": "2005-12-31"}} -
randomTime- Random time{"field": "meeting_time", "source": "randomTime", "config": {"min": "09:00:00", "max": "17:00:00"}}
Data Generators
-
uuid- UUID v4{"field": "id", "source": "uuid"} -
firstNames- Random first names{"field": "first_name", "source": "firstNames"} -
lastNames- Random last names{"field": "last_name", "source": "lastNames"} -
companies- Company names{"field": "company", "source": "companies"} -
emails- Email addresses{"field": "email", "source": "emails"}
Derived Fields
Create computed fields using the Expr language:
{
"field": "full_name",
"source": "derived",
"config": {
"fields": ["first_name", "last_name"],
"expression": "first_name + ' ' + last_name"
}
}
Complex expressions:
{
"field": "display_name",
"source": "derived",
"config": {
"fields": ["first_name", "last_name", "company"],
"expression": "first_name + ' ' + last_name + ' (' + company + ')'"
}
}
Complete Schema Example
[
{"field": "id", "source": "uuid"},
{"field": "first_name", "source": "firstNames"},
{"field": "last_name", "source": "lastNames"},
{"field": "email", "source": "emails"},
{"field": "age", "source": "randomUniformInt", "config": {"min": 18, "max": 65}},
{"field": "department", "source": "randomEnum", "config": {"values": ["Engineering", "Sales", "Marketing", "HR"]}},
{"field": "salary", "source": "randomNormal", "config": {"mean": 75000, "std": 15000}},
{"field": "is_active", "source": "randomBool"},
{
"field": "full_name",
"source": "derived",
"config": {
"fields": ["first_name", "last_name"],
"expression": "first_name + ' ' + last_name"
}
}
]
Best Practices
- Generate small samples first (n=10) to verify schema
- Use derived fields to create realistic relationships
- Use NDJSON format for nested/complex data
- Save schemas to files for reuse
- Use appropriate distributions for realistic data
Output Formats
CSV (default):
id,first_name,age
550e8400-e29b-41d4-a716-446655440000,John,34
NDJSON:
{"id":"550e8400-e29b-41d4-a716-446655440000","first_name":"John","age":34}
{"id":"550e8400-e29b-41d4-a716-446655440001","first_name":"Jane","age":28}
Related Skills
dct-peek: Verify generated data looks correctdct-infer: Check schema of generated datadct-diff: Compare generated data with production samples
