Database Handling Specifications
Goals
- Safely explore database schema without performance degradation.
- Construct precise, efficient SQL queries that prevent system crashes (OOM & OOT).
- Handle dialect-specific nuances (PostgreSQL, MySQL, SQLite, etc.).
- Transform raw result sets into structured, validated data for analysis.
Inspection
- Volume Estimation:
- Before any
SELECT *, always runSELECT COUNT(*) FROM table_nameto understand the scale. - If a table has >1,000,000 rows, strictly use indexed columns for filtering.
- Before any
- Sample Data:
- Use
SELECT * FROM table_name LIMIT 5to see actual data formats.
- Use
Querying
- Safety Constraints:
- Always use
LIMIT: Never execute a query without aLIMITclause unless the row count is confirmed to be small. - Avoid
SELECT *: In production-scale tables, explicitly name columns to reduce I/O and memory usage.
- Always use
- Dialect & Syntax:
- Case Sensitivity: If a column/table name contains uppercase or special characters, MUST quote it (e.g.,
"UserTable"in Postgres,`UserTable`in MySQL). - Date/Time: Use standard ISO strings for date filtering; be mindful of timezone-aware vs. naive columns.
- Case Sensitivity: If a column/table name contains uppercase or special characters, MUST quote it (e.g.,
- Complex Queries:
- For
JOINoperations, ensure joining columns are indexed to prevent full table scans. - When performing
GROUP BY, ensure the result set size is manageable.
- For
Data Retrieval & Transformation
- Type Mapping:
- Ensure SQL types (e.g.,
DECIMAL,BIGINT,TIMESTAMP) are correctly mapped to Python/JSON types without precision loss. - Convert
NULLvalues to a consistent "missing" representation (e.g.,NoneorNaN).
- Ensure SQL types (e.g.,
- Chunked Fetching:
- For medium-to-large exports, use
fetchmany(size)orOFFSET/LIMITpagination instead of fetching everything into memory at once.
- For medium-to-large exports, use
- Aggregations:
- Prefer performing calculations (SUM, AVG, COUNT) at the database level rather than pulling raw data to the client for processing.
Error Handling & Recovery
- Timeout Management: If a query takes too long, retry with more restrictive filters or optimized joins.
- Syntax Errors: If a query fails, inspect the dialect-specific error message and re-verify the schema (it's often a misspelled column or missing quotes).
Anti-Pattern Prevention (Avoiding "Bad" SQL)
- Index-Friendly Filters: Never wrap indexed columns in functions (e.g.,
DATE(),UPPER()) within theWHEREclause. - Join Safety: Always verify join keys. Before joining, check if the key has high cardinality to avoid massive intermediate result sets.
- Memory Safety:
- Avoid
DISTINCTandUNION(which performs de-duplication) on multi-million row sets unless necessary; useUNION ALLif duplicates are acceptable. - Avoid
ORDER BYon large non-indexed text fields.
- Avoid
- Wildcard Warning: Strictly avoid leading wildcards in
LIKEpatterns (e.g.,%term) on large text columns. - No Function on Columns:
WHERE col = FUNC(val)is good;WHERE FUNC(col) = valis bad. - Explicit Columns: Only fetch what is necessary.
- Early Filtering: Push
WHEREconditions as close to the base tables as possible. - CTE for Clarity: Use
WITHfor complex multi-step logic to improve maintainability and optimizer hints.
Best Practices
- Always verify database structure before querying
- Use appropriate sampling techniques for large datasets
- Optimize queries for efficiency based on schema inspection
- Self-review the draft SQL against the "Anti-Pattern Prevention" list.
- Perform a silent mental 'EXPLAIN' on your query. If it smells like a full table scan on a large table, refactor it before outputting
