How to use an AI assistant to construct SQL queries and data transformation steps for a pipeline
Database connection (any SQL database), AI assistant for query generation
What this does
Writing SQL queries for data pipelines involves joins, window functions, aggregations, and conditional logic that can become error-prone when the pipeline has many transformation stages. This guide demonstrates how to use an AI assistant to generate correct SQL queries from natural language descriptions of your data transformation needs. The assistant also validates queries against the schema, suggests performance optimizations, and outputs self-contained transformation steps ready to be chained in a pipeline.
Steps
Export the database schema as a DDL statement. Use
pg_dump --schema-onlyfor PostgreSQL or.schemafor SQLite. Provide this fully to the AI assistant as context for every query-generation session.Describe the desired transformation in plain language. Include the source tables, the target output columns, any filter conditions, grouping requirements, and the expected sort order. For example: "Join the orders and customers tables on customer_id. Return customer name, total order value (sum of order amounts), and count of orders. Only include customers with more than 2 orders. Sort by total order value descending."
Ask the AI to produce the SQL query with the following constraints: use explicit join syntax only, qualify all column references with table aliases, and include a comment block describing the transformation logic.
Review the generated query for correctness against the schema. Verify that every referenced column exists and every join condition references the correct columns. Ask the AI to explain any parts of the query that are unclear.
Test the query with
EXPLAINorEXPLAIN ANALYZEand provide the output to the AI. Ask if there are index recommendations or query structure changes that would improve performance.For multi-step transformations, ask the AI to produce the full sequence of CTEs or temporary tables in dependency order, naming each intermediate result clearly so the pipeline graph is self-documenting.
Verification
# Verify the AI-generated query runs without errors and returns rows
python3 -c "
import subprocess, sys, json
result = subprocess.run(
['python3', 'scripts/run_generated_query.py'],
capture_output=True, text=True, timeout=30
)
output = json.loads(result.stdout)
assert output['row_count'] > 0, 'Query returned zero rows'
assert output['errors'] == [], f'Query errors: {output[\"errors\"]}'
print(f'Verification passed: {output[\"row_count\"]} rows returned, no errors')
sys.exit(0)
"
# Expected: Verification passed: <N> rows returned, no errors
Common failures
- The AI generates a query using a column that does not exist in the provided schema. The AI may hallucinate typical column names. Always validate by running the query in a dry-run mode (e.g.,
EXPLAINwithout execution) and check for column-not-found errors before executing. - The generated query uses implicit joins (comma-separated FROM clauses) that produce Cartesian products. Explicitly instruct the AI to use
INNER JOIN ... ONsyntax only and reject any query containing a comma in the FROM clause. - The AI omits NULL handling, causing unexpected empty result sets. When columns can be NULL, explicitly ask the AI to add
COALESCEor NULL-safe comparison operators. Review every WHERE clause for NULL-safety. - The AI produces a correct query for a different SQL dialect than the target database. PostgreSQL, MySQL, and SQLite differ in string concatenation, date functions, and LIMIT syntax. Always specify the database engine and version at the start of the session.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.