What this does

Batch ETL pipelines move data from source systems into analytics-ready data warehouses on a periodic schedule. When the transformation logic requires unstructured-to-structured conversion (e.g., extracting entities from free-text fields, classifying records, or normalizing inconsistent category labels), AI models are uniquely suited for the transform step. This guide covers designing a pipeline that extracts paginated data from REST APIs, applies AI-driven transformations in batches to manage cost and latency, validates outputs, and loads the final records into a data warehouse for BI and reporting.

Steps

Define the target schema as Pydantic models for both the source (API response) and target (warehouse table) data structures. This provides runtime validation at both boundaries of the pipeline.
Implement the extract phase with pagination handling. Use httpx with a retry-enabled transport. Walk through paginated endpoints following the API's pagination mechanism (cursor, offset, or page-based) until all records are collected. Write each page of raw records to an intermediate staging file (stage/raw_page_N.jsonl).
Design the transform phase as a batched AI processing step. Chunk raw records into batches of 10-25 items (small enough to fit in the AI context window, large enough to amortize API overhead). For each batch, construct a prompt containing the raw records and the target schema, instructing the AI to return an array of transformed JSON objects matching the target Pydantic model.
Implement transformation validation. After AI processing, parse the response array and validate each transformed record against the target Pydantic model. Rejected records go to dead_letter/batch_N.jsonl with the validation error message appended.
Load valid records into the data warehouse using batched inserts (100-500 rows per INSERT for optimal throughput). Use a staging table approach: insert into a temporary staging table, then atomically swap or merge into the production table.
Orchestrate the full pipeline with a Python script that calls extract, transform, and load in sequence, logging timing and record counts for each phase.
Schedule the pipeline with cron or an orchestrator (Airflow, Prefect, or Dagster) to run at the desired interval.

Verification

# Verify the ETL pipeline completes and the warehouse contains transformed records
python3 -c "
import subprocess, json, sys
result = subprocess.run(
    ['python3', 'scripts/run_etl_pipeline.py'],
    capture_output=True, text=True, timeout=120
)
output = json.loads(result.stdout)
assert output['extracted'] > 0, 'No records extracted from API'
assert output['transformed'] > 0, 'No records transformed'
assert output['loaded'] > 0, 'No records loaded to warehouse'
assert output['loaded'] == output['transformed'], 'Load count mismatch'
print(f'Verification passed: {output[\"loaded\"]} records extracted, transformed, and loaded')
sys.exit(0)
"
# Expected: Verification passed: <N> records extracted, transformed, and loaded

Common failures

API pagination fails silently, extracting only the first page repeatedly. The AI cannot detect this because it processes whatever data it receives. Implement a page-count assertion: if the extracted record count is exactly the page size, verify the next page exists and that its records differ from the current page.
AI transformation produces keys in a different case or format than the target schema. Some AI models randomly capitalize keys or add extra whitespace. Add a normalization step after parsing that lowercases all dictionary keys and strips trailing whitespace from string values.
A large batch exceeds the AI context window and is silently truncated. Measure the byte size of each batch before sending. If it exceeds 50,000 bytes, split the batch in half and process separately. Log each batch size so you can tune the batch count over time.
Warehouse connection times out during long load phases due to idle timeout settings. Set the database connection keepalive parameters (keepalives_idle, tcp_keepalives_interval) and re-establish the connection before the load phase if the transform phase took more than 5 minutes.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

How to design a batch data pipeline that extracts from APIs, transforms with AI, and loads into a data warehouse

What this does

Steps

Verification

Common failures

Related guides