How to build a scheduled web scraping job that feeds extracted data into a database using AI parsing
Working web scraper script, database (PostgreSQL/SQLite), scheduler tool
What this does
Manually running a scraper produces one-off data that quickly becomes stale. This guide describes how to build an end-to-end scheduled scraping pipeline where the scraper runs on a recurring cron schedule, raw HTML is processed by an AI model that parses unstructured content into structured records, and the resulting records are upserted into a database for downstream querying and analysis. The pipeline handles duplicates, deduplication, and failed job recovery automatically.
Steps
Define a database schema with columns for the extracted fields (e.g., product name, price, URL, and a
scraped_attimestamp). Add aUNIQUEconstraint on the natural key (typically the URL or item ID) to support upsert logic.Separate the scraper into two components: the fetch layer (downloading HTML) and the parse layer (AI-driven extraction). The fetch layer writes raw HTML to a staging table or local file cache keyed by URL and retrieved timestamp.
The parse layer reads new or updated HTML from the staging area, sends it to the AI for structured extraction, and receives validated JSON records back. Validate each record against the schema before insertion.
Implement upsert logic using
INSERT ... ON CONFLICT DO UPDATE(PostgreSQL) orINSERT OR REPLACE(SQLite) so re-scraped URLs update existing rows rather than creating duplicates.Add a deduplication step that computes a content hash (SHA-256 of normalized extracted fields) and skips database writes if the hash is unchanged from the previous scrape.
Wrap the full pipeline (fetch + parse + upsert) in a single runnable script and schedule it with cron or systemd timer. Record each job run with start time, end time, status, and record count in a
scrape_job_logtable.Add failure handling: if the AI API is unavailable, retry up to three times with exponential backoff. If parsing fails for a specific URL, log it and continue to the next URL without aborting the entire batch.
Verification
# Verify scheduled job produces records in the database
python3 -c "
import sqlite3, subprocess, sys
result = subprocess.run(
['python3', 'scripts/scheduled_scrape_job.py'],
capture_output=True, text=True, timeout=60
)
conn = sqlite3.connect('data/scraped_records.db')
count = conn.execute('SELECT COUNT(*) FROM products').fetchone()[0]
assert count > 0, 'No records found in database'
print(f'Verification passed: {count} records in database')
conn.close()
sys.exit(0)
"
# Expected: Verification passed: <N> records in database
Common failures
- The cron job runs before the previous invocation finishes, causing concurrent DB write conflicts. Use a lock file (
/tmp/scrape_job.lock) or database advisory lock to ensure only one instance runs at a time. - AI parsing returns inconsistent JSON structure across runs. Pin the AI model to a specific version for production pipelines. Add a JSON schema validation step that rejects records not matching the expected structure and writes them to a dead-letter queue for manual review.
- The source website changes its HTML structure, breaking the content hash deduplication silently. Implement a daily monitoring check that compares the number of records produced against a historical baseline. Alert if the count drops by more than 20%.
- Database connection pool exhausts during long-running jobs. Use a connection manager that closes connections after each batch of 50 writes. Set
pool_maxsizeto a small number (5) for scheduled jobs. - Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.