RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to build a scheduled web scraping job that feeds extracted data into a database using AI parsing
HOW-TO · DEV

How to build a scheduled web scraping job that feeds extracted data into a database using AI parsing

advanced·35 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Working web scraper script, database (PostgreSQL/SQLite), scheduler tool

What this does

Manually running a scraper produces one-off data that quickly becomes stale. This guide describes how to build an end-to-end scheduled scraping pipeline where the scraper runs on a recurring cron schedule, raw HTML is processed by an AI model that parses unstructured content into structured records, and the resulting records are upserted into a database for downstream querying and analysis. The pipeline handles duplicates, deduplication, and failed job recovery automatically.

Steps

  1. Define a database schema with columns for the extracted fields (e.g., product name, price, URL, and a scraped_at timestamp). Add a UNIQUE constraint on the natural key (typically the URL or item ID) to support upsert logic.

  2. Separate the scraper into two components: the fetch layer (downloading HTML) and the parse layer (AI-driven extraction). The fetch layer writes raw HTML to a staging table or local file cache keyed by URL and retrieved timestamp.

  3. The parse layer reads new or updated HTML from the staging area, sends it to the AI for structured extraction, and receives validated JSON records back. Validate each record against the schema before insertion.

  4. Implement upsert logic using INSERT ... ON CONFLICT DO UPDATE (PostgreSQL) or INSERT OR REPLACE (SQLite) so re-scraped URLs update existing rows rather than creating duplicates.

  5. Add a deduplication step that computes a content hash (SHA-256 of normalized extracted fields) and skips database writes if the hash is unchanged from the previous scrape.

  6. Wrap the full pipeline (fetch + parse + upsert) in a single runnable script and schedule it with cron or systemd timer. Record each job run with start time, end time, status, and record count in a scrape_job_log table.

  7. Add failure handling: if the AI API is unavailable, retry up to three times with exponential backoff. If parsing fails for a specific URL, log it and continue to the next URL without aborting the entire batch.

Verification

# Verify scheduled job produces records in the database
python3 -c "
import sqlite3, subprocess, sys
result = subprocess.run(
    ['python3', 'scripts/scheduled_scrape_job.py'],
    capture_output=True, text=True, timeout=60
)
conn = sqlite3.connect('data/scraped_records.db')
count = conn.execute('SELECT COUNT(*) FROM products').fetchone()[0]
assert count > 0, 'No records found in database'
print(f'Verification passed: {count} records in database')
conn.close()
sys.exit(0)
"
# Expected: Verification passed: <N> records in database

Common failures

  • The cron job runs before the previous invocation finishes, causing concurrent DB write conflicts. Use a lock file (/tmp/scrape_job.lock) or database advisory lock to ensure only one instance runs at a time.
  • AI parsing returns inconsistent JSON structure across runs. Pin the AI model to a specific version for production pipelines. Add a JSON schema validation step that rejects records not matching the expected structure and writes them to a dead-letter queue for manual review.
  • The source website changes its HTML structure, breaking the content hash deduplication silently. Implement a daily monitoring check that compares the number of records produced against a historical baseline. Alert if the count drops by more than 20%.
  • Database connection pool exhausts during long-running jobs. Use a connection manager that closes connections after each batch of 50 writes. Set pool_maxsize to a small number (5) for scheduled jobs.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • How to set up an AI-assisted web scraping pipeline that extracts structured data from HTML
  • How to handle API rate limiting and retry logic in AI-integrated API calls
← All how-to guidesCourses →