HOW-TO · DEV
How to set up an AI-assisted web scraping pipeline that extracts structured data from HTML
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Python 3.10+, requests or httpx, BeautifulSoup4 or Playwright, AI API access (OpenAI or Anthropic)
What this does
Web scraping typically produces raw HTML that requires brittle CSS or XPath selectors to extract meaningful data. This guide explains how to build a pipeline where raw HTML is fed to an AI assistant which returns structured JSON matching a defined schema. This approach handles pages with complex layouts, inconsistent markup, and embedded content that would be difficult to parse with selectors alone. The pipeline is suitable for recurring scrape jobs such as aggregating product listings, news articles, or job postings.
Steps
- Define an extraction schema using a Pydantic model or a JSON Schema document that describes the fields to extract, their types, and whether each field is required or optional.
- Write a fetch function using
httpxthat retrieves the target URL and returns the full HTML body. Add headers to mimic a standard browser user agent. - Write a truncate function that limits the HTML to the first 8,000 characters. Excessively long HTML confuses AI context windows; if the page is long, target a specific
<main>or<article>section using BeautifulSoup before truncating. - Construct an AI prompt: include the extraction schema, the truncated HTML, and the instruction to return a JSON object matching the schema and nothing else.
- Call the AI API with the prompt and parse the response as JSON. Validate it against the extraction schema using Pydantic or jsonschema.
- Write the validated record to a JSON Lines file (
output/records.jsonl), appending one record per line. - Add error handling: if the AI returns malformed JSON or the schema validation fails, log the failure, write the raw AI response to a dead-letter file, and continue to the next URL.
Verification
# Verify the output file exists, contains valid JSON Lines, and has records
test -f output/records.jsonl && \
python3 -c "
import json, sys
with open('output/records.jsonl') as f:
lines = [json.loads(l) for l in f]
assert len(lines) > 0, 'No records found'
print(f'Verified: {len(lines)} records in output/records.jsonl')
for rec in lines:
assert 'url' in rec, 'Record missing url field'
"
# Expected: Verified: <N> records in output/records.jsonl
Common failures
- The AI returns Markdown fences instead of raw JSON. Add a post-processor that strips the first and last lines if they contain triple backticks and parses the remainder.
- The target site blocks scrapers with JavaScript rendering or anti-bot checks. Use Playwright or Selenium to render the page fully before passing HTML to the AI.
- Rate limiting or IP blocking from the target site. Add a 2-second sleep between requests and rotate the User-Agent header. Monitor the HTTP status code and pause on non-200 responses.
- Extraction schema is too broad, causing the AI to omit optional fields. Narrow the schema to only required fields and mark optional fields explicitly; instruct the AI to return
nullfor missing optional fields rather than omitting them. - Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.