RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to set up an AI-assisted web scraping pipeline that extracts structured data from HTML
HOW-TO · DEV

How to set up an AI-assisted web scraping pipeline that extracts structured data from HTML

intermediate·25 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Python 3.10+, requests or httpx, BeautifulSoup4 or Playwright, AI API access (OpenAI or Anthropic)

What this does

Web scraping typically produces raw HTML that requires brittle CSS or XPath selectors to extract meaningful data. This guide explains how to build a pipeline where raw HTML is fed to an AI assistant which returns structured JSON matching a defined schema. This approach handles pages with complex layouts, inconsistent markup, and embedded content that would be difficult to parse with selectors alone. The pipeline is suitable for recurring scrape jobs such as aggregating product listings, news articles, or job postings.

Steps

  1. Define an extraction schema using a Pydantic model or a JSON Schema document that describes the fields to extract, their types, and whether each field is required or optional.
  2. Write a fetch function using httpx that retrieves the target URL and returns the full HTML body. Add headers to mimic a standard browser user agent.
  3. Write a truncate function that limits the HTML to the first 8,000 characters. Excessively long HTML confuses AI context windows; if the page is long, target a specific <main> or <article> section using BeautifulSoup before truncating.
  4. Construct an AI prompt: include the extraction schema, the truncated HTML, and the instruction to return a JSON object matching the schema and nothing else.
  5. Call the AI API with the prompt and parse the response as JSON. Validate it against the extraction schema using Pydantic or jsonschema.
  6. Write the validated record to a JSON Lines file (output/records.jsonl), appending one record per line.
  7. Add error handling: if the AI returns malformed JSON or the schema validation fails, log the failure, write the raw AI response to a dead-letter file, and continue to the next URL.

Verification

# Verify the output file exists, contains valid JSON Lines, and has records
test -f output/records.jsonl && \
python3 -c "
import json, sys
with open('output/records.jsonl') as f:
    lines = [json.loads(l) for l in f]
assert len(lines) > 0, 'No records found'
print(f'Verified: {len(lines)} records in output/records.jsonl')
for rec in lines:
    assert 'url' in rec, 'Record missing url field'
"
# Expected: Verified: <N> records in output/records.jsonl

Common failures

  • The AI returns Markdown fences instead of raw JSON. Add a post-processor that strips the first and last lines if they contain triple backticks and parses the remainder.
  • The target site blocks scrapers with JavaScript rendering or anti-bot checks. Use Playwright or Selenium to render the page fully before passing HTML to the AI.
  • Rate limiting or IP blocking from the target site. Add a 2-second sleep between requests and rotate the User-Agent header. Monitor the HTTP status code and pause on non-200 responses.
  • Extraction schema is too broad, causing the AI to omit optional fields. Narrow the schema to only required fields and mark optional fields explicitly; instruct the AI to return null for missing optional fields rather than omitting them.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • Use AI to transform legacy API response schemas into modern typed structures
  • Handle API rate limiting and retry logic in AI-integrated API calls
← All how-to guidesCourses →