RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 1
  6. /Ch. 4
RAG Systems: Part 1

04. HTML Ingestion

Chapter 4 of 22 · 25 min
KEY INSIGHT

HTML's semantic structure lets you extract content by heading boundaries, preserving contextual relationships that PDFs lack. ```python from bs4 import BeautifulSoup # Minimal working example html = "<html><body><h1>Title</h1><p>Content here</p></body></html>" soup = BeautifulSoup(html, "lxml") print(soup.find("p").get_text()) ```

Web pages and HTML documents require different parsing than PDFs. HTML has structural elements (headings, paragraphs, lists) that PDFs lack. Exploiting this structure produces cleaner chunks.

Installation and dependencies

pip install beautifulsoup4 lxml

BeautifulSoup4 parses HTML. lxml is a faster parser backend.

Basic HTML parsing

from bs4 import BeautifulSoup

def extract_html_content(html: str) -> dict:
    """Extract clean text content from HTML."""
    soup = BeautifulSoup(html, "lxml")

    # Remove script, style, and nav elements
    for tag in soup(["script", "style", "nav", "header", "footer"]):
        tag.decompose()

    # Extract title
    title = soup.title.string if soup.title else ""

    # Extract headings
    headings = []
    for h in soup.find_all(["h1", "h2", "h3"]):
        headings.append(h.get_text(strip=True))

    # Extract main content (paragraphs and list items)
    paragraphs = []
    for p in soup.find_all("p"):
        text = p.get_text(strip=True)
        if text:
            paragraphs.append(text)

    return {
        "title": title,
        "headings": headings,
        "content": "\n".join(paragraphs),
        "raw_text": soup.get_text(separator="\n", strip=True)
    }

Extracting by semantic structure

HTML semantic elements (<article>, <section>, <main>) let you extract context-aware content. A chapter in a <section> has more meaning than the same text in a <div>.

def extract_semantic_html(html: str) -> list[dict]:
    """Extract content preserving semantic structure."""
    from bs4 import BeautifulSoup

    soup = BeautifulSoup(html, "lxml")

    # Remove noise
    for tag in soup(["script", "style", "nav", "iframe"]):
        tag.decompose()

    sections = []

    # Find main content area
    main = soup.find("main") or soup.find("article") or soup.body

    # Process sections and headings
    current_section = {"heading": "", "content": []}

    for element in main.find_all(["h1", "h2", "h3", "h4", "p", "ul", "ol"]):
        if element.name in ["h1", "h2", "h3", "h4"]:
            # Save previous section
            if current_section["content"]:
                sections.append(current_section)

            current_section = {
                "heading": element.get_text(strip=True),
                "content": []
            }
        else:
            text = element.get_text(strip=True)
            if text:
                current_section["content"].append(text)

    # Don't forget last section
    if current_section["content"]:
        sections.append(current_section)

    return sections

Handling relative URLs

RAG systems need absolute URLs for source tracking. Convert relative links before storing.

from urllib.parse import urljoin

def make_absolute(url: str, base_url: str) -> str:
    """Convert relative URL to absolute URL."""
    return urljoin(base_url, url)

Fetching HTML from URLs

import requests
from bs4 import BeautifulSoup

def fetch_html(url: str, timeout: int = 30) -> str:
    """Fetch HTML content from a URL."""
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; RAG-Bot/1.0)"
    }

    response = requests.get(url, headers=headers, timeout=timeout)
    response.raise_for_status()

    return response.text

Batch crawling with rate limiting

Crawling too fast gets you banned. Add delays between requests.

import time
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor

def crawl_urls(urls: list[str], delay: float = 1.0, max_workers: int = 3) -> list[dict]:
    """Crawl multiple URLs with rate limiting."""
    results = []

    def fetch_with_delay(url):
        time.sleep(delay)
        try:
            response = requests.get(url, timeout=30)
            response.raise_for_status()
            return extract_html_content(response.text)
        except Exception as e:
            print(f"Error fetching {url}: {e}")
            return None

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(fetch_with_delay, url) for url in urls]

        for future in futures:
            result = future.result()
            if result:
                results.append(result)

    return results

Extracting metadata

HTML meta tags contain valuable metadata for filtering and context.

def extract_metadata(html: str) -> dict:
    """Extract Open Graph and meta tags."""
    from bs4 import BeautifulSoup

    soup = BeautifulSoup(html, "lxml")
    meta = {}

    # Standard meta tags
    for tag in soup.find_all("meta"):
        name = tag.get("name") or tag.get("property")
        content = tag.get("content")
        if name and content:
            meta[name] = content

    # Favicon and other
    if soup.find("link", rel="icon"):
        meta["favicon"] = soup.find("link", rel="icon").get("href")

    return meta

Common failure modes

  1. JavaScript-rendered content: BeautifulSoup cannot execute JavaScript. For SPAs, use Selenium or Playwright.

  2. Cookie banners: These often appear in <div> elements and pollute text extraction. Remove them by targeting known banner selectors.

  3. Infinite scroll pages: Content loads only when user scrolls. Use browser automation for these.

  4. Encoded content: Some sites serve compressed or encoded HTML. requests handles gzip automatically, but base64-encoded text needs decoding.

EXERCISE

Find a documentation page (e.g., from readthedocs.io) and extract its content using the semantic parser. Count how many sections it found. Print each section's heading and the first 100 characters of its content. Verify the extraction makes sense by comparing to the original page.

← Chapter 3
PDF Ingestion with PyMuPDF
Chapter 5 →
Markdown Ingestion