RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Extract Metadata from Documents for Filtering
HOW-TO · RAG

How to Extract Metadata from Documents for Filtering

intermediate·15 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Documents with metadata (author, date, source)

What this does

Metadata turns flat document content into a queryable structure. By extracting and attaching structured metadata such as author, creation date, department, and file type, you enable search engines to filter results before or after vector retrieval. This prevents irrelevant documents from surfacing and keeps RAG answers grounded in the right context.

Steps

  1. Inspect existing metadata on loaded documents.

    from langchain_community.document_loaders import PyPDFLoader
    
    loader = PyPDFLoader("/data/report.pdf")
    pages = loader.load()
    for doc in pages:
        print("Metadata:", doc.metadata)
    
  2. Inject metadata from the filesystem.

    from pathlib import Path
    
    def enrich_from_path(doc, base_path="/data"):
        path = Path(doc.metadata.get("source", ""))
        doc.metadata["filename"] = path.name
        doc.metadata["folder"] = path.parent.relative_to(base_path).as_posix()
        doc.metadata["file_size_bytes"] = path.stat().st_size
        return doc
    
    enriched = [enrich_from_path(d) for d in pages]
    print(enriched[0].metadata)
    
  3. Add date-based metadata for temporal filtering.

    from datetime import datetime
    
    def add_date_metadata(doc, date_str="2024-11-15"):
        doc.metadata["created_date"] = date_str
        doc.metadata["year"] = int(date_str[:4])
        return doc
    
    enriched = [add_date_metadata(d) for d in pages]
    
  4. Filter documents at retrieval time using metadata.

    relevant = [d for d in enriched if d.metadata.get("year", 0) >= 2024]
    print(f"Kept {len(relevant)} of {len(enriched)} documents")
    

Verification

python -c "
from langchain_community.document_loaders import TextLoader
loader = TextLoader('/etc/hostname')
docs = loader.load()
print('Metadata keys:', list(docs[0].metadata.keys()))
"
# Expected: Metadata keys: ['source']

Common failures

  • Metadata lost after transformation. Not assigning back to doc in functions. Ensure functions modify in place or return the object.
  • Year filter returns zero documents. Year stored as string, not integer. Convert with int(doc.metadata["year"]).
  • KeyError on source key. Some loaders omit source metadata. Use .get("source", "/unknown") with a fallback.
  • Duplicate metadata keys. Clear metadata dict first with doc.metadata = {}.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • load-documents-langchain-loaders
  • optimize-chunk-size-overlap
RELATED GUIDES
RAG
How to Load Documents with LangChain Document Loaders
RAG
How to Optimize Chunk Size and Overlap Strategy
← All how-to guidesCourses →