What this does

Metadata turns flat document content into a queryable structure. By extracting and attaching structured metadata such as author, creation date, department, and file type, you enable search engines to filter results before or after vector retrieval. This prevents irrelevant documents from surfacing and keeps RAG answers grounded in the right context.

Steps

Inspect existing metadata on loaded documents.

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/data/report.pdf")
pages = loader.load()
for doc in pages:
    print("Metadata:", doc.metadata)

Inject metadata from the filesystem.

from pathlib import Path

def enrich_from_path(doc, base_path="/data"):
    path = Path(doc.metadata.get("source", ""))
    doc.metadata["filename"] = path.name
    doc.metadata["folder"] = path.parent.relative_to(base_path).as_posix()
    doc.metadata["file_size_bytes"] = path.stat().st_size
    return doc

enriched = [enrich_from_path(d) for d in pages]
print(enriched[0].metadata)

Add date-based metadata for temporal filtering.

from datetime import datetime

def add_date_metadata(doc, date_str="2024-11-15"):
    doc.metadata["created_date"] = date_str
    doc.metadata["year"] = int(date_str[:4])
    return doc

enriched = [add_date_metadata(d) for d in pages]

Filter documents at retrieval time using metadata.

relevant = [d for d in enriched if d.metadata.get("year", 0) >= 2024]
print(f"Kept {len(relevant)} of {len(enriched)} documents")

Verification

python -c "
from langchain_community.document_loaders import TextLoader
loader = TextLoader('/etc/hostname')
docs = loader.load()
print('Metadata keys:', list(docs[0].metadata.keys()))
"
# Expected: Metadata keys: ['source']

Common failures

Metadata lost after transformation. Not assigning back to doc in functions. Ensure functions modify in place or return the object.
Year filter returns zero documents. Year stored as string, not integer. Convert with int(doc.metadata["year"]).
KeyError on source key. Some loaders omit source metadata. Use .get("source", "/unknown") with a fallback.
Duplicate metadata keys. Clear metadata dict first with doc.metadata = {}.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

How to Extract Metadata from Documents for Filtering

What this does

Steps

Verification

Common failures

Related guides