HOW-TO · RAG
How to Extract Metadata from Documents for Filtering
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Documents with metadata (author, date, source)
What this does
Metadata turns flat document content into a queryable structure. By extracting and attaching structured metadata such as author, creation date, department, and file type, you enable search engines to filter results before or after vector retrieval. This prevents irrelevant documents from surfacing and keeps RAG answers grounded in the right context.
Steps
Inspect existing metadata on loaded documents.
from langchain_community.document_loaders import PyPDFLoader loader = PyPDFLoader("/data/report.pdf") pages = loader.load() for doc in pages: print("Metadata:", doc.metadata)Inject metadata from the filesystem.
from pathlib import Path def enrich_from_path(doc, base_path="/data"): path = Path(doc.metadata.get("source", "")) doc.metadata["filename"] = path.name doc.metadata["folder"] = path.parent.relative_to(base_path).as_posix() doc.metadata["file_size_bytes"] = path.stat().st_size return doc enriched = [enrich_from_path(d) for d in pages] print(enriched[0].metadata)Add date-based metadata for temporal filtering.
from datetime import datetime def add_date_metadata(doc, date_str="2024-11-15"): doc.metadata["created_date"] = date_str doc.metadata["year"] = int(date_str[:4]) return doc enriched = [add_date_metadata(d) for d in pages]Filter documents at retrieval time using metadata.
relevant = [d for d in enriched if d.metadata.get("year", 0) >= 2024] print(f"Kept {len(relevant)} of {len(enriched)} documents")
Verification
python -c "
from langchain_community.document_loaders import TextLoader
loader = TextLoader('/etc/hostname')
docs = loader.load()
print('Metadata keys:', list(docs[0].metadata.keys()))
"
# Expected: Metadata keys: ['source']
Common failures
- Metadata lost after transformation. Not assigning back to
docin functions. Ensure functions modify in place or return the object. - Year filter returns zero documents. Year stored as string, not integer. Convert with
int(doc.metadata["year"]). - KeyError on source key. Some loaders omit source metadata. Use
.get("source", "/unknown")with a fallback. - Duplicate metadata keys. Clear metadata dict first with
doc.metadata = {}. - Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
RELATED GUIDES