HOW-TO · RAG
How to Load Documents with LangChain Document Loaders
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
LangChain installed, source documents
What this does
LangChain document loaders provide a unified interface for ingesting data from dozens of sources into Document objects that the rest of a RAG pipeline consumes. Each loader returns Document instances wrapping page content and optional metadata.
Steps
Install extras for your specific loader.
pip install langchain-community pypdf beautifulsoup4Import the loader and load documents.
from langchain_community.document_loaders import PyPDFLoader from langchain_core.documents import Document loader = PyPDFLoader("/data/report.pdf") pages = loader.load() for page in pages: print(f"Page {page.metadata.get('page')}: {len(page.page_content)} chars")Load from a text file or directory.
from langchain_community.document_loaders import TextLoader, DirectoryLoader txt_loader = TextLoader("/data/notes.txt") txt_docs = txt_loader.load() dir_loader = DirectoryLoader("/data/docs", glob="**/*.txt", loader_cls=TextLoader) dir_docs = dir_loader.load()Load from a URL with BSHTMLLoader.
from langchain_community.document_loaders import BSHTMLLoader html_loader = BSHTMLLoader("https://example.com/page.html") html_docs = html_loader.load()
Verification
python -c "
from langchain_community.document_loaders import TextLoader
loader = TextLoader('/etc/hostname')
docs = loader.load()
print(f'Loaded {len(docs)} document(s), {len(docs[0].page_content)} chars')
"
# Expected: Loaded 1 document(s), <N> chars
Common failures
- ImportError for PyPDFLoader. Missing
langchain-communityorpypdf. Install withpip install langchain-community pypdf. - FileNotFoundError. Verify the path exists with
ls -la /data/report.pdf. - Empty page_content on all pages. Encrypted PDF or image-only scan. Run
pdftotextto confirm text extraction works first. - AttributeError on metadata. Metadata field missing for some pages. Use
.get("page", 0)with a fallback default. - Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Related guides
RELATED GUIDES