12. Document Loaders
Document loaders read files from disk into LangChain's Document format. Each Document contains page_content (text) and metadata (source, page number, etc.). LangChain supports 50+ loader types including PDFs, CSVs, Markdown, HTML, and proprietary formats.
Start with the simplest loader for plain text files.
from langchain_community.document_loaders import TextLoader
loader = TextLoader("./policy.txt")
documents = loader.load()
print(type(documents[0])) # <class 'langchain_core.documents.base.Document'>
print(documents[0].page_content[:100])
print(documents[0].metadata) # {'source': './policy.txt'}
PDF loading requires PyPDFLoader or UnstructuredPDFLoader. The former is faster but extracts text sequentially; the latter handles complex layouts better.
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("./report.pdf")
pages = loader.load_and_split() # Returns one Document per page
print(f"Loaded {len(pages)} pages")
print(pages[0].page_content)
print(pages[0].metadata) # Includes {'source': ..., 'page': 1}
CSV files load row by row. Each row becomes a document with column names as keys.
from langchain_community.document_loaders import CSVLoader
loader = CSVLoader("./sales_data.csv")
docs = loader.load()
print(docs[0].page_content) # "column1: value1\ncolumn2: value2"
print(docs[0].metadata)
For directories, use DirectoryLoader with glob patterns.
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader(
"./docs",
glob="**/*.md", # Only markdown files
loader_cls=TextLoader
)
docs = loader.load()
A frequent error: specifying the wrong encoding. Non-UTF8 files crash without explicit encoding.
loader = TextLoader("./legacy_doc.txt", encoding="latin-1")
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Download a PDF, load it with PyPDFLoader, and verify that metadata["page"] increments correctly across pages.