12. Document Loaders

Chapter 12 of 18 · 20 min

Document loaders read files from disk into LangChain's Document format. Each Document contains page_content (text) and metadata (source, page number, etc.). LangChain supports 50+ loader types including PDFs, CSVs, Markdown, HTML, and proprietary formats.

Start with the simplest loader for plain text files.

from langchain_community.document_loaders import TextLoader

loader = TextLoader("./policy.txt")
documents = loader.load()
print(type(documents[0]))  # <class 'langchain_core.documents.base.Document'>
print(documents[0].page_content[:100])
print(documents[0].metadata)  # {'source': './policy.txt'}

PDF loading requires PyPDFLoader or UnstructuredPDFLoader. The former is faster but extracts text sequentially; the latter handles complex layouts better.

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./report.pdf")
pages = loader.load_and_split()  # Returns one Document per page
print(f"Loaded {len(pages)} pages")
print(pages[0].page_content)
print(pages[0].metadata)  # Includes {'source': ..., 'page': 1}

CSV files load row by row. Each row becomes a document with column names as keys.

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader("./sales_data.csv")
docs = loader.load()
print(docs[0].page_content)  # "column1: value1\ncolumn2: value2"
print(docs[0].metadata)

For directories, use DirectoryLoader with glob patterns.

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    "./docs",
    glob="**/*.md",  # Only markdown files
    loader_cls=TextLoader
)
docs = loader.load()

A frequent error: specifying the wrong encoding. Non-UTF8 files crash without explicit encoding.

loader = TextLoader("./legacy_doc.txt", encoding="latin-1")

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Download a PDF, load it with PyPDFLoader, and verify that metadata["page"] increments correctly across pages.