13. Text Splitters
Large documents must be chunked before embedding. Text splitters divide documents into smaller pieces that fit within embedding model limits (typically 512-8192 tokens) while preserving semantic coherence.
LangChain provides RecursiveCharacterTextSplitter as the default choice. It splits on paragraph breaks first, then sentences, then words—preserving natural language boundaries.
from langchain.text_splitter import RecursiveCharacterTextSplitter
with open("./article.txt") as f:
text = f.read()
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Target tokens per chunk
chunk_overlap=50, # Overlap between chunks
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
print(f"First chunk length: {len(chunks[0])}")
The chunk_overlap parameter matters more than most tutorials acknowledge. Without overlap, sentences split across chunk boundaries lose context. With 50-token overlap, a sentence starting at chunk boundary appears in both chunks.
# Verify overlap is working
print(chunks[0][-100:]) # End of chunk 0
print(chunks[1][:100]) # Start of chunk 1 - should overlap
For code repositories, use LanguageSplitter with language-specific separators.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import Language
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON,
chunk_size=100,
chunk_overlap=20
)
code_chunks = python_splitter.split_text(open("./processor.py").read())
print(f"Code split into {len(code_chunks)} chunks preserving function boundaries")
Token-aware splitting prevents embedding model truncation. Use tiktoken for accurate counting.
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter
# More accurate token counting
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
length_function=lambda x: len(x) // 4 # Rough estimate: 4 chars per token
)
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Load a 2000+ word article, split it with chunk_size=300 and chunk_overlap=50, then verify that identical text appears at the end of chunk N and start of chunk N+1.