13. Text Splitters

Chapter 13 of 18 · 20 min

Large documents must be chunked before embedding. Text splitters divide documents into smaller pieces that fit within embedding model limits (typically 512-8192 tokens) while preserving semantic coherence.

LangChain provides RecursiveCharacterTextSplitter as the default choice. It splits on paragraph breaks first, then sentences, then words—preserving natural language boundaries.

from langchain.text_splitter import RecursiveCharacterTextSplitter

with open("./article.txt") as f:
    text = f.read()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,      # Target tokens per chunk
    chunk_overlap=50,    # Overlap between chunks
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

chunks = splitter.split_text(text)
print(f"Created {len(chunks)} chunks")
print(f"First chunk length: {len(chunks[0])}")

The chunk_overlap parameter matters more than most tutorials acknowledge. Without overlap, sentences split across chunk boundaries lose context. With 50-token overlap, a sentence starting at chunk boundary appears in both chunks.

# Verify overlap is working
print(chunks[0][-100:])   # End of chunk 0
print(chunks[1][:100])     # Start of chunk 1 - should overlap

For code repositories, use LanguageSplitter with language-specific separators.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import Language

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=100,
    chunk_overlap=20
)

code_chunks = python_splitter.split_text(open("./processor.py").read())
print(f"Code split into {len(code_chunks)} chunks preserving function boundaries")

Token-aware splitting prevents embedding model truncation. Use tiktoken for accurate counting.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter

# More accurate token counting
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=lambda x: len(x) // 4  # Rough estimate: 4 chars per token
)

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Load a 2000+ word article, split it with chunk_size=300 and chunk_overlap=50, then verify that identical text appears at the end of chunk N and start of chunk N+1.