HOW-TO · RAG
How to Use Recursive Character Text Splitter
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
LangChain installed
What this does
The RecursiveCharacterTextSplitter is the default chunking strategy in most LangChain pipelines. It recursively splits text on a list of separators—paragraphs first, then newlines, then sentences—until every chunk is below the target size. This preserves natural structure while guaranteeing uniform, embeddable pieces.
Steps
Create the splitter with sensible defaults.
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=50, separators=["\n\n", "\n", ". ", " ", ""], )Split raw text directly.
sample = "Natural language processing enables computers to understand text. RAG combines retrieval with generation." chunks = splitter.split_text(sample) print(f"Split into {len(chunks)} chunk(s)") for i, chunk in enumerate(chunks): print(f" [{i}] ({len(chunk)} chars): {chunk}")Split pre-loaded LangChain documents.
from langchain_core.documents import Document docs = [Document(page_content=sample, metadata={"source": "demo"})] chunked_docs = splitter.split_documents(docs) print(f"Produced {len(chunked_docs)} document chunks")Customize separators for code.
code_splitter = RecursiveCharacterTextSplitter( chunk_size=200, chunk_overlap=20, separators=["\nclass ", "\ndef ", "\n ", "\n", " "], ) print(code_splitter.split_text("class Agent:\n def run(self):\n pass"))
Verification
python -c "
from langchain.text_splitter import RecursiveCharacterTextSplitter
s = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10)
c = s.split_text('A B C D E F G H I J K L M N O P Q R S T U V W X Y Z')
print(f'Chunks: {len(c)}, max len: {max(len(x) for x in c)}')
"
# Expected: Chunks: <N>, max len: <=100
Common failures
- Overlap causes duplicate concepts. Set overlap to ~10% of chunk size.
- Empty chunks produced.
split_textalways returns at least one chunk. Verify withassert chunks. - Whitespace-heavy chunks. Strip with
chunk.strip()in a post-processing step. - Chunks too large for embedding. Reduce chunk_size to 512 tokens or fewer.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
RELATED GUIDES