06. Fixed-Size Chunking
The simplest chunking strategy is fixed size: split text every N characters or tokens. It is fast, predictable, and requires no text analysis. It is also the worst strategy for semantic coherence.
Token-based vs character-based
Character-based chunking splits at exact character counts. Token-based chunking respects language boundaries (words, subwords). LLMs process tokens, so token-based chunking aligns with how models see your text.
Use token-based chunking unless you have a specific reason not to.
Basic fixed-size tokenizer chunking
import tiktoken
def fixed_size_chunk(
text: str,
chunk_size: int = 512,
overlap: int = 50
) -> list[str]:
"""
Split text into fixed-size token chunks with overlap.
Args:
text: Input text to chunk
chunk_size: Target tokens per chunk
overlap: Token overlap between chunks
Returns:
List of text chunks
"""
encoder = tiktoken.get_encoding("cl100k_base") # GPT-4's encoder
tokens = encoder.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunk_text = encoder.decode(chunk_tokens)
if chunk_text.strip():
chunks.append(chunk_text)
start += (chunk_size - overlap)
return chunks
Handling the overlap problem
Overlap preserves context at chunk boundaries. A question might start in one chunk and continue in the next. Overlap ensures the retriever can find the complete context.
But overlap has a cost: the same text appears in multiple chunks, inflating your vector store. With 50-token overlap on 512-token chunks, roughly 10% of your content is duplicated.
def fixed_chunk_with_overlap_stats(
text: str,
chunk_size: int = 512,
overlap: int = 50
) -> tuple[list[str], dict]:
"""Chunk text and return statistics about overlap."""
encoder = tiktoken.get_encoding("cl100k_base")
tokens = encoder.encode(text)
chunks = []
total_tokens = len(tokens)
unique_tokens = set()
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunk_text = encoder.decode(chunk_tokens)
if chunk_text.strip():
chunks.append(chunk_text)
unique_tokens.update(chunk_tokens)
start += (chunk_size - overlap)
stats = {
"total_tokens": total_tokens,
"num_chunks": len(chunks),
"unique_tokens": len(unique_tokens),
"overlap_ratio": 1 - (len(unique_tokens) / total_tokens),
"tokens_per_chunk_avg": total_tokens / len(chunks) if chunks else 0
}
return chunks, stats
Document-level chunking with metadata
When chunking a document, attach source metadata to every chunk.
from dataclasses import dataclass
from typing import Optional
@dataclass
class Chunk:
text: str
chunk_index: int
source_document: str
start_char: int
end_char: int
num_tokens: int
def chunk_document(
document: dict,
chunk_size: int = 512,
overlap: int = 50
) -> list[Chunk]:
"""Chunk a document with full provenance metadata."""
encoder = tiktoken.get_encoding("cl100k_base")
text = document["text"]
source = document.get("source", "unknown")
tokens = encoder.encode(text)
chunks = []
start = 0
while start < len(tokens):
end = start + chunk_size
chunk_tokens = tokens[start:end]
chunk_text = encoder.decode(chunk_tokens)
if chunk_text.strip():
# Calculate character positions (approximate)
char_start = len(encoder.decode(tokens[:start]))
char_end = len(encoder.decode(tokens[:end]))
chunks.append(Chunk(
text=chunk_text,
chunk_index=len(chunks),
source_document=source,
start_char=char_start,
end_char=char_end,
num_tokens=len(chunk_tokens)
))
start += (chunk_size - overlap)
return chunks
When to use fixed-size chunking
Fixed-size chunking works when:
- Documents have no structure (chat logs, meeting transcripts)
- Speed matters more than semantic coherence
- You are prototyping and need something that works immediately
Fixed-size chunking fails when:
- Documents have clear section boundaries
- Paragraphs contain complete thoughts that should stay together
- Headers and their content are separated (common failure)
# Demonstration of boundary problems
text = """
# Return Policy
## Section 1: Electronics
The return window for electronics is 30 days.
This applies to all purchases made in 2024.
## Section 2: Clothing
The return window for clothing is 60 days.
Clothing must be unworn with tags attached.
"""
chunks = fixed_size_chunk(text, chunk_size=50, overlap=10)
# Chunk 0 might end at "...electronics is 30"
# Chunk 1 might start at "days. This applies..."
# The question "What is the return window for electronics?" gets answered
# with half electronics policy and half clothing policy
Take a 2000-word article and chunk it with chunk_size=300 combined with overlap=50, 100, and 150. Calculate the overlap ratio for each. Explain why increasing overlap beyond a certain point provides diminishing returns.