Fixed-Size Chunking — RAG Systems: Part 1 (Chapter 6)

The simplest chunking strategy is fixed size: split text every N characters or tokens. It is fast, predictable, and requires no text analysis. It is also the worst strategy for semantic coherence.

Token-based vs character-based

Character-based chunking splits at exact character counts. Token-based chunking respects language boundaries (words, subwords). LLMs process tokens, so token-based chunking aligns with how models see your text.

Use token-based chunking unless you have a specific reason not to.

Basic fixed-size tokenizer chunking

import tiktoken

def fixed_size_chunk(
    text: str,
    chunk_size: int = 512,
    overlap: int = 50
) -> list[str]:
    """
    Split text into fixed-size token chunks with overlap.

    Args:
        text: Input text to chunk
        chunk_size: Target tokens per chunk
        overlap: Token overlap between chunks

    Returns:
        List of text chunks
    """
    encoder = tiktoken.get_encoding("cl100k_base")  # GPT-4's encoder

    tokens = encoder.encode(text)
    chunks = []

    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]

        chunk_text = encoder.decode(chunk_tokens)
        if chunk_text.strip():
            chunks.append(chunk_text)

        start += (chunk_size - overlap)

    return chunks

Handling the overlap problem

Overlap preserves context at chunk boundaries. A question might start in one chunk and continue in the next. Overlap ensures the retriever can find the complete context.

But overlap has a cost: the same text appears in multiple chunks, inflating your vector store. With 50-token overlap on 512-token chunks, roughly 10% of your content is duplicated.

def fixed_chunk_with_overlap_stats(
    text: str,
    chunk_size: int = 512,
    overlap: int = 50
) -> tuple[list[str], dict]:
    """Chunk text and return statistics about overlap."""
    encoder = tiktoken.get_encoding("cl100k_base")
    tokens = encoder.encode(text)

    chunks = []
    total_tokens = len(tokens)
    unique_tokens = set()

    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)

        if chunk_text.strip():
            chunks.append(chunk_text)
            unique_tokens.update(chunk_tokens)

        start += (chunk_size - overlap)

    stats = {
        "total_tokens": total_tokens,
        "num_chunks": len(chunks),
        "unique_tokens": len(unique_tokens),
        "overlap_ratio": 1 - (len(unique_tokens) / total_tokens),
        "tokens_per_chunk_avg": total_tokens / len(chunks) if chunks else 0
    }

    return chunks, stats

Document-level chunking with metadata

When chunking a document, attach source metadata to every chunk.

from dataclasses import dataclass
from typing import Optional

@dataclass
class Chunk:
    text: str
    chunk_index: int
    source_document: str
    start_char: int
    end_char: int
    num_tokens: int

def chunk_document(
    document: dict,
    chunk_size: int = 512,
    overlap: int = 50
) -> list[Chunk]:
    """Chunk a document with full provenance metadata."""
    encoder = tiktoken.get_encoding("cl100k_base")
    text = document["text"]
    source = document.get("source", "unknown")

    tokens = encoder.encode(text)
    chunks = []

    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)

        if chunk_text.strip():
            # Calculate character positions (approximate)
            char_start = len(encoder.decode(tokens[:start]))
            char_end = len(encoder.decode(tokens[:end]))

            chunks.append(Chunk(
                text=chunk_text,
                chunk_index=len(chunks),
                source_document=source,
                start_char=char_start,
                end_char=char_end,
                num_tokens=len(chunk_tokens)
            ))

        start += (chunk_size - overlap)

    return chunks

When to use fixed-size chunking

Fixed-size chunking works when:

Documents have no structure (chat logs, meeting transcripts)
Speed matters more than semantic coherence
You are prototyping and need something that works immediately

Fixed-size chunking fails when:

Documents have clear section boundaries
Paragraphs contain complete thoughts that should stay together
Headers and their content are separated (common failure)

# Demonstration of boundary problems
text = """
# Return Policy

## Section 1: Electronics
The return window for electronics is 30 days.
This applies to all purchases made in 2024.

## Section 2: Clothing
The return window for clothing is 60 days.
Clothing must be unworn with tags attached.
"""

chunks = fixed_size_chunk(text, chunk_size=50, overlap=10)
# Chunk 0 might end at "...electronics is 30"
# Chunk 1 might start at "days. This applies..."
# The question "What is the return window for electronics?" gets answered
# with half electronics policy and half clothing policy