RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 1
  6. /Ch. 6
RAG Systems: Part 1

06. Fixed-Size Chunking

Chapter 6 of 22 · 25 min
KEY INSIGHT

Fixed-size chunking is fast but ignores semantic boundaries, often splitting paragraphs and separating headers from their content. ```python import tiktoken encoder = tiktoken.get_encoding("cl100k_base") text = "Hello world" tokens = encoder.encode(text) print(f"Token count: {len(tokens)}") # Output: 2 ```

The simplest chunking strategy is fixed size: split text every N characters or tokens. It is fast, predictable, and requires no text analysis. It is also the worst strategy for semantic coherence.

Token-based vs character-based

Character-based chunking splits at exact character counts. Token-based chunking respects language boundaries (words, subwords). LLMs process tokens, so token-based chunking aligns with how models see your text.

Use token-based chunking unless you have a specific reason not to.

Basic fixed-size tokenizer chunking

import tiktoken

def fixed_size_chunk(
    text: str,
    chunk_size: int = 512,
    overlap: int = 50
) -> list[str]:
    """
    Split text into fixed-size token chunks with overlap.

    Args:
        text: Input text to chunk
        chunk_size: Target tokens per chunk
        overlap: Token overlap between chunks

    Returns:
        List of text chunks
    """
    encoder = tiktoken.get_encoding("cl100k_base")  # GPT-4's encoder

    tokens = encoder.encode(text)
    chunks = []

    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]

        chunk_text = encoder.decode(chunk_tokens)
        if chunk_text.strip():
            chunks.append(chunk_text)

        start += (chunk_size - overlap)

    return chunks

Handling the overlap problem

Overlap preserves context at chunk boundaries. A question might start in one chunk and continue in the next. Overlap ensures the retriever can find the complete context.

But overlap has a cost: the same text appears in multiple chunks, inflating your vector store. With 50-token overlap on 512-token chunks, roughly 10% of your content is duplicated.

def fixed_chunk_with_overlap_stats(
    text: str,
    chunk_size: int = 512,
    overlap: int = 50
) -> tuple[list[str], dict]:
    """Chunk text and return statistics about overlap."""
    encoder = tiktoken.get_encoding("cl100k_base")
    tokens = encoder.encode(text)

    chunks = []
    total_tokens = len(tokens)
    unique_tokens = set()

    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)

        if chunk_text.strip():
            chunks.append(chunk_text)
            unique_tokens.update(chunk_tokens)

        start += (chunk_size - overlap)

    stats = {
        "total_tokens": total_tokens,
        "num_chunks": len(chunks),
        "unique_tokens": len(unique_tokens),
        "overlap_ratio": 1 - (len(unique_tokens) / total_tokens),
        "tokens_per_chunk_avg": total_tokens / len(chunks) if chunks else 0
    }

    return chunks, stats

Document-level chunking with metadata

When chunking a document, attach source metadata to every chunk.

from dataclasses import dataclass
from typing import Optional

@dataclass
class Chunk:
    text: str
    chunk_index: int
    source_document: str
    start_char: int
    end_char: int
    num_tokens: int

def chunk_document(
    document: dict,
    chunk_size: int = 512,
    overlap: int = 50
) -> list[Chunk]:
    """Chunk a document with full provenance metadata."""
    encoder = tiktoken.get_encoding("cl100k_base")
    text = document["text"]
    source = document.get("source", "unknown")

    tokens = encoder.encode(text)
    chunks = []

    start = 0
    while start < len(tokens):
        end = start + chunk_size
        chunk_tokens = tokens[start:end]
        chunk_text = encoder.decode(chunk_tokens)

        if chunk_text.strip():
            # Calculate character positions (approximate)
            char_start = len(encoder.decode(tokens[:start]))
            char_end = len(encoder.decode(tokens[:end]))

            chunks.append(Chunk(
                text=chunk_text,
                chunk_index=len(chunks),
                source_document=source,
                start_char=char_start,
                end_char=char_end,
                num_tokens=len(chunk_tokens)
            ))

        start += (chunk_size - overlap)

    return chunks

When to use fixed-size chunking

Fixed-size chunking works when:

  • Documents have no structure (chat logs, meeting transcripts)
  • Speed matters more than semantic coherence
  • You are prototyping and need something that works immediately

Fixed-size chunking fails when:

  • Documents have clear section boundaries
  • Paragraphs contain complete thoughts that should stay together
  • Headers and their content are separated (common failure)
# Demonstration of boundary problems
text = """
# Return Policy

## Section 1: Electronics
The return window for electronics is 30 days.
This applies to all purchases made in 2024.

## Section 2: Clothing
The return window for clothing is 60 days.
Clothing must be unworn with tags attached.
"""

chunks = fixed_size_chunk(text, chunk_size=50, overlap=10)
# Chunk 0 might end at "...electronics is 30"
# Chunk 1 might start at "days. This applies..."
# The question "What is the return window for electronics?" gets answered
# with half electronics policy and half clothing policy
EXERCISE

Take a 2000-word article and chunk it with chunk_size=300 combined with overlap=50, 100, and 150. Calculate the overlap ratio for each. Explain why increasing overlap beyond a certain point provides diminishing returns.

← Chapter 5
Markdown Ingestion
Chapter 7 →
Semantic Chunking