03. Fixed-Size vs Semantic Tradeoffs

Chapter 3 of 24 · 15 min

Both chunking strategies have distinct performance profiles that suit different use cases.

Fixed-size chunking offers predictable memory usage, uniform embedding computation, and simplicity. Implementation requires only character/word counting and slice operations. However, this approach frequently splits sentences mid-thought and separates related concepts across chunk boundaries.

Semantic chunking preserves document structure but introduces complexity in boundary detection and variable chunk sizes. The overhead becomes significant when processing millions of documents.

Factor Fixed-Size Semantic
Implementation complexity Low Medium-High
Chunk size variance Low High
Context preservation Poor Good
Index size predictability High Medium
Retrieval on fragmented docs Degraded Stable

Hybrid approaches offer a middle path. First, identify semantic boundaries (headings, paragraph breaks). Then apply fixed-size chunking within sections. This preserves section context while maintaining reasonable size uniformity.

def hybrid_chunk(document: str, target_size: int = 400) -> List[str]:
    """Chunk by sections first, then apply size limits within sections."""
    
    # Split by headings (lines starting with # or ALL CAPS followed by :)
    section_pattern = r'(?=\n#{1,6}\s|\n[A-Z][A-Z\s]+:\n)'
    sections = re.split(section_pattern, document)
    
    chunks = []
    for section in sections:
        if len(section) <= target_size:
            if section.strip():
                chunks.append(section.strip())
        else:
            # Apply fixed-size within long sections
            words = section.split()
            for i in range(0, len(words), target_size // 5):
                chunk = ' '.join(words[i:i + target_size // 5])
                if chunk.strip():
                    chunks.append(chunk)
    
    return chunks

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Calculate chunk size distribution statistics (mean, std, min, max) for both strategies on a sample of 20 documents using NumPy. Visualize with a histogram.