03. Fixed-Size vs Semantic Tradeoffs
Both chunking strategies have distinct performance profiles that suit different use cases.
Fixed-size chunking offers predictable memory usage, uniform embedding computation, and simplicity. Implementation requires only character/word counting and slice operations. However, this approach frequently splits sentences mid-thought and separates related concepts across chunk boundaries.
Semantic chunking preserves document structure but introduces complexity in boundary detection and variable chunk sizes. The overhead becomes significant when processing millions of documents.
| Factor | Fixed-Size | Semantic |
|---|---|---|
| Implementation complexity | Low | Medium-High |
| Chunk size variance | Low | High |
| Context preservation | Poor | Good |
| Index size predictability | High | Medium |
| Retrieval on fragmented docs | Degraded | Stable |
Hybrid approaches offer a middle path. First, identify semantic boundaries (headings, paragraph breaks). Then apply fixed-size chunking within sections. This preserves section context while maintaining reasonable size uniformity.
def hybrid_chunk(document: str, target_size: int = 400) -> List[str]:
"""Chunk by sections first, then apply size limits within sections."""
# Split by headings (lines starting with # or ALL CAPS followed by :)
section_pattern = r'(?=\n#{1,6}\s|\n[A-Z][A-Z\s]+:\n)'
sections = re.split(section_pattern, document)
chunks = []
for section in sections:
if len(section) <= target_size:
if section.strip():
chunks.append(section.strip())
else:
# Apply fixed-size within long sections
words = section.split()
for i in range(0, len(words), target_size // 5):
chunk = ' '.join(words[i:i + target_size // 5])
if chunk.strip():
chunks.append(chunk)
return chunks
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Calculate chunk size distribution statistics (mean, std, min, max) for both strategies on a sample of 20 documents using NumPy. Visualize with a histogram.