02. Semantic Chunking at Scale
Fixed-size chunking ignores document structure. Semantic chunking identifies natural topic boundaries and splits accordingly, improving retrieval relevance by preserving coherent units.
Sentence splitting forms the foundation. Naive regex-based splitting fails on abbreviations, decimal numbers, and edge cases. Use a tokenizer-aware splitter that understands sentence boundaries in your target language.
Paragraph detection identifies topic shifts within documents. Sections separated by blank lines or headings typically represent distinct concepts. Long paragraphs may contain multiple sub-topics requiring subdivision.
Hierarchical merging combines short segments up to a target size while respecting semantic boundaries. Merge paragraphs until reaching the target chunk size, but stop at heading boundaries.
import re
from typing import List, Tuple
def semantic_chunk(
text: str,
min_chunk_size: int = 100,
max_chunk_size: int = 500,
overlap: int = 50
) -> List[Tuple[str, dict]]:
"""Split text into semantically coherent chunks with overlap."""
# Split into paragraphs at blank lines
paragraphs = re.split(r'\n\s*\n', text)
chunks = []
current_chunk = []
current_length = 0
for para in paragraphs:
para = para.strip()
if not para:
continue
para_length = len(para)
# If single paragraph exceeds max, split by sentences
if para_length > max_chunk_size:
sentences = split_into_sentences(para)
for sentence in sentences:
if current_length + len(sentence) > max_chunk_size:
if current_chunk:
chunks.append(('\n'.join(current_chunk), {}))
current_chunk = []
current_length = 0
# Carry overlap
if overlap > 0 and current_chunk:
current_chunk = current_chunk[-1:]
current_chunk.append(sentence)
current_length += len(sentence)
else:
if current_length + para_length > max_chunk_size:
chunks.append(('\n'.join(current_chunk), {}))
# Start new chunk with overlap from previous
overlap_text = '\n'.join(current_chunk)[-overlap:] if overlap else ''
current_chunk = [overlap_text, para] if overlap_text else [para]
current_length = len(overlap_text) + para_length
else:
current_chunk.append(para)
current_length += para_length
if current_chunk:
chunks.append(('\n'.join(current_chunk), {}))
return chunks
def split_into_sentences(text: str) -> List[str]:
"""Use punctuation-aware sentence splitting."""
# Split on sentence-ending punctuation followed by space and uppercase
sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', text)
return [s.strip() for s in sentences if s.strip()]
Implement semantic_chunk and compare results against fixed-size chunking on a technical blog post. Count how many code blocks, lists, or distinct topics span multiple chunks in each approach.