Semantic Chunking at Scale — Advanced RAG — Chunking, Retrieval, Re-ranking (Chapter 2)

Fixed-size chunking ignores document structure. Semantic chunking identifies natural topic boundaries and splits accordingly, improving retrieval relevance by preserving coherent units.

Sentence splitting forms the foundation. Naive regex-based splitting fails on abbreviations, decimal numbers, and edge cases. Use a tokenizer-aware splitter that understands sentence boundaries in your target language.

Paragraph detection identifies topic shifts within documents. Sections separated by blank lines or headings typically represent distinct concepts. Long paragraphs may contain multiple sub-topics requiring subdivision.

Hierarchical merging combines short segments up to a target size while respecting semantic boundaries. Merge paragraphs until reaching the target chunk size, but stop at heading boundaries.

import re
from typing import List, Tuple

def semantic_chunk(
    text: str,
    min_chunk_size: int = 100,
    max_chunk_size: int = 500,
    overlap: int = 50
) -> List[Tuple[str, dict]]:
    """Split text into semantically coherent chunks with overlap."""
    
    # Split into paragraphs at blank lines
    paragraphs = re.split(r'\n\s*\n', text)
    
    chunks = []
    current_chunk = []
    current_length = 0
    
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
            
        para_length = len(para)
        
        # If single paragraph exceeds max, split by sentences
        if para_length > max_chunk_size:
            sentences = split_into_sentences(para)
            for sentence in sentences:
                if current_length + len(sentence) > max_chunk_size:
                    if current_chunk:
                        chunks.append(('\n'.join(current_chunk), {}))
                    current_chunk = []
                    current_length = 0
                    # Carry overlap
                    if overlap > 0 and current_chunk:
                        current_chunk = current_chunk[-1:]
                current_chunk.append(sentence)
                current_length += len(sentence)
        else:
            if current_length + para_length > max_chunk_size:
                chunks.append(('\n'.join(current_chunk), {}))
                # Start new chunk with overlap from previous
                overlap_text = '\n'.join(current_chunk)[-overlap:] if overlap else ''
                current_chunk = [overlap_text, para] if overlap_text else [para]
                current_length = len(overlap_text) + para_length
            else:
                current_chunk.append(para)
                current_length += para_length
    
    if current_chunk:
        chunks.append(('\n'.join(current_chunk), {}))
    
    return chunks

def split_into_sentences(text: str) -> List[str]:
    """Use punctuation-aware sentence splitting."""
    # Split on sentence-ending punctuation followed by space and uppercase
    sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', text)
    return [s.strip() for s in sentences if s.strip()]