How to Use Semantic Chunking with Embedding Similarity
Sentence-transformers installed, sample text for chunking
What this does
Semantic chunking groups sentences or paragraphs based on embedding vector similarity rather than fixed token counts. Consecutive text segments with embedding similarity above a defined threshold are grouped into the same chunk. This produces boundaries that align with semantic shifts in the content, yielding chunks that are more coherent for retrieval than uniform-size splits.
Steps
- Load your text corpus and split it into individual sentences using a sentence segmentation library such as NLTK or spaCy.
- Initialize a sentence-transformer model (e.g.,
all-MiniLM-L6-v2) and encode each sentence into a dense vector. - Compute cosine similarity between the embedding of the current sentence and the previous one.
- If similarity exceeds your threshold (start with 0.7), append the sentence to the current chunk. If similarity drops below the threshold, close the current chunk and begin a new one.
- Optionally merge chunks that fall below a minimum size threshold with their neighbors to avoid overly short fragments.
- Collect the resulting chunks and inspect their boundaries against the source text to verify semantic coherence.
Verification
Print each chunk with its sentence count and token estimate. Verify that semantic boundaries occur at topic transitions. For the sentence "The migration completed successfully." followed by "Next, we configure the load balancer.", the similarity should be low and trigger a split.
Expected output: Chunk 1 contains sentences 1-5 (similarity average 0.84), Chunk 2 begins at sentence 6 (similarity drop to 0.31). The chunk boundary aligns with a topic shift in the source text.
Common failures
- Threshold set too aggressively: A threshold of 0.9 or higher produces tiny chunks because even related sentences vary slightly. Lower the threshold incrementally if chunks are averaging fewer than three sentences.
- Sentence boundary detection errors: Abbreviations like "e.g." or "Dr." cause incorrect splits. Use a robust sentence tokenizer and preprocess the text to normalize these patterns before segmentation.
- Model selection mismatch: Larger embedding models may shift similarity distributions, causing your threshold to behave differently. Test any new model against your existing threshold by computing similarity distributions across a sample and adjusting the cutoff accordingly.
Related guides
- create-context-aware-chunks-parent
- implement-dynamic-chunk-sizing