What this does

Hybrid chunking combines semantic boundaries from an embedding model with fixed-size constraints. The algorithm first splits text into semantically coherent units, then merges adjacent units until they approach the target chunk size. The result is chunks that are both meaningful and uniformly sized.

Steps

Set up the embedding model and semantic chunker.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings

embed = OllamaEmbeddings(model="nomic-embed-text")
semantic_chunker = SemanticChunker(embed, breakpoint_threshold_type="gradient")

Pre-split the document semantically.

from langchain_community.document_loaders import TextLoader

loader = TextLoader("/data/article.txt")
docs = loader.load()
semantic_units = semantic_chunker.split_documents(docs)
print(f"Semantic units: {len(semantic_units)}")

Merge units into fixed-size chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

fixed_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=64, separators=[""],
)
merged_text = "\n\n".join(u.page_content for u in semantic_units)
merged_docs = [type(u)(page_content=merged_text, metadata=semantic_units[0].metadata)]
hybrid_chunks = fixed_splitter.split_documents(merged_docs)
print(f"Hybrid chunks: {len(hybrid_chunks)}")

Tag chunks with semantic origin.

for chunk in hybrid_chunks:
    chunk.metadata["strategy"] = "hybrid_semantic_fixed"

Verification

python -c "
from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings
e = OllamaEmbeddings(model='nomic-embed-text')
c = SemanticChunker(e)
print('Hybrid chunker ready')
"
# Expected: Hybrid chunker ready

Common failures

Single enormous hybrid chunk. Semantic units too large. Lower breakpoint_threshold_amount.
Ollama embedding timeout on long text. Unit size exceeds embedding context. Add pre-splitting in SemanticChunker.
Metadata lost on merged chunks. Re-attach metadata in post-processing step.
ImportError for langchain_experimental. Run pip install langchain-experimental.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to Implement Hybrid Chunking (Semantic + Fixed Size)

What this does

Steps

Verification

Common failures

Operator checkpoint

Related guides