What this does

Raw text pulled from documents or web pages frequently contains noise that degrades embedding quality. Preprocessing transforms unstructured text into clean, uniformly sized chunks that an embedding model can represent accurately. This guide covers cleaning, splitting, deduplication, and batch ingestion into a FAISS-indexed vector store.

Steps

Clean raw text.

import re, unicodedata

def clean_text(raw):
    text = re.sub(r"<[^>]+>", " ", raw)
    text = unicodedata.normalize("NFKC", text)
    text = re.sub(r"\s+", " ", text)
    return text.strip()

cleaned = clean_text("<p>  RAG combines &nbsp; LLMs with context.  </p>")
print(repr(cleaned))
# Expected: 'RAG combines LLMs with context.'

Split into semantically coherent chunks.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(cleaned)

Remove duplicate chunks using embeddings.

import ollama, numpy as np

def dedupe_chunks(chunks, threshold=0.95):
    vectors = [ollama.embeddings(model="nomic-embed-text", prompt=c)["embedding"] for c in chunks]
    embs = np.array(vectors, dtype="float32")
    norms = embs / np.linalg.norm(embs, axis=1, keepdims=True)
    sim = norms @ norms.T
    keep = []
    discard = set()
    for i in range(len(chunks)):
        if i in discard: continue
        keep.append(chunks[i])
        for j in range(i + 1, len(chunks)):
            if sim[i, j] > threshold:
                discard.add(j)
    return keep

Ingest into FAISS vector store.

from langchain_ollama import OllamaEmbeddings
from langchain_community.vectorstores import FAISS

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = FAISS.from_texts(chunks, embedding=embeddings)
print(f"Ingested {vectorstore.index.ntotal} chunks")

Verification

python3 -c "
import ollama
resp = ollama.embeddings(model='nomic-embed-text', prompt='test')
print(f'Embedding dims: {len(resp[\"embedding\"])}')
"
# Expected: Embedding dims: <model-dimension>

Common failures

Chunks exceeding embedding model context. Enforce chunk_size below the model's token limit.
Overlapping chunks cause retrieval bias. Set overlap to 10–15% of chunk_size.
Whitespace normalization destroys table structure. Preserve tabular layout with explicit row-splitting.
Deduplication threshold too aggressive. Calibrate on a sample by inspecting flagged pairs.
Metadata lost after LangChain round-trip. Verify metadata in reloaded stores.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

How to Preprocess Text for Vector Database Ingestion

What this does

Steps

Verification

Common failures

Related guides