HOW-TO · RAG
How to Preprocess Text for Vector Database Ingestion
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES
Raw text documents, vector database set up
What this does
Raw text pulled from documents or web pages frequently contains noise that degrades embedding quality. Preprocessing transforms unstructured text into clean, uniformly sized chunks that an embedding model can represent accurately. This guide covers cleaning, splitting, deduplication, and batch ingestion into a FAISS-indexed vector store.
Steps
Clean raw text.
import re, unicodedata def clean_text(raw): text = re.sub(r"<[^>]+>", " ", raw) text = unicodedata.normalize("NFKC", text) text = re.sub(r"\s+", " ", text) return text.strip() cleaned = clean_text("<p> RAG combines LLMs with context. </p>") print(repr(cleaned)) # Expected: 'RAG combines LLMs with context.'Split into semantically coherent chunks.
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) chunks = splitter.split_text(cleaned)Remove duplicate chunks using embeddings.
import ollama, numpy as np def dedupe_chunks(chunks, threshold=0.95): vectors = [ollama.embeddings(model="nomic-embed-text", prompt=c)["embedding"] for c in chunks] embs = np.array(vectors, dtype="float32") norms = embs / np.linalg.norm(embs, axis=1, keepdims=True) sim = norms @ norms.T keep = [] discard = set() for i in range(len(chunks)): if i in discard: continue keep.append(chunks[i]) for j in range(i + 1, len(chunks)): if sim[i, j] > threshold: discard.add(j) return keepIngest into FAISS vector store.
from langchain_ollama import OllamaEmbeddings from langchain_community.vectorstores import FAISS embeddings = OllamaEmbeddings(model="nomic-embed-text") vectorstore = FAISS.from_texts(chunks, embedding=embeddings) print(f"Ingested {vectorstore.index.ntotal} chunks")
Verification
python3 -c "
import ollama
resp = ollama.embeddings(model='nomic-embed-text', prompt='test')
print(f'Embedding dims: {len(resp[\"embedding\"])}')
"
# Expected: Embedding dims: <model-dimension>
Common failures
- Chunks exceeding embedding model context. Enforce
chunk_sizebelow the model's token limit. - Overlapping chunks cause retrieval bias. Set overlap to 10–15% of chunk_size.
- Whitespace normalization destroys table structure. Preserve tabular layout with explicit row-splitting.
- Deduplication threshold too aggressive. Calibrate on a sample by inspecting flagged pairs.
- Metadata lost after LangChain round-trip. Verify metadata in reloaded stores.
- Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.
Related guides
RELATED GUIDES