RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Preprocess Text for Vector Database Ingestion
HOW-TO · RAG

How to Preprocess Text for Vector Database Ingestion

intermediate·20 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Raw text documents, vector database set up

What this does

Raw text pulled from documents or web pages frequently contains noise that degrades embedding quality. Preprocessing transforms unstructured text into clean, uniformly sized chunks that an embedding model can represent accurately. This guide covers cleaning, splitting, deduplication, and batch ingestion into a FAISS-indexed vector store.

Steps

  1. Clean raw text.

    import re, unicodedata
    
    def clean_text(raw):
        text = re.sub(r"<[^>]+>", " ", raw)
        text = unicodedata.normalize("NFKC", text)
        text = re.sub(r"\s+", " ", text)
        return text.strip()
    
    cleaned = clean_text("<p>  RAG combines &nbsp; LLMs with context.  </p>")
    print(repr(cleaned))
    # Expected: 'RAG combines LLMs with context.'
    
  2. Split into semantically coherent chunks.

    from langchain.text_splitter import RecursiveCharacterTextSplitter
    
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
    chunks = splitter.split_text(cleaned)
    
  3. Remove duplicate chunks using embeddings.

    import ollama, numpy as np
    
    def dedupe_chunks(chunks, threshold=0.95):
        vectors = [ollama.embeddings(model="nomic-embed-text", prompt=c)["embedding"] for c in chunks]
        embs = np.array(vectors, dtype="float32")
        norms = embs / np.linalg.norm(embs, axis=1, keepdims=True)
        sim = norms @ norms.T
        keep = []
        discard = set()
        for i in range(len(chunks)):
            if i in discard: continue
            keep.append(chunks[i])
            for j in range(i + 1, len(chunks)):
                if sim[i, j] > threshold:
                    discard.add(j)
        return keep
    
  4. Ingest into FAISS vector store.

    from langchain_ollama import OllamaEmbeddings
    from langchain_community.vectorstores import FAISS
    
    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    vectorstore = FAISS.from_texts(chunks, embedding=embeddings)
    print(f"Ingested {vectorstore.index.ntotal} chunks")
    

Verification

python3 -c "
import ollama
resp = ollama.embeddings(model='nomic-embed-text', prompt='test')
print(f'Embedding dims: {len(resp[\"embedding\"])}')
"
# Expected: Embedding dims: <model-dimension>

Common failures

  • Chunks exceeding embedding model context. Enforce chunk_size below the model's token limit.
  • Overlapping chunks cause retrieval bias. Set overlap to 10–15% of chunk_size.
  • Whitespace normalization destroys table structure. Preserve tabular layout with explicit row-splitting.
  • Deduplication threshold too aggressive. Calibrate on a sample by inspecting flagged pairs.
  • Metadata lost after LangChain round-trip. Verify metadata in reloaded stores.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • setup-faiss-index-similarity-search
  • extract-text-pdfs-pymupdf
RELATED GUIDES
RAG
How to Set Up FAISS Index for Similarity Search
RAG
How to Extract Text from PDFs Using PyMuPDF
← All how-to guidesCourses →