Dense Retrieval — RAG Systems: Part 1 (Chapter 13)

Dense retrieval converts text into high-dimensional vectors where semantically similar content clusters together. The search operation becomes finding the nearest neighbors in this vector space.

How Embedding Models Work

An embedding model maps text to a fixed-size vector (typically 384 to 1536 dimensions). The model encodes semantic meaning, not word frequency. "The cat sat on the mat" and "A feline rested on the rug" produce vectors closer together than their word overlaps suggest.

from your_rag_library import EmbeddingModel

model = EmbeddingModel(
    name="BAAI/bge-large-en-v1.5",
    dimension=1024,
    batch_size=32
)

# Encode single text
query_vector = model.encode("How does authentication work?")
print(f"Vector shape: {query_vector.shape}")  # (1024,)

# Encode batch
texts = ["Text chunk 1", "Text chunk 2", "Text chunk 3"]
chunk_vectors = model.encode(texts)
print(f"Batch shape: {chunk_vectors.shape}")  # (3, 1024)

The model choice matters. BGE, E5, and GTE models from HuggingFace outperform OpenAI Ada-002 on many benchmarks. MTEB leaderboard ranking helps compare models.

Vector Search Implementation

Vector search uses approximate nearest neighbor (ANN) algorithms. Exact search on 1 million vectors takes seconds. ANN achieves 99% accuracy in milliseconds through algorithmic shortcuts.

from your_rag_library import VectorStore, HNSWIndex

store = VectorStore(
    dimension=1024,
    index_type=HNSWIndex,
    index_params={
        "m": 16,      # Connections per node (higher = more accurate, slower)
        "ef_construction": 200,  # Build-time quality
        "ef_search": 50    # Search-time quality
    }
)

# Add chunks to store
store.add_vectors(
    ids=["chunk_1", "chunk_2", "chunk_3"],
    vectors=chunk_vectors
)

# Search
results = store.search(
    query_vector=query_vector,
    top_k=10,
    metric="cosine"  # or "dot" or "l2"
)

Parameter m=16 works well for most use cases. Increase to 32 or 64 if recall drops below 95%. Increasing ef_search from 50 to 100 improves recall at the cost of search latency.

Filtering with Metadata

Pure vector search ignores document metadata. Production queries often need metadata filters: "Only documents from the last 30 days" or "Only from the API documentation section."

results = store.search(
    query_vector=query_vector,
    top_k=10,
    filters={
        "category": "API",
        "last_updated": {"$gte": "2024-01-01"},
        "version": {"$in": ["v2", "v3"]}
    }
)

The filter runs as a post-processing step. HNSW indexes handle millions of vectors; metadata filtering narrows the candidate set before vector comparison.

Embedding Model Training

Pre-trained models work for general content. Domain-specific content benefits from fine-tuning on your data. Fine-tuning with just 100-1000 curated query-document pairs improves performance significantly:

from your_rag_library import FineTuner

fine_tuner = FineTuner(
    base_model="BAAI/bge-large-en-v1.5",
    negative_strategy="hard"  # Mine hard negatives from corpus
)

fine_tuner.train(
    train_data="training_pairs.jsonl",  # Format: {"query": "", "positive": "", "negative": ""}
    epochs=3,
    learning_rate=2e-5,
    batch_size=16
)

# Export fine-tuned model
fine_tuner.save("models/my-rag-embedder")

Without fine-tuning, medical RAG systems fail on specialized terminology. Without fine-tuning, legal RAG systems miss case citations and statute references.