What this does

Different embedding models trade off speed, memory, and retrieval accuracy. This guide benchmarks multiple models on your own dataset to select the best fit.

Steps

Pull candidate embedding models.

ollama pull all-minilm
ollama pull nomic-embed-text
ollama pull bge-m3

Create a benchmark script.

import requests, time, numpy as np
from sklearn.metrics import ndcg_score

def evaluate_model(model, queries, relevant_docs, all_docs, k=5):
    times, scores = [], []
    for q in queries:
        start = time.perf_counter()
        q_vec = requests.post("http://localhost:11434/api/embeddings",
            json={"model": model, "prompt": q}).json()["embedding"]
        times.append(time.perf_counter() - start)

        doc_vecs = [requests.post("http://localhost:11434/api/embeddings",
            json={"model": model, "prompt": d}).json()["embedding"] for d in all_docs]
        # Compute NDCG@k
        sims = np.dot(doc_vecs, q_vec) / (np.linalg.norm(doc_vecs, axis=1) * np.linalg.norm(q_vec))
        scores.append(ndcg_score([relevant_docs], [sims], k=k))
    return np.mean(times), np.mean(scores)

models = ["all-minilm", "nomic-embed-text", "bge-m3"]
for m in models:
    latency, ndcg = evaluate_model(m, test_queries, test_labels, test_corpus)
    print(f"{m}: {latency*1000:.1f}ms avg, NDCG@{5}: {ndcg:.3f}")

Measure memory usage per model.
```
nvidia-smi --query-gpu=memory.used --format=csv,noheader
```
Test each model individually with the same batch workload.

Visualize the trade-off.

import matplotlib.pyplot as plt
models = ["all-minilm", "nomic-embed-text", "bge-m3"]
latency = [12, 28, 45]   # ms
ndcg = [0.82, 0.87, 0.91]
memory = [0.5, 1.2, 2.1] # GB
plt.scatter(latency, ndcg, s=[m*200 for m in memory], alpha=0.5)
for i, m in enumerate(models):
    plt.annotate(m, (latency[i], ndcg[i]))
plt.xlabel("Latency (ms)"); plt.ylabel("NDCG@5")
plt.savefig("embedding_comparison.png")

Verification

python benchmark_embeddings.py
# Expected output table with latency, NDCG@5, and memory for each model

Common failures

Cold start skews latency: Send a warm-up request before timing. The first request includes model loading time.
Inconsistent NDCG: Relevance judgments must be in the same order as all_docs. Use binary relevance (0/1) for simplicity.
bge-m3 requires prefix: BGE models need "Represent this sentence for searching: " prepended to queries. Check model documentation.

How to compare embedding model performance for your use case

What this does

Steps

Verification

Common failures

Related guides