HOW-TO · INF
How to compare embedding model performance for your use case
PREREQUISITES
Multiple embedding models, test dataset with queries
What this does
Different embedding models trade off speed, memory, and retrieval accuracy. This guide benchmarks multiple models on your own dataset to select the best fit.
Steps
Pull candidate embedding models.
ollama pull all-minilm ollama pull nomic-embed-text ollama pull bge-m3Create a benchmark script.
import requests, time, numpy as np from sklearn.metrics import ndcg_score def evaluate_model(model, queries, relevant_docs, all_docs, k=5): times, scores = [], [] for q in queries: start = time.perf_counter() q_vec = requests.post("http://localhost:11434/api/embeddings", json={"model": model, "prompt": q}).json()["embedding"] times.append(time.perf_counter() - start) doc_vecs = [requests.post("http://localhost:11434/api/embeddings", json={"model": model, "prompt": d}).json()["embedding"] for d in all_docs] # Compute NDCG@k sims = np.dot(doc_vecs, q_vec) / (np.linalg.norm(doc_vecs, axis=1) * np.linalg.norm(q_vec)) scores.append(ndcg_score([relevant_docs], [sims], k=k)) return np.mean(times), np.mean(scores) models = ["all-minilm", "nomic-embed-text", "bge-m3"] for m in models: latency, ndcg = evaluate_model(m, test_queries, test_labels, test_corpus) print(f"{m}: {latency*1000:.1f}ms avg, NDCG@{5}: {ndcg:.3f}")Measure memory usage per model.
nvidia-smi --query-gpu=memory.used --format=csv,noheaderTest each model individually with the same batch workload.
Visualize the trade-off.
import matplotlib.pyplot as plt models = ["all-minilm", "nomic-embed-text", "bge-m3"] latency = [12, 28, 45] # ms ndcg = [0.82, 0.87, 0.91] memory = [0.5, 1.2, 2.1] # GB plt.scatter(latency, ndcg, s=[m*200 for m in memory], alpha=0.5) for i, m in enumerate(models): plt.annotate(m, (latency[i], ndcg[i])) plt.xlabel("Latency (ms)"); plt.ylabel("NDCG@5") plt.savefig("embedding_comparison.png")
Verification
python benchmark_embeddings.py
# Expected output table with latency, NDCG@5, and memory for each model
Common failures
- Cold start skews latency: Send a warm-up request before timing. The first request includes model loading time.
- Inconsistent NDCG: Relevance judgments must be in the same order as
all_docs. Use binary relevance (0/1) for simplicity. - bge-m3 requires prefix: BGE models need
"Represent this sentence for searching: "prepended to queries. Check model documentation.
Related guides
RELATED GUIDES