RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to compare embedding model performance for your use case
HOW-TO · INF

How to compare embedding model performance for your use case

intermediate·20 min·By Fredoline Eruo
PREREQUISITES

Multiple embedding models, test dataset with queries

What this does

Different embedding models trade off speed, memory, and retrieval accuracy. This guide benchmarks multiple models on your own dataset to select the best fit.

Steps

  1. Pull candidate embedding models.

    ollama pull all-minilm
    ollama pull nomic-embed-text
    ollama pull bge-m3
    
  2. Create a benchmark script.

    import requests, time, numpy as np
    from sklearn.metrics import ndcg_score
    
    def evaluate_model(model, queries, relevant_docs, all_docs, k=5):
        times, scores = [], []
        for q in queries:
            start = time.perf_counter()
            q_vec = requests.post("http://localhost:11434/api/embeddings",
                json={"model": model, "prompt": q}).json()["embedding"]
            times.append(time.perf_counter() - start)
    
            doc_vecs = [requests.post("http://localhost:11434/api/embeddings",
                json={"model": model, "prompt": d}).json()["embedding"] for d in all_docs]
            # Compute NDCG@k
            sims = np.dot(doc_vecs, q_vec) / (np.linalg.norm(doc_vecs, axis=1) * np.linalg.norm(q_vec))
            scores.append(ndcg_score([relevant_docs], [sims], k=k))
        return np.mean(times), np.mean(scores)
    
    models = ["all-minilm", "nomic-embed-text", "bge-m3"]
    for m in models:
        latency, ndcg = evaluate_model(m, test_queries, test_labels, test_corpus)
        print(f"{m}: {latency*1000:.1f}ms avg, NDCG@{5}: {ndcg:.3f}")
    
  3. Measure memory usage per model.

    nvidia-smi --query-gpu=memory.used --format=csv,noheader
    

    Test each model individually with the same batch workload.

  4. Visualize the trade-off.

    import matplotlib.pyplot as plt
    models = ["all-minilm", "nomic-embed-text", "bge-m3"]
    latency = [12, 28, 45]   # ms
    ndcg = [0.82, 0.87, 0.91]
    memory = [0.5, 1.2, 2.1] # GB
    plt.scatter(latency, ndcg, s=[m*200 for m in memory], alpha=0.5)
    for i, m in enumerate(models):
        plt.annotate(m, (latency[i], ndcg[i]))
    plt.xlabel("Latency (ms)"); plt.ylabel("NDCG@5")
    plt.savefig("embedding_comparison.png")
    

Verification

python benchmark_embeddings.py
# Expected output table with latency, NDCG@5, and memory for each model

Common failures

  • Cold start skews latency: Send a warm-up request before timing. The first request includes model loading time.
  • Inconsistent NDCG: Relevance judgments must be in the same order as all_docs. Use binary relevance (0/1) for simplicity.
  • bge-m3 requires prefix: BGE models need "Represent this sentence for searching: " prepended to queries. Check model documentation.

Related guides

  • How to run embedding models for semantic search
  • How to fine-tune embedding batch sizes for your hardware
RELATED GUIDES
INF
How to fine-tune embedding batch sizes for your hardware
INF
How to run embedding models for semantic search
← All how-to guidesCourses →