RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Vector Database Internals
  6. /Ch. 6
Vector Database Internals

06. IVF Parameters

Chapter 6 of 18 · 20 min
KEY INSIGHT

IVF parameters interact. Doubling nprobe roughly doubles query time but doesn't double recall. The relationship is non-linear and data-dependent—you must measure, not guess. ### Number of Clusters (nlist) The number of clusters determines the granularity of your partition. Guidelines: - **Too few clusters:** Large clusters, higher recall per probe, but fewer candidates to search means still potentially slow - **Too many clusters:** Small clusters, more clusters to probe for good recall, higher overhead Typical rule of thumb: nlist = √n (square root of dataset size) to √10n. For 1M vectors, that's 1k to 10k clusters. ```python # Rule-of-thumb calculation def recommended_nlist(n_vectors): return int(np.sqrt(n_vectors)) for n in [10_000, 100_000, 1_000_000, 10_000_000]: print(f"{n:>10,}: nlist ≈ {recommended_nlist(n):>8,}") ``` ### nprobe nprobe controls how many clusters you search. Higher nprobe = higher recall = slower queries. The relationship isn't linear: - nprobe=1 searches 1 cluster, but recall might be only 60% - nprobe=10 searches 10 clusters, but recall might only reach 90% - nprobe=50 might reach 98% but take 5x longer than nprobe=10 ```python import numpy as np import matplotlib.pyplot as plt def measure_recall_curve(index, vectors, queries, max_nprobe=50): """Measure recall vs nprobe.""" results = [] for nprobe in range(1, max_nprobe + 1, 2): recall = evaluate_recall(index, vectors, queries[:50], k=10, nprobe=nprobe) results.append((nprobe, recall)) return results # This would plot the curve # results = measure_recall_curve(index, vectors, queries) # plt.plot([r[0] for r in results], [r[1] for r in results]) # plt.xlabel('nprobe') # plt.ylabel('Recall@10') # plt.show() ``` ### Memory Considerations IVF without quantization stores: - Vectors: n × dim × 4 bytes (float32) - Centroids: nlist × dim × 4 bytes - Inverted lists: n × 4 bytes (indices) For 1M vectors, 768 dimensions: ~3GB for vectors + small overhead. ### Choosing Parameters by Latency Budget If you have a 50ms latency budget: 1. Benchmark brute force: if it's already <50ms, you might not need IVF 2. Start with nlist = √n, nprobe = 1 3. Measure recall and latency 4. If recall is too low, increase nprobe 5. If latency is too high, decrease nprobe or increase nlist (smaller clusters) ```python def tune_ivf(vectors, queries, latency_budget_ms, target_recall=0.95): """ Find IVF parameters that meet latency budget while achieving target recall. """ n = len(vectors) nlist = int(np.sqrt(n)) # Build index kmeans = MiniBatchKMeans(n_clusters=nlist, batch_size=4096) kmeans.fit(vectors) centroids = kmeans.cluster_centers_ assignments = kmeans.predict(vectors) # Binary search on nprobe low, high = 1, nlist best_nprobe = 1 while low <= high: mid = (low + high) // 2 latency = benchmark_search_latency(centroids, vectors, assignments, queries, nprobe=mid) recall = evaluate_recall(vectors, queries, nprobe=mid) if latency <= latency_budget_ms and recall >= target_recall: best_nprobe = mid high = mid - 1 else: low = mid + 1 return nlist, best_nprobe ```

IVF has three primary parameters to tune: number of clusters, nprobe, and whether to use quantization. Getting these wrong produces either slow queries or poor recall.

EXERCISE

Create a parameter sweep over nlist ∈ {100, 500, 1000, 5000} and nprobe ∈ {1, 5, 10, 20}. For each combination, record recall and latency. Identify the Pareto frontier—combinations where you can't improve recall without sacrificing latency.

← Chapter 5
IVF Training and Search
Chapter 7 →
HNSW: Hierarchical Navigable Small World