06. IVF Parameters

Chapter 6 of 18 · 20 min

KEY INSIGHT

IVF parameters interact. Doubling nprobe roughly doubles query time but doesn't double recall. The relationship is non-linear and data-dependent—you must measure, not guess. ### Number of Clusters (nlist) The number of clusters determines the granularity of your partition. Guidelines: - **Too few clusters:** Large clusters, higher recall per probe, but fewer candidates to search means still potentially slow - **Too many clusters:** Small clusters, more clusters to probe for good recall, higher overhead Typical rule of thumb: nlist = √n (square root of dataset size) to √10n. For 1M vectors, that's 1k to 10k clusters. ```python # Rule-of-thumb calculation def recommended_nlist(n_vectors): return int(np.sqrt(n_vectors)) for n in [10_000, 100_000, 1_000_000, 10_000_000]: print(f"{n:>10,}: nlist ≈ {recommended_nlist(n):>8,}") ``` ### nprobe nprobe controls how many clusters you search. Higher nprobe = higher recall = slower queries. The relationship isn't linear: - nprobe=1 searches 1 cluster, but recall might be only 60% - nprobe=10 searches 10 clusters, but recall might only reach 90% - nprobe=50 might reach 98% but take 5x longer than nprobe=10 ```python import numpy as np import matplotlib.pyplot as plt def measure_recall_curve(index, vectors, queries, max_nprobe=50): """Measure recall vs nprobe.""" results = [] for nprobe in range(1, max_nprobe + 1, 2): recall = evaluate_recall(index, vectors, queries[:50], k=10, nprobe=nprobe) results.append((nprobe, recall)) return results # This would plot the curve # results = measure_recall_curve(index, vectors, queries) # plt.plot([r[0] for r in results], [r[1] for r in results]) # plt.xlabel('nprobe') # plt.ylabel('Recall@10') # plt.show() ``` ### Memory Considerations IVF without quantization stores: - Vectors: n × dim × 4 bytes (float32) - Centroids: nlist × dim × 4 bytes - Inverted lists: n × 4 bytes (indices) For 1M vectors, 768 dimensions: ~3GB for vectors + small overhead. ### Choosing Parameters by Latency Budget If you have a 50ms latency budget: 1. Benchmark brute force: if it's already <50ms, you might not need IVF 2. Start with nlist = √n, nprobe = 1 3. Measure recall and latency 4. If recall is too low, increase nprobe 5. If latency is too high, decrease nprobe or increase nlist (smaller clusters) ```python def tune_ivf(vectors, queries, latency_budget_ms, target_recall=0.95): """ Find IVF parameters that meet latency budget while achieving target recall. """ n = len(vectors) nlist = int(np.sqrt(n)) # Build index kmeans = MiniBatchKMeans(n_clusters=nlist, batch_size=4096) kmeans.fit(vectors) centroids = kmeans.cluster_centers_ assignments = kmeans.predict(vectors) # Binary search on nprobe low, high = 1, nlist best_nprobe = 1 while low <= high: mid = (low + high) // 2 latency = benchmark_search_latency(centroids, vectors, assignments, queries, nprobe=mid) recall = evaluate_recall(vectors, queries, nprobe=mid) if latency <= latency_budget_ms and recall >= target_recall: best_nprobe = mid high = mid - 1 else: low = mid + 1 return nlist, best_nprobe ```

IVF has three primary parameters to tune: number of clusters, nprobe, and whether to use quantization. Getting these wrong produces either slow queries or poor recall.

EXERCISE

Create a parameter sweep over nlist ∈ {100, 500, 1000, 5000} and nprobe ∈ {1, 5, 10, 20}. For each combination, record recall and latency. Identify the Pareto frontier—combinations where you can't improve recall without sacrificing latency.