KEY INSIGHT
IVF parameters interact. Doubling nprobe roughly doubles query time but doesn't double recall. The relationship is non-linear and data-dependent—you must measure, not guess.
### Number of Clusters (nlist)
The number of clusters determines the granularity of your partition. Guidelines:
- **Too few clusters:** Large clusters, higher recall per probe, but fewer candidates to search means still potentially slow
- **Too many clusters:** Small clusters, more clusters to probe for good recall, higher overhead
Typical rule of thumb: nlist = √n (square root of dataset size) to √10n. For 1M vectors, that's 1k to 10k clusters.
```python
# Rule-of-thumb calculation
def recommended_nlist(n_vectors):
return int(np.sqrt(n_vectors))
for n in [10_000, 100_000, 1_000_000, 10_000_000]:
print(f"{n:>10,}: nlist ≈ {recommended_nlist(n):>8,}")
```
### nprobe
nprobe controls how many clusters you search. Higher nprobe = higher recall = slower queries.
The relationship isn't linear:
- nprobe=1 searches 1 cluster, but recall might be only 60%
- nprobe=10 searches 10 clusters, but recall might only reach 90%
- nprobe=50 might reach 98% but take 5x longer than nprobe=10
```python
import numpy as np
import matplotlib.pyplot as plt
def measure_recall_curve(index, vectors, queries, max_nprobe=50):
"""Measure recall vs nprobe."""
results = []
for nprobe in range(1, max_nprobe + 1, 2):
recall = evaluate_recall(index, vectors, queries[:50], k=10, nprobe=nprobe)
results.append((nprobe, recall))
return results
# This would plot the curve
# results = measure_recall_curve(index, vectors, queries)
# plt.plot([r[0] for r in results], [r[1] for r in results])
# plt.xlabel('nprobe')
# plt.ylabel('Recall@10')
# plt.show()
```
### Memory Considerations
IVF without quantization stores:
- Vectors: n × dim × 4 bytes (float32)
- Centroids: nlist × dim × 4 bytes
- Inverted lists: n × 4 bytes (indices)
For 1M vectors, 768 dimensions: ~3GB for vectors + small overhead.
### Choosing Parameters by Latency Budget
If you have a 50ms latency budget:
1. Benchmark brute force: if it's already <50ms, you might not need IVF
2. Start with nlist = √n, nprobe = 1
3. Measure recall and latency
4. If recall is too low, increase nprobe
5. If latency is too high, decrease nprobe or increase nlist (smaller clusters)
```python
def tune_ivf(vectors, queries, latency_budget_ms, target_recall=0.95):
"""
Find IVF parameters that meet latency budget while achieving target recall.
"""
n = len(vectors)
nlist = int(np.sqrt(n))
# Build index
kmeans = MiniBatchKMeans(n_clusters=nlist, batch_size=4096)
kmeans.fit(vectors)
centroids = kmeans.cluster_centers_
assignments = kmeans.predict(vectors)
# Binary search on nprobe
low, high = 1, nlist
best_nprobe = 1
while low <= high:
mid = (low + high) // 2
latency = benchmark_search_latency(centroids, vectors, assignments,
queries, nprobe=mid)
recall = evaluate_recall(vectors, queries, nprobe=mid)
if latency <= latency_budget_ms and recall >= target_recall:
best_nprobe = mid
high = mid - 1
else:
low = mid + 1
return nlist, best_nprobe
```