05. IVF Training and Search
Chapter 5 of 18 · 20 min
IVF's effectiveness hinges on the quality of its cluster centroids. Training is where operators can introduce subtle bugs that silently degrade recall.
EXERCISE
Train IVF on data with one dense cluster and one sparse cluster (e.g., 90% of vectors in a tight ball, 10% spread across space). Observe how cluster sizes and search behavior differ.
# Create imbalanced data
dense = np.random.randn(90000, 128) * 0.5 + np.random.rand(90000, 128)
sparse = np.random.rand(10000, 128) * 10
vectors = np.vstack([dense, sparse]).astype('float32')
# Train and observe
centroids, assignments = kmeans_train(vectors, n_clusters=100)
cluster_sizes = [np.sum(assignments == i) for i in range(100)]
print(f"Cluster size stats: min={min(cluster_sizes)}, max={max(cluster_sizes)}, "
f"std={np.std(cluster_sizes):.1f}")
print(f"Largest 5 clusters: {sorted(cluster_sizes, reverse=True)[:5]}")