05. IVF Training and Search

Chapter 5 of 18 · 20 min

KEY INSIGHT

IVF training isn't just clustering—it's configuring k-means for your specific data distribution. Poor training produces centroids that don't represent your vector space, causing queries to land in sparse or unrepresentative clusters. ### The Training Process Training IVF has two phases: centroid estimation and vector assignment. **Phase 1: K-means for centroid estimation** K-means finds cluster centroids that minimize within-cluster sum of squares. The algorithm iteratively assigns vectors to nearest centroid and recomputes centroids. ```python import numpy as np def kmeans_train(vectors, n_clusters, max_iter=100, tol=1e-4): """ Simplified k-means for understanding IVF training. Production: use sklearn.MiniBatchKMeans or FAISS's k-means. """ dim = vectors.shape[1] # Initialize centroids randomly from data points indices = np.random.choice(len(vectors), n_clusters, replace=False) centroids = vectors[indices].copy() for iteration in range(max_iter): # Assignment step: assign each vector to nearest centroid diff = vectors[:, np.newaxis] - centroids[np.newaxis, :] distances = np.sum(diff ** 2, axis=2) assignments = np.argmin(distances, axis=1) # Update step: compute new centroids new_centroids = np.zeros_like(centroids) for c in range(n_clusters): mask = assignments == c if mask.sum() > 0: new_centroids[c] = vectors[mask].mean(axis=0) else: # Empty cluster: reinitialize randomly new_centroids[c] = vectors[np.random.randint(len(vectors))] # Check convergence shift = np.linalg.norm(new_centroids - centroids) centroids = new_centroids if iteration % 10 == 0: print(f" Iteration {iteration}: shift={shift:.4f}") if shift < tol: print(f" Converged at iteration {iteration}") break return centroids, assignments ``` ### Assignment Strategy After finding centroids, you assign all vectors to their nearest cluster. This becomes the inverted index. The critical question: do you store the original vectors or residuals (vector minus centroid)? **IVF-Flat (no quantization):** Store originals. Lower recall variance, but higher memory. **IVF-PQ:** Store residuals compressed via product quantization. Higher recall variance, but lower memory. ### Search Pipeline At query time, IVF performs: 1. **Centroid search:** Find nprobe closest centroids to query 2. **List scanning:** Retrieve vectors from selected clusters 3. **Distance computation:** Compute distances to all candidates 4. **Top-k selection:** Return k smallest distances ```python def ivf_search(centroids, inverted_lists, vectors, query, nprobe, k=10): """Complete IVF search pipeline.""" # Step 1: Find closest centroids diff = centroids - query centroid_distances = np.sum(diff ** 2, axis=1) closest_centroids = np.argsort(centroid_distances)[:nprobe] print(f"Query -> Centroid distances: {centroid_distances[closest_centroids]}") # Step 2: Collect candidate indices candidate_indices = [] for c in closest_centroids: candidate_indices.extend(inverted_lists[c]) print(f"Scanning {len(candidate_indices)} candidates from {nprobe} clusters") # Step 3: Compute distances to candidates candidates = vectors[candidate_indices] diff = candidates - query distances = np.sum(diff ** 2, axis=1) # Step 4: Return top-k top_k_idx = np.argsort(distances)[:k] return distances[top_k_idx], np.array(candidate_indices)[top_k_idx] # Example execution vectors = np.random.rand(50000, 128).astype('float32') centroids, assignments = kmeans_train(vectors, n_clusters=100) inverted_lists = {i: np.where(assignments == i)[0].tolist() for i in range(100)} query = np.random.rand(128).astype('float32') distances, indices = ivf_search(centroids, inverted_lists, vectors, query, nprobe=5, k=10) print(f"Top-10 distances: {distances}") ``` ### Failure Modes **Empty clusters:** If a cluster has no vectors, queries to that region produce no results. K-means with random initialization can produce empty clusters. **Unbalanced clusters:** Data with uneven density produces some clusters with many vectors, others with few. Search time varies dramatically based on which clusters are queried. **Non-spherical clusters:** K-means assumes spherical clusters. If your data forms elongated manifolds (common in deep learning embeddings), centroids won't represent the data well.

IVF's effectiveness hinges on the quality of its cluster centroids. Training is where operators can introduce subtle bugs that silently degrade recall.

EXERCISE

Train IVF on data with one dense cluster and one sparse cluster (e.g., 90% of vectors in a tight ball, 10% spread across space). Observe how cluster sizes and search behavior differ.

# Create imbalanced data
dense = np.random.randn(90000, 128) * 0.5 + np.random.rand(90000, 128)
sparse = np.random.rand(10000, 128) * 10
vectors = np.vstack([dense, sparse]).astype('float32')

# Train and observe
centroids, assignments = kmeans_train(vectors, n_clusters=100)
cluster_sizes = [np.sum(assignments == i) for i in range(100)]
print(f"Cluster size stats: min={min(cluster_sizes)}, max={max(cluster_sizes)}, "
      f"std={np.std(cluster_sizes):.1f}")
print(f"Largest 5 clusters: {sorted(cluster_sizes, reverse=True)[:5]}")