Vector Database Internals

04. IVF: Inverted File Index

Chapter 4 of 18 · 20 min

KEY INSIGHT

IVF trades recall for speed by limiting search scope. If your query lands in the correct cluster region, you get exact results faster. If it lands near cluster boundaries, you miss relevant vectors in neighboring clusters. ### The Inverted Index Structure IVF maintains an inverted list for each cluster. Each inverted list contains the IDs of vectors belonging to that cluster (and optionally their residuals for reconstruction). ```python import numpy as np from sklearn.cluster import MiniBatchKMeans class IVFFlat: """ Simplified IVF index for educational purposes. Production implementations (FAISS, ScaNN) add quantization and pruning. """ def __init__(self, nlist, dim): self.nlist = nlist self.dim = dim self.centroids = None self.inverted_lists = {} # list_id -> list of vector indices def train(self, vectors): """Cluster vectors to build centroids.""" print(f"Training IVF with {self.nlist} clusters...") kmeans = MiniBatchKMeans( n_clusters=self.nlist, batch_size=4096, n_init=3, max_iter=100 ) kmeans.fit(vectors) self.centroids = kmeans.cluster_centers_ # Assign vectors to clusters assignments = kmeans.predict(vectors) self.inverted_lists = {i: [] for i in range(self.nlist)} for idx, cluster_id in enumerate(assignments): self.inverted_lists[cluster_id].append(idx) print(f"Training complete. Cluster sizes: min={min(len(v) for v in self.inverted_lists.values())}, " f"max={max(len(v) for v in self.inverted_lists.values())}") return self def search(self, query, nprobe=1): """Search nprobe closest clusters.""" # Find closest centroids diff = self.centroids - query distances = np.sum(diff ** 2, axis=1) closest_clusters = np.argsort(distances)[:nprobe] # Search vectors in those clusters candidate_indices = [] for cluster_id in closest_clusters: candidate_indices.extend(self.inverted_lists[cluster_id]) # Brute force among candidates candidates = np.array(candidate_indices) diff = self.vectors[candidates] - query dists = np.sum(diff ** 2, axis=1) top_k = np.argsort(dists)[:10] return dists[top_k], candidates[top_k] ``` ### How IVF Reduces Work If you have 1M vectors and 1000 clusters, each cluster has ~1000 vectors. Searching 1 cluster requires 1000 distance computations instead of 1M. With nprobe=10, you search 10k vectors—still 100x faster than brute force. The catch: you only find vectors in the searched clusters. If the true nearest neighbors span multiple clusters, you miss them.

IVF (Inverted File Index) is the workhorse of vector search. Its core idea: partition your vectors into clusters using k-means, then at query time, find which cluster your query belongs to and search only that cluster (plus neighbors).

EXERCISE

Build an IVF index with 100 clusters on 100k vectors. Vary nprobe from 1 to 20 and measure recall (compare against brute force ground truth). Plot recall vs nprobe. Observe how recall saturates—there's a point of diminishing returns.

# Continuing from above...
def evaluate_recall(index, vectors, queries, k=10, nprobe=1):
    """Calculate recall by comparing to brute force."""
    # Ground truth via brute force
    db = index.vectors  # access vectors
    diff = db[:, np.newaxis] - queries[np.newaxis, :, :]
    all_dists = np.sqrt(np.sum(diff ** 2, axis=2))
    gt_indices = np.argsort(all_dists, axis=0)[:k]
    
    # IVF results
    recall_sum = 0
    for i, query in enumerate(queries):
        _, indices = index.search(query, nprobe)
        relevant = set(gt_indices[:, i])
        retrieved = set(indices[:k])
        recall_sum += len(relevant & retrieved) / k
    
    return recall_sum / len(queries)

# Run evaluation
for nprobe in [1, 5, 10, 20, 50]:
    recall = evaluate_recall(index, vectors, queries[:100], k=10, nprobe=nprobe)
    print(f"nprobe={nprobe:2d}: recall={recall:.4f}")