RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Vector Database Internals
  6. /Ch. 5
Vector Database Internals

05. IVF Training and Search

Chapter 5 of 18 · 20 min
KEY INSIGHT

IVF training isn't just clustering—it's configuring k-means for your specific data distribution. Poor training produces centroids that don't represent your vector space, causing queries to land in sparse or unrepresentative clusters. ### The Training Process Training IVF has two phases: centroid estimation and vector assignment. **Phase 1: K-means for centroid estimation** K-means finds cluster centroids that minimize within-cluster sum of squares. The algorithm iteratively assigns vectors to nearest centroid and recomputes centroids. ```python import numpy as np def kmeans_train(vectors, n_clusters, max_iter=100, tol=1e-4): """ Simplified k-means for understanding IVF training. Production: use sklearn.MiniBatchKMeans or FAISS's k-means. """ dim = vectors.shape[1] # Initialize centroids randomly from data points indices = np.random.choice(len(vectors), n_clusters, replace=False) centroids = vectors[indices].copy() for iteration in range(max_iter): # Assignment step: assign each vector to nearest centroid diff = vectors[:, np.newaxis] - centroids[np.newaxis, :] distances = np.sum(diff ** 2, axis=2) assignments = np.argmin(distances, axis=1) # Update step: compute new centroids new_centroids = np.zeros_like(centroids) for c in range(n_clusters): mask = assignments == c if mask.sum() > 0: new_centroids[c] = vectors[mask].mean(axis=0) else: # Empty cluster: reinitialize randomly new_centroids[c] = vectors[np.random.randint(len(vectors))] # Check convergence shift = np.linalg.norm(new_centroids - centroids) centroids = new_centroids if iteration % 10 == 0: print(f" Iteration {iteration}: shift={shift:.4f}") if shift < tol: print(f" Converged at iteration {iteration}") break return centroids, assignments ``` ### Assignment Strategy After finding centroids, you assign all vectors to their nearest cluster. This becomes the inverted index. The critical question: do you store the original vectors or residuals (vector minus centroid)? **IVF-Flat (no quantization):** Store originals. Lower recall variance, but higher memory. **IVF-PQ:** Store residuals compressed via product quantization. Higher recall variance, but lower memory. ### Search Pipeline At query time, IVF performs: 1. **Centroid search:** Find nprobe closest centroids to query 2. **List scanning:** Retrieve vectors from selected clusters 3. **Distance computation:** Compute distances to all candidates 4. **Top-k selection:** Return k smallest distances ```python def ivf_search(centroids, inverted_lists, vectors, query, nprobe, k=10): """Complete IVF search pipeline.""" # Step 1: Find closest centroids diff = centroids - query centroid_distances = np.sum(diff ** 2, axis=1) closest_centroids = np.argsort(centroid_distances)[:nprobe] print(f"Query -> Centroid distances: {centroid_distances[closest_centroids]}") # Step 2: Collect candidate indices candidate_indices = [] for c in closest_centroids: candidate_indices.extend(inverted_lists[c]) print(f"Scanning {len(candidate_indices)} candidates from {nprobe} clusters") # Step 3: Compute distances to candidates candidates = vectors[candidate_indices] diff = candidates - query distances = np.sum(diff ** 2, axis=1) # Step 4: Return top-k top_k_idx = np.argsort(distances)[:k] return distances[top_k_idx], np.array(candidate_indices)[top_k_idx] # Example execution vectors = np.random.rand(50000, 128).astype('float32') centroids, assignments = kmeans_train(vectors, n_clusters=100) inverted_lists = {i: np.where(assignments == i)[0].tolist() for i in range(100)} query = np.random.rand(128).astype('float32') distances, indices = ivf_search(centroids, inverted_lists, vectors, query, nprobe=5, k=10) print(f"Top-10 distances: {distances}") ``` ### Failure Modes **Empty clusters:** If a cluster has no vectors, queries to that region produce no results. K-means with random initialization can produce empty clusters. **Unbalanced clusters:** Data with uneven density produces some clusters with many vectors, others with few. Search time varies dramatically based on which clusters are queried. **Non-spherical clusters:** K-means assumes spherical clusters. If your data forms elongated manifolds (common in deep learning embeddings), centroids won't represent the data well.

IVF's effectiveness hinges on the quality of its cluster centroids. Training is where operators can introduce subtle bugs that silently degrade recall.

EXERCISE

Train IVF on data with one dense cluster and one sparse cluster (e.g., 90% of vectors in a tight ball, 10% spread across space). Observe how cluster sizes and search behavior differ.

# Create imbalanced data
dense = np.random.randn(90000, 128) * 0.5 + np.random.rand(90000, 128)
sparse = np.random.rand(10000, 128) * 10
vectors = np.vstack([dense, sparse]).astype('float32')

# Train and observe
centroids, assignments = kmeans_train(vectors, n_clusters=100)
cluster_sizes = [np.sum(assignments == i) for i in range(100)]
print(f"Cluster size stats: min={min(cluster_sizes)}, max={max(cluster_sizes)}, "
      f"std={np.std(cluster_sizes):.1f}")
print(f"Largest 5 clusters: {sorted(cluster_sizes, reverse=True)[:5]}")
← Chapter 4
IVF: Inverted File Index
Chapter 6 →
IVF Parameters