COURSE · OPS · A010

Vector Database Internals

Learn vector database internals through RunLocalAI's practical lens: vectordb, hnsw, ivf and quantization, hardware fit, runtime settings, verification habits and local-vs-cloud tradeoffs.

18 chapters14hOperator trackBy Fredoline Eruo
PREREQUISITES
  • B012
  • I001

Why this course matters

Vector Database Internals is for operators making local AI reliable, measurable and cheaper to run. It connects vectordb, hnsw, ivf, quantization and cuda to the questions RunLocalAI wants every reader to answer before they install, upgrade or scale a model: will it run, what will it cost in memory, what setting changes the result, and how do you verify the answer instead of trusting a demo?

What you will be able to do

By the end, you should be able to explain the main tradeoffs in plain language, choose a safe next experiment, and use the chapter exercises as a repeatable operator checklist. The course favors local evidence, hardware fit, context limits, latency and failure modes over generic AI vocabulary.

How to use this course

Start at chapter one if the topic is new. If you already have a working stack, scan for chapters such as Why Build a Vector DB?, Vector Search Fundamentals, Brute Force Search and IVF: Inverted File Index and use those lessons as a quality-control pass before changing a workstation, team workflow or production-like local deployment.

CHAPTERS
  1. 01Why Build a Vector DB?Vector databases exist because exact nearest neighbor search in high-dimensional space is computationally intractable at scale, and approximate methods trade recall for speed in ways that matter enormously in production. The core problem: given 10 million embeddings (from images, text, audio, or any deep learning model), find the k closest vectors to a query. A naive approach examines every vector—O(n) time. At 10M vectors with 768-dimensional float32 vectors, you're doing 7.68 billion float comparisons per query. That's approximately never acceptable. Approximate Nearest Neighbor (ANN) indexes solve this by accepting that you don't need the *exact* answer, just a *good enough* answer that's fast. The trade-off is parameterized—you control how much accuracy you sacrifice for speed. Three core techniques power modern vector databases: **Inverted File Index (IVF)** partitions your vector space into clusters. At query time, you find which cluster your query lands in and only search that cluster (plus a few neighbors). The parameter `nprobe` controls how many clusters you check—higher nprobe = higher recall = slower queries. **Hierarchical Navigable Small World (HNSW)** builds a multi-layer graph structure. Upper layers are sparse and let you jump toward your target region quickly. Lower layers are dense and refine the search. Think of it as a highway system for vectors. **Product Quantization (PQ)** compresses vectors by splitting them into subvectors and clustering each subvector independently. This reduces memory footprint by 10-100x and allows GPU-accelerated distance computation on compressed data.15 min
  2. 02Vector Search FundamentalsYour embeddings define your search semantics. A poor embedding model produces a vector database that's "fast at finding irrelevant things." ### Vector Representations Modern embeddings come from models like CLIP (images + text), BERT variants (text), or ResNet (images). Each produces a fixed-length vector, typically 128 to 2048 dimensions. These vectors are points in a high-dimensional space—semantically similar items cluster together. The embedding model matters more than the index type. If your model maps "dog" and "puppy" to distant points, no indexing trick will make them retrievable together. ### Distance Metrics Three metrics dominate vector search: **Cosine Similarity** measures the angle between vectors, ignoring magnitude. Range: -1 to 1. Common for text embeddings where direction matters more than scale. **L2 Distance (Euclidean)** measures straight-line distance. Common for normalized embeddings and image similarity. **Inner Product (Dot Product)** measures vector alignment. Range: -∞ to ∞. Common when vectors have varying magnitudes and alignment direction matters. ```python import numpy as np def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def l2_distance(a, b): return np.linalg.norm(a - b) def inner_product(a, b): return np.dot(a, b) # Verify relationships v1 = np.array([1.0, 0.0]) v2 = np.array([1.0, 1.0]) print(f"Cosine: {cosine_similarity(v1, v2):.4f}") # 0.7071 print(f"L2: {l2_distance(v1, v2):.4f}") # 1.4142 print(f"IP: {inner_product(v1, v2):.4f}") # 1.0 ``` ### Dimensionality and the Curse High-dimensional spaces behave counterintuitively. As dimensions increase, the relative difference between nearest and farthest neighbors shrinks toward zero—you can't easily distinguish "close" from "far." This is the curse of dimensionality. This is why approximate methods are necessary. Exact search requires examining essentially all vectors anyway, because distances become meaningless. ANN methods exploit structure (clusters, graphs) that exists *before* the curse dominates completely.20 min
  3. 03Brute Force SearchBrute force search on 1M vectors with 768 dimensions does ~768M FLOPs per query. At 100 QPS, that's 76.8 billion FLOPs—your CPU screams, and latency explodes. This is why indexes exist. ### Implementation ```python import numpy as np def brute_force_search(database, query, k=10): """ Find k nearest neighbors by exhaustive comparison. Args: database: (n_vectors, dim) array of vectors query: (dim,) query vector k: number of results Returns: distances, indices of k nearest neighbors """ # Compute L2 distance to all vectors diff = database - query distances = np.sqrt(np.sum(diff ** 2, axis=1)) # Get indices of k smallest distances indices = np.argsort(distances)[:k] return distances[indices], indices # Benchmark n = 100_000 dim = 768 k = 10 db = np.random.rand(n, dim).astype('float32') query = np.random.rand(dim).astype('float32') import time start = time.time() for _ in range(100): distances, indices = brute_force_search(db, query, k) elapsed = (time.time() - start) / 100 print(f"Average latency: {elapsed*1000:.2f}ms for {n:,} vectors, dim={dim}") ``` ### Complexity Analysis | Operation | Complexity | |-----------|------------| | Distance computation | O(n × dim) | | Sorting for top-k | O(n log k) | | Memory access | O(n × dim) bytes | For n=1M, dim=768: 768M distance computations, plus sorting 1M elements. This doesn't fit in L3 cache—the memory bandwidth bottleneck bites hard. ### Why Brute Force Sometimes Wins For small datasets (<10k vectors), the overhead of index construction and traversal often exceeds brute force cost. Many production systems fall back to brute force for small indexes. Additionally, when you need 100% recall (guaranteed exact results), brute force is your only option. ANN methods are a bet on "good enough recall."15 min
  4. 04IVF: Inverted File IndexIVF trades recall for speed by limiting search scope. If your query lands in the correct cluster region, you get exact results faster. If it lands near cluster boundaries, you miss relevant vectors in neighboring clusters. ### The Inverted Index Structure IVF maintains an inverted list for each cluster. Each inverted list contains the IDs of vectors belonging to that cluster (and optionally their residuals for reconstruction). ```python import numpy as np from sklearn.cluster import MiniBatchKMeans class IVFFlat: """ Simplified IVF index for educational purposes. Production implementations (FAISS, ScaNN) add quantization and pruning. """ def __init__(self, nlist, dim): self.nlist = nlist self.dim = dim self.centroids = None self.inverted_lists = {} # list_id -> list of vector indices def train(self, vectors): """Cluster vectors to build centroids.""" print(f"Training IVF with {self.nlist} clusters...") kmeans = MiniBatchKMeans( n_clusters=self.nlist, batch_size=4096, n_init=3, max_iter=100 ) kmeans.fit(vectors) self.centroids = kmeans.cluster_centers_ # Assign vectors to clusters assignments = kmeans.predict(vectors) self.inverted_lists = {i: [] for i in range(self.nlist)} for idx, cluster_id in enumerate(assignments): self.inverted_lists[cluster_id].append(idx) print(f"Training complete. Cluster sizes: min={min(len(v) for v in self.inverted_lists.values())}, " f"max={max(len(v) for v in self.inverted_lists.values())}") return self def search(self, query, nprobe=1): """Search nprobe closest clusters.""" # Find closest centroids diff = self.centroids - query distances = np.sum(diff ** 2, axis=1) closest_clusters = np.argsort(distances)[:nprobe] # Search vectors in those clusters candidate_indices = [] for cluster_id in closest_clusters: candidate_indices.extend(self.inverted_lists[cluster_id]) # Brute force among candidates candidates = np.array(candidate_indices) diff = self.vectors[candidates] - query dists = np.sum(diff ** 2, axis=1) top_k = np.argsort(dists)[:10] return dists[top_k], candidates[top_k] ``` ### How IVF Reduces Work If you have 1M vectors and 1000 clusters, each cluster has ~1000 vectors. Searching 1 cluster requires 1000 distance computations instead of 1M. With nprobe=10, you search 10k vectors—still 100x faster than brute force. The catch: you only find vectors in the searched clusters. If the true nearest neighbors span multiple clusters, you miss them.20 min
  5. 05IVF Training and SearchIVF training isn't just clustering—it's configuring k-means for your specific data distribution. Poor training produces centroids that don't represent your vector space, causing queries to land in sparse or unrepresentative clusters. ### The Training Process Training IVF has two phases: centroid estimation and vector assignment. **Phase 1: K-means for centroid estimation** K-means finds cluster centroids that minimize within-cluster sum of squares. The algorithm iteratively assigns vectors to nearest centroid and recomputes centroids. ```python import numpy as np def kmeans_train(vectors, n_clusters, max_iter=100, tol=1e-4): """ Simplified k-means for understanding IVF training. Production: use sklearn.MiniBatchKMeans or FAISS's k-means. """ dim = vectors.shape[1] # Initialize centroids randomly from data points indices = np.random.choice(len(vectors), n_clusters, replace=False) centroids = vectors[indices].copy() for iteration in range(max_iter): # Assignment step: assign each vector to nearest centroid diff = vectors[:, np.newaxis] - centroids[np.newaxis, :] distances = np.sum(diff ** 2, axis=2) assignments = np.argmin(distances, axis=1) # Update step: compute new centroids new_centroids = np.zeros_like(centroids) for c in range(n_clusters): mask = assignments == c if mask.sum() > 0: new_centroids[c] = vectors[mask].mean(axis=0) else: # Empty cluster: reinitialize randomly new_centroids[c] = vectors[np.random.randint(len(vectors))] # Check convergence shift = np.linalg.norm(new_centroids - centroids) centroids = new_centroids if iteration % 10 == 0: print(f" Iteration {iteration}: shift={shift:.4f}") if shift < tol: print(f" Converged at iteration {iteration}") break return centroids, assignments ``` ### Assignment Strategy After finding centroids, you assign all vectors to their nearest cluster. This becomes the inverted index. The critical question: do you store the original vectors or residuals (vector minus centroid)? **IVF-Flat (no quantization):** Store originals. Lower recall variance, but higher memory. **IVF-PQ:** Store residuals compressed via product quantization. Higher recall variance, but lower memory. ### Search Pipeline At query time, IVF performs: 1. **Centroid search:** Find nprobe closest centroids to query 2. **List scanning:** Retrieve vectors from selected clusters 3. **Distance computation:** Compute distances to all candidates 4. **Top-k selection:** Return k smallest distances ```python def ivf_search(centroids, inverted_lists, vectors, query, nprobe, k=10): """Complete IVF search pipeline.""" # Step 1: Find closest centroids diff = centroids - query centroid_distances = np.sum(diff ** 2, axis=1) closest_centroids = np.argsort(centroid_distances)[:nprobe] print(f"Query -> Centroid distances: {centroid_distances[closest_centroids]}") # Step 2: Collect candidate indices candidate_indices = [] for c in closest_centroids: candidate_indices.extend(inverted_lists[c]) print(f"Scanning {len(candidate_indices)} candidates from {nprobe} clusters") # Step 3: Compute distances to candidates candidates = vectors[candidate_indices] diff = candidates - query distances = np.sum(diff ** 2, axis=1) # Step 4: Return top-k top_k_idx = np.argsort(distances)[:k] return distances[top_k_idx], np.array(candidate_indices)[top_k_idx] # Example execution vectors = np.random.rand(50000, 128).astype('float32') centroids, assignments = kmeans_train(vectors, n_clusters=100) inverted_lists = {i: np.where(assignments == i)[0].tolist() for i in range(100)} query = np.random.rand(128).astype('float32') distances, indices = ivf_search(centroids, inverted_lists, vectors, query, nprobe=5, k=10) print(f"Top-10 distances: {distances}") ``` ### Failure Modes **Empty clusters:** If a cluster has no vectors, queries to that region produce no results. K-means with random initialization can produce empty clusters. **Unbalanced clusters:** Data with uneven density produces some clusters with many vectors, others with few. Search time varies dramatically based on which clusters are queried. **Non-spherical clusters:** K-means assumes spherical clusters. If your data forms elongated manifolds (common in deep learning embeddings), centroids won't represent the data well.20 min
  6. 06IVF ParametersIVF parameters interact. Doubling nprobe roughly doubles query time but doesn't double recall. The relationship is non-linear and data-dependent—you must measure, not guess. ### Number of Clusters (nlist) The number of clusters determines the granularity of your partition. Guidelines: - **Too few clusters:** Large clusters, higher recall per probe, but fewer candidates to search means still potentially slow - **Too many clusters:** Small clusters, more clusters to probe for good recall, higher overhead Typical rule of thumb: nlist = √n (square root of dataset size) to √10n. For 1M vectors, that's 1k to 10k clusters. ```python # Rule-of-thumb calculation def recommended_nlist(n_vectors): return int(np.sqrt(n_vectors)) for n in [10_000, 100_000, 1_000_000, 10_000_000]: print(f"{n:>10,}: nlist ≈ {recommended_nlist(n):>8,}") ``` ### nprobe nprobe controls how many clusters you search. Higher nprobe = higher recall = slower queries. The relationship isn't linear: - nprobe=1 searches 1 cluster, but recall might be only 60% - nprobe=10 searches 10 clusters, but recall might only reach 90% - nprobe=50 might reach 98% but take 5x longer than nprobe=10 ```python import numpy as np import matplotlib.pyplot as plt def measure_recall_curve(index, vectors, queries, max_nprobe=50): """Measure recall vs nprobe.""" results = [] for nprobe in range(1, max_nprobe + 1, 2): recall = evaluate_recall(index, vectors, queries[:50], k=10, nprobe=nprobe) results.append((nprobe, recall)) return results # This would plot the curve # results = measure_recall_curve(index, vectors, queries) # plt.plot([r[0] for r in results], [r[1] for r in results]) # plt.xlabel('nprobe') # plt.ylabel('Recall@10') # plt.show() ``` ### Memory Considerations IVF without quantization stores: - Vectors: n × dim × 4 bytes (float32) - Centroids: nlist × dim × 4 bytes - Inverted lists: n × 4 bytes (indices) For 1M vectors, 768 dimensions: ~3GB for vectors + small overhead. ### Choosing Parameters by Latency Budget If you have a 50ms latency budget: 1. Benchmark brute force: if it's already <50ms, you might not need IVF 2. Start with nlist = √n, nprobe = 1 3. Measure recall and latency 4. If recall is too low, increase nprobe 5. If latency is too high, decrease nprobe or increase nlist (smaller clusters) ```python def tune_ivf(vectors, queries, latency_budget_ms, target_recall=0.95): """ Find IVF parameters that meet latency budget while achieving target recall. """ n = len(vectors) nlist = int(np.sqrt(n)) # Build index kmeans = MiniBatchKMeans(n_clusters=nlist, batch_size=4096) kmeans.fit(vectors) centroids = kmeans.cluster_centers_ assignments = kmeans.predict(vectors) # Binary search on nprobe low, high = 1, nlist best_nprobe = 1 while low <= high: mid = (low + high) // 2 latency = benchmark_search_latency(centroids, vectors, assignments, queries, nprobe=mid) recall = evaluate_recall(vectors, queries, nprobe=mid) if latency <= latency_budget_ms and recall >= target_recall: best_nprobe = mid high = mid - 1 else: low = mid + 1 return nlist, best_nprobe ```20 min
  7. 07HNSW: Hierarchical Navigable Small WorldHNSW is essentially a highway system for vectors. The top layer is sparse highways that get you close to your destination fast. The bottom layer is dense local roads that find the exact nearest neighbors. ### The Multi-Layer Graph HNSW builds a layered graph where: - Layer 0 contains all vectors (dense) - Layer 1 contains ~50% of vectors (sparse) - Layer 2 contains ~25% of vectors (sparsest) - And so on... Search starts at the top layer, finds the closest node, and descends to the next layer, repeating until layer 0. ```python import numpy as np from dataclasses import dataclass from typing import List, Dict, Set @dataclass class Node: id: int vector: np.ndarray neighbors: Dict[int, Set[int]] # layer -> set of neighbor node IDs def __hash__(self): return hash(self.id) class HNSW: """ Simplified HNSW implementation for educational purposes. Production systems use optimized structures and additional pruning. """ def __init__(self, m: int = 16, max_layers: int = None): self.m = m # Number of connections per node per layer self.max_layers = max_layers self.nodes: Dict[int, Node] = {} self.entry_point: Node = None def _get_layers(self, num_vectors: int) -> List[int]: """ Determine layer for each node. Layer selection follows geometric distribution: P(layer > L) = exp(-L / lambda) Lambda controls average number of layers. """ if self.max_layers is None: # Auto-calculate based on size self.max_layers = int(np.log2(num_vectors)) + 1 # Geometric distribution for layer assignment lamb = 1.0 / np.log(2) # Average ~1 layer per node layers = [] for _ in range(num_vectors): l = 0 while l < self.max_layers and np.random.random() < 1.0 / np.exp(l / lamb): l += 1 layers.append(l - 1) if l > 0 else layers.append(0) return layers def add(self, vectors: np.ndarray): """Add vectors to HNSW index.""" if len(self.nodes) == 0: # First node becomes entry point first_node = Node(id=0, vector=vectors[0], neighbors={}) for layer in range(self.max_layers or 1): first_node.neighbors[layer] = set() self.nodes[0] = first_node self.entry_point = first_node vectors = vectors[1:] node_layers = self._get_layers(len(vectors)) for i, vec in enumerate(vectors): node_id = len(self.nodes) node = Node(id=node_id, vector=vec, neighbors={}) # Determine max layer for this node max_layer = node_layers[i] for layer in range(max_layer + 1): node.neighbors[layer] = set() # Search for insertion point at this layer entry = self.entry_point for l in range(len(self.nodes), -1, -1): if l > max_layer: continue best = self._search_layer(entry, vec, layer, ef=1)[0] entry = best # Find neighbors and connect neighbors = self._search_layer(entry, vec, layer, ef=self.m) for neighbor in neighbors[:self.m]: node.neighbors[layer].add(neighbor.id) neighbor.neighbors[layer].add(node.id) self.nodes[node_id] = node def _search_layer(self, entry: Node, query: np.ndarray, layer: int, ef: int = 1) -> List[Node]: """ Search within a single layer. ef: search width (number of candidates to keep) """ visited = {entry.id} candidates = [(0, entry)] # (distance, node) best = entry while candidates: # Get furthest element _, current = heapq.heappop(candidates) # Get node at max layer if current.neighbors.get(layer): for neighbor_id in current.neighbors[layer]: if neighbor_id in visited: continue visited.add(neighbor_id) neighbor = self.nodes[neighbor_id] dist = np.linalg.norm(neighbor.vector - query) if dist < np.linalg.norm(best.vector - query): best = neighbor # Add to candidates if promising heapq.heappush(candidates, (dist, neighbor)) return best ``` ### Why HNSW Works HNSW exploits the small world phenomenon: in high-dimensional spaces, local neighborhood and long-range connections coexist. By organizing vectors into a navigable small world graph, you can traverse from anywhere to anywhere in logarithmic time. The top layers provide "long-range" connections that skip large portions of the space. The bottom layers provide local precision.20 min
  8. 08HNSW ConstructionHNSW construction is O(n log n) if done correctly—each insertion costs roughly O(log n) to find its layer and position, then O(m log n) to maintain connections. But the quality of connections matters for search performance. ### Insertion Algorithm When inserting a new vector: 1. **Determine its maximum layer** via geometric distribution 2. **For each layer up to max_layer:** - Search from current entry point to find where the new vector fits - Connect to nearest neighbors at that layer 3. **Update entry point** if new node is closer to origin ```python import heapq def hnsw_insert(hnsw, vector: np.ndarray, ef_construction: int = 100): """ Insert a vector into HNSW. Args: hnsw: HNSW index vector: vector to insert ef_construction: size of candidate pool during construction """ node_id = len(hnsw.nodes) node = Node(id=node_id, vector=vector, neighbors={}) # Determine max layer for this node max_layer = int(-np.log(np.random.random()) * 1.0 / np.log(2)) max_layer = min(max_layer, hnsw.max_layers or 10) # Start from entry point current_entry = hnsw.entry_point level = hnsw.max_layers or 10 # Descend from top to max_layer for l in range(level, max_layer, -1): candidates = [(np.linalg.norm(current_entry.vector - vector), current_entry)] visited = {current_entry.id} while candidates: dist, candidate = heapq.heappop(candidates) entry_dist = np.linalg.norm(current_entry.vector - vector) if dist > entry_dist: break for neighbor_id in candidate.neighbors.get(l, []): if neighbor_id in visited: continue visited.add(neighbor_id) neighbor = hnsw.nodes[neighbor_id] d = np.linalg.norm(neighbor.vector - vector) if d < entry_dist or len(candidates) < ef_construction: heapq.heappush(candidates, (d, neighbor)) # Track best entry for this layer if d < entry_dist: current_entry = neighbor entry_dist = d # Now at max_layer, begin actual insertion for l in range(max_layer, -1, -1): node.neighbors[l] = set() # Search for neighbors at this layer candidates = [(np.linalg.norm(current_entry.vector - vector), current_entry)] visited = {current_entry.id} results = [] while candidates: dist, candidate = heapq.heappop(candidates) results.append((dist, candidate)) for neighbor_id in candidate.neighbors.get(l, []): if neighbor_id in visited: continue visited.add(neighbor_id) neighbor = hnsw.nodes[neighbor_id] d = np.linalg.norm(neighbor.vector - vector) heapq.heappush(candidates, (d, neighbor)) # Sort and take closest m neighbors results.sort() neighbors = [node for _, node in results[:hnsw.m]] for neighbor in neighbors: node.neighbors[l].add(neighbor.id) neighbor.neighbors[l].add(node_id) hnsw.nodes[node_id] = node # Update entry point if necessary if hnsw.entry_point is None or \ np.linalg.norm(vector) < np.linalg.norm(hnsw.entry_point.vector): hnsw.entry_point = node return node_id ``` ### ef_construction Parameter ef_construction controls how thoroughly you search during construction. Higher values produce better connections but slow construction. - ef_construction = 100-200 is typical for production - Too low (ef_construction = m): edges may not connect to true nearest neighbors - Too high: diminishing returns, much slower construction ### Construction Order Matters HNSW is sensitive to insertion order. If you insert similar vectors in sequence, they form local clusters but miss connections to distant clusters. Random shuffling of insertions typically produces better results. ```python def build_hnsw_batched(vectors: np.ndarray, m: int = 16, ef_construction: int = 100): """ Build HNSW index with randomized insertion order. """ hnsw = HNSW(m=m) # Shuffle indices for better graph structure indices = np.random.permutation(len(vectors)) shuffled = vectors[indices] # Insert with shuffled order for i, vec in enumerate(shuffled): hnsw_insert(hnsw, vec, ef_construction) if (i + 1) % 10000 == 0: print(f"Inserted {i+1}/{len(vectors)} vectors") return hnsw # Benchmark construction import time vectors = np.random.randn(100_000, 128).astype('float32') start = time.time() hnsw = build_hnsw_batched(vectors, m=16, ef_construction=100) elapsed = time.time() - start print(f"Construction time: {elapsed:.2f}s for {len(vectors)} vectors") print(f"Rate: {len(vectors)/elapsed:.0f} vectors/sec") ```25 min
  9. 09HNSW SearchHNSW search is a guided greedy walk. At each layer, you examine local neighbors and move to the one closest to your query. This greedy approach works because the graph is constructed to be navigable—neighbors are arranged so greedy walks reach good solutions. ### Search Algorithm ```python def hnsw_search(hnsw, query: np.ndarray, ef: int = 10, k: int = 10) -> List[int]: """ Search HNSW for k nearest neighbors. Args: hnsw: HNSW index query: query vector ef: search width (larger = more thorough = higher recall) k: number of neighbors to return Returns: List of k nearest node IDs """ if hnsw.entry_point is None: return [] # Get top layer top_layer = max(hnsw.entry_point.neighbors.keys()) if hnsw.entry_point.neighbors else 0 # Phase 1: Descend from top to layer 1 current = hnsw.entry_point for layer in range(top_layer, 0, -1): improved = True while improved: improved = False current_dist = np.linalg.norm(current.vector - query) for neighbor_id in current.neighbors.get(layer, []): neighbor = hnsw.nodes[neighbor_id] dist = np.linalg.norm(neighbor.vector - query) if dist < current_dist: current = neighbor improved = True break # Phase 2: Search layer 0 with ef parameter candidates = [(np.linalg.norm(current.vector - query), current)] visited = {current.id} results = [] # (distance, node) while candidates: dist, node = heapq.heappop(candidates) heapq.heappush(results, (-dist, node)) # Max heap via negative # If we have enough results and this candidate is worse, stop if len(results) >= ef and dist > -results[0][0]: break for neighbor_id in node.neighbors.get(0, []): if neighbor_id in visited: continue visited.add(neighbor_id) neighbor = hnsw.nodes[neighbor_id] dist = np.linalg.norm(neighbor.vector - query) # Add if promising if len(results) < ef or dist < -results[0][0]: heapq.heappush(candidates, (dist, neighbor)) heapq.heappush(results, (-dist, neighbor)) # Extract top k results.sort(reverse=True) return [node.id for _, node in results[:k]] ``` ### The ef Parameter ef controls the search width at the bottom layer. Higher ef = more candidates examined = higher recall = slower search. The relationship: - ef=1: Pure greedy, very fast, poor recall - ef=10-50: Good balance for most use cases - ef=100+: Near-perfect recall, but approaching brute force speed ```python def benchmark_ef(hnsw, queries, ground_truth): """Measure recall vs ef parameter.""" results = [] for ef in [1, 5, 10, 25, 50, 100, 200]: total_recall = 0 for query, gt in zip(queries, ground_truth): retrieved = hnsw_search(hnsw, query, ef=ef, k=10) recall = len(set(retrieved) & set(gt)) / 10 total_recall += recall avg_recall = total_recall / len(queries) results.append((ef, avg_recall)) print(f"ef={ef:3d}: recall@10 = {avg_recall:.4f}") return results # Run benchmark recall_by_ef = benchmark_ef(hnsw, queries[:100], gt) ``` ### Why ef Works With ef > 1, you maintain a pool of candidates rather than just tracking the best. This lets you explore alternatives that might lead to better solutions. The key insight: the best solution at layer 0 might not be reachable from the best node at layer 1. By examining more candidates, you find paths that pure greedy would miss. ### Trade-offs | ef | Recall | Latency | Memory | |----|--------|---------|--------| | 1 | ~0.7 | Fastest | Low | | 10 | ~0.95 | Baseline | Baseline | | 100 | ~0.99 | 5-10x baseline | Baseline | | 1000 | ~1.0 | Approaches brute force | Baseline | Memory is constant because HNSW stores the graph structure; ef only affects search-time computation.25 min
  10. 10HNSW ParametersHNSW parameters create a direct trade-off triangle: build time, memory consumption, and query recall. You can optimize any two at the expense of the third. Profile your actual recall requirements rather than assuming maximum quality is always best.20 min
  11. 11Product QuantizationPQ achieves extreme compression by accepting reconstruction error. The key insight is that small errors in individual subspace dimensions often cancel out in distance calculations, making PQ distances correlate well with true Euclidean distances despite imperfect reconstruction.20 min
  12. 12PQ Encoding and SearchPQ search is fundamentally a table lookup problem. The art is structuring your data to maximize cache hits during distance accumulation. The precomputed distance tables should fit in L2 cache if possible.25 min
  13. 13SIMD Distance ComputationDistance computation is rarely the bottleneck in a vector database—graph traversal, memory bandwidth, and cache efficiency matter more. Only optimize distance computation after profiling identifies it as a hot path.25 min
  14. 14Multi-Tier IndexingMulti-tier indexing is fundamentally a budget allocation problem. Your goal is to maximize recall per dollar spent on storage and compute. The optimal tier sizes depend on your query distribution—profile before committing to a tier split.25 min
  15. 15Distributed SearchDistributed vector search is fundamentally constrained by the coordination overhead between nodes. Minimize communication by sending queries in parallel, limiting the number of nodes queried per request, and designing for approximate rather than exact results.25 min
  16. 16Index PersistenceIndex persistence is a trade-off between recovery speed and write throughput. More frequent checkpoints mean faster recovery but more overhead. Profile your crash recovery time at different checkpoint intervals to find the right balance.25 min
  17. 17Filtering and Hybrid SearchFiltering fundamentally conflicts with HNSW's graph-based navigation. The best strategy depends on your filter selectivity: pre-filter for rare filters (high cardinality), post-filter for common filters, and indexed filtering for categorical fields with moderate cardinality.25 min
  18. 18Minimal Vector Database ProjectA minimal vector database is surprisingly small—under 500 lines of code implements core functionality. The engineering complexity lies in making it fast, reliable, and scalable. Start simple, measure everything, and optimize based on actual bottlenecks.30 min