02. Vector Search Fundamentals

Chapter 2 of 18 · 20 min

KEY INSIGHT

Your embeddings define your search semantics. A poor embedding model produces a vector database that's "fast at finding irrelevant things." ### Vector Representations Modern embeddings come from models like CLIP (images + text), BERT variants (text), or ResNet (images). Each produces a fixed-length vector, typically 128 to 2048 dimensions. These vectors are points in a high-dimensional space—semantically similar items cluster together. The embedding model matters more than the index type. If your model maps "dog" and "puppy" to distant points, no indexing trick will make them retrievable together. ### Distance Metrics Three metrics dominate vector search: **Cosine Similarity** measures the angle between vectors, ignoring magnitude. Range: -1 to 1. Common for text embeddings where direction matters more than scale. **L2 Distance (Euclidean)** measures straight-line distance. Common for normalized embeddings and image similarity. **Inner Product (Dot Product)** measures vector alignment. Range: -∞ to ∞. Common when vectors have varying magnitudes and alignment direction matters. ```python import numpy as np def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def l2_distance(a, b): return np.linalg.norm(a - b) def inner_product(a, b): return np.dot(a, b) # Verify relationships v1 = np.array([1.0, 0.0]) v2 = np.array([1.0, 1.0]) print(f"Cosine: {cosine_similarity(v1, v2):.4f}") # 0.7071 print(f"L2: {l2_distance(v1, v2):.4f}") # 1.4142 print(f"IP: {inner_product(v1, v2):.4f}") # 1.0 ``` ### Dimensionality and the Curse High-dimensional spaces behave counterintuitively. As dimensions increase, the relative difference between nearest and farthest neighbors shrinks toward zero—you can't easily distinguish "close" from "far." This is the curse of dimensionality. This is why approximate methods are necessary. Exact search requires examining essentially all vectors anyway, because distances become meaningless. ANN methods exploit structure (clusters, graphs) that exists *before* the curse dominates completely.

Before diving into indexes, you need to understand what you're actually searching and how distances are measured. The choice of vector representation and distance metric affects everything downstream.

EXERCISE

Generate vectors in 2D, 16D, 64D, and 256D. For each dimensionality, compute the ratio between the 10th nearest neighbor distance and the median distance across all points. Watch how this ratio shrinks as dimensionality increases—demonstrating why search becomes harder.

import numpy as np

def nearest_ratio(dim, n_points=1000):
    vectors = np.random.rand(n_points, dim)
    # Use brute force
    dists = np.linalg.norm(vectors[:, np.newaxis] - vectors[np.newaxis, :], axis=2)
    np.fill_diagonal(dists, np.inf)
    sorted_dists = np.sort(dists, axis=1)
    median_dist = np.median(sorted_dists[:, 0])
    tenth_dist = np.median(sorted_dists[:, 9])
    return tenth_dist / median_dist

for dim in [2, 16, 64, 256]:
    ratio = nearest_ratio(dim)
    print(f"Dim {dim:3d}: 10th/1st ratio = {ratio:.4f}")