RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Vector Database Internals
  6. /Ch. 2
Vector Database Internals

02. Vector Search Fundamentals

Chapter 2 of 18 · 20 min
KEY INSIGHT

Your embeddings define your search semantics. A poor embedding model produces a vector database that's "fast at finding irrelevant things." ### Vector Representations Modern embeddings come from models like CLIP (images + text), BERT variants (text), or ResNet (images). Each produces a fixed-length vector, typically 128 to 2048 dimensions. These vectors are points in a high-dimensional space—semantically similar items cluster together. The embedding model matters more than the index type. If your model maps "dog" and "puppy" to distant points, no indexing trick will make them retrievable together. ### Distance Metrics Three metrics dominate vector search: **Cosine Similarity** measures the angle between vectors, ignoring magnitude. Range: -1 to 1. Common for text embeddings where direction matters more than scale. **L2 Distance (Euclidean)** measures straight-line distance. Common for normalized embeddings and image similarity. **Inner Product (Dot Product)** measures vector alignment. Range: -∞ to ∞. Common when vectors have varying magnitudes and alignment direction matters. ```python import numpy as np def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) def l2_distance(a, b): return np.linalg.norm(a - b) def inner_product(a, b): return np.dot(a, b) # Verify relationships v1 = np.array([1.0, 0.0]) v2 = np.array([1.0, 1.0]) print(f"Cosine: {cosine_similarity(v1, v2):.4f}") # 0.7071 print(f"L2: {l2_distance(v1, v2):.4f}") # 1.4142 print(f"IP: {inner_product(v1, v2):.4f}") # 1.0 ``` ### Dimensionality and the Curse High-dimensional spaces behave counterintuitively. As dimensions increase, the relative difference between nearest and farthest neighbors shrinks toward zero—you can't easily distinguish "close" from "far." This is the curse of dimensionality. This is why approximate methods are necessary. Exact search requires examining essentially all vectors anyway, because distances become meaningless. ANN methods exploit structure (clusters, graphs) that exists *before* the curse dominates completely.

Before diving into indexes, you need to understand what you're actually searching and how distances are measured. The choice of vector representation and distance metric affects everything downstream.

EXERCISE

Generate vectors in 2D, 16D, 64D, and 256D. For each dimensionality, compute the ratio between the 10th nearest neighbor distance and the median distance across all points. Watch how this ratio shrinks as dimensionality increases—demonstrating why search becomes harder.

import numpy as np

def nearest_ratio(dim, n_points=1000):
    vectors = np.random.rand(n_points, dim)
    # Use brute force
    dists = np.linalg.norm(vectors[:, np.newaxis] - vectors[np.newaxis, :], axis=2)
    np.fill_diagonal(dists, np.inf)
    sorted_dists = np.sort(dists, axis=1)
    median_dist = np.median(sorted_dists[:, 0])
    tenth_dist = np.median(sorted_dists[:, 9])
    return tenth_dist / median_dist

for dim in [2, 16, 64, 256]:
    ratio = nearest_ratio(dim)
    print(f"Dim {dim:3d}: 10th/1st ratio = {ratio:.4f}")
← Chapter 1
Why Build a Vector DB?
Chapter 3 →
Brute Force Search