RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Vector Database Internals
  6. /Ch. 13
Vector Database Internals

13. SIMD Distance Computation

Chapter 13 of 18 · 25 min
KEY INSIGHT

Distance computation is rarely the bottleneck in a vector database—graph traversal, memory bandwidth, and cache efficiency matter more. Only optimize distance computation after profiling identifies it as a hot path.

SIMD (Single Instruction Multiple Data) instructions process multiple floating-point operations per CPU cycle. Vector distance computation is embarrassingly parallel—making it ideal for SIMD acceleration.

Available SIMD Instruction Sets

Instruction Set Vector Width Operations/Cycle
SSE4.2 128 bits 4 float32
AVX2 256 bits 8 float32
AVX-512 512 bits 16 float32

Most modern servers support AVX2; high-end processors support AVX-512.

Manual SIMD with NumPy

NumPy automatically uses SIMD when operations are contiguous:

import numpy as np

def euclidean_distance_numpy(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """
    NumPy automatically vectorizes this using available SIMD instructions.
    For contiguous float32 arrays, this compiles to optimized assembly.
    """
    return np.sqrt(np.sum((a - b) ** 2, axis=1))

# Verify SIMD usage
# On Linux: cat /proc/cpuinfo | grep flags | grep avx
# Or compile with: gcc -O3 -mavx2 -mfma source.c

Explicit SIMD with Numba

For explicit control, use Numba's SIMD intrinsic functions:

from numba import njit, prange
import numpy as np

@njit('float32[:](float32[:,:], float32[:])')
def compute_distances_numba(vectors: np.ndarray, query: np.ndarray) -> np.ndarray:
    """
    Compute distances from query to all vectors.
    Numba generates SIMD code automatically with parallel=True.
    """
    n = vectors.shape[0]
    distances = np.empty(n, dtype=np.float32)
    
    for i in prange(n):  # prange enables parallel execution
        diff = 0.0
        for j in range(vectors.shape[1]):
            d = vectors[i, j] - query[j]
            diff += d * d
        distances[i] = np.sqrt(diff)
    
    return distances

AVX2 Implementation

For precise control, write explicit AVX2 intrinsics:

// distance_avx2.c - compile with: gcc -O3 -mavx2 -mfma distance_avx2.c -o distance
#include <immintrin.h>
#include <stdint.h>

void distance_avx2(const float* a, const float* b, float* out, int n, int dim) {
    for (int i = 0; i < n; i++) {
        __m256 sum = _mm256_setzero_ps();
        
        for (int j = 0; j < dim; j += 8) {
            __m256 va = _mm256_loadu_ps(&a[i * dim + j]);
            __m256 vb = _mm256_loadu_ps(&b[j]);
            __m256 diff = _mm256_sub_ps(va, vb);
            sum = _mm256_fmadd_ps(diff, diff, sum);  // multiply-add fused
        }
        
        // Horizontal sum of 8 floats
        float temp[8];
        _mm256_storeu_ps(temp, sum);
        out[i] = sqrtf(temp[0] + temp[1] + temp[2] + temp[3] + 
                       temp[4] + temp[5] + temp[6] + temp[7]);
    }
}

Benchmarking Results

Expect 3-5x speedup from SIMD versus scalar code. AVX-512 typically provides 10-15x improvement over naive implementations due to wider registers and better pipelining.

# Benchmark comparison
import time

def benchmark(func, *args, iterations=1000):
    start = time.perf_counter()
    for _ in range(iterations):
        func(*args)
    return (time.perf_counter() - start) / iterations

# Typical results:
# Scalar:    50ms per 100K vectors
# NumPy:     12ms per 100K vectors (4x speedup)
# Numba:      8ms per 100K vectors (6x speedup)
# AVX2:       5ms per 100K vectors (10x speedup)

Failure Modes

Alignment requirements: AVX2 requires 32-byte alignment for best performance. _mm256_loadu_ps handles unaligned loads but is slower. Ensure your arrays are aligned when possible.

Throttling: Sustained AVX2/AVX-512 use can cause thermal throttling on some CPUs. The "AVX offset" reduces base clock speed. Consider this when capacity planning.


EXERCISE

Write a benchmark comparing numpy, numba, and explicit AVX2 implementations for computing 10,000 distances. Measure throughput and verify the speedup matches expected SIMD gains. Profile with perf to confirm SIMD instructions are being executed.

# Build and run:
gcc -O3 -mavx2 -mfma -o distance_avx2 distance_avx2.c
perf stat -e instructions,cycles ./benchmark.py
← Chapter 12
PQ Encoding and Search
Chapter 14 →
Multi-Tier Indexing