SIMD Distance Computation — Vector Database Internals (Chapter 13)

SIMD (Single Instruction Multiple Data) instructions process multiple floating-point operations per CPU cycle. Vector distance computation is embarrassingly parallel—making it ideal for SIMD acceleration.

Available SIMD Instruction Sets

Instruction Set	Vector Width	Operations/Cycle
SSE4.2	128 bits	4 float32
AVX2	256 bits	8 float32
AVX-512	512 bits	16 float32

Most modern servers support AVX2; high-end processors support AVX-512.

Manual SIMD with NumPy

NumPy automatically uses SIMD when operations are contiguous:

import numpy as np

def euclidean_distance_numpy(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """
    NumPy automatically vectorizes this using available SIMD instructions.
    For contiguous float32 arrays, this compiles to optimized assembly.
    """
    return np.sqrt(np.sum((a - b) ** 2, axis=1))

# Verify SIMD usage
# On Linux: cat /proc/cpuinfo | grep flags | grep avx
# Or compile with: gcc -O3 -mavx2 -mfma source.c

Explicit SIMD with Numba

For explicit control, use Numba's SIMD intrinsic functions:

from numba import njit, prange
import numpy as np

@njit('float32[:](float32[:,:], float32[:])')
def compute_distances_numba(vectors: np.ndarray, query: np.ndarray) -> np.ndarray:
    """
    Compute distances from query to all vectors.
    Numba generates SIMD code automatically with parallel=True.
    """
    n = vectors.shape[0]
    distances = np.empty(n, dtype=np.float32)
    
    for i in prange(n):  # prange enables parallel execution
        diff = 0.0
        for j in range(vectors.shape[1]):
            d = vectors[i, j] - query[j]
            diff += d * d
        distances[i] = np.sqrt(diff)
    
    return distances

AVX2 Implementation

For precise control, write explicit AVX2 intrinsics:

// distance_avx2.c - compile with: gcc -O3 -mavx2 -mfma distance_avx2.c -o distance
#include <immintrin.h>
#include <stdint.h>

void distance_avx2(const float* a, const float* b, float* out, int n, int dim) {
    for (int i = 0; i < n; i++) {
        __m256 sum = _mm256_setzero_ps();
        
        for (int j = 0; j < dim; j += 8) {
            __m256 va = _mm256_loadu_ps(&a[i * dim + j]);
            __m256 vb = _mm256_loadu_ps(&b[j]);
            __m256 diff = _mm256_sub_ps(va, vb);
            sum = _mm256_fmadd_ps(diff, diff, sum);  // multiply-add fused
        }
        
        // Horizontal sum of 8 floats
        float temp[8];
        _mm256_storeu_ps(temp, sum);
        out[i] = sqrtf(temp[0] + temp[1] + temp[2] + temp[3] + 
                       temp[4] + temp[5] + temp[6] + temp[7]);
    }
}

Benchmarking Results

Expect 3-5x speedup from SIMD versus scalar code. AVX-512 typically provides 10-15x improvement over naive implementations due to wider registers and better pipelining.

# Benchmark comparison
import time

def benchmark(func, *args, iterations=1000):
    start = time.perf_counter()
    for _ in range(iterations):
        func(*args)
    return (time.perf_counter() - start) / iterations

# Typical results:
# Scalar:    50ms per 100K vectors
# NumPy:     12ms per 100K vectors (4x speedup)
# Numba:      8ms per 100K vectors (6x speedup)
# AVX2:       5ms per 100K vectors (10x speedup)

Failure Modes

Alignment requirements: AVX2 requires 32-byte alignment for best performance. _mm256_loadu_ps handles unaligned loads but is slower. Ensure your arrays are aligned when possible.

Throttling: Sustained AVX2/AVX-512 use can cause thermal throttling on some CPUs. The "AVX offset" reduces base clock speed. Consider this when capacity planning.