13. SIMD Distance Computation
SIMD (Single Instruction Multiple Data) instructions process multiple floating-point operations per CPU cycle. Vector distance computation is embarrassingly parallel—making it ideal for SIMD acceleration.
Available SIMD Instruction Sets
| Instruction Set | Vector Width | Operations/Cycle |
|---|---|---|
| SSE4.2 | 128 bits | 4 float32 |
| AVX2 | 256 bits | 8 float32 |
| AVX-512 | 512 bits | 16 float32 |
Most modern servers support AVX2; high-end processors support AVX-512.
Manual SIMD with NumPy
NumPy automatically uses SIMD when operations are contiguous:
import numpy as np
def euclidean_distance_numpy(a: np.ndarray, b: np.ndarray) -> np.ndarray:
"""
NumPy automatically vectorizes this using available SIMD instructions.
For contiguous float32 arrays, this compiles to optimized assembly.
"""
return np.sqrt(np.sum((a - b) ** 2, axis=1))
# Verify SIMD usage
# On Linux: cat /proc/cpuinfo | grep flags | grep avx
# Or compile with: gcc -O3 -mavx2 -mfma source.c
Explicit SIMD with Numba
For explicit control, use Numba's SIMD intrinsic functions:
from numba import njit, prange
import numpy as np
@njit('float32[:](float32[:,:], float32[:])')
def compute_distances_numba(vectors: np.ndarray, query: np.ndarray) -> np.ndarray:
"""
Compute distances from query to all vectors.
Numba generates SIMD code automatically with parallel=True.
"""
n = vectors.shape[0]
distances = np.empty(n, dtype=np.float32)
for i in prange(n): # prange enables parallel execution
diff = 0.0
for j in range(vectors.shape[1]):
d = vectors[i, j] - query[j]
diff += d * d
distances[i] = np.sqrt(diff)
return distances
AVX2 Implementation
For precise control, write explicit AVX2 intrinsics:
// distance_avx2.c - compile with: gcc -O3 -mavx2 -mfma distance_avx2.c -o distance
#include <immintrin.h>
#include <stdint.h>
void distance_avx2(const float* a, const float* b, float* out, int n, int dim) {
for (int i = 0; i < n; i++) {
__m256 sum = _mm256_setzero_ps();
for (int j = 0; j < dim; j += 8) {
__m256 va = _mm256_loadu_ps(&a[i * dim + j]);
__m256 vb = _mm256_loadu_ps(&b[j]);
__m256 diff = _mm256_sub_ps(va, vb);
sum = _mm256_fmadd_ps(diff, diff, sum); // multiply-add fused
}
// Horizontal sum of 8 floats
float temp[8];
_mm256_storeu_ps(temp, sum);
out[i] = sqrtf(temp[0] + temp[1] + temp[2] + temp[3] +
temp[4] + temp[5] + temp[6] + temp[7]);
}
}
Benchmarking Results
Expect 3-5x speedup from SIMD versus scalar code. AVX-512 typically provides 10-15x improvement over naive implementations due to wider registers and better pipelining.
# Benchmark comparison
import time
def benchmark(func, *args, iterations=1000):
start = time.perf_counter()
for _ in range(iterations):
func(*args)
return (time.perf_counter() - start) / iterations
# Typical results:
# Scalar: 50ms per 100K vectors
# NumPy: 12ms per 100K vectors (4x speedup)
# Numba: 8ms per 100K vectors (6x speedup)
# AVX2: 5ms per 100K vectors (10x speedup)
Failure Modes
Alignment requirements: AVX2 requires 32-byte alignment for best performance. _mm256_loadu_ps handles unaligned loads but is slower. Ensure your arrays are aligned when possible.
Throttling: Sustained AVX2/AVX-512 use can cause thermal throttling on some CPUs. The "AVX offset" reduces base clock speed. Consider this when capacity planning.
Write a benchmark comparing numpy, numba, and explicit AVX2 implementations for computing 10,000 distances. Measure throughput and verify the speedup matches expected SIMD gains. Profile with perf to confirm SIMD instructions are being executed.
# Build and run:
gcc -O3 -mavx2 -mfma -o distance_avx2 distance_avx2.c
perf stat -e instructions,cycles ./benchmark.py