HOW-TO · RAG

How to Optimize Vector Search Query Performance

intermediate20 minBy Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

Vector database with indexed embeddings, benchmark tool

What this does

Vector search query performance optimization ensures your retrieval layer returns results in the shortest time possible without sacrificing recall. This involves tuning index parameters, adjusting the ANN algorithm, batching queries, and aligning your embedding dimension with what your database handles efficiently.

Steps

  1. Profile current performance. Measure baseline latency over at least 500 queries under realistic load.
python benchmark_queries.py --host localhost --port 6333 \
  --collection my_collection --num-queries 500
  1. Check your ANN index configuration. HNSW parameters have the most impact. Increase ef_construction at index build time for higher recall at equal speed, and raise ef at query time to trade latency for recall.

  2. Validate recall is acceptable before aggressively optimizing latency. Compare ANN results against exact nearest-neighbor results on your test set.

  3. Reduce embedding dimensionality if your vectors are oversized. Models producing 1536-dimension vectors often retain nearly identical semantic information at 384–768 dimensions. Retrain with a smaller dimension or truncate and re-index.

  4. Enable query batching if your pipeline processes multiple queries concurrently. Most vector databases serve batched requests far more efficiently than sequential single queries.

  5. Adjust quantization settings. Product quantization (PQ) or scalar quantization (SQ) dramatically reduces memory footprint and increases throughput at a small cost to recall.

Verification

After applying optimizations, run the benchmark again and compare:

python benchmark_queries.py --host localhost --port 6333 \
  --collection my_collection --num-queries 500 --ef 128

Expected output: Per-query latency should show a clear reduction. A healthy target is p50 latency under 20ms and p99 under 100ms for 1M-document collections. Confirm recall remains above 95% of the exact baseline:

python verify_recall.py --collection my_collection --ef 128

Expected output: Recall@10: 0.971 (or equivalent). Values below 0.95 indicate the index configuration is too aggressive.

Common failures

  • Over-quantization reducing recall: Aggressive PQ settings cut latency but drop recall below acceptable thresholds. Always validate recall whenever changing quantization.
  • Inconsistent p99 latency: A cold index or cache thrashing causes sporadic spikes. Pre-warm the index or increase memory allocation.
  • Mismatched embedding models: Query embeddings generated by a different model than the stored embeddings produce poor retrieval even with an optimal index.
  • Index not rebuilt after configuration changes: Some databases require explicit index rebuilds for parameter changes to take effect.
  • CPU saturation: Single-threaded indexing on a multi-core machine leaves compute headroom unused.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • improve-embedding-quality-retrieval
  • implement-hybrid-search-keyword-semantic