How to Optimize RAG for Low Latency
RAG pipeline deployed, performance monitoring tools
What this does
This guide explains how to reduce end-to-end latency in a Retrieval-Augmented Generation pipeline running on Ollama. By tuning retrieval granularity, embedding models, and chunking strategies, you can bring p95 latency under 1 second for typical single-document answers.
Steps
Profile the pipeline end-to-end. Identify which stage dominates latency: embedding, retrieval, or generation.
import time, ollama, chromadb client = chromadb.Client() col = client.get_collection("docs") start = time.time() embed_start = time.time() # measure embedding time query_vec = ollama.embeddings(model="mxbai-embed-large", prompt="your query")["embedding"] embed_time = (time.time() - embed_start) * 1000 retrieve_start = time.time() results = col.query(query_embeddings=[query_vec], n_results=5) retrieve_time = (time.time() - retrieve_start) * 1000 gen_start = time.time() response = ollama.generate(model="llama3.2", prompt=f"Context: {results['documents'][0]}\nQuestion: your query") gen_time = (time.time() - gen_start) * 1000 print(f"Embed: {embed_time:.1f}ms | Retrieve: {retrieve_time:.1f}ms | Generate: {gen_time:.1f}ms | Total: {(time.time()-start)*1000:.1f}ms")Expected output:
Embed: 45.2ms | Retrieve: 12.1ms | Generate: 820.3ms | Total: 877.6msSwitch to a faster embedding model.
mxbai-embed-largeis optimized for speed and quality. Replace any large transformer embedder.# Use a lighter embedder - update your pipeline config model_name = "mxbai-embed-large" # ~500MB, <50ms per query on CPUReduce chunk size and overlap. Smaller chunks (512 tokens with 64-token overlap) improve retrieval precision and reduce context size passed to the generator, cutting generation latency.
def chunk_text(text, chunk_size=512, overlap=64): words = text.split() chunks = [] for i in range(0, len(words), chunk_size - overlap): chunks.append(" ".join(words[i:i + chunk_size])) return chunksEnable retrieval caching. Cache query vectors for repeated or near-duplicate queries.
from functools import lru_cache @lru_cache(maxsize=1000) def cached_embed(prompt): return ollama.embeddings(model="mxbai-embed-large", prompt=prompt)["embedding"]
Verification
python3 profile_latency.py
# Expected: Embed: <60ms | Retrieve: <20ms | Generate: <800ms | Total: <1000ms
Common failures
- Generator latency dominates. If generation >800ms, switch to a smaller model (e.g.,
llama3.2:1b) or enable Ollama's streaming mode to first-token latency. - Embedding model not cached. First call is slow (~500ms); subsequent calls reuse the in-memory model. Always warm up before profiling.
- Collection not indexed. ChromaDB builds indexes lazily. Call
col.count()after loading to trigger index build. - Large context passed to generator. Passing 10k tokens from retrieval degrades generation speed and quality. Limit to top-3 chunks.
- No GPU acceleration. Ollama runs on CPU by default. Set
OLLAMA_GPU_LAYERSor runnvidia-smito confirm GPU usage. - Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
- Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.