What this does

This guide explains how to reduce end-to-end latency in a Retrieval-Augmented Generation pipeline running on Ollama. By tuning retrieval granularity, embedding models, and chunking strategies, you can bring p95 latency under 1 second for typical single-document answers.

Steps

Profile the pipeline end-to-end. Identify which stage dominates latency: embedding, retrieval, or generation.

import time, ollama, chromadb

client = chromadb.Client()
col = client.get_collection("docs")

start = time.time()
embed_start = time.time()
# measure embedding time
query_vec = ollama.embeddings(model="mxbai-embed-large", prompt="your query")["embedding"]
embed_time = (time.time() - embed_start) * 1000

retrieve_start = time.time()
results = col.query(query_embeddings=[query_vec], n_results=5)
retrieve_time = (time.time() - retrieve_start) * 1000

gen_start = time.time()
response = ollama.generate(model="llama3.2", prompt=f"Context: {results['documents'][0]}\nQuestion: your query")
gen_time = (time.time() - gen_start) * 1000

print(f"Embed: {embed_time:.1f}ms | Retrieve: {retrieve_time:.1f}ms | Generate: {gen_time:.1f}ms | Total: {(time.time()-start)*1000:.1f}ms")

Expected output: Embed: 45.2ms | Retrieve: 12.1ms | Generate: 820.3ms | Total: 877.6ms

Switch to a faster embedding model. mxbai-embed-large is optimized for speed and quality. Replace any large transformer embedder.
```
# Use a lighter embedder - update your pipeline config
model_name = "mxbai-embed-large"  # ~500MB, <50ms per query on CPU
```

Reduce chunk size and overlap. Smaller chunks (512 tokens with 64-token overlap) improve retrieval precision and reduce context size passed to the generator, cutting generation latency.

def chunk_text(text, chunk_size=512, overlap=64):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunks.append(" ".join(words[i:i + chunk_size]))
    return chunks

Enable retrieval caching. Cache query vectors for repeated or near-duplicate queries.

from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_embed(prompt):
    return ollama.embeddings(model="mxbai-embed-large", prompt=prompt)["embedding"]

Verification

python3 profile_latency.py
# Expected: Embed: <60ms | Retrieve: <20ms | Generate: <800ms | Total: <1000ms

Common failures

Generator latency dominates. If generation >800ms, switch to a smaller model (e.g., llama3.2:1b) or enable Ollama's streaming mode to first-token latency.
Embedding model not cached. First call is slow (~500ms); subsequent calls reuse the in-memory model. Always warm up before profiling.
Collection not indexed. ChromaDB builds indexes lazily. Call col.count() after loading to trigger index build.
Large context passed to generator. Passing 10k tokens from retrieval degrades generation speed and quality. Limit to top-3 chunks.
No GPU acceleration. Ollama runs on CPU by default. Set OLLAMA_GPU_LAYERS or run nvidia-smi to confirm GPU usage.
Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

How to Optimize RAG for Low Latency

What this does

Steps

Verification

Common failures

Related guides