RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /How-to
  5. /How to Optimize RAG for Low Latency
HOW-TO · RAG

How to Optimize RAG for Low Latency

advanced·30 min·By Fredoline Eruo
Target environment
Ubuntu 24.04 · Ollama 0.4.x
PREREQUISITES

RAG pipeline deployed, performance monitoring tools

What this does

This guide explains how to reduce end-to-end latency in a Retrieval-Augmented Generation pipeline running on Ollama. By tuning retrieval granularity, embedding models, and chunking strategies, you can bring p95 latency under 1 second for typical single-document answers.

Steps

  1. Profile the pipeline end-to-end. Identify which stage dominates latency: embedding, retrieval, or generation.

    import time, ollama, chromadb
    
    client = chromadb.Client()
    col = client.get_collection("docs")
    
    start = time.time()
    embed_start = time.time()
    # measure embedding time
    query_vec = ollama.embeddings(model="mxbai-embed-large", prompt="your query")["embedding"]
    embed_time = (time.time() - embed_start) * 1000
    
    retrieve_start = time.time()
    results = col.query(query_embeddings=[query_vec], n_results=5)
    retrieve_time = (time.time() - retrieve_start) * 1000
    
    gen_start = time.time()
    response = ollama.generate(model="llama3.2", prompt=f"Context: {results['documents'][0]}\nQuestion: your query")
    gen_time = (time.time() - gen_start) * 1000
    
    print(f"Embed: {embed_time:.1f}ms | Retrieve: {retrieve_time:.1f}ms | Generate: {gen_time:.1f}ms | Total: {(time.time()-start)*1000:.1f}ms")
    

    Expected output: Embed: 45.2ms | Retrieve: 12.1ms | Generate: 820.3ms | Total: 877.6ms

  2. Switch to a faster embedding model. mxbai-embed-large is optimized for speed and quality. Replace any large transformer embedder.

    # Use a lighter embedder - update your pipeline config
    model_name = "mxbai-embed-large"  # ~500MB, <50ms per query on CPU
    
  3. Reduce chunk size and overlap. Smaller chunks (512 tokens with 64-token overlap) improve retrieval precision and reduce context size passed to the generator, cutting generation latency.

    def chunk_text(text, chunk_size=512, overlap=64):
        words = text.split()
        chunks = []
        for i in range(0, len(words), chunk_size - overlap):
            chunks.append(" ".join(words[i:i + chunk_size]))
        return chunks
    
  4. Enable retrieval caching. Cache query vectors for repeated or near-duplicate queries.

    from functools import lru_cache
    
    @lru_cache(maxsize=1000)
    def cached_embed(prompt):
        return ollama.embeddings(model="mxbai-embed-large", prompt=prompt)["embedding"]
    

Verification

python3 profile_latency.py
# Expected: Embed: <60ms | Retrieve: <20ms | Generate: <800ms | Total: <1000ms

Common failures

  • Generator latency dominates. If generation >800ms, switch to a smaller model (e.g., llama3.2:1b) or enable Ollama's streaming mode to first-token latency.
  • Embedding model not cached. First call is slow (~500ms); subsequent calls reuse the in-memory model. Always warm up before profiling.
  • Collection not indexed. ChromaDB builds indexes lazily. Call col.count() after loading to trigger index build.
  • Large context passed to generator. Passing 10k tokens from retrieval degrades generation speed and quality. Limit to top-3 chunks.
  • No GPU acceleration. Ollama runs on CPU by default. Set OLLAMA_GPU_LAYERS or run nvidia-smi to confirm GPU usage.
  • Version mismatch - The installed package or runtime differs from the command shown; check the version first and rerun the smallest verification command.
  • Local environment drift - Another service, virtual environment, model, or path is being used; print the active binary path and configuration before changing the guide steps.

Related guides

  • build-multi-modal-rag-images-text
  • setup-chromadb-scratch
RELATED GUIDES
RAG
How to Build Multi-Modal RAG for Images and Text
RAG
How to Set Up ChromaDB from Scratch
← All how-to guidesCourses →