RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Enterprise-Scale RAG
  6. /Ch. 12
Enterprise-Scale RAG

12. Latency Budgeting

Chapter 12 of 24 · 20 min
KEY INSIGHT

Latency budgets transform vague "make it faster" requirements into specific engineering targets. When every team knows their budget allocation, optimization efforts focus automatically. When budgets are violated, the root cause is immediately identifiable.

Latency budgeting allocates time budgets to each RAG pipeline stage. If your target is 3-second queries, you might budget: vector search 200ms, ACL filtering 100ms, context assembly 150ms, LLM inference 2,000ms, network overhead 550ms. The sum equals your target.

Budgeting forces explicit tradeoffs. If LLM inference takes 2.5 seconds with a 70B model, you cannot hit a 3-second target without model changes, caching, or target relaxation. These are engineering decisions, not preferences.

# Latency budget allocation
LATENCY_BUDGET_MS = {
    "vector_search": 200,
    "acl_filtering": 100,
    "context_assembly": 150,
    "llm_inference": 2000,
    "network_overhead": 400,
    "safety_filtering": 50,
    "total": 3000
}

# Budget verification in code
async def timed_retrieve(query: str, user: User) -> tuple[Context, dict]:
    timings = {}
    start = time.time()
    
    results = await vector_search(query)
    timings["vector_search"] = elapsed_ms(start)
    assert timings["vector_search"] < LATENCY_BUDGET_MS["vector_search"], \
        f"Vector search exceeded budget: {timings['vector_search']}ms"
    
    filtered = await acl_filter(results, user)
    timings["acl_filtering"] = elapsed_ms(start)
    assert timings["acl_filtering"] < LATENCY_BUDGET_MS["acl_filtering"], \
        f"ACL filtering exceeded budget: {timings['acl_filtering']}ms"
    
    # ... continue for all stages

Identify the critical path. The critical path is the longest sequence of dependent operations. Parallel operations (fetch user permissions and load document cache simultaneously) do not add to the critical path. Sequential operations (search then filter then assemble) do.

LLM inference dominates budgets. Even with aggressive optimization, 2-second inference is optimistic for frontier models. Strategies to reduce LLM time:

  • Smaller models for simple queries (classification, routing)
  • Speculative execution (start generating while final chunks are retrieved)
  • Caching frequent query types (FAQ queries)
  • Reduced context length (retrieve fewer, more targeted chunks)
# Query classification to select appropriate model
async def classify_and_route(query: str) -> tuple[str, str]:
    # Fast classification model (100ms)
    intent = await small_model.classify(query)
    
    if intent == "factual_lookup":
        return query, "gpt-4o-mini"  # Fast, cheap
    elif intent == "analysis":
        return query, "gpt-4o"  # Slow, capable
    elif intent == "synthesis":
        return query, "claude-3-opus"  # Slowest, best reasoning

Budget violations cascade. When one stage exceeds its budget, downstream stages receive less time. If vector search takes 1 second instead of 200ms, you have 800ms less for everything else. Implement per-stage deadlines and graceful degradation when budgets are exceeded.

The budget review cycle. Quarterly, review actual latency distributions against budgets. If p50 vector search is 80ms but p99 is 800ms, your 200ms budget is correct for p50 but violated for tail latency. Adjust budgets based on data, not assumptions.

EXERCISE

Given a 5-second query budget with these observed latencies: vector search p99 400ms, ACL filtering p99 200ms, context assembly p99 300ms, LLM inference p99 6 seconds, network overhead p99 500ms. What is over budget? Propose three solutions with tradeoffs.

← Chapter 11
SLA Monitoring
Chapter 13 →
Semantic Caching