12. Latency Budgeting
Latency budgeting allocates time budgets to each RAG pipeline stage. If your target is 3-second queries, you might budget: vector search 200ms, ACL filtering 100ms, context assembly 150ms, LLM inference 2,000ms, network overhead 550ms. The sum equals your target.
Budgeting forces explicit tradeoffs. If LLM inference takes 2.5 seconds with a 70B model, you cannot hit a 3-second target without model changes, caching, or target relaxation. These are engineering decisions, not preferences.
# Latency budget allocation
LATENCY_BUDGET_MS = {
"vector_search": 200,
"acl_filtering": 100,
"context_assembly": 150,
"llm_inference": 2000,
"network_overhead": 400,
"safety_filtering": 50,
"total": 3000
}
# Budget verification in code
async def timed_retrieve(query: str, user: User) -> tuple[Context, dict]:
timings = {}
start = time.time()
results = await vector_search(query)
timings["vector_search"] = elapsed_ms(start)
assert timings["vector_search"] < LATENCY_BUDGET_MS["vector_search"], \
f"Vector search exceeded budget: {timings['vector_search']}ms"
filtered = await acl_filter(results, user)
timings["acl_filtering"] = elapsed_ms(start)
assert timings["acl_filtering"] < LATENCY_BUDGET_MS["acl_filtering"], \
f"ACL filtering exceeded budget: {timings['acl_filtering']}ms"
# ... continue for all stages
Identify the critical path. The critical path is the longest sequence of dependent operations. Parallel operations (fetch user permissions and load document cache simultaneously) do not add to the critical path. Sequential operations (search then filter then assemble) do.
LLM inference dominates budgets. Even with aggressive optimization, 2-second inference is optimistic for frontier models. Strategies to reduce LLM time:
- Smaller models for simple queries (classification, routing)
- Speculative execution (start generating while final chunks are retrieved)
- Caching frequent query types (FAQ queries)
- Reduced context length (retrieve fewer, more targeted chunks)
# Query classification to select appropriate model
async def classify_and_route(query: str) -> tuple[str, str]:
# Fast classification model (100ms)
intent = await small_model.classify(query)
if intent == "factual_lookup":
return query, "gpt-4o-mini" # Fast, cheap
elif intent == "analysis":
return query, "gpt-4o" # Slow, capable
elif intent == "synthesis":
return query, "claude-3-opus" # Slowest, best reasoning
Budget violations cascade. When one stage exceeds its budget, downstream stages receive less time. If vector search takes 1 second instead of 200ms, you have 800ms less for everything else. Implement per-stage deadlines and graceful degradation when budgets are exceeded.
The budget review cycle. Quarterly, review actual latency distributions against budgets. If p50 vector search is 80ms but p99 is 800ms, your 200ms budget is correct for p50 but violated for tail latency. Adjust budgets based on data, not assumptions.
Given a 5-second query budget with these observed latencies: vector search p99 400ms, ACL filtering p99 200ms, context assembly p99 300ms, LLM inference p99 6 seconds, network overhead p99 500ms. What is over budget? Propose three solutions with tradeoffs.