SLA Monitoring — Enterprise-Scale RAG (Chapter 11)

RAG systems need explicit SLA definitions because users experience end-to-end latency, not component latency. Your embedding service responds in 50ms, your vector search in 30ms, your context assembly in 20ms, and your LLM inference in 5 seconds—but users complain that queries take 10 seconds.

Define SLAs at the system boundary, not per-component:

Query p50 latency: 1.5 seconds
Query p99 latency: 5 seconds
Indexing freshness (document upload to searchable): 60 seconds
System availability: 99.9%
Retrieval precision@10: >0.85 (measured via relevance sampling)

Distributed tracing is essential for diagnosing latency issues. A single query touches 5+ services. Without trace IDs linking spans, you cannot identify which service adds latency.

# OpenTelemetry instrumentation for RAG components
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

async def retrieve_with_tracing(query: str, user_id: str):
    with tracer.start_as_current_span("retrieve") as span:
        span.set_attribute("user.id", user_id)
        span.set_attribute("query.length", len(query))
        
        with tracer.start_as_current_span("vector_search") as search_span:
            search_results = await vector_db.search(query, k=10)
            search_span.set_attribute("result.count", len(search_results))
        
        with tracer.start_as_current_span("access_control_filter") as acl_span:
            filtered = apply_acl(search_results, user_id)
            acl_span.set_attribute("filtered.count", len(filtered))
        
        with tracer.start_as_current_span("context_assembly") as ctx_span:
            context = assemble_context(filtered)
            ctx_span.set_attribute("context.length", len(context))
        
        return context

Key metrics to monitor:

Metric	Alert Threshold	Dashboard
Query p99 latency	>5s	Latency percentiles over time
Indexing backlog	>1000 pending	Ingest queue depth
Embedding queue depth	>500	Processing pipeline health
Cache hit rate	<80%	Cache efficiency
Error rate	>0.1%	System availability
Relevance score	<0.85	Retrieval quality sampling

Synthetic monitoring catches regressions before users do. Run queries every 60 seconds against your production system. Measure latency and sample result quality. Alert if p50 latency exceeds 2 seconds or relevance drops below 0.80.

Real user monitoring (RUM) complements synthetic checks. Track actual user queries, latency, and engagement. A query that takes 3 seconds might be acceptable for analysts but unacceptable for customer support agents.

Failure modes in SLA monitoring: alert fatigue (so many alerts that real issues are ignored), dashboard blindness (too many metrics to focus), and missing the right metrics (you monitor infrastructure but miss business outcomes).