Semantic Caching — Enterprise-Scale RAG (Chapter 13)

Semantic caching stores query-response pairs indexed by semantic similarity rather than exact string matches. When a user asks "How do I reset my password?", a cached answer for "I forgot my login credentials" can be retrieved if similarity exceeds the configured threshold.

The core implementation uses a vector database as the cache store:

import numpy as np
from redis import Redis
from redis.commands.search import SearchCommands
from sentence_transformers import SentenceTransformer

class SemanticCache:
    def __init__(self, model_name="all-MiniLM-L6-v2", 
                 similarity_threshold=0.85):
        self.model = SentenceTransformer(model_name)
        self.redis = Redis(host='localhost', port=6379)
        self.threshold = similarity_threshold
    
    def _embed(self, query: str) -> np.ndarray:
        return self.model.encode(query).astype(np.float32)
    
    def get_cached_response(self, query: str) -> str | None:
        embedding = self._embed(query)
        results = self.redis.ft("idx:queries").search(
            f"*=>[KNN 1 @embedding $vec AS score]",
            query_params={"vec": embedding.tobytes()},
        )
        
        if not results:
            return None
        
        top_result = results[0]
        similarity = 1 - top_result.score  # Redis KNN returns distance
        
        if similarity >= self.threshold:
            return top_result.payload
        return None
    
    def store_response(self, query: str, response: str, ttl: int = 3600):
        embedding = self._embed(query)
        self.redis.hset(f"cache:{query[:100]}", mapping={
            "embedding": embedding.tobytes(),
            "response": response
        })
        self.redis.expire(f"cache:{query[:100]}", ttl)

Failure Modes:

Embedding drift: Cached responses become semantically misaligned with newer queries as models evolve. Mitigation requires cache TTL limits and periodic invalidation.
False positives: High similarity scores don't guarantee answer applicability. A threshold of 0.85 catches most cases, but medical or legal queries may require 0.95.
Memory pressure: Unbounded cache growth. Redis eviction policy should be set to allkeys-lru and cache size monitored.

Cache warming on startup loads frequently-accessed queries into hot memory, reducing cold-start latency by 40-60% in typical deployments.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.