RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Enterprise-Scale RAG
  6. /Ch. 13
Enterprise-Scale RAG

13. Semantic Caching

Chapter 13 of 24 · 15 min
KEY INSIGHT

Semantic caching reduces LLM inference costs by 30-70% for repetitive queries, but the similarity threshold must be tuned per domain—a threshold too low returns irrelevant cached responses, too high defeats the cache purpose entirely.

Semantic caching stores query-response pairs indexed by semantic similarity rather than exact string matches. When a user asks "How do I reset my password?", a cached answer for "I forgot my login credentials" can be retrieved if similarity exceeds the configured threshold.

The core implementation uses a vector database as the cache store:

import numpy as np
from redis import Redis
from redis.commands.search import SearchCommands
from sentence_transformers import SentenceTransformer

class SemanticCache:
    def __init__(self, model_name="all-MiniLM-L6-v2", 
                 similarity_threshold=0.85):
        self.model = SentenceTransformer(model_name)
        self.redis = Redis(host='localhost', port=6379)
        self.threshold = similarity_threshold
    
    def _embed(self, query: str) -> np.ndarray:
        return self.model.encode(query).astype(np.float32)
    
    def get_cached_response(self, query: str) -> str | None:
        embedding = self._embed(query)
        results = self.redis.ft("idx:queries").search(
            f"*=>[KNN 1 @embedding $vec AS score]",
            query_params={"vec": embedding.tobytes()},
        )
        
        if not results:
            return None
        
        top_result = results[0]
        similarity = 1 - top_result.score  # Redis KNN returns distance
        
        if similarity >= self.threshold:
            return top_result.payload
        return None
    
    def store_response(self, query: str, response: str, ttl: int = 3600):
        embedding = self._embed(query)
        self.redis.hset(f"cache:{query[:100]}", mapping={
            "embedding": embedding.tobytes(),
            "response": response
        })
        self.redis.expire(f"cache:{query[:100]}", ttl)

Failure Modes:

  • Embedding drift: Cached responses become semantically misaligned with newer queries as models evolve. Mitigation requires cache TTL limits and periodic invalidation.
  • False positives: High similarity scores don't guarantee answer applicability. A threshold of 0.85 catches most cases, but medical or legal queries may require 0.95.
  • Memory pressure: Unbounded cache growth. Redis eviction policy should be set to allkeys-lru and cache size monitored.

Cache warming on startup loads frequently-accessed queries into hot memory, reducing cold-start latency by 40-60% in typical deployments.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Implement a semantic cache that stores embeddings in Redis. Test it with five query variations and measure which threshold values produce false positives for your specific use case.

← Chapter 12
Latency Budgeting
Chapter 14 →
Cache Invalidation