20. Common RAG Failures
RAG systems fail in predictable patterns. Recognizing common failures enables targeted debugging. This chapter documents the most frequent production issues with diagnosis and fixes.
Failure 1: Semantic Mismatch
User queries and document vocabulary differ semantically. Query "brain freeze headaches" retrieves nothing while documents discuss "ice cream headache triggers."
# Diagnosis: Check vocabulary overlap between queries and chunks
from collections import Counter
def diagnose_semantic_mismatch(
failed_queries: list[str],
failed_chunks: list[str]
) -> dict:
"""Identify vocabulary gaps."""
query_terms = Counter(word for q in failed_queries for word in q.lower().split())
chunk_terms = Counter(word for c in failed_chunks for word in c.lower().split())
# Find terms in queries but rare in chunks
missing_terms = {
term: count for term, count in query_terms.items()
if term not in chunk_terms and len(term) > 4
}
return {"terms_in_queries_not_chunks": missing_terms}
# Example result:
# {'terms_in_queries_not_chunks': {'freeze': 5, 'brain': 3, 'headache': 4}}
Fix: Fine-tune embedding model on domain-specific query-document pairs. Add query expansion with synonyms.
Failure 2: Chunk Boundary Problems
Required information spans multiple chunks. A troubleshooting guide says "Step 1: Check the config file. Step 2: Update the auth token in config.yaml. Step 3: Restart the server." These three steps appear across three chunks - none contains complete instructions.
# Diagnosis: Check if related content spans chunks
def diagnose_chunk_boundaries(
failed_query: str,
relevant_chunks: list[dict]
) -> dict:
"""Detect if answer spans chunks."""
all_text = " ".join(c["text"] for c in relevant_chunks)
# Check for sequential markers in chunks
sequential_markers = ["step 1", "step 2", "step 3", "first", "then", "next"]
has_sequential = any(
marker in c["text"].lower()
for c in relevant_chunks
for marker in sequential_markers
)
return {
"chunks_needed_for_answer": len(relevant_chunks),
"has_sequential_content": has_sequential,
"recommendation": "Increase overlap or use larger chunks"
}
# Example output:
# {'chunks_needed_for_answer': 4, 'has_sequential_content': True}
Fix: Increase chunk overlap from 10% to 20-30% or use variable chunk sizes that respect semantic boundaries.
Failure 3: Metadata Filter Conflicts
Metadata filters exclude relevant documents. Query "authentication" with filter: {category: "security"} misses "authentication" content that metadata marks as "troubleshooting."
# Diagnosis: Check how many relevant chunks are excluded by filters
def diagnose_filter_conflicts(
query: str,
retrieved_chunks: list[dict],
all_relevant_chunks: list[dict]
) -> dict:
"""Identify if filters exclude relevant content."""
retrieved_ids = {c["id"] for c in retrieved_chunks}
relevant_ids = {c["id"] for c in all_relevant_chunks}
missed = relevant_ids - retrieved_ids
return {
"total_relevant": len(relevant_ids),
"retrieved": len(retrieved_ids & relevant_ids),
"missed_by_filter": len(missed),
"percentage_missed": len(missed) / len(relevant_ids) if relevant_ids else 0
}
Fix: Don't apply filters by default. Apply them only when user explicitly restricts scope. If filters are required, maintain multiple metadata indexes ensuring categories overlap appropriately.
Failure 4: Context Length Exceeds Model Limits
Retrieving 20 chunks for a verbose technical topic exceeds context window. The model sees truncated or incomplete context.
# Diagnosis: Check token counts at each pipeline stage
def diagnose_context_length(
assembled_context: str,
model_max_tokens: int,
expected_response_tokens: int
) -> dict:
"""Detect context overflow."""
from your_rag_library import count_tokens
context_tokens = count_tokens(assembled_context)
available_for_context = model_max_tokens - expected_response_tokens - 500
return {
"context_tokens": context_tokens,
"available_tokens": available_for_context,
"overflow_percentage": (context_tokens - available_for_context)
/ available_for_context * 100
if context_tokens > available_for_context else 0,
"recommendation": "Reduce top_k or segment into multiple queries"
}
Fix: Reduce top_k from 20 to 10, or implement query decomposition that handles topic complexity through multiple queries.
Failure 5: Hallucination Despite Relevant Context
The model generates content not present in context even when correct answers exist within the context.
# Diagnosis: Check model behavior with context-only prompts
def diagnose_hallucination(
query: str,
context_chunks: list[dict],
response: str,
ground_truth: str
) -> dict:
"""Identify hallucination patterns."""
from your_rag_library import compare_to_context
# Check what percentage of response phrases appear in context
factual_percentage = compare_to_context(response, context_chunks)
return {
"response_grounded_in_context": factual_percentage,
"hallucination_detected": factual_percentage < 0.7,
"recommendation": "Lower temperature or use longer context"
}
Fix: Decrease temperature from 0.7 to 0.1-0.3 for RAG applications. Add explicit prompt instructions to cite context. Check if context ordering confuses the model.
Run all five diagnostic functions against your production logs. For each failed query, identify which failure mode explains it and record the root cause frequency. Expect retrieval issues to dominate (70%+) over generation issues.