Common RAG Failures — RAG Systems: Part 1 (Chapter 20)

RAG systems fail in predictable patterns. Recognizing common failures enables targeted debugging. This chapter documents the most frequent production issues with diagnosis and fixes.

Failure 1: Semantic Mismatch

User queries and document vocabulary differ semantically. Query "brain freeze headaches" retrieves nothing while documents discuss "ice cream headache triggers."

# Diagnosis: Check vocabulary overlap between queries and chunks
from collections import Counter

def diagnose_semantic_mismatch(
    failed_queries: list[str],
    failed_chunks: list[str]
) -> dict:
    """Identify vocabulary gaps."""
    query_terms = Counter(word for q in failed_queries for word in q.lower().split())
    chunk_terms = Counter(word for c in failed_chunks for word in c.lower().split())
    
    # Find terms in queries but rare in chunks
    missing_terms = {
        term: count for term, count in query_terms.items()
        if term not in chunk_terms and len(term) > 4
    }
    
    return {"terms_in_queries_not_chunks": missing_terms}

# Example result:
# {'terms_in_queries_not_chunks': {'freeze': 5, 'brain': 3, 'headache': 4}}

Fix: Fine-tune embedding model on domain-specific query-document pairs. Add query expansion with synonyms.

Failure 2: Chunk Boundary Problems

Required information spans multiple chunks. A troubleshooting guide says "Step 1: Check the config file. Step 2: Update the auth token in config.yaml. Step 3: Restart the server." These three steps appear across three chunks - none contains complete instructions.

# Diagnosis: Check if related content spans chunks
def diagnose_chunk_boundaries(
    failed_query: str,
    relevant_chunks: list[dict]
) -> dict:
    """Detect if answer spans chunks."""
    all_text = " ".join(c["text"] for c in relevant_chunks)
    
    # Check for sequential markers in chunks
    sequential_markers = ["step 1", "step 2", "step 3", "first", "then", "next"]
    has_sequential = any(
        marker in c["text"].lower() 
        for c in relevant_chunks 
        for marker in sequential_markers
    )
    
    return {
        "chunks_needed_for_answer": len(relevant_chunks),
        "has_sequential_content": has_sequential,
        "recommendation": "Increase overlap or use larger chunks"
    }

# Example output:
# {'chunks_needed_for_answer': 4, 'has_sequential_content': True}

Fix: Increase chunk overlap from 10% to 20-30% or use variable chunk sizes that respect semantic boundaries.

Failure 3: Metadata Filter Conflicts

Metadata filters exclude relevant documents. Query "authentication" with filter: {category: "security"} misses "authentication" content that metadata marks as "troubleshooting."

# Diagnosis: Check how many relevant chunks are excluded by filters
def diagnose_filter_conflicts(
    query: str,
    retrieved_chunks: list[dict],
    all_relevant_chunks: list[dict]
) -> dict:
    """Identify if filters exclude relevant content."""
    retrieved_ids = {c["id"] for c in retrieved_chunks}
    relevant_ids = {c["id"] for c in all_relevant_chunks}
    
    missed = relevant_ids - retrieved_ids
    
    return {
        "total_relevant": len(relevant_ids),
        "retrieved": len(retrieved_ids & relevant_ids),
        "missed_by_filter": len(missed),
        "percentage_missed": len(missed) / len(relevant_ids) if relevant_ids else 0
    }

Fix: Don't apply filters by default. Apply them only when user explicitly restricts scope. If filters are required, maintain multiple metadata indexes ensuring categories overlap appropriately.

Failure 4: Context Length Exceeds Model Limits

Retrieving 20 chunks for a verbose technical topic exceeds context window. The model sees truncated or incomplete context.

# Diagnosis: Check token counts at each pipeline stage
def diagnose_context_length(
    assembled_context: str,
    model_max_tokens: int,
    expected_response_tokens: int
) -> dict:
    """Detect context overflow."""
    from your_rag_library import count_tokens
    
    context_tokens = count_tokens(assembled_context)
    available_for_context = model_max_tokens - expected_response_tokens - 500
    
    return {
        "context_tokens": context_tokens,
        "available_tokens": available_for_context,
        "overflow_percentage": (context_tokens - available_for_context) 
                               / available_for_context * 100
                               if context_tokens > available_for_context else 0,
        "recommendation": "Reduce top_k or segment into multiple queries"
    }

Fix: Reduce top_k from 20 to 10, or implement query decomposition that handles topic complexity through multiple queries.

Failure 5: Hallucination Despite Relevant Context

The model generates content not present in context even when correct answers exist within the context.

# Diagnosis: Check model behavior with context-only prompts
def diagnose_hallucination(
    query: str,
    context_chunks: list[dict],
    response: str,
    ground_truth: str
) -> dict:
    """Identify hallucination patterns."""
    from your_rag_library import compare_to_context
    
    # Check what percentage of response phrases appear in context
    factual_percentage = compare_to_context(response, context_chunks)
    
    return {
        "response_grounded_in_context": factual_percentage,
        "hallucination_detected": factual_percentage < 0.7,
        "recommendation": "Lower temperature or use longer context"
    }

Fix: Decrease temperature from 0.7 to 0.1-0.3 for RAG applications. Add explicit prompt instructions to cite context. Check if context ordering confuses the model.