RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Systems: Part 1
  6. /Ch. 20
RAG Systems: Part 1

20. Common RAG Failures

Chapter 20 of 22 · 25 min
KEY INSIGHT

80% of RAG failures trace to retrieval problems, not generation problems. Debug retrieval first before adjusting prompts or models.

RAG systems fail in predictable patterns. Recognizing common failures enables targeted debugging. This chapter documents the most frequent production issues with diagnosis and fixes.

Failure 1: Semantic Mismatch

User queries and document vocabulary differ semantically. Query "brain freeze headaches" retrieves nothing while documents discuss "ice cream headache triggers."

# Diagnosis: Check vocabulary overlap between queries and chunks
from collections import Counter

def diagnose_semantic_mismatch(
    failed_queries: list[str],
    failed_chunks: list[str]
) -> dict:
    """Identify vocabulary gaps."""
    query_terms = Counter(word for q in failed_queries for word in q.lower().split())
    chunk_terms = Counter(word for c in failed_chunks for word in c.lower().split())
    
    # Find terms in queries but rare in chunks
    missing_terms = {
        term: count for term, count in query_terms.items()
        if term not in chunk_terms and len(term) > 4
    }
    
    return {"terms_in_queries_not_chunks": missing_terms}

# Example result:
# {'terms_in_queries_not_chunks': {'freeze': 5, 'brain': 3, 'headache': 4}}

Fix: Fine-tune embedding model on domain-specific query-document pairs. Add query expansion with synonyms.

Failure 2: Chunk Boundary Problems

Required information spans multiple chunks. A troubleshooting guide says "Step 1: Check the config file. Step 2: Update the auth token in config.yaml. Step 3: Restart the server." These three steps appear across three chunks - none contains complete instructions.

# Diagnosis: Check if related content spans chunks
def diagnose_chunk_boundaries(
    failed_query: str,
    relevant_chunks: list[dict]
) -> dict:
    """Detect if answer spans chunks."""
    all_text = " ".join(c["text"] for c in relevant_chunks)
    
    # Check for sequential markers in chunks
    sequential_markers = ["step 1", "step 2", "step 3", "first", "then", "next"]
    has_sequential = any(
        marker in c["text"].lower() 
        for c in relevant_chunks 
        for marker in sequential_markers
    )
    
    return {
        "chunks_needed_for_answer": len(relevant_chunks),
        "has_sequential_content": has_sequential,
        "recommendation": "Increase overlap or use larger chunks"
    }

# Example output:
# {'chunks_needed_for_answer': 4, 'has_sequential_content': True}

Fix: Increase chunk overlap from 10% to 20-30% or use variable chunk sizes that respect semantic boundaries.

Failure 3: Metadata Filter Conflicts

Metadata filters exclude relevant documents. Query "authentication" with filter: {category: "security"} misses "authentication" content that metadata marks as "troubleshooting."

# Diagnosis: Check how many relevant chunks are excluded by filters
def diagnose_filter_conflicts(
    query: str,
    retrieved_chunks: list[dict],
    all_relevant_chunks: list[dict]
) -> dict:
    """Identify if filters exclude relevant content."""
    retrieved_ids = {c["id"] for c in retrieved_chunks}
    relevant_ids = {c["id"] for c in all_relevant_chunks}
    
    missed = relevant_ids - retrieved_ids
    
    return {
        "total_relevant": len(relevant_ids),
        "retrieved": len(retrieved_ids & relevant_ids),
        "missed_by_filter": len(missed),
        "percentage_missed": len(missed) / len(relevant_ids) if relevant_ids else 0
    }

Fix: Don't apply filters by default. Apply them only when user explicitly restricts scope. If filters are required, maintain multiple metadata indexes ensuring categories overlap appropriately.

Failure 4: Context Length Exceeds Model Limits

Retrieving 20 chunks for a verbose technical topic exceeds context window. The model sees truncated or incomplete context.

# Diagnosis: Check token counts at each pipeline stage
def diagnose_context_length(
    assembled_context: str,
    model_max_tokens: int,
    expected_response_tokens: int
) -> dict:
    """Detect context overflow."""
    from your_rag_library import count_tokens
    
    context_tokens = count_tokens(assembled_context)
    available_for_context = model_max_tokens - expected_response_tokens - 500
    
    return {
        "context_tokens": context_tokens,
        "available_tokens": available_for_context,
        "overflow_percentage": (context_tokens - available_for_context) 
                               / available_for_context * 100
                               if context_tokens > available_for_context else 0,
        "recommendation": "Reduce top_k or segment into multiple queries"
    }

Fix: Reduce top_k from 20 to 10, or implement query decomposition that handles topic complexity through multiple queries.

Failure 5: Hallucination Despite Relevant Context

The model generates content not present in context even when correct answers exist within the context.

# Diagnosis: Check model behavior with context-only prompts
def diagnose_hallucination(
    query: str,
    context_chunks: list[dict],
    response: str,
    ground_truth: str
) -> dict:
    """Identify hallucination patterns."""
    from your_rag_library import compare_to_context
    
    # Check what percentage of response phrases appear in context
    factual_percentage = compare_to_context(response, context_chunks)
    
    return {
        "response_grounded_in_context": factual_percentage,
        "hallucination_detected": factual_percentage < 0.7,
        "recommendation": "Lower temperature or use longer context"
    }

Fix: Decrease temperature from 0.7 to 0.1-0.3 for RAG applications. Add explicit prompt instructions to cite context. Check if context ordering confuses the model.

EXERCISE

Run all five diagnostic functions against your production logs. For each failed query, identify which failure mode explains it and record the root cause frequency. Expect retrieval issues to dominate (70%+) over generation issues.

← Chapter 19
RAG Evaluation: MRR
Chapter 21 →
RAG Pipeline Optimization