Hallucination Detection — RAG Evaluation and Metrics (Chapter 10)

Hallucinations in RAG systems occur when the retrieved context does not support the generated answer, yet the model produces confident-sounding but incorrect claims. Detecting these failures is harder than measuring relevance because the model may phrase fabricated information eloquently while remaining factually disconnected from the source material.

Causes of Hallucination in RAG Pipelines

Hallucination typically stems from three sources: contextual mismatch, instruction drift, and parametric knowledge conflicts. A contextual mismatch happens when retrieved chunks do not contain the information needed to answer the query, but the generative model fills the gap. Instruction drift occurs when the system prompt or generation settings cause the model to deviate from grounded output. Parametric knowledge conflicts arise when the LLM possesses stronger prior knowledge than the retrieved context, leading it to override or modify the provided facts.

Detection Methods

The NER overlap method extracts named entities from the generated answer and verifies their presence in the retrieved context. If a high percentage of entities appear only in the generated text and not in the source chunks, that answer warrants flagging.

import re
from typing import Set, List

def extract_entities(text: str) -> Set[str]:
    """Simple pattern-based entity extraction for demonstration."""
    # In production, use spaCy or a fine-tuned NER model
    patterns = [
        r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b',  # Proper nouns
        r'\b\d+(?:\.\d+)*\b',  # Numbers with decimals
        r'\b\w+ied\b|\b\w+ing\b',  # Action words (simplified)
    ]
    entities = set()
    for pattern in patterns:
        entities.update(re.findall(pattern, text))
    return entities

def hallucination_score(
    generated: str,
    context_chunks: List[str]
) -> dict:
    """Calculate entity overlap between answer and context."""
    gen_entities = extract_entities(generated)
    
    if not gen_entities:
        return {"score": 1.0, "flagged": False, "entities": []}
    
    context_text = " ".join(context_chunks)
    ctx_entities = extract_entities(context_text)
    
    present_in_context = gen_entities & ctx_entities
    missing_from_context = gen_entities - ctx_entities
    
    if len(gen_entities) == 0:
        score = 1.0
    else:
        score = len(present_in_context) / len(gen_entities)
    
    return {
        "score": score,
        "flagged": score < 0.5,  # Flag if less than half entities are grounded
        "entities_only_in_answer": list(missing_from_context),
        "total_entities": len(gen_entities),
        "grounded_entities": len(present_in_context)
    }

# Usage
result = hallucination_score(
    generated="The company achieved 47% revenue growth in Q3 2024.",
    context_chunks=[
        "Q3 2024 saw significant market expansion.",
        "Customer acquisition costs decreased by 12%."
    ]
)
print(result)
# {'score': 0.2, 'flagged': True, 'entities_only_in_answer': ['47%', 'revenue', 'growth'], ...}

Attribution-Based Detection

A more reliable approach uses semantic similarity between answer spans and source chunks. Chunk-level attribution assigns each answer sentence a maximum similarity score against all retrieved chunks.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def sentence_attribution(
    answer: str,
    context_chunks: List[str],
    threshold: float = 0.3
) -> List[dict]:
    """Score each answer sentence against retrieved context."""
    sentences = answer.split('. ')
    
    if not context_chunks:
        return [{"sentence": s, "attribution_score": 0.0, "flagged": True} 
                for s in sentences if s.strip()]
    
    vectorizer = TfidfVectorizer()
    all_texts = context_sentences = [answer] + context_chunks
    vectorizer.fit(all_texts)
    
    answer_vec = vectorizer.transform([answer])
    context_vecs = vectorizer.transform(context_chunks)
    
    similarities = cosine_similarity(answer_vec, context_vecs).flatten()
    max_similarity = np.max(similarities)
    
    return {
        "max_attribution": max_similarity,
        "flagged": max_similarity < threshold,
        "context_scores": similarities.tolist()
    }

Failure Modes to Watch

A common failure mode is contextual hallucination where the model correctly reads the chunk but infers unsupported conclusions. Another failure mode involves numbers—models often hallucinate specific statistics while surrounding text remains accurate. Temporal hallucination occurs when dates or sequences are fabricated.