Faithfulness — RAG Evaluation and Metrics (Chapter 6)

Faithfulness measures whether a generated answer stays grounded in the retrieved context without introducing hallucinated information. The RAGAS implementation prompts an LLM to verify each claim in the answer against the source context.

The evaluation works by splitting the answer into discrete claims, then checking each claim against the context. Claims that cannot be verified from context reduce the faithfulness score. The proportion of verifiable claims determines the final score.

from ragas.metrics import faithfulness
from ragas import evaluate
from ragas.dataset import Dataset

# Example with a faithful answer vs a hallucinated one
faithful_data = [
    {
        "user_input": "How long does standard shipping take?",
        "retrieved_contexts": [
            "Standard shipping in the continental US takes 5-7 business days. "
            "Alaska and Hawaii may take 7-10 business days. "
            "International shipping varies by destination."
        ],
        "response": "Standard shipping typically takes 5-7 business days "
                   "within the continental United States."
    }
]

hallucinated_data = [
    {
        "user_input": "How long does standard shipping take?",
        "retrieved_contexts": [
            "Standard shipping in the continental US takes 5-7 business days. "
            "Alaska and Hawaii may take 7-10 business days. "
            "International shipping varies by destination."
        ],
        "response": "Standard shipping takes 2-3 business days within the "
                   "continental United States. Alaska and Hawaii receive "
                   "overnight shipping options."
    }
]

faithful_ds = Dataset.from_list(faithful_data)
hallucinated_ds = Dataset.from_list(hallucinated_data)

faithful_result = evaluate(faithful_ds, metrics=[faithfulness])
hallucinated_result = evaluate(hallucinated_ds, metrics=[faithfulness])

print(f"Faithful answer score: {faithful_result['faithfulness']}")
print(f"Hallucinated answer score: {hallucinated_result['faithfulness']}")
# Faithful answer score: 1.0
# Hallucinated answer score: 0.5

The hallucinated example demonstrates how Faithfulness catches fabrications. The answer claims 2-3 business days when context specifies 5-7. The answer claims overnight shipping for Alaska and Hawaii when context says nothing about overnight options. Each fabrication reduces the verified claims proportion.

Common causes of low Faithfulness include prompt instructions that encourage elaboration beyond context, language models with strong world knowledge that overrides retrieved facts, and insufficient context ambiguity where multiple interpretations exist. Each cause requires different remediation.

When Faithfulness drops in production, examine which claims failed verification. If failed claims cluster around specific topics, the retrieval system may be fetching wrong contexts. If failed claims involve general knowledge overrides, the generation prompt may need grounding instructions. Systematic failure analysis surfaces the actual root cause.