RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /RAG Evaluation and Metrics
  6. /Ch. 6
RAG Evaluation and Metrics

06. Faithfulness

Chapter 6 of 18 · 15 min
KEY INSIGHT

Faithfulness scores quantify hallucination by measuring what proportion of answer claims can be verified in retrieved context.

Faithfulness measures whether a generated answer stays grounded in the retrieved context without introducing hallucinated information. The RAGAS implementation prompts an LLM to verify each claim in the answer against the source context.

The evaluation works by splitting the answer into discrete claims, then checking each claim against the context. Claims that cannot be verified from context reduce the faithfulness score. The proportion of verifiable claims determines the final score.

from ragas.metrics import faithfulness
from ragas import evaluate
from ragas.dataset import Dataset

# Example with a faithful answer vs a hallucinated one
faithful_data = [
    {
        "user_input": "How long does standard shipping take?",
        "retrieved_contexts": [
            "Standard shipping in the continental US takes 5-7 business days. "
            "Alaska and Hawaii may take 7-10 business days. "
            "International shipping varies by destination."
        ],
        "response": "Standard shipping typically takes 5-7 business days "
                   "within the continental United States."
    }
]

hallucinated_data = [
    {
        "user_input": "How long does standard shipping take?",
        "retrieved_contexts": [
            "Standard shipping in the continental US takes 5-7 business days. "
            "Alaska and Hawaii may take 7-10 business days. "
            "International shipping varies by destination."
        ],
        "response": "Standard shipping takes 2-3 business days within the "
                   "continental United States. Alaska and Hawaii receive "
                   "overnight shipping options."
    }
]

faithful_ds = Dataset.from_list(faithful_data)
hallucinated_ds = Dataset.from_list(hallucinated_data)

faithful_result = evaluate(faithful_ds, metrics=[faithfulness])
hallucinated_result = evaluate(hallucinated_ds, metrics=[faithfulness])

print(f"Faithful answer score: {faithful_result['faithfulness']}")
print(f"Hallucinated answer score: {hallucinated_result['faithfulness']}")
# Faithful answer score: 1.0
# Hallucinated answer score: 0.5

The hallucinated example demonstrates how Faithfulness catches fabrications. The answer claims 2-3 business days when context specifies 5-7. The answer claims overnight shipping for Alaska and Hawaii when context says nothing about overnight options. Each fabrication reduces the verified claims proportion.

Common causes of low Faithfulness include prompt instructions that encourage elaboration beyond context, language models with strong world knowledge that overrides retrieved facts, and insufficient context ambiguity where multiple interpretations exist. Each cause requires different remediation.

When Faithfulness drops in production, examine which claims failed verification. If failed claims cluster around specific topics, the retrieval system may be fetching wrong contexts. If failed claims involve general knowledge overrides, the generation prompt may need grounding instructions. Systematic failure analysis surfaces the actual root cause.

EXERCISE

Generate 10 test examples covering your system's common queries. Run RAGAS Faithfulness evaluation on all examples. For any example scoring below 0.8, manually identify which claims failed verification and categorize the failure as retrieval error (wrong context) or generation error (misunderstanding correct context).

← Chapter 5
RAGAS Introduction
Chapter 7 →
Answer Relevance