RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Advanced RAG — Chunking, Retrieval, Re-ranking
  6. /Ch. 21
Advanced RAG — Chunking, Retrieval, Re-ranking

21. RAGAS Faithfulness

Chapter 21 of 24 · 20 min
KEY INSIGHT

Faithfulness measures whether the generated answer stays within the bounds of what the retrieved context actually supports. ### Definition An answer is faithful if it can be entirely attributed to the retrieved context without introducing facts not present in—or contradicted by—the context. A faithfulness score of 1.0 means every claim in the answer maps to a supporting citation in the context. A score of 0.5 means half the claims are unsupported. ### Why It Matters High retrieval relevance scores do not guarantee answer faithfulness. A retrieval system can return relevant chunks, but the LLM can still confabulate related-but-absent facts. Faithfulness validation catches this failure mode. ### RAGAS Faithfulness Implementation ```python from openai import OpenAI client = OpenAI() def computeFaithfulness( answer: str, context: str, model: str = "gpt-4o-mini" ) -> dict: """ RAGAS faithfulness: break answer into claims, check each against context. """ # Step 1: Decompose answer into atomic claims claims_prompt = ( "Break the following answer into independent factual claims. " "Each claim should be a single verifiable statement. " "List one per line.\n\nAnswer:\n{answer}" ) response = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": "You extract factual claims accurately."}, {"role": "user", "content": claims_prompt.format(answer=answer)} ], temperature=0.0, max_tokens=512 ) claims = [c.strip() for c in response.choices[0].message.content.split("\n") if c.strip()] print(f"[FAITHFULNESS] Found {len(claims)} claims") # Step 2: Verify each claim against context supported = 0 claim_results = [] for claim in claims: verification_prompt = ( "Given the context below, determine if the claim is supported " "by the context. Answer YES if the claim is entailed by the context. " "Answer NO if the claim contradicts the context or is not supported.\n\n" "Context:\n{context}\n\nClaim:\n{claim}" ) verification = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": "Answer only YES or NO."}, {"role": "user", "content": verification_prompt.format(context=context, claim=claim)} ], temperature=0.0, max_tokens=32 ) verdict = verification.choices[0].message.content.strip().upper() is_supported = "YES" in verdict supported += 1 if is_supported else 0 claim_results.append({ "claim": claim, "supported": is_supported, "verdict": verdict }) faithfulnessScore = supported / len(claims) if claims else 0.0 return { "faithfulness_score": faithfulnessScore, "total_claims": len(claims), "supported_claims": supported, "claim_details": claim_results } ``` ### Aggregated Evaluation ```python from statistics import mean def evaluateFaithfulnessOnDataset( dataset: list[dict], verbose: bool = True ) -> dict: """ dataset: list of {"question", "answer", "context"} """ scores = [] for i, item in enumerate(dataset): result = computeFaithfulness(item["answer"], item["context"]) scores.append(result["faithfulness_score"]) if verbose: status = "✅" if result["faithfulness_score"] == 1.0 else "⚠️" print(f"{status} Q{i+1}: faithfulness={result['faithfulness_score']:.2f} " f"({result['supported_claims']}/{result['total_claims']} claims)") return { "mean_faithfulness": mean(scores), "min_faithfulness": min(scores), "max_faithfulness": max(scores), "scores": scores } ``` ### Failure Modes The claim decomposition step can merge multiple facts into one claim, making the verdict ambiguous (partially true, partially false). Fine-tune the decomposition prompt to generate shorter atomic claims. Fragments-of-claims that are partially implied but not explicitly stated confuse the YES/NO classifier; add a "PARTIAL" category in a production evaluation pipeline.

EXERCISE

Run faithfulness evaluation on 20 question-answer pairs from your production pipeline. Identify the top 3 lowest-scoring cases and manually analyze what claims are unsupported. (15 min)

← Chapter 20
Caching Strategies
Chapter 22 →
RAGAS Answer Relevance