Advanced RAG — Chunking, Retrieval, Re-ranking

21. RAGAS Faithfulness

Chapter 21 of 24 · 20 min

KEY INSIGHT

Faithfulness measures whether the generated answer stays within the bounds of what the retrieved context actually supports. ### Definition An answer is faithful if it can be entirely attributed to the retrieved context without introducing facts not present in—or contradicted by—the context. A faithfulness score of 1.0 means every claim in the answer maps to a supporting citation in the context. A score of 0.5 means half the claims are unsupported. ### Why It Matters High retrieval relevance scores do not guarantee answer faithfulness. A retrieval system can return relevant chunks, but the LLM can still confabulate related-but-absent facts. Faithfulness validation catches this failure mode. ### RAGAS Faithfulness Implementation ```python from openai import OpenAI client = OpenAI() def computeFaithfulness( answer: str, context: str, model: str = "gpt-4o-mini" ) -> dict: """ RAGAS faithfulness: break answer into claims, check each against context. """ # Step 1: Decompose answer into atomic claims claims_prompt = ( "Break the following answer into independent factual claims. " "Each claim should be a single verifiable statement. " "List one per line.\n\nAnswer:\n{answer}" ) response = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": "You extract factual claims accurately."}, {"role": "user", "content": claims_prompt.format(answer=answer)} ], temperature=0.0, max_tokens=512 ) claims = [c.strip() for c in response.choices[0].message.content.split("\n") if c.strip()] print(f"[FAITHFULNESS] Found {len(claims)} claims") # Step 2: Verify each claim against context supported = 0 claim_results = [] for claim in claims: verification_prompt = ( "Given the context below, determine if the claim is supported " "by the context. Answer YES if the claim is entailed by the context. " "Answer NO if the claim contradicts the context or is not supported.\n\n" "Context:\n{context}\n\nClaim:\n{claim}" ) verification = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": "Answer only YES or NO."}, {"role": "user", "content": verification_prompt.format(context=context, claim=claim)} ], temperature=0.0, max_tokens=32 ) verdict = verification.choices[0].message.content.strip().upper() is_supported = "YES" in verdict supported += 1 if is_supported else 0 claim_results.append({ "claim": claim, "supported": is_supported, "verdict": verdict }) faithfulnessScore = supported / len(claims) if claims else 0.0 return { "faithfulness_score": faithfulnessScore, "total_claims": len(claims), "supported_claims": supported, "claim_details": claim_results } ``` ### Aggregated Evaluation ```python from statistics import mean def evaluateFaithfulnessOnDataset( dataset: list[dict], verbose: bool = True ) -> dict: """ dataset: list of {"question", "answer", "context"} """ scores = [] for i, item in enumerate(dataset): result = computeFaithfulness(item["answer"], item["context"]) scores.append(result["faithfulness_score"]) if verbose: status = "✅" if result["faithfulness_score"] == 1.0 else "⚠️" print(f"{status} Q{i+1}: faithfulness={result['faithfulness_score']:.2f} " f"({result['supported_claims']}/{result['total_claims']} claims)") return { "mean_faithfulness": mean(scores), "min_faithfulness": min(scores), "max_faithfulness": max(scores), "scores": scores } ``` ### Failure Modes The claim decomposition step can merge multiple facts into one claim, making the verdict ambiguous (partially true, partially false). Fine-tune the decomposition prompt to generate shorter atomic claims. Fragments-of-claims that are partially implied but not explicitly stated confuse the YES/NO classifier; add a "PARTIAL" category in a production evaluation pipeline.

EXERCISE

Run faithfulness evaluation on 20 question-answer pairs from your production pipeline. Identify the top 3 lowest-scoring cases and manually analyze what claims are unsupported. (15 min)