KEY INSIGHT
Faithfulness measures whether the generated answer stays within the bounds of what the retrieved context actually supports.
### Definition
An answer is faithful if it can be entirely attributed to the retrieved context without introducing facts not present in—or contradicted by—the context. A faithfulness score of 1.0 means every claim in the answer maps to a supporting citation in the context. A score of 0.5 means half the claims are unsupported.
### Why It Matters
High retrieval relevance scores do not guarantee answer faithfulness. A retrieval system can return relevant chunks, but the LLM can still confabulate related-but-absent facts. Faithfulness validation catches this failure mode.
### RAGAS Faithfulness Implementation
```python
from openai import OpenAI
client = OpenAI()
def computeFaithfulness(
answer: str,
context: str,
model: str = "gpt-4o-mini"
) -> dict:
"""
RAGAS faithfulness: break answer into claims, check each against context.
"""
# Step 1: Decompose answer into atomic claims
claims_prompt = (
"Break the following answer into independent factual claims. "
"Each claim should be a single verifiable statement. "
"List one per line.\n\nAnswer:\n{answer}"
)
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You extract factual claims accurately."},
{"role": "user", "content": claims_prompt.format(answer=answer)}
],
temperature=0.0,
max_tokens=512
)
claims = [c.strip() for c in response.choices[0].message.content.split("\n") if c.strip()]
print(f"[FAITHFULNESS] Found {len(claims)} claims")
# Step 2: Verify each claim against context
supported = 0
claim_results = []
for claim in claims:
verification_prompt = (
"Given the context below, determine if the claim is supported "
"by the context. Answer YES if the claim is entailed by the context. "
"Answer NO if the claim contradicts the context or is not supported.\n\n"
"Context:\n{context}\n\nClaim:\n{claim}"
)
verification = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "Answer only YES or NO."},
{"role": "user", "content": verification_prompt.format(context=context, claim=claim)}
],
temperature=0.0,
max_tokens=32
)
verdict = verification.choices[0].message.content.strip().upper()
is_supported = "YES" in verdict
supported += 1 if is_supported else 0
claim_results.append({
"claim": claim,
"supported": is_supported,
"verdict": verdict
})
faithfulnessScore = supported / len(claims) if claims else 0.0
return {
"faithfulness_score": faithfulnessScore,
"total_claims": len(claims),
"supported_claims": supported,
"claim_details": claim_results
}
```
### Aggregated Evaluation
```python
from statistics import mean
def evaluateFaithfulnessOnDataset(
dataset: list[dict],
verbose: bool = True
) -> dict:
"""
dataset: list of {"question", "answer", "context"}
"""
scores = []
for i, item in enumerate(dataset):
result = computeFaithfulness(item["answer"], item["context"])
scores.append(result["faithfulness_score"])
if verbose:
status = "✅" if result["faithfulness_score"] == 1.0 else "⚠️"
print(f"{status} Q{i+1}: faithfulness={result['faithfulness_score']:.2f} "
f"({result['supported_claims']}/{result['total_claims']} claims)")
return {
"mean_faithfulness": mean(scores),
"min_faithfulness": min(scores),
"max_faithfulness": max(scores),
"scores": scores
}
```
### Failure Modes
The claim decomposition step can merge multiple facts into one claim, making the verdict ambiguous (partially true, partially false). Fine-tune the decomposition prompt to generate shorter atomic claims. Fragments-of-claims that are partially implied but not explicitly stated confuse the YES/NO classifier; add a "PARTIAL" category in a production evaluation pipeline.