11. Evaluating with LLMs
Using LLMs to evaluate RAG outputs allows for nuanced assessment beyond string matching. This approach is called LLM-as-Judge and scales evaluation to thousands of examples without human annotation.
Designing Evaluation Prompts
Effective evaluation prompts specify the evaluation criteria, provide explicit scoring rubrics, and include chain-of-thought reasoning to justify the score. Ambiguous prompts produce inconsistent ratings.
EVALUATION_PROMPT = """You are an expert evaluator assessing a RAG system's response quality.
## Question
{question}
## Retrieved Context
{context}
## Generated Answer
{answer}
## Evaluation Criteria
Score the answer on a scale of 1-5 for each dimension:
1. **Faithfulness** (1-5): Does the answer only make claims supported by the context?
- 1: Multiple unsupported claims or fabrications
- 3: Minor unsupported detail mixed with supported claims
- 5: Every claim traceable to context
2. **Relevance** (1-5): Does the answer address the actual question asked?
- 1: Answer addresses wrong topic entirely
- 3: Partially addresses the question with gaps
- 5: Directly and fully answers the question
3. **Conciseness** (1-5): Does the answer avoid unnecessary verbosity?
- 1: Contains extensive off-topic material
- 3: Mixes relevant and irrelevant content
- 5: Contains only relevant information
## Output Format
Return a JSON object with your scores and reasoning:
{{
"faithfulness": <integer 1-5>,
"relevance": <integer 1-5>,
"conciseness": <integer 1-5>,
"reasoning": "<explain each score>"
}}
"""
Implementing LLM Evaluation
from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.output_parsers import JsonOutputParser
from pydantic import BaseModel
from typing import List
class EvaluationResult(BaseModel):
faithfulness: int
relevance: int
conciseness: int
reasoning: str
def evaluate_rag_response(
question: str,
context: List[str],
answer: str,
model: ChatOpenAI
) -> EvaluationResult:
"""Evaluate a single RAG response using an LLM judge."""
prompt = ChatPromptTemplate.from_template(EVALUATION_PROMPT)
parser = JsonOutputParser(pydantic_object=EvaluationResult)
chain = prompt | model | parser
result = chain.invoke({
"question": question,
"context": "\n\n---\n\n".join(context),
"answer": answer
})
return EvaluationResult(**result)
def batch_evaluate(
dataset: List[dict],
model: ChatOpenAI,
maxConcurrency: int = 5
) -> List[EvaluationResult]:
"""Evaluate multiple responses concurrently."""
from concurrent.futures import ThreadPoolExecutor
results = []
with ThreadPoolExecutor(max_workers=maxConcurrency) as executor:
futures = [
executor.submit(
evaluate_rag_response,
item["question"],
item["context"],
item["answer"],
model
)
for item in dataset
]
for future in futures:
results.append(future.result())
return results
Agreement and Calibration
LLM judges exhibit position bias—placing the ideal answer first or last influences scores. Using pairwise comparison instead of absolute scores often improves agreement. K-fold cross-evaluation, where multiple LLMs score the same outputs, reveals systematic biases.
Generate 100 evaluation pairs using pairwise comparison. Calculate agreement rate between two different judge models and analyze which dimension shows the most disagreement.