Evaluating with LLMs — RAG Evaluation and Metrics (Chapter 11)

Using LLMs to evaluate RAG outputs allows for nuanced assessment beyond string matching. This approach is called LLM-as-Judge and scales evaluation to thousands of examples without human annotation.

Designing Evaluation Prompts

Effective evaluation prompts specify the evaluation criteria, provide explicit scoring rubrics, and include chain-of-thought reasoning to justify the score. Ambiguous prompts produce inconsistent ratings.

EVALUATION_PROMPT = """You are an expert evaluator assessing a RAG system's response quality.

## Question
{question}

## Retrieved Context
{context}

## Generated Answer
{answer}

## Evaluation Criteria
Score the answer on a scale of 1-5 for each dimension:

1. **Faithfulness** (1-5): Does the answer only make claims supported by the context?
   - 1: Multiple unsupported claims or fabrications
   - 3: Minor unsupported detail mixed with supported claims
   - 5: Every claim traceable to context

2. **Relevance** (1-5): Does the answer address the actual question asked?
   - 1: Answer addresses wrong topic entirely
   - 3: Partially addresses the question with gaps
   - 5: Directly and fully answers the question

3. **Conciseness** (1-5): Does the answer avoid unnecessary verbosity?
   - 1: Contains extensive off-topic material
   - 3: Mixes relevant and irrelevant content
   - 5: Contains only relevant information

## Output Format
Return a JSON object with your scores and reasoning:
{{
    "faithfulness": <integer 1-5>,
    "relevance": <integer 1-5>,
    "conciseness": <integer 1-5>,
    "reasoning": "<explain each score>"
}}
"""

Implementing LLM Evaluation

from langchain.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain.output_parsers import JsonOutputParser
from pydantic import BaseModel
from typing import List

class EvaluationResult(BaseModel):
    faithfulness: int
    relevance: int
    conciseness: int
    reasoning: str

def evaluate_rag_response(
    question: str,
    context: List[str],
    answer: str,
    model: ChatOpenAI
) -> EvaluationResult:
    """Evaluate a single RAG response using an LLM judge."""
    
    prompt = ChatPromptTemplate.from_template(EVALUATION_PROMPT)
    parser = JsonOutputParser(pydantic_object=EvaluationResult)
    
    chain = prompt | model | parser
    
    result = chain.invoke({
        "question": question,
        "context": "\n\n---\n\n".join(context),
        "answer": answer
    })
    
    return EvaluationResult(**result)

def batch_evaluate(
    dataset: List[dict],
    model: ChatOpenAI,
    maxConcurrency: int = 5
) -> List[EvaluationResult]:
    """Evaluate multiple responses concurrently."""
    from concurrent.futures import ThreadPoolExecutor
    
    results = []
    with ThreadPoolExecutor(max_workers=maxConcurrency) as executor:
        futures = [
            executor.submit(
                evaluate_rag_response,
                item["question"],
                item["context"],
                item["answer"],
                model
            )
            for item in dataset
        ]
        
        for future in futures:
            results.append(future.result())
    
    return results

Agreement and Calibration

LLM judges exhibit position bias—placing the ideal answer first or last influences scores. Using pairwise comparison instead of absolute scores often improves agreement. K-fold cross-evaluation, where multiple LLMs score the same outputs, reveals systematic biases.