18. LangChain Evaluation

Chapter 18 of 18 · 20 min

Evaluation quantifies chain quality. LangChain's evaluation framework supports built-in string matchers, LLM-as-judge patterns, and custom evaluators. Without evaluation, you cannot iterate confidently—you're guessing whether changes help.

Built-in evaluators compare responses against references.

from langchain.evaluation import load_evaluator
from langchain.evaluation.metrics import EmbeddingDistance

# String equality
exact_evaluator = load_evaluator("exact_match")
result = exact_evaluator.evaluate_strings(
    prediction="The capital is Paris",
    reference="Paris"
)
print(result)  # {'score': 0 or 1}

Semantic similarity uses embeddings.

from langchain.evaluation import load_evaluator

embedding_evaluator = load_evaluator("embedding_distance", distance_metric="cosine")
result = embedding_evaluator.evaluate_strings(
    prediction="Paris is the capital of France",
    reference="Paris is France's capital city"
)
print(f"Cosine distance: {result['score']}")  # Lower is better

For complex answers, use LLM grading.

from langchain.evaluation import load_evaluator

qa_evaluator = load_evaluator("cot_evaluator")
result = qa_evaluator.evaluate_strings(
    prediction="Paris is the capital.",
    input="What is the capital of France?",
    reference="Paris"
)
print(result)

Build a custom evaluator for domain-specific criteria.

from langchain.evaluation import StringEvaluator
from pydantic import Field

class LengthValidator(StringEvaluator):
    name = "length_check"
    criteria = {"length": "Output must be between 50 and 200 characters"}
    
    def _evaluate_strings(self, prediction, reference=None, **kwargs):
        length = len(prediction)
        passed = 50 <= length <= 200
        return {"score": int(passed), "reasoning": 
                f"Length {length} is {'valid' if passed else 'invalid'}"}

validator = LengthValidator()
result = validator.evaluate_strings(prediction="Short")
print(result)  # Score 0 - too short

Create a benchmark dataset and run batch evaluation.

from langchain.evaluation import EvaluatorBundle

test_cases = [
    {"input": "What is 2+2?", "reference": "4"},
    {"input": "Capital of Japan?", "reference": "Tokyo"},
    {"input": "Color of sky?", "reference": "Blue"},
]

qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)

scores = []
for case in test_cases:
    result = qa_chain.invoke({"query": case["input"]})
    evaluator = load_evaluator("qa")
    score = evaluator.evaluate_strings(
        prediction=result["result"],
        reference=case["reference"],
        input=case["input"]
    )
    scores.append(score["score"])

print(f"Average score: {sum(scores)/len(scores):.2%}")

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Create a RAG pipeline, write 5 test questions with known answers, run evaluation on all 5, and report the average score with specific failure cases.