18. LangChain Evaluation
Evaluation quantifies chain quality. LangChain's evaluation framework supports built-in string matchers, LLM-as-judge patterns, and custom evaluators. Without evaluation, you cannot iterate confidently—you're guessing whether changes help.
Built-in evaluators compare responses against references.
from langchain.evaluation import load_evaluator
from langchain.evaluation.metrics import EmbeddingDistance
# String equality
exact_evaluator = load_evaluator("exact_match")
result = exact_evaluator.evaluate_strings(
prediction="The capital is Paris",
reference="Paris"
)
print(result) # {'score': 0 or 1}
Semantic similarity uses embeddings.
from langchain.evaluation import load_evaluator
embedding_evaluator = load_evaluator("embedding_distance", distance_metric="cosine")
result = embedding_evaluator.evaluate_strings(
prediction="Paris is the capital of France",
reference="Paris is France's capital city"
)
print(f"Cosine distance: {result['score']}") # Lower is better
For complex answers, use LLM grading.
from langchain.evaluation import load_evaluator
qa_evaluator = load_evaluator("cot_evaluator")
result = qa_evaluator.evaluate_strings(
prediction="Paris is the capital.",
input="What is the capital of France?",
reference="Paris"
)
print(result)
Build a custom evaluator for domain-specific criteria.
from langchain.evaluation import StringEvaluator
from pydantic import Field
class LengthValidator(StringEvaluator):
name = "length_check"
criteria = {"length": "Output must be between 50 and 200 characters"}
def _evaluate_strings(self, prediction, reference=None, **kwargs):
length = len(prediction)
passed = 50 <= length <= 200
return {"score": int(passed), "reasoning":
f"Length {length} is {'valid' if passed else 'invalid'}"}
validator = LengthValidator()
result = validator.evaluate_strings(prediction="Short")
print(result) # Score 0 - too short
Create a benchmark dataset and run batch evaluation.
from langchain.evaluation import EvaluatorBundle
test_cases = [
{"input": "What is 2+2?", "reference": "4"},
{"input": "Capital of Japan?", "reference": "Tokyo"},
{"input": "Color of sky?", "reference": "Blue"},
]
qa_chain = RetrievalQA.from_chain_type(llm=llm, retriever=retriever)
scores = []
for case in test_cases:
result = qa_chain.invoke({"query": case["input"]})
evaluator = load_evaluator("qa")
score = evaluator.evaluate_strings(
prediction=result["result"],
reference=case["reference"],
input=case["input"]
)
scores.append(score["score"])
print(f"Average score: {sum(scores)/len(scores):.2%}")
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Create a RAG pipeline, write 5 test questions with known answers, run evaluation on all 5, and report the average score with specific failure cases.