12. Building Test Datasets
A well-constructed test dataset determines whether evaluation results reflect real-world performance. Poor datasets lead to misleading metrics—passing tests while producing bad user experiences in production.
Dataset Requirements
Test datasets for RAG evaluation require four components per example: query, ground truth answer, context chunks, and expected retrieval documents. Each component serves a different evaluation purpose.
The Context field tells the retriever what it should find. The Ground Truth field tells the evaluator what a perfect generator would produce. The Query field represents how users actually phrase questions.
Synthetic vs. Human-Crafted Data
Human-crafted test sets capture natural language variation and real query intent distribution. The main drawback is scale—creating 500 high-quality examples manually takes dozens of hours.
Synthetic data generation creates examples programmatically, sacrificing some naturalness for scale. The optimal approach uses human-crafted examples as seed data, then generates variations using an LLM.
import json
from typing import List, Optional
def create_test_example(
case_id: str,
query: str,
ground_truth: str,
contexts: List[str],
metadata: Optional[dict] = None
) -> dict:
"""Create a structured test example."""
return {
"id": case_id,
"query": query,
"ground_truth": ground_truth,
"contexts": contexts,
"metadata": metadata or {},
"version": "1.0"
}
def load_test_dataset(filepath: str) -> List[dict]:
"""Load test examples from JSONL format."""
dataset = []
with open(filepath, "r") as f:
for line in f:
dataset.append(json.loads(line))
return dataset
def save_test_dataset(
dataset: List[dict],
filepath: str
):
"""Save test examples to JSONL format."""
with open(filepath, "w") as f:
for example in dataset:
f.write(json.dumps(example) + "\n")
Diversity Sampling
Test datasets must cover the diversity of the retrieval corpus and query distribution. A dataset containing only simple factual questions fails to test complex reasoning chains.
| Query Type | Example | Evaluation Focus |
|---|---|---|
| Factual | "What year was X founded?" | Chunk retrieval precision |
| Comparative | "How does X compare to Y?" | Multi-hop retrieval |
| Instructive | "Explain how to perform X" | Instruction following |
| Diagnostic | "Why is X not working?" | Causal reasoning |
| Edge case | Ambiguous or out-of-scope query | Graceful degradation |
Test Dataset Validation
After creating the dataset, validate it against the production query distribution. If production queries frequently use domain-specific terminology not present in the test set, the evaluation results do not generalize.
from collections import Counter
import re
def analyze_query_distributions(
test_queries: List[str],
production_queries: List[str]
) -> dict:
"""Compare test and production query distributions."""
def tokenize(text: str) -> List[str]:
return re.findall(r'\b\w+\b', text.lower())
test_tokens = Counter()
for query in test_queries:
test_tokens.update(tokenize(query))
production_tokens = Counter()
for query in production_queries:
production_tokens.update(tokenize(query))
# Find missing terms (in production but rare in test)
missing_terms = set(production_tokens.keys()) - set(test_tokens.keys())
# Find overrepresented terms (in test but rare in production)
overrepresented = {
term: count
for term, count in test_tokens.items()
if count > production_tokens.get(term, 0) * 3
}
return {
"missing_terms": list(missing_terms)[:50], # Top 50 missing
"overrepresented_terms": list(overrepresented.keys())[:50],
"test_total_unique": len(test_tokens),
"production_total_unique": len(production_tokens)
}
Load your production query logs from the past month, cluster them by query type, and build a test dataset that mirrors the natural distribution rather than an arbitrary mix.