Building Test Datasets — RAG Evaluation and Metrics (Chapter 12)

A well-constructed test dataset determines whether evaluation results reflect real-world performance. Poor datasets lead to misleading metrics—passing tests while producing bad user experiences in production.

Dataset Requirements

Test datasets for RAG evaluation require four components per example: query, ground truth answer, context chunks, and expected retrieval documents. Each component serves a different evaluation purpose.

The Context field tells the retriever what it should find. The Ground Truth field tells the evaluator what a perfect generator would produce. The Query field represents how users actually phrase questions.

Synthetic vs. Human-Crafted Data

Human-crafted test sets capture natural language variation and real query intent distribution. The main drawback is scale—creating 500 high-quality examples manually takes dozens of hours.

Synthetic data generation creates examples programmatically, sacrificing some naturalness for scale. The optimal approach uses human-crafted examples as seed data, then generates variations using an LLM.

import json
from typing import List, Optional

def create_test_example(
    case_id: str,
    query: str,
    ground_truth: str,
    contexts: List[str],
    metadata: Optional[dict] = None
) -> dict:
    """Create a structured test example."""
    return {
        "id": case_id,
        "query": query,
        "ground_truth": ground_truth,
        "contexts": contexts,
        "metadata": metadata or {},
        "version": "1.0"
    }

def load_test_dataset(filepath: str) -> List[dict]:
    """Load test examples from JSONL format."""
    dataset = []
    with open(filepath, "r") as f:
        for line in f:
            dataset.append(json.loads(line))
    return dataset

def save_test_dataset(
    dataset: List[dict],
    filepath: str
):
    """Save test examples to JSONL format."""
    with open(filepath, "w") as f:
        for example in dataset:
            f.write(json.dumps(example) + "\n")

Diversity Sampling

Test datasets must cover the diversity of the retrieval corpus and query distribution. A dataset containing only simple factual questions fails to test complex reasoning chains.

Query Type	Example	Evaluation Focus
Factual	"What year was X founded?"	Chunk retrieval precision
Comparative	"How does X compare to Y?"	Multi-hop retrieval
Instructive	"Explain how to perform X"	Instruction following
Diagnostic	"Why is X not working?"	Causal reasoning
Edge case	Ambiguous or out-of-scope query	Graceful degradation

Test Dataset Validation

After creating the dataset, validate it against the production query distribution. If production queries frequently use domain-specific terminology not present in the test set, the evaluation results do not generalize.

from collections import Counter
import re

def analyze_query_distributions(
    test_queries: List[str],
    production_queries: List[str]
) -> dict:
    """Compare test and production query distributions."""
    
    def tokenize(text: str) -> List[str]:
        return re.findall(r'\b\w+\b', text.lower())
    
    test_tokens = Counter()
    for query in test_queries:
        test_tokens.update(tokenize(query))
    
    production_tokens = Counter()
    for query in production_queries:
        production_tokens.update(tokenize(query))
    
    # Find missing terms (in production but rare in test)
    missing_terms = set(production_tokens.keys()) - set(test_tokens.keys())
    
    # Find overrepresented terms (in test but rare in production)
    overrepresented = {
        term: count 
        for term, count in test_tokens.items() 
        if count > production_tokens.get(term, 0) * 3
    }
    
    return {
        "missing_terms": list(missing_terms)[:50],  # Top 50 missing
        "overrepresented_terms": list(overrepresented.keys())[:50],
        "test_total_unique": len(test_tokens),
        "production_total_unique": len(production_tokens)
    }