Synthetic Data Generation — RAG Evaluation and Metrics (Chapter 13)

Generating synthetic test data programmatically solves the scaling problem while maintaining evaluation coverage. The key is controlling the generation process to produce diverse, high-quality examples without introducing systematic biases.

Query Variation Generation

Generate multiple query phrasings for the same context to test retrieval reliability. Variations should differ in vocabulary, syntax, and specificity level while remaining answerable by the same source material.

from langchain.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

QUERY_VARIATION_GENERATOR = PromptTemplate.from_template("""You are generating query variations for RAG system testing.

## Source Document
{document}

## Original Query
{original_query}

## Task
Generate {count} alternative queries that:
1. Ask for the same information as the original
2. Use different vocabulary than the original
3. Have varying levels of specificity
4. Are natural and do not mention the document explicitly

Return a JSON array of query strings.
""")

def generate_query_variations(
    document: str,
    original_query: str,
    count: int = 5,
    model: ChatOpenAI = None
) -> List[str]:
    """Generate alternative phrasings for a test query."""
    chain = QUERY_VARIATION_GENERATOR | model
    result = chain.invoke({
        "document": document,
        "original_query": original_query,
        "count": count
    })
    import json
    return json.loads(result.content)

# Example usage
variations = generate_query_variations(
    document="The quarterly report shows Q3 revenue of $4.2M, up 15% year-over-year.",
    original_query="What was the Q3 revenue?",
    count=5,
    model=ChatOpenAI(model="gpt-4o")
)

print(variations)
# ["How much revenue did the company generate in the third quarter?",
#  "What does the earnings report say about Q3 financial performance?",
#  "State the revenue figure reported for the third quarter."]

Negative Example Generation

Negative examples test the retrievers ability to identify irrelevant queries. Generate queries that seem related to the document content but cannot be answered from it.

NEGATIVE_QUERY_GENERATOR = PromptTemplate.from_template("""You are generating negative test cases for RAG retrieval.

## Source Document
{document}

## Task
Generate {count} queries that:
1. Are plausible to someone unfamiliar with the document
2. Sound like they might be answered by the document
3. CANNOT actually be answered using the document
4. Test whether the retriever can distinguish answerable from unanswerable queries

Return a JSON array of query strings.
""")

def generate_negative_queries(
    document: str,
    count: int = 3
) -> List[str]:
    """Generate queries that should NOT retrieve this document."""
    chain = NEGATIVE_QUERY_GENERATOR | ChatOpenAI(model="gpt-4o")
    result = chain.invoke({"document": document, "count": count})
    import json
    return json.loads(result.content)

# Example
neg_queries = generate_negative_queries(
    document="The quarterly report shows Q3 revenue of $4.2M, up 15% year-over-year.",
    count=3
)
print(neg_queries)
# ["What was the Q4 revenue projection?",
#  "How many employees does the company have?",
#  "What is the CEO's outlook for next year?"]

Adversarial Query Generation

Generate queries designed to stress-test retrieval edge cases: multi-hop questions requiring information from multiple documents, queries with conflicting temporal modifiers, and ambiguous reference resolution.

ADVERSARIAL_QUERY_GENERATOR = PromptTemplate.from_template("""Generate adversarial test cases for RAG evaluation.

Document topics to test: {topics}

Generate {count} queries that test:
1. Multi-hop reasoning requiring 2+ documents
2. Temporal confusion (comparing different time periods)
3. Reference ambiguity (pronouns, vague noun phrases)

Each query should be answerable only with correct multi-document retrieval.
Return a JSON array with "query" and "required_documents" fields.
""")