Synthetic Preference Data — RLHF, DPO, and PPO (Chapter 12)

Human annotation is expensive and slow. Synthetic preference data—generated by AI systems rather than humans—offers a path to faster, cheaper, and more scalable alignment signals. The key challenge is maintaining quality: synthetic preferences must correlate with real human preferences.

LLM-as-judge: Use a capable model (GPT-4, Claude, etc.) to judge preference pairs. The judge must be more capable than the model being aligned.

def generate_synthetic_preference(prompt, response_a, response_b, judge_model):
    """
    Generate synthetic preference using an LLM judge.
    Returns (chosen, rejected, confidence).
    """
    judge_prompt = f"""You are evaluating AI assistant responses for helpfulness and harmlessness.

USER QUERY: {prompt}

RESPONSE A: {response_a}

RESPONSE B: {response_b}

Which response is better? Consider:
- Accuracy of information
- Completeness of answer
- Clarity and readability
- Appropriateness for the query

Respond in this format:
PREFERENCE: [A/B]
CONFIDENCE: [high/medium/low]
REASONING: [brief explanation]
"""
    
    judgment = call_judge_model(judge_prompt, model=judge_model)
    
    # Parse the response
    pref_line = [l for l in judgment.split('\n') if l.startswith('PREFERENCE:')][0]
    preference = pref_line.split(':')[1].strip()
    
    chosen = response_a if preference == 'A' else response_b
    rejected = response_b if preference == 'A' else response_a
    
    return {
        "prompt": prompt,
        "chosen": chosen,
        "rejected": rejected,
        "judgment": judgment
    }

Constitutional AI critique-revise: Generate responses, have the model critique its own outputs, then revise based on critique. Preferences are inferred: revised > original.

def critique_and_revise(prompt, initial_response, critique_model):
    """
    Constitutional AI style preference generation.
    """
    # Critique
    critique_prompt = f"""Review this response for issues:

USER: {prompt}
RESPONSE: {initial_response}

Identify any problems with accuracy, completeness, or safety.
"""
    critique = critique_model(critique_prompt)
    
    # Revision
    revision_prompt = f"""Given this critique, revise the response:

USER: {prompt}
ORIGINAL: {initial_response}
CRITIQUE: {critique}

Provide an improved response:
"""
    revised = critique_model(revision_prompt)
    
    # Preference: revised is better than original
    return {
        "prompt": prompt,
        "chosen": revised,
        "rejected": initial_response,
        "critique": critique
    }

Failure mode: judge model bias. The judge has its own preferences and biases that leak into the training signal. If the judge prefers verbose responses, the policy will learn to be verbose—even when conciseness is better.

# Detect judge bias: check correlation between response length and preference
def detect_length_bias(synthetic_preferences):
    lengths_chosen = [len(p["chosen"].split()) for p in synthetic_preferences]
    lengths_rejected = [len(p["rejected"].split()) for p in synthetic_preferences]
    
    longer_preferred = sum(c > r for c, r in zip(lengths_chosen, lengths_rejected))
    pct_longer_preferred = longer_preferred / len(synthetic_preferences)
    
    print(f"Longer responses preferred: {pct_longer_preferred:.1%}")
    
    if abs(pct_longer_preferred - 0.5) > 0.15:
        print("WARNING: Significant length bias in judge")
        return True
    return False

Self-generated preferences: Fine-tune a model on its own outputs, where preferences come from self-critique or external evaluation. This can bootstrap alignment without external annotation.