Data Collection — RLHF, DPO, and PPO (Chapter 7)

Preference data is the foundation of alignment. The quality, quantity, and diversity of your preference data directly determine how well your aligned model behaves.

Response generation: You need diverse responses to compare. The standard approach is sampling from your SFT model with varied temperature and top-p settings. Higher temperature produces more varied but potentially lower-quality responses; including both good and bad responses is essential for learning.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

def generate_preference_pairs(model, tokenizer, prompts, num_responses=2, temperature=0.7):
    """
    Generate multiple responses per prompt for preference annotation.
    """
    pairs = []
    for prompt in prompts:
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        
        responses = []
        for _ in range(num_responses):
            with torch.no_grad():
                # Vary temperature to increase diversity
                sample_temp = temperature * torch.rand(1).item() + 0.3
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=256,
                    do_sample=True,
                    temperature=sample_temp,
                    top_p=0.95,
                )
            response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
            responses.append(response)
        
        pairs.append({
            "prompt": prompt,
            "responses": responses
        })
    
    return pairs

Annotation pipeline: Human annotation is expensive but provides ground-truth preferences. For scalability, you can use:

Synthetic preferences from LLMs: Use GPT-4 or Claude to label preferences. This is fast and cheap but introduces model-specific biases.
Constitutional AI self-critique: Models generate responses, critique them, and revise. Preferences are inferred from the revision process.
Expert annotation: Domain experts label specific types of responses. Best for high-stakes applications.

# Synthetic preference with LLM judge
def synthetic_preference(prompt, response_a, response_b, judge_model="gpt-4"):
    judge_prompt = f"""Compare these two responses to the prompt: '{prompt}'
    
Response A: {response_a}
    
Response B: {response_b}
    
Which response is better? Respond with ONLY 'A' or 'B'."""
    
    # Call your LLM API here
    judgment = call_llm(judge_prompt, model=judge_model)
    
    return {
        "prompt": prompt,
        "chosen": response_a if "A" in judgment else response_b,
        "rejected": response_b if "A" in judgment else response_a
    }

Failure mode: preference漂移 (drift). As models improve, human annotators change their standards. A response rated "good" in 2023 might be rated "average" in 2025. This temporal drift makes it difficult to combine datasets collected at different times. Mitigation: include temporal metadata and weight recent data more heavily.