07. Data Collection
Preference data is the foundation of alignment. The quality, quantity, and diversity of your preference data directly determine how well your aligned model behaves.
Response generation: You need diverse responses to compare. The standard approach is sampling from your SFT model with varied temperature and top-p settings. Higher temperature produces more varied but potentially lower-quality responses; including both good and bad responses is essential for learning.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
def generate_preference_pairs(model, tokenizer, prompts, num_responses=2, temperature=0.7):
"""
Generate multiple responses per prompt for preference annotation.
"""
pairs = []
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
responses = []
for _ in range(num_responses):
with torch.no_grad():
# Vary temperature to increase diversity
sample_temp = temperature * torch.rand(1).item() + 0.3
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
temperature=sample_temp,
top_p=0.95,
)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
responses.append(response)
pairs.append({
"prompt": prompt,
"responses": responses
})
return pairs
Annotation pipeline: Human annotation is expensive but provides ground-truth preferences. For scalability, you can use:
- Synthetic preferences from LLMs: Use GPT-4 or Claude to label preferences. This is fast and cheap but introduces model-specific biases.
- Constitutional AI self-critique: Models generate responses, critique them, and revise. Preferences are inferred from the revision process.
- Expert annotation: Domain experts label specific types of responses. Best for high-stakes applications.
# Synthetic preference with LLM judge
def synthetic_preference(prompt, response_a, response_b, judge_model="gpt-4"):
judge_prompt = f"""Compare these two responses to the prompt: '{prompt}'
Response A: {response_a}
Response B: {response_b}
Which response is better? Respond with ONLY 'A' or 'B'."""
# Call your LLM API here
judgment = call_llm(judge_prompt, model=judge_model)
return {
"prompt": prompt,
"chosen": response_a if "A" in judgment else response_b,
"rejected": response_b if "A" in judgment else response_a
}
Failure mode: preference漂移 (drift). As models improve, human annotators change their standards. A response rated "good" in 2023 might be rated "average" in 2025. This temporal drift makes it difficult to combine datasets collected at different times. Mitigation: include temporal metadata and weight recent data more heavily.
Design a preference data collection pipeline for a customer service chatbot. List 10 diverse prompt categories that cover common user intents. For each category, write 5 example prompts. Then outline the response generation strategy and annotation guidelines. Consider edge cases like ambiguous queries, conflicting user goals, and responses that are safe but unhelpful.