15. Alignment Evaluation
Chapter 15 of 24 · 20 min
Evaluating alignment is fundamentally harder than evaluating capabilities because alignment involves human values, which are contested and context-dependent.
Evaluation Metrics Taxonomy
Automated Metrics:
- Reward model scores (proxy for human preference)
- Classifier-based safety scores
- Response length and format consistency
Human Metrics:
- Preference rankings (direct comparison)
- Likert scales (quality ratings)
- Adversarial probing (safety testing)
Behavioral Metrics:
- Refusal rates on benign requests
- Response quality on edge cases
- Consistency under reframing attacks
Implementing Preference Evaluation
def evaluate_preference_alignment(model, eval_pairs):
"""Evaluate how often model prefers the same response as humans."""
correct = 0
total = 0
for pair in eval_pairs:
prompt = pair["prompt"]
human_preferred = pair["chosen"]
human_rejected = pair["rejected"]
# Score both responses
score_chosen = reward_model(model, prompt, human_preferred)
score_rejected = reward_model(model, prompt, human_rejected)
# Check if model would agree with human
if score_chosen > score_rejected:
correct += 1
total += 1
return correct / total
def evaluate_safety_behavior(model, test_prompts):
"""Evaluate safety-related behavior patterns."""
results = {
"benign_refusals": 0,
"jailbreak_successes": 0,
"harmful_request_handling": 0,
"total": len(test_prompts)
}
for prompt in test_prompts:
response = model.generate(prompt)
if is_benign(prompt) and model_refused(response):
results["benign_refusals"] += 1
if is_jailbreak(prompt) and not model_refused(response):
results["jailbreak_successes"] += 1
if is_harmful(prompt) and handled_appropriately(response):
results["harmful_request_handling"] += 1
return results
A/B Testing for Alignment Changes
Compare two model versions on human preference:
# Collect paired comparisons
python collect_preference_data.py \
--model-a base_model \
--model-b aligned_model \
--prompts eval_prompts.json \
--output comparison_results.json
# Analyze statistical significance
python analyze_preference.py \
--data comparison_results.json \
--min-samples 200
Red-Teaming Evaluations
Adversarial evaluation finds alignment failures:
def red_team_evaluation(model, attack_budget=1000):
"""Automate red-teaming with attack generation."""
attacks = []
for category in ATTACK_CATEGORIES:
for _ in range(attack_budget // len(ATTACK_CATEGORIES)):
# Generate attack prompt
attack = generate_attack(model, category)
# Test response
response = model.generate(attack)
# Score severity
severity = classify_severity(response)
attacks.append({
"attack": attack,
"response": response,
"severity": severity,
"category": category
})
return aggregate_results(attacks)
EXERCISE
Build a simple evaluation suite with 50 test prompts covering safety, helpfulness, and honesty categories. Score your aligned model on each category and identify the weakest area.