23. Benchmarking Alignment
Chapter 23 of 24 · 20 min
Standard benchmarks measure capabilities, not alignment. This chapter covers specialized benchmarks for evaluating alignment quality.
Existing Alignment Benchmarks
BBQ: Biases in Question Answering
- Tests for demographic biases in model responses
- Multiple-choice format with known correct answers
BOLD: Bias in Large Language Models
- Tests model outputs for demographic bias across categories
TruthfulQA: Truthfulness evaluation
- Tests whether models generate false statements
- Adversarial questions where models are likely to err
RealToxicityPrompts: Toxicity detection
- Tests tendency to generate toxic content
- Uses Perspective API for scoring
Implementing Custom Alignment Benchmarks
class AlignmentBenchmark:
def __init__(self, model, evaluator_config):
self.model = model
self.categories = {
"safety": SafetyTests(),
"helpfulness": HelpfulnessTests(),
"honesty": HonestyTests(),
"fairness": FairnessTests()
}
def run(self):
results = {}
for category, tests in self.categories.items():
category_results = []
for test in tests:
try:
result = self.run_single_test(test)
category_results.append(result)
except TestError as e:
print(f"Test {test.name} failed: {e}")
category_results.append({"error": str(e)})
results[category] = {
"scores": [r["score"] for r in category_results if "score" in r],
"details": category_results
}
return self.summarize(results)
def run_single_test(self, test):
prompt = test.prompt()
response = self.model.generate(prompt)
return {
"prompt": prompt,
"response": response,
"expected_behavior": test.expected(),
"observed_behavior": test.evaluate(response),
"score": test.score(response)
}
Preference Agreement Benchmark
def preference_agreement_benchmark(model, eval_pairs, human_reference):
"""
Measure how often model preferences match human preferences.
"""
agreements = 0
total = 0
for pair in eval_pairs:
prompt = pair["prompt"]
chosen = pair["chosen"]
rejected = pair["rejected"]
# Get model preference
model_chooses_chosen = model_prefers(model, prompt, chosen, rejected)
# Check against human
human_chooses_chosen = (human_reference[prompt]["preferred"] == "chosen")
if model_chooses_chosen == human_chooses_chosen:
agreements += 1
total += 1
return {
"agreement_rate": agreements / total,
"agreements": agreements,
"total": total
}
Red-Team Benchmarking
# Automated red-team benchmark
python red_team_benchmark.py \
--model aligned_model \
--attack_count 500 \
--categories ["jailbreak", "social_engineering", "harmful_content"] \
--output red_team_results.json
# Generate attack severity report
python analyze_attacks.py \
--results red_team_results.json \
--severity_threshold 0.7 \
--output attack_report.html
Benchmark Comparison Table
| Benchmark | Focus | Format | Limitations |
|---|---|---|---|
| BBQ | Bias detection | MCQ | Limited coverage |
| TruthfulQA | Factuality | Open-ended | Human evaluation needed |
| RealToxicity | Toxicity | Generation | Perspective API dependency |
| Custom | Task-specific | Flexible | No standardization |
EXERCISE
Run your aligned model through TruthfulQA and RealToxicityPrompts benchmarks. Compare results against the base model and document areas of improvement and regression.