17. Constitutional AI
Chapter 17 of 24 · 20 min
Constitutional AI (CAI) is Anthropic's approach to alignment that uses a set of principles ("constitution") to guide model behavior without requiring human feedback for every training example.
The Constitutional Framework
A constitution is a set of principles the model should follow:
CONSTITUTION = [
"Choose the response that is least likely to contain harmful or unethical content.",
"Choose the response that would be most helpful and informative.",
"Choose the response that is most likely to be factual and accurate.",
"Avoid responses that are deceptive, manipulative, or evasive.",
"Prefer responses that acknowledge uncertainty when appropriate.",
# ... 16 principles total in Anthropic's constitution
]
def constitutional_principle(index):
return CONSTITUTION[index % len(CONSTITUTION)]
The CAI Training Process
Stage 1: Supervised Learning on Helpful Responses
def cai_stage1_training(model, prompts, helpful_responses):
"""Standard SFT on helpful demonstrations."""
for prompt, response in zip(prompts, helpful_responses):
loss = compute_sft_loss(model, prompt, response)
backward(loss)
Stage 2: Constitutional Critique and Revision
def constitutional_critique(model, response, principle):
"""Generate critique of response based on constitutional principle."""
critique_prompt = f"""Review the following response according to this principle:
Principle: {principle}
Response: {response}
Identify specific ways the response violates or fails to meet the principle.
Be specific and constructive."""
critique = model.generate(critique_prompt)
return critique
def constitutional_revision(model, response, critique):
"""Revise response based on critique."""
revision_prompt = f"""Original response: {response}
Critique: {critique}
Rewrite the response to address the critique while maintaining its helpfulness.
Focus on improving the specific issues identified."""
revised = model.generate(revision_prompt)
return revised
def cai_stage2_training(model, prompts, initial_responses):
"""Train on critiqued and revised responses."""
for prompt, initial in zip(prompts, initial_responses):
for i, principle in enumerate(CONSTITUTION):
# Critique
critique = constitutional_critique(model, initial, principle)
# Revise
revised = constitutional_revision(model, initial, critique)
# Train on revision
loss = compute_sft_loss(model, prompt, revised)
backward(loss)
RLAIF: AI-Assisted Preference Generation
Constitutional AI also enables scalable preference annotation:
def rlaif_preference_generation(prompt, responses, clm_model):
"""Use a model to generate preferences between responses."""
for response_a, response_b in generate_pairs(responses):
# Ask model which is better according to principles
judgment_prompt = f"""Prompt: {prompt}
Response A: {response_a}
Response B: {response_b}
Which response is better according to the principles of helpfulness and harmlessness?
Consider accuracy, safety, and usefulness."""
judgment = clm_model.generate(judgment_prompt)
preference = parse_judgment(judgment)
yield {"prompt": prompt, "chosen": preference.chosen, "rejected": preference.rejected}
EXERCISE
Implement a simplified constitutional AI training loop with 4 principles (helpful, harmless, honest, clear). Train for one epoch and compare against standard SFT.