15. Constitutional AI
Constitutional AI (CAI) aligns models by training them to critique and revise their own outputs according to a written set of principles. This approach reduces reliance on human labels for every training example.
The CAI Training Process
class ConstitutionalAI:
"""Implement Constitutional AI training pipeline."""
def __init__(self, model, principles: list[str]):
self.model = model
self.principles = principles
def generate_critique(self, response: str, prompt: str) -> str:
"""Generate critique based on constitutional principles."""
critique_prompt = f"""Review the following response and identify
any ways it violates the principles below.
Principles:
{chr(10).join(f'{i+1}. {p}' for i, p in enumerate(self.principles))}
Prompt: {prompt}
Response: {response}
Identify specific violations:"""
return self.model.generate(critique_prompt)
def generate_revision(self, response: str, critique: str) -> str:
"""Revise response to address critique."""
revision_prompt = f"""Please revise the response below to address
the identified critique.
Original response: {response}
Critique: {critique}
Revised response:"""
return self.model.generate(revision_prompt)
def preference_dataset_from_constitutional(
self, prompt: str, initial_response: str
) -> tuple[str, str, str]:
"""Create preference pair from constitutional critique cycle."""
critique = self.generate_critique(initial_response, prompt)
revision = self.generate_revision(initial_response, critique)
# Preference: revised response is better than initial
return prompt, initial_response, revision
Training with Preference Pairs
def train_constitutional_preference_model(
base_model, constitutional_ai, training_prompts, learning_rate=1e-5
):
"""Fine-tune model using constitutional preference pairs."""
optimizer = torch.optim.AdamW(base_model.parameters(), lr=learning_rate)
for prompt in training_prompts:
initial = base_model.generate(prompt)
prompt, response_a, response_b = constitutional_ai.preference_dataset_from_constitutional(
prompt, initial
)
# Reward model prefers revision (response_b)
reward_a = compute_safety_reward(prompt, response_a)
reward_b = compute_safety_reward(prompt, response_b)
preference_target = torch.tensor([0.0, 1.0]) # B preferred over A
# Compute preference loss
loss = preference_loss(
reward_a, reward_b, preference_target
)
loss.backward()
optimizer.step()
optimizer.zero_grad()
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Define a custom constitution of 5 principles for a domain-specific application. Run the critique-revision cycle on 20 challenging prompts and measure how often revisions actually improve safety.