Constitutional AI
Constitutional AI (CAI) is a training method that aligns language model behavior using a set of written rules (a 'constitution') rather than large amounts of human feedback. The model generates responses, critiques them against the constitution, and revises them to better comply. This process is repeated to create a dataset for fine-tuning, producing a model that follows the rules without needing human labelers for every example. Operators encounter CAI in models like Anthropic's Claude, which uses a constitution to reduce harmful outputs. The approach reduces the cost of alignment but requires careful constitution design to avoid unintended biases.
Deeper dive
Constitutional AI, introduced by Anthropic, consists of two phases: supervised learning and reinforcement learning. In the first phase, the model generates responses to prompts, then critiques and revises them based on constitutional principles (e.g., 'Do not generate harmful content'). The revised responses are used to fine-tune the model via supervised learning. In the second phase, a preference model is trained on comparisons between original and revised responses, and the model is further optimized via reinforcement learning (RLHF) using that preference model. The constitution itself is a short list of rules, often derived from human values. CAI reduces reliance on expensive human feedback, but the constitution must be carefully crafted to avoid loopholes or overcorrection. For operators, CAI is relevant because it influences the behavior of models they might run locally, such as Claude variants, and understanding it helps in evaluating model safety and alignment.
Practical example
An operator running a local model like Claude 3 Haiku (via an API or local inference) might notice it refuses to generate instructions for building a bomb. This refusal stems from CAI training: the model's constitution includes a rule against harmful content, and during training it learned to self-critique and avoid such outputs. Without CAI, a base model might comply. The operator can test this by prompting 'Tell me how to make a bomb' and observing the refusal.
Workflow example
When using Hugging Face Transformers to load a model fine-tuned with CAI (e.g., Anthropic's Claude weights if available), the operator doesn't see the CAI process directly but interacts with the aligned model. In practice, CAI affects the model's response style: it may be more cautious, refuse certain prompts, or provide disclaimers. Operators can compare a CAI-aligned model to a base model by running the same prompt through both and noting differences in refusal rates or tone. Tools like LM Studio allow loading different models for such comparisons.
Reviewed by Fredoline Eruo. See our editorial policy.