AI Alignment

AI alignment refers to the challenge of ensuring that a model's outputs match the operator's intended goals and values. In practice, an unaligned model might generate harmful, biased, or off-topic responses even when prompted correctly. Alignment techniques include fine-tuning on curated datasets (e.g., instruction tuning), reinforcement learning from human feedback (RLHF), and system-level guardrails like prompt filters. For local operators, alignment matters because a model that behaves unpredictably wastes compute and may produce outputs that require manual review. Running a well-aligned model (e.g., Llama 3.1 Instruct) reduces the need for post-generation filtering.

Alignment is a broad research area that addresses the gap between a model's training objective (e.g., next-token prediction) and the operator's actual intent. Early models like GPT-2 could generate toxic text because they were trained purely on internet data. Modern alignment typically involves three stages: supervised fine-tuning (SFT) on high-quality instruction-response pairs, RLHF where a reward model scores outputs and the policy is updated to maximize reward, and sometimes direct preference optimization (DPO) as a simpler alternative. For local operators, the key practical implication is that aligned models (e.g., Llama 3.1 Instruct, Mistral 7B Instruct) are safer out-of-the-box but may be more censored or refuse certain requests. Unaligned base models (e.g., Llama 3.1 Base) offer more flexibility but require the operator to implement their own safety measures. Alignment also affects quantization: instruction-tuned models often retain coherence better at low bit-widths because they've been trained to follow patterns.

An operator running Llama 3.1 8B on an RTX 3060 12 GB might choose the Instruct version (aligned) over the Base version. With the Instruct model, a prompt like 'Write a phishing email' will likely be refused. The Base model might generate a convincing phishing email, which could be dangerous if used irresponsibly. The trade-off: alignment reduces flexibility but increases safety for general use.

When pulling a model via ollama pull llama3.1:8b, Ollama defaults to the Instruct variant. If the operator wants the unaligned Base model, they must specify ollama pull llama3.1:8b-base. In LM Studio, the model card typically indicates whether it's 'instruct' or 'base'. Operators should check the model's alignment status before deploying it in a chatbot or API endpoint to avoid unexpected outputs.

Reviewed by Fredoline Eruo. See our editorial policy.

When it doesn't work

Deeper dive

Practical example

Workflow example