Large language models

RLHF (Reinforcement Learning from Human Feedback)

RLHF (Reinforcement Learning from Human Feedback) is a training method that fine-tunes a language model using human preferences as a reward signal. After initial pretraining, the model generates multiple responses to prompts, and human raters rank them. A reward model is trained to predict these rankings, then the base model is further optimized via reinforcement learning (typically PPO) to maximize the reward. In practice, RLHF aligns model outputs with human values—making them more helpful, harmless, and honest. Operators encounter RLHF indirectly: models like Llama 3.1 Instruct or Mistral 7B Instruct have been RLHF-tuned, so they follow instructions better and refuse harmful requests compared to base checkpoints.

Deeper dive

RLHF consists of three stages. First, a supervised fine-tuning (SFT) stage trains the model on high-quality demonstrations to teach basic instruction-following. Second, a reward model is trained on human preference data: for each prompt, the model generates several responses, humans rank them, and the reward model learns to assign higher scores to preferred responses. Third, the SFT model is fine-tuned using reinforcement learning (often Proximal Policy Optimization, PPO) to maximize the reward model's score, while a KL penalty prevents the policy from diverging too far from the SFT model. Variants include Direct Preference Optimization (DPO), which skips the explicit reward model by directly optimizing from preferences. For operators, RLHF matters because it determines how 'aligned' a model is—an RLHF-tuned model will refuse harmful prompts and follow instructions more reliably, but may also be more censored. Running an RLHF-tuned model locally is identical to running any other model; the alignment is baked into the weights.

Practical example

A practical example: the Llama 3.1 8B base model (not RLHF-tuned) might complete a prompt like 'How to pick a lock?' with detailed instructions. The Llama 3.1 8B Instruct model (RLHF-tuned) would instead refuse, saying 'I can't provide instructions for illegal activities.' Both models have the same architecture and VRAM requirements (~5 GB at Q4), but the Instruct version's weights encode the alignment. Operators downloading from Hugging Face see 'meta-llama/Meta-Llama-3.1-8B' (base) vs 'meta-llama/Meta-Llama-3.1-8B-Instruct' (RLHF-tuned).

Workflow example

In a typical workflow, an operator runs ollama pull llama3.1:8b to get the Instruct (RLHF-tuned) model. When querying it, the model's refusal behavior is immediately apparent—e.g., asking 'Write a phishing email' returns a refusal. If the operator instead pulls the base model (ollama pull llama3.1:8b-text), it may comply. The RLHF tuning is invisible at runtime; it's just a different set of weights. Operators can also fine-tune their own RLHF models using tools like TRL (Transformer Reinforcement Learning) from Hugging Face, which requires a reward model and preference dataset.

Reviewed by Fredoline Eruo. See our editorial policy.