RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Large language models / Direct Preference Optimization (DPO)
Large language models

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) is a method for fine-tuning language models to align with human preferences without using reinforcement learning (RL). Unlike RLHF, which trains a separate reward model and then optimizes the policy via PPO, DPO directly optimizes the model on pairs of preferred and dispreferred responses using a simple binary cross-entropy loss. This eliminates the need for reward model training and RL sampling loops, making DPO computationally cheaper and more stable. Operators encounter DPO when fine-tuning models like Llama 3 or Mistral on preference datasets (e.g., Anthropic HH-RLHF) to improve helpfulness or reduce harmful outputs.

Deeper dive

DPO reframes preference learning as a supervised learning problem. Given a dataset of prompts with two responses (chosen and rejected), DPO updates the model to increase the log-probability of the chosen response relative to the rejected one, weighted by a parameter β that controls how far the model can deviate from its reference (base) model. The key insight is that the optimal policy under the Bradley-Terry preference model can be expressed in closed form, bypassing RL. In practice, DPO requires only a forward pass through both the policy and reference model for each pair, making it memory-efficient (no reward model, no value network). Operators using Hugging Face TRL can run DPO with a few lines of code, and it often achieves comparable or better alignment than PPO while being simpler to tune.

Practical example

An operator wants to fine-tune Llama 3.1 8B to be more concise. They collect 1,000 prompts and for each, generate two responses: a concise one (chosen) and a verbose one (rejected). Using Hugging Face TRL's DPOTrainer, they load the base model and a reference model (same base), set β=0.1, and train for 1 epoch on a single RTX 4090 (24 GB VRAM). Training takes ~2 hours and yields a model that produces shorter answers without sacrificing quality.

Workflow example

In a typical DPO workflow, an operator first prepares a preference dataset in JSON format with 'prompt', 'chosen', and 'rejected' fields. They then run a script using Hugging Face Transformers and TRL: from trl import DPOTrainer; trainer = DPOTrainer(model, ref_model, train_dataset, beta=0.1); trainer.train(). The training loop computes log-probabilities for both chosen and rejected responses, calculates the DPO loss, and updates model weights. After training, the model can be saved and used with Ollama or llama.cpp by converting to GGUF format.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →