Proximal Policy Optimization (PPO) — AI glossary

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm used to fine-tune large language models (LLMs) with human feedback (RLHF). It updates the model's weights to maximize a reward signal while keeping each update small enough to avoid catastrophic forgetting. In practice, PPO trains a reward model to score outputs, then uses that score to adjust the LLM's policy. Operators encounter PPO indirectly: when a model is described as 'RLHF-tuned' (e.g., Llama 3.1 Instruct), PPO is the algorithm behind that tuning. It matters because PPO-tuned models tend to follow instructions better and produce safer outputs, but the training process is computationally expensive and typically done on server clusters, not local hardware.

Deeper dive

PPO is a policy gradient method that balances exploration and stability. It uses a clipped surrogate objective to prevent the updated policy from deviating too far from the old one, which reduces the risk of performance collapse. In LLM fine-tuning, the workflow is: (1) a reward model is trained on human preference data; (2) the LLM generates responses; (3) the reward model scores them; (4) PPO updates the LLM's weights to increase the probability of high-reward responses. Variants like PPO-ptx add a term to preserve the model's original language modeling ability. While PPO is the standard for RLHF, alternatives like Direct Preference Optimization (DPO) avoid the separate reward model and are gaining traction. For operators, PPO is not run locally; it's a training-stage algorithm. However, understanding PPO helps evaluate model cards: a model fine-tuned with PPO may exhibit different behavior (e.g., more refusal of harmful requests) than one fine-tuned with supervised learning alone.

Practical example

A model like Llama 3.1 8B Instruct is fine-tuned with RLHF using PPO. The base model (8B parameters) first undergoes supervised fine-tuning on instruction data, then PPO aligns it with human preferences. The reward model for PPO is typically a separate, smaller model (e.g., 7B parameters) trained on human comparisons. Running the final PPO-tuned model locally requires the same VRAM as the base model (~16 GB for 8B at Q4). The PPO training itself, however, requires multiple GPUs (e.g., 8× A100s) and is not feasible on consumer hardware.

Workflow example

Operators do not run PPO locally. Instead, they download PPO-tuned models from Hugging Face (e.g., NousResearch/Hermes-3-Llama-3.1-8B). When using Ollama, ollama pull llama3.1:8b retrieves the PPO-tuned instruct variant. The model card on Hugging Face will mention 'RLHF' or 'PPO' in the training details. For those fine-tuning their own models locally, PPO is impractical due to VRAM and compute requirements; alternatives like DPO or supervised fine-tuning (e.g., with unsloth) are more accessible on consumer GPUs.