Reinforcement Learning (RL)
Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment, receiving rewards or penalties for its actions. The goal is to maximize cumulative reward over time. In the context of local AI, RL is used to fine-tune language models via techniques like RLHF (Reinforcement Learning from Human Feedback), where a reward model scores outputs and the base model is updated to produce higher-scoring responses. This process typically requires significant compute and is done offline, not during inference on consumer hardware.
Deeper dive
RL differs from supervised learning: instead of labeled examples, the agent explores actions and learns from delayed feedback. The core components are the policy (what action to take), the reward signal, and the value function (expected future reward). For LLMs, RLHF involves three stages: 1) supervised fine-tuning on human demonstrations, 2) training a reward model on human preferences, and 3) optimizing the policy (the LLM) using Proximal Policy Optimization (PPO) to maximize the reward model's score. This aligns model outputs with human values. Operators rarely run RL training locally due to VRAM and time requirements—training a 7B model with RLHF can take days on a single high-end GPU. However, they may use pre-trained RLHF-tuned models (e.g., Llama 3.1 Instruct) or run inference with reward models for reranking.
Practical example
A practical example: training a 7B model with RLHF on a single RTX 4090 (24 GB VRAM) is impractical—PPO requires loading the policy, reward model, and reference model simultaneously, exceeding VRAM. Instead, operators download already RLHF-tuned models like Llama 3.1 8B Instruct, which was trained using RLHF on a cluster. During inference, the operator runs ollama run llama3.1:8b and gets helpful, safe responses without running RL themselves.
Workflow example
In a typical workflow, an operator using Hugging Face Transformers might load a reward model for reranking: from transformers import AutoModelForSequenceClassification; reward_model = AutoModelForSequenceClassification.from_pretrained('OpenAssistant/reward-model-deberta-v3-large'). They generate multiple candidate responses from a base model, score each with the reward model, and select the highest-scoring one. This is a lightweight RL-inspired technique that fits on consumer GPUs.
Reviewed by Fredoline Eruo. See our editorial policy.