Scientific
reinforcement learning
rlhf
rlaif

Reinforcement Learning

RL for game-playing, robotics, alignment. PPO, DPO, post-training RLHF + RLAIF.

Setup walkthrough

  1. pip install gymnasium stable-baselines3 (Gymnasium for environments, SB3 for RL algorithms).
  2. Train a PPO agent on CartPole (the "hello world" of RL):
import gymnasium as gym
from stable_baselines3 import PPO
env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=50_000)
model.save("ppo_cartpole")
# Test
obs, _ = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs)
    obs, reward, done, _, _ = env.step(action)
    if done: break
  1. First trained agent in 30-90 seconds on CPU. CartPole solves in ~30 seconds of training.
  2. For more complex environments: Atari games (gym.make("ALE/Breakout-v5")) train CNN-based PPO in 1-4 hours on GPU.
  3. For continuous control (robotics): MuJoCo environments (gym.make("HalfCheetah-v4")) or Isaac Gym for GPU-parallel RL.
  4. For post-training LLM alignment: TRL (pip install trl) for RLHF/DPO training. DPO training on a 7B model takes 1-4 hours on 24 GB GPU with LoRA.
  5. For research: pip install torchrl (Meta's RL library) or pip install skrl for multi-agent, multi-GPU RL.

The cheap setup

RL training varies wildly by environment complexity. CartPole/Atari trains on CPU: a $300 laptop trains PPO on Atari in 1-4 hours. For GPU-accelerated RL (Isaac Gym, sample-efficient DRL): used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) trains MuJoCo locomotion policies in 30-60 minutes. For LLM alignment (DPO/RLHF): the same GPU trains DPO on a 7B model with QLoRA in 2-4 hours. Total: ~$400-500. RL at $400 covers toy-to-medium environments and small-model alignment. Large-scale RL (million-timestep training, 70B alignment) needs 24+ GB.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Isaac Gym parallel RL: runs 4096+ parallel environments on a single GPU, training locomotion policies in minutes. For LLM alignment: DPO on 7B models in 1-2 hours, on 32B models (QLoRA) in 4-8 hours. For multi-agent RL (MALib, RLLib): RTX 3090 serves as the policy network trainer with CPU workers collecting experience. Dual RTX 3090 for PPO with large policy networks. Total: ~$1,800-2,200. RL at this tier is viable for research — the same hardware used for RLHF post-training at AI labs (just fewer GPUs).

Common beginner mistake

The mistake: Training PPO on CartPole for 500K timesteps, seeing a perfect reward, then deploying the policy on a real robot — where it immediately crashes. Why it fails: CartPole is a toy: perfect state observations, zero latency, deterministic physics, no sensor noise. The PPO policy learns to exploit simulator specifics — it jerks the cart at precise timings that don't exist in reality. The sim-to-real gap is the fundamental challenge of RL. The fix: Add domain randomization from day one. Randomize physics parameters (mass, friction, actuator delay), add observation noise, randomize initial conditions. If your policy works across randomized environments, it has a chance at sim-to-real transfer. Never train on a single deterministic environment and expect real-world deployment. RL policies are world-class at exploiting simulator bugs — give them a robust simulator to exploit.

Recommended setup for reinforcement learning

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running reinforcement learning locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle reinforcement learning before committing money.

Specialized buyer guides
Updated 2026 roundup