RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Classical ML algorithms / Markov Decision Process (MDP)
Classical ML algorithms

Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in environments where outcomes are partly random and partly under the control of a decision-maker. It consists of states, actions, transition probabilities (the chance of moving from one state to another given an action), and rewards. The goal is to find a policy—a mapping from states to actions—that maximizes cumulative reward over time. In local AI, MDPs appear in reinforcement learning (RL) contexts, such as training agents for games or robotics, but are less common in typical LLM workflows.

Deeper dive

MDPs formalize sequential decision problems. At each step, the agent observes a state, chooses an action, receives a reward, and transitions to a new state according to probabilities that depend only on the current state and action (the Markov property). The solution is an optimal policy, often found via dynamic programming (value iteration, policy iteration) or RL algorithms (Q-learning, PPO). In local AI, MDPs are foundational for RL but rarely used directly with LLMs. However, some advanced LLM applications (e.g., RLHF, tool-use agents) borrow MDP concepts: the state is the conversation context, actions are token generations or tool calls, and rewards come from human feedback or task success. Operators training custom RL agents on local hardware (e.g., using Stable-Baselines3 on a GPU) will encounter MDPs when defining environments.

Practical example

Suppose you train a simple game-playing agent on an RTX 3060 using Stable-Baselines3. The game is a grid world: states are grid positions, actions are up/down/left/right, transitions are deterministic (or stochastic with a slip probability), and rewards are +1 for reaching the goal. This is an MDP. You define it as a Gymnasium environment, then run PPO for 1 million timesteps. The training loop iterates over states, samples actions, observes next states and rewards—exactly the MDP cycle.

Workflow example

When using Hugging Face's trl library for RLHF on a local LLM, the underlying formulation is an MDP. The state is the current conversation prefix, the action is the next token, and the reward comes from a preference model. The PPO trainer samples trajectories (state-action-reward sequences) and updates the policy. Operators see this in code: trainer = PPOTrainer(config, model, ref_model, tokenizer, ...) and trainer.step(queries, responses, scores). The MDP is implicit but governs the training loop.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →