Markov Decision Process (MDP)
A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in environments where outcomes are partly random and partly under the control of a decision-maker. It consists of states, actions, transition probabilities (the chance of moving from one state to another given an action), and rewards. The goal is to find a policy—a mapping from states to actions—that maximizes cumulative reward over time. In local AI, MDPs appear in reinforcement learning (RL) contexts, such as training agents for games or robotics, but are less common in typical LLM workflows.
Deeper dive
MDPs formalize sequential decision problems. At each step, the agent observes a state, chooses an action, receives a reward, and transitions to a new state according to probabilities that depend only on the current state and action (the Markov property). The solution is an optimal policy, often found via dynamic programming (value iteration, policy iteration) or RL algorithms (Q-learning, PPO). In local AI, MDPs are foundational for RL but rarely used directly with LLMs. However, some advanced LLM applications (e.g., RLHF, tool-use agents) borrow MDP concepts: the state is the conversation context, actions are token generations or tool calls, and rewards come from human feedback or task success. Operators training custom RL agents on local hardware (e.g., using Stable-Baselines3 on a GPU) will encounter MDPs when defining environments.
Practical example
Suppose you train a simple game-playing agent on an RTX 3060 using Stable-Baselines3. The game is a grid world: states are grid positions, actions are up/down/left/right, transitions are deterministic (or stochastic with a slip probability), and rewards are +1 for reaching the goal. This is an MDP. You define it as a Gymnasium environment, then run PPO for 1 million timesteps. The training loop iterates over states, samples actions, observes next states and rewards—exactly the MDP cycle.
Workflow example
When using Hugging Face's trl library for RLHF on a local LLM, the underlying formulation is an MDP. The state is the current conversation prefix, the action is the next token, and the reward comes from a preference model. The PPO trainer samples trajectories (state-action-reward sequences) and updates the policy. Operators see this in code: trainer = PPOTrainer(config, model, ref_model, tokenizer, ...) and trainer.step(queries, responses, scores). The MDP is implicit but governs the training loop.
Reviewed by Fredoline Eruo. See our editorial policy.