Solving MDPs

Slide Note

An MDP consists of a tuple {S,A,R,P} representing states, actions, rewards, and transition probabilities. The goal in MDPs is to find an action policy to maximize future rewards. Solving involves iteratively updating value and action policies. MDPs are a part of AI models with control over actions, observable states, and are differentiated from HMMs and POMDPs. Human decision modeling in MDPs poses challenges due to conceptualizing states in the real world and predicting rewards. Markov chains aim to identify stationary distributions, while HMMs estimate latent state transitions from observation sequences. Reinforcement Learning in MDP involves learning optimal policies, either model-based or model-free.

leva_ila Follow

Uploaded on Feb 17, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Solving MDPs CS786 27th January 2022

The MDP framework An MDP is the tuple {S,A,R,P} Set of states (S) Set of actions (A) Possible rewards (R) for each {s,a} combination P(s |s,a) is the probability of reaching state s given you took action a while in state s

Solving an MDP Solving an MDP is equivalent to finding an action policy AP(s) Tells you what action to take whenever you reach a state s Typical rational solution is to maximize future- discounted expected reward

Solution strategy Notation: P(s |s,a) is the probability of moving to s from s via action a R(s ,a) is the reward received for reaching state s via action a Update value and action policy iteratively https://towardsdatascience.com/getting-started-with-markov-decision-processes- reinforcement-learning-ada7b4572ffb

Part of a larger universe of AI models Control over actions? States observable? No Yes No HMM POMDP Yes Markov chain MDP

Modeling human decisions? States are seldom nicely conceptualized in the real world Where do rewards come from? Storing transition probabilities is hard Do people really look ahead into the infinite time horizon?

Markov chain Goal in solving a Markov chain Identify the stationary distribution 0.1 Sunny Rainy 0.3

HMM HMM goal estimate latent state transition and emission probability from sequence of observations

MDP RL In MDP, {S,A,R,P} are known In RL, R and P are not known to begin with They are learned from experience Optimal policy is updated sequentially to account for increased information about rewards and transition probabilities Model-based RL Learns transition probabilities P as well as optimal policy Model-free RL Learns only optimal policy, not the transition probabilities P

Solving MDPs

Download Presentation

Presentation Transcript

Related

More Related Content