
Introduction to Reinforcement Learning for Large Scale Statistical Machine Learning Seminar
Explore the fundamentals of reinforcement learning in the context of large-scale statistical machine learning. Understand the concepts, strategies, and models involved in solving RL problems and discover the differences between RL, supervised learning, AI, and planning. Dive into the world of RL to grasp how agents learn behavior through trial-and-error interactions with dynamic environments.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
COSC 878 Seminar on Large Scale Statistical Machine Learning 1
Todays Plan Course Website http://people.cs.georgetown.edu/~huiyang/cosc- 878/ Join the Google group: https://groups.google.com/forum/#!forum/cosc8 78 Students Introduction Team-up and Presentation Scheduling First Talk 2
Reinforcement Learning: A Survey Grace 1/13/15
What is Reinforcement Learning The problem faced by an agent that must learn behavior through trial-and-error interactions with a dynamic environment
Solve RL Problems Two strategies Search in the behavior space To find one behavior that perform well in the environment Genetic algorithms, genetic programming Statistical methods and dynamic programming Estimate the utility of taking actions in states of the world We focus on this strategy
What we learn in RL The agent s job is to find a policy \pi that maximizes some long-run measure of reinforcement. A policy \pi maps states to actions Reinforcement = reward
Difference between RL and Supervised Learning In RL, no presentation of input/output pairs No training data We only know the immediate reward Not know the best actions in long run In RL, need to evaluate the system online while learning Online evaluation (know the online performance) is important
Difference between RL and AI/Planning AI algorithms are less general AI algorithms require a predefined model of state transitions And assume determinism RL assumes that the state space can be enumerated and stored in memory
Models The difficult part: How to model future into the model Three models Finite horizon Infinite horizon Average-reward
Finite Horizon At a given moment in time, the agent optimizes its expected reward for the next h steps Ignore what will happen after h steps
Infinite Horizon Maximize the long run reward Does not put limit on the number of future steps Future rewards are discounted geometrically Mathematically more tractable than finite horizon Discount factor (between 0 and 1)
Average-reward Maximize the long run average reward It is the limiting case of infinite horizon when \gamma approaches 1 Weakness: Cannot know when get large rewards When we prefer large initial reward, we have no way to know it in this model Cures: Maximize both the long run average and the initial rewards The Bias optimal model
Compare model optimality all unlabeled arrows produce a reward of 0 A single action
Compare model optimality Finite horizon h=4 Upper line: 0+0+2+2+2=6 Middle: 0+0+0+0+0=0 Lower: 0+0+0+0+0=0
Compare model optimality Infinite horizon \gamma=0.9 Upper line: 0*0.9^0 + 0*0.9^1+2*0.9^2+ 2*0.9^3+2*0.9^4 = 2*0.9^2*(1+0.9+0.9^ 2..)= 1.62*(1)/(1- 0.9)=16.2 Middle: 10*0.9^5+ ~=59 Lower: + 11*0.9^6+ = 58.5
Compare model optimality Average reward Upper line: ~= 2 Middle: ~=10 Lower: ~= 11
Parameters Finite horizon and infinite horizon both have parameters h \gamma These parameters matter to the choice of optimality model Choose them carefully in your application Average reward model s advantage: not influenced by those parameters
Markov Process Markov Property1(the memoryless property) for a system, its next state depends on its current state. Pr(Si+1|Si, ,S0)=Pr(Si+1|Si) s1 si+1 s0 si e.g. Markov Process a stochastic process with Markov property. 20 1A. A. Markov, 06
Family of Markov Models Markov Chain Hidden Markov Model Markov Decision Process Partially Observable Markov Decision Process Multi-armed Bandit 21
Markov Chain (S, M) Discrete-time Markov process Example: Google PageRank1 State S web page Transition probability M PageRank: how likely a random web surfer will land on a page B D Pagerank(B) Pagerank(D) A Pagerank(A) Random jump factor ???????? ? =1 ? ????????(?) ?(?) + ? ? C E ? Pagerank(C) Pagerank(E) # of pages # of outlinks pages linked to S The stable state distribution of such an MC is PageRank 1L. Page et. al., 99 22
Hidden Markov Model (S, M, O, e) A Markov chain that states are hidden and observable symbols are emitted with some probability according to its states1. s0 s1 p0 p1 p2 s2 ?0 ?1 ?2 o0 o1 o2 Si hidden state pi -- transition probability oi --observation ei --observation probability (emission probability) 1Leonard E. Baum et. al., 66 23
Markov Decision Process (S, M, A, R, ) MDP extends MC with actions and rewards1 p2 p1 p0 s1 s2 s3 s0 a0 a1 a2 r0 r1 r2 si state ai action ri reward pi transition probability 1R. Bellman, 57 24
Definition of MDP A tuple (S, M, A, R, ) S : state space M: transition matrix Ma(s, s') = P(s'|s, a) A: action space R: reward function R(s,a) = immediate reward taking action a at state s : discount factor, 0< 1 policy (s) = the action taken at state s Goal is to find an optimal policy * maximizing the expected total rewards. 25
Policy According to which, select an action a at state s. Policy: (s) = a (s0) =move right and up s0 (s1) =move right and up s1 (s2) = move right s2 [Slide altered from Carlos Guestrin s ML lecture] 26
Value of Policy Expected long-term reward starting from s Value: V (s) V (s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3) + 4 R(s4) + ] Future rewards discounted by [0,1) s0 (s0) R(s0) Start from s0 [Slide altered from Carlos Guestrin s ML lecture] 27
Value of Policy Expected long-term reward starting from s Value: V (s) V (s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3) + 4 R(s4) + ] s1 Future rewards discounted by [0,1) s0 R(s1 ) (s0) R(s0) s1 Start from s0 R(s1) s1 R(s1 ) [Slide altered from Carlos Guestrin s ML lecture] 28
Value of Policy Expected long-term reward starting from s Value: V (s) V (s0) = E[R(s0) + R(s1) + 2 R(s2) + 3 R(s3) + 4 R(s4) + ] s1 Future rewards discounted by [0,1) s2 s0 (s1 ) R(s1 ) (s0) R(s2 ) R(s0) s1 (s1) s2 Start from s0 R(s1) R(s2) s1 s2 (s1 ) R(s1 ) R(s2 ) [Slide altered from Carlos Guestrin s ML lecture] 29
Computing the value of a policy Value function V (s0) = ??[? ?0,? + ?? ?1,? + ?2? ?2,? + ?3? ?3,? + ] =??[? ?0,? + ? ?=1 =? ?0,? + ???[ ?=1 ?? 1?(??,?)] ?? 1?(??,?)] =? ?0,? + ? ? ?? ?(?,? ) ? (? ) The current state A possible next state 30
Optimality Bellman Equation The Bellman equation1 to MDP is a recursive definition of the optimal value function V*(.) ? s = max ??(?,? )? (? ) ? ?,? + ? state-value function ? ? Optimal Policy ???,? ? (? ) s = arg??? ? ?,? + ? ? ? 1R. Bellman, 57 31
Optimality Bellman Equation Relationship between V and Q The Bellman equation can be rewritten as ? ? = max ?(?,?) a action-value function ??(?,? )? (? ) ?(?,?) = ? ?,? + ? ? Optimal Policy s = arg??? ? ?,? ? 32
MDP algorithms Solve Bellman equation Optimal value V*(s) Optimal policy *(s) Value Iteration Policy Iteration Modified Policy Iteration Prioritized Sweeping Temporal Difference (TD) Learning Q-Learning [Bellman, 57, Howard, 60, Puterman and Shin, 78, Singh & Sutton, 96, Sutton & Barto, 98, Richard Sutton, 88, Watkins, 92] Model-based approaches Model free approaches [Slide altered from Carlos Guestrin s ML lecture] 33