Reinforcement Learning for Multi-Agent Transportation World in PD-World

cosc 4368 n.w
1 / 20
Embed
Share

Discover how Q-Learning and SARSA algorithms are applied to a 3-agent transportation world in the PD-World. Explore strategies to efficiently transport blocks, optimize RL hyperparameters, learn effective coordination, and adapt to changes swiftly.

  • Reinforcement Learning
  • Transportation World
  • Multi-Agent
  • PD-World
  • Q-Learning

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. COSC 4368 Group Project Spring 2024 Learning Paths Using Reinforcement Learning for a 3-Agent Transportation World P D !! P D !! D P !! Eick: Q-Learning and SARSA for the PD-World

  2. Terminal State: Drop off cells contain 5 blocks each Initial State: Black agent is in cell (1,3), red agent is in cell (3,3) and blue agent is in cell (3,5) and each pickup cell contains 5 blocks. PD-World !! (1,1) (1,2) (1,5) (1,3) (1,4) Goal: Transport all blocks from Pickup cells to dropoff cells! (2,2) (2,1) (2,3) (2,4) (2,5) (3,1) (3,2) (3,3) !! (3,5) (3,4) (4,2) (4,1) (4,3) (4,4) (4,5) (5,1) (5,2) (5,4) !! (5,3) (5,5) Pickup: Cells: (1,5), (2,4), (5,2) (contain 5 blocks initially) Dropoff Cells: (1,1), (3,1), (4,5) (capacity 5 blocks, initially empty) Eick: Q-Learning and SARSA for the PD-World

  3. Key Questions to be Addressed a. In spite of getting quite limited feedback is the RL approach able to decent strategies to transport blocks from sources to destinations. b. What RL hyper parameter settings lead to the best performance for the PD-World? c. Is the RL approach able to learn efficient paths from block sources to block destinations? d. Will the RL approach able to learn good agent coordination strategies? e. How well and how quickly can the RL approach adapt to changes in the PD-World? Eick: Q-Learning and SARSA for the PD-World

  4. PD-World P D !! P D !! D P !! Operators there are six of them: North, South, East, West are applicable in each state, and move the agent to the cell in that direction except leaving the grid is not allowed. Pickup is only applicable if the agent is in an pickup cell that contain at least one block and if the agent does not already carry a block. Dropoff is only applicable if the agent is in a dropoff cell that contains less that 4 blocks and the agent carries a block. Moving of Agents: The red, blue and black agent alternate applying operators with the red agent acting first, the blue agent second, and the black agent acting third. Moreover, agents are not allowed to occupy the same cell at the same time, creating a blockage problem, limiting agent efficiency if they work in close proximity. Eick: Q-Learning and SARSA for the PD-World

  5. Rewards in the PD-World P D !! P D !! D P !! Rewards: Picking up a block from a pickup state: +13 Dropping off a block in a dropoff state: +13 Applying north, south, east, west: -1. Experiment Setup: The 3-agent system is run for a number of operator applications; e.g. 4000. If a terminal state is reached the system is reset to the initial state, but Q-Tables are not reinitialized, and the three agents continue to operate in the PD-World until the operator application limit has been reached. Eick: Q-Learning and SARSA for the PD-World

  6. 2024 Policies PRandom: If pickup and dropoff is applicable, choose this operator; otherwise, choose an operator randomly. PExploit: If pickup and dropoff is applicable, choose this operator; otherwise, apply the applicable operator with the highest q-value (break ties by rolling a dice for operators with the same utility) with probability 0.80 and choose a different applicable operator randomly with probability 0.20. PGreedy: If pickup and dropoff is applicable, choose this operator; otherwise, apply the applicable operator with the highest q-value (break ties by rolling a dice for operators with the same utility). Eick: Q-Learning and SARSA for the PD-World

  7. Performance Measures a. Bank account of the 3-agent system b. Number of operators applied to reach a terminal state from the initial state this can happen multiple times in a single experiment! Eick: Q-Learning and SARSA for the PD-World

  8. Multi-Agent Learning Strategy There are two approaches to choose from to implement 3-agent reinforcement learning: a. Each agent uses his own reinforcement learning strategy and Q- Table. However, the position the other agent occupies is visible to each agent, and can be part of the employed reinforcement learning state space. b. A single reinforcement learning strategy and Q-Table is used which moves the three agents, selecting an operator for each agent and then executing the selected three operators in the order red-blue-black. Extra credit is given to groups who devise and implement both 3- agent learning approaches and compare their results. Eick: Q-Learning and SARSA for the PD-World

  9. PD-World State Space The actual state space of the PD World is as follows: (i, j, i , j , i , j ,x, x ,x a, b, c, d, e, f) with (i, j, i , j , i ,j ) positions of the red, blue, black agent. Moreover, two agent cannot be in the same cell! x, x , x is 1 if the red, blue black agent carries a block and 0 if not. (a,b,c,d,e,f) are the number of blocks in cells (1,1), (3,1), (4,5) , (1,5), (2,4), (5,2) Initial State: (3,3,5,3,1,3,0,0,0,0,0,0,5,5,5) Terminal State: ( ,5,5,5,0,0,0) there a several of those Remark: The actual reinforcement learning approach likely will use a simplified state space that aggregates multiple states of the actual state space into a single state in the reinforcement learning state space. Eick: Q-Learning and SARSA for the PD-World

  10. Mapping State Spaces to RL State Space Most worlds have enormously large state spaces or even non- finite state spaces. Moreover, how quickly Q/TD learning learns is inversely proportional to the size of the state space. Consequently, smaller state spaces are used as RL-state spaces, and the original state space are rarely used as RL-state space. World State Space Reduction RL-State Space Eick: Q-Learning and SARSA for the PD-World

  11. Remark: There will be more discussions about RL-State-Space in the lecture on We., March 20!! Eick: Q-Learning and SARSA for the PD-World

  12. Very Simple Reinforcement Learning State Space0 Original State Space: (i, j, i , j , i , j ,x, x ,x a, b, c, d, e, f) Red Agent Simplified Space: (i,j,x) Blue Agent Simplified Space (i ,j ,x ) Black Agent Simplified Space (i ,j ,x ) Comments: 1. The algorithm initially learns paths between pickup states and dropoff states different paths for x=1 and for x=0, x =1 2. Minor complication: The q-values of those paths will decrease as soon as the particular pickup state runs out of blocks or the particular dropoff state cannot store any further blocks, as it is no longer attractive to visit these locations. 3. The states pace ignores the position of the other agent, and is therefore prone to blockage. Eick: Q-Learning and SARSA for the PD-World

  13. Reinforcement Learning State Space1 which considers Blocking Original State Space: (i, j, i , j , i , j ,x, x ,x a, b, c, d, e, f) Red Agent Simplified Space: (i,j,x, r) Blue Agent Simplified Space (i ,j ,x , be) Black Agent Simplified Space (i ,j ,x , bk) Comments: 1. r=(min(d(red,blue),d(red,black))) be=(min(d(blue, red),d(blue,black))) bk=similarly where d measures the Manhattan distance between two agents. Eick: Q-Learning and SARSA for the PD-World

  14. Complicated Reinforcement Learning State Space2 Original State Space: (i, j, i , j , i , j ,x, x ,x a, b, c, d, e, f) Red Agent Simplified Space: (i,j,x, r, a ,b ,c ,d , e , f ) Blue Agent Simplified Space: (i ,j ,x , be, a ,b ,c ,d , e , f ) Black Agent Simplified Space: (i ,j ,x , bk, a ,b ,c ,d , e , f ) With a ,b ,c ,d .e .f being Boolean variables that are 1 if a pickup station still has blocks left or if a dropoff location still has capacity; otherwise, if the respective pickup location is empty or the respective dropoff location is full the value of the respective Boolean variable is 0. Advantage: No need to unlearn paths; as no unlearning using this state space occurs the team of agents might get more efficient in later runs after already solving the transportation problem multiple times. Disadvantage: Space is 64 times larger than Space1; there might be ways to reduce if further. Remark: If you want to simplify the above state space you could drop , but in this case you might face a lot of agent blockage Eick: Q-Learning and SARSA for the PD-World

  15. Analysis of Attractive Paths See also: http://horstmann.com/gridworld/gridworld-manual.html http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html Eick: Q-Learning and SARSA for the PD-World

  16. Remark: This is the QL approach you must use!!! TD-Q-Learning for the PD-World Goal: Measure the utility of using action a in state s, denoted by Q(a,s); the following update formula is used every time an agent reaches state s from s using actions a: Q(a,s) (1- ) Q(a,s) + R(s ,a,s)+ *maxa Q(a ,s )] is the learning rate; is the discount factor a has to be an applicable operator in s ; e.g. pickup and drop-off are not applicable in a pickup/dropoff states if empty/full! The q-values of non- applicable operators are therefore not considered! R(s ,a,s) is the reward of reaching s from s by applying a; e.g. -1 if moving, +13 if picking up or dropping blocks for the PD-World. Eick: Q-Learning and SARSA for the PD-World

  17. S a SARSA s Approach: SARSA selects, using the policy , the action a to be applied to s and then updates Q- values as follows: Q(a,s) Q(a,s) + [ R(s) + *Q(a ,s ) Q(a,s) ] SARSA vs. Q-Learning SARSA uses the actually taken action for the update and is therefore more realistic as it uses the employed policy; however, it has problems with convergence. Q-Learning is an off-policy learning algorithm and geared towards the optimal behavior although this might not be realistic to accomplish in practice, as in most applications policies are needed that allow for some exploration. Eick: Q-Learning and SARSA for the PD-World

  18. S A SARSA Pseudo-Code S Eick: Q-Learning and SARSA for the PD-World

  19. 4368 Group Project in a Nutshell Policy Learning Rate RL-State Space RL-System Q-Learning/SARSA Discount Rate Utility Update ??? What design leads to the best performance? RL-System Performance Eick: Q-Learning and SARSA for the PD-World

  20. Suggested Implementation Steps Write a function aplop: (i, j, i , j , x, x , a, b, c, d, e, f) 2{n,s,e,w,p,d} that returns the set of applicable operators in (i, j, i , j , x, x , a, b, c, d, e, f) Write a function apply: (i, j, i , j , x, x , a, b, c, d, e, f) {n,s,e,w,p,d} (i ,j ,x ,a ,b ,c ,d ,e ,f ) Implement the q-table data structure Implement the SARSA/Q-Learning q-table update Implement the 3 policies Write functions that enable an agent to act according to a policy for n steps which also computes the performance variables Develop visualization functions for Q-Tables Develop a visualization functions for the evolution of the PD-World Develop functions to run experiments 1-4 Develop visualization functions for attractive paths Develop functions to analyze agent coordination. Eick: Q-Learning and SARSA for the PD-World

Related


More Related Content