Challenges and Trends in Research on Reinforcement Learning

Download Presenatation
research problems in rl n.w
1 / 35
Embed
Share

Explore the latest research problems in Reinforcement Learning, including switching between habits and goals, cognitive maps in rats and humans, model-free versus model-based learning debate, and the blend of strategies in behavior predictions. Discover insights into varying training levels and decision-making processes in animals and humans.

  • Reinforcement Learning
  • Cognitive Science
  • Behavioral Studies
  • Decision-Making
  • Training

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Research problems in RL CS786 9th March 2021

  2. Announcement: research paper timelines Topic due by 15th March Either reviewing literature addressing a specific question in cognitive science, or programming a previously published model Extended abstract due by 31st March 400 words Describes the methods and/or the scope of the paper in detail First draft due by 15th April Should be a nearly complete version of the paper I will give comments for improvement by 20th April

  3. RL problems actively being studied in my lab SWITCHING BETWEEN HABITS AND GOALS

  4. Cognitive maps in rats and men

  5. Rats learned a spatial model Rats behave as if they had some sense of p(s |s,a) This was not explicitly trained Generalized from previous experience Corresponding paper is recommended reading So is Tolman s biography http://psychclassics.yorku.ca/Tolman/Maps/maps.htm

  6. The model free vs model-based debate Model free learning learn stimulus- response mappings = habits What about goal-based decision-making? Do animals not learn the physics of the world in making decisions? Model-based learning learn what to do based on the way the world is currently set up = thoughtful responding? People have argued for two systems Thinking fast and slow (Balleine & O Doherty, 2010)

  7. Predictions meet data Behavior appears to be a mix of both strategies What does this mean? Active area of research

  8. Some hunches Moderate training Extensive training (Holland, 2004; Kilcross & Coutureau, 2003)

  9. Current consensus In moderately trained tasks, people behave as if they are using model-based RL In highly trained tasks, people behave as if they are using model-free RL Nuance: Repetitive training on a small set of examples favors model-free strategies Limited training on a larger set of examples favors model-based strategies (Fulvio, Green & Schrater, 2014)

  10. Big ticket application How to practically shift behavior from habitual to goal-directed in the digital space Vice versa is understood pretty well by Social media designers

  11. The social media habituation cycle Reward State

  12. Designed based on cognitive psychology principles

  13. Competing claims First World kids are miserable! https://journals.sagepub.com/doi/full/10.1177/2167702617723376 (Twenge, Joiner, Rogers & Martin, 2017) Not true! https://www.nature.com/articles/s41562-018-0506-1 (Orben & Przybylski, 2019)

  14. Big ticket application How to change computer interfaces from promoting habitual to thoughtful engagement Depends on being able to measure habitual vs thoughtful behavior online Bharadwaj & Srivastava (2019)

  15. RL problems actively being studied in my lab DESIGNING BETTER STATE SPACES

  16. The state space problem in model- free RL Number of states quickly becomes too large Even for trivial applications Learning becomes too dependent on right choice of exploration parameters Explore-exploit tradeoffs become harder to solve State space = 765 unique states

  17. Solution approach Cluster states Design features to stand in for important situation elements Close to win Close to loss Fork opp Block fork Center Corner Empty side

  18. Whats the basis for your evaluation? Use domain knowledge to spell out what is better 1(s) self center, opponent corner 2(s) opponent corner, self center 3(s) self fork, opponent center 4(s) opponent fork, self center as many as you can think of These are basis functions

  19. Value function approximation RL methods have traditionally approximated the state value function using linear basis functions w is a k valued parameter vector, where k is the number of features that are part of the function Implicit assumption: all features contribute independently to evaluation

  20. Function approximation in Q- learning Approximate the Q table with linear basis functions Update the weights Where is the TD term

  21. Non-linear approximations Universal approximation theorem a neural network with even one hidden layer can approximately represent any continuous- valued function Neural nets were always attractive for their representation generality But were hard to train That changed with the GPU revolution ten years ago

  22. The big idea Approximate Q values using non-linear function approximation = ( , ) ( , ) ( , , ) Q s a Q s a f s a Where are the parameters of the neural network and f(x) is the output of the network for input x Combines both association and reinforcement principles Association buys us state inference Reinforcement buys as action policy learning https://www.nature.com/articles/nature14236

  23. Conv nets basics Image patch Filter Convolution https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

  24. Discriminability from diverse filtering

  25. The Atari test bench A very popular RL test bench Limited space of actions Non-stop reward feedback Free to use Earlier methods used features handcrafted for each game

  26. Schematic illustration of the convolutional neural network. V Mnih et al. Nature 518, 529-533 (2015) doi:10.1038/nature14236

  27. Deep Q network Basic Q learning algorithm augmented a bunch of different ways Use of experience replay Use of batch learning Use of non-linear function approximation

  28. AlphaZero Figure 1: Training AlphaZero for 700,000 steps. Elo ratings were computed from evaluation games between different players when given one second per move. a Performance of AlphaZero in chess, compared to 2016 TCEC world-champion program Stockfish. b Performance of AlphaZero in shogi, compared to 2017 CSA world-champion program Elmo. c Performance of AlphaZero in Go, compared to AlphaGo Lee and AlphaGo Zero (20 block / 3 day) (29).

  29. Secret ingredient Some algorithmic innovations MCTS Mostly, just lots and lots of computation 5000 TPUs to generate game-play 64 TPUs to train the DQN This work closes a long chapter in game-based AI research And brings research in RL to a dead end! https://www.quora.com/Is-reinforcement-learning-a-dead-end

  30. Summary Deep reinforcement learning is the cognitive architecture of the moment Perhaps of the future also Beautifully combines the cognitive concepts of association and reinforcement Excellent generalizability across toy domains Limitations exist: timing, higher-order structure, computational complexity etc.

  31. RL is as intelligent as a railway engine You tell it what to do Shape behavior using reward signals It does what you tell it to do After tons of cost-free simulations Can work in specific toy domains Does not work as a model of real real-time learning https://www.sciencedirect.com/science/ article/pii/S0921889005800259

  32. Where do rewards/labels come from? To get a shi-fei (this-not-that discrimination) from the xin (heart-mind) without its having been constructed there is like going to Yue today and arriving yesterday, like getting something from nothing Zhuangzi Yinde, 4th century B.C. Preferences are constructed (Slovic, 1995; Gilboa & Schmeidler, 2000) From past experience (Srivastava & Schrater, 2015)

  33. Elephants dont play chess The world as its own model Subsumption architecture Don t try to model the world with states and rewards Give individual robot components their own (simple, maybe hardwired) goals Tweak components until you get behavior that looks reasonable Big success Roomba! https://en.wikipedia.org/wiki/BEAM_robotics http://cid.nada.kth.se/en/HeideggerianAI.pdf

  34. Summary Principles of association and reinforcement are being used prominently in existing ML methods They work well for specific applications But not as general models of learning to be in the world Much remains to be learned about Internal representations Processes controlling internal representations Embodied priors How embodied priors interact with processes

Related


More Related Content