
Efficient Multi-Task Reinforcement Learning Innovations
Discover cutting-edge research in reinforcement learning, including provably efficient multi-task models with model transfer, heterogeneous online RL for healthcare robotics, and multi-player episodic RL challenges. Dive deep into the formal setup of the MPERL problem and explore how inter-task information sharing enhances learning outcomes.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Provably Efficient Multi Provably Efficient Multi- -Task Reinforcement Learning with Reinforcement Learning with Model Transfer Model Transfer Task Chicheng Zhang and Zhi Wang NeurIPS 2021
Heterogenous Multi Heterogenous Multi- -Task Online Reinforcement Learning (RL) Task Online Reinforcement Learning (RL) Each robot learns the preferences of its paired individuals through interactions. A group of assistive robots deployed to provide personalized healthcare services (Kubota et al., 2020). Question: If the robots receive similar yet nonidentical feedback, how can they learn to perform their respective tasks faster in an online RL setting?
Multi Multi- -Player Episodic RL (MPERL) Player Episodic RL (MPERL) A set of ? players (robots) concurrently interact with their respective environments, each represented as an Episodic MDP. Alice Bob Actions = cognitive activities to recommend Action 2 ??(?,?) = 0.4 Action 1 ??(?,?) = 0.4 Action 1 ??(?,?) = 0.6 Action 3 ???,? = 0.5 Action 2 ??(?,?) = 0.5 Action 3 ??(?,?) = 0.6
The The ?- -MPERL Problem MPERL Problem A set of ? players (robots) concurrently interact with their respective environments, each represented as an Episodic MDP. Alice Bob Actions = cognitive activities to recommend Action 2 ??(?,?) = 0.4 Action 1 ??(?,?) = 0.4 Action 1 ?_?(?,?) = 0.6 Action 3 ???,? = 0.5 Action 2 ??(?,?) = 0.5 Action 3 ??(?,?) = 0.6 ?,?,?,?: ???,? ??(?,?) ? ? ?,? ? ?,? ?: dissimilarity parameter 1 ?/?
The The ?- -MPERL Problem: formal setup MPERL Problem: formal setup ? ? episodic, tabular, ?-layered MDPs ??=1 action spaces, and common initial distribution ?(?0) with shared state- For episodes ? = 1,2, ,?: For players ? = 1,2, ,?: Player ? interacts with ? with policy ??(?) for one episode, obtaining trajectory ?? All ? trajectories ?? ?=1 are shared among the players Bob Alice ? ? ? ???(?0) ? ? ?? ?0 ?? Collective regret: Reg ? = ?=1 ?=1 Value of player ? executing ??? Optimal value of player ?
Baseline: individual single Baseline: individual single- -task learning task learning Each player learns separately using a state-of-the-art online tabular RL algorithm, e.g., Strong-Euler (Simchowitz and Jamieson, 2019), achieving a collective regret of (Gap-independent bound) ? ? ?2??? (Gap-dependent bound) ? ?3 ln ? ?,min ?3 ln ? ?(?,?) ? + ?=1 ?,? ??,opt ?,? ??,opt where ??,? ?? ? ?? ?,min= ?,? ,??,opt= ?,? : ??,? = 0 , ?,? ??,opt ??,? min Can we do better with inter-task information sharing?
The benefit of multi The benefit of multi- -task learning task learning (Wang, Zhang, Singh, Riek, Chaudhuri, 2021): in a multi-task multi-armed bandit setting, information sharing sometimes does not help, information theoretically. Example: For a fixed and ? < ?/4, consider: Claim: Any sublinear regret algorithm must have ? regret, no better than the individual single- task learning baseline. ?(2) = ? ? ln ? Arm 1 ??(?) = ?.? + ? Arm 1 ??(?) = ?.? + ? Arm 2 ??(?) = ?.? Arm 2 ??(?) = ?.? Key observation: the benefit of multi-task learning depends on the interaction between and suboptimality gaps ?(?,?)
Key notion: subpar state Key notion: subpar state- -action pairs action pairs Subpar state-action pairs: ?,? :for some ? ? , ??,? ?? ?= Example: Alice Bob Action 2 ?? 2.4 Action 1 ?? 2.4 Action 1 ?? 2.6 Action 3 ?? 1.5 Action 2 ?? 2.2 Action 3 ?? 1.6 (?,?) = (?,?) = (?,?) = ?,? = (?,?) = (?,?) = ?,3 ?; ?,2 ? Subpar state-action pairs are those amenable for inter-task information sharing
Our results Our results For ?-MPERL problems, assuming known ?: Our algorithm, Multi-Task-Euler(?), achieves gap-dependent and gap- independent regret upper bounds We also show gap-dependent and gap-independent regret lower bounds, that nearly match the upper bounds for constant ?
Our results: gap Our results: gap- -independent bounds independent bounds State-action pairs ?C ? ? ?/? (?/?) Individual Strong-Euler ? ?2 ??? + ? ?2 ?? ? ?2 ??? + ??2 ?? Multi-task-Euler(?) ? ? ?2 (?/?) + ? ??2 (?/?)? Lower bound
Our results: gap Our results: gap- -dependent bounds dependent bounds For player ? s contribution to the collective regret: State-action pairs ? ??,opt ? ??,opt ? ? ??,opt (?/?) (?/?) ?3 ln ? ?,min ?3ln ? ?(?,?) ?3ln ? ?(?,?) Individual Strong-Euler ?,? ?3 ln ? ?,min ?3ln ? ?(?,?) 1 ? ?3ln ? ?(?,?) Multi-task-Euler(?) ?,? ?2 ln ? ?(?,?) 1 ? ?2ln ? ?(?,?) Lower bound ?,?
task- -Euler( Euler(? ?): main ideas ): main ideas Multi Multi- -task For each player ?, Multi-Task-Euler(?): 1. Maintains two model estimates for ?: (1) an individual estimate aggregate model estimate ? (2) an 2. Performs a ``heterogeneous optimistic value iteration using both obtain ??, a tight upper confidence bound of ?? ? and to , and executes its greedy policy Similar algorithmic idea of ``model transfer has appeared in prior works, e.g., (Taylor, Jong, & Stone, 2008), (Pazis & Parr, 2016)
Technical overview Technical overview Upper bounds: a new surplus bound in the multi-task setting: 1 1 ???,? ???,? + ? ?,? , ?? ? min ,? + , ???,? ? ?,? and combine with the ``clipping trick (Simchowitz & Jamieson, 2019) Lower bounds: combine the multi-task bandit lower bounds (Wang, Zhang, Singh, Riek, Chaudhuri, 2021) with a standard bandit-to-RL conversion
Conclusion and open problems Conclusion and open problems We study ?-MPERL, a new multi-task RL setting; this complements existing multi- task RL settings (e.g., Brunskill & Li, 2013, Liu, Guo, & Brunskill, 2016, Pazis & Parr, 2016) We give upper and lower bounds on the collective regret that are nearly matching for constant episode length ? Open questions: Improve the dependence on ? in the collective regret bounds Improve the dependence on ??,opt, similar to recent works (e.g., Xu, Ma, & Du, 2021) Extensions to RL with function approximation
Thank you! Thank you!