
Constrained Reinforcement Learning for Network Slicing in 5G
Explore how Constrained Reinforcement Learning (CRL) is applied to address network slicing challenges in 5G, enabling efficient resource allocation for diverse services like manufacturing, entertainment, and smart cities. Traditional optimization methods are compared with learning-based approaches, emphasizing adaptability to uncertainty and epistemic variability. The study formulates the problem as a Markov Decision Process (MDP) and delves into solutions like REINFORCE, TRPO, and PPO. Additionally, a Constrained Markov Decision Process (CMDP) is introduced to deal with cost-aware network slicing scenarios.
Uploaded on | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
A Constrained Reinforcement Learning Based Approach for Network Slicing Yongshuai Liu (UC DAVIS), Jiaxin Ding (SJTU), Xin Liu (UC DAVIS) HDR-Nets 2020 October 13, 2020 1
CRL Based Network Slicing Problem Statement 5G network slicing enables the multiplexing of logical networks on the same physical network infrastructure. The network slicing is essentially a resource allocation problem. Communication Entertainment Internet eMBB Manufacturing Monitoring Retail, Smart city mMTC Autonomous driving Factory automation Remote surgery Physical Infrastructure URLLC MEC server Edge Access node 5G Network Slicing Base station 2
CRL Based Network Slicing Why Learning-based Approach Traditional optimization approaches require accurate mathematical models with parameters known. Traditional methods do not adapt to epistemic uncertainty. 3
CRL Based Network Slicing VNF Manager Service Provider State Inputs Action Outputs State:1 State:2 State:3 Action:1 Action:2 Action:3 Network Slice Manager PolicyEngine S A R/C Physical Infrastructure Video Slice VoLTE Slice Edge URLLC Slice S: State A: Action (Slices) R/C: Reward/Constraint MEC server Access node Base station 4
Problem Formulation Markov Decision Process (MDP) States S: {s1, s2 sn} Agent Actions A: {a1, a2 an} Reward R: S A S -> State st Reward Rt Action at Transition Probability: P(st+1|st, at) Rt+1 st+1 Discount Factor: ? Policy ??: S -> A Environment 5
Solutions Objective: REINFORCE: TRPO: PPO: where Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation.", 2000. Schulman, John, et al. "Trust region policy optimization." , 2015. Schulman, John, et al. "Proximal policy optimization algorithms. , 2017. 6
Problem Formulation Constrained Markov Decision Process (CMDP) States S: {s1, s2 sn} Agent Actions A: {a1, a2 an} Reward R: S A S -> State st Cost Ct Reward Rt Action at Cost C: S A S -> Rt+1 Ct+1 Transition Probability: P(st+1|st, at) Discount Factor: ? st+1 Environment Policy ??: S -> A Each a is feasible 7
Problem Formulation Constraints Discounted cumulative constraints Instantaneous constraints Each action a is feasible 8
Case Study Three types of users [1] (three network slices): Video, VoLTE, URLLC Time slotted system (1 second) State: the number of active users in each network slice Action: bandwidth allocation to each network slice Reward: overall throughput [1] . Li, Z. Zhao, Q. Sun, I. Chih-Lin, C. Yang, X. Chen, M. Zhao, andH. Zhang, Deep reinforcement learning for resource management in network slicing, IEEE Access, vol. 6, pp. 74 429 74 441, 2018. 9
Case Study Cumulative constraint: user satisfaction over time User satisfaction: min (bandwidth allocated/bandwidth required, 1) Explicit instantaneous constraint: available bandwidth Implicit instantaneous constraint: latency (FIFO queue) Dynamic hidden structure: Users (of each slice) arrive and depart following an unknown Poisson Process Arrival/departure rate depends on the user satisfaction Key reason why a learning-based approach can outperform 10
Case study Problem Formulation Cumulative constraints: IPO bw allocation feasible Explicit Instantaneous constraint: safe layer latency requirement Implicit Instantaneous constraint: safe layer 11
IPO: Interior-point policy optimization Cumulative constraints 12
Solutions & Limitations to CMDP Objective: Lagrangian Relaxation Method: Sensitive to initialization of hyperparameters Large reward and constraint variation during training. Constrained Policy Optimization (CPO): Quadratic approximation Complicated computation and implementation Difficult to handle to mean-value constraints and multiple constraints. Chow, Yinlam, et al. Risk-constrained reinforcement learning with percentile risk criteria. ,2017. Achiam, Joshua, et al. Constrained policy optimization. , 2017. 13
IPO: Interior-point Policy Optimization Intuition Add constraints as penalty to the objective function if a constraint is satisfied, the penalty added to the reward function is zero. if the constraint is violated, the penalty goes to negative infinity. 14
IPO: Interior-point Policy Optimization Penalty function Denote Ideal penalty function Logarithmic barrier function: A differentiable approximation 15
IPO: Interior-point Policy Optimization Objective function 16
IPO: Interior-point Policy Optimization Performance Bound Theorem: The maximum gap between the optimal value of the constrained optimization problem and IPO is bounded by m/t. m is the number of constraints t is the hyperparameter of logarithmic barrier function 17
IPO Summary IPO outperforms the state-of-art methods. Achieves monotonic improvement on the policy. Can handle multiple constraints. Easy to compute and implement Easy-to-tune hyperparameters. Can easily incorporate with instantaneous constraints. 18
Safe layer Instantaneous constraints 19
Safe layer Instantaneous constraints Explicit instantaneous constraints (bandwidth) Implicit instantaneous constraints (latency) 20
CRL Based Network Slicing Evaluation Traditional baselines: - User number - Packet number - Traffic demand RL algorithms: - PPO 21
Conclusion Network slicing -> CMDP IPO: cumulative constraints Safe layer: instantaneous constraints Other challenges Sample efficiency Trustiness 22
Q&A 23