Efficient Reinforcement Learning on Real Processing-In-Memory Systems

Slide Note

Symposium on Performance Analysis of Systems and Software focusing on SwiftRL, Processing-in-Memory implementation of RL algorithms, Roofline analysis for RL workloads, and the significance of offline RL in various sectors like healthcare, finance, and robotics. Real-world PIM architectures like UPMEM PIM and Samsung HBM-PIM are discussed.

nmemm Follow

Uploaded on Feb 20, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

2024 IEEE International Symposium on Performance Analysis of Systems and Software SwiftRL: Towards Efficient Reinforcement Learning on Real Processing-In-Memory Systems Kailash Gogineni, Sai Santosh D, Juan G mez Luna, Karthik Gogineni, Peng Wei, Tian Lan, Mohammad Sadrosadati, Onur Mutlu, Guru Venkataramani

Outline Reinforcement learning algorithms Processing-in-memory (PIM) PIM implementation of RL algorithms Evaluation Conclusion George Washington University | ETH Z rich ISPASS'24 2/20/2025 2

Outline Reinforcement learning algorithms Processing-in-memory (PIM) PIM implementation of RL algorithms Evaluation Conclusion George Washington University | ETH Z rich ISPASS'24 2/20/2025 3

Reinforcement Learning (RL) Algorithms Offline RL is vital in healthcare, finance, and robotics, ensuring safe optimization where real- time learning is risky. Processing large datasets requires significant computational power and memory. Representative RL algorithms: Q-learning Applications: AlphaGo, Robotics SARSA Applications: Robotics, Adaptive Control Systems Source: https://adabhishekdabas.medium.com/rl-world-3fc4dc38a73d George Washington University | ETH Z rich ISPASS'24 2/20/2025 4

Reinforcement Learning Algorithms: Roofline Analysis Roofline plot for RL workloads: Constrained by DRAM bandwidth due to repeated memory accesses during the The Roofline model derives the performance characteristics of CPU versions in RL workloads, where Q refers to Q-learner, S signifies the SARSA learner, and 1M and 20M indicate the data size in millions of transitions. training phase. Observation: All workloads fall in the memory-bound area of the Roofline George Washington University | ETH Z rich ISPASS'24 2/20/2025 5

Outline Reinforcement learning algorithms Processing-in-memory (PIM) PIM implementation of RL algorithms Evaluation Conclusion George Washington University | ETH Z rich ISPASS'24 2/20/2025 6

Processing-in-Memory (PIM) PIM advocates for memory-centric computing, positioning processing elements near or inside memory arrays. Real-world PIM architectures are becoming a reality - UPMEM PIM, Samsung HBM-PIM, Samsung AxDIMM, SK Hynix AiM, Alibaba HB- PNM These PIM systems have some common characteristics: 1. There is a host processor (CPU or GPU) with access to (1) standard main memory, and (2) PIM-enabled memory 2. PIM-enabled memory contains multiple PIM processing elements (PEs) with high bandwidth and low latency memory access 3. PIM PEs run only at a few hundred MHz and have a small number of registers and small (or no) cache/scratchpad 4. PIM PEs may need to communicate via the host processor George Washington University | ETH Z rich ISPASS'24 2/20/2025 7

UPMEM PIM System Standard Main Memory DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip Host CPU Memory Array (Rank or Bank) DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip DRAM Chip xM PIM-Host Cache Shared Cache Core Scratchpad/ Cache Instruction Memory Cache Core Memory Array Memory Array Memory Array Memory Array Host-PIM PIM PE PIM PE PIM PE PIM PE PIM Processing Elements xN PIM-enabled Memory In our work, we use the UPMEM PIM architecture - General-purpose processing cores called DRAM Processing Units (DPUs) Up to 24 PIM threads, called tasklets 32-bit integer arithmetic, but multiplication/division are emulated, as well as floating-point operations 8-bit integer multiplication is natively supported - 64-MB DRAM bank (MRAM), 64-KB scratchpad (WRAM) George Washington University | ETH Z rich ISPASS'24 2/20/2025 8

Outline Reinforcement learning algorithms Processing-in-memory (PIM) PIM implementation of RL algorithms Evaluation Conclusion George Washington University | ETH Z rich ISPASS'24 2/20/2025 9

PIM Implementation: Operational Workflow The execution phase comprises four main steps: (1) loading the input dataset chunks into individual DRAM banks of PIM-enabled memory, (2) executing the RL workload (kernel) on PIM cores in parallel operating on different chunks of data, (3) retrieving partial results from DRAM banks to the host CPU, and (4) aggregating partial results on the host processor. George Washington University | ETH Z rich ISPASS'24 2/20/2025 10

Tabular Q-learning Implementation Tabular Q-learning is a widely-used model-free and off-policy RL workload that learns through a trial- and-error approach. Our PIM implementation divides the training dataset (X) equally among PIM cores. Q-Learning Updates in PIM Cores: PIM cores perform Q-learning updates concurrently, each core handling its assigned data chunk. Result Aggregation: Combine partial results from PIM cores to form the final Q-table on the host processor. George Washington University | ETH Z rich ISPASS'24 2/20/2025 11

Multi-agent Tabular Q-learning Implementation Multi-agent training is increasingly applied across diverse fields. Developed a variant of Q-learning optimized for hardware adaptability. Multi-agent Q-learning PIM implementation: 1. Data Loading and Training: Agent-specific datasets loaded into PIM cores for concurrent training of independent learners. 2. Independent Operation: Each agent pinned to a core, trained iteratively, with final Q-tables retrieved directly. George Washington University | ETH Z rich ISPASS'24 2/20/2025 12

SARSA Implementation SARSA, an on-policy reinforcement learning algorithm, directly updates Q-values using the next action chosen according to the policy and its associated Q-value. George Washington University | ETH Z rich ISPASS'24 2/20/2025 13

Tabular Q-learning & SARSA Variants SEQ Q-learner-SEQ-FP32, Q-learner-SEQ-INT32 FP32 RAN Q-learner-STR-FP32, Q-learner-STR-INT32, Q-learner INT32 Q-learner-RAN-FP32, Q-learner-RAN-INT32, STR SEQ SARSA-SEQ-FP32, SARSA-SEQ-INT32, FP32 RAN SARSA-STR-FP32, SARSA-STR-INT32 SARSA INT32 SARSA-RAN-FP32, SARSA-RAN-INT32 STR George Washington University | ETH Z rich ISPASS'24 2/20/2025 14

Outline Reinforcement learning algorithms Processing-in-memory (PIM) PIM implementation of RL algorithms Evaluation Conclusion George Washington University | ETH Z rich ISPASS'24 2/20/2025 15

Evaluation Methodology Algorithms & Datasets: Tabular Q-learning, SARSA, Multi-Agent Q-learners (Independent) Frozen Lake & Taxi from Open AI GYM 1 million & 5 million experiences Evaluated systems: UPMEM PIM system with 2,524 PIM cores @ 425 MHz and 158 GB of DRAM Intel Xeon Silver 4110 CPU NVIDIA RTX 3090 Ampere Architecture Evaluated Performance of RL Algorithm on PIM using: Performance scaling across PIM cores Comparison with FP32 & fixed-point representation + scaling optimization Comparison to CPU & GPU More details in the paper George Washington University | ETH Z rich ISPASS'24 2/20/2025 16

Evaluation: Strong Scaling-FL Tabular Q-learning & SARSA : Strong scaling - 125 to 2,000 PIM cores Datasets: Frozen lake The PIM kernel time decreases by 2 as we linearly increase the number of PIM cores. George Washington University | ETH Z rich ISPASS'24 2/20/2025 17

Evaluation: Strong Scaling-FL (FP32 & INT32) Tabular Q-learning & SARSA : Strong scaling - 125 to 2,000 PIM cores Datasets: Frozen lake With 125 PIM cores, fixed- point (INT32) representation accelerates the PIM kernel time by about 11 over FP32 George Washington University | ETH Z rich ISPASS'24 2/20/2025 18

Evaluation: Strong Scaling-Taxi Tabular Q-learning & SARSA: Strong scaling - 125 to 2,000 PIM cores Datasets: Taxi Similar PIM kernel time trends seen in taxi environment. George Washington University | ETH Z rich ISPASS'24 2/20/2025 19

Evaluation: Strong Scaling-Taxi (Commn.) Tabular Q-learning & SARSA: Strong scaling - 125 to 2,000 PIM cores Datasets: Taxi In taxi task, Q-learner-STR- INT32 with 2000 PIM cores peaks at 21.19% for inter-PIM core synchronization. Taxi environment requires 47 more PIM core data exchange (Q-values) than frozen lake. George Washington University | ETH Z rich ISPASS'24 2/20/2025 20

Evaluation: Comparing PIM to CPU (SEQ & STR) 80 71.52 70.94 70 Key Takeaway 1. Key Takeaway 1. Our INT32 PIM implementationwith sequential and stride-based sampling techniques exhibits slower execution times than CPU due to enhanced cache locality and lower CPU-cache latencies. Execution time (seconds) 60 50 43.87 39.91 40 35.33 34.98 34.92 27.84 30 20 10 0 Q_learner_Taxi SARSA_Taxi RL Algorithms CPU_SEQ CPU_STR PIM_SEQ_INT32 PIM_STR_INT32 George Washington University | ETH Z rich ISPASS'24 2/20/2025 21

Evaluation: Comparing PIM to CPU (RAN) 140 122.01 120 111.48 Key Takeaway 2. Key Takeaway 2. In both taxi and FL environments, our fixed-point PIM implementation with random sampling demonstrates superior performance. Execution time (seconds) 100 80.94 80 60 45.66 40 16.68 16.61 13.73 20 9.42 0 Q_learner_Taxi SARSA_Taxi Q_learner_FL SARSA_FL RL Algorithms CPU_RAN PIM_RAN_INT32 George Washington University | ETH Z rich ISPASS'24 2/20/2025 22

Evaluation: Comparing PIM to GPU (SEQ) 200 174.41 180 166.52 Execution time (seconds) 160 140 120 Q-learner-SEQ-INT32-FL achieves a 4.84 speedup over the GPU version due to INT32 instructions. 100 70.94 80 60 35.33 34.88 32.88 40 11.82 20 6.79 0 Q_learner_Taxi SARSA_Taxi Q_learner_FL SARSA_FL RL Algorithms GPU_SEQ PIM_SEQ_INT32 George Washington University | ETH Z rich ISPASS'24 2/20/2025 23

Evaluation: Multi-agent Q-learning (I) UPMEM architecture accelerates training of multiple independent Q-learners with 10,000 transitions (frozen lake) each. Utilizes fixed-point representation and scaling optimization. George Washington University | ETH Z rich ISPASS'24 2/20/2025 24

Evaluation: Multi-agent Q-learning (II) Key Takeaway 3. Key Takeaway 3. Memory-intensive RL algorithms with minimal inter-PIM core communication, such as multi-agent Q-learning, are best suited for UPMEM PIM architecture. George Washington University | ETH Z rich ISPASS'24 2/20/2025 25

Outline Reinforcement learning algorithms Processing-in-memory (PIM) PIM implementation of RL algorithms Evaluation Conclusion George Washington University | ETH Z rich ISPASS'24 2/20/2025 26

Conclusion Adapted and implemented RL algorithms on a PIM architecture for exploring memory-centric systems in RL training Explored optimization strategies for enhancing RL workload performance across various data types, sampling strategies (SEQ, RAN, STR) Compared PIM-based Q-learning & SARSA on UPMEM PIM (2000 cores) to CPU & GPU Achieved near-linear scaling of 15x in performance with a 16x increase in PIM cores (125 to 2000) George Washington University | ETH Z rich ISPASS'24 2/20/2025 27