PipeSwitch: Fast Pipelined Context Switching for DL Applications
PipeSwitch is a solution presented in OSDI 2020, aiming to enable GPU-efficient multiplexing of deep learning applications with fine-grained time-sharing. It focuses on achieving millisecond-scale context switching latencies and high throughput by optimizing GPU memory allocation and model transmission. The design includes pipelined context switching and standby worker initialization to enhance performance for memory-intensive DL workloads. With a structured approach to managing GPU resources, PipeSwitch addresses the challenge of limited GPU memory and slow switching speeds commonly faced in deep learning tasks.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
PipeSwitch : Fast Pipelined Context Switching For Deep Learning Applications OSDI 2020 Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin Johns Hopkins University, ByteDance Inc. Presented by Goeun Lee
Index - Introduction - PipeSwitch Overview - PipeSwitch Design - Evaluation
Introduction 1. GPU cluster
Introduction 1. GPU cluster
Introduction 1. GPU cluster
Introduction 2. Fine-Grained Time-Sharing GPU The gap : the precious GPU memory and slow switching - Unlike several TB of host memory, GPU memory is limited and purposed for task execution, not for storing the state of idle applications. - Storing the models in the GPU cannot support training tasks which are memory-intensive or even multiple inference tasks which have large models. The opportunity : DL workloads have well-defined structures. - DNN models are consist of multiple layers stacking one on another. - The computation of DNN models takes place layer by layer.
Introduction Goal = Fast context switching Enable GPU-efficient multiplexing of multiple DL apps with fine-grained time-sharing Achieve millisecond-scale context switching latencies and high throughput.
PipeSwitch Overview To start a new task, controller waits or stops the current work. Standby worker initializes its environment. The memory daemon allocates GPU memory to standby worker and transmits the model used by the new task
PipeSwitch Design Pipelined Model Transmission
PipeSwitch Design Pipelined Model Transmission
PipeSwitch Design Pipelined Model Transmission Basic way for pipelining is to pipeline on per-layer granularity. But it brings two sources of system overhead - A overhead to invoke multiple calls to PCIe to transmit the data. - Synchronization overhead between transmission and computation. We use grouping to minimize these two sources of overhead.
PipeSwitch Design Pipelined Model Transmission ? ?,? ?? ? ???????? ? ?? ??????? ? ? ????? ???? ?? ? ? ??????? ???????? ????????
PipeSwitch Design Unified Memory Management Na ve solution for GPU memory management is that uses the native cudaMallocManaged function for GPU memory allocation. - DL applications have large models and generate large amounts of intermediate results - Native cudaMalloc function and CUDA unified memory are designed for general-purpose applications. DL task stores DNN model and the intermediate results. DNN model is fixed and the intermediate results change in a simple, regular pattern. Intermediate results in training task are first-in-last-out, so memory allocation and release can be handled by a stack-like mechanism.
PipeSwitch Design Unified Memory Management Minimize memory allocation overhead Minimize memory footprint and avoid extra memory copies Minimize IPC overhead Pin Memory
PipeSwitch Design Active-Standby Worker Switching Na ve solution is to use separate processes. - Start the new task after the current task is stopped Another solution is to use one process - The current and new tasks share the same process Active-standby worker switching mechanism hides the overhead of both task cleaning and task initialization, and also ensures process-level isolation
PipeSwitch Design Active-Standby Worker Switching
PipeSwitch Design Active-Standby Worker Switching
Reference Performance of GPU https://swconsulting.tistory.com/1364 GPU cluster t https://m.blog.naver.com/PostView.nhn?blogId=uclick2016&logNo=221926240897&proxyReferer=https:%2F%2Fwww.google.com%2F PipeSwitch Presentation PDF https://www.usenix.org/sites/default/files/conference/protected-files/osdi20_slides_bai.pdf PipeSwitch Review https://developpaper.com/%E3%80%90osdi20%E3%80%91pipeswitch-fast-pipelined-context-switching-for-dl/