PipeSwitch: Fast Pipelined Context Switching for DL Applications

PipeSwitch: Fast Pipelined Context Switching for DL Applications
Slide Note
Embed
Share

PipeSwitch is a solution presented in OSDI 2020, aiming to enable GPU-efficient multiplexing of deep learning applications with fine-grained time-sharing. It focuses on achieving millisecond-scale context switching latencies and high throughput by optimizing GPU memory allocation and model transmission. The design includes pipelined context switching and standby worker initialization to enhance performance for memory-intensive DL workloads. With a structured approach to managing GPU resources, PipeSwitch addresses the challenge of limited GPU memory and slow switching speeds commonly faced in deep learning tasks.

  • Pipelined Context Switching
  • DL Applications
  • GPU Efficiency
  • Deep Learning
  • GPU Memory Optimization

Uploaded on Mar 09, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. PipeSwitch : Fast Pipelined Context Switching For Deep Learning Applications OSDI 2020 Zhihao Bai, Zhen Zhang, Yibo Zhu, Xin Jin Johns Hopkins University, ByteDance Inc. Presented by Goeun Lee

  2. Index - Introduction - PipeSwitch Overview - PipeSwitch Design - Evaluation

  3. Introduction

  4. Introduction 1. GPU cluster

  5. Introduction 1. GPU cluster

  6. Introduction 1. GPU cluster

  7. Introduction 2. Fine-Grained Time-Sharing GPU The gap : the precious GPU memory and slow switching - Unlike several TB of host memory, GPU memory is limited and purposed for task execution, not for storing the state of idle applications. - Storing the models in the GPU cannot support training tasks which are memory-intensive or even multiple inference tasks which have large models. The opportunity : DL workloads have well-defined structures. - DNN models are consist of multiple layers stacking one on another. - The computation of DNN models takes place layer by layer.

  8. Introduction Goal = Fast context switching Enable GPU-efficient multiplexing of multiple DL apps with fine-grained time-sharing Achieve millisecond-scale context switching latencies and high throughput.

  9. PipeSwitch Overview

  10. PipeSwitch Overview To start a new task, controller waits or stops the current work. Standby worker initializes its environment. The memory daemon allocates GPU memory to standby worker and transmits the model used by the new task

  11. PipeSwitch Design

  12. PipeSwitch Design Pipelined Model Transmission

  13. PipeSwitch Design Pipelined Model Transmission

  14. PipeSwitch Design Pipelined Model Transmission Basic way for pipelining is to pipeline on per-layer granularity. But it brings two sources of system overhead - A overhead to invoke multiple calls to PCIe to transmit the data. - Synchronization overhead between transmission and computation. We use grouping to minimize these two sources of overhead.

  15. PipeSwitch Design Pipelined Model Transmission ? ?,? ?? ? ???????? ? ?? ??????? ? ? ????? ???? ?? ? ? ??????? ???????? ????????

  16. PipeSwitch Design Unified Memory Management Na ve solution for GPU memory management is that uses the native cudaMallocManaged function for GPU memory allocation. - DL applications have large models and generate large amounts of intermediate results - Native cudaMalloc function and CUDA unified memory are designed for general-purpose applications. DL task stores DNN model and the intermediate results. DNN model is fixed and the intermediate results change in a simple, regular pattern. Intermediate results in training task are first-in-last-out, so memory allocation and release can be handled by a stack-like mechanism.

  17. PipeSwitch Design Unified Memory Management Minimize memory allocation overhead Minimize memory footprint and avoid extra memory copies Minimize IPC overhead Pin Memory

  18. PipeSwitch Design Active-Standby Worker Switching Na ve solution is to use separate processes. - Start the new task after the current task is stopped Another solution is to use one process - The current and new tasks share the same process Active-standby worker switching mechanism hides the overhead of both task cleaning and task initialization, and also ensures process-level isolation

  19. PipeSwitch Design Active-Standby Worker Switching

  20. PipeSwitch Design Active-Standby Worker Switching

  21. Evaluation t

  22. Evaluation t

  23. Evaluation t

  24. Evaluation t

  25. Evaluation t

  26. Reference Performance of GPU https://swconsulting.tistory.com/1364 GPU cluster t https://m.blog.naver.com/PostView.nhn?blogId=uclick2016&logNo=221926240897&proxyReferer=https:%2F%2Fwww.google.com%2F PipeSwitch Presentation PDF https://www.usenix.org/sites/default/files/conference/protected-files/osdi20_slides_bai.pdf PipeSwitch Review https://developpaper.com/%E3%80%90osdi20%E3%80%91pipeswitch-fast-pipelined-context-switching-for-dl/

Related


More Related Content