Optimized Environment for Transparent Virtualization of GPUs

vocl an optimized environment for transparent n.w

1 / 28

Embed Share

Explore the VOCL framework for transparent virtualization of GPUs, addressing challenges in GPU computing provisioning, offering efficient resource management, and optimizing data transfer. Discover the motivation, contributions, related work, framework details, optimizations, experimental results, and future work outlined in this study.

doli_849 Follow

Uploaded on Apr 04, 2025 | 44 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao1, Pavan Balaji2, Qian Zhu3, Rajeev Thakur2, Susan Coghlan2, Heshan Lin1, Gaojin Wen4, Jue Hong4, and Wu-chun Feng1 1. Virginia Tech 2. Argonne National Laboratory 3. Accenture Technology Labs 4. Chinese Academy of Sciences synergy.cs.vt.edu

Motivation GPUs are widely used as accelerators for scientific computation Many applications are parallelized on GPUs Speedup is reported compared to execution on CPUs 2 synergy.cs.vt.edu

GPU-Based Supercomputers 3 synergy.cs.vt.edu

Challenges of GPU Computing Provisioning Limitations Not all computing nodes are configured with GPUs Budget and power consumption consideration Multiple stages of investment Programmability Current GPU programming models: CUDA and OpenCL CUDA and OpenCL only support the utilization of local GPUs 4 synergy.cs.vt.edu

Our Contributions Virtual OpenCL (VOCL) framework for transparent virtualization of GPUs Remote GPUs look like virtual local GPUs A program can use non-local GPUs A program can use more GPUs than that can be installed locally MPI MPI VOCL OpenCL OpenCL OpenCL Efficient resource management Optimization of data transfer across different machines 5 synergy.cs.vt.edu

Outline Motivation and Contributions Related Work VOCL Framework VOCL Optimization Experimental Results Conclusion & Future work 6 synergy.cs.vt.edu

Existing Frameworks for GPU Virtualization rCUDA Good performance Relative performance overhead is about 2% compared to the execution on a local GPU (GeForce 9800) Lack of support for CUDA C extensions __kernel<<< .>>>() Partial support for asynchronous data transfer MOSIX-VCL Transparent virtualization Large overhead even for local GPUs Average overhead: local GPU 25.95%; remote GPU 317.42% No support for asynchronous data transfer 7 synergy.cs.vt.edu

Outline Motivation and Contributions Related Work VOCL Framework VOCL Optimization Experimental Results Conclusion & Future work 8 synergy.cs.vt.edu

Virtual OpenCL (VOCL) Framework Components VOCL library and proxy process Local node Remote node Application Proxy Proxy MPI OpenCL API Native OpenCL Library VOCL Library GPU GPU 9 synergy.cs.vt.edu

VOCL Library Located on each local node Implements OpenCL functionality Application Programming Interface (API) compatibility API functions in VOCL have the same interface as that in OpenCL VOCL is transparent to application programs Application Binary Interface (ABI) compatibility No recompilation is needed; Needs relinking for static libraries Uses an environment variable to preload the library for dynamic libraries Deals with both local and remote GPUs in a system Local GPUs: Calls native OpenCL functions Remote GPUs: Uses MPI API functions to send function calls to remote nodes 10 synergy.cs.vt.edu

VOCL Abstraction: GPUs on Multiple Nodes OpenCL object handle value Same node, each OpenCL object has a unique handle value Different nodes, different OpenCL objects could share the same handle value Application VOCL object OCLH1 != OCLH2 OCLH1 VOCLH1 OCLH3 VOCLH3 struct voclObj { voclHandle vocl; oclHandle ocl; MPI_Comm com; int nodeIndex; } OCLH2 VOCLH2 OCLH2 == OCLH3 VOCL Library OCLH1 OCLH3 OCLH2 Native OpenCL Library Native OpenCL Library Each OpenCL object is translated to a VOCL object with a different handle value GPU GPU VOCL abstraction VOCL object 11 synergy.cs.vt.edu

VOCL Proxy Daemon process: Initialized by the administrator Located on each remote node Receives data communication requests (a separate thread) Receives input data from and send output data to the application process Calls native OpenCL functions for GPU computation Remote node 2 Remote node 1 Local node Proxy App Proxy MPI MPI Native OpenCL Library Native OpenCL Library VOCL Library GPU GPU GPU GPU 12 synergy.cs.vt.edu

Outline Motivation and Contributions Related Work VOCL Framework VOCL Optimization Experimental Results Conclusion & Future work 13 synergy.cs.vt.edu

Overhead in VOCL Local GPUs Translation between OpenCL and VOCL handles Overhead is negligible host VOCL GPU VOCL OpenCL OpenCL OpencL Remote node Local node GPU Remote GPUs Translation between VOCL and OpenCL handles Data communication between different machines 14 synergy.cs.vt.edu

Data Transfer: Between Host Memory and Device Memory Pipelining approach Single block, each stage is transferred after another Multiple blocks, transfer of first stage of one block can be overlapped by the second stage of another block Pre-allocate buffer pool for data storage in the proxy 1 Local node Remote node 2 CPU and memory CPU and memory 3 3 4 4 1 3 2 Buffer pool GPU and memory 15 synergy.cs.vt.edu

Environment for Program Execution Node Configuration Local Node 2 Magny-Cours AMD CPUs 64 GB Memory Remote Node Host: 2 Magny-Cours AMD CPUs (64GB memory) 2 Tesla M2070 GPUs (6GB global memory each) CUDA 3.2 (OpenCL 1.1 specification) Network Connection QDR InfiniBand Local node Remote node App Proxy CPU3 CPU2 CPU1 CPU0 PCIe PCIe PCIe InfiniBand InfiniBand GPU0 GPU1 16 synergy.cs.vt.edu

Micro-benchmark Results Continuously transfer a window of data blocks one after another Call the clFinish() function to wait for completion for (i = 0; i < N; i++) { clEnqueueWriteBuffer() } clFinish() CPU2 CPU3 CPU0 CPU1 App Proxy PCIe PCIe InfiniBand InfiniBand GPU0 GPU1 OpenCL, local VOCL, remote, pipelining VOCL, remote, no pipelining Percentage of the local GPU bandwidth, pipelining Percentage of the local GPU bandwidth, nopipelining 3.0 100% 90% 2.5 local GPU bandwidth Bandwidth (GB/s) 80% Percentage of the 70% 2.0 60% 1.5 50% 40% 1.0 30% 20% 0.5 10% 0.0 0% Bandwidth increases from 50% to 80% of that of the local GPU 512K 1024K 2048K Data block size (byte) 4096K 8192K 16384K 32768K 17 GPU memory write bandwidth synergy.cs.vt.edu

Kernel Argument Setting Local node Remote node clSetKernelArg() __kernel foo(int a, __global int *b) {} int a; cl_mem b; b = clCreateBuffer( , ); clSetKernelArg(hFoo, 0, sizeof(int), &a); clSetKernelArg(hFoo, 1, sizeof(cl_mem), &b) clEnqueueNDRangeKernel( ,hFoo, ); clSetKernelArg() clSetKernelArg() clEnqueueND- RangeKernel() Overhead of kernel execution for aligning one pair of sequences (6K letters) with Smith-Waterman Runtime Local GPU Runtime Remote GPU Number of Calls Function Name Overhead clSetKernelArg 4.33 420.45 416.02 86,028 clEnqueueNDRangeKernel 1210.85 1316.92 106.07 12,288 Total time 1215.18 1737.37 522.19 (Unit: ms) 18 42.97% synergy.cs.vt.edu

Local node Remote node Store arguments locally Kernel Argument Setting Caching clSetKernelArg() clSetKernelArg() clSetKernelArg() clEnqueueND- RangeKernel() Overhead of functions related to kernel execution for aligning the same pair of sequence Runtime Local GPU Runtime Remote GPU Number of Calls Function Name Overhead clSetKernelArg 4.33 4.03 -0.30 86,028 clEnqueueNDRangeKernel 1210.85 1344.01 133.71 12,288 Total time 1215.18 1348.04 132.71 (Unit: ms) 19 10.92% synergy.cs.vt.edu

Outline Motivation and Contributions Related Work VOCL Framework VOCL Optimization Experimental Results Conclusion & Future work 20 synergy.cs.vt.edu

Evaluation via Application Kernels Three application kernels Matrix multiplication Matrix transpose Smith-Waterman Program execution time CPU0 CPU1 App Proxy PCIe PCIe InfiniBand InfiniBand GPU Relative overhead Relationship to time percentage of kernel execution 21 synergy.cs.vt.edu

Matrix Multiplication 100% 90% Percentage of kernel Multiple problem instances are issued consecutively 80% execution time 70% 60% 50% for (i = 0; i < N; i++) { clEnqueueWriteBuffer(); clEnqueueNDRangeKernel(); clEnqueueReadBuffer(); } clFinish(); 40% 30% 20% 10% 0% 1K X 1K 2K X 2K 3K X 3K Matrix size 4K X 4K 5K X 5K 6K X 6K OpenCL VOCL, local Time percentage of kernel execution 10000 4.5% VOCL, remote Program execution time (ms) 4.0% Percentage of slowdown Percentage of slowdown 3.5% 1000 3.0% CPU0 CPU1 App Proxy 2.5% 100 PCIe PCIe 2.0% InfiniBand InfiniBand GPU 1.5% 10 1.0% 0.5% 1 0.0% 1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K Matrix size Kernel execution time and performance overhead 22 synergy.cs.vt.edu

Matrix Transpose 8% Time percentage of kernel 7% Multiple problem instances are issued consecutively 6% 5% execution 4% 3% for (i = 0; i < N; i++) { clEnqueueWriteBuffer(); clEnqueueNDRangeKernel(); clEnqueueReadBuffer(); } clFinish(); 2% 1% 0% 1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K Time percentage of kernel execution OpenCL VOCL, local VOCL, remote Percentage of slowdown 350 60% Program execution time (ms) 300 50% Percentage of slowdown 250 40% 200 30% 150 20% 100 10% 50 0 0% 1K X 1K 2K X 2K 3K X 3K Matrix size 4K X 4K 5K X 5K 6K X 6K 23 Kernel execution time and performance overhead synergy.cs.vt.edu

Smith-Waterman 90% Time percentage of kernel 80% 70% 60% execution for (i = 0; i < N; i++) { clEnqueueWriteBuffer(); for (j = 0; j < M; j++) { clEnqueueNDRangeKernel(); } clEnqueueReadBuffer(); } clFinish(); 50% 40% 30% 20% 10% 0% 1K X 1K 2K X 2K 3K X 3K 4K X 4K 5K X 5K 6K X 6K Time percentage of kernel execution 1.0 180% Two Observations 1. SW needs a lot of kernel launches and large number of small messages are transferred 2. MPI in the proxy is initialized to support multiple threads, which supports the transfer of small messages poorly 0.9 160% Program execution time (s) OpenCL VOCL, local VOCL, remote Percentage of slowdown Percentage of slowdown 0.8 140% 0.7 120% 0.6 100% 0.5 80% 0.4 60% 0.3 40% 0.2 20% 0.1 0.0 0% 1K 2K 3K Sequence size 4K 5K 6K 24 Kernel execution time and performance overhead synergy.cs.vt.edu

Outline Motivation and Contributions Related Work VOCL Framework VOCL Optimization Experimental Results Conclusion & Future work 25 synergy.cs.vt.edu

Conclusions Virtual OpenCL Framework Based on the OpenCL programming model Internally use MPI for data communication VOCL Framework Optimization Kernel arguments caching GPU memory write and read pipelining Application Kernel Verification SGEMM, n-body, Matrix transpose, and Smith-Waterman Reasonable virtualization cost 26 synergy.cs.vt.edu

Future Work Extensions to the VOCL Framework Live task migration (already done) Super-GPU Performance model for GPU utilization Resource management strategies Energy-efficient computing 27 synergy.cs.vt.edu