Strategies for Large-Scale Matrix Multiplication with Multi-GPUs

blasx n.w
1 / 26
Embed
Share

Addressing challenges in solving large-scale matrix multiplications with multi-GPUs including limited GPU RAM size, communications and computations overlapping, communication reduction, load balancing, and ease of use. Explore methods like the Tile Algorithm, Level 3 BLAS in Tiles, efficient Pipelines, perfect communications and computations overlapping, and utilization of temporal locality for tile reuse and cache management. Learn how to optimize GPU performance through tile caching strategies and handle LRU operations, avoiding performance degeneration with high-frequency mallocs.

  • Multi-GPU
  • Large-scale
  • Matrix Multiplication
  • Tile Algorithm
  • Cache Management

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. BLASX LINNAN WANG, GEORGIA INSTITUTE OF TECHNOLOGY WEI WU, UNIVERSITY OF TENNESSEE, KNOXVILLE YANG YI, NEC LABORATORIES, USA ZENGLIN XU, UESTC AND JIANXIONG XIAO, PRINCETON

  2. MOTIVATION Question: solve large scale matrix multiplications with multi-GPUs? Challenges: 1. Limited GPU RAM Size 2. Communications and Computations Overlapping 3. Reduce Communications 4. Load Balancing 5. Ease of Use, Backward Compatibility

  3. TILEALGORITHM = i i = 1, i = 1, j = 1, k = 0 k = 1 k = 2 i = 1, j = 1, j = 1, k Cij=a AikBkj+bCij

  4. LEVE3BLAS IN TILES GEMM SYRK TRSM TRMM SYR2K SYMM

  5. PIPELINE = i GPU Out-of-core Operation Step 1 2 3 4 5 6 7 8 Stream 0 Stream 1 A00,B00 A00*B00 A10,B01 A01,B10 A10*B01 C00 A12*B21 A01*B10 A11,B11 A02,B20 A11*B11 A00*B00 A12,B21 C11 communications computations

  6. PIPELINE Perfect Communications and Computations Overlapping

  7. MOTIVATION Question: solve large scale matrix multiplications with multi-GPUs? Challenges: 1. Limited GPU RAM Size 2. Communications and Computations Overlapping 3. Reduce Communications 4. Load Balancing 5. Ease of Use, Backward Compatibility

  8. TEMPORAL LOCALITY tile reuse = i

  9. TILE CACHE L2 Tile Cache L2 Tile Cache L1 Tile Cache L1 Tile Cache L1 Tile Cache L1 Tile Cache GPU GPU GPU GPU PCIe 16 PCIe 16 PCIe 16 PCIe 16 PCIe Switch PCIe Switch Host RAM L1 tile cache: GPU onboard RAM L2 tile cache: the combined RAM Spaces

  10. L1 TILE CACHE LRU LRU search: O(logn) LRU in : O(1) LRU out: O(1)

  11. L1TILECACHE High Frequency Malloc => Performance Degeneration

  12. L1TILECACHE Solution: Pre-allocate a big chunk of GPU RAM, and mange the chunk with BLASX_Malloc. GPU heap based memory management 0 Heap 1 2 3 4 5 6 7 0->1 2->4 5->7 Meta-data List 0->1 5->7 Occupied List Tile Addr -> Block Addr 2->4 Empty List HashTable for Occupied List Steady Performance

  13. L2 TILE CACHE multiple LRUs, grouped by the P2P accessibility. If L1 tile cache miss, search other LRUs in the group. group0 group1 LRU0 LRU1 LRU2 LRU3

  14. TILE CACHE USAGE L1 Tile Cache Hit: find the tile in LRU, get GPU Addr L1 Tile Cache Miss: find the tile in other LRUs within the P2P group. L2 Tile Cache Hit: initiate P2P transfer L2 Tile Cache Miss: retrieve from host RAM.

  15. TILE CACHE RESULTS Without Tile Cache Excessive Comm Tile Cache Much Less !!!

  16. MOTIVATION Question: solve large scale matrix multiplications with multi-GPUs? Challenges: 1. Limited GPU RAM Size 2. Communications and Computations Overlapping 3. Reduce Communications 4. Load Balancing 5. Ease of Use, Backward Compatibility

  17. DEFINE TASKS Independent!!! = i Task: solving a tile Cij 1. Reading the inputs for a task is data dependency free 2. Concurrent writing a task s output is data race free 3. The workload of each task varies

  18. TASKSCHEDULING Infrastructures: D2H Kernel H2D GPU Computation Thread Priority GPU0 GPU1 CPU Computation Thread Task Slot Id Reservation Station Reservation Station 1 2 3 4 5 6 1 2 3 4 5 6 Non-blocking Task Queue 7 8 7 8 CPU task task task Non-blocking task queue allowing efficient concurrent queue operations

  19. COMPUTATIONTHREADS CPU Comp Thread GPU Comp Thread Submit Instructions to GPUs Can be Disable or Enable Bind to a Dedicated CPU Core Solve Tasks with a Multi- Threaded CPU BLAS Overlap Tasks on GPU Streams

  20. RESERVATIONSTATION D2H Kernel H2D Priority GPU0 GPU1 Task Slot Id Priority: holds the potential of cache hits. Reservation Station 1 2 3 4 5 6 1 2 3 4 5 6 7 8 7 8 CPU Task: holds the metadata of future tasks. task task Slot ID: match with stream id. task Non-blocking task queue allowing efficient concurrent queue operations

  21. TASKSCHEDULING Strategies: D2H Kernel H2D Demand Driven Tasks Assignment Priority GPU0 GPU1 Task Slot Id Work Stealing Reservation Station 1 2 3 4 5 6 1 2 3 4 5 6 Priority Scheduling 7 8 7 8 CPU task task task Non-blocking task queue allowing efficient concurrent queue operations

  22. MOTIVATION Question: solve large scale matrix multiplications with multi-GPUs? Challenges: 1. Limited GPU RAM Size 2. Communications and Computations Overlapping 3. Reduce Communications 4. Load Balancing 5. Ease of Use, Backward Compatibility

  23. USABILITY BLASX features CBLAS and FORTRAN Interfaces and Out-of-core computing Calling BLASX is as easy as OpenBLAS, MKL, or GotoBLAS Change the BLAS linkage

  24. PERFORMANCE Benchmarked on 3 K40c BLASX cuBLAS-XT SuperMatrix

  25. PERFORMANCE Benchmarked on 3 K40c BLASX MAGMA cuBLAS-XT SuperMatrix

  26. PERFORMANCE Heterogeneous Benchmarks on 2 K40c + 2 TITAN-X BLASX, 4x Speedup cuBLAS-XT

More Related Content