Strategies for Large-Scale Matrix Multiplication with Multi-GPUs

1 / 26

Embed Share

Addressing challenges in solving large-scale matrix multiplications with multi-GPUs including limited GPU RAM size, communications and computations overlapping, communication reduction, load balancing, and ease of use. Explore methods like the Tile Algorithm, Level 3 BLAS in Tiles, efficient Pipelines, perfect communications and computations overlapping, and utilization of temporal locality for tile reuse and cache management. Learn how to optimize GPU performance through tile caching strategies and handle LRU operations, avoiding performance degeneration with high-frequency mallocs.

snellings_a Follow

Uploaded on Mar 13, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

BLASX LINNAN WANG, GEORGIA INSTITUTE OF TECHNOLOGY WEI WU, UNIVERSITY OF TENNESSEE, KNOXVILLE YANG YI, NEC LABORATORIES, USA ZENGLIN XU, UESTC AND JIANXIONG XIAO, PRINCETON

MOTIVATION Question: solve large scale matrix multiplications with multi-GPUs? Challenges: 1. Limited GPU RAM Size 2. Communications and Computations Overlapping 3. Reduce Communications 4. Load Balancing 5. Ease of Use, Backward Compatibility

TILEALGORITHM = i i = 1, i = 1, j = 1, k = 0 k = 1 k = 2 i = 1, j = 1, j = 1, k Cij=a AikBkj+bCij

LEVE3BLAS IN TILES GEMM SYRK TRSM TRMM SYR2K SYMM

PIPELINE = i GPU Out-of-core Operation Step 1 2 3 4 5 6 7 8 Stream 0 Stream 1 A00,B00 A00*B00 A10,B01 A01,B10 A10*B01 C00 A12*B21 A01*B10 A11,B11 A02,B20 A11*B11 A00*B00 A12,B21 C11 communications computations

PIPELINE Perfect Communications and Computations Overlapping

MOTIVATION Question: solve large scale matrix multiplications with multi-GPUs? Challenges: 1. Limited GPU RAM Size 2. Communications and Computations Overlapping 3. Reduce Communications 4. Load Balancing 5. Ease of Use, Backward Compatibility

TEMPORAL LOCALITY tile reuse = i

TILE CACHE L2 Tile Cache L2 Tile Cache L1 Tile Cache L1 Tile Cache L1 Tile Cache L1 Tile Cache GPU GPU GPU GPU PCIe 16 PCIe 16 PCIe 16 PCIe 16 PCIe Switch PCIe Switch Host RAM L1 tile cache: GPU onboard RAM L2 tile cache: the combined RAM Spaces

L1 TILE CACHE LRU LRU search: O(logn) LRU in : O(1) LRU out: O(1)

L1TILECACHE High Frequency Malloc => Performance Degeneration

L1TILECACHE Solution: Pre-allocate a big chunk of GPU RAM, and mange the chunk with BLASX_Malloc. GPU heap based memory management 0 Heap 1 2 3 4 5 6 7 0->1 2->4 5->7 Meta-data List 0->1 5->7 Occupied List Tile Addr -> Block Addr 2->4 Empty List HashTable for Occupied List Steady Performance

L2 TILE CACHE multiple LRUs, grouped by the P2P accessibility. If L1 tile cache miss, search other LRUs in the group. group0 group1 LRU0 LRU1 LRU2 LRU3

TILE CACHE USAGE L1 Tile Cache Hit: find the tile in LRU, get GPU Addr L1 Tile Cache Miss: find the tile in other LRUs within the P2P group. L2 Tile Cache Hit: initiate P2P transfer L2 Tile Cache Miss: retrieve from host RAM.

TILE CACHE RESULTS Without Tile Cache Excessive Comm Tile Cache Much Less !!!

MOTIVATION Question: solve large scale matrix multiplications with multi-GPUs? Challenges: 1. Limited GPU RAM Size 2. Communications and Computations Overlapping 3. Reduce Communications 4. Load Balancing 5. Ease of Use, Backward Compatibility

DEFINE TASKS Independent!!! = i Task: solving a tile Cij 1. Reading the inputs for a task is data dependency free 2. Concurrent writing a task s output is data race free 3. The workload of each task varies

TASKSCHEDULING Infrastructures: D2H Kernel H2D GPU Computation Thread Priority GPU0 GPU1 CPU Computation Thread Task Slot Id Reservation Station Reservation Station 1 2 3 4 5 6 1 2 3 4 5 6 Non-blocking Task Queue 7 8 7 8 CPU task task task Non-blocking task queue allowing efficient concurrent queue operations

COMPUTATIONTHREADS CPU Comp Thread GPU Comp Thread Submit Instructions to GPUs Can be Disable or Enable Bind to a Dedicated CPU Core Solve Tasks with a Multi- Threaded CPU BLAS Overlap Tasks on GPU Streams

RESERVATIONSTATION D2H Kernel H2D Priority GPU0 GPU1 Task Slot Id Priority: holds the potential of cache hits. Reservation Station 1 2 3 4 5 6 1 2 3 4 5 6 7 8 7 8 CPU Task: holds the metadata of future tasks. task task Slot ID: match with stream id. task Non-blocking task queue allowing efficient concurrent queue operations

TASKSCHEDULING Strategies: D2H Kernel H2D Demand Driven Tasks Assignment Priority GPU0 GPU1 Task Slot Id Work Stealing Reservation Station 1 2 3 4 5 6 1 2 3 4 5 6 Priority Scheduling 7 8 7 8 CPU task task task Non-blocking task queue allowing efficient concurrent queue operations

MOTIVATION Question: solve large scale matrix multiplications with multi-GPUs? Challenges: 1. Limited GPU RAM Size 2. Communications and Computations Overlapping 3. Reduce Communications 4. Load Balancing 5. Ease of Use, Backward Compatibility

USABILITY BLASX features CBLAS and FORTRAN Interfaces and Out-of-core computing Calling BLASX is as easy as OpenBLAS, MKL, or GotoBLAS Change the BLAS linkage