Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices
This paper explores Sparse-TPU, a novel approach that modifies systolic arrays to efficiently handle sparse matrix workloads, achieving significant speedup and energy savings compared to traditional TPUs. The content delves into matrix packing, dataflow, PE design, and algorithm optimization within the TPU-like architecture.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Sparse-TPU: Adapting Systolic Arrays for Sparse Matrices Xin He, S. Pal, A. Amarnath, S. Feng, DH. Park, A. Rovinski, H. Ye, Y. Chen, R. Dreslinski, T. Mudge University of Michigan 6/7/2017 1
Goal of this paper To extend the range of TPU-like accelerators by modifying systolic arrays to execute sparse matrix workloads efficiently With minimal modification to original TPU structure 12% area overhead in floating point implementation Proposed Sparse-TPU (STPU) achieves 16X speedup and 12X lower energy on average vs TPU 2 6/7/2017 2
Outline Matrix packing to create regularity Sparse-TPU dataflow Sparse-TPU PE design Packing algorithm optimization Evaluation 3 6/7/2017 3
TPU-like Systolic Architecture The Kernel of TPU is a 2d systolic array ?? ?? ?? ?? ??? ??? ??? ??? ??? ??? ??? ??? ??? Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." In ISCA. 2017. 4 6/7/2017 4
TPU-like Systolic Architecture The Kernel of TPU is a 2d systolic array ?? ?? ?? ??? ?? ??? ??? ?? ??? ??? ??? ??? ??? ??? Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." In ISCA. 2017. 5 6/7/2017 5
TPU-like Systolic Architecture The Kernel of TPU is a 2d systolic array ?? ?? ??? ??? ?? ??? ?? ??? ?? ??? ??? ?? ??? ??? ??? Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." In ISCA. 2017. 6 6/7/2017 6
TPU-like Systolic Architecture The Kernel of TPU is a 2d systolic array Massive parallel PEs work in synchronized steps High throughput for regular linear algebra functions Unable to handle irregular operations like sparse matrix operation Jouppi, Norman P., et al. "In-datacenter performance analysis of a tensor processing unit." In ISCA. 2017. Sparse-TPU (STPU)aims at performing sparse matrix vector multiplication in a highly scalable way through hardware/algorithm co-design 7 6/7/2017 7
SPMV on TPU-like systolic array sparse matrix vector Mapping Zeros occupy PE cells 8 6/7/2017 8
Sparse Matrix Packing sparse matrix vector Mapping Greedy algorithm for column merging 9 6/7/2017 9
Dataflow with the packed matrix Dataflow on TPU Parallel streaming dataflow * Referred as KMZ Parallel buses MUXs to select element PE PE PE PE PE PE PE PE PE PE Not scalable PE PE PE PE PE Kung et al. "Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. in ASPLOS. 2019. 10 6/7/2017 10
Dataflow of Sparse-TPU (STPU) Key feature: sequential processing of vector inputs Parallel streaming dataflow Dataflow on STPU Overlapped execution Sequential streaming PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE 11 6/7/2017 11
Dataflow with a large matrix Taking a 512x512 sparse matrix as an example Assume matrix packing reduces #columns by half Sparse Matrix Packed Matrix B0 B1 B2 B3 B0 B1 B4 B5 B6 B7 B4 B5 512 B8 B9 B10 B11 B8 B9 B12 B13 B14 B15 B12 B13 2048 cycles 1032 cycles Reduce the execution time by ~50% 12 6/7/2017 12
Sparse-TPU PE modification Two modest modifications to TPU PE Enable simple index matching Support overlapped input Four PE modes to support STPU dataflow Index mismatch, index matched, multiply and accumulation and bypass. 13 6/7/2017 13
PE modes Index matched mode Index mismatch mode Input X matches the index of the stored weight, and the input value is stored Input X doesn t match the index of the stored weight Bypass mode Mult and Accu mode Always bypass the partialsum when a PE stores a zero entry Perform MAC operation when the last element of the group is indicated 14 6/7/2017 14
Optimizations to improve matrix packing Matrix packing is NP-complete Na ve greedy column packing Select the matrix columns as packing candidates Doesn t allow any collision Optimizations Partition-wise packing Collision-aware packing 15 6/7/2017 15
Optimizations to improve matrix packing Matrix partition Apply column packing on each partition instead of the whole matrices Horizontally partition the matrix Packing on each partition Packing with the entire sparse matrix Improve the density by 13.85X compared with na ve packing 16 6/7/2017 16
Optimizations to improve matrix packing Relax the collision-free constraint during packing 17 6/7/2017 17
Optimizations to improve matrix packing Example: relaxed collision method makes matrix packing possible Packed matrix Original sparse matrix 18 6/7/2017 18
Experimental methodology Systolic array architecture Size: 128 X 128 PEs: STPU, TPU and KMZ Performance, power and area (PPA) Cycle accurate C++ simulator to model Perform hardware synthesis with PE RTL Place and Route to estimate area (28nm process) Primetime to estimate power Datasets 15 synthetic sparse matrices Size: 500 to 10,000; Density: 0.001 to 0.3 13 real-world sparse matrices from UFL sparse matrix dataset PE Number Representation Index width 16 Value width float32/int8 END width 1 19 6/7/2017 19
Experimental Results PPA of different PEs 13% area overhead 112% area overhead 20 6/7/2017 20
Baseline and proposed schemes TPU like systolic array KMZ systolic array STPU-C0 scheme Partition-wise packing No collision is allowed during packing STPU-C1 scheme Partition-wise packing Collision-aware packing 21 6/7/2017 21
Packing Algorithm Efficiency Density improvement (Matrix rank, density) (Matrix rank, density) 140X density improvement 5.7X density improvement 22 6/7/2017 22
Tradeoff analysis Evaluate on fixed size (1000x1000) sparse matrices with different densities Int8 version Float32 version density density 23 6/7/2017 23
Results on UFL sparse matrices UFL sparse matrices are very sparse Compared to TPU STPU-C0 and STPU-C1 achieve 24.14X and 21.50X speedup In int8 implementations, STPU-C0 and STPU-C1 achieve 7.83X and 6.60X energy reduction, respectively For the float32 counterparts, STPU-C0 and STPU-C1 reduce energy consumption by 33.58X and 28.54X, respectively UFL sparse matrices UFL sparse matrices 24 6/7/2017 24
Evaluating Bandwidth The increased bandwidth requirement of STPU With high sparsity matrix, the bandwidth requirement of STPU is 5.5X lower than baseline With low sparsity matrix, the bandwidth requirement of STPU is 55% higher than baseline (density0, density1, density2) Matrix rank 25 6/7/2017 25
Summary Packing sparse matrix creates regularity, making it possible to be mapped onto a systolic array efficiently STPU make modest modifications to the PEs so that low-cost sequential input dataflow and overlapped data processing are enabled Further optimizations on the packing algorithm improves the scalability of STPU which handles large sparse matrices (through horizontal partition) of high nonzero density (relaxed-collision) 26 6/7/2017 26
Thanks! 27 6/7/2017 27