Analysis of Sparse Convolutional Neural Networks & Deep Compression Techniques

analysis of sparse convolutional neural networks n.w
1 / 20
Embed
Share

Explore the impact of sparsity in convolutional neural networks, focusing on memory efficiency and performance improvements. Learn about deep compression pruning methods and the use of structured sparsity learning in neural networks. Discover the Caffe framework for building and running CNNs efficiently.

  • CNN Analysis
  • Sparse Networks
  • Deep Compression
  • Caffe Framework
  • Neural Networks

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Analysis of Sparse Convolutional Neural Networks Sabareesh Ganapathy Manav Garg Prasanna Venkatesh Srinivasan UNIVERSITY OF WISCONSIN-MADISON

  2. Convolutional Neural Network State of the art in Image classification Terminology Feature Maps, Weights Layers - Convolution, ReLU, Pooling, Fully Connected Example - AlexNet 2012 (Image taken from mdpi.com) UNIVERSITY OF WISCONSIN-MADISON

  3. Sparsity in Networks Trained networks occupy huge memory. (~200MB for AlexNet) Many DRAM accesses. Filters can be sparse. Sparsity percentage for filters in Alexnet convolutional layers shown below. Layer Sparsity Percentage CONV2 6 CONV3 7 CONV4 7 Sparsity too low. Thresholding in inference will lead to accuracy loss. Account for sparsity during training. UNIVERSITY OF WISCONSIN-MADISON

  4. Sparsity in Networks Deep Compression Pruning low-weight values, retrain to recover accuracy. Less storage and also speedup reported in Fully Connected layers. Structured Sparsity Learning (SSL) in Deep Neural Networks Employs locality optimization for sparse weights. Memory savings and speedup in Convolutional layers. Layer Sparsity Percentage CONV1 14 CONV2 54 CONV3 72 CONV4 68 CONV5 58 UNIVERSITY OF WISCONSIN-MADISON

  5. Caffe Framework Open-source framework to build and run convolutional neural nets. Provides Python and C++ interface for inference. Source code in C++ and Cuda. Employs efficient data structures for feature maps and weights. Blob data structure Caffe Model Zoo - Repository of CNN models for analysis. Pretrained models for base and compressed versions of AlexNet available in the Model zoo. UNIVERSITY OF WISCONSIN-MADISON

  6. Convolution = Matrix Multiply . . . OFM: 55x55x96 IFM: 227x227x3 Filter: 11x11x3x96 Stride: 4 IFM converted to 363x3025 matrix filter looks at 11x11x3 input volume, 55 locations along W,H. Weights converted to 96x363 matrix OFM = Weights x IFM. BLAS libraries used to implement matrix multiply ( GEMM ) MKL for CPU, CuBLAS for GPU UNIVERSITY OF WISCONSIN-MADISON

  7. Sparse Matrix Multiply Weight matrix can be represented in sparse format for sparse networks. Compressed Sparse Row Format. Matrix converted to arrays to represent non-zero values. Array A - Contains non zero values. Array JA - Column index of each element in A. Array IA - Cumulative sum of number of non-zero values in previous rows. Sparse representation saves memory and could result in efficient computation. Wen-Wei New Caffe branch for sparse convolution Represent convolutional layer weights in CSR format. Uses sparse matrix multiply routines. (CSRMM) Weight in sparse format, IFM in dense , Output is in dense. MKL library for CPU, cuSPARSE for GPU. UNIVERSITY OF WISCONSIN-MADISON

  8. Analysis Framework Initially gem5-gpu was planned to be used as simulation framework. Gem5 ended up being very slow due to the large size of Deep Neural Networks. Analysis was performed by running Caffe and Cuda programs on Native Hardware. For CPU analysis, AWS system with 2 Intel Xeon cores running @ 2.4GHz was used. For GPU analysis, dodeca system with NVIDIA GeForce GTX 1080 GPU was used. UNIVERSITY OF WISCONSIN-MADISON

  9. Convolutional Layer Analysis Deep Compression and SSL trained networks were used for analysis. Both showed similar trends. Memory savings obtained with sparse representation is given below Layers Memory Saving CONV1 1.11 CONV2 2.48 CONV3 2.72 CONV4 2.53 CONV5 2.553 The time taken for the multiplication was recorded. Conversion time to CSR format was not included as weights are sparsified only once for a set of IFMs. UNIVERSITY OF WISCONSIN-MADISON

  10. CPU- CSRMM vs GEMM CSRMM slower compared to GEMM. Overhead depends on sparsity percentage. UNIVERSITY OF WISCONSIN-MADISON

  11. GPU- CSRMM vs GEMM CSRMM overhead more in GPU. GPU operations are faster compared to CPU. UNIVERSITY OF WISCONSIN-MADISON

  12. Fully Connected Layer (FC) Analysis Fully Connected Layers form the final layers of a typical CNN and implemented as Matrix Vector Multiply operation. (GEMV). Modified Caffe s internal data structures (blob) to represent Weights of FC layer in sparse format. Sparse Matrix-Vector Multiplication (SpMV) used for sparse computation. Deep Compression Model used for analysis. Image taken from petewarden.com UNIVERSITY OF WISCONSIN-MADISON

  13. FC Layer Analysis Speed-up of 3x observed for both CPU and GPU. UNIVERSITY OF WISCONSIN-MADISON

  14. Matrix Multiply Analysis Custom C++ and Cuda programs were written to measure the time taken to execute only matrix multiplication routines. This allowed us to vary the sparsity of weight matrix to figure out the break-even point where CSRMM performs faster than GEMM. Size of the weight matrix were chosen to be equal to that of the largest AlexNet CONV layer. The zeros were distributed randomly in the weight matrix. UNIVERSITY OF WISCONSIN-MADISON

  15. Matrix Multiply Analysis Speedup of CSRMM over GEMM for random sparsity (CPU) 2.5 2 Speedup 1.5 1 0.5 0 0.945 0.95 0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 Sparsity Speedup of CSRMM over GEMM for random sparsity (GPU) 0.6 0.5 0.4 Speedup 0.3 0.2 0.1 0 0.945 0.95 0.955 0.96 0.965 0.97 0.975 0.98 0.985 0.99 0.995 Sparsity UNIVERSITY OF WISCONSIN-MADISON

  16. GPU Memory Scaling Experiment Motivation: As Sparse Matrix representation occupies significantly less space compared to dense representation, a larger working set can fit in the Cache/Memory system in the case of sparse. Implementation: As the Weights matrix was the one being passed in the Sparse format, we increased its size to a larger value. UNIVERSITY OF WISCONSIN-MADISON

  17. GPU Results (GEMM vs CSRMM) GEMM vs CSRMM ( Weight Matrix = 256 x 1200 ) Format Total Memory Used (MB) Runtime (ns) Dense 3377 60473 Sparse 141 207910 GEMM vs CSRMM ( Weight Matrix = 25600 x 24000 ) Format Total Memory Used (MB) Runtime (ns) Dense 5933 96334 Sparse 1947 184496 While GEMM is still faster in both cases, (GEMM/CSRMM) Time Ratio increases to 0.52 from 0.29 as the Weight Matrix Dimensions are bumped. UNIVERSITY OF WISCONSIN-MADISON

  18. GPU: Sparse X Sparse IFM sparsity due to ReLU activation. Sparsity in CONV layers of AlexNet given below Layer Sparsity Percentage CONV2 23.57 CONV3 56.5 CONV4 64.7 CONV5 68.5 CUSP library used in custom program for sparse x sparse multiply of IFM and Weights. Speedup of 4x observed compared to GEMM routine. Memory savings of 4.2x compared to GEMM. Could not scale to typical dimensions of AlexNet. This is in progress. UNIVERSITY OF WISCONSIN-MADISON

  19. CONCLUSION Representing Matrices in Sparse Format results in significant Memory savings as expected. We didn t observe any practical computational benefits for Convolutional Layers using library routines provide by MKL & cuSPARSE for both CPU & GPU. Fully Connected Layers showed around 3x Speedup for layers having a high sparsity. For a large dataset & GPU Memory, we might see drop in convolutional runtime for Sparse representation. Sparse x Sparse computation showed promising results. Will be implemented in Caffe. UNIVERSITY OF WISCONSIN-MADISON

  20. THANK YOU QUESTIONS ? UNIVERSITY OF WISCONSIN-MADISON

Related


More Related Content