Efficient Hardware Architectures for Deep Neural Network Processing

Efficient Hardware Architectures for Deep Neural Network Processing
Slide Note
Embed
Share

Discover new hardware architectures designed for efficient deep neural network processing, including SCNN accelerators for compressed-sparse Convolutional Neural Networks. Learn about convolution operations, memory size versus access energy, dataflow decisions for reuse, and Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow strategies. Dive into concepts such as intra and inter PE parallelism, and understand how these architectures enhance performance for inference and deep learning tasks.

  • Hardware Architectures
  • Deep Learning
  • Neural Networks
  • Efficiency
  • Convolution Operations

Uploaded on Oct 01, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. New hardware architectures for efficient deep net processing

  2. SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks 9 authors @ NVIDIA, MIT, Berkeley, Stanford ISCA 2017

  3. Convolution operation

  4. Reuse

  5. Memory: size vs. access energy

  6. Dataflow decides reuse

  7. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP N = 1 for inference Reuse activations: Input Stationary IS Reuse filters

  8. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Inter PE parallelism Intra PE parallelism Ouput coordinate

  9. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Intra PE parallelism Cartesian Product CP, all to all multiplications Each PE contains F*I multipliers Vector of F filter weights fetched Vector of I input activations fetched Multiplier outputs sent to accumulator to compute partial sums Accumulator has F*I adders to match multiplier throughput Partial sum written at matching coordinate in output activation space

  10. Inter PE parallelism

  11. Last week: Inter PE parallelism

  12. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Planar tiled PT, division of input activation map inter PE parallelism Each PE processes C*Wt*Ht inputs Output halos: accumulation buffer contains incomplete partial sums that are communicated to neighboring PEs

  13. Planar Tiled-Input Stationary-Cartesian Product-Sparse dataflow or PT-IS-CP Inter PE parallelism Intra PE parallelism Ouput coordinate

  14. Last week: sparsity in FC layer Dynamic sparsity (input dependent), created by ReLU operator on specific activations during inference Static sparsity of the network, created by pruning during training Activations might be 0 Weights might be 0 4 bit index to shared weights Weight quantization and sharing. Table of shared weights Non-zero weights Non-zero activations

  15. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.

  16. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.

  17. Weight sparsity (statically from pruning) and activation sparsity (dynamically from ReLU) present in conv layers too.

  18. Last week: Compressed format to store non-zero weights

  19. Example compressed weights for PE0 v x p

  20. Similar compression of zero weights and activations Not all weights (F) and Input activations (I) are stored and fetched in the dataflow.

  21. PE hardware architecture

  22. Other details in paper Implementation Evaluation

  23. Next Friday student presentations Tuesday minor 1 3-4 more lectures on architecture

More Related Content