
Data Approximation Framework for Network-On-Chip Architectures
Data Approximation Framework for Network-On-Chip Architectures presented by Rahul Boyapati, Jiayi Huang, Pritam Majumder, Ki Hwan Yum, Eun Jung Kim explores the concept of approximation in NoCs to achieve higher throughput, mitigate memory bandwidth bottleneck, and increase data similarity for improved compression rates. The framework leverages inaccuracy tolerance of applications to enhance effective bandwidth, focusing on hardware approximation, compute approximation, and storage approximation. The main idea delves into cache block approximations, network representations, and precise encoding/decoding. The challenges include cost, latency overhead, quality control, error calculation, power and latency overheads, emphasizing the need for a light-weight design. The architecture overview showcases the tile, NI, router components, highlighting the injection and ejection processes.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
APPROX-NoC: A Data Approximation Framework for Network-On-Chip Architectures Rahul Boyapati, Jiayi Huang, Pritam Majumder, Ki Hwan Yum, Eun Jung Kim
Motivation Perfect accuracy is not required Computer vision Machine learning Graph processing Large amount of data movement across NoC Video frame Neuron weights Graph weights Leveraging inaccuracy to provide high throughput NoC 2
Hardware Approximation Compute Approximation Variable voltage based ALUs [Esmaeilzadeh et al. ASPLOS 12] Analog based circuit designs [St. Amant et al. ISCA 14] Neural network acceleration [Esmaeilzadeh et al. MICRO 12, Moreau et al. HPCA 15] Storage Approximation Approximate main memory [Sampson et al. MICRO 13, Liu et al. ASPLOS 11] Approximate cache [San Miguel et al. MICRO 15, MICRO 16] No previous research on approximation in NoCs 3
Approximation in NoCs Why do we need approximation in NoCs? Higher throughput Mitigate memory bandwidth bottleneck Approximation increase data similarity to improve compression rate Leveraging inaccuracy tolerance of applications to improve effective bandwidth 4
Main Idea 0xA 0xB 0xC 0xD 0xE 0xF Cache block VAXX Source Approximated block 0xA 0xB 0xE 0xD 0xE 0xB Compr Network Network Representation e0+0xA e1 e2 e0+0xD e2 e1 Decompr Destination 0xA 0xB 0xE 0xD 0xE 0xB Decompressed block Uncompressed Precise Encoding/Decoding Approximate Encoding e0 e1 e2 uncompressed 0xB 0xE 6
Challenges Value approximation and compression not cheap Latency overhead (on the critical path) Hardware cost Quality control is important Error calculation for every word Power and latency overhead for error compute Should be a Light-Weight Design 7
APPROX-NoC Architecture Overview Tile Tile NI NI NI NI Router Router Ejection Q To Processor or MC Eject NI Core Inject From Processor or MC Injection Q 8
APPROX-NoC Architecture Overview Tile Tile NI NI NI NI Router Router Ejection Q To Processor or MC Eject Decompr NI Core Inject Compr From Processor or MC Injection Q 8
APPROX-NoC Architecture Overview Tile Tile NI NI NI NI Router Router Ejection Q To Processor or MC Eject Decompr NI Approx? Core Inject Compr VAXX From Processor or MC Injection Q Approximate to similar data to improve compression rate. 8
APPROX-NoC Operation Flow Chart Cache Block Data type aware approximation N Approximable? Y Bypass approximation logic to reduce overhead on critical path Compute Logic (AVCL) float Approximate Value Int or float? Mantissa extraction int Seamlessly integrated with compression unit in plug-and-play manner Approximate Logic Compressor 9
Integer Approximation Datapath Simple for integer The complete word passed for approximation Abstraction Calculate the error budget based on the threshold Detect number of bits for the error budget, e.g. n bits Approximate least significant (n-1)don t care bits for compression-friendly data patterns 31 0 integer Approximate Logic 31 0 Approximated integer 10
Floating-Point Approximation Datapath Representation IEEE 754 ( 1)sign (1 + .mantissa) 2(exponent bias) sign exponent mantissa Abstraction 31 30 23 22 0 No Floating-Point Operation for FP Approximation Extract the mantissa bits and normalized as an integer Approximate like integer Concatenate exponent to recover approximate float value s exponent mantissa 22 0 31 24 23 0 .. 0 1 mantissa Approximate Logic 0 .. 0 1 approx mantissa s exponent approx mantissa 11
Approximate Value Compute Logic (AVCL) 0 31 Unified logic for both integer and floating point one word data 32 23 31 24 23 22 0 0 . 0 1 mantissa Fast error budget compute e: error threshold (0-100) error_budget = given_value (e/100) = given_value/(100/e) 100/e predefined (100/25 = 4 = B 100) Only shifting bits 32 32 0 1 int/float? 8 Approximate Logic Float Exponent Detection 32 9 9 23 int/float? 1 0 int/float? 32 31 23 22 0 32 0 1 approx? 12
APPROX-NoC Implementation Cases Plug VAXX approximate engine with compression units Frequent pattern compression (FP-COMP) [Das et al. HPCA 08] Dictionary-based compression (DI-COMP) [Jin et al. MICRO 08] Frequent Pattern Based VAXX (FP-VAXX) Approximate the value Compressed approximated pattern Dictionary-Based VAXX (DI-VAXX) Use TCAM to store approximated tracked patterns Approximation off the critical path 13
Frequent Pattern VAXX (FP-VAXX) Approximate pattern Approximate Value Compute Logic (AVCL) Frequent Pattern Compressor Given word Encoded pattern Error threshold First approximate the value with AVCL Compressed the approximate pattern using frequent pattern compression 14
Dictionary-Based VAXX (DI-VAXX) Error threshold Fill and Update Approx Pattern Encoded Idx Approximate Value Compute Logic (AVCL) Dictionary Update 1001 010X e0 e1 10XX 10XX Given word 1010 Lookup Use TCAM to store approximated patterns Match? Precompute approximate patterns while update and fill the dictionary Encoded index e1 Approximation off critical path 15
Methodology Parsec 3.0 SSCA2 graph application Synthetic workload from benchmark traces 32 Out-of-Order cores at 2 GHz 32 KB L1I$ and 64 KB L1D$, 2-way 2 MB L2-bank and 16 directories 4x4 2D concentrated-mesh 2 GHz, 3-stage router 4 virtual channels, 4-flit buffer 64-bit flit, X-Y routing Gem5 for full system performance Pin-based simulator for application output error In house NoC simulator for synthetic study Workloads Architecture NoC Tools 16
Packet Latency and Data Quality Queue_lat Net_lat Decode_lat 37 48 Latency (cycles) 30 20 10 0 Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline Baseline blackscholes bodytrack canneal fluidanimate streamcluster swaptions x264 ssca2 AVG Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold 17
Packet Latency and Data Quality Queue_lat Net_lat Decode_lat 37 48 30 Latency (cycles) 20 10 0 FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP blackscholes bodytrack canneal fluidanimate streamcluster swaptions x264 ssca2 AVG Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold 17
Packet Latency and Data Quality Queue_lat Net_lat Decode_lat 48 37 30 Latency (cycles) 20 10 0 FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP DI-VAXX FP-COMP DI-VAXX DI-VAXX DI-VAXX DI-VAXX DI-VAXX DI-VAXX DI-VAXX DI-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP blackscholes bodytrack canneal fluidanimate streamcluster swaptions x264 ssca2 AVG Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold DI-VAXX reduces latency by 11% and 40% compared to DI-COMP and Baseline FP-VAXX reduces latency by 21% and 46% over FP-COMP and Baseline For data intensive benchmark SSCA2, DI-VAXX outperforms DI-COMP by 22%, FP-VAXX outperforms FP-COMP by 36% 17
Packet Latency and Data Quality Queue_lat Net_lat Decode_lat Data_approx_quality 48 37 Data Approx Quality 1 30 Latency (cycles) 0.99 0.98 20 0.97 10 0.96 0 0.95 FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP FP-COMP DI-VAXX FP-COMP DI-VAXX DI-VAXX DI-VAXX DI-VAXX DI-VAXX DI-VAXX DI-VAXX DI-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX FP-VAXX Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP Baseline DI-COMP blackscholes bodytrack canneal fluidanimate streamcluster swaptions x264 ssca2 AVG Synthetic study: benchmark traces permutations, 75% approximable data packet and 10% error threshold DI-VAXX reduces latency by 11% and 40% compared to DI-COMP and Baseline FP-VAXX reduces latency by 21% and 46% over FP-COMP and Baseline For data intensive benchmark SSCA2, DI-VAXX outperforms DI-COMP by 22%, FP-VAXX outperforms FP-COMP by 36% Data value quality is higher than 97% (< 3% error) 17
Compression Ratio DI-COMP DI-VAXX FP-COMP FP-VAXX 3 Compression Ratio 2.5 2 1.5 1 0.5 0 Synthetic study: benchmark traces permutations, 75% approximable data packets and 10% error threshold Approximation can improve compression ratio up to 41% DI-VAXX and FP-VAXX improve compression ratio by 10% and 30% in geomean Higher compression ratio reduces flits, thus reduces queuing and contention 18
Throughput - Uniform Random 60 Baseline DI-COMP DI-VAXX FP-COMP FP-VAXX Packet Latency (cycles) 50 40 30 20 10 0 0.1 0.2 Injection Rate (flits/cycle/node) 0.3 0.4 0.5 0.6 Synthetic study: Streamcluster traces permutations, 1: data to control packet ratio 75% approximable data packets and 10% error threshold3 VAXX improves the throughput by up to 40% 19
Throughput - Transpose 60 Baseline DI-COMP DI-VAXX FP-COMP FP-VAXX Packet Latency (cycles) 50 40 30 20 10 0 0.1 0.2 Injection Rate (flits/cycle/node) 0.3 0.4 0.5 0.6 Synthetic study: Streamcluster traces permutations, 1:3 data to control packet ratio 75% approximable data packets and 10% error threshold VAXX improves the throughput by up to 69% 20
Application Error and Full System Performance error_rate 100% Application Errors 80% 60% 40% 20% 0% 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 blackscholes bodytrack canneal Benchmarks with data error budget (%) fluidanimate streamcluster swaptions x264 ssca2 Application errors are less than 5% except for streamcluster and swaptions 21
Application Error and Full System Performance error_rate performance 100% Application Errors 1.2 Normalized Performance 1 80% 0.8 60% 0.6 40% 0.4 20% 0.2 0% 0 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 0 10 20 blackscholes bodytrack canneal Benchmarks with data error budget (%) fluidanimate streamcluster swaptions x264 ssca2 Application errors are less than 5% except for streamcluster and swaptions performance is improved by up to 10% and 14% in swaptions and SSCA2 21
Power Consumption and Area Overhead Baseline DI-COMP DI-VAXX FP-COMP FP-VAXX 1 Power (normalized) 0.9 0.8 0.7 0.6 Approximation power consumption is compensated by flit reduction Schemes DI-VAXX FP-VAXX 0.0037 mm2 0.0029 mm2 Area Overhead (45 nm) 22
Conclusions NoC data approximation framework for leveraging inaccuracy to provide high throughput. Light-weight Approximate Compute to support both integer and floating-point. Low cost microarchitecture implementations of VAXX. APPROX-NoC achieves up to 21% average packet latency reduction and 69% throughput improvement. 23
Thank You & Questions Jiayi Huang jyhuang@cse.tamu.edu