
Balancing Cache Capacity and On-Chip Bandwidth with FLEXclusion
Explore the innovative approach of FLEXclusion for balancing cache capacity and on-chip bandwidth in multi-level cache hierarchies. Discover the impact on performance, traffic, and design choices for cache inclusion.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim
Outline Motivation FLEXclusion Design Monitoring & Operation Extension Evaluations Conclusion 2/26
Introduction Today s processors have multi-level cache hierarchies Design options for each size, inclusion property, # of levels, ... Design choice for cache inclusion Inclusion: upper-level cache blocks always exist in the lower-level cache Exclusion: upper-level cache blocks must not exist in the lower-level cache Non-Inclusion : may contain the upper-level cache blocks UPPER-LEVEL LOWER-LEVEL Inclusion Non-inclusion Exclusion 3/26
Trend of Cache Size Ratio Trend of total non-LLC capacity to LLC capacity High ratio indicates more data duplications with inclusion/non-inclusions L2: 4 x 256KB , L3: 6MB L3 More than 15% duplication!! More Duplication 0.2 Ratio of Non-LLC to LLC Multi-Core Era Begins Core i5-2xxx 0.15 PIII (Coopermine T) Core i7-9xx Core i3-5xx Core i7-2600 Core i7-8xx 0.1 Core i7-39xxX PIII-S (Tualantin) Core 2 Duo E4xxx P4 (Northwood) Core 2 Duo E6xxx 0.05 Core 2 Duo L7xxx Core 2 Duo E7xxx Core 2 Duo P9xxx Prescott 2M (90) P4 (Prescott) Core (Woodcrest) 0 2000 2002 2004 2006 Year 2008 2010 2012 Ratio of non-LLC to LLC sizes of Intel s processors over the past 10 years For Capacity: Exclusion is a better option 4/26
On-Chip Traffic What about on-chip traffic? Each design also has a different impact on on-chip traffic Sliently Dropped! More Traffic!! L3 Hit L2 L2 Clean Victim Dirty Victim Dirty Victim Clean Victim Fill Flow Fill Flow L3 (LLC) L3 (LLC) L3 Hit DRAM DRAM Exclusive Hierarchy Non-Inclusive Hierarchy For Bandwith: Non-Inclusion is a better option 5/26
Static Inclusion Question: Which design do we want to choose? More BW consumption on exclusion want to go for non-inclusion 60 mcf L2->L3 Traffic Difference 50 want to go for exclusion 40 bwavesleslie3d (IPKI) soplex 30 sphinx3 20 omnetpp wrf 10 bzip2 hmmer calculix gcc h264ref xalancbmk 0 0.95 1 1.05 1.1 1.15 1.2 1.25 Performance of Exclusion Relative to Non-Inclusion More performance benefits on exclusion 6/26
Static Inclusion : Problem Each policy has its advantages/disadvantages Non-Inclusion provides less capacity but higher efficiency on on-chip traffic Exclusion provides more capacity but low efficiency on on-chip traffic Workloads have diverse capacity/bandwidth requirement Problem: No single static cache configuration works best for all workloads 7/26
Our Solution : Flexible Exclusion Dynamically change cache inclusion according to the workload requirement! 8/26
Our Solution : Flexible Exclusion Providing both non-inclusion and exclusion Capture the best of capacity/bandwidth requirement Key Observation Non-inclusion and exclusion require similar hardware Benefits of FLEXclusion Reducing on-chip traffic compared to exclusion Improving performance compared to non-inclusion 9/26
Outline Motivation FLEXclusion Design Monitoring & Operation Extension Evaluations Conclusion 10/26
FLEXclusion Overview Goal: Adapts cache inclusion between non-inclusion and exclusion Overall Design Monitoring logic A few logic blocks in the hardware to control traffic 11/26
Design EXCL-REG: to control L2 clean victim data flow NICL-GATE: to control incoming blocks from memory Monitoring & policy decision logic: to switch operating mode L2 Line Fill Monitoring logic is required in many modern cache mechanisms! L2 Cache L2 Clean Victim EXCL-REG Policy Decision & Information Collection Logic L3 Line Fill Last-Level Cache NICL-GATE 12/26
Non-inclusive Mode (PDL signals 0) Clean L2 victims are silently dropped Incoming blocks are installed into both L2 and L3 L3 hitting blocks keep residing in the cache Non-inclusive mode follows typical non-inclusive behavior L2 Line Fill L2 Cache L2 Clean Victim EXCL-REG Policy Decision & Information Collection Logic L3 Line Fill Last-Level Cache NICL-GATE 13/26
Exclusive Mode (PDL signals 1) Clean L2 victims are inserted into L3 Incoming blocks are only installed into L2 L3 hitting blocks are invalidated Performs similar to typical exclusive design except for L3 insertions from L2 L2 Line Fill L2 Cache L2 Clean Victim EXCL-REG Policy Decision & Information Collection Logic L3 Line Fill Last-Level Cache NICL-GATE 14/26
L2 PDL ICL Requirement Monitoring LLC Set-dueling method is used to capture performance and traffic behavior of exclusion and non- inclusion Sampling sets follow their original behavior Monitor cache miss and insertion Set 0 Cache Miss Other sets follow the winning policy Set 1 Insertion Set 2 Set 3 Counters Set 4 Cache Miss Set 5 Non-Inclusive Set Exclusive Set Following Set Insertion Set 6 Set 7 15/26
L2 PDL ICL Operating Region LLC Decision of winning policy is made by Policy Decision Logic (PDL) Basic operating mode is determined by Perfth Extensions of FLEXclusion use Insertionth for further performance/traffic optimization L3 IPKI Difference Non-Inclusive Region Exclusive Region (Bypass) Miss(NICL) Miss(EX) > Perfth Insertionth Ins(EX) Ins(NICL) > Insertionth Non-Inclusive Region (Aggressive) Exclusive Region 1.0 Perfth Exclusion Performance Relative to Non-Inclusion (Cache Miss) 16/26
Extensions of FLEXclusion Per-core policy: to isolate each application behavior Aggressive non-inclusion: to improve performance in non-inclusive mode Bypass on exclusive mode: to reduce traffic in exclusive mode Detail explanations are in the paper. L2 L2 Hit on LLC Hit on LLC Clean Victim Clean Victim Line Fill (DRAM) Line Fill (DRAM) LLC LLC Aggressive non-inclusive mode Bypass on exclusive mode 17/26
FLEXclusion Operation A FLEXclusive cache changes operating mode at run-time FLEXclusion does not require any special actions - On a switch from non-inclusive to exclusive mode - On a switch from exclusive to non-inclusive mode FLEXclusion Mode Non-Inclusive Exclusive Non-Inclusive L2 Evict Dirty FILL Evict FILL Evict Dirty Evict Hit LLC Hit Written back into the same position! FLEXclusive Hierarchy 18/26
Outline Motivation FLEXclusion Design Monitoring & Operation Extension Evaluations Conclusion 19/26
Evaluations MacSim Simulator A cycle-level in house simulator (now public) Power results with Orion (Wang+[MICRO 02]) Baseline Processor 4-core, 4.0GHz, private L1 and L2, shared L3 Workloads Group A: bzip2, gcc, hmmer, h264, xalancbmk, calculix (Low MPKI) Group B: mcf, omnetpp, bwaves, soplex, lesilie3d, wrf, sphinx3 (High MPKI) Multi-programmed: 2-MIX-S, 2-MIX-A, 4-MIX-S Other results in the paper Multi-programmed workloads, per-core, aggressive mode, bypass, threshold sensitivity 20/26
Evaluations Performance/Traffic AVG. 6.3% loss for 1MB Performance Non-Inclusion FLEXclusion 1.05 Performance Relative 0.95 to Exclusion 0.85 0.75 1MB 2MB FLEXclusion performs similar to exclusion 4MB 1MB 2MB 4MB 1MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 4MB 1MB 5.9% improvement over non-inclusion!! 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 4MB 1MB 2MB 4MB 2MB 2MB 2MB bzip2 gcc hmmer h264refxalancbmkcalculix mcf omnetpp bwaves leslie3d soplex wrf sphinx3 AVG. Traffic L3 IPKI Normalized 35 Exclusion FLEXclusion to Non-Inclusion 30 25 20 72.6% reduction over exclusion!! 15 10 5 0 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 1MB 2MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 1MB 2MB 4MB 4MB 4MB bzip2 gcc hmmer h264refxalancbmkcalculix mcf omnetpp bwaves leslie3d soplex wrf sphinx3 AVG. 21/26
Evaluations - Effective Cache Size Running the same benchmark on 1-/2-/4- cores (4MB L3) Exclusive Mode (1-Core) Non-Inclusive Mode (1-Core) Exclusive Mode (2-Cores) Non-Incluive Mode (2-Cores) Exclusive Mode (4-Cores) Non-Inclusive Mode (4-Cores) 100% 100% 100% 80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% wrf leslie3d calculix mcf bwaves AVG. gcc xalancbmk soplex h264 omentpp bzip sphinx3 hmmer xalancb xalancb bzip omentpp h264 leslie3d AVG. mcf sphinx3 gcc calculix soplex wrf hmmer bwaves wrf AVG. calculix bwaves leslie3d soplex gcc hmmer h264 omentpp bzip mcf sphinx3 FLEXclusive cache is configured as exclusive mode more often!! One thread is enjoying the cache!! Threads are competing for shared caches!! FLEXclusion adapts inclusion on the effective cache size for each workload!! 22/26
Evaluations Traffic & Power Impact on L3 insertion traffic reduction in total? FLEXclusion effectively reduces the traffic L3 Insertion takes up more than 40%! with FLEXclusion!! Reduced to ~10% Traffic Normalized to Exclusion 1.2 Data (L2->L3) 20% Reduction Data (MC<->Caches) Data (L3->L2) Address 1 1 Normalized to Exclusion 0.8 Power Consumption 0.8 0.6 0.6 0.4 Exclusion FLEXclusion Non-Inclusion 0.2 0.4 0 0.2 Exclusion Exclusion Exclusion Exclusion Exclusion Exclusion Exclusion Exclusion Exclusion Exclusion Exclusion Exclusion Exclusion Exclusion Non-Inclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion FLEXclusion Non-Inclusion Non-Inclusion Non-Inclusion Non-Inclusion Non-Inclusion Non-Inclusion Benchmarks Non-Inclusion Non-Inclusion Non-Inclusion Non-Inclusion Non-Inclusion Non-Inclusion Non-Inclusion 0 A+B 2-MIX-S 4-MIX-S 2-MIX-A bzip2 gcc hmmer h264 xalancbmk calculix mcf omnetpp bwaves leslie3d soplex wrf sphinx3 AVG. Single-threaded (A+B) 23/26
Outline Motivation FLEXclusion Design Monitoring & Operation Extension Evaluations Conclusion 24/26
Conclusions & Future Work FLEXclusion balances performance and on-chip bandwidth consumption depending on the workload requirement with negliglibe hardware changes 5.9% performance improvement over non-inclusion 72.6% L3 insertion traffic reduction over exclusion (20% power reduction) Future Work More generic flexclusion including inclusion property Impact on on-chip network 25/26
Q/A Thank you! 26/26