Blacklisting Memory Scheduler for Improved Performance and Fairness
Addressing main memory interference problems, the blacklisting memory scheduler tackles inter-application interference to achieve high performance and fairness at a low cost. By enforcing application-aware memory scheduling and avoiding full ranking, it optimizes performance without sacrificing simplicity.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
The Blacklisting Memory Scheduler Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1
Main Memory Interference Problem Core Core Main Memory Core Core Causes interference between applications requests 2
Inter-Application Interference Results in Performance Degradation 6 5 Slowdown 4 3 2 1 0 leslie3d (core 0) mcf (core 1) High application slowdowns due to interference 3
Tackling Inter-Application Interference: Application-aware Memory Scheduling Enforce Ranks Request Buffer Monitor Rank Highest Ranked AID 2 1 App. ID (AID) Request 4 2 = = = = = = = = Req 1 Req 2 Req 3 Req 4 Req 5 Req 5 1 4 1 1 3 2 3 3 1 4 Full ranking increases critical path latency and area significantly to improve performance and fairness Req 7 Req 8 1 3 4
Performance vs. Fairness vs. Simplicity Fairness FRFCFS App-unaware PARBS App-aware (Ranking) Ideal ATLAS Low performance and fairness TCM Our Solution (No Ranking) Blacklisting Performance Is it essential to give up simplicity to optimize for performance and/or fairness? Our solution achieves all three goals Complex Very Simple Simplicity 5
Outline Introduction Problems with Application-aware Schedulers Key Observations The Blacklisting Memory Scheduler Design Evaluation Conclusion 6
Outline Introduction Problems with Application-aware Schedulers Key Observations The Blacklisting Memory Scheduler Design Evaluation Conclusion 7
Problems with Previous Application-aware Memory Schedulers 1. Full ranking increases hardware complexity 2. Full ranking causes unfair slowdowns 8
Problems with Previous Application-aware Memory Schedulers 1. Full ranking increases hardware complexity 2. Full ranking causes unfair slowdowns 9
Ranking Increases Hardware Complexity Enforce Ranks Request Buffer Next Highest Highest Ranked AID Ranked AID Monitor Rank App. ID (AID) Request 2 1 Req 1 Req 2 Req 3 Req 4 Req 5 Req 5 1 4 1 1 3 4 = = = = = = = = 4 2 3 3 1 4 Hardware complexity increases with application/core count Req 7 Req 8 1 3 10
Ranking Increases Hardware Complexity From synthesis of RTL implementations using a 32nm library 9 80000 Scheduler Area (in square um) 8 70000 Critical Path Latency (in ns) 7 60000 1.8x 6 50000 App-unaware App-unaware 5 8x 40000 App-aware App-aware 4 30000 3 20000 2 10000 1 Ranking-based application-aware schedulers incur high hardware cost 0 0 11
Problems with Previous Application-aware Memory Schedulers 1. Full ranking increases hardware complexity 2. Full ranking causes unfair slowdowns 12
Ranking Causes Unfair Slowdowns GemsFDTD GemsFDTD (high memory intensity) 30 30 Number of Requests Number of Requests 20 20 App-unaware App-unaware 10 10 Ranking Ranking 0 0 0 0 20 20 40 40 60 60 80 80 100 100 Execution Time (in 1000s of Cycles) Execution Time (in 1000s of Cycles) sjeng sjeng (low memory intensity) 30 Number of Requests Full ordered ranking of applications GemsFDTD denied request service 20 App-unaware 10 Ranking 0 0 20 40 60 80 100 Execution Time (in 1000s of Cycles) 13
Ranking Causes Unfair Slowdowns sjeng GemsFDTD (low memory intensity) (high memory intensity) 8 8 7 7 6 6 Slowdown Slowdown 5 5 App-unaware App-unaware 4 4 Ranking Ranking 3 3 2 2 1 1 0 0 Ranking-based application-aware schedulers cause unfair slowdowns 14
Problems with Previous Application-aware Memory Schedulers 1. Full ranking increases hardware complexity 2. Full ranking causes unfair slowdowns Our Goal: Design a memory scheduler with Low Complexity, High Performance, and Fairness 15
Outline Introduction Problems with application-aware schedulers Key Observations The Blacklisting Memory Scheduler Design Evaluation Conclusion 16
Key Observation 1: Group Rather Than Rank Observation 1: Sufficient to separate applications into two groups, rather than do full ranking Monitor Rank Group Interference Causing Vulnerable 2 1 4 1 2 > 2 3 3 4 3 1 4 Benefit 1: Low complexity compared to ranking 17
Key Observation 1: Group Rather Than Rank GemsFDTD (high memory intensity) 30 Number of Requests 20 App-unaware 10 Ranking Grouping 0 0 20 40 60 80 100 Execution Time (in 1000s of Cycles) sjeng (low memory intensity) No denial of request service 30 Number of Requests 20 App-unaware 10 Ranking Grouping 0 0 20 40 60 80 100 Execution Time (in 1000s of Cycles) 18
Key Observation 1: Group Rather Than Rank sjeng GemsFDTD (low memory intensity) (high memory intensity) 8 8 7 7 6 6 Slowdown Slowdown 5 5 App-unaware App-unaware 4 4 Ranking Ranking Grouping Grouping 3 3 2 2 1 1 0 0 Benefit 2: Lower slowdowns than ranking 19
Key Observation 1: Group Rather Than Rank Observation 1: Sufficient to separate applications into two groups, rather than do full ranking Monitor Rank Group Interference Causing Vulnerable 2 1 4 1 2 > 2 3 3 4 3 1 4 How to classify applications into groups? 20
Key Observation 2 Observation 2: Serving a large number of consecutive requests from an application causes interference Basic Idea: Group applications with a large number of consecutive requests as interference-causing Blacklisting Deprioritize blacklisted applications Clear blacklist periodically (1000s of cycles) Benefits: Lower complexity Finer grained grouping decisions Lower unfairness 21
Outline Introduction Problems with application-aware schedulers Key Observations The Blacklisting Memory Scheduler Design Evaluation Conclusion 22
The Blacklisting Memory Scheduler (BLISS) 1. Monitor 2. Blacklist 1. Monitor 4. Clear Periodically Periodically 1. Monitor 2. Blacklist 3. Prioritize 3. Prioritize 4. Clear Request Buffer Req Blacklist Blacklist 0 1 1 0 0 0 1 0 AID ? ? ? ? ? ? ? ? Req 1 Req 2 Req 3 Req 4 Req 5 Req 6 Req 7 Req 8 Simple and scalable design AID2 AID1 AID1 AID1 AID1 0 1 2 3 0 0 0 0 1 AID Blacklist Last Req AID 0 1 2 3 3 2 1 0 1 0 # Consecutive Requests Last Req AID 3 1 2 3 4 4 # Consecutive Requests 0 0 1 23
Outline Introduction Problems with application-aware schedulers Key Observations The Blacklisting Memory Scheduler Design Evaluation Conclusion 24
Methodology Configuration of our simulated baseline system 24 cores 4 channels, 8 banks/channel DDR3 1066 DRAM 512 KB private cache/core Workloads SPEC CPU2006, TPC-C, Matlab , NAS 80 multiprogrammed workloads 25
Metrics System Performance: shared i IPC i = Weighted Speedup alone i IPC Fairness: alone i IPC = Maximum Slowdown max shared i IPC Complexity: Critical path latency and area from synthesis with 32 nm library 26
Previous Memory Schedulers FRFCFS[Zuravleff and Robinson, US Patent 1997, Rixner et al., ISCA 2000] Prioritizes row-buffer hits and older requests Application-unaware + Low complexity - Low performance and fairness FRFCFS-Cap [Mutlu and Moscibroda, MICRO 2007] Caps number of consecutive row-buffer hits PARBS [Mutlu and Moscibroda, ISCA 2008] Batches oldest requests from each application; prioritizes batch Employs ranking within a batch Application-aware + High performance and fairness - High complexity ATLAS[Kim et al., HPCA 2010] Prioritizes applications with low memory-intensity TCM[Kim et al., MICRO 2010] Always prioritizes low memory-intensity applications Shuffles thread ranks of high memory-intensity applications 27
Performance and Fairness FRFCFS ATLAS FRFCFS-Cap TCM PARBS Blacklisting 15 13 Unfairness 11 5% 9 21% 7 5 7.5 8 8.5 9 9.5 10 Performance 1. Blacklisting achieves the highest performance 2. Blacklisting balances performance and fairness 28
Complexity FRFCFS ATLAS FRFCFS-Cap TCM PARBS Blacklisting Scheduler Area (sq. um) 120000 100000 80000 43% 60000 40000 70% 20000 0 0 2 4 6 8 10 12 Critical Path Latency (ns) Blacklisting reduces complexity significantly 29
Performance vs. Fairness vs. Simplicity Close to fairest FRFCFS FRFCFS -Cap PARBS ATLAS TCM Blacklisting Fairness Ideal Highest performance Performance Blacklisting is the closest scheduler to ideal Close to simplest Simplicity 30
Summary Applications requests interfere at main memory Prevalent solution approach Application-aware memory request scheduling Key shortcoming of previous schedulers: Full ranking High hardware complexity Unfair application slowdowns Our Solution: Blacklisting memory scheduler Sufficient to group applications rather than rank Group by tracking number of consecutive requests Much simpler than application-aware schedulers at higher performance and fairness 31
The Blacklisting Memory Scheduler Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 32
DRAM Memory Organization Columns Row hit Row miss (2-3x latency of row hit) Rows Bank 0 Bank 0 Bank 1 Bank 1 Bank 2 Bank 2 Bank 3 Bank 3 Memory Controller Row Row Buffer Buffer Row Buffer Row Buffer Row Buffer FR-FCFS Memory Scheduler [Zuravleff and Robinson, US Patent 97; Rixner et al., ISCA 00] Row-buffer hit first Older request first Unaware of inter-application interference 34
Tackling Inter-Application Interference: Application-aware Memory Scheduling Monitor application memory access characteristics (e.g., memory intensity) Rank applications based on memory access characteristics Prioritize requests at the memory controller, based on ranking 35
Performance and Fairness 10 15 Maximum Slowdown Weighted Speedup FRFCFS FRFCFS 8 FRFCFS-Cap FRFCFS-Cap 10 6 PARBS PARBS ATLAS ATLAS 4 5 TCM TCM 2 Blacklisting Blacklisting 0 0 5% higher system performance and 21% lower maximum slowdown than TCM 36
Complexity Results 12 120000 10 100000 App-unaware FRFCFS Area (in square um) Latency (in ns) FRFCFS-Cap FRFCFS-Cap 8 80000 PARBS PARBS 6 60000 ATLAS ATLAS 4 40000 App-aware TCM Blacklisting Blacklisting 2 20000 0 0 Blacklisting achieves Blacklisting achieves 70% lower latency than TCM 43% lower area than TCM 37
Understanding Why Blacklisting Works 0.4 0.5 Fraction of Requests Fraction of Requests 0.4 0.3 FRFCFS FRFCFS 0.3 0.2 PARBS PARBS 0.2 0.1 TCM TCM 0.1 Blacklisting Blacklisting 0 0 0 10 20 0 10 20 Streak Length libquantum Streak Length calculix (High memory-intensity application) Blacklisting shifts the request distribution towards the left towards the right (Low memory-intensity application) Blacklisting shifts the request distribution 38
Harmonic Speedup 0.35 0.3 FRFCFS Harmonic Speedup 0.25 FRFCFS-Cap 0.2 PARBS ATLAS 0.15 TCM 0.1 Blacklisting 0.05 0 39
Effect of Workload Memory Intensity 1.4 2 Maximum Slowdown Weighted Speedup 1.2 (Normalized) FRFCFS 1.5 (Normalized) 1 FRFCFS-Cap 0.8 1 PARBS 0.6 ATLAS 0.4 0.5 TCM 0.2 Blacklisting 0 0 25 50 75 100 Avg 25 50 75 100 Avg 40
Combining FRFCFS-Cap and Blacklisting 1.2 1.4 Maximum Slowdown 1.2 1 FRFCFS Weighted Speedup 1 0.8 FRFCFS-Cap 0.8 0.6 Blacklisting 0.6 0.4 0.4 FRFCFS-Cap- Blacklisting 0.2 0.2 0 0 41
Sensitivity to Blacklisting Threshold 1.4 1.2 1.2 1 Maximum Slowdown Weighted Speedup FRFCFS 1 0.8 Blacklisting-1 0.8 Blaclisting-2 0.6 Blacklisting-4 0.6 Blacklisting-8 0.4 0.4 Blacklisting-16 0.2 0.2 0 0 42
Sensitivity to Clearing Interval 1.4 1.2 FRFCFS 1.2 Maximum Slowdown 1 Weighted Speedup 1 0.8 Blacklisting- 1000 Blacklisting- 10000 Blacklisting- 100000 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 43
Sensitivity to Core Count 20 35 Maximum Slowdown Weighted Speedup 30 15 25 20 FRFCFS 10 15 PARBS 10 5 TCM 5 Blacklisting 0 0 16 24 Core Count 32 64 16 24 Core Count 32 64 44
Sensitivity to Channel Count 15 40 Weighted Speedup Maximum Slowdown 30 10 FRFCFS 20 PARBS 5 10 TCM Blacklisting 0 0 1 2 4 8 1 2 4 8 Channel Count Channel Count 45
Breakdown of Benefits 1.4 1.2 1.2 Maximum Slowdown 1 Weighted Speedup 1 FRFCFS 0.8 0.8 TCM 0.6 Grouping 0.6 0.4 Blacklisting 0.4 0.2 0.2 0 0 47
BLISS vs. Criticality-aware Scheduling 1.4 1.4 1.2 1.2 Maximum Slowdown Weighted Speedup FRFCFS 1 1 PARBS 0.8 0.8 TCM Crit-MaxStall 0.6 0.6 Crit-TotalStall 0.4 0.4 Blacklisting 0.2 0.2 0 0 48
Sub-row Interleaving 12 12 Maximum Slowdown 10 10 FRFCFS-Row Weighted Speedup FRFCFS 8 8 FRFCFS-Cap 6 6 PARBS ATLAS 4 4 TCM 2 2 Blacklisting 0 0 49