
Multi-Tiered Branch Predictor for Pipelined Processors Overview
Explore the architecture and benefits of a multi-tiered branch predictor system designed for modern processors to minimize costly mispredictions. Learn about the hierarchy of predictors, performance benefits, speculative state management, and more in this detailed analysis.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Multi-tiered branch predictor for pipelined processors Francis Wang and Shana Matthew
Background Branch predictors guess the direction of branches before the outcome is known for certain In modern processors, the fetch stage is decoupled from execution units Mispredictions are very costly Motivates the development of highly accurate branch predictors
Multi-tiered branch predictor Hierarchy of predictors Branch target buffer (BTB) Predicts next PC before instruction is fetched Primary branch predictor (Gshare) Small and fast Less accurate Secondary branch predictor (TAGE) Large and slow More accurate
Pipeline diagram 6-stage pipeline to model deeply pipelined processor
Performance benefit BTB TAGE Penalty BTB Gshare TAGE Penalty X 0 4 X X 0 4 0 1 2 X 0 1 2 1 1 0 0 1 1 1 1 1 1 0
Speculative state Speculative state is managed by three boolean epoch counters epGshare Gshare redirection epTAGE TAGE redirection epEx branch resolution Until the execute stage, all instructions are speculative Instructions are dropped if its epoch does not match the epoch counter
Redirections Epoch counters and next PC are updated If multiple redirections occur in the same cycle, the redirection furthest down the pipeline takes precedence Use of Ephemeral History Registers (EHRs) Eliminate rule conflicts between pipeline stages that all trigger redirections Enforce ordering between different redirections Internal rule doRedirect fires before all other rules in the same cycle Ensures that effect of redirections are felt the cycle after it takes place
Branch target buffer Contains a table of: Previously-used next addresses ( branch targets ) that were not simply PC+4 The PCs that generated those branch targets Tightly coupled to the fetch stage Predicts the next address before the instruction is even fetched from memory.
Gshare predictor N-bit global history register 2N entry branch history table (BHT) 2-bit saturating counters of prediction 00: Strongly Not Taken 01: Weakly Not Taken 10: Weakly Taken 11: Strongly Taken Global history register is XORed with the PC to give the BHT index Counters are incremented if the branch was taken and decremented otherwise
TAGE predictor Each predictor tables indexed with a different history length Can capture correlations from distant and near branches Table T0 provides default predictions Upper tables have a 3-bit saturating prediction counter and a 2-bit saturating useful counter Matching table that uses the longest history length takes precedence
Trace dumps Instrumentation printouts in the processor source code to record the state of the pipeline at every cycle Note the redirection in cycle 5818
Tandem verification Trace dumps from the benchmarks are fed into Python models Used to verify that BTB, Gshare, and TAGE are functionally correct
Utilization and timing Critical path - within the DDR3 interface logic Logical depth of 2, 97.7% of the delay coming from routing Yosys synthesis report - critical path within the processor is from the instruction cache to the PC redirection register
Benchmark Big processor benchmarks modified to repeat the computation, normalized to about 10,000,000 instructions 4-entry BTB 256-entry Gshare 5-component 64-entry TAGE Only hit rate on branch instructions is quoted for BTB
Performance Benchmarks do not reflect realistic program behavior
Effect of table sizing Need better benchmarks for meaningful comparison