Multi-Tiered Branch Predictor for Pipelined Processors Overview

multi tiered branch predictor for pipelined n.w
1 / 16
Embed
Share

Explore the architecture and benefits of a multi-tiered branch predictor system designed for modern processors to minimize costly mispredictions. Learn about the hierarchy of predictors, performance benefits, speculative state management, and more in this detailed analysis.

  • Branch Predictor
  • Pipeline Processor
  • Performance Benefits
  • Speculative State
  • Processor Architecture

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Multi-tiered branch predictor for pipelined processors Francis Wang and Shana Matthew

  2. Background Branch predictors guess the direction of branches before the outcome is known for certain In modern processors, the fetch stage is decoupled from execution units Mispredictions are very costly Motivates the development of highly accurate branch predictors

  3. Multi-tiered branch predictor Hierarchy of predictors Branch target buffer (BTB) Predicts next PC before instruction is fetched Primary branch predictor (Gshare) Small and fast Less accurate Secondary branch predictor (TAGE) Large and slow More accurate

  4. Pipeline diagram 6-stage pipeline to model deeply pipelined processor

  5. Performance benefit BTB TAGE Penalty BTB Gshare TAGE Penalty X 0 4 X X 0 4 0 1 2 X 0 1 2 1 1 0 0 1 1 1 1 1 1 0

  6. Speculative state Speculative state is managed by three boolean epoch counters epGshare Gshare redirection epTAGE TAGE redirection epEx branch resolution Until the execute stage, all instructions are speculative Instructions are dropped if its epoch does not match the epoch counter

  7. Redirections Epoch counters and next PC are updated If multiple redirections occur in the same cycle, the redirection furthest down the pipeline takes precedence Use of Ephemeral History Registers (EHRs) Eliminate rule conflicts between pipeline stages that all trigger redirections Enforce ordering between different redirections Internal rule doRedirect fires before all other rules in the same cycle Ensures that effect of redirections are felt the cycle after it takes place

  8. Branch target buffer Contains a table of: Previously-used next addresses ( branch targets ) that were not simply PC+4 The PCs that generated those branch targets Tightly coupled to the fetch stage Predicts the next address before the instruction is even fetched from memory.

  9. Gshare predictor N-bit global history register 2N entry branch history table (BHT) 2-bit saturating counters of prediction 00: Strongly Not Taken 01: Weakly Not Taken 10: Weakly Taken 11: Strongly Taken Global history register is XORed with the PC to give the BHT index Counters are incremented if the branch was taken and decremented otherwise

  10. TAGE predictor Each predictor tables indexed with a different history length Can capture correlations from distant and near branches Table T0 provides default predictions Upper tables have a 3-bit saturating prediction counter and a 2-bit saturating useful counter Matching table that uses the longest history length takes precedence

  11. Trace dumps Instrumentation printouts in the processor source code to record the state of the pipeline at every cycle Note the redirection in cycle 5818

  12. Tandem verification Trace dumps from the benchmarks are fed into Python models Used to verify that BTB, Gshare, and TAGE are functionally correct

  13. Utilization and timing Critical path - within the DDR3 interface logic Logical depth of 2, 97.7% of the delay coming from routing Yosys synthesis report - critical path within the processor is from the instruction cache to the PC redirection register

  14. Benchmark Big processor benchmarks modified to repeat the computation, normalized to about 10,000,000 instructions 4-entry BTB 256-entry Gshare 5-component 64-entry TAGE Only hit rate on branch instructions is quoted for BTB

  15. Performance Benchmarks do not reflect realistic program behavior

  16. Effect of table sizing Need better benchmarks for meaningful comparison

More Related Content