Multi-Tiered Branch Predictor for Pipelined Processors Overview

multi tiered branch predictor for pipelined n.w

1 / 16

Embed Share

Explore the architecture and benefits of a multi-tiered branch predictor system designed for modern processors to minimize costly mispredictions. Learn about the hierarchy of predictors, performance benefits, speculative state management, and more in this detailed analysis.

teer824 Follow

Uploaded on Apr 24, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Multi-tiered branch predictor for pipelined processors Francis Wang and Shana Matthew

Background Branch predictors guess the direction of branches before the outcome is known for certain In modern processors, the fetch stage is decoupled from execution units Mispredictions are very costly Motivates the development of highly accurate branch predictors

Multi-tiered branch predictor Hierarchy of predictors Branch target buffer (BTB) Predicts next PC before instruction is fetched Primary branch predictor (Gshare) Small and fast Less accurate Secondary branch predictor (TAGE) Large and slow More accurate

Pipeline diagram 6-stage pipeline to model deeply pipelined processor

Performance benefit BTB TAGE Penalty BTB Gshare TAGE Penalty X 0 4 X X 0 4 0 1 2 X 0 1 2 1 1 0 0 1 1 1 1 1 1 0

Speculative state Speculative state is managed by three boolean epoch counters epGshare Gshare redirection epTAGE TAGE redirection epEx branch resolution Until the execute stage, all instructions are speculative Instructions are dropped if its epoch does not match the epoch counter

Redirections Epoch counters and next PC are updated If multiple redirections occur in the same cycle, the redirection furthest down the pipeline takes precedence Use of Ephemeral History Registers (EHRs) Eliminate rule conflicts between pipeline stages that all trigger redirections Enforce ordering between different redirections Internal rule doRedirect fires before all other rules in the same cycle Ensures that effect of redirections are felt the cycle after it takes place

Branch target buffer Contains a table of: Previously-used next addresses ( branch targets ) that were not simply PC+4 The PCs that generated those branch targets Tightly coupled to the fetch stage Predicts the next address before the instruction is even fetched from memory.

Gshare predictor N-bit global history register 2N entry branch history table (BHT) 2-bit saturating counters of prediction 00: Strongly Not Taken 01: Weakly Not Taken 10: Weakly Taken 11: Strongly Taken Global history register is XORed with the PC to give the BHT index Counters are incremented if the branch was taken and decremented otherwise

TAGE predictor Each predictor tables indexed with a different history length Can capture correlations from distant and near branches Table T0 provides default predictions Upper tables have a 3-bit saturating prediction counter and a 2-bit saturating useful counter Matching table that uses the longest history length takes precedence

Trace dumps Instrumentation printouts in the processor source code to record the state of the pipeline at every cycle Note the redirection in cycle 5818

Tandem verification Trace dumps from the benchmarks are fed into Python models Used to verify that BTB, Gshare, and TAGE are functionally correct

Utilization and timing Critical path - within the DDR3 interface logic Logical depth of 2, 97.7% of the delay coming from routing Yosys synthesis report - critical path within the processor is from the instruction cache to the PC redirection register

Benchmark Big processor benchmarks modified to repeat the computation, normalized to about 10,000,000 instructions 4-entry BTB 256-entry Gshare 5-component 64-entry TAGE Only hit rate on branch instructions is quoted for BTB