WO2021108007A1 - Appareil et procédé de prédiction de branche - Google Patents

Appareil et procédé de prédiction de branche Download PDF

Info

Publication number
WO2021108007A1
WO2021108007A1 PCT/US2020/050680 US2020050680W WO2021108007A1 WO 2021108007 A1 WO2021108007 A1 WO 2021108007A1 US 2020050680 W US2020050680 W US 2020050680W WO 2021108007 A1 WO2021108007 A1 WO 2021108007A1
Authority
WO
WIPO (PCT)
Prior art keywords
branch
tage
instructions
instruction
predictor
Prior art date
Application number
PCT/US2020/050680
Other languages
English (en)
Inventor
Fangping LIU
Weiyu Chen
Richard VAN
Gang Liu
Sang Wook Do
Original Assignee
Futurewei Technologies, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Futurewei Technologies, Inc. filed Critical Futurewei Technologies, Inc.
Publication of WO2021108007A1 publication Critical patent/WO2021108007A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • G06F9/3848Speculative instruction execution using hybrid branch prediction, e.g. selection between prediction techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3808Instruction prefetching for instruction reuse, e.g. trace cache, branch target cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3858Result writeback, i.e. updating the architectural state or memory

Definitions

  • a microprocessor comprising a fetch stage configured to retrieve instructions from a memory; a buffer configured to store instructions retrieved by the fetch stage; one or more processors configured to execute instructions stored in the buffer; a branch predictor including a branch target buffer (BTB), including a BTB table storing one or more BTB entries, each of the BTB entries including first branch information for a first branch instruction, the first branch information including a program counter (PC), a branch target address, and an attribute associated with corresponding second branch information for a second branch instruction, the second branch information including a PC, a branch target addresses and attributes, and a tagged geometric (TAGE) history length branch predictor, including multiple TAGE tables storing one or more TAGE entries, each of the TAGE entries including the first branch information associated with the corresponding second branch information; and the branch predictor configured to provide predicted branch PCs and target addresses for the first and second branch instructions corresponding to a single PC identifying one of the instructions stored in the buffer.
  • BTB branch target buffer
  • PC program counter
  • TAGE
  • branch predictor configured to chain table entries associated with the first branch instruction and table entries associated with the second branch instruction together by concatenating an entry in the BTB table for the first branch entry to include information from the second branch entry; and concatenating an entry in the TAGE table for the first branch entry to include information from the second branch instruction such that a prediction of taken for the first branch entry automatically predicts a prediction of taken for the second branch entry.
  • the branch predictor further configured, when more than one second branch instructions exist in a same branch prediction block, to modify the BTB entry for the first branch instruction to include information for multiple second branch instructions; and modify the TAGE entry indexed with corresponding global history registers to include information for one of the second branches, wherein each of the entries in the TAGE table corresponds to a separate global history and table index, and each of the one or more second branches correspond to different execution directions.
  • the branch predictor further configured to add an entry the TAGE table that includes one of the first branch instructions associated with a not taken entry.
  • each BTB table entry includes a branch indexing address that maps to a program counter of an instruction being fetched from memory and a predicted target branch address corresponding to the program counter; each TAGE table entry includes one or more tagged predictor components that are indexed according to different lengths that provide a predicted direction for the first and second branch instructions; and the branch predictor is further configured to match the predicted direction of the first and second branch instruction direction predicted by the TAGE branch predictor with the first branch instruction and the associated second branch instruction of the BTB that includes the predicted branch target addresses, and output the predicted branch PCs and target addresses of the first and second branch instructions into the one or more processors.
  • branch predictor is further configured to input the program counter into the BTB to lookup the corresponding first and second branch target addresses; and input the program counter and a global history register value into the TAGE branch predictor to lookup the corresponding first and second branch instructions.
  • the concatenating the entry in the BTB table is performed when the first branch instruction is a single branch prediction and the first branch instruction and the second branch instruction are direct branches and identified as taken.
  • the concatenating the entry in the TAGE table is performed when the first branch instruction is a single branch prediction, the first branch instruction and the second branch instruction are identified as taken, the first branch instruction and the second branch instruction are predicated as taken in by the TAGE branch predictor, and the first branch instruction and the second branch instruction are direct branches.
  • the branch predictor when more than one second branch instruction exists in a same branch prediction block, the branch predictor further configured to chain the first branch instructions and the second branch instructions together by modifying an entry in the BTB table for the one or more first branch instructions to include a single second branch instruction out of multiple branches in the second branch prediction block; and modifying entries in the TAGE table such that each of the one or more first branch instructions each have a separate entry corresponding to a different one of the one or more second branch instructions, wherein each of the entries in the TAGE table corresponds to a separate global history and table index, and each of the one or more second branch instructions corresponds to different execution directions.
  • the branch predictor when more than one second branch instruction exists in a same branch prediction block, the branch predictor further configured to chain a single one of the first branch instructions and single one of the second branch instructions together by modifying an entry in the BTB table for the single first branch instruction to include a single second branch instruction; and modifying entries in the TAGE table such that the single first branch instruction each have a separate entry corresponding to a different one of the one or more second branch instructions, wherein each of the entries in the TAGE table corresponds to a separate global history and table index, and each of the one or more second branch instructions corresponds to different execution directions.
  • branch target predictor comprises a co-processor or field-programmable gate array.
  • a computer-implemented method for branch prediction comprising retrieving, in a fetch stage, instructions from a memory; storing instructions retrieved by the fetch stage in a buffer; executing the instructions stored in the buffer by one or more processors; predicting, by a branch predictor, including storing one or more branch target buffer (BTB) entries in a BTB table, each of the BTB entries including first branch information for a first branch instruction, the first branch information including a program counter (PC), a branch target address, and an attribute associated with corresponding second branch information for a second branch instruction, the second branch information including a PC, a branch target address and an attribute and storing one or more a tagged geometric (TAGE) entries in multiple TAGE tables, each of the TAGE entries including the first branch information associated with the corresponding second branch information; and providing, by the branch predictor, predicted branch PCs and target addresses for the first and second branch instructions corresponding to a single PC identifying one of the instructions stored in the
  • FIGS. 1A and 1B are respectively block diagrams of a computer system and a microprocessor that can be incorporated into such a computer system.
  • FIG. 2 illustrate the pipelined of a microprocessor like that of FIG. 1B.
  • FIG. 3 illustrates one example embodiment of the branch predictor in FIG. 2.
  • FIG. 4 illustrates an example of training a dual branch predictor (DBP) by chaining together two consecutive branches.
  • DBP dual branch predictor
  • FIG. 5 illustrates an example of training a dual branch predictor (DBP) when multiple branches exist on the same BP fetch block.
  • DBP dual branch predictor
  • FIG. 6 illustrates a flow diagram of training the dual branch predictor of FIG. 3.
  • FIGS. 7A and 7B illustrate an example of a dual branch prediction (DBP) flow in accordance with the disclosed embodiments.
  • DBP dual branch prediction
  • FIGS. 8A and 8B illustrate another example of a dual branch prediction flow in accordance with the disclosed embodiments.
  • FIGS. 8C and 8D illustrate another example of a dual branch prediction flow in accordance with the disclosed embodiments.
  • FIG. 9 shows an example of a computing system in which the microprocessor architecture disclosed herein may be implemented.
  • FIG. 10 illustrates a schematic diagram of a general-purpose network component or computer system.
  • Modern computing architectures increasingly rely on speculation to boost instruction-level parallelism. For example, data that is likely to be read in the near future is speculatively prefetched, and predicted values are speculatively used before actual values are available. Accurate prediction mechanisms are a driving force behind these techniques, so increasing the accuracy of predictors increases the performance benefit of speculation.
  • the branch predictor in the disclosed technology increases the accuracy of branch prediction using a dual branch predictor (DBP) in a single cycle.
  • the DBP provides dual branch prediction per cycle using modified branch target buffer (BTB) and geometric tagged (TAGE) history length predictors.
  • BTB modified branch target buffer
  • TAGE geometric tagged
  • FIGS. 1A and 1B are respectively block diagrams of a computer system and a microprocessor such as can be incorporated into such a computer system.
  • the computer system 100 includes a computer 105, one or more input devices 101 and one or more output devices 103.
  • input devices 101 include a keyboard or mouse.
  • output devices 103 include monitors or printers.
  • the computer 105 includes memory 107 and microprocessor 120, where in this simplified representation the memory 107 is represented as a single block.
  • the memory 107 can include ROM memory, RAM memory and non-volatile memory and, depending on the embodiment, include separate memory for data and instructions.
  • FIG. 1B illustrates one embodiment for the microprocessor 120 of FIG. 1A and also includes the memory 107.
  • the microprocessor 120 includes control logic 125, a processing section 140, an input interface 121 , and an output interface 123.
  • the dashed lines represent control signals exchanged between the control logic 125 and the other elements of the microprocessor 120 and the memory 107.
  • the solid lines represent the flow of data and instructions within the microprocessor 120 and between the microprocessor 120 and memory 107.
  • the processing block 140 includes combinatorial logic 143 that is configured to execute instructions and registers 141 in which the combinatorial logic stores instructions and data while executing these instructions.
  • specific elements or units such as an arithmetic and logic unit (ALU), floating point unit, and other specific elements commonly used in executing instructions are not explicitly shown in the combinatorial logic 143 block.
  • the combinatorial logic 143 is connected to the memory 107 to receive and execute instruction and supply back the results.
  • the combinatorial logic 143 is also connected to the input interface 121 to receive input from input devices 101 or other sources and to the output interface 123 to provide output to output devices 103 or other destinations.
  • FIG. 2 schematically illustrates an embodiment of the pipelined operation of a microprocessor such as can be used in the processing section 140 of microprocessor 120 represented in FIG. 1B.
  • the different stages of the pipeline can be executed by the combinatorial logic 143, with the various buffers being part of the registers 141 .
  • the various pipeline stages can be executed through software, hardware elements, firmware, of various combinations of these as further discussed below.
  • the registers 141 including the various buffers explicitly represented in FIG. 2, are commonly implemented as volatile random access memory (RAM), but some embodiments may also employ some amount of non-volatile memory.
  • Pipelined microprocessors have a number of variations, where the embodiment of FIG. 2 shows a fairly typical example of a microprocessor integrated circuit in a block representation, but other embodiments can have differing numbers of stages with differing names and functions and can be implemented in hardware, firmware, software and various combinations of these.
  • the first stage is a fetch stage 201 that retrieves instructions from the instruction memory and loads them in an instruction/decode buffer 203.
  • the fetched instruction can be stored in the instruction/decode buffer based on the starting and ending addresses of a fetch block and ordered from earliest to latest.
  • FIG. 2 illustrates these subsequent stages as a particular set of stages and buffers.
  • Other microprocessor embodiments may have more or fewer stages and may use differing names, but FIG. 2 can be taken as a representative embodiment for purposes of this discussion.
  • the various functional stages are represented as blocks and can be implemented through various logic circuitry on the microprocessor's integrated circuit, depending on the embodiment; and buffer sections are represented as the series of smaller squares, where these buffers can be implemented by RAM or other memory and may be distinct or portions of a shared memory, again depending on the embodiment.
  • decode stage 205 which decodes the instructions from the instruction/decode buffer 203 and places them in the dispatch buffer 207 for subsequent execution.
  • Dispatch stage 209 issues the decoded instructions, distributing them to reservation stations 211 , after which they are executed as grouped at 213. The results upon finishing execution are then placed in the reorder/completion buffer 215, pass through completion stage 217 and then go to the store buffer 219. Finally, the pipeline ends with the retire stage 221 , with any results being written back to memory or provided as output as needed.
  • the pipeline structure allows for efficient execution of instructions, as while one set of instructions is at one stage, the next set can follow a cycle behind in the preceding stage.
  • Flowever if an instruction is a branch instruction, such as an if-then- else or a jump type instruction, once this instruction is to be executed the pipeline can be stalled out until a needed new instruction is fetched and propagates though the pipeline.
  • FIG. 2 shows a branch instruction at 213a in the far-left slot of the execute stage. A taken branch can only redirect instructions when executed. If this branch requires another instruction to be fetched, this will need to be reported back to the fetch stage 201 so that it can retrieve the needed instruction, which must then propagate through the pipeline.
  • a microprocessor can use branch prediction, where if an instruction with a branch is fetched, the result of this branch can be predicted, and any needed instructions can then be fetched and speculatively executed.
  • FIG. 2 includes such a branch predictor 500.
  • the branch predictor 500 receives information on branching instructions that have been fetched and then makes predictions for instructions to be fetched and speculatively executed.
  • the branch predictor 500 can predict the starting and ending addresses of a fetch block, then the fetch stage 201 can use this information to fetch the actual instructions of the fetch block.
  • the correctness of a branch prediction is determined after the branch is executed.
  • An n-cycle branch predictor usually includes a “target predictor” and a “direction predictor.”
  • a target prediction for the target predictor includes one or more branch/target pairs in the given fetch block, while the direction prediction predicts whether any instruction of the fetch block is taken or not. Those two pieces of information are combined to choose the earliest taken branch in the fetch block. If none is chosen, the fetch block is predicted not taken.
  • Most target predictors e.g., branch target buffers (BTBs), can be looked up in 1 cycle, while a state-of-the-art direction predictor, such as a tagged geometric (TAGE) length predictor, needs at least a few cycles to get results.
  • TAGE tagged geometric
  • FIG. 3 illustrates one example embodiment of the branch predictor in FIG. 2.
  • the branch predictor includes a branch target buffer (BTB) predictor 500A and a TAGE length predictor 500B.
  • the fetch stage 201 (FIG. 2) includes the branch predictor 500 that generates branch target addresses that are stored or provided to one or more BTB predictors (or BTB tables) and TAGE predictors (or TAGE tables).
  • the generated branch target addresses are relative to a program counter (PC) (not shown) identifying the instructions stored in the buffer.
  • PC program counter
  • BTB predictor 500A and the TAGE predictor 500B are shown internal to the branch predictor 500, the BTB predictor 500A and the TAGE predictor 500B may or may not be located in the microprocessor 120 proximate to certain elements of the branch predictor 500 or fetch stage 201 .
  • a BTB predictor 500A is conventionally a single small memory cache in a processor that stores branch information including predicted branch targets.
  • the BTB provides a view (for example, in the form of a tree) of all the branch instructions.
  • a BTB predictor 500A can be used to predict the target of a predicted taken branch instruction based on the address of the branch instruction. Predicting the target of the branch instruction can prevent pipeline stalls by not waiting for the branch instruction to reach the execution stage of the pipeline to compute the branch target address.
  • the branch's target instruction decode may be performed in the same cycle or the cycle after the branch instruction instead of having multiple cycles between the branch instruction and the target of the predicted taken branch instruction.
  • the BTB predictor 500A stores entries in a BTB table (not shown) where addresses of taken branch instructions are stored together with their target addresses.
  • Other branch prediction components that may be included in the BTB predictor 500A or implemented separately include a branch history table and a pattern history table.
  • a branch history table can predict the direction of a branch (taken vs. not taken) as a function of the branch address.
  • a pattern history fable can assist with direction prediction of a branch as a function of the pattern of branches encountered leading up to the given branch which is to be predicted.
  • a TAGE predictor 500B is a common form of branch predictor used for predicting branch instruction outcomes.
  • TAGE branch predictor may have a number of TAGE prediction tables to provide a TAGE prediction for the branch instruction outcome, where each TAGE prediction table comprises a number of prediction entries trained based on previous branch instruction outcomes.
  • the TAGE prediction tables are looked up based on an index determined as a function of a target instruction address and a portion of previous execution history which is indicative of execution behavior preceding instruction at the target instruction address. The portion of the previous execution history used to determine the index has different lengths for different TAGE prediction tables of the TAGE predictor 500B.
  • the TAGE predictor 500B By tracking predictions for different lengths of previous execution history, this enables the TAGE predictor 500B to be relatively accurate since when there is a match against a longer pattern of previous execution history then it is more likely that the predicted branch instruction outcome will be relevant to the current target instruction address. That said, the TAGE predictor 500B is also able to record predictions for shorter lengths of previous execution history in case there is no match against a longer length of previous execution history. In general, the TAGE predictors provides the direction for each branch taken in the system 100.
  • a TAGE predictor 500B typically also includes a fallback predictor which provides a fallback prediction for the branch instruction outcome in case the lookup of the TAGE prediction tables misses in ail of the TAGE prediction tables.
  • the fallback prediction could also be used if confidence in the prediction output by one of the TAGE prediction tables is less than a given threshold.
  • the fallback predictor is implemented as a himodal predictor, which provides a number of 2-bit confidence counters which track whether a branch should be strongly predicted taken, weakly predicted taken, weakly predicted not taken or strongly predicted not taken.
  • the bimodal predictor is typically much less accurate than the TAGE prediction tables once the TAGE prediction tables have been suitably trained, providing the bimodal predictor as a fallback predictor to handle cases w'hen the TAGE prediction tables have not yet reached a sufficient level of confidence can be useful to improve performance.
  • two branch predictions per cycle i.e., dual branch prediction (DBF)
  • DBF dual branch prediction
  • fetching two branch predictions per cycle doubles the fetch capacity of the microprocessor.
  • FIG. 4 illustrates an example of training a dual branch predictor (DBP) by chaining together two consecutive branches.
  • DBP dual branch predictor
  • Microprocessors cannot execute an application, such as application 202, faster than it fetches its instructions.
  • a branch predictor such as branch predictor 500
  • the illustrated embodiment shows two consecutive branch prediction (BP) fetch blocks 402 and 404, where each fetch block includes a branch (e.g., branch 1 or branch 2).
  • branch branch prediction
  • each BP fetch block 402 and 404 is observed during a different cycle.
  • two cycles are required in which to observe the BP fetch blocks 402 and 404 (and corresponding branches 1 and 2).
  • each BP fetch block 402 and 404 can be a single fetch block. In one other embodiment, each BP fetch block 402 and 404 can be split into multiple fetch blocks.
  • a branch predictor such as a dynamic branch predictor, uses the history of program execution to guess (predict) whether a branch should be taken.
  • Dynamic branch predictors maintain a table of branch instructions that the processor has executed.
  • One such dynamic branch predictor is a BTB, which has a table including the destination of the branch and a history of whether the branch was taken.
  • Still another well-known predictor is the TAGE predictor.
  • the TAGE predictor has TAGE prediction tables that comprise a number of prediction entries trained based on previous branch instruction outcomes.
  • to predict more than a single branch requires more than a single cycle. In the case of two branches, two cycles would be required to make the prediction — one for each of branch.
  • branch predictor 500 When branch 1 is predicted in a subsequent fetch cycle, the branch predictor 500 also automatically predicts branch 2. Likewise, for the TAGE table, entry 408 for branch 1 from the first cycle is attached with information about the entry for branch 2 from the second cycle. When branch 1 is predicted in a subsequent fetch cycle, the branch predictor 500 also automatically predicts branch 2. These modified entries can be used to make dual predictions in the future. Thus, two branches (branch 1 and branch 2) can be predicted in the same fetch cycle.
  • FIG. 5 illustrates an example of training a dual branch predictor (DBP) when multiple branches exist on the same BP fetch block.
  • DBP dual branch predictor
  • fetch block 502 has two branches — branch 1a and branch 1b. If branch 1a is taken, three different branches (branches 2a, 2b and 3c) exist in fetch block 504. Thus, three different execution paths exist for each of the different branches (the arrow from branch 2b shows the selected execution path) Although not depicted, a similar scenario would be true for branch 1 b (i.e.
  • branch 1 b of fetch block 502 when branch 1 b of fetch block 502 is selected, multiple different branches (e.g., branches 2x, 2y and 2z) would exist in the fetch block 504). In this situation, further modification is made to the BTB table 506A (and 506B) and the TAGE table 506B (and 508B).
  • branches 2x, 2y and 2z e.g., branches 2x, 2y and 2z
  • the entries are modified to all possible branches from branch 2 to the branch 1 entry.
  • the entries in the BTB table 506A for DBP are modified to attach branches 2a, 2b and 2c to branch 1a
  • the entries in the BTB table 506B are modified to attach branches 2x, 2y and 2z to branch 1b.
  • the entries for DBP are also modified for all possible branches from branch 2 to the branch 1 entry.
  • TAGE table 508A Since the TAGE entries provide a path direction for branch 2, there will be three different path directions (in TAGE table 508A, one for each of branch 2a, 2b and 2c; in TAGE table 508B, one for each of branch 2x, 2y and 2z). Accordingly, the execution history (and table index) for each branch 2 option will be different. For example, in the illustrated embodiment, there will be at least 3 different TAGE entries for branch 1 . In TAGE table 508A, branch 1a will have separate entries for branch 2a, branch 2b and branch 2c. In TAGE table 508B, branch 1 b will have separate entries for branch 2x, 2y and 2z. In one embodiment, each entry with a different execution history and table index.
  • the branch 2 entries in the TAGE tables 508A and 508B will be respectively attached to branch 1a or branch 1 b, as shown.
  • the “not taken” entry in the TAGE tables 508A and 508B indicate that none of the branch 2 entries have been taken, and the process would continue to the next fetch block with the follow through address as the predicted fetch address.
  • FIG. 6 illustrates a flow diagram of training the dual branch predictor of
  • branch predictor 500 performs the procedures. Flowever, it is appreciated that any other functional unit or processing unit may implement the processes described herein, and the disclosure is not limited to implementation by the branch predictor.
  • the training process 600 trains the branch predictor 500 by observing single branch predictions (SBPs).
  • SBPs single branch predictions
  • the BTB predictor 500A and TAGE predictor 500B may be trained.
  • Steps 604 - 610 describe the flow of training the BTB predictor 500A.
  • the branch predictor 500 determines whether the detected branch is an SBP. If the detected branch (e.g., branch 1a) is not a SBP (e.g., branch 2b has been detected in addition to branch 1a, as shown in FIG. 5), then the branch has already been trained and no further operation (NOOP) is necessary (step 604a).
  • NOOP no further operation
  • branch predictor 500 determines whether branch 2 (e.g., branch 2b in this example) has also been “taken,” such that both branches have been “taken.” If both branches have not been “taken,” then no further operation (NOOP) is necessary (step 606a) and the process 600 continues with the SBP. Otherwise, if both branches are “taken,” then the branch predictor 500 proceeds to step 608 to determine with both branch 1a and branch 2b are direct branches.
  • branch 2 e.g., branch 2b in this example
  • step 608a If both branches are not direct branches, then no further operation (NOOP) is necessary and the process 600 proceeds to step 608a. If the branch predictor 500 determines that both branches are direct, then entries in the BTB table (e.g., BTB tables 406, 506A or 506B) are modified by attaching the branch 2 information to the branch 1 entry, as described above.
  • BTB table e.g., BTB tables 406, 506A or 506B
  • Steps 612 - 620 describe the flow of training the TAGE predictor 500B.
  • the branch predictor 500 determines whether the detected branch is an SBP. If the detected branch (e.g., branch 1a) is not a SBP (e.g., branch 2b has been detected in addition to branch 1a, as shown in FIG. 5), then the branch has already been trained and no further operation (NOOP) is necessary (step 612a).
  • NOOP no further operation
  • branch predictor 500 determines whether branch 2 (e.g., branch 2b in this example) has also been “taken,” such that both branches have been “taken.” If both branches have not been “taken,” then no further operation (NOOP) is necessary (step 614a) and the process 600 continues with the SBP. Otherwise, if both branches are “taken,” then the branch predictor 500 proceeds to step 616 to determine if TAGE predicts that both branches are “taken” (this also provides the direction or path of the branches).
  • branch 2 e.g., branch 2b in this example
  • step 616a If both branches have not been “taken” in TAGE, then no further operation (NOOP) is necessary (step 616a) and the process 600 continues with the SBP. If both branches have been “taken,” then the process 600 continues to step 618.
  • the branch predictor 500 determines whether both branch 1a and branch 2b are direct branches. If both branches are not direct branches, then no further operation (NOOP) is necessary and the process 600 proceeds to step 618a. Otherwise, if the branch predictor 500 determines that both branches are direct, then entries in the TAGE table (e.g., TAGE tables 408, 508A or 508B) are modified by attaching the branch 2 information to each of the branch 1 entries, as described above.
  • TAGE table e.g., TAGE tables 408, 508A or 508B
  • the dual branch predictor has been trained and may be used in implementation.
  • FIGS. 7A and 7B illustrate an example of a dual branch prediction (DBP) flow in accordance with the disclosed embodiments.
  • the DBP is the trained branch predictor 500 with the BTB predictor 500A and the TAGE predictor 500B, as illustrated in FIG. 3.
  • PC program counter
  • the PC which serves as input into the BTB predictor 500A of the branch predictor 500, may advance through addresses of a block (fetch block) of compiled instructions, incrementing by a particular number of bytes depending on the length of each instruction and how many instructions are fetched.
  • fetch block a block of compiled instructions
  • branch instruction there may be a branch instruction associated with the branch target address to which the PC will jump or a conditional branch instruction that is a condition that is to be evaluated to yield a branch direction result.
  • the BTB predictor 500A is a cache memory associated with the instruction fetch stage 201 of the pipeline, as described above.
  • the BTB predictor 500A retains three tuples in its table, each of which contains the address of a previously executed instruction, information that permits a prediction as to whether or not the instruction branch will be taken and the most recent target address for that branch.
  • the fetch stage 201 compares the instruction address (based on the PC counter) against the instruction addresses in the BTB table of BTB predictor 500A.
  • the branch is actually resolved, at the execute stage, the BTB can be updated with the corrected prediction information and target address.
  • the fetched address is found in the BTB table of the BTB predictor 500A.
  • the BTB table for DBP includes branch entries for branch 1a and branch 1b, which have been previously trained as explained above.
  • the entry for branch 1a includes the target address (and prediction status) for branch 1a as well as the target addresses for branches 2a, 2b and 2c.
  • the entry for branch 1b includes the target address (and prediction status) for branch 1 b as well as the target addresses for branches 2x, 2y and 2z.
  • the BTB predictor 500A provides target addresses for the predicted branch (and the prediction status), there is no information as to the direction or path of execution for the predicted branch (when taken).
  • the branch predictor 500 utilizes the TAGE predictor 500B.
  • the TAGE predictor 500B in addition to the PC input, also receives as input an execution history.
  • the TAGE predictor 500B may include TAGE tables used to generate a prediction for DBP and a fallback predictor used to provide a fallback prediction in case the TAGE tables cannot provide a suitable prediction for DBP (e.g,, fallback to SBP).
  • the TAGE predictor 500B is looked up based on lookup information which is derived as a function of the program counter (PC) which represents a target instruction address for which a prediction is required, and/or previous execution history (e.g., from a global history register (GHR)) which represents a history of execution behavior leading up to the instruction associated with the PC address.
  • PC program counter
  • GHR global history register
  • the execution history could include a sequence of taken/not taken indications for a number of branches leading up to the instruction represented by the PC, or portions of one or more previously executed instruction addresses, or could represent a call stack history representing a sequence of function calls which led to the instruction represented by the PC.
  • the execution history could also include an indication of how long ago the branch was encountered previously (e.g. number of intervening instructions or intervening branch instructions). By considering some aspects of history leading up to a given PC address in addition to the address itself, this can provide a more accurate prediction for branches which may have different behavior depending on the previous execution sequence.
  • the fallback predictor could be selected In cases where the main TAGE predictor cannot provide a prediction of sufficiently high confidence.
  • the fetched address is found in the TAGE table of the TAGE predictor 500B.
  • the TAGE predictor 500B determines that branches 1a and 2b have been taken, which provides the specific direction/path for the branches.
  • the TAGE predictor is not aware of the target addresses for the selection branches 1a and 2b. These target addresses may be obtained from the BTB predictor 500A, which, as noted above, has acquired the target addresses for each branch.
  • the results from the BTB predictor 500A (target addresses) and the TAGE predictor 500B (branch direction/path) may be combined into a single look up table 702 that provides a dual (two) branch prediction in a single cycle.
  • the two branch target addresses i.e. , the addresses of branch 1a and branch 2b
  • FIGS. 8A and 8B illustrate another example of a dual branch prediction flow in accordance with the disclosed embodiments.
  • a single entry (single direction/path) of the branch 2 is attached to the branch 1 (i.e., only a partial view of the tree obtained by the BTB table will be stored).
  • the branch 1 i.e., only a partial view of the tree obtained by the BTB table will be stored.
  • only a single branch 2 option is stored in the BTB predictor table (no changes are required to the TAGE predictor table since each branch 1 entry has a single branch 2 option). If the stored option is not selected, then the system falls back to single branch prediction (SBP).
  • SBP single branch prediction
  • fetch block 802 has two branches — branch 1a and branch 1 b. If branch 1a is taken, three different branches (branches 2a, 2b and 3c) exist in fetch block 504. Thus, three different execution paths exist for each of the different branches (the arrow from branch 2b shows the selected execution path) Although not depicted, a similar scenario would be true for branch 1b (i.e., when branch 1 b of fetch block 502 is selected, multiple different branches (e.g., branches 2x, 2y and 2z) would exist in the fetch block 504). In this situation, further modification is made to the BTB table 506A (and 506B) and the TAGE table 506B (and 508B).
  • the branch 1 entries are modified to select and attach a single branch from the three different branch 2 entries (unlike the example in FIG. 5, in which all three branch 2 entries are attached).
  • the entry in the BTB table 806A is modified to attach branch 2b to branch 1a
  • the entry in the BTB table 806B is modified to attach branch 2x to branch 1b.
  • the entries are modified for all possible branches from branch 2 to the branch 1 entry.
  • the TAGE entries are similar to those presented in FIG. 5, above. For example, in the illustrated embodiment, there will be at least 3 different TAGE entries for branch 1.
  • branch 1a will have separate entries for branch 2a, branch 2b and branch 2c.
  • branch 1 b will have separate entries for branch 2x, 2y and 2z.
  • each entry with a different execution history and table index each entry with a different execution history and table index.
  • the branch 2 entries in the TAGE tables 808A and 808B will be respectively attached to branch 1a or branch 1b, as shown.
  • the “not taken” entry in the TAGE tables 808A and 808B indicate that none of the branch 2 entries have been taken, and the process would continue to the next fetch block with the follow through address as the predicted fetch address.
  • the fetched address is found in the BTB table of the BTB predictor 500A.
  • the BTB table for the DBP includes a single branch entry for branch 1a and branch 1 b, which have been previously trained as explained above.
  • the entry for branch 1a includes the target address (and prediction status) for branch 1a as well as the target addresses for branch 2b.
  • the entry for branch 1 b includes the target address (and prediction status) for branch 1 b as well as the target address for branch 2x.
  • the BTB predictor 500A provides target addresses for the predicted branch (and the prediction status), there is no information as to the direction or path of execution for the predicted branch (when taken).
  • the branch predictor 500 utilizes the TAGE predictor 500B.
  • the fetched address is found in the TAGE table of the TAGE predictor 500B.
  • the TAGE predictor 500B determines that branches 1a and 2b have been taken, which provides the specific direction/path for the branches.
  • the TAGE predictor is not aware of the target addresses for the selection branches 1a and 2b. These target addresses may be obtained from the BTB predictor 500A, which, as noted above, has acquired the target addresses for each branch.
  • the results from the BTB predictor 500A (target addresses) and the TAGE predictor 500B (branch direction/path) may be combined into a single look up table 802B that provides a dual (two) branch prediction in a single cycle.
  • the two branch target addresses i.e. , the addresses of branch 1a and branch 2b
  • FIG. 8C and 8D illustrate another example of a dual branch prediction flow in accordance with the disclosed embodiments.
  • the depicted embodiment is similar to the example presented in FIGS. 8A and 8B.
  • the TAGE predictor 500B outputs results as branch 1a and branch 2c (instead of branch 1a and branch 2b).
  • the output results from the TAGE predictor 500B are not found in the BTB table of the BTB predictor 500A. That is, there is no entry of branch 1a, branch 2c in the BTB table. Only ab entry of branch 1 a, branch 2b and an entry of branch 1 b, branch 2x exists in the BTB table.
  • the target addresses for branch 1 a and branch 2c may not be acquired from the BTB table.
  • the combined output 802D results in branch 1a for branch 1 and “Not Taken” for branch 2. Accordingly, the branch predictor 500 may fallback to single branch prediction (SBP).
  • SBP single branch prediction
  • FIG. 9 shows an example of a computing system in which the microprocessor architecture disclosed herein may be implemented.
  • the computing system 900 includes at least one processor 902, which could be a single central processing unit (CPU) or an arrangement of multiple processor cores of a multi-core architecture.
  • the processor 902 includes a pipeline 904, an instruction cache 906, and a data cache 908 (and other circuitry, not shown).
  • the processor 902 is connected to a processor bus 910, which enables communication with an external memory system 912 and an input/output (I/O) bridge 914.
  • the I/O bridge 914 enables communication over an I/O bus 916, with various different I/O devices 918A-918D (e.g., disk controller, network interface, display adapter, and/or user input devices such as a keyboard or mouse).
  • I/O devices 918A-918D e.g., disk controller, network interface, display adapter, and/or user input devices such as a keyboard or mouse.
  • the external memory system 912 is part of a hierarchical memory system that includes multi-level caches, including the first level (L1 ) instruction cache 906 and data cache 908, and any number of higher level (L2, L3, . . . ) caches within the external memory system 912.
  • Other circuitry (not shown) in the processor 902 supporting the caches 906 and 908 includes a translation lookaside buffer (TLB), various other circuitry for handling a miss in the TLB or the caches 906 and 908.
  • TLB translation lookaside buffer
  • the TLB is used to translate an address of an instruction being fetched or data being referenced from a virtual address to a physical address, and to determine whether a copy of that address is in the instruction cache 906 or data cache 908, respectively.
  • the external memory system 912 also includes a main memory interface 920, which is connected to any number of memory modules (not shown) serving as main memory (e.g., Dynamic Random Access Memory modules).
  • FIG. 10 illustrates a schematic diagram of a general-purpose network component or computer system.
  • the general-purpose network component or computer system 1000 includes a processor 1002 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 1004, and memory, such as ROM 1006 and RAM 1008, input/output (I/O) devices 1010, and a network 1012, such as the Internet or any other well-known type of network, that may include network connectively devices, such as a network interface.
  • a processor 1002 is not so limited and may comprise multiple processors.
  • the processor 1002 may be implemented as one or more CPU chips, cores (e.g., a multi-core processor), FPGAs, ASICs, and/or DSPs, and/or may be part of one or more ASICs.
  • the processor 1002 may be configured to implement any of the schemes described herein.
  • the processor 1002 may be implemented using hardware, software, or both.
  • the secondary storage 1004 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if the RAM 1008 is not large enough to hold all working data.
  • the secondary storage 1004 may be used to store programs that are loaded into the RAM 1008 when such programs are selected for execution.
  • the ROM 1006 is used to store instructions and perhaps data that are read during program execution.
  • the ROM 1006 is a non-volatile memory device that typically has a small memory capacity relative to the larger memory capacity of the secondary storage 1004.
  • the RAM 1008 is used to store volatile data and perhaps to store instructions. Access to both the ROM 706 and the RAM 1008 is typically faster than to the secondary storage 1004.
  • At least one of the secondary storage 1004 or RAM 1008 may be configured to store routing tables, forwarding tables, or other tables or information disclosed herein.
  • the technology described herein can be implemented using hardware, firmware, software, or a combination of these.
  • the software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein.
  • the processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media.
  • computer readable media may comprise computer readable storage media and communication media.
  • Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.
  • a computer readable medium or media does (do) not include propagated, modulated or transitory signals.
  • Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media.
  • modulated data signal means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal.
  • communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.
  • some or all of the software can be replaced by dedicated hardware logic components.
  • illustrative types of hardware logic components include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc.
  • FPGAs Field-programmable Gate Arrays
  • ASICs Application-specific Integrated Circuits
  • ASSPs Application-specific Standard Products
  • SOCs System-on-a-chip systems
  • CPLDs Complex Programmable Logic Devices
  • special purpose computers etc.
  • some of the elements used to execute the instructions issued in Figure 2 can use specific hardware elements.
  • software stored on a storage device
  • the one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces.
  • each process associated with the disclosed technology may be performed continuously and by one or more computing devices.
  • Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

La divulgation concerne des techniques de prédiction de branche qui peuvent améliorer les performances de microprocesseurs en pipeline. Un microprocesseur pour la sélection de prédicteur de branche comprend un étage d'extraction, un tampon, un ou plusieurs processeurs et un prédicteur de branche. Le prédicteur de branche comprend un tampon cible de branche (BTB), comprenant une table de BTB stockant les entrées de BTB. Chacune des entrées de BTB comprend des premières informations de branche pour une première instruction de branche et des secondes informations de branche pour une seconde instruction de branche. Le prédicteur de branche comprend également un prédicteur de branche de longueur d'historique (TAGE) géométrique marqué, comprenant de multiples tables TAGE stockant des entrées TAGE. Le prédicteur de branche fournit des compteurs de programme de branche prédite et des adresses cibles pour les première et seconde instructions de branche correspondant à un compteur de programme unique identifiant l'une des instructions stockées dans le tampon.
PCT/US2020/050680 2020-05-30 2020-09-14 Appareil et procédé de prédiction de branche WO2021108007A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202063032570P 2020-05-30 2020-05-30
US63/032,570 2020-05-30

Publications (1)

Publication Number Publication Date
WO2021108007A1 true WO2021108007A1 (fr) 2021-06-03

Family

ID=72659923

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/050680 WO2021108007A1 (fr) 2020-05-30 2020-09-14 Appareil et procédé de prédiction de branche

Country Status (1)

Country Link
WO (1) WO2021108007A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023249729A1 (fr) * 2022-06-24 2023-12-28 Microsoft Technology Licensing, Llc Fourniture d'entrées de tampon cible de branche étendue (btb) permettant de stocker des métadonnées de branche de tronc et des métadonnées de branche de feuille

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160034279A1 (en) * 2014-07-31 2016-02-04 International Business Machines Corporation Branch prediction using multi-way pattern history table (pht) and global path vector (gpv)
US20170090935A1 (en) * 2015-09-30 2017-03-30 Ecole Polytechnique Federale De Lausanne (Epfl) Unified prefetching into instruction cache and branch target buffer
US20180060073A1 (en) * 2016-08-30 2018-03-01 Advanced Micro Devices, Inc. Branch target buffer compression
EP3306467A1 (fr) * 2016-10-10 2018-04-11 VIA Alliance Semiconductor Co., Ltd. Prédiction de branchement utilisant de multiples décalages d'octets dans le hachage d'adresse d'extraction de bloc d'instructions et motif de branchement pour générer des indices de prédiction de branchement conditionnel

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160034279A1 (en) * 2014-07-31 2016-02-04 International Business Machines Corporation Branch prediction using multi-way pattern history table (pht) and global path vector (gpv)
US20170090935A1 (en) * 2015-09-30 2017-03-30 Ecole Polytechnique Federale De Lausanne (Epfl) Unified prefetching into instruction cache and branch target buffer
US20180060073A1 (en) * 2016-08-30 2018-03-01 Advanced Micro Devices, Inc. Branch target buffer compression
EP3306467A1 (fr) * 2016-10-10 2018-04-11 VIA Alliance Semiconductor Co., Ltd. Prédiction de branchement utilisant de multiples décalages d'octets dans le hachage d'adresse d'extraction de bloc d'instructions et motif de branchement pour générer des indices de prédiction de branchement conditionnel

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LEE, J.SMITH, A.: "Branch Prediction Strategies and Branch Target Buffer Design", January 1984, UNIVERSITY OF CALIFORNIA

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023249729A1 (fr) * 2022-06-24 2023-12-28 Microsoft Technology Licensing, Llc Fourniture d'entrées de tampon cible de branche étendue (btb) permettant de stocker des métadonnées de branche de tronc et des métadonnées de branche de feuille
US11915002B2 (en) 2022-06-24 2024-02-27 Microsoft Technology Licensing, Llc Providing extended branch target buffer (BTB) entries for storing trunk branch metadata and leaf branch metadata

Similar Documents

Publication Publication Date Title
US9367471B2 (en) Fetch width predictor
JP2889955B2 (ja) 分岐予測の方法とそのための装置
KR101192814B1 (ko) 로드가 선행하는 스토어에 의존적인지를 예측하는 의존성 매커니즘을 구비한 프로세서
US6338136B1 (en) Pairing of load-ALU-store with conditional branch
US6609194B1 (en) Apparatus for performing branch target address calculation based on branch type
US6976152B2 (en) Comparing operands of instructions against a replay scoreboard to detect an instruction replay and copying a replay scoreboard to an issue scoreboard
US20110320787A1 (en) Indirect Branch Hint
US20080276071A1 (en) Reducing the fetch time of target instructions of a predicted taken branch instruction
US10817298B2 (en) Shortcut path for a branch target buffer
US10579387B2 (en) Efficient store-forwarding with partitioned FIFO store-reorder queue in out-of-order processor
US20070033385A1 (en) Call return stack way prediction repair
KR101723711B1 (ko) 조건부 쇼트 포워드 브랜치들을 산술적으로 동등한 술어적 명령어들로 변환
US6629234B1 (en) Speculative generation at address generation stage of previous instruction result stored in forward cache for use by succeeding address dependent instruction
US20110320774A1 (en) Operand fetching control as a function of branch confidence
JP2010509680A (ja) ワーキング・グローバル・ヒストリ・レジスタを備えるシステム及び方法
US20050182918A1 (en) Pipelined microprocessor, apparatus, and method for generating early instruction results
US20220091850A1 (en) Apparatus and method for efficient branch prediction using machine learning
US8909907B2 (en) Reducing branch prediction latency using a branch target buffer with a most recently used column prediction
US11294684B2 (en) Indirect branch predictor for dynamic indirect branches
WO2021108007A1 (fr) Appareil et procédé de prédiction de branche
US7865705B2 (en) Branch target address cache including address type tag bit
US7269714B2 (en) Inhibiting of a co-issuing instruction in a processor having different pipeline lengths
US11379240B2 (en) Indirect branch predictor based on register operands
Hasan et al. An improved pipelined processor architecture eliminating branch and jump penalty
US11010170B2 (en) Arithmetic processing apparatus which replaces values for future branch prediction upon wrong branch prediction

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20781196

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20781196

Country of ref document: EP

Kind code of ref document: A1