Detailed Description
In order to make those skilled in the art better understand the technical solutions in the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be described in detail below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments of the present invention shall fall within the scope of the protection of the embodiments of the present invention.
The following terms are used herein:
conditional branch instruction: the instructions of the program flow may be changed. If the branch condition is true, the next instruction to be executed is changed.
Branch predictor: guessing which branch will be executed before branch instruction execution ends improves the performance of the processor's instruction pipeline. By using a branch predictor, the flow of instruction pipelining may be improved.
TAGE predictor: the TAGE is called a Tagged Geometric History Branch Predictor, is a hybrid Predictor, and has the advantages that Branch prediction can be performed on a certain Branch instruction according to Branch History sequences with different lengths, the accuracy of the Branch instruction in each History sequence is evaluated, and the highest historical accuracy is selected as the judgment standard of final Branch prediction.
A saturation counter: also known as bimodal predictors, there are typically 4 state machines, namely: strong unselected, weak selected, strong selected. When a branch command is evaluated, the corresponding state machine is modified. If the branch is not adopted, the state value is reduced towards the direction of 'strong non-selection'; if the branch is taken, the state value is increased towards a "strong selection" direction.
A processor core: also referred to as processor cores, are the cores of a processor, and a processor may have multiple (two or more) cores, but one core belongs to only one processor. The processor core assumes the functions of the processor compute engine, and all computations, commands accepted/stored, and data processed are performed in the processor core.
A pipelined processor: a processor having a pipeline of stages, each stage performing a different task with respect to program instructions. In a standard pipelined processor, the stages typically include five stages, instruction fetch, instruction decode, operand fetch, execute, and write back results.
A dual-port memory cell: a read-write memory cell can be supported simultaneously.
Single-port memory cell: the memory cell can only read or write at the same time, and can not read or write at the same time.
Hereinafter, aspects of embodiments of the present invention will be described based on the above terms.
Generally, a hardware processor having one or more computational cores may execute instructions (e.g., threads of instructions) to operate on data, such as to perform arithmetic, logical, or other functions. In some examples, the executed instruction operations (e.g., threads) include one or more branch operations (e.g., branch instructions).
In some examples, a branch operation is either unconditional (e.g., the branch is taken each time the instruction is executed) or conditional (e.g., the direction taken for the branch is conditional). For example, an instruction to be executed after a conditional branch (e.g., a conditional jump) is not known exactly until the condition to which the branch is compliant is resolved (resolve). In this case, rather than waiting until the condition is determined, a branch predictor of the processor may perform branch prediction to predict whether a branch will be taken and/or predict a target instruction (e.g., target address) for the branch. In some examples, if a branch is predicted to be taken, the processor fetches and speculatively executes instruction(s), such as instructions found at the predicted branch target address, for the direction (e.g., path) of the taken branch. Instructions executed after a branch prediction are speculative in some examples, where the processor has not determined whether the prediction is correct. In some examples, the processor decides the branch instruction at the back end of the pipeline circuit (e.g., in an execution, retirement, and/or writeback unit/circuit). In some examples, if the branch instruction is determined by the processor (e.g., by the back end) to not be taken, then all instructions following the branch instruction that are currently taken in the pipeline circuit are flushed (e.g., discarded). In some examples, a branch predictor (e.g., a branch predictor) learns from past behavior of branches to predict a next (e.g., incoming) branch.
The processor architecture diagram of FIG. 1 illustrates a hardware processor 100 that includes at least one branch predictor 104 (1) -104 (N) and at least one (e.g., data load dependent) branch redirect circuit 102 (1) -102 (N), where the hardware processor 100 may be a pipelined processor. Although multiple branch predictors are depicted in FIG. 1, a single branch predictor may be utilized for branch prediction for the compute cores 106 (1) -106 (N). In some examples, the branch predictors are distributed, with each computational core including its own local branch predictor 104 (1) -104 (N). Each local branch predictor 104 (1) -104 (N) may share data, such as a history of branch instructions executed by the processor 100.
In some examples, N is any integer greater than two. Hardware processor 100 may be coupled to system memory 114 to form a computing system. The computational core of hardware processor 100 may include, for example, any of instruction fetch circuitry, decoders, execution circuitry, or retirement circuitry (or other units or circuitry discussed herein) as pipeline circuitry for the computational core.
Hardware processor 100 may also include registers 108. In addition to, or in lieu of, accessing data in system memory 114, for example, registers 108 may include one or more general purpose registers 110 to perform (e.g., logical or arithmetic) operations. Registers 108 may include one or more architectural register files 112. In some examples, the processor 100 (e.g., a branch predictor thereof) will populate branch history data (e.g., context data) into one or more registers 108 based on the instruction (e.g., a branch instruction). In another embodiment, the branch history may be saved to system memory 114. The branch history may include a global history of the branch instruction (e.g., including a history of paths taken through a series of branches of currently executing program code to reach the branch instruction), as well as an address identifier of the branch instruction (e.g., an instruction pointer value or program counter value associated with the branch instruction).
The system memory 114 may include (e.g., store) one or more of the following: operating System (OS) code 116, or application code 118.
The branch redirect circuit 102 for the core 106 is used to redirect incorrect predictions.
The branch predictor 104 in the processor 100 may employ techniques such as a Ghare predictor, a TAGE predictor, a journey branch predictor, etc. Because of the better performance of the TAGE predictor in the processor, in the embodiments of the present invention, only the TAGE predictor is taken as an example, and the solution of the embodiments of the present invention will be described.
An exemplary structure of a TAGE predictor is shown in FIG. 2, which splits a branch into history-dependent and history-independent branches, which are predicted using a base prediction table and a tag prediction table, respectively. Specifically, as shown in FIG. 2, it includes a base predictor and n (typically four) tagged labeled branch predictors. Wherein the base predictor uses a base prediction table, shown schematically as T0, for predicting historical irrelevant branches; the labeled prediction tables corresponding to the four labeled branch predictors are respectively represented as T1, T2, T3 and T4 and are used for predicting branches related to the history.
Each entry (each row) of the base prediction table T0 includes a saturation counter ctr of 2 bits. The base predictor directly indexes a 2-bit saturation counter by the number of program counters PC XOR T0 entries. Each tag prediction table has a certain number of entries (table rows), and the entries of different tag prediction tables may be different. However, each table entry includes three parts, namely: a saturation counter ctr indicating the branch instruction jump, a flag bit tag for the matching PC, and a signal valid bit u indicating that the current entry is valid.
In addition, the branch predictor also includes a history register h to record history prediction information.
When the branch instruction enters the predictor, the prediction results of the five tables are obtained, and then the result value with the highest priority is selected as the final prediction result of the time according to the priorities of the five tables.
An exemplary branch prediction process based on the branch predictor shown in FIG. 2 includes: (1) Each branch instruction corresponds to a program counter value PC. Firstly, indexing a T0 table by partial bits of a PC to obtain the value of a two-bit saturation counter; (2) Dividing the history register h into 4 equal parts, and performing two different hash calculations on the PC part bit and the 1/4 length bit width, the 2/4 length bit width, the 3/4 length bit width and the 4/4 length bit width of the history register respectively to obtain 8 result values serving as index values and mark values of 4 tables of T0-T4; (3) Selecting the table items (table rows) corresponding to the 4 tables of T1, T2, T3 and T4 by using the index value obtained in the step (2), taking out the tag bits of the corresponding table items, and comparing the tag bits with the mark value obtained in the step (2); if the two entries are equal, the ctr of the corresponding entry is taken out; otherwise, ignoring the predicted value of the table; (4) And (4) obtaining more than one predicted value according to the (2) and the (3), wherein the priority order of the predicted values is T4> T3> T2> T1> T0, and selecting the final predicted value according to the priority.
In the branch prediction process, the memory (e.g., SRAM) of each predictor needs to be periodically accessed, for example, the historical execution result of the memory needs to be read, and the prediction result needs to be written into the memory for updating, which causes read-write conflict of the memory.
However, in the conventional method, by decoupling the front-end pipeline and the branch predictor, or suspending the pipeline when read-write conflict occurs, the complexity and area of the processor are increased, or the performance of the processor is lost.
To this end, an embodiment of the present invention provides a branch predictor, which includes a base predictor and a plurality of labeled branch predictors, as shown in fig. 2. However, unlike conventional branch predictors, in embodiments of the present invention, each tagged branch predictor employs a dual memory location structure. That is, each labeled branch predictor has two memory locations, a first memory location and a second memory location, respectively.
The first storage unit is a single-port storage unit and is used for storing saturation counter high bits of preset bits for performing branch jump prediction in a tag prediction table corresponding to the current tag branch predictor and flag bits for branch hit judgment. The second storage unit is a dual-port storage unit and is used for storing a saturation counter low bit of a preset bit number for saturation updating in a mark prediction table corresponding to the current mark branch predictor and a valid bit for indicating the validity of an entry of the prediction table of the current mark branch predictor. By the mode, the part needing to be updated in each mark branch predictor can be split, and the method is realized by using a dual-port memory, so that the balance of the performance and the area of the processor is ensured while the read-write conflict is solved.
Preferably, the high bit of the saturation counter is the highest bit of the saturation counter, and the low bit of the saturation counter is other bits of the saturation counter except the highest bit. In branch predictors, and in particular in the TAGE predictor described in FIG. 2, the highest bit of the saturation counter (herein, "bit" in "high", "low", etc. means all bits) is used for branch jump prediction, which does not need to be updated if the prediction is correct.
In general, for the saturation counter, when the tagged branch predictor predicts correctly, the corresponding tagged prediction tables T1 to T4 hit, the saturation counter in the tagged prediction table of the tagged branch predictor that predicts correctly will be updated, and the saturation counters in the tagged prediction tables of the other three tagged branch predictors will be updated when the prediction is wrong. If the tagged branch predictors are all in wrong prediction, the number of T0 entries of the basic prediction table of the PC XOR basic predictor is directly indexed into a 2-bit saturation counter to be updated. For the valid bit, when one marked branch predictor predicts correctly and other marked branch predictors predict incorrectly, the valid bit corresponding to the correctly predicted marked branch predictor is added with 1. In addition, the valid bits corresponding to all the labeled branch predictors have the maximum value, and in practical applications, periodic updating is performed according to the condition that the valid bits corresponding to the labeled branch predictors have the maximum value.
However, in actual table entry updating, the marked branch predictor predicts the correct condition more often, and for the saturation counter, only part of bits of the marked branch predictor need to be updated under the correct condition. Based on this, in one possible approach, after one branch prediction is completed, the branch predictor performs a data update operation on the second storage unit according to the branch prediction result, or performs a data update operation on both the first storage unit and the second storage unit.
When the branch prediction result is that the branch prediction is correct, the branch predictor writes and updates the low bit of the saturation counter stored in the second storage unit through the data write-in port of the second storage unit of the mark branch predictor with correct prediction. In this case, since the second storage unit has both a data write port and a data read port, data read is not affected during data write, and since most of the time the prediction result of the branch predictor is correct, the use of the dual-port second storage unit stores the low bit of the saturation counter, which not only effectively ensures data operation on the storage unit, but also effectively avoids read-write collision, thereby achieving effective balance between processor area and performance.
When the branch prediction result is a branch prediction error, the branch predictor takes a single port of a first storage unit of the marked branch predictor with the prediction error as a data writing port, and the high order of a saturation counter stored in the first storage unit is written and updated; and writing and updating the low bits of the saturation counter stored in the second storage unit through a data writing port in the dual port of the second storage unit of the marked branch predictor with the prediction error. In this case, a branch prediction error may cause a null in the tag prediction table, and the high and low bits of the saturation counter need to be updated. However, because the prediction error probability is very low, the first storage unit of the single port is used for storing the high bit of the saturation counter, and the single port is used as a data writing port during updating, so that the performance of the processor is not obviously influenced, and the area of the processor is effectively reduced.
In addition, when the branch predictor receives a data reading instruction sent by the processor to which the branch predictor belongs, each mark branch predictor takes a single port of the first storage unit as a data reading port and provides reading access to the high bit of the saturation counter; and, read access to the lower bits of the saturation counter is provided through a data read port in the dual port of the second memory cell.
According to the embodiment, a storage unit of a labeled branch predictor part in the branch predictor is divided into two storage units, namely a first storage unit and a second storage unit, wherein the first storage unit stores a saturation counter high bit of a preset bit number for performing branch jump prediction in a label prediction table corresponding to the current labeled branch predictor and a flag bit for branch hit judgment; the second storage unit stores the low order of a saturation counter of a preset number of bits for saturation updating in a mark prediction table corresponding to the current mark branch predictor and the effective bit for indicating the validity of the table entry of the prediction table of the current mark branch predictor. Since the flag bit and the high bit of the saturation counter do not need to be updated in case of correct branch prediction, only the low bit and the valid bit of the saturation counter need to be periodically updated. According to different updating conditions, the part of each branch which needs to be updated is split to be stored by using a second storage unit, the second storage unit is realized by using a dual-port memory, and the first storage unit is realized by using a single-port memory, so that the balance of the performance and the area of the processor is ensured.
The above process is exemplified below with reference to fig. 2 and 3, taking the TAGE predictor as an example.
As mentioned previously, the TAGE predictor mainly consists of: the T0 level basic predictor is used for providing a default prediction result; tn-level multiple labeled branch predictors (4 in the example of FIG. 2), which contain three logic components: (1) tag (flag bit) of N bit, which is used to mark the hit judgment (i.e. branch hit judgment) of the table entry of the prediction table; (2) a counter (saturation counter) of M bit for indicating the predicted value of the table entry of the mark prediction table; (3) j bit, useful counter, to indicate the use (valid or not) of the entry of the tag prediction table.
In this example, the above three logic units will be divided into two memory cells (as shown in fig. 3), where:
(1) the first storage unit (TAGE _ HI) includes: the highest bit (ctr _ hi, highest bit of the saturation counter) of the 1-bit counter is used as a jump prediction value, and N-bit tag (flag bit);
(2) the second storage unit (TAGE _ LO) includes: an M-1bit counter (ctr _ lo, the lower bit of the saturation counter except the highest bit) for performing saturation update of the counter; j bit useFUL counter (u, significant bit).
The second storage unit has two port ports, and can support simultaneous read-write operation.
Based on the above arrangement, access to the memory can be implemented as:
(1) And (3) reading: for a conventional read operation, both the base predictor and the high and low memory portions of the tagged branch predictor will be accessed, i.e., both the first memory location and the second memory location will be accessed.
(2) And (3) writing: for the case of correct branch prediction, only the second storage unit needs to be updated, and the first storage unit does not need to be updated, so that only the second storage unit needs to be stored by using a read-write operation; and for the case of branch prediction error, if a bubble exists in the prediction error, updating the first storage unit and the second storage unit.
According to the above example, since the highest bits of tag and counter do not need to be updated under the condition that the branch prediction is correct, and only the lower bits of counter and usefull counter need to be updated every cycle, the memory unit is split according to the difference of the update conditions, and the memory unit which needs to support simultaneous reading and writing uses dual ports, and the other memory units use single ports, thereby ensuring the balance of processor performance and area.
FIG. 4 is a block diagram of a pipelined processor according to another embodiment of the present disclosure. The pipelined processor 500 of the present embodiment includes the branch predictor described in the previous embodiments, and it should be understood that the pipelined processor 500 may be a single-core processor or a multi-core processor.
In some examples, each core of the pipelined processor 500 includes a branch prediction stage, an instruction fetch stage, a decode stage, an allocation stage, an execution stage, and a write-back (e.g., retirement) stage. Each of the above stages may include different levels of circuitry. Alternatively, the above-described line phases may be subdivided into a larger number of phases. In addition, additional pipeline stages may also be included, such as a prefetch stage, an instruction pointer generation (IP Gen) stage, and so forth.
In some examples, the pipelined processor 500 receives an Instruction Pointer (IP) that identifies the next instruction to be input into the processor. For example, the IP generation stage may select an instruction pointer (e.g., a memory address) that identifies the next instruction in a program sequence to be fetched and executed by a core (e.g., a logic core). In some examples, the pipelined processor 500 (e.g., the IP generation stage) increments the memory address of the most recently fetched instruction by a predetermined amount X (e.g., 1) each clock cycle.
However, in the case of an exception, or when a branch instruction is taken, the pipelined processor 500 (e.g., the IP generation stage) may select an instruction pointer that identifies the next sequential instruction in program order that is not. In some examples, the pipelined processor 500 (e.g., a branch prediction stage) predicts whether a conditional branch instruction is to be taken, e.g., to reduce branch penalties.
On the basis of the processor, the embodiment of the invention further provides a chip, which at least comprises the branch predictor or the processor core or the processor as described above. In practical applications, the chip may further include hardware, a controller, and the like for implementing various functions according to different actual requirements, but it is within the scope of the present invention as long as the chip includes the branch predictor or the processor core or the processor.
Further, an embodiment of the present invention also provides a control device, which includes at least the branch predictor or the processor core or the processor or the chip as described above. In practical applications, the control device may be implemented as any suitable device, such as a mobile control device, an industrial control device, a desktop control device, and so on.
In addition, an embodiment of the invention further provides a branch prediction method, and fig. 5 is a schematic step diagram of a branch prediction method according to another embodiment of the invention. The branch prediction method of the embodiment comprises the following steps:
s510: judging whether the prediction result of the branch predictor is correct or wrong; if the prediction is correct, go to step S520; if the prediction is wrong, step S530 is executed.
In a possible manner, before this step, the method may further include: acquiring a program counter value PC corresponding to the branch instruction; according to the program counter value PC, indexing and reading a saturated counter value in a basic prediction table corresponding to a basic predictor in the branch predictor; according to a program counter value PC, indexing and reading a saturation counter value in a mark prediction table corresponding to each mark branch predictor in the branch predictors; and determining a prediction result according to the read saturation counter value in the basic prediction table and the read saturation counter value in the mark prediction table.
Wherein reading the saturated counter value in the tag prediction table corresponding to each tag branch predictor in the branch predictors can be implemented as: for each marking branch predictor, taking a single port of a first storage unit as a data reading port, and performing reading access on the high order of a saturation counter; and, a read access is made to the lower bits of the saturation counter through a data read port in the dual ports of the second memory cell.
S520: and if the prediction is correct, writing and updating the lower bits of a saturation counter with preset bits for saturation updating in the label prediction table stored in the second storage unit through a data writing port in a dual port of the second storage unit of the label branch predictor aiming at the label branch predictor with correct prediction in the plurality of label branch predictors.
S530: if the branch prediction is wrong, aiming at the label branch predictor with the wrong prediction in a plurality of label branch predictors of the branch predictors, writing a single port of a first storage unit into a port through a single-port first storage unit of the label branch predictor, and performing writing updating operation on the high bits of a saturation counter with preset bits for performing branch jump prediction in a label prediction table, which is stored in the first storage unit; and writing and updating the lower bits of the saturation counters of the preset number of bits for performing saturation updating in the mark prediction table stored in the second storage unit through the data writing ports in the dual ports of the second storage unit.
When a prediction error occurs, the labeled branch predictors all have an error, and the operation of step S530 is performed for each labeled branch predictor.
It should be understood that the branch prediction method of the present embodiment is described simply, and reference may be made to the foregoing description of the branch predictor, and corresponding beneficial effects are not described herein again.
In addition, for specific implementation of each step in the program, reference may be made to corresponding steps and corresponding descriptions in units in the foregoing method embodiments, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.
It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present invention may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present invention.
The above-described method according to an embodiment of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It is understood that a computer, processor, microprocessor controller, or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by a computer, processor, or hardware, implements the methods described herein. Furthermore, when a general-purpose computer accesses code for implementing the methods illustrated herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the methods illustrated herein.
Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The above embodiments are only used for illustrating the embodiments of the present invention, and not for limiting the embodiments of the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present invention, so that all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the scope of patent protection of the embodiments of the present invention should be defined by the claims.