CN116149733A - Instruction transfer prediction system, method, apparatus, computer device and storage medium - Google Patents

Instruction transfer prediction system, method, apparatus, computer device and storage medium Download PDF

Info

Publication number
CN116149733A
CN116149733A CN202310215287.9A CN202310215287A CN116149733A CN 116149733 A CN116149733 A CN 116149733A CN 202310215287 A CN202310215287 A CN 202310215287A CN 116149733 A CN116149733 A CN 116149733A
Authority
CN
China
Prior art keywords
prediction
instruction
address
predicted
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310215287.9A
Other languages
Chinese (zh)
Inventor
刘亮
张馨予
张茜歌
王春萌
李伟立
易江芳
孙玉峰
蔡昊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
State Grid Corp of China SGCC
Beijing Smartchip Microelectronics Technology Co Ltd
Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Peking University
State Grid Corp of China SGCC
Beijing Smartchip Microelectronics Technology Co Ltd
Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, State Grid Corp of China SGCC, Beijing Smartchip Microelectronics Technology Co Ltd, Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd filed Critical Peking University
Priority to CN202310215287.9A priority Critical patent/CN116149733A/en
Publication of CN116149733A publication Critical patent/CN116149733A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • G06F9/3844Speculative instruction execution using dynamic branch prediction, e.g. using branch history tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • G06F9/3806Instruction prefetching for branches, e.g. hedging, branch folding using address prediction, e.g. return stack, branch history buffer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3814Implementation provisions of instruction buffers, e.g. prefetch buffer; banks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses an instruction transfer prediction system, an instruction transfer prediction method, an instruction transfer prediction device, computer equipment and a storage medium, wherein the instruction transfer prediction system comprises a prediction unit, a prediction target address queue and an instruction fetching unit; wherein: the prediction unit is used for carrying out multi-stage branch prediction on the transfer instruction; the prediction target address queue is configured to record an instruction address of the corresponding branch instruction and the first-stage prediction result when a first-stage prediction result is received, and cover the corresponding first-stage prediction result with a second-stage prediction result when the second-stage prediction result corresponding to the branch instruction is received; the instruction fetching unit is used for acquiring a target predicted address from the predicted target address queue so as to perform corresponding instruction fetching operation. Therefore, the prediction unit is separated from the original coupling structure, and the decoupling effect of the prediction unit is effectively improved.

Description

Instruction transfer prediction system, method, apparatus, computer device and storage medium
Technical Field
The present invention relates to the field of integrated circuit design technologies, and in particular, to a system, a method, an apparatus, a computer device, and a storage medium for instruction transfer prediction.
Background
In modern high performance processors, the pipeline may be divided into front-end processing units and back-end processing units. The front-end processing unit is configured to provide instructions to be executed to the back-end processing unit, so that the efficiency of the front-end processing unit directly affects the execution speed of the back-end processing unit.
In the related art, a prediction unit and a finger fetching unit in a front-end processing unit are decoupled to improve the performance of the front-end processing unit. However, the decoupling effect for prediction units having a multi-stage branch prediction structure is to be improved.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems in the related art to some extent. Therefore, a first object of the present invention is to propose an instruction branch prediction system, redesign logic for orienting instruction addresses for fetching by using prediction results generated by prediction units, and separate the prediction units from the original coupling structure, so as to improve decoupling effects of the prediction units.
A second object of the present invention is to provide an instruction branch prediction method.
A third object of the present invention is to provide an instruction branch prediction apparatus.
A fourth object of the invention is to propose a computer device.
A fifth object of the present invention is to propose a computer readable storage medium.
In order to achieve the above objective, an embodiment of a first aspect of the present invention provides an instruction transfer prediction method, where the instruction transfer prediction system includes a prediction unit, a prediction target address queue, and a fetching unit, where an input end of the prediction target address queue is connected to an output end of the prediction unit, and an output end of the prediction target address queue is connected to an input end of the fetching unit; wherein: the prediction unit is used for carrying out multi-stage branch prediction on the transfer instruction; the multi-level branch prediction includes a first level branch prediction and a second level branch prediction; branch prediction is carried out on the transfer instruction through the first-stage branch prediction, a first-stage prediction result is obtained and output, and branch prediction is carried out on the transfer instruction through the second-stage branch prediction, a second-stage prediction result is obtained and output; the first-stage prediction result and the second-stage prediction result respectively comprise a prediction address corresponding to the transfer instruction; the prediction target address queue is configured to record an instruction address of the corresponding branch instruction and the first-stage prediction result when the first-stage prediction result is received, and cover the corresponding first-stage prediction result with the second-stage prediction result when the second-stage prediction result corresponding to the branch instruction is received; the instruction fetching unit is used for acquiring a target predicted address corresponding to a target instruction address from the predicted target address queue and performing corresponding instruction fetching operation according to the acquired target predicted address.
According to one embodiment of the present invention, the predicted target address queue is further configured to delete the corresponding target instruction address and a target entry where the target predicted address is located from the predicted target address queue when the instruction fetching unit performs an instruction fetching operation according to the obtained target predicted address and obtains a corresponding target instruction block from a first instruction cache space.
According to one embodiment of the present invention, the predicted target address queue is further configured to filter a predicted address corresponding to any instruction address in the predicted target address queue, so as to obtain a predicted address that meets a prefetch condition, as a prefetch address; the predicted addresses meeting the prefetch condition are the predicted addresses of the corresponding instruction addresses respectively located in different cache lines; the prefetch address is used for acquiring a prefetch instruction block corresponding to the prefetch address from a second instruction cache space.
According to one embodiment of the present invention, the prediction target address queue is further configured to, when recording the instruction address of the corresponding branch instruction and the corresponding entry from the first-stage prediction result to the prediction target address queue, return the entry index of the entry to the prediction unit, so that the prediction unit writes the corresponding second-stage prediction result into the corresponding entry according to the corresponding entry index.
According to one embodiment of the invention, the predicted target address queue has a dequeue pointer; the predicted target address queue is further configured to delete, when the instruction fetching unit performs an instruction fetching operation according to a predicted address included in an entry currently pointed to by the dequeue pointer and obtains a corresponding instruction block from the first instruction cache space, the entry currently pointed to by the dequeue pointer from the predicted target address queue, and update the dequeue pointer.
According to one embodiment of the invention, the predicted target address queue has a prefetch pointer; the predicted target address queue is further configured to return, when a prefetch request sent by a front end of a processor is received, a predicted address included in a table entry currently pointed to by the prefetch pointer and a prefetch validity signal corresponding to the predicted address to the front end of the processor, and update the prefetch pointer, so that the front end of the processor uses the predicted address corresponding to the prefetch validity signal as a prefetch address when the prefetch validity signal received is the prefetch validity signal, and obtains a prefetch instruction block corresponding to the prefetch address from the second instruction cache space; the prefetch valid signal is used for indicating that a predicted address corresponding to the prefetch valid signal is a predicted address meeting the prefetch condition.
According to one embodiment of the invention, the predicted target address queue has an enqueue pointer and a read pointer; the predicted target address queue is further configured to record, when the first-level predicted result is received, an instruction address of the corresponding branch instruction and the first-level predicted result to an entry currently pointed to by the enqueuing pointer and update the enqueuing pointer, and return, when a read request of the fetch unit is received, a predicted address included in the entry currently pointed to by the read pointer to the fetch unit and update the read pointer.
In order to achieve the above object, an embodiment of a second aspect of the present invention provides an instruction branch prediction method, which is applied to an instruction branch prediction system, where the instruction branch prediction system includes a prediction unit, a prediction target address queue, and a fetch unit, an input end of the prediction target address queue is connected to an output end of the prediction unit, and an output end of the prediction target address queue is connected to an input end of the fetch unit; the prediction unit is used for carrying out multi-stage branch prediction on the transfer instruction; the multi-level branch prediction includes a first level branch prediction and a second level branch prediction; the method comprises the following steps: the prediction unit performs branch prediction on the transfer instruction through the first-stage branch prediction to obtain a first-stage prediction result and output the first-stage prediction result, and performs branch prediction on the transfer instruction through the second-stage branch prediction to obtain a second-stage prediction result and output the second-stage prediction result; the first-stage prediction result and the second-stage prediction result respectively comprise a prediction address corresponding to the transfer instruction; the prediction target address queue records the instruction address of the corresponding transfer instruction and the first-stage prediction result under the condition that the first-stage prediction result is received, and covers the corresponding first-stage prediction result by using the second-stage prediction result under the condition that the second-stage prediction result corresponding to the transfer instruction is received; and the instruction fetching unit acquires a target predicted address corresponding to a target instruction address from the predicted target address queue, and performs corresponding instruction fetching operation according to the acquired target predicted address.
In order to achieve the above object, an embodiment of a third aspect of the present invention provides an instruction branch prediction apparatus, which is applied to an instruction branch prediction system, where the instruction branch prediction system includes a prediction unit, a prediction target address queue, and a fetch unit, an input end of the prediction target address queue is connected to an output end of the prediction unit, and an output end of the prediction target address queue is connected to an input end of the fetch unit; the prediction unit is used for carrying out multi-stage branch prediction on the transfer instruction; the multi-level branch prediction includes a first level branch prediction and a second level branch prediction; the device comprises: the branch prediction module is used for performing branch prediction on the branch instruction through the first-stage branch prediction by the prediction unit to obtain a first-stage prediction result and output the first-stage prediction result, and performing branch prediction on the branch instruction through the second-stage branch prediction to obtain a second-stage prediction result and output the second-stage prediction result; the first-stage prediction result and the second-stage prediction result respectively comprise a prediction address corresponding to the transfer instruction; the recording module is used for recording the instruction address of the corresponding transfer instruction and the first-stage predicted result under the condition that the predicted target address queue receives the first-stage predicted result, and covering the corresponding first-stage predicted result by using the second-stage predicted result under the condition that the second-stage predicted result corresponding to the transfer instruction is received; the predicted address acquisition module is used for acquiring a target predicted address corresponding to a target instruction address from the predicted target address queue by the instruction fetching unit, and performing corresponding instruction fetching operation according to the acquired target predicted address.
To achieve the above object, according to a fourth aspect of the present invention, there is provided a computer device including a memory and a processor, wherein the memory stores a computer program, and the processor implements the steps of the instruction transfer prediction method according to any one of the foregoing embodiments when executing the computer program.
To achieve the above object, an embodiment of the fifth aspect of the present invention proposes a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the instruction transfer prediction method according to any one of the previous embodiments.
According to the embodiments provided by the invention, the directional logic of the instruction stream is redesigned, the program counter address required by fetching is directly directional by using the advanced prediction result, and the participation of the pre-decoding result is canceled, so that the whole prediction unit with the multi-stage prediction coverage structure is separated from the original structure, and the decoupling effect of the prediction unit is improved.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
Fig. 1a is a schematic structural diagram of a front-end processing unit and a back-end processing unit provided according to the present specification.
Fig. 1b is a schematic diagram of a pipeline front-end processing unit according to the present disclosure.
Fig. 1c is a schematic structural diagram of a decoupled architecture of a pipeline front-end processing unit according to the present disclosure.
FIG. 1d is a schematic diagram of a cache prefetch scheme according to the present disclosure.
FIG. 1e is a schematic diagram of an application scenario of an instruction transfer prediction system according to one embodiment of the present disclosure
Fig. 2a is a schematic structural diagram of an instruction transfer prediction system according to an embodiment of the present disclosure.
Fig. 2b is a schematic flow chart of the prediction result writing provided in one embodiment of the present disclosure.
Fig. 3 is a flow chart of an instruction transfer prediction method according to an embodiment of the present disclosure.
Fig. 4 is a block diagram showing a structure of an instruction transfer prediction apparatus according to an embodiment of the present disclosure.
Fig. 5 is a block diagram of a computer device according to one embodiment of the present disclosure.
Detailed Description
Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.
In modern high performance processors, the pipeline may be divided into front-end processing units and back-end processing units. Referring to fig. 1a, the front-end processing unit and the back-end processing unit are connected together through an instruction buffer or an instruction queue to form a producer-consumer working model, that is, the front-end processing unit fills the instruction buffer with instructions to be executed, and the back-end processing unit fetches the instructions from the instruction buffer and executes them. It will be appreciated that the speed at which the front-end processing unit fills the instruction buffer with instructions directly affects the execution speed of the back-end processing unit and thus the efficiency of the overall pipeline. Therefore, the efficiency of the front-end processing unit must be improved to provide enough instructions for the rich functional components of the back-end of the pipeline.
In the structure of the conventional pipeline front-end processing unit, referring to fig. 1b, the branch prediction unit predicts the subsequent instruction address by taking the content of the current Program Counter (PC) as input, and updates the PC at the same time after the prediction is completed. Meanwhile, the instruction fetching unit accesses the instruction cache according to the PC address, and puts the fetched new instruction into the instruction transmitting queue for transmitting to the decoding stage. The prediction unit and the finger taking unit are tightly coupled together and are tightly matched and cooperatively work. When the fetch unit generates a pause due to cache miss, the prediction unit also has to stop prediction; when the prediction unit generates a prediction error, the fetch unit needs to issue a new memory address and fetch again. The close coupling relationship of the two severely limits the improvement of the performance of the front-end structure of the pipeline. Meanwhile, different optimization technologies aiming at the finger taking unit and the prediction unit respectively can interfere with each other, so that the effectiveness of the design scheme of the front-end processing unit is difficult to evaluate and analyze, and a great challenge is brought to a front-end designer of the processor. Therefore, the prediction unit and the finger taking unit are decoupled to eliminate strong correlation during the work of the prediction unit and the finger taking unit, and the design space of the prediction unit and the finger taking unit is increased, so that the method has important significance for improving the performance of the front-end processing unit.
In the related art, a design scheme of a decoupling structure of a front-end processing unit of a pipeline is provided. Through the design thought of decoupling of the front end and the rear end of the analog processor, a first-in first-out queue structure, namely a finger target queue (Fetch Target Queue, FTQ), is introduced between the prediction unit and the finger picking unit, and other structures are correspondingly adjusted.
Referring to FIG. 1c, the FTQ queue is located between the prediction unit and the finger fetch unit. The prediction unit predicts the subsequent instruction address with the content in the current Program Counter (PC) as input. Updating the PC at the same time after the prediction is completed, and writing the predicted target address into the tail part of the FTQ queue; and meanwhile, the instruction fetching unit fetches the instruction address from the head of the FTQ queue, accesses the instruction cache according to the fetched instruction address, and fetches the corresponding instruction to be sent to the decoding stage. When the follow-up stage of the pipeline finds that the transfer prediction is wrong or the program control flow needs to be redirected, the table entry content in the FTQ queue is emptied in time, and meanwhile the instruction fetching unit uses the correct PC content to fetch again, and the prediction unit uses the correct PC content to predict.
On the one hand, when a branch prediction error occurs, the front-end processing unit needs to be updated quickly to restore it to a correct state, wherein the global branch history information in the prediction unit and the address sent to the instruction store and the like mainly need to be updated. A processing structure for recording history information on a speculative path, a speculative history queue (Speculative History Queue, SHQ), is proposed in the related art. The SHQ queue maintains global transfer history information for all speculative updates during processor operation. During prediction, the speculatively updated global history is written into the SHQ queue, and if updated global history information exists in the SHQ queue at the moment, the prediction unit replaces the content in the history register with the new global history information so as to ensure the timely validity of the global history information. When an instruction completes the submission at the back end of the pipeline, the actual transfer result of the instruction is used for updating the global history register, and the corresponding table entry in the SHQ queue is removed; when a prediction error occurs, the history information corresponding to the mispredicted branch instruction and entries in all allocated SHQ queues need to be removed from the SHQ queues.
On the other hand, the FTQ queue stores the predicted target addresses generated by the prediction unit, and the predicted target addresses are used as the fetch addresses of the fetch unit at the same time, so that the predicted target addresses in the FTQ queue can be used for accessing the secondary cache in advance by means of the prefetching technology, thereby reducing the miss rate of the instruction cache. Referring to fig. 1d, the related art implements a FTQ queue-based cache prefetch scheme design by adding a prefetch instruction queue (Prefetch Instruction Queue, PIQ). The PIQ queue is used to hold all predicted target addresses in the FTQ queue waiting to be prefetched. In order to further ensure the effectiveness of prefetching, a series of prefetching strategies such as cache probe filtering, filtering based on the number of FTQ table entries, cache miss filtering, instruction block eviction filtering and the like can be combined for prefetching.
However, contemporary high performance processors have undergone significant changes in microstructure relative to the processor structures described above. The prediction unit does not use a single prediction technology any more, but adopts a multi-stage covered branch prediction structure, namely a plurality of branch predictors jointly complete the prediction process, and the result of each stage of branch prediction is given in different periods. And branch prediction is performed in conjunction with fetching instructions, redirecting the instruction stream immediately upon finding a prediction error, placing the resulting instruction into an instruction buffer after all prediction phases are completed, and saving information about the branch prediction.
In view of the above, in the decoupling design of the prediction unit and the finger fetch unit, it is necessary to separate the branch predictor having the multi-level coverage prediction structure from the previous structure in the processor, so as to form a prediction unit capable of working independently. The difficulty is that advanced branch prediction combines the type information after extension, pre-decode processing and branch target address information of the instruction obtained from the fetch unit to determine whether the previous prediction needs to be corrected. The decoupled prediction unit is located before the instruction fetching unit, the content of the instruction block cannot be obtained, and instruction expansion and pre-decoding cannot be performed, so that the redirection logic at this stage needs to be redesigned.
Second, the related art uses FTQ queues as global data structures to hold branch prediction information. In the multi-stage covered branch prediction structure, the branch prediction information is changed continuously, the prediction result before the instruction is needed for updating the state of the predictor, the instruction address sequence for fetching the instruction is needed to be recorded in the queue, and the branch prediction result of each instruction block is also needed to be stored and sent to the subsequent instruction fetching component to be transmitted to the corresponding pipeline stage for processing. In order to facilitate the use, a plurality of information can be repeatedly recorded and mutually transmitted, so that the waste of hardware resources is caused.
Therefore, for contemporary high performance processors, the decoupling structure of the front-end processing unit needs to be redesigned. In particular, there is a need for improvements in the corresponding architecture for branch predictors having a multi-level overlay prediction architecture, as well as for designing new and efficient global data structures to record constantly updated branch prediction information and to keep as little information as possible to fully utilize hardware resources.
In order to separate an existing branch predictor having a multi-level overlay prediction structure from a previous structure in a processor, it is necessary to provide an instruction branch prediction system, method, apparatus, computer device, and storage medium. By introducing a structure similar to the FTQ queue, the prediction target address queue PTQ, in the front-end processing unit of the processor, it is responsible for recording the instruction execution sequence resulting from the branch prediction. The prediction unit writes the generated prediction result of each stage into the tail part of the PTQ queue, and the instruction fetching unit reads out the instruction address required for fetching from the head of the PTQ queue. Only when the instruction fetching unit successfully reads out the corresponding instruction block from the instruction cache, the corresponding instruction address in the PTQ queue can complete dequeuing operation so as to ensure normal pipelining of the instruction cache and solve the redirection problem of the instruction fetching unit. A quick recovery mechanism aiming at a multi-level coverage prediction structure is designed in a prediction unit, and a high-level prediction result is directly utilized to cover a low-level prediction result so as to realize redirection of a PC address, and whether the low-level prediction result needs to be corrected or not is determined without combining type information, transfer target address information and the like after the expansion and pre-decoding processing of an instruction obtained from a fetching unit. Further, a prefetch pointer is introduced into the PTQ queue to determine whether the corresponding instruction block needs to be prefetched. If prefetching is needed, the prefetching pointer is used for reading the instruction address of the second-level instruction cache, so that the prefetching mechanism of the instruction cache is optimized. Therefore, the prediction unit can generate the instruction stream sequence needed in the subsequent stage of the pipeline in advance under the condition of not being influenced by the instruction fetching unit, so that the overall working efficiency of the front-end processing unit is effectively improved.
Fig. 1e is a schematic application scenario diagram of an instruction transfer prediction system, a method, an apparatus, a computer device and a storage medium provided in the present specification. Taking a prediction unit with a three-level coverage prediction structure as an example, the output end of the prediction unit is connected with the input end of a prediction target address queue PTQ, and the output end of the PTQ queue is connected with the input end of the instruction fetching unit. The prediction unit writes the corresponding branch prediction result generated by each stage of branch prediction into the PTQ queue after the corresponding branch prediction result is generated, and the instruction fetching unit reads the instruction address from the head of the PTQ queue so as to perform corresponding instruction fetching operation according to the fetched instruction address.
In this scenario example, the prediction unit does not need to write the prediction result into the PTQ queue after the three-stage branch prediction result is generated, but writes the corresponding branch prediction result into the PTQ queue after each stage branch prediction result is generated, and updates the prediction result in the PTQ queue in real time in an overwriting manner. Specifically, after performing first-stage branch prediction on a transfer instruction to generate a first-stage prediction result, a prediction unit writes the first-stage prediction result and an instruction address corresponding to the transfer instruction into the tail part of a PTQ queue; after performing second-level branch prediction on the transfer instruction to generate a second-level prediction result, rewriting a first-level prediction result of a corresponding table entry in the PTQ queue; and after performing third-stage branch prediction on the branch instruction to generate a third-stage prediction result, re-writing the second-stage prediction result of the corresponding table entry in the PTQ queue. Therefore, for the multi-stage covered prediction structure, the write operation of the prediction unit to the PTQ queue also forms a covered structure, and the prediction result with higher accuracy in the subsequent stage can cover the corresponding prediction result of the previous stage, but the instruction address written each time is kept unchanged. The decoupling effect of the prediction unit can be improved by redesigning the logic for orienting the instruction address for fetching by using the prediction result generated by the prediction unit, and separating the prediction unit from the original coupling structure. Further, by means of the overwriting writing mode, the fast redirection of the pipeline can be achieved under the condition that fewer error instructions are executed by the pipeline, and the pipeline can be enabled to resume execution fast.
Each entry of the PTQ queue not only holds an instruction address but also records branch prediction information for the corresponding instruction block, and this information is always held. Therefore, in the pre-decoding stage, whether the branch predicted target address is consistent with the target address obtained by decoding can be directly compared, and whether the unconditional direct branch instruction successfully changes the instruction execution path of the processor can be determined.
In some embodiments, when the prediction unit writes a first-stage prediction result corresponding to the transfer instruction and a corresponding instruction address into a target table entry at the tail of the PTQ queue, the PTQ queue returns an entry index of the target table entry to the prediction unit, so that when the prediction unit performs second-stage branch prediction on the transfer instruction to generate a corresponding second-stage prediction result, the second-stage prediction result can be written into the corresponding table entry according to the entry index to cover the corresponding first-stage prediction result; similarly, when the prediction unit performs third-level branch prediction on the branch instruction to generate a corresponding third-level prediction result, the third-level prediction result may be written into the corresponding table entry according to the table entry index to cover the corresponding second-level prediction result.
In this scenario example, only when the instruction fetch unit sends an instruction address to the instruction cache space along with the instruction fetch request according to the instruction address read from the head of the PTQ queue and successfully reads the corresponding instruction block from the instruction cache space, the instruction address of the head of the PTQ queue may complete the dequeuing operation, so that the dequeuing operation of the PTQ queue and the instruction fetch request sent to the instruction cache space are distinguished. The instruction fetching unit reads the instruction address from the PTQ queue according to the read pointer and sends the instruction address to the instruction cache space.
In some embodiments, the prefetch mechanism of the instruction cache may be optimized by the instruction address in the PTQ queue while the accuracy of the branch prediction is maintained at a high level. Taking an example that the instruction fetching unit does not acquire a corresponding instruction block from the first-level instruction cache according to the instruction address in the PTQ queue, and prefetching from the second-level instruction cache is needed, specifically, an additional prefetch pointer is introduced into the PTQ to determine whether the instruction block corresponding to the instruction address pointed by the prefetch pointer needs prefetching or not. If prefetching is required, the prefetch pointer is used to read the instruction address that needs to be sent to the second level instruction cache. When the PTQ queue is initialized, the pre-fetch pointer is overlapped with the queue head pointer, and a pre-fetch request is not sent to the second-level instruction cache at the moment; along with the dequeuing of the PTQ queue, the prefetch pointer continuously moves towards the tail direction when the second-level instruction cache is idle, and whether the instruction address in each PTQ table item and the predicted address in the corresponding predicted result are in the same cache line of the first-level instruction cache is checked in the moving process. If the two are found not to be in the same cache line, the predicted addresses can be aligned and then sent to a second-level instruction cache for prefetching.
In this scenario example, by evaluating the empty and full condition of the PTQ queue, the result indicates that the PTQ queue occupies a relatively small space when the PTQ queue is empty, which indicates that the PTQ queue has an instruction execution sequence generated by branch prediction in most of the time, and side proves that the PTQ queue plays its role in a front end decoupling structure based on the PTQ queue, so that the prediction unit can not be affected by the instruction fetching component, and generates an instruction stream sequence needed in the subsequent stage of the pipeline in advance. Further, the overall operating efficiency of the front-end decoupled architecture based on the PTQ queue is measured for the number of instructions issued into the fetch buffer per cycle. From the evaluation result, the instruction fetching efficiency of the program after decoupling in running is improved to a certain extent. Because the PTQ queue exists, the result of transfer prediction is always recorded in the PTQ queue, and the third-stage prediction result is continuously updated into the queue along with the increase of the number of entries in the PTQ queue, so that most of instruction address sequences in the PTQ queue are identical to the instruction sequences in the actual running process of the program, and accordingly transfer prediction errors at the rear end of the pipeline can be reduced correspondingly.
The embodiment of the present disclosure provides an instruction branch prediction system, referring to fig. 2a, where the instruction branch prediction system 200 includes a prediction unit 210, a prediction target address queue 220, and a fetching unit 230, an input end of the prediction target address queue 220 is connected to an output end of the prediction unit 210, and an output end of the prediction target address queue 220 is connected to an input end of the fetching unit 230.
Wherein, the prediction unit 210 is configured to perform multi-stage branch prediction on the branch instruction; the multi-level branch prediction includes a first level branch prediction and a second level branch prediction; branch prediction is carried out on the transfer instruction through first-stage branch prediction, a first-stage prediction result is obtained and output, and branch prediction is carried out on the transfer instruction through second-stage branch prediction, a second-stage prediction result is obtained and output; the first-stage prediction result and the second-stage prediction result respectively comprise a prediction address corresponding to the transfer instruction.
The predicted target address queue 220 is configured to record an instruction address of a corresponding branch instruction and a first-stage predicted result when the first-stage predicted result is received, and to overwrite the corresponding first-stage predicted result with a second-stage predicted result when the second-stage predicted result corresponding to the branch instruction is received.
The instruction fetching unit 230 is configured to obtain a target predicted address corresponding to the target instruction address from the predicted target address queue, and perform a corresponding instruction fetching operation according to the obtained target predicted address.
The branch instruction may include, among other things, a branch instruction, an unconditional direct branch instruction, and the like. The prediction unit is a branch predictor with a multi-level overlay prediction structure. The predicted address is the next target address of the corresponding instruction address, and may be the branch target address when the branch occurs in the predicted corresponding branch instruction. The target instruction address is an instruction address included in a corresponding entry of the predicted target address queue that is currently read by the fetch unit. The target prediction address is a prediction address included in a current prediction result in a corresponding table entry of the prediction target address queue currently read by the instruction fetch unit, and may be a prediction address included in a first-stage prediction result or a prediction address included in a second-stage prediction result.
It is understood that the prediction unit may include a plurality of branch predictors, the first level of prediction may be obtained by the first level predictor, and the second level of prediction may be obtained by the second level predictor. The first-stage prediction result and the second-stage prediction result both comprise information for predicting whether the branch instruction generates branch or not and a branch target address when the branch instruction generates branch. The accuracy of the second-stage predicted result is higher than that of the first-stage predicted result.
In some cases, after the prediction unit is decoupled from the instruction fetch unit, since the prediction unit is located before the instruction fetch unit, the instruction fetched by the instruction fetch unit cannot be obtained by the advanced branch prediction, and thus the low-level branch prediction result cannot be corrected in combination with instruction expansion, pre-decode information, and the like, so as to orient the instruction stream and determine the final branch target address. Under comprehensive consideration, the accuracy of the prediction result of the advanced branch prediction can be guaranteed to be at a higher level under the guidance of no instruction expansion, pre-decoding information and the like, so that the low-level prediction result can be directly covered by the advanced prediction result to solve the problem of redirecting instruction flow, and the existing branch predictor with a multi-level coverage prediction structure in the processor can be separated from the previous structure to form a prediction unit capable of working independently. Further, since the prediction result of the multi-level overlay prediction structure has multiple levels, if the prediction unit is waited to update the prediction target address queue after giving the final prediction result, the previous prediction result is lost. Thus, the update operation can be performed on the predicted target address queue after the prediction result of each stage is generated.
Specifically, referring to fig. 2b, after the prediction unit performs first-stage branch prediction on the current branch instruction to generate a first-stage prediction result, the prediction unit outputs the first-stage prediction result to perform a write operation on a prediction target address queue, so as to write an instruction address corresponding to the current branch instruction and the first-stage prediction result into the prediction target address queue, where the prediction target address queue records the received instruction address and the first-stage prediction result in a current tail table entry. After the prediction unit performs second-stage branch prediction on the current transfer instruction to generate a second-stage prediction result, the prediction unit outputs the second-stage prediction result to perform writing operation on a prediction target address queue so as to write an instruction address corresponding to the current transfer instruction and the second-stage prediction result into the prediction target address queue, and the prediction target address queue records the received instruction address and the second-stage prediction result into an item in which a first-stage prediction result corresponding to the current transfer instruction is located so as to cover the corresponding first-stage prediction result by using the second-stage prediction result corresponding to the current transfer instruction.
It can be understood that the instruction addresses written into the prediction target address queue after the branch prediction of the first stage and the branch prediction of the second stage are respectively performed on the branch instructions in the same instruction block are unchanged, so that when the prediction target address queue records the corresponding instruction addresses and the second stage prediction results, the corresponding first stage prediction results can be covered by the second stage prediction results.
The multi-level branch prediction includes at least a first-level branch prediction and a second-level branch prediction. Accordingly, the prediction results generated by the prediction unit at least comprise a first-stage prediction result and a second-stage prediction result. With continued reference to FIG. 2b, assuming that the prediction unit includes n-level branch prediction, and accordingly, n-level prediction results may be generated for the current branch instruction, the n-level prediction results are all recorded in the prediction target address queue in the override manner described above.
Further, the instruction fetching unit obtains a target predicted address corresponding to the target instruction address in the current queue head entry of the predicted target address queue. If the current prediction result corresponding to the target instruction address recorded in the current queue head entry of the predicted target address queue is a first-stage prediction result, the instruction fetching unit can acquire the corresponding target prediction address from the first-stage prediction result; if the current prediction result corresponding to the target instruction address recorded in the current queue head entry of the predicted target address queue is the second-stage prediction result, the instruction fetching unit may acquire the corresponding target prediction address from the second-stage prediction result. According to the obtained target prediction address, the instruction fetching unit can perform corresponding instruction fetching operation in the instruction cache space.
In this specification, the instruction address and the prediction address may be a program counter address, i.e., a PC address. The predicted address may be a target address next to a corresponding instruction address formed by sequential addressing in the case where a branch is not predicted to occur in the branch instruction.
In the above embodiment, the instruction stream orientation logic is redesigned, the program counter address required for fetching is directly oriented by using the advanced prediction result, and participation of the pre-decoding result is eliminated, so that the whole prediction unit with the multi-stage prediction coverage structure is separated from the original structure, and the decoupling effect of the prediction unit is improved. Meanwhile, the new and efficient global data structure is used for recording the transition prediction information and the like which are updated and changed continuously, repeated recording information is reduced, and the information required to be stored is reduced, so that the full utilization rate of hardware resources can be improved.
In some embodiments, the predicted target address queue is further configured to delete, when the instruction fetching unit performs an instruction fetching operation according to the obtained target predicted address and obtains a corresponding target instruction block from the first instruction cache space, the corresponding target instruction address and a target entry where the target predicted address is located from the predicted target address queue.
Wherein the first instruction cache space may include a level one instruction cache. Further, the level one instruction cache may be a cache.
It will be appreciated that, since the branch instruction is generally used for branching between instruction blocks, the instruction fetching unit performs a corresponding instruction fetching operation according to the target prediction address, and may sequentially read from a location pointed to by the target prediction address in the first instruction cache space, so as to obtain the target instruction block corresponding to the target prediction address.
In some cases, since the prediction result of the multi-stage branch prediction is changed, after the instruction fetching unit performs the corresponding instruction fetching operation according to the target prediction address corresponding to the target instruction address, the target prediction address corresponding to the target instruction address in the predicted target address queue may be changed into a new prediction address, and the instruction fetching unit needs to fetch again. In order to ensure normal pipelining of the instruction cache and solve the redirection problem of the instruction fetching unit, dequeuing operations of the predicted target address queue can be distinguished from instruction fetching requests sent to the first instruction cache space.
Specifically, the target instruction address and the target instruction address are recorded in the same target entry of the predicted target address queue. The instruction fetching unit acquires a target predicted address corresponding to the target instruction address from the target table entry of the predicted target address queue, and then sends an instruction fetching request to the first instruction cache space according to the acquired target predicted address so as to perform corresponding instruction fetching operation. When the instruction fetching unit successfully acquires a target instruction block corresponding to the target prediction address from the first instruction cache space, the content in the target table entry in the prediction target address queue can be deleted to complete dequeuing operation of the target table entry.
In some embodiments, the instruction blocks stored in the first instruction cache space may include instruction blocks prefetched from an instruction cache space having a lower priority than the first instruction cache space by a prefetch operation.
In other embodiments, the first instruction cache space may further include a prefetch instruction cache space in which instruction blocks prefetched from the instruction cache space having a lower priority than the first instruction cache by the prefetch operation are stored.
In some embodiments, the predicted target address queue is further configured to filter a predicted address corresponding to any instruction address in the predicted target address queue, so as to obtain a predicted address that meets the prefetch condition, as the prefetch address.
The predicted addresses meeting the prefetch condition are the predicted addresses of the cache lines respectively located at the corresponding instruction addresses; the prefetch address is used for acquiring a prefetch instruction block corresponding to the prefetch address from the second instruction cache space.
The second instruction cache space can be a storage space such as a second-level instruction cache, a third-level instruction cache, or a main memory. The cache line instruction cache space stores units of instructions and instruction blocks, and is a continuous address space.
In some cases, the prediction target address queue stores the instruction execution sequence predicted by the prediction unit, and when the accuracy of prediction is maintained at a high level, the instruction execution sequence in the prediction target address queue is very close to the effective instruction execution sequence when the program runs truly, so that the prefetching mechanism of the instruction cache can be optimized by means of the instruction address in the prediction target address queue.
Specifically, for each entry in the prediction target address queue, it may be checked whether the instruction block address where the instruction address in each entry is located and the prediction address in the prediction result corresponding to the instruction address are in the same cache line of the first instruction cache space. If the instruction block address where the instruction address in any table item is located and the predicted address corresponding to the instruction address are found not to be in the same cache line, the predicted address meets the prefetching condition, the predicted address is determined to be a prefetching address, and the prefetching address is aligned and then sent to a second instruction cache space for prefetching. If the instruction block address where the instruction address in any table item is located and the predicted address corresponding to the instruction address are found in the same cache line of the first instruction cache space, the predicted address does not meet the prefetch condition, and does not need to be sent to the second instruction cache space for prefetching.
In the above embodiment, a new cache prefetch policy is designed based on the predicted target address queue, and the predicted address satisfying the prefetch condition is determined as the prefetch address by filtering each entry of the predicted target address queue, and is sent to the second instruction cache space for prefetching. Therefore, the prefetching addresses sent to the second instruction cache space come from the predicted target address queue, an efficient prefetching mechanism can be realized without additional prefetching instruction queues, and the hit rate and the hit speed of the fetching unit in fetching are improved.
In some embodiments, the prediction target address queue is further configured to, when recording the instruction address of the corresponding branch instruction and the first-level prediction result to the corresponding entry of the prediction target address queue, return the entry index of the entry to the prediction unit, so that the prediction unit writes the corresponding second-level prediction result to the corresponding entry according to the corresponding entry index.
The corresponding table entry is a queue tail table entry of the predicted target address queue.
Specifically, the prediction target address queue records the instruction address of the received corresponding branch instruction and the first-stage prediction result into the current tail entry of the prediction target address queue, and meanwhile, the entry index of the entry is returned to the prediction unit. The entry index is stored along with the pipeline, so that when the prediction unit performs second-stage branch prediction on the same branch instruction to generate a second-stage prediction result, the second-stage prediction result can be written into the corresponding same entry by using the corresponding entry index, and the second-stage prediction result corresponding to the branch instruction covers the first-stage prediction result corresponding to the branch instruction.
It will be appreciated that the predicted target address queue will change from end of queue entry to end of queue during the write process.
Further, when the second-stage prediction result obtained by predicting the same branch instruction is different from the first-stage prediction result, the predicted address of the instruction address corresponding to the branch instruction will change after the corresponding first-stage prediction result is covered by the second-stage prediction result corresponding to the branch instruction. Since the prediction target address queue is responsible for recording the instruction execution sequence generated by the prediction unit, the contents of the subsequent entry of the entry where the instruction address is located need to be redirected.
Specifically, after the second-stage prediction result corresponding to the current branch instruction is used to cover the first-stage prediction result corresponding to the current branch instruction, when the second-stage prediction result obtained by predicting the current branch instruction is different from the first-stage prediction result corresponding to the current branch instruction, the contents in the subsequent table entry of the table entry where the instruction address corresponding to the new second-stage prediction result is located can be covered or rewritten, so that the instruction address in the subsequent table entry of the table entry and the corresponding prediction address can be redirected.
Illustratively, in t cycles, the prediction unit performs first-stage branch prediction on the branch instruction I1 to obtain a first-stage prediction result R11 of the branch instruction I1, and writes the instruction address addr1 and the first-stage prediction result R11 into the prediction target address queue. The predicted target address queue records the received address addr1 and the first-stage predicted result R11 in the current tail table entry T1, and returns the table entry index S1 of the table entry to the prediction unit. t+1, the cycle prediction unit continues to perform first-stage branch prediction on the next branch instruction I2, so as to obtain a first-stage prediction result R12 of the branch instruction I2, and writes an instruction address addr2 and the first-stage prediction result R12 into the prediction target address queue. The predicted target address queue records the received address addr2 and the first-stage predicted result R12 in the current tail table entry T2, and returns the table entry index S2 of the table entry to the prediction unit.
In the period T, the prediction unit performs second-stage branch prediction on the branch instruction I1 to obtain a second-stage prediction result R21 of the branch instruction I1, and writes the instruction address addr1 and the second-stage prediction result R21 into an entry T1 of a prediction target address queue according to the entry index S1. The predicted target address queue records the received address addr1 and the second-stage predicted result R21 into the table entry T1, so as to overwrite the first-stage predicted result R11 with the second-stage predicted result R21. In the case that no branch prediction error occurs or the pipeline needs to be redirected, the prediction unit may continue to perform second-stage branch prediction on the next branch instruction I2 in the period t+1, to obtain a second-stage prediction result R22 of the branch instruction I2, and write the instruction address addr2 and the second-stage prediction result R22 into the table entry T2 of the prediction target address queue according to the table entry index S2. The predicted target address queue records the received address addr2 and the second-level predicted result R22 into the table entry T2 to overwrite the first-level predicted result R12 with the second-level predicted result R22.
If a branch prediction error occurs in the prediction of the branch instruction I1 or the back-end processing unit of the processor needs to redirect the prediction address corresponding to the instruction address addr1, or after the first-stage prediction result R11 is covered by the second-stage prediction result R21 in the T period, the prediction address corresponding to the instruction address addr1 is changed, the subsequent instruction stream may be redirected in the prediction target address queue at the starting position of the redirection of the table entry T2.
It will be appreciated that in some cases, when a branch prediction error occurs or the back-end processing unit needs to redirect the control flow, the pipeline needs to be flushed and the front-end processing unit needs to recover quickly. Compared with the method that the final prediction result is written into the prediction target address queue after the prediction unit gives the multi-stage prediction result corresponding to the current branch instruction, the method has the advantages that the prediction unit can write the prediction target address queue after generating each stage of prediction result, and under the condition that fewer error instructions are executed by the pipeline, the quick redirection of the pipeline can be realized, so that the instructions needing to be emptied are reduced, and the pipeline can quickly resume execution.
It should be noted that the fast recovery of the front-end processing unit also includes fast recovery of the state of the branch predictor. Specifically, when a branch prediction error occurs, the history information required for prediction by the prediction unit also needs to be updated and restored in time, including fast restoration of global history information, fast restoration of predictor metadata, and fast restoration of the contents of the return address stack (Return Address Stack, RAS). In the related art processor design, global information is transferred along with the pipeline and finally stored as a copy information in an entry corresponding to a fetch target queue (Fetch Target Queue, FTQ) of the processor, so that a quick recovery mechanism of global history information is not additionally implemented in the specification. For quick recovery of metadata of predictors, no speculative update is performed on metadata of each branch predictor in the design of a related-art processor, so that the update of the metadata of the predictors is completed in an instruction commit stage, and therefore, a quick recovery mechanism of the metadata of the predictors is not additionally realized in the specification. For rapid recovery of RAS content, a classical top of stack recovery mechanism may be employed. The RAS stack top address corresponding to each instruction block is saved and then written into the FTQ queue of the processor as a part of the instruction block transfer prediction information, so that when the pipeline redirection occurs, the stack top entry of the RAS stack can be quickly restored by utilizing the RAS stack top address in the corresponding FTQ queue entry so as to restore to a correct state.
In the above embodiment, based on the decoupled prediction unit structure of the predicted target address queue, a coverage structure is formed for the write operation of the predicted target address queue, so that the problem of redirecting the instruction stream based on the decoupled structure of the front end of the predicted target address queue can be solved, and the fast recovery of the pipeline can be realized.
In some implementations, the predicted target address queue has a dequeue pointer. The predicted target address queue is further configured to delete the entry currently pointed to by the dequeue pointer from the predicted target address queue and update the dequeue pointer when the instruction fetching unit performs an instruction fetching operation according to the predicted address included in the entry currently pointed to by the dequeue pointer and obtains the corresponding instruction block from the first instruction cache space.
Wherein the dequeue pointer may be a head pointer of the predicted target address queue. The entry that the dequeue pointer is currently pointing to may be the head of the predicted target address queue entry.
It will be appreciated that each entry in the predicted target address queue not only stores the corresponding target instruction address, but also records branch prediction information for the corresponding instruction block, including whether a branch has occurred, the branch target address, etc. Thus, the fetch unit may obtain the predicted address included therein from the entry to which the dequeue pointer is currently directed.
Specifically, a dequeue pointer deq _ptr is maintained inside the predicted target address queue and points to the current head entry of the predicted target address queue. The fetch unit obtains the target predicted address from the current head entry of the predicted target address queue. Under the condition that the instruction fetching unit performs instruction fetching operation according to the predicted address included in the current queue head table entry and acquires a corresponding instruction block from the first instruction cache space, the instruction fetching unit is indicated to have successfully performed instruction fetching operation, and the predicted target address queue can remove the content in the current queue head table entry from the predicted target address queue through an internally maintained dequeue pointer pointing to the current queue head table entry of the predicted target address queue, so that dequeue operation is completed. After the dequeue operation is completed, the dequeue pointer will automatically increment by 1 to point to the new head entry of the predicted target address queue.
In some embodiments, the predicted target address queue may further provide a dequeue port, i.e. a deq port, for completing a dequeue operation of an entry in the predicted target address queue, and using the dequeue port to remove a current head entry of the predicted target address queue from the queue by using a dequeue pointer pointing to the current head entry of the predicted target address queue, where the instruction fetching unit performs an instruction fetching operation according to a predicted address included in the current head entry and obtains a corresponding instruction block.
In some implementations, the predicted target address queue has a prefetch pointer. The predicted target address queue is further configured to return, when a prefetch request sent by the front end of the processor is received, a predicted address included in a table item currently pointed to by the prefetch pointer and a prefetch validity signal corresponding to the predicted address to the front end of the processor, and update the prefetch pointer, so that when the received prefetch validity signal is the prefetch validity signal, the front end of the processor takes the predicted address corresponding to the prefetch validity signal as the prefetch address, and acquires a prefetch instruction block corresponding to the prefetch address from the second instruction cache space; the prefetch valid signal is used for indicating that the predicted address corresponding to the prefetch valid signal is the predicted address meeting the prefetch condition.
The front end of the processor is a front end processing unit of the processor where the predicted target address queue is located. The prefetch request is a request generated by the front end of the processor in the case that the fetch unit does not fetch a corresponding target instruction block from the first instruction cache space according to the current target predicted address fetched from the predicted target address queue.
Specifically, a prefetch pointer prefetch_ptr is maintained within the predicted target address queue. When the predicted target address queue receives a prefetch request sent by the front end of the processor, the instruction fetching unit does not acquire a corresponding target instruction block from the first instruction cache space according to the current target predicted address acquired from the predicted target address queue, and the target instruction block needs to be prefetched from the second instruction cache space, the prefetch pointer will automatically add 1 to point to the next entry possibly needing to be prefetched of the entries where the current target predicted address is located in the predicted target address queue. The prefetch pointer checks whether the instruction block address of the instruction address in the table item pointed by the prefetch pointer and the predicted address in the predicted result corresponding to the instruction address are in the same cache line of the first instruction cache space, and returns a reply to the front end of the processor, wherein the reply contains a prefetch validity signal corresponding to the predicted address included in the table item pointed by the prefetch pointer. If the instruction block address where the instruction address in the table item pointed to by the prefetch pointer is found not to be in the same cache line with the corresponding prediction address, the prediction address is a prefetch address meeting the prefetch condition, and the prediction target address queue may return a reply containing the prediction address and a prefetch valid signal corresponding to the prediction address to the front end of the processor, where the reply is used to indicate that the front end of the processor needs to prefetch according to the prediction address included in the table item, so as to obtain the corresponding prefetch instruction block from the second instruction cache space.
If the instruction block address where the instruction address in the table item pointed to by the prefetch pointer is found to be in the same cache line with the corresponding prediction address, the prediction address does not meet the prefetch condition, and prefetching is not needed. The predicted target address queue may return a reply containing the predicted address and a prefetch invalidate signal corresponding to the predicted address to the processor front end to indicate that the processor front end does not need to prefetch according to the predicted address.
In some embodiments, the predicted target address queue may also read, through a prefetch port, a prefetch address pointed to by a prefetch pointer and required to be sent to the instruction cache from the predicted target address queue. The prefetch request may be generated by the instruction fetch unit or by the first instruction cache space.
In the above embodiment, the prefetch strategy that the prefetch pointer and the prefetch validity signal are used to filter the entries in the predicted target address queue selects the instruction block address meeting the condition to prefetch, so that the hit rate and hit speed of the fetch unit in fetching are improved, and the hardware resources can be saved to a certain extent.
In some implementations, the predicted target address queue has an enqueue pointer and a read pointer. The predicted target address queue is further configured to record an instruction address of the corresponding branch instruction and the first-stage predicted result to an entry currently pointed to by the enqueue pointer and update the enqueue pointer, and return a predicted address included in the entry currently pointed to by the read pointer to the fetch unit and update the read pointer, when a read request of the fetch unit is received.
Specifically, an enqueue pointer enq _ptr and a read pointer read_ptr are maintained in the predicted target address queue, the enqueue pointer points to the current queue tail table entry of the predicted target address queue, and the read pointer points to the current queue head table entry of the predicted target address queue. The prediction unit writes the generated prediction result into the current tail empty entry of the prediction target address queue pointed by the enqueue pointer, and the fetching unit reads the address needed by the fetching from the current queue head of the prediction target address queue pointed by the read pointer through the read pointer. When the predicted target address queue is initialized, the predicted target address queue is an empty queue, the enqueue pointer and the read pointer both point to the first empty entry of the predicted target address queue, only enqueue operation is allowed at this time, and the instruction fetching unit cannot acquire effective contents from the queue. After the enqueuing operation occurs, for example, the prediction unit writes the target instruction address and the first-stage prediction result into the predicted target address queue, and after the predicted target address queue records the received target instruction address and the first-stage prediction result to the position pointed by the enqueuing pointer, the enqueuing pointer will automatically increment by 1 to point to the next written entry position, so as to realize the function of adding the entry to the tail of the predicted target address queue. After a read operation occurs, for example, after the target address queue receives a read request sent by the instruction fetch unit, the content in the table entry pointed by the read pointer is read, and the read pointer is automatically added with 1 to point to the next table entry position to be read.
It will be appreciated that the predicted address included in the entry currently pointed to by the read pointer is the target predicted address.
Further, in some embodiments, the predicted target address queue receives a reset signal (reset signal) when the instruction fetch unit does not fetch a corresponding instruction block from the instruction cache. The predicted target address queue needs to reset the program counter address for fetching and reset the read pointer to point the read pointer to the next entry location of the head pointer.
In some embodiments, the enqueue pointer is the end of queue pointer of the predicted target address queue. The internal structure of the predicted target address queue can be designed as a circular queue containing N entries, and the head pointer and the tail pointer of the queue point to the first entry in the queue at the beginning, and the queue is in an empty state; when the predicted target address queue is empty, only enqueuing operation is allowed, and the instruction fetching component cannot acquire a valid instruction address from the queue; after the enqueuing operation occurs, namely, after the transfer prediction unit writes the generated prediction result into a prediction target address queue, a queue tail pointer moves to the next empty entry; when the queue head pointer and the queue tail pointer are overlapped again, the predicted target address queue needs to judge whether the queue is in a full state according to the current dequeue or enqueue operation. When the predicted target address queue is in a full state, only dequeue operation is allowed, and at this time, the prediction unit cannot write a new prediction result into the predicted target address queue any more, and re-prediction is required.
The predicted target address queue may also provide an enqueue port, enq port, through which the prediction unit writes new entry content to the predicted target address queue. The predicted target address queue may also provide a read port, such that the fetch unit reads the predicted address from the predicted target address queue via the read port that needs to be sent to the instruction cache. The reading port sequentially reads the contents of the entries in the predicted target address queue through the reading pointer.
Further, the predicted target address queue may also provide a corresponding write port for each level of branch prediction. For example, after the prediction unit performs first-stage branch prediction on the branch instruction I1 to generate a first-stage prediction result, the target instruction address corresponding to the branch instruction I1 and the first-stage prediction result may be written into the target entry T1 of the predicted target address queue through the enq port, and the predicted target address queue may return the entry index of the target entry T1 to the prediction unit; after the prediction unit performs second-stage branch prediction on the transfer instruction I1 to generate a second-stage prediction result, the entry index of the target entry T1 can be used to write the target instruction address corresponding to the transfer instruction I1 and the second-stage prediction result into the target entry T1 through a write-in port, namely a writeidx1 port, corresponding to the second-stage branch prediction; after the prediction unit performs third-level branch prediction on the branch instruction I1 to generate a third-level prediction result, the entry index of the target entry T1 may be used to write the target instruction address corresponding to the branch instruction I1 and the third-level prediction result into the target entry T1 through the write port, i.e., the writeidx2 port corresponding to the third-level branch prediction.
It should be noted that, the predicted target address queue may implement a dequeue operation through the cooperation of the read pointer and the dequeue pointer. Specifically, when the predicted target address queue receives the dequeue operation instruction, it is determined whether the dequeue pointer will exceed the read pointer. In the case that the dequeue pointer does not exceed the read pointer, the dequeue pointer will automatically increment by 1 to point to the new head entry of the predicted target address queue.
The embodiment of the specification provides an instruction transfer prediction method which is applied to an instruction transfer prediction system, wherein the instruction transfer prediction system comprises a prediction unit, a prediction target address queue and a fetching unit, the input end of the prediction target address queue is connected with the output end of the prediction unit, and the output end of the prediction target address queue is connected with the input end of the fetching unit; the prediction unit is used for carrying out multi-stage branch prediction on the transfer instruction; the multi-level branch prediction includes a first level branch prediction and a second level branch prediction. Referring to fig. 3, the instruction branch prediction method includes the following steps.
S310, the prediction unit carries out branch prediction on the transfer instruction through first-stage branch prediction to obtain a first-stage prediction result and output the result, and carries out branch prediction on the transfer instruction through second-stage branch prediction to obtain a second-stage prediction result and output the result; the first-stage prediction result and the second-stage prediction result respectively comprise a prediction address corresponding to the transfer instruction.
S320, the predicted target address queue records the instruction address of the corresponding transfer instruction and the first-stage predicted result under the condition that the first-stage predicted result is received, and covers the corresponding first-stage predicted result by using the second-stage predicted result under the condition that the second-stage predicted result corresponding to the transfer instruction is received.
S330, the instruction fetching unit acquires a target predicted address corresponding to the target instruction address from the predicted target address queue, and performs corresponding instruction fetching operation according to the acquired target predicted address.
It should be noted that, for the description of the prediction unit, the prediction target address queue, and the instruction fetch unit in the above embodiment, please refer to the description of the prediction unit, the prediction target address queue, and the instruction fetch unit of the instruction transfer prediction system in this specification, and details thereof are not repeated here.
The embodiment of the specification provides an instruction transfer prediction device which is applied to an instruction transfer prediction system, wherein the instruction transfer prediction system comprises a prediction unit, a prediction target address queue and a fetching unit, the input end of the prediction target address queue is connected with the output end of the prediction unit, and the output end of the prediction target address queue is connected with the input end of the fetching unit; the prediction unit is used for carrying out multi-stage branch prediction on the transfer instruction; the multi-level branch prediction includes a first level branch prediction and a second level branch prediction. Referring to fig. 4, the instruction branch prediction apparatus 400 includes: branch prediction module 410, logging module 420, prediction address acquisition module 430.
The branch prediction module 410 is configured to perform branch prediction on the branch instruction by the prediction unit through first-stage branch prediction, obtain a first-stage prediction result, output the first-stage prediction result, and perform branch prediction on the branch instruction through second-stage branch prediction, obtain a second-stage prediction result, and output the second-stage prediction result; the first-stage prediction result and the second-stage prediction result respectively comprise a prediction address corresponding to the transfer instruction.
The recording module 420 is configured to record, in the case that the first-stage prediction result is received, the instruction address of the corresponding branch instruction and the first-stage prediction result of the target address queue, and, in the case that the second-stage prediction result corresponding to the branch instruction is received, cover the corresponding first-stage prediction result with the second-stage prediction result.
The predicted address obtaining module 430 is configured to obtain, by the instruction fetching unit, a target predicted address corresponding to the target instruction address from the predicted target address queue, and perform a corresponding instruction fetching operation according to the obtained target predicted address.
For specific limitations of the instruction branch prediction apparatus, reference may be made to the above limitations of the instruction branch prediction method, and no further description is given here. The various modules in the instruction branch prediction apparatus described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
The embodiment of the present disclosure further provides a computer device, referring to fig. 5, where the computer device 500 includes a memory 510, a processor 520, and a computer program 530 stored in the memory 510 and capable of running on the processor 520, and when the processor 520 executes the computer program 530, the foregoing instruction transfer prediction method is implemented.
The present description also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the foregoing instruction transfer prediction method.
It should be noted that the logic and/or steps represented in the flowcharts or otherwise described herein, for example, may be considered as a ordered listing of executable instructions for implementing logical functions, and may be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; either directly or indirectly, through intermediaries, or both, may be in communication with each other or in interaction with each other, unless expressly defined otherwise. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims (10)

1. The instruction transfer prediction system is characterized by comprising a prediction unit, a prediction target address queue and a fetching unit, wherein the input end of the prediction target address queue is connected with the output end of the prediction unit, and the output end of the prediction target address queue is connected with the input end of the fetching unit; wherein:
the prediction unit is used for carrying out multi-stage branch prediction on the transfer instruction; the multi-level branch prediction includes a first level branch prediction and a second level branch prediction; branch prediction is carried out on the transfer instruction through the first-stage branch prediction, a first-stage prediction result is obtained and output, and branch prediction is carried out on the transfer instruction through the second-stage branch prediction, a second-stage prediction result is obtained and output; the first-stage prediction result and the second-stage prediction result respectively comprise a prediction address corresponding to the transfer instruction;
the prediction target address queue is configured to record an instruction address of the corresponding branch instruction and the first-stage prediction result when the first-stage prediction result is received, and cover the corresponding first-stage prediction result with the second-stage prediction result when the second-stage prediction result corresponding to the branch instruction is received;
The instruction fetching unit is used for acquiring a target predicted address corresponding to a target instruction address from the predicted target address queue and performing corresponding instruction fetching operation according to the acquired target predicted address.
2. The system of claim 1, wherein the predicted target address queue is further configured to delete the corresponding target instruction address and a target entry in which the target predicted address is located from the predicted target address queue if the instruction fetch unit performs an instruction fetch operation according to the obtained target predicted address and obtains a corresponding target instruction block from a first instruction cache space.
3. The system of claim 1, wherein the predicted target address queue is further configured to filter a predicted address corresponding to any instruction address in the predicted target address queue to obtain a predicted address that meets a prefetch condition as a prefetch address; the predicted addresses meeting the prefetch condition are the predicted addresses of the corresponding instruction addresses respectively located in different cache lines; the prefetch address is used for acquiring a prefetch instruction block corresponding to the prefetch address from a second instruction cache space.
4. The system of claim 1, wherein the prediction target address queue is further configured to, when recording the instruction address of the corresponding branch instruction and the corresponding entry of the first-level prediction result to the prediction target address queue, return an entry index of the entry to the prediction unit, so that the prediction unit writes the corresponding second-level prediction result to the corresponding entry according to the corresponding entry index.
5. The system of claim 2, wherein the predicted target address queue has a dequeue pointer;
the predicted target address queue is further configured to delete, when the instruction fetching unit performs an instruction fetching operation according to a predicted address included in an entry currently pointed to by the dequeue pointer and obtains a corresponding instruction block from the first instruction cache space, the entry currently pointed to by the dequeue pointer from the predicted target address queue, and update the dequeue pointer.
6. The system of claim 3, wherein the predicted target address queue has a prefetch pointer;
the predicted target address queue is further configured to return, when a prefetch request sent by a front end of a processor is received, a predicted address included in a table entry currently pointed to by the prefetch pointer and a prefetch validity signal corresponding to the predicted address to the front end of the processor, and update the prefetch pointer, so that the front end of the processor uses the predicted address corresponding to the prefetch validity signal as a prefetch address when the prefetch validity signal received is the prefetch validity signal, and obtains a prefetch instruction block corresponding to the prefetch address from the second instruction cache space; the prefetch valid signal is used for indicating that a predicted address corresponding to the prefetch valid signal is a predicted address meeting the prefetch condition.
7. The system of any one of claims 1 to 6, wherein the predicted target address queue has an enqueue pointer and a read pointer;
the predicted target address queue is further configured to record, when the first-level predicted result is received, an instruction address of the corresponding branch instruction and the first-level predicted result to an entry currently pointed to by the enqueuing pointer and update the enqueuing pointer, and return, when a read request of the fetch unit is received, a predicted address included in the entry currently pointed to by the read pointer to the fetch unit and update the read pointer.
8. The instruction transfer prediction method is characterized by being applied to an instruction transfer prediction system, wherein the instruction transfer prediction system comprises a prediction unit, a prediction target address queue and a fetching unit, the input end of the prediction target address queue is connected with the output end of the prediction unit, and the output end of the prediction target address queue is connected with the input end of the fetching unit; the prediction unit is used for carrying out multi-stage branch prediction on the transfer instruction; the multi-level branch prediction includes a first level branch prediction and a second level branch prediction; the method comprises the following steps:
The prediction unit performs branch prediction on the transfer instruction through the first-stage branch prediction to obtain a first-stage prediction result and output the first-stage prediction result, and performs branch prediction on the transfer instruction through the second-stage branch prediction to obtain a second-stage prediction result and output the second-stage prediction result; the first-stage prediction result and the second-stage prediction result respectively comprise a prediction address corresponding to the transfer instruction;
the prediction target address queue records the instruction address of the corresponding transfer instruction and the first-stage prediction result under the condition that the first-stage prediction result is received, and covers the corresponding first-stage prediction result by using the second-stage prediction result under the condition that the second-stage prediction result corresponding to the transfer instruction is received;
and the instruction fetching unit acquires a target predicted address corresponding to a target instruction address from the predicted target address queue, and performs corresponding instruction fetching operation according to the acquired target predicted address.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of claim 8 when executing the computer program.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of claim 8.
CN202310215287.9A 2023-02-28 2023-02-28 Instruction transfer prediction system, method, apparatus, computer device and storage medium Pending CN116149733A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310215287.9A CN116149733A (en) 2023-02-28 2023-02-28 Instruction transfer prediction system, method, apparatus, computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310215287.9A CN116149733A (en) 2023-02-28 2023-02-28 Instruction transfer prediction system, method, apparatus, computer device and storage medium

Publications (1)

Publication Number Publication Date
CN116149733A true CN116149733A (en) 2023-05-23

Family

ID=86373583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310215287.9A Pending CN116149733A (en) 2023-02-28 2023-02-28 Instruction transfer prediction system, method, apparatus, computer device and storage medium

Country Status (1)

Country Link
CN (1) CN116149733A (en)

Similar Documents

Publication Publication Date Title
JP5357017B2 (en) Fast and inexpensive store-load contention scheduling and transfer mechanism
JP3542021B2 (en) Method and apparatus for reducing set associative cache delay by set prediction
KR100626858B1 (en) System for store to load forwarding of individual bytes from separate store buffer entries to form a single load word
US5519841A (en) Multi instruction register mapper
US6067616A (en) Branch prediction device with two levels of branch prediction cache
KR100958705B1 (en) System and method for linking speculative results of load operations to register values
US6256727B1 (en) Method and system for fetching noncontiguous instructions in a single clock cycle
US5394530A (en) Arrangement for predicting a branch target address in the second iteration of a short loop
JP3919802B2 (en) Processor and method for scheduling instruction operations in a processor
CN104731719B (en) Cache system and method
US20050268076A1 (en) Variable group associativity branch target address cache delivering multiple target addresses per cache line
CN1127899A (en) Data processor with speculative instruction fetching and method of operation
WO2005062167A2 (en) Transitioning from instruction cache to trace cache on label boundaries
US6157999A (en) Data processing system having a synchronizing link stack and method thereof
KR20040111566A (en) System and method for using speculative source operands in order to bypass load/store operations
JPH0334024A (en) Method of branch prediction and instrument for the same
JPH10232827A (en) Method and device for writing back prefetch cache
US7096348B2 (en) Method and apparatus for allocating entries in a branch target buffer
US5930820A (en) Data cache and method using a stack memory for storing stack data separate from cache line storage
JPH06242949A (en) Queue control type order cash
CN114116016B (en) Instruction prefetching method and device based on processor
US5581719A (en) Multiple block line prediction
JP3590427B2 (en) Instruction cache memory with read-ahead function
US5794027A (en) Method and apparatus for managing the execution of instructons with proximate successive branches in a cache-based data processing system
CN116149733A (en) Instruction transfer prediction system, method, apparatus, computer device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination