CN110825442B - Instruction prefetching method and processor - Google Patents

Instruction prefetching method and processor Download PDF

Info

Publication number
CN110825442B
CN110825442B CN201910985665.5A CN201910985665A CN110825442B CN 110825442 B CN110825442 B CN 110825442B CN 201910985665 A CN201910985665 A CN 201910985665A CN 110825442 B CN110825442 B CN 110825442B
Authority
CN
China
Prior art keywords
instruction
fetch
branch prediction
starting address
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910985665.5A
Other languages
Chinese (zh)
Other versions
CN110825442A (en
Inventor
崔泽汉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Haiguang Microelectronics Technology Co Ltd
Original Assignee
Chengdu Haiguang Microelectronics Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Haiguang Microelectronics Technology Co Ltd filed Critical Chengdu Haiguang Microelectronics Technology Co Ltd
Publication of CN110825442A publication Critical patent/CN110825442A/en
Application granted granted Critical
Publication of CN110825442B publication Critical patent/CN110825442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3814Implementation provisions of instruction buffers, e.g. prefetch buffer; banks

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The embodiment of the invention provides an instruction prefetching method and a processor, wherein the instruction prefetching method comprises the following steps: acquiring a second next instruction fetching starting address; the obtained branch prediction direction corresponding to the second next fetch starting address is opposite to the branch prediction direction corresponding to the first next fetch starting address obtained by branch prediction; and executing an instruction prefetching process according to the obtained second next instruction fetching starting address. By the instruction prefetching method provided by the embodiment of the invention, the hit probability of the instruction subjected to redirection instruction fetching in the first-level cache can be improved, the instruction prefetching effect can be improved, and the redirection overhead can be reduced.

Description

Instruction prefetching method and processor
Technical Field
The embodiment of the invention relates to the technical field of processors, in particular to an instruction prefetching method and a processor.
Background
Modern processors typically employ pipelining to process instructions in parallel to speed up instruction processing efficiency. The instruction fetch is used as a basic stage of the pipeline technology, and mainly reads instructions from a first-level cache for further subsequent processing; to increase the probability of an instruction hitting the level one cache at the time of instruction fetching, most modern processors support instruction prefetching techniques.
The instruction prefetching refers to judging whether an instruction is stored in a first-level cache after an instruction fetching address of the instruction is determined and before an instruction is fetched formally, and if not, extracting the instruction from a lower-level cache of the first-level cache to the first-level cache, so that the instruction can be hit in the first-level cache during the instruction fetching formally, and the probability of the instruction being hit in the first-level cache during the instruction fetching is improved.
In the past, how to raise the probability of an instruction hitting in the first-level cache during instruction fetching has been a research focus of those skilled in the art.
Disclosure of Invention
The embodiment of the invention provides an instruction prefetching method and a processor, which are used for improving the probability of instruction hit in a first-level cache during instruction fetching.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
an instruction prefetch method comprising:
acquiring a second next instruction fetching starting address; the obtained branch prediction direction corresponding to the second next fetch starting address is opposite to the branch prediction direction corresponding to the first next fetch starting address obtained by branch prediction;
and executing an instruction prefetching process according to the obtained second next instruction fetching starting address.
An embodiment of the present invention further provides a processor, where the processor includes at least one processor core, where the processor core includes at least the following logic:
an instruction cache hit prediction unit for obtaining a second next fetch start address; the obtained branch prediction direction corresponding to the second next fetch starting address is opposite to the branch prediction direction corresponding to the first next fetch starting address obtained by branch prediction; and executing an instruction prefetching process according to the obtained second next instruction fetching starting address.
An embodiment of the present invention further provides a processor, where the processor includes at least one processor core, and the processor core at least includes: logic to implement the method described above.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following advantages:
the instruction prefetching method provided by the embodiment of the invention can obtain a second next instruction fetching starting address, wherein the second next instruction fetching starting address is a next instruction fetching starting address which is not selected by branch prediction, and the branch prediction direction corresponding to the second next instruction fetching starting address is opposite to the branch prediction direction corresponding to the first next instruction fetching starting address obtained by branch prediction; therefore, the instruction prefetching process is executed according to the obtained second next instruction fetching starting address, and the redirection overhead can be reduced through the prefetched instruction for redirecting instruction fetching when redirection occurs in the follow-up process.
Because the second next instruction fetch starting address is the same as the redirected instruction fetch starting address when the branch prediction is wrong and redirection is performed, the method and the device for pre-fetching instructions execute the instruction pre-fetching process according to the second next instruction fetch starting address, can improve the hit probability of the redirected instruction fetch in the first-level cache when redirection is performed subsequently, and reduce the redirection overhead. Therefore, by the instruction prefetching method provided by the embodiment of the invention, the probability of hit of the instruction subjected to redirection instruction fetching in the first-level cache can be improved, the instruction prefetching effect can be improved, and the redirection overhead can be reduced.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is an alternative block diagram of a computer system architecture;
FIG. 2 is an alternative block diagram of a processor coupled to a memory;
FIG. 3 is an alternative block diagram of a processor including a processor core that uses pipelining;
FIG. 4 is an architectural block diagram of a processor associated with instruction prefetching;
FIG. 5 is a block diagram of an architecture of a processor according to an embodiment of the present invention;
FIG. 6 is a block diagram of another architecture of a processor according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating an example of a process for performing instruction prefetching according to an embodiment of the present invention;
FIG. 8 is a block diagram of a further architecture of a processor according to an embodiment of the present invention;
FIG. 9 is a flowchart of the method steps for filtering the second next fetch start address according to one embodiment of the present invention;
FIG. 10 is a block diagram of a processor according to an embodiment of the present invention;
FIG. 11 is a block diagram of yet another architecture of a processor according to an embodiment of the present invention;
FIG. 12 is a flowchart of an alternative method step for a method of instruction prefetching according to an embodiment of the present invention.
Detailed Description
At present, when redirection occurs, instruction prefetching cannot guarantee that an instruction for redirecting instruction fetching can be hit in a first-level cache, so that the embodiment of the invention provides an improved instruction prefetching method.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As an alternative example of the present disclosure, FIG. 1 illustrates a block diagram of a computer system architecture; it should be noted that the block diagram is shown to facilitate understanding of the disclosure of the embodiments of the present invention, which are not limited to the architecture shown in fig. 1.
Referring to fig. 1, a computer system 1 may include: a processor 11, a memory 12 coupled to the processor 11, and a south bridge 13 coupled to the processor.
The processor 11 may comprise a CISC (complex instruction set computer) microprocessor, a RISC (reduced instruction set computer) microprocessor, a VLIW (very long instruction word) microprocessor, a processor implementing a combination of instruction sets, or any other processor device, such as a digital signal processor.
Processor 11 may integrate at least one processor core 100 for executing at least one instruction, processor core 100 representing any type of architected processor core, such as a RISC processor core, a CISC processor core, a VLIM processor core, or a hybrid processor core, among others. Processor core 100 may be implemented in any suitable manner, in the case of processor 11 integrating multiple processor cores 100, the processor cores may be homogeneous or heterogeneous in architecture and/or instruction set; in an alternative implementation, some processor cores may be in-order and other processor cores may be out-of-order, and in another alternative implementation, two or more processor cores may execute the same instruction set, while other processor cores may execute a subset of the instruction set or a different instruction set.
As an alternative example, the processor 11 may integrate the memory controller and the like, and provide the memory interface and the like to the outside; the processor 11 may be coupled to the memory 12 through a memory interface. Meanwhile, the processor 11 may be coupled to a processor bus, and coupled to the south bridge 13 through the processor bus.
As an alternative example, the south bridge 13 may integrate a bus interface 14 that communicates with the other components of the computer system, such that the processor 11 signals with most of the other components of the computer system 1 via the south bridge 13; the components of the computer system can be added and adjusted according to actual conditions, and are not explained one by one here;
in an alternative example, the bus interface 14 integrated by south bridge 13 includes, but is not limited to: a memory (such as a hard disk) bus interface, a USB bus interface, a network controller bus interface, a PCIE bus interface, etc.
It should be noted that the coupling structure of the processor and the south bridge in the exemplary block diagram of fig. 1 is basic, but the detailed refinement structure of the processor and the south bridge may be set, adjusted and/or expanded according to the specific use case, and is not fixed.
In other computer system architectures, such as those with separate south and north bridges, memory control may also be provided by the north bridge, such as the north bridge being primarily responsible for signal passing between the graphics card, memory, and processor, and coupling the processor up and the south bridge down; the south bridge is mainly responsible for signal transmission among the hard disk, the peripheral equipment, various IO (input/output) interfaces with lower bandwidth requirements, the memory and the processor.
The above is a computer architecture of a processor and south bridge type, and in other examples of the computer architecture, the computer architecture may also be implemented by SoC (System on Chip); for example, the SoC may integrate a processor, a memory controller, an IO interface, and the like, and the SoC may be coupled with other components such as an external memory, an IO device, and a network card, so as to build a computer architecture on a single main chip.
It should be further noted that the architecture described above is not limited to computer systems, but may be used in other devices such as handheld devices and other devices having embedded applications; some examples of handheld devices include cellular phones, internet protocol devices, digital cameras, Personal Digital Assistants (PDAs), or handheld PCs (personal computers). Other devices with embedded applications may include network computers (Net PCs), set-top boxes, servers, Wide Area Network (WAN) switches, or any other system that can execute one or more instructions of at least one disclosed embodiment of the invention.
In addition, the processor described above is not limited to a Central Processing Unit (CPU), but may be an accelerator (e.g., a Graphics accelerator or a digital signal Processing Unit), a Graphics Processing Unit (GPU), a field programmable gate array (fpga), or any other processor having an instruction execution function. Although illustrated as a single processor, in practice, a computer architecture may have multiple processors, each with at least one processor core.
As an alternative example of the present disclosure, fig. 2 illustrates a block diagram of a processor coupled to a memory; it should be noted that the block diagram is shown to facilitate understanding of the disclosure of the embodiments of the present invention, which are not limited to the architecture shown in fig. 2.
Referring to fig. 2, the processor 11 may include: at least one processor core 100 (the multiple processor cores case shown in figure 2 is only one optional example); at least one private cache 210 may reside inside each processor core 100; meanwhile, at least one shared cache 220 resides outside of the processor core 100 and is shared by the at least one processor core 100; shared cache 220 accesses memory 12 and passes signals between processor core 100 and memory 12. Optionally, on the basis of the architecture shown in fig. 2, in the embodiment of the present invention, an external shared cache may also be disposed outside the processor 11, and the external shared cache transfers signals between the processor 11 and the memory 12.
It should be noted that the processor may also include other circuits (not shown) that are not necessary for understanding the disclosure of the embodiments of the present invention, and the embodiments of the present invention are not described in detail since the other circuits are not necessary for understanding the disclosure of the embodiments of the present invention.
A cache (e.g., a cache) is a storage unit with extremely high access speed between the processor core 100 and the memory 12, and generally has a multi-level structure; more commonly, the cache structure of the third level is divided into a first level (L1) cache, a second level (L2) cache and a third level (L3) cache; of course, embodiments of the present invention may also support structures that are more than three levels of cache or less than three levels of cache.
As an alternative example, each processor core 100 may internally integrate the L1 cache and the L2 cache, i.e., the private cache 210 may include: an L1 cache and an L2 cache; the shared cache 220 may include an L3 cache, the L3 cache being shared by the at least one processor core 100; of course, this cache arrangement is merely an example, and it is also possible to integrate the L2 and L3 caches as shared caches, or in the case of more than three-level cache structures, the L1, L2, and L3 caches may all be integrated within the processor core as private caches.
Modern microprocessor architectures generally use pipeline (pipeline) technology to implement parallel processing of multiple instructions, and combine Branch Prediction (Branch Prediction), out of order execution (out of order execution), and other technologies to improve the execution efficiency of the pipeline. As an alternative example of the present disclosure, FIG. 3 illustratively shows a block diagram of a processor including a processor core that uses pipelining; it should be noted that the block diagram is shown to facilitate understanding of the disclosure of the embodiments of the present invention, and the embodiments of the present invention are not limited to the architecture shown in fig. 3.
As an optional example, the processing procedure of the five-stage pipeline may be divided into Instruction Fetch (Instruction Fetch), Decode (Instruction Decode), Execute (Execute), Memory Access (Memory Access), and Write Back (Write Back); in order to solve the problem of pipeline delay caused by waiting for the execution result of the branch instruction to determine the next instruction fetching when the branch instruction is processed, the front end of the pipeline can be further provided with a branch prediction unit to realize branch prediction.
Referring to fig. 3, the processor 11 may include: branch prediction unit 101, instruction fetch unit 102, decode unit 103, execution engine unit 104, memory access unit 105, write back unit 106, level one (L1) cache 210, and at least one lower level cache 220.
The branch prediction unit 101, the instruction fetch unit 102, the decode unit 103, the execution engine unit 104, the access unit 105, and the write-back unit 106 may be logic circuit units integrated in a processor core, including but not limited to the processor core 100 shown in fig. 1 or fig. 2.
The L1 cache 210 includes an instruction cache that stores instructions primarily through instruction cache blocks; the lower level cache 220 is lower in level than the L1 cache 210, which may be one or more caches lower in level than the L1 cache, such as at least one of the L2 cache, the L3 cache, etc. below the L1 cache;
optionally, in the embodiment of the present invention, the L1 cache 210 is a separate instruction cache (storing instructions only by an instruction cache block), and the at least one lower level cache 220 caches data and instructions together (both data cache blocks and instruction cache blocks); as an example of a three-level cache architecture, the L1 cache is an instruction cache that only caches instructions, and the L2 cache and the L3 cache (an optional implementation of the at least one lower level cache 220) may cache data and instructions together.
In embodiments of the invention, all or part of the L1 cache 210 and the at least one lower level cache 220 may be integrated within a processor core; as an example, the L1 cache 210 may be integrated within a processor core, and all or a portion of at least one lower level cache 220 may be integrated outside of the processor core (e.g., both the L2 and L3 caches are integrated within the processor core, or alternatively, the L2 cache is integrated within the processor core and the L3 cache resides outside of the processor core);
it should be noted that, no matter how the cache hierarchy integrated within the processor core and residing outside the processor core is set; optionally, generally, the previous-layer cache may cache information from the next-layer cache, for example, the L1 cache may cache information from the L2 cache, although this configuration is only optional, and the embodiment of the present invention is not limited thereto.
In modern processors, branch prediction unit 101 runs at the very front of the pipeline and controls fetching; the branch prediction unit can determine a branch prediction result through a branch prediction technology; the branch prediction result at least comprises a branch prediction direction (the prediction direction of a branch instruction, such as branch instruction jump or no jump), a target address of the branch instruction, a current instruction fetch address, a next instruction fetch starting address and the like; in one implementation, the branch prediction unit may make branch predictions based on historical execution information and results of branch instructions.
Based on the current instruction fetch address predicted by branch prediction unit 101, instruction fetch unit 102 may fetch instructions (including, but not limited to, fetch branch instructions, logical operation instructions, access instructions, etc.) from L1 cache 210 and feed into decode units 103; in an alternative implementation, instruction fetch unit 102 may deposit the fetched instruction into an instruction register of processor core 100 for decoding by decode unit 103 reading the instruction from the instruction register.
The decoding unit 103 may interpret the instruction to obtain a decoding result; the decoded result may be machine-executable operation information derived from interpreting the instruction, such as machine-executable uops (micro-instructions) formed by interpreting the operation code, operands, and control fields of the instruction; optionally, the decode unit 103 may read the source operands from the register file and parse the operation codes to generate the control signals. Optionally, the decode unit 103 may support Instruction Prefetch (Instruction Prefetch) techniques.
The execution engine unit 104 may perform operations based on the decoding result of the decoding unit to generate an execution result (the execution result corresponds to the instruction function of the instruction fetch instruction and relates to access, a logical operation result, instruction jump, and the like); optionally, the execution engine unit 104 may support out-of-order execution techniques.
Memory access unit 105 may perform memory accesses based on the results of execution of the memory access instructions by execution engine unit 104.
Write back unit 106 may write back the execution results to the register file based on the execution results of the instruction by execution engine unit 104 or the LOAD instruction by the memory access unit.
It should be noted that fig. 3 exemplarily shows a five-stage pipeline processor core architecture, and as technology adjusts, logic circuit units at different stages in the pipeline may be integrated or separated, and the architecture is not fixed; meanwhile, the processor core of the embodiment of the invention can also be applied to other pipeline technologies such as a four-stage pipeline and the like.
It is understood that the processor core may also include other circuits (not shown) that are not necessary for understanding the disclosure of the embodiments of the present invention, and the embodiments of the present invention are not described in detail since the other circuits are not necessary for understanding the disclosure of the embodiments of the present invention.
As an alternative implementation, FIG. 4 shows a block diagram of an instruction prefetch related processor architecture, it should be noted that the block diagram is shown for the purpose of facilitating understanding of the disclosure of embodiments of the present invention, which are not limited to the architecture shown in FIG. 4; the processor may also include other circuitry (not shown) that is not necessary for an understanding of embodiments of the present invention and will not be described in detail herein.
In an alternative example, as shown in fig. 3 and 4, the input of the branch prediction unit 101 is a current fetch start address, and the current fetch start address can be selected from a next fetch start address and a redirected fetch start address; according to the current fetch starting address and the internal information, the branch prediction unit can generate a branch prediction result; the branch prediction result at least comprises: branch prediction direction (e.g., branch instruction jump or no jump), target address of branch instruction, current fetch ending address, and next fetch starting address, etc., although branch prediction results may also include other things that are not necessary for understanding embodiments of the present invention;
the current instruction fetch starting address and the current instruction fetch ending address constitute a current instruction fetch address, and are used for reading a corresponding instruction from an L1 cache;
the next fetch starting address can be used as the current fetch starting address of the next branch prediction, and when a branch prediction error occurs and redirection is carried out, the redirected fetch starting address is used as the current fetch starting address of the branch prediction unit; the branch prediction unit can be continuously driven to operate, and the current fetch address is continuously output;
the redirected fetch starting address is used for correcting the branch prediction error when the branch prediction error is found and redirected; when redirection occurs, the redirected fetch starting address can be used as an input of branch prediction, the branch prediction unit can output a redirected fetch address (corresponding to the current fetch address during redirection) through branch prediction, and the redirected fetch address can be used for reading an instruction (namely, an instruction for redirecting fetch) in a cache so as to correct the instruction read when a branch prediction error occurs;
the redirection process also involves flushing instructions in the pipeline that are erroneously fetched, etc.; generally, the redirected fetch start address is provided by the decode unit or the execution engine unit, and in some cases, such as when a branch prediction error is found in the fetch stage, the redirected fetch start address may also be provided by the logic circuit unit involved in the fetch stage; although fig. 4 illustrates an example in which the execution engine unit 104 outputs the redirected fetch start address, the output of the redirected fetch start address is not limited to that shown in fig. 4.
As an alternative implementation, in order to support instruction prefetching, the embodiment of the present invention is further provided with an instruction fetch queue 1021 and an instruction cache hit prediction unit 1022; optionally, instruction fetch queue 1021 and instruction cache hit prediction unit 1022 may be part of instruction fetch unit 102;
the current fetch address output by the branch prediction unit 101 may be stored in the fetch queue 1021 and queued in the fetch queue 1021 (i.e., the current fetch address output by the branch prediction unit 101 at each branch prediction may be stored in the fetch queue 1021 and queued in the fetch queue 1021), and the fetch queue 1021 may schedule the fetch address in the queue to fetch from the L1 cache 210;
to increase the probability of an instruction hitting in the L1 cache 210 when the instruction is fetched from the fetch queue 1021; instruction cache hit prediction unit 1022 may determine whether the current fetch address hits in L1 cache 210 when branch prediction unit 101 outputs the current fetch address; if the prediction is not hit, the instruction cache hit prediction unit 1022 may issue an instruction prefetch request to the lower cache 220, causing the lower cache 220 to move the corresponding instruction into the L1 cache 210 in advance; the instruction prefetch request may include information that requires fetching, such as the current fetch address, or the current fetch starting address may be used directly, etc.;
after the lower-level cache 220 moves the instruction to the L1 cache 210 in advance, when the instruction queue 1021 fetches the instruction from the L1 cache 210 according to the current instruction fetching address, the instruction has a very high probability of hitting in the L1 cache, so that the situation that the instruction is not hit in the L1 cache during instruction fetching can be reduced, and the instruction fetching efficiency is improved.
The instruction prefetching mode or the instruction prefetching mode with the same principle as the instruction prefetching mode uses the queuing time of the instruction fetching address in the instruction fetching queue to move the instruction to be read from the lower-level cache to the L1 cache in advance, so that the condition that the instruction is not hit in the L1 cache can be reduced when the instruction is fetched from the L1 cache formally; the essence of this way of instruction prefetching is: the access delay of the instruction prefetching to the lower-level cache is hidden according to the queuing delay of the instruction fetching address in the instruction fetching queue. However, the inventors have found that this way of instruction prefetching fails with at least the following exceptions:
when a branch prediction is wrong and redirection is performed, because instructions which are wrongly fetched in a pipeline are emptied, an instruction fetching queue is emptied, instruction prefetching cannot be suitable for fetching during redirection, instructions which are redirected are possibly not hit in a first-level cache, and at the moment, corresponding instructions need to be moved from a lower-level cache to the first-level cache on the spot, and redirection overhead is increased undoubtedly.
It can be understood that, when a branch is predicted incorrectly, the fetch queue is emptied when redirection is performed, which may not hide the access delay of instruction prefetching to the lower level cache, and therefore instruction prefetching cannot be applied to fetch during redirection; when the redirect instruction address does not hit in the first-level cache, the corresponding instruction needs to be moved from the lower-level cache to the first-level cache on the spot, which undoubtedly increases the redirect overhead caused by the branch prediction error. It should be noted that the redirect instruction fetch address is an instruction fetch address predicted by the branch prediction unit based on the redirect instruction fetch start address, and the redirect instruction fetch address may be considered as a current instruction fetch address for fetching instructions at the time of redirection, and may be used to read instructions of the redirect instruction fetch from the L1 cache.
In order to solve the above problems, the present inventors propose an improved instruction prefetch method, which can obtain a next instruction fetch start address that is not selected by a branch prediction unit when performing branch prediction, so as to execute an instruction prefetch process according to the next instruction fetch start address, and improve the probability of hit of an instruction for redirecting instruction fetch in a first-level cache when redirection occurs.
It should be noted that, when the branch prediction unit performs branch prediction, branch prediction directions are mainly divided into two types: branch instruction jump and branch instruction do not jump; the branch prediction unit may select one of a plurality of possible fetch addresses as output based on the predicted direction (jump or not jump) of the branch instruction;
specifically, the next fetch start address generated by the branch prediction unit can be divided into two types: the next step of the corresponding instruction-taking starting address of the branch prediction direction which is jumped and the next step of the corresponding instruction-taking starting address of the branch prediction direction which is not jumped; the branch prediction unit may select a next fetch start address according to a branch prediction direction predicted during branch prediction, so as to be used as a next fetch start address (e.g., as an input of next branch prediction);
for example, if the predicted branch prediction direction is a jump, the next fetch start address corresponding to the branch prediction direction of the jump is selected, and if the predicted branch prediction direction is a no-jump, the next fetch start address corresponding to the branch prediction direction of the no-jump is selected.
Upon finding a branch prediction error to redirect, the inventors of the present invention found: the redirected fetch starting address is the next fetch starting address which is not selected by the branch prediction unit; therefore, the instruction prefetching process is executed according to the next instruction fetching starting address which is not selected by the branch prediction unit, and the instruction prefetching process is executed according to the redirected instruction fetching starting address in advance when redirection possibly occurs subsequently; the redirected fetch starting address is related to the redirected fetch address, so that the probability of hitting the redirected fetch instruction in the first-level cache can be improved when the subsequent redirection really occurs.
For convenience of the following description, the embodiments of the present invention may refer to a branch prediction direction predicted by the branch prediction unit as a first branch prediction direction, and a next fetch start address corresponding to the first branch prediction direction as a first next fetch start address (i.e., when a branch is predicted, a next fetch start address selected by the branch prediction unit is a first next fetch start address); the prediction direction opposite to the first branch prediction direction is a second branch prediction direction (i.e., the second branch prediction direction is opposite to the branch prediction direction predicted by the branch prediction unit), and the next fetch start address corresponding to the second branch prediction direction is a second next fetch start address (i.e., when a branch is predicted, the next fetch start address not selected by the branch prediction unit is the second next fetch start address).
As an alternative example of the disclosure of the embodiments of the present invention, fig. 5 shows an architecture block diagram of a processor provided by the embodiments of the present invention; it should be noted that the processor may also include other circuitry (not shown) that is not necessary for understanding the present disclosure;
referring to fig. 5, in the embodiment of the present invention, the branch prediction unit 101 outputs a second next fetch start address in addition to the current fetch start address and the first next fetch start address;
wherein, the current fetch start address is fed into the fetch queue 1021 to be queued for fetching, and the current fetch start address can be fed into the instruction cache hit prediction unit 1022 for instruction prefetch operation;
the first next step is to fetch the initial address of the instruction as the input of the next branch prediction;
the second next fetch start address is fed to the instruction cache hit prediction unit 1022 for instruction prefetch operation;
it is understood that the first next fetch start address is a next fetch start address selected by the branch prediction unit during branch prediction, the first branch prediction direction corresponding to the first next fetch start address is the branch prediction direction predicted by the branch prediction unit, and the second branch prediction direction corresponding to the second next fetch start address is opposite to the first branch prediction direction.
In the embodiment of the present invention, the instruction cache hit prediction unit 1022 not only executes the instruction prefetching process according to the current fetch start address, but also executes the instruction prefetching process according to the second next fetch start address;
from the perspective of the instruction cache prediction unit 1022, the instruction cache hit prediction unit 1022 may fetch the second next fetch start address; the obtained branch prediction direction corresponding to the second next fetch starting address is opposite to the branch prediction direction corresponding to the first next fetch starting address obtained by branch prediction; the instruction cache prediction unit 1022 may perform an instruction prefetch process based on the second next fetch start address fetched. Furthermore, when redirection is carried out subsequently, the probability that the instruction of the redirection instruction fetching hits in the first-level cache is improved, the instruction prefetching effect is improved, and the redirection overhead is reduced.
Optionally, as an optional implementation of performing the instruction prefetching process according to the second next fetch starting address, the instruction cache hit prediction unit 1022 may determine whether the second next fetch starting address hits in the L1 cache 210; if the prediction is not hit, the instruction cache hit prediction unit 1022 may generate an instruction prefetch request corresponding to the second next fetch start address, which may be used to request the lower level cache 220 to move an instruction corresponding to the second next fetch start address to the L1 cache 210;
in an alternative implementation, the instruction cache hit prediction unit 1022 may forward the generated instruction prefetch request, which may include the second next fetch start address, to the lower cache 220; so that the lower level cache 220 may move instructions corresponding to the starting address of the second next fetch instruction to the L1 cache 210.
For example, the instruction cache hit prediction unit 1022 may be illustrated as performing instruction prefetching according to the first next fetch start address, as shown in the corresponding solid line portion of FIG. 5; the instruction cache hit prediction unit 1022 may be illustrated in FIG. 5 by corresponding dashed lines, illustrating an exemplary implementation of instruction prefetching according to the second next fetch start address.
When the branch prediction is wrong and the redirection is performed, the second next instruction fetching starting address is the redirected instruction fetching starting address, so that the instruction cache hit prediction unit executes the instruction prefetching process according to the second next instruction fetching starting address, the probability of hitting the redirected instruction fetching in the L1 cache can be improved when the redirection is actually performed in the follow-up process, and the redirection overhead is reduced. The embodiment of the invention can improve the hit probability of the instruction of the redirection instruction fetch in the first-level cache, improve the instruction prefetching effect and reduce the redirection overhead when the subsequent redirection occurs.
As an example, the branch prediction logic may include: if the branch prediction unit predicts that the branch instruction does not jump (i.e. the first branch prediction direction is not jumping), or predicts that no branch instruction exists, the current fetch ending address may be a preset boundary (e.g. a boundary corresponding to 64 bytes), and the selected first next fetch starting address may be the current fetch ending address + 1;
if the branch prediction unit predicts a branch instruction jump (i.e. the first branch prediction direction is a jump), the output current fetch ending address is the ending address of the branch instruction, and the next fetch starting address is the target address of the branch instruction (i.e. the address of the branch instruction jump).
It should be apparent that the branch prediction logic shown above is only an alternative example, and for convenience of illustration only, other possible branch prediction logic may be supported by embodiments of the present invention.
For example, as shown in fig. 6, taking the example that the branch prediction unit 101 predicts that the branch instruction does not jump, the first branch prediction direction predicted by the branch prediction unit is no jump, the first next fetch start address may be the current fetch end address +1, correspondingly, the second branch prediction direction not predicted by the branch prediction unit is a jump, and the second next fetch start address not selected by the branch prediction unit is the target address of the branch instruction;
as shown in fig. 6, based on the second next fetch start address, the instruction cache hit prediction unit 1022 may predict whether the target address of the branch instruction hits in the L1 cache 210; if a miss is predicted, instruction cache hit prediction unit 1022 may feed an instruction prefetch request, which may include the target address of the branch instruction, to lower level cache 220; so that the lower level cache 220 may move instructions corresponding to the target address of the branch instruction to the L1 cache;
if the branch prediction of the branch prediction unit is wrong, the execution engine unit can redirect and output a redirected fetch starting address (namely a target address of the branch instruction) to the branch prediction unit, wherein the target address of the branch instruction is used as a branch prediction input so as to correct the branch prediction error;
in the above process, the redirection process also involves emptying the fetch queue, so that the redirection fetch address (corresponding to the current fetch address during redirection) predicted by the branch prediction unit cannot be applied to instruction prefetching; in the embodiment of the present invention, since the instruction cache hit prediction unit 1022 is used to perform instruction prefetching on the redirected fetch start address in advance, when fetching an instruction from the L1 cache based on the redirected fetch address, the redirected fetch address has a very high probability of hitting in the L1 cache, and there is no need to move the instruction from a lower level cache on the spot, so that the redirection overhead can be greatly reduced.
The above example is described by taking an example of a predicted branch instruction not jumping, but of course, the embodiment of the present invention may also support a scenario of predicted branch instruction jumping; it will be appreciated that in the branch prediction logic of the above example, if the branch prediction unit predicts a branch instruction jump, but the branch instruction does not actually jump, the redirected fetch start address is the corresponding target address when not jumping (i.e. the predicted current fetch end address +1 when not jumping); if the branch prediction unit predicts that the branch instruction does not jump but actually jumps, the redirected fetch starting address is a corresponding target address during jumping; the redirected fetch start address corresponds to a second next fetch start address not selected by the branch prediction unit.
As an optional example of the disclosure of the embodiment of the present invention, the L1 cache 210 and the lower level cache 220 may both store instructions through an instruction cache block, where the instruction cache block has at least a Tag field, and the Tag field may record at least a part of the fetch start address; optionally, an optional implementation of determining whether the obtained second next fetch start address hits in the L1 cache according to the embodiment of the present invention may be:
judging whether the obtained second next instruction fetch starting address is matched with a Tag field of an instruction cache block cached by L1, and if not, determining that the obtained second next instruction fetch starting address is not hit in the L1 cache;
correspondingly, an instruction prefetch request corresponding to the obtained second next instruction fetch starting address can be generated, so that the lower-level cache moves an instruction cache block corresponding to the Tag domain matched with the obtained second next instruction fetch starting address to the L1 cache, and the prefetching of the corresponding instruction is realized under the condition that the obtained second next instruction fetch starting address is not hit in the L1 cache; that is, the generated instruction prefetch request may be specifically used to: requesting the lower-level cache to move an instruction cache block corresponding to the Tag field matching the instruction start address of the second next fetch to the L1 cache.
For example, FIG. 7 shows an illustration of an instruction prefetch process performed according to a second next fetch start address, as described with reference to FIG. 7:
the L1 cache 210 has a plurality of instruction cache blocks, each instruction cache block having a corresponding Tag field, although an instruction cache block may also include fields indicating instruction address, content, etc., and will not be described one by one herein;
the lower level cache 220 also has a plurality of instruction cache blocks, each instruction cache block having a corresponding Tag field;
assuming that the instruction to be fetched corresponding to the second next fetch start address is instruction a, that is, the instruction corresponding to the branch prediction direction that is not predicted by the branch prediction unit is instruction a; instruction A is stored in the lower level cache 220, and the Tag field of the instruction cache block storing instruction A is Tag 1;
after the instruction cache hit prediction unit 1022 obtains the second next instruction fetch starting address, it may be determined whether the second next instruction fetch starting address matches the Tag field of the instruction cache block cached in L1, that is, whether the next instruction fetch starting address hits in the L1 cache; if not, indicating that the L1 cache does not store the instruction a instruction cache block, the instruction cache hit prediction unit 1022 may feed an instruction prefetch request (including the second next fetch start address) to the next level cache;
the lower-level cache searches for a Tag1 matched with the starting address of the second next instruction fetch, and determines that the starting address of the second next instruction fetch hits an instruction cache block corresponding to the Tag1 in the lower-level cache, so that the instruction cache block corresponding to the Tag1 is moved to the L1 cache, and the L1 cache has an instruction cache block for storing the instruction A;
furthermore, when redirection occurs subsequently, the redirected instruction fetching start address can hit Tag1 in the L1 cache, and an instruction corresponding to the redirected instruction fetching address is read from an instruction cache block corresponding to the hit Tag1, so that the instruction for redirecting instruction fetching is read from the L1 cache, the condition that the instruction is moved from a lower-level cache to a first-level cache on the spot is reduced, and the redirection overhead is reduced.
It should be noted that, when an instruction is read, the embodiments of the present invention may determine whether the fetch start address hits in a cache (e.g., an L1 cache and a lower cache) based on the match between the fetch start address and the Tag field of the instruction cache block; upon a hit, the instruction cache block in which the fetch start address hits may be determined based on the matching Tag field, such that the corresponding instruction is read from the hitting instruction cache block further based on the fetch address (including the fetch start address and the fetch end address).
Optionally, in this embodiment of the present invention, the L1 cache may be a separate instruction cache (only caches instructions), and the lower level caches may cache data and instructions together (both data cache blocks and instruction cache blocks); as an example of a three-level cache architecture, the L1 cache stores instructions as a command cache only by command cache blocks, and the L2 cache and the L3 cache (an optional implementation of the lower level cache) store data by data cache blocks, and store instructions by command cache blocks;
as an alternative example, the data cache block and the instruction cache block each comprise: a Tag field, a data field and an ECC field; the data field of the data cache block can record data information, and the data field of the instruction cache block can record instruction information; an ECC (Error Correcting Code) field may be used to provide ECC protection for the data cache block and the instruction cache block, in another implementation, the instruction cache block may use parity protection, and an ECC space corresponding to the ECC field of the instruction cache block may be left free to store other information.
As an optional example of the disclosure of the embodiment of the present invention, before the instruction cache hit prediction unit obtains the second next fetch start address, the embodiment of the present invention may filter the unnecessary second next fetch start address, so as to reduce the second next fetch start address where instruction prefetching is unnecessary, and reduce the possibility of polluting the L1 cache.
It can be understood that, executing the instruction prefetching process according to the second next fetch start address can only have the effect of reducing the redirection overhead when the redirection is performed due to a branch prediction error; if the branch prediction is correct, the excessive instruction prefetching operation on the second next instruction fetching start address will certainly reduce the space utilization of the LI cache, resulting in the pollution of the storage space of the L1 cache (for example, the L1 cache stores too many instructions that will not be used in a short time);
the embodiment of the invention can set the filtering condition, and filter the second next instruction fetching starting address output by the branch prediction unit when the filtering condition is met, so that the instruction cache hit prediction unit can obtain the filtered second next instruction fetching starting address, thereby reducing the possibility that the instruction cache hit prediction unit obtains the unnecessary second next instruction fetching starting address.
As an alternative implementation, fig. 8 shows a further architecture block diagram of the processor provided in the embodiment of the present invention, and in conjunction with fig. 5 and fig. 8, a filtering unit 300 may also be provided in the embodiment of the present invention, where the filtering unit 300 may be implemented by a logic circuit unit integrated in the processor core or residing outside the processor core;
in an embodiment of the present invention, the second next fetch start address output by the branch prediction unit 101 may be fed to the filter unit 300; the filtering unit 300 may filter the second next fetch start address output by the branch prediction unit when the filtering condition is satisfied; as one example, the filter condition may indicate at least that the current likelihood of correctness of the branch prediction meets a predetermined likelihood of correctness condition; the current correct possibility of the branch prediction reaches a preset correct possibility condition, which indicates that the current branch prediction is very correct, and the second next-step instruction fetch initial address output by the branch prediction unit is filtered, so that unnecessary instruction prefetching of the prediction unit hit by an instruction cache can be reduced;
the instruction cache hit prediction unit 1022 may fetch the second next fetch start address filtered by the filter unit 300 to perform the instruction prefetch process.
For example, the current likelihood of correctness for the branch prediction meeting the predetermined likelihood of correctness condition may include:
the current confidence state of the branch prediction is higher than the set confidence state, wherein the higher the current confidence state of the branch prediction is, the higher the current correctness possibility of the branch prediction is;
furthermore, the embodiment of the invention can define a set confidence state, and when the current confidence state of the branch prediction is not higher than the set confidence state, the branch prediction has higher error probability at present, and redirection is most likely to occur; and if the current confidence level state of the branch prediction is higher than the set confidence level state, the possibility that the branch prediction is correct at present is considered to be higher, the possibility of redirection is lower, and at the moment, the instruction prefetching process does not need to be executed according to the second next instruction fetching starting address.
Optionally, in addition to using the current correct possibility of branch prediction to reach the predetermined correct possibility condition as the filtering condition, the embodiment of the present invention may further set the following filtering condition:
the second next step is that the fetch starting address and the current fetch starting address correspond to the same instruction cache block; it can be understood that one instruction cache block may store a plurality of instructions, and when the second next instruction fetch start address and the current instruction fetch start address correspond to the same instruction cache block, the second next instruction fetch start address and the current instruction fetch start address belong to the same basic block (cache line or cache block) of the instruction cache, at this time, even if the branch prediction is wrong, after the instruction prefetch process is executed according to the current instruction fetch start address, the instruction cache block corresponding to the current instruction fetch start address stored in the L1 cache may correspond to the second next instruction fetch start address, and at this time, it is not necessary to execute the instruction prefetch process according to the second next instruction fetch start address.
Optionally, a method for filtering the second next fetch start address may be as shown in fig. 9, where fig. 9 is only one optional filtering method, and the embodiment of the present invention may also support using one of the following filtering conditions: the second next step is that the instruction fetch starting address and the current instruction fetch starting address correspond to the same instruction cache block, or the current confidence level state of the branch prediction is higher than the set confidence level state;
referring to fig. 9, the method may include:
step S10, determining whether the second next fetch start address and the current fetch start address correspond to the same instruction cache block, if yes, performing step S11, and if no, performing step S12.
Optionally, in the embodiment of the present invention, a tag field matched with the second next-step fetch starting address and a tag field matched with the current fetch starting address may be searched, if the tag field matched with the second next-step fetch starting address is consistent with the tag field matched with the current fetch starting address, it is considered that the second next-step fetch starting address and the current fetch starting address correspond to the same instruction cache block, otherwise, it is considered that the second next-step fetch starting address and the current fetch starting address do not correspond to the same instruction cache block;
optionally, in another implementation, in the embodiment of the present invention, an instruction fetching start address range corresponding to each tag domain may also be set, if the second next instruction fetching start address and the current instruction fetching start address belong to the instruction fetching start address range corresponding to the same tag domain, it is considered that the second next instruction fetching start address and the current instruction fetching start address correspond to the same instruction cache block, otherwise, it is considered that the second next instruction fetching start address and the current instruction fetching start address do not correspond to the same instruction cache block.
And step S11, filtering the second next instruction fetching starting address.
Step S12, determining whether the current confidence level state of the branch prediction is higher than the set confidence level state, if so, performing step S11, and if not, performing step S13.
Optionally, for each branch prediction performed by the branch prediction unit, the branch prediction result has information such as a branch prediction direction (jump or no jump), and also has strength information of a confidence state of the branch prediction; taking the simplest 2-bit counter as an example, the confidence states of four branch predictions can be corresponded to: strong token, non-token, strong non-token; optionally, of the four confidence states of branch prediction, the confidence levels of the confidence states at both ends are the highest (i.e., the confidence levels of the strong token and the strong non-token are the highest), and the confidence level of the confidence state in the middle is the lowest (i.e., the confidence levels of the token and the non-token are the lowest), that is, the confidence levels decrease from the confidence state at the end to the confidence state in the middle.
If the current confidence level state of the branch prediction is higher than the set confidence level state, the possibility of the branch prediction being correct at present is higher, the possibility of redirection is lower, and instruction prefetching is not needed to be carried out according to the second next instruction fetching start address, so that the second next instruction fetching start address can be filtered; if the current confidence level state of the branch prediction is not higher than the set confidence level state, the probability of the branch prediction being correct at present is low, the probability of redirection is high, and the second next fetch starting address can not be filtered.
And step S13, reserving the second next instruction fetching starting address.
The reserved second next-fetch starting address, i.e., the filtered second next-fetch starting address, may be fetched by the instruction cache hit prediction unit, so that the instruction cache hit prediction unit performs the instruction prefetch process based on the fetched second next-fetch starting address.
Alternatively, steps S10 to S13 may be performed by the filter unit.
As an optional example of the disclosure of the embodiment of the present invention, in the embodiment of the present invention, the instruction cache hit prediction unit needs to execute an instruction prefetching process according to the current fetch start address and the second next fetch start address at the same time; accordingly, the frequency of port access of the instruction cache hit prediction unit becomes high, and port access conflict of the instruction cache hit prediction unit is likely to exist;
in order to reduce port access conflicts when the instruction cache hits the prediction unit, the embodiment of the invention can use a queuing mechanism for the second next instruction fetch starting address; optionally, fig. 10 shows another architecture block diagram of the processor according to the embodiment of the present invention, and in conjunction with fig. 5 and fig. 10, a prediction queue 400 may be further configured to queue a second next fetch start address output by the branch prediction unit 101 according to the embodiment of the present invention; prediction queue 400 may be implemented by logic circuitry units integrated within the processor core or residing outside the processor core;
as shown in FIG. 10, the branch prediction unit may output a second next fetch start address that is not selected each time the branch prediction unit predicts, and the second next fetch start address output by the branch prediction unit may be queued in the prediction queue 400;
accordingly, the instruction cache hit prediction unit 1022 may obtain a second next fetch start address queued in the prediction queue, and perform an instruction prefetch process according to the obtained second next fetch start address.
As an alternative implementation, the instruction cache hit prediction unit 1022 may have a port that may be used to fetch a fetch start address for instruction prefetching; when the path of the port for acquiring the current instruction-fetching starting address is in an access neutral position, a second next instruction-fetching starting address can be acquired from the prediction queue through the port, and therefore the instruction prefetching process is executed according to the acquired second next instruction-fetching starting address.
Optionally, the instruction cache hit prediction unit 1022 may obtain the second next fetch starting address from the prediction queue through the port according to a queuing order of the second next fetch starting address in the prediction queue. Of course, it is also possible to set the priority of each second next fetch starting address queued in the prediction queue, and the instruction cache hit prediction unit 1022 may obtain, through the port, the second next fetch starting address with the highest priority in the prediction queue, that is, the higher the priority of the second next fetch starting address is, the higher the priority is obtained by the instruction cache hit prediction unit; optionally, in an optional example, it may be set that the lower the confidence of the branch prediction is, the higher the priority of the corresponding second next-step instruction fetch starting address is, that is, the higher the error rate of the branch prediction is, the higher the priority of the corresponding second next-step instruction fetch starting address is, the higher the priority is to be fetched by the instruction cache hit prediction unit for instruction prefetching.
Optionally, the step of obtaining the access slot of the path of the current fetch start address by the port of the instruction cache hit prediction unit includes, but is not limited to: the instruction cache hit prediction unit may be configured to determine whether the port has a current fetch start address or a time period for instruction prefetching after the port has obtained the current fetch start address.
Optionally, if the prediction queue is full (that is, the prediction queue has no empty queuing bit, and the number of the second next fetch start addresses queued in the prediction queue reaches the upper limit), the second next fetch start addresses that have not yet been added to the prediction queue may be discarded, for example, the second next fetch start addresses newly output by the branch prediction unit are discarded until the prediction queue has an empty queuing bit; optionally, on the other hand, in the embodiment of the present invention, when the prediction queue is full, the second next-step fetch starting address newly output by the branch prediction unit may be used to cover the prediction queue, for example, an old second next-step fetch starting address in the prediction queue is discarded (for example, the second next-step fetch starting address with the longest time in the prediction queue is discarded, or any second next-step fetch starting address in the prediction queue is discarded, or the like), so that the second next-step fetch starting address newly output by the branch prediction unit is added to the prediction queue;
it is understood that, as an optional implementation, the branch prediction unit may output a second next fetch starting address that is not selected by the branch prediction each time (that is, a second next fetch starting address may be newly generated by each branch prediction), and the second next fetch starting address that is newly output by the branch prediction unit each time may request to enter the prediction queue 400 for queuing, so that the instruction cache hit prediction unit 1022 may subsequently obtain and prefetch instructions; however, when the prediction queue is full, the second next fetch start address newly output by the branch prediction unit may need to wait for a long time before entering the prediction queue, and a corresponding redirection may have already occurred during the waiting time.
In an alternative implementation, the instruction cache hit prediction unit may have only one port (physical port), and of course, the instruction cache hit prediction unit may also have multiple ports, and the number of the ports specifically set may be determined according to the actual situation and the acceptable degree of the hardware overhead, and generally, increasing the ports will bring the hardware overhead to the instruction cache hit prediction unit.
As an alternative example of the disclosure of the embodiment of the present invention, the filtering unit shown in fig. 8 and the prediction queue shown in fig. 10 may be used in combination, so that the advantages of the filtering unit and the advantages of the prediction queue may be combined.
Optionally, fig. 11 shows another architecture block diagram of the processor according to the embodiment of the present invention, in fig. 11, the filtering unit 300 may filter a second next fetch start address output by the branch prediction unit when the filtering condition is satisfied, and the second next fetch start address filtered by the filtering unit 300 may enter the prediction queue 400; the second next step of enqueueable filtering in the prediction queue 400 is to fetch the start address; the instruction cache hit prediction unit 1022 may obtain a second next fetch start address from the prediction queue through the port when the access slot occurs in the path where the port obtains the current fetch start address, so as to execute an instruction prefetching process according to the obtained second next fetch start address;
the specific contents of the filtering unit 300 can be described with reference to the corresponding parts above, and are not described herein again; similarly, the details of the prediction queue 400 and the details of the instruction cache hit prediction unit 1022 obtaining the second next fetch start address from the prediction queue 400 may refer to the corresponding parts described above, and are not described herein again.
As an alternative example, FIG. 12 illustrates alternative method steps of a method for instruction prefetching provided by an embodiment of the invention, the method illustrated in FIG. 12 being executable by a processor; alternatively, the method shown in fig. 12 may be executed by a logic circuit unit arranged in the processor (i.e., the method is executed by a hardened logic circuit), and in some cases, the method shown in fig. 12 may also be executed by a logic circuit unit arranged in the code control processor (i.e., the method is executed by combining soft and hard).
Referring to fig. 12, the method may include:
in step S20, the branch prediction unit outputs the second next fetch start address.
Alternatively, step S20 may be performed by the branch prediction unit; the branch prediction unit outputs a second next fetch starting address which is not selected by branch prediction besides a traditional branch prediction result (such as a first next fetch starting address and a current fetch starting address); the branch prediction direction corresponding to the second next fetch starting address is opposite to the branch prediction direction corresponding to the first next fetch starting address obtained by branch prediction.
Step S21, determining whether the second next fetch start address and the current fetch start address correspond to the same instruction cache block, if yes, performing step S22, and if no, performing step S23.
And step S22, filtering the second next instruction fetching starting address.
Step S23, determining whether the current confidence level state of the branch prediction is higher than the set confidence level state, if so, performing step S22, and if not, performing step S24.
And step S24, feeding the filtered second next instruction fetching start address to the prediction queue.
Alternatively, steps S21 through S24 may be performed by the filter unit.
Step S25, determine whether the prediction queue is full, if yes, execute step S26, otherwise execute step S27.
And step S26, discarding the filtered second next instruction-fetching start address.
Alternatively, steps S25 and S26 may be performed by the prediction queue.
Optionally, in this embodiment of the present invention, when the prediction queue is full, the second next fetch start address that has not yet been added to the prediction queue is discarded until the prediction queue has empty queuing bits.
As an example, when the prediction queue is full, discarding the second next fetch start address that has not yet joined the prediction queue may be: discarding the newly filtered second next fetch start address as shown in FIG. 11 when the prediction queue is full;
as an alternative, if no filtering unit is used, when the prediction queue is full, discarding the second next fetch start address that has not yet been added to the prediction queue may be: when the prediction queue is full, the second next fetch start address newly output by the branch prediction unit as shown in FIG. 10 is discarded.
And step S27, when the path of the port for acquiring the current instruction-fetching start address is in the access neutral position, acquiring a second next instruction-fetching start address from the prediction queue through the port.
Optionally, the port may be used to fetch a fetch start address for instruction prefetching.
Step S28, execute the instruction prefetch process according to the obtained second next instruction fetch start address.
Alternatively, steps S27 and S28 may be performed by the instruction cache hit prediction unit;
optionally, the optional implementation of step S27 may refer to the corresponding description above, and is not described herein again; in an alternative example, the instruction cache hit prediction unit may obtain the second next fetch start address from the prediction queue through the port based on a queuing order of the second next fetch start address queued in the prediction queue or a priority of the second next fetch start address queued in the prediction queue.
Optionally, the optional implementation of step S28 may refer to the corresponding description above, and is not described herein again.
While various embodiments of the present invention have been described above, various alternatives described in the various embodiments can be combined and cross-referenced without conflict to extend the variety of possible embodiments that can be considered disclosed and disclosed in connection with the embodiments of the present invention.
The scheme provided by the embodiment of the invention can acquire the second next instruction fetch starting address which is not selected by branch prediction, so that the instruction prefetching process is executed according to the second next instruction fetch starting address; when the branch prediction is wrong and the redirection is carried out, the second next instruction fetching starting address corresponds to the redirected instruction fetching starting address, so that the instruction prefetching process is executed according to the second next instruction fetching starting address, the hit probability of the redirected instruction fetching in the first-level cache can be improved when the redirection is carried out subsequently, and the redirection overhead is reduced. The embodiment of the invention can improve the hit probability of the instruction of the redirection instruction fetch in the first-level cache, improve the instruction prefetching effect and reduce the redirection overhead when the subsequent redirection occurs.
Optionally, the embodiment of the present invention may further set a filtering condition, and/or a queuing mechanism; by setting the filtering condition, the instruction prefetching for the unnecessary second next-step instruction fetching starting address can be reduced, and the possibility of polluting the L1 cache is reduced; by setting a queuing mechanism of the second next-step fetch starting address, the port access conflict of the cache hit prediction unit can be instructed, and the port resources are reasonable and effective.
With reference to the drawings provided above, a processor provided in an embodiment of the present invention is described below from a logic point of view of a processor core. The following description may be referred to with respect to corresponding ones of the above description. It should be noted that the logic referred to in the embodiments of the present invention may be a logic circuit unit in the processor core.
As an alternative implementation, a processor provided in an embodiment of the present invention may include at least one processor core, where the processor core may include at least the following logic:
an instruction cache hit prediction unit for obtaining a second next fetch start address; the obtained branch prediction direction corresponding to the second next fetch starting address is opposite to the branch prediction direction corresponding to the first next fetch starting address obtained by branch prediction; and executing an instruction prefetching process according to the obtained second next instruction fetching starting address.
Optionally, the processor core may further include logic to:
the filtering unit is used for filtering the second next-step instruction-fetching starting address output by the branch prediction unit when the filtering condition is met; the filter condition at least indicates that the current correct possibility of the branch prediction reaches a predetermined correct possibility condition, or that a second next fetch start address output by the branch prediction unit corresponds to the same instruction cache block as the current fetch start address.
Optionally, the condition that the current correct possibility of the branch prediction reaches the predetermined correct possibility includes:
the current confidence state of the branch prediction is higher than the set confidence state, wherein the higher the current confidence state of the branch prediction, the higher the current likelihood of correctness of the branch prediction.
Optionally, the processor core may further include logic to:
and the prediction queue is used for queuing the filtered second next-step instruction starting address.
Optionally, the prediction queue may further include logic to:
when the prediction queue is full, discarding a second next-step instruction-fetching start address which is not added into the prediction queue;
it will be appreciated that the logic of the prediction queue may be further configured with logic to discard the second next fetch start address that has not yet been added to the prediction queue when the prediction queue is full.
Optionally, the instruction cache hit prediction unit may include logic to:
when the path of the current instruction fetching starting address obtained by the port is in an access neutral position, a second next instruction fetching starting address is obtained from the prediction queue through the port; the port is used for acquiring an instruction fetching starting address for instruction prefetching;
it will be appreciated that the instruction cache hit prediction unit may set logic to fetch the second next fetch start address by the implementation described above.
Optionally, the instruction cache hit prediction unit may include logic to:
judging whether the obtained second next-step instruction-fetching initial address is hit in the first-level cache; if the obtained second next instruction fetch starting address is not hit in the first-level cache, generating an instruction prefetching request corresponding to the obtained second next instruction fetch starting address; the instruction prefetch request is used for requesting a lower-level cache to move an instruction corresponding to the second next-step fetch starting address to the first-level cache; the lower level cache is lower in hierarchy than the first level cache.
For the detailed logic implementation of the logic circuit unit described above, reference may be made to the corresponding descriptions of the foregoing parts, and all the parts associated in the description may be referred to correspondingly, which is not described herein again. The various alternatives described above in connection with the various embodiments can be combined and cross-referenced without conflict and are considered disclosed, disclosed embodiments of the present invention.
The processor provided by the embodiment of the present invention may include at least one processor core, where the processor core may include at least: logic to implement the instruction prefetch method provided by embodiments of the present invention; the specific form of the logic of the processor core is not limited to the foregoing description, and any logic that can implement the instruction prefetching method provided by the embodiment of the present invention is within the scope of the present invention.
Although the embodiments of the present invention have been disclosed, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (19)

1. An instruction prefetch method, comprising:
acquiring a second next instruction fetching starting address; the obtained branch prediction direction corresponding to the second next fetch starting address is opposite to the branch prediction direction corresponding to the first next fetch starting address obtained by branch prediction; the first next-step fetch starting address is the next-step fetch starting address selected by the branch prediction unit and is used as the input of next branch prediction; the second next instruction fetch starting address is a next instruction fetch starting address which is not selected by the branch prediction unit, and the second next instruction fetch starting address is used as a redirected instruction fetch starting address and is used as an input of branch prediction when redirection occurs;
and executing an instruction prefetching process according to the obtained second next instruction fetching starting address.
2. The method of claim 1, wherein prior to fetching the second next fetch start address, the method further comprises:
when the filtering condition is met, filtering a second next-step instruction-fetching starting address output by the branch prediction unit; the filter condition at least indicates that the current correct possibility of the branch prediction reaches a predetermined correct possibility condition, or that a second next fetch start address output by the branch prediction unit corresponds to the same instruction cache block as the current fetch start address.
3. The instruction prefetch method of claim 2, wherein the current likelihood of correctness of the branch prediction meeting a predetermined likelihood of correctness condition comprises:
the current confidence state of the branch prediction is higher than the set confidence state, wherein the higher the current confidence state of the branch prediction, the higher the current likelihood of correctness of the branch prediction.
4. The method of claim 2 or 3, wherein said fetching a second next fetch start address comprises:
and acquiring a filtered second next instruction-fetching starting address.
5. The method of claim 2, wherein after filtering the second next fetch start address output by the branch prediction unit, the method further comprises:
queuing the filtered second next fetch in the prediction queue to refer to the start address.
6. The method of claim 1, wherein prior to fetching the second next fetch start address, the method further comprises:
the second next fetch start address output by the branch prediction unit is queued in the prediction queue.
7. The method of claim 5 or 6, wherein said fetching a second next fetch start address comprises:
when the path of the current instruction fetching starting address obtained by the port is in an access neutral position, a second next instruction fetching starting address is obtained from the prediction queue through the port; the port is used to fetch a fetch start address for instruction prefetching.
8. The method of claim 7, wherein said fetching a second next fetch start address from a prediction queue via said port comprises:
acquiring a second next-step instruction-fetching starting address from the prediction queue according to the queuing sequence of the second next-step instruction-fetching starting address in the prediction queue through the port;
and/or acquiring a second next-step instruction-fetching start address with the highest priority in the prediction queue through the port, wherein the lower the current confidence of branch prediction is, the higher the priority of the corresponding second next-step instruction-fetching start address is.
9. The instruction prefetch method according to claim 5 or 6, further comprising:
when the prediction queue is full, discarding a second next fetch start address that has not yet been added to the prediction queue.
10. The method of claim 7, wherein the performing an instruction prefetch process according to the fetch second next-step fetch start address comprises:
judging whether the obtained second next-step instruction-fetching initial address is hit in the first-level cache;
if the obtained second next instruction fetch starting address is not hit in the first-level cache, generating an instruction prefetching request corresponding to the obtained second next instruction fetch starting address; the instruction prefetch request is used for requesting a lower-level cache to move an instruction corresponding to the second next-step fetch starting address to the first-level cache; the lower level cache is lower in hierarchy than the first level cache.
11. The method of claim 10, wherein determining whether the fetch start address of the second next fetch hit in the level one cache comprises:
if the obtained second next instruction fetch starting address is not matched with the Tag field of the instruction cache block of the first-level cache, determining that the obtained second next instruction fetch starting address misses the first-level cache;
the instruction prefetch request is specifically configured to request a lower-level cache to move an instruction cache block corresponding to a Tag field that matches the obtained second next-step instruction fetch start address to the first-level cache.
12. A processor comprising at least one processor core, the processor core comprising at least the following logic:
an instruction cache hit prediction unit for obtaining a second next fetch start address; the obtained branch prediction direction corresponding to the second next fetch starting address is opposite to the branch prediction direction corresponding to the first next fetch starting address obtained by branch prediction; executing an instruction prefetching process according to the obtained second next instruction fetching starting address;
the first next fetch starting address is the next fetch starting address selected by the branch prediction unit and is used as the input of the next branch prediction; the second next instruction fetch starting address is the next instruction fetch starting address which is not selected by the branch prediction unit, and the second next instruction fetch starting address is used as the redirected instruction fetch starting address and is used as the input of branch prediction when redirection occurs.
13. The processor of claim 12, wherein the processor core further comprises logic to:
the filtering unit is used for filtering the second next-step instruction-fetching starting address output by the branch prediction unit when the filtering condition is met; the filter condition at least indicates that the current correct possibility of the branch prediction reaches a predetermined correct possibility condition, or that a second next fetch start address output by the branch prediction unit corresponds to the same instruction cache block as the current fetch start address.
14. The processor of claim 13, wherein the current likelihood of correctness for the branch prediction meeting a predetermined likelihood of correctness condition comprises:
the current confidence state of the branch prediction is higher than the set confidence state, wherein the higher the current confidence state of the branch prediction, the higher the current likelihood of correctness of the branch prediction.
15. The processor of claim 13 or 14, wherein the processor core further comprises logic to:
and the prediction queue is used for queuing the filtered second next-step instruction starting address.
16. The processor of claim 15, wherein the prediction queue further comprises logic to:
when the prediction queue is full, discarding a second next fetch start address that has not yet been added to the prediction queue.
17. The processor of claim 15, wherein the instruction cache hit prediction unit comprises logic to:
when the path of the current instruction fetching starting address obtained by the port is in an access neutral position, a second next instruction fetching starting address is obtained from the prediction queue through the port; the port is used to fetch a fetch start address for instruction prefetching.
18. The processor of claim 12 or 17, wherein the instruction cache hit prediction unit comprises logic to:
judging whether the obtained second next-step instruction-fetching initial address is hit in the first-level cache;
if the obtained second next instruction fetch starting address is not hit in the first-level cache, generating an instruction prefetching request corresponding to the obtained second next instruction fetch starting address; the instruction prefetch request is used for requesting a lower-level cache to move an instruction corresponding to the second next-step fetch starting address to the first-level cache; the lower level cache is lower in hierarchy than the first level cache.
19. A processor, comprising at least one processor core, the processor core comprising at least: logic that implements the method of any of claims 1-11.
CN201910985665.5A 2019-04-30 2019-10-17 Instruction prefetching method and processor Active CN110825442B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2019103622470 2019-04-30
CN201910362247 2019-04-30

Publications (2)

Publication Number Publication Date
CN110825442A CN110825442A (en) 2020-02-21
CN110825442B true CN110825442B (en) 2021-08-06

Family

ID=69549716

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910985665.5A Active CN110825442B (en) 2019-04-30 2019-10-17 Instruction prefetching method and processor

Country Status (1)

Country Link
CN (1) CN110825442B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111459551B (en) * 2020-04-14 2022-08-16 上海兆芯集成电路有限公司 Microprocessor with highly advanced branch predictor
CN112579175B (en) * 2020-12-14 2023-03-31 成都海光微电子技术有限公司 Branch prediction method, branch prediction device and processor core

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3543181B2 (en) * 1994-11-09 2004-07-14 株式会社ルネサステクノロジ Data processing device
US8001363B2 (en) * 2005-04-04 2011-08-16 Globalfoundries Inc. System for speculative branch prediction optimization and method thereof
US7640422B2 (en) * 2006-08-16 2009-12-29 Qualcomm Incorporated System for reducing number of lookups in a branch target address cache by storing retrieved BTAC addresses into instruction cache
US7702888B2 (en) * 2007-02-28 2010-04-20 Globalfoundries Inc. Branch predictor directed prefetch
US9836304B2 (en) * 2010-11-15 2017-12-05 Advanced Micro Devices, Inc. Cumulative confidence fetch throttling
CN102360282A (en) * 2011-09-26 2012-02-22 杭州中天微系统有限公司 Production-line processor device for rapidly disposing prediction error of branch instruction
WO2013101121A1 (en) * 2011-12-29 2013-07-04 Intel Corporation Managed instruction cache prefetching
CN102662640B (en) * 2012-04-12 2015-04-01 苏州睿云智芯微电子有限公司 Double-branch target buffer and branch target processing system and processing method
US9389868B2 (en) * 2012-11-01 2016-07-12 International Business Machines Corporation Confidence-driven selective predication of processor instructions
US8978022B2 (en) * 2013-01-10 2015-03-10 Oracle International Corporation Reducing instruction miss penalties in applications
US9348599B2 (en) * 2013-01-15 2016-05-24 International Business Machines Corporation Confidence threshold-based opposing branch path execution for branch prediction
WO2015061648A1 (en) * 2013-10-25 2015-04-30 Advanced Micro Devices, Inc. Bandwidth increase in branch prediction unit and level 1 instruction cache
CN104793921B (en) * 2015-04-29 2018-07-31 深圳芯邦科技股份有限公司 A kind of instruction branch prediction method and system
US10296463B2 (en) * 2016-01-07 2019-05-21 Samsung Electronics Co., Ltd. Instruction prefetcher dynamically controlled by readily available prefetcher accuracy
CN107479860B (en) * 2016-06-07 2020-10-09 华为技术有限公司 Processor chip and instruction cache prefetching method

Also Published As

Publication number Publication date
CN110825442A (en) 2020-02-21

Similar Documents

Publication Publication Date Title
CN110069285B (en) Method for detecting branch prediction and processor
US6151662A (en) Data transaction typing for improved caching and prefetching characteristics
US5941981A (en) System for using a data history table to select among multiple data prefetch algorithms
EP1889152B1 (en) A method and apparatus for predicting branch instructions
CN112579175B (en) Branch prediction method, branch prediction device and processor core
US9891915B2 (en) Method and apparatus to increase the speed of the load access and data return speed path using early lower address bits
CN109101276B (en) Method for executing instruction in CPU
CN110806900B (en) Memory access instruction processing method and processor
US11163577B2 (en) Selectively supporting static branch prediction settings only in association with processor-designated types of instructions
CN111078296B (en) Branch prediction method, branch prediction unit and processor core
TWI502347B (en) Branch prediction power reduction
US10481912B2 (en) Variable branch target buffer (BTB) line size for compression
CN114201219B (en) Instruction scheduling method, instruction scheduling device, processor and storage medium
EP1460532A2 (en) Computer processor data fetch unit and related method
US8707014B2 (en) Arithmetic processing unit and control method for cache hit check instruction execution
CN110825442B (en) Instruction prefetching method and processor
US20230350683A1 (en) Branch prediction method, branch prediction apparatus, processor, medium, and device
CN112230992A (en) Instruction processing device comprising branch prediction loop, processor and processing method thereof
KR20070001081A (en) Method and apparatus for allocating entries in a branch target buffer
CN112596792B (en) Branch prediction method, apparatus, medium, and device
MX2009001747A (en) Methods and apparatus for reducing lookups in a branch target address cache.
CN114168202A (en) Instruction scheduling method, instruction scheduling device, processor and storage medium
US10372902B2 (en) Control flow integrity
US6754813B1 (en) Apparatus and method of processing information for suppression of branch prediction
CN111522584A (en) Hardware loop acceleration processor and hardware loop acceleration method executed by same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 300384 Tianjin Binhai New Area Tianjin Huayuan Industrial Zone No. 18 Haitai West Road North 2-204 Industrial Incubation-3-8

Applicant after: Haiguang Information Technology Co., Ltd

Address before: 300384 Tianjin Binhai New Area Tianjin Huayuan Industrial Zone No. 18 Haitai West Road North 2-204 Industrial Incubation-3-8

Applicant before: HAIGUANG INFORMATION TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
TA01 Transfer of patent application right

Effective date of registration: 20210719

Address after: No. 23-32, 12 / F, block E5, Tianfu Software Park, building 2, No. 1366, middle section of Tianfu Avenue, Chengdu hi tech Zone, China (Sichuan) pilot Free Trade Zone, Chengdu, Sichuan 610041

Applicant after: CHENGDU HAIGUANG MICROELECTRONICS TECHNOLOGY Co.,Ltd.

Address before: 300384 industrial incubation-3-8, North 2-204, No. 18, Haitai West Road, Tianjin Huayuan Industrial Zone, Binhai New Area, Tianjin

Applicant before: Haiguang Information Technology Co., Ltd

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant