CN112230992A - Instruction processing device comprising branch prediction loop, processor and processing method thereof - Google Patents

Instruction processing device comprising branch prediction loop, processor and processing method thereof Download PDF

Info

Publication number
CN112230992A
CN112230992A CN201910636741.1A CN201910636741A CN112230992A CN 112230992 A CN112230992 A CN 112230992A CN 201910636741 A CN201910636741 A CN 201910636741A CN 112230992 A CN112230992 A CN 112230992A
Authority
CN
China
Prior art keywords
instruction
loop body
conditional branch
candidate
stored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910636741.1A
Other languages
Chinese (zh)
Other versions
CN112230992B (en
Inventor
刘东启
陈晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou C Sky Microsystems Co Ltd
Original Assignee
Hangzhou C Sky Microsystems Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou C Sky Microsystems Co Ltd filed Critical Hangzhou C Sky Microsystems Co Ltd
Priority to CN201910636741.1A priority Critical patent/CN112230992B/en
Publication of CN112230992A publication Critical patent/CN112230992A/en
Application granted granted Critical
Publication of CN112230992B publication Critical patent/CN112230992B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30145Instruction analysis, e.g. decoding, instruction word fields
    • G06F9/30149Instruction analysis, e.g. decoding, instruction word fields of variable length instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • G06F9/3806Instruction prefetching for branches, e.g. hedging, branch folding using address prediction, e.g. return stack, branch history buffer
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • G06F9/3844Speculative instruction execution using dynamic branch prediction, e.g. using branch history tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3851Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Advance Control (AREA)

Abstract

The invention discloses an instruction fetching method, which comprises the following steps: when the currently read instruction is a backward conditional branch instruction, determining whether a candidate instruction loop body indicated by the backward conditional branch instruction is consistent with an instruction loop body stored in a cache; if the candidate instruction loop body is consistent with the stored instruction loop body, the instruction to be executed is fetched from the stored instruction loop body in the cache; and if the candidate instruction loop body and the stored instruction loop body are not consistent, storing the candidate instruction loop body in a cache. The invention also discloses an instruction processing device for executing the method and a system on chip comprising the instruction processing device.

Description

Instruction processing device comprising branch prediction loop, processor and processing method thereof
Technical Field
The present invention relates to the field of processors, and more particularly, to processor cores and processors including a loop body buffer.
Background
Modern processors or processor cores process instructions in a pipelined manner. A typical pipeline typically includes various pipeline stages of instruction fetch, instruction decode, instruction issue, instruction execution, and instruction retirement. A specialized instruction fetch unit is typically included in the processor to fetch instructions and provide them to the next level of the pipeline. To enable successive supply of instructions to the next stage of the pipeline, the processor typically provides an instruction cache from which the instruction fetch unit typically successively fetches instructions for supply to the next stage. To fetch instructions from an external storage device or instruction cache, the instruction fetch unit needs to consume energy. Reducing the power consumption of instruction fetch units and instruction caches is one of the problems that the art is constantly addressing.
In many processing procedures, the same processing steps need to be repeatedly performed multiple times in order to realize a specific function. Accordingly, in a processor, it is often necessary to repeatedly process an instruction fragment consisting of a plurality of instructions, and such an instruction fragment is referred to in the art as an instruction loop body. If the instruction loop body can be monitored and cached, the power consumption of fetching instructions from an external storage device or instruction cache can be reduced. Therefore, it is common in present day processors to provide a dedicated cache unit for caching the instruction loop body, and such a cache unit is referred to as a loop body buffer.
However, in one existing instruction loop body scheme, the loop body in the loop body buffer is cleared when the loop terminates. However, in practical applications, the processing of the prior art solution causes the same loop body to be filled repeatedly meaningfully, since there is a high probability that the previously executed loop body needs to be executed again shortly after exiting the loop. In addition, in the conventional instruction loop body scheme, processing such as jump prediction of a branch jump instruction in the loop body is not generally involved, and thus, for the branch jump instruction sent from the loop body buffer, determination of the subsequent instruction stream is prone to error, and error correction is required in the subsequent stage of the pipeline, and thus the branch jump instruction from the loop body buffer cannot be efficiently processed.
Therefore, a new instruction loop body scheme is needed, which can solve or alleviate the above problems, and improve the processing efficiency of the instruction loop body, thereby reducing the power consumption of the processor as a whole.
Disclosure of Invention
To this end, the present invention provides a new instruction processing apparatus, processor and instruction processing method in an attempt to solve or at least alleviate at least one of the problems identified above.
According to an aspect of the present invention, there is provided an instruction fetch method comprising the steps of: when the currently read instruction is a backward conditional branch instruction, determining whether a candidate instruction loop body indicated by the backward conditional branch instruction is consistent with an instruction loop body stored in a cache; if the candidate instruction loop body is consistent with the stored instruction loop body, the instruction to be executed is fetched from the stored instruction loop body in the cache; and if the candidate instruction loop body is inconsistent with the stored instruction loop body, storing the candidate instruction loop body in a cache.
Optionally, in the instruction fetching method according to the present invention, information of the backward conditional branch instruction at the end of the instruction loop body is stored in the cache. The step of determining whether the candidate instruction loop body and the stored instruction loop body are consistent comprises: it is determined whether the address of the currently read backward conditional branch instruction and the address of the stored backward conditional branch instruction are consistent.
Optionally, the instruction fetching method according to the present invention further comprises the steps of: determining a length of the candidate instruction loop body before storing the candidate instruction loop body; and when the length of the candidate instruction loop body is larger than a first preset value, the candidate instruction loop body is not stored in the cache, and the currently read instruction is used as the instruction to be executed.
Optionally, the instruction fetching method according to the present invention further comprises the steps of: determining a number of consecutive executions of a candidate instruction loop body before storing the candidate instruction loop body; and when the number of continuous execution times of the candidate instruction loop body is smaller than a second predetermined value, not storing the candidate instruction loop body in the cache, and using the currently read instruction as the instruction to be executed.
Optionally, in the instruction fetching method according to the present invention, determining the length of the candidate instruction loop body includes: determining the length of a candidate instruction loop body according to the address of the currently read backward conditional branch instruction and the address of a target position to be jumped by the backward conditional branch instruction; and the step of determining the number of consecutive executions of the candidate instruction loop body comprises: the number of consecutive executions of the currently fetched backward conditional branch instruction is recorded to serve as the number of consecutive executions of the candidate instruction loop body.
Optionally, in the instruction fetching method according to the present invention, the step of storing the candidate instruction loop body in the cache includes: and starting to acquire the instruction from the target position to be jumped by the currently read backward conditional branch instruction, and storing the acquired instruction into the cache until the acquired instruction is the backward conditional branch instruction.
Optionally, in the instruction fetching method according to the present invention, the step of storing the candidate instruction loop body further includes: if the fetched instruction is a conditional branch instruction, storing information related to the conditional branch instruction, wherein the information related to the conditional branch instruction comprises one or more of: the jump target address and jump direction prediction information of the conditional branch instruction.
Optionally, in the instruction fetching method according to the present invention, the step of storing the candidate instruction loop body further includes: if the read instruction indicates an abnormal condition, clearing the instruction which is stored in the cache at present, and not storing the candidate instruction loop; and wherein the abnormal condition comprises one or more of: the fetched instruction is an instruction that is not allowed in the loop body, the fetched instruction is a conditional branch instruction and the indicated jump target address is out of range of the loop body, the fetched instruction is a conditional branch instruction and the number of conditional branch instructions in the loop body exceeds a predetermined threshold.
Optionally, in the instruction fetching method according to the present invention, the step of storing the candidate instruction loop body further includes: store, in association with the candidate instruction loop body, information relating to the backward conditional branch instruction, wherein the information relating comprises an address of the backward conditional branch instruction, a jump target address, and a jump direction prediction.
Optionally, in the instruction fetching method according to the present invention, the step of fetching an instruction to be executed from an instruction loop body stored in a cache includes: when the fetched instruction to be executed is a conditional branch instruction, the jump direction of the conditional branch instruction is also predicted from the stored jump direction prediction information of the conditional branch instruction.
Optionally, in the instruction fetching method according to the present invention, the instruction processing apparatus further includes a branch history table. The branch history table stores the jump history information of the conditional branch instruction; and predicting a jump direction of the conditional branch instruction comprises: the jump direction of the conditional branch instruction is predicted based on jump direction prediction information of the conditional branch instruction stored in a cache and jump history information of the conditional branch instruction stored in a branch history table.
Optionally, in the method for fetching a value according to the present invention, the buffer includes a buffer register.
According to an aspect of the present invention, there is provided an instruction processing apparatus. The instruction processing apparatus includes: an instruction fetch unit adapted to fetch instructions from a storage device coupled to an instruction processing apparatus; a normal instruction fetch unit coupled to the instruction fetch unit for fetching the instruction obtained by the instruction fetch unit as an instruction to be executed; and a loop body buffer, coupled to the instruction fetch unit and the normal fetch unit, adapted to store an instruction loop body having a backward conditional branch instruction at an end, wherein the loop body buffer is adapted to determine whether a candidate instruction loop body indicated by the backward conditional branch instruction is consistent with an instruction loop body already stored in the loop body buffer when the instruction currently fetched by the instruction fetch unit is the backward conditional branch instruction; if the candidate instruction loop body is consistent with the stored instruction loop body, the instruction to be executed is fetched from the instruction loop body stored in the loop body buffer; and storing the candidate instruction loop body in the loop body buffer if the candidate instruction loop body and the stored instruction loop body do not coincide.
According to another aspect of the invention, a system on a chip is provided, comprising an instruction processing apparatus or a processor according to the invention.
According to yet another aspect of the invention, a system on chip is provided, comprising a system on chip according to the invention.
According to the scheme of the invention, the previous instruction loop body is stored in the loop body buffer, and when the new instruction loop body is the same as the previously stored instruction loop body, the originally stored instruction loop body is directly used for sending out the instruction, so that the time for refilling the instruction loop body is reduced, and the use efficiency of the loop body buffer is improved.
In addition, in the scheme of the invention, the filling operation of the instruction loop body is only carried out when a new instruction loop body meets a certain condition, for example, the length of the loop body is within a preset range or the loop body is repeated at least for preset times, so that the power consumption caused by mis-filling is reduced, and the use efficiency of the loop body buffer is further improved.
In addition, according to the scheme of the invention, when the conditional branch instruction is sent out from the loop body buffer, the jump direction prediction of the branch instruction is also sent out, and the prediction is made based on the loop body internal information and the historical execution record of the conditional branch instruction, so that the accuracy is higher, and the execution efficiency of the instruction can be improved.
Drawings
To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description when read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.
FIG. 1 shows a schematic diagram of an instruction processing apparatus according to one embodiment of the invention;
FIG. 2 illustrates a schematic diagram of an instruction loop body, according to one embodiment of the invention;
FIG. 3 illustrates a schematic diagram of an instruction fetch unit, according to one embodiment of the invention;
FIG. 4 illustrates a schematic diagram of an instruction fetch unit, according to yet another embodiment of the present invention;
FIG. 5 illustrates a flow diagram of a method of instruction fetching according to one embodiment of the invention;
FIG. 6 shows a schematic diagram of a processor, according to an embodiment of the invention;
FIG. 7 shows a schematic diagram of a computer system according to one embodiment of the invention; and
FIG. 8 shows a schematic diagram of a system on chip (SoC) according to one embodiment of the invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 is a schematic diagram of an instruction processing apparatus 100 according to one embodiment of the invention. In some embodiments, instruction processing apparatus 100 may be a processor, a processor core of a multi-core processor, or a processing element in an electronic system.
As shown in FIG. 1, instruction processing apparatus 100 includes an instruction fetch unit 130. Instruction fetch unit 130 may fetch instructions to be processed from cache 110, memory 120, or other sources and send to decode unit 140. Instructions fetched by instruction fetch unit 130 include, but are not limited to, high-level machine instructions, macro instructions, or the like. The processing device 100 performs certain functions by executing these instructions. The instructions fetched by instruction fetch unit 130 may include portions that are repeatedly executed, which portions constitute an instruction loop body. FIG. 2 shows a schematic diagram of an instruction sequence having an instruction loop body 200. As shown in FIG. 2, each rectangular box represents an instruction. "sequential instruction" means an instruction to be executed sequentially, and when the instruction is executed, the instructions subsequent to the instruction are executed. "BT XXXX" is a conditional branch instruction that, when executed, jumps to a location marked XXXX to continue execution if the condition is met. In fig. 2, since the location of the flag "LABEL _ 1" precedes the location of the instruction "BT LABEL _ 1", the instruction "BT LABEL _ 1" is a backward condition branch instruction, i.e., jump back to LABEL _1 after the condition is satisfied to start executing the instruction. Similarly, the instruction "BT LABEL _ 2" is a forward conditional branch instruction because the location of the tag LABEL _2 follows the instruction "BT LABEL _ 2". Thus, as long as the condition is satisfied, all instructions from the tag "LABEL 1" to "BT LABEL _ 1" are repeatedly executed and constitute the instruction loop body 200. The instruction loop body therefore typically marks the loop body with backward branch instructions, each of which represents a loop body. Also, the backward branch instruction is also the loop body end marker, and the backward branch instruction target "LABEL 1" address is the loop body start marker.
Decode unit 140 receives incoming instructions from instruction fetch unit 130 and decodes the instructions to generate low-level micro-operations, microcode entry points, micro-instructions, or other low-level instructions or control signals that reflect or are derived from the received instructions. The low-level instructions or control signals may operate at a low level (e.g., circuit level or hardware level) to implement the operation of high-level instructions. Decoding unit 140 may be implemented using a variety of different mechanisms. Examples of suitable mechanisms include, but are not limited to, microcode, look-up tables, hardware implementations, Programmable Logic Arrays (PLAs). The present invention is not limited to various mechanisms for implementing decoding unit 140, and any mechanism that can implement decoding unit 140 is within the scope of the present invention.
These decoded instructions are then sent to execution unit 150 and executed by execution unit 150. Execution unit 150 includes circuitry operable to execute instructions. Execution unit 150, when executing these instructions, receives data input from and generates data output to register set 170, cache 110, and/or memory 120.
In one embodiment, the register set 170 includes architectural registers, also referred to as registers. Unless otherwise specified or clearly evident, the phrases architectural register, register set, and register are used herein to refer to registers that are visible (e.g., software visible) to software and/or programmers and/or that are specified by macro-instructions to identify operands. These registers are different from other non-architected registers in a given microarchitecture (e.g., temporary registers, reorder buffers, retirement registers, etc.). According to one embodiment, the register set 170 may include a set of vector registers 175, where each vector register 175 may be 512 bits, 256 bits, or 128 bits wide, or may use a different vector width. Optionally, the register set 170 may also include a set of general purpose registers 176. General purpose registers 176 may be used when an execution unit executes an instruction, such as to store jump conditions and the like.
To avoid obscuring the description, a relatively simple instruction processing apparatus 100 has been shown and described. It should be understood that other embodiments may have more than one execution unit. For example, the apparatus 100 may include a plurality of different types of execution units, such as, for example, an arithmetic unit, an Arithmetic Logic Unit (ALU), an integer unit, a floating point unit, and so forth. Other embodiments of an instruction processing apparatus or processor may have multiple cores, logical processors, or execution engines.
Fig. 3 shows a schematic diagram of instruction fetch unit 130 in instruction processing apparatus 100 shown in fig. 1. While the instruction fetch unit 130 of FIG. 3 is designed accordingly to enable caching and processing of the instruction loop body of FIG. 2, it should be understood that the partitioning of the various components in the instruction fetch unit 130 of FIG. 3 is functional and may be rearranged and combined for physical implementation without departing from the scope of the present invention.
As shown in FIG. 3, instruction fetch unit 130 includes instruction fetch unit 210. The instruction fetch unit 210 fetches instructions from an external storage device coupled to the instruction processing apparatus 100. As described above with reference to fig. 1, the storage devices include cache 110, memory 120, or other sources, and the like. The instruction reading unit 210 determines an instruction address to be read from a PC (program counter) and reads the instruction from an external storage device. According to one embodiment, instruction fetch unit 210 may not fetch instructions directly from an external storage device, but may fetch instructions to be fetched from cache 110 after the instructions are cached in the cache. The present invention is not limited to the specific implementation of the instruction fetch unit 210, and all ways in which instructions can be fetched from an external storage device (either the cache 110 or the memory 120) according to a PC are within the scope of the present invention.
Normal fetch unit 220 is coupled to instruction fetch unit 210. The instruction fetched by the instruction fetch unit 210 is sent to the normal fetch unit 220, so as to be fetched by the normal fetch unit 220, and sent to the decode unit 140 as a subsequent instruction to be processed, i.e. an instruction to be decoded.
The loop body buffer 230 is coupled to the instruction fetch unit 210 and the normal fetch unit 220. The instructions fetched by the instruction fetch unit 210 are also sent to the loop body buffer 230. The loop body buffer 230 determines whether the currently fetched instruction triggers the loop body buffer 230 to issue an instruction from the stored instruction loop body. If the currently fetched instruction does not trigger an instruction issue from the instruction loop body, the loop body buffer 230 instructs the normal fetch unit 220 to issue the currently fetched instruction as an instruction to be subsequently processed. If the currently fetched instruction triggers the issue of an instruction from the instruction loop body, the loop body buffer 230 retrieves the instruction from the stored instruction loop body as an instruction to be subsequently processed for issue to the decode unit 140.
According to one embodiment, instructions fetched by instruction fetch unit 210 may be pre-processed before being sent to normal fetch unit 220. Preprocessing includes, but is not limited to, instruction encapsulation, etc., so that instructions may be subsequently processed in decode unit 140.
Optionally, instruction fetch unit 130 further includes a selection unit 240. The selection unit 240 is coupled to the loop body buffer 230 and the normal instruction fetch unit 220, and selects an instruction to be subsequently sent to the decode unit 140 for subsequent processing from the instructions sent from the loop body buffer 230 and the normal instruction fetch unit 220. According to one embodiment, when an instruction is issued from the loop body buffer 230, the instruction issued from the loop body buffer 230 is selected as a subsequent instruction to be processed. When no instruction is issued from the loop body buffer 230, the instruction issued from the normal instruction fetch unit 220 is selected as the instruction to be processed subsequently.
It should be understood that the processing logic in the selection unit 240 is illustrative. For example, according to another embodiment, the instruction fetch unit 220 sends the currently fetched instruction to the selection unit 240 only if the loop body buffer 230 indicates that the normal instruction fetch unit 220 is to issue instructions, so the selection unit 240 can only obtain instructions from one source at a time, i.e., the loop body buffer 230 or the selection unit 240, as instructions to be subsequently processed.
The present invention is not limited to the specific implementation manner of the selection unit 240, as long as the selection unit 240 can select the instruction sent by the normal instruction fetching unit 220 as the instruction to be processed subsequently when the loop body buffer 230 does not send the instruction; and when the loop body buffer 230 sends out an instruction, the instruction sent out by the loop body buffer 230 is selected as an instruction to be processed subsequently, any implementation of the selection unit 240 is within the scope of the present invention.
The loop body buffer 230 stores therein an instruction loop body. As described above with reference to fig. 2, the instruction loop body includes a plurality of instructions starting from the jump target of the backward branch instruction until the backward branch instruction, i.e., the instruction loop body is characterized by a backward conditional branch instruction at the end or end of the loop body (hereinafter, this instruction is referred to as an end backward conditional branch instruction). Thus, loop body buffer 230, while storing instructions in the instruction loop body, also stores characteristics of the instruction loop body in association with information associated with the ending backward conditional branch instruction, including but not limited to the address of the conditional branch instruction, the jump target and its address, jump direction prediction, and the like.
According to one embodiment, loop body buffer 230 includes an instruction cache 232 and an information cache 234. Instruction cache 232 stores instructions in an instruction loop body, and information cache 234 stores characteristic information associated with the instruction loop body stored in instruction cache 232. The characteristic information associated with the instruction loop body includes information of the conditional branch instruction after the end of the instruction loop body, information of each conditional branch instruction in the instruction loop body (including but not limited to the address of the conditional branch instruction, the jump target and address, jump direction prediction, etc.), the number of times the instruction loop body is executed in a loop, and the like.
It should be understood that the division of the instruction cache 232 and the information cache 234 is illustrative and that the instruction cache 232 and the information cache 234 may be merged and re-divided as desired without departing from the scope of the present invention.
Also included in the loop body buffer 230 is a control unit 236. The control unit 236 controls the operation of the loop body buffer 230 so as to control the issue of instructions from the loop body buffer 230 according to the currently read instruction. Optionally, the control unit 236 may also control the operation of the normal fetch unit 220 coupled to the loop body buffer 230. Thus, control unit 236 may be disposed in loop body buffer 230, anywhere in instruction fetch unit 130, and control the operation of the entire instruction fetch unit 130.
It should be appreciated that for loop body buffer 230, the cache in which instructions and their associated information are stored may take the form of, for example, a cache, a buffer, or even a register. The invention is not limited to the particular implementation of the cache, which is within the scope of the invention, as long as the forms can provide instructions to the next level of high speed in the instruction processing pipeline.
Operations performed by various components in instruction fetch unit 130 under the control of instruction processing apparatus 100, and in particular control unit 236, will be further described below in conjunction with instruction fetch method 500 described in fig. 5.
The method 500 begins at step S510. In step S510, it is determined whether the current instruction read by the instruction reading unit 210 is a backward conditional branch instruction, for example. If the instruction is not a backward branch instruction, the normal instruction fetch unit 220 is instructed to issue the current instruction fetched by the instruction fetch unit 310 as a subsequent instruction to be decoded in step S520. According to one embodiment, when the instruction is not a backward branch instruction, loop body buffer 230 may not operate to reduce power consumption.
If the currently fetched instruction is a backward conditional branch instruction, then the instruction is likely to be a characteristic instruction of some instruction loop body. Therefore, in step S530, it is determined whether the candidate instruction loop body indicated by the instruction coincides with the instruction loop body already stored in the loop body buffer 230.
According to one embodiment, when an instruction loop body is stored in loop body buffer 230, characteristics associated with the instruction loop body are also stored, including but not limited to information of the conditional branch instruction following the end of the instruction loop body. Whether the candidate loop body is consistent with the instruction loop body stored in buffer 230 may be determined by comparing information (e.g., instruction address) associated with the currently fetched backward conditional branch instruction with information (e.g., instruction address) associated with the ending backward conditional branch instruction of the instruction loop body.
If the candidate instruction loop body is identical to the instruction loop body stored in the loop body buffer 230, for example, the address of the currently read backward conditional branch instruction is the same as the address of the stored end backward conditional branch instruction, then in step S540, the instruction is fetched from the instruction loop body stored in the loop body buffer 230 to be sent to the decode unit 140 as the instruction to be processed.
When an instruction to be subsequently processed is issued from loop body buffer 230, other components in instruction fetch unit 130, such as instruction fetch unit 210 and normal instruction fetch unit 220, may be stalled without affecting instruction fetching. Therefore, according to an embodiment of the present invention, when an instruction to be subsequently processed is sent out from the loop body buffer 230, the operations of the modules such as the instruction fetch unit 210 and the normal instruction fetch unit 220 can be stopped in order to reduce the power consumption of the instruction fetch unit 130.
If the candidate instruction loop body does not match the instruction loop body stored in the loop body buffer 230, or the instruction loop body is not stored in the loop body buffer 230, the candidate instruction loop body is stored in the loop body buffer 230 in step S550. Meanwhile, the method 500 continues to step S520, where the normal instruction fetch unit 220 sends the current instruction fetched by the instruction fetch unit 310 to the decode unit 140 as a subsequent instruction to be decoded.
Subsequently, when it is determined that an instruction no longer needs to be issued from the loop body buffer 230, the loop body buffer 230 is instructed to exit the instruction issue state, and return to step S510 to read a new instruction and re-monitor a new candidate instruction loop body. According to one embodiment of the present invention, when a condition branch prediction error and a global pipeline flush occur during instruction processing in the instruction processing apparatus 100, the instruction processing apparatus 100 issues a signal to terminate the loop buffer issue instruction, so that the instruction fetch unit 130 determines that it is no longer necessary to issue an instruction from the loop body buffer 230 when receiving the signal. According to another embodiment of the present invention, when the loop body buffer 230 outputs the backward conditional branch instruction, if the corresponding jump prediction result indicates that no jump is performed during the backward conditional branch instruction, it is determined that the instruction does not need to be output from the loop body buffer 230, and the loop body buffer 230 is instructed to exit the instruction output state.
According to one embodiment, instead of executing step S550 to store the candidate loop body in the loop body buffer 230 when it is determined in step S530 that the candidate loop body and the stored instruction loop body are not consistent, it is first necessary to determine whether the candidate instruction loop body satisfies a predetermined condition, and only executing step S530 to store the candidate loop body when the candidate instruction loop body satisfies the predetermined condition.
To this end, the method 500 further comprises step S570. In step S570, the length of the candidate instruction loop body is determined, and it is determined whether the length is greater than a length threshold. If the length of the candidate instruction loop body is greater than the length threshold, the candidate instruction loop body is not stored, and step S520 is executed to instruct the normal instruction fetch unit 220 to send the read current instruction as a subsequent instruction to be decoded. According to one embodiment, the length threshold is an upper limit of the size of the loop body buffer 230 that can store instructions. If the length of the candidate instruction loop body exceeds the upper limit, the candidate loop body cannot be stored in the loop body buffer 230. The present invention is not limited to the size of the length threshold, and any way of setting the length threshold in consideration of the capacity of the loop body buffer 230 is within the scope of the present invention.
According to one embodiment, the length of the candidate instruction loop body may be determined according to the address of the currently read backward conditional branch instruction and the address of the target location to which the branch instruction is to jump, for example, the length of the loop body may be determined according to the distance between the address of the branch instruction and the jump target address, and the number of instructions in the instruction loop body may also be determined according to the size of the space occupied by each instruction as the length of the loop body.
The method 500 further includes step S580. In step S580, the number of consecutive executions of the candidate loop indicated by the current conditional branch instruction is determined, and it is determined whether the number of consecutive executions reaches a threshold number of times. If the number of consecutive executions does not reach the threshold number of times, the candidate instruction loop body is not stored, and the step S520 is skipped to instruct the normal instruction fetching unit 220 to send the read current instruction as the instruction to be decoded subsequently.
According to one embodiment, the number threshold may be set according to the actual application. If the number threshold is set to be small, the candidate instruction loop body may be stored into the loop body buffer 230 and the loop body buffer 230 may be enabled to issue instructions as soon as possible, thereby increasing the launch speed of the loop buffer 230. At the same time, however, a smaller threshold number of times may also result in a higher probability that the filled candidate instruction loop body will be unsatisfactory for reasons such as subsequent missed hits.
According to one embodiment, the number of times the currently fetched backward conditional branch instruction is repeatedly executed may be stored in the loop body buffer 230. Each time a currently fetched backward conditional branch instruction is processed, the corresponding number of repeated executions may be incremented by 1 and compared to a number threshold. If the number of repeat executions of the backward conditional branch instruction is not stored, the number of repeat executions may be set to 1 in the loop body buffer 230. According to one embodiment, after determining that the candidate loop body indicated by the current conditional branch instruction is stored in the loop body buffer 230 as the instruction loop body, the number of times the backward conditional branch instruction is repeatedly executed is not counted.
It should be noted that the execution order of step S570 and step S580 may be changed as needed. For example, the determination processing in step S580 may be performed first, and then the determination processing in step S570 may be performed. Either both may be performed simultaneously or only one may be performed without departing from the scope of the present invention. And the processing in step S550 is performed when it is determined in step S570 that the length of the candidate loop body is not greater than the length threshold and/or when it is determined in step S580 that the number of repeated executions of the candidate loop body is greater than the number threshold.
According to one embodiment of the present invention, in step S550, in order to store the candidate instruction loop body in the loop body buffer 230, the address of the jump target of the currently read backward conditional branch instruction is first determined (step S551), and then the instruction fetch unit 210 is instructed to fetch instructions from the jump target position (step S552) and store the fetched instructions in the loop body buffer 230 (in particular, the instruction cache 232) in order until the instruction to be stored is the backward conditional branch instruction (step S553).
After the candidate instruction loop body is stored in the loop body buffer 230, the information about the candidate instruction loop body, particularly the information about the conditional branch instruction after the candidate instruction loop body is ended, may also be stored in step S554. As described above, the information related to the backward conditional branch instruction includes the address of the branch instruction, the instruction jump target and its address and jump direction prediction, etc. Optionally, such relevant information may be stored in the information cache 234 of the loop body buffer 230.
According to one embodiment, step S550 further includes step S555, in which it is determined whether the instruction read in step S552 is a conditional branch instruction (including forward and backward conditional branch instructions), and if the read instruction is a conditional branch instruction, then in step S556, information related to the conditional branch instruction, such as an address of the conditional branch instruction, a jump target and its address, and a jump direction prediction, may also be stored, for example, in the information cache 234. Thus, when the conditional branch instruction is subsequently sent out from the loop body buffer 230, the information related to the instruction can be sent out at the same time, thereby improving the processing efficiency of the instruction in the subsequent pipeline.
Optionally, step S550 further includes step S557. In step S557, it is determined whether the instruction read in step S552 indicates an abnormal condition, and if the instruction indicates an abnormal condition, in step S558, all instructions in the candidate instruction loop body currently stored in the loop body buffer 230 are cleared, the information related to the conditional branch instruction stored in the information cache is cleared, the processing in step S550 is not continued, and step S520 is skipped to instruct the normal instruction fetch unit 220 to send the read current instruction as an instruction to be decoded subsequently.
According to one embodiment, the instruction read in step S554 indicating an abnormal condition includes: the fetched instructions are those that are not allowed in the instruction loop body. In this case, the instruction loop body illustrating the candidates should not be a loop body, and therefore, the loop body cannot be cached.
According to another embodiment, the instruction read in step S554 indicating an abnormal condition includes: the fetched instruction is a conditional branch instruction and the indicated jump target address is out of range of the candidate instruction loop body. In this case, it is described that the jump target of the read instruction is out of the range of the candidate instruction loop body, and therefore, the previous judgment of the candidate instruction loop body is wrong, and the candidate loop body cannot be cached.
According to yet another embodiment, the instruction read in step S554 indicating an abnormal condition includes: the instruction fetched is a conditional branch instruction and the number of conditional branch instructions in the candidate loop body exceeds a predetermined threshold. If the number of conditional branch instructions in the subsequent loop body is excessive, the relevant information that needs to be stored in the information cache 234 may exceed the capacity of the information cache 234, thereby also causing the loop body buffer 230 to not function properly, and thus abandoning the caching of the candidate loop body.
According to the above instruction fetching scheme, the instruction loop body stored in the loop body buffer 230 is not cleared immediately when the loop body is not functional, and the loop body buffer 230 can be used immediately to send out instructions when the characteristic instruction (backward conditional branch instruction) of the loop body is read subsequently, thereby accelerating the starting speed of the loop body buffer 230. In addition, according to the instruction fetching scheme, by setting a predetermined condition for instruction loop body caching in the loop body buffer 230, the probability that an instruction loop body is correctly used can be increased.
FIG. 4 illustrates a schematic diagram of instruction fetch unit 130, according to another embodiment of the present invention. In the instruction fetch unit 130 shown in FIG. 4, the same components as those in the instruction fetch unit 300 shown in FIG. 3 are denoted by the same reference numerals and will not be described again.
As shown in FIG. 4, instruction fetch unit 130 also includes a storage unit 410. The storage unit 410 stores therein a Branch History Table (BHT). The BHT stores therein jump history information of each conditional branch instruction. According to one embodiment, for a conditional branch instruction, the BHT records the result of whether the conditional branch jumped when executed the previous predetermined number of times. According to one embodiment, the predetermined number of times is 10 times. The historical results of the predetermined number of instruction jumps form a predetermined pattern corresponding to the conditional branch instruction. The predetermined pattern may be used to predict whether the current conditional branch instruction will jump.
According to one embodiment, assume that at the time of executing a conditional branch instruction, a 1 is recorded in the BHT if a jump occurs, and a 0 is recorded in the BHT if no jump occurs. For a conditional branch instruction, a list of 0's and 1's is formed to indicate the jump record for that instruction.
The jump record recorded in the BHT is further processed, and the correspondence between the predetermined number of jump history records and the current jump direction can be obtained. For example:
for jump record 1 consisting of 11 jump histories: [1,1,1,0,0,0,0,0,1,1,1], from which a jump pattern 1[1,1,1,0,0,0,0,0,1,1] configured for the previous 10 jump histories can be determined, and the subsequent jump result is a jump.
For a jump record 2 consisting of 11 additional jump histories: [1,0,1,0,0,1,0,0,1,1,0], from which a jump pattern 2 constructed for the previous 10 jump histories can be determined: [1,0,1,0,0,1,0,0,1,1], and the subsequent jump result is no jump.
If the history information of the conditional branch instruction to be currently predicted constitutes the same pattern as the jump pattern 1, the BHT may provide the jump prediction result of the conditional branch instruction as a jump.
It should be noted that the above description of jump prediction through a jump history stored in a BHT is illustrative, and the present invention is not limited thereto, and all ways in which a BHT can be utilized to provide jump prediction of conditional branch instructions are within the scope of the present invention.
The storage unit 410 is coupled to the instruction fetch unit 310 and the normal fetch unit 320. When it is determined that the instruction read by instruction read unit 310 is a conditional branch instruction (including forward and backward conditional branch instructions), the result of the BHT-based jump prediction is sent to normal instruction fetch unit 320. In this way, when an instruction to be decoded to the next stage of the pipeline is issued from normal instruction fetch unit 320, the jump prediction information associated with the conditional branch instruction may be issued at the same time.
The storage unit 410 is also coupled to the loop body buffer 230. When an instruction is sent from loop body buffer 230 and the instruction is a conditional branch instruction, loop body buffer 230 provides jump prediction information for the conditional branch instruction based on jump prediction information associated with the conditional branch instruction stored in information cache 334 and jump prediction information provided by a BHT for the conditional branch instruction, so that the accuracy of jump prediction can be improved.
In addition, the actual jump result of the executed conditional branch instruction may be recorded in the BHT by other components in instruction processing device 110, such as execution unit 150, in order to update the jump history information of the conditional branch instruction.
Optionally, the jump prediction of the stored branch instruction in loop body buffer 230 may also be updated based on the actual jump result, thereby providing a more accurate jump prediction.
According to the scheme of the present invention, when the conditional branch instruction is sent out from the loop body buffer 230, the BHT is used to perform prediction, and the jump condition of the branch instruction in the loop body buffer 230 is also recorded in the history information of the BHT, thereby improving the jump prediction accuracy of the BHT and the loop body buffer 230.
Accordingly, in step S540 of method 500 shown in fig. 5, when an instruction is sent out from loop body buffer 330, if the sent out instruction is a conditional branch instruction, the prediction result of the conditional branch instruction may be simultaneously sent out in consideration of the jump prediction result provided by the BHT and the jump prediction result stored in loop body buffer 230.
As described above, the instruction processing apparatus according to the present invention may be implemented as a processor core, and the instruction processing method may be executed in the processor core. Processor cores may be implemented in different processors in different ways. For example, a processor core may be implemented as a general-purpose in-order core for general-purpose computing, a high-performance general-purpose out-of-order core for general-purpose computing, and a special-purpose core for graphics and/or scientific (throughput) computing. While a processor may be implemented as a CPU (central processing unit) that may include one or more general-purpose in-order cores and/or one or more general-purpose out-of-order cores, and/or as a coprocessor that may include one or more special-purpose cores. Such a combination of different processors may result in different computer system architectures. In one computer system architecture, the coprocessor is on a separate chip from the CPU. In another computer system architecture, the co-processor is in the same package as the CPU but on a separate die. In yet another computer system architecture, coprocessors are on the same die as the CPU (in which case such coprocessors are sometimes referred to as special-purpose logic such as integrated graphics and/or scientific (throughput) logic, or as special-purpose cores). In yet another computer system architecture, referred to as a system on a chip, the described CPU (sometimes referred to as an application core or application processor), coprocessors and additional functionality described above may be included on the same die.
Fig. 6 shows a schematic diagram of a processor 1100 according to an embodiment of the invention. As shown in solid line blocks in fig. 6, according to one embodiment, processor 1110 includes a single core 1102A, a system agent unit 1110, and a bus controller unit 1116. As shown in the dashed box in FIG. 6, according to another embodiment of the invention, the processor 1100 may further include a plurality of cores 1102A-N, an integrated memory controller unit 1114 residing in a system agent unit 1110, and a dedicated logic 1108.
According to one embodiment, processor 1100 may be implemented as a Central Processing Unit (CPU), where dedicated logic 1108 is integrated graphics and/or scientific (throughput) logic (which may include one or more cores), and cores 1102A-N are one or more general-purpose cores (e.g., general-purpose in-order cores, general-purpose out-of-order cores, a combination of both). According to another embodiment, processor 1100 may be implemented as a coprocessor in which cores 1102A-N are a number of special purpose cores for graphics and/or science (throughput). According to yet another embodiment, processor 1100 may be implemented as a coprocessor in which cores 1102A-N are a plurality of general purpose in-order cores. Thus, the processor 1100 may be a general-purpose processor, coprocessor or special-purpose processor, such as, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), a high-throughput Many Integrated Core (MIC) coprocessor (including 30 or more cores), embedded processor, or the like. The processor may be implemented on one or more chips. Processor 1100 may be a part of, and/or may be implemented on, one or more substrates using any of a number of processing technologies, such as, for example, BiCMOS, CMOS, or NMOS.
The memory hierarchy includes one or more levels of cache within the cores, one or more shared cache units 1106, and external memory (not shown) coupled to the integrated memory controller unit 1114. The shared cache unit 1106 may include one or more mid-level caches, such as a level two (L2), a level three (L3), a level four (L4), or other levels of cache, a Last Level Cache (LLC), and/or combinations thereof. Although in one embodiment, the ring-based interconnect unit 1112 interconnects the integrated graphics logic 1108, the shared cache unit 1106, and the system agent unit 1110/integrated memory controller unit 1114, the invention is not so limited and any number of well-known techniques may be used to interconnect these units.
The system agent 1110 includes those components of the coordination and operation cores 1102A-N. The system agent unit 1110 may include, for example, a Power Control Unit (PCU) and a display unit. The PCU may include logic and components needed to adjust the power states of cores 1102A-N and integrated graphics logic 1108. The display unit is used to drive one or more externally connected displays.
The cores 1102A-N may have the core architecture described above with reference to fig. 1, and may be homogeneous or heterogeneous in terms of the architecture instruction set. That is, two or more of the cores 1102A-N may be capable of executing the same instruction set, while other cores may be capable of executing only a subset of the instruction set or a different instruction set.
FIG. 7 shows a schematic diagram of a computer system 1200, according to one embodiment of the invention. The computer system 1200 shown in fig. 7 may be applied to laptop devices, desktop devices, handheld PCs, personal digital assistants, engineering workstations, servers, network appliances, network hubs, switches, embedded processors, Digital Signal Processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, cellular telephones, portable media players, handheld devices, and various other electronic devices. The invention is not so limited and all systems that may incorporate the processor and/or other execution logic disclosed in this specification are within the scope of the invention.
As shown in fig. 7, the system 1200 may include one or more processors 1210, 1215. These processors are coupled to controller hub 1220. In one embodiment, the controller hub 1220 includes a Graphics Memory Controller Hub (GMCH)1290 and an input/output hub (IOH)1250 (which may be on separate chips). The GMCH 1290 includes a memory controller and graphics controllers that are coupled to a memory 1240 and a coprocessor 1245. IOH 1250 couples an input/output (I/O) device 1260 to GMCH 1290. Alternatively, the memory controller and graphics controller are integrated into the processor such that memory 1240 and coprocessor 1245 are coupled directly to processor 1210, in which case controller hub 1220 may include only IOH 1250.
The optional nature of additional processors 1215 is represented in fig. 7 by dashed lines. Each processor 1210, 1215 may include one or more of the processing cores described herein, and may be some version of the processor 1100 shown in fig. 6.
Memory 1240 may be, for example, Dynamic Random Access Memory (DRAM), Phase Change Memory (PCM), or a combination of the two. For at least one embodiment, the controller hub 1220 communicates with the processors 1210, 1215 via a multi-drop bus such as a Front Side Bus (FSB), a point-to-point interface such as a quick channel interconnect (QPI), or similar connection 1295.
In one embodiment, the coprocessor 1245 is a special-purpose processor, such as, for example, a high-throughput MIC processor, a network or communication processor, compression engine, graphics processor, GPGPU, embedded processor, or the like. In one embodiment, controller hub 1220 may include an integrated graphics accelerator.
In one embodiment, processor 1210 executes instructions that control data processing operations of a general type. Embedded in these instructions may be coprocessor instructions. The processor 1210 recognizes these coprocessor instructions as being of a type that should be executed by the attached coprocessor 1245. Thus, the processor 1210 issues these coprocessor instructions (or control signals representing coprocessor instructions) on a coprocessor bus or other interconnect to coprocessor 1245. Coprocessor 1245 accepts and executes received coprocessor instructions.
FIG. 8 shows a schematic diagram of a system on chip (SoC)1500 in accordance with one embodiment of the present invention. The system-on-chip shown in fig. 8 includes the processor 1100 shown in fig. 6, and therefore like components to those in fig. 6 have the same reference numerals. As shown in fig. 8, the interconnect unit 1502 is coupled to an application processor 1510, a system agent unit 1110, a bus controller unit 1116, an integrated memory controller unit 1114, one or more coprocessors 1520, a Static Random Access Memory (SRAM) unit 1530, a Direct Memory Access (DMA) unit 1532, and a display unit 1540 for coupling to one or more external displays. The application processor 1510 includes a set of one or more cores 1102A-N and a shared cache unit 110. The coprocessor 1520 includes integrated graphics logic, an image processor, an audio processor, and a video processor. In one embodiment, the coprocessor 1520 comprises a special-purpose processor, such as, for example, a network or communication processor, compression engine, GPGPU, a high-throughput MIC processor, embedded processor, or the like.
The system on chip (SoC) according to the present invention can be used in various smart devices to implement corresponding functions in the smart devices. Such smart devices include, but are not limited to, car mounted devices, smart speakers, smart display devices, IoT devices, mobile terminals, personal digital terminals, and the like, and all systems that may incorporate the accelerated computing system and/or other execution logic disclosed in this specification are within the scope of the invention.
Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the invention may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.
It should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: rather, the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into a plurality of sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Furthermore, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (26)

1. A method of instruction fetching comprising the steps of:
when the currently read instruction is a backward conditional branch instruction, determining whether a candidate instruction loop body indicated by the backward conditional branch instruction is consistent with an instruction loop body stored in a cache;
if the candidate instruction loop body is consistent with the stored instruction loop body, taking the instruction to be executed from the instruction loop body stored in the cache; and
storing the candidate instruction loop body in the cache if the candidate instruction loop body and the stored instruction loop body are not consistent.
2. A method of instruction fetching according to claim 1 wherein information of a backward conditional branch instruction at the end of the stored instruction loop body is also stored in the cache, and said step of determining whether the candidate instruction loop body and the stored instruction loop body are coherent comprises:
determining whether the address of the currently read backward conditional branch instruction and the address of the stored backward conditional branch instruction are consistent.
3. An instruction fetch method as claimed in claim 1 or 2, further comprising the steps of:
determining a length of the candidate instruction loop body prior to storing the candidate instruction loop body; and
when the length of the candidate instruction loop body is greater than a first predetermined value, the candidate instruction loop body is not stored in the cache, and the currently fetched instruction is used as the instruction to be executed.
4. A method of instruction fetching according to claim 3, further comprising the step of:
determining a number of consecutive executions of the candidate instruction loop body before storing the candidate instruction loop body; and
when the number of consecutive executions of the candidate instruction loop body is less than a second predetermined value, the candidate instruction loop body is not stored in the cache, and the currently fetched instruction is used as the instruction to be executed.
5. The instruction fetch method of claim 4, wherein determining the length of the candidate instruction loop body comprises: determining the length of the candidate instruction loop body according to the address of the currently read backward conditional branch instruction and the address of a target position to be jumped by the backward conditional branch instruction; and
the step of determining the number of consecutive executions of the candidate instruction loop body comprises: the number of consecutive executions of the currently read backward conditional branch instruction is recorded to serve as the number of consecutive executions of the candidate instruction loop body.
6. An instruction fetch method as claimed in any one of claims 1-5, wherein said step of storing said candidate instruction loop body in a cache comprises:
and starting to acquire the instruction from the target position to be jumped by the currently read backward conditional branch instruction, and storing the acquired instruction into the cache until the acquired instruction is the backward conditional branch instruction.
7. An instruction fetch method as defined in claim 6, said storing a candidate instruction loop body step further comprising:
if the fetched instruction is a conditional branch instruction, storing information related to the conditional branch instruction, wherein the information includes one or more of: a jump target address and jump direction prediction information of the conditional branch instruction.
8. An instruction fetch method as claimed in claim 6 or 7, wherein the step of storing a candidate instruction loop body further comprises:
if the read instruction indicates an abnormal condition, clearing the instruction stored in the cache at present, and not storing the candidate instruction loop body any more; and
wherein the abnormal condition comprises one or more of: the fetched instruction is an instruction that is not allowed in the loop body, the fetched instruction is a conditional branch instruction and the indicated jump target address is out of range of the loop body, the fetched instruction is a conditional branch instruction and the number of conditional branch instructions in the loop body exceeds a predetermined threshold.
9. An instruction fetch method as claimed in any one of claims 6-8, wherein the step of storing a candidate instruction loop body further comprises:
storing relevant information for the backward conditional branch instruction in association with the candidate instruction loop body, wherein the relevant information includes an address of the backward conditional branch instruction, a jump target address, and a jump direction prediction.
10. An instruction fetch method as claimed in any one of claims 1-9, wherein fetching said instruction to be executed from an instruction loop body stored in said cache comprises:
when the fetched instruction to be executed is a conditional branch instruction, the jump direction of the conditional branch instruction is also predicted from the stored jump direction prediction information of the conditional branch instruction.
11. An instruction fetch method as defined in claim 10, wherein the instruction processing apparatus further comprises a branch history table in which jump history information of a conditional branch instruction is stored; and said predicting a jump direction of the conditional branch instruction comprises: predicting the jump direction of the conditional branch instruction according to the jump direction prediction information of the conditional branch instruction stored in the cache and the jump history information of the conditional branch instruction stored in the branch history table.
12. The method of taking a value of as in any one of claims 1-11 wherein said cache comprises a buffer register.
13. An instruction processing apparatus comprising:
an instruction fetch unit adapted to fetch instructions from a storage device coupled to the instruction processing apparatus;
a normal fetch unit coupled to the instruction fetch unit for fetching the instruction obtained by the instruction fetch unit as an instruction to be executed; and
a loop body buffer coupled to the instruction fetch unit and the normal fetch unit and adapted to store an instruction loop body having a backward conditional branch instruction at an end, wherein the loop body buffer is adapted to
When the instruction currently read by the instruction reading unit is a backward conditional branch instruction, determining whether a candidate instruction loop body indicated by the backward conditional branch instruction is consistent with an instruction loop body already stored in the loop body buffer;
if the candidate instruction loop body is consistent with the stored instruction loop body, taking the instruction to be executed from the instruction loop body stored in the loop body buffer; and
storing the candidate instruction loop body in the loop body buffer if the candidate instruction loop body and the stored instruction loop body do not coincide.
14. An instruction processing apparatus as claimed in claim 13, wherein said loop body buffer is further adapted to store information of backward conditional branch instructions at the end of said stored instruction loop body, an
The loop body buffer is adapted to determine whether a candidate loop body of instructions indicated by the backward conditional branch instruction and a loop body of instructions already stored in the loop body buffer are consistent by determining whether an address of the currently read backward conditional branch instruction and an address of the stored backward conditional branch instruction are consistent.
15. Instruction processing apparatus according to claim 13 or 14, the loop body buffer further adapted to:
determining a length of the candidate instruction loop body prior to storing the candidate instruction loop body; and
when the length of the candidate instruction loop body is greater than a first predetermined value, the candidate instruction loop body is not stored in the loop body buffer.
16. An instruction processing apparatus according to claim 15, wherein the loop body buffer is further adapted to:
determining a number of consecutive executions of the candidate instruction loop body before storing the candidate instruction loop body; and
when the number of consecutive executions of the candidate instruction loop body is less than a second predetermined value, the candidate instruction loop body is not stored in the loop body buffer.
17. An instruction processing apparatus according to claim 16, wherein the loop body buffer is adapted to determine the length of the candidate instruction loop body in dependence on the address of the currently fetched backward conditional branch instruction and the address of the target location to which the backward conditional branch instruction is to jump; and
the number of consecutive executions of the currently read backward conditional branch instruction is recorded to serve as the number of consecutive executions of the candidate instruction loop body.
18. An instruction processing apparatus as claimed in any one of claims 13 to 17, wherein said loop body buffer storing said candidate instruction loop body comprises:
and starting to acquire the instruction from the target position to be jumped to by the currently read backward conditional branch instruction, and storing the acquired instruction into the loop body buffer until the backward conditional branch instruction.
19. An instruction processor in accordance with claim 18, wherein said loop body buffer storing said candidate instruction loop body further comprises:
if the fetched instruction is a conditional branch instruction, storing information related to the conditional branch instruction, wherein the information includes one or more of: the address of the conditional branch instruction, the jump target address, and the jump direction prediction information.
20. An instruction processing apparatus as claimed in claim 18 or 19, wherein said loop body buffer stores a candidate instruction loop body further comprising:
if the read instruction indicates an abnormal condition, emptying the instruction currently stored in the loop body buffer, and not storing the candidate instruction loop body any more, and indicating the normal instruction fetching unit to fetch the currently read instruction to be used as the instruction to be executed; and
wherein the abnormal condition comprises one or more of: the fetched instruction is an instruction that is not allowed in the loop body, the fetched instruction is a conditional branch instruction and the indicated jump target address is out of range of the loop body, the fetched instruction is a conditional branch instruction and the number of conditional branch instructions in the loop body exceeds a predetermined threshold.
21. An instruction processing apparatus as claimed in any one of claims 18-20, wherein said loop body buffer stores a candidate instruction loop body further comprising:
storing relevant information for the backward conditional branch instruction in association with the candidate instruction loop body, wherein the relevant information includes an address of the backward conditional branch instruction, a jump target address, and a jump direction prediction.
22. An instruction processing apparatus as claimed in any one of claims 13 to 21, wherein said loop body buffer fetching said instruction to be executed from said stored instruction loop body comprises:
when the fetched instruction to be executed is a conditional branch instruction, the jump direction of the conditional branch instruction is also predicted from the stored jump direction prediction information of the conditional branch instruction.
23. An instruction processing apparatus according to claim 22, further comprising a storage unit in which a branch history table in which jump history information of the conditional branch instruction is stored; and
predicting a jump direction of the conditional branch instruction comprises: predicting a jump direction of the conditional branch instruction based on jump direction prediction information of the conditional branch instruction stored in the loop body buffer and jump history information of the conditional branch instruction stored in the branch history table.
24. An instruction processing apparatus as claimed in any one of claims 13 to 23, wherein said loop body buffer comprises:
an instruction cache adapted to store instructions in the instruction loop body; and
and the information cache is suitable for storing the related information of each conditional branch instruction in the instruction loop body.
25. A system on a chip comprising an instruction processing apparatus as claimed in any one of claims 13 to 24.
26. A smart device comprising the system on a chip of claim 25.
CN201910636741.1A 2019-07-15 2019-07-15 Instruction processing device, processor and processing method thereof comprising branch prediction loop Active CN112230992B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910636741.1A CN112230992B (en) 2019-07-15 2019-07-15 Instruction processing device, processor and processing method thereof comprising branch prediction loop

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910636741.1A CN112230992B (en) 2019-07-15 2019-07-15 Instruction processing device, processor and processing method thereof comprising branch prediction loop

Publications (2)

Publication Number Publication Date
CN112230992A true CN112230992A (en) 2021-01-15
CN112230992B CN112230992B (en) 2023-05-23

Family

ID=74111146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910636741.1A Active CN112230992B (en) 2019-07-15 2019-07-15 Instruction processing device, processor and processing method thereof comprising branch prediction loop

Country Status (1)

Country Link
CN (1) CN112230992B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626084A (en) * 2021-09-03 2021-11-09 苏州睿芯集成电路科技有限公司 Method for optimizing instruction stream of extra-large cycle number by TAGE branch prediction algorithm
CN113760366A (en) * 2021-07-30 2021-12-07 浪潮电子信息产业股份有限公司 Method, system and related device for processing conditional jump instruction
CN113946540A (en) * 2021-10-09 2022-01-18 深圳市创成微电子有限公司 DSP processor and processing method for judging jump instruction
CN114116010A (en) * 2022-01-27 2022-03-01 广东省新一代通信与网络创新研究院 Architecture optimization method and device for loop body
CN115495155A (en) * 2022-11-18 2022-12-20 北京数渡信息科技有限公司 Hardware circulation processing device suitable for general processor
CN116048627A (en) * 2023-03-31 2023-05-02 北京开源芯片研究院 Instruction buffering method, apparatus, processor, electronic device and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1107110A2 (en) * 1999-11-30 2001-06-13 Texas Instruments Incorporated Instruction loop buffer
US20050015537A1 (en) * 2003-07-16 2005-01-20 International Business Machines Corporation System and method for instruction memory storage and processing based on backwards branch control information
CN102637149A (en) * 2012-03-23 2012-08-15 山东极芯电子科技有限公司 Processor and operation method thereof
CN104298488A (en) * 2014-09-29 2015-01-21 上海兆芯集成电路有限公司 Loop buffer guided by loop predictor
CN105242904A (en) * 2015-09-21 2016-01-13 中国科学院自动化研究所 Apparatus for processor instruction buffering and circular buffering and method for operating apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1107110A2 (en) * 1999-11-30 2001-06-13 Texas Instruments Incorporated Instruction loop buffer
US20050015537A1 (en) * 2003-07-16 2005-01-20 International Business Machines Corporation System and method for instruction memory storage and processing based on backwards branch control information
CN102637149A (en) * 2012-03-23 2012-08-15 山东极芯电子科技有限公司 Processor and operation method thereof
CN104298488A (en) * 2014-09-29 2015-01-21 上海兆芯集成电路有限公司 Loop buffer guided by loop predictor
CN105242904A (en) * 2015-09-21 2016-01-13 中国科学院自动化研究所 Apparatus for processor instruction buffering and circular buffering and method for operating apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郑纬民等: "《计算机系统结构》", 28 February 2001 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113760366A (en) * 2021-07-30 2021-12-07 浪潮电子信息产业股份有限公司 Method, system and related device for processing conditional jump instruction
CN113760366B (en) * 2021-07-30 2024-02-09 浪潮电子信息产业股份有限公司 Method, system and related device for processing conditional jump instruction
CN113626084A (en) * 2021-09-03 2021-11-09 苏州睿芯集成电路科技有限公司 Method for optimizing instruction stream of extra-large cycle number by TAGE branch prediction algorithm
CN113626084B (en) * 2021-09-03 2023-05-19 苏州睿芯集成电路科技有限公司 Method for optimizing TAGE branch prediction algorithm for instruction stream with oversized cycle number
CN113946540A (en) * 2021-10-09 2022-01-18 深圳市创成微电子有限公司 DSP processor and processing method for judging jump instruction
CN113946540B (en) * 2021-10-09 2024-03-22 深圳市创成微电子有限公司 DSP processor and processing method for judging jump instruction thereof
CN114116010A (en) * 2022-01-27 2022-03-01 广东省新一代通信与网络创新研究院 Architecture optimization method and device for loop body
CN114116010B (en) * 2022-01-27 2022-05-03 广东省新一代通信与网络创新研究院 Architecture optimization method and device for processor cycle body
CN115495155A (en) * 2022-11-18 2022-12-20 北京数渡信息科技有限公司 Hardware circulation processing device suitable for general processor
CN116048627A (en) * 2023-03-31 2023-05-02 北京开源芯片研究院 Instruction buffering method, apparatus, processor, electronic device and readable storage medium

Also Published As

Publication number Publication date
CN112230992B (en) 2023-05-23

Similar Documents

Publication Publication Date Title
CN112230992B (en) Instruction processing device, processor and processing method thereof comprising branch prediction loop
US10241797B2 (en) Replay reduction by wakeup suppression using early miss indication
US9367471B2 (en) Fetch width predictor
US7437537B2 (en) Methods and apparatus for predicting unaligned memory access
CN104298488A (en) Loop buffer guided by loop predictor
US7596683B2 (en) Switching processor threads during long latencies
CN112579175B (en) Branch prediction method, branch prediction device and processor core
US20080140996A1 (en) Apparatus and methods for low-complexity instruction prefetch system
WO2022187014A1 (en) Loop buffering employing loop characteristic prediction in a processor for optimizing loop buffer performance
JP2010501913A (en) Cache branch information associated with the last granularity of branch instructions in a variable length instruction set
CN110806900B (en) Memory access instruction processing method and processor
EP4020191A1 (en) Alternate path decode for hard-to-predict branch
US7346737B2 (en) Cache system having branch target address cache
US20060095746A1 (en) Branch predictor, processor and branch prediction method
US6823430B2 (en) Directoryless L0 cache for stall reduction
CN112559048B (en) Instruction processing device, processor and processing method thereof
US9395985B2 (en) Efficient central processing unit (CPU) return address and instruction cache
US7178013B1 (en) Repeat function for processing of repetitive instruction streams
EP3905034A1 (en) A code prefetch instruction
CN116302106A (en) Apparatus, method, and system for facilitating improved bandwidth of branch prediction units
US7389405B2 (en) Digital signal processor architecture with optimized memory access for code discontinuity
CN112395000B (en) Data preloading method and instruction processing device
US6957319B1 (en) Integrated circuit with multiple microcode ROMs
EP0415351A2 (en) Data processor for processing instruction after conditional branch instruction at high speed
US20230185572A1 (en) Instruction decode cluster offlining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant