CN116339832A - Data processing device, method and processor - Google Patents

Data processing device, method and processor Download PDF

Info

Publication number
CN116339832A
CN116339832A CN202310339483.7A CN202310339483A CN116339832A CN 116339832 A CN116339832 A CN 116339832A CN 202310339483 A CN202310339483 A CN 202310339483A CN 116339832 A CN116339832 A CN 116339832A
Authority
CN
China
Prior art keywords
instruction
micro
operand
current instruction
instructions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310339483.7A
Other languages
Chinese (zh)
Inventor
孔超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Eswin Computing Technology Co Ltd
Original Assignee
Beijing Eswin Computing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Eswin Computing Technology Co Ltd filed Critical Beijing Eswin Computing Technology Co Ltd
Priority to CN202310339483.7A priority Critical patent/CN116339832A/en
Publication of CN116339832A publication Critical patent/CN116339832A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • G06F9/3804Instruction prefetching for branches, e.g. hedging, branch folding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • G06F9/3842Speculative instruction execution
    • G06F9/3848Speculative instruction execution using hybrid branch prediction, e.g. selection between prediction techniques

Abstract

The present disclosure provides a data processing apparatus, method and processor, the data processing apparatus including an instruction fetch unit configured to fetch a plurality of instructions; the decoding unit is configured to sequentially decode the plurality of instructions to obtain micro operands corresponding to each instruction in the plurality of instructions; a branch prediction unit configured to store jump instruction information corresponding to each of a plurality of instructions, and execute a prediction operation on a current instruction according to the jump instruction information, to obtain a prediction result; and a micro-operand storage unit configured to store instruction information and micro-operands corresponding to each of the plurality of instructions, and to send the micro-operands corresponding to the current instruction to the micro-operand queue to execute the current instruction if the predicted result indicates that the current instruction hits in the branch prediction unit.

Description

Data processing device, method and processor
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data processing apparatus, a method, and a processor.
Background
With the rapid development of information technology, the development of human society is being promoted by the information technology, and the production mode and life style of people are changed. A new generation of information processing terminals represented by a high-performance processor (High Performance Processor) has become a technical foundation of the information age.
Fifth generation reduced instruction processors (Reduced Instruction Set Computer-Five, RISC-V) and advanced reduced instruction set machines (Advanced RISC Machine, ARM) are increasingly being used in high performance processors because of their advantages of fully open source, simple architecture, easy migration, etc. The conventional front-end pipeline under RISC-V based reduced instruction set architecture is referred to as the macroinstruction translation engine (Macro Instruction Translation Engine, MITE). The MITE front-end pipeline includes the processes of instruction fetching, decoding, register renaming, dispatch, issue, and the like. Depending on the implementation of the processor, the instruction fetch process may include a two-stage to four-stage pipeline.
In the related art, the power consumption of the MITE front-end pipeline accounts for about 28% of the power consumption of the whole processor. The lengthy front-end pipeline may cause the pipeline delay to become higher due to the presence of branch instructions, and may also add additional power consumption to the processor.
Disclosure of Invention
The present disclosure provides a data processing apparatus, method, and processor.
According to a first aspect of the present disclosure, a data processing apparatus is provided, including an instruction fetch unit configured to fetch a plurality of instructions; the decoding unit is configured to sequentially decode the plurality of instructions to obtain micro operands corresponding to each instruction in the plurality of instructions; a branch prediction unit configured to store jump instruction information corresponding to each of a plurality of instructions, and execute a prediction operation on a current instruction according to the jump instruction information, to obtain a prediction result; and a micro-operand storage unit configured to store instruction information and micro-operands corresponding to each of the plurality of instructions, and to send the micro-operands corresponding to the current instruction to the micro-operand queue to execute the current instruction if the predicted result indicates that the current instruction hits in the branch prediction unit.
For example, the micro-operand storage unit includes a tag storage subunit configured to store micro-operand valid bits corresponding to each of the plurality of instructions, virtual addresses corresponding to each of the plurality of instructions, and cache line valid bits corresponding to each of the plurality of instructions; and a data storage subunit configured to store micro-operands corresponding to each of the plurality of instructions.
For example, the micro-operand storage unit is further configured to: reading a virtual address corresponding to the current instruction from the tag storage subunit in the case that the prediction result indicates that the current instruction hits in the branch prediction unit; and reading the micro-operand corresponding to the virtual address from the data storage subunit in the case that the valid bit of the cache line corresponding to the current instruction is high.
For example, the branch prediction unit includes a branch target buffer configured to store an execution address of each jump instruction of the plurality of instructions and a jump address corresponding to the jump instruction.
For example, the branch prediction unit is further configured to determine whether the current instruction is a jump instruction; and under the condition that the current instruction is a jump instruction, determining whether a target label corresponding to the label of the current instruction exists in the branch target buffer according to the execution address and the label of the current instruction, and obtaining a prediction result.
For example, the prediction result includes a hit of the current instruction in the branch prediction unit or a miss of the current instruction in the branch prediction unit; wherein the current instruction hits in the branch prediction unit, indicating that a target tag corresponding to the tag of the current instruction exists in the branch target buffer; and a miss of the current instruction in the branch prediction unit indicating that there is no target tag in the branch target buffer corresponding to the tag of the current instruction.
For example, jump instructions include conditional jump instructions, unconditional jump instructions, and system call instructions.
For example, the branch prediction unit is further configured to obtain, in the case where the current instruction is a conditional jump instruction, a jump direction corresponding to the current instruction according to a mode information history table, where the mode information history table is configured to store history instruction information of each jump instruction; and obtaining the execution address of the subsequent instruction of the current instruction according to the jump direction and the mode information history table.
For example, the branch prediction unit is further configured to obtain a jump address corresponding to the current instruction from the branch target buffer in the case that the current instruction is an unconditional jump instruction.
For example, the branch prediction unit is further configured to obtain, in case the current instruction is a system call instruction, an execution address of an instruction subsequent to the current instruction according to the return address stack.
According to a second aspect of embodiments of the present disclosure, there is provided a data processing method applicable to the data processing apparatus provided in the first aspect of the present disclosure, the method comprising obtaining a plurality of instructions; sequentially decoding the plurality of instructions to obtain micro operands corresponding to the instructions in the plurality of instructions; storing instruction information and micro-operands corresponding to each of the plurality of instructions, and jump instruction information corresponding to each of the plurality of jump instructions; predicting the current instruction according to the jump instruction information to obtain a prediction result; and sending the micro-operand corresponding to the current instruction to a micro-operand queue to execute the current instruction if the predicted result indicates that the current instruction hits in the branch prediction unit.
According to a third aspect of embodiments of the present disclosure, there is provided a processor comprising the data processing apparatus provided in the first aspect of the present disclosure.
According to a technical scheme of the disclosed embodiment, a data processing device is provided. The device stores instruction information and micro-operands corresponding to each instruction through a micro-operand storage unit, and a branch prediction unit stores jump instruction information corresponding to each jump instruction in a plurality of instructions and predicts the current instruction, so that the micro-operands corresponding to the current instruction are directly sent to a micro-operand queue through the micro-operand storage unit under the condition that the current instruction hits in the branch prediction unit. The process can avoid the multi-period instruction fetching and decoding process of the current instruction in the command through the instruction fetching unit and the decoding unit, shortens the execution period of the instruction and improves the operation efficiency of the processor.
Drawings
The above and other objects, features and advantages of the embodiments of the present disclosure will become more apparent from the following description of the embodiments of the present disclosure taken in conjunction with the accompanying drawings. It should be noted that throughout the appended drawings, like elements are represented by like or similar reference numerals. In the figure:
FIG. 1 shows a schematic diagram of a front-end pipeline according to the related art;
FIG. 2 shows a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of a data processing apparatus according to another embodiment of the present disclosure;
FIG. 4 illustrates a flow diagram for reading micro-operands according to an embodiment of the present disclosure;
FIGS. 5A-5D and 6A-6D illustrate schematic diagrams of data stored in a micro-operand storage unit, respectively, according to an embodiment of the present disclosure;
FIG. 7 illustrates a schematic diagram of a micro-operand queue according to an embodiment of the present disclosure; and
FIG. 8 is a flow chart of a data processing method according to an embodiment of the present disclosure.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present disclosure. It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. Based on the described embodiments of the present disclosure, all other embodiments that would be apparent to one of ordinary skill in the art without the benefit of this disclosure are within the scope of this disclosure. In the following description, some specific embodiments are for descriptive purposes only and should not be construed as limiting the disclosure in any way, but are merely examples of embodiments of the disclosure. Conventional structures or constructions will be omitted when they may cause confusion in understanding the present disclosure. It should be noted that the shapes and dimensions of the various components in the figures do not reflect the actual sizes and proportions, but merely illustrate the contents of the embodiments of the present disclosure.
Unless defined otherwise, technical or scientific terms used in the embodiments of the present disclosure should be in a general sense understood by those skilled in the art. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.
Some block diagrams and/or flowcharts are shown. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, when executed by the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart. The techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). Additionally, the techniques of this disclosure may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon, the computer program product being for use by or in connection with an instruction execution system.
Fig. 1 shows a schematic structure of a front-end pipeline according to the related art.
As shown in fig. 1, the front-end pipeline 100 includes an instruction buffer unit (Instruction Cache) 110, an instruction fetch unit (Instruction Fetch) 120, and a decode unit (Decoder) 130.
Instruction buffer unit 110 is configured to store a plurality of instructions.
Instruction fetch unit 120 is configured to sequentially fetch a plurality of instructions from instruction buffer unit 110.
The decoding unit 130 is configured to sequentially decode the instructions read by the instruction fetching unit 120, so as to obtain instruction information corresponding to each instruction.
In the example of FIG. 1, instruction fetch unit 120 includes a four stage pipeline, a first instruction fetch stage 121, a second instruction fetch stage 122, a third instruction fetch stage 123, and a fourth instruction fetch stage 124, respectively.
For example, in the first instruction fetch stage 121, a value of an execution address (PC) corresponding to each instruction is determined.
For example, in the second instruction fetch stage 122, access to the instruction buffer is completed, and part of the work of the branch target buffer (Branch Target Buffer, BTB) may be completed. Since the instruction address in the program is a virtual address, it is necessary to translate the virtual address into a physical address through a page table buffer (Translation Lookaside Buffer, TLB) and access the instruction buffer according to the physical address. And querying the table entry of the BTB according to the PC value, determining whether the BTB is a branch instruction, and whether to jump (Taken) and jump address.
For example, the third instruction fetch stage 123 fetches the response results of the instruction buffer unit 110 and places the response results into the instruction store (Instruction Memory, IMem) response queue.
For example, in the fourth instruction Fetch stage 124, fetch packets (Fetch packets) are fetched from the IMem response queue and decoded quickly (BR Decode). And judging whether the instruction is a branch jump instruction or not and the jump address according to the decoding result. The decoded branch jump address, entries of the BTB, and the result of the backup predictor (Backing Predictor, BPD) are then compared (BR chemker) to update the entries of the BTB. Finally, the Fetch data packet is stored in a Fetch Buffer (Fetch Buffer), and the PC value and the information of branch prediction are put into a prefetch target queue (Fetch Target Queue, FTQ).
In the related art, in the instruction fetching process of each instruction, the instruction fetching process based on a four-stage pipeline seriously reduces the execution efficiency of the processor on each instruction and improves the power consumption of the processor. Especially in the presence of branch instructions, the lengthy front-end pipeline may cause the pipeline delay to become higher, as well as adding additional power consumption. In addition, this can also cause back-end pipeline components to be idle due to insufficient throughput of the conventional front-end pipeline.
In view of the problems in the related art, an embodiment of the present disclosure provides a data processing apparatus, including an instruction fetching unit configured to fetch a plurality of instructions; the decoding unit is configured to sequentially decode the plurality of instructions to obtain micro operands corresponding to each instruction in the plurality of instructions; a branch prediction unit configured to store jump instruction information corresponding to each of a plurality of instructions, and execute a prediction operation on a current instruction according to the jump instruction information, to obtain a prediction result; and a micro-operand storage unit configured to store instruction information and micro-operands corresponding to each of the plurality of instructions, and to send the micro-operands corresponding to the current instruction to the micro-operand queue to execute the current instruction if the predicted result indicates that the current instruction hits in the branch prediction unit.
The embodiment of the disclosure provides a data processing device. The device stores instruction information and micro-operands corresponding to each instruction through a micro-operand storage unit, and a branch prediction unit stores jump instruction information corresponding to each jump instruction in a plurality of instructions and predicts the current instruction, so that the micro-operands corresponding to the current instruction are directly sent to a micro-operand queue through the micro-operand storage unit under the condition that the current instruction hits in the branch prediction unit. The process can avoid the multi-period instruction fetching and decoding process of the current instruction in the command through the instruction fetching unit and the decoding unit, shortens the execution period of the instruction and improves the operation efficiency of the processor.
Fig. 2 shows a schematic configuration diagram of a data processing apparatus according to an embodiment of the present disclosure.
As shown in fig. 2, the data processing apparatus 200 may include an instruction fetch unit 210, a decode unit 220, a branch prediction unit (Branch Prediction Unit, BPU) 230, a micro-operand storage unit 240, and a micro-operand Queue (Uops Queue) 250.
Instruction fetch unit 210 is configured to fetch a plurality of instructions.
The decoding unit 220 is configured to sequentially decode the plurality of instructions to obtain micro operands corresponding to each of the plurality of instructions.
The branch prediction unit 230 is configured to store jump instruction information corresponding to each of the plurality of instructions, and perform a prediction operation on a current instruction according to the jump instruction information, resulting in a prediction result.
The micro-operand storage unit 240 is configured to store instruction information and micro-operands corresponding to each of a plurality of instructions. And in the event that the prediction result indicates that the current instruction hits in the branch prediction unit 230, sending the micro-operand corresponding to the current instruction to the micro-operand queue 250 to execute the current instruction.
In the disclosed embodiment, the plurality of instructions are processor instructions (i.e., CPU instructions).
For example, each instruction may be a jump instruction, a logic instruction (e.g., AND operation, OR operation, NOT operation), an arithmetic instruction (e.g., addition operation, subtraction operation), a multiplication instruction, a division instruction, and the like.
For example, instruction fetch unit 210 may include a two-stage pipeline, a three-stage pipeline, a four-stage pipeline, or the like, to complete the instruction fetching process for each instruction.
For example, where instruction fetch unit 210 includes a four-stage pipeline, instruction fetch unit 210 may be configured identically to instruction fetch unit 120 shown in FIG. 1. For brevity, the structure of the instruction fetch unit will not be described in detail in this disclosure.
For example, instruction fetch unit 210 may be configured to fetch instructions from instruction buffer unit sequentially.
For example, the decoding unit 220 is configured to sequentially decode each instruction to obtain instruction information corresponding to each instruction of the plurality of instructions.
For example, the instruction information may include information such as a register index, an immediate size, a value of a PC to which the instruction corresponds, an instruction type, source register information, target register information, and the like.
In an embodiment of the present disclosure, branch prediction unit 230 is configured to store jump instruction information corresponding to each of a plurality of instructions. And according to the jump instruction information, executing a prediction operation on the current instruction to obtain a prediction result.
For example, the jump instruction information corresponding to each jump instruction may include a jump instruction type, a Tag (Tag), a jump address (i.e., a PC value corresponding to a subsequent instruction corresponding to the jump instruction), valid bit information (Valid), and the like.
For example, the jump instruction type includes a conditional jump instruction, an unconditional jump instruction, a system call instruction, and the like.
For example, the tag represents a partial virtual address of each jump instruction for matching with the virtual address of each instruction obtained by decode unit 220 to determine whether each jump instruction hits in branch prediction unit 230.
For example, the valid bit information indicates whether the cache line in which each jump instruction is stored in branch prediction unit 230 is valid.
For example, the valid bit information may be 1 bit (bit), may be a low level "0" or a high level "1".
For example, in the case where the valid bit information is at a high level "1", it indicates that the cache line in which each jump instruction is stored in the branch prediction unit 230 is valid.
For example, in the case where the valid bit information is at a low level "0", it indicates that the cache line where each jump instruction is stored in the branch prediction unit 230 is invalid.
It will be appreciated that the "hit" shown in the embodiments of the present disclosure refers to whether or not jump instruction information corresponding to each jump instruction is stored in the branch prediction unit 230.
In the embodiment of the present disclosure, the micro-operand storage unit 240 is configured to store the instruction information and micro-operands corresponding to each instruction decoded by the decoding unit 220, and send the micro-operands corresponding to the current instruction to the micro-operand queue 250 for subsequent pipeline execution of the current instruction, in case the prediction result output by the branch prediction unit 230 indicates that the current instruction hits in the branch prediction unit 230.
It will be appreciated that in the case that the current instruction hits in the branch prediction unit 230, the micro-operand storage unit 240 must also have the instruction information and micro-operands corresponding to the current instruction, and the micro-operand storage unit 240 directly sends the instruction information and micro-operands corresponding to the current instruction to the micro-operand queue 250, so that the processing period of the processor for the instruction is shortened.
According to an embodiment of the present disclosure, in the case that a current instruction hits in a branch prediction unit, a micro-operand corresponding to the current instruction is directly sent to a micro-operand queue by a micro-operand storage unit. The process can avoid the multi-period instruction fetching and decoding process of the current instruction in the command through the instruction fetching unit and the decoding unit, shortens the execution period of the instruction and improves the operation efficiency of the processor.
Fig. 3 illustrates a schematic structure of a data processing apparatus according to another embodiment of the present disclosure.
As shown in FIG. 3, the data processing apparatus 300 includes a program counter generation unit 310, an instruction buffer unit 320, an instruction fetch unit 330, a decode unit 340, a branch prediction unit 350, a micro-operand storage unit 360, and a micro-operand queue 370.
The program counter generating unit 310 is configured to sequentially generate a plurality of PC addresses after the processor is turned on.
The instruction buffer unit 320 is configured to obtain instructions corresponding to respective PC addresses of the plurality of PC addresses based on the plurality of PC addresses.
Instruction fetch unit 330 is configured to fetch multiple instructions in sequence from instruction buffer unit 320.
The decoding unit 340 is configured to sequentially decode each instruction to obtain instruction information corresponding to each instruction in the plurality of instructions, and output the instruction information corresponding to each instruction to the micro-operand storage unit 360 and the micro-operand queue 370.
The branch prediction unit 350 is configured to store jump instruction information corresponding to each of the plurality of instructions, and perform a prediction operation on the current instruction according to the jump instruction information, resulting in a predicted outcome.
Micro-operand storage unit 360 is configured to store instruction information and micro-operands corresponding to each instruction and, in the event that the predicted outcome indicates a hit of the current instruction in branch prediction unit 350, send the micro-operands corresponding to the current instruction to micro-operand queue 370 for execution of the current instruction.
For example, the plurality of PC addresses generated by the program counter generating unit 310 are PC addresses sequentially incremented according to an initial address (e.g., 0×8000 0000).
For example, the plurality of PC addresses may be 0x8000 0000, 0x8000 0004, 0x8000 0008, respectively.
For example, instruction fetch unit 330 includes a multi-stage pipeline. Each stage of pipeline is used for completing one instruction fetching stage, and sequentially completing instruction fetching process of each instruction.
For example, instruction fetch unit 330 may include a two-stage pipeline, a three-stage pipeline, a four-stage pipeline, and so forth. The embodiment of the present disclosure does not limit the structure of the instruction fetch unit 330, and the structure of the instruction fetch unit 330 may be set according to actual requirements.
For example, decode unit 340 decodes each instruction to obtain instruction information corresponding to each instruction including, but not limited to, a register index, an immediate size, a value of the PC to which the instruction corresponds, an instruction type, source register information, destination register information, and the like.
In the disclosed embodiment, branch prediction unit 350 includes a branch target buffer. The branch target buffer is configured to store an execution address (i.e., a PC address) of each jump instruction of the plurality of instructions, a tag, a jump instruction type, valid bit information, and a jump address (a PC address of a subsequent instruction of a current instruction, i.e., a Next PC) corresponding to the jump instruction.
For example, the branch prediction unit 350 is further configured to determine whether the current instruction is a jump instruction. And under the condition that the current instruction is a jump instruction, determining whether a target label corresponding to the execution address and the label of the current instruction exists in the branch target buffer according to the PC address and the label of the current instruction, and obtaining a prediction result.
For example, in the case where the current instruction is a conditional jump instruction, an unconditional jump instruction, or a system call instruction, the branch prediction unit 350 determines whether a target tag corresponding to the execution address and tag of the current instruction exists in the branch target buffer according to the PC address and tag of the current instruction, and obtains a prediction result.
For example, the prediction result includes a hit of the current instruction in the branch prediction unit 350 or a miss of the current instruction in the branch prediction unit 350.
For example, the current instruction hits in the branch prediction unit 350, indicating that there is a target tag in the branch target buffer corresponding to the tag of the current instruction.
For example, a miss of the current instruction in branch prediction unit 350 indicates that there is no target tag in the branch target buffer corresponding to the tag of the current instruction.
For example, the tag corresponding to the current instruction and the target tag corresponding thereto may be the same, or the target tag may be a partial tag of the tag corresponding to the current instruction.
Note that each tag in the embodiments of the present disclosure represents a partial virtual address or a full virtual address corresponding to each PC address.
For example, branch prediction unit 350 also includes a mode information history table (Pattern historytable, PHT) and a return address stack (Return Address Stack, RAS).
For example, the mode information history table is used to store branch history information for each jump instruction.
For example, the return address stack is used to store historical information for each system call instruction.
For example, the branch prediction unit 350 is further configured to obtain a jump address corresponding to the current instruction according to the branch target buffer to perform instruction jumps in case the current instruction is an unconditional jump instruction.
For example, the branch prediction unit 350 is further configured to obtain a jump direction corresponding to the current instruction according to the mode information history table in the case where the current instruction is a conditional jump instruction. And obtaining the execution address (Next PC) of the subsequent instruction of the current instruction according to the jump direction and the mode information history table.
For example, the branch prediction unit 350 is further configured to obtain, in the case where the current instruction is a system call instruction, an execution address of an instruction subsequent to the current instruction according to the return address stack.
It will be appreciated that the structure of the branch prediction unit in the embodiments of the present disclosure is merely illustrative and not limiting of the embodiments of the present disclosure.
In the disclosed embodiment, micro-operand storage unit 360 includes a Tag storage subunit (Tag RAM) and a Data storage subunit (Data RAM).
And a tag storage subunit configured to store micro-operand valid bits corresponding to each of the plurality of instructions, virtual addresses corresponding to each of the plurality of instructions, and cache line valid bits corresponding to each of the plurality of instructions.
And a data storage subunit configured to store micro-operands corresponding to each of the plurality of instructions.
For example, micro-operand storage unit 360 may include 64 Cache lines. Each cache line may store 3 micro-operands (uops).
For example, each cache line in the tag memory subunit may be 35 bits (bits) in length. The format of each cache line of the tag storage subunit may be expressed as:
TABLE 1
Uop_Vld Uop_Tag Uop 0_Vld Uop 1_Vld Uop 2_Vld
Bit position 34 33:3 2 1 0
In the example of table 1, uop Vld represents the valid bit of each cache line. Uop Vld may be represented as a 1-bit binary, such as a "1" or "0". Uop_tag represents the partial virtual address of each instruction. Uop Vld may be represented as a 31-bit binary. Uop0_vld, uop1_ Vld, and uop2_vld represent the valid bit of the first micro-operand, the valid bit of the second micro-operand, and the valid bit of the third micro-operand, respectively, stored per cache line. Uop0_Vld, uop1_Vld, and Uop2_Vld may each be represented as a 1-bit binary, such as a "1" or "0".
For example, each cache line in the data storage subunit may be 363 bits (bits) in length. The internal format of each cache line of the tag storage subunit can be expressed as:
TABLE 2
Uop 2 Uop1 Uop 0
Bit position 362:242 241:121 120:0
In the example of Table 2, uops represent micro-operands, such as a first micro-operand Uop0, a second micro-operand Uop1, and a third micro-operand Uop 2. Each micro-operand is 121 bits in length.
For example, each instruction corresponds to a virtual address of 40 bits. The virtual address corresponding to each instruction may be expressed as:
TABLE 3 Table 3
Tag Index Offset
Bit position 39:9 8:3 2:0
In the example of Table 3, the Tag represents the partial virtual address to which each instruction corresponds. Index represents the Index to which each instruction corresponds. Offset represents an Offset, and Offset indicates whether each instruction is a compressed instruction. The lowest bit of Offset defaults to "0".
It should be noted that the structures of the cache line of the tag storage subunit, the cache line of the data storage subunit, and each PC address in the embodiments of the present disclosure are merely exemplary illustrations, and do not constitute limitations of the embodiments of the present disclosure.
According to an embodiment of the present disclosure, micro-operand storage unit 360 is further configured to read the virtual address corresponding to the current instruction from the tag storage subunit in case the prediction result indicates that the current instruction hits in branch prediction unit 350. And reading the micro-operand corresponding to the virtual address from the data storage subunit in the case that the valid bit of the cache line corresponding to the current instruction is high.
FIG. 4 illustrates a flow diagram for reading micro-operands according to an embodiment of the present disclosure.
As shown in fig. 4, the execution address 410 represents a virtual address corresponding to each instruction address. The execution address 410 includes an Offset (Offset), an Index (Index), and a Tag (Tag).
The execution address 410 is illustrated as 40 bits. The offset is 3 bits in length and the lowest bit of the offset defaults to "0". The index is 6 bits in length. The index indicates the serial number of the cache line, and the numerical range is 0-63. The tag represents a partial virtual address of the execution address 410.
In the example of FIG. 4, the read of each micro-operand stored in the micro-operand storage unit is illustrated using the 40-bit execution address 410 as an example. It is to be understood that this reading process is merely an exemplary illustration and is not to be construed as limiting the embodiments of the disclosure.
For example, in reading each micro-operand stored in the micro-operand storage unit, a target tag corresponding to an index value of the execution address 410 of the current instruction is read in the tag storage subunit 420 according to the index value.
For example, in the case where the index value of the execution address 410 of the current instruction is "60", the target tag "001F5678" stored in the cache line with the sequence number "60" in the tag storage subunit 420 is read. And determining whether the tag of the execution address 410 of the current instruction is equal to the read target tag "001F 5678".
For example, in the case where the tag of the execution address 410 of the current instruction is equal to the read target tag "001F5678", the valid bit of the output Match signal (Match Vld) is high-level "1". Otherwise, the valid bit of Match Vld is low level "0".
For example, the Match Vld is logically anded with the valid bit of the cache line with sequence number "60" to obtain the result after the and operation.
For example, in the case where the valid bit of the cache line with the sequence number "60" is high "1", and the valid bit of the Match Vld is high "1", the result signal after the logical and operation is high "1", indicating that the micro-operand corresponding to the current instruction can be read from the data storage subunit.
For example, micro-operands corresponding to the current instruction are read in the data storage sub-unit 430 based on the index value of the virtual address of the current instruction. And outputting the micro-operand to a micro-operand queue.
In the embodiment of the disclosure, the process of reading the micro operand corresponding to the virtual address in the data storage subunit needs to be executed after the current instruction hits in the branch prediction unit, so as to avoid the instruction fetching process and the decoding process of the traditional pipeline under the condition that the current instruction is the jump instruction, shorten the processing period of the hit jump instruction, and improve the execution efficiency of the processor.
Fig. 5A-5D and 6A-6D illustrate schematic diagrams of storing data in a micro-operand storage unit, respectively, according to an embodiment of the present disclosure.
It will be appreciated that to simplify the storing process of micro-operands, micro-operands between two jump instructions may be stored in the same cache line or multiple adjacent cache lines in the micro-operand storage unit.
For example, when the current instruction misses any cache line in the micro-operand storage unit, and there is no jump instruction in the micro-operand to be written, each micro-operand is written in turn in the free cache line in the micro-operand storage unit.
As shown in FIG. 5A, the first Cache Line 0 of the micro-operand storage unit has stored a first micro-operand Mul. The second micro-operand and the third micro-operand are both null (Empty).
In the post-processing period, the micro-operands obtained by the decoding unit are respectively micro-operand Add, micro-operand Sub and micro-operand OR. Since no jump instruction exists in the micro-operand Add, the micro-operand Sub and the micro-operand OR, the micro-operand Add, the micro-operand Sub and the micro-operand OR are sequentially stored in the first Cache Line 1 of the micro-operand storage unit.
As shown in FIG. 5B, the first Cache Line 0 of the micro-operand storage unit has stored a first micro-operand Mul and a second micro-operand Div. The third micro-operand is null.
In the post-processing period, the micro-operands obtained by the decoding unit are respectively micro-operand Add, micro-operand Sub and micro-operand OR. Since no jump instruction exists in the micro-operand Add, the micro-operand Sub and the micro-operand OR, the micro-operand Add, the micro-operand Sub and the micro-operand OR are sequentially stored in the first Cache Line 1 of the micro-operand storage unit.
As shown in FIG. 5C, the first Cache Line 0 of the micro-operand storage unit has stored a first micro-operand Mul and a second micro-operand Div. The third micro-operand is null.
In the post-processing period, the micro-operands obtained by the decoding unit are respectively micro-operand Add, micro-operand BR and micro-operand OR. Since no jump instruction exists in the micro-operand Add, the micro-operand BR and the micro-operand OR, the micro-operand Add, the micro-operand BR and the micro-operand OR are sequentially stored in the first Cache Line 1 of the micro-operand storage unit.
As shown in fig. 5D, the first Cache Line 0 of the micro-operand storage unit has stored the first micro-operand Mul and the second micro-operand Div. The third micro-operand is null.
In the post-processing period, the micro-operands obtained by the decoding unit are micro-operand BR, micro-operand Add and micro-operand OR, respectively. Since no jump instruction exists in the micro-operand BR, the micro-operand Add and the micro-operand OR, the micro-operand BR, the micro-operand Add and the micro-operand OR are sequentially stored in the first Cache Line 1 of the micro-operand storage unit.
For example, when the current instruction misses any cache line in the micro-operand storage unit and an unconditional jump instruction (JAL) exists in the micro-operand to be written, the micro-operand corresponding to the unconditional jump and the micro-operand before the unconditional jump are written in sequence in the free cache line in the micro-operand storage unit.
As shown in fig. 6A, the first Cache Line 0 of the micro-operand storage unit has stored the first micro-operand Mul and the second micro-operand Div. The third micro-operand is null.
In the post-processing period, the micro-operands obtained by the decoding unit are respectively micro-operand JAL, micro-operand Add and micro-operand OR. Because the micro-operand JAL, the micro-operand Add and the micro-operand OR have unconditional jump instructions JAL, the micro-operand JAL is stored in the first Cache Line 1 of the micro-operand storage unit, and the micro-operands Add and OR after the micro-operand JAL are not stored in the first Cache Line 1.
As shown in fig. 6B, the first Cache Line 0 of the micro-operand storage unit has stored the first micro-operand Mul and the second micro-operand Div. The third micro-operand is null.
In the post-processing period, the micro-operands obtained by the decoding unit are respectively micro-operand Add, micro-operand JAL and micro-operand OR. Because the micro-operand Add, the micro-operand JAL and the micro-operand OR have unconditional jump instructions JAL, the micro-operand Add and the micro-operand JAL are stored in the first Cache Line of the micro-operand storage unit, and the micro-operand OR after the micro-operand JAL is not stored in the first Cache Line 1.
As shown in fig. 6C, the first Cache Line 0 of the micro-operand storage unit has stored the first micro-operand Mul and the second micro-operand BR. The third micro-operand is null.
In the post-processing period, the micro-operands obtained by the decoding unit are respectively micro-operand JAL, micro-operand Add and micro-operand OR. Because the micro-operand JAL, the micro-operand Add and the micro-operand OR have unconditional jump instructions JAL, the micro-operand JAL is stored in the first Cache Line1 of the micro-operand storage unit, and the micro-operands Add and OR after the micro-operand JAL are not stored in the first Cache Line1.
As shown in fig. 6D, the first Cache Line 0 of the micro-operand storage unit has stored the first micro-operand Mul, the second micro-operand Div, and the third micro-operand Add.
In the post-processing period, the micro-operands obtained by the decoding unit are respectively micro-operand JAL, micro-operand JAL and micro-operand JAL. The micro-operand JAL, the micro-operand JAL and the micro-operand JAL are unconditional jump instructions, and then the micro-operand JAL, the micro-operand JAL and the micro-operand JAL are sequentially stored into a first Cache Line1 of a micro-operand storage unit.
For example, in the case that the current instruction misses in the branch prediction unit, the current instruction misses any one cache line in the micro-operand storage unit, and a conditional jump instruction and a system call instruction exist in each micro-operand, at this time, the conditional jump instruction and the system call instruction are written in order in an empty cache line in the micro-operand storage unit, and the micro-operands after the system call instruction are not stored in the micro-operand storage unit.
In the examples of fig. 5A to 5D and fig. 6A to 6D, cache Line 0 represents the first Cache Line of the micro operand storage unit. Cache Line 1 represents the second Cache Line of the micro-operand storage unit. Mul represents a multiplication operand. Sub represents a subtraction operand. OR represents an OR operation operand. Div represents a division operand. JAL represents an unconditional jump instruction operand. In addition, embodiments of the present disclosure are applicable to various micro-operands, including but not limited to those shown in fig. 5A-5D and 6A-6D, and the types of the various micro-operands are not limited by the embodiments of the present disclosure.
FIG. 7 illustrates a schematic diagram of a micro-operand queue according to an embodiment of the present disclosure.
As shown in fig. 7, the micro-operand queue 700 includes a buffer unit 710, a read pointer 720, and a write pointer 730.
The buffer unit 710 is configured to store instruction information output from the decoding unit and the micro operand storage unit. For example, instruction information includes indexes, micro-operands, and the like.
For example, the cache unit 710 may include 64 cache lines. Each cache line is used to store 3 micro-operands.
In the example of fig. 7, micro-operand queue 700 is a high-capacity first-in-first-out (FirstInputFirstOutput, FIFO) queue. Micro-operand queue 700 may be used to write 3 micro-operands per clock and read 3 micro-operands per clock.
For example, micro-operands stored in a cache line corresponding to the location of the read pointer 720 are read based on the location of the read pointer 720.
For example, each micro-operand is written into a cache line corresponding to the location of the write pointer 730, depending on the location of the write pointer 730.
For example, in the case where the micro-operand queue 700 is not empty and the subsequent pipeline of the micro-operand queue 700 is not blocked, the position of the read pointer 720 is added with "1", and the corresponding micro-operand is read according to the position of the read pointer 720.
For example, in the case where micro-operand queue 700 is empty, or a later pipeline of micro-operand queue 700 is blocked, read pointer 720 is located at a position that is not "1" and the micro-operands are not read.
For example, when there is an effective micro-operand output in the decoding unit or the micro-operand storage unit and the information stored in the micro-operand queue 700 is not full, the location of the write pointer 730 is added with "1", and the corresponding micro-operand is written according to the location of the write pointer 730.
For example, in the case where there is no valid micro-operand output by the decode unit or micro-operand storage unit, or the micro-operand queue 700 is full, the write pointer 730 is located at a position that is not "1" and no micro-operands are written.
It should be noted that the micro operand queues in the embodiments of the present disclosure are merely exemplary illustrations, and do not constitute limitations of the embodiments of the present disclosure. In other embodiments, the cache unit of the micro-operand queue may include other numbers of cache lines, each of which may store other numbers of micro-operands, as embodiments of the present disclosure are not limited in this respect.
Fig. 8 shows a flow chart of a data processing method according to an embodiment of the present disclosure. As shown in fig. 8, a data processing method according to an embodiment of the present disclosure may include the following steps. It should be noted that the serial numbers of the respective steps in the following methods are merely representative of the steps for description, and should not be construed as representing the order of execution of the respective steps. The method need not be performed in the exact order shown unless explicitly stated.
As shown in fig. 8, the data processing method 800 is applied to a data processing apparatus and includes steps S810 to S850. It will be appreciated that the method may be applied to a data processing apparatus as shown in fig. 2 or 3. For brevity, the disclosure will not be repeated for the data processing apparatus.
In step S810, a plurality of instructions are acquired.
It is understood that step S810 may be performed by the instruction fetch unit shown in fig. 2 or the instruction fetch unit shown in fig. 3. For brevity, the instruction fetch unit shown in fig. 2 or the instruction fetch unit shown in fig. 3 will not be described in detail.
In step S820, the plurality of instructions are sequentially decoded to obtain micro-operands corresponding to each of the plurality of instructions.
It is understood that step S820 may be performed by the decoding unit shown in fig. 2 or the decoding unit shown in fig. 3. For brevity, the disclosure will not be repeated for the decoding units shown in fig. 2 or the decoding units shown in fig. 3.
In step S830, instruction information and micro operands corresponding to each of the plurality of instructions, and jump instruction information corresponding to each of the plurality of jump instructions are stored.
It will be appreciated that step S830 may be performed by the branch prediction unit and micro-operand storage unit shown in fig. 2 or the branch prediction unit and micro-operand storage unit shown in fig. 3. For brevity, the present disclosure will not be repeated for the branch prediction unit and micro-operand storage unit shown in fig. 2 or the branch prediction unit and micro-operand storage unit shown in fig. 3.
In step S840, the current instruction is predicted according to the jump instruction information, and a prediction result is obtained.
It will be appreciated that step S840 may be performed by the branch prediction unit shown in FIG. 2 or the branch prediction unit shown in FIG. 3. For brevity, this disclosure will not be repeated for the branch prediction unit shown in fig. 2 or the branch prediction unit shown in fig. 3.
In step S850, in the case where the prediction result indicates that the current instruction hits in the branch prediction unit, the micro-operand corresponding to the current instruction is sent to the micro-operand queue to execute the current instruction.
It is understood that step S850 may be performed by the micro-operand storage unit shown in fig. 2 or the micro-operand storage unit shown in fig. 3. For brevity, the disclosure will not be repeated for the micro-operand storage unit shown in fig. 2 or the micro-operand storage unit shown in fig. 3.
In another aspect of the disclosure, a processor is provided and includes a data processing apparatus.
For example, the data processing device may be a data processing device as shown in fig. 2 or fig. 3. For brevity, the disclosure will not be repeated for the data processing apparatus.
Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be combined in various combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.
While the present disclosure has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure as defined by the appended claims and their equivalents. The scope of the disclosure should, therefore, not be limited to the above-described embodiments, but should be determined not only by the following claims, but also by the equivalents of the following claims.

Claims (12)

1. A data processing apparatus comprising:
an instruction fetch unit configured to fetch a plurality of instructions;
the decoding unit is configured to sequentially decode the plurality of instructions to obtain micro operands corresponding to each instruction in the plurality of instructions;
a branch prediction unit configured to store jump instruction information corresponding to each of the plurality of instructions, and execute a prediction operation on a current instruction according to the jump instruction information, to obtain a prediction result; and
and a micro-operand storage unit configured to store instruction information and micro-operands corresponding to each of the plurality of instructions, and send the micro-operands corresponding to the current instruction to a micro-operand queue to execute the current instruction if the prediction result indicates that the current instruction hits in the branch prediction unit.
2. The apparatus of claim 1, wherein the micro-operand storage unit comprises:
a tag storage subunit configured to store micro-operand valid bits corresponding to each of the plurality of instructions, virtual addresses corresponding to each of the plurality of instructions, and valid bits of a cache line corresponding to each of the plurality of instructions; and
and a data storage subunit configured to store micro-operands corresponding to each of the plurality of instructions.
3. The apparatus of claim 2, wherein the micro-operand storage unit is further configured to:
in the event that the prediction indicates that the current instruction hits in the branch prediction unit,
reading a virtual address corresponding to the current instruction from the tag storage subunit; and
and reading the micro operand corresponding to the virtual address from the data storage subunit under the condition that the valid bit of the cache line corresponding to the current instruction is high.
4. The apparatus of claim 1, wherein the branch prediction unit comprises:
and a branch target buffer configured to store an execution address of each jump instruction in the plurality of instructions and a jump address corresponding to the jump instruction.
5. The apparatus of claim 4, wherein the branch prediction unit is further configured to:
determining whether the current instruction is a jump instruction; and
and under the condition that the current instruction is a jump instruction, determining whether a target label corresponding to the label of the current instruction exists in the branch target buffer according to the execution address and the label of the current instruction, and obtaining the prediction result.
6. The apparatus of claim 5, wherein the prediction result comprises the current instruction hitting in the branch prediction unit or the current instruction not hitting in the branch prediction unit; wherein, the liquid crystal display device comprises a liquid crystal display device,
the current instruction hits in the branch prediction unit, indicating that a target tag corresponding to a tag of the current instruction exists in the branch target buffer; and
the current instruction misses in the branch prediction unit, indicating that there is no target tag in the branch target buffer corresponding to a tag of the current instruction.
7. The apparatus of claim 5, wherein the jump instruction comprises a conditional jump instruction, an unconditional jump instruction, and a system call instruction.
8. The apparatus of claim 7, wherein the branch prediction unit is further configured to:
in the case where the current instruction is a conditional jump instruction,
obtaining a jump direction corresponding to the current instruction according to a mode information history table, wherein the mode information history table is configured to store history instruction information of each jump instruction; and
and obtaining the execution address of the subsequent instruction of the current instruction according to the jump direction and the mode information history table.
9. The apparatus of claim 7, wherein the branch prediction unit is further configured to:
in the case where the current instruction is an unconditional jump instruction,
and obtaining the jump address corresponding to the current instruction according to the branch target buffer.
10. The apparatus of claim 7, wherein the branch prediction unit is further configured to:
in the case where the current instruction is a system call instruction,
and obtaining the execution address of the subsequent instruction of the current instruction according to the return address stack.
11. A data processing method applicable to the apparatus of claim 1, the method comprising:
acquiring a plurality of instructions;
Sequentially decoding the plurality of instructions to obtain micro operands corresponding to each instruction in the plurality of instructions;
storing instruction information and micro-operands corresponding to each of the plurality of instructions, and jump instruction information corresponding to each of the plurality of jump instructions;
predicting the current instruction according to the jump instruction information to obtain a prediction result; and
and sending a micro-operand corresponding to the current instruction to a micro-operand queue to execute the current instruction when the predicted result indicates that the current instruction hits in the branch prediction unit.
12. A processor, comprising:
the device of any one of claims 1-10.
CN202310339483.7A 2023-03-31 2023-03-31 Data processing device, method and processor Pending CN116339832A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310339483.7A CN116339832A (en) 2023-03-31 2023-03-31 Data processing device, method and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310339483.7A CN116339832A (en) 2023-03-31 2023-03-31 Data processing device, method and processor

Publications (1)

Publication Number Publication Date
CN116339832A true CN116339832A (en) 2023-06-27

Family

ID=86880384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310339483.7A Pending CN116339832A (en) 2023-03-31 2023-03-31 Data processing device, method and processor

Country Status (1)

Country Link
CN (1) CN116339832A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117389629A (en) * 2023-11-02 2024-01-12 北京市合芯数字科技有限公司 Branch prediction method, device, electronic equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117389629A (en) * 2023-11-02 2024-01-12 北京市合芯数字科技有限公司 Branch prediction method, device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
EP1442364B1 (en) System and method to reduce execution of instructions involving unreliable data in a speculative processor
US7473293B2 (en) Processor for executing instructions containing either single operation or packed plurality of operations dependent upon instruction status indicator
CN104657110B (en) Instruction cache with fixed number of variable length instructions
US9262327B2 (en) Signature based hit-predicting cache
EP1886217B1 (en) Caching instructions for a multiple-state processor
WO2006089194A2 (en) Unaligned memory access prediction
US9652234B2 (en) Instruction and logic to control transfer in a partial binary translation system
US20120204008A1 (en) Processor with a Hybrid Instruction Queue with Instruction Elaboration Between Sections
JP5335440B2 (en) Early conditional selection of operands
CN116339832A (en) Data processing device, method and processor
EP4020191A1 (en) Alternate path decode for hard-to-predict branch
US7346737B2 (en) Cache system having branch target address cache
US20220197660A1 (en) Context-based loop branch prediction
US20220197661A1 (en) Context-based memory indirect branch target prediction
US20210089305A1 (en) Instruction executing method and apparatus
CN111813447B (en) Processing method and processing device for data splicing instruction
US20230091167A1 (en) Core-based speculative page fault list
US11809873B2 (en) Selective use of branch prediction hints
CN112559037B (en) Instruction execution method, unit, device and system
CN113568663A (en) Code prefetch instruction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination