CN114217859A

CN114217859A - Device and method for submitting instructions out of order

Info

Publication number: CN114217859A
Application number: CN202111353500.XA
Authority: CN
Inventors: 刘权胜; 余红斌; 刘磊
Original assignee: Guangdong Saifang Technology Co ltd
Current assignee: Guangdong Saifang Technology Co ltd
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-03-22

Abstract

The invention discloses a device and a method for submitting instructions out of order, comprising the following steps: the system comprises an instruction cache, an instruction fetching module in signal connection with the instruction cache, an instruction queue module and a CPU branch processing module in signal connection with the instruction fetching module, a decoder in signal connection with the instruction queue module, a uop queue module in signal connection with the decoder, a renaming register in signal connection with the uop queue module, a distribution module in signal connection with the renaming register, a scheduling program module in signal connection with the distribution module and a plurality of execution units in signal connection with the scheduling program module, wherein the execution units are in signal connection with a reordering buffer area, and one execution unit is in signal connection with a data buffer area. According to the invention, the instructions which are not executed and completed in the pipeline can be submitted out of order, and the performance loss caused by the instructions or the delay of the L2/memory access is reduced.

Description

Device and method for submitting instructions out of order

Technical Field

The invention relates to the technical field of computers, in particular to a device and a method for submitting instructions out of order.

Background

The development of microprocessors has made tremendous progress in the short decades. The performance of processors is constantly being improved from a number of aspects, including hardware architectures, processes, and combinations of software and hardware. The hardware architecture experiences from a single-launch scalar to a multiple-launch superscalar; from the first 3-stage pipeline to a few tens of stages; from an in-order execution instruction to an out-of-order execution instruction; a storage structure from no cache to multi-level cache; from physical single core to physical multiple-Processors (CMP) and logical single core to logical multiple-cores (SMT); even for clustered systems for super-arithmetic, instruction-level parallelism and thread-level parallelism of execution by processors have been greatly developed. The instruction level parallel bandwidth requirement of the single-core microprocessor is higher and higher, and the multiple of the logic complexity program of the chip is increased.

Each instruction of the existing microprocessor must be executed and finished, and then the instructions can be submitted in sequence. When the long-period instruction or the memory access instruction is not finished, the long-period instruction or the memory access instruction is always in a block pipeline state, and the special long-period instruction or the memory access instruction blocks the submission of a subsequent instruction, so that the pipeline enters a pause state.

Currently, the pipeline processing bandwidth of a server reaches up to 8 instructions per clock cycle. In the terminal domain, there are also 6 instructions per clock cycle in the instruction processing bandwidth. The CPU expects better performance by designing high bandwidth processing capabilities. There may be a correlation between instructions in each clock cycle or with instructions in some clock cycle before. Load, store, AMO and UC type instructions need to be accessed, and in the case of DCACHE miss, L2, L3 and memory may need to be accessed, which may consume hundreds of clock cycles of block, affecting the execution efficiency of the instruction without following related instructions

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a device for out-of-order submission of instructions, which can be used for out-of-order submission of instructions after completion of the instructions which are not executed in a pipeline and reducing the performance loss caused by the instructions or L2/memory access delay. To achieve the above objects and other advantages in accordance with the present invention, there is provided an apparatus for out-of-order commit of instructions, comprising:

the system comprises an instruction cache, an instruction fetching module in signal connection with the instruction cache, an instruction queue module and a CPU branch processing module in signal connection with the instruction fetching module, a decoder in signal connection with the instruction queue module, a uop queue module in signal connection with the decoder, a renaming register in signal connection with the uop queue module, a distribution module in signal connection with the renaming register, a scheduling program module in signal connection with the distribution module and a plurality of execution units in signal connection with the scheduling program module, wherein the execution units are in signal connection with a reordering buffer area, and one execution unit is in signal connection with a data buffer area.

Preferably, the instruction cache memory is in signal connection with a CPU secondary cache module, the CPU secondary cache module is in signal connection with a CPU tertiary cache module, the CPU tertiary cache module is in signal connection with a memory module, and the CPU secondary cache module is in signal connection with a reordering buffer.

Preferably, the instruction cache is used for completing the translation from a physical address to a virtual address, the cache line management and instruction fetching according to the address of the instruction fetching module;

the instruction fetching module is used for selecting an instruction fetching address according to conditions such as a BPU execution unit and the like and selecting an effective instruction according to address offset;

the instruction queue module is used for caching the macroinstruction and finishing the early-stage check from the macroinstruction to the UOP translation;

the decoder is to translate the macro instruction to the UOP;

the UOP queue module is used for caching LOOP processing of a UOP instruction and a LOOP instruction;

the rename registers are used to map architectural registers to physical registers and to accomplish management of free physical registers.

Preferably, the allocation module is configured to complete allocation and recovery of the scheduler module, the reordering buffer, the download area, and the store buffer;

the scheduling program module is used for completing functions of awakening, arbitration, transmission, prediction, retransmission and the like of the UOP;

the execution unit is configured to complete all UOP execution processes including branch instructions, additions, subtractions, and all micro-operations;

the data buffer area is used for completing data reading and writing, including the conditions of cache line crossing and page crossing are processed, and the data buffer area and the instruction cache memory share a CPU secondary cache module;

the reorder buffer is used to maintain in-order commit of instructions, including maintaining the architected state of RAT recovery for the maintenance and renaming phases of atomicity, processing of events, and UOP state for macro instructions.

A method of out-of-order commit of instructions, comprising the steps of:

s1, judging whether any long-period instruction generates an exception or not, and sending the status to a reordering buffer;

s2, in the re-ordering buffer, allowing the instruction after the completion instruction is not executed to be submitted, namely, the instruction is submitted out of order;

s3, when the long-period instruction execution is completed, the instruction is submitted.

Preferably, the method further comprises the following steps:

(1) in the renaming stage, according to the destination register of each instruction, allocating a free physical register from a physical register management queue;

(2) in the dispatch stage, instructions are dispatched to the reorder buffer while the instructions enter the execution units;

(3) after the execution of the instruction by the execution unit is completed, writing the execution result of the instruction into a physical register, and updating the execution completion state to a reordering buffer;

(4) in the reordering cache, the instructions are submitted according to the order of the instructions, the entries of the reordering cache are released, the architecture register mapping table is updated, the idle physical registers are released, and the idle physical registers are written back to the physical register management queue.

Preferably, in the reorder buffer, it is necessary to determine whether the instruction to be committed completes execution. When the instruction needs to write the destination register, the data of the destination register needs to be fetched, and the instruction is executed and completed.

Preferably, when the division instruction performs divider execution, if the division instruction does not generate an exception, the execution of the division instruction is not finished. In this case, instructions subsequent to the divide instruction may first commit and free the physical registers. And after the execution of the division instruction is finished, submitting the division instruction. When the memory-accessing instruction needs to read L2, if the current instruction does not raise an exception, then the instruction also does not block instructions following the instruction, and subsequent instructions are committed early. And submitting the access instruction until the access instruction is executed.

Preferably, after the division instruction enters the divider, whether data overflow exists or not is checked, or the divisor is 0 or the like is abnormal, if the division instruction cannot generate the abnormality, an abnormality report is not generated to the ROB. When load, AMO, LR/SC types and the like enter the execution unit, the TLB information is inquired firstly, whether the instruction address is illegal or not is judged, and if the instruction does not generate an exception, if the L2 or the memory needs to be accessed, an exception report is not generated to the ROB.

Compared with the prior art, the invention has the beneficial effects that: the method can issue the instructions before the completion of the execution in the production line out of order, reduces the performance loss caused by the instructions or the L2/memory access delay, and is not limited to any instruction set, architecture, process and other conditions.

Drawings

FIG. 1 is a diagram of the pipeline architecture of a CPU of an apparatus for out-of-order commit of instructions and method thereof according to the present invention;

FIG. 2 is a block diagram of an ROB out-of-order commit instruction according to the apparatus and method of the present invention;

FIG. 3 is a flow diagram illustrating long-term instruction execution for an apparatus and method for out-of-order issue of instructions according to the present invention;

FIG. 4 is a diagram of an out-of-order commit instruction according to the apparatus and method of the present invention;

FIG. 5 is a flowchart illustrating an apparatus for out-of-order issue of instructions and method thereof for load, store and DIV instruction issue processing.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to FIGS. 1-5, an apparatus for out-of-order commit of instructions, comprising: the system comprises an instruction cache, an instruction fetching module in signal connection with the instruction cache, an instruction queue module and a CPU branch processing module in signal connection with the instruction fetching module, a decoder in signal connection with the instruction queue module, a uop queue module in signal connection with the decoder, a renaming register in signal connection with the uop queue module, a distribution module in signal connection with the renaming register, a scheduling program module in signal connection with the distribution module and a plurality of execution units in signal connection with the scheduling program module, wherein the execution units are in signal connection with a reordering buffer area, and one execution unit is in signal connection with a data buffer area.

Furthermore, the instruction cache memory is in signal connection with a CPU second-level cache module, the CPU second-level cache module is in signal connection with a CPU third-level cache module, the CPU third-level cache module is in signal connection with a memory module, and the CPU second-level cache module is in signal connection with a reordering buffer area.

Further, the instruction cache is used for completing the translation from a physical address to a virtual address, the cache line management and instruction fetching according to the address of the instruction fetching module;

the decoder is to translate the macro instruction to the UOP;

Further, the distribution module is used for completing distribution and recovery of the scheduler module, the reordering buffer area, the downloading area and the store buffer;

A method of out-of-order commit of instructions, comprising the steps of:

Further, the method also comprises the following steps:

Further, in the reorder buffer, it is necessary to determine whether the execution of the instruction to be submitted is completed. When the instruction needs to write the destination register, the data of the destination register needs to be fetched, and the instruction is executed and completed.

Further, when the division instruction performs divider execution, if the division instruction does not generate an exception, the execution of the division instruction is not finished. In this case, instructions subsequent to the divide instruction may first commit and free the physical registers. And after the execution of the division instruction is finished, submitting the division instruction. When the memory-accessing instruction needs to read L2, if the current instruction does not raise an exception, then the instruction also does not block instructions following the instruction, and subsequent instructions are committed early. And submitting the access instruction until the access instruction is executed.

Furthermore, after the division instruction enters the divider, whether data overflow exists or not is checked, or the divisor is 0 and the like are abnormal, if the division instruction cannot generate the abnormality, an abnormality report is not generated to the ROB. When load, AMO, LR/SC types and the like enter the execution unit, the TLB information is inquired firstly, whether the instruction address is illegal or not is judged, and if the instruction does not generate an exception, if the L2 or the memory needs to be accessed, an exception report is not generated to the ROB.

Table 1 execution Unit Generation of Exception check report control signals

And after receiving the report sent by the execution unit, the ROB compares the report with the entry of the ROB, and when the lu-ROB-index hits the entry of the ROB, the ROB judges whether the report sent by the current execution unit is an instruction which is not executed yet and does not generate an exception. The pending _ val at the location lu _ ROB _ index of the ROB is 1. If Iu _ ROB _ ie _ n _ val is 0, indicating that the current instruction has completed execution, the pending _ val at the lu _ ROB _ index position is set to 0, and status in ROB is set to done state. When the commit pointer commit _ ptr points to the instruction with the pending _ val being 1 in the ROB, the ROB entry is released, and the instructions after the entry cross the current instruction to perform out-of-order commit. The committed instruction updates the register map table of the architectural register according to the architectural register number dst _ index and frees the free physical register. If pending _ val of the architectural register being updated is valid, then the register map table of the architectural register also sets the pending _ val flag. When a free physical register is released from the architectural register map table, pending _ val is also updated to the free physical register management queue. When an instruction needs to allocate a free physical register of a destination register in a renaming stage, if a physical register with pending _ val of 1 is not allocated, the instruction is not yet complete because the instruction has not yet executed, i.e., the instruction has not yet committed. Only free physical registers when pending _ val is 0 can be allocated. When the long-cycle execution instruction completes execution, the instruction is submitted, and pending _ val in the reorder buffer, the architectural register mapping table and the free physical register management queue is set to 0, at which time the instruction is submitted to completion. In FIG. 5, k registers in the free physical registers; the architecture register mapping table has m entries; the reorder buffer stores n entries. The execution unit determines that the instruction of the 9 th entry of the reorder buffer is a long-period instruction, and sets the pending _ val of the 9 th entry to be 1 after the reorder buffer receives the report of the execution unit. After the instruction is submitted, the pending _ val of the entry 2 corresponding to the architectural register mapping table is also set to 1, i.e., the entry of the register with dst _ index of 2 in the reorder buffer. When the register is released back to the free physical register management queue, pending _ val of physical register 17 is 1. The instruction corresponding to the physical register 17 has not been executed, and actually has not been committed, so that the physical register 17 is not allowed to be allocated to the instruction and renamed until the execution unit returns the data of the instruction. When the execution unit finishes executing, the pending _ val for updating the reorder buffer, the architectural register map table, and the free physical register is 0 at the same time, and at this time, the instruction is really committed, and the physical register 17 may be allocated to the instruction for applying for renaming of the free physical register.

Assume that the instruction sequence of the RISC V instruction set is as shown in the following table, and all instructions are assigned with a reorder cache entry, a load instruction cache entry, a store instruction cache entry and an instruction destination register according to the instruction type. The idle physical register has K registers; the load instruction caches p entries; store instruction caches q entries; the reorder buffer has n entries and the architectural register has m entries.

TABLE 2 instruction example sequence

When 2 load instructions with

numbers

2 and 3 are executed in the LSU, when the 2 load instructions all determine that the instruction itself does not generate an exception, for example, the instruction has already obtained a PA, and the authority of the instruction PA does not generate an exception, even if the load instruction does not obtain data at the DC3, a DCACHE miss event occurs, the L2/memory is required to read data, the load instruction updates the execution state to the reorder buffer, the pending _ val of the reorder buffer is set to 1, meanwhile, entry 1 and entry 2 of the load instruction buffer are released, and the pending _ val of the load instruction buffer entry is set to 1. When the commit pointer commit _ ptr in the reorder buffer points to entry 1 in the ROB, the 2 load instructions release the reorder buffer entries and allow the instructions following the load to commit. The 2 load instructions do not actually commit, but rather allow out-of-order commit of instructions subsequent to the 2 load instructions. These 2 load instructions are also executed at the LSU, and the status bit pending _ val of the 2 load instructions is also updated to the architectural register map table and the free physical register file. When a new instruction is renamed in the renaming stage, the register with register status bit pending _ val of 1 in the free physical register queue cannot be allocated to the new instruction because the instruction of the register has not been executed and has not been actually committed. Similarly, entries with status bit pending _ val of 1 in the free load instruction management queue and the free store instruction management queue cannot be allocated to new instructions. Only after the load instruction in the LSU fetches data, the LSU sends an instruction execution completion signal to clear that the pending _ val of the physical register, the reorder buffer, the load instruction buffer entry, the store instruction buffer entry, and the architectural register mapping table is 0. When the pending _ val of the physical register, the load instruction cache entry and the store instruction cache entry is 0, the new instruction may be allocated.

The AMO type command Amoxor, numbered 6, needs to read data from the memory first, then perform an xor operation, and then write to the memory. When the load and store attributes of the AMO are judged not to be abnormal, the out-of-order submission of subsequent instructions is supported, and the processing flow is similar to that of a load instruction. The execution cycle of the Divw instruction with the number of 9 is longer, and the Divw instruction defined by the RISC V instruction set does not generate an exception, so that the Divw instruction sends an instruction execution state to the reorder buffer when entering the divider pipeline. The process flow of the division instruction is also similar to the process flow of the load instruction.

After the LSU executes the load instruction, it issues an instruction execution completion signal, which is also a load instruction commit indication signal. The Load instruction updates data to physical registers and forwards data to instructions that depend on the data.

The instructions complete execution at the execution unit and generate instruction execution complete signals, as shown in the following table.

Table 3 execution Unit Generation of instruction execution completion signals

Signal name	Description of the invention
		Iu_cmt_val	The execution unit sends an instruction execution completion signal
Iu_cmt_dst_val	Instruction present destination register indication
		Iu_cmt_dst_index	Destination register encoding
Iu_cmt_dst_phy	Architectural register encoding of destination registers
		Iu_cmt_load_index	Position coding of load instruction in load cache
Iu_cmt_load_val	load instruction execution completion indication
		Iu_cmt_store_index	Location encoding of store instructions in store cache
Iu_cmt_store_val	store instruction execution completion indication
		Iu_cmt_index	Location encoding of instructions at ROB
Iu_cmt_data	Execution unit return data

And comparing dst _ phy and Iu _ cmt _ dst _ phy in the Iu _ cmt _ index entry of the reordering buffer, and clearing pending _ val as 0 if the dst _ phy and Iu _ cmt _ dst _ phy are equal. The dst _ phy and Iu _ cmt _ dst _ phy in the entry of the lu _ cmt _ dst _ index of the architectural register mapping table are compared, and if equal, the pending _ val is cleared to 0. The pending _ val of the lu _ cmt _ dst _ phy entry of the free physical register management queue is cleared to 0. The pending _ val of the lu _ cmt _ load _ index entry of the idle load instruction management queue is cleared to 0. The pending _ val of the lu _ cmt _ store _ index entry of the free strore instruction management queue is cleared to 0. Physical registers in the pipeline are compared, and data Iu _ cmt _ data is forwarded to dependent instructions. And the data of the lu _ cmt _ dst _ phy item of the physical register management queue is updated to Iu _ cmt _ data, and the instruction execution is completed at this time.

The division instruction only needs to compare physical registers and also completes the submission of the instruction at the same time.

The number of devices and the scale of the processes described herein are intended to simplify the description of the invention, and applications, modifications and variations of the invention will be apparent to those skilled in the art.

While embodiments of the invention have been described above, it is not limited to the applications set forth in the description and the embodiments, which are fully applicable in various fields of endeavor to which the invention pertains, and further modifications may readily be made by those skilled in the art, it being understood that the invention is not limited to the details shown and described herein without departing from the general concept defined by the appended claims and their equivalents.

Claims

1. An apparatus for out-of-order commit of instructions, comprising:

2. The apparatus of claim 1, wherein the instruction cache is signally connected to a CPU level two cache module, the CPU level two cache module is signally connected to a CPU level three cache module, the CPU level three cache module is signally connected to a memory module, and the CPU level two cache module is signally connected to a reorder buffer.

3. An out-of-order commit instruction apparatus of claim 1 wherein said instruction cache is configured to perform physical address to virtual address translation, cache line management and instruction fetching according to the address of an instruction fetch module;

the decoder is to translate the macro instruction to the UOP;

4. An out-of-order commit apparatus according to claim 1 wherein said allocate module is arranged to perform scheduler module, reorder buffer, download and store buffer allocation and reclamation;

5. A method of out-of-order commit of instructions according to claim 1, comprising the steps of:

6. A method for out-of-order commit of instructions according to claim 5 further comprising the steps of:

7. The method of claim 6, wherein determining whether the instruction to be committed completes execution is required in a reorder buffer. When the instruction needs to write the destination register, the data of the destination register needs to be fetched, and the instruction is executed and completed.

8. The method of out-of-order commit of claim 6 wherein when the divide instruction is executing in the divider, if the divide instruction does not raise an exception, then execution of the divide instruction is not completed. In this case, instructions subsequent to the divide instruction may first commit and free the physical registers. And after the execution of the division instruction is finished, submitting the division instruction. When the memory-accessing instruction needs to read L2, if the current instruction does not raise an exception, then the instruction also does not block instructions following the instruction, and subsequent instructions are committed early. And submitting the access instruction until the access instruction is executed.

9. The method of claim 6 wherein after the divide instruction enters the divider, checking whether there is an overflow of data or an exception with a divisor of 0, if the divide instruction does not generate an exception, then generating no exception report to the ROB. When load, AMO, LR/SC types and the like enter the execution unit, the TLB information is inquired firstly, whether the instruction address is illegal or not is judged, and if the instruction does not generate an exception, if the L2 or the memory needs to be accessed, an exception report is not generated to the ROB.