CN115729628A - Advanced submission method for unequal data of superscalar microprocessor storage instruction - Google Patents

Advanced submission method for unequal data of superscalar microprocessor storage instruction Download PDF

Info

Publication number
CN115729628A
CN115729628A CN202211480671.3A CN202211480671A CN115729628A CN 115729628 A CN115729628 A CN 115729628A CN 202211480671 A CN202211480671 A CN 202211480671A CN 115729628 A CN115729628 A CN 115729628A
Authority
CN
China
Prior art keywords
instruction
data
rob
store
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211480671.3A
Other languages
Chinese (zh)
Inventor
尹飞
路冬冬
范好好
颜世云
何军
冯烁
蒋生健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANGHAI HIGH-PERFORMANCE INTEGRATED CIRCUIT DESIGN CENTER
Original Assignee
SHANGHAI HIGH-PERFORMANCE INTEGRATED CIRCUIT DESIGN CENTER
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHANGHAI HIGH-PERFORMANCE INTEGRATED CIRCUIT DESIGN CENTER filed Critical SHANGHAI HIGH-PERFORMANCE INTEGRATED CIRCUIT DESIGN CENTER
Priority to CN202211480671.3A priority Critical patent/CN115729628A/en
Publication of CN115729628A publication Critical patent/CN115729628A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to a method for submitting unequal data of a superscalar microprocessor storage instruction in advance, which comprises the following steps: obtaining a memory access instruction from an instruction sequence, wherein the memory access instruction comprises a loading instruction and a storage instruction; decomposing the storage instruction into an address calculation micro instruction and a storage data micro instruction, wherein the address calculation micro instruction and the storage data micro instruction enter an emission queue together and are emitted and executed independently; the address calculation micro-instruction is preferentially transmitted, and memory access address calculation, virtual-real address substitution and access to the memory subsystem are executed; if the address computation microinstruction hits in a first-level data Cache and obtains a writable permission and the store instruction becomes the oldest instruction in the instruction sequence, the store instruction is submitted regardless of whether the store data microinstruction is launched and executed. The invention can obviously improve the performance of the superscalar processor.

Description

Advanced submission method for unequal data of superscalar microprocessor storage instruction
Technical Field
The invention relates to the technical field of super-scalar microprocessor microstructure design optimization, in particular to a method for submitting unequal data of a storage instruction of a super-scalar microprocessor in advance.
Background
Current superscalar microprocessors support out-of-order Issue, out-of-order execution, speculative execution, and sequential commit, as shown in fig. 1, the instruction pipeline typically includes several basic pipeline stages of Fetch (Fetch), decode (Decode), register rename (Map), issue (Issue), execute (Execute), commit (retry), and so on. The instruction pipeline is provided with a full instruction reorder buffer (ROB) which is responsible for controlling all instructions to be submitted according to the program sequence. In program order, instructions that commit first are referred to as older instructions and instructions that commit later are referred to as younger instructions.
As an important measure of processor performance, the number of instructions executed in a single clock cycle (IPC) is closely related to the instruction pipeline fetch, decode, issue width, and instruction execution and commit speeds. In a reduced instruction set system (RISC), instructions can be roughly divided into two types, one is arithmetic instructions, operands of the arithmetic instructions are all in registers inside a processor, and the clock cycle taken by the execution of the arithmetic instructions is fixed; the other type is a memory access instruction, data accessed by the instruction can be in different levels of caches (caches) inside a processor or in an external main memory, the clock period spent for executing the instruction is not fixed, some of the data need several clock periods, and some of the data need hundreds of clock periods, so the memory access instruction is often the chief reason for limiting the instruction submission speed.
The access instruction is divided into a Load instruction (Load) and a Store instruction (Store), and in order to control out-of-order execution of the access instruction, a special access queue is arranged in the processor and used for caching the access instruction which is flying. The access queue can be a merged queue (MQ for short) physically, and simultaneously stores the loading instruction and the storage instruction; or two independent queues, which are used for storing load instructions (LQ for short) and store instructions (SQ for short) respectively.
As shown in fig. 2, for a load instruction, after data is read from a primary data Cache (DCache) or a main memory system and written into a register, the load instruction can be reported to the ROB, and then the ROB submits the load instruction according to the instruction sequence; for a store instruction, the DCache is accessed or the main memory system is accessed to obtain the writable right and the latest data copy of data, then the ROB waits for the storage instruction to be informed that the store instruction becomes the oldest instruction in the instruction sequence, the data can be written into the DCache or sent to the main memory system, the completion of the instruction execution is reported to the ROB, and then the ROB submits the store instruction according to the instruction sequence. It can be seen that there is an interdependent relationship between SQ and ROB, resulting in slower commit speed for store instructions than load instructions.
To ensure correctness of program execution results, any storage consistency model requires: when a younger load instruction accesses the same address as an older store instruction, the load instruction must read the data written by the store instruction. In superscalar microprocessors, register renaming and scoreboard techniques are commonly employed to implement out-of-order issue and out-of-order execution of instructions to improve IPC of the processor core. The register number carried in the instruction is a logical register number, the number of the logical registers affected by the instruction encoding is limited (usually 32), and more physical registers are arranged inside the processor, and more instructions without true register correlation are executed in parallel through a register renaming technology.
At the instruction transmitting station, all source registers of the instruction are checked to see if they are ready, then the arbitration of the register write port or the transmitting port is performed, and other transmitting conditions are judged. The load instruction contains only one source register for specifying the main memory address to be loaded (read); a store instruction contains two source registers, one to specify the main memory address to store (write), and the other to specify the data to store (write). For load and store instructions that access the same main memory address, the source registers specifying the main memory address are typically the same register. When the source register specifying the main memory address is readable, the load instruction may be issued before the store instruction, since the store instruction additionally determines whether the source register storing the data is ready. When a store instruction is issued to the memory access component, the load and subsequent instructions are invalidated and the instruction pipeline is notified to resume instruction fetch execution from the load instruction. This process directly introduces pipeline bubbles, which both impact processor performance and waste power consumption.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a method for submitting unequal data of a storage instruction of a superscalar microprocessor in advance, which correctly realizes the function of the storage instruction, does not change the original memory order model, accelerates the execution speed of an instruction pipeline by reducing the condition of submitting the storage instruction and obviously improves the performance of the superscalar processor.
The technical scheme adopted by the invention for solving the technical problems is as follows: a method for providing advanced commit of unequal data of superscalar microprocessor store instructions is provided, comprising: obtaining a memory access instruction from an instruction sequence, wherein the memory access instruction comprises a loading instruction and a storage instruction; decomposing the storage instruction into an address calculation micro instruction and a storage data micro instruction, wherein the address calculation micro instruction and the storage data micro instruction enter an emission queue together and are emitted and executed independently;
the address calculation micro-instruction is preferentially transmitted, and memory access address calculation, virtual-real address substitution and access to the memory subsystem are executed; if the address computation microinstruction hits in a first-level data Cache and obtains a writable permission and the store instruction becomes the oldest instruction in the instruction sequence, the store instruction is submitted regardless of whether the store data microinstruction is launched and executed.
When the address calculation microinstruction and the storage data microinstruction enter the transmission queue together:
if the virtual-real address substitution is abnormal or the access authority is wrong, registering abnormal conditions in the SQ queue, and entering a state of waiting for abnormal processing after reporting the completion of the abnormality to a full-instruction reordering buffer ROB;
if the virtual and real address substitution is normally finished, the access authority is not abnormal, but the copy in the DCache is not hit or the writable authority is not obtained, suspending the write request and entering a state of waiting for the DCache to obtain the writable copy;
if the virtual-real address substitution is normally completed, the access authority is not abnormal, the DCache copy is hit and the writable authority is obtained, but the full-instruction reordering buffer ROB does not allow the storage instruction to be submitted, the SQ entry enters a state of waiting for the ROB to allow the storage instruction to be submitted;
if the virtual-real address replacement is normally finished, the access authority is not abnormal, the DCache copy is hit and the writable authority is obtained, the full-instruction reordering buffer ROB allows the storage instruction to be submitted, and the SQ entry enters a state of 'preparing to report completion to the ROB'.
The method for processing the data in the sequence queue comprises the following steps that a storage instruction is waited to become the oldest instruction in a full-instruction reordering buffer ROB in the state of waiting for exception handling of an SQ entry in the SQ queue, and specifically: if the storage instruction is not the oldest instruction in the full instruction reorder buffer ROB, continuing to be kept in the current state; and deleting the SQ entry and carrying out exception handling after the storage instruction becomes the oldest instruction in the full instruction reordering buffer ROB.
The state that the SQ entry in the SQ queue obtains the writable copy in the wait DCache waits for obtaining the data copy and the writable permission from the next-layer Cache or the main memory is as follows: if the writable copy is not obtained in the DCache, keeping the current state; when the DCache obtains the writable copy, further judging whether the full-instruction reordering buffering ROB allows the storage instruction to be submitted, if the full-instruction reordering buffering ROB does not allow the storage instruction to be submitted, jumping to a state of waiting for the ROB to allow the storage instruction to be submitted; if the full instruction reorder buffer ROB allows commit, then jump to the ready to report to ROB state.
The method comprises the following steps that a waiting ROB sends a commit-allowed signal in a state of waiting for the ROB to allow commit, and specifically comprises the following steps: if the signal of allowing to submit given by the ROB is not received, the state is kept at the current state; and when a commit permission signal given by the ROB is received, jumping to a state of being ready to report to the ROB to be completed.
The SQ entry in the SQ queue applies for the completion of the report to the ROB in the state of 'the completion of the report to the ROB', if the report to the ROB is not completed successfully, the SQ entry is kept in the current state and continues to apply for the completion of the report to the ROB; after the current state is successfully reported to the ROB, further judging whether the stored data is received, and if the stored data is not received, skipping to the state of waiting for storing the data; if the store data has been received, a jump is made to the "ready to write DCache state".
The SQ entry in the SQ queue waits for the micro instruction of the stored data to be transmitted to the accessing part in the 'data waiting for storage' state and registers the SQ entry, and if the stored data is not received, the SQ entry is kept in the state; after receiving the storage data, the SQ entry jumps to a 'ready-to-write DCache' state;
applying for writing the DCache in the 'ready-to-write DCache state' of the SQ entry in the SQ queue, and if the operation of writing the DCache is not finished, keeping the state and continuing applying for writing the DCache; if the store data is successfully written to DCache, the SQ entry is released.
And the submitted storage instruction enters a buffer waiting state, is transmitted after the storage data micro instruction meets the condition, reads out the storage data from the register file, enters the buffer to converge with the address calculation micro instruction, writes the storage data into a primary data Cache, and then the storage data is visible by a memory model.
When the reassigned physical register number is equal to the physical source register number of the store data microinstruction, the current instruction is prevented from register renaming until the store data microinstruction is issued.
When the address of the consistency request sent from the core is the same as the address of the storage instruction, the consistency request must be prevented from being executed until the storage instruction writes the data into the primary data Cache.
And when the Cache line in the first-level data Cache accessed by the storage instruction cannot be eliminated by other young access instructions until the storage instruction writes the data into the first-level data Cache.
Advantageous effects
Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: when judging whether the storage instruction can be reported to the ROB, the invention only depends on whether the ROB allows the storage instruction to be submitted and whether the address calculation micro-instruction decomposed by the instruction is transmitted to the access storage component, completes the conditions of virtual-real address substitution and authority check, hitting the writable copy in the DCache and the like, and does not need to pay attention to whether the decomposed storage data micro-instruction reads the data to be stored from the register file and transmits the data to the access storage component and whether the SQ entry is written in or not. On one hand, the condition that a single storage instruction is converted into a resumable state is reduced, resources such as ROB entries corresponding to the instruction can be released earlier, and the subsequent instructions are accelerated to enter a window of out-of-order execution; on the other hand, in order to support the parallel transmission and execution of a plurality of storage instructions, a plurality of sets of pipeline resources for supporting address calculation micro instructions need to be arranged, while less pipeline resources for storing data micro instructions can be arranged, and because the execution speed of the storage data micro instructions and the execution speed of the storage instructions are decoupled, the resource configuration does not affect the performance under most conditions, reduces the hardware overhead and the physical implementation difficulty, and is very favorable for improving the performance power consumption ratio of a chip.
Drawings
FIG. 1 is a schematic diagram of an instruction pipeline in the background of the invention;
FIG. 2 is a diagram illustrating out-of-order execution of load and store instructions according to the background of the invention;
fig. 3 is a diagram illustrating SQ entry management according to an embodiment of the present invention.
Detailed Description
The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.
The embodiment of the invention relates to a method for early submitting unequal data of a superscalar microprocessor storage instruction, which comprises the following steps:
obtaining a memory access instruction from an instruction sequence, wherein the memory access instruction comprises a loading instruction and a storage instruction;
decomposing the storage instruction into an address calculation micro instruction and a storage data micro instruction, wherein the address calculation micro instruction and the storage data micro instruction enter an emission queue together and are emitted and executed independently;
the address calculation micro-instruction is preferentially transmitted, and memory access address calculation, virtual-real address substitution and access to the memory subsystem are executed; if the address calculation micro-instruction hits a first-level data Cache and obtains the writable right and the storage instruction becomes the oldest instruction in the instruction sequence, the storage instruction is submitted regardless of whether the storage data micro-instruction is transmitted and executed.
And further, the submitted storage instruction enters a specific buffer for waiting, the storage data micro-instruction is transmitted after meeting the condition, the storage data is read from the register file, enters the buffer and is converged with the address calculation micro-instruction, the storage data is written into the primary data Cache, and then the storage data is visible by the memory model.
Further, to ensure that the store data microinstruction will still read the store data from the source register after the store instruction is committed, the register renaming logic must take additional control of preventing register renaming of the current instruction until the store data microinstruction is issued when the reassigned physical register number equals the physical source register number of the store data microinstruction.
Further, the store instruction that has been committed but has not received the store data cannot be backed nor seen to the memory model, so when the address of the coherency request from outside the core is the same as the address of the store instruction (address comparison is performed with the minimum granularity of Cache coherency protocol processing), the coherency request must be prevented from being executed until the store instruction writes the data into the primary data Cache.
Further, a storage instruction that has been submitted but has not received storage data cannot be backed off nor seen to the memory model, so that a Cache line in a primary data Cache accessed by the storage instruction cannot be eliminated by other young access instructions until the storage instruction writes data into the primary data Cache.
Further, committed store instructions may free instruction pipeline resources, such as a full instruction reorder Buffer (ROB), allowing subsequent younger instructions to continue to commit.
Further, store instructions that have committed but have not received store data may wait in a separate buffer or may continue to wait in a store instruction reorder buffer (SQ) but need to set a special flag to distinguish from uncommitted store instructions.
The present embodiment is described in detail below:
assume that the assembly instruction format of a store instruction is: st r1, disp (r 2), wherein r1 is a logical source register 1 for storing data to be stored; r2 is a logical source register 2 for storing a memory base address; disp represents an address offset from the base address of the memory; the memory address accessed by this store instruction is equal to the base memory address in register r2 plus the address offset.
After the register renaming, the format of the storage instruction is as follows: st p1, disp (p 2), wherein p1 is a physical source register 1 for storing data to be stored; p2 is the physical source register 2, which stores the base address of the memory. The store instruction is broken down into the following two microinstructions:
(1) Address calculation microinstructions: st _ addr disp (p 2);
(2) Store data microinstructions: st _ datap1.
The two microinstructions enter the issue queue together and then are issued and executed independently. Two microinstructions perform issue condition determination as do other instructions, including checking if the source register is ready, contending for the issue port and the register file read port. In order to simplify the control complexity of a storage instruction reordering queue (SQ), on the basis of meeting the self-transmission condition of a micro instruction, firstly, an address calculation micro instruction is transmitted, and then, a storage data micro instruction is transmitted.
After the address calculation microinstruction (st _ addr disp (p 2)) is transmitted to the access memory unit, an access memory base address is read from the physical register p2, and is added with an address offset (disp) to be used as a real access memory virtual address (VAddr), and the subsequent processing of the storage instruction is determined according to the conditions of whether an exception exists, whether a copy of DCache is hit and a writable right is obtained, whether the ROB allows the storage instruction to submit, whether storage data is received and the like. As shown in fig. 3, SQ entries may be managed using a state machine, as follows:
(1) After the access instruction is transmitted to an access component, if the virtual-real address substitution is abnormal or the access authority is wrong, the abnormal condition is registered in the SQ, and the state of waiting for abnormal processing is entered after the exception is reported to the ROB;
(2) After the access instruction is transmitted to the access component, if the virtual-real address replacement is normally completed, the access authority is not abnormal, but the copy in the DCache is not hit or the writable authority is not obtained, the write request is suspended and enters a state of waiting for the DCache to obtain the writable copy;
(3) After the access instruction is transmitted to the access component, if the virtual-real address replacement is normally completed and the access authority is not abnormal, the DCache copy is hit and the writable authority is obtained, but the ROB does not allow the storage instruction to be submitted, the SQ entry enters a state of waiting for the ROB to allow the submission;
(4) After the access instruction is transmitted to the access component, if the virtual-real address replacement is normally completed and the access authority is not abnormal, the DCache copy is hit and the writable authority is obtained, and the ROB allows the storage instruction to be submitted, the SQ entry enters a state of 'ready to report to the ROB for completion';
(5) The SQ entry waits in a "wait for exception" state for the store instruction to become the oldest instruction in the ROB. If the store instruction is not the oldest instruction in the ROB, then continue to remain in that state; when the storage instruction becomes the oldest instruction in the ROB, deleting the SQ entry and performing exception handling;
(6) The SQ entry waits to obtain a data copy and a writable right from the next-layer Cache or main memory in a 'waiting for obtaining a writable copy in DCache' state. If no writable copy is obtained in DCache, then it is kept in this state; when the DCache obtains the writable copy, further judging whether the ROB allows the storage instruction to be submitted, if the ROB does not allow the storage instruction to be submitted, jumping to a state of waiting for the ROB to allow the storage instruction to be submitted; if the ROB allows submission, jumping to a state of 'ready to report to the ROB' and 'finishing';
(7) The SQ entry waits for the ROB to give a commit-allowed signal in a "wait for ROB commit allowed" state. If the commit-allowed signal given by the ROB is not received, it remains in this state; after receiving a commit-allowed signal given by the ROB, jumping to a 'ready to report to the ROB completed' state;
(8) The SQ item applies for the completion of the report to the ROB in the state of 'preparation for reporting to the ROB', if the report to the ROB is not completed successfully, the SQ item is kept in the state and continues to apply for the completion of the report to the ROB; after the state is successfully reported to the ROB, whether the stored data is received or not is further judged, and if the stored data is not received, the state of waiting for storing the data is skipped; if the stored data has been received, jumping to a "ready to write DCache state";
(9) The SQ entry waits in a "wait for store data" state for a store data micro instruction to be transmitted to the access component and registers the SQ entry. If the stored data is not received, then maintaining the state; after receiving the storage data, the SQ entry jumps to a 'ready-to-write DCache' state;
(10) The SQ entry applies for writing the DCache in a 'ready to write DCache state', and if the operation of writing the DCache is not finished, the SQ entry is kept in the state and continues applying for writing the DCache; if the store data is successfully written to DCache, the SQ entry is released.
At the same time, when the ROB recognizes that the store instruction is the oldest instruction in the sequence of instructions, the SQ is notified to allow it to commit. At this point, the store data micro-instruction (st _ data 1) may still be waiting in the issue queue, may be reading the register file, or has been launched into the memory access component and merged with the address calculation micro-instruction in SQ. If the address calculation micro-instruction in SQ has an exception and the ROB is reported to have completed the exception, the store instruction is deleted from the ROB head and the SQ is informed to delete the corresponding entry, the instruction pipeline deletes the store data micro-instruction which may not be transmitted and all the instructions younger than the store instruction, and then the exception handler is entered for execution. If the address calculation micro-instruction of the store instruction has reported normal completion to the ROB, the store instruction is deleted from the ROB header regardless of whether the store data micro-instruction has reached SQ.
To ensure that this optimization does not alter the processor store coherency model, all committed store instructions must write store data to DCache in order, and store instructions that committed but have not yet received store data cannot perform DCache write operations until the store data is received. For this reason, all committed store instructions, whether they have received store data or not, need to enter a first-in first-out queue (FIFO) to wait, and after the condition for writing DCache is satisfied, the data is written into DCache in turn. In specific implementation, an additional FIFO queue may be provided to buffer committed store instructions, or the SQ queue may be continuously used to store committed store instructions. When the committed storage instructions are cached by using the SQ queue, the time for the storage instructions to occupy SQ entries is prolonged, the probability that an instruction pipeline is blocked because the SQ has no idle entries to store the storage instructions which are subsequently transmitted to the access storage unit is increased, and therefore the number of the SQ entries needs to be increased properly to reduce the additional influence brought by the optimization.
For control convenience, store instructions that have committed but have not yet performed a write DCache operation continue to wait in SQ, setting separate head and tail pointers and state for identification. And the submitted storage instructions are sequentially judged according to the instruction sequence, if the storage data are received, the DCache writing operation is executed, otherwise, the instruction and the subsequent submitted storage instructions continue to wait.
For the store coherency model, store instructions that have been committed by the ROB but have not yet written store data into the DCache need to be visible to other cores, so under a multi-core architecture, if a coherency request address received by a core is the same as a store address accessed by such store instruction, then processing of the coherency request needs to be blocked until such store instruction writes store data into the DCache. It should be noted that, in the multi-core architecture, the minimum data granularity processed by the Cache coherence protocol is generally one Cache line, and therefore, the address accessed by the coherence request and the address accessed by the store instruction are both compared by taking the Cache line address as a unit. Since the store data instruction resolved by the store instruction only accesses one source register, and the committed store instruction is definitely the oldest instruction in the instruction sequence, the store data instruction cannot be transmitted for a long time because the source register is not ready, and is only blocked temporarily because of a transmission port or register read port conflict, so that the consistency request is not blocked for a long time because the consistency request is the same as the address of the store instruction committed but not written to DCache, namely, the system deadlock problem is not caused. Assuming that the Cache line size is 128 bytes in the present processor, the lowest 7 bits are ignored when comparing the address accessed by the coherency request with the address accessed by the store instruction.
Since a store instruction that has committed but has not yet written store data to DCache cannot be rolled back and has gained writeable rights to the accessed DCache line prior to committing, its accessed DCache line cannot be evicted by other younger access instructions prior to writing the store data to DCache. When a young load instruction and a store instruction are received, if the DCache is missed and the DCache line accessed by the store instruction needs to be eliminated, the young access instruction enters LQ or SQ waiting. When a young flush instruction and a kill instruction are received, an LQ or SQ wait must also be entered. When the corresponding storage instruction writes data into DCache, the young access instructions are triggered to execute again.
On the instruction pipeline, after the store instruction commits, the pipeline resources occupied by the store instruction are released (such as ROB entries). The instruction ratio with P1 as the target register is determined to have been executed and committed before the store instruction, and P1 has become a free physical register and can theoretically be allocated to a new instruction at any time. However, if the store data microinstruction (st _ data 1) is still waiting in the issue queue, since the microinstruction requires a read of the source register P1 during execution, register renaming must be halted and the instruction pipeline blocked until the store data microinstruction (st _ data 1) is issued if it is found that P1 is about to be reassigned to another logical register.
In the optimization, when whether the storage instruction can be reported to the ROB or not is judged, only depending on whether the ROB allows the storage instruction to be submitted or not and whether the address calculation micro-instruction decomposed by the instruction is transmitted to the access storage component or not, and finishing the conditions of virtual-real address substitution, authority check, hitting the writable copy in the DCache and the like, whether the decomposed storage data micro-instruction reads the data to be stored from the register file or not and transmits the data to the access storage component or not, whether the SQ entry is written in or not and the like are not concerned. On one hand, the condition that a single storage instruction is converted into a resumable state is reduced, resources such as ROB entries corresponding to the instruction can be released earlier, and the subsequent instructions are accelerated to enter a window of out-of-order execution; on the other hand, in order to support the parallel transmission and execution of a plurality of storage instructions, a plurality of sets of pipeline resources for supporting address calculation micro instructions need to be arranged, while less pipeline resources for storing data micro instructions can be arranged, and because the execution speed of the storage data micro instructions and the execution speed of the storage instructions are decoupled, the resource configuration does not affect the performance under most conditions, reduces the hardware overhead and the physical implementation difficulty, and is very favorable for improving the performance power consumption ratio of a chip.
The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims (11)

1. A method for advanced commit of unequal data of superscalar microprocessor store instructions, comprising:
obtaining a memory access instruction from an instruction sequence, wherein the memory access instruction comprises a loading instruction and a storage instruction;
decomposing the storage instruction into an address calculation micro instruction and a storage data micro instruction, wherein the address calculation micro instruction and the storage data micro instruction enter an emission queue together and are emitted and executed independently;
the address calculation micro-instruction is preferentially transmitted, and memory access address calculation, virtual-real address substitution and access to the memory subsystem are executed; if the address computation microinstruction hits in a first-level data Cache and obtains a writable permission and the store instruction becomes the oldest instruction in the instruction sequence, the store instruction is submitted regardless of whether the store data microinstruction is launched and executed.
2. The method of claim 1, wherein when the address compute microinstruction and store data microinstruction are entered into the issue queue together:
if the virtual-real address substitution is abnormal or the access authority is wrong, registering abnormal conditions in the SQ queue, and entering a state of waiting for abnormal processing after reporting the completion of the abnormality to a full-instruction reordering buffer ROB;
if the virtual and real address substitution is normally finished, the access authority is not abnormal, but the copy in the DCache is not hit or the writable authority is not obtained, suspending the write request and entering a state of waiting for the DCache to obtain the writable copy;
if the virtual-real address substitution is normally finished, the access authority is not abnormal, the DCache copy is hit and the writable authority is obtained, but the full-instruction reordering buffer ROB does not allow the storage instruction to be submitted, the SQ entry enters a state of waiting for the ROB to allow the storage instruction to be submitted;
if the virtual-real address replacement is normally finished, the access authority is not abnormal, the DCache copy is hit and the writable authority is obtained, the full-instruction reordering buffer ROB allows the storage instruction to be submitted, and the SQ entry enters a state of 'preparing to report completion to the ROB'.
3. The method as claimed in claim 2, wherein the SQ entry in the SQ queue waits for the store instruction to become the oldest instruction in the full instruction reorder buffer ROB in the "wait for exception handling" state, and is further characterized by: if the storage instruction is not the oldest instruction in the full instruction reorder buffer ROB, continuing to be kept in the current state; and deleting the SQ entry and carrying out exception handling after the storage instruction becomes the oldest instruction in the full instruction reordering buffer ROB.
4. The method as claimed in claim 2, wherein the method for performing advanced commit of unequal data of storage instructions of a superscalar microprocessor is characterized in that an SQ entry in the SQ queue waits to obtain a data copy and a writable right from a next-layer Cache or a main memory in a state of "wait for obtaining a writable copy" in DCache, and specifically comprises: if the writable copy is not obtained in the DCache, keeping the current state; when the DCache obtains the writable copy, further judging whether the full-instruction reordering buffer ROB allows the storage instruction to be submitted, if the full-instruction reordering buffer ROB does not allow the storage instruction to be submitted, jumping to
A "wait for ROB allowed commit" state; if the full instruction reorder buffer ROB allows commit, then jump to the ready to report to ROB state.
5. The method as claimed in claim 2, wherein the wait-ROB state of the SQ entry in the SQ queue waits for the ROB to give a commit-allowed signal, specifically: if the signal of allowing to submit, which is given by the ROB, is not received, the state is kept at the current state; when the signal of allowing to submit given by the ROB is received, the system jumps to the state of waiting for reporting the completion to the ROB.
6. The method of claim 2, wherein the SQ entry in the SQ queue applies for completion of the full instruction reorder buffer ROB in the ready to report to ROB complete state, and if the completion of the full instruction reorder buffer ROB is not successfully completed, the SQ entry remains in the current state and continues to apply for completion of the full instruction reorder buffer ROB; after the current state is successfully reported to the ROB, further judging whether the stored data is received, and if the stored data is not received, skipping to the state of waiting for storing the data; if the store data has been received, a jump is made to the "ready to write DCache state".
7. The superscalar microprocessor store instruction unequal data early commit method as recited in claim 6, wherein an SQ entry in said SQ queue waits for a store data microinstruction to issue to an access means and register an SQ entry in said "wait for store data" state, and remains in that state if no store data is received; after receiving the storage data, jumping to a 'preparation for writing DCache' state by the SQ entry;
applying for writing the DCache in the 'ready-to-write DCache state' of the SQ entry in the SQ queue, and if the operation of writing the DCache is not finished, keeping the state and continuing applying for writing the DCache; if the store data is successfully written to DCache, the SQ entry is released.
8. The method of claim 1, wherein the committed store instruction enters a buffer wait, is issued when the store data micro instruction satisfies a condition, reads the store data from the register file, enters the buffer to converge with the address computation micro instruction, writes the store data into a primary data Cache, and thereafter the store data is visible by the memory model.
9. The superscalar microprocessor store instruction unequal data early commit method as recited in claim 1, wherein when the reassigned physical register number equals the physical source register number of the store data microinstruction, the current instruction is prevented from register renaming until said store data microinstruction is issued.
10. The superscalar microprocessor store instruction unequal data early commit method as recited in claim 1,
when the address of the consistency request sent from the core is the same as the address of the storage instruction, the consistency request must be prevented from being executed until the storage instruction writes the data into the primary data Cache.
11. The superscalar microprocessor store instruction unequal data early commit method as recited in claim 1,
and when the Cache line in the first-level data Cache accessed by the storage instruction cannot be eliminated by other young access instructions until the storage instruction writes the data into the first-level data Cache.
CN202211480671.3A 2022-11-23 2022-11-23 Advanced submission method for unequal data of superscalar microprocessor storage instruction Pending CN115729628A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211480671.3A CN115729628A (en) 2022-11-23 2022-11-23 Advanced submission method for unequal data of superscalar microprocessor storage instruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211480671.3A CN115729628A (en) 2022-11-23 2022-11-23 Advanced submission method for unequal data of superscalar microprocessor storage instruction

Publications (1)

Publication Number Publication Date
CN115729628A true CN115729628A (en) 2023-03-03

Family

ID=85297903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211480671.3A Pending CN115729628A (en) 2022-11-23 2022-11-23 Advanced submission method for unequal data of superscalar microprocessor storage instruction

Country Status (1)

Country Link
CN (1) CN115729628A (en)

Similar Documents

Publication Publication Date Title
JP5118652B2 (en) Transactional memory in out-of-order processors
US6481251B1 (en) Store queue number assignment and tracking
CN102483704B (en) There is the transactional memory system that efficient high-speed cache is supported
US8180967B2 (en) Transactional memory virtualization
US7181598B2 (en) Prediction of load-store dependencies in a processing agent
US6523109B1 (en) Store queue multimatch detection
US7962730B2 (en) Replaying memory operation assigned a load/store buffer entry occupied by store operation processed beyond exception reporting stage and retired from scheduler
US8127057B2 (en) Multi-level buffering of transactional data
US20010052053A1 (en) Stream processing unit for a multi-streaming processor
US20130232499A1 (en) Compare and exchange operation using sleep-wakeup mechanism
US6694424B1 (en) Store load forward predictor training
US9098327B2 (en) Method and apparatus for implementing a transactional store system using a helper thread
US20220188233A1 (en) Managing cached data used by processing-in-memory instructions
JP2007536626A (en) System and method for verifying a memory file that links speculative results of a load operation to register values
JPH0670779B2 (en) Fetch method
US6915395B1 (en) Active address content addressable memory
US6668287B1 (en) Software direct memory access
JP2001209535A (en) Command scheduling device for processors
US7519775B2 (en) Enforcing memory-reference ordering requirements at the L2 cache level
US11314509B2 (en) Processing of plural-register-load instruction
US20050283783A1 (en) Method for optimizing pipeline use in a multiprocessing system
US11194574B2 (en) Merging memory ordering tracking information for issued load instructions
CN115729628A (en) Advanced submission method for unequal data of superscalar microprocessor storage instruction
US8904150B2 (en) Microprocessor systems and methods for handling instructions with multiple dependencies
CN117472803B (en) Atomic instruction execution method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination