CN110647362A

CN110647362A - Two-stage buffering transmitting device based on scoreboard principle

Info

Publication number: CN110647362A
Application number: CN201910858592.3A
Authority: CN
Inventors: 胡向东; 范好好; 李俊; 尹飞; 王国澎
Original assignee: Shanghai Integrated Circuits with Highperformance Center
Current assignee: Shanghai Integrated Circuits with Highperformance Center
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2020-01-03
Anticipated expiration: 2039-09-11
Also published as: CN110647362B

Abstract

The invention relates to a two-stage buffering transmitting device based on a scoreboard principle, which comprises a first-stage waiting queue and a second-stage transmitting queue, wherein a guessing scoreboard is arranged between the first-stage waiting queue and the second-stage transmitting queue, a transmitting part of the second-stage transmitting queue is provided with an accurate scoreboard, and the guessing scoreboard unlocking time is the time for unlocking the guessing scoreboard after an instruction is transmitted from the first-stage waiting queue to the second-stage transmitting queue according to the period of the instruction executed by an executing part; the accurate scoreboard unlocking opportunity is the opportunity of solving the integral accurate scoreboard after the instruction is transmitted from the secondary transmission sub-queue to the execution unit according to the period of the instruction executed by the execution unit. The invention simplifies the complex transmitting selection logic, adjusts the utilization rate of the secondary queue and improves the transmitting efficiency.

Description

Two-stage buffering transmitting device based on scoreboard principle

Technical Field

The invention relates to the technical field of instruction pipeline design of superscalar microprocessors, in particular to a two-stage buffering transmitting device based on a scoreboard principle.

Background

Modern superscalar processors typically include basic pipeline stages for fetching, decoding, renaming, launching, executing, exiting, etc., and contain multiple execution units that allow multiple instructions to be executed in parallel. As a bridge connecting the instruction pipeline and the execution unit, the emission unit can judge the current running state of the processor in real time, mine instructions which can be parallel from the instruction window and dynamically schedule the instructions to the execution unit for execution. In the pipeline station before the transmitting component, the instructions enter in sequence and flow out in sequence; for the issue unit, the instructions enter in sequence and exit out of sequence.

To support dynamic instruction scheduling, scoreboard techniques are often employed in superscalar processors, the principle of which is: the status of all current operands is centrally recorded in the scoreboard status to indicate whether they are available, i.e. can be read out for use by the instruction. When an instruction is to write some operand, blocking the operand to make it unavailable; when the instruction execution is completed, the operand is unlocked, indicating that the operand is available. During the period that the operand is blocked, the instruction taking the operand as the source operand can not be transmitted to the execution unit, so as to eliminate read-after-write hazard and ensure that the instruction with data dependency is executed strictly according to the program order. And the instruction without data correlation has no relevance on the scoreboard, and can be transmitted to the execution unit out of order and executed out of order.

In particular, a scoreboard status table (hereinafter abbreviated as scoreboard status table) for maintaining data dependency between instructions in a system contains n bits of information, and collectively records whether n operands are available, i.e., blocked. Each bit corresponds to an operand, a "0" indicates that the operand is available, i.e., has been unlocked; a "1" indicates that the operand is unavailable, i.e., blocked.

Each instruction passing through the rename station has a source scoreboard of n bits, each bit corresponding to an operand. The location of "0" in the source scoreboard indicates that execution of the present instruction does not require the corresponding operand to be available, and the location of "1" indicates that execution of the present instruction requires the corresponding operand to be available. The source scoreboard for each instruction may have zero bits or multiple bits of "1" depending on the number of its source operands.

Meanwhile, each instruction passing through the rename station is provided with an n-bit target scoreboard, and each bit corresponds to an operand. The position of "0" in the target scoreboard indicates that the instruction will not modify the corresponding operand, and the position of "1" indicates that the instruction will modify the corresponding operand. The target scoreboard for each instruction may have zero bits or one bit of "1" depending on the number of its target operands.

When an instruction enters the transmitting part from a previous stage platform, if the mth bit in the target indication board is 1, the mth position of the scoreboard state table is blocked by '1'; after the instruction is transmitted to the execution unit and the execution is completed, the m-th bit of the scoreboard status table is cleared to be 0, namely, the scoreboard status table is unlocked.

For instructions entering the launch component, each cycle compares its own source scoreboard to the scoreboard status table, and as long as both appear with a "1" in the same location, it means that there is a read-after-write hazard for the data, and the instruction is not allowed to be launched.

Generally, since the instructions cached in the transmitting component have been processed by the stations such as decoding and renaming, the instructions contain more information, and the control logic such as searching and judging is more complex, during physical implementation, the instructions are densely wired, have more logic levels and longer delay, which is a difficult point of physical design. As processor frequency and transmission bandwidth increase, instruction issue logic tends to become a pipeline critical path. The transmit section design therefore requires a compromise between performance and physical implementation to achieve the best results.

Therefore, in the design of the transmitting part, it is often difficult for the conventional one-level buffer design to satisfy both the performance and timing requirements. If a two-level buffer design is used, the instructions may be stored in two buffers, respectively. On the premise that the total number of cached instructions is equivalent, compared with the condition that only one-level transmission buffer is adopted, the physical implementation difficulty can be reduced, and the frequency of the processor can be improved.

In the design of two-stage buffer emission, in order to have good performance as much as possible, the two-stage buffer function is fully exerted, the instruction with the operand not ready is always placed in the first-stage buffer close to the upstream of the pipeline as much as possible, and the instruction with the operand close to the ready is placed in the second-stage buffer close to the downstream of the pipeline, so that the resources are fully utilized, and the influence on the performance is reduced to the greatest extent. Therefore, it is important to control the timing of the transmission of instructions from the primary buffer to the secondary buffer.

Disclosure of Invention

The invention aims to solve the technical problem of providing a two-stage buffering transmitting device based on a scoreboard principle, simplifying complex transmitting selection logic, adjusting the utilization rate of a two-stage queue and improving the transmitting efficiency.

The technical scheme adopted by the invention for solving the technical problems is as follows: the two-stage buffering transmitting device based on the scoreboard principle comprises a first-stage waiting queue and a second-stage transmitting queue, wherein a guessing scoreboard is arranged between the first-stage waiting queue and the second-stage transmitting queue and used for regulating and controlling the time for transmitting all instructions from the first-stage waiting queue to the second-stage transmitting queue, and a precise scoreboard is arranged at the transmitting position of the second-stage transmitting queue and used for regulating and controlling the time for transmitting all instructions from the second-stage transmitting queue; the speculative scoreboard unlocking opportunity is an opportunity for de-speculation of a scoreboard after an instruction is transmitted from the primary waiting queue to the secondary transmitting queue according to the period of execution of the instruction in an execution unit; the accurate scoreboard unlocking opportunity is the opportunity of solving the integral accurate scoreboard after the instruction is transmitted from the secondary transmission sub-queue to the execution unit according to the period of the instruction executed by the execution unit.

Assuming that the mth bit of a target scoreboard of an instruction is an effective position and the number of execution cycles is N, immediately blocking the mth bit of the speculative scoreboard after the instruction enters the primary waiting queue from the renamed station; the instruction unlocks the mth bit of the speculative scoreboard at cycle N-2 after being launched from the secondary issue queue to an execution unit.

For a single beat instruction, unlocking the mth bit of the speculative scoreboard when an instruction is transmitted from the primary wait queue to the secondary transmit queue; for a LOAD class instruction, the number of execution cycles is considered to be the same as the number of execution cycles when hitting the primary data Cache.

When an instruction enters the primary waiting queue, if any one or more bits of the states of the speculative scoreboards corresponding to the effective positions of the source scoreboards of the instruction are found to be blocked, the instruction is prohibited from being transmitted to the secondary transmitting queue.

Assuming that the mth bit of a target scoreboard of an instruction is an effective position and the number of execution cycles is N, immediately blocking the mth bit of the accurate scoreboard after the instruction enters the primary waiting queue from the renamed station; the instruction unlocks the mth bit of the accurate scoreboard at the N-1 th cycle after being transmitted from the secondary transmission queue to the execution unit.

For a single beat instruction, unlocking the mth bit of the precision scoreboard when an instruction is transmitted from the secondary transmit queue to the execution unit; regarding the LOAD instruction, the execution period number of the LOAD instruction is considered to be the same as the execution period number when the LOAD instruction hits a primary data Cache, the LOAD instruction execution period is considered to be 4 beats, after a LOAD instruction is transmitted, the scoreboard number set by the instruction is recorded, when the LOAD instruction is transmitted, the scoreboard number set by the LOAD instruction is translated into 64 bits, the m-th bit blocking of an accurate scoreboard is removed, when the LOAD instruction is transmitted, whether the speculation is successful or not is judged according to a DCache hit signal, and if the speculation is successful; if the speculation fails, the blocking beat instruction is transmitted, and the m bit of the accurate scoreboard is blocked again to wait for the completion of the blocking beat instruction and then is unlocked.

The number of the first-stage waiting queues is 3, and the first-stage waiting queues are respectively an integer waiting queue, a floating point waiting queue and an access waiting queue; the number of the secondary transmitting queues is 9, and the secondary transmitting queues comprise 3 integer transmitting queues, 2 floating point transmitting queues, 2 access transmitting queues, 1 integer storage data transmitting queue and 1 floating point storage data transmitting queue; the instructions in the integer waiting queue are respectively sent to 3 integer transmitting queues according to the distributed assembly lines; the instructions in the floating point waiting queue are respectively sent to 2 floating point transmitting queues according to the distributed assembly lines; if the instruction in the access waiting queue is a LOAD instruction, the instruction is sent to 2 access transmitting queues according to the distributed assembly line, if the instruction is a STORE instruction, the instruction is sent to 2 access transmitting queues according to the distributed assembly line on one hand, and on the other hand, the instruction is sent to an integer storage data transmitting queue or a floating point storage data transmitting queue according to the STORE data type.

Advantageous effects

Due to the adoption of the technical scheme, compared with the prior art, the invention has the following advantages and positive effects: the invention has no more than 3 dispatch ports of each instruction from waiting buffering to transmitting buffering, thereby greatly reducing the complexity of logic, reducing the difficulty of physical realization, improving the transmitting efficiency and being beneficial to improving the frequency of a processor. The invention sets a guess scoreboard between a first-level waiting queue and a second-level transmitting queue, controls the time when an instruction enters the second-level transmitting queue from the first-level waiting queue, uses an accurate scoreboard at the outlet of the second-level transmitting queue, so that the operand is obviously not ready to wait for transmitting in the first-level waiting queue as much as possible, the operand is close to or the ready instruction waits for transmitting in the second-level transmitting queue as much as possible, and the two scoreboards can control the time when the instruction is transmitted from the first-level buffer to the second-level buffer, adjust the utilization rate of the second-level queue and improve the.

Drawings

FIG. 1 is a schematic structural view of the present invention;

FIG. 2 is a diagram of a primary wait queue being sent to a secondary transmit queue;

FIG. 3 is a schematic diagram of a two-stage transmit queue transmission;

FIG. 4 is a schematic diagram of a speculative scoreboard being blocked;

FIG. 5 is a schematic view of the speculative scoreboard being unlocked;

FIG. 6 is a schematic diagram of the accurate scoreboard being pre-unlocked by a LOAD-like instruction;

FIG. 7 is a schematic diagram of the failure of the accurate scoreboard to be pre-unlocked by a LOAD-like instruction;

FIG. 8 is a schematic diagram of the accurate scoreboard being unlocked by a LOAD-type command.

Detailed Description

The invention will be further illustrated with reference to the following specific examples. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

The embodiment of the invention relates to a two-stage buffering launching device based on a scoreboard principle, which comprises a first-stage waiting queue and a second-stage launching queue, wherein a guessing scoreboard is arranged between the first-stage waiting queue and the second-stage launching queue and used for regulating and controlling the time of sending all instructions from the first-stage waiting queue to the second-stage launching queue, and a precise scoreboard is arranged at the launching position of the second-stage launching queue and used for regulating and controlling the time of accurately launching all instructions from the second-stage launching queue.

This embodiment is primarily directed to a processor that employs scoreboard technology and that requires out-of-order transmission. The first-stage buffer near the upstream of the pipeline, referred to as a first-stage wait queue in this embodiment, and the second-stage buffer near the downstream of the pipeline, referred to as a second-stage transmit queue in this embodiment.

The number of the first-stage waiting queues is 3, and the first-stage waiting queues are respectively an integer waiting queue, a floating point waiting queue and an access waiting queue; the number of the secondary transmission queues is 9, and the secondary transmission queues comprise 3 integer transmission queues, 2 floating point transmission queues, 2 access transmission queues, 1 integer storage data transmission queue and 1 floating point storage data transmission queue. The instructions in the integer waiting queue are respectively sent to 3 integer transmitting queues according to the distributed assembly lines; the instructions in the floating point waiting queue are respectively sent to 2 floating point transmitting queues according to the distributed assembly lines; if the instruction in the access waiting queue is a LOAD instruction, the instruction is sent to 2 access transmitting queues according to the distributed assembly line, if the instruction is a STORE instruction, the instruction is sent to 2 access transmitting queues according to the distributed assembly line on one hand, and on the other hand, the instruction is sent to an integer storage data transmitting queue or a floating point storage data transmitting queue according to the STORE data type.

In the embodiment, the instructions of which the upstream decoding is finished are respectively distributed into 3 first-level waiting queues according to the types of instruction integers, floating points or accesses, the instructions of which the speculative scoreboard is not untwisted are stored in the first-level waiting queues to wait for the speculative scoreboard to be untwisted, and the instructions of which the speculative scoreboard is untwisted are sent to a second-level transmitting queue (see fig. 2). The secondary issue queue stores all the instructions that the accurate scoreboard is unlocked, which can be immediately sent to the execution unit, and fig. 3 is a schematic diagram of the secondary issue queue issue.

The speculative scoreboard unblocking opportunity is an opportunity to de-speculate a scoreboard after an instruction is transmitted from the primary wait queue to the secondary transmit queue determined according to a cycle of execution of the instruction at an execution unit. The width of the presumed scoreboard in this embodiment is consistent with the width of the accurate scoreboard, and each bit has two states: a value of "0" indicates the unlocked state, and a value of "1" indicates the locked state.

For a single beat instruction, the mth bit of the speculative scoreboard is unlocked when an instruction is transmitted from the primary wait queue to the secondary transmit queue.

For a LOAD class instruction, the number of execution cycles is considered to be the same as the number of execution cycles when hitting the primary data Cache.

The instruction enters a first-stage waiting queue, if any one or more bits of the states of the speculative scoreboard corresponding to the effective position of the source scoreboard of the instruction are found to be blocked, the instruction is prohibited to be transmitted to a second-stage transmitting queue, and therefore the condition that no ready instruction is waited in the first-stage waiting queue by operands can be guaranteed, and the items of the second-stage transmitting queue are not occupied.

As shown in fig. 4, the valid location of the target scoreboard for instruction j is location 2, and after the instruction j is transmitted to the primary waiting queue from rename, location 2 of the speculative scoreboard is blocked immediately according to the valid location of its target scoreboard. As shown in fig. 5, the effective position of the target scoreboard of instruction j is position 2, the number of execution cycles is N, and the position 2 of the speculative scoreboard is unlocked according to the effective position of the target scoreboard N in the nth cycle after the target scoreboard is transmitted from the primary waiting queue to the secondary transmitting queue.

The accurate scoreboard unlocking opportunity is the opportunity of solving the integral accurate scoreboard after the instruction is transmitted from the secondary transmission sub-queue to the execution unit according to the period of the instruction executed by the execution unit.

For a single beat instruction, the mth bit of the precision scoreboard is unlocked when an instruction is transmitted from the secondary issue queue to the execution unit.

For the LOAD instruction, the execution period number is considered to be the same as that when the instruction hits the primary data Cache, if the DCache is not hit, the mth bit of the accurate scoreboard is blocked again, and the instruction is unlocked after the instruction is completed. Specifically, for integer LOAD class instructions, the LOAD instruction is always speculated as hitting DCache, i.e., the LOAD instruction execution cycle is considered to be 4 beats. After a LOAD instruction is transmitted, the scoreboard number set by the instruction is recorded, when the instruction is transmitted in the 3 rd beat, the scoreboard number set by the LOAD instruction is translated into 64 bits, then the corresponding scoreboard bit is removed, and whether the speculation is successful or not is judged according to a DCache hit signal given by DBOX in the 4 th beat. If the speculation is successful, continuing the subsequent operation; if the speculation fails, the shot instruction issue is blocked and the speculatively unwrapped scoreboard bits are restored.

When the speculation is successful, the subsequent instruction can be transmitted 1 beat in advance, i.e. the scoreboard is released one beat in advance. A high DCache hit rate means a high presumed hit rate, and performance can be improved. If the LOAD instruction misses the DCache, a speculative LOAD miss is generated, in which case the issue is stalled if the instruction to be issued is associated with the speculative LOAD instruction.

As shown in fig. 6, the instruction j is a LOAD type instruction, the effective position of the target scoreboard is position 2, and when the instruction j is shot 3 after being transmitted from the secondary transmission queue to the execution unit, the position 2 of the accurate scoreboard is unlocked according to the effective position of the target scoreboard. As shown in fig. 7, the instruction j is a LOAD type instruction, the effective position of the target scoreboard is position 2, the instruction j is transmitted from the secondary transmission queue to the execution unit, then the 4 th beat receives the signal that the DCache is not hit, and the position 2 of the accurate regulation scoreboard is blocked again. As shown in fig. 8, the instruction j is a LOAD type instruction, the effective position of the target scoreboard is position 2, and when the instruction j is really completed, the position 2 of the accurate scoreboard is unlocked according to the effective position of the target scoreboard.

It is not difficult to find that the invention sets a guess scoreboard between the first-level waiting queue and the second-level transmitting queue, controls the time when the instruction enters the second-level transmitting queue from the first-level waiting queue, uses an accurate scoreboard at the outlet of the second-level transmitting queue, so that the operand obviously has no ready instruction to wait for transmitting in the first-level waiting queue as much as possible, the operand is close to or the ready instruction waits for transmitting in the second-level transmitting queue as much as possible, the two scoreboards can control the time when the instruction is transmitted from the first-level buffer to the second-level buffer, adjust the utilization rate of the second-level queue.

Claims

1. A two-stage buffering launching device based on a scoreboard principle comprises a first-stage waiting queue and a second-stage launching queue, wherein a guess scoreboard is arranged between the first-stage waiting queue and the second-stage launching queue and used for regulating and controlling the time of sending all instructions from the first-stage waiting queue to the second-stage launching queue; the speculative scoreboard unlocking opportunity is an opportunity for de-speculation of a scoreboard after an instruction is transmitted from the primary waiting queue to the secondary transmitting queue according to the period of execution of the instruction in an execution unit; the accurate scoreboard unlocking opportunity is the opportunity of solving the integral accurate scoreboard after the instruction is transmitted from the secondary transmission sub-queue to the execution unit according to the period of the instruction executed by the execution unit.

2. The two-stage buffering launching device based on the scoreboard principle as claimed in claim 1, wherein assuming that the mth bit of the target scoreboard of the instruction is valid and the number of execution cycles is N, the mth bit of the speculative scoreboard is blocked immediately after the instruction enters the first-stage waiting queue from the rename station; the instruction unlocks the mth bit of the speculative scoreboard at cycle N-2 after being launched from the secondary issue queue to an execution unit.

3. A two-stage cache-launching device based on scoreboard principle as claimed in claim 2, characterized in that for a single beat instruction, the mth bit of the speculative scoreboard is unlocked when an instruction is launched from the primary wait queue to the secondary launch queue; for a LOAD class instruction, the number of execution cycles is considered to be the same as the number of execution cycles when hitting the primary data Cache.

4. A two-stage buffering transmission device based on the scoreboard principle as claimed in claim 2, wherein when an instruction enters the first-stage waiting queue, if any one or more bits of the states of the speculative scoreboard corresponding to the valid position of the source scoreboard of the instruction are found to be blocked, the instruction is prohibited from being transmitted to the second-stage transmission queue.

5. The two-stage buffering launching device based on the scoreboard principle as claimed in claim 1, wherein assuming that the mth bit of the target scoreboard of the instruction is valid and the number of execution cycles is N, the mth bit of the accurate scoreboard is blocked immediately after the instruction enters the first-stage waiting queue from the rename station; the instruction unlocks the mth bit of the accurate scoreboard at the N-1 th cycle after being transmitted from the secondary transmission queue to the execution unit.

6. The two-stage cache transmission device based on the scoreboard principle according to claim 5, wherein for a single beat instruction, the mth bit of the accurate scoreboard is unlocked when the instruction is transmitted from the two-stage transmission queue to the execution unit; regarding the LOAD instruction, the execution period number of the LOAD instruction is considered to be the same as the execution period number when the LOAD instruction hits a primary data Cache, the LOAD instruction execution period is considered to be 4 beats, after a LOAD instruction is transmitted, the scoreboard number set by the instruction is recorded, when the LOAD instruction is transmitted, the scoreboard number set by the LOAD instruction is translated into 64 bits, the m-th bit blocking of an accurate scoreboard is removed, when the LOAD instruction is transmitted, whether the speculation is successful or not is judged according to a DCache hit signal, and if the speculation is successful; if the speculation fails, the blocking beat instruction is transmitted, and the m bit of the accurate scoreboard is blocked again to wait for the completion of the blocking beat instruction and then is unlocked.

7. The two-stage buffering launching device based on the scoreboard principle as claimed in claim 1, wherein the number of the first-stage waiting queues is 3, and the first-stage waiting queues are respectively an integer waiting queue, a floating point waiting queue and an access waiting queue; the number of the secondary transmitting queues is 9, and the secondary transmitting queues comprise 3 integer transmitting queues, 2 floating point transmitting queues, 2 access transmitting queues, 1 integer storage data transmitting queue and 1 floating point storage data transmitting queue; the instructions in the integer waiting queue are respectively sent to 3 integer transmitting queues according to the distributed assembly lines; the instructions in the floating point waiting queue are respectively sent to 2 floating point transmitting queues according to the distributed assembly lines; if the instruction in the access waiting queue is a LOAD instruction, the instruction is sent to 2 access transmitting queues according to the distributed assembly line, if the instruction is a STORE instruction, the instruction is sent to 2 access transmitting queues according to the distributed assembly line on one hand, and on the other hand, the instruction is sent to an integer storage data transmitting queue or a floating point storage data transmitting queue according to the STORE data type.