WO2021203560A1

WO2021203560A1 - Instruction withering-based multi-instruction out-of-order transmission method and processor

Info

Publication number: WO2021203560A1
Application number: PCT/CN2020/098961
Authority: WO
Inventors: 虞致国; 马晓杰; 魏敬和; 顾晓峰
Original assignee: 江南大学
Priority date: 2020-04-07
Filing date: 2020-06-29
Publication date: 2021-10-14
Also published as: CN111538534A; CN111538534B

Abstract

An instruction withering-based multi-instruction out-of-order transmission method and a processor, belonging to the field of processor designs. A tedious arbitration structure in conventional transmission architecture is abandoned, and an instruction withering circuit is increased. An instruction age array is used to characterize the time for which the instruction is stored in the CPU. In addition, a wakeup state bit is added. An instruction which has exceeded a withering threshold is stored in a deposition pool for direct transmission by the CPU. Furthermore, circuit structures such as an instruction request circuit, an instruction allocation circuit and a wakeup circuit are improved, effectively improving the timing of a key path in a multi-instruction transmission processor. When waking up instructions, instructions having a short execution cycle are woken up by a delay, and instructions having a long execution cycle are pre-woken up, so as to ensure that the instructions can be executed in a back-to-back manner, thereby satisfying the requirements of a high performance to power consumption ratio, a low delay and a high IPC in a modern super scalar out-of-order processor, and solving the problems in the prior art that the number of table entries in a transmission queue of a processor is increasing and the delay is also increasing.

Description

Multi-instruction out-of-sequence launch method and processor based on instruction withering

Technical field

The invention relates to a multi-instruction out-of-order emission method and a processor based on instruction withering, and belongs to the field of processor design.

Background technique

Since the end of Dennard's expansion for more than a decade, the single-core performance improvement of CPU has been particularly slow. In this context, it is absolutely necessary to re-study the core microarchitecture to obtain high single-core performance.

Among the many structures of the CPU, the instruction launch architecture is one of the important architectures to achieve the high performance of the CPU. The instruction issue architecture schedules the execution of instructions by selecting and issuing instructions from the instructions to be issued in the instruction issue queue in each cycle. In order to achieve high performance, the instruction issue architecture must achieve high IPC (Instructions per clock, the number of instructions executed per cycle) with low latency. At the same time, low latency is an important consideration in the process of designing the instruction issue architecture, because the instruction issue architecture is a timing critical path in the processor, and the delay of the instruction issue architecture will have a significant impact on the CPU's operating frequency.

The traditional multi-instruction out-of-sequence launch architecture uses the arbitration circuit to select the instructions that can be launched. The advantage is that the oldest instruction can be accurately selected for launch, which ensures the efficiency of the processor pipeline. However, as the number of items in the launch team grows, arbitration The delay of the circuit will increase accordingly.

In modern processors, in order to pursue high IPC, many entries are often designed in the transmit queue, which results in significant delays in the arbitration circuit, making the instruction transmitting circuit a critical path in the processor and the bottleneck of the processor's main frequency .

In view of the above requirements and challenges, it is very urgent to provide a design of a multi-instruction out-of-sequence launch architecture based on instruction decay for conditions such as low latency and high IPC.

The multi-instruction out-of-sequence launch architecture designed by the present invention can effectively determine the size of the instruction age and the impact on the efficiency of the processor pipeline is as small as possible, and the delay of the timing path will not vary with the number of entries in the launch queue. Increase and increase to ensure that the delay is as small as possible in a processor with a large number of entries, which provides a guarantee for the increase in the processor's main frequency.

Summary of the invention

In order to solve the current method of selecting commands that can be issued through the arbitration circuit, the delay of the arbitration circuit will increase correspondingly with the increase of the number of items in the launch queue. And the processor.

A method for multi-instruction out-of-order issuance, in which an instruction withering circuit is added to the instruction out-of-order issuing architecture of the processor, which is used to store the newly allocated instructions in the emission queue and implement the withering operation on the instructions in the emission queue; Methods include:

Set the highest bit of the instruction age corresponding to each instruction in the instruction wither circuit as the wake-up status bit of the instruction, and the remaining bits of the instruction age indicate the intrinsic age of the instruction; the wake-up status bit is used to indicate whether the corresponding instruction is awakened or not, and it is in the transmit queue. The wake-up instruction age is greater than the non-wake-up instruction age;

Set the withering threshold. When the instruction age of a certain instruction exceeds the withering threshold, the instruction age array triggers the withering signal, which causes the instruction to wither; the instruction that has withered can be randomly selected for launch without arbitration, realizing the chaos of multiple instructions. Sequence launch

Each instruction in the transmit queue determines the transmit sequence according to the instruction age and the wake-up state.

Optionally, when the instruction is awakened, the method delays the awakening of the instruction with a short execution cycle, and wakes up the instruction with a long execution cycle in advance, so as to ensure that the instruction can be executed back-to-back.

Optionally, when the method wakes up an instruction, after the preceding instruction among the preceding instructions is issued, the processor waits for the preceding instruction to be executed and then wakes up the succeeding instruction.

Optionally, the instruction out-of-sequence issuing architecture further includes an instruction distribution circuit, an instruction request circuit based on a class adder, and a dynamic delay wake-up circuit;

The instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue;

The instruction request circuit based on the class adder is used to count the total number of idle signals in the entry queue, and encode the number of idle signals with a special code, if the total number of idle signals after the code is smaller than the command transmission width that also passes through the code , Then send an instruction request signal to the physical register file;

The dynamic delayed wake-up circuit is used to send a wake-up signal when the source register number of the instruction to be issued is equal to the destination register number of the issued instruction. At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and according to the instruction to be issued The execution cycle of instructions adjusts the sequence of wake-up signals to ensure that instructions can be executed back to back.

Optionally, the command fade circuit includes a command age array, a transmission queue, a fade threshold adjuster, a sedimentation pool, and a global age feature extraction circuit;

The instruction age array is used to indicate the instruction age of each instruction in the emission queue and whether it is awakened;

The transmit queue is used to store the instructions sent from the physical register; the transmit queue is designed as a non-compressed structure, that is, when the physical register number of the instruction in a certain table item is transmitted and becomes idle, other table items will not be shifted. In addition to temporarily storing the physical register number of the current instruction, each entry also records the wake-up state of the current instruction and whether the entry is idle;

The withering threshold adjuster is used to dynamically adjust and output the withering threshold according to the number of free entries in the sedimentation pool and the age value of the instructions still remaining in the transmission queue;

The sedimentation tank is used for storing withering instructions that meet the withering conditions;

The global age feature extraction circuit is used to count the global age features.

Optionally, the input of the withering threshold adjuster is the age of each instruction in the instruction age array, and the output is the withering threshold x, namely:

Among them, σ is the variance of the instruction age, μ is the expectation of the instruction age, α is the adjustment coefficient, and α satisfies

Optionally, the instruction request circuit based on a class adder includes a class add layer and a post-log2 (n/2) shift logic layer, where n represents the number of entries in the transmit queue.

Optionally, the instruction request circuit based on the class adder inputs the idle signal sequence of the table entry into the class adder layer when counting the total number of idle signals of the table entry, performs calculations on the number of idle signals and performs special coding, and outputs the number of idle signals. The total number of idle signals after special encoding; the output of the addition-like layer is sent to the rear log2(n/2) layer shift logic layer, and the final output statistical result is compared with the instruction emission width of the same special encoding. Determine whether you need to send a command request signal.

Optionally, the class-addition layer is composed of a class-addition calculation unit; the idle signal sequence of the table entry is input to the class-addition layer, the number of idle signals is calculated and a special encoding is performed, and the idling after the special encoding is output The total number of signals, including:

When counting the total number of idle signals in the table entry, the idle signal sequence of the table entry is input to the class addition layer, and each class addition unit is input as two binary numbers in the idle signal sequence, and the AND operation and the exclusive OR operation are performed respectively, and then the two are compared. The result of the calculation:

If it is equal and the result of the AND operation is 1, the output represents the code of 1: "01", which means that the sum of the two secondary system numbers input of the class addition unit is 1, and the code is "01";

If it is equal, and the result of the AND operation is 0, the output code represents 0: "10", which means that the sum of the two secondary system numbers input of the class addition unit is 0, and the code is "10";

If they are not equal, output the code representing 2: "00", which means that the sum of the two secondary system numbers input of the class addition unit is 2, and the code is "00";

The number of coding bits is n.

Optionally, the rear log2 (n/2) level shift logic layer is composed of a right shift shifter; the input of the output result of the addition-like layer into the rear log2 (n/2) level shift logic layer, and The command emission widths that are also specially coded are compared to determine whether the command request signal needs to be sent, including:

The right shifter takes the output of one type of addition unit as the input of the data to be shifted, and the output of the other type of addition unit as the input of the number of shift bits, and the number to be shifted is shifted to the right by n bits through the right shifter; where n is The decimal number corresponding to the number of shift bits.

Optionally, the rear log2 (n/2) level shift logic layer has a tree structure and is connected layer by layer.

Optionally, the dynamic delayed wake-up circuit is composed of a comparator, an instruction execution discrimination circuit, and a register; the input of the wake-up circuit is the source register number of the instruction to be issued and the destination register number of the issued instruction, and the instruction to be issued is compared by the comparator. Whether the source register number of the issued instruction and the destination register number of the issued instruction are equal, if they are equal, a wake-up signal is sent; at the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and outputs the cycle number of the instruction to be issued, and the register passes the waiting The number of cycles of the transmitted instruction registers the wake-up signal to be sent, so as to achieve the purpose of adjusting the sequence of the wake-up signal.

Optionally, the instruction execution discrimination circuit is implemented by a read-only RAM, the number of execution cycles corresponding to different instructions is written in the read-only RAM in advance, and the number of cycles pre-stored in the RAM is read by inputting the type code of the instruction as an address, In order to get the operating cycle of the corresponding instruction.

The present application also provides a processor. The instruction out-of-order issue architecture of the processor includes an instruction distribution circuit, an instruction decay circuit, an instruction request circuit based on an adder-like circuit, and a dynamic delay wake-up circuit;

The instruction withering circuit is used to store the newly allocated instructions in the emission queue, and implement the withering operation on the instructions in the emission queue according to the instruction age of each instruction; the instructions that have withered can be randomly selected for emission without arbitration;

The instruction request circuit based on the class adder is used to count the total number of idle signals in the entry queue, and encode the number of idle signals with a special code, if the total number of idle signals after the code is smaller than the instruction transmission width that also passes through the code , Then send an instruction request signal to the physical register file;

Optionally, the highest bit of the instruction age of each instruction is set to the wake-up status bit of the instruction, and the remaining bits of the instruction age represent the intrinsic age of the instruction; the wake-up status bit is used to indicate whether the corresponding instruction is awakened, and the awakened ones in the transmit queue The instruction age is greater than the non-wake-up instruction age.

The beneficial effects of the present invention are:

This application abandons the lengthy arbitration structure in the traditional launch architecture, adds an instruction decay circuit, uses an instruction age array to characterize the time the instruction is stored in the CPU, and adds a wake-up status bit to store the instructions that have exceeded the decay threshold to the sink. The pool allows the CPU to directly issue, and improve the circuit structure of the instruction request circuit, instruction distribution circuit, wake-up circuit, etc., and effectively improve the timing of the critical path in the processor of multi-instruction transmission. Specifically, the improved instruction request circuit is in the table When counting the total number of idle signals, the addition-like unit is used to perform AND and XOR operations on the two input signals respectively, instead of the traditional command request circuit using logical addition when counting the vacant table information, which saves the command request circuit statistics The total number of idle signals for the entry is time-consuming; when waking up instructions, the short execution cycle of the instruction is delayed, and the long execution cycle of the instruction is awakened in advance to ensure that the instructions can be executed back-to-back, which meets the requirements of modern superscalar out-of-order processors. The requirements of performance-to-power ratio, low latency, and high IPC solve the problem that the processor cannot increase the number of items in the launch queue and the delay is also increasing in the prior art.

Description of the drawings

In order to explain the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 is a schematic diagram of the overall composition of the multi-instruction out-of-sequence issuing architecture based on instruction fading according to the present invention.

Figure 2 is a schematic diagram of the composition of the instruction withering circuit of the present invention.

Fig. 3 is a schematic diagram of the composition of the instruction distribution circuit of the present invention.

Fig. 4 is a schematic diagram of the composition of the instruction request circuit based on the class adder of the present invention.

Fig. 5 is a schematic diagram of the composition of the dynamic delayed wake-up circuit of the present invention.

Fig. 6 is a schematic diagram of the pipeline for adjusting the wake-up sequence through the wake-up circuit.

Detailed ways

In order to make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below in conjunction with the accompanying drawings.

Example one:

This embodiment provides a processor. Referring to FIG. 1, a schematic diagram of the overall composition of a multi-instruction out-of-order issue architecture of the processor. The multi-instruction out-of-order issue architecture includes: an instruction distribution circuit, an instruction withdraw circuit, and a class-based adder The command request circuit, dynamic delay wake-up circuit.

Among them, the instruction distribution circuit distributes the register renamed instruction to each entry in the instruction issue queue. The instruction issue queue contains multiple entries, and each entry contains an instruction to be issued. If there is an idle entry in the instruction issue queue, the instruction issued by the distribution circuit will be accepted.

All commands to be issued that have just entered the table entry are in an unawakened state. If the source register number of a certain command is equal to the label of the target register of the issued command, the command will be awakened by the wake-up circuit. All the instructions in the table entries are realized by the instruction withering circuit, and all the instructions that have completed the withering will eventually be issued, which realizes the out-of-order emission of multiple instructions. The out-of-order emission of multiple instructions can be completed in the superscalar out-of-order emission processor. .

The schematic diagram of the composition of the command fade circuit is shown in Figure 2. The command fade circuit includes a command age array, a launch queue, a fade threshold adjuster, a sedimentation pool, and a global age feature extraction circuit.

After the newly allocated instructions of the instruction allocation circuit, enter the instruction decay circuit and store the list items of the idle launching team. At the same time, the corresponding instruction age in the instruction age array is initialized to a random value between 0 and 1.

Whenever an instruction is issued, the age increment signal is released to the age array, and the age of the instruction that has not been issued in the transmit queue is increased by one accordingly.

The wither threshold adjuster adjusts and outputs the wither threshold according to the free entry information of the sedimentation pool and the global age threshold. If an instruction age in the instruction age array is greater than the wither threshold, the instruction age array outputs the wither signal and receives the instruction of the wither signal. Perform the withering operation, enter the sedimentation pool from the launch queue, and set the corresponding entry in the launch queue to the idle state, waiting for the newly assigned instruction input.

The wither command in the settlement tank can be launched without going through arbitration.

The wither threshold adjuster, the input is the sedimentation pool idle entry information and the global age feature, the global age feature value is output by the global age feature extraction circuit, and the adjuster adjusts according to the number of idle entries in the sedimentation pool and all current instruction age values And output the wither threshold.

The input of the withering threshold adjuster is the age of each instruction in the instruction age array, and the output is the withering threshold. The withering threshold x is:

Where α satisfies

σ is the variance of the instruction age, and μ is the expectation of the instruction age. Its characteristics are worth deriving as follows:

In modern processors, hundreds of millions of instructions can be processed per second, and the initial age value is a random value between 0 and 1. Under this large sample condition, the age of the processor can be considered continuous, and according to the Number theorem, it can be considered that the age of the processor obeys a normal distribution:

Where σ is the variance of the instruction age, and μ is the expectation of the instruction age.

Constructor g(x):

Variant (2)

Find the first derivative of (3)

Find the second derivative of (3)

Let (4) be 0, we can get

Bring formula (6) into (5) make

have to

In order to make the age of withering by x as the threshold as large as possible, the impact on the efficiency of the pipeline is as small as possible, there should be

Taking the lowest threshold, the constraint condition of the adjustment coefficient α is

In summary

And α satisfies

The instruction age array is essentially a counter array, each counter in total

Bit, representing the instruction age of the corresponding instruction, where low

The bit is the age counting bit, and the highest bit is the wake-up status bit.

Whenever a newly allocated instruction enters the issue queue of the instruction wither circuit, the corresponding instruction age is set to zero;

Whenever an instruction is issued, the instruction age corresponding to the instruction that has not been issued increases by 1;

Whenever an instruction in the emission queue is awakened, the awakening status of the instruction age corresponding to the instruction is set to 1. If the instruction age corresponding to an instruction is greater than the withering threshold, the withering signal will be output to the emission queue, where n represents the number of entries in the emission queue. s represents the command emission width.

The transmit queue includes n entries, and each entry stores an instruction to be transmitted and an idle bit of the entry.

The sedimentation pool is an instruction queue whose number of entries is much smaller than the instruction issue queue, in which there are withering instructions that meet the withering conditions, and the withering instructions in the sedimentation pool can be directly launched without arbitration.

Figure 3 shows the composition diagram of the instruction distribution circuit. The instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue.

The instruction distribution circuit includes s entry number selection circuits, and the input of each entry number selection circuit is the idle signal sequence of n/s entries in the transmit queue in the instruction decay circuit and the corresponding transmit queue list item number, The entry number selection circuit selects the transmission queue entry number according to whether the input idle signal is valid. If multiple idle signals are valid, the first idle signal is selected; if there is no valid idle signal, the output value It is the maximum value of the upper limit of the data bit, which means that there is no selected table item. The table item number output by the table item distribution circuit is compared with the upper limit of the value. If it is equal, the effective signal is set to 1, and if it is not equal, it is set to 0. The instructions to be allocated for each input of the allocation circuit are written into the corresponding entry according to the entry number and the valid signal. Among them, s represents the instruction issue width, and n represents the number of entries in the issue queue.

The entry number selection circuit is composed of a selector array. As shown in Fig. 2, the first column selector inputs the entry number. Since the first free entry needs to be selected, the selector is based on the idle signal of the smaller entry number. Selection table item number; the second-level table item number input is the selection table item number output of the first-level selection layer, the selection signal is an idle signal with a smaller table item number, and so on, there are a total of log2(n) selection layers. The selection result of the log2(n) layer selection layer is output to the all-empty entry selector. The selection signal of this selector is the selection signal of the log2(n)th layer selection layer, and the data to be selected is the log2(n) layer selection layer. The selection result and the upper limit value of the value. If the selection signal is 0, the upper limit value of the value will be output as the final entry number; if it is not 0, the selection result of the log2(n) layer selection layer will be output as the final entry number output , Where n represents the number of entries.

Figure 4 shows a schematic diagram of the composition of the command request circuit. The instruction request circuit is used to count the total number of table entry idle signals, and encode the number of idle signals with a special code. If the total number of idle signals after the code is less than the instruction transmission width that also passes through the code, then an instruction is issued to the physical register file Request signal. The instruction request circuit is composed of two parts: an addition-like layer and a post-log2(n/2) shift logic layer.

The class addition layer is composed of a class addition calculation unit; when counting the total number of idle signals of the table entry, the idle signal sequence of the table entry is input to the class addition layer, the number of idle signals is calculated and special coding is performed, and the output is specially coded The total number of idle signals after the end; the output of the addition-like layer is sent to the rear log2(n/2) layer shift logic layer, and finally the statistical result is output. The statistical result is compared with the instruction emission width that has also been specially coded to determine whether Need to send a command request signal.

Specifically, when counting the total number of idle signals in the table entry, the idle signal sequence of the table entry is input to the class addition layer, and each class addition unit is input as two binary numbers in the idle signal sequence and performs an AND operation and an exclusive OR operation respectively. Then compare the calculation results of the two:

If they are equal, and the result of the AND operation is 1, the output code represents 1: "01", which means that the sum of the two secondary system numbers input of the class addition unit is 1, and the code is "01";

The number of coding bits is n.

The rear log2 (n/2) layer shift logic layer is composed of a right shift shifter; the output result of the addition-like layer is input to the rear log2 (n/2) layer shift logic layer, and the instruction emission width is also specially coded Make a comparison to determine whether you need to send a command request signal, including:

The right shifter takes the output of one type of addition unit as the input of the data to be shifted, and the output of the other type of addition unit as the input of the number of shift bits, and the number to be shifted is shifted by n bits to the right by the right shifter. Where n is the decimal number corresponding to the number of shift bits.

For example, if the number to be shifted is "01" and the number of shifting bits is "00", then according to the above coding rule, "01" is shifted to the right by 2 bits;

Figure 5 shows the schematic diagram of the wake-up circuit. The wake-up circuit is composed of a comparator, an instruction execution discrimination circuit, and a register.

The wake-up circuit input is the source register number of the instruction to be issued and the destination register number of the issued instruction. The comparator is used to compare whether the source register number of the instruction to be issued and the destination register number of the issued instruction are equal. If you want to wait, send a wake-up signal ; At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discrimination circuit, and outputs the cycle number of the instruction to be issued. The register registers the wake-up signal to be sent by the cycle number of the instruction to be issued, so as to adjust the sequence of the wake-up signal The purpose is to delay the wake-up of instructions with a short execution cycle, and wake up instructions with a long execution cycle in advance, so as to ensure that the instructions on the pipeline can be executed back to back and improve the efficiency of the pipeline.

Figure 6 shows a schematic diagram of the pipeline after the instruction wake-up adjustment. Instruction A requires three execution cycles, and instructions B, C, and D each require one execution cycle. After the wake-up sequence of the wake-up circuit is adjusted, instruction D delays the wake-up of instruction A by two cycles, and two instructions can be inserted between instructions A and D Instructions B and C are executed back to back to ensure that all 4 instructions are executed back to back, there is no delay bubble, and the execution efficiency of the pipeline is improved.

Example two

This embodiment provides a multi-instruction out-of-sequence issuing method based on instruction fading, which is used in the processor described in the first embodiment. It will actually read the physical register file, and each entry in the transmit queue stores the physical register number; the method includes:

S1: When the physical register file receives the instruction request signal from the instruction request circuit, it outputs an appropriate instruction to the instruction distribution circuit.

S2, the instruction allocation circuit allocates the instructions output by the physical register file to each entry in the instruction issue queue:

The instruction distribution circuit includes s entry number selection circuits, and the input of each entry number selection circuit is the idle signal sequence of n/s entries in the transmit queue in the instruction decay circuit and the corresponding transmit queue list item number, the entry number The selection circuit selects the list item number of the transmission queue according to whether the input idle signal is valid. If multiple idle signals are valid, the first idle signal is selected; if there is no valid idle signal, the output value is the data bit The maximum value of the upper limit indicates that there is no selected entry.

The table item number output by the table item distribution circuit is compared with the upper limit of the value. If it is equal, the effective signal is set to 1, and if it is not equal, it is set to 0. The instructions to be allocated for each input of the allocation circuit are written into the corresponding entry according to the entry number and the valid signal.

Among them, s represents the instruction issue width, and n represents the number of entries in the issue queue.

S3: Each time the issue queue in the instruction wither circuit receives a new instruction, the instruction age in the instruction age array corresponding to the entry of the instruction is reset to zero; each time the instruction wither circuit emits an instruction, the instruction that is still in the issue queue corresponds to The instruction age is incremented by one; the highest bit of the instruction age corresponding to the instruction is the wake-up status bit of the instruction, and the remaining bits represent the intrinsic age of the instruction. After the instruction in the transmission queue is awakened, the highest position of the corresponding age information is one, which ensures that the awakened instruction age is greater than the non-awakened instruction age.

When the instruction age exceeds the withering threshold, the instruction age array will trigger the withering signal, which causes the instruction to wither. The instruction that has withered enters the sedimentation pool from the emission queue, and the entry in the emission queue is set to be idle.

The sedimentation pool is an instruction queue whose number of entries is much smaller than that of the launch queue. There are instructions after the withering, and the withering instructions in the sedimentation pool can be randomly selected for launch.

The transmit queue in the wither circuit is designed as a non-compressed structure, that is, when the physical register number of the instruction in a certain table item is transmitted and becomes idle, the other table items will not be shifted, and each table item will temporarily store the physical register of the current instruction. The register number also records the wake-up status of the current command and whether the entry is idle;

S4. The idle signal of the entry in the transmit queue will be transmitted to the instruction request circuit at the same time. The instruction request circuit counts the number of free entries in the transmit queue. If the number of free entries in the transmit queue is greater than the instruction transmit width, the request circuit will send The physical register file sends an instruction request signal, and the physical register file accepts the request signal and sends an instruction to the instruction distribution circuit;

S5. In the process of instruction transmission, the wake-up circuit is responsible for comparing the current transmitted destination register number with the source register number of each instruction in the transmit queue. If the numbers are equal, a wake-up signal is issued, and at the same time, it is judged whether to change the wake-up signal according to the execution cycle of the instruction. Delayed transmission, instructions with a long execution cycle are awakened in advance, and instructions with a short execution cycle are delayed to be transmitted. The wake-up signal is set to 1 in the wake-up state in the instruction age corresponding to the instruction to ensure that the instruction age that is awakened is greater than the instruction age that is not awakened.

Part of the steps in the embodiments of the present invention may be implemented by software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.

The above are only the preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims

A multi-instruction out-of-order issuing method, characterized in that an instruction withering circuit is added to the processor's out-of-order issuing architecture to store the newly allocated instructions in the emission queue, and to achieve the withering of the instructions in the emission queue Operation; the method includes:

Set the highest bit of the instruction age corresponding to each instruction in the instruction wither circuit as the wake-up status bit of the instruction, and the remaining bits of the instruction age indicate the intrinsic age of the instruction; the wake-up status bit is used to indicate whether the corresponding instruction is awakened or not, and it is in the transmit queue. The wake-up instruction age is greater than the non-wake-up instruction age;

Set the withering threshold. When the instruction age of a certain instruction exceeds the withering threshold, the instruction age array triggers the withering signal, which causes the instruction to wither; the instruction that has withered can be randomly selected for launch without arbitration, realizing the chaos of multiple instructions. Sequence launch

Each instruction in the transmit queue determines the transmit sequence according to the instruction age and the wake-up state.
The method according to claim 1, characterized in that when the instruction is awakened, the method delays the wake-up of the instruction with a short execution cycle, and wakes up the instruction with a long execution cycle in advance, so as to ensure that the instruction can be executed back-to-back.
The method according to claim 2, characterized in that when the method wakes up an instruction, after the preceding instruction among the preceding instructions is issued, the processor waits for the preceding instruction to be executed before awakening the succeeding instruction.
The method according to claim 3, wherein the instruction out-of-order issue architecture further comprises an instruction distribution circuit, an instruction request circuit based on a class adder, and a dynamic delay wake-up circuit;

The instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue;

The instruction request circuit based on the class adder is used to count the total number of idle signals for entries in the transmit queue, and use a special code to encode the number of idle signals, if the total number of idle signals that have undergone the special encoding is less than the number of instructions that have also undergone the encoding. Width, then send an instruction request signal to the physical register file;

The dynamic delayed wake-up circuit is used to send a wake-up signal when the source register number of the instruction to be issued is equal to the destination register number of the issued instruction. At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and according to the instruction to be issued The execution cycle of instructions adjusts the sequence of wake-up signals to ensure that instructions can be executed back to back.
The method according to claim 4, wherein the command fade circuit comprises a command age array, a transmission queue, a fade threshold adjuster, a sedimentation pool, and a global age feature extraction circuit;

The instruction age array is used to indicate the instruction age of each instruction in the emission queue and whether it is awakened;

The transmit queue is used to store the instructions sent from the physical register; the transmit queue is designed as a non-compressed structure, that is, when the physical register number of the instruction in a certain table item is transmitted and becomes idle, other table items will not be shifted. In addition to temporarily storing the physical register number of the current instruction, each entry also records the wake-up status of the current instruction and whether the entry is idle;

The withering threshold adjuster is used to dynamically adjust and output the withering threshold according to the number of free entries in the sedimentation pool and the age value of the instructions still remaining in the transmission queue;

The sedimentation tank is used for storing withering instructions that meet the withering conditions;

The global age feature extraction circuit is used to count the global age features.
The method according to claim 5, wherein the input of the withering threshold adjuster is the age of each instruction in the instruction age array, and the output is the withering threshold x, namely:

Among them, σ is the variance of the instruction age, μ is the expectation of the instruction age, α is the adjustment coefficient, and α satisfies
The method according to claim 4, wherein the instruction request circuit based on a class adder includes a class add layer and a post-log2 (n/2) shift logic layer, and n represents the number of entries in the transmit queue .
The method according to claim 7, wherein the instruction request circuit based on the class adder inputs the idle signal sequence of the table entry into the class adder layer when counting the total number of idle signals of the table entry, and the number of idle signals Perform calculations and perform special coding, and output the total number of idle signals after special coding; send the output of the class addition layer to the rear log2 (n/2) layer shift logic layer, and finally output the statistical results, and the statistical results are the same as the special The coded command emission width is compared to determine whether a command request signal needs to be sent.
The method according to claim 8, wherein the class addition layer is composed of a class addition calculation unit; the idle signal sequence of the table entry is input to the class addition layer, and the number of idle signals is calculated and special Encoding, output the total number of idle signals after special encoding, including:

When counting the total number of idle signals in the table entry, the idle signal sequence of the table entry is input to the class addition layer, and each class addition unit is input as two binary numbers in the idle signal sequence, and the AND operation and the exclusive OR operation are performed respectively, and then the two are compared. The result of the calculation:

If it is equal and the result of the AND operation is 1, the output represents the code of 1: "01", which means that the sum of the two secondary system numbers input of the class addition unit is 1, and the code is "01";

If it is equal, and the result of the AND operation is 0, the output code represents 0: "10", which means that the sum of the two binary numbers input of the class addition unit is 0, and the code is "10";

If they are not equal, output the code representing 2: "00", which means that the sum of the two secondary system numbers input of the class addition unit is 2, and the code is "00";

The number of coding bits is n.
The method according to claim 9, wherein the rear log2 (n/2) layer shift logic layer is composed of a right shift shifter; and the output result of the addition-like layer is input to the rear log2 (n/ 2) The layer shift logic layer is compared with the command transmission width that has also been specially coded to determine whether it is necessary to send a command request signal, including:

The right shifter takes the output of one type of addition unit as the input of the data to be shifted, and the output of the other type of addition unit as the input of the number of shift bits, and the number to be shifted is shifted to the right by n bits through the right shifter; where n is The decimal number corresponding to the number of shift bits.
11. The method according to claim 10, wherein the rear log2 (n/2) level shift logic layer has a tree structure and is connected layer by layer.
The method according to claim 4, wherein the dynamic delayed wake-up circuit is composed of a comparator, an instruction execution discrimination circuit, and a register;

The input of the wake-up circuit is the source register number of the instruction to be issued and the destination register number of the issued instruction. The comparator compares whether the source register number of the instruction to be issued and the destination register number of the issued instruction are equal, and if they are equal, a wake-up signal is sent; At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discrimination circuit, and outputs the cycle number of the instruction to be issued. The register registers the wake-up signal to be sent by the cycle number of the instruction to be issued, so as to achieve the adjustment of the wake-up signal sequence. Purpose.
The method according to claim 12, wherein the instruction execution discrimination circuit is realized by a read-only RAM, the number of execution cycles corresponding to different instructions is pre-written in the read-only RAM, and the instruction type code is input as the address to read The number of cycles pre-stored in the RAM is obtained, and the operation cycle of the corresponding instruction is obtained.
A processor, characterized in that the instruction out-of-order issue architecture of the processor includes an instruction distribution circuit, an instruction decay circuit, an instruction request circuit based on a class adder, and a dynamic delayed wake-up circuit;

The instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue;

The instruction withering circuit is used to store the newly allocated instructions in the emission queue, and implement the withering operation on the instructions in the emission queue according to the instruction age of each instruction; the instructions that have withered can be randomly selected for emission without arbitration;

The instruction request circuit based on the class adder is used to count the total number of idle signals in the entry queue, and encode the number of idle signals with a special code, if the total number of idle signals after the code is smaller than the command transmission width that also passes through the code , Then send an instruction request signal to the physical register file;

The dynamic delayed wake-up circuit is used to send a wake-up signal when the source register number of the instruction to be issued is equal to the destination register number of the issued instruction. At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and according to the instruction to be issued The execution cycle of instructions adjusts the sequence of wake-up signals to ensure that instructions can be executed back to back.
The processor according to claim 13, wherein the highest bit of the instruction age of each instruction is set as the wake-up status bit of the instruction, and the remaining bits of the instruction age represent the intrinsic age of the instruction; the wake-up status bit is used to represent the corresponding instruction Whether to be awakened, the awakened instruction age in the transmit queue is greater than the non-awakened instruction age.