WO2021203560A1 - Instruction withering-based multi-instruction out-of-order transmission method and processor - Google Patents

Instruction withering-based multi-instruction out-of-order transmission method and processor Download PDF

Info

Publication number
WO2021203560A1
WO2021203560A1 PCT/CN2020/098961 CN2020098961W WO2021203560A1 WO 2021203560 A1 WO2021203560 A1 WO 2021203560A1 CN 2020098961 W CN2020098961 W CN 2020098961W WO 2021203560 A1 WO2021203560 A1 WO 2021203560A1
Authority
WO
WIPO (PCT)
Prior art keywords
instruction
circuit
age
wake
instructions
Prior art date
Application number
PCT/CN2020/098961
Other languages
French (fr)
Chinese (zh)
Inventor
虞致国
马晓杰
魏敬和
顾晓峰
Original Assignee
江南大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江南大学 filed Critical 江南大学
Publication of WO2021203560A1 publication Critical patent/WO2021203560A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3836Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to a multi-instruction out-of-order emission method and a processor based on instruction withering, and belongs to the field of processor design.
  • the instruction launch architecture is one of the important architectures to achieve the high performance of the CPU.
  • the instruction issue architecture schedules the execution of instructions by selecting and issuing instructions from the instructions to be issued in the instruction issue queue in each cycle.
  • the instruction issue architecture In order to achieve high performance, the instruction issue architecture must achieve high IPC (Instructions per clock, the number of instructions executed per cycle) with low latency.
  • IPC Instructions per clock, the number of instructions executed per cycle
  • low latency is an important consideration in the process of designing the instruction issue architecture, because the instruction issue architecture is a timing critical path in the processor, and the delay of the instruction issue architecture will have a significant impact on the CPU's operating frequency.
  • the traditional multi-instruction out-of-sequence launch architecture uses the arbitration circuit to select the instructions that can be launched.
  • the advantage is that the oldest instruction can be accurately selected for launch, which ensures the efficiency of the processor pipeline.
  • arbitration The delay of the circuit will increase accordingly.
  • the multi-instruction out-of-sequence launch architecture designed by the present invention can effectively determine the size of the instruction age and the impact on the efficiency of the processor pipeline is as small as possible, and the delay of the timing path will not vary with the number of entries in the launch queue. Increase and increase to ensure that the delay is as small as possible in a processor with a large number of entries, which provides a guarantee for the increase in the processor's main frequency.
  • the delay of the arbitration circuit will increase correspondingly with the increase of the number of items in the launch queue. And the processor.
  • a method for multi-instruction out-of-order issuance in which an instruction withering circuit is added to the instruction out-of-order issuing architecture of the processor, which is used to store the newly allocated instructions in the emission queue and implement the withering operation on the instructions in the emission queue;
  • the wake-up status bit is used to indicate whether the corresponding instruction is awakened or not, and it is in the transmit queue.
  • the wake-up instruction age is greater than the non-wake-up instruction age
  • Each instruction in the transmit queue determines the transmit sequence according to the instruction age and the wake-up state.
  • the method delays the awakening of the instruction with a short execution cycle, and wakes up the instruction with a long execution cycle in advance, so as to ensure that the instruction can be executed back-to-back.
  • the processor waits for the preceding instruction to be executed and then wakes up the succeeding instruction.
  • the instruction out-of-sequence issuing architecture further includes an instruction distribution circuit, an instruction request circuit based on a class adder, and a dynamic delay wake-up circuit;
  • the instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue
  • the instruction request circuit based on the class adder is used to count the total number of idle signals in the entry queue, and encode the number of idle signals with a special code, if the total number of idle signals after the code is smaller than the command transmission width that also passes through the code , Then send an instruction request signal to the physical register file;
  • the dynamic delayed wake-up circuit is used to send a wake-up signal when the source register number of the instruction to be issued is equal to the destination register number of the issued instruction.
  • the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and according to the instruction to be issued The execution cycle of instructions adjusts the sequence of wake-up signals to ensure that instructions can be executed back to back.
  • the command fade circuit includes a command age array, a transmission queue, a fade threshold adjuster, a sedimentation pool, and a global age feature extraction circuit;
  • the instruction age array is used to indicate the instruction age of each instruction in the emission queue and whether it is awakened;
  • the transmit queue is used to store the instructions sent from the physical register; the transmit queue is designed as a non-compressed structure, that is, when the physical register number of the instruction in a certain table item is transmitted and becomes idle, other table items will not be shifted. In addition to temporarily storing the physical register number of the current instruction, each entry also records the wake-up state of the current instruction and whether the entry is idle;
  • the withering threshold adjuster is used to dynamically adjust and output the withering threshold according to the number of free entries in the sedimentation pool and the age value of the instructions still remaining in the transmission queue;
  • the sedimentation tank is used for storing withering instructions that meet the withering conditions
  • the global age feature extraction circuit is used to count the global age features.
  • the input of the withering threshold adjuster is the age of each instruction in the instruction age array, and the output is the withering threshold x, namely:
  • is the variance of the instruction age
  • is the expectation of the instruction age
  • is the adjustment coefficient
  • the instruction request circuit based on a class adder includes a class add layer and a post-log2 (n/2) shift logic layer, where n represents the number of entries in the transmit queue.
  • the instruction request circuit based on the class adder inputs the idle signal sequence of the table entry into the class adder layer when counting the total number of idle signals of the table entry, performs calculations on the number of idle signals and performs special coding, and outputs the number of idle signals.
  • the class-addition layer is composed of a class-addition calculation unit; the idle signal sequence of the table entry is input to the class-addition layer, the number of idle signals is calculated and a special encoding is performed, and the idling after the special encoding is output
  • the total number of signals including:
  • the idle signal sequence of the table entry is input to the class addition layer, and each class addition unit is input as two binary numbers in the idle signal sequence, and the AND operation and the exclusive OR operation are performed respectively, and then the two are compared.
  • the output represents the code of 1: "01", which means that the sum of the two secondary system numbers input of the class addition unit is 1, and the code is "01";
  • the output code represents 0: "10", which means that the sum of the two secondary system numbers input of the class addition unit is 0, and the code is "10";
  • the number of coding bits is n.
  • the rear log2 (n/2) level shift logic layer is composed of a right shift shifter; the input of the output result of the addition-like layer into the rear log2 (n/2) level shift logic layer, and
  • the command emission widths that are also specially coded are compared to determine whether the command request signal needs to be sent, including:
  • the right shifter takes the output of one type of addition unit as the input of the data to be shifted, and the output of the other type of addition unit as the input of the number of shift bits, and the number to be shifted is shifted to the right by n bits through the right shifter; where n is The decimal number corresponding to the number of shift bits.
  • the rear log2 (n/2) level shift logic layer has a tree structure and is connected layer by layer.
  • the dynamic delayed wake-up circuit is composed of a comparator, an instruction execution discrimination circuit, and a register; the input of the wake-up circuit is the source register number of the instruction to be issued and the destination register number of the issued instruction, and the instruction to be issued is compared by the comparator.
  • the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and outputs the cycle number of the instruction to be issued, and the register passes the waiting The number of cycles of the transmitted instruction registers the wake-up signal to be sent, so as to achieve the purpose of adjusting the sequence of the wake-up signal.
  • the instruction execution discrimination circuit is implemented by a read-only RAM, the number of execution cycles corresponding to different instructions is written in the read-only RAM in advance, and the number of cycles pre-stored in the RAM is read by inputting the type code of the instruction as an address, In order to get the operating cycle of the corresponding instruction.
  • the present application also provides a processor.
  • the instruction out-of-order issue architecture of the processor includes an instruction distribution circuit, an instruction decay circuit, an instruction request circuit based on an adder-like circuit, and a dynamic delay wake-up circuit;
  • the instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue
  • the instruction withering circuit is used to store the newly allocated instructions in the emission queue, and implement the withering operation on the instructions in the emission queue according to the instruction age of each instruction; the instructions that have withered can be randomly selected for emission without arbitration;
  • the instruction request circuit based on the class adder is used to count the total number of idle signals in the entry queue, and encode the number of idle signals with a special code, if the total number of idle signals after the code is smaller than the instruction transmission width that also passes through the code , Then send an instruction request signal to the physical register file;
  • the dynamic delayed wake-up circuit is used to send a wake-up signal when the source register number of the instruction to be issued is equal to the destination register number of the issued instruction.
  • the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and according to the instruction to be issued The execution cycle of instructions adjusts the sequence of wake-up signals to ensure that instructions can be executed back to back.
  • the highest bit of the instruction age of each instruction is set to the wake-up status bit of the instruction, and the remaining bits of the instruction age represent the intrinsic age of the instruction; the wake-up status bit is used to indicate whether the corresponding instruction is awakened, and the awakened ones in the transmit queue The instruction age is greater than the non-wake-up instruction age.
  • This application abandons the lengthy arbitration structure in the traditional launch architecture, adds an instruction decay circuit, uses an instruction age array to characterize the time the instruction is stored in the CPU, and adds a wake-up status bit to store the instructions that have exceeded the decay threshold to the sink.
  • the pool allows the CPU to directly issue, and improve the circuit structure of the instruction request circuit, instruction distribution circuit, wake-up circuit, etc., and effectively improve the timing of the critical path in the processor of multi-instruction transmission.
  • the improved instruction request circuit is in the table
  • the addition-like unit is used to perform AND and XOR operations on the two input signals respectively, instead of the traditional command request circuit using logical addition when counting the vacant table information, which saves the command request circuit statistics
  • the total number of idle signals for the entry is time-consuming; when waking up instructions, the short execution cycle of the instruction is delayed, and the long execution cycle of the instruction is awakened in advance to ensure that the instructions can be executed back-to-back, which meets the requirements of modern superscalar out-of-order processors.
  • the requirements of performance-to-power ratio, low latency, and high IPC solve the problem that the processor cannot increase the number of items in the launch queue and the delay is also increasing in the prior art.
  • FIG. 1 is a schematic diagram of the overall composition of the multi-instruction out-of-sequence issuing architecture based on instruction fading according to the present invention.
  • Figure 2 is a schematic diagram of the composition of the instruction withering circuit of the present invention.
  • Fig. 3 is a schematic diagram of the composition of the instruction distribution circuit of the present invention.
  • Fig. 4 is a schematic diagram of the composition of the instruction request circuit based on the class adder of the present invention.
  • Fig. 5 is a schematic diagram of the composition of the dynamic delayed wake-up circuit of the present invention.
  • Fig. 6 is a schematic diagram of the pipeline for adjusting the wake-up sequence through the wake-up circuit.
  • FIG. 1 a schematic diagram of the overall composition of a multi-instruction out-of-order issue architecture of the processor.
  • the multi-instruction out-of-order issue architecture includes: an instruction distribution circuit, an instruction withdraw circuit, and a class-based adder The command request circuit, dynamic delay wake-up circuit.
  • the instruction distribution circuit distributes the register renamed instruction to each entry in the instruction issue queue.
  • the instruction issue queue contains multiple entries, and each entry contains an instruction to be issued. If there is an idle entry in the instruction issue queue, the instruction issued by the distribution circuit will be accepted.
  • All commands to be issued that have just entered the table entry are in an unawakened state. If the source register number of a certain command is equal to the label of the target register of the issued command, the command will be awakened by the wake-up circuit. All the instructions in the table entries are realized by the instruction withering circuit, and all the instructions that have completed the withering will eventually be issued, which realizes the out-of-order emission of multiple instructions. The out-of-order emission of multiple instructions can be completed in the superscalar out-of-order emission processor. .
  • the schematic diagram of the composition of the command fade circuit is shown in Figure 2.
  • the command fade circuit includes a command age array, a launch queue, a fade threshold adjuster, a sedimentation pool, and a global age feature extraction circuit.
  • the instruction decay circuit After the newly allocated instructions of the instruction allocation circuit, enter the instruction decay circuit and store the list items of the idle launching team. At the same time, the corresponding instruction age in the instruction age array is initialized to a random value between 0 and 1.
  • the age increment signal is released to the age array, and the age of the instruction that has not been issued in the transmit queue is increased by one accordingly.
  • the wither threshold adjuster adjusts and outputs the wither threshold according to the free entry information of the sedimentation pool and the global age threshold. If an instruction age in the instruction age array is greater than the wither threshold, the instruction age array outputs the wither signal and receives the instruction of the wither signal. Perform the withering operation, enter the sedimentation pool from the launch queue, and set the corresponding entry in the launch queue to the idle state, waiting for the newly assigned instruction input.
  • the wither command in the settlement tank can be launched without going through arbitration.
  • the wither threshold adjuster the input is the sedimentation pool idle entry information and the global age feature, the global age feature value is output by the global age feature extraction circuit, and the adjuster adjusts according to the number of idle entries in the sedimentation pool and all current instruction age values And output the wither threshold.
  • the input of the withering threshold adjuster is the age of each instruction in the instruction age array, and the output is the withering threshold.
  • the withering threshold x is:
  • the initial age value is a random value between 0 and 1.
  • the age of the processor can be considered continuous, and according to the Number theorem, it can be considered that the age of the processor obeys a normal distribution:
  • is the variance of the instruction age
  • is the expectation of the instruction age
  • the instruction age array is essentially a counter array, each counter in total Bit, representing the instruction age of the corresponding instruction, where low
  • the bit is the age counting bit, and the highest bit is the wake-up status bit.
  • the awakening status of the instruction age corresponding to the instruction is set to 1. If the instruction age corresponding to an instruction is greater than the withering threshold, the withering signal will be output to the emission queue, where n represents the number of entries in the emission queue. s represents the command emission width.
  • the transmit queue includes n entries, and each entry stores an instruction to be transmitted and an idle bit of the entry.
  • the sedimentation pool is an instruction queue whose number of entries is much smaller than the instruction issue queue, in which there are withering instructions that meet the withering conditions, and the withering instructions in the sedimentation pool can be directly launched without arbitration.
  • Figure 3 shows the composition diagram of the instruction distribution circuit.
  • the instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue.
  • the instruction distribution circuit includes s entry number selection circuits, and the input of each entry number selection circuit is the idle signal sequence of n/s entries in the transmit queue in the instruction decay circuit and the corresponding transmit queue list item number,
  • the entry number selection circuit selects the transmission queue entry number according to whether the input idle signal is valid. If multiple idle signals are valid, the first idle signal is selected; if there is no valid idle signal, the output value It is the maximum value of the upper limit of the data bit, which means that there is no selected table item.
  • the table item number output by the table item distribution circuit is compared with the upper limit of the value. If it is equal, the effective signal is set to 1, and if it is not equal, it is set to 0.
  • the instructions to be allocated for each input of the allocation circuit are written into the corresponding entry according to the entry number and the valid signal. Among them, s represents the instruction issue width, and n represents the number of entries in the issue queue.
  • the entry number selection circuit is composed of a selector array. As shown in Fig. 2, the first column selector inputs the entry number. Since the first free entry needs to be selected, the selector is based on the idle signal of the smaller entry number. Selection table item number; the second-level table item number input is the selection table item number output of the first-level selection layer, the selection signal is an idle signal with a smaller table item number, and so on, there are a total of log2(n) selection layers. The selection result of the log2(n) layer selection layer is output to the all-empty entry selector. The selection signal of this selector is the selection signal of the log2(n)th layer selection layer, and the data to be selected is the log2(n) layer selection layer.
  • the selection result and the upper limit value of the value If the selection signal is 0, the upper limit value of the value will be output as the final entry number; if it is not 0, the selection result of the log2(n) layer selection layer will be output as the final entry number output , Where n represents the number of entries.
  • Figure 4 shows a schematic diagram of the composition of the command request circuit.
  • the instruction request circuit is used to count the total number of table entry idle signals, and encode the number of idle signals with a special code. If the total number of idle signals after the code is less than the instruction transmission width that also passes through the code, then an instruction is issued to the physical register file Request signal.
  • the instruction request circuit is composed of two parts: an addition-like layer and a post-log2(n/2) shift logic layer.
  • the class addition layer is composed of a class addition calculation unit; when counting the total number of idle signals of the table entry, the idle signal sequence of the table entry is input to the class addition layer, the number of idle signals is calculated and special coding is performed, and the output is specially coded The total number of idle signals after the end; the output of the addition-like layer is sent to the rear log2(n/2) layer shift logic layer, and finally the statistical result is output. The statistical result is compared with the instruction emission width that has also been specially coded to determine whether Need to send a command request signal.
  • the idle signal sequence of the table entry is input to the class addition layer, and each class addition unit is input as two binary numbers in the idle signal sequence and performs an AND operation and an exclusive OR operation respectively. Then compare the calculation results of the two:
  • the output code represents 1: "01", which means that the sum of the two secondary system numbers input of the class addition unit is 1, and the code is "01";
  • the output code represents 0: "10", which means that the sum of the two secondary system numbers input of the class addition unit is 0, and the code is "10";
  • the number of coding bits is n.
  • the rear log2 (n/2) layer shift logic layer is composed of a right shift shifter; the output result of the addition-like layer is input to the rear log2 (n/2) layer shift logic layer, and the instruction emission width is also specially coded Make a comparison to determine whether you need to send a command request signal, including:
  • the right shifter takes the output of one type of addition unit as the input of the data to be shifted, and the output of the other type of addition unit as the input of the number of shift bits, and the number to be shifted is shifted by n bits to the right by the right shifter.
  • n is the decimal number corresponding to the number of shift bits.
  • FIG. 5 shows the schematic diagram of the wake-up circuit.
  • the wake-up circuit is composed of a comparator, an instruction execution discrimination circuit, and a register.
  • the wake-up circuit input is the source register number of the instruction to be issued and the destination register number of the issued instruction.
  • the comparator is used to compare whether the source register number of the instruction to be issued and the destination register number of the issued instruction are equal. If you want to wait, send a wake-up signal ; At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discrimination circuit, and outputs the cycle number of the instruction to be issued.
  • the register registers the wake-up signal to be sent by the cycle number of the instruction to be issued, so as to adjust the sequence of the wake-up signal
  • the purpose is to delay the wake-up of instructions with a short execution cycle, and wake up instructions with a long execution cycle in advance, so as to ensure that the instructions on the pipeline can be executed back to back and improve the efficiency of the pipeline.
  • Figure 6 shows a schematic diagram of the pipeline after the instruction wake-up adjustment.
  • Instruction A requires three execution cycles, and instructions B, C, and D each require one execution cycle.
  • instruction D delays the wake-up of instruction A by two cycles, and two instructions can be inserted between instructions A and D
  • Instructions B and C are executed back to back to ensure that all 4 instructions are executed back to back, there is no delay bubble, and the execution efficiency of the pipeline is improved.
  • This embodiment provides a multi-instruction out-of-sequence issuing method based on instruction fading, which is used in the processor described in the first embodiment. It will actually read the physical register file, and each entry in the transmit queue stores the physical register number; the method includes:
  • the instruction allocation circuit allocates the instructions output by the physical register file to each entry in the instruction issue queue:
  • the instruction distribution circuit includes s entry number selection circuits, and the input of each entry number selection circuit is the idle signal sequence of n/s entries in the transmit queue in the instruction decay circuit and the corresponding transmit queue list item number, the entry number
  • the selection circuit selects the list item number of the transmission queue according to whether the input idle signal is valid. If multiple idle signals are valid, the first idle signal is selected; if there is no valid idle signal, the output value is the data bit The maximum value of the upper limit indicates that there is no selected entry.
  • the table item number output by the table item distribution circuit is compared with the upper limit of the value. If it is equal, the effective signal is set to 1, and if it is not equal, it is set to 0.
  • the instructions to be allocated for each input of the allocation circuit are written into the corresponding entry according to the entry number and the valid signal.
  • s represents the instruction issue width
  • n represents the number of entries in the issue queue.
  • the instruction decay circuit After the newly allocated instructions of the instruction allocation circuit, enter the instruction decay circuit and store the list items of the idle launching team. At the same time, the corresponding instruction age in the instruction age array is initialized to a random value between 0 and 1.
  • the instruction age array When the instruction age exceeds the withering threshold, the instruction age array will trigger the withering signal, which causes the instruction to wither.
  • the instruction that has withered enters the sedimentation pool from the emission queue, and the entry in the emission queue is set to be idle.
  • the sedimentation pool is an instruction queue whose number of entries is much smaller than that of the launch queue. There are instructions after the withering, and the withering instructions in the sedimentation pool can be randomly selected for launch.
  • the transmit queue in the wither circuit is designed as a non-compressed structure, that is, when the physical register number of the instruction in a certain table item is transmitted and becomes idle, the other table items will not be shifted, and each table item will temporarily store the physical register of the current instruction.
  • the register number also records the wake-up status of the current command and whether the entry is idle;
  • the idle signal of the entry in the transmit queue will be transmitted to the instruction request circuit at the same time.
  • the instruction request circuit counts the number of free entries in the transmit queue. If the number of free entries in the transmit queue is greater than the instruction transmit width, the request circuit will send The physical register file sends an instruction request signal, and the physical register file accepts the request signal and sends an instruction to the instruction distribution circuit;
  • the wake-up circuit is responsible for comparing the current transmitted destination register number with the source register number of each instruction in the transmit queue. If the numbers are equal, a wake-up signal is issued, and at the same time, it is judged whether to change the wake-up signal according to the execution cycle of the instruction. Delayed transmission, instructions with a long execution cycle are awakened in advance, and instructions with a short execution cycle are delayed to be transmitted.
  • the wake-up signal is set to 1 in the wake-up state in the instruction age corresponding to the instruction to ensure that the instruction age that is awakened is greater than the instruction age that is not awakened.
  • Part of the steps in the embodiments of the present invention may be implemented by software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Advance Control (AREA)

Abstract

An instruction withering-based multi-instruction out-of-order transmission method and a processor, belonging to the field of processor designs. A tedious arbitration structure in conventional transmission architecture is abandoned, and an instruction withering circuit is increased. An instruction age array is used to characterize the time for which the instruction is stored in the CPU. In addition, a wakeup state bit is added. An instruction which has exceeded a withering threshold is stored in a deposition pool for direct transmission by the CPU. Furthermore, circuit structures such as an instruction request circuit, an instruction allocation circuit and a wakeup circuit are improved, effectively improving the timing of a key path in a multi-instruction transmission processor. When waking up instructions, instructions having a short execution cycle are woken up by a delay, and instructions having a long execution cycle are pre-woken up, so as to ensure that the instructions can be executed in a back-to-back manner, thereby satisfying the requirements of a high performance to power consumption ratio, a low delay and a high IPC in a modern super scalar out-of-order processor, and solving the problems in the prior art that the number of table entries in a transmission queue of a processor is increasing and the delay is also increasing.

Description

一种基于指令凋零的多指令乱序发射方法及处理器Multi-instruction out-of-sequence launch method and processor based on instruction withering 技术领域Technical field
本发明涉及一种基于指令凋零的多指令乱序发射方法及处理器,属于处理器设计领域。The invention relates to a multi-instruction out-of-order emission method and a processor based on instruction withering, and belongs to the field of processor design.
背景技术Background technique
自从Dennard扩展终结的十多年以来,CPU的单核性能改进尤为缓慢。在此背景下,重新研究核心微体系结构以获得高的单核性能是完全有必要的。Since the end of Dennard's expansion for more than a decade, the single-core performance improvement of CPU has been particularly slow. In this context, it is absolutely necessary to re-study the core microarchitecture to obtain high single-core performance.
在CPU的众多结构中,指令发射架构是实现CPU高性能的重要架构之一。指令发射架构通过在每个周期从指令发射队列中的待发射指令中选择并发射指令来调度执行指令。为了获得高性能,指令发射架构必须在低延迟的情况下实现高IPC(Instructions per clock,每周期执行指令数)。同时在设计指令发射架构过程中,低延迟是重要的考虑因素,因为指令发射架构是处理器中的时序关键路径,指令发射架构的延迟会对CPU的工作主频产生重大影响。Among the many structures of the CPU, the instruction launch architecture is one of the important architectures to achieve the high performance of the CPU. The instruction issue architecture schedules the execution of instructions by selecting and issuing instructions from the instructions to be issued in the instruction issue queue in each cycle. In order to achieve high performance, the instruction issue architecture must achieve high IPC (Instructions per clock, the number of instructions executed per cycle) with low latency. At the same time, low latency is an important consideration in the process of designing the instruction issue architecture, because the instruction issue architecture is a timing critical path in the processor, and the delay of the instruction issue architecture will have a significant impact on the CPU's operating frequency.
传统多指令乱序发射架构通过仲裁电路来选择可以进行发射的指令,优点是可以准确选择年龄最大的指令进行发射,保证了处理器流水线的效率,但是随着发射队列表项数的增长,仲裁电路的延迟会相应增加。The traditional multi-instruction out-of-sequence launch architecture uses the arbitration circuit to select the instructions that can be launched. The advantage is that the oldest instruction can be accurately selected for launch, which ensures the efficiency of the processor pipeline. However, as the number of items in the launch team grows, arbitration The delay of the circuit will increase accordingly.
在现代处理器中,为追求高IPC,发射队列中往往会设计众多表项,这就造成仲裁电路的延迟明显,使指令发射电路成为处理器中的关键路径,成为处理器的主频的瓶颈。In modern processors, in order to pursue high IPC, many entries are often designed in the transmit queue, which results in significant delays in the arbitration circuit, making the instruction transmitting circuit a critical path in the processor and the bottleneck of the processor's main frequency .
针对以上需求和挑战,针对低延迟、高IPC等条件,提供一种基于指令凋零的多指令乱序发射架构的设计是非常迫切的。In view of the above requirements and challenges, it is very urgent to provide a design of a multi-instruction out-of-sequence launch architecture based on instruction decay for conditions such as low latency and high IPC.
本发明所设计的多指令乱序发射架构,在能有效判别指令年龄的大小、对处理器流水线的效率的影响尽可能小的条件下,时序路径的延迟不会随发射队列中表项数的增加而增加,保证在具有大量表项的处理器中延迟尽可能小,对处理器的主频提升提供了保障。The multi-instruction out-of-sequence launch architecture designed by the present invention can effectively determine the size of the instruction age and the impact on the efficiency of the processor pipeline is as small as possible, and the delay of the timing path will not vary with the number of entries in the launch queue. Increase and increase to ensure that the delay is as small as possible in a processor with a large number of entries, which provides a guarantee for the increase in the processor's main frequency.
发明内容Summary of the invention
为了解决目前通过仲裁电路来选择可以进行发射的指令的方法随着发射队列表项数的增长,仲裁电路的延迟会相应增加的问题,本发明提供一种基于指令凋零的多指令乱序发射方法及处理器。In order to solve the current method of selecting commands that can be issued through the arbitration circuit, the delay of the arbitration circuit will increase correspondingly with the increase of the number of items in the launch queue. And the processor.
一种多指令乱序发射方法,在处理器的指令乱序发射架构中增加一个指令凋零电路,用于将新分配的指令存入发射队列,并对发射队列中的指令实现凋零操作;所述方法包括:A method for multi-instruction out-of-order issuance, in which an instruction withering circuit is added to the instruction out-of-order issuing architecture of the processor, which is used to store the newly allocated instructions in the emission queue and implement the withering operation on the instructions in the emission queue; Methods include:
将指令凋零电路中各指令对应的指令年龄的最高位设置为指令的唤醒状态位,指令年龄 的其余位表示指令本征年龄;唤醒状态位用来表示对应的指令是否被唤醒,发射队列中被唤醒的指令年龄大于非唤醒的指令年龄;Set the highest bit of the instruction age corresponding to each instruction in the instruction wither circuit as the wake-up status bit of the instruction, and the remaining bits of the instruction age indicate the intrinsic age of the instruction; the wake-up status bit is used to indicate whether the corresponding instruction is awakened or not, and it is in the transmit queue. The wake-up instruction age is greater than the non-wake-up instruction age;
设定凋零阈值,当某一指令的指令年龄超过凋零阈值时,指令年龄阵列触发凋零信号,使该指令发生凋零;发生凋零的指令无需经过仲裁就可被随机选择进行发射,实现多指令的乱序发射;Set the withering threshold. When the instruction age of a certain instruction exceeds the withering threshold, the instruction age array triggers the withering signal, which causes the instruction to wither; the instruction that has withered can be randomly selected for launch without arbitration, realizing the chaos of multiple instructions. Sequence launch
所述发射队列中各指令根据指令年龄和唤醒状态确定发射顺序。Each instruction in the transmit queue determines the transmit sequence according to the instruction age and the wake-up state.
可选的,所述方法在唤醒指令时,对执行周期短的指令延迟唤醒,对执行周期长的指令提前唤醒,以保证指令能够背靠背执行。Optionally, when the instruction is awakened, the method delays the awakening of the instruction with a short execution cycle, and wakes up the instruction with a long execution cycle in advance, so as to ensure that the instruction can be executed back-to-back.
可选的,所述方法在唤醒指令时,当具有前后顺序的指令中在前指令被发射后,处理器等待在前指令执行完毕后再唤醒在后指令。Optionally, when the method wakes up an instruction, after the preceding instruction among the preceding instructions is issued, the processor waits for the preceding instruction to be executed and then wakes up the succeeding instruction.
可选的,所述指令乱序发射架构还包括指令分配电路,基于类加法器的指令请求电路和动态延迟唤醒电路;Optionally, the instruction out-of-sequence issuing architecture further includes an instruction distribution circuit, an instruction request circuit based on a class adder, and a dynamic delay wake-up circuit;
所述指令分配电路用于将物理寄存器发送过来的多条指令分配给发射队列中空闲的表项;The instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue;
所述基于类加法器的指令请求电路用于统计发射队列中表项空闲信号总数,并用特殊编码对空闲信号的数量进行编码,若经过该编码的空闲信号总数小于同样经过该编码的指令发射宽度,则向物理寄存器堆发出指令请求信号;The instruction request circuit based on the class adder is used to count the total number of idle signals in the entry queue, and encode the number of idle signals with a special code, if the total number of idle signals after the code is smaller than the command transmission width that also passes through the code , Then send an instruction request signal to the physical register file;
所述动态延迟唤醒电路用于在待发射指令的源寄存器编号和已发射指令的目的寄存器编号相等时送出唤醒信号,同时,唤醒电路通过指令执行辨别电路识别待发射指令的执行周期,根据待发射指令的执行周期调整唤醒信号顺序,以保证指令能够背靠背执行。The dynamic delayed wake-up circuit is used to send a wake-up signal when the source register number of the instruction to be issued is equal to the destination register number of the issued instruction. At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and according to the instruction to be issued The execution cycle of instructions adjusts the sequence of wake-up signals to ensure that instructions can be executed back to back.
可选的,所述指令凋零电路包含指令年龄阵列、发射队列、凋零阈值调整器、沉降池、全局年龄特征提取电路;Optionally, the command fade circuit includes a command age array, a transmission queue, a fade threshold adjuster, a sedimentation pool, and a global age feature extraction circuit;
所述指令年龄阵列用于表示发射队列中各指令的指令年龄以及是否被唤醒;The instruction age array is used to indicate the instruction age of each instruction in the emission queue and whether it is awakened;
所述发射队列用于存放从物理寄存器发送过来的指令;发射队列设计为非压缩结构,即某表项中指令的物理寄存器编号被发射后呈空闲态时,其它表项不会进行移位,每个表项除了暂存当前指令的物理寄存器编号,还记录当前指令的唤醒状态以及表项是否为空闲状态;The transmit queue is used to store the instructions sent from the physical register; the transmit queue is designed as a non-compressed structure, that is, when the physical register number of the instruction in a certain table item is transmitted and becomes idle, other table items will not be shifted. In addition to temporarily storing the physical register number of the current instruction, each entry also records the wake-up state of the current instruction and whether the entry is idle;
所述凋零阈值调整器用于根据沉降池的空闲表项数和仍存留发射队列中的指令的年龄值,动态调整并输出凋零阈值;The withering threshold adjuster is used to dynamically adjust and output the withering threshold according to the number of free entries in the sedimentation pool and the age value of the instructions still remaining in the transmission queue;
所述沉降池用于存有满足凋零条件的凋零指令;The sedimentation tank is used for storing withering instructions that meet the withering conditions;
所述全局年龄特征提取电路用于统计全局年龄特征。The global age feature extraction circuit is used to count the global age features.
可选的,所述凋零阈值调整器的输入为指令年龄阵列中各指令的年龄,输出为凋零阈值 x,即:Optionally, the input of the withering threshold adjuster is the age of each instruction in the instruction age array, and the output is the withering threshold x, namely:
Figure PCTCN2020098961-appb-000001
Figure PCTCN2020098961-appb-000001
其中,σ为指令年龄的方差,μ为指令年龄的期望,α为调节系数,α满足
Figure PCTCN2020098961-appb-000002
Among them, σ is the variance of the instruction age, μ is the expectation of the instruction age, α is the adjustment coefficient, and α satisfies
Figure PCTCN2020098961-appb-000002
可选的,所述基于类加法器的指令请求电路包括类加法层和后log2(n/2)层移位逻辑层,n代表发射队列中的表项数。Optionally, the instruction request circuit based on a class adder includes a class add layer and a post-log2 (n/2) shift logic layer, where n represents the number of entries in the transmit queue.
可选的,所述基于类加法器的指令请求电路在统计表项空闲信号总数时,将表项的空闲信号序列输入类加法层,对表示空闲信号的数量进行运算并进行特殊编码,输出经过特殊编码后的空闲信号总数;将类加法层的输出送入后log2(n/2)层移位逻辑层,最终输出统计结果,将统计结果与同样经过特殊编码的指令发射宽度进行比较,以确定是否需要发送指令请求信号。Optionally, the instruction request circuit based on the class adder inputs the idle signal sequence of the table entry into the class adder layer when counting the total number of idle signals of the table entry, performs calculations on the number of idle signals and performs special coding, and outputs the number of idle signals. The total number of idle signals after special encoding; the output of the addition-like layer is sent to the rear log2(n/2) layer shift logic layer, and the final output statistical result is compared with the instruction emission width of the same special encoding. Determine whether you need to send a command request signal.
可选的,所述类加法层由类加法计算单元构成;所述将表项的空闲信号序列输入类加法层,对表示空闲信号的数量进行运算并进行特殊编码,输出经过特殊编码后的空闲信号总数,包括:Optionally, the class-addition layer is composed of a class-addition calculation unit; the idle signal sequence of the table entry is input to the class-addition layer, the number of idle signals is calculated and a special encoding is performed, and the idling after the special encoding is output The total number of signals, including:
在统计表项空闲信号总数时,将表项的空闲信号序列输入类加法层,每个类加法单元输入为空闲信号序列中的两个二进制数并分别作与运算和异或运算,然后比较二者的计算结果:When counting the total number of idle signals in the table entry, the idle signal sequence of the table entry is input to the class addition layer, and each class addition unit is input as two binary numbers in the idle signal sequence, and the AND operation and the exclusive OR operation are performed respectively, and then the two are compared. The result of the calculation:
若相等,且与运算结果为1,则输出代表1的编码:“01”,表示类加法单元的两个二级制数输入的和为1,并对其编码为“01”;If it is equal and the result of the AND operation is 1, the output represents the code of 1: "01", which means that the sum of the two secondary system numbers input of the class addition unit is 1, and the code is "01";
若相等,且与运算结果位0,则输出代表0的编码:“10”,表示类加法单元的两个二级制数输入的和为0,并对其编码为“10”;If it is equal, and the result of the AND operation is 0, the output code represents 0: "10", which means that the sum of the two secondary system numbers input of the class addition unit is 0, and the code is "10";
若不相等,则输出代表2的编码:“00”,表示类加法单元的两个二级制数输入的和为2,并对其编码为“00”;If they are not equal, output the code representing 2: "00", which means that the sum of the two secondary system numbers input of the class addition unit is 2, and the code is "00";
编码位数为n。The number of coding bits is n.
可选的,所述后log2(n/2)层移位逻辑层由右移移位器构成;所述将类加法层的输出结果输入后log2(n/2)层移位逻辑层,与同样经过特殊编码的指令发射宽度进行比较,以确定是否需要发送指令请求信号,包括:Optionally, the rear log2 (n/2) level shift logic layer is composed of a right shift shifter; the input of the output result of the addition-like layer into the rear log2 (n/2) level shift logic layer, and The command emission widths that are also specially coded are compared to determine whether the command request signal needs to be sent, including:
右移移位器把一类加法单元输出作为待移位数据输入,把另一类加法单元输出作为移位位数输入,待移位数通过右移移位器右移n位;其中n为移位位数所对应的十进制数。The right shifter takes the output of one type of addition unit as the input of the data to be shifted, and the output of the other type of addition unit as the input of the number of shift bits, and the number to be shifted is shifted to the right by n bits through the right shifter; where n is The decimal number corresponding to the number of shift bits.
可选的,所述后log2(n/2)层移位逻辑层呈树状结构,且层层相连。Optionally, the rear log2 (n/2) level shift logic layer has a tree structure and is connected layer by layer.
可选的,所述动态延迟唤醒电路由比较器、指令执行辨别电路、寄存器构成;唤醒电路的输入为待发射指令的源寄存器编号和已发射指令的目的寄存器编号,通过比较器比较待发射指令的源寄存器编号和已发射指令的目的寄存器编号是否相等,若相等则送出唤醒信号;同时唤醒电路通过指令执行辨别电路识别待发射指令的执行周期,并输出待发射指令的周期数,寄存器通过待发射指令的周期数对将要送出的唤醒信号进行寄存,从而达到对唤醒信号顺序调整的目的。Optionally, the dynamic delayed wake-up circuit is composed of a comparator, an instruction execution discrimination circuit, and a register; the input of the wake-up circuit is the source register number of the instruction to be issued and the destination register number of the issued instruction, and the instruction to be issued is compared by the comparator. Whether the source register number of the issued instruction and the destination register number of the issued instruction are equal, if they are equal, a wake-up signal is sent; at the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and outputs the cycle number of the instruction to be issued, and the register passes the waiting The number of cycles of the transmitted instruction registers the wake-up signal to be sent, so as to achieve the purpose of adjusting the sequence of the wake-up signal.
可选的,所述指令执行辨别电路通过只读RAM实现,只读RAM中预先写好不同指令对应的执行周期数,通过输入指令的类别码作为地址来读出RAM中预先存放的周期数,从而得到相应指令的操作周期。Optionally, the instruction execution discrimination circuit is implemented by a read-only RAM, the number of execution cycles corresponding to different instructions is written in the read-only RAM in advance, and the number of cycles pre-stored in the RAM is read by inputting the type code of the instruction as an address, In order to get the operating cycle of the corresponding instruction.
本申请还提供一种处理器,所述处理器的指令乱序发射架构包括指令分配电路,指令凋零电路,基于类加法器的指令请求电路和动态延迟唤醒电路;The present application also provides a processor. The instruction out-of-order issue architecture of the processor includes an instruction distribution circuit, an instruction decay circuit, an instruction request circuit based on an adder-like circuit, and a dynamic delay wake-up circuit;
所述指令分配电路用于将物理寄存器发送过来的多条指令分配给发射队列中空闲的表项;The instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue;
所述指令凋零电路用于将新分配的指令存入发射队列,并根据各指令的指令年龄对发射队列中的指令实现凋零操作;发生凋零的指令无需经过仲裁就可被随机选择进行发射;The instruction withering circuit is used to store the newly allocated instructions in the emission queue, and implement the withering operation on the instructions in the emission queue according to the instruction age of each instruction; the instructions that have withered can be randomly selected for emission without arbitration;
所述基于类加法器的指令请求电路用于统计发射队列中表项空闲信号总数,并用特殊编码对空闲信号的数量进行编码,若经过该编码的空闲信号总数小于同样经过该编码的指令发射宽度,则向物理寄存器堆发出指令请求信号;The instruction request circuit based on the class adder is used to count the total number of idle signals in the entry queue, and encode the number of idle signals with a special code, if the total number of idle signals after the code is smaller than the instruction transmission width that also passes through the code , Then send an instruction request signal to the physical register file;
所述动态延迟唤醒电路用于在待发射指令的源寄存器编号和已发射指令的目的寄存器编号相等时送出唤醒信号,同时,唤醒电路通过指令执行辨别电路识别待发射指令的执行周期,根据待发射指令的执行周期调整唤醒信号顺序,以保证指令能够背靠背执行。The dynamic delayed wake-up circuit is used to send a wake-up signal when the source register number of the instruction to be issued is equal to the destination register number of the issued instruction. At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and according to the instruction to be issued The execution cycle of instructions adjusts the sequence of wake-up signals to ensure that instructions can be executed back to back.
可选的,各指令的指令年龄的最高位设置为指令的唤醒状态位,指令年龄的其余位表示指令本征年龄;唤醒状态位用来表示对应的指令是否被唤醒,发射队列中被唤醒的指令年龄大于非唤醒的指令年龄。Optionally, the highest bit of the instruction age of each instruction is set to the wake-up status bit of the instruction, and the remaining bits of the instruction age represent the intrinsic age of the instruction; the wake-up status bit is used to indicate whether the corresponding instruction is awakened, and the awakened ones in the transmit queue The instruction age is greater than the non-wake-up instruction age.
本发明有益效果是:The beneficial effects of the present invention are:
本申请摒弃传统发射架构中冗长的仲裁结构,增加指令凋零电路,采用指令年龄阵列来表征指令在CPU中存储的时间,另外加上一位唤醒状态位,将已经超过凋零阈值的指令存放至沉降池以便CPU直接发射,并改善指令请求电路、指令分配电路、唤醒电路等电路结构,有效改善多指令发射这一处理器中关键路径的时序,具体的,改善后的指令请求电路,在对表项空闲信号总数进行统计时,利用类加法单元对两个输入信号分别作与运算和异或运 算,取代传统指令请求电路在统计空余表项信息时采用逻辑加的方式,节省了指令请求电路统计表项空闲信号总数的耗时;而在唤醒指令时,对执行周期短的指令延迟唤醒,对执行周期长的指令提前唤醒,以保证指令能够背靠背执行,满足了现代超标量乱序处理器中高性能功耗比、低延时、高IPC的要求,解决了现有技术中处理器无法在发射队列表项数日益增加、延迟也日益增加的问题。This application abandons the lengthy arbitration structure in the traditional launch architecture, adds an instruction decay circuit, uses an instruction age array to characterize the time the instruction is stored in the CPU, and adds a wake-up status bit to store the instructions that have exceeded the decay threshold to the sink. The pool allows the CPU to directly issue, and improve the circuit structure of the instruction request circuit, instruction distribution circuit, wake-up circuit, etc., and effectively improve the timing of the critical path in the processor of multi-instruction transmission. Specifically, the improved instruction request circuit is in the table When counting the total number of idle signals, the addition-like unit is used to perform AND and XOR operations on the two input signals respectively, instead of the traditional command request circuit using logical addition when counting the vacant table information, which saves the command request circuit statistics The total number of idle signals for the entry is time-consuming; when waking up instructions, the short execution cycle of the instruction is delayed, and the long execution cycle of the instruction is awakened in advance to ensure that the instructions can be executed back-to-back, which meets the requirements of modern superscalar out-of-order processors. The requirements of performance-to-power ratio, low latency, and high IPC solve the problem that the processor cannot increase the number of items in the launch queue and the delay is also increasing in the prior art.
附图说明Description of the drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions in the embodiments of the present invention more clearly, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative work.
图1为本发明的基于指令凋零的多指令乱序发射架构总体组成示意图。FIG. 1 is a schematic diagram of the overall composition of the multi-instruction out-of-sequence issuing architecture based on instruction fading according to the present invention.
图2为本发明的指令凋零电路的组成示意图。Figure 2 is a schematic diagram of the composition of the instruction withering circuit of the present invention.
图3为本发明的指令分配电路的组成示意图。Fig. 3 is a schematic diagram of the composition of the instruction distribution circuit of the present invention.
图4为本发明的基于类加法器的指令请求电路的组成示意图。Fig. 4 is a schematic diagram of the composition of the instruction request circuit based on the class adder of the present invention.
图5为本发明的动态延迟唤醒电路的组成示意图。Fig. 5 is a schematic diagram of the composition of the dynamic delayed wake-up circuit of the present invention.
图6为经过唤醒电路调整唤醒顺序的流水线示意图。Fig. 6 is a schematic diagram of the pipeline for adjusting the wake-up sequence through the wake-up circuit.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。In order to make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below in conjunction with the accompanying drawings.
实施例一:Example one:
本实施例提供一种处理器,参见图1,所述处理器的多指令乱序发射架构总体组成示意图,所述多指令乱序发射架构包括:指令分配电路,指令凋零电路,基于类加法器的指令请求电路,动态延迟唤醒电路。This embodiment provides a processor. Referring to FIG. 1, a schematic diagram of the overall composition of a multi-instruction out-of-order issue architecture of the processor. The multi-instruction out-of-order issue architecture includes: an instruction distribution circuit, an instruction withdraw circuit, and a class-based adder The command request circuit, dynamic delay wake-up circuit.
其中,指令分配电路把经过寄存器重命名的指令分配给指令发射队列中的每个表项。指令发射队列包含多个表项,每个表项包含一条待发射的指令,若指令发射队列出现空闲表项,则会接受通过分配电路分配的指令。Among them, the instruction distribution circuit distributes the register renamed instruction to each entry in the instruction issue queue. The instruction issue queue contains multiple entries, and each entry contains an instruction to be issued. If there is an idle entry in the instruction issue queue, the instruction issued by the distribution circuit will be accepted.
所有刚进入表项的待发射指令呈未唤醒状态,若某指令的源寄存器编号与已发射的指令的目标寄存器的标号相等,则该指令会被唤醒电路唤醒。所有表项中的指令,由指令凋零电路实现指令凋零,所有完成凋零的指令最终会被发射,实现多指令的乱序发射,可在超标量 乱序发射处理器中完成多指令的乱序发射。All commands to be issued that have just entered the table entry are in an unawakened state. If the source register number of a certain command is equal to the label of the target register of the issued command, the command will be awakened by the wake-up circuit. All the instructions in the table entries are realized by the instruction withering circuit, and all the instructions that have completed the withering will eventually be issued, which realizes the out-of-order emission of multiple instructions. The out-of-order emission of multiple instructions can be completed in the superscalar out-of-order emission processor. .
指令凋零电路的组成示意图如图2所示,所述指令凋零电路包含指令年龄阵列、发射队列、凋零阈值调整器、沉降池、全局年龄特征提取电路。The schematic diagram of the composition of the command fade circuit is shown in Figure 2. The command fade circuit includes a command age array, a launch queue, a fade threshold adjuster, a sedimentation pool, and a global age feature extraction circuit.
经过指令分配电路的新分配的指令,进入指令凋零电路存入空闲的发射队列表项,同时指令年龄阵列中对应的指令年龄被初始化,初始化为0与1之间的随机值。After the newly allocated instructions of the instruction allocation circuit, enter the instruction decay circuit and store the list items of the idle launching team. At the same time, the corresponding instruction age in the instruction age array is initialized to a random value between 0 and 1.
每当有指令被发射,对年龄阵列释放年龄递增信号,在发射队列中未被发射的指令年龄相应加1。Whenever an instruction is issued, the age increment signal is released to the age array, and the age of the instruction that has not been issued in the transmit queue is increased by one accordingly.
凋零阈值调整器根据沉降池空闲表项信息和全局年龄阈值,调整并输出凋零阈值,若指令年龄阵列中的某个指令年龄大于凋零阈值,则指令年龄阵列输出凋零信号,接收到凋零信号的指令执行凋零操作,由发射队列进入沉降池,发射队列中的相应表项置为空闲状态,等待新分配的指令输入。The wither threshold adjuster adjusts and outputs the wither threshold according to the free entry information of the sedimentation pool and the global age threshold. If an instruction age in the instruction age array is greater than the wither threshold, the instruction age array outputs the wither signal and receives the instruction of the wither signal. Perform the withering operation, enter the sedimentation pool from the launch queue, and set the corresponding entry in the launch queue to the idle state, waiting for the newly assigned instruction input.
处于沉降池中的凋零指令不用经过仲裁就可被发射。The wither command in the settlement tank can be launched without going through arbitration.
所述凋零阈值调整器,输入为沉降池空闲表项信息和全局年龄特征,全局年龄特征值由全局年龄特征提取电路输出,调整器根据沉降池的空闲表项数和现行所有指令年龄值,调整并输出凋零阈值。The wither threshold adjuster, the input is the sedimentation pool idle entry information and the global age feature, the global age feature value is output by the global age feature extraction circuit, and the adjuster adjusts according to the number of idle entries in the sedimentation pool and all current instruction age values And output the wither threshold.
所述凋零阈值调整器的输入为指令年龄阵列中各指令的年龄,输出为凋零阈值,所述凋零阈值x,即:The input of the withering threshold adjuster is the age of each instruction in the instruction age array, and the output is the withering threshold. The withering threshold x is:
Figure PCTCN2020098961-appb-000003
Figure PCTCN2020098961-appb-000003
其中,α满足
Figure PCTCN2020098961-appb-000004
σ为指令年龄的方差,μ为指令年龄的期望。其特征值得推导过程如下:
Where α satisfies
Figure PCTCN2020098961-appb-000004
σ is the variance of the instruction age, and μ is the expectation of the instruction age. Its characteristics are worth deriving as follows:
现代处理器中,每秒可以处理数以亿计的指令,且年龄初始值为0与1之间的随机值,在此大样本条件下,可以认为处理器的年龄是连续的,并且根据大数定理,可以认为处理器的年龄服从正态分布:In modern processors, hundreds of millions of instructions can be processed per second, and the initial age value is a random value between 0 and 1. Under this large sample condition, the age of the processor can be considered continuous, and according to the Number theorem, it can be considered that the age of the processor obeys a normal distribution:
Figure PCTCN2020098961-appb-000005
Figure PCTCN2020098961-appb-000005
其中σ为指令年龄的方差,μ为指令年龄的期望。Where σ is the variance of the instruction age, and μ is the expectation of the instruction age.
构造函数g(x):Constructor g(x):
Figure PCTCN2020098961-appb-000006
Figure PCTCN2020098961-appb-000006
对(2)式变型Variant (2)
Figure PCTCN2020098961-appb-000007
Figure PCTCN2020098961-appb-000007
对(3)求一阶导Find the first derivative of (3)
Figure PCTCN2020098961-appb-000008
Figure PCTCN2020098961-appb-000008
对(3)求二阶导Find the second derivative of (3)
Figure PCTCN2020098961-appb-000009
Figure PCTCN2020098961-appb-000009
令(4)式为0可得Let (4) be 0, we can get
Figure PCTCN2020098961-appb-000010
Figure PCTCN2020098961-appb-000010
将(6)式带入(5)使
Figure PCTCN2020098961-appb-000011
Bring formula (6) into (5) make
Figure PCTCN2020098961-appb-000011
have to
Figure PCTCN2020098961-appb-000012
Figure PCTCN2020098961-appb-000012
为使按x为阈值所凋零的年龄尽可能大,对流水线效率的影响尽可能小,应有In order to make the age of withering by x as the threshold as large as possible, the impact on the efficiency of the pipeline is as small as possible, there should be
Figure PCTCN2020098961-appb-000013
Figure PCTCN2020098961-appb-000013
取最低阈值,得调节系数α的约束条件为Taking the lowest threshold, the constraint condition of the adjustment coefficient α is
Figure PCTCN2020098961-appb-000014
Figure PCTCN2020098961-appb-000014
综上可得In summary
Figure PCTCN2020098961-appb-000015
且α满足
Figure PCTCN2020098961-appb-000016
Figure PCTCN2020098961-appb-000015
And α satisfies
Figure PCTCN2020098961-appb-000016
所述指令年龄阵列本质为计数器阵列,每个计数器总共
Figure PCTCN2020098961-appb-000017
位,代表相应指令的指令年龄,其中低
Figure PCTCN2020098961-appb-000018
位为年龄记数位,最高位1位为唤醒状态位。
The instruction age array is essentially a counter array, each counter in total
Figure PCTCN2020098961-appb-000017
Bit, representing the instruction age of the corresponding instruction, where low
Figure PCTCN2020098961-appb-000018
The bit is the age counting bit, and the highest bit is the wake-up status bit.
每当新分配的指令进入指令凋零电路的发射队列,相对应的指令年龄置零;Whenever a newly allocated instruction enters the issue queue of the instruction wither circuit, the corresponding instruction age is set to zero;
每当有指令被发射,未被发射的指令对应的指令年龄加1;Whenever an instruction is issued, the instruction age corresponding to the instruction that has not been issued increases by 1;
每当发射队列中指令被唤醒,指令对应的指令年龄的唤醒状态位置1,若某指令对应的指令年龄大于凋零阈值,则会输出凋零信号给发射队列,其中n代表发射队列的表项数,s代表指令发射宽度。Whenever an instruction in the emission queue is awakened, the awakening status of the instruction age corresponding to the instruction is set to 1. If the instruction age corresponding to an instruction is greater than the withering threshold, the withering signal will be output to the emission queue, where n represents the number of entries in the emission queue. s represents the command emission width.
所述发射队列,包含n个表项,每个表项存有待发射的指令,以及表项空闲位。The transmit queue includes n entries, and each entry stores an instruction to be transmitted and an idle bit of the entry.
所述沉降池为表项数远小于指令发射队列的指令队列,其中存有满足凋零条件的凋零指令,沉降池中的凋零指令可不经过仲裁直接发射。The sedimentation pool is an instruction queue whose number of entries is much smaller than the instruction issue queue, in which there are withering instructions that meet the withering conditions, and the withering instructions in the sedimentation pool can be directly launched without arbitration.
如图3所示为指令分配电路的组成示意图。指令分配电路用于将物理寄存器发送过来的多条指令分配给发射队列中空闲的表项。Figure 3 shows the composition diagram of the instruction distribution circuit. The instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue.
所述指令分配电路中包含s个表项编号选择电路,每个表项编号选择电路的输入为指令凋零电路中发射队列中n/s个表项的空闲信号序列与相应发射队列表项编号,表项编号选择电路根据输入空闲信号是否有效选择发射队列表项编号,若有多个空闲信号有效,则选择第一个空闲信号有效的表项编号;若不存在有效地空闲信号,则输出值为数据位上限的最大值,表示没有选中的表项。表项分配电路输出的表项编号与数值上限比较,若相等则有效信号置1,若不相等则置0。分配电路每个输入的待分配指令根据表项编号和有效信号写入相应表项。其中s代表指令发射宽度,n代表发射队列中的表项数。The instruction distribution circuit includes s entry number selection circuits, and the input of each entry number selection circuit is the idle signal sequence of n/s entries in the transmit queue in the instruction decay circuit and the corresponding transmit queue list item number, The entry number selection circuit selects the transmission queue entry number according to whether the input idle signal is valid. If multiple idle signals are valid, the first idle signal is selected; if there is no valid idle signal, the output value It is the maximum value of the upper limit of the data bit, which means that there is no selected table item. The table item number output by the table item distribution circuit is compared with the upper limit of the value. If it is equal, the effective signal is set to 1, and if it is not equal, it is set to 0. The instructions to be allocated for each input of the allocation circuit are written into the corresponding entry according to the entry number and the valid signal. Among them, s represents the instruction issue width, and n represents the number of entries in the issue queue.
所述表项编号选择电路由选择器阵列构成,如图2所示,第一列选择器输入表项编号,由于需选择首个空闲表项,所以选择器根据较小表项编号的空闲信号选择表项编号;第二层表项编号输入为第一层选择层的选择表项编号输出,选择信号为较小表项编号的空闲信号,以此类推总共有log2(n)层选择层。log2(n)层选择层的选择结果输出给全空表项选择器,该选择器的选择信号为第log2(n)层选择层的选择信号,待选数据为log2(n)层选择层的选择结果和数值上限值,若选择信号为0,则输出数值上限值作为最终表项编号输出;若不为0,则输出log2(n)层选择层的选择结果作为最终表项编号输出,其中n代表表项数。The entry number selection circuit is composed of a selector array. As shown in Fig. 2, the first column selector inputs the entry number. Since the first free entry needs to be selected, the selector is based on the idle signal of the smaller entry number. Selection table item number; the second-level table item number input is the selection table item number output of the first-level selection layer, the selection signal is an idle signal with a smaller table item number, and so on, there are a total of log2(n) selection layers. The selection result of the log2(n) layer selection layer is output to the all-empty entry selector. The selection signal of this selector is the selection signal of the log2(n)th layer selection layer, and the data to be selected is the log2(n) layer selection layer. The selection result and the upper limit value of the value. If the selection signal is 0, the upper limit value of the value will be output as the final entry number; if it is not 0, the selection result of the log2(n) layer selection layer will be output as the final entry number output , Where n represents the number of entries.
如图4所示为指令请求电路的组成示意图。所述指令请求电路用于统计表项空闲信号总数,并用特殊编码对空闲信号的数量进行编码,若经过该编码的空闲信号总数小于同样经过该编码的指令发射宽度,则向物理寄存器堆发出指令请求信号。指令请求电路由两部分构成:类加法层和后log2(n/2)层移位逻辑层。Figure 4 shows a schematic diagram of the composition of the command request circuit. The instruction request circuit is used to count the total number of table entry idle signals, and encode the number of idle signals with a special code. If the total number of idle signals after the code is less than the instruction transmission width that also passes through the code, then an instruction is issued to the physical register file Request signal. The instruction request circuit is composed of two parts: an addition-like layer and a post-log2(n/2) shift logic layer.
所述类加法层由类加法计算单元构成;在统计表项空闲信号总数时,将表项的空闲信号序列输入类加法层,对表示空闲信号的数量进行运算并进行特殊编码,输出经过特殊编码后的空闲信号总数;将类加法层的输出送入后log2(n/2)层移位逻辑层,最终输出统计结果,将统计结果与同样经过特殊编码的指令发射宽度进行比较,以确定是否需要发送指令请求信 号。The class addition layer is composed of a class addition calculation unit; when counting the total number of idle signals of the table entry, the idle signal sequence of the table entry is input to the class addition layer, the number of idle signals is calculated and special coding is performed, and the output is specially coded The total number of idle signals after the end; the output of the addition-like layer is sent to the rear log2(n/2) layer shift logic layer, and finally the statistical result is output. The statistical result is compared with the instruction emission width that has also been specially coded to determine whether Need to send a command request signal.
具体的,在统计表项空闲信号总数时,将表项的空闲信号序列输入类加法层,每个类加法单元输入为空闲信号序列中的两个二进制数并分别作与运算和异或运算,然后比较二者的计算结果:Specifically, when counting the total number of idle signals in the table entry, the idle signal sequence of the table entry is input to the class addition layer, and each class addition unit is input as two binary numbers in the idle signal sequence and performs an AND operation and an exclusive OR operation respectively. Then compare the calculation results of the two:
若相等,且与运算结果为1,则输出代表1的编码:“01”,表示类加法单元的两个二级制数输入的和为1,并对其编码为“01”;If they are equal, and the result of the AND operation is 1, the output code represents 1: "01", which means that the sum of the two secondary system numbers input of the class addition unit is 1, and the code is "01";
若相等,且与运算结果位0,则输出代表0的编码:“10”,表示类加法单元的两个二级制数输入的和为0,并对其编码为“10”;If it is equal, and the result of the AND operation is 0, the output code represents 0: "10", which means that the sum of the two secondary system numbers input of the class addition unit is 0, and the code is "10";
若不相等,则输出代表2的编码:“00”,表示类加法单元的两个二级制数输入的和为2,并对其编码为“00”;If they are not equal, output the code representing 2: "00", which means that the sum of the two secondary system numbers input of the class addition unit is 2, and the code is "00";
编码位数为n。The number of coding bits is n.
后log2(n/2)层移位逻辑层由右移移位器构成;将类加法层的输出结果输入后log2(n/2)层移位逻辑层,与同样经过特殊编码的指令发射宽度进行比较,以确定是否需要发送指令请求信号,包括:The rear log2 (n/2) layer shift logic layer is composed of a right shift shifter; the output result of the addition-like layer is input to the rear log2 (n/2) layer shift logic layer, and the instruction emission width is also specially coded Make a comparison to determine whether you need to send a command request signal, including:
右移移位器把一类加法单元输出作为待移位数据输入,把另一类加法单元输出作为移位位数输入,待移位数通过右移移位器右移n位。其中n为移位位数所对应的十进制数。The right shifter takes the output of one type of addition unit as the input of the data to be shifted, and the output of the other type of addition unit as the input of the number of shift bits, and the number to be shifted is shifted by n bits to the right by the right shifter. Where n is the decimal number corresponding to the number of shift bits.
例如,待移位数为“01”,移位位数为“00”,则根据上述编码规则,即对“01”右移2位;For example, if the number to be shifted is "01" and the number of shifting bits is "00", then according to the above coding rule, "01" is shifted to the right by 2 bits;
如图5所示为唤醒电路的组成示意图。所述唤醒电路由比较器、指令执行辨别电路、寄存器构成。Figure 5 shows the schematic diagram of the wake-up circuit. The wake-up circuit is composed of a comparator, an instruction execution discrimination circuit, and a register.
唤醒电路输入为待发射指令的源寄存器编号和已发射指令的目的寄存器编号,通过比较器来比较待发射指令的源寄存器编号和已发射指令的目的寄存器编号是否相等,若想等则送出唤醒信号;同时唤醒电路通过指令执行辨别电路识别待发射指令的执行周期,并输出待发射指令的周期数,寄存器通过待发射指令的周期数对将要送出的唤醒信号进行寄存,从而达到对唤醒信号顺序调整的目的,对执行周期短的指令延迟唤醒,对执行周期长的指令提前唤醒,以此保证流水线上的指令能够背靠背执行,提高流水线的效率。The wake-up circuit input is the source register number of the instruction to be issued and the destination register number of the issued instruction. The comparator is used to compare whether the source register number of the instruction to be issued and the destination register number of the issued instruction are equal. If you want to wait, send a wake-up signal ; At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discrimination circuit, and outputs the cycle number of the instruction to be issued. The register registers the wake-up signal to be sent by the cycle number of the instruction to be issued, so as to adjust the sequence of the wake-up signal The purpose is to delay the wake-up of instructions with a short execution cycle, and wake up instructions with a long execution cycle in advance, so as to ensure that the instructions on the pipeline can be executed back to back and improve the efficiency of the pipeline.
如图6所示为,经过指令唤醒调整的流水线示意图。指令A需3个执行周期,指令B、C、D各需一个执行周期,经过唤醒电路的唤醒顺序调整,使指令D延迟指令A两个周期唤醒,可使指令A、D之间插入两条背靠背执行的指令B、C,从而保证4条指令全部背靠背执行,不存在延迟气泡,提高流水线的执行效率。Figure 6 shows a schematic diagram of the pipeline after the instruction wake-up adjustment. Instruction A requires three execution cycles, and instructions B, C, and D each require one execution cycle. After the wake-up sequence of the wake-up circuit is adjusted, instruction D delays the wake-up of instruction A by two cycles, and two instructions can be inserted between instructions A and D Instructions B and C are executed back to back to ensure that all 4 instructions are executed back to back, there is no delay bubble, and the execution efficiency of the pipeline is improved.
实施例二Example two
本实施例提供一种基于指令凋零的多指令乱序发射方法,用于实施例一所述的处理器中,该处理器中发射架构为非数据捕捉型发射架构,即CPU在发射阶段过后才会真正读取物理寄存器堆,发射队列中每个表项存放的皆为物理寄存器编号;所述方法包括:This embodiment provides a multi-instruction out-of-sequence issuing method based on instruction fading, which is used in the processor described in the first embodiment. It will actually read the physical register file, and each entry in the transmit queue stores the physical register number; the method includes:
S1,当物理寄存器堆接受到指令请求电路的指令请求信号,输出合适指令到指令分配电路。S1: When the physical register file receives the instruction request signal from the instruction request circuit, it outputs an appropriate instruction to the instruction distribution circuit.
S2,指令分配电路把物理寄存器堆输出的指令分配给指令发射队列中的每个表项:S2, the instruction allocation circuit allocates the instructions output by the physical register file to each entry in the instruction issue queue:
指令分配电路包含s个表项编号选择电路,每个表项编号选择电路的输入为指令凋零电路中发射队列中n/s个表项的空闲信号序列与相应发射队列表项编号,表项编号选择电路根据输入空闲信号是否有效选择发射队列表项编号,若有多个空闲信号有效,则选择第一个空闲信号有效的表项编号;若不存在有效地空闲信号,则输出值为数据位上限的最大值,表示没有选中的表项。The instruction distribution circuit includes s entry number selection circuits, and the input of each entry number selection circuit is the idle signal sequence of n/s entries in the transmit queue in the instruction decay circuit and the corresponding transmit queue list item number, the entry number The selection circuit selects the list item number of the transmission queue according to whether the input idle signal is valid. If multiple idle signals are valid, the first idle signal is selected; if there is no valid idle signal, the output value is the data bit The maximum value of the upper limit indicates that there is no selected entry.
表项分配电路输出的表项编号与数值上限比较,若相等则有效信号置1,若不相等则置0。分配电路每个输入的待分配指令根据表项编号和有效信号写入相应表项。The table item number output by the table item distribution circuit is compared with the upper limit of the value. If it is equal, the effective signal is set to 1, and if it is not equal, it is set to 0. The instructions to be allocated for each input of the allocation circuit are written into the corresponding entry according to the entry number and the valid signal.
其中s代表指令发射宽度,n代表发射队列中的表项数。Among them, s represents the instruction issue width, and n represents the number of entries in the issue queue.
经过指令分配电路的新分配的指令,进入指令凋零电路存入空闲的发射队列表项,同时指令年龄阵列中对应的指令年龄被初始化,初始化为0与1之间的随机值。After the newly allocated instructions of the instruction allocation circuit, enter the instruction decay circuit and store the list items of the idle launching team. At the same time, the corresponding instruction age in the instruction age array is initialized to a random value between 0 and 1.
S3,指令凋零电路中的发射队列每接受一个新的指令,该指令所在表项对应的指令年龄阵列中的指令年龄置零;指令凋零电路每发射一个指令,仍在发射队列中的指令对应的指令年龄加一;指令对应的指令年龄的最高位为指令的唤醒状态位,其余位表示指令本征年龄。发射队列中的指令被唤醒后,所对应的年龄信息的最高位置一,保证唤醒的指令年龄大于非唤醒的指令年龄。S3: Each time the issue queue in the instruction wither circuit receives a new instruction, the instruction age in the instruction age array corresponding to the entry of the instruction is reset to zero; each time the instruction wither circuit emits an instruction, the instruction that is still in the issue queue corresponds to The instruction age is incremented by one; the highest bit of the instruction age corresponding to the instruction is the wake-up status bit of the instruction, and the remaining bits represent the intrinsic age of the instruction. After the instruction in the transmission queue is awakened, the highest position of the corresponding age information is one, which ensures that the awakened instruction age is greater than the non-awakened instruction age.
当指令年龄超过凋零阈值时,指令年龄阵列会触发凋零信号,使该指令发生凋零,发生凋零的指令由发射队列进入沉降池,同时发射队列中的该表项置为空闲。When the instruction age exceeds the withering threshold, the instruction age array will trigger the withering signal, which causes the instruction to wither. The instruction that has withered enters the sedimentation pool from the emission queue, and the entry in the emission queue is set to be idle.
沉降池为表项数远小于发射队列的指令队列,存有凋零后的指令,沉降池中的凋零指令可被随机选择进行发射。The sedimentation pool is an instruction queue whose number of entries is much smaller than that of the launch queue. There are instructions after the withering, and the withering instructions in the sedimentation pool can be randomly selected for launch.
凋零电路中的发射队列设计为非压缩结构,即某表项中指令的物理寄存器编号被发射后呈空闲态时,其它表项不会进行移位,每个表项除了暂存当前指令的物理寄存器编号,还记录当前指令的唤醒状态以及表项是否为空闲状态;The transmit queue in the wither circuit is designed as a non-compressed structure, that is, when the physical register number of the instruction in a certain table item is transmitted and becomes idle, the other table items will not be shifted, and each table item will temporarily store the physical register of the current instruction. The register number also records the wake-up status of the current command and whether the entry is idle;
S4,发射队列中表项的空闲信号同时会传送给指令请求电路,指令请求电路统计发射队列中的空闲表项数,若发射队列中的空闲表项数大于指令发射宽度,则请求电路会向物理寄存器堆发送指令请求信号,物理寄存器堆接受请求信号,并向指令分配电路发送指令;S4. The idle signal of the entry in the transmit queue will be transmitted to the instruction request circuit at the same time. The instruction request circuit counts the number of free entries in the transmit queue. If the number of free entries in the transmit queue is greater than the instruction transmit width, the request circuit will send The physical register file sends an instruction request signal, and the physical register file accepts the request signal and sends an instruction to the instruction distribution circuit;
S5,指令发射过程中,唤醒电路负责比较当前发射的目的寄存器编号与发射队列中的各指令的源寄存器编号,若编号相等则发出唤醒信号,同时根据该指令的执行周期判断改唤醒信号是否要延迟发送,执行周期长的指令提前唤醒,执行周期短的指令延后发射。唤醒信号对该指令对应的指令年龄中的唤醒状态位置1,保证被唤醒的指令年龄大于未被唤醒的指令年龄。S5. In the process of instruction transmission, the wake-up circuit is responsible for comparing the current transmitted destination register number with the source register number of each instruction in the transmit queue. If the numbers are equal, a wake-up signal is issued, and at the same time, it is judged whether to change the wake-up signal according to the execution cycle of the instruction. Delayed transmission, instructions with a long execution cycle are awakened in advance, and instructions with a short execution cycle are delayed to be transmitted. The wake-up signal is set to 1 in the wake-up state in the instruction age corresponding to the instruction to ensure that the instruction age that is awakened is greater than the instruction age that is not awakened.
本发明实施例中的部分步骤,可以利用软件实现,相应的软件程序可以存储在可读取的存储介质中,如光盘或硬盘等。Part of the steps in the embodiments of the present invention may be implemented by software, and the corresponding software program may be stored in a readable storage medium, such as an optical disc or a hard disk.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only the preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims (15)

  1. 一种多指令乱序发射方法,其特征在于,在处理器的指令乱序发射架构中增加一个指令凋零电路,用于将新分配的指令存入发射队列,并对发射队列中的指令实现凋零操作;所述方法包括:A multi-instruction out-of-order issuing method, characterized in that an instruction withering circuit is added to the processor's out-of-order issuing architecture to store the newly allocated instructions in the emission queue, and to achieve the withering of the instructions in the emission queue Operation; the method includes:
    将指令凋零电路中各指令对应的指令年龄的最高位设置为指令的唤醒状态位,指令年龄的其余位表示指令本征年龄;唤醒状态位用来表示对应的指令是否被唤醒,发射队列中被唤醒的指令年龄大于非唤醒的指令年龄;Set the highest bit of the instruction age corresponding to each instruction in the instruction wither circuit as the wake-up status bit of the instruction, and the remaining bits of the instruction age indicate the intrinsic age of the instruction; the wake-up status bit is used to indicate whether the corresponding instruction is awakened or not, and it is in the transmit queue. The wake-up instruction age is greater than the non-wake-up instruction age;
    设定凋零阈值,当某一指令的指令年龄超过凋零阈值时,指令年龄阵列触发凋零信号,使该指令发生凋零;发生凋零的指令无需经过仲裁就可被随机选择进行发射,实现多指令的乱序发射;Set the withering threshold. When the instruction age of a certain instruction exceeds the withering threshold, the instruction age array triggers the withering signal, which causes the instruction to wither; the instruction that has withered can be randomly selected for launch without arbitration, realizing the chaos of multiple instructions. Sequence launch
    所述发射队列中各指令根据指令年龄和唤醒状态确定发射顺序。Each instruction in the transmit queue determines the transmit sequence according to the instruction age and the wake-up state.
  2. 根据权利要求1所述的方法,其特征在于,所述方法在唤醒指令时,对执行周期短的指令延迟唤醒,对执行周期长的指令提前唤醒,以保证指令能够背靠背执行。The method according to claim 1, characterized in that when the instruction is awakened, the method delays the wake-up of the instruction with a short execution cycle, and wakes up the instruction with a long execution cycle in advance, so as to ensure that the instruction can be executed back-to-back.
  3. 根据权利要求2所述的方法,其特征在于,所述方法在唤醒指令时,当具有前后顺序的指令中在前指令被发射后,处理器等待在前指令执行完毕后再唤醒在后指令。The method according to claim 2, characterized in that when the method wakes up an instruction, after the preceding instruction among the preceding instructions is issued, the processor waits for the preceding instruction to be executed before awakening the succeeding instruction.
  4. 根据权利要求3所述的方法,其特征在于,所述指令乱序发射架构还包括指令分配电路、基于类加法器的指令请求电路和动态延迟唤醒电路;The method according to claim 3, wherein the instruction out-of-order issue architecture further comprises an instruction distribution circuit, an instruction request circuit based on a class adder, and a dynamic delay wake-up circuit;
    所述指令分配电路用于将物理寄存器发送过来的多条指令分配给发射队列中空闲的表项;The instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue;
    所述基于类加法器的指令请求电路用于统计发射队列中表项空闲信号总数,并用特殊编码对空闲信号的数量进行编码,若经过该特殊编码的空闲信号总数小于同样经过该编码的指令发射宽度,则向物理寄存器堆发出指令请求信号;The instruction request circuit based on the class adder is used to count the total number of idle signals for entries in the transmit queue, and use a special code to encode the number of idle signals, if the total number of idle signals that have undergone the special encoding is less than the number of instructions that have also undergone the encoding. Width, then send an instruction request signal to the physical register file;
    所述动态延迟唤醒电路用于在待发射指令的源寄存器编号和已发射指令的目的寄存器编号相等时送出唤醒信号,同时,唤醒电路通过指令执行辨别电路识别待发射指令的执行周期,根据待发射指令的执行周期调整唤醒信号顺序,以保证指令能够背靠背执行。The dynamic delayed wake-up circuit is used to send a wake-up signal when the source register number of the instruction to be issued is equal to the destination register number of the issued instruction. At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and according to the instruction to be issued The execution cycle of instructions adjusts the sequence of wake-up signals to ensure that instructions can be executed back to back.
  5. 根据权利要求4所述的方法,其特征在于,所述指令凋零电路包含指令年龄阵列、发射队列、凋零阈值调整器、沉降池、全局年龄特征提取电路;The method according to claim 4, wherein the command fade circuit comprises a command age array, a transmission queue, a fade threshold adjuster, a sedimentation pool, and a global age feature extraction circuit;
    所述指令年龄阵列用于表示发射队列中各指令的指令年龄以及是否被唤醒;The instruction age array is used to indicate the instruction age of each instruction in the emission queue and whether it is awakened;
    所述发射队列用于存放从物理寄存器发送过来的指令;发射队列设计为非压缩结构,即 某表项中指令的物理寄存器编号被发射后呈空闲态时,其它表项不会进行移位,每个表项除了暂存当前指令的物理寄存器编号,还记录当前指令的唤醒状态以及表项是否为空闲状态;The transmit queue is used to store the instructions sent from the physical register; the transmit queue is designed as a non-compressed structure, that is, when the physical register number of the instruction in a certain table item is transmitted and becomes idle, other table items will not be shifted. In addition to temporarily storing the physical register number of the current instruction, each entry also records the wake-up status of the current instruction and whether the entry is idle;
    所述凋零阈值调整器用于根据沉降池的空闲表项数和仍存留发射队列中的指令的年龄值,动态调整并输出凋零阈值;The withering threshold adjuster is used to dynamically adjust and output the withering threshold according to the number of free entries in the sedimentation pool and the age value of the instructions still remaining in the transmission queue;
    所述沉降池用于存有满足凋零条件的凋零指令;The sedimentation tank is used for storing withering instructions that meet the withering conditions;
    所述全局年龄特征提取电路用于统计全局年龄特征。The global age feature extraction circuit is used to count the global age features.
  6. 根据权利要求5所述的方法,其特征在于,所述凋零阈值调整器的输入为指令年龄阵列中各指令的年龄,输出为凋零阈值x,即:The method according to claim 5, wherein the input of the withering threshold adjuster is the age of each instruction in the instruction age array, and the output is the withering threshold x, namely:
    Figure PCTCN2020098961-appb-100001
    Figure PCTCN2020098961-appb-100001
    其中,σ为指令年龄的方差,μ为指令年龄的期望,α为调节系数,α满足
    Figure PCTCN2020098961-appb-100002
    Among them, σ is the variance of the instruction age, μ is the expectation of the instruction age, α is the adjustment coefficient, and α satisfies
    Figure PCTCN2020098961-appb-100002
  7. 根据权利要求4所述的方法,其特征在于,所述基于类加法器的指令请求电路包括类加法层和后log2(n/2)层移位逻辑层,n代表发射队列中的表项数。The method according to claim 4, wherein the instruction request circuit based on a class adder includes a class add layer and a post-log2 (n/2) shift logic layer, and n represents the number of entries in the transmit queue .
  8. 根据权利要求7所述的方法,其特征在于,所述基于类加法器的指令请求电路在统计表项空闲信号总数时,将表项的空闲信号序列输入类加法层,对表示空闲信号的数量进行运算并进行特殊编码,输出经过特殊编码后的空闲信号总数;将类加法层的输出送入后log2(n/2)层移位逻辑层,最终输出统计结果,将统计结果与同样经过特殊编码的指令发射宽度进行比较,以确定是否需要发送指令请求信号。The method according to claim 7, wherein the instruction request circuit based on the class adder inputs the idle signal sequence of the table entry into the class adder layer when counting the total number of idle signals of the table entry, and the number of idle signals Perform calculations and perform special coding, and output the total number of idle signals after special coding; send the output of the class addition layer to the rear log2 (n/2) layer shift logic layer, and finally output the statistical results, and the statistical results are the same as the special The coded command emission width is compared to determine whether a command request signal needs to be sent.
  9. 根据权利要求8所述的方法,其特征在于,所述类加法层由类加法计算单元构成;所述将表项的空闲信号序列输入类加法层,对表示空闲信号的数量进行运算并进行特殊编码,输出经过特殊编码后的空闲信号总数,包括:The method according to claim 8, wherein the class addition layer is composed of a class addition calculation unit; the idle signal sequence of the table entry is input to the class addition layer, and the number of idle signals is calculated and special Encoding, output the total number of idle signals after special encoding, including:
    在统计表项空闲信号总数时,将表项的空闲信号序列输入类加法层,每个类加法单元输入为空闲信号序列中的两个二进制数并分别作与运算和异或运算,然后比较二者的计算结果:When counting the total number of idle signals in the table entry, the idle signal sequence of the table entry is input to the class addition layer, and each class addition unit is input as two binary numbers in the idle signal sequence, and the AND operation and the exclusive OR operation are performed respectively, and then the two are compared. The result of the calculation:
    若相等,且与运算结果为1,则输出代表1的编码:“01”,表示类加法单元的两个二级制数输入的和为1,并对其编码为“01”;If it is equal and the result of the AND operation is 1, the output represents the code of 1: "01", which means that the sum of the two secondary system numbers input of the class addition unit is 1, and the code is "01";
    若相等,且与运算结果位0,则输出代表0的编码:“10”,表示类加法单元的两个二 级制数输入的和为0,并对其编码为“10”;If it is equal, and the result of the AND operation is 0, the output code represents 0: "10", which means that the sum of the two binary numbers input of the class addition unit is 0, and the code is "10";
    若不相等,则输出代表2的编码:“00”,表示类加法单元的两个二级制数输入的和为2,并对其编码为“00”;If they are not equal, output the code representing 2: "00", which means that the sum of the two secondary system numbers input of the class addition unit is 2, and the code is "00";
    编码位数为n。The number of coding bits is n.
  10. 根据权利要求9所述的方法,其特征在于,所述后log2(n/2)层移位逻辑层由右移移位器构成;所述将类加法层的输出结果输入后log2(n/2)层移位逻辑层,与同样经过特殊编码的指令发射宽度进行比较,以确定是否需要发送指令请求信号,包括:The method according to claim 9, wherein the rear log2 (n/2) layer shift logic layer is composed of a right shift shifter; and the output result of the addition-like layer is input to the rear log2 (n/ 2) The layer shift logic layer is compared with the command transmission width that has also been specially coded to determine whether it is necessary to send a command request signal, including:
    右移移位器把一类加法单元输出作为待移位数据输入,把另一类加法单元输出作为移位位数输入,待移位数通过右移移位器右移n位;其中n为移位位数所对应的十进制数。The right shifter takes the output of one type of addition unit as the input of the data to be shifted, and the output of the other type of addition unit as the input of the number of shift bits, and the number to be shifted is shifted to the right by n bits through the right shifter; where n is The decimal number corresponding to the number of shift bits.
  11. 根据权利要求10所述的方法,其特征在于,所述后log2(n/2)层移位逻辑层呈树状结构,且层层相连。11. The method according to claim 10, wherein the rear log2 (n/2) level shift logic layer has a tree structure and is connected layer by layer.
  12. 根据权利要求4所述的方法,其特征在于,所述动态延迟唤醒电路由比较器、指令执行辨别电路、寄存器构成;The method according to claim 4, wherein the dynamic delayed wake-up circuit is composed of a comparator, an instruction execution discrimination circuit, and a register;
    唤醒电路的输入为待发射指令的源寄存器编号和已发射指令的目的寄存器编号,通过比较器比较待发射指令的源寄存器编号和已发射指令的目的寄存器编号是否相等,若相等则送出唤醒信号;同时唤醒电路通过指令执行辨别电路识别待发射指令的执行周期,并输出待发射指令的周期数,寄存器通过待发射指令的周期数对将要送出的唤醒信号进行寄存,从而达到对唤醒信号顺序调整的目的。The input of the wake-up circuit is the source register number of the instruction to be issued and the destination register number of the issued instruction. The comparator compares whether the source register number of the instruction to be issued and the destination register number of the issued instruction are equal, and if they are equal, a wake-up signal is sent; At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discrimination circuit, and outputs the cycle number of the instruction to be issued. The register registers the wake-up signal to be sent by the cycle number of the instruction to be issued, so as to achieve the adjustment of the wake-up signal sequence. Purpose.
  13. 根据权利要求12所述的方法,其特征在于,所述指令执行辨别电路通过只读RAM实现,只读RAM中预先写好不同指令对应的执行周期数,通过输入指令的类别码作为地址来读出RAM中预先存放的周期数,从而得到相应指令的操作周期。The method according to claim 12, wherein the instruction execution discrimination circuit is realized by a read-only RAM, the number of execution cycles corresponding to different instructions is pre-written in the read-only RAM, and the instruction type code is input as the address to read The number of cycles pre-stored in the RAM is obtained, and the operation cycle of the corresponding instruction is obtained.
  14. 一种处理器,其特征在于,所述处理器的指令乱序发射架构包括指令分配电路,指令凋零电路,基于类加法器的指令请求电路和动态延迟唤醒电路;A processor, characterized in that the instruction out-of-order issue architecture of the processor includes an instruction distribution circuit, an instruction decay circuit, an instruction request circuit based on a class adder, and a dynamic delayed wake-up circuit;
    所述指令分配电路用于将物理寄存器发送过来的多条指令分配给发射队列中空闲的表项;The instruction allocation circuit is used to allocate multiple instructions sent from the physical register to the idle entries in the transmit queue;
    所述指令凋零电路用于将新分配的指令存入发射队列,并根据各指令的指令年龄对发射队列中的指令实现凋零操作;发生凋零的指令无需经过仲裁就可被随机选择进行发射;The instruction withering circuit is used to store the newly allocated instructions in the emission queue, and implement the withering operation on the instructions in the emission queue according to the instruction age of each instruction; the instructions that have withered can be randomly selected for emission without arbitration;
    所述基于类加法器的指令请求电路用于统计发射队列中表项空闲信号总数,并用特殊编码对空闲信号的数量进行编码,若经过该编码的空闲信号总数小于同样经过该编码的指令发射宽度,则向物理寄存器堆发出指令请求信号;The instruction request circuit based on the class adder is used to count the total number of idle signals in the entry queue, and encode the number of idle signals with a special code, if the total number of idle signals after the code is smaller than the command transmission width that also passes through the code , Then send an instruction request signal to the physical register file;
    所述动态延迟唤醒电路用于在待发射指令的源寄存器编号和已发射指令的目的寄存器编号相等时送出唤醒信号,同时,唤醒电路通过指令执行辨别电路识别待发射指令的执行周期,根据待发射指令的执行周期调整唤醒信号顺序,以保证指令能够背靠背执行。The dynamic delayed wake-up circuit is used to send a wake-up signal when the source register number of the instruction to be issued is equal to the destination register number of the issued instruction. At the same time, the wake-up circuit recognizes the execution cycle of the instruction to be issued through the instruction execution discriminating circuit, and according to the instruction to be issued The execution cycle of instructions adjusts the sequence of wake-up signals to ensure that instructions can be executed back to back.
  15. 根据权利要求13所述的处理器,其特征在于,各指令的指令年龄的最高位设置为指令的唤醒状态位,指令年龄的其余位表示指令本征年龄;唤醒状态位用来表示对应的指令是否被唤醒,发射队列中被唤醒的指令年龄大于非唤醒的指令年龄。The processor according to claim 13, wherein the highest bit of the instruction age of each instruction is set as the wake-up status bit of the instruction, and the remaining bits of the instruction age represent the intrinsic age of the instruction; the wake-up status bit is used to represent the corresponding instruction Whether to be awakened, the awakened instruction age in the transmit queue is greater than the non-awakened instruction age.
PCT/CN2020/098961 2020-04-07 2020-06-29 Instruction withering-based multi-instruction out-of-order transmission method and processor WO2021203560A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010264562.2 2020-04-07
CN202010264562.2A CN111538534B (en) 2020-04-07 2020-04-07 Multi-instruction out-of-order transmitting method and processor based on instruction wither

Publications (1)

Publication Number Publication Date
WO2021203560A1 true WO2021203560A1 (en) 2021-10-14

Family

ID=71978534

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/098961 WO2021203560A1 (en) 2020-04-07 2020-06-29 Instruction withering-based multi-instruction out-of-order transmission method and processor

Country Status (2)

Country Link
CN (1) CN111538534B (en)
WO (1) WO2021203560A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112099854B (en) * 2020-11-10 2021-04-23 北京微核芯科技有限公司 Method and device for scheduling out-of-order queue and judging queue cancellation item
EP4027236A4 (en) 2020-11-10 2023-07-26 Beijing Vcore Technology Co.,Ltd. Method and device for scheduling out-of-order queues and determining queue cancel items
CN113254079B (en) * 2021-06-28 2021-10-01 广东省新一代通信与网络创新研究院 Method and system for realizing self-increment instruction
CN117908968A (en) * 2022-10-11 2024-04-19 深圳市中兴微电子技术有限公司 Instruction sending method, device, equipment and medium based on compression type transmission queue
CN117742796A (en) * 2023-12-11 2024-03-22 上海合芯数字科技有限公司 Instruction awakening method, device and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706714A (en) * 2009-11-23 2010-05-12 北京龙芯中科技术服务中心有限公司 System and method for issuing instruction, processor and design method thereof
CN101826000A (en) * 2010-01-29 2010-09-08 北京龙芯中科技术服务中心有限公司 Interrupt response determining method, device and microprocessor core for pipeline microprocessor
US20190171453A1 (en) * 2016-04-28 2019-06-06 Oracle International Corporation Method for managing software threads dependent on condition variables
CN109885857A (en) * 2018-12-26 2019-06-14 苏州中晟宏芯信息科技有限公司 Instruction issue control method, instruction execution verification method, system and storage medium
CN110297662A (en) * 2019-07-04 2019-10-01 深圳芯英科技有限公司 Instruct method, processor and the electronic equipment of Out-of-order execution

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7721071B2 (en) * 2006-02-28 2010-05-18 Mips Technologies, Inc. System and method for propagating operand availability prediction bits with instructions through a pipeline in an out-of-order processor
US7464253B2 (en) * 2006-10-02 2008-12-09 The Regents Of The University Of California Tracking multiple dependent instructions with instruction queue pointer mapping table linked to a multiple wakeup table by a pointer
CN104932945B (en) * 2015-06-18 2018-05-18 合肥工业大学 A kind of out of order multi-emitting scheduler of task level and its dispatching method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706714A (en) * 2009-11-23 2010-05-12 北京龙芯中科技术服务中心有限公司 System and method for issuing instruction, processor and design method thereof
CN101826000A (en) * 2010-01-29 2010-09-08 北京龙芯中科技术服务中心有限公司 Interrupt response determining method, device and microprocessor core for pipeline microprocessor
US20190171453A1 (en) * 2016-04-28 2019-06-06 Oracle International Corporation Method for managing software threads dependent on condition variables
CN109885857A (en) * 2018-12-26 2019-06-14 苏州中晟宏芯信息科技有限公司 Instruction issue control method, instruction execution verification method, system and storage medium
CN110297662A (en) * 2019-07-04 2019-10-01 深圳芯英科技有限公司 Instruct method, processor and the electronic equipment of Out-of-order execution

Also Published As

Publication number Publication date
CN111538534A (en) 2020-08-14
CN111538534B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
WO2021203560A1 (en) Instruction withering-based multi-instruction out-of-order transmission method and processor
Lee et al. Warped-compression: Enabling power efficient GPUs through register compression
US8250395B2 (en) Dynamic voltage and frequency scaling (DVFS) control for simultaneous multi-threading (SMT) processors
US6981129B1 (en) Breaking replay dependency loops in a processor using a rescheduled replay queue
US8904153B2 (en) Vector loads with multiple vector elements from a same cache line in a scattered load operation
CN101156132B (en) Method and device for unaligned memory access prediction
US7464253B2 (en) Tracking multiple dependent instructions with instruction queue pointer mapping table linked to a multiple wakeup table by a pointer
US20120060016A1 (en) Vector Loads from Scattered Memory Locations
US7603543B2 (en) Method, apparatus and program product for enhancing performance of an in-order processor with long stalls
US11106494B2 (en) Memory system architecture for multi-threaded processors
US6928533B1 (en) Data processing system and method for implementing an efficient out-of-order issue mechanism
JPH10124391A (en) Processor and method for executing store convergence by merged store operation
US11204770B2 (en) Microprocessor having self-resetting register scoreboard
CN113778522B (en) Instruction transmitting processing method in transmitting unit
Sassone et al. Matrix scheduler reloaded
CN113467830A (en) Processor having delay shifter and control method using the same
CN111552366B (en) Dynamic delay wake-up circuit and out-of-order instruction transmitting architecture
US6988185B2 (en) Select-free dynamic instruction scheduling
US20080244224A1 (en) Scheduling a direct dependent instruction
CN111538533B (en) Class adder-based instruction request circuit and out-of-order instruction transmitting architecture
US7328327B2 (en) Technique for reducing traffic in an instruction fetch unit of a chip multiprocessor
TW522339B (en) Method and apparatus for buffering microinstructions between a trace cache and an allocator
CN115878190B (en) Method applied to instruction scheduling filling among transmission queues
Sharkey et al. Instruction packing: Toward fast and energy-efficient instruction scheduling
CN117369878A (en) Instruction processing method and device of double-emission pipeline, electronic equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20930430

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20930430

Country of ref document: EP

Kind code of ref document: A1