CN112379928B - Instruction scheduling method and processor comprising instruction scheduling unit - Google Patents

Instruction scheduling method and processor comprising instruction scheduling unit Download PDF

Info

Publication number
CN112379928B
CN112379928B CN202011253606.8A CN202011253606A CN112379928B CN 112379928 B CN112379928 B CN 112379928B CN 202011253606 A CN202011253606 A CN 202011253606A CN 112379928 B CN112379928 B CN 112379928B
Authority
CN
China
Prior art keywords
register file
physical register
execution unit
read port
file read
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011253606.8A
Other languages
Chinese (zh)
Other versions
CN112379928A (en
Inventor
薛大庆
崔泽汉
胡世文
耿恒生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haiguang Information Technology Co Ltd
Original Assignee
Haiguang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haiguang Information Technology Co Ltd filed Critical Haiguang Information Technology Co Ltd
Priority to CN202011253606.8A priority Critical patent/CN112379928B/en
Publication of CN112379928A publication Critical patent/CN112379928A/en
Application granted granted Critical
Publication of CN112379928B publication Critical patent/CN112379928B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3854Instruction completion, e.g. retiring, committing or graduating
    • G06F9/3856Reordering of instructions, e.g. using queues or age tags
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/22Microcontrol or microprogram arrangements
    • G06F9/223Execution means for microinstructions irrespective of the microinstruction function, e.g. decoding of microinstructions and nanoinstructions; timing of microinstructions; programmable logic arrays; delays and fan-out problems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

The present disclosure provides an instruction scheduling method and a processor including an instruction scheduling unit, the method including: generating a first microinstruction according to a first task needing to be executed, wherein the first task does not need a target operand, and the first microinstruction comprises a control field; selecting according to the control domain, and distributing the first micro instruction to a first instruction scheduling queue; providing the first microinstruction from the first instruction dispatch queue to the first execution unit for processing, wherein the first execution unit does not have a physical register file write port. The execution unit for processing the instruction without the target operand does not need to additionally increase the write port of the exclusive physical register file, can increase the overall execution bandwidth of the execution scheduling unit of the processor core, better support the requirement of synchronous multithreading on the throughput of the execution scheduling unit, and can eliminate the area requirement and the winding difficulty caused by increasing the port of the physical register file and reduce the time sequence constraint caused by complex winding.

Description

Instruction scheduling method and processor comprising instruction scheduling unit
Technical Field
Embodiments of the present disclosure relate to an instruction scheduling method and a processor including an instruction scheduling unit.
Background
A processor (e.g., a central processing unit CPU) is one of the main devices of an electronic computer and is also a core component of the computer, and its main functions are to interpret computer instructions and process data in computer software. The CPU is the core component of the computer responsible for reading, decoding and executing instructions. The CPU has the main functions of processing instructions, executing operations, controlling time and processing data.
An Arithmetic Logic Unit (ALU) is an execution Unit of a CPU, also called an Arithmetic Unit, and is a Logic Unit for executing addition, subtraction, multiplication, division, and or, negation instructions. The ALU is the core component of all central processing units. When the computer runs, the operation and the operation type of the arithmetic unit are determined by the controller. The data processed by the arithmetic unit comes from a memory or a register; the processed result data is written back to the memory or temporarily registered in a register according to the type of the target operand. The arithmetic unit is operated by receiving the command of the controller, namely, all the operations performed by the arithmetic unit are directed by the control signal sent by the controller, so that the arithmetic unit is an execution part. In various micro-architectures, different execution units (logic units in the instruction dispatch unit for performing specific logic operations, such as ALU and AGU, etc., according to the execution purpose) pick up the ready micro-instructions for processing. How to scientifically design the micro-architecture of the high-performance CPU is important content for improving the execution bandwidth of a CPU core and meeting the requirement of synchronous multithreading on the execution bandwidth.
Disclosure of Invention
The embodiment of the disclosure provides an instruction scheduling method and a processor comprising an instruction scheduling unit, wherein the instruction without target operands is combed in the design of the processor aiming at the imperfect hardware resource allocation of the instruction without the target operands, an execution unit for processing the instruction without the target operands is additionally arranged without additionally increasing a dedicated Physical Register File (PRF) write port, the whole execution bandwidth of the execution scheduling unit of a processor core can be increased, the requirement of synchronous multithreading on the throughput of the execution scheduling unit is better supported, the area requirement and the winding difficulty caused by the increase of the PRF port can be eliminated, and the time sequence constraint caused by complex winding is reduced.
At least one embodiment of the present disclosure provides an instruction scheduling method, including:
generating a first microinstruction according to a first task that needs to be executed, wherein the first task does not require a target operand, the first microinstruction comprising a control field;
selecting according to the control domain, and distributing the first microinstruction to a first instruction scheduling queue;
providing the first microinstruction from the first instruction dispatch queue to a first execution unit for processing, wherein the first execution unit does not have a physical register file write port.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, the first task requires less than or equal to a single source operand, the first execution unit does not have a physical register file read port or the first execution unit includes a single first physical register file read port.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, in response to a single source operand being needed, the single source operand is read from a physical register file through the first physical register file read port.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, reading the single source operand from a physical register file through the first physical register file read port includes: the first physical register file read port of the first execution unit and one of the other execution units having at least one physical register file read port multiplex one source operand read from a physical register file such that the first physical register file read port reads the one source operand as the single source operand required by the first execution unit.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, a second execution unit has at least one second physical register file read port, the first physical register file read port of the first execution unit is coupled to one second physical register file read port of the second execution unit, and the single source operand is read from a physical register file through the first physical register file read port, including:
multiplexing a second physical register file read port of the second execution unit and the first physical register file read port of the first execution unit to read the single source operand from the physical register file.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, a second execution unit has at least one second physical register file read port, a third execution unit has at least one third physical register file read port, and the first physical register file read port of the first execution unit is coupled to one second physical register file read port of the second execution unit and one third physical register file read port of the third execution unit through a multiplexer, respectively, so as to obtain a target operand through one of the second physical register file read port of the second execution unit and the third physical register file read port of the third execution unit, and the single source operand is read from the physical register file through the first physical register file read port, including:
obtaining the single source operand from the physical register file via one of a second physical register file read port of the second execution unit and a third physical register file read port of the third execution unit, and then obtaining the single source operand from the first execution unit via the first physical register file read port.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, a second execution unit has at least one second physical register file read port, the first physical register file read port of the first execution unit and the second physical register file read port of the second execution unit are respectively connected to an output of a demultiplexer, and the single source operand is obtained from the physical register file through an input of the demultiplexer, so that one of the first physical register file read port of the first execution unit and the second physical register file read port of the second execution unit obtains the single source operand, and the single source operand is read from the physical register file through the first physical register file read port, including:
the single source operand passing through the input of the demultiplexer is output from the output of the demultiplexer and read by the first physical register file read port of the first execution unit.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, the other execution unit having at least one physical register file read port has two or more physical register file read ports, and the first physical register file read port of the first execution unit and one physical register file read port of the other execution units having at least one physical register file read port multiplex one source operand read from a physical register file, so that after the first physical register file read port reads the single source operand, the other execution units having at least one physical register file read port continue to schedule execution of other microinstructions having one or more source operands based on the remaining physical register file read ports without waiting.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, the first task requires N source operands, the first execution unit includes N first physical register file read ports, N is an integer and N ≧ 2, wherein, in response to the N source operands being required, the respective source operands are respectively read from the physical register file through the N first physical register file read ports.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, the first execution unit includes 2 first physical register file read ports, the second execution unit has at least one second physical register file read port, and the third execution unit has at least one third physical register file read port; a first physical register file read port of the first execution unit and a second physical register file read port of the second execution unit are respectively connected with an output end of a first demultiplexer, and a first source operand is obtained from the physical register file through an input end of the first demultiplexer, so that one of the first physical register file read port of the first execution unit and a second physical register file read port of the second execution unit obtains the first source operand; the other first physical register file read port of the first execution unit and a third physical register file read port of the third execution unit are respectively connected to an output of a second demultiplexer, and a second source operand is obtained from the physical register file through an input of the second demultiplexer, so that one of the other first physical register file read port of the first execution unit and a third physical register file read port of the third execution unit obtains the second source operand; for responding to the requirement of 2 source operands, respectively reading each source operand from the physical register file through 2 first physical register file read ports, comprising:
said first source operand passing through an input of said first demultiplexer is output from an output of said first demultiplexer and read by said one first physical register file read port of said first execution unit;
said second source operand passing through an input of said second demultiplexer is output from an output of said second demultiplexer and read by said another first physical register file read port of said first execution unit.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, the second execution unit has a second physical register file read port equal to or greater than two, wherein the second execution unit continues to schedule execution of other microinstructions having one or more source operands without waiting based on removing the remaining second physical register file read ports connected to the first demultiplexer; a third execution unit has a third physical register file read port equal to or greater than two, wherein the third execution unit continues to schedule execution of other microinstructions having one or more source operands without waiting based on removing the remaining third physical register file read port coupled to the second demultiplexer.
For example, in an instruction scheduling method provided in at least one embodiment of the present disclosure, the number of the first execution units is greater than or equal to two; the output ends of the multiplexers for outputting the write data are connected with the data writing paths of the memory access units, and the results processed by the at least two first execution units are written into the memory access units respectively; or the output end of each first execution unit is respectively connected with the data writing path of the memory access unit, and the result processed by each first execution unit is written into the memory access unit.
For example, in an instruction scheduling method provided in at least one embodiment of the present disclosure, the method further includes:
generating a second microinstruction based on a second task to be executed, wherein the second task requires a target operand and requires more than two source operands;
issuing the second microinstruction to a second instruction dispatch queue;
and providing the second microinstruction from the second instruction scheduling queue to the other execution unit with at least one physical register file read port for processing, wherein the other execution unit with at least one physical register file read port has a physical register file write port.
For example, in an instruction scheduling method provided in at least one embodiment of the present disclosure, the method further includes:
generating a second microinstruction based on a second task to be executed, wherein the second task requires a target operand and requires more than two source operands;
issuing the second microinstruction to a second instruction dispatch queue;
providing the second microinstruction from the second instruction dispatch queue to the second execution unit and/or the third execution unit for processing, wherein the second execution unit and/or the third execution unit has a physical register file write port.
For example, in an instruction scheduling method provided by at least one embodiment of the present disclosure, the second microinstruction does not include a control field, or the second microinstruction is provided with a control field different from the control field of the first microinstruction; wherein different control domains are set for the second microinstruction, and the second microinstruction is distributed to a second instruction scheduling queue, including:
and selecting according to the different control domains, and distributing the second micro-instruction to the second instruction scheduling queue.
For example, in an instruction scheduling method provided in at least one embodiment of the present disclosure, the first instruction scheduling queue and the second instruction scheduling queue are separate instruction scheduling queues, or the first instruction scheduling queue and the second instruction scheduling queue belong to a unified instruction scheduling queue.
At least one embodiment of the present disclosure provides a processor, including an instruction scheduling unit, the instruction scheduling unit including:
a first instruction scheduling queue configured to select according to a control domain of a first microinstruction, receive the dispatched microinstruction, wherein the microinstruction is generated according to a first task that needs to be executed, and a target operand is not needed by the first task;
a first execution unit configured to receive the microinstructions from the first instruction dispatch queue for processing, wherein the first execution unit does not have a physical register file write port.
For example, in a processor provided in at least one embodiment of the present disclosure, the first execution unit does not have a physical register file read port or the first execution unit includes a single first physical register file read port.
For example, in a processor provided in at least one embodiment of the present disclosure, the instruction dispatch unit further includes a second execution unit having at least one second physical register file read port, and the first physical register file read port of the first execution unit is coupled to one second physical register file read port of the second execution unit.
For example, in a processor provided in at least one embodiment of the present disclosure, the instruction scheduling unit further includes a second execution unit, a third execution unit, and a multiplexer, the second execution unit has at least one second physical register file read port, the third execution unit has at least one third physical register file read port, and the first physical register file read port of the first execution unit is coupled to one second physical register file read port of the second execution unit and one third physical register file read port of the third execution unit through the multiplexer, respectively.
For example, in a processor provided in at least one embodiment of the present disclosure, the instruction scheduling unit further includes a second execution unit and a demultiplexer, the second execution unit has at least one second physical register file read port, the first physical register file read port of the first execution unit and one second physical register file read port of the second execution unit are respectively connected to an output of the demultiplexer, and the single source operand is obtained from the physical register file through an input of the demultiplexer, so that one of the first physical register file read port of the first execution unit and one second physical register file read port of the second execution unit obtains the single source operand.
For example, in a processor provided by at least one embodiment of the present disclosure, the instruction dispatch unit further includes a second execution unit, a third execution unit, a first demultiplexer, and a second demultiplexer, the first execution unit includes 2 first physical register file read ports, the second execution unit has at least one second physical register file read port, the third execution unit has at least one third physical register file read port,
a first physical register file read port of the first execution unit and a second physical register file read port of the second execution unit are respectively connected with an output end of the first demultiplexer, a first source operand is obtained from the physical register file through an input end of the first demultiplexer, so that one of the first physical register file read port of the first execution unit and a second physical register file read port of the second execution unit obtains the first source operand,
the other first physical register file read port of the first execution unit and the one third physical register file read port of the third execution unit are respectively connected with the output end of the second demultiplexer, and a second source operand is obtained from the physical register file through the input end of the second demultiplexer, so that one of the other first physical register file read port of the first execution unit and the one third physical register file read port of the third execution unit obtains the second source operand.
For example, in a processor provided in at least one embodiment of the present disclosure, the processor further includes a decode and dispatch unit, where the instruction dispatch unit is connected to the decode and dispatch unit, and the decode and dispatch unit decodes an input instruction to generate a microinstruction with a control field and dispatches the microinstruction with the control field.
For example, in a processor provided in at least one embodiment of the present disclosure, the first instruction scheduling queue is a discrete instruction scheduling queue, or the first instruction scheduling queue belongs to at least a part of a unified instruction scheduling queue.
For example, in at least one embodiment of the present disclosure, a processor is provided, which includes a central processing unit.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a diagram of an instruction dispatch unit in a micro-architecture for a CPU core;
FIG. 2 is a flow chart illustrating a method for scheduling instructions according to some embodiments of the present disclosure;
FIG. 3 is a schematic diagram of an instruction dispatch unit with an additional execution unit without a PRF write port according to some embodiments of the present disclosure;
FIG. 4 is a schematic diagram of an instruction dispatch unit with an added execution unit that does not require a PRF write port and borrows a PRF read port according to some embodiments of the present disclosure;
FIG. 5 is a schematic diagram of an instruction dispatch unit with an additional execution unit that does not require a PRF write port and that borrows a PRF read port according to further embodiments of the present disclosure;
FIG. 6 is a schematic diagram of an instruction dispatch unit with additional execution units that do not require a PRF write port and share a PRF read port according to some embodiments of the present disclosure;
FIG. 7 is a schematic diagram of an instruction dispatch unit with additional execution units that do not require a PRF write port and share a PRF read port according to yet further embodiments of the present disclosure;
FIG. 8 is a schematic diagram of an instruction dispatch unit with multiple additional execution units that do not require a PRF write port and borrow a PRF read port according to some embodiments of the present disclosure; and
fig. 9 is a schematic diagram of an instruction scheduling method according to some embodiments of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without inventive step, are intended to be within the scope of the present disclosure.
Unless otherwise defined, all terms (including technical and scientific terms) used in the embodiments of the present disclosure have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The use of "first," "second," and similar terms in the embodiments of the disclosure is not intended to indicate any order, quantity, or importance, but rather to distinguish one element from another. The use of the terms "a" and "an" or "the" and similar referents do not denote a limitation of quantity, but rather denote the presence of at least one. Similarly, the word "comprising" or "comprises", and the like, means that the element or item preceding the word comprises the element or item listed after the word and its equivalent, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. Flowcharts are used in the disclosed embodiments to illustrate the steps of the methods according to the disclosed embodiments. It should be understood that the preceding and following steps are not necessarily performed in the exact order in which they are performed. Rather, various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to or removed from these processes.
FIG. 1 is a diagram of an instruction dispatch unit in a micro-architecture of a processor (e.g., CPU) core, the micro-architecture of the CPU core of FIG. 1 taking into account a simplified execution model (with floating point operations omitted).
As shown in FIG. 1, the dispatch queue of the instruction dispatch unit is used to receive dispatched microinstructions. For scheduling queues with separate instructions, different scheduling queues receive microinstructions of corresponding classes, respectively, according to the class of the microinstructions. For example, if the microinstruction is a fixed point compute operation, it is sent to the ALU dispatch queue. For another example, if the micro instruction is a memory access operation, the following steps are performed: (a) Sending the Address Generation Unit (AGU) scheduling queue to perform virtual Address calculation, where an execution unit AGU is a logic unit used to perform access virtual Address calculation; (b) Meanwhile, sending the virtual address to a memory access unit for memory access operation after virtual address calculation; (c) If the operation is a fixed point write operation, the operation is also sent to an ALU or an independent write scheduling queue at the same time for generating a source operand of the write. In addition, for a CPU architecture with a unified instruction dispatch queue, the respective ready microinstructions are picked up by different execution units (e.g., AGU and ALU) during the instruction dispatch stage. And the ALU scheduling queue and the AGU scheduling queue and dispatch all the received micro instructions out of order, and executable micro instructions are selected from the received micro instructions and transmitted.
Microinstructions issued from the ALU scheduling queue or the AGU scheduling queue read required register source operands (hereinafter, simply referred to as source operands) from a Physical Register File (PRF), are executed by corresponding execution units, and write the execution results back to the Physical register file or a memory access Unit (LSU) according to the types of the target operands. The physical register file is used for storing source operands and target operands of an instruction, and is generally provided with a plurality of read-write ports to support a plurality of ALUs and AGUs to execute concurrently, and the memory access unit is a logic unit used for executing memory access operation (read/write).
It should be noted that, in the embodiments of the present disclosure, if not specifically described, an execution unit refers to a pipeline execution unit (Pipe), where Pipe refers to a pipeline arithmetic unit, and an arithmetic task can be decomposed into a plurality of sub-tasks and sequentially executed in each stage of a pipeline, and a multi-stage pipeline can support different arithmetic stages of a plurality of different tasks to run synchronously in the same clock.
Microarchitectures of high performance processor (e.g., CPU) cores may include, for example, a microarchitecture of the SunnyCove, zen2, a77 type. Both SunnyCove and Zen2 share branch instructions and other ALU instructions with specific execution units, and therefore require dedicated PRF read and write ports to support ALU operations. A77 has a dedicated branch instruction execution unit, so that only PRF read port is needed, and no PRF write port is needed for write operation. In addition, sunnyCove and A77 utilize a dedicated PRF read port to send data to the memory access unit, and a PRF write port is not needed for a data path; zen2 uses other ALU components to send data to the memory access unit, thus requiring dedicated PRF read and PRF write ports to support ALU operations.
To increase the execution bandwidth of the CPU core, the three micro-architectures described above (the micro-architectures of the SunnyCove, zen2 and a77 types) all add execution units for different instruction types. The inventors of the present disclosure found that: the PRF needs to add a dedicated PRF read port or a dedicated PRF write port to the newly added execution unit, or both the read port and the write port are available. Considering that a high-performance CPU has a large number of physical registers (for example, sunny Cove and Zen types support more than 180 physical registers), adding a dedicated PRF port or write port requires connecting all the physical registers to the newly added multiplexer, which greatly increases the routing difficulty and timing convergence constraints; the three microarchitectures all place branch instructions and store data paths in different execution units, objectively increasing the need for the number of read and write ports of physical registers.
Dynamic instruction ratio analysis according to a typical CPU benchmark CpuSpec2017 (see Table below):
Figure BDA0002772398970000101
based on the above CpuSpec2017 benchmark program set behavior analysis, the inventors found that: the instruction occupation ratio of target operands in the fixed point benchmark program set is more than one third; even considering that more than 70% of conditional jump instructions can be merged with adjacent condition setting ALU instructions, the instruction ratio without the target operand is considerable.
At present, no document exists on how to comb imperfections of hardware resource allocation of non-target operand instructions in the design of a high-performance processor (such as a CPU).
At least one embodiment of the present disclosure provides an instruction scheduling method, including:
generating a first microinstruction according to a first task needing to be executed, wherein the first task does not need a target operand, and the first microinstruction comprises a control field;
selecting according to the control domain, and distributing the first micro instruction to a first instruction scheduling queue;
providing the first microinstruction from the first instruction dispatch queue to the first execution unit for processing, wherein the first execution unit does not have a physical register file write port.
The instruction scheduling method of the embodiment is based on optimizing the instruction without the target operand, so that the imperfection of hardware resource allocation of the instruction without the target operand in the design of a high-performance processor (such as a CPU) is combed, by adding the execution unit for processing the instruction without the target operand and not needing to additionally add a dedicated PRF write port, the overall execution bandwidth of the instruction scheduling unit of the CPU core can be increased, the requirement of synchronous multithreading (such as SMT2, SMT4 and the like) on the throughput of the instruction scheduling unit can be better supported, the area requirement and the winding difficulty caused by the increase of the PRF port can be eliminated, and the timing constraint caused by complex winding can be reduced.
At least one embodiment of the present disclosure provides an instruction scheduling method, for example, the instruction scheduling method may be applied to an instruction scheduling unit having a discrete instruction scheduling queue, and may also be applied to an instruction scheduling unit having a unified instruction scheduling queue.
For example, in the example relating to a discrete instruction dispatch queue, the relevant contents are as follows: each execution unit corresponds to one instruction scheduling queue, the distribution unit distributes the micro instructions meeting certain constraint conditions to the instruction scheduling queue corresponding to each execution unit, the execution unit of the micro instructions can only select ready micro instructions from the instruction scheduling queue corresponding to the execution unit to operate but can not read from other instruction scheduling queues, and generally, the execution unit of each micro instruction corresponds to one instruction scheduling queue and distributes the micro instructions to the instruction scheduling queue corresponding to the expected execution unit in the micro instruction distribution stage.
In addition, for example, in the example regarding the unified instruction scheduling queue, the relevant contents are as follows: all execution units share one instruction scheduling queue, the distribution unit writes all the microinstructions into the same instruction scheduling queue (i.e. unified instruction scheduling queue), and distributes the ready microinstructions to the expected execution units in the instruction execution scheduling stage.
The following instruction scheduling method is mainly described by taking an instruction scheduling unit with a discrete instruction scheduling queue as an example, and of course, each example herein is also applicable to an instruction scheduling unit with a unified instruction scheduling queue.
At least one embodiment of the present disclosure further provides an instruction scheduling unit, including a first instruction scheduling queue and a first execution unit, wherein the first instruction scheduling queue is configured to select according to a control domain of a first microinstruction, receive a dispatched microinstruction, wherein the microinstruction is generated according to a first task that needs to be executed, and a target operand is not needed by the first task; the first execution unit is configured to receive microinstructions from the first instruction dispatch queue for processing, wherein the first execution unit does not have a physical register file write port.
Likewise, for example, the instruction dispatch unit may be an instruction dispatch unit having a separate instruction dispatch queue, or may be an instruction dispatch unit having a unified instruction dispatch queue. The following instruction scheduling unit is also illustrated by taking an instruction scheduling unit with a discrete instruction scheduling queue as an example, and of course, the instruction scheduling unit of each example herein may also be an instruction scheduling unit with a unified instruction scheduling queue.
At least one embodiment of the present disclosure also provides a processor including an instruction scheduling unit. The instruction scheduling unit may adopt any of the instruction scheduling units illustrated herein, and the embodiments of the present disclosure are not limited herein and are not described herein again.
In addition, the following description mainly takes a simplified schematic diagram of a core micro-architecture of a CPU as an example, and may also be applied to core micro-architectures of other types of processors.
Fig. 2 is a flowchart illustrating an instruction scheduling method according to some embodiments of the disclosure. As shown in fig. 2, the instruction scheduling method includes steps S1 to S3.
S1, generating a first micro-instruction according to a first task needing to be executed, wherein the first task does not need a register target operand (hereinafter referred to as the target operand for short), and the first micro-instruction comprises a control domain, namely the first micro-instruction is an instruction which does not need the target operand;
s2, selecting according to the control domain, and distributing the first micro instruction to a first instruction scheduling queue;
and S3, providing the first microinstruction from the first instruction scheduling queue to a first execution unit for processing, wherein the first execution unit does not have a physical register file write port.
Fig. 3 is a schematic diagram of an instruction dispatch unit with an additional execution unit without a PRF write port according to some embodiments of the present disclosure.
For example, in some examples, a processor includes an instruction fetch unit and a decode unit distribution unit.
As shown in fig. 2 and fig. 3 in combination, before step S1, the instruction fetch unit is responsible for fetching an instruction from the instruction cache and handing it to the decode unit dispatch unit.
First, as for step S1, some modifications need to be made to the decoding unit of the CPU core, for example, in some examples, according to the first task to be executed, the decoding unit in the decoding and distributing unit decodes the input instruction and generates a first microinstruction including a control field, which is used to indicate whether the instruction scheduling unit can distribute the current microinstruction to the newly added first execution unit.
Second, since the decode unit dispatch unit dispatches the microinstructions to different dispatch queues in the instruction dispatch unit, e.g., separate instruction dispatch queues, such as the ALU dispatch queue and the AGU dispatch queue shown in fig. 2, according to the type of the microinstruction. Thus, for step S2, in some examples, based on the control domain of the first microinstruction, the dispatch unit in the decode and dispatch unit dispatches the first microinstruction including the control domain to a first instruction dispatch queue (e.g., the ALU dispatch queue of FIG. 3) for provision to the first execution unit.
For example, in some examples, a first microinstruction capable of being dispatched to a first instruction dispatch queue requires at least the following condition to be satisfied: (a) no destination operand. For condition (a), indicating that the first task does not require the destination operand, the first execution unit does not require the PRF write port at this time.
As shown in FIG. 3, the microinstructions dispatched to the newly added first execution unit are destination operand-free, and the first execution unit does not need to add a PRF write port.
As can be seen from the above, the embodiments of the present disclosure implement instruction classification in the decoding stage, and mark which instructions can enter the added first execution unit. For example, in some examples, the first microinstruction includes one or more of: the present disclosure is not limited to these listed instructions, and all instructions without target operands are within the scope of the present disclosure, and the present disclosure is not limited to this, and is not exhaustive here.
FIG. 4 is a schematic diagram of an instruction dispatch unit with an added execution unit that does not require a PRF write port and borrows a PRF read port according to some embodiments of the present disclosure. FIG. 5 is a schematic diagram of an instruction dispatch unit with additional execution units that do not require a PRF write port and borrow a PRF read port according to further embodiments of the present disclosure. Fig. 6 is a schematic diagram of an instruction dispatch unit with an additional execution unit that does not require a PRF write port and shares a PRF read port according to some embodiments of the present disclosure. 4-6, the microinstructions dispatched into the newly added first execution unit are all destination operand free, which does not require the addition of a PRF write port.
Based on the above, at least one embodiment of the present disclosure provides that a first execution unit that at least does not require a dedicated PRF write port is added in a CPU core, and the first execution unit is responsible for executing microinstructions (for example, not limited to branch instructions) that do not require target operands, so that scheduling pressures of multiple execution units can be better balanced, an overall execution bandwidth can be increased, especially requirements of simultaneous multithreading (SMT 2, SMT4, etc.) on the execution bandwidth can be met, and requirements of a required CPU design on an area, physical implementation, and difficulty of timing convergence can be reduced as much as possible.
For example, in some examples, the first microinstruction may further satisfy any one of the following conditions: (i) a passive operand; (ii) a single source operand is required. For condition (i), this means that the first task does not require source operands, at which point the first execution unit does not require a PRF read port. For condition (ii), it indicates that the first task requires a single source operand, when the first execution unit uses only one PRF read port. For example, in the instruction scheduling method, the method further includes: when a single source operand is needed by the first task, then the single source operand is read from the PRF using this PRF read port of the first execution unit.
To facilitate understanding of the disclosed embodiments, it is necessary to distinguish between the first execution unit and other execution units, and the first execution unit that has no target operands and requires only one source operand is named the control execution unit, abbreviated as CRLU. It should be noted that the CRLU is only an example of a naming method and does not have any limitation on the content and characteristics of the first execution unit.
Finally, an instruction scheduling queue (e.g., the ALU scheduling queue and the AGU scheduling queue in fig. 3) in the instruction scheduling unit queues and schedules out-of-order all the received microinstructions, and selects ready-to-execute microinstructions from the queue for transmission. Thus, for step S3, for example, in some examples, the first microinstruction is provided from the first instruction dispatch queue to a first execution unit, such as a CRLU, for processing.
For example, in some examples, ready microinstructions issued by the instruction dispatch queue are not only provided to the corresponding execution unit, e.g., the first microinstruction is provided to the first execution unit, but the issued ready microinstructions also read source operands from the PRF and are then executed by the corresponding execution unit; further, the execution result may also be written back to the PRF or the memory access unit according to the destination operand type.
FIG. 7 is a schematic diagram of an instruction dispatch unit with additional execution units that do not require a PRF write port and share a PRF read port according to further embodiments of the disclosure.
For example, in some examples, the issued ready microinstructions include not only the first microinstruction corresponding to the first task described above, but also other microinstructions corresponding to other tasks (e.g., the other tasks require more than two source operands), and accordingly, the instruction dispatch unit further includes execution units corresponding to microinstructions other than the first microinstruction, e.g., the instruction dispatch unit includes four ALUs (the ALUs also require register target operands) and two AGUs (the AGUs do not need to register the target operands because the AGUs are virtual address calculation units, logic units for performing the access virtual address calculation) as illustrated in fig. 3-7. The instruction scheduling unit comprises an ALU scheduling queue and an AGU scheduling queue.
For the sake of clarity, other microinstructions besides the first microinstruction are referred to as second microinstructions, and the ALU and AGU in fig. 3-7 are also only examples, and the micro-architecture of the processor core is not limited to these two execution units, and may include other types of execution units, which is not limited by the embodiment of the present disclosure.
Meanwhile, the numbers of the ALUs and the AGUs in fig. 3 to fig. 7 are only schematic, and the numbers of the ALUs and the AGUs in each example are not limited, and are adjusted freely according to actual needs.
For example, in some examples, the second microinstruction does not include a control domain, such that the dispatch unit is enabled to provide the first microinstruction including the control domain to the first execution unit via the first instruction dispatch queue, while providing the second microinstruction without the control domain to the other corresponding execution units.
For another example, in some examples, the second microinstruction is provided with a control field (second control field) different from the control field (first control field) of the first microinstruction, e.g., the code referring to the two control fields is different, e.g., a bit (bit) is selected/set in the microinstruction to express the kind of the control field, e.g., the bit represents the first control field when being 0 and represents the second control field when being 1, wherein the selection is performed according to the different second control field of the second microinstruction, and the second microinstruction is distributed to the second instruction scheduling queue to be provided to other corresponding execution units.
In some examples, as shown in fig. 9, an instruction scheduling method includes the steps of:
(T1) a dispatch unit of the decode and dispatch unit receives a first microinstruction comprising a control field sent by a decode unit of the decode and dispatch unit, wherein the first microinstruction is generated by the decode unit according to a first task and does not require a target operand.
(T2) the distribution unit judges whether the current microinstruction belongs to the first microinstruction according to the control domain: if yes, distributing the current micro-instruction to a first instruction scheduling queue to be provided to a first execution unit; if not, the current microinstruction is distributed to other corresponding instruction scheduling queues to be provided to other corresponding execution units.
(T3) determining whether the dispensing unit is empty: if yes, finishing the micro instruction distribution, and jumping to the following step (T4); if not, jumping to the step (T1), and circularly performing until the distribution unit is empty.
(T4) instructing the dispatch unit to enter a sleep state.
Since most ALU operations are considered to require 2 source operands, the improvements of the various examples below are designed based on each ALU having two PRF read ports, but the embodiments of the present disclosure do not limit the number of PRF read ports that each ALU has, e.g., an ALU included in an instruction scheduling unit of embodiments of the present disclosure may also have more than two PRF read ports, and are equally applicable to the various examples below, since there are microinstructions having more than two source operands, such as FMA (multiply-add) instructions of an X86 processor, which require three source operands, according to different microarchitectural designs. Alternatively, the instruction scheduling unit may further include an ALU having a PRF read port, and the specific situation needs to be designed according to actual needs. In other words, for a source operand that requires K (K is an integer and K ≧ 1), the execution unit needs to read the K source operands from the PRF, which is not described in detail herein.
It is worth noting that since each execution unit generally has m (m < K) PRF read ports in consideration of area, timing convergence and other constraints in a specific CPU design, the N registers are read by the m read ports in a time-sharing (i.e. multiple beats) manner, or the PRF read ports of other execution units are borrowed; in summary, no matter how many PRF read ports an execution unit has, but always to meet the K source operands required by the execution unit, K source operands need to be read from the PRF.
It should be further noted that the number of the first microinstruction and the corresponding first execution unit in the embodiment of the present disclosure may be one, or may be multiple, for example, only one first execution unit is schematically illustrated in fig. 3 to fig. 7 in the embodiment of the present disclosure, and further, for example, multiple first execution units are illustrated in fig. 8 in the embodiment of the present disclosure, the specific number of the first execution units is not limited in the embodiment of the present disclosure, and if multiple first execution units are additionally provided in the instruction scheduling unit, the PRF read/write ports of the first execution units may be designed in the same manner or in different manners, for example, each first execution unit in the multiple first execution units may select an execution unit designed in any example of the present disclosure, which is not limited and described herein again, and may be freely adjusted specifically according to actual needs.
For example, in the example of fig. 3, the first execution unit is a CRLU, and considering that only one dedicated PRF read port is added to the CRLU, which continues to support various instructions without target operands in instruction sorting, distribution, and scheduling, the area and timing constraints on the design can be optimized, and the example is not limited to supporting only branch instructions on other microarchitectures.
For example, in some examples, when a single source operand is required for the first microinstruction satisfying condition (ii) above, then the first execution unit is a CRLU that requires only one PRF read port and does not require a PRF write port, as shown in fig. 4-6, where the CRLU no longer adds a dedicated PRF read port like the first execution unit shown in fig. 3, but instead reads the required single source operand from the PRF using the following method:
the PRF read port of the CRLU (denoted as the first PRF read port) and one of the other execution units having at least one PRF read port multiplex one source operand read from the PRF heap such that the first PRF read port reads the one source operand as a single source operand required by the CRLU.
For example, another execution unit having at least one PRF read port as described herein may be an ALU of FIGS. 4-6 having two PRF read ports. When the other execution unit with at least one PRF read port has only one PRF read port, the first PRF read port can also read one source operand as a single source operand required by the CRLU for the CRLU which only needs one source operand, except that the other execution unit with only one PRF read port cannot continue to schedule and execute other microinstructions with the source operand at the same beat.
As shown in fig. 4, the instruction scheduling unit includes not only a plurality of ALUs and a plurality of AGUs, but also a newly added CRLU and a multiplexer 4a. The CRLU has a first PRF read port and reads the required single source operand from the PRF through the first PRF read port. In the example of FIG. 4, each ALU has two PRF read ports and each AGU has two PRF read ports. The first PRF read port of the CRLU is coupled to a second PRF read port of the ALU 401 and a third PRF read port of the ALU402 via the multiplexer 4a, respectively, so that the target operand is obtained via one of the second PRF read port of the ALU 401 and a third PRF read port of the ALU 402.
In some examples, as shown in fig. 4, in the instruction scheduling method, reading a required single source operand from the PRF through the first PRF read port of the CRLU, further comprises: the single source operand is obtained from the PRF via one of a second PRF read port of the ALU 401 and a third PRF read port of the ALU402, and then the CRLU obtains the required single source operand via the first PRF read port.
In the example of fig. 4, the CRLU selects, through the multiplexer 4a, PRF read port data borrowed from one of the left and right adjacent ALUs, that is, the CRLU temporarily borrows a single source operand read from the PRF by a PRF read port of one of the ALUs 401 and the ALU402 in the same beat, and the target ALU temporarily borrowed the single source operand temporarily does not read the source operand in the same beat.
For example, in some examples, the target ALU (e.g., one of the ALUs 401 and 402) that has been borrowed by the CRLU with a single source operand may continue to schedule execution of other microinstructions having one or more source operands without waiting based on the remaining PRF read ports (i.e., the PRF read ports remaining in the target ALU after the removal of the PRF read port coupled to the multiplexer 4 a), where the number of source operands required by the other microinstructions needs to correspond to the target ALU, and the number of source operands required by the other microinstructions cannot be greater than the number of remaining PRF read ports of the target ALU.
As shown in fig. 5, the instruction scheduling unit includes not only a plurality of ALUs and a plurality of AGUs, but also a newly added CRLU. The CRLU has a first PRF read port, and the desired single source operand is read from the PRF through the first PRF read port. In the example of FIG. 5, each ALU has two PRF read ports and each AGU has two PRF read ports. The first PRF read port of the CRLU is directly coupled to a second PRF read port of the ALU 501, i.e., the single source operand required by the CRLU can be read from the PRF through the first PRF read port.
In some examples, as shown in fig. 5, in the instruction scheduling method, a single source operand required for the CRLU may be read from the PRF through the first PRF read port, further comprising: one second PRF read port of the ALU 501 and the first PRF read port of the CRLU are multiplexed to read a single source operand from the PRF.
In the example of fig. 5, the CRLU temporarily borrows a single source operand read from the PRF by a PRF read port of the ALU 501 in the same beat, while the ALU 501, which has been temporarily borrowed by the CRLU with the single source operand, temporarily does not read the single source operand in the same beat.
For example, in some examples, an ALU 501 that has a single source operand borrowed by the CRLU may continue to schedule execution of other microinstructions having one or more source operands that need to correspond to this ALU 501 without waiting based on the remaining PRF read ports (i.e., the PRF read ports remaining in the ALU 501 after the PRF read ports coupled to the CRLU are removed), and which cannot be greater than the number of the remaining PRF read ports of the ALU 501.
As shown in fig. 6, the instruction scheduling unit includes not only a plurality of ALUs and a plurality of AGUs but also a newly added CRLU and a demultiplexer 6a. The first PRF read port of the CRLU and a second PRF read port of the ALU 601 are each coupled to an output of the demultiplexer 6a, and a desired single source operand is obtained from the PRF through an input of the demultiplexer 6a, such that one of the first PRF read port of the CRLU and a second PRF read port of the ALU 601 obtains the single source operand.
In some examples, as shown in fig. 6, in the instruction scheduling method, reading a single source operand from the PRF through the first PRF read port of the CRLU, further comprises: a single source operand that passes through an input of the demultiplexer 6a is output from an output of the demultiplexer 6a and is read by the first PRF read port of the CRLU.
In the example of fig. 6, the CRLU temporarily shares a single source operand read from the PRF by one PRF read port of ALU 601 in the same beat, while the ALU 601 that temporarily shares the single source operand does not read the single source operand in the same beat.
For example, in some examples, the ALU 601 that is being temporarily shared by the CRLU may continue to schedule execution of other microinstructions having one or more source operands that need to correspond to the ALU 601 without waiting based on the remaining PRF read ports (i.e., the PRF read ports remaining in the ALU 601 after the PRF read ports coupled to the CRLU are removed), and which cannot be greater than the number of the remaining PRF read ports of the ALU 601.
It is noted that for the CRLU in fig. 4 and 5 to temporarily borrow one PRF read port of the other ALU for a single source operand read from the PRF, and the CRLU in fig. 6 to temporarily share one PRF read port of the other ALU for a single source operand read from the PRF, they are the same from the perspective of the scheduling algorithm: on the same beat, as long as one execution unit reads the single source operand, the remaining execution unit can not repeatedly read in the beat; they differ in that: the borrowing of the former example means that the CRLU can only take its turn to read the single source operand using its first PRF read port if the borrowed ALU does not need to read the single source operand; the latter example sharing means that the opportunities and statuses of the CRLU and the shared ALU when reading a single source operand are equal, and specifically which execution unit reads depends on the corresponding scheduling algorithm.
As can be seen from fig. 4 and fig. 6, the above-mentioned other execution units having at least one PRF read port may be an ALU having two PRF read ports (i.e. two source operands are required), may also be execution units having more than 2 PRF read ports, and may also be execution units having one PRF read port, which is not limited and will not be described in detail in this disclosure.
In addition, as can be seen from fig. 4 and fig. 6, the following technical effects can be achieved by optimizing the instruction distribution and scheduling algorithm according to the embodiment of the present disclosure: the first PRF read port of the CRLU and one PRF read port of the other execution unit with at least one PRF read port multiplex one source operand read from the PRF such that after the first PRF read port reads a single source operand, the other execution unit with at least one PRF read port continues to schedule execution of other microinstructions with one or more source operands based on the remaining PRF read ports without waiting.
For example, in some examples, for step S1, the first task requires N source operands, N is an integer and N ≧ 2, i.e., the first execution unit includes N first PRF read ports. The instruction scheduling method further comprises the following steps: when the first task needs N source operands, reading each source operand from the PRF through the N first PRF read ports respectively. The following description specifically refers to a case where the first task requires 2 source operands and the first execution unit includes 2 first PRF read ports, and details of an example where N is greater than 2 are not repeated.
As shown in fig. 7, the instruction scheduling unit includes not only a plurality of ALUs and a plurality of AGUs, but also a newly added first execution unit 703 (e.g., a CRLU that requires a single source operand or an ALU that requires two source operands) and a plurality of demultiplexers. In the example of FIG. 7, each ALU and AGU, except the first execution unit 603, has two PRF read ports, respectively.
In some examples, as shown in fig. 7, the first execution unit 703 may be an ALU 703 that requires two source operands, and correspondingly, the instruction dispatch unit includes at least two demultiplexers, a first demultiplexer 7a and a second demultiplexer 7b.
The ALU 703 includes two first PRF read ports, the ALU701 has two second PRF read ports, and the ALU702 has two third PRF read ports. A first PRF read port of the ALU 703 and a second PRF read port of the ALU701 are connected to the output of the first demultiplexer 7a, respectively, and a first source operand is obtained from the PRF through the input of the first demultiplexer 7a, such that one of the first PRF read port of the ALU 703 and the second PRF read port of the ALU701 obtains the first source operand. The other first PRF read port of the ALU 703 and a third PRF read port of the ALU702 are connected to the output of the second demultiplexer 7b, respectively, and a second source operand is obtained from the PRF through the input of the second demultiplexer 7b, so that one of the other first PRF read port of the ALU 703 and a third PRF read port of the ALU702 obtains the second source operand.
For example, in the example of fig. 7, in the instruction scheduling method, for responding to a need for 2 source operands, reading respective source operands from the PRFs through two first PRF read ports includes:
a first source operand, passing through the input of the first demultiplexer 7a, is output from the output of the first demultiplexer 7a and read by a first PRF read port of ALU 703;
a second source operand, passing through an input of the second demultiplexer 7b, is output from an output of the second demultiplexer 7b and is read by another first PRF read port of the ALU 703.
For example, in the example of fig. 7, the CRLU temporarily shares a source operand read from the PRF by a second PRF read port of the ALU701 and a source operand read from the PRF by a third PRF read port of the ALU702, respectively, in the same beat, while the ALU701 and the ALU702 that have the source operands temporarily shared by the CRLU temporarily do not read the corresponding source operands in the same beat.
For example, in some examples, the ALU701 that has been suspended with source operands by the CRLU may continue to schedule execution of other microinstructions having one or more source operands that need to correspond to the ALU701 without waiting based on the remaining PRF read ports (i.e., the PRF read ports remaining in the ALU701 after removing the PRF read port connected to the output of the demultiplexer 7 a), such as the number of source operands needed by the other microinstructions cannot be greater than the number of the remaining PRF read ports of the ALU 701.
For another example, in some examples, the ALU702 that has the source operand temporarily shared by the CRLU may continue to schedule execution of other microinstructions having one or more source operands that need to correspond to the ALU 7012 based on the remaining PRF read ports (i.e., the PRF read ports that remain in the ALU702 after the PRF read port connected to the output of the demultiplexer 7b is removed), such as the number of source operands needed by the other microinstructions cannot be greater than the number of remaining PRF read ports of the ALU 702.
As can be seen from the above description, the shared source operand means that the opportunities and statuses of the first execution unit and its shared ALU when reading a single source operand are equal, and it is completely dependent on the corresponding scheduling algorithm, so that the sum N1 of the number of source operands required by the first execution unit 703 and the two ALUs 701 and 703 that are temporarily shared by the first execution unit 703 and the source operand are required to be less than or equal to the total number N2 of PRF read ports of the two ALUs 701 and 703 that are not temporarily shared by the source operand.
For example, in the example of fig. 7, the ALU701, the ALU702, and the first execution unit 703, which collectively require source operands denoted as n, e.g., the ALU701 includes 2 second PRF read ports, the ALU702 includes 2 third PRF read ports, then n =2+2=4, when a source operands are required by the ALU701, b source operands are required by the first execution unit 703, c source operands are required by the ALU702, their total number s = a + b + c, and n cannot exceed s.
For example, ALU701 needs 1 source operand, first execution unit 703 needs 2 source operands, and ALU702 needs 1 source operand, then s =1+2+1=4, s does not exceed n, and normal instruction invocation can be implemented.
For another example, ALU701 needs 1 source operand, first execution unit 703 needs 2 source operands, and if ALU702 needs 2 source operands, then s =1+ 2=5, which exceeds n =4, the normal instruction call cannot be realized.
For another example, 1 source operand is needed for ALU701, 1 source operand is needed for ALU702, and if only 1 source operand is needed for first execution unit 703, at this time, s =1+ 1=3, s does not exceed n, normal instruction invocation can be completely implemented, because some PRF read ports may be temporarily unused in some beats, therefore, in the example of fig. 7, first execution unit 703 may be used not only for executing an ALU that requires 2 microinstructions of source operation, but also for executing a CRLU that requires 1 microinstruction of source operand, which is not limited by the present disclosure, and is adjusted freely according to actual needs.
It is noted that the one or more execution units temporarily shared by the first execution unit 703 may be not only the ALU having 2 PRF read ports described in the above example, but also other execution units having more than 2 PRF read ports, or a CRLU having only one PRF read port. When the other execution units that are temporarily shared by the first execution unit 703 are execution units that have only one PRF read port, the first PRF read port of the first execution unit 703 can also read one source operand from the PRF, and only then the temporarily shared execution unit that has only one PRF read port cannot continue to schedule execution of other microinstructions that have source operands in the same clock.
FIG. 8 is a schematic diagram of an instruction dispatch unit with multiple additional execution units that do not require PRF write ports and borrow PRF read ports according to some embodiments of the present disclosure.
For example, in some examples, the instruction dispatch unit includes not only the plurality of ALUs and the plurality of AGUs, but also the newly added plurality of first execution units. For example, as shown in fig. 8, the instruction scheduling unit includes two newly added execution units CRLU, CRLU 801 and CRLU 802, respectively, and at the same time, the CRLU 801 is further provided with a multiplexer 81a, and the CRLU 802 is provided with a multiplexer 82a. In the example of fig. 8, the design manner between the CRLU 801 and the multiplexer 81a is the same as the example described in fig. 4, and similarly, the design manner between the CRLU 802 and the multiplexer 82a is the same as the example described in fig. 4, and the specific design details may refer to the description of the text related to fig. 4, which is not repeated herein in the embodiment of the present disclosure.
For example, in some examples, the outputs of the CRLU 801 and the CRLU 802 are respectively connected to the input of the multiplexer 83b for outputting write data, and the output of the multiplexer 83b is connected to one write data path of the memory access unit, so as to respectively write the results of the processing performed by the CRLU 801 and the CRLU 802 into the memory access unit, as shown in fig. 8.
For another example, in some examples, for a write data path design supporting 2 or more than 2 ways, that is, the multiplexer 83b for writing data output in fig. 8 is no longer needed, but the output end of each CRLU is directly connected to the corresponding write data path in the memory access unit, and the processed result of each CRLU is written into the memory access unit.
It should be noted that the embodiment of the present disclosure is not limited to adding 2 CRLU shown in fig. 8 to the instruction scheduling unit, and may also add more than 2 CRLU to the instruction scheduling unit, and the plurality of first execution units connected to the multiplexer 83b for writing data output is not limited to the CRLU designed in the above-mentioned example of fig. 8, but may also be execution units such as CRLU or ALU designed by any other example, and the embodiment of the present disclosure does not limit this.
As shown in fig. 4 to fig. 8, the first execution units added in the instruction scheduling unit borrow or share the source operands of the ALU execution units adjacent to the first execution units, but the first execution units of the instruction scheduling unit may also borrow or share the source operands of other execution units (e.g., ALU or AGU) that are not adjacent to the first execution units.
The instruction scheduling method according to the above embodiment of the present disclosure provides an execution unit that does not require a PRF write port, and selects and operates the execution unit by a microinstruction having a control field, thereby increasing the overall execution bandwidth of the instruction scheduling unit of the CPU core, better supporting the requirement of simultaneous multithreading for the throughput of the instruction scheduling unit, better solving the contradiction between the increase of the area budget and the limitation of the silicon area budget due to the increase of the execution unit, and also eliminating the area requirement, the difficulty in winding, and the reduction of the timing constraint due to the complicated winding caused by the increase of the dedicated PRF read/write port. It should be noted that, in the embodiments of the present disclosure, reference may be made to the description above regarding the instruction scheduling method for specific functions and technical effects of the instruction scheduling unit, and details are not described here.
The following points need to be explained:
(1) The drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to common designs.
(2) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.
The above description is only a specific embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and the scope of the present disclosure should be subject to the scope of the claims.

Claims (25)

1. An instruction scheduling method, comprising:
generating a first microinstruction according to a first task that needs to be executed, wherein the first task does not require a target operand, the first microinstruction comprising a control field;
selecting according to the control domain, and distributing the first microinstruction to a first instruction scheduling queue;
providing the first microinstruction from the first instruction dispatch queue to a first execution unit for processing, wherein the first execution unit does not have a physical register file write port.
2. The instruction scheduling method of claim 1,
the first task requires less than or equal to a single source operand, the first execution component does not have a physical register file read port or the first execution component includes a single first physical register file read port.
3. The instruction scheduling method of claim 2, further comprising:
in response to a single source operand being required, the single source operand is read from a physical register file through the first physical register file read port.
4. The instruction scheduling method of claim 3,
reading the single source operand from a physical register file through the first physical register file read port, comprising:
the first physical register file read port of the first execution unit and one of the other execution units having at least one physical register file read port multiplex one source operand read from a physical register file such that the first physical register file read port reads the one source operand as the single source operand required by the first execution unit.
5. The instruction scheduling method of claim 4,
a second execution unit having at least a second physical register file read port, the first physical register file read port of the first execution unit coupled with a second physical register file read port of the second execution unit,
reading the single source operand from a physical register file through the first physical register file read port, comprising:
multiplexing a second physical register file read port of the second execution unit and the first physical register file read port of the first execution unit to read the single source operand from the physical register file.
6. The instruction scheduling method of claim 4,
a second execution unit having at least one second physical register file read port, a third execution unit having at least one third physical register file read port, the first physical register file read port of the first execution unit being coupled to one of the second physical register file read port of the second execution unit and one of the third physical register file read port of the third execution unit via a multiplexer, respectively, such that a target operand is obtainable via one of the second physical register file read port of the second execution unit and one of the third physical register file read port of the third execution unit,
reading, by the first physical register file read port, the single source operand from the physical register file, comprising:
obtaining the single source operand from the physical register file via one of a second physical register file read port of the second execution unit and a third physical register file read port of the third execution unit, and then obtaining the single source operand from the first execution unit via the first physical register file read port.
7. The instruction scheduling method of claim 4,
a second execution unit having at least one second physical register file read port, said first physical register file read port of said first execution unit and one second physical register file read port of said second execution unit being connected to an output of a demultiplexer, respectively, to retrieve said single source operand from said physical register file via an input of said demultiplexer such that one of said first physical register file read port of said first execution unit and one second physical register file read port of said second execution unit retrieves said single source operand,
reading, by the first physical register file read port, the single source operand from the physical register file, comprising:
the single source operand passing through an input of the demultiplexer is output from an output of the demultiplexer and read by the first physical register file read port of the first execution unit.
8. The instruction scheduling method of claim 4, wherein,
the other execution units having at least one physical register file read port have two or more physical register file read ports,
the first physical register file read port of the first execution unit and one of the other execution units with at least one physical register file read port multiplex one source operand read from the physical register file so that after the first physical register file read port reads the single source operand, the other execution units with at least one physical register file read port continue to schedule execution of other microinstructions with one or more source operands based on the remaining physical register file read ports without waiting.
9. The instruction scheduling method of claim 1,
the first task requires N source operands, the first execution component includes N first physical register file read ports, N is an integer and N ≧ 2, wherein respective source operands are read from the physical register file through the N first physical register file read ports, respectively, in response to the N source operands being required.
10. The instruction scheduling method of claim 9,
the first execution unit includes 2 first physical register file read ports, the second execution unit has at least one second physical register file read port, the third execution unit has at least one third physical register file read port,
a first physical register file read port of the first execution unit and a second physical register file read port of the second execution unit are respectively connected with an output end of a first demultiplexer, a first source operand is obtained from the physical register file through an input end of the first demultiplexer, so that one of the first physical register file read port of the first execution unit and a second physical register file read port of the second execution unit obtains the first source operand,
the other first physical register file read port of the first execution unit and the third physical register file read port of the third execution unit are respectively connected with the output end of a second multiplexer, and a second source operand is obtained from the physical register file through the input end of the second multiplexer, so that one of the other first physical register file read port of the first execution unit and the third physical register file read port of the third execution unit obtains the second source operand;
for responding to the requirement of 2 source operands, respectively reading each source operand from the physical register file through 2 first physical register file reading ports, comprising:
said first source operand passing through an input of said first demultiplexer is output from an output of said first demultiplexer and read by said one first physical register file read port of said first execution unit;
said second source operand passing through an input of said second demultiplexer is output from an output of said second demultiplexer and read by said another first physical register file read port of said first execution unit.
11. The instruction scheduling method of claim 10,
the second execution unit has a second physical register file read port equal to or greater than two, wherein the second execution unit continues to schedule execution of other microinstructions having one or more source operands without waiting based on removing the remaining second physical register file read ports coupled to the first demultiplexer;
a third execution unit has a third physical register file read port equal to or greater than two, wherein the third execution unit continues to schedule execution of other microinstructions having one or more source operands without waiting based on removing the remaining third physical register file read port coupled to the second demultiplexer.
12. The instruction scheduling method according to any one of claims 1 to 11,
the number of the first execution parts is more than or equal to two;
the output ends of the at least two first execution units are connected with the input end of a multiplexer for writing data output, the output end of the multiplexer for writing data output is connected with a data writing path of the memory access unit, and the results processed by the at least two first execution units are written into the memory access unit respectively; or the output end of each first execution unit is respectively connected with the data writing path of the memory access unit, and the result processed by each first execution unit is written into the memory access unit.
13. The instruction scheduling method of claim 8, further comprising:
generating a second microinstruction based on a second task that needs to be executed, wherein the second task requires a target operand and requires more than two source operands;
issuing the second microinstruction to a second instruction dispatch queue;
and providing the second microinstruction from the second instruction scheduling queue to the other execution unit with at least one physical register file read port for processing, wherein the other execution unit with at least one physical register file read port has a physical register file write port.
14. The instruction scheduling method of claim 11, further comprising:
generating a second microinstruction based on a second task to be executed, wherein the second task requires a target operand and requires more than two source operands;
distributing the second microinstruction to a second instruction scheduling queue;
providing the second microinstruction from the second instruction dispatch queue to the second execution unit and/or the third execution unit for processing, wherein the second execution unit and/or the third execution unit has a physical register file write port.
15. The instruction scheduling method according to claim 13 or 14,
the second microinstruction does not comprise a control domain, or the second microinstruction is provided with a control domain different from the control domain of the first microinstruction;
wherein different control domains are set for the second microinstruction, and the second microinstruction is distributed to a second instruction scheduling queue, including:
and selecting according to the different control domains, and distributing the second micro-instruction to the second instruction scheduling queue.
16. The instruction scheduling method according to claim 13 or 14,
the first instruction scheduling queue and the second instruction scheduling queue are discrete instruction scheduling queues, or the first instruction scheduling queue and the second instruction scheduling queue belong to a unified instruction scheduling queue.
17. A processor comprising an instruction dispatch unit, the instruction dispatch unit comprising:
a first instruction scheduling queue configured to select according to a control domain of a first microinstruction, receive the distributed first microinstruction, wherein the first microinstruction is generated according to a first task that needs to be executed, and a target operand is not needed by the first task;
a first execution unit configured to receive the first microinstruction from the first instruction dispatch queue for processing, wherein the first execution unit does not have a physical register file write port.
18. The processor of claim 17,
the first execution unit has no physical register file read port or the first execution unit includes a single first physical register file read port.
19. The processor of claim 18, wherein the instruction dispatch unit further comprises a second execution unit having at least a second physical register file read port, the first physical register file read port of the first execution unit coupled with a second physical register file read port of the second execution unit.
20. The processor of claim 18, wherein the instruction dispatch unit further comprises a second execution unit, a third execution unit, and a multiplexer,
the second execution unit has at least one second physical register file read port, the third execution unit has at least one third physical register file read port, the first physical register file read port of the first execution unit is coupled with one second physical register file read port of the second execution unit and one third physical register file read port of the third execution unit, respectively, through the multiplexer.
21. The processor of claim 18, wherein the instruction dispatch unit further comprises a second execution unit and a demultiplexer,
the second execution unit has at least one second physical register file read port, the first physical register file read port of the first execution unit and one second physical register file read port of the second execution unit are respectively connected with the output end of the demultiplexer, and a single source operand is obtained from the physical register file through the input end of the demultiplexer, so that one of the first physical register file read port of the first execution unit and one second physical register file read port of the second execution unit obtains the single source operand.
22. The processor of claim 17, wherein the instruction dispatch unit further comprises a second execution unit, a third execution unit, a first demultiplexer and a second demultiplexer,
the first execution unit includes 2 first physical register file read ports, the second execution unit has at least one second physical register file read port, the third execution unit has at least one third physical register file read port,
a first physical register file read port of the first execution unit and a second physical register file read port of the second execution unit are respectively connected with an output end of the first demultiplexer, a first source operand is obtained from the physical register file through an input end of the first demultiplexer, so that one of the first physical register file read port of the first execution unit and a second physical register file read port of the second execution unit obtains the first source operand,
the other first physical register file read port of the first execution unit and the third physical register file read port of the third execution unit are respectively connected with the output end of the second demultiplexer, and a second source operand is obtained from the physical register file through the input end of the second demultiplexer, so that one of the other first physical register file read port of the first execution unit and the third physical register file read port of the third execution unit obtains the second source operand.
23. The processor of claim 17, further comprising a decode and distribute unit,
the instruction scheduling unit is connected with the decoding and distributing unit, and the decoding and distributing unit decodes an input instruction to generate a microinstruction with a control domain and distributes the microinstruction with the control domain.
24. The processor of claim 17,
the first instruction scheduling queue is a discrete instruction scheduling queue, or the first instruction scheduling queue belongs to at least one part of a unified instruction scheduling queue.
25. The processor of any one of claims 17 to 24, wherein the processor comprises a central processing unit.
CN202011253606.8A 2020-11-11 2020-11-11 Instruction scheduling method and processor comprising instruction scheduling unit Active CN112379928B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011253606.8A CN112379928B (en) 2020-11-11 2020-11-11 Instruction scheduling method and processor comprising instruction scheduling unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011253606.8A CN112379928B (en) 2020-11-11 2020-11-11 Instruction scheduling method and processor comprising instruction scheduling unit

Publications (2)

Publication Number Publication Date
CN112379928A CN112379928A (en) 2021-02-19
CN112379928B true CN112379928B (en) 2023-04-07

Family

ID=74582031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011253606.8A Active CN112379928B (en) 2020-11-11 2020-11-11 Instruction scheduling method and processor comprising instruction scheduling unit

Country Status (1)

Country Link
CN (1) CN112379928B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114461278B (en) * 2022-04-13 2022-06-21 海光信息技术股份有限公司 Method for operating instruction scheduling queue, operating device and electronic device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5542058A (en) * 1992-07-06 1996-07-30 Digital Equipment Corporation Pipelined computer with operand context queue to simplify context-dependent execution flow
CN103440210A (en) * 2013-08-21 2013-12-11 复旦大学 Register file reading and isolating method controlled by asynchronous clock

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU5634086A (en) * 1985-05-06 1986-11-13 Wang Laboratories, Inc. Information processing system with enhanced instruction execution and support control
US6304954B1 (en) * 1998-04-20 2001-10-16 Rise Technology Company Executing multiple instructions in multi-pipelined processor by dynamically switching memory ports of fewer number than the pipeline
EP1761844A2 (en) * 2004-06-25 2007-03-14 Koninklijke Philips Electronics N.V. Instruction processing circuit
DE102006025713B9 (en) * 2005-10-28 2013-10-17 Infineon Technologies Ag Cryptographic device and cryptographic method for calculating a result of a modular multiplication
EP2695055B1 (en) * 2011-04-07 2018-06-06 VIA Technologies, Inc. Conditional load instructions in an out-of-order execution microprocessor
CN104615409B (en) * 2014-05-27 2017-07-07 上海兆芯集成电路有限公司 The method jumped over the processor of MOV instruction and used by the processor
US20170116154A1 (en) * 2015-10-23 2017-04-27 The Intellisis Corporation Register communication in a network-on-a-chip architecture
CN106126336B (en) * 2016-06-17 2019-06-04 上海兆芯集成电路有限公司 Processor and dispatching method
CN106155814B (en) * 2016-07-04 2019-04-05 合肥工业大学 A kind of reconfigurable arithmetic unit that supporting multiple-working mode and its working method
CN108415730B (en) * 2018-01-30 2021-06-01 上海兆芯集成电路有限公司 Micro instruction scheduling method and device using same
GB2580316B (en) * 2018-12-27 2021-02-24 Graphcore Ltd Instruction cache in a multi-threaded processor

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5542058A (en) * 1992-07-06 1996-07-30 Digital Equipment Corporation Pipelined computer with operand context queue to simplify context-dependent execution flow
CN103440210A (en) * 2013-08-21 2013-12-11 复旦大学 Register file reading and isolating method controlled by asynchronous clock

Also Published As

Publication number Publication date
CN112379928A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
US11061672B2 (en) Chained split execution of fused compound arithmetic operations
US10469397B2 (en) Processors and methods with configurable network-based dataflow operator circuits
US10817291B2 (en) Apparatuses, methods, and systems for swizzle operations in a configurable spatial accelerator
US9965274B2 (en) Computer processor employing bypass network using result tags for routing result operands
US20190004878A1 (en) Processors, methods, and systems for a configurable spatial accelerator with security, power reduction, and performace features
JP3797471B2 (en) Method and apparatus for identifying divisible packets in a multi-threaded VLIW processor
US20190004994A1 (en) Processors and methods for pipelined runtime services in a spatial array
US20190018815A1 (en) Processors, methods, and systems with a configurable spatial accelerator
US9569214B2 (en) Execution pipeline data forwarding
US10678724B1 (en) Apparatuses, methods, and systems for in-network storage in a configurable spatial accelerator
US8458443B2 (en) VLIW processor with execution units executing instructions from instruction queues and accessing data queues to read and write operands
JP2018519602A (en) Block-based architecture with parallel execution of continuous blocks
GB2503438A (en) Method and system for pipelining out of order instructions by combining short latency instructions to match long latency instructions
US10514919B2 (en) Data processing apparatus and method for processing vector operands
KR20030067892A (en) Apparatus and method for dispatching very long instruction word with variable length
JP3777541B2 (en) Method and apparatus for packet division in a multi-threaded VLIW processor
US20140317388A1 (en) Apparatus and method for supporting multi-modes of processor
KR20140131472A (en) Reconfigurable processor having constant storage register
EP2270652B1 (en) Priority circuit for dispatching instructions in a superscalar processor having a shared reservation station and processing method
CN112379928B (en) Instruction scheduling method and processor comprising instruction scheduling unit
CN112074810B (en) Parallel processing apparatus
US10853077B2 (en) Handling Instruction Data and Shared resources in a Processor Having an Architecture Including a Pre-Execution Pipeline and a Resource and a Resource Tracker Circuit Based on Credit Availability
EP0496407A2 (en) Parallel pipelined instruction processing system for very long instruction word
US7437544B2 (en) Data processing apparatus and method for executing a sequence of instructions including a multiple iteration instruction
Baudisch et al. Evaluation of speculation in out-of-order execution of synchronous dataflow networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant