CN115390924A - Instruction execution method, execution engine, processor, chip and electronic equipment - Google Patents

Instruction execution method, execution engine, processor, chip and electronic equipment Download PDF

Info

Publication number
CN115390924A
CN115390924A CN202210981393.3A CN202210981393A CN115390924A CN 115390924 A CN115390924 A CN 115390924A CN 202210981393 A CN202210981393 A CN 202210981393A CN 115390924 A CN115390924 A CN 115390924A
Authority
CN
China
Prior art keywords
data
logical operation
pipeline
logic operation
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210981393.3A
Other languages
Chinese (zh)
Inventor
崔泽汉
王博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haiguang Information Technology Co Ltd
Original Assignee
Haiguang Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haiguang Information Technology Co Ltd filed Critical Haiguang Information Technology Co Ltd
Priority to CN202210981393.3A priority Critical patent/CN115390924A/en
Publication of CN115390924A publication Critical patent/CN115390924A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3885Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)

Abstract

The embodiment of the application provides an instruction execution method, an execution engine, a processor, a chip and electronic equipment, wherein the method comprises the following steps: inputting data of a plurality of instructions into an execution engine through a plurality of pipelines, and inputting data of one instruction into one pipeline; the logic operation unit in the execution engine is divided into a plurality of logic operation groups, and one logic operation group is configured to one pipeline to be used independently; selecting the data input by each assembly line for the logic operation groups configured for each assembly line so that each logic operation group can carry out logic operation on the data input by each assembly line; and outputting the logic operation result of each logic operation group. The method and the device can reduce the idle and wasted degree of the data bit width resource of the logic operation unit and improve the resource utilization rate of the logic operation unit.

Description

Instruction execution method, execution engine, processor, chip and electronic equipment
Technical Field
The embodiment of the application relates to the technical field of processors, in particular to an instruction execution method, an execution engine, a processor, a chip and electronic equipment.
Background
To improve the performance of the processor, the processor may employ a Superscalar architecture (Superscalar Architectures) to improve the instruction set parallelism of the processor. A superscalar architecture can be considered a processor design, where a processor using a superscalar architecture (referred to as a superscalar processor for short) is capable of executing multiple instructions in one clock cycle.
When an execution engine of the superscalar processor executes an instruction, a logic operation unit in the execution engine can perform logic operation on data (data of the instruction for short) of instruction starting operation; at this time, how to improve the resource utilization rate of the logical operation unit becomes a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
In view of this, the present disclosure provides an instruction execution method, an execution engine, a processor, a chip and an electronic device, so as to improve resource utilization of a logic operation unit.
In order to achieve the above object, the embodiments of the present application provide the following technical solutions.
In a first aspect, an embodiment of the present application provides an instruction execution method, including:
inputting data of a plurality of instructions into an execution engine through a plurality of pipelines, and inputting data of one instruction into one pipeline; the logic operation unit in the execution engine is divided into a plurality of logic operation groups, and one logic operation group is configured to one pipeline for independent use;
selecting the data input by each assembly line to the logic operation groups configured for each assembly line so that each logic operation group can carry out logic operation on the data input by each assembly line;
and outputting the logic operation result of each logic operation group.
In a second aspect, an embodiment of the present application provides an execution engine, including: a data selector and a logical operation unit; the logic operation unit is divided into a plurality of logic operation groups, and one logic operation group is configured to one production line to be used independently;
the execution engine acquires data of a plurality of instructions input by a plurality of pipelines in an instruction transmitting stage, and one pipeline inputs data of one instruction;
the data selector is used for selecting the data input by each assembly line to the logic operation groups correspondingly configured for each assembly line;
the logical operation group is used for obtaining the data input by the corresponding assembly line from the data selector, carrying out logical operation on the data input by the corresponding assembly line and outputting a logical operation result.
In a third aspect, an embodiment of the present application provides a processor including the execution engine as described above.
In a fourth aspect, an embodiment of the present application provides a chip including the processor as described above.
In a fifth aspect, an embodiment of the present application provides an electronic device including the chip as described above.
According to the instruction execution method provided by the embodiment of the application, a logic operation unit in an execution engine can be divided into a plurality of logic operation groups, and one logic operation group is configured to one production line to be used independently; on this basis, data of a plurality of instructions can be input into the execution engine through a plurality of pipelines, and data of one instruction is input into one pipeline; based on the configuration relationship between the assembly lines and the logic operation groups, the data input by each assembly line can be selected to the logic operation groups configured for each assembly line, so that each logic operation group can carry out logic operation on the data input by each assembly line; further, the embodiment of the application can output the logic operation result of each logic operation group to complete instruction execution.
It can be seen that, in the embodiment of the present application, in the case of dividing the logical operation unit into a plurality of logical operation groups, each pipeline supporting logical operations in the processor may use one logical operation group alone to perform logical operations on data; therefore, after the data input by each pipeline is selected to the correspondingly configured logic operation groups, the plurality of logic operation groups can process the data input by a plurality of pipelines in parallel; and different logical operation groups correspond to different data bit width ranges in the logical operation unit, so that the data bit width resources in different ranges in the logical operation unit can process data input by a plurality of pipelines in parallel, the data bit width resources in the logical operation unit are fully utilized, the degree of idle waste of the data bit width resources of the logical operation unit is reduced, and the resource utilization rate of the logical operation unit is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a diagram illustrating an example of two pipelines respectively occupying a logical operation unit.
FIG. 2 is another exemplary diagram of two pipelines respectively occupying logical operation units.
FIG. 3A is a diagram illustrating an example of resource grouping for logical operation units.
FIG. 3B is a diagram of an example of a unit execution unit of the logical unit.
FIG. 3C is a flow chart of a method for dividing logical operation units into logical operation groups.
Fig. 3D is an exemplary diagram of a logical operation group of the logical operation unit.
FIG. 4 is a flow chart of a method of instruction execution.
FIG. 5A is a diagram of a system architecture that enables instruction execution.
FIG. 5B is another system architecture diagram that enables instruction execution.
FIG. 5C is yet another system architecture diagram for implementing instruction execution.
FIG. 6 is a diagram of an exemplary implementation of a method for instruction execution.
FIG. 7 is another flowchart of a method of instruction execution.
FIG. 8 is a diagram of another exemplary implementation of a method for instruction execution.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
After the instruction is decoded, the superscalar processor may issue the decoded instruction (also referred to as a microinstruction) to an execution engine for execution by way of multi-issue. That is, multiple pipelines may input data for multiple instructions to the execution engine during the same instruction issue cycle. The Instruction referred to herein may be a SIMD (Single Instruction Multiple Data) Instruction. SIMD is a technology for improving the parallelism of data levels, namely, a single instruction is adopted to carry out the same operation on a plurality of data; that is, the superscalar processor using SIMD technology can start the same operation of multiple data with a single instruction, thereby improving the instruction execution efficiency of the processor and reducing the power consumption of the processor.
When an instruction is executed, the logic operation unit in the execution engine can carry out logic operation on data input by a plurality of pipelines, in order to save the area of a superscalar processor, each pipeline is not provided with a set of logic operation unit with complete function, but only a part of pipelines are provided with the logic operation unit with a certain function, or a plurality of pipelines compete for hardware resources of the same logic operation unit, so that the data input by the plurality of pipelines are processed; for example, clock cycles of the logic operation units are distributed to the pipelines, so that the plurality of pipelines respectively occupy the logic operation units to process data in respective clock cycles. The embodiment of the application mainly aims at optimizing the situation that a plurality of pipelines compete for hardware resources of the same logic operation unit.
For ease of understanding, fig. 1 exemplarily shows an exemplary diagram in which two pipelines respectively occupy a logical operation unit, taking two pipelines as an example. As shown in FIG. 1, data 101 and data 102 are data for an instruction-initiated operation, e.g., a SIMD instruction initiates two data for the same operation, and data 101 is input to an execution engine via pipeline 111 and data 102 is input to the execution engine via pipeline 112;
in the execution engine, the logic operation unit 113 is occupied by the pipeline 111 and the pipeline 112, respectively, so that input data of the pipeline 111 and the pipeline 112 needs to be controlled by a pipeline selection signal, and is input to the logic operation unit 113 after data selection is performed by the data selector 114; that is, the data 101 input by the pipeline 111 and the data 102 input by the pipeline 112 need to be selected by the data selector 114 and then input into the logical operation unit 113;
when a plurality of pipelines respectively occupy the logic operation unit, the logic operation unit can be taken as a whole and is distributed to one pipeline for independent use in one clock period; that is, the data selector selects data input by a pipeline in one clock cycle and sends the data to the logic operation unit for processing; the data selector 114 can therefore select either the data 101 input by the pipeline 111 or the data 102 input by the pipeline 112 to the logic operation unit 113 based on the pipeline that the logic operation unit 113 is assigned to use at the current clock cycle; for example, the logic operation unit 113 is allocated to the pipeline 111 for independent use in the current clock cycle, and the data selector 114 can select the data 101 to the logic operation unit 113 for logic operation; after the logical operation unit 113 completes the logical operation of the data 101, the data selector 114 may select the data 102 to the logical operation unit 113 for the logical operation. That is, under the condition that the logic operation unit cannot be executed in a pipeline, the delay of the logic operation unit for processing the data input by the next pipeline is determined by the time for the logic operation unit to complete the logic operation; for example, the logical operation unit is used as a divider for executing division operation, and the divider cannot be executed in a pipelined manner as an example, the time for the divider to complete the division operation is 11 clock cycles, and the delay for the divider to process the data input by the next pipeline is 11 cycles; it should be noted that the specific delay value is only an example.
It can be understood that, when a plurality of pipelines respectively occupy a logic operation unit, if the logic operation unit is used by one pipeline in one clock cycle, when the data bit width of the data input by the pipeline is lower than the data bit width of the logic operation unit (for example, the data input by the pipeline is lower than the maximum data bit width supported by the logic operation unit), the calculation resource of the high data bit width in the logic operation unit is wasted, so that the logic operation unit has a problem of low resource utilization rate.
It should be noted that there may be a need for an instruction (e.g., a SIMD instruction) to operate on data with different data bit widths, and therefore the logical operation unit needs to support logical operations with different data bit widths. For example, if the bit width of a single-precision floating point number is 32 bits (one bit is regarded as one bit), and the bit width of a floating point register is 256 bits, the logical operation unit may need to process an instruction with a bit width of 32 bits (1 bit width of single-precision floating point number), an instruction with a bit width of 64 bits (2 bit widths of single-precision floating point numbers), an instruction with a bit width of 128 bits (4 bit widths of single-precision floating point numbers), and an instruction with a bit width of 256 bits (8 bit widths of single-precision floating point numbers); accordingly, the logical operation unit needs to support logical operations on floating point data of 32-bit data bit width, 64-bit data bit width, 128-bit data bit width, and 256-bit data bit width. On this basis, if the logic operation unit processes data input by one pipeline in one clock cycle, when the data bit width of the data input by the pipeline is lower than that of the logic operation unit, the data bit width resource of the logic operation unit in the high bit position is idle, and the resource utilization rate of the logic operation unit is low.
In one example, on the basis of fig. 1, taking an example that the logical operation unit supports a logical operation with a 256-bit data bit width, and two pipelines input two 128-bit data respectively, fig. 2 exemplarily shows another example that two pipelines occupy the logical operation unit respectively, and in conjunction with fig. 1 and fig. 2, the data bit width of the logical operation unit 113 is 256 bits, the data 101 input by the pipeline 111 is 128 bits, and the data 102 input by the pipeline 112 is 128 bits; when the data 101 and the data 102 are input into the data selector 114, if the logic operation unit 113 is allocated to the pipeline 111 for single use in the current clock cycle, the data selector 114 selects the data 101 to the logic operation unit 113 for logic operation; at this time, the logical operation unit 113 performs logical operation on the data 101 by using the data bit width resource with the lower 128 bits, and the data bit width resource with the higher 128 bits in the logical operation unit 113 is idle;
after the logical operation unit 113 completes the logical operation of the data 101, the data selector 114 may select the data 102 to the logical operation unit 113 for the logical operation, so that the logical operation unit 113 performs the logical operation on the data 102 by using the data bit width resource with the low 128 bits, and the data bit width resource with the high 128 bits in the logical operation unit 113 is still idle. That is, the logical operation unit 113 processes the data 101 and the data 102 respectively using the data bit width resource of the lower 128 bits, while the data bit width resource of the upper 128 bits in the logical operation unit 113 is wasted.
It can be seen from the above description that, under the condition that multiple pipelines compete for hardware resources of the same logical operation unit, if the logical operation unit as a whole can only be used by one pipeline at a time (for example, one clock cycle), when the data bit width of data input by the pipeline is lower than that of the logical operation unit, the data bit width resource of the logical operation unit is idle and wasted, so that the logical operation unit has a problem of low resource utilization rate.
Based on this, the embodiment of the present application provides an improved instruction execution scheme, by grouping the resources of the logical operation units, the logical operation units are divided into a plurality of logical operation groups, and one logical operation group is configured for one pipeline to be used alone; therefore, the plurality of pipelines configured with the logic operation groups can simultaneously use the respective configured logic operation groups (namely, the plurality of pipelines can simultaneously multiplex the logic operation units instead of independently using the logic operation units in different clock cycles), thereby ensuring that the data bit width resources of the logic operation units can be fully utilized and achieving the effect of improving the resource utilization rate of the logic operation units.
Based on the above thought, the embodiments of the present application may divide the logical operation unit into a plurality of logical operation groups, and configure one logical operation group for one pipeline to be used alone, where the number of logical operation groups may be less than or equal to the number of pipelines in the processor; for example, in the case where one logical operation packet is configured to be used alone by one pipeline, the number of the plurality of logical operation packets into which the logical operation unit is divided may be smaller than or equal to the number of pipelines in the superscalar processor. In one example, taking the number of pipelines in the superscalar processor as 4 as an example, the logic operation unit can be divided into 4 logic operation groups, and one logic operation group is configured for one pipeline to be used independently, so that the 4 pipelines are all configured with the logic operation group; in another example, the logic operation unit may be divided into 2 logic operation groups, the 2 logic operation groups may be configured to two of the 4 pipelines, and one logic operation group may be configured to one pipeline for single use, while the other two pipelines of the 4 pipelines may not be configured with the logic operation group. As an alternative implementation, the number of the logical operation groups may be set, and the number of the logical operation groups is equal to or less than the number of the pipelines, and further, the logical operation unit may be divided into a plurality of logical operation groups according to the set number of the logical operation groups.
As an alternative implementation, fig. 3A exemplarily shows an exemplary diagram for resource grouping of the logical operation unit, as shown in fig. 3A, the logical operation unit 113 may be divided into n logical operation groups 311 to 31n, and one logical operation group is configured for one pipeline to be used alone, each logical operation group may perform a logical operation of data through a divided data bit wide resource, and the divided n logical operation groups may be equal to or less than the number of pipelines in the superscalar processor.
In one example, if the number of the plurality of logical operation packets divided by the logical operation unit corresponds to the number of pipelines in the processor, the embodiment of the present application may divide the data bit-wide resource of the logical operation unit 113 into a plurality of packets according to the number of pipelines in the processor, thereby obtaining a plurality of logical operation packets corresponding to the number of pipelines. In the embodiment of the application, one logic operation group can be configured to be used independently by one pipeline, so that the data input by each pipeline configuring the logic operation group can be subjected to logic operation through the logic operation group used independently.
In some embodiments, in an alternative implementation of dividing the logical operation unit into a plurality of logical operation groups, the embodiments of the present application may combine the unit execution units of the logical operation unit to obtain the plurality of logical operation groups into which the logical operation unit is divided. The logical operation unit may be formed by a plurality of unit execution units that are independent from each other and are duplicated, the unit execution unit may be regarded as a unit bit width logical operation circuit that forms the logical operation unit, and the plurality of duplicated unit bit width logical operation circuits may form the logical operation unit. That is, the logic operation unit may be formed of a plurality of independent and repetitive unit circuits as a circuit for performing logic operation, and the unit circuit herein may be understood as the unit execution unit described above.
It should be noted that, the unit bit width of the unit execution unit may be regarded as the minimum bit width that the logic operation unit can independently complete data operations; for example, one unit bit width logical operation circuit may be a minimum bit width logical operation circuit in which the logical operation unit performs data operations, and each unit bit width logical operation circuit may independently perform logical operations on single data.
In one example, fig. 3B exemplarily shows an exemplary diagram of a unit execution unit of a logical operation unit, and as shown in fig. 3B, taking the data bit width of the logical operation unit as 256 bits and the data bit width of the unit execution unit as 32 bits as an example (that is, a unit execution unit with a bit width of 32 bits can independently complete a logical operation of 32 bits), the logical operation unit may be formed by 8 unit execution units with 32 bits; for example, if the data bit width of the unit execution unit performing the division operation in the single-precision floating point divider is 32 bits, the 256-bit single-precision floating point divider may be formed by 8 32-bit unit execution units.
It should be noted that the data bit width of the unit execution unit is 32 bits, which is only an example, and the specific value of the data bit width of the unit execution unit may be determined according to the actual situation, for example, for a 32-bit single-precision instruction (e.g., SIMD instruction), the data bit width (i.e., unit bit width) of the unit execution unit in the logical operation unit may not be less than 32 bits; while for double-precision 64-bit instructions (e.g., SIMD instructions), the data bit width of the unit execution units in the logical operation unit may be no less than 64 bits. It should be further noted that the logical operation unit may support processing of multiple floating point data formats, for example, support processing of single-precision floating point data (32 bits per data), double-precision floating point data (64 bits per data), and the like; it will be appreciated that for a 256-bit instruction (e.g. a SIMD instruction), 8 data are included if single precision floating point data is being processed, and 4 data are included if double precision floating point data is being processed.
As an alternative implementation, based on multiple unit execution units forming a logical operation unit, the embodiments of the present application may combine the unit execution units to obtain multiple logical operation groups whose number is equal to or less than the number of pipelines of the processor. For example, in the case that the number of the plurality of logical operation groups is equal to the number of pipelines of the processor, the embodiment of the present application may combine the unit execution units according to the plurality of unit execution units of the logical operation unit and the number of pipelines of the processor, so as to obtain a plurality of logical operation groups corresponding to the number of pipelines of the processor. In the embodiment of the present application, one logical operation group may be regarded as a combination of several unit execution units, and accordingly, the data bit width of one logical operation group may be a sum of the data bit widths of the combined unit execution units.
As an alternative implementation, fig. 3C exemplarily shows a flowchart of an alternative method for dividing a logical operation unit into logical operation groups according to an embodiment of the present application. Alternatively, the method flow may be implemented during a processor design phase (e.g., a superscalar processing design phase); in other possible implementations, the method flow may be implemented when the processor is operating, for example, the processor may implement dividing the logical operation unit into a plurality of logical operation groups by dynamically adjusting connections and links between hardware circuits.
Referring to fig. 3C, the method flow may include the following steps.
In step S31, a plurality of unit execution units of the logical operation unit are determined, and one unit execution unit supports logical operation on data with a unit bit width, which is the data bit width of the unit execution unit.
The logical operation unit may be formed of a plurality of unit execution units, a data bit width of one unit execution unit may be referred to as a unit bit width of the logical operation unit, and one unit execution unit may support a logical operation on data of the unit bit width. For example, a single precision floating point divider may be formed from a plurality of unit execution units that perform division operations, one unit execution unit supporting division operations on 32 bits of data. Embodiments of the present application may determine that a plurality of unit execution units forming a logical operation unit, for example, a 256-bit divider is formed by 8 32-bit unit execution units performing a division operation.
In step S32, according to the number of the logical operation groups, combining the unit execution units to obtain a plurality of logical operation groups whose number is equal to or less than the number of the pipeline; the data bit width of one logical operation group is the sum of the data bit widths of the combined unit execution units.
After determining the plurality of unit execution units of the logical operation unit, based on the number of the set logical operation groups (equal to or less than the number of pipelines), the embodiments of the present application may combine the unit execution units on the basis of the unit execution units to obtain a plurality of logical operation groups, the number of which is equal to or less than the number of pipelines of the processor, and then configure one logical operation group to one pipeline supporting the logical operation for independent use. For example, the number of logical operation groups may be set, and the number of logical operation groups may be equal to or less than the number of pipelines, and further, the unit execution units of the logical operation units may be combined according to the number of logical operation groups to obtain a plurality of logical operation groups.
Optionally, when the number of the logical operation groups is equal to the number of the pipelines of the processor, the embodiment of the present application may combine the unit execution units according to the number of the pipelines of the processor on the basis of the unit execution units, so as to obtain the logical operation groups whose number corresponds to the number of the pipelines of the processor.
One logical operation group of the embodiment of the application can be obtained by combining a plurality of unit execution units, and a plurality of logical operation groups with the number equal to or less than the number of pipelines of the processor can be obtained by combining the unit execution units.
In some embodiments, the number of unit execution units in each logical operation group may be the same (the number of data bit widths corresponding to each logical operation group is the same), or may be different (the number of data bit widths corresponding to each logical operation group is different), or the number of unit execution units in a partial logical operation group is the same, and the number of unit execution units in a partial logical operation group is different.
Optionally, in an optional implementation that combines the unit execution units to obtain a plurality of logical operation groups, in the embodiment of the present application, the combined number of the unit execution units corresponding to one logical operation group (that is, the number of the unit execution units required for combining to obtain one logical operation group) may be determined according to the number of the logical operation groups and the number of the unit execution units, so that the combined number of the unit execution units is combined into one logical operation group, and a plurality of logical operation groups whose number is equal to or less than the number of the pipeline are combined through a plurality of unit execution units in the logical operation unit.
In one implementation example, the present application embodiment may divide the number of unit execution units by the number of logical operation groups (the number of logical operation groups may correspond to the number of pipelines supporting the logical operations of the logical operation groups, which may be less than or equal to the number of pipelines in the processor) to obtain the combined number of unit execution units corresponding to one logical operation group; thereby combining the combined number of unit execution units into one logical operation group to obtain a plurality of logical operation groups.
For the sake of understanding, based on the example of fig. 3B, taking two pipelines as an example of input data, fig. 3D exemplarily shows an example diagram of a logical operation group of logical operation units, and as shown in fig. 3D, a 256-bit logical operation unit may be formed by 8 32-bit unit execution units, and 2 logical operation groups may be configured in this example based on the number of pipelines being 2; thus, the combined number of the unit execution units corresponding to one logical operation group is 4, that is, in this example, 4 unit execution units are combined to obtain one 128-bit logical operation group; further, the first 4 unit execution units may be combined into one logical operation group, and the last 4 unit execution units may be combined into another logical operation group.
As an alternative implementation, when the combined number of unit execution units are combined into one logical operation group, in the present embodiment, the combined number of unit execution units may be sequentially combined into one logical operation group in the logical operation unit (for example, according to the bit width order of the unit execution units in the logical operation unit), so as to obtain a plurality of logical operation groups through a plurality of unit execution units in the logical operation unit. The data bit widths of the plurality of logical operation packets may thus be in-order (e.g., sequentially increasing or sequentially decreasing); further, the plurality of logical operation packets have different data bit width ranges, and for example, a logical operation packet having a low bit data bit width and a logical operation packet having a high bit data bit width are present in the plurality of logical operation packets. In one example, as shown in fig. 3D, when the unit execution units are sequentially combined in the order of the data bit width, the logical operation packet combined by the first 4 unit execution units may be a logical operation packet with a low data bit width (e.g., a logical operation packet with a low 128 bit), and the logical operation packet combined by the second 4 unit execution units may be a logical operation packet with a high data bit width (e.g., a logical operation packet with a high 128 bit).
Furthermore, after dividing the logic operation group into a plurality of logic operation groups, the embodiment of the application can allocate one logic operation group to one production line for independent use, so that a plurality of production lines configured with the logic operation group in the superscalar processor are provided with the logic operation groups which are independently used, and a basis is provided for multiplexing the logic operation units by the plurality of production lines simultaneously and using the respective logic operation groups simultaneously.
In some embodiments, one logical operation group may be configured to a pipeline in the same order for independent use according to the order of the pipelines and the order of the logical operation group; of course, the embodiment of the present application may also support configuring the logical operation groups and the pipelines out of order, as long as it is ensured that one logical operation group is configured for one pipeline.
After the logic operation unit is divided into a plurality of logic operation groups and the logic operation groups which are used independently are configured for each pipeline supporting the logic operation, the embodiment of the application can process the data input by the corresponding pipeline by each logic operation group in the instruction execution stage, so that the resource utilization rate of the logic operation unit is improved under the condition that the plurality of logic operation groups in the logic operation unit are fully used by the plurality of pipelines.
As an alternative implementation, fig. 4 illustrates an alternative flowchart of an instruction execution method provided by an embodiment of the present application. Referring to fig. 4, the method flow may include the following steps.
In step S41, data of a plurality of instructions is input to the execution engine through a plurality of pipelines, and data of one instruction is input to one pipeline; the logic operation unit in the execution engine is divided into a plurality of logic operation groups, and one logic operation group is configured to one pipeline to be used independently.
In this embodiment, a plurality of pipelines may input data of a plurality of instructions to the execution engine, and one pipeline may input data of one instruction, and the data may be operated by the instruction. In one example, the instruction that initiates the data operation may be a SIMD instruction that may initiate the same operation on multiple data by a single instruction; for example, multiple SIMD instructions may be issued simultaneously, whereby data for different SIMD instructions may be input to the execution engine via different pipelines, whereby multiple pipelines may input instructions for multiple SIMD instructions to the execution engine, one pipeline inputting data for one SIMD instruction; multiple data for a single SIMD instruction to initiate an operation may be input to the execution engine through a pipeline. As an alternative implementation, multiple pipelines may input data for multiple instructions to the execution engine and one pipeline may input data for one instruction, based on a multiple issue approach, at the instruction issue stage.
In conjunction with the foregoing description, the execution engine is provided with a logical operation unit for executing a logical operation, the logical operation unit is divided into a plurality of logical operation groups (the number of the plurality of logical operation groups may be less than or equal to the number of pipelines in the processor), and one logical operation group is allocated to one pipeline for independent use. For the way of dividing the logic operation unit into a plurality of logic operation groups, reference may be made to the description of the corresponding parts above, and details are not repeated here.
In step S42, the data input to each pipeline is selected to the logical operation group configured for each pipeline, so that each logical operation group performs a logical operation on the data input to each pipeline.
One pipeline uses the configured logic operation group to perform logic operation on data, and for data input by a plurality of pipelines, the execution engine can select input data for performing logic operation for each logic operation group based on the configuration relationship between the pipelines and the logic operation groups (namely, one logic operation group which is configured by one pipeline and is used independently); therefore, the execution engine can select the data input by each pipeline to the logic operation groups configured for each pipeline, and each logic operation group can perform logic operation on the data input by the pipeline configured correspondingly.
In some embodiments, a data selector in the execution engine may perform the above-described operation of selecting the data input to each pipeline to group the logical operations configured for each pipeline. As an alternative implementation, the data input by a plurality of pipelines can be transmitted to the data selector, so that the data selector can select the data input by each pipeline to the logic operation groups configured by each pipeline based on the configuration relationship between the pipelines and the logic operation groups.
For ease of understanding, FIG. 5A is a diagram illustrating an alternative system architecture for implementing instruction execution according to an embodiment of the present application, and as shown in FIG. 5A, a plurality of pipelines 111 through 11n input data, and during an instruction issue stage, data 101 through 10n may be input to an execution engine via the plurality of pipelines 111 through 11 n; wherein, the pipeline 111 inputs the data 101, the pipeline 112 inputs the data 102, and so on, and the pipeline 11n inputs the data 10n; the data 101 to 10n input by the plurality of pipelines 111 to 11n may be delivered to the data selector 114 of the execution engine;
in the case where the logical operation unit 113 of the execution engine is divided into a plurality of logical operation groups 311 to 31n, one logical operation group may be configured to be used alone for one pipeline; for example, the logical operation group 311 is configured for use by the pipeline 111, the logical operation group 312 is configured for use by the pipeline 112, and so on, the logical operation group 31n is configured for use by the pipeline 11 n; thus, after the data selector 114 obtains the data 101 to 10n input by the pipelines 111 to 11n, the data selector 114 may select the data 101 input by the pipeline 111 to the logical operation group 311 for logical operation, select the data 102 input by the pipeline 112 to the logical operation group 312 for logical operation, and so on, select the data 10n input by the pipeline 11n to the logical operation group 31n for logical operation, based on the arrangement relationship between the pipelines and the logical operation groups.
In a further implementation example, the execution engine may set a plurality of data selectors, and the plurality of data selectors corresponds to the number of the plurality of logical operation groups, and one logical operation group is configured with one data selector for data selection; so that a data selector can pass the data input by the pipeline to the corresponding logical operation group. Optionally, fig. 5B is an exemplary diagram illustrating another alternative system architecture for implementing instruction execution according to an embodiment of the present application, and referring to fig. 5A and 5B, in the system architecture illustrated in fig. 5B, the data selector illustrated in fig. 5A includes a plurality of data selectors 511 to 51n, and one data selector is correspondingly configured for one logical operation group; for example, pipeline 111 and logical operation group 311 configure data selector 511, pipeline 112 and logical operation group 312 configure data selector 512, and so on, pipeline 11n and logical operation group 31n configure data selector 51n;
on the basis, when data of a plurality of instructions are input into the execution engine through a plurality of pipelines, one pipeline can input the data to the data selector corresponding to the correspondingly configured logic operation group; that is, the embodiment of the present application can input the data of each pipeline into the data selector corresponding to the logical operation group configured corresponding to each pipeline; for example, data 101 enters pipeline 111 through pipeline 111 into correspondingly configured data selector 511, data 102 enters pipeline 112 through pipeline 112 into correspondingly configured data selector 512, and so on, data 10n enters pipeline 11n through pipeline 11n into correspondingly configured data selector 51n;
furthermore, the input data can be transmitted to the correspondingly configured logic operation groups through each data selector, so that the logic operation groups obtain the data input by the correspondingly configured assembly lines and carry out logic operation processing; for example, data selector 511 may pass incoming data 101 to correspondingly configured logical operation packet 311, data selector 512 may pass incoming data 102 to correspondingly configured logical operation packet 312, and so on, data selector 51n may pass incoming data 10n to correspondingly configured logical operation packet 31n.
It should be noted that, the multiple data selectors provided in the embodiment of the present application are only an optional implementation manner, and in the case of setting the configuration relationship between the pipelines and the logical operation groups, the embodiment of the present application may also transmit data input by the multiple pipelines to the logical operation groups correspondingly configured to the pipelines through one data selector. For example, in the case where a pipeline identifier (for example, a pipeline number) is set for the pipeline, and a group identifier (for example, a group number) is set for the logical operation group, the data selector may determine the pipeline identifier to which the input data corresponds, and thereby select the data to the logical operation group corresponding to the group identifier according to the configuration relationship of the pipeline and the logical operation group (for example, the configuration relationship may record the correspondence relationship of the pipeline identifier and the group identifier). It should be noted that, in the case that a plurality of data selectors are provided, as a possible implementation, the embodiments of the present application may implement the configuration relationship setting of the pipeline, the data selectors, and the logical operation groups through internal circuit connection of the processors.
In step S43, the logical operation result of each logical operation group is output.
After a logical operation group completes a logical operation on data, the logical operation group may output a logical operation result. In some embodiments, a pipeline may be configured with a result data bus, and after a logic operation group completes a logic operation of data, the logic operation group may output a logic operation result to the result data bus corresponding to the configured pipeline.
In a further alternative implementation, fig. 5C is a diagram illustrating a further alternative system architecture for implementing instruction execution according to an embodiment of the present application, and in conjunction with fig. 5B and fig. 5C, in the system architecture shown in fig. 5C, a plurality of pipelines are correspondingly configured with a plurality of result data buses 521 to 52n, wherein one pipeline is correspondingly configured with one result data bus; for example, pipeline 111 corresponds to configuration result data bus 521, pipeline 112 corresponds to configuration result data bus 522, and so on, pipeline 11n corresponds to configuration result data bus 52n; therefore, after one logic operation group completes the logic operation of the data, one logic operation group can output the logic operation result to the result data bus corresponding to the assembly line configured correspondingly; for example, after the logical operation group 311 completes the logical operation of the data 101, the logical operation result may be output to the result data bus 521 corresponding to the pipeline 111, after the logical operation group 312 completes the logical operation of the data 102, the logical operation result may be output to the result data bus 522 corresponding to the pipeline 112, and so on, and after the logical operation group 31n completes the logical operation of the data 10n, the logical operation result may be output to the result data bus 52n corresponding to the pipeline 11 n.
According to the instruction execution method provided by the embodiment of the application, a logic operation unit in an execution engine can be divided into a plurality of logic operation groups, and one logic operation group is configured to one production line to be used independently; on this basis, data of a plurality of instructions can be input into the execution engine through a plurality of pipelines, and data of one instruction is input into one pipeline; based on the configuration relationship between the assembly lines and the logic operation groups, the data input by each assembly line can be selected to the logic operation groups configured for each assembly line, so that each logic operation group can carry out logic operation on the data input by each assembly line; further, the embodiment of the application can output the logic operation result of each logic operation group to complete instruction execution.
It can be seen that, in the embodiment of the present application, in the case of dividing the logical operation unit into a plurality of logical operation groups, each pipeline supporting logical operations in the processor may use one logical operation group alone to perform logical operations on data; therefore, after the data input by each pipeline is selected to the correspondingly configured logic operation groups, the plurality of logic operation groups can process the data input by a plurality of pipelines in parallel; and different logical operation groups correspond to different data bit width ranges in the logical operation unit, so that the data bit width resources in different ranges in the logical operation unit can process data input by a plurality of pipelines in parallel, the data bit width resources in the logical operation unit are fully utilized, the degree of idle waste of the data bit width resources of the logical operation unit is reduced, and the resource utilization rate of the logical operation unit is improved.
It can be understood that, under the condition of keeping the data bit width of the whole logical operation unit, by dividing the logical operation unit into a plurality of logical operation groups and allocating one logical operation group to one pipeline for independent use, the data bit width of the logical operation unit which can be occupied by each pipeline is reduced; however, when the data bit width of the data input by one pipeline is lower than the data bit width of the whole logic operation unit, the pipeline uses the correspondingly configured logic operation packet to perform the logic operation of the data, and other logic operation packets can be used for the data operation by other pipelines, so that the waste of data bit width resources when the whole logic operation unit is exclusively occupied by one pipeline unit can be avoided, and the resource utilization rate of the logic operation unit is improved. It can be seen that, in the embodiment of the present application, by finely dividing the data bit width resource of the logic operation unit and configuring multiple logic operation packets used by multiple pipelines, the data bit width resource utilization rate of the logic operation unit can be improved on the whole, the throughput rate of the logic operation unit of the processor is improved, and the instruction execution efficiency is further improved.
In an implementation example, taking an example that the data bit width of the logical operation unit is 256 bits, the number of pipelines supporting the logical operation is two, and two pipelines input two 128-bit data, fig. 6 exemplarily illustrates an implementation example of the instruction execution method according to the embodiment of the present application, and as shown in fig. 6, the 256-bit logical operation unit may be divided into 2 128-bit logical operation groups (e.g., logical operation groups 311 and 312) based on two pipelines supporting the logical operation (e.g., the pipeline 111 and the pipeline 112);
in an alternative implementation, based on that the data bit width of the unit execution unit in the logical operation unit is 32 bits, the embodiment of the present application may combine 4 unit execution units with a lower bit width in the logical operation unit into one lower logical operation group (e.g., the logical operation group 311), so as to configure the pipeline and the data selector for the lower logical operation group; combining 4 unit execution units with high bit width in the logic operation unit into a logic operation group with high bit (such as logic operation group 312), thereby configuring a pipeline and a data selector for the logic operation group with high bit; in the present example, the pipeline 111, the data selector 511, and the logical operation group 311 may set a configuration relationship, and the pipeline 112, the data selector 512, and the logical operation group 312 may set a configuration relationship;
furthermore, the data 101 input by the pipeline 111 is 128 bits, and the data selector 511 can transmit the 128-bit data 101 to the lower 128-bit logic operation packet 311 for logic operation, so that the logic operation result of the logic operation packet 311 is output to the result data bus 521 corresponding to the pipeline 111; in this example, the data selector 512 transmits the 128-bit data 102 input by the pipeline 112 to the 128-bit logical operation packet 312 for logical operation, so that the logical operation result of the logical operation packet 312 is output to the result data bus 522 corresponding to the pipeline 112, unlike the situation that the high 128-bit data bit width resource in the logical operation unit is idle and wasted.
It can be seen from this example that, in the embodiment of the present application, data input by multiple pipelines can be processed in parallel through multiple logical operation groups of different data bit width resources, so that the data bit width resources of the logical operation unit can be fully utilized to a great extent, and the resource utilization rate of the logical operation unit is improved.
As an alternative implementation, the following situations may exist in the data bit width of the data input by the pipeline and the data bit width of the logical operation packet:
in case one, the data bit width of the data input by the pipeline is less than the data bit width of the logical operation packet. For example, the data bit width of the logical operation packet is 128 bits, while the data bit width of the data input to the pipeline is 32 bits, 64 bits, and so on. In this case, the logic operation groups configured in each pipeline can not influence each other to run independently; that is, the data input to each pipeline can be operated by the logical operation packets arranged in each pipeline, and in this case, although there is a part of unused data bit wide resource in each logical operation packet, the logical operation unit can ensure the resource utilization rate by processing the data input to a plurality of pipelines in parallel as a whole.
And in case two, the data bit width of the data input by the pipeline is equal to the data bit width of the logical operation packet. For example, the data bit width of the logical operation packet is 128 bits, and the data bit width of the data input to the pipeline is also 128 bits. At the moment, the data input by each pipeline can be operated by the logic operation groups configured by each pipeline, and the data bit width resource of each logic operation group is fully utilized, so that the data bit width resource of the whole logic operation unit is fully utilized, and the resource utilization rate of the logic operation unit can be greatly improved.
In one example, taking the data bit width of the logical operation packet as 128 bits as an example, for the first case and the second case, when the logical operation unit performs single-precision floating-point logical operations of 32 bits, 64 bits and 128 bits (for example, single-precision floating-point division operation), each pipeline inputs data lower than or equal to 128 bits and is processed by the logical operation packet configured in each pipeline. Further, taking the data bit width of the logic operation unit as 256 bits as an example, the logic operation unit may simultaneously support performing logic operation on two single-precision floating point data with the data bit width less than or equal to 128 bits, where the two single-precision floating point data may be operated by SIMD instructions of the single-precision floating point logic operation.
Thirdly, the data bit width of the data input by the assembly line is larger than the data bit width of the logical operation grouping; for example, the data bit width of the logical operation packet is 128 bits, and the data bit width of the data input to the pipeline is 256 bits. In this case, the logical operation packet arranged corresponding to the pipeline cannot complete the operation of the input data at one time, and therefore the logical operation packet arranged corresponding to the pipeline needs to perform a plurality of logical operations on the data. In this case, the embodiment of the present application may group the data input by the pipeline according to the data bit width of the logical operation group, so as to obtain a plurality of data groups, where the data bit width of one data group is not greater than the data bit width of the logical operation group; furthermore, in the embodiment of the application, the logical operation groups correspondingly configured in the pipeline can be utilized to sequentially perform logical operation on a plurality of data groups, and the logical operation results of each data group are spliced, so that the final logical operation result of the data is obtained.
For example, the data input by the pipeline is 256 bits, and the data bit width of the logical operation packet is 128 bits, the data input by the pipeline can be divided into two 128-bit data packets, such as a lower 128-bit data packet and an upper 128-bit data packet; therefore, the logical operation group correspondingly configured to the pipeline can sequentially perform logical operation on the data group with the lower 128 bits and the data group with the upper 128 bits through two logical operation processes, so as to sequentially obtain the logical operation result of the data group with the lower 128 bits and the logical operation result of the data group with the upper 128 bits; and further splicing the logic operation results of the data groups with the low 128 bits and the high 128 bits to obtain the final logic operation result of the data.
Based on case three, as an alternative implementation, fig. 7 exemplarily shows another alternative flowchart of the instruction execution method provided by the embodiment of the present application. Referring to fig. 7, the method flow may include the following steps.
In step S71, data of a plurality of instructions is input to the execution engine through a plurality of pipelines; and if the data bit width of the data input by the pipeline is greater than the data bit width of the corresponding configured logical operation packet, the data input by the pipeline comprises a plurality of data packets, wherein the data bit width of one data packet is not greater than the data bit width of the corresponding configured logical operation packet of the pipeline.
The portion of step S71 related to step S41 shown in fig. 4 is not described here again. It should be noted that, when the pipeline inputs data, if the data is greater than the data bit width of the logical operation packet correspondingly configured to the pipeline, the data input by the pipeline needs to be divided into multiple data packets, and the data bit width of one data packet is not greater than the data bit width of the logical operation packet correspondingly configured to the pipeline. As an optional implementation, the pipeline may divide the data into a plurality of data packets according to the data bit width of the correspondingly configured logical operation packet, and make the data bit width of one data packet not greater than the data bit width of the logical operation packet.
In one example, if the data input to the pipeline is 256 bits and the data bit width of the logical operation packet corresponding to the pipeline is lower than 256 bits (e.g., 128 bits, 64 bits, etc.), the 256 bits of data input to the pipeline may be divided into a plurality of data packets and the plurality of data packets correspond to different data bit width ranges, and the data bit width of each data packet is not greater than the data bit width of the logical operation packet. For example, when the data bit width of the logical operation packet is 128 bits, the 256 bits of data input by the pipeline may be divided into two data packets of lower 128 bits and upper 128 bits, or for example, when the data bit width of the logical operation packet is 64 bits, the 256 bits of data input by the pipeline may be divided into 4 data packets of which the data bit width sequentially increases, and one data packet is 64 bits. In an alternative implementation, the data input to the pipeline may be grouped by the pipeline, resulting in a plurality of data packets.
In step S72, the plurality of data packets of the pipeline are sequentially selected to the logical operation packet of the pipeline arrangement, so that the logical operation packet of the pipeline arrangement sequentially performs logical operations on the plurality of data packets.
Under the condition that data input by the pipeline comprises a plurality of data packets, the embodiment of the application can select one data packet at a time and give the logic operation packet configured correspondingly to the pipeline, so that the logic operation packet performs logic operation on one data packet at a time (it can be understood that, because the data bit width of one data packet is not greater than that of the logic operation packet, the logic operation packet has enough data bit width resources to process one data packet in one logic operation); furthermore, in the embodiment of the present application, a plurality of data packets of the pipeline are selected to the logical operation packet correspondingly arranged in the pipeline in multiple times, and the logical operation packet performs logical operations on the plurality of data packets in multiple times.
As an optional implementation, in the embodiment of the present application, the multiple data packets may be sequentially selected to the logical operation packet configured in the pipeline according to the data bit width order of the multiple data packets of the pipeline, so that the logical operation packet sequentially completes the logical operation of the multiple data packets according to the data bit width order of the multiple data packets, and performs the logical operation on one data packet at a time.
In one example, assuming that the data bit width of the logical operation packet is 128 bits, 256 bits of data input by the pipeline are divided into two data packets of lower 128 bits and upper 128 bits, according to the data bit width order of the data packets, the data packet of lower 128 bits may be selected to perform a logical operation on the logical operation packet configured by the pipeline first, and after the logical operation packet completes the logical operation on the data packet of lower 128 bits, the data packet of upper 128 bits may be selected to perform a logical operation on the logical operation packet configured by the pipeline.
As an alternative implementation, the data packet of the pipeline may be input into the logical operation packet configured in the pipeline through a data selector (e.g., a data selector configured corresponding to the pipeline), and for the content of this block, reference may be made to the description of the corresponding part above, and details are not described here again.
In step S73, the logical operation results of the plurality of data packets grouped by the logical operation are spliced to obtain the logical operation result of the data input by the pipeline, and the logical operation result of the data is output.
In some embodiments, after the logical operation of one data packet is completed, the logical operation group may temporarily store the logical operation result of the data packet in the logical operation group, and obtain the next data packet for the logical operation from the data selector; and after the logical operation grouping completes the logical operation of the plurality of data groupings of the production line, the logical operation grouping can splice the logical operation results of the plurality of data groupings to obtain the logical operation result of the data. As an optional implementation, in the embodiment of the present application, the logical operation results of the multiple data packets may be sequentially spliced according to the data bit width order of the data packets, so as to obtain the logical operation result of the data.
Taking the 128-bit logic operation group to sequentially process two data groups of low 128 bits and high 128 bits as an example, after the logic operation of the low 128-bit data group is completed, the logic operation group can temporarily store the logic operation result of the low 128-bit data group, and obtain the next high 128-bit data group for logic operation from the data selector; and after the logical operation grouping completes the logical operation of the data grouping with the high 128 bits, the logical operation results of the data grouping with the low 128 bits and the high 128 bits can be spliced, so that the final logical operation result of the data input by the production line is obtained.
Further, the logical operation group can output the logical operation result of the data to a result data bus correspondingly configured to the pipeline.
It can be seen that, in the embodiment of the present application, when the logical operation unit is divided into a plurality of logical operation packets, and the data bit width of the data input by the pipeline is greater than the data bit width of the logical operation packet, the data input by the pipeline may include a plurality of data packets, and the data bit width of one data packet is not greater than the data bit width of the logical operation packet; furthermore, the logical operation groups configured corresponding to the pipeline can sequentially process a plurality of data groups of the pipeline, and the logical operation results of the plurality of data groups are spliced to obtain a final logical operation result, so that data processing when the data bit width of the input data is greater than that of the logical operation groups is realized. When the data input by the pipeline is divided into a plurality of data packets, the data bit width of one data packet is not more than the data bit width of the logic operation packet, so that under the condition that the plurality of logic operation packets process the data input by the plurality of pipelines in parallel, each logic operation packet can have enough data bit width resources to process one data packet of the pipeline in one operation, the normal operation of the data operation is ensured, and meanwhile, the data bit width resources of each logic operation packet can be reasonably and efficiently utilized.
In an implementation example, taking two pipelines and 256-bit logical operation units as an example, in the case that data input by the pipelines is 256 bits, fig. 8 exemplarily shows another implementation example of the instruction execution method provided by the embodiment of the present application, and in conjunction with fig. 6 and 8, assuming that data 101 input by the pipeline 111 is 256 bits and data 102 input by the pipeline 112 is 256 bits, since the input data exceeds the computation resources of the 128-bit logical operation packet, the data 101 can be divided into a lower 128-bit data packet 1011 and an upper 128-bit data packet 1012 according to 128-bit data of the logical operation packet; meanwhile, the data 102 may be divided into a lower 128-bit data packet 1021 and an upper 128-bit data packet 1022;
when performing a logical operation on the data 101, according to the data bit width order of the data packet 1011 and the data packet 1012, the data selector 511 configured in the pipeline 111 may first transmit the data packet 1011 with the lower 128 bits to the logical operation packet 311 configured correspondingly in the pipeline 111 to perform the logical operation; after the logical operation packet 311 completes the logical operation of the data packet 1011, the logical operation result of the data packet 1011 may be temporarily stored, and the data selector 511 may transmit the data packet 1012 with the 128 bits higher to the logical operation packet 311 for the logical operation.
It should be noted that the delay of the processing of the data packet 1012 by the logic operation packet 311 may be determined by the time when the logic operation packet 311 completes the logic operation and whether the logic operation can be executed in a pipeline; for example, if the division operation is performed by the logical operation group, the data selector 511 delays the transmission of the data group 1011 to the logical operation group 311 by 11 clock cycles after the completion time of the division operation is exemplary 11 clock cycles and the logical operation of the division operation cannot be performed in a pipeline manner, and then transmits the data group 1012 to the logical operation group 311. It should be further noted that, if the logic operation can be executed in a pipeline manner, the logic operation group can process the next data group after 1 clock cycle of executing the logic operation, but the logic operations such as division and the like generally cannot be executed in a pipeline manner, so that the logic operation group needs to process the next data group after 11 clock cycles of completing the division operation. The 11 clock cycle delay for the divider to complete the divide operation is merely an example, and different processor products, different operation instructions, may result in different delays. In addition, compared with the fact that logic operation can be executed in a pipelining manner, the scheme provided by the embodiment of the application can obtain a higher resource utilization ratio advantage under the condition that the logic operation cannot be executed in a pipelining manner.
As further shown in fig. 6 and fig. 8, after the logical operation of the data packet 1012 is completed, the logical operation packet 311 may splice the logical operation results of the data packet 1011 and the data packet 1012, and then output the result to the result data bus 521 configured in the pipeline 111; in the case where the logical operation cannot be executed in a pipelined manner, the delay time for completing the logical operation of the plurality of data packets in the logical operation packet is determined by the number of the plurality of data packets and the time for completing one logical operation in the logical operation packet (for example, the number of the plurality of data packets is multiplied by the time for completing one logical operation in the logical operation packet); taking the division operation as an example, in the case that the logic operation cannot be executed in a pipeline, if one division operation is completed for 11 clock cycles, the delay of completing the division operation of the low 128-bit data packet and the high 128-bit data packet by one logic operation packet is 22 clock cycles.
With further reference to fig. 6 and 8, similarly, when performing the logic operation on the data 102, the data selector 512 disposed in the pipeline 112 may first transmit the low 128-bit data packet 1021 to the logic operation packet 312 disposed in the pipeline 112 for performing the logic operation; after the logical operation group 312 completes the logical operation of the data group 1021, the logical operation result of the data group 1021 may be temporarily stored, and the data selector 512 may transmit the data group 1022 with high 128 bits to the logical operation group 312 for the logical operation; logical operation group 312 may splice the logical operation results of data group 1021 and data group 1022 after completing the logical operation of data group 1022, and then output the result onto result data bus 522 of pipeline 112.
Further, taking the example of executing the division operation as an example, on the basis of the above implementation example, if the logical operation of the division cannot be executed in a pipeline, when executing the single-precision floating-point division operation of 256 bits, the pipeline needs to divide the input data into data packets of lower 128 bits and upper 128 bits, and execute the division operation in two times; therefore, the data packet with the lower 128 bits is firstly controlled to the correspondingly configured logic operation packet for executing the division operation, after the time for finishing one division operation is delayed (for example, the delay for finishing one division operation is 11 clock cycles), the division operation result of the data packet with the lower 128 bits is temporarily stored, and the data packet with the higher 128 bits is controlled to the correspondingly configured logic operation packet for executing the division operation; and after the time of finishing one division operation is delayed, splicing the division operation results of the data groups of the low 128 bits and the high 128 bits to obtain a final division operation result, and outputting the final division operation result to a result data bus correspondingly configured on the production line, thereby finishing the single-precision floating-point division operation of 256 bits.
It should be noted that, in the embodiment of the present application, when the data bit width of the data input by the pipeline is greater than the data bit width of the logical operation packet, the logical operation packet needs to be operated for multiple times to perform the logical operation, which increases the delay of performing the logical operation on the data, but when the data bit width of the input data is greater than the logical operation packet, the throughput of the logical operation unit can be kept unchanged in the embodiment of the present application; and when the data input by the pipeline is not more than the data bit width of the logical operation group, the throughput rate of the logical operation unit is improved by multiple levels.
Taking the example that a 256-bit logical operation unit is divided into two 128-bit logical operation groups, for 32-bit, 64-bit, and 128-bit single-precision floating-point logical operations (e.g., single-precision floating-point division operation), the delay of completing data operation in the embodiment of the present application is the time of one logical operation (e.g., 11 clock cycles of completing one division operation), and the throughput rate is 2 data, so that the throughput rate of the logical operation unit is improved; for 256-bit single-precision floating-point logical operation, the delay of completing data operation is the time of two logical operations, and the throughput rate is two data; therefore, the throughput rate of 32-bit, 64-bit and 128-bit single-precision floating-point logical operation can be doubled in the embodiments of the present application (that is, when the data input by the pipeline is not greater than the data bit width of the logical operation packet, the throughput rate of the logical operation unit and the data bit width resource utilization rate are increased), and for 256 single-precision floating-point logical operations, although the delay of the data operation is increased, the throughput rate of the logical operation unit remains unchanged during the delay time. Therefore, the embodiment of the application can improve the data bit width resource utilization rate of the logic operation unit, improve the throughput rate of the logic operation unit and improve the instruction execution efficiency on the whole.
Aiming at the problem of idle waste of data bit width resources of a logic operation unit, and being different from a mode that the logic operation unit is used as a whole and is used independently by different pipelines in different clock cycles, the embodiment of the application divides the logic operation unit into a plurality of logic operation groups, and one logic operation group is configured to one pipeline for independent use, so that the plurality of pipelines can simultaneously use the respectively configured logic operation groups to perform logic operation processing of data, the plurality of pipelines can simultaneously multiplex the logic operation units, one pipeline only needs partial data bit width resources of the logic operation unit to perform data operation, and other data bit width resources in the logic operation unit can be simultaneously used by other pipelines, thereby improving the utilization rate of the data bit width resources of the logic operation unit, and improving the instruction parallel execution capacity and the instruction execution efficiency of a processor.
It should be further noted that the Logic Unit in the embodiments of the present application may be a circuit for performing Logic operations in a processor, such as an ALU (Arithmetic and Logic Unit). The ALU may be a circuit device that performs different types of logical operations, such as dividers, multipliers, adders, etc.
In combination with the system architecture exemplarily provided in the foregoing, an embodiment of the present application further provides an execution engine, including: a data selector and a logical operation unit; the logic operation unit is divided into a plurality of logic operation groups, and one logic operation group is configured to one assembly line for independent use;
the execution engine acquires data of a plurality of instructions input by a plurality of pipelines in an instruction transmitting stage, and one pipeline inputs data of one instruction;
the data selector is used for selecting the data input by each pipeline to the logic operation groups correspondingly configured for each pipeline;
the logical operation group is used for obtaining data corresponding to the pipeline input from the data selector, performing logical operation on the data corresponding to the pipeline input, and outputting a logical operation result.
In some embodiments, the plurality of unit execution units may form a logical operation unit, and the plurality of logical operation groups may be divided according to the plurality of unit execution units and the number of logical operation groups; wherein the number of logical operation groups is less than or equal to the number of the plurality of pipelines.
As an alternative implementation, a combined number of unit execution units may be combined to obtain one logical operation group, where the combined number may be determined based on the number of logical operation groups and the number of the plurality of unit execution units; the unit execution unit supports logical operation on data with unit bit width, wherein the unit bit width is the data bit width of the unit execution unit, and the data bit width of one logical operation group is the sum of the data bit widths of the combined unit execution units.
In some embodiments, a data bit width of data input by one pipeline is less than or equal to a data bit width of a corresponding configured logical operation packet; or, the data bit width of the data input by one pipeline is greater than the data bit width of the corresponding configured logical operation packet.
As an optional implementation, if the data bit width of the data input by the pipeline is greater than the data bit width of the correspondingly configured logical operation packet, the data input by the pipeline includes a plurality of data packets, and the data bit width of one data packet is not greater than the data bit width of the logical operation packet; the data input by the production line is divided into a plurality of data packets according to the data bit width of the corresponding configured logic operation packet;
in this case, the selecting, by the data selector, the data input by each pipeline to the logical operation group correspondingly configured to each pipeline may include: grouping a plurality of data of the assembly line, and sequentially selecting logic operation groups configured for the assembly line;
the logic operation group is used for obtaining the data corresponding to the pipeline input from the data selector, and performing the logic operation on the data corresponding to the pipeline input may include:
sequentially obtaining a plurality of data groups corresponding to the production line from the data selector, and sequentially carrying out logic operation on the plurality of data groups; splicing the logic operation results of a plurality of data groups to obtain the logic operation result of the data input by the production line; after the logical operation of one data packet is completed, the logical operation result of the data packet is temporarily stored in the logical operation packet, and the next data packet for logical operation is obtained from the data selector.
In some embodiments, the data selector may include: a plurality of data selectors, one of which is configured for data selection in a logical operation grouping; and one data selector is used for acquiring data input by a pipeline corresponding to the logic operation grouping and transmitting the data to the corresponding logic operation grouping.
In some further embodiments, a plurality of result data buses corresponding to the number of the pipelines may be further provided in the execution engine, wherein one pipeline is correspondingly configured with one result data bus; the logic operation groups can output logic operation results to result data buses corresponding to the correspondingly configured production lines.
Further, embodiments of the present application also provide a processor (e.g., a superscalar processor), which may include the execution engine provided by the embodiments of the present application.
Further, an embodiment of the present application also provides a chip, where the chip may include the processor provided in the embodiment of the present application.
Further, an electronic device, such as a terminal device or a server device, is also provided in the embodiments of the present application, and the electronic device may include the chip provided in the embodiments of the present application.
While various embodiments have been described above in connection with what are presently considered to be the embodiments of the disclosure, the various alternatives described in the various embodiments can be readily combined and cross-referenced without conflict to extend the variety of possible embodiments that can be considered to be the disclosed and disclosed embodiments of the disclosure.
Although the embodiments of the present application are disclosed above, the present application is not limited thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present disclosure, and it is intended that the scope of the present disclosure be defined by the appended claims.

Claims (18)

1. An instruction execution method, comprising:
inputting data of a plurality of instructions into an execution engine through a plurality of pipelines, and inputting data of one instruction into one pipeline; the logic operation unit in the execution engine is divided into a plurality of logic operation groups, and one logic operation group is configured to one pipeline for independent use;
selecting the data input by each assembly line for the logic operation groups configured for each assembly line so that each logic operation group can carry out logic operation on the data input by each assembly line;
and outputting the logic operation result of each logic operation group.
2. The method of claim 1, further comprising:
dividing the logical operation unit into a plurality of logical operation groups according to a plurality of unit execution units forming the logical operation unit and the number of logical operation groups; wherein the number of logical operation groups is less than or equal to the number of the plurality of pipelines.
3. The method of claim 2, wherein the dividing the logical operation unit into a plurality of logical operation groups according to a number of unit execution units forming the logical operation unit and a number of logical operation groups comprises:
determining a plurality of unit execution units of the logical operation unit, wherein one unit execution unit supports logical operation on data with a unit bit width, and the unit bit width is the data bit width of the unit execution unit;
combining the unit execution units according to the number of the logical operation groups to obtain a plurality of logical operation groups; the data bit width of one logical operation group is the sum of the data bit widths of the combined unit execution units.
4. The method of claim 3, wherein combining the unit execution units according to the number of logical operation groups to obtain a plurality of logical operation groups comprises:
determining the combined number of the unit execution units corresponding to one logic operation group according to the number of the logic operation groups and the number of the plurality of unit execution units;
and combining the combined number of unit execution units into one logic operation group, so as to obtain a plurality of logic operation groups through combination of the plurality of unit execution units.
5. The method according to any one of claims 1 to 4, wherein a data bit width of data inputted by one pipeline is smaller than or equal to a data bit width of a corresponding configured logical operation packet; or the data bit width of the data input by one pipeline is greater than the data bit width of the corresponding configured logical operation packet.
6. The method of claim 5, wherein if the data bit width of the data inputted by the pipeline is greater than the data bit width of the corresponding configured logical operation packet, the data inputted by the pipeline comprises a plurality of data packets, and the data bit width of one data packet is not greater than the data bit width of the logical operation packet; the data input by the pipeline is divided into a plurality of data packets according to the data bit width of the corresponding configured logic operation packet.
7. The method of claim 6, wherein selecting the set of logical operations configured for each pipeline input data includes:
the method comprises the steps that a plurality of data of a production line are grouped, and logical operation groups configured for the production line are sequentially selected, so that the logical operation groups configured for the production line sequentially carry out logical operation on the data groups; after the logical operation of one data packet is completed by the logical operation packet, temporarily storing the logical operation result of the data packet and obtaining the next data packet for logical operation;
the outputting the logic operation result of each logic operation group comprises:
and splicing the logic operation results of a plurality of data groups obtained by logic operation grouping to obtain the logic operation result of the data input by the production line, and outputting the logic operation result of the data.
8. The method of claim 5, wherein inputting data for the plurality of instructions into the execution engine via the plurality of pipelines comprises: inputting data of a plurality of instructions into a data selector of an execution engine through a plurality of pipelines;
the selecting the data input by each pipeline to the logic operation groups configured for each pipeline comprises: transmitting the data input by each pipeline to the logic operation groups configured for each pipeline through the data selector;
the outputting the logic operation result of each logic operation group comprises: outputting the logic operation result of the logic operation grouping to a result data bus corresponding to a correspondingly configured production line; wherein, a pipeline is correspondingly provided with a result data bus.
9. The method of claim 8, wherein the data selector comprises a plurality of data selectors, and one data selector for data selection is configured for one logical operation group;
the data selector inputting data of a plurality of instructions into the execution engine through a plurality of pipelines comprises: inputting the data of each pipeline into a data selector corresponding to the logic operation group configured in each pipeline;
the transmitting the data input by each pipeline to the logic operation groups configured in each pipeline through the data selector comprises the following steps: and transmitting the input data to the corresponding logic operation group of the data selector through each data selector.
10. An execution engine, comprising: a data selector and a logical operation unit; the logic operation unit is divided into a plurality of logic operation groups, and one logic operation group is configured to one assembly line for independent use;
the execution engine acquires data of a plurality of instructions input by a plurality of pipelines in an instruction transmitting stage, and data of one instruction is input by one pipeline;
the data selector is used for selecting the data input by each pipeline to the logic operation groups correspondingly configured for each pipeline;
the logical operation group is used for obtaining the data input by the corresponding assembly line from the data selector, carrying out logical operation on the data input by the corresponding assembly line and outputting a logical operation result.
11. The execution engine of claim 10, wherein a plurality of unit execution units form the logical operation unit, and wherein the plurality of logical operation groups are divided according to the plurality of unit execution units and the number of logical operation groups; wherein the number of logical operation groups is less than or equal to the number of the plurality of pipelines.
12. The execution engine of claim 11, wherein a combined number of unit execution units are combined to obtain a logical operation group, the combined number being determined based on the number of logical operation groups and the number of the plurality of unit execution units; the unit execution unit supports logical operation on data with unit bit width, wherein the unit bit width is the data bit width of the unit execution unit, and the data bit width of one logical operation group is the sum of the data bit widths of the combined unit execution units.
13. The execution engine of any of claims 10-12, wherein if the data bit width of the data input to the pipeline is greater than the data bit width of the corresponding configured logical operation packet, the data input to the pipeline comprises a plurality of data packets, the data bit width of one data packet is not greater than the data bit width of the logical operation packet; the data input by the production line is divided into a plurality of data packets according to the data bit width of the corresponding configured logic operation packet;
the data selector is used for selecting the data input by each pipeline to the logic operation groups correspondingly configured for each pipeline, and the selection comprises the following steps: grouping a plurality of data of the assembly line, and sequentially selecting the data to the logic operation groups configured for the assembly line;
the logical operation group is used for obtaining data corresponding to the pipeline input from the data selector, and the logical operation on the data corresponding to the pipeline input comprises the following steps:
sequentially obtaining a plurality of data groups corresponding to the production line from the data selector, and sequentially carrying out logic operation on the plurality of data groups; splicing the logic operation results of a plurality of data groups to obtain the logic operation result of the data input by the production line; after the logical operation of one data packet is completed, the logical operation result of the data packet is temporarily stored in the logical operation packet, and the next data packet for logical operation is obtained from the data selector.
14. The execution engine of any of claims 10-12, wherein the data selector comprises: a plurality of data selectors, one data selector for selecting data is correspondingly configured for one logic operation group; and one data selector is used for acquiring data input by a pipeline corresponding to the logic operation grouping and transmitting the data to the corresponding logic operation grouping.
15. The execution engine of any of claims 10-12, further comprising: a plurality of result data buses corresponding in number to the number of the plurality of pipelines, wherein one pipeline is correspondingly provided with one result data bus; and the logic operation groups output the logic operation results to result data buses corresponding to the correspondingly configured production lines.
16. A processor comprising an execution engine according to any of claims 10-15.
17. A chip comprising the processor of claim 16.
18. An electronic device comprising the chip of claim 17.
CN202210981393.3A 2022-08-16 2022-08-16 Instruction execution method, execution engine, processor, chip and electronic equipment Pending CN115390924A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210981393.3A CN115390924A (en) 2022-08-16 2022-08-16 Instruction execution method, execution engine, processor, chip and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210981393.3A CN115390924A (en) 2022-08-16 2022-08-16 Instruction execution method, execution engine, processor, chip and electronic equipment

Publications (1)

Publication Number Publication Date
CN115390924A true CN115390924A (en) 2022-11-25

Family

ID=84121421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210981393.3A Pending CN115390924A (en) 2022-08-16 2022-08-16 Instruction execution method, execution engine, processor, chip and electronic equipment

Country Status (1)

Country Link
CN (1) CN115390924A (en)

Similar Documents

Publication Publication Date Title
EP3096222B1 (en) Pipelined cascaded digital signal processing structures and methods
US9792118B2 (en) Vector processing engines (VPEs) employing a tapped-delay line(s) for providing precision filter vector processing operations with reduced sample re-fetching and power consumption, and related vector processor systems and methods
US9684509B2 (en) Vector processing engines (VPEs) employing merging circuitry in data flow paths between execution units and vector data memory to provide in-flight merging of output vector data stored to vector data memory, and related vector processing instructions, systems, and methods
US9977676B2 (en) Vector processing engines (VPEs) employing reordering circuitry in data flow paths between execution units and vector data memory to provide in-flight reordering of output vector data stored to vector data memory, and related vector processor systems and methods
US10140124B2 (en) Reconfigurable microprocessor hardware architecture
US7146486B1 (en) SIMD processor with scalar arithmetic logic units
US20070083733A1 (en) Reconfigurable circuit and control method therefor
US20150143079A1 (en) VECTOR PROCESSING ENGINES (VPEs) EMPLOYING TAPPED-DELAY LINE(S) FOR PROVIDING PRECISION CORRELATION / COVARIANCE VECTOR PROCESSING OPERATIONS WITH REDUCED SAMPLE RE-FETCHING AND POWER CONSUMPTION, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS
US7734896B2 (en) Enhanced processor element structure in a reconfigurable integrated circuit device
US9727526B2 (en) Apparatus and method of vector unit sharing
US7958179B2 (en) Arithmetic method and device of reconfigurable processor
CN115390924A (en) Instruction execution method, execution engine, processor, chip and electronic equipment
CN112074810A (en) Parallel processing apparatus
CN116974510A (en) Data stream processing circuit, circuit module, electronic chip, method and device
CN115390925A (en) Data processing method of instruction, related device and electronic equipment
US7047271B2 (en) DSP execution unit for efficient alternate modes for processing multiple data sizes
CN116796816B (en) Processor, computing chip and computing device
CN112463218A (en) Instruction emission control method and circuit, data processing method and circuit
US7007059B1 (en) Fast pipelined adder/subtractor using increment/decrement function with reduced register utilization
US8150949B2 (en) Computing apparatus
CN220208247U (en) Division operation circuit
WO2022141321A1 (en) Dsp and parallel computing method therefor
JP3144859B2 (en) Arithmetic logic unit
CN115543914A (en) Configurable multi-path SIMD execution path design method, vector operation unit and processor
CN117370721A (en) Vector processor with vector reduction method and element reduction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination