WO2021182222A1

WO2021182222A1 - Computation device and computation method

Info

Publication number: WO2021182222A1
Application number: PCT/JP2021/008131
Authority: WO
Inventors: 成司西村
Original assignee: 株式会社エヌエスアイテクス; 株式会社デンソー
Priority date: 2020-03-11
Filing date: 2021-03-03
Publication date: 2021-09-16
Also published as: JP7393519B2; JPWO2021182222A1

Abstract

A computation device (10) is provided with a pipeline that includes a plurality of function units (12) having the same function and performs computation by chaining. This computation device (10) is provided with: an execution assessment part (22) for assessing whether there is a function unit (12) that is not executing commands; and a control part (24) that, when one of the function units (12) is executing commands, causes a function unit (12) currently not executing commands to execute the commands that can be executed in parallel to execution of commands by the one of the function units (12).

Description

Arithmetic device and arithmetic method

Cross-reference to related applications

This application is based on Patent Application No. 2020-042169 filed on March 11, 2020 and claims the benefit of its priority, all of which is referenced. Incorporated herein by.

This disclosure relates to an arithmetic unit and an arithmetic method.

Conventionally, an arithmetic unit having a plurality of functional units that are arithmetic units that process instructions, and performing pipeline processing (hereinafter, also referred to as "multi-stage arithmetic pipeline") with a plurality of functional units for a series of input instructions. Is used.

For example, Patent Document 1 discloses a configuration in which pipeline processing is performed by a plurality of functional units such as a multiplier MUL and an adding machine ADD. In the configuration described in Patent Document 1, the multiplier MUL multiplies each data of the set of corresponding elements of the values "a" and "b" input at the same time and sequentially outputs the data to the adder ADD, and the adder ADD outputs the data. The multiplied value and the output of the previous adder ADD are sequentially added.

Japanese Unexamined Patent Publication No. 2012-69081

Here, an arithmetic unit that performs instructions (calculations) by chaining using a multi-stage arithmetic pipeline requires overhead for starting and falling of the pipeline, and executes complicated operations that combine addition and multiplication. In some cases, the pipeline was started up and down multiple times. Further, when executing a complicated calculation, there may be a functional unit that waits for the completion of the calculation by another functional unit before performing the calculation. That is, in some cases, efficient calculation cannot be performed by chaining using a pipeline.

An object of the present disclosure is to provide an arithmetic unit and an arithmetic method capable of more efficiently performing arithmetic operations by chaining using a pipeline.

The computing device of one aspect of the present disclosure is a computing device that includes a pipeline including a plurality of functional units having the same function and performs calculations by chaining, and determines the presence or absence of the functional unit that is not executing an instruction. When the determination unit and any of the functional units are executing an instruction, the functional unit that is not executing the instruction can be executed in parallel with the execution of the instruction by any of the functional units. A control unit for executing the command is provided.

According to the present invention, calculations by chaining using a pipeline can be performed more efficiently.

The above objectives and other objectives, features and advantages of the present disclosure will be clarified by the following detailed description with reference to the accompanying drawings. The drawing is
FIG. 1 is a schematic configuration diagram of an arithmetic unit according to an embodiment. FIG. 2 is a schematic diagram showing the chaining of the embodiment. FIG. 3 is a schematic diagram showing the chaining of the embodiment. FIG. 4 is a flowchart showing the flow of the chaining calculation process of the embodiment.

Hereinafter, embodiments of the present disclosure will be described with reference to the drawings. It should be noted that the embodiments described below show an example of the case where the present disclosure is carried out, and the present disclosure is not limited to the specific configuration described below. In carrying out the present disclosure, a specific configuration according to the embodiment may be appropriately adopted.

FIG. 1 is a schematic configuration diagram of the arithmetic unit 10 of the present embodiment. The arithmetic unit 10 of the present embodiment includes a pipeline including a plurality of functional units 12, and performs arithmetic operations by chaining.

The functional unit 12 has, for example, an LD that copies data from a memory to a register, an ST that copies data from a register to a memory, an ADD (adder) having an addition function, a MUL (multiplier) having a multiplication function, and a division function. It is an arithmetic unit such as a DIV (divider) that has. These plurality of functional units 12 are provided in a multi-pipe vector arithmetic unit (hereinafter referred to as “vector arithmetic unit”) 20 that executes pipeline processing.

Next, the pipeline processing by the arithmetic unit 10 of the present embodiment will be described.

First, as an example, a vector arithmetic unit composed of four stages of a pipeline (hereinafter referred to as "the number of stages of a pipeline") and five functional units (LD, ST, ADD, MUL, DIV). 20 is assumed. In the following description, the hardware state of the vector arithmetic unit 20 having four pipeline stages is referred to as a default mode.

This default mode can be reconfigured into mode 1 and mode 2 with a smaller number of pipeline stages. In mode 1, the number of pipeline stages is two, and the number of functional units 12 is eight (LD, ST, two ADDs, two MULs, and two DIVs). In mode 2, the number of pipeline stages is one, and the number of functional units 12 is 14 (LD, ST, 4 ADD, 4 MUL, 4 DIV). As described above, in the mode 1 and the mode 2 in which the number of stages is smaller than that in the default mode, there are a plurality of functional units 12 having the same functions such as ADD, MUL, and DIV.

As an example, the arithmetic unit 10 of the present embodiment has all variations of hardware modes (combination of the number of pipeline stages and the number of functional units 12) defined in advance, and the current hardware mode is a flag value. Is held in a dedicated register (mode register). Then, the hardware mode is set by a dedicated control command.

As described above, the arithmetic unit 10 of the present embodiment can reconfigure the pipeline from the default mode by including a plurality of functional units 12 having the same function. Then, when the functional unit 12 is executing an instruction, the arithmetic unit 10 of the present embodiment causes another functional unit 12 that is not executing the instruction to execute an instruction that can be executed in parallel. Therefore, the arithmetic unit 10 of the present embodiment has a function (Out-of-Order) of determining whether or not another functional unit 12 is executing an instruction when a certain functional unit 12 is executing an instruction. Function).

In order to realize the Out-of-Order function, the functional unit 12 of the present embodiment has a state register that holds an identifier that identifies whether or not an instruction is being executed. Based on this state register, the arithmetic unit 10 determines the functional unit 12 that is not executing the instruction, and assigns the instruction to each functional unit 12. In the present embodiment, as an example, the state register of the functional unit 12 that is executing the instruction is "1", and the state register of the functional unit 12 that is not executing the instruction is "0".

Since the total number of functional units 12 that contribute to the calculation (the product of the number of pipeline stages and the number of corresponding functional units 12) is constant, the state register is a bit sufficient to hold the total number of functional units 12. All you need is a number. In the above example, the number of pipeline stages is four at the maximum, and the functional units 12 having the same function are three types (ADD, MUL, DIV), so that the state register may be 3 × 4 = 12 bits. In addition, in order to execute the Out-of-Order function, it is necessary to determine whether or not the functional unit 12 of the LD and ST is executing, so the LD and ST also have state registers of 1 bit each. However, the LD and ST state registers may be 1 bit each regardless of the number of pipeline stages.

Next, as an example, a case of chaining the operation “((A + B) + C) × D” in mode 1 will be described with reference to FIG. In mode 1, as described above, the number of functional units 12 of ADD, MUL, and DIV is two each. The vector register width corresponding to each value (A, B, C, D) in FIG. 2 and FIG. 3 described later is fixed, and is, for example, 64 bits or 32 bits.

In mode 1, first, a process (ADD instruction) for writing the operation result of "A + B" to a register (intermediate register) is issued. At the time of issuing the ADD instruction, none of the functional units 12 is executing the instruction yet, so the state registers of all the functional units 12 are "0". By issuing the ADD instruction here, the state register of the functional unit 12 corresponding to the first ADD is set to "1".

Then, in the next instruction cycle, the ADD instruction that adds "C" to the result of the first ADD instruction is issued without waiting for the completion of the first ADD instruction. Since the state register of the second ADD is "0", it is determined that there is an ADD that is not executing the instruction, the state register of the second ADD is set to "1", and the chaining with the first ADD instruction is performed. Is done.

Furthermore, in the next instruction cycle, a multiplication instruction of "D" for "(A + B) + C" is issued without waiting for the end of the two preceding ADD instructions. At this time, since the state register of the MUL is "0", the state register of the first MUL is set to "1", and chaining with the two preceding ADD instructions is performed.

Next, a case where the operation "(A + B) x (C + D)" is chained in mode 1 will be described with reference to FIG. First, the operations of "(A + B)" and "(C + D)" can be executed independently. Therefore, one ADD performs the operation of "(A + B)", and the operation instruction of "(C + D)" is performed by another ADD. That is, these two ADD instructions are issued at the same time without waiting for each other to finish. Therefore, in the case of mode 1, these two ADD instructions can be assigned at the same time as hardware resources.

In the example of FIG. 3, one MUL functional unit 12 is vacant. Here, if the next instruction (subsequent instruction) has a MUL instruction that has no dependency on the above chaining operation, the MUL instruction can be executed at the same time. As a result, the operating rate of the arithmetic function unit can be made higher than that in the default mode, and higher effective performance can be exhibited even with the same hardware resources (number of arithmetic units) as before.

Further, the functional unit 12 of the present embodiment includes a mask register 30 (see FIGS. 2 and 3) indicating an instruction execution state. Then, the functional unit 12 that executes the next instruction using the calculation results of the plurality of other functional units 12 issues the next instruction after the mask register 30 of the plurality of other functional units 12 indicates the end of execution of the instruction. Run.

More specifically, the mask register 30 is provided corresponding to the vector register length, and "0" is rewritten to "1" according to the progress of the operation. When the calculation in each functional unit 12 is completed, all the mask registers 30 are set to "1". Then, it is determined whether or not the calculation by the plurality of functional units 12 is completed by the AND (logical product) of the mask registers 30 of the plurality of functional units 12 that have executed the previous instruction. That is, the functional unit 12 that executes the next instruction does not perform the calculation until the calculation by the plurality of functional units 12 that execute the previous instruction is completed.

That is, in the example of FIG. 3, when the operation of "A + B" is completed, all the register areas become "1", and when the operation of "C + D" is completed, all the register areas become "1". Then, when the calculation of "A + B" and the calculation of "C + D" are completed, the calculation of "(A + B) × (C + D)" is started.

As a result, even if the plurality of functional units 12 execute the previous instruction asynchronously, the functional unit 12 that executes the next instruction waits for the completion of the execution of the instructions by the plurality of functional units 12 to perform the calculation. , The next instruction can be executed without causing an error.

Even if the number of stages of the pipeline that executes the next instruction is larger than the number of stages of the pipeline that executes the previous instruction, for example, the previous pipeline has two stages and the subsequent pipeline has four stages. As described above, the mask register 30 determines the degree of progress of the instruction (calculation) by the previous pipeline. Therefore, even when the operation by chaining is performed by combining pipelines having different numbers of stages, the instruction can be executed without causing an error.

In this way, the arithmetic unit 10 of the present embodiment reconfigures the four-stage pipeline into a pipeline having a smaller number of stages (two-stage pipeline), so that the pipeline can be started up and down. The overhead can be reduced. At this time, the two-stage pipeline is provided with a plurality (at least two) functional units 12 having the same function.

Specifically, when the calculation "((A + B) + C) x D" is performed by the above-mentioned four-stage pipeline, after the calculation of "A + B = E" is performed by chaining, "(E + C) x D" Need to be done in a new chaining. For this reason, in a four-stage pipeline, it is necessary to start up and down the pipeline twice, and overhead is required twice. Further, when the calculation result of "A + B = E" is temporarily stored (stored) in the memory and the calculation of "(E + C) x D" is performed, it is necessary to read the calculation result "E" from the memory, which is a process. Was inefficient.

On the other hand, the two-stage pipeline will have two functional units 12 (ADD, MUL, DIV) having the same function. Then, when a certain functional unit 12 is executing an instruction, it is determined whether or not there is another functional unit 12 that is not executing the instruction (Out-of-Order function), and the other functional unit 12 that is not executing the instruction. On the other hand, an instruction that can be executed is executed in parallel with the instruction execution by a certain functional unit 12.

As a result, in a two-stage pipeline, the operation of "((A + B) + C) x D" can be performed by one chaining, and the pipeline can be started up and down and (overhead) only once. .. Further, in the two-stage pipeline, it is not necessary to temporarily store the calculation result as performed in the four-stage pipeline in the memory, so that more efficient processing becomes possible.

For this reason, by reducing the number of stages of the pipeline and forming the pipeline including a plurality of functional units 12 having the same function as in the present embodiment, the overhead time required for the start-up and down-down of the pipeline is achieved. Can be reduced and the efficiency of calculation can be improved.

In order to execute chaining by such a pipeline, as shown in FIG. 1, the arithmetic unit 10 of the present embodiment includes an execution determination unit 22 and a control unit 24.

The execution determination unit 22 is a component that executes the Out-of-Order function, and determines whether or not there is a functional unit 12 that is not executing an instruction. The execution determination unit 22 determines the functional unit 12 that is not executing the instruction based on the state register.

The control unit 24 causes the functional unit 12 that is not executing the instruction to execute an executable instruction. The control unit 24 of the present embodiment, when any of the functional units 12 is executing an instruction, for the functional unit 12 that is not executing the instruction, in parallel with executing the instruction by any of the functional units 12. Execute an executable instruction.

The arithmetic unit 10 controls according to the number of stages of the pipeline, in other words, according to the mode. For example, in a four-stage pipeline (default mode), the Out-of-Order function is not executed, and in a two-stage or one-stage pipeline (mode 1 or mode 2), the Out-of-Order function is executed. In other words, the Out-of-Order function is performed on a pipeline having a plurality of functional units 12 having the same function. The mode is appropriately selected according to a series of operations to be executed by the vector arithmetic unit 20.

Also, not limited to this, the Out-of-Order function may be executed even in the default mode. That is, not only the functional units 12 having the same function but also the functional units 12 having no dependency in all the different functional units 12 (LD, ST, MUL, ADD, DIV) may be able to be executed at the same time.

The arithmetic unit 10 of the present embodiment stores an instruction waiting to be executed by the functional unit 12 in the instruction waiting buffer 14. Then, when there is a functional unit 12 capable of executing the instructions stored in the instruction waiting buffer 14, the instructions are sequentially read from the instruction waiting buffer 14 and executed by the functional unit 12. This makes it possible to efficiently assign the instruction to the functional unit 12 that is not executing the instruction.

FIG. 4 is a flowchart showing the flow of the chaining calculation process for executing the Out-of-Order function. This chaining calculation process is executed by a program stored in a recording medium included in the calculation device 10. When this program is executed, the method corresponding to the program is executed.

Since the chaining operation process shown in FIG. 4 executes the Out-of-Order function, the Out-of-Order function can be performed by checking the mode register before executing the chaining operation process. It is determined whether or not the mode is executable. If the mode is not such that the Out-of-Order function can be executed, the normal chaining operation processing that does not execute the Out-of-Order function is executed. Alternatively, the Out-of-Order function is reconfigured into a feasible mode.

First, in step S100, the execution determination unit 22 confirms the state register of each functional unit 12 and determines the presence or absence of the functional unit 12 that is not executing the instruction (Out-of-Order function). The presence / absence of the functional unit 12 referred to here is a functional unit 12 capable of executing a given instruction. For example, when the given instruction is an ADD instruction, the execution determination unit 22 determines whether or not there is a functional unit 12 capable of executing this ADD instruction.

In the next step S102, if there is a functional unit 12 that is not executing the instruction, the process proceeds to step S106, while if there is no functional unit 12 that does not execute the instruction, the process proceeds to step S104.

In step S104, since there is no functional unit 12 capable of executing the instruction, the instruction is queued in the instruction waiting buffer 14 as an instruction waiting to be executed, and the process returns to step S100.

In step S106, the control unit 24 assigns an instruction to the functional unit 12 that is not executing the instruction.

In the next step S108, the control unit 24 sets the state register of the functional unit 12 to which the instruction is assigned to "1".

In the next step S110, the instruction to which the functional unit 12 is assigned is executed.

In the next step S112, the control unit 24 determines whether or not the function unit 12 has completed the execution of the assigned instruction, proceeds to step S114 in the case of an affirmative determination, and proceeds to step S116 in the case of a negative determination. Transition. When there are a plurality of functional units 12 executing the instruction, in step S112, it is determined for each functional unit 12 whether or not the execution of the instruction is completed.

In step S114, the control unit 24 sets the state register of the functional unit 12 that has completed the instruction to "0", and proceeds to step S116.

In step S116, the control unit 24 determines whether or not there is a next command, and if there is a next command, returns to step S100 and executes each step in correspondence with the next command.

On the other hand, if it is determined in step S116 that there is no next instruction, all of the input series of arithmetic instructions have been completed, and this chaining is terminated.

As described above, in the arithmetic unit 10 of the present embodiment, when any of the functional units 12 is executing an instruction, the instruction is executed by any of the functional units 12 with respect to the functional unit 12 that is not executing the instruction. To execute an instruction that can be executed in parallel with. As a result, the arithmetic unit 10 of the present embodiment can perform arithmetic operations by chaining using a pipeline more efficiently.

Although the present disclosure has been described above using the above-described embodiment, the technical scope of the present disclosure is not limited to the scope described in the above-described embodiment. Various changes or improvements can be made to the above embodiments without departing from the gist of the disclosure, and the modified or improved forms are also included in the technical scope of the present disclosure.

For example, in the above embodiment, the embodiment in which the four-stage pipeline is reconstructed into a two-stage pipeline or a one-stage pipeline has been described, but the present disclosure is not limited to this. For example, a pipeline having five or more stages may be reconfigured into a pipeline having a smaller number of stages. Further, the vector arithmetic unit 20 may be composed of, for example, a pipeline fixed in two stages without having the concept of restructuring the pipeline.

Claims

An arithmetic unit (10) having a pipeline including a plurality of functional units (12) having the same function and performing arithmetic operations by chaining.
A determination unit (22) that determines the presence or absence of the functional unit that is not executing an instruction, and
When any of the functional units is executing an instruction, the functional unit that is not executing the instruction is made to execute the instruction that can be executed in parallel with the instruction execution by any of the functional units. Control unit (24) and
An arithmetic unit.
The functional unit includes a mask register (30) indicating the execution state of the instruction.
The functional unit that executes the next instruction using the calculation results of the plurality of other functional units issues the next instruction after the mask registers of the plurality of other functional units indicate the end of execution of the instruction. The arithmetic unit according to claim 1, which is executed.
The functional unit is set with an identifier that identifies whether or not the instruction is being executed.
The determination unit determines the functional unit that is not executing the instruction based on the identifier.
The arithmetic unit according to claim 1 or 2.
The instruction waiting to be executed by the functional unit is stored in the storage medium (14), and is stored in the storage medium (14).
When there is a functional unit capable of executing the instruction stored in the storage medium, the instruction is sequentially read from the storage medium and executed by the functional unit.
The arithmetic unit according to claim 1 to 3.
It is a calculation method by chaining using a pipeline containing multiple functional units having the same function.
The first step of determining the presence or absence of the functional unit that is not executing the instruction, and
When any of the functional units is executing an instruction, the functional unit that is not executing the instruction is made to execute the instruction that can be executed in parallel with the instruction execution by any of the functional units. Second step and
A calculation method having.