CN111124492A

CN111124492A - Instruction generation method, apparatus, instruction execution method, processor and electronic device

Info

Publication number: CN111124492A
Application number: CN201911300243.6A
Authority: CN
Inventors: 蒋宇翔; 王晓阳
Original assignee: Hygon Information Technology Co Ltd
Current assignee: Chengdu Haiguang Microelectronics Technology Co Ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2020-05-08
Anticipated expiration: 2039-12-16
Also published as: CN111124492B; WO2021120712A1; WO2021120712A8

Abstract

The present application relates to an instruction generation method, an apparatus, an instruction execution method, a processor and an electronic device. The method includes: determining that an instruction execution unit supports data pass-through; when generating an instruction, setting a first flag in the i-th instruction for instructing to write the destination data of the i-th instruction into the pass-through path; A second identifier is set in the instruction to indicate that the source operand is obtained from the through path; the i-th instruction and the i+j-th instruction are sent to the instruction execution unit, so that when the instruction execution unit executes the instruction, according to the first The flag writes the result of the i-th instruction to the pass-through path, and fetches the required source operands from the pass-through path according to the second flag. By setting the first flag for instructing to write the destination data into the pass-through path and the second flag for indicating that the source operand comes from the pass-through path, the hardware is forced to realize the explicit data pass-through by means of software, and the memory is reduced. of visits.

Description

Instruction generation method and device, instruction execution method, processor and electronic equipment

Technical Field

The application belongs to the technical field of computers, and particularly relates to an instruction generation method, an instruction generation device, an instruction execution method, a processor and electronic equipment.

Background

The power consumption is a key point of attention in computing application, in typical high-intensity computing application, 70-80% of the power consumption is used by a computing unit, and in the computing unit, 50% of the power consumption is used for reading and writing data. In typical scientific computing and machine learning applications, matrix multiplication is one of the most popular use cases, and in matrix multiplication computing applications, 35-40% of the power is used to access Vector General Purpose Registers (VGPR).

Disclosure of Invention

In view of the above, an object of the present application is to provide an instruction generating method, an instruction generating apparatus, an instruction executing method, a processor, and an electronic device, so as to solve the problem that a large amount of memory access is required in typical high-strength computing applications, which results in a large amount of power consumption.

The embodiment of the application is realized as follows:

in a first aspect, an embodiment of the present application provides an instruction generating method, where the method includes: determining that the instruction execution unit supports data pass-through; when an instruction is generated, if the destination data of an ith instruction is required to be used as a source operand of an (i + j) th instruction, setting a first identifier for indicating that the destination data of the ith instruction is written into a through path in the ith instruction, and setting a second identifier for indicating that the source operand is obtained from the through path in the (i + j) th instruction; wherein i and j are positive integers; sending the ith instruction and the (i + j) th instruction to the instruction execution unit, so that the instruction execution unit writes the result of the ith instruction into a through path according to the first identifier when executing the ith instruction, and acquires a required source operand from the through path according to the second identifier when executing the (i + j) th instruction. In the embodiment of the application, the first identifier for indicating that the destination data is written into the through path and the second identifier for indicating that the source operand is from the through path are arranged in the instruction, and the hardware is forced to realize the explicit data through by means of software, so that the access times of the memory can be greatly reduced, and a large amount of power consumption is saved.

With reference to a possible implementation manner of the embodiment of the first aspect, sending the ith instruction and the (i + j) th instruction to the instruction execution unit includes: splicing the ith instruction and the (i + j) th instruction according to a generation sequence to obtain an instruction block; sending the instruction block to a decoder, so that the decoder sequentially acquires first key information in the ith instruction from the instruction block and sends the first key information to the instruction execution unit, acquires second key information in the (i + j) th instruction and sends the second key information to the instruction execution unit, wherein the first key information comprises the first identifier, and the second key information comprises the second identifier. In the embodiment of the application, the instruction blocks are obtained by splicing a plurality of instructions according to the generation sequence, so that when a plurality of instruction blocks exist, the instructions generated in the embodiment of the application can be correctly executed, and because the hardware needs to complete the execution of the whole instruction block when executing the current instruction block, the hardware can be switched to other instruction blocks, and the situation that the data direct connection has logic errors due to the fact that the hardware needs to switch to the instructions for realizing other functions in the process of executing the instructions is avoided.

In a second aspect, an embodiment of the present application further provides an instruction execution method, where the method includes: acquiring an instruction to be executed; obtaining key information in the instruction to be executed, wherein the key information comprises: source operand address information and destination address information, wherein the source operand address information is used for indicating the source of a source operand, and the destination address information is used for indicating a writing path of destination data; judging whether the source operand indicated by the source operand address information is from a through path or not; when the source operand indicated by the source operand address information is from a through path, acquiring a required source operand from the through path; judging whether a write path of the destination data indicated by the destination address information is the through path; and when the write path of the destination data indicated by the destination address information is the through path, writing the result of executing the instruction to be executed into the through path. In the embodiment of the application, the source operand address information, the destination address information and other key information in the instruction to be executed are acquired, the required source operand is directly acquired from the through path when the source operand indicated by the source operand address information is from the through path, and the instruction execution result is directly written into the through path when the write path of the destination data indicated by the destination address information is the through path, so that the access of a memory is reduced, and the power consumption is reduced.

With reference to a possible implementation manner of the embodiment of the second aspect, the determining whether the source operand indicated by the source operand address information is sourced from a direct path includes: determining whether the source operand indicated by the source operand address information originates from a through path by determining whether the source operand address information contains a second identifier; when the source operand address information contains the second identification, the source operand indicated by the source operand address information is characterized to be derived from a through path. In the embodiment of the application, the second identifier for indicating that the source operand is obtained from the through path is arranged in the instruction, so that whether the source operand indicated by the source operand address information is sourced from the through path can be quickly judged by judging whether the source operand address information contains the first identifier.

With reference to a possible implementation manner of the embodiment of the second aspect, the determining whether a write path of destination data indicated by the destination address information is the through path includes: judging whether a write path of destination data indicated by the destination address information is the through path by judging whether the destination address information contains a first identifier; and when the destination address information contains the first identifier, representing that a write path of destination data indicated by the destination address information is the through path. In the embodiment of the application, the first identifier for indicating that the target data of the instruction to be executed is written into the through path is set in the instruction, so that whether the write path of the target data indicated by the destination address information is the through path can be quickly judged by judging whether the destination address information contains the first identifier.

In a third aspect, an embodiment of the present application further provides a processor, including: an instruction execution unit and a processor core; a processor core to determine that the instruction execution unit supports data pass-through; when an instruction is generated, if the destination data of the ith instruction is required to be used as the source operand of the (i + j) th instruction, setting a first identifier for indicating that the destination data of the ith instruction is written into a through path in the ith instruction, and setting a second identifier for indicating that the source operand is obtained from the through path in the (i + j) th instruction; i is a positive integer; and the instruction execution unit is also used for sending the ith instruction and the (i + j) th instruction to the instruction execution unit; wherein i and j are positive integers; the instruction execution unit is configured to, when the ith instruction is executed, write a result of the ith instruction into a pass-through path according to the first identifier, and, when the (i + j) th instruction is executed, obtain a required source operand from the pass-through path according to the second identifier.

With reference to a possible implementation manner of the third aspect, the processor further includes a decoder, where the processor core is configured to splice the ith instruction and the (i + j) th instruction according to a generation order to obtain an instruction block, and send the instruction block to the decoder; the decoder is configured to sequentially acquire first key information in the ith instruction from the instruction block, send the first key information to the instruction execution unit, acquire second key information in the (i + j) th instruction, and send the second key information to the instruction execution unit, where the first key information includes the first identifier, and the second key information includes the second identifier.

In a fourth aspect, an embodiment of the present application further provides a processor, including: a decoder and an instruction execution unit; the decoder is used for acquiring an instruction to be executed and acquiring key information in the instruction to be executed, wherein the key information comprises: source operand address information and destination address information, wherein the source operand address information is used for indicating the source of a source operand, and the destination address information is used for indicating a writing path of destination data; and further for sending the key information to an instruction execution unit; the instruction execution unit is used for judging whether the source operand indicated by the source operand address information is from a direct path or not; when the source operand indicated by the source operand address information is from a through path, acquiring a required source operand from the through path; and further for judging whether a write path of the destination data indicated by the destination address information is the through path; and when the write path of the destination data indicated by the destination address information is the through path, writing the result of executing the instruction to be executed into the through path.

In combination with one possible implementation manner of the embodiment of the fourth aspect, the instruction execution unit is configured to determine whether the source operand indicated by the source operand address information is derived from a direct path by determining whether the source operand address information includes a second identifier; when the source operand address information contains the second identification, the source operand indicated by the source operand address information is characterized to be derived from a through path.

With reference to one possible implementation manner of the embodiment of the fourth aspect, the instruction execution unit is configured to determine whether a write path of destination data indicated by the destination address information is the through path by determining whether the destination address information includes a first identifier; and when the destination address information contains the first identifier, representing that a write path of destination data indicated by the destination address information is the through path.

In a fifth aspect, an embodiment of the present application further provides an electronic device, including: a processor as provided in the above-mentioned third aspect embodiment and/or in connection with one possible implementation of the third aspect embodiment, or a processor as provided in the above-mentioned fourth aspect embodiment and/or in connection with any one possible implementation of the fourth aspect embodiment.

In a sixth aspect, an embodiment of the present application further provides an instruction generating apparatus, where the apparatus includes: the device comprises a determining module, a generating module and a sending module; the determining module is used for determining that the instruction execution unit supports data direct connection; the generating module is used for setting a first identifier for indicating that the destination data of the ith instruction is written into a through path and setting a second identifier for indicating that the source operand is obtained from the through path in the ith + j instruction if the destination data of the ith instruction is required to be used as the source operand of the (i + j) th instruction when the instruction is generated; i is a positive integer; a sending module, configured to send the ith instruction and the (i + j) th instruction to the instruction execution unit, so that the instruction execution unit writes a result of the ith instruction into a through path according to the first identifier when executing the ith instruction, and obtains a required source operand from the through path according to the second identifier when executing the (i + j) th instruction.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and drawings.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts. The foregoing and other objects, features and advantages of the application will be apparent from the accompanying drawings. Like reference numerals refer to like parts throughout the drawings. The drawings are not intended to be to scale as practical, emphasis instead being placed upon illustrating the subject matter of the present application.

Fig. 1 shows a flowchart of an instruction generation method provided in an embodiment of the present application.

Fig. 2 is a schematic diagram illustrating fields in a VOP3 instruction according to an embodiment of the present disclosure.

Fig. 3 is a logic diagram for executing a complete instruction block according to an embodiment of the present application.

Fig. 4 shows a flowchart of an instruction execution method according to an embodiment of the present application.

Fig. 5 shows a hardware schematic diagram of a matrix multiplication application provided in an embodiment of the present application.

Fig. 6 shows a functional block diagram of an instruction generating apparatus according to an embodiment of the present application.

Fig. 7 shows a block diagram of a processor according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In view of the current problem of large memory access and thus large power consumption in typical high-intensity computing applications, for example, in matrix multiplication applications, it is necessary to repeatedly and continuously obtain the source operands a, B and C from the vector general purpose registers VGPR, perform a + B + C computation, and then write the multiplied result into the VGPR. For example, when instruction0 (instruction 0) is executed, source operand a, source operand B, and source operand C (in this case, 0) are acquired from VGPR, and when the calculation is completed, the calculation result needs to be written into VGPR; when instruction1 is executed, source operand a, source operand B, and source operand C are obtained from VGPR (at this time, the result of the previous calculation is C1 ═ a0+ B0+ C0), and when the calculation is completed, the calculation result needs to be written into VGPR, and the above steps are repeated until the matrix calculation is completed. Therefore, the embodiment of the present application provides a new instruction, so that the access of the VGPR can be converted into data pass-through, that is, the destination data (calculation result) of the i-th instruction can be used as the source data of the i + j-th instruction, so that the hardware directly skips the VGPR writing of the destination data (calculation result) of the i-th instruction and the VGPR reading of the source operand of the i + j-th instruction. Wherein i and j are both positive integers, and the maximum value of j does not exceed the maximum number of stages of data through supported by hardware, for example, assuming that the maximum number of stages of data through supported by hardware is 2, the maximum value of j is 2.

Referring to fig. 1, steps included in a method for generating an instruction according to an embodiment of the present application will be described with reference to fig. 1.

Step S101: it is determined that the instruction execution unit supports data pass-through.

In the embodiment of the present application, before generating the Instruction, a compiler (program software) detects whether an Instruction Execution unit (Instruction Execution) supports data pass-through, and when it is determined that the Instruction Execution unit (hardware) supports data pass-through, step S102 is performed. Currently, most hardware supports 1/2 levels of implicit data pass-through: when hardware executes the instruction1 or instruction2, if it is detected that source data can be obtained from a direct path (forwarding path), reading of VGPR may be skipped, and the source data may be obtained directly from the direct path, but this approach still has a large number of memory write operations of the instruction0 (previous instruction), and there is also a case where forwarding is unsuccessful, that is, even if hardware supports implicit data direct pass, source data may not necessarily be obtained from the direct path.

Step S102: when an instruction is generated, if the destination data of an ith instruction is required to be used as a source operand of an (i + j) th instruction, a first identifier used for indicating that the destination data of the ith instruction is written into a through path is set in the ith instruction, and a second identifier used for indicating that the source operand is obtained from the through path is set in the (i + j) th instruction.

When determining that hardware supports data pass-through, when generating an instruction, if the destination data of the ith instruction needs to be used as the source operand of the (i + j) th instruction, setting a first identifier for indicating to write the destination data of the ith instruction into a pass-through path in the ith instruction.

Wherein i and j are both positive integers, and the maximum value of j does not exceed the maximum number of stages of data through supported by hardware, for example, assuming that the maximum number of stages of data through supported by hardware is 2, the maximum value of j is 2. If the target data of the previous instruction is required to be used as the source operand of the next instruction, a first identifier is set in the ith instruction when the ith instruction is generated, and a second identifier is set in the (i + 1) th instruction when the (i + 1) th instruction is generated. If the target data of the previous instruction is required to be used as the source operand of the next instruction, a first identifier is set in the ith instruction when the ith instruction is generated, and a second identifier is set in the (i + 2) th instruction when the (i + 2) th instruction is generated.

For ease of understanding, the present application is described with reference to generating a vector operation instruction VOP3(VectorOperation with 3 operands) instruction with 3 operands, as shown in fig. 2. The setting type is "110010", i.e., 110010 means that the instruction is a VOP3 instruction. The meaning of each field in the VOP3 instruction is shown in table 1.

TABLE 1

Note that the number of bits (bit width) of each field in table 1 is relatively fixed, and the position thereof may be changed, for example, operandd 0_ ID0 may no longer be [ 40: 32] which may be a value between [ 8: 0] this number of bits, and the rest of the fields are similar.

Wherein, the Operand0_ ID0 field, Operand1_ ID1 field, and Operand2_ ID2 field are used to indicate the source of the Operand, i.e. where to obtain the source Operand, that is, for the Operand0, if VF ═ 125 (second identifier) in this field, it indicates that Operand0 originated from the through path, otherwise Operand operanded 0 is obtained from the position pointed to by Operand0_ ID 0; for the Operand Operand1, if VF in the field is 125 (second identification), it indicates that the Operand Operand1 originates from a direct path, otherwise the Operand Operand1 is obtained from the position pointed to by Operand0_ ID 1; for Operand Operand2, if VF in this field is 125 (second identification), then it indicates that Operand Operand2 originates from the pass-through path, otherwise Operand Operand2 is fetched from the location pointed to by Operand0_ ID 2. The Result _ ID field and the DF field are used to indicate the write path of the destination data, and if DF is equal to 1 (first identifier), the destination data is written directly to the pass-through path, otherwise it is written to the location pointed to by the Result _ ID. Note that, the VF value for indicating that the operand is derived from the pass-through path is not limited to 125, and similarly, the value for indicating that the destination data is written into the pass-through path is not limited to 1.

As described in table 1, if the destination data of the ith instruction needs to be used as the source Operand of the (i + j) th instruction, the value in the destination pass-through DF field of the ith instruction is set to 1, and the value of the VF in the address field of the (i + j) th instruction used for indicating the source of the source Operand is set to 125, that is, if the vector pass-through VF value in the Operand0_ ID0 field is set to 125, it indicates that the Operand operanded 0 is from the pass-through path; if the vector pass VF value in the Operand0_ ID1 field is set to 125, it indicates that Operand1 originates from the pass-through path; if the vector pass VF value in the Operand0_ ID2 field is set to 125, it indicates that Operand operands 2 originate from the pass-through path. In the embodiment of the application, by setting a first identifier (DF ═ 1) for indicating that destination data is written into a pass-through path and setting a second identifier (VF ═ 125) for indicating that a source operand is derived from the pass-through path in an instruction, explicit data pass-through is forcibly realized by hardware by means of software, so that forwarding can be certainly executed.

Step S103: sending the ith instruction and the (i + j) th instruction to the instruction execution unit, so that the instruction execution unit writes the result of the ith instruction into a through path according to the first identifier when executing the ith instruction, and acquires a required source operand from the through path according to the second identifier when executing the (i + j) th instruction.

And after the ith instruction and the (i + j) th instruction are generated, the ith instruction and the (i + j) th instruction are sent to an instruction execution unit to be executed. When the ith instruction is executed, the instruction execution unit writes the result of the ith instruction into the through path according to the first identifier, and when the (i + j) th instruction is executed, the instruction execution unit acquires the required source operand from the through path according to the second identifier. For example, if DF in the ith instruction is 1, the result of the ith instruction is written directly into the pass-through path, and if VF in the operandd 0_ ID0 field in the (i + j) th instruction is 125, the source Operand operandd 0 is obtained directly from the pass-through path.

Considering that, when executing an instruction to realize the same function, an instruction to realize another function is not allowed to be inserted halfway, therefore, in order to ensure that the pass-through scheme provided by the application can be correctly implemented when there are a plurality of instructions for implementing different functions, in the embodiment of the application, or when the ith instruction and the (i + j) th instruction are sent to the instruction execution unit, the ith instruction and the (i + j) th instruction are spliced according to the generation sequence to obtain an instruction block (namely an instruction group or an instruction set), then sending the instruction block to a decoder (hardware) so that the decoder sequentially acquires first key information in the ith instruction from the instruction block and sends the first key information to an instruction execution unit, enabling the instruction execution unit to write the execution result of the ith instruction into the through path according to the first identifier in the first key information; and acquiring second key information in the (i + j) th instruction, and sending the second key information to the instruction execution unit, so that the instruction execution unit acquires the required source operand from the through path according to a second identifier in the second key information. The first key information comprises a first identifier, and the second key information comprises a second identifier. Therefore, when the instruction is executed, the switching to other instruction blocks can be carried out only after all the instructions in the current instruction block are executed. Wherein the group of commands includes a group header and a group body. The group header defines how many resources are used by the instruction group, and the instruction group comprises how many instructions; the group body contains all the instructions of the instruction group.

In order to avoid switching to other instruction blocks during the execution of the current instruction block, a blocking lock (lock) can be added to lock the Arbitration logic (Arbitration) when the hardware runs an instruction block. In each cycle, the decoder reads an instruction from the instruction block to execute, when an instruction with "BS ═ 1" (indicating the start of the instruction block) is encountered, the "lock" logic is enabled, the arbitration logic keeps track of the current "wave _ id", i.e. the arbitration logic can only select an instruction from the current instruction block. When an instruction of "BE ═ 1" (representing the end of an instruction block) is encountered, the "lock" logic will BE disabled, causing the arbitration logic to unlock, entering normal mode. In other words, the execution side, where the hardware must complete the entire instruction block, can switch to other instruction blocks, whose logic diagram is shown in fig. 3.

In order to support a new instruction, an embodiment of the present application further provides an instruction execution method, as shown in fig. 4, and the steps included therein will be described below with reference to fig. 4.

Step S201: and acquiring an instruction to be executed.

Step S202: obtaining key information in the instruction to be executed, wherein the key information comprises: source operand address information and destination address information.

After the instruction to be executed is obtained, key information in the instruction to be executed is obtained, wherein the key information comprises: source operand address information and destination address information. For example, the source Operand address information corresponding to the operatand 0_ ID0 field, the operatand 1_ ID1 field, and the operatand 2_ ID2 field in the instruction shown in fig. 1 is obtained, and the destination address information corresponding to the Result _ ID field and the DF field in the instruction shown in fig. 1 is obtained. The source operand address information is used for indicating the source of the source operand, and the destination address information is used for indicating the writing path of destination data.

Step S203: and judging whether the source operand indicated by the source operand address information is from a through path or not.

After the source operand address information is acquired, whether the source operand indicated by the source operand address information is from a through path is judged, if so, step S204 is executed, and if not, the source operand is acquired from the address pointed by the source operand address information.

The process of determining whether the indicated source operand is derived from the pass-through path may be: determining whether the source operand indicated by the source operand address information originates from a through path by determining whether the source operand address information contains a second identifier; when the source operand address information contains a second identification (VF 125), the source operand indicated by the source operand address information is characterized to be sourced from a through path. It should be noted that the VF value used to indicate that the operand is from the pass-through path is not limited to 125.

Step S204: obtaining a required source operand from the pass-through path.

When the source operand indicated by the source operand address information is derived from the pass-through path, the required source operand is obtained from the pass-through path.

Step S205: and judging whether the write path of the destination data indicated by the destination address information is the through path.

After the destination address information is acquired, it is determined whether or not the write path of the destination data indicated by the destination address information is a through path, and if so, step S206 is executed, and if not, the destination data is written to the address indicated by the destination address information.

Optionally, the process of determining whether the write path of the destination data indicated by the destination address information is the through path may be: determining whether a write path of destination data indicated by the destination address information is the through path by determining whether the destination address information contains a first flag (DF ═ 1); and when the destination address information contains the first identification, representing that the write path of the destination data indicated by the destination address information is the through path. Note that the value for instructing to write the destination data to the through path is not limited to 1.

Step S206: and writing the result of executing the instruction to be executed into the through path.

And when the write path of the destination data indicated by the destination address information is a through path, writing the result of executing the instruction to be executed into the through path.

In the embodiment of the application, when a compiler (a software program) detects that hardware can use pass-through data as source data, the compiler explicitly passes through destination data of an instruction0 to a source of an instruction1 or an instruction2 to realize data pass-through, so that the hardware skips VGPR writing of the instruction0 and VGPR reading of the instruction1 or the instruction2, and a large amount of power consumption is saved. For ease of understanding, an example of applying the method provided herein in matrix multiplication will be described below with reference to fig. 5.

It should be noted that the 3 a temporary registers (Temp Register For a) in fig. 5 are physically the same temporary Register, but 3 Unit times are delayed, and similarly, the 2B temporary registers (Temp Register For B) in the figure are physically the same temporary Register, but 2 Unit times are delayed, and2 Unit times are represented, since there is only one read port For the 2B temporary registers (Temp Register For B), three input operands (a, B, C) are required to be staggered in time, and may be delayed by the temporary Register, and finally aligned at the entry of the Arithmetic Unit (arithmetric Logic Unit, ALU), that is, the operand a obtained from VGPR is temporarily placed in the temporary Register of a at the first time, and the operand B obtained from VGPR is placed in the temporary Register of ALU at the second time, and the operand B obtained from VGPR is placed in the temporary Register of B at the third time, and the operand B obtained from VGPR is placed in the temporary Register of ALU C at the third time, and the operand C obtained from the temporary Register of ALU Register For a, and the operand C obtained from the Arithmetic Unit (ALU Register For the third time, the operand C + r — C.

As is clear from fig. 5, after the method provided by the embodiment of the present application is applied, only the read of the VGPR of the first instruction and the write of the VGPR of the last instruction are involved, and the read and write of the VGPR of a large number of intermediate instructions are omitted, so that a large amount of power consumption can be reduced. Next, explanation will be given with an example of multiplying a specific matrix a by a matrix B, in which the result of the detection 0 is used as the source operand of the detection 1 and the result of the detection 1 is used as the source operand of the detection 2 when matrix multiplication is performed. In conventional mode, instruction0 writes the result to VGPR, instruction1 reads its source operands from VGPR, instruction1 writes the result to VGPR, and instruction2 reads its source operands from VGPR. The following is represented by C_64x64＝A_64x64*B_64x64By way of example, it may be saidIt is to be understood that the matrix size of 64X64 is used herein as an example only and is not limited thereto. And assume that there are 64 arithmetic operation units (ALUs), each with a VGPR space of 200x64 bit.

The calculation process is roughly as follows:

1) matrix a is loaded in linear mode to LDS (Local Data Share, Local Share unit):

a (0,0) → LDS (Address 0); // A (0,0) is stored at the location of Address0 of LDS;

a (0,1) → LDS (Address 1); // A (0,1) is stored at the location of Address1 of LDS;

a (0,2) → LDS (Address 2); // A (0,2) is stored at the location of Address2 of LDS;

……

2) matrix B is loaded into the VGPR space as shown in Table 2.

TABLE 2

ALU0	ALU1	ALU2	……	ALU62	ALU63
						B0,0	B0,1	B0,2	……	B0,62	B0,63
B1,0	B1,1	B1,2	……	B1,62	B1,63
						……	……	……	……	……	……
B63,0	B63,1	B63,2	……	B63,62	B63,63

During calculation, elements in the matrix A are loaded into 64 ALUs one by one in parallel and multiplied by elements corresponding to columns stored in 64 vector general registers respectively, and the 64 ALUs accumulate multiplication results generated by the elements in the same row in the matrix A and the corresponding elements in the matrix B one by one in parallel in sequence to obtain all elements in the same row in the matrix C, so that multiplication operation of the matrix A and the second matrix B is completed.

3) Calculating a matrix C:

the instructions for calculating matrix C in the normal mode are as follows:

m0_ register is start _ address; the initial address of a register of/M0, wherein the register of M0 is used for storing the address of each element in the reading matrix A and automatically updating to the address corresponding to the next element after the 64 ALUs read the corresponding element in the matrix A from the LDS according to the current address of the register of M0 in parallel;

//-----------------------------------------

// Calculate the first row of Matrix C (first row of calculation Matrix C):

// C (0,0) is calculated on ALU _ Index0 ALU _ Index ═ 0(ALU0 calculates C (0,0)).

// C (0,1) is calculated on ALU _ Index1 ALU _ Index ═ 1(ALU1 calculates C (0,0)).

//......

The execution instruction for each ALU to compute a corresponding element in the first row of the matrix C is as follows:

Block_Start::C(0,ALU_Index)＝LDS_Direct(M0_register)*B(0,ALU_Index)；

C(0,ALU_Index)＝LDS_Direct(M0_register)*B(1,ALU_Index)+C(0,ALU_Index)；

C(0,ALU_Index)＝LDS_Direct(M0_register)*B(2,ALU_Index)+C(0,ALU_Index)；

C(0,ALU_Index)＝LDS_Direct(M0_register)*B(3,ALU_Index)+C(0,ALU_Index)；

C(0,ALU_Index)＝LDS_Direct(M0_register)*B(4,ALU_Index)+C(0,ALU_Index)；

......

Block_End::C(0,ALU_Index)＝LDS_Direct(M0_register)*B(63,ALU_Index)+

C(0，ALU_Index)；

//-----------------------------------------

// calculating the second row of Matrix C:

// C (1,0) is calculated on ALU _ Index0(ALU0 calculates C (1,0)).

// C (1,1) is calculated on ALU _ Index1(ALU1 calculates C (1,1)).

//......

The execution instruction for each ALU to compute a corresponding element in the second row of the matrix C is as follows:

Block_Start::C(1,ALU_Index)＝LDS_Direct(M0_register)*B(0,ALU_Index)；

C(1,ALU_Index)＝LDS_Direct(M0_register)*B(1,ALU_Index)+C(1,ALU_Index)；

C(1,ALU_Index)＝LDS_Direct(M0_register)*B(2,ALU_Index)+C(1,ALU_Index)；

C(1,ALU_Index)＝LDS_Direct(M0_register)*B(3,ALU_Index)+C(1,ALU_Index)；

C(1,ALU_Index)＝LDS_Direct(M0_register)*B(4,ALU_Index)+C(1,ALU_Index)；

......

Block_End::C(1,ALU_Index)＝LDS_Direct(M0_register)*B(63,ALU_Index)+

C(1,ALU_Index)；

......

//-----------------------------------------

// Calculate the last row of Matrix C:

// C (63,0) is calculated on ALU _ Index0(ALU0 calculates C (63,0)).

// C (63,1) is calculated on ALU _ Index1(ALU1 calculates C (63,1)).

//......

The execution instruction for each ALU to compute a corresponding element in the last row of the matrix C is as follows:

Block_Start::C(63,ALU_Index)＝LDS_Direct(M0_register)*B(0,ALU_Index)；

C(63,ALU_Index)＝LDS_Direct(M0_register)*B(1,ALU_Index)+C(63,ALU_Index)；

C(63,ALU_Index)＝LDS_Direct(M0_register)*B(2,ALU_Index)+C(63,ALU_Index)；

C(63,ALU_Index)＝LDS_Direct(M0_register)*B(3,ALU_Index)+C(63,ALU_Index)；

......

Block_End::C(63,ALU_Index)＝LDS_Direct(M0_register)*B(63,ALU_Index)+

C(63,ALU_Index)；

referring to the above instruction table, it can be seen that: each row of the computation matrix C requires 64 instructions, so the total number of instructions is 64x64 — 4096. Each instruction is executed once on each thread, so the total number of executions is 64x64x 64. The first instruction of the instruction block for calculating each row of matrix C is, for example, as follows:

C(63,ALU_Index)＝LDS_Direct(M0_register)*B(0,ALU_Index)；

there is one VGPR read and one VGPR write, and such an instruction occurs a total of 64 times, so there are 64x64 reads and 64x64 writes. Other lines of the instruction block, for example, are as follows:

C(63,ALU_Index)＝LDS_Direct(M0_register)*B(1,ALU_Index)+C(63,ALU_Index)；

there are two VGPR reads and one VGPR write, and this instruction occurs 63x64 times in total, so there are 2x63x64x64 reads and 63x64x64 writes. A summary of the reading and writing of VPGR is shown in Table 3.

TABLE 3

By using the explicit vector pass-through technique provided by the embodiment of the present application, the instruction for calculating the matrix C is as follows:

M0_register＝start_address；

//-----------------------------------------

// Calculate the first row of Matrix C (first row of calculation Matrix C):

//C(0,0)is calculated on ALU_Index0:ALU_Index＝0.

//C(0,1)is calculated on ALU_Index1:ALU_Index＝1.

//......

//-----------------------------------------

Block_Start::Forwarding＝LDS_Direct(M0_register)*B(0,ALU_Index)；

Forwarding＝LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding；

Forwarding＝LDS_Direct(M0_register)*B(2,ALU_Index)+Forwarding；

Forwarding＝LDS_Direct(M0_register)*B(3,ALU_Index)+Forwarding；

Forwarding＝LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding；

......

Block_End::C(0,ALU_Index)＝LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding；

//-----------------------------------------

// calculating the second row of Matrix C:

//C(1,0)is calculated on ALU_Index0.

//C(1,1)is calculated on ALU_Index1.

//...........

//-----------------------------------------

Block_Start::Forwarding＝LDS_Direct(M0_register)*B(0,ALU_Index)；

Forwarding＝LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding；

Forwarding＝LDS_Direct(M0_register)*B(2,ALU_Index)+Forwarding；

Forwarding＝LDS_Direct(M0_register)*B(3,ALU_Index)+Forwarding；

Forwarding＝LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding；

......

Block_End::C(1,ALU_Index)＝LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding；

......

//-----------------------------------------

// Calculate the last row of Matrix C:

//C(63,0)is calculated on ALU_Index0.

//C(63,1)is calculated on ALU_Index1.

//...........

//-----------------------------------------

Block_Start::Forwarding＝LDS_Direct(M0_register)*B(0,ALU_Index)；

Forwarding＝LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding；

Forwarding＝LDS_Direct(M0_register)*B(2,ALU_Index)+Forwarding；

Forwarding＝LDS_Direct(M0_register)*B(3,ALU_Index)+Forwarding；

Forwarding＝LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding；

......

Block_End::C(63,ALU_Index)＝LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding；

referring to the above instruction table, it can be seen that: each row of the computation matrix C requires 64 instructions, so the total number of instructions is 64x64 — 4096. Each instruction is executed once on each thread, so the total number of executions is 64x64x 64. The last instruction of the instruction block for each row of the computation matrix C is, for example, as follows:

C(63,ALU_Index)＝LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding；

here, there is one VGPR read and one VGPR write, and such an instruction occurs a total of 64 times, so there are 64x64 reads and 64x64 writes. Other lines of the instruction block, for example, are as follows:

Forwarding＝LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding；

here, there is one VGPR read, and such an instruction occurs a total of 64 times, thus a total of 63x64x64 reads. A summary of the reading and writing of VPGR is shown in Table 4.

TABLE 4

In summary, in the typical matrix multiplication example described above, the number of VGPR reads and writes is from 3x2 using the explicit vector pass technique of the present application¹⁸Optimization to 2¹⁸And is reduced to about 1/3, so that a great deal of energy consumption can be saved.

As shown in fig. 6, an embodiment of the present application further provides an instruction generating apparatus 100, including: a determination module 110, a generation module 120, and a transmission module 130.

A determining module 110, configured to determine that the instruction execution unit supports data pass-through.

A generating module 120, configured to, when an instruction is generated, if it is necessary to use destination data of an ith instruction as a source operand of an (i + j) th instruction, set, in the ith instruction, a first identifier for indicating to write the destination data of the ith instruction into a through path, and set, in the (i + j) th instruction, a second identifier for indicating to obtain the source operand from the through path; i is a positive integer.

A sending module 130, configured to send the ith instruction and the (i + j) th instruction to the instruction execution unit, so that the instruction execution unit writes a result of the ith instruction into a through path according to the first identifier when executing the ith instruction, and obtains a required source operand from the through path according to the second identifier when executing the (i + j) th instruction.

Optionally, the sending module 130 is configured to splice the ith instruction and the (i + j) th instruction according to a generation order to obtain an instruction block; sending the instruction block to a decoder, so that the decoder sequentially acquires first key information in the ith instruction from the instruction block and sends the first key information to the instruction execution unit, acquires second key information in the (i + j) th instruction and sends the second key information to the instruction execution unit, wherein the first key information comprises the first identifier, and the second key information comprises the second identifier.

The instruction generating apparatus 100 provided in the embodiment of the present application has the same implementation principle and the same technical effect as those of the foregoing method embodiments, and for the sake of brief description, reference may be made to corresponding contents in the foregoing method embodiments for parts that are not mentioned in the apparatus embodiments.

As shown in fig. 7, fig. 7 is a block diagram illustrating a structure of a processor 200 according to an embodiment of the present disclosure. The processor 200 includes: a processor core 210 (kernel), a decoder 220, and an instruction execution unit 230. The processor core 210, decoder 220, and instruction execution unit 230 are connected by a bus interconnect.

The processor core 210 has program code embodied therein, which when executed, generates instructions, and, accordingly, the processor core 210 is configured to determine that the instruction execution unit 230 supports data pass-through; when an instruction is generated, if the destination data of the ith instruction is required to be used as the source operand of the (i + j) th instruction, setting a first identifier for indicating that the destination data of the ith instruction is written into a through path in the ith instruction, and setting a second identifier for indicating that the source operand is obtained from the through path in the (i + j) th instruction; i is a positive integer; and is further configured to send the ith instruction and the (i + j) th instruction to the instruction execution unit 230; wherein i and j are positive integers.

The instruction execution unit 230 is configured to, when the ith instruction is executed, write a result of the ith instruction into a pass-through path according to the first identifier, and, when the (i + j) th instruction is executed, obtain a required source operand from the pass-through path according to the second identifier.

Optionally, the processor core is further configured to splice the ith instruction and the (i + j) th instruction according to a generation sequence to obtain an instruction block; the block of instructions is sent to the decoder 220. Correspondingly, the decoder 220 is configured to sequentially obtain first key information in the ith instruction from the instruction block, send the first key information to the instruction execution unit 230, obtain second key information in the (i + j) th instruction, and send the second key information to the instruction execution unit 230. Wherein the first key information includes the first identifier, and the second key information includes the second identifier.

In addition, the decoder 220 is further configured to obtain an instruction to be executed, and obtain key information in the instruction to be executed, where the key information includes: source operand address information and destination address information, wherein the source operand address information is used for indicating the source of a source operand, and the destination address information is used for indicating a writing path of destination data; and also for issuing the critical information to the instruction execution unit 230. Accordingly, the instruction execution unit 230 is configured to: judging whether the source operand indicated by the source operand address information is from a through path or not; when the source operand indicated by the source operand address information is from a through path, acquiring a required source operand from the through path; judging whether a write path of the destination data indicated by the destination address information is the through path; and when the write path of the destination data indicated by the destination address information is the through path, writing the result of executing the instruction to be executed into the through path.

Optionally, the instruction execution unit 230 is configured to determine whether the source operand indicated by the source operand address information originates from a direct path by determining whether the source operand address information contains a second identifier; when the source operand address information contains the second identification, the source operand indicated by the source operand address information is characterized to be derived from a through path. Optionally, the instruction execution unit 230 is configured to determine whether a write path of destination data indicated by the destination address information is the through path by determining whether the destination address information contains a first identifier; and when the destination address information contains the first identifier, representing that a write path of destination data indicated by the destination address information is the through path.

The processor 200 may be an integrated circuit chip having signal processing capability. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 200 may be any conventional processor or the like.

The embodiment of the application also provides electronic equipment comprising the processor, and the electronic equipment can be equipment such as a computer and a server.

The present embodiment also provides a non-volatile computer-readable storage medium (hereinafter referred to as a storage medium), where the storage medium stores a computer program, and when the computer program is executed by the processor 200, the computer program executes the steps included in the instruction generating method and the instruction executing method in the above embodiments.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for generating instructions, wherein the method comprises:

Determine that the instruction execution unit supports data pass-through;

When generating an instruction, if the destination data of the i-th instruction needs to be used as the source operand of the i+j-th instruction, the i-th instruction is set to indicate that the destination data of the i-th instruction is to be used as the source operand of the i+j-th instruction. Write the first identifier of the straight-through path, and set the second identifier in the i+jth instruction to indicate that the source operand is obtained from the straight-through path; wherein, i and j are both positive integers;

Send the i-th instruction and the i+j-th instruction to the instruction execution unit, so that the instruction execution unit executes the i-th instruction according to the first identifier The result of the i-th instruction is written into the pass-through path, and when the i+j-th instruction is executed, the required source operand is obtained from the pass-through path according to the second identifier.

2. The method according to claim 1, wherein sending the i-th instruction and the i+j-th instruction to the instruction execution unit comprises:

splicing the i-th instruction and the i+j-th instruction according to the generation order to obtain an instruction block;

sending the instruction block to the decoder, so that the decoder sequentially obtains the first key information in the i-th instruction from the instruction block, and sends the first key information to the instruction an execution unit, and acquires second key information in the i+jth instruction, and sends the second key information to the instruction execution unit, where the first key information includes the first identifier, and the The second key information includes the second identifier.

3. An instruction execution method, wherein the method comprises:

Get the command to be executed;

Acquire key information in the instruction to be executed, the key information includes: source operand address information and destination address information, the source operand address information is used to indicate the source of the source operand, and the destination address information Used to indicate the write path of the destination data;

Determine whether the source operand indicated by the source operand address information originates from the through path;

When the source operand indicated by the source operand address information comes from the through path, obtain the required source operand from the through path;

Determine whether the write path of the destination data indicated by the destination address information is the through path;

When the writing path of the destination data indicated by the destination address information is the through path, the result of executing the to-be-executed instruction is written into the through path.

4. The method according to claim 3, wherein judging whether the source operand indicated by the source operand address information originates from a through path, comprising:

Determine whether the source operand indicated by the source operand address information originates from a through path by judging whether the source operand address information contains a second identifier;

When the source operand address information includes the second identifier, it indicates that the source operand indicated by the source operand address information originates from a straight-through path.

5. The method according to claim 3, wherein judging whether the write path of the destination data indicated by the destination address information is the through path, comprising:

judging whether the writing path of the destination data indicated by the destination address information is the through path by judging whether the destination address information contains the first identifier;

When the destination address information includes the first identifier, the writing path representing the destination data indicated by the destination address information is the through path.

6. A processor, characterized in that, comprising:

instruction execution unit;

The processor core is used to determine that the instruction execution unit supports data pass-through; and when generating an instruction, if the destination data of the i-th instruction needs to be used as the source operand of the i+j-th instruction, then in the i-th instruction A first flag is set in the instruction to indicate that the destination data of the i-th instruction is written into the pass-through path, and set in the i+j-th instruction to indicate that the source operand is obtained from the pass-through path i is a positive integer; and is also used to send the i-th instruction and the i+j-th instruction to the instruction execution unit; wherein, i and j are both positive integers;

The instruction execution unit is configured to, when executing the i-th instruction, write the result of the i-th instruction into the pass-through path according to the first identifier, and, when executing the i+j-th instruction When , the required source operand is obtained from the through path according to the second identifier.

7 . The processor according to claim 6 , wherein the processor further comprises a decoder, and the processor core is configured to convert the i-th instruction, the i+j-th instruction, and the i+j-th instruction according to the generation order. The instructions are spliced to obtain an instruction block, and the instruction block is sent to the decoder;

The decoder is configured to sequentially acquire the first key information in the i-th instruction from the instruction block, send the first key information to the instruction execution unit, and acquire the i-th instruction + the second key information in the j instructions, and send the second key information to the instruction execution unit, the first key information includes the first identifier, and the second key information includes the first key Second identification.

8. A processor, characterized in that, comprising:

The decoder is used to obtain the instruction to be executed, and obtain the key information in the instruction to be executed, the key information includes: source operand address information and destination address information, the source operand address information is used to indicate the source operation The source of the data, the destination address information is used to indicate the writing path of the destination data; and also used to send the key information to the instruction execution unit;

The instruction execution unit is configured to determine whether the source operand indicated by the source operand address information originates from a through path; source operand required for path acquisition; and also used to determine whether the write path of the destination data indicated by the destination address information is the through path; the write path of the destination data indicated by the destination address information In the case of the straight-through path, the result of executing the instruction to be executed is written into the straight-through path.

9 . The processor according to claim 8 , wherein the instruction execution unit is configured to determine the source indicated by the source operand address information by judging whether the source operand address information contains a second identifier. 10 . Whether the operand originates from the through path; when the source operand address information includes the second identifier, it indicates that the source operand indicated by the source operand address information originates from the through path.

10 . The processor according to claim 8 , wherein the instruction execution unit is configured to judge whether the destination address information indicates the destination data by judging whether the destination address information contains a first identifier. 11 . Whether the write path is the through path; when the destination address information includes the first identifier, the write path representing the destination data indicated by the destination address information is the through path.

11. An electronic device, comprising: the processor according to claim 6 or 7, or the processor according to any one of claims 8-10.

12. An instruction generation device, characterized in that the device comprises:

a determining module for determining that the instruction execution unit supports data pass-through;

The generation module is used for generating the instruction, if the destination data of the i-th instruction needs to be used as the source operand of the i+j-th instruction, then the i-th instruction is set to indicate that the i-th instruction is to be used as the source operand. The purpose data of the instruction is written into the first identifier of the through path, and the i+jth instruction is set to indicate that the source operand is obtained from the through path. The second identifier; i is a positive integer;

A sending module, configured to send the i-th instruction and the i+j-th instruction to the instruction execution unit, so that when the instruction execution unit executes the i-th instruction, according to the A flag writes the result of the i-th instruction into the pass-through path, and, when the i+j-th instruction is executed, obtains a required source operand from the pass-through path according to the second flag.