CN113591031A

CN113591031A - Low-power-consumption matrix operation method and device

Info

Publication number: CN113591031A
Application number: CN202111155568.7A
Authority: CN
Inventors: 不公告发明人
Original assignee: Muxi Technology Beijing Co ltd
Current assignee: Muxi Technology Beijing Co ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2021-11-02

Abstract

The invention provides a low-power matrix operation method and device, relates to the chip technology, and is characterized in that the operand is judged and processed based on a preset special value to obtain a judgment result; if the judgment result indicates that the operand is the preset special value, acquiring a preset path corresponding to the preset special value, wherein the preset path selectively skips the operation of the multiplication arithmetic logic unit according to the preset special value or skips all the operations after the multiplication arithmetic logic unit; the technical scheme includes that the operands are processed based on the preset path to obtain operation results, whether the operands belong to special values or not is judged, if yes, the operands are operated according to the preset path to skip the operation of the multiplication arithmetic logic unit, or all the operations after the multiplication arithmetic logic unit, so that extra power consumption of access of multiplication units and register files in an original calculation path is avoided, and power consumption of a matrix operation unit is reduced.

Description

Low-power-consumption matrix operation method and device

Technical Field

The invention relates to a chip technology, in particular to a low-power-consumption matrix operation method and device.

Background

A Matrix operation unit represented by an MMA (Matrix-multilevel and Accumulation) operation unit is an important operation unit of a High-throughput (High-throughput) computation chip such as a GPU and an AI chip, and is widely used in chips such as a GPU and a neural network accelerator. A large number of machine learning applications can be implemented using MMA operations. For example, a convolutional layer in a Convolutional Neural Network (CNN) is usually implemented by using MMA, a Transformer unit of NLP also uses MMA, and a general matrix operation unit can be expressed as D = a × B + C, where a and B are M × K, K × N, and C and D are both matrices of M × N.

The existing MMA operation unit usually executes matrix multiply-accumulate according to the operand specified by the MMA instruction and according to a fixed computation flow, resulting in very high power consumption of the matrix operation unit.

Disclosure of Invention

The embodiment of the invention provides a low-power-consumption matrix operation method and device, which can reduce the power consumption of a matrix operation unit.

In a first aspect of the embodiments of the present invention, a low power consumption matrix operation method is provided, before processing operands according to a multiplication arithmetic logic unit, including:

judging the operand based on a preset special value to obtain a judgment result;

if the judgment result indicates that the operand is the preset special value, acquiring a preset path corresponding to the preset special value, wherein the preset path selectively skips the operation of the multiplication arithmetic logic unit according to the preset special value or skips all the operations after the multiplication arithmetic logic unit;

and processing the operand based on the preset path to obtain an operation result.

Optionally, in a possible implementation manner of the first aspect, the preset special values include a NaN value, a + INF value, an-INF value, a 0 value, a 1 value, and a-1 value.

Optionally, in a possible implementation manner of the first aspect, the preset path includes a first path that skips the operation of the multiplication arithmetic logic unit, and a second path that skips all operations after the multiplication arithmetic logic unit.

Optionally, in a possible implementation manner of the first aspect, if the determination result indicates that the operand is the preset special value, acquiring a preset path corresponding to the preset special value includes:

if the judgment result indicates that the operand is a 1 value or a-1 value, acquiring the first path;

and if the judgment result indicates that the operand is a NaN value, + INF value, -INF value or 0 value, acquiring the second path.

Optionally, in a possible implementation manner of the first aspect, the processing the operand based on the preset path to obtain an operation result includes:

processing the 1 value based on the first path, directly outputting another operand;

and directly outputting the other operand with the inverted sign bit based on the first path to process the-1 value.

processing the + INF value based on the second path, directly outputting an INF value that is the same as the sign bit of another operand;

processing the INF value based on the second path, directly outputting the INF value inverted with the sign bit of another operand;

and directly outputting the value of the read intermediate result register based on the second path to process the 0 value.

In a second aspect of the embodiments of the present invention, there is provided a low power consumption matrix arithmetic device, including:

the judgment module is used for judging the operand based on a preset special value to obtain a judgment result;

a path module, configured to obtain a preset path corresponding to the preset special value if the judgment result indicates that the operand is the preset special value, where the preset path selectively skips the operation of the multiplicative arithmetic logic unit according to the preset special value, or skips all operations after the multiplicative arithmetic logic unit;

and the execution module is used for processing the operand based on the preset path and acquiring an operation result.

Optionally, in one possible implementation manner of the second aspect, the judging module is connected between the operand input unit and the multiplication arithmetic logic unit.

In a third aspect of the embodiments of the present invention, there is provided a low power consumption matrix arithmetic device, including: memory, a processor and a computer program, the computer program being stored in the memory, the processor running the computer program to perform the method of the first aspect of the invention as well as various possible aspects of the first aspect.

A fourth aspect of the embodiments of the present invention provides a readable storage medium, in which a computer program is stored, the computer program being, when executed by a processor, configured to implement the method according to the first aspect of the present invention and various possible aspects of the first aspect.

The invention provides a low-power consumption matrix operation method and a low-power consumption matrix operation device, which judge whether the operand belongs to a special value, if the operand belongs to the special value, the operand is operated according to a preset path to skip the operation of a multiplication arithmetic logic unit, or skip all the operations after the multiplication arithmetic logic unit, thereby avoiding the extra power consumption of the access of the multiplication unit and a register file in the original calculation path and reducing the power consumption of a matrix operation unit.

Drawings

Fig. 1 is a schematic diagram of an arithmetic unit according to an embodiment of the present invention.

FIG. 2 is a diagram of a four-stage pipeline of a multiply-accumulate unit according to an embodiment of the present invention.

Fig. 3 is a schematic flow chart of a low power consumption matrix operation method according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a first path and a second path provided by an embodiment of the invention.

Fig. 5 is a schematic structural diagram of a low power consumption matrix computing device according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a hardware structure of a low power consumption matrix arithmetic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

It should be understood that in the present application, "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in the present invention, "a plurality" means two or more. "and/or" is merely an association describing an associated object, meaning that three relationships may exist, for example, and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "comprises A, B and C" and "comprises A, B, C" means that all three of A, B, C comprise, "comprises A, B or C" means that one of A, B, C comprises, "comprises A, B and/or C" means that any 1 or any 2 or 3 of A, B, C comprises.

It should be understood that in the present invention, "B corresponding to a", "a corresponds to B", or "B corresponds to a" means that B is associated with a, and B can be determined from a. Determining B from a does not mean determining B from a alone, but may be determined from a and/or other information. And the matching of A and B means that the similarity of A and B is greater than or equal to a preset threshold value.

As used herein, "if" may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Referring to fig. 1, which is a schematic diagram of an arithmetic unit according to an embodiment of the present invention, a general matrix arithmetic unit may be represented as D = a × B + C, where a and B are M × K, K × N, and C and D are both M × N matrices, and M =4, K =2, and N =4 in fig. 1.

Referring to fig. 2, which is a schematic diagram of a four-stage pipeline of a multiply-accumulate unit according to an embodiment of the present invention, a conventional MMA operation unit typically performs matrix multiply-accumulate according to an operand specified by an MMA instruction and according to the fixed computation flow of fig. 2, where the conventional computation flow has the following problems:

1. in the operation process, a large number of floating point or shaping multiply-add operations need to be carried out, m, n, k times of floating point or shaping multiplication need to be carried out, and m, n, k times of floating point or integer addition consumes high energy;

2. in the operation process, a register file for temporarily storing the intermediate result needs to be frequently accessed, and a large amount of energy is consumed.

Therefore, the operation method in the prior art can cause the power consumption of the matrix operation unit to be very high.

To solve the above technical problem, referring to fig. 3, a flowchart of a low power consumption matrix operation method provided by an embodiment of the present invention is shown, and an execution main body of the method shown in fig. 3 may be a software and/or hardware device. The execution subject of the present application may include, but is not limited to, at least one of: user equipment, network equipment, etc. The user equipment may include, but is not limited to, a computer, a smart phone, a Personal Digital Assistant (PDA), the above mentioned electronic equipment, and the like. The network device may include, but is not limited to, a single network server, a server group of multiple network servers, or a cloud of numerous computers or network servers based on cloud computing, wherein cloud computing is one type of distributed computing, a super virtual computer consisting of a cluster of loosely coupled computers. The present embodiment does not limit this. The low-power matrix operation method comprises the following steps S301 to S303:

s301, judging the operand based on a preset special value, and acquiring a judgment result.

Specifically, in this step, it is determined whether the operand is a preset special value, and then a determination result is obtained, where the determination result may be contained or not, it is understood that, during the determination, the server may compare the operand with the preset special value one by one, and if the comparison is consistent, the operand is a special value.

In practical applications, the preset special values include a NaN value, + INF value, -INF value, 0 value, 1 value, and-1 value. Where the NaN value is not a numerical value and the + INF value and-INF value are infinity.

It is understood that, referring to fig. 2, there may be two operands, including Amk and Bkn, and the server may determine that any one operand is a preset special value.

It should be noted that this step is performed before the operation of the multiply ALU is processed, because this step is to skip the operation of the multiply ALU on the operand, the determination of the operand is performed before it.

S302, if the result indicates that the operand is the predetermined special value, obtaining a predetermined path corresponding to the predetermined special value, wherein the predetermined path selectively skips the operation of the multiplicative arithmetic logic unit or skips all the operations after the multiplicative arithmetic logic unit according to the predetermined special value.

Specifically, in this step, after the operand is determined to be the preset special value, the corresponding preset path is found, and the operand is operated according to the preset path to skip the operation of the multiplication arithmetic logic unit or skip all the operations after the multiplication arithmetic logic unit, thereby avoiding the extra power consumption of the multiplication unit and the register file access in the original calculation path.

It is understood that the present solution selects different paths to operate according to different specific values.

In practical applications, referring to fig. 4, the preset path may include a first path for skipping the operation of the multiply arithmetic logic unit and a second path for skipping all operations after the multiply arithmetic logic unit, that is, if it is determined that the first path is needed to process the operand, the operation may be directly skipped during the operation; if it is determined that the second path is needed for processing the operands, then all operations following the multiply arithmetic logic unit may be skipped during operation.

In some embodiments, if the determination result indicates that the operand is a 1 value or a-1 value, the first path is obtained, and the operand is operated by the first path.

For example, if the operand is 1, the value 1 is processed based on the first path, and the other operand is directly output, it can be understood that when one operand is 1, the value obtained when the two operands are subjected to multiplication operation is the other operand, and the scheme directly skips the process of performing multiplication operation on the two operands, thereby reducing the operation power consumption.

For another example, if the operand is-1, the-1 value is processed based on the first path, and the other operand with the inverted sign bit is directly output, it can be understood that when one operand is-1, when two operands are subjected to multiplication operation, the obtained value is the other operand with the inverted sign bit, and the scheme directly skips the process of multiplication operation on the two operands, thereby reducing the operation power consumption.

In other embodiments, if the determination result indicates that the operand is a NaN value, + INF value, -INF value, or 0 value, the second path is obtained, and the operand is operated by the second path.

For example, if the operand is + INF, the + INF value is processed based on the second path, and the INF value with the same sign bit as that of the other operand is directly output.

For another example, if the operand is-INF, the-INF value is processed based on the second path, and the INF value inverted with the sign bit of the other operand is directly output, it can be understood that when one operand is the INF value, and when the two operands are subjected to the multiplication operation, the obtained value is the INF value, and the scheme directly skips the process of performing the multiplication operation on the two operands, thereby reducing the operation power consumption.

For another example, if the operand is 0, the value 0 is processed based on the second path, and the value read from the intermediate result register is directly output and output, it can be understood that, when one operand is 0, the value obtained when the two operands are subjected to multiplication operation is 0, and the scheme directly skips the process of performing multiplication operation on the two operands, and directly reads the value from the intermediate result register, thereby reducing the operation power consumption.

And S303, processing the operand based on the preset path to obtain an operation result.

It can be understood that the scheme operates on some special values through the first path and the second path, skips partial multiplication and addition, and accesses to the intermediate register, and combines the operation units in the prior art to obtain the operation result.

Referring to fig. 5, which is a schematic structural diagram of a low power consumption matrix computing apparatus according to an embodiment of the present invention, the low power consumption matrix computing apparatus 50 includes:

the judging module 51 is configured to perform judgment processing on the operand based on a preset special value to obtain a judgment result;

a path module 52, configured to, if the determination result indicates that the operand is the preset special value, obtain a preset path corresponding to the preset special value, where the preset path selectively skips the operation of the multiplicative arithmetic logic unit according to the preset special value, or skips all operations after the multiplicative arithmetic logic unit;

and the execution module 53 is configured to process the operand based on the preset path to obtain an operation result.

Wherein the judging module 51 is connected between the operand input unit and the multiplying arithmetic logic unit to finish judging the operand before the multiplying arithmetic logic unit.

The apparatus in the embodiment shown in fig. 5 can be correspondingly used to perform the steps in the method embodiment shown in fig. 3, and the implementation principle and technical effect are similar, which are not described herein again.

Referring to fig. 6, which is a schematic diagram of a hardware structure of a low power consumption matrix computing device according to an embodiment of the present invention, the low power consumption matrix computing device 60 includes: a processor 61, memory 62 and computer programs; wherein

A memory 62 for storing the computer program, which may also be a flash memory (flash). The computer program is, for example, an application program, a functional module, or the like that implements the above method.

A processor 61 for executing the computer program stored in the memory to implement the steps performed by the apparatus in the above method. Reference may be made in particular to the description relating to the preceding method embodiment.

Alternatively, the memory 62 may be separate or integrated with the processor 61.

When the memory 62 is a device separate from the processor 61, the apparatus may further include:

a bus 63 for connecting the memory 62 and the processor 61.

The present invention also provides a readable storage medium, in which a computer program is stored, which, when being executed by a processor, is adapted to implement the methods provided by the various embodiments described above.

The readable storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, a readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Additionally, the ASIC may reside in user equipment. Of course, the processor and the readable storage medium may also reside as discrete components in a communication device. The readable storage medium may be a read-only memory (ROM), a random-access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The present invention also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the device may read the execution instructions from the readable storage medium, and the execution of the execution instructions by the at least one processor causes the device to implement the methods provided by the various embodiments described above.

In the above embodiments of the apparatus, it should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of low power matrix operation, prior to processing operands according to a multiplying arithmetic logic unit, comprising:

2. The method of claim 1, wherein the preset special values include NaN value, + INF value, -INF value, 0 value, 1 value, and-1 value.

3. The method of claim 2, wherein the predetermined path comprises a first path that skips operations of the multiplying arithmetic logic unit and a second path that skips operations of all operations subsequent to the multiplying arithmetic logic unit.

4. The method of claim 3, wherein if the determination result indicates that the operand is the special value, obtaining a default path corresponding to the special value comprises:

5. The method of claim 4, wherein processing the operands based on the predetermined path to obtain the operation result comprises:

6. The method according to claim 4 or 5, wherein processing the operands based on the predetermined path to obtain the operation result comprises:

7. A low power matrix arithmetic device, comprising:

the judging module is used for judging the operands based on the preset special value to obtain a judging result;

a path module, configured to obtain a preset path corresponding to the preset special value if the judgment result indicates that the operand is the preset special value, where the preset path selectively skips operations of the multiplicative arithmetic logic unit according to the preset special value, or skips all operations after the multiplicative arithmetic logic unit;

8. The apparatus of claim 7, wherein the decision module is coupled between the operand input unit and the multiply arithmetic logic unit.

9. A low-power matrix arithmetic device, comprising: memory, a processor and a computer program, the computer program being stored in the memory, the processor running the computer program to perform the method of any of claims 1 to 6.

10. A readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 6.