CN112862086A

CN112862086A - Neural network operation processing method and device and computer readable medium

Info

Publication number: CN112862086A
Application number: CN202011574026.9A
Authority: CN
Inventors: 李坤傧
Original assignee: Nanjing Lanyang Intelligent Technology Co ltd
Current assignee: Nanjing Lanyang Intelligent Technology Co ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-05-28

Abstract

The invention discloses a neural network operation processing method, a device and a computer readable medium, comprising the following steps: receiving input values of two functions of the operation; checking a value range of an input value of at least one of the functions; and selecting one of the input values of the two functions to execute operation according to the check result. The invention provides a neural network operation processing method and device for reducing power consumption, shortening processing time and improving performance, and can maximally multiplex hardware into an architecture design of activation-normalization, normalization-activation and activation-weight.

Description

Neural network operation processing method and device and computer readable medium

Technical Field

The invention discloses a neural network operation processing method, a neural network operation processing device and a computer readable medium, and relates to the technical field of engineering such as low power consumption design and neural network operation.

Background

With the rapid development of artificial intelligence technology, neural network operation has been widely and successfully applied in the fields of images, voice, characters and the like with mass data. In a specific application scenario of a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), and the like, multiplication is one of the most basic operations in many applications, and therefore, the power consumption and processing time of the multiplication operation usually account for a large part of the total power consumption and processing time.

In the prior art, a common conventional method is to check whether one of the two inputs to the multiplication operation is zero. For example, the layers accelerator checks whether the input from the feature map is zero to prevent the MAC datapath from switching when the input is zero. If the input is N bits of data, an N-bit comparator is required. For example, N-bit data in the Eyeris accelerator is 16-bit data. When the input data size is a runtime variable, such as 16-bit, 8-bit, or 4-bit data, the comparator is also required to have the capability of comparing the variable bit length data. See reference 1 for details: y. h.chen, t.krishna, j.emer, v.sze, "eyeris: an Energy-Efficient Reconfigurable Accelerator (Eyeris: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks), "IEEE Journal of Solid State Circuits (JSSC), ISSCC specific Issue, Vol.52, No.1, pp.127-138, January 2017.

To achieve a deeper neural network, such as a 1001-layer neural network, contrary to the traditional "post-activation" solution perspective, he camama, dian rain, no paniculate swallowwort, grand sword are in reference 2: the 'preactivation' scheme of the weighting layer is proposed in the 'identity mapping' in the deep residual network. As shown in fig. 1, wherein fig. 1(a) is a "post-activation" diagram and fig. 1(b) is a "pre-activation" diagram, the activation function ReLu (linear correction unit) is executed after BN (batch normalization). In reference document 2, the weight layer is called a convolutional layer. Each of the blocks shown in FIG. 1, such as BN, ReLu, Weight, and Addition, is a layer of the neural network model. "layers" and "functions" are used interchangeably. Wherein:

ReLu function: f (x) max (0, x); if the input x is negative, the output of ReLu f (x) is f (x) 0, otherwise the output of ReLu is x.

BN (batch normalization): for specific details, see reference 3: ioffe, S., Szegedy, C., Batchnormal: accurate deep network tracking internal covariate shift. in: ICML (2015).

Since there are other types of normalization, the calculation can be summarized as: g (x) ═ μ λ. A simplified illustration is shown in fig. 2. When a is the input (x- μ) and B is the normalized scaling factor λ, the output of the normalized activation f (c) ═ f (B x a), where f is the activation function (ReLu in this case).

Furthermore, in conventional methods, such as the method used in eyeris, the PE result needs to be stored in a memory, such as a Global Buffer (Global Buffer) in eyeris, the data is read from the memory, ReLu is executed, and the ReLu result is stored in the memory (e.g., DRAM in eyeris). Eyeris performs run-length coding (RLC) on the ReLu results before storing them into DRAM. Then the eeyeris reads back the compressed ReLu result, performs RLC decoding, and stores the decoded result into Global Buffer together with the weight of the filter, and finally performs multiplication or multiply-add/multiply-accumulate using a PE array having a plurality of MACs. This approach requires storing the ReLu results in memory and then reading them from memory to perform the "layer Weight" process, which relatively lengthens the processing time.

In a word, in the prior art, when a large number of multiplication calculations are performed, problems such as huge calculation amount requirements, memory occupation requirements, high bandwidth requirements and the like are often caused, so that high requirements are provided for realizing a large-scale neural network by hardware.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the defects of the prior art, the invention provides a neural network operation processing method, a device and a computer readable medium, which achieve the purpose of reducing the power consumption and the processing time of multiplication operation by checking the symbol or value range information of at least one of the inputs.

The invention adopts the following technical scheme for solving the technical problems:

in a first aspect, the present invention discloses a neural network operation processing method, including: receiving input values of two functions of the operation; checking a value range of an input value of at least one of the functions; and determining the execution strategy of the function according to the check result.

Further, the checking the value range of the input value of the at least one function includes checking a sign of the input value of the at least one function, specifically checking sign information or a sign bit of the input value of the function; the operation comprises a multiplication operation.

Further, the determining the execution policy of the function includes: if the range of the input value of the checked function is a positive value, one of the two functions is executed in a first mode, otherwise, one of the two functions is executed in a second mode.

The executing one of two functions in the first mode comprises: a multiplication operation is performed.

The executing one of two functions in the second mode comprises: and assigning the output of one of the two functions to a set constant value.

Further, the method further comprises: predicting the range of the operation result by checking the value range of the input value of at least one function; if the range of the predicted operation result is a positive value, the function is executed, otherwise, the function is not executed.

The determining the execution policy of the function specifically includes: if the value range of the input value of the checked function indicates that the input value of the checked function is a negative value, the function is not executed; and executing the function if the value range of the input value of the checked function indicates that the input value of the checked function is a positive value.

And if the function is not executed, the constant result is 0.

The determining the execution policy of the function may also include: if the value range of the input value of the checked first function is a positive value, selecting a first coefficient as the input value of a second function; and if the range of the input value of the checked first function is a negative value, selecting the second coefficient as the input value of the second function.

The performing of the operation comprises multiplying the input value of the examined function with the selected coefficient.

Further, the executing of the function further comprises: if the value range of the input value of the checked first function is a positive value, selecting a first coefficient as an input and multiplying the input value of the first function by the selected first coefficient; if the range of the input value of the checked first function is negative, selecting a first coefficient and a second coefficient as inputs and multiplying the input value of the first function by the selected first coefficient and the selected second coefficient.

As a preferred embodiment of the present invention, the multiplication operation is implemented as a multiplier, a shifter, a series of adders and shifters, or a series of and gates and adders.

As a preferred embodiment of the present invention, the two functions are a normalization function and an activation function, respectively. Wherein the normalization function comprises: a layer normalization function, an instance normalization function, a group normalization function, or a switchable normalization function; the activation function includes a ReLu function, a PReLu function, a ReLu6 function, or a ReLuN function.

In a second aspect, the present invention also discloses a neural network operation processing apparatus, including a memory, a processor and a computer program stored on the memory and operable on the processor, where the processor includes a plurality of multipliers, and is characterized in that the processor implements the following steps when executing the computer program: receiving input values of two functions of the operation; checking a value range of an input value of at least one of the functions; and determining the execution strategy of the function according to the check result.

In a third aspect, the invention also discloses a computer readable medium having a non-volatile program code executable by a processor, the program code causing the processor to execute the neural network operation processing method.

Compared with the prior art, the invention adopting the technical scheme has the following technical effects: a neural network operation processing method and apparatus for reducing power consumption, shortening processing time and improving performance are provided.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the aforementioned and other objects, features and advantages of the invention more comprehensible, preferred embodiments accompanied with figures are described in detail below

The attached drawings are described in detail as follows.

Drawings

FIG. 1 is a schematic diagram of a "post-activation" scheme and a "pre-activation" scheme in the prior art.

Fig. 2 is a diagram illustrating the execution of the activation function ReLu after BN in the prior art.

FIG. 3 is a schematic of the process of the present invention.

Fig. 4 is a schematic diagram of a neg indicator.

Fig. 5 is a schematic diagram of an exemplary PE array.

Fig. 6 is an exemplary circuit schematic for skip detection.

FIG. 7 is a schematic diagram of the application of "ReLu-Weight" fusion.

FIG. 8 is a schematic diagram of the "ReLu-Weight" fusion method in the present invention.

FIG. 9 is another exemplary circuit schematic for skip detection.

FIG. 10 is a schematic diagram of the "PReLu-Weight" fusion method in the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

The technical scheme of the invention is further explained in detail by combining the attached drawings:

fig. 3 shows a schematic diagram of the processing method of the present invention, where the normalization function is a BN function, the activation function is a ReLu function, and the ReLu function may be replaced by other types of activation functions such as prilu, ReLu6, and ReLuN. The BN function may be replaced by other types of normalization functions such as layer normalization, instance normalization, group normalization, switchable normalization, etc.

The first design innovation point of the invention is as follows: when the activation function ReLu is executed after the BN function is executed, the proposed method is applied to reduce power consumption and processing time. The method comprises the following specific steps:

when applying the ReLu function to the multiplication result C ═ AxB, i.e., f (C) ═ f (AxB) shown in fig. 2, the proposed method is to check the sign of a and the sign of B, if they are different, C will be a negative value, and f (C) will be zero. Thus, unnecessary multiplication calculations can be avoided by actually checking the signs of the two inputs.

In one specific embodiment of the present invention, a ═ -1 and B ═ 7 are shown in the form of a 4-bit 2 complement as shown in the following table.

And f (c) ═ 7. In the complement of 2, the sign bit of a is 1 'B1, and the sign bit of B is also 1' B1.

As shown in fig. 4, the exemplary detection circuit for checking whether C is negative (neg) can be implemented as a simple xor gate. If the neg indicator is true (i.e., neg 1' b1), indicating that C is negative, then the multiplication used to calculate C AxB may be avoided or skipped to reduce power consumption.

In some implementations, the multiplication operation requires multiple cycles. For example, two-stage pipelined multipliers are implemented in eyeris, described in the background. Thus, avoiding multiplication also shortens the processing time.

In another embodiment, a bit-serial multiplier is employed, so skipping multiplication can reduce the number of clock cycles required to perform the bit-serial multiplication operation by the bit-serial multiplier. The bit-serial multiplier may be implemented by only a shifter for representing a power of two multiplication. The bit-serial multiplier may also employ an add-shift or and-gate-shift architecture to represent arbitrary coefficients, rather than powers of two coefficients.

In another embodiment, as shown in FIG. 5, using a PE array with multiply or multiply-add/multiply-accumulate (MAC) units that can be performed in parallel, if all the neg indicators (neg 1, neg 2, neg 3, neg 4) indicate that C1-C4 will be 0, then processing time can be reduced since no multiply or MAC operation need be performed. If any neg indicator N indicates that CN will be positive, its associated MAC unit still needs to compute CN, and therefore total processing time cannot be reduced. However, the ability to associate other MAC units whose neg indicator indication results in a negative value may be reduced.

The examples in the table below show a-1 and B-7 in the form of 4-bit signed magnitudes. In this example, since C ═ 7, f (C) is 0.

The present method can still use the simple xor gate described above to generate the neg indicator with the sign bit of a and the sign bit of B. Other numerical systems besides 2's complement and sign magnitude may also be employed.

The proposed method can also be combined with a check to determine if the input is zero. An example of such a detection circuit is shown in fig. 6.

In another embodiment, input B is designed as an unsigned data type. At this time, if the circuit is designed specifically for this use case, the exclusive or gate can be omitted and the sign bit of a is directly used as the neg indicator.

The second design innovation point of the invention is that: the optimization method proposed for the "pre-activation" of the weight layer, namely the execution of the activation function ReLu (linear correction unit) after the execution of the BN function. Two pre-activation operations, namely BN, are shown in FIG. 7¹→ReLu¹And BN²→ReLu². Wherein: BN¹And BN²Respectively, the first and second BN layers. Similarly, ReLu¹And ReLu²Representing the first and second ReLu layers, respectively.

We will introduce another optimization method for the second aspect of the invention. In conventional designs, we can utilize weighting layers¹And layer BN²The layers of (1) are fused so as to optimize the multiplication amount of the two layers. Application of our proposed method to layer ReLu²And layer Weight²Not only can data access be reduced, but also the power consumption for executing multiplication can be reduced.

As mentioned in the background of the present application, conventional methods, such as those used in eyeris, require storing PE results to memory (e.g., Global Buffer in eyeris), reading data from memory, executing ReLu, and then storing the ReLu results to memory (e.g., DRAM in eyeris). Eyeris performs run-length coding (RLC) on the ReLu results before storing them into DRAM. Then the eeyeris reads back the compressed ReLu result, performs RLC decoding, and stores the decoded result into Global Buffer together with the weight of the filter, and finally performs multiplication or multiply-add/multiply-accumulate using a PE array having a plurality of MACs.

The conventional approach is to perform the ReLu function after the multiplier-adder unit, while the present invention performs layer fusion of the "ReLu layer" and the "Weight layer". Furthermore, the method uses a very simple circuit to perform the ReLu function before the arithmetic unit, such as a multiplier or a multiplier-adder unit. Through the fusion, the invention does not need to store the results of executing the ReLu function in the memory and then read the results from the memory to execute the 'Weight layer' processing. This approach works equally well for other activation functions used in the design.

The 1 × 1 convolution is taken as a weight layer with a 4-channel MxN input feature map. The output signature for position (0,0) is:

y₀(0,0)＝w ₁0*x₀(0,0)+w ₁1*x₁(0,0)+w ₁2*x₂(0,0)+w₁3*x₃(0,0)；

wherein w 10-w 13 are the weights of the first kernel in this weight layer. Typically, a weight layer has multiple cores.

x₀(0,0)、x₁(0,0)、x₂(0,0)、x₃(0,0) is input feature map data of position (0, 0).

FIG. 8 illustrates this layer fusion for one embodiment, as the previous layer of the weight layer is the ReLu layer. A is the input of the ReLu layer, B is the Weight of the Weight layer, and f () is the ReLu function. With such layer fusion, only a single memory access for reading the a and f (a) results of the ReLu function is provided directly as one of the inputs to the multiply or multiply-add/multiply-accumulate operation. Fig. 8 is a simplified diagram, not showing the accumulation portion of the weight layer. When either f (A) or B is zero, the multiplication step may be skipped.

As previously mentioned, the ReLu function is: f (x) max (0, x); that is, if the input x is negative, the output of the ReLu function is f (x) 0, otherwise the output of the ReLu function is x.

Detecting if f (a) is zero can be done by checking if a is less than or equal to 0, as shown in fig. 9 (a). Another exemplary implementation for such detection is to re-use the aforementioned xor gate in fig. 6 in the manner shown in fig. 9 (b). Furthermore, if the circuit is designed specifically for this use case, the exclusive or gate can be omitted and the sign bit of a is used directly as the neg indicator.

If there are multiple PEs, for example 8 PEs, then there will be skip 0-skip 7. If all of these skip signals indicate that an operation is to be skipped, we can not only reduce power consumption, but also skip processing cycles. In this case, look-ahead circuits (look-ahead circuits) may be implemented to check the symbols of several a. Since the symbol check circuit is very simple, a corresponding look-ahead circuit is also easy to implement.

For example, when using a (s, t) to represent the input of the s-th PE in period t, then:

a (0,0), A (1,0), A (2,0), … …, A (7,0) represent inputs A of PE 0-PE 7 in cycle 0;

a (0,1), A (1,1), A (2,1), … …, A (7,1) represent inputs A of PE 0-PE 7 in cycle 1;

a (0,2), A (1,2), A (2,2), … …, A (7,2) represent inputs A of PE 0-PE 7 in cycle 2;

a (0,3), A (1,3), A (2,3), … …, A (7,3) represent inputs A of PE 0-PE 7 in cycle 3;

if all these A skip signals indicate that the operation is to be skipped, four cycles can be saved, thereby improving performance.

In addition to ReLu, other types of activation functions may also benefit from this layer fusion. For example, parameterised relu (pralu). In another embodiment of the present invention, a PReLu function is employed, wherein:

f(x)＝max(αx,x)；α∈[0,1)，or，f(x)＝max(αx,x)；α∈(0,1)。

in the above embodiment, the calculation of "PReLu-Weight" fusion C ═ B x f (a) may be changed to:

C＝B x A，for A≥0；

C＝αx B x A，for A<0。

one of which is shown in fig. 10 (a). Alpha x B can be computed in the pipeline, thus eliminating the need to first store intermediate results BxA to memory and read it back to execute the PReLu function, i.e., alpha x (B x A).

In another embodiment, since B and α are constants, α x B need not be calculated at runtime, and α x B can be pre-calculated as B' offline. Furthermore, the symbol of a may be checked first in order to read out B or B' from the memory. As shown in fig. 10(b), the,Bindicating data B or B' read from the memory according to the symbol of a. The neg and skip circuits shown in fig. 9 can be used in this embodiment.

If the fusion method of the invention is not adopted, two multiplication operations are needed to obtain the final C, namely:

C＝B x f(A)while，

f(A)＝A，for A≥0；

f(A)＝αx A，for A<0。

similarly, the PReLu function can be used for normalization-activation fusion. When a is the input and B is the normalized scaling factor, the output of normalization-activation is f (c) ═ f (B × a):

F(C)＝B x A，for B x A≥0；

F(C)＝λ*B x A，for B x A<0。

on the other hand, the above-described fusion methods for "ReLu-Weight" fusion and "PReLu-Weight" fusion can also be applied to ReLu-Normalization and PReLu-Normalization fusions.

Another benefit of the present invention is that hardware can be maximally multiplexed into the active-normalized, normalized-active, and active-weighted architectural design.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention. Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A neural network operation processing method, characterized by comprising:

receiving input values of two functions of the operation;

checking the value range of the input value of at least one function;

and determining the execution strategy of the function according to the check result.

2. The neural network operation processing method of claim 1, wherein: the checking of the value range of the input value of the at least one function comprises checking the sign of the input value of the at least one function.

3. The neural network operation processing method of claim 1, wherein: the operation comprises a multiplication operation.

4. The neural network operation processing method of claim 1, wherein determining the execution strategy of the function includes:

if the range of the input value of the checked function is a positive value, one of the two functions is executed in a first mode, otherwise, one of the two functions is executed in a second mode.

5. The neural network operation processing method of claim 4, wherein the performing one of the two functions in the first mode comprises: a multiplication operation is performed.

6. The neural network operation processing method of claim 4, wherein the performing one of the two functions in the second mode includes: and assigning the output of one of the two functions as a set constant value.

7. The neural network arithmetic processing method of claim 1, wherein the method further comprises: predicting the range of the operation result by checking the value range of the input value of at least one function;

if the range of the predicted operation result is a positive value, the function is executed, otherwise, the function is not executed.

8. The neural network operation processing method of claim 1, wherein the determining an execution strategy of the function specifically includes:

if the value range of the input value of the checked function indicates that the input value of the checked function is a negative value, the function is not executed;

and executing the function if the value range of the input value of the checked function indicates that the input value of the checked function is a positive value.

9. The neural network operation processing method of claim 1, wherein the determining of the execution strategy of the function includes:

if the input value range of the checked first function is a positive value, selecting a first coefficient as an input parameter of a second function;

and if the input value range of the checked first function is a negative value, selecting a second coefficient as the input parameter of the second function.

10. The neural network operation processing method of claim 9, wherein the execution of the function includes multiplying the input value of the checked function by the selected coefficient.

11. The neural network operation processing method of claim 1, wherein the execution of the operation further comprises:

if the value range of the input value of the checked first function is a positive value, selecting a first coefficient as an input and multiplying the input value of the first function by the selected first coefficient;

if the range of the input value of the checked first function is negative, selecting a first coefficient and a second coefficient as inputs and multiplying the input value of the first function by the selected first coefficient and the selected second coefficient.

12. The neural network arithmetic processing method of claim 2, wherein: and checking the value range of the input value of at least one function, wherein the value range comprises the sign information or the sign bit of the input value of the function.

13. The neural network operation processing method according to claim 7 or 8, wherein: and if the operation is not executed, the constant result is 0.

14. The neural network operation processing method of claim 3, wherein the multiplication operation is implemented as a multiplier, a shifter, a series of adders and shifters, or a series of AND gates and adders.

15. The neural network arithmetic processing method of claim 1, wherein the two functions are a normalization function and an activation function, respectively.

16. The neural network arithmetic processing method of claim 15, wherein the normalization function includes: a layer normalization function, an instance normalization function, a group normalization function, or a switchable normalization function.

17. The neural network arithmetic processing method of claim 15, wherein the activation function includes a ReLu function, a prilu function, a ReLu6 function, or a ReLuN function.

18. A neural network arithmetic processing device, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, the processor including a plurality of multipliers, wherein the processor implements the following steps when executing the computer program:

receiving input values of two functions of the operation;

checking a value range of an input value of at least one of the functions;

19. A computer-readable medium having non-volatile program code executable by a processor, the program code causing the processor to perform the method of any of claims 1 to 17.