CN111931919B

CN111931919B - Sparse neural network computing method and device based on systolic array

Info

Publication number: CN111931919B
Application number: CN202011013121.1A
Authority: CN
Inventors: 陶为; 王中风; 刘文剑; 谢逍茹
Original assignee: Nanjing Fengxing Technology Co ltd
Current assignee: Nanjing Fengxing Technology Co ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2021-04-27
Anticipated expiration: 2040-09-24
Also published as: CN111931919A

Abstract

The application discloses a sparse neural network computing method based on a pulse array, which comprises the steps of obtaining a characteristic diagram containing n weights; the size of the characteristic diagram is x y; equally dividing the feature map into n sub-feature blocks along the direction of an x axis, wherein the size of each sub-feature block is (x/n) × y; each sub-feature block corresponds to one weight; calculating each sub-feature block according to the position of the weight in the weight matrix to obtain a calculation result; and regenerating a weight matrix according to the calculation result and outputting the weight matrix. According to the method, sparse convolution calculation is achieved through a pulse array mode, data multiplexing is more sufficient, convolution is conducted in a blocking mode, calculation is more flexible and efficiency is higher, the weight is input to subordinate equipment after being encoded, only nonzero weight is input to a framework for calculation, and loss of an encoding unit is reduced.

Description

Sparse neural network computing method and device based on systolic array

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a sparse neural network computing method and device based on a pulse array.

Background

With the continuous development of Artificial Intelligence (AI), it has evolved from early artificial feature engineering to learning that can be done from massive data today, and can be applied to multiple fields of machine vision, speech recognition, and natural language processing. Convolutional Neural Network (CNN) is one of the most representative Network structures in deep learning technology, and is increasingly favored in the field of artificial intelligence, and particularly has a significant effect in the field of image processing.

As networks become more extensive and complex, the computational resources used for convolutional training have multiplied, and reducing their storage and computational costs becomes critical for neural networks with more layers and nodes. The prior art shows that the convolutional layer occupies about 90-95% of the calculation time and parameter scale, and has a large value; the fully connected layer occupies about 5-10% of the computation time, 95% of the parametric scale, and the value is smaller. Therefore, compressing and accelerating the depth model by network pruning, sharing and the like is a necessary means for saving operation parameters and calculation time on the premise of ensuring that the network works normally.

The pruning method in the prior art is commonly used for pruning the weight of redundant and non-information quantity in a pre-trained CNN model, the training of compact CNN under the condition of sparsity limitation is the mainstream method at present, and the sparsity constraints are usually introduced into the optimization problem as l _0 or l _1 norm regulators. For example, a Combricon-X architecture may be adopted for processing CNN calculation of sparse weights, wherein a control unit Buffer Controller in the architecture removes weights with a value of 0, transmits the weights to corresponding computing units, and the computing units perform multiply-add operations. However, the data input by the architecture includes data with a weight value of zero, so that after all the weight values are input, the weight value with a value of zero needs to be removed from all the weight values, and then a non-zero weight value needs to be selected for encoding, which not only seriously reduces the calculation efficiency, but also consumes a lot of system resources.

Disclosure of Invention

The application provides a sparse neural network computing method based on a systolic array, and aims to solve the problems of low efficiency and high resource consumption in encryption and decryption processes in the prior art.

In a first aspect, the present application provides a sparse neural network computing method based on a systolic array architecture, including:

acquiring a feature map containing n weights; the size of the characteristic diagram is x y;

equally dividing the feature map into n sub-feature blocks along the direction of an x axis, wherein the size of each sub-feature block is (x/n) × y; each sub-feature block corresponds to one weight;

calculating each sub-feature block according to the position of the weight in the weight matrix to obtain a calculation result;

and regenerating a weight matrix according to the calculation result and outputting the weight matrix.

In some embodiments, the step of equally dividing the feature map into n sub-feature blocks along the x-axis direction comprises:

acquiring the parallel number and the step value of an input channel;

and taking the parallel number as a group, taking the step value as an output channel, and dividing the feature graph into n sub-feature blocks.

In some embodiments, the calculating each sub-feature block according to the position of the weight in the weight matrix includes:

acquiring the position of a weight corresponding to the current sub-feature block in a weight matrix;

if the current weight is the first weight in the weight matrix, or the position of the current weight is the same as the position of the last weight, calculating the partial sum in the current weight matrix;

and multiplying the weight, the sub-feature block and the partial sum to obtain a calculation result.

In some embodiments, further comprising:

and if the position of the current weight is different from the position of the previous weight, moving the part in the current weight matrix calculated by the calculation by a plurality of rows upwards, wherein the number of rows upwards moved is the difference of the row numbers of the positions of the current weight and the previous weight in the weight matrix.

In some embodiments, further comprising:

and if the current weight is the last weight in the weight matrix, multiplying the weight, the sub-feature block and the partial sum to obtain a calculation result, and finishing the calculation.

In some embodiments, the step value is preferably 2.

In a second aspect, the present application further provides an apparatus corresponding to the method of the first aspect, including:

the input unit is configured to acquire a feature map containing n weights; the size of the characteristic diagram is x y;

a data distribution unit configured to equally divide the feature map into n sub-feature blocks along the x-axis direction, wherein the size of each sub-feature block is (x/n) × y; each sub-feature block corresponds to one weight;

the pulsating array calculation unit is configured to calculate each sub-feature block according to the position of the weight in the weight matrix to obtain a calculation result;

and the output unit is configured to regenerate the weight matrix according to the calculation result and output the weight matrix.

In some embodiments, the data distribution unit comprises:

the acquisition subunit is configured to acquire the parallel number and the step value of the input channel;

and the dividing subunit is configured to divide the feature map into n sub-feature blocks named by groups and output channels, wherein the parallel number is used as a group, the step value is used as an output channel, and the feature map is divided into n sub-feature blocks named by groups and output channels.

In some embodiments, the systolic array computation unit comprises:

the positioning subunit is configured to acquire the position of the weight corresponding to the current sub-feature block in the weight matrix;

the first execution subunit is configured to calculate a partial sum in the current weight matrix if the current weight is a first weight in the weight matrix, or the position of the current weight is the same as the position of the previous weight;

and the operation subunit is configured to multiply the weight, the sub-feature block and the partial sum to obtain a calculation result.

In some embodiments, the systolic array computation unit further comprises:

and the second execution subunit is configured to, if the position of the current weight is different from the position of the previous weight, move the calculated part of the current weight matrix by a plurality of rows upwards, wherein the number of rows upwards moved is the row number difference of the positions of the current weight and the previous weight in the weight matrix.

In some embodiments, the systolic array computation unit further comprises:

and the third execution subunit is configured to, if the current weight is the last weight in the weight matrix, perform multiplication on the weight, the sub-feature block and the partial sum to obtain a calculation result, and then end the calculation.

The application provides a sparse neural network computing method based on a pulse array architecture, which comprises the steps of obtaining a characteristic diagram containing n weights; the size of the characteristic diagram is x y; equally dividing the feature map into n sub-feature blocks along the direction of an x axis, wherein the size of each sub-feature block is (x/n) × y; each sub-feature block corresponds to one weight; calculating each sub-feature block according to the position of the weight in the weight matrix to obtain a calculation result; and regenerating a weight matrix according to the calculation result and outputting the weight matrix. According to the method, sparse convolution calculation is achieved through a pulse array mode, data multiplexing is more sufficient, convolution is conducted in a blocking mode, calculation is more flexible and efficiency is higher, the weight is input to subordinate equipment after being encoded, only nonzero weight is input to a framework for calculation, and loss of an encoding unit is reduced.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.

FIG. 1 is a flowchart of a sparse neural network computing method based on a systolic array architecture according to the present application;

FIG. 2 is a diagram illustrating an exploded step of step S300 of the method of FIG. 1, in one embodiment;

FIG. 3 is a schematic diagram of the various processes performed on a portion and in accordance with the method provided herein;

FIG. 4 is a flowchart of the calculation of the method provided by the present application at a step size of 1;

FIG. 5 is a flowchart of the calculation of the method provided by the present application at a step size of 2;

FIG. 6 is a schematic structural diagram of a sparse neural network computing device based on a systolic array architecture according to the present application;

fig. 7 is a block diagram of a data distribution unit in the apparatus shown in fig. 6;

fig. 8 is a block diagram of a pulsation array calculation unit in the apparatus shown in fig. 6.

Detailed Description

Referring to fig. 1, a flowchart of a sparse neural network computing method based on a systolic array architecture is shown.

As can be seen from fig. 1, a sparse neural network computing method based on a systolic array architecture provided in an embodiment of the present application includes:

s100: acquiring a feature map containing n weights; the size of the characteristic diagram is x y;

in this embodiment, the obtained feature map is derived from data input into the computing framework from a previous stage, and the n weights represent the number of sub-feature blocks into which the feature map is to be divided, so that after the feature map is divided into the sub-feature blocks, each sub-feature block corresponds to one weight. The characteristic map may be generally represented in the form of x y, such as 64 x 8, 128 x 8, and so on.

S200: equally dividing the feature map into n sub-feature blocks along the direction of an x axis, wherein the size of each sub-feature block is (x/n) × y; each sub-feature block corresponds to one weight;

in the present embodiment, step S200 mainly performs the function of data distribution, dividing the entire feature map into a plurality of sub-feature blocks of the same size by dividing, which is more efficient in convolution calculation in units of blocks than in convolution calculation in units of rows, so that in some embodiments, the value of x is required to be generally an integer multiple of n, for example, when x =64 and y =8, n =8 may be set, that is, the current feature map may be divided into 8 sub-feature blocks of 8 × 8 in size.

Further, in a possible embodiment, step S200 can be decomposed into:

s210: acquiring the parallel number and the step value of an input channel; the parallel number and step size of the input channels may be divided differently for the characteristic diagram with the input size x × y, and taking x =64 and y =8 as an example, the distribution unit may distribute differently for different configurations. The parallel number of the input channels represents the number of computing units capable of being computed in parallel, and the step value (stride) represents the corresponding input number of each computing unit.

S220: and (4) recording the parallel rows as groups (abbreviated as g) and the step value (abbreviated as s) as output channels, dividing the feature map into n sub-feature blocks, wherein different distribution modes can be recorded as s1g1, s1g2 and s2g1 … …, and correspondingly, the sub-feature blocks divided in different distribution modes have different features for subsequent calculation processes.

After the feature map is divided into n sub-feature blocks, each sub-feature block is allocated to a computing unit to execute an individual computing process, and the number of the computing units corresponds to the number of the sub-feature blocks, which is specifically shown in step S300:

s300: calculating each sub-feature block according to the position of the weight in the weight matrix to obtain a calculation result;

in this embodiment, the computing unit executing step S300 is a systolic array computing unit, and taking 3 × 3 convolution computation when the number of parallel input channels is 1 as an example, after the 8 × 8 sub-feature blocks of the feature map of 64 × 8 are divided and input into 8 systolic array computing units, each systolic array computing unit is provided with (8 × 8= 64) number of multiply-accumulate units corresponding to the sub-feature block size to perform the computation operation. Specifically, in a possible embodiment shown in fig. 2, the calculation process of the systolic array calculation unit includes:

s310: the positions of the weights corresponding to the current sub-feature block in the weight matrix are obtained, the n weights are usually arranged in a matrix form, and the corresponding calculation modes are different according to the weights corresponding to different positions in the weight matrix, so that the current position is recorded before calculation.

S321: if the current weight is the first weight in the weight matrix, or the position of the current weight is the same as the position of the last weight, calculating the partial sum in the current weight matrix; in this embodiment, the first weight is the weight located at the lowest layer in the weight matrix, and since other weights are not input before, the operations of calculating and executing movement are not needed; similarly, if the current weight is not the first weight, it needs to be determined whether the current weight and the last input weight are located at the same position in the weight matrix, and if so, partial sum and subsequent operations can be calculated in the same calculation unit;

further, if the position of the current weight is different from the position of the previous weight, the method further includes:

s322: moving the calculated part in the current weight matrix upwards by a plurality of rows, wherein the number of rows moved upwards is the row number difference of the current weight and the position of the previous weight in the weight matrix; as shown in the diagram of fig. 3, if the calculation performed by the current weight in the calculation unit 0 is the first weight calculation or the calculation with the same position as the previous weight, then the sum of the moving parts is not needed, and the calculation result is shown in (1); if the calculation executed by the current weight in the calculating unit 0 is different from the position of the previous weight, if the position is different by one bit, the partial sum is moved upwards by one line, and the calculation result is shown as (2), namely, the calculating unit 1 finishes the subsequent calculation steps; if the positions differ by two bits, the partial sum is shifted up by two lines, and the calculation result is shown in (3), i.e. the calculation unit 2 completes the subsequent calculation steps.

S330: and multiplying the weight, the sub-feature block and the partial sum to obtain a calculation result.

And performing multiplication operation on the weight and the sub-feature block, and performing operation on the moved part and the moved part after the multiplication operation is finished. For example, in FIG. 4 and FIG. 5, weight kernel 1 is multiplied by the value in PE, then the partial sum is shifted, the multiplication result and the partial sum are added, and the operation is ended.

Step S330 is to perform a calculation operation on each weight in the feature map, and to complete the calculation process for the entire feature map including a plurality of sub-feature blocks, further, while step S330 is executed, it is necessary to add:

s331: if the current weight is the last weight in the weight matrix, after multiplying the weight, the sub-feature block and the partial sum to obtain a calculation result, the calculation is finished, and when the current weight is judged to be the last weight, the calculation result obtained at this time indicates that the calculation results of all weights are obtained, and the operation of inputting the weights can be stopped.

S400: regenerating a weight matrix according to the calculation result and outputting the weight matrix;

in the data distribution unit, the data is scattered and distributed to different calculation units for calculation. In the output unit, the hardware rearranges the scrambled data into the whole before input. As in fig. 4 and 5, data is distributed in 64 registers and input to the calculation unit, and in the output unit, the output result is rewritten to the same tile address and output.

Further, in a feasible embodiment, when the step value (stride) set in step S210 is preferably 2, the method of the present application has better calculation efficiency, and specifically, the calculation flowcharts shown in fig. 4 and fig. 5 are described as follows: because of the way the convolution is computed, only one of the four outputs is typically a valid output, so by default each computing unit has only 25% computational efficiency. In the framework provided by the application, the calculation efficiency can be improved to 50% by inputting a plurality of feature maps for calculation. Fig. 4 is a flowchart of calculation for step value =1, fig. 5 is a flowchart of calculation for step value =2, and it can be seen from fig. 4 that each calculation unit PE corresponds to one input Reg, and in fig. 5, when step value =2, each calculation unit PE corresponds to two inputs Reg, so that the calculation efficiency is effectively improved.

According to the technical scheme, the sparse neural network computing method based on the pulse array architecture comprises the steps of obtaining a characteristic diagram containing n weights; the size of the characteristic diagram is x y; equally dividing the feature map into n sub-feature blocks along the direction of an x axis, wherein the size of each sub-feature block is (x/n) × y; each sub-feature block corresponds to one weight; calculating each sub-feature block according to the position of the weight in the weight matrix to obtain a calculation result; and regenerating a weight matrix according to the calculation result and outputting the weight matrix. According to the method, sparse convolution calculation is achieved through a pulse array mode, data multiplexing is more sufficient, convolution is conducted in a blocking mode, calculation is more flexible and efficiency is higher, the weight is input to subordinate equipment after being encoded, only nonzero weight is input to a framework for calculation, and loss of an encoding unit is reduced.

as can be seen from fig. 6, the present application further provides an apparatus corresponding to the foregoing method, including:

an input unit 100 configured to acquire a feature map including n weights; the size of the characteristic diagram is x y;

a data distribution unit 200 configured to equally divide the feature map into n sub-feature blocks along the x-axis direction, wherein the size of each sub-feature block is (x/n) × y; each sub-feature block corresponds to one weight;

the systolic array computing unit 300 is configured to compute each sub-feature block according to the position of the weight in the weight matrix to obtain a computation result;

and an output unit 400 configured to regenerate and output the weight matrix according to the calculation result.

Further, as shown in fig. 7, the data distribution unit 200 includes:

an obtaining subunit 210 configured to obtain the parallel number and the step value of the input channel;

and a dividing subunit 220 configured to divide the feature map into n sub-feature blocks named by groups and output channels, where the parallel number is used as a group, and the step value is used as an output channel.

Further, as shown in fig. 8, the systolic array computing unit 300 includes:

a positioning subunit 310 configured to obtain a position of a weight corresponding to the current sub-feature block in the weight matrix;

the first execution subunit 320 is configured to calculate a partial sum in the current weight matrix if the current weight is a first weight in the weight matrix, or if the position of the current weight is the same as the position of the previous weight;

and the operation subunit 330 is configured to multiply the weight, the sub-feature block and the partial sum to obtain a calculation result.

Further, the systolic array computing unit 300 further includes:

the second execution subunit 340 is configured to, if the position of the current weight is different from the position of the previous weight, move the calculated portion of the current weight matrix by a number of rows upwards, where the number of rows upwards is a difference between the number of rows at the position of the current weight and the position of the previous weight in the weight matrix.

Further, the systolic array computing unit 300 further includes:

the third execution subunit 350 is configured to, if the current weight is the last weight in the weight matrix, perform multiplication on the weight, the sub-feature block, and the partial sum to obtain a calculation result, and then end the calculation.

For the functions of the apparatus in the above embodiment, the functional roles of the structural units in executing the above method are referred to the descriptions in the above method embodiments, and are not described herein again.

The steps of a method or algorithm described in this application may be embodied directly in hardware, in a software element executed by a processor, or in a combination of the two. The software cells may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a UE. In the alternative, the processor and the storage medium may reside in different components in the UE.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A sparse neural network computing method based on a systolic array architecture is characterized in that the method comprises the following steps:

constructing a calculation method and a device for realizing a sparse neural network based on a pulse array architecture; the computing method comprises a method for distributing and processing sparse network data, and the device designed based on the computing method comprises an input unit, a data distribution unit, a systolic array computing unit and an output unit;

wherein the input unit is designed to obtain corresponding feature maps according to n nonzero weights in the convolutional layer currently calculated; the characteristic diagram is data obtained by the operation of the preceding-stage convolution layer in a computing architecture, and the size is x y;

the data distribution unit is designed to equally divide the characteristic diagram into n sub-characteristic blocks with the same size of (x/n) × y, the parallel number of which is recorded as a group and the step value of which is recorded as an output channel, according to the position of a non-zero weight, and distribute the encoded data which adopts the blocks as a unit to the pulse array calculation unit; when distributing, distributing the corresponding sub-feature blocks according to the nonzero weight, and only inputting the nonzero weight into the pulse array computing unit for computing;

and the pulse array calculating unit is designed to read the corresponding sub-feature block according to the position of the nonzero weight in the weight matrix, and calculate with each sub-feature block to obtain a result only containing nonzero weight operation.

2. The sparse neural network computing method based on the systolic array architecture of claim 1, wherein the step of equally dividing the feature map into n sub-feature blocks along the x-axis direction comprises:

acquiring the parallel number and the step value of an input channel;

3. The sparse neural network computing method based on the systolic array architecture of claim 1, wherein the computing each sub-feature block according to the position of the weight in the weight matrix comprises:

4. The sparse neural network computing method based on the systolic array architecture of claim 3, characterized by further comprising:

5. The sparse neural network computing method based on the systolic array architecture of claim 3, characterized by further comprising:

6. The sparse neural network computing method based on the systolic array architecture of claim 2, characterized in that said step value is preferably 2.

7. A sparse neural network computing device based on a systolic array architecture, the computing device comprising:

the input unit is configured to acquire a feature map containing n weights; the feature map is derived from data input into the computing architecture at a previous stage; the size of the characteristic diagram is x y;

a data distribution unit configured to equally divide the feature map into n sub-feature blocks along the x-axis direction, wherein the size of each sub-feature block is (x/n) × y; each sub-feature block corresponds to one weight, the corresponding sub-feature block is distributed according to a nonzero weight during distribution, and only the nonzero weight is input into the pulse array computing unit for computing;

8. The sparse neural network computing device based on a systolic array architecture of claim 7, characterized in that said data distribution unit comprises:

9. The sparse neural network computing device based on a systolic array architecture of claim 7, wherein the systolic array computing unit includes:

10. The sparse neural network computing device based on systolic array architecture of claim 9, characterized in that said systolic array computing unit further comprises:

11. The sparse neural network computing device based on systolic array architecture of claim 9, characterized in that said systolic array computing unit further comprises: