CN111126569A

CN111126569A - Convolutional neural network device supporting pruning sparse compression and calculation method

Info

Publication number: CN111126569A
Application number: CN201911312338.XA
Authority: CN
Inventors: 丁永林; 曹学成; 廖湘萍; 邱蔚
Original assignee: CETHIK Group Ltd
Current assignee: CETC 52 Research Institute
Priority date: 2019-12-18
Filing date: 2019-12-18
Publication date: 2020-05-08
Anticipated expiration: 2039-12-18
Also published as: CN111126569B

Abstract

The invention discloses a convolutional neural network device and a calculation method supporting pruning sparse compression, wherein the convolutional neural network device comprises a weight buffer, a zero value detection unit, a source data buffer, a source data first-in first-out queue, a weight first-in first-out queue, a convolutional block processing unit and a target data buffer; meanwhile, unconstrained pruning and sparsification can be carried out according to the position of the zero weight in the weight, so that unconstrained support for pruning and sparsification is realized, the effects of pruning and sparsification are obviously improved, and the convolution calculation efficiency is improved.

Description

Convolutional neural network device supporting pruning sparse compression and calculation method

Technical Field

The application belongs to the field of artificial intelligence chip design and FPGA design, and particularly relates to a convolutional neural network device supporting pruning sparse compression and a calculation method.

Background

Artificial intelligence is a branch of computer science, and has gained more and more attention. It replaces "people" in the complex affairs that could be accomplished by people by simulating some thinking process and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.) of "people".

The image recognition technology is a typical application field of artificial intelligence, and aims to enable a computer to process a large amount of physical information instead of human beings, and finally classify and even decide through acquisition, preprocessing and feature extraction of image information.

Convolutional Neural Networks (CNN) are a class of feed-forward neural networks including convolutional calculation and having a deep structure, are one of the representative algorithms for deep learning, and are widely applied to various image recognition algorithms.

The convolutional neural network is generated by the inspiration of the structure of a visual system, a convolutional kernel is simulated into neuron cells, the multilayer network is simulated into information transmission among the neuron cells, the result is finally output to an output layer by passing input layer data through a plurality of hidden layers, the image recognition is realized, and the recognition precision is always in direct proportion to the number of the hidden layers. The core operation of the whole network is the operation of the convolution layer, and the convolution operation is increased in geometric level along with the increase of the number of the network layers. Therefore, the convolution operation amount is reduced, and the convolution operation amount plays a vital role in improving the performance of the convolution neural network and saving the power consumption of the convolution neural network.

The conventional method for reducing the convolution operand is mainly completed by weight data pruning and sparsification. By pruning and thinning the weight data, the redundancy of convolution kernels can be reduced, and the computational complexity is reduced. In the learning of ImageNet data, the operation speed of a convolutional neural network thinned by 90% is 2 to 10 times that of a traditional convolutional neural network with the same structure, and the output classification precision is only lost by 2%

However, due to the limitation of the prior art, in the chip design and FPGA design process, pruning and sparsification of weight data of any structure cannot be completely supported, and only pruning and sparsification of some specific modes can be supported. Because of too much constraint on the pruning and thinning of the algorithm level, the actual pruning and thinning effects are greatly reduced.

Therefore, in the field of chip design and FPGA design processes, the requirement of no constraint support for pruning and sparseness exists.

Disclosure of Invention

The convolution neural network device and the calculation method for supporting pruning sparsification compression achieve unconstrained support on pruning and sparsification, reduce convolution calculation pressure remarkably and improve convolution calculation efficiency.

In order to achieve the purpose, the technical scheme adopted by the application is as follows:

a convolutional neural network device supporting pruning sparse compression comprises a weight buffer, a zero value detection unit, a source data buffer, a source data first-in first-out queue, a weight first-in first-out queue, a convolutional block processing unit and a target data buffer, wherein:

the weight buffer is used for storing the weights of the convolutional neural network and outputting the weights to the zero value detection unit, and each weight corresponds to different position information;

the zero value detection unit is used for judging whether the received weight value is zero or not, outputting a non-zero weight value to the weight value first-in first-out queue and outputting position information corresponding to the non-zero weight value to the source data buffer;

the source data buffer is used for storing source data of the convolutional neural network and outputting the corresponding source data to the source data first-in first-out queue according to the position information corresponding to the received nonzero weight;

the source data first-in first-out queue is used for storing the source data output by the source data buffer and outputting S source data to the volume block processing unit according to a first-in first-out principle;

the weight FIFO queue is used for storing the nonzero weight output by the zero value detection unit and outputting the nonzero weight to the rolling block processing unit according to the FIFO principle;

the convolution block processing unit is used for calculating convolution operation of S target data in parallel according to the received nonzero weight and S source data and outputting the S target data obtained through calculation to the target data buffer;

and the target data buffer is used for buffering S target data output by the convolution block processing unit.

Preferably, the convolution block processing unit includes S multiply-accumulators, each multiply-accumulator for calculating a convolution operation of a single target data.

Preferably, S of said multiply-accumulators are calculated in parallel.

Preferably, the pixels of the input layer of the convolutional neural network are M rows × N columns, and the number of the multiply-accumulator satisfies the following relation:

S>N；

or, S ═ N;

or, S < N.

The application also provides a calculation method, based on any one of the above technical solutions, the calculation method is a convolutional neural network device supporting pruning sparse compression, and the calculation method relates to a weight buffer, a zero detection unit, a source data buffer, a source data first-in first-out queue, a weight first-in first-out queue, a convolutional block processing unit, and a target data buffer, and the calculation method includes:

outputting weights to the zero value detection unit through the weight buffer, wherein each weight corresponds to different position information;

the zero value detection unit judges whether the received weight value is zero or not, outputs a non-zero weight value to the weight value first-in first-out queue, and outputs position information corresponding to the non-zero weight value to the source data buffer;

the source data buffer outputs corresponding source data to the source data first-in first-out queue according to the received position information corresponding to the nonzero weight value;

the source data first-in first-out queue outputs S source data to the volume block processing unit according to a first-in first-out principle;

the weight FIFO queue outputs a non-zero weight to the rolling block processing unit according to a FIFO principle;

the target data buffer buffers S target data output by the volume block processing unit.

Preferably, S of said multiply-accumulators are calculated in parallel.

S>N；

or, S ═ N;

or, S < N.

According to the convolutional neural network device and the calculating method supporting pruning sparse compression, before convolution operation, whether the weight is zero or not is judged, only the nonzero weight and source data corresponding to the nonzero weight are output according to the judgment result, pruning and sparse are carried out by taking input data as an entry point, the redundancy of a convolution kernel is reduced, and the calculation complexity is reduced; meanwhile, unconstrained pruning and sparsification can be carried out according to the position of the zero weight in the weight, so that unconstrained support for pruning and sparsification is realized, the effects of pruning and sparsification are obviously improved, and the convolution calculation efficiency is improved.

Drawings

FIG. 1 is a schematic structural diagram of a convolutional neural network device supporting pruning sparsification compression according to the present application;

FIG. 2 is a schematic diagram of the internal structure of the rolling block processing unit according to the present application;

FIG. 3 is a schematic diagram of the operation of a single multiply-accumulator according to the present application;

FIG. 4 is a schematic diagram of the present application, where the multiply-accumulate unit completes convolution operation to generate a whole row of target data in the output layer under 1024 conditions.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

As shown in fig. 1, in one embodiment, a convolutional neural network device supporting pruning sparsification compression is provided, which can achieve unconstrained support for pruning and sparsification.

Specifically, the Convolutional Neural Network (CNN) device supporting pruning sparse compression of the present embodiment includes a weight buffer (WGT BUF), a zero detection unit, a source data buffer (SRC BUF), a source data first-in first-out queue (SRCFIFO), a weight first-in first-out queue (WGT FIFO), a convolutional BLOCK processing unit (CONV BLOCK), and a target data buffer (DES BUF), where:

The above-mentioned non-zero weight is understood to be a weight whose value is non-zero, and a zero weight is understood to be a weight whose value is zero.

As shown in FIG. 2, in one embodiment, the convolution block processing unit includes S multiply-accumulators, each for calculating a convolution operation of a single target data. And S multiply-accumulator parallel computation, that is, the convolution block processing unit can simultaneously carry out convolution operation on S target data once, so as to improve the computation efficiency.

It should be noted that the number of multiply-accumulate units included in the convolution block processing unit is adjusted according to the computing capability and computing requirement of the computer device.

In one embodiment, if the pixels of the input layer of the convolutional neural network are M rows by N columns, the number of the multiply-accumulator satisfies the following relation:

s is more than N; or, S ═ N; or, S < N.

According to the above relation, the number of parallel computations by the convolution block processing unit at a time can be one row larger than, equal to or smaller than the input layer pixels, i.e. there is no necessary constraint between the two.

To facilitate an understanding of the apparatus of the present application, further details are provided by the following examples.

Example 1

If the input layer of the convolutional neural network has 768 rows by 1024 columns of pixels, 4 input channels, 3 by 3 convolution kernels, 1 step length (stride), 1 output channel, and the output layer of the network: the pixels are 768 rows by 1024 columns.

As shown in fig. 3, first, a multiply-accumulator is taken as an example for explanation:

and when the target data of the mth row and the nth column need to be calculated, the source data of the mth row and the nth column are taken. The position information of the convolution kernel of 3 × 3 corresponding to each input channel is defined as 0, 1, 0, 2, 1, 0, 1, 2, 0, 2, 1, 2, from the top left to the bottom right, the complete target data of the mth row and the nth column of each input channel is also a structure of 3 × 3, the position information of the structure is defined as m-1, n-1, m-1, n +1, m, n-1, m, n, m +1, n-1, m +1, n +1, and the weight of the corresponding position is multiplied by the data of the corresponding position in the target data, that is, according to the formula of convolution operation, the mth row of the output layer, and the target data of the nth column is:

y_mninput channel 1

(x_m-1，n-1*W_0，0+x_m-1，n*W_0，1+x_m-1，n+1*W_0，2+

x_m，n-1*W_1，0+x_m，n*W_1，1+x_m，n+1*W_1，2+

x_m+1，n-1*W_2，0+x_m+1，n*W_2，1+x_m+1，n+1*W_2，2+)

+

// input channel 2

(x_m-1，n-1*W_0，0+x_m-1，n*W_0，1+x_m-1，n+1*W_0，2+

x_m，n-1*W_1，0+x_m，n*W_1，1+x_m，n+1*W_1，2+

x_m+1，n-1*W_2，0+x_m+1，n*W_2，1+x_m+1，n+1*W_2，2+)

+

// input channel 3

(x_m-1，n-1*W_0，0+x_m-1，n*W_0，1+x_m-1，n+1*W_0，2+

x_m，n-1*W_1，0+x_m，n*W_1，1+x_m，n+1*W_1，2+

x_m+1，n-1*W_2，0+x_m+1，n*W_2，1+x_m+1，n+1*W_2，2+)

+

// input channel 4

(x_m-1，n-1*W_0，0+x_m-1，n*W_0，1+x_m-1，n+1*W_0，2+

x_m，n-1*W_1，0+x_m，n*W_1，1+x_m，n+1*W_1，2+

x_m+1，n-1*W_2，0+x_m+1，n*W_2，1+x_m+1，n+1*W_2，2+)

According to the above calculation process, after the period of 3 × 4 is 36 cycles, the interface completes the convolution operation of the target data in the mth row and the nth column of the output layer.

All weights are assumed to be nonzero in the convolution operation, and in the application, the weight FIFO queue only stores nonzero weights, namely w_0，0～w_2，2In addition, the zero weight value can be kicked out, and the data corresponding to the zero weight value position in the source data can also be kicked out, namely, only the non-zero weight value and the data corresponding to the position in the source data can participate in the convolution operation, so that the operation efficiency is obviously improved.

On the basis of one multiply-accumulator, S multiply-accumulators are taken as an example for explanation:

as shown in fig. 4, in a preferred embodiment, S ═ N is set, that is, the convolution block processing unit includes 1024 multiply-accumulator units, and the convolution block processing unit calculates the convolution operation of 1024 target data at a time, and generates an entire row of target data of the output layer.

At the moment, the source data FIFO queue outputs S pieces of source data at a time, each source data is correspondingly input into one multiply accumulator and is subjected to convolution operation with a convolution kernel in the multiply accumulator, the convolution kernels in the multiply accumulators are the same, simultaneous calculation of the S pieces of target data is achieved, and a whole line of target data of an output layer is obtained after calculation is completed.

For the convolutional neural network device of the present application, another possible solution is to set S < N, i.e., the number of S-selected output layers less than one entire row of target data, i.e., N < 1024.

For example, S is selected to be 200, i.e., the convolution block processing unit includes 200 multiply-accumulators, and one convolution operation can simultaneously calculate convolution operations of 200 target data, resulting in partial target data in one row of the output layer. And after the current calculation is finished, continuously taking next 200 source data from the un-calculated data to carry out convolution operation until the calculation of all the data is finished.

If the source data in the last calculation is less than 200, for example, only 32 are left, the calculation is performed by only the first 32 multiply-accumulate units, or the calculation is performed by randomly allocating the 32 multiply-accumulate units to the former 32 multiply-accumulate units.

For the convolutional neural network device of the present application, another possible solution is to set S > N, i.e., S selects the number of output layers more than one entire row of target data, i.e., N > 1024.

For example, N is selected to be 2000, that is, the convolution block processing unit is composed of 2000 multiply-accumulators, and one convolution operation can simultaneously calculate the convolution operations of 2000 target data to generate one row of target data of the output layer. After the calculation is finished, continuously taking 2000 next source data from the un-calculated data to carry out convolution operation until the calculation of all the data is finished.

If the source data in the last calculation is less than 2000, for example, only 432 are left, the calculation is performed by only the first 432 multiply-accumulate units, or the calculation is performed by randomly allocating the 432 multiply-accumulate units to the former 432 multiply-accumulate units.

In an actual convolutional neural network, through compression of a basic pruning sparse algorithm, nonzero weight data is only 1/5 of all weights, so that after zero-value weights are removed, a convolution block processing unit completes convolution operation of S target data, and 7.2(36 × 1/5) cycles are required on average.

And data are pruned according to the zero weight, no constraint support for pruning and sparsification is realized, the pruning and sparsification effects are obviously improved, and the convolution calculation efficiency is improved.

A convolutional neural network apparatus supporting pruning sparsification compression may be a computer device, which may be a terminal. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

In another embodiment, a calculation method is provided, where the calculation method is implemented based on the foregoing convolutional neural network device supporting pruning sparse compression, that is, the calculation method involves a weight buffer, a zero detection unit, a source data buffer, a source data first-in first-out queue, a weight first-in first-out queue, a convolutional block processing unit, and a target data buffer, and the calculation method includes:

Specifically, the convolution block processing unit includes S multiply-accumulators, each of which is used to calculate a convolution operation of a single target data, and S of the multiply-accumulators are calculated in parallel.

If the pixels of the input layer of the convolutional neural network are M rows by N columns, the number of the multiply-accumulator satisfies the following relation:

s > N; or, S ═ N; or, S < N. That is, the multiply-accumulator in the convolution block processing unit can be adjusted according to actual conditions.

The calculation method provided by the embodiment judges whether the weight is zero or not before convolution operation is performed, only outputs the non-zero weight and source data corresponding to the non-zero weight according to the judgment result, and performs pruning and sparsification by using input data as an entry point, so that the redundancy of a convolution kernel is reduced, and the calculation complexity is reduced; meanwhile, unconstrained pruning and sparsification can be carried out according to the position of the zero weight in the weight, the pruning and sparsification effects are obviously improved, and the convolution calculation efficiency is improved.

For further limitation of the calculation method, reference may be made to the above-mentioned limitation on the convolutional neural network device supporting pruning sparseness compression, and details thereof are not repeated here.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. The utility model provides a support convolution neural network device of pruning sparse compression, which characterized in that, the convolution neural network device of pruning sparse compression includes weight buffer, zero value detecting element, source data buffer, source data first-in first-out queue, weight first-in first-out queue, convolution block processing unit and target data buffer, wherein:

2. The convolutional neural network device supporting pruning sparsification compression as claimed in claim 1, wherein the convolutional block processing unit comprises S multiply-accumulators, each multiply-accumulator for calculating a convolution operation of a single target data.

3. The convolutional neural network device supporting pruning-sparsification compression as claimed in claim 2, wherein S of the multiply-accumulators are calculated in parallel.

4. The convolutional neural network device for supporting pruning sparsification compression as claimed in claim 2, wherein the pixels of the input layer of the convolutional neural network are M rows by N columns, and the number of the multiply-accumulator satisfies the following relation:

S>N；

or, S ═ N;

or, S < N.

5. A computing method of a convolutional neural network device supporting pruning sparsification compression based on claim 1, the computing method involving a weight buffer, a zero value detection unit, a source data buffer, a source data first-in first-out queue, a weight first-in first-out queue, a convolutional block processing unit, and a target data buffer, the computing method comprising:

6. The computing method of claim 5, wherein the convolution block processing unit includes S multiply-accumulators, each multiply-accumulator for calculating a convolution operation of a single target data.

7. The method of computing of claim 6, wherein S of said multiply-accumulator computations are in parallel.

8. The computing method of claim 6, wherein the pixels of the input layer of the convolutional neural network are M rows by N columns, and the number of the multiply-accumulator satisfies the following relation:

S>N；

or, S ═ N;

or, S < N.