CN111275184B

CN111275184B - Method, system, device and storage medium for realizing neural network compression

Info

Publication number: CN111275184B
Application number: CN202010039749.2A
Authority: CN
Inventors: 陈弟虎; 萧嘉乐; 粟涛
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2022-05-03
Anticipated expiration: 2040-01-15
Also published as: CN111275184A

Abstract

The invention discloses a method, a system, a device and a storage medium for realizing neural network compression, wherein the method comprises the following steps: acquiring weight data of a neural network; training the weight data by using a preset local variance as a constraint condition to obtain new weight data; detecting whether the recognition precision of the neural network based on the new weight data meets a preset requirement, and if so, executing the step S4; otherwise, return to step S2; converting a new representation form of the weight data according to a preset mode to realize compression of the weight data; detecting whether the compression ratio of the weight data meets the preset requirement or not, and if so, completing the compression step; otherwise, the process returns to step S2. The invention not only does not lose the recognition precision of the neural network, but also realizes the compression of the weight data by carrying out the form conversion on the weight data of the neural network, thereby improving the calculation performance of the accelerator and being widely applied to the data processing technology of the neural network.

Description

Method, system, device and storage medium for realizing neural network compression

Technical Field

The present invention relates to neural network data processing technologies, and in particular, to a method, a system, an apparatus, and a storage medium for implementing neural network compression.

Background

A convolutional neural network is widely used in various fields such as image classification and target recognition as one of deep neural networks because of its superior recognition accuracy compared to a conventional empirical model-based recognition algorithm. It is noted that most of the convolutional neural network is parallel convolutional computation, so in order to improve the energy efficiency ratio of the computation, it is necessary to design a hardware accelerator specifically for accelerating the convolutional neural network computation. While hardware accelerators need to face two problems: how to increase the parallelism of the computation and how to increase the effective memory bandwidth of the accelerator.

In the face of the latter problem, both academic and industrial fields have been studied, and the mainstream solutions can be divided into three types: pruning of the network, quantization of the network, compression of the network data. The pruning of the network refers to observing the contribution degree of each item to the final result in the training process and observing whether redundant zero values exist in the calculation process, and then deleting the redundant values in the network to achieve the slimming of the network. The quantization of the network refers to representing the characteristic value and the weight value required by convolution by using a low-precision data representation mode, namely 16-bit and 8-bit fixed point data, and then retraining to achieve the purpose of reducing the whole bandwidth occupation of the network weight value and the characteristic value. The compression of the network data refers to compressing the weight in the network by referring to the practice in audio and video coding and decoding, for example, compressing in the form of a huffman code.

The three solutions of the above mainstream for increasing the effective bandwidth of the accelerator have their limitations. The pruning of the network can make the network sparse, and if the pruning is applied to a general hardware accelerator computing architecture, the situation that the utilization rate of a computing unit is not high occurs, and the energy efficiency ratio is improved to a limited extent. In the mainstream quantization strategy mode, the minimum bit width of the quantization is about 8 bits on the premise of keeping the precision not to be reduced. The variable length codec needs to add a codec module at a hardware end, and the codec may cause the data to lose regularity relative to an original calculation mode, and a situation of data non-alignment occurs, which may cause an actual calculation process to fail to use a theoretical bandwidth stored on a chip.

Disclosure of Invention

In order to solve one of the above technical problems, an object of the present invention is to provide a method, a system, an apparatus, and a storage medium for implementing neural network compression, which can effectively compress weights of a neural network without losing recognition accuracy.

The first technical scheme adopted by the invention is as follows:

a method of implementing neural network compression, comprising the steps of:

s1, acquiring weight data of the neural network;

s2, training the weight data by using a preset local variance as a constraint condition to obtain new weight data;

s3, detecting whether the recognition precision of the neural network based on the new weight data meets the preset requirement, and if so, executing the step S4; otherwise, return to step S2;

s4, converting the representation form of the new weight data according to a preset mode, and compressing the weight data;

s5, detecting whether the compression ratio of the weight data meets the preset requirement, and if so, completing the compression step; otherwise, the process returns to step S2.

Further, the weight data is data quantized by fixed-point data, and the step S4 specifically includes:

converting the new weight data into a representation form of local mean data plus difference data;

wherein, a local mean value corresponds to a plurality of weights, and a difference value corresponds to a weight.

Further, the preset local variance is calculated by the following formula:

wherein Num represents the number of local means, and Y represents the bit width of the difference.

Further, the compression rate of the weight data is obtained by calculating in the following way:

and calculating to obtain the compression ratio by combining the dimension local size value of the neural network, the bit width of the difference value, the bit width of the local mean value and a first preset formula.

Further, the first preset formula is as follows:

wherein, the N is_part、M_part、K_2partAnd K_1partThe local size values in four weight dimensions are obtained, Y represents the bit width of the difference, and X represents the bit width of the weight in the weight data.

Further, the method also comprises the step of designing the on-chip storage module, which specifically comprises the following steps:

determining a mean value address generator and a mean value storage module according to the local mean value data;

determining a difference address generator and a difference storage module according to the difference data;

and determining a configurable shifter according to the bit width of the weight value and the bit width of the difference value, wherein the shifter is used for restoring the compressed bit width to the uncompressed bit width.

The second technical scheme adopted by the invention is as follows:

a system for implementing neural network compression, comprising:

the data acquisition module is used for acquiring weight data of the neural network;

the weight training module is used for training the weight data by adopting a preset local variance as a constraint condition to obtain new weight data;

the precision detection module is used for detecting whether the recognition precision of the neural network based on the new weight data meets the preset requirement or not, and if so, jumping to the form conversion module; otherwise, returning to the weight value training module;

the form conversion module is used for converting the representation form of the new weight data according to a preset mode and realizing the compression of the weight data;

the compression ratio detection module is used for detecting whether the compression ratio of the weight data meets the preset requirement or not, and if so, completing the compression step; otherwise, returning to the weight value training module.

Further, still include the on-chip storage design module, include:

the local mean value addressing unit is used for determining a mean value address generator and a mean value storage module according to the local mean value data;

the searching addressing unit is used for determining a difference value address generator and a difference value storage module according to the difference value data;

and the shifter is used for recovering the compressed bit width into the uncompressed bit width.

The third technical scheme adopted by the invention is as follows:

an apparatus for implementing neural network compression, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method described above.

The fourth technical scheme adopted by the invention is as follows:

a storage medium having stored therein processor-executable instructions for performing the method as described above when executed by a processor.

The invention has the beneficial effects that: the invention not only does not lose the recognition precision of the neural network, but also realizes the compression of the weight data by carrying out the form conversion on the weight data of the neural network, thereby improving the calculation performance of the accelerator and being suitable for hardware accelerators sensitive to power consumption and performance.

Drawings

FIG. 1 is a flow diagram of steps in a method of implementing neural network compression in an embodiment;

FIG. 2 is a diagram illustrating a standard convolution process in an embodiment;

FIG. 3 is a diagram illustrating weight splitting after variance constraint in the embodiment

FIG. 4 is a diagram illustrating the transmission of weights to a computing unit according to an embodiment;

FIG. 5 is a schematic diagram of the operation of the configurable shifter in an embodiment;

FIG. 6 is a block diagram of a system for implementing neural network compression according to an embodiment.

Detailed Description

As shown in fig. 1, the present embodiment provides a method for implementing neural network compression, including the following steps:

s101, weight data of the neural network are obtained.

The weight data in this embodiment is data quantized by fixed-point data, and may be specifically determined as a quantization result of X bits, such as 8 bits or 16 bits. The neural network can be a convolutional neural network or a common deep network or the like.

S102, training the weight data by using a preset local variance as a constraint condition to obtain new weight data.

And adding a constraint condition of local variance in the weight training process according to the actual compression ratio requirement, so that the weight with smaller local difference can be obtained after training.

The preset local variance may be specifically calculated by:

consider a standard convolution process, as shown in FIG. 2. And performing convolution calculation on the input features with the row number W, the column number D and the channel number M and convolution kernels with kernels of N channels K multiplied by K to obtain output features with the size of W multiplied by D multiplied by N. Then the maximum space for local variance constraints on the weights is K x M x N and the constraints can be done in K, K, M and N four weight dimensions. Considering that the bit width of the weight data input from the beginning is X bits, and the bit width of the difference data after the target compression is Y bits. Then the statistical maximum variance S in the presence of the Y-bit data difference can be found from the variance calculation²Is (where Num is the number of local internal weights, related to the selected local size):

maximum variance S of theory²When the method is applied to the weight training process, the variable of Num is noted in the maximum variance, and the variable and the dimension and range of local selectionThe maximum value is K × K × M × N, depending on the size. And the whole is completely divided by considering the local, namely when selecting the local range, the range size selected in each dimension is evenly divided by the maximum value in the dimension. The purpose of this is to guarantee that there is no redundancy of the local partitions while all weights are variance constrained.

The weight value can be trained by using the existing training method, which is not the key point of this embodiment and is not repeated here.

S103, detecting whether the recognition precision of the neural network based on the new weight data meets a preset requirement, and if so, executing a step S104; otherwise, the process returns to step S102.

Applying the new weight data to a neural network, operating the neural network, and detecting whether the identification precision of the neural network meets a preset requirement, wherein the identification precision can be the precision of identifying the picture, and if so, the weight data is in accordance with the requirement; if not, the procedure returns to step S102 to continue training to meet the preset requirement.

And S104, converting the new representation form of the weight data according to a preset mode, and compressing the weight data.

Specifically, step S104 specifically includes: and converting the new weight data into a representation form of local mean data and difference data, wherein one local mean corresponds to a plurality of weights, and one difference corresponds to one weight.

As shown in fig. 3, if the maximum variance is applied to the training process, so that the original local unordered weight is limited within a range, the fixed-point representation of the weight from the original X bits may be modified into a representation of the local mean plus the difference, where the local mean is represented by X bits, and the difference is represented by Y bits, so that the original weight may be decomposed into two parts, one part is the original-size partial weight represented by the difference, and the other part is the local mean weight that needs to be indexed through address conversion.

Fig. 3 illustrates an example of weight splitting for a local size of 2 × 2, and it is easy to find that the address index of the local average value is related to the local size. If the four-dimensional addressing is converted into one-dimensional addressing representation, the address index of the local mean value can be obtained as follows:

wherein N is_part、M_part、K_2partAnd K_1partIs the local size value in each weight dimension.

S105, detecting whether the compression rate of the weight data meets a preset requirement, and if so, completing the compression step; otherwise, the process returns to step S102.

Specifically, step S105 specifically includes: and calculating to obtain the compression ratio by combining the dimension local size value of the neural network, the bit width of the difference value, the bit width of the local mean value and a first preset formula.

According to the above description, the weight data represented by the original X bits can be decomposed into the remaining difference between the local average represented by the small X bits and the Y bits of the original size, so as to implement the compression of the weight, and the compression ratio a can be obtained by comparing the storage space occupied by the weight before and after the compression:

wherein, the N is_part、M_part、K_2partAnd K_1partThe local size values in four weight dimensions are obtained, Y represents the bit width of the difference, and X represents the bit width of the weight in the weight data. It can be seen from the formula that the compression ratio is mainly determined by the original X bits and the Y bits of the difference, and when Y is half of the X value, the compression ratio can be approximately 50%.

Further as an optional implementation manner, after step S105, step S106 of designing an on-chip storage module is further included, and step S106 specifically includes steps S1061, S1062, and S1063:

s1061, determining a mean value address generator and a mean value storage module according to the local mean value data;

s1062, determining a difference address generator and a difference storage module according to the difference data;

s1063, determining a configurable shifter according to the bit width of the weight and the bit width of the difference, wherein the shifter is used for recovering the compressed bit width to the bit width which is not compressed.

In the convolution calculation process, the weight needs to be buffered in the storage on the chip so as to reduce the number of times of accessing the off-chip storage in the calculation process and increase the calculation efficiency, and then for the compressed weight, a special on-chip storage structure needs to be designed when a hardware accelerator is designed, so that the calculation efficiency is not influenced by the on-chip decompression process, and the input required by the calculation array is not limited by the output bandwidth of the on-chip storage.

The on-chip storage of the weights may be as shown in fig. 4. The address generator is used to generate the address information needed to address the mean and difference memory blocks, it being noted that the address generator for the mean memory block is configurable, depending on the local size of the dimensions used and the bit width before and after compaction, as discussed above in relation to weight splitting. The weight storage block is mainly divided into a bit difference value storage block and a mean value storage block and is used for storing the weight after disassembly. After the weight values are read out by the address from the address generator, the difference weight values need to be restored to normal bit width representation through a configurable shifter, and then bit-wise addition is carried out on the difference weight values and the corresponding average weight values, so that the final weight values output to the calculation array can be obtained, and the process is decompression. The configurable shifter here is also related to the parameters of the compression.

The configurable shifter operates as shown in fig. 5. The method comprises the following steps of calculating the difference value of a bit number to be amplified, and calculating the difference value of the bit number to be amplified and the average value of the bit number to be amplified. Wherein X is Y + a + b. The operation mode is that the difference value of the Y bit is amplified by a times according to the parameters, namely, the a bit displacement is carried out, and meanwhile, the 0 or 1 is compensated on the right side; then, the complete weight of X bit is formed by adding 0 or 1 of b bit in front of the obtained data. The complement of a and b is determined by the positive and negative of the difference, when the difference is positive complement 0, the difference is negative complement 1.

The method of this embodiment can compress convolutional layers of any size, and can also compress a complete network (formed by combining multiple convolutional layers). The preset local variance can be calculated by adopting other formulas or a variance constraint replacement mode, so that the effects of not losing the recognition precision of the neural network and realizing data compression can be achieved.

In summary, compared with the existing method, the method of the embodiment at least has the following beneficial effects:

(1) and on the premise of not losing the identification precision of the neural network, compressing the weight of the convolutional neural network so as to be suitable for the hardware accelerator sensitive to power consumption and performance.

(2) The variance constraint and the compression process can be completely configurable, can be compressed according to the requirements of compression rate and precision, and can be adapted to the convolutional layer with any size.

(3) And the regular and aligned compression is realized while the weight is compressed, so that the waste of input bandwidth caused by the decompression of the weight in the actual weight reading process of the hardware accelerator is avoided.

(4) The embodiment realizes a method for really and effectively reducing the weight in the convolutional neural network, thereby realizing the improvement of the performance of the hardware accelerator calculation.

As shown in fig. 6, the present embodiment provides a system for implementing neural network compression, including:

As a further optional implementation, the system further includes an on-chip memory design module, including:

The system for realizing neural network compression of the embodiment can execute the method for realizing neural network compression provided by the method embodiment of the invention, can execute any combination implementation steps of the method embodiment, and has corresponding functions and beneficial effects of the method.

The embodiment also provides a device for realizing neural network compression, which comprises:

at least one processor;

at least one memory for storing at least one program;

The device for realizing neural network compression of the embodiment can execute the method for realizing neural network compression provided by the method embodiment of the invention, can execute any combination implementation steps of the method embodiment, and has corresponding functions and beneficial effects of the method.

The present embodiments also provide a storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method as described above.

The storage medium of this embodiment may execute the method for implementing neural network compression provided by the method embodiment of the present invention, may execute any combination of the implementation steps of the method embodiment, and has corresponding functions and advantages of the method.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of implementing neural network compression, comprising the steps of:

s1, acquiring weight data of the neural network;

s2, training the weight data by using the maximum variance as a constraint condition to obtain new weight data;

s5, detecting whether the compression ratio of the weight data meets the preset requirement, and if so, completing the compression step; otherwise, return to step S2;

the weight data is quantized data of fixed point data, and the step S4 specifically includes:

wherein, a local mean value corresponds to a plurality of weight values, a difference value corresponds to a weight value, and the maximum variance is applied to the training process, so that the original local unordered weight value is limited in a range, the weight value can be modified from the original fixed-point representation mode of X bits into the representation mode of the local mean value plus the difference value, wherein the local mean value is represented by X-bit data, and the difference value is represented by Y bits, the original weight value can be decomposed into two parts, one part is the partial weight value of the original size represented by the difference value, and the other part is the local mean weight value which needs to be indexed through address conversion;

the maximum variance is calculated by the following formula:

wherein, Num represents the number of local internal weights, and Y represents the bit width of the difference.

2. The method of claim 1, wherein the compression rate of the weight data is calculated by:

calculating to obtain a compression ratio by combining the dimension local size value of the neural network, the bit width of the difference value, the bit width of the weight in the weight data and a first preset formula;

the first preset formula is as follows:

wherein, the N is_part、M_part、K_2partAnd K_1partThe local size values in four weight dimensions are represented, Y represents the bit width of the difference, and X represents the bit width of the weight in the weight data input at first.

3. The method of claim 1, further comprising a step of designing an on-chip storage module, specifically:

4. A system for implementing neural network compression, comprising:

the weight training module is used for training the weight data by adopting the maximum variance as a constraint condition to obtain new weight data;

the compression ratio detection module is used for detecting whether the compression ratio of the weight data meets the preset requirement or not, and if so, completing the compression step; otherwise, returning to the weight value training module;

the weight data is data quantized by fixed point data, and the form conversion module is specifically configured to:

the maximum variance is calculated by the following formula:

5. The system of claim 4, further comprising an on-chip storage design module comprising:

6. An apparatus for implementing neural network compression, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a method of implementing neural network compression as claimed in any one of claims 1 to 3.

7. A storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method of any one of claims 1-3.