CN114970853A

CN114970853A - Cross-range quantization convolutional neural network compression method

Info

Publication number: CN114970853A
Application number: CN202210260332.8A
Authority: CN
Inventors: 邢晓芬; 杨弈才; 郭锴凌; 徐向民
Original assignee: South China University of Technology SCUT; Zhongshan Institute of Modern Industrial Technology of South China University of Technology
Current assignee: South China University of Technology SCUT; Zhongshan Institute of Modern Industrial Technology of South China University of Technology
Priority date: 2022-03-16
Filing date: 2022-03-16
Publication date: 2022-08-30

Abstract

The invention discloses a cross-range quantized convolutional neural network compression method, which comprises the following steps: determining quantization bits of a quantization convolutional neural network, a quantization range under the quantization bits, and a quantization function based on step quantization; training a full-precision convolutional neural network, and initializing a quantization convolutional neural network and a quantization step length by using the full-precision convolutional neural network; in the forward propagation stage, the weight parameters and the activation values of the convolutional neural network are quantized, and a conventional quantization mode is adopted for the values in the range of the quantization threshold; for the value outside the range of the quantization threshold, subtracting the quantization threshold, and then carrying out conventional quantization; the back propagation phase uses gradient approximation to make the non-derivable quantization function conductive. The invention adopts different quantization modes for the value outside the quantization range on the basis of conventional quantization, and realizes the compression and acceleration of the convolutional neural network while keeping the image identification precision.

Description

Cross-range quantization convolutional neural network compression method

Technical Field

The invention belongs to the field of image recognition, and relates to a cross-range quantized convolutional neural network compression method.

Background

In recent years, deep convolutional neural networks have been developed rapidly, and have achieved excellent performance in various tasks, such as image recognition, target detection, and the like, and more intelligent devices have deployed deep convolutional neural networks. However, the deployment and application of the deep convolutional neural network are limited by the large amount of memory resources occupied by the deep convolutional neural network and the characteristics of the computing resources. Therefore, researchers have begun to research convolutional neural network compression technology, which can reduce the weight of a bulked deep convolutional neural network while ensuring the performance of the deep convolutional neural network.

At present, four convolutional neural network compression methods based on an image recognition task are mainly used, namely low-rank decomposition, pruning, knowledge distillation and quantization.

(1) Low rank decomposition

In general, a matrix has a lot of redundant information, and it can be considered that the matrix is not of full rank. Therefore, the matrix can be decomposed into a plurality of small matrixes with fewer parameters, lower rank and simpler form, and the original matrix is reconstructed through the operations of outer product of the small matrixes and the like, so that the aims of reducing the memory and accelerating the operation are fulfilled. In general, the low rank decomposition method aims to minimize reconstruction errors while ensuring the performance of the convolutional neural network.

(2) Pruning branches

The pruning method is generally a method for compressing a pre-trained convolutional neural network. The bigger and deeper a convolutional neural network is, the more easily redundant and invalid parameters exist, and pruning ensures that the redundant and invalid parameters do not participate in the convolutional neural network reasoning by removing the redundant and invalid parameters, thereby achieving the purpose of compressing the convolutional neural network. Pruning is generally divided into unstructured pruning and structured pruning. Unstructured pruning generally sets some small weights to zero, and although model performance is largely preserved after pruning, the hardware aspect is more difficult to achieve this compression effect. The structured pruning is generally to prune modules such as convolution kernels and the like, and is friendly to the deployment and implementation of hardware, so that the method is a main research direction in the pruning field.

(3) Knowledge distillation

Knowledge distillation is to transfer the 'knowledge' of a deeper, more complex or better-performance large convolutional neural network to a relatively simple small convolutional neural network to improve the performance of the small convolutional neural network, and finally, the small convolutional neural network can be used for replacing the large convolutional neural network to perform an actual deep learning task, so that the purposes of saving memory and accelerating operation are achieved. Knowledge distillation generally has two ways, one is to make the final output of the small convolutional neural network imitate the final output of the large convolutional neural network, and the other is to make the intermediate output of the small convolutional neural network imitate the intermediate output of the large convolutional neural network. The two methods can be used independently or together.

(4) Quantization

Quantization generally refers to representing weight parameters and even activation values of the convolutional neural network by lower-bit values, which can greatly reduce the memory and speed up the calculation. From the quantization bit perspective, quantization can be divided into binary quantization and multi-bit quantization, the compression efficiency and computational efficiency of binary quantization are very high, but the performance loss of convolutional neural networks is not a small amount. While multi-bit quantization can bring less performance loss while ensuring certain compression efficiency and calculation efficiency.

Current multi-bit quantization methods generally use learnable quantization functions, some of which use floating point quantization ranges of different convolutional layers or fully-connected layers as a learnable parameter, such as DSQ (Ruihao Gong, xiaanglong Liu, Shenghu Jiang, tianxiaang Li, Peng Hu, jiazhezhen Lin, Fengwei Yu, and Junjie yan.differential soft quantization: brightening of floating point range participating in quantization and low-bit neural network ICCV,2019.2,4,5) to explicitly change left and right thresholds of the floating point range of each convolutional layer into learnable; LSQ (Steven K Esser, Jeffrey L mckindly, deepia bablandi, Rathinakumar applushwamy, and dharmandra S module a left step size quantification. in ICLR,2020.2,3,4,5,6,8) indirectly makes the range of floating point numbers involved in the quantification learnable using the quantification step as a learnable parameter.

The quantization methods such as LSQ and DSQ may perform truncation on the values outside the quantization range, which are less than the quantization range, but are often important. Truncation of the part of values into quantization thresholds brings about a certain information loss, and meanwhile, truncation operation is not derivable, which affect normal training of the low-bit quantization convolutional neural network and bring about a certain precision loss.

The lower the equivalent bit, the more severely the classification accuracy of its corresponding quantized convolutional neural network decreases. By carrying out range-spanning quantization on the value of the part of the low-bit convolutional neural network exceeding the quantization range, although the actual calculation amount is increased after the range-spanning quantization, the value of the part is less, so that compared with the conventional quantization mode, the classification accuracy can be effectively improved by only increasing a small amount of calculation amount.

The four methods can achieve the effect of convolutional neural network compression from different angles, but quantization often has a larger compression rate and a calculation acceleration effect. Most of the existing quantization methods adopt a truncation type quantization function, namely, numerical values exceeding the range of a quantization threshold value are uniformly truncated into the quantization threshold value. Although the proportion of the larger values occupying the whole parameters is small, the larger values are often important, and certain information loss is brought by truncating the larger values, so that the performance of the convolutional neural network is influenced. While the truncation operation is not derivable, this further limits the training and updating of the quantized convolutional neural network parameters. Therefore, if the information of the numerical value outside the quantization threshold range can be reserved, the performance of the quantization convolutional neural network can be effectively improved.

Disclosure of Invention

Aiming at the defect that the conventional quantization method uniformly adopts simple truncation operation on values outside a quantization threshold value, the invention provides a convolution neural network compression method of cross-range quantization. In the task of image recognition, a small amount of calculation amount is added on the basis of a conventional quantization method, so that higher classification precision can be improved on the basis of the conventional quantization method.

The invention is realized by at least one of the following technical schemes.

A convolution neural network compression method of cross-range quantization comprises the following steps:

preprocessing an original image to obtain a preprocessed image;

carrying out cross-range quantization and training on the weight and the activation value of the convolutional neural network to construct a low bit quantization convolutional neural network;

and performing image recognition on the preprocessed image by using the quantized convolutional neural network.

Further, the weight quantization process of the convolutional neural network comprises:

in the initialization stage of the quantization convolutional neural network, initializing the weight parameters of the quantization convolutional neural network by using the weight parameters of the full-precision convolutional neural network, simultaneously calculating the statistical information of the weight and the activation value of the full-precision convolutional neural network, and initializing the quantization step length of the weight and the activation value of the quantization convolutional neural network by using the statistical information and the set quantization bits;

secondly, in a forward propagation stage of the training process, performing cross-range quantization on the weight and the activation value;

and thirdly, in the back propagation stage of the training process, deriving and updating the quantized convolutional neural network parameters according to the cross entropy loss function.

Further, initialization of the quantization step sw of each layer of weight W of the quantization convolutional neural network is obtained by jointly calculating the set quantization bit and the distribution of the weight parameters of the full-precision convolutional neural network.

Further, classifying and identifying the samples by using a full-precision model, and simultaneously recording the distribution information of the activation values of each layer, wherein the quantization step sA of the activation values A of each layer of the quantization convolutional neural network is obtained by jointly calculating the set quantization bits and the distribution of the activation values of the full-precision convolutional neural network.

Further, for the set quantization bits, the weight W of each layer of the quantization convolutional neural network and the quantization range thresholds QW and QA of the activation value a are fixed values.

Further, the quantization function includes Round operation, that is, for an input floating point number x, rounding the floating point number x to a corresponding integer value; the Round operation changes the original floating point number to a low-order integer.

Further, for the original floating point number input, dividing the original floating point number input by a quantization step length for scaling, and if the scaled value is within the range of a quantization threshold value, obtaining quantized output by using Round operation; if the scaled value exceeds the range of the quantization threshold, the quantization threshold is subtracted first, then the value after the quantization threshold is subtracted is subjected to Round operation to obtain quantization output, and if the quantization output at the moment still exceeds the quantization threshold, the quantization threshold is cut off; when convolution calculation is performed, the value outside the quantization threshold range is represented as the quantization threshold plus the quantization value minus the quantization threshold.

Further, the Round operation in the quantization function is not derivable, and the quantization function is made derivable by a gradient approximation in the back propagation phase of the training process.

Further, in a back propagation stage, the gradient of the quantization step size parameter is reduced, and the training of the quantization convolutional neural network can be ensured to be converged.

Further, the reduction coefficient of the weight quantization step gradient of each convolution layer is related to the set quantization bits and the layer weight parameter number; the reduction coefficient of the activation value quantization step gradient of each convolution layer is related to the set quantization bits and the number of activation value parameters.

Compared with the prior art, the invention has the following beneficial effects:

the method can compress and accelerate the conventional convolutional neural network, realizes the light weight of the convolutional neural network, and promotes the application of an image recognition algorithm to light-weight equipment.

Drawings

FIG. 1 is a diagram illustrating an implementation of a cross-range quantization convolutional neural network compression method and system thereof according to the present invention;

FIG. 2 is a diagram illustrating an implementation of cross-range weight two-bit quantization according to the present invention;

FIG. 3 illustrates a two-bit quantization process for cross-range activation values according to the present invention;

FIG. 4 is a flow chart of image recognition according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to specific embodiments, but the embodiments of the present invention are not limited thereto.

The principle of the invention is as follows: quantifying the weights and activation values of the convolutional neural network can greatly reduce the memory and the calculation amount of the convolutional neural network. Most of the existing low-bit quantization methods adopt truncation operation on the numerical value exceeding the quantization threshold value, namely, the numerical value exceeding the quantization threshold value Q is truncated into Q, the operation causes information loss on the part of the numerical value, and meanwhile, the truncation operation is not derivable, so that the training update of the convolutional neural network is further limited. The above two reasons illustrate that the truncation operation of conventional quantization can affect the performance of the quantized convolutional neural network. The invention provides a compression method and a compression system of cross-range quantization, which keep the information of the numerical value exceeding the quantization range threshold Q, and theoretically, compared with the original quantization method, the invention can obtain larger performance improvement by only increasing a small amount of calculation.

Example 1

As shown in fig. 1 and 4, a convolutional neural network compression method with cross-range quantization includes the following steps:

and S1, carrying out preprocessing such as zero filling, random clipping, random turning, normalization and the like on the original image to obtain a preprocessed image.

S2, quantizing and training the weight and the activation value of the convolutional neural network, and constructing a low-bit quantized convolutional neural network;

specifically, the method comprises the following steps:

firstly, training a full-precision convolutional neural network as a pre-training model (including but not limited to common convolutional neural networks such as ResNet and MobileNet) by using the preprocessed image;

and secondly, setting quantization bits, and calculating the quantization threshold of the weight and the activation value.

The quantization bit is set as b bits, and the basic structure of the convolutional neural network is as follows: convolution layer → batch normalization layer → ReLU activation function, because the ReLU function generally truncates to zero the part of the value less than zero, the quantization thresholds for the weight and activation value are different, and set

And

the left and right boundaries of the weight quantization range, respectively, then

Is provided with

And

the left and right boundaries of the quantization range of the activation values, respectively, then

Initializing and quantizing the weight parameters of the convolutional neural network by using the weight parameters of the full-precision convolutional neural network;

fourthly, calculating the statistical information of the weight and the activation value of the full-precision convolutional neural network, and then initializing the quantization step length of the weight and the activation value of the quantization convolutional neural network by using the statistical information and preset quantization bits.

Setting the weight of the current convolutional layer as a matrix

Wherein

Denotes the K line Cd ² And C is the number of input channels, K is the number of convolution kernels, and d is the size of the convolution kernels. Quantization step size s of current layer weight _W The initialization is as follows:

wherein | W | ceiling ₁ Is the L1 norm of the weight W.

Classifying and identifying a certain number of samples (such as 128 samples) by using a full-precision convolutional neural network, recording the activation value of each layer in the inference process, wherein the quantization step sA of the activation value A of each layer of the quantization convolutional neural network is determined by preset quantization bits and the mean value of the L1 norm of the activation value currently recorded by the full-precision convolutional neural network. And initializing the quantization step size of the activation value of the current layer by using the statistical information of the activation value of the pre-trained full-precision convolutional neural network and the activation value quantization threshold.

Setting the current layer input activation value as a matrix

Wherein

A matrix representing B rows and CHW columns, B being the number of input samples (e.g., 128), C being the number of input channels, H and W being the length and width dimensions of the input feature map, the quantization step size s of the current layer activation value _A The initialization is as follows:

wherein | A | Y phosphor ₁ Is the L1 norm of the activation value a.

Sixthly, a forward propagation stage, wherein the weight W is uniformly quantized based on a learnable step length, wherein W _i The ith element, i.e. floating point number W, representing the weight W _i . For floating point number w _i Dividing by the weight quantization step size s _w And changing the value into an integer by using a Round function, wherein the value exceeding the quantization threshold value is subjected to cross-range quantization, namely, the quantization threshold value is subtracted firstly, then the value obtained after the quantization threshold value is subtracted is subjected to Round operation to obtain quantization output, and if the quantization output at the moment still exceeds the quantization threshold value, the quantization output is cut off into the quantization threshold value.

In order to be an integer after the quantization,

for the final quantized output, as shown in fig. 2, the specific formula is as follows:

seventhly, a forward propagation stage, performing uniform quantization on the activation value A based on a learnable step size, wherein a _i The ith element representing the activation value a. For floating point number a _i Divided by the activation value quantization step size s _A And changing the value into an integer by using Round operation, wherein the value exceeding the quantization threshold value is subjected to cross-range quantization, namely, the quantization threshold value is subtracted firstly, then the value obtained after the quantization threshold value is subtracted is subjected to Round operation to obtain quantization output, and if the quantization output at the moment still exceeds the quantization threshold value, the quantization output is cut off into the quantization threshold value.

To be quantizedThe number of the integer (c) of (d),

for the final quantized output, as shown in fig. 3, the specific formula is as follows:

and in a back propagation stage, reducing the gradient of the quantization step size parameter to ensure that the training of the quantization convolutional neural network can be converged.

The Round operation in the quantization function is not derivable, and here, the gradient approximation is performed on the gradient of the Round operation based on the Straight Through Estimation (STE) method, and the input is set as x, and the specific formula is as follows:

ninthly, in the derivation process, because the weights of each convolution layer share one quantization step length, and the number of the weight parameters is higher than the number of the corresponding quantization step lengths by several orders of magnitude, the gradient amplitude of the quantization step length is higher than that of a single weight parameter w after derivation by the chain rule _i Is several orders of magnitude higher for activation values. Extreme imbalance in the magnitude of the parameter gradient is detrimental to the training and convergence of the quantization convolutional neural network, and therefore the gradient of the quantization step needs to be reduced. The reduction coefficient of the quantization step gradient is related to the preset quantization bits and the number of the layer weight parameters or activation values.

Weight matrix for convolutional layer

Where C is the number of input channels, K is the number of convolution kernels, and d isConvolution kernel size, its corresponding quantization step size gradient reduction factor g _w Comprises the following steps:

input active layer matrix for convolutional layer

Where B is the number of input samples (e.g., 128), C is the number of input channels, and H and W are the input feature map dimensions. Its corresponding quantization step size gradient minification coefficient g _A Independent of the number of input samples, is:

and in (r), updating the quantitative convolutional neural network by using cross entropy as a loss function and random gradient descent as an optimizer.

As a preferred embodiment, the parameters of the first convolutional layer and the last fully-connected layer of the quantized convolutional neural network and the batch normalization layer are not quantized; the original input to the quantized convolutional neural network is also not quantized.

As a preferred embodiment, the quantization function would include a Round operation, i.e. for an input floating-point number x, rounding x to the corresponding integer value. If x is in the range of a quantization threshold value Q, conventional quantization is adopted for a value x to be subjected to quantization operation, and quantized output round (x) is obtained; if x exceeds the range of the quantization threshold Q, the quantization threshold Q is subtracted, then the quantization threshold Q is quantized conventionally, and the quantized output is represented as Q + Round (x-Q).

And S3, performing image recognition on the preprocessed image by using the quantized convolutional neural network.

The following experiments demonstrate the method of the present invention based on convolutional neural networks ResNet20 and ResNet 32.

Example 2

This example was conducted on a convolutional neural network ResNet20 on a common data set CIFAR-10, with two-bit and three-bit quantization on ResNet20, respectively. Experimental comparison results are shown in Table 1, LSQ is from reference 1 (see, for details: Steven K Eser, Jeffrey L McKinstry, Deepika Bablani, Rathi-nakumar applied wamy, and Dharmandra S Modha. Learned step size quantification. International Conference on Learning retrieval (ICLR), 2020.).

In the table, all methods use the same training setup. Initializing the convolutional neural networks in the same full-precision ResNet20 convolutional neural network, training the quantized convolutional neural networks, performing gradient descent training by using a random gradient descent optimizer with momentum of 0.9, wherein the initial learning rate is lr of 0.01, training 400 epochs, adopting cosine learning rate and finally attenuating to zero, the weight attenuation weight _ decay is 1e-4, and the sample batch size is batch size of 128.

TABLE 1 comparison results with LSQ two bit quantization

TABLE 2 comparison results with LSQ three bit quantization

From tables 1 and 2, the cross-range quantization method of the present invention can effectively improve the accuracy of the convolutional neural network of ResNet20, wherein the accuracy of the method of the present invention is improved by more than 1% in two-bit quantization. Moreover, the quantization precision can be effectively improved by independently carrying out cross-range quantization on the activation value, and meanwhile, after the cross-range quantization is carried out on the weight and the activation value, the precision improvement effect is more obvious. As can be seen from the results in the rightmost columns of tables 1 and 2, the increased amount of computation across the range quantization unit can bring more increase in accuracy than the way of directly increasing the quantization bits.

Example 3

This example performs an experiment on convolutional neural network ResNet32 on common data set CIFAR-100, and performs two-bit quantization on ResNet 32. The results are shown in Table 3. The parameter settings of example 2 were as in example 1.

TABLE 3 comparison results with LSQ two bit quantization

From table 3, the cross-range quantization method of the present invention improves the accuracy of the convolutional neural network of ResNet32 by 1.54% in two-bit quantization, and the effect is obvious.

The above embodiments are preferred embodiments of the present invention, but the present invention is not limited to the above embodiments, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be construed as equivalents thereof, and all such changes, modifications, substitutions, combinations, and simplifications are intended to be included in the scope of the present invention.

Claims

1. A convolution neural network compression method of cross-range quantization is characterized by comprising the following steps:

preprocessing an original image to obtain a preprocessed image;

2. The convolutional neural network compression method for quantization across the range of claim 1, wherein the weight quantization process of the convolutional neural network comprises:

secondly, performing cross-range quantization on the weight and the activation value in a forward propagation stage of the training process;

3. The convolutional neural network compression method for quantization across range of claim 2, wherein the quantization step s of the weight W of each layer of the convolutional neural network is quantized _w The initialization of the method is obtained by jointly calculating the set quantization bits and the distribution of the weight parameters of the full-precision convolutional neural network.

4. The convolutional neural network compression method based on cross-range quantization of claim 2, wherein the samples are classified and identified by using a full-precision model, distribution information of each layer of activation values is recorded, and the quantization step size s of each layer of activation values A of the convolutional neural network is quantized _A The method is obtained by jointly calculating the distribution of the set quantization bits and the activation values of the full-precision convolutional neural network.

5. The convolutional neural network compression method based on range-spanning quantization of claim 2, wherein for the set quantization bits, the quantization range threshold Q of the activation value a and the weight W of each layer of the convolutional neural network are quantized _W And Q _A Are all fixed values.

6. The convolutional neural network compression method of claim 2, wherein the quantization function comprises a Round operation, that is, for an input floating point number x, the floating point number x is rounded to a corresponding integer value; the Round operation changes the original floating point number to a low-order integer.

7. The method according to claim 6, wherein for an original floating-point number input, scaling is performed by dividing the original floating-point number input by a quantization step, and if the scaled value is within a quantization threshold range, a Round operation is performed to obtain a quantized output; if the scaled value exceeds the range of the quantization threshold, the quantization threshold is subtracted first, then the value after the quantization threshold is subtracted is subjected to Round operation to obtain quantization output, and if the quantization output at the moment still exceeds the quantization threshold, the quantization threshold is cut off; when convolution calculation is performed, the value outside the quantization threshold range is represented as the quantization threshold plus the quantization value minus the quantization threshold.

8. The convolutional neural network compression method of claim 6, wherein a Round operation in the quantization function is non-conductive, and a gradient approximation is performed on the quantization function to make it conductive in the back propagation stage of the training process.

9. The method as claimed in claim 6, wherein in the back propagation stage, the gradient of the quantization step parameter is reduced to ensure convergence of the training of the quantization convolutional neural network.

10. The convolutional neural network compression method of claim 8, wherein the reduction coefficient of the weight quantization step gradient of each convolutional layer is related to the set quantization bits and the number of weight parameters; the reduction coefficient of the activation value quantization step gradient of each convolutional layer is related to the set quantization bits and the number of activation value parameters.