CN111652366A

CN111652366A - Combined neural network model compression method based on channel pruning and quantitative training

Info

Publication number: CN111652366A
Application number: CN202010388100.1A
Authority: CN
Inventors: 徐磊; 何林; 苏华友; 刘小龙; 罗荣; 张海涛; 李君宝
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2020-05-09
Filing date: 2020-05-09
Publication date: 2020-09-11

Abstract

The invention discloses a combined neural network model compression method based on channel pruning and quantitative training. Step 1: sparsifying the training model; step 2: training model pruning; and step 3: fine adjustment of the model; and 4, step 4: quantizing the model after pruning is finished, and constructing a conventional floating point number calculation graph; and 5: inserting pseudo-quantization modules at corresponding positions of convolution calculation in a calculation graph, inserting two pseudo-quantization modules at convolution weights and activation values, and quantizing the weights and the activation values into 8-bit integer; step 6: dynamically quantizing the training model until convergence; and 7: quantitative reasoning; and 8: finally, a model after pruning and quantification is obtained. The invention greatly reduces the consumption of the model on time and space by two technologies of pruning and quantification under the condition of keeping the accuracy of the model.

Description

Combined neural network model compression method based on channel pruning and quantitative training

Technical Field

The invention belongs to the technical field of data processing; in particular to a combined neural network model compression method based on channel pruning and quantitative training.

Background

The existing neural network pruning algorithm can be mainly divided into 3 steps: sparse training, cutting off channels with little influence and fine adjustment on a data set; existing pruning algorithms often take the calculation of the average of the convolution filter parameters in assessing the importance of a channel. However, this evaluation method only considers the influence of the convolution operation on the feature map, and does not consider the influence of the BN layer on the feature map. The network pruned in this way therefore has a considerable loss in performance. In the aspect of quantification, the existing method is mainly a static quantification method after model training is completed. The quantized parameters of the quantization method have certain errors, and an adjustment method on a data set is lacked, so that the accuracy of the quantized model has certain loss.

Disclosure of Invention

In order to solve the problem that the neural network model is difficult to deploy on general computing equipment due to large parameter quantity and large calculation quantity, the invention designs a combined neural network model compression method based on channel pruning and quantization training, and greatly reduces the consumption of the model on time and space by two technologies of pruning and quantization under the condition of keeping the accuracy rate of the model.

The invention is realized by the following technical scheme:

a combined neural network model compression method based on channel pruning and quantitative training comprises the following steps:

step 1: the sparse training model applies L1 norm punishment to the BN layer parameters after the convolutional layer needing to be sparse in the training process, so that the parameters have the characteristic of structured sparsity and are prepared for next channel cutting;

step 2: training model pruning, pruning channels corresponding to the convolutional layers with small gamma parameters in the BN layer according to the corresponding relation between the convolutional layers and the BN layer in the model in the pruning process, and pruning each layer shallowly and deeply to form a new model after channel pruning;

and step 3: fine adjustment of the model, continuing training the model after pruning on the data set, and properly reducing the learning rate to the previous one

Training until the model precision is not improved any more, and ending channel pruning;

and 4, step 4: quantizing the model after pruning is finished, and constructing a conventional floating point number calculation graph;

and 5: inserting pseudo-quantization modules at corresponding positions of convolution calculation in a calculation graph, inserting two pseudo-quantization modules at convolution weights and activation values, and quantizing the weights and the activation values into 8-bit integer;

step 6: dynamically quantizing the training model until convergence, wherein in quantization training, the weight and the activation value of the convolutional layer need to be quantized;

and 7: quantizing reasoning, namely saving the quantization parameters of the convolutional layer weight and the activation value, scaling the coefficient S and the zero point Z, and finishing quantization training;

and 8: finally, a model after pruning and quantification is obtained.

Further, the step 1 sparse training model specifically comprises,

step 1.1: constructing an original convolutional neural network model, traversing each layer of the model, finding out the BN layer after each convolutional layer, and adding each BN layer into a BN layer list;

step 1.2: setting a training hyper-parameter in an original convolutional neural network model, wherein a sparsification coefficient lambda is between 0.0001 and 0.01,

step 1.3: setting a training super parameter and then performing sparse training;

carrying out forward propagation and calculating gradient information of each layer parameter by backward propagation; before gradient application, applying an L1 norm penalty to gamma parameters in each BN layer in a BN layer list;

collecting absolute values of all gamma parameters in a BN layer in the training process, sequencing and listing the sizes of the gamma parameters of each quantile;

judging the sparsification level according to the value, wherein the smaller the parameter value is, the higher the sparsification level is;

the training process is continued until the accuracy index and the sparsification level are not increased any more;

and after the training is stopped, storing the model and the model structure obtained by the training, and simultaneously calculating the parameter quantity and the calculated quantity of the model.

Further, in step 1.3, an L1 norm penalty is imposed on the γ parameter in each BN layer in the BN layer list, as shown below,

in the above formula, Ω (w) represents the L1 norm of the BN layer γ parameter, and the L1 norm is multiplied by the sparsification coefficient λ and added to the original objective function L to form a new objective function L;

the calculation process of the BN layer is shown as the following formula,

wherein z is_inRepresents the input tensor of the layer, mu and sigma are respectively the mean and the variance of the tensor per channel, oa is a small value to ensure the stability of numerical calculation, and gamma and β are respectively trainable parameters in the two BN layers and respectively represent the scaling and the offset of the layer, wherein the gamma parameter is the target parameter punished by the L1 norm.

Further, the training model pruning in the step 2 is specifically

Step 2.1: traversing the model from front to back, finding out the corresponding BN layer behind each convolutional layer, if no corresponding BN layer exists, pruning the convolutional layer, for a network output layer, because an output channel is limited by a target task, the output channel cannot be pruned, a pruning part needs to be marked, the pruning information is summarized into a table, and for a part with short-circuit connection, the part can be regarded as a plurality of inputs;

step 2.2: and globally sorting the gamma parameters in all BN layers, calculating the pruning threshold of the gamma parameters according to the pruning ratio, calculating the minimum value of the maximum values of the gamma parameters of all BN layers, and taking the value as the upper limit of the pruning threshold. Exceeding this threshold will result in a layer being completely cut;

step 2.3: the pruning information table traversed from front to back in the step 2.1 is classified into the following three types,

for the combination of the unrestricted convolution layer plus the BN layer;

for the combination of the constrained convolutional layer plus the BN layer:

for residual block structure with short circuit connection

Step 2.4: redefining the network model according to the residual channel number after each layer of pruning, and storing the parameters of the new model after pruning.

5. The compression method according to claim 4, characterized in that said step 2.3 is, in particular,

step 2.3.1: for the combination of the unrestricted convolution layer plus the BN layer; pruning a convolution filter corresponding to the pruning mask of the input channel and the output pruning mask; outputting a pruning mask: according to the index composition of the convolution filter corresponding to the situation that the gamma parameter of the current convolution layer BN layer is smaller than the threshold value, pruning is realized in a mode of recombining the parameters of the convolution layer and the parameters of the BN layer; if the current convolutional layer is a depth separable convolution, namely the number of packets is equal to the number of input channels, the output pruning mask is the same as the input pruning mask, and the number of the packets after pruning is the same as the number of the residual convolution filters;

step 2.3.2: for the combination of the constrained convolutional layer plus the BN layer: the output channel is not pruned, only the convolution filter corresponding to the pruning mask of the input channel is pruned, and pruning is realized by recombining the parameters of the convolutional layer and the parameters of the BN layer;

step 2.3.3: for a residual block structure with a short circuit connection; the number of output channels of the last layer in the residual block must be the same as the number of input channels of the residual block, and the output pruning mask of the last layer of the residual block needs to be equal to the output pruning mask of the previous layer of the residual block in the pruning process, so that the model structure after pruning is ensured to be good.

Further, the step 6 dynamically quantifies the training model until convergence,

step 6.1, in quantization training, floating point numbers which are still unquantized are input, convolution layer parameters participate in floating point number operation after being quantized by a pseudo quantization module, the intermediate convolution processes are all floating point number operation, and the activated values after being activated by functions are quantized by the pseudo quantization module;

step 6.2: because the weight value distribution of the convolutional layer is concentrated and is a fixed constant and does not change along with the input of the model, the quantization of the parameter layering of the convolutional layer is integer; the activation value is influenced by the model input, and the value of the activation value can fluctuate in a large range, so a channel-by-channel quantification method is adopted; the formula for the quantization calculation is shown below,

clamp(r；a,b):＝min(max(x,a),b) (3)

wherein r represents a floating point number being quantized; [ a, b ]]Represents a quantization range; n is the quantization order, n is 2 in 8-bit integer quantization⁸＝256；[·]Represents the nearest integer; q (r; a, b, n) represents the result after quantization;

in the above formula, the calculation of the quantization result according to the floating point numerical value can be calculated only by determining the quantization ranges [ a, b ]; in the training process, because the input and model parameters are constantly changed, the distribution range of the quantized parameters needs to be observed to determine the quantization range;

step 6.3: for the quantization range of the convolutional layer parameter, for each convolutional layer parameter w, a: ═ minw and b: ═ maxw are taken. The value of the quantization range is converged along with the convergence of the convolutional layer parameters;

step 6.4: for the quantization range of the activation value, the quantization range needs to be calculated independently for each channel;

in order that the quantization range of the activation values may reflect the distribution over the entire data set, it is necessary to calculate an exponential moving average of the single quantization range, the exponential moving average calculation formula is as follows,

S_t＝α×Y_t+(1-α)×S_t-1(6)

α is moving average coefficient with value between 0-1, α is near 1 to reflect long-term numerical average, Y_tFor the value observed this time, S_tIs an exponential moving average value at the time t;

step 6.5: the BN layer operation is fused during training, the calculation formula of the parameters of the fused convolutional layer is as follows,

wherein gamma is the gamma parameter in the BN layer;

an exponential moving average representing the variance of the convolutional layer results for each batch; is a small constant; w and w_foldBefore and after fusion, respectively.

Further, the quantitative reasoning of step 7 is specifically,

quantizing the fused offset parameters into 32-bit integer, taking a scaling coefficient S as the product of the convolutional layer weight and an input scaling coefficient, and setting a zero point Z to be 0;

the quantization reasoning needs to solve the problem of carrying out convolution matrix operation by using integers, and the formula is as follows;

first, the scaling coefficient S and the zero point Z of each layer are calculated from the obtained quantization range.

S＝s(a,b,n)

Z＝[q(0.0；a,b,n)](8)

The functions in the above two equations are defined before, and the quantized values can be dequantized to floating point numbers according to the scaling factor S and the zero point Z, and the dequantization equation is as follows,

r＝S(q-Z) (9)

consider two real matrices r of two N × N₁,r₂Multiplication with the result r₃α∈ {1,2,3}, 1 ≦ i, j ≦ N, having r_α ^(i,j)Is represented by r_αIn ith row and jth column of (e), each matrix quantization parameter is represented by (S)_α,Z_α) Each matrix after quantization is represented as q_α ^(i,j)；

The inverse quantization formula can be expressed as:

from the matrix multiplication, one can obtain:

the above formula can be rewritten as:

wherein

M is only one non-integer in the formula, but under the normal condition, the value of M is between 0 and 1, and the precision requirement can be met by using 32-bit fixed point number to express M; q. q.s₃The results are calculated for quantification.

The invention has the beneficial effects that:

1. the invention does not increase extra variables in the pruning process, directly limits the parameters of the BN layer, can better utilize the scaling effect of the gamma parameters of the BN layer, has better pruning effect than the existing pruning method and has better acceleration effect on common hardware.

2. The method is different from the prior quantization method of training before quantization, and the quantization training method directly simulates the quantization process in the training process, thereby overcoming the defect that the quantization parameter can not be adjusted by the quantization method after the quantization. The quantization method of the invention has smaller precision loss and is suitable for various common convolution models.

3. The operation method designed by the invention completely avoids floating point number operation and has better acceleration effect on some hardware.

4. The two model compression methods are orthogonal to each other, can be used independently or jointly, and can achieve a better model compression effect.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a flow chart of the pruning algorithm of the present invention.

FIG. 3 is a schematic diagram of the pruning algorithm of the present invention.

FIG. 4 is a schematic diagram of pruning in a multi-layer structure according to the pruning algorithm of the present invention.

FIG. 5 is a schematic diagram of pruning in a residual structure by the pruning algorithm of the present invention.

FIG. 6 is a flow chart of the quantization algorithm of the present invention.

FIG. 7 is a diagram of the present invention for quantitative inference computation.

FIG. 8 is a diagram of the computation of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

A combined neural network model compression method based on channel pruning and quantitative training comprises the following steps: channel pruning is to reduce the number of neural network channels; integer operation instead of floating-point operation for quantization training

and 4, step 4: quantizing the model after pruning is finished, and constructing a conventional floating point number calculation graph; 32-bit floating point numbers are adopted in general training;

and 5: inserting pseudo quantization modules at corresponding positions of convolution calculation in a calculation diagram, and inserting two pseudo quantization modules at convolution weight positions and activation value positions in order to simulate quantization effects in actual quantization estimation, so as to quantize the weight and activation values into 8-bit integer;

and 8: finally, a model after pruning and quantification is obtained.

Further, the step 1 sparse training model specifically comprises,

step 1.2: setting a training superparameter in an original convolutional neural network model, wherein the sparsification coefficient lambda is between 0.0001 and 0.01, generally 0.01, and the rest training superparameters are consistent with those without sparsification training;

the calculation process of the BN layer is shown as the following formula,

Further, the training model pruning in the step 2 is specifically

for the combination of the unrestricted convolution layer plus the BN layer;

for the combination of the constrained convolutional layer plus the BN layer:

for residual block structure with short circuit connection

Further, the step 2.3 is specifically,

step 2.3.1: for the combination of the unrestricted convolution layer plus the BN layer; pruning a pruning mask (corresponding to a pruning result of an output channel at the upper layer) of the input channel and a convolution filter corresponding to the output pruning mask; outputting a pruning mask: according to the index composition of the convolution filter corresponding to the situation that the gamma parameter of the current convolution layer BN layer is smaller than the threshold value, pruning is realized in a mode of recombining the parameters of the convolution layer and the parameters of the BN layer; if the current convolutional layer is a depth separable convolution, namely the number of packets is equal to the number of input channels, the output pruning mask is the same as the input pruning mask, and the number of the packets after pruning is the same as the number of the residual convolution filters;

step 2.3.2: for the combination of the constrained convolutional layer plus the BN layer: the output channel is not pruned, only the convolution filter corresponding to the pruning mask (corresponding to the pruning result of the upper layer of the output channel) of the input channel is pruned, and pruning is realized by recombining the parameters of the convolutional layer and the parameters of the BN layer;

clamp(r；a,b):＝min(max(x,a),b) (3)

because the activation value of the initial training model is unstable, the activation value is not quantized at the moment; generally, after one fourth of the training process, observing the activation values channel by channel, and calculating the quantization range in the same way as the above; in order that the quantization range of the activation values may reflect the distribution over the entire data set, it is necessary to calculate an exponential moving average of the single quantization range, the exponential moving average calculation formula is as follows,

S_t＝α×Y_t+(1-α)×S_t-1(6)

step 6.5: for networks where a BN layer exists, BN is a stand-alone operation in general conventional training. In the quantitatively optimized model, the BN operation is usually merged in the convolutional layer. In order to simulate the influence caused by the difference, the BN layer operation is fused during training, the calculation formula of the parameters of the fused convolutional layer is as follows,

wherein gamma is the gamma parameter in the BN layer;

In the quantization training process, the quantization range and the BN layer parameters need to be frozen timely. The aim is to enable the network to learn the weights under static quantization parameters and BN layer parameters so as to better simulate the network inference process under the real condition. Both are usually frozen sequentially at 10% to 20% of the whole training process.

Further, the quantitative reasoning of step 7 is specifically,

in the quantitative reasoning, the invention realizes the full integer reasoning, namely floating point number operation is not used in the model reasoning process; the quantization reasoning needs to solve the problem of carrying out convolution matrix operation by using integers, and the formula is as follows;

S＝s(a,b,n)

Z＝[q(0.0；a,b,n)](8)

r＝S(q-Z) (9)

The inverse quantization formula can be expressed as:

from the matrix multiplication, one can obtain:

the above formula can be rewritten as:

wherein

Example 2

The improved YOLOv3 network was compressed using the pruning algorithm of the present invention. The improved YOLOv3 network architecture employs Mobilenetv2 as a feature extractor, followed by replacing the normal convolution with a deep separable convolution to reduce the amount of computation. The improved YOLOv3 network achieved 78.46% of the test set maps on the VOC data set. On a 512 × 512 input image, the calculated amount is 4.15GMACs, and the model parameter number is 6.775M.

The results are trained on the VOC training set for 80 rounds, standard data enhancement methods including random cutting, perspective transformation and horizontal turning are adopted, and a mixup data enhancement method is additionally adopted. And (3) adopting an Adam optimization algorithm and a cosine annealing learning rate strategy, wherein the initial learning rate is 4e-3, and the size of the batch size is 16. The following sparsification training and fine tuning training both use the same hyper-parameter settings.

In the sparse training, the sparse coefficient is set to be 0.01, 80 rounds of training are performed on the VOC data set from the beginning, and 75.65% of mAP of a test set is achieved. And finally achieving 75.44% of test set mAP through pruning by 40% of channel number and 20 rounds of fine tuning training, and reducing the precision by 3.0% compared with an un-pruned model. The calculated amount is reduced to 1.74GMACs, and the parameter amount is reduced to 2.31M. The reduction is 58.1 percent and 65.9 percent respectively compared with the non-pruning model.

And the model after pruning is quantized by using the quantization training algorithm of the invention. Int8 quantification is adopted, a post-pruning model is used for quantitative training on the VOC data set, and the same setting is adopted for the hyper-parameters. The BN layer parameters were frozen after 10 rounds and the quantization parameters were frozen after 15 rounds. The final quantitative model achieved 76.74% mAP on the test set, which is 1.7% lower than the original model.

And carrying out speed test on the models on a test platform, wherein the test platform adopts E5-2630 v4 CPU. The tests were performed on a test set of VOC data sets and the results are shown in the table below.

TABLE 1 pruning and quantification model speed test results

From the above table, it can be seen that the pruning quantification combination method of the present invention can greatly accelerate the small model such as Mobilenetv2 with only a small loss of precision.

Claims

1. A combined neural network model compression method based on channel pruning and quantitative training is characterized by comprising the following steps:

and 8: finally, a model after pruning and quantification is obtained.

2. The compression method according to claim 1, wherein the step 1 sparse training model is specifically,

step 1.2: setting a training hyper-parameter in an original convolutional neural network model, wherein a sparsification coefficient lambda is between 0.0001 and 0.01;

3. The compression method according to claim 2, wherein an L1 norm penalty is imposed on the gamma parameter in each BN layer in the BN layer list in step 1.3, as shown in the following formula,