CN113627595A

CN113627595A - Probability-based MobileNet V1 network channel pruning method

Info

Publication number: CN113627595A
Application number: CN202110903135.9A
Authority: CN
Inventors: 赵汉理; 史开杰; 潘飞; 卢望龙; 黄辉
Original assignee: Wenzhou University
Current assignee: Wenzhou University
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2021-11-09
Anticipated expiration: 2041-08-06
Also published as: CN113627595B

Abstract

The invention provides a probability-based MobileNet V1 network channel pruning method. The method comprises the following three steps: pre-training, pruning and fusing. A pre-training stage: and training by using the cross entropy loss and the L1 loss of the BN scaling factor to obtain a pre-training model. Pruning: and (3) calculating the probability that the output of each BN channel is less than 0 by using the characteristics of BN and ReLU when the MobileNet V1 network structure is designed, and pruning the channels with high probability. A fusion stage: because the influence of the pruned channel on the accuracy usually exists in the offset factor of the deep convolution output layer BN, the invention fuses the pruned channel into the offset factor of the next layer BN to obtain the final pruned network. By implementing the invention, the acquisition time of pruning can be shortened, the calculation amount of the network is reduced, and the accuracy rate which is the same as that of the pre-training network is kept as far as possible.

Description

Probability-based MobileNet V1 network channel pruning method

Technical Field

The invention relates to the field of neural network pruning algorithms, in particular to a probability-based MobileNet V1 pruning algorithm.

Background

Convolutional Neural Networks (CNNs) have received much attention from the industry as they enable very high recognition and detection accuracy in the field of computer vision. However, the speed of convolutional neural network operations impacts the ultimate hardware deployment. How to accelerate neural network computation is a very important issue while achieving high accuracy. The proposal of MobileNet V1 (please refer to: Howard AG, Zhu M, Chen B, et al. Mobilenes: Efficient connected neural networks for mobile vision applications [ J ]. arXiv preprints: 1704.04861,2017.), initially reduced the computational load of neural networks. However, in a given task, not all channels are important to the output given a certain network, and some channels may be deleted that have little impact on the final output. The currently popular pruning method only utilizes the scaling factor of a Batch Normalization (BN) layer in the network design for judging the importance of the channel, and does not fully consider the offset factor of the BN layer and the architecture design of a neural network; and these pruning methods require three processes: pre-training, pruning and tuning result in huge time required by the whole pruning algorithm process. From the above perspective, the channel importance determination method of the present invention considers the scaling factor and the offset factor of the BN and the ReLU layer behind the BN layer at the same time; and the operations contained by the pruned channels are fused into the bias factors of the lower layer convolution to remove the tuning flow. The invention utilizes the mathematical properties of the commonly used BN layer and the ReLU layer in the network design to calculate the probability that a certain channel can be deleted in a given task. Also, performance may be degraded in a given task after a channel is deleted, and the present invention proposes fusing offset factors. I.e., for pruned channels, its contribution to the underlying computation is often centered on the offset factor of the BN. The calculation related to the offset factor is fused into the offset factor of the next BN layer, compared with a non-fusion method, the pruning MobileNet V1 model with higher accuracy can be obtained, and extra parameters are not needed in the fusion process. And finally, the tuning stage is not needed, and the speed of the whole process of the pruning algorithm is improved.

Disclosure of Invention

The technical problem to be solved by the embodiment of the invention is to provide a probability-based MobileNetV1 network channel pruning method, which makes full use of a BN layer and a ReLU layer in network design, selects a channel from a probability angle for pruning, and subtracts the channel which is more to be pruned; meanwhile, constants existing after pruning of the depth convolution are fused and calculated into the offset factor of the next layer of BN; and removing the tuning stage, and accelerating the time of the whole pruning algorithm.

In order to solve the above technical problem, an embodiment of the present invention provides a probability-based MobileNetV1 network channel pruning method, including the following steps:

step S1, given training set and testing set, in the training process of MobileNet V1, except that the cross loss function loss of the predicted label and the real label is calculated_clsAdditionally, the L1 loss function loss of the BN scaling factor is also calculated_norm(ii) a Calculating gradient by using the two loss functions, and updating parameters of the MobileNet V1; obtaining a pre-training model;

step S2, a parameter Z belongs to [2,4], and Z is defined as beta + zXy gamma; wherein, beta and gamma are trainable parameters of the BN layer respectively: an offset factor and a scaling factor; calculating the Z of each channel of all BN layers in the pre-training model in the step S1;

step S3, regarding Z output in step S2, if Z is less than 0, pruning is carried out, otherwise, pruning is not carried out; in MobileNetV1, its basic module is a deep separable convolution; the depth separable convolution includes a depth convolution and a point-by-point convolution; the deep convolution is characterized in that input channels and output channels are in one-to-one correspondence, a certain input channel and a certain output channel need to be pruned or not pruned at the same time, otherwise, the convolution mode can be damaged; for a certain pair of input and output channels, there are 4 cases: 1) the input channel is not pruned, and the output channel is not pruned; 2) the input channel is not pruned, and the output channel is pruned; 3) the input channel is pruned, and the output channel is not pruned; 4) input channel pruning and output channel pruning; performing channel pruning on the depth convolution conforming to the conditions of 2), 3) and 4) to obtain a preliminary pruning MobileNet V1 model;

step S4, for the case of the 3 rd in the step S3), the output channel outputs a constant, and the constant is not affected by the network input; the related calculation results are fused into the offset factor of the next layer of BN, so that the reduction of the accuracy rate of the pruning network is reduced, and no additional parameter is added in the fusion process;

and step S5, outputting and storing the final pruning MobileNet V1 model.

As a further improvement, in said step S1, the given training set comprises images and corresponding class labels. The computation of the loss function includes two categories: 1) cross entropy loss_cls(ii) a 2) Loss of L1_norm. The total Loss is the weighted sum of the two Loss_cls+10^-5×loss_norm. And calculating the gradient required by back propagation based on the total loss, and updating the parameters of the MobileNet V1 model to obtain a final pre-trained MobileNet V1 model.

As a further improvement, step S1, a training set D is given^train＝{(Image_i,Label_i)|i∈[1,M]And test set D^test＝{(Image_j,Label_j)|j∈[1,N]}; wherein the Image_iRepresents the ith sample, Label, of the training set_iRepresenting the real label, Image, corresponding to the ith sample of the training set_jDenotes the jth sample, Label, of the test set_jRepresenting the true label corresponding to the jth sample of the test set, and M representing the training set D^trainN represents the test set D^testThe number of samples. Initializing parameters of a given network MobileNet V1 and a random gradient descent optimizer SGD; the parameters of the MobileNet V1 comprise iteration number q and network parameter theta_qNetwork parameter θ of the optimal model_best；

l represents the index of the corresponding network layer number, W represents the parameter of the corresponding convolution layer, B represents the learnable parameters of the BN layer, namely a scaling factor alpha and an offset factor beta;

represents the parameters corresponding to the first convolutional layer in the q training,

representing learnable parameters corresponding to the l-th BN layer in the q-th training; the iteration number q is initialized to 1, 1 is added each time, and the number of the iterations is 150; network parameter theta_qInitialized to theta₁Network parameter θ of the optimal model_bestInitialized to theta₁(ii) a The initialization of the parameters of the stochastic gradient descent optimizer SGD includes initializing the learning rate 0.01, the momentum 0.9 and the weight attenuation coefficient 4 x 10^-5。

For a certain number of iterations q, training set D^train＝{(Image_i,Label_i)|i∈[1,M]Samples in the training set are input into a MobileNetV1 for forward calculation to obtain a corresponding prediction training set P^train＝{(Image_i,Predict_i)|i∈[1,M]}; wherein Predict_iRepresents the Image of MobileNet V1 on the training sample_iThe predictive tag of (1).

Calculating a training set D according to a preset cross entropy loss function and an L1 loss function^trainPrediction tag Predict of_iAnd true tag Label_iObtaining a cross entropy loss value through the error between the two points; obtaining an L1 loss value according to the scaling factor gamma of all BN layers in the MobileNet V1; adding the cross entropy loss value and the L1 loss value to obtain a final loss value, and performing back propagation by using the obtained loss value to the network parameter theta of the MobileNet V1_qAnd (6) adjusting. The loss function formula includes: 1) cross entropy loss_cls(ii) a 2) Loss of L1_norm. Therein, loss_normActing only on the scaling factor gamma of BN. The loss function is formulated as:

Loss＝loss_cls+10^-5×loss_norm.

wherein, Label_iReal labels, Presect, representing images_iRepresents the predicted label output by MobileNet V1, M represents the number of training sets, γ_bAnd (b) represents the scaling factor of a certain channel of a certain layer of BN (in the invention, gamma and beta are scalar quantities, if necessary, an upper subscript is added, if not necessary, the lower subscript is not added), and A represents the number of the scaling factors of all BN layers in the MobileNet V1, namely the sum of the channel numbers of all BN layers in the MobileNet V1 network. Loss is the final Loss value, propagates back through the Loss, and updates the parameter θ of the MobileNet V1 network_q。

Using test set D^testEvaluating the MobileNet V1 network if the parameter theta of the MobileNet V1 network_qWhen the test accuracy is highest, let θ_best＝θ_q(ii) a Meanwhile, at the parameter updating end stage, whether the training iteration number reaches the maximum iteration number 150 is judged, if the training iteration number reaches the maximum iteration number 150, the training stage is ended, and the next step S2 is entered; otherwise, the training is continued and q is q + 1.

Updating network parameter θ_qThe formula of (1) is as follows:

wherein the content of the first and second substances,

respectively representing the parameters of the convolution layer of the corresponding l layer and the parameters of the BN layer in model network parameters of the q-th iteration;

representing passage through q timesUpdating the network parameters to obtain the network parameters of the (q + 1) th time; η represents a learning rate of 0.01 in the hyper-parameter;

the gradient of the corresponding convolutional layer parameter and the gradient of the BN layer parameter are respectively expressed and obtained by a chain derivation rule.

The accuracy of the MobileNetV1 network on the test set was calculated. Test set D^testThe samples in (1) are used as the input of the MobileNet V1 and are calculated layer by layer through a network to obtain a prediction test set P of the corresponding samples^test＝{(Image_j,Predict_j)|j∈[1,N]}. To test set D^testMiddle corresponding real Label Label_jAs a reference, P is^testMiddle predictive tag Predict_jAnd D^testTrue tag Label in (1)_jComparing one by one, and calculating a test set D^test(iv) accuracy of; defining the network parameter θ of the current MobileNet V1_qTest accuracy of ACC_qAnd defining an optimal model network parameter theta_bestWith an accuracy of ACC_best(initialization to 0), if ACC_q>ACC_bestThen let θ_best＝θ_q. 150 times of training to pre-training MobileNet V1 with the parameter theta_best。

Step S2, given z ∈ [2,4]]And let Z be β + zx γ |. Where β, γ are the offset factor and the scaling factor of the BN layer, respectively. Calculating the MobileNet V1 model parameter theta obtained in the step S1_bestZ of all BN layers.

The principle is as follows, and the calculation formula of the BN layer is as follows:

wherein, the ratio of x,

respectively representing the input and the output of a BN layer, and E (x), Var (x) is statistic obtained in the network training process; ε is to prevent the denominator from being 0, and its value is equal to 10^-5. γ, β are called scaling factor and offset factor of the BN layer, respectively.

Variance of is γ²The average is β. If Z is less than or equal to 0, then the output of the BN layer in the angle of probability is represented

There is a great probability of being less than or equal to 0; on the contrary, the probability that the output of the BN layer is less than 0 is not very large. Meanwhile, a ReLU layer is connected behind the BN, and the formula is as follows:

then, when

When the probability is greater than or equal to 0, the equal probability of the ReLU outputs 0, and the network calculation with the value of 0 can be pruned; otherwise, no pruning is performed.

Network parameter θ for MobileNetV1 saved at step S1_bestCalculating the Z values of all channels of BN in each layer according to the Z ═ beta + zXy | gamma |, and storing the Z value of the idx channel in the l-th layer into an array

In (1), get the array

Step S3, obtaining the array according to the step S2

To the parameter theta of the MobileNetV1 trained in the step S1_bestPruning is carried out. The depth separable convolution consists of two parts: 1) performing depth convolution; 2) and (4) performing point-by-point convolution. Wherein, please refer to the channel pruning method of point-by-point convolution (Liu Z, Li J, Shen Z, et al]// Proceedings of the IEEE International Conference on Computer Vision.2017: 2736-. For the deep convolution, the input channels and the output channels are in one-to-one correspondence, as shown in table 1, under the method of calculating Z in step S2, there are 4 cases in total: 1) the input channel is not pruned, and the output channel is not pruned; 2) the input channel is not pruned, and the output channel is pruned; 3) the input channel is pruned, and the output channel is not pruned; 4) input channel pruning and output channel pruning.

Table 1: 4 cases involved in deep convolutional pruning

In table 1, for a certain input channel and output channel, if the condition 1) is met, the input channel and the output channel are both important, and pruning is not needed; if the condition 2) is met, no matter how important the input channel is, the output channel is unimportant, and the output is 0, so that pruning is directly carried out; if the case 3) is satisfied, the output of the input channel is 0 according to the formula of the step S2, the input of the output channel is fixed to 0, the output of the output channel is a constant, pruning can be performed, and the step S4 is performed in the present invention to maintain the accuracy of the calculated value; if the condition of 4) is met, 0 is output by the input channel and 0 is output by the output channel, and pruning is directly carried out. Obtaining a preliminary pruning MobileNet V1 with the parameter of theta_p1。

Step S4, parameter theta of pruning MobileNet V1 obtained in step S3_p1The accuracy on the test set is usually lower than the network parameter θ of step S1_best. This is due to direct pruning for case 3) of table 1 during pruning, which results in errors in the calculated values of the MobileNetV1 network before and after pruning. The invention is toThe related numerical calculation is directly calculated and fused into the beta of the next layer of BN, so that the error of the network calculated numerical value before and after pruning is reduced. The accuracy of the pruning network on the test set can be close to theta_bestAccuracy on top of the test set.

The numerical calculation method involved in the case 3) in the calculation table 1 of the present invention is as follows. For the case 3) in Table 1, the output of layer l-1 is the input of layer l, assuming for ease of explanation the kth channel, so

The formula for the output of layer l is as follows:

wherein

Is given a parameter of theta_bestOutput of kth channel of l-th layer BN in MobileNetV 1;

the average value obtained in the training process is a fixed constant in the testing process;

also so as to

Is also a fixed constant. Constant number

The corresponding next layer point-by-point convolution is calculated as follows:

wherein, K₁,K₃The channel position sets corresponding to the 1 st) and 3) cases in the table 1, and the 2 nd) and 4) case outputs in the table 1 are 0, which can be omitted in the formula calculation;

represents the weight of the convolution, l represents the number of layers, and k represents the number of channels (at this point training has been completed, in the test phase, so q in step S1 is ignored). At the time of testing, at

In (1),

is a constant that is fixed in the number of,

is also a fixed constant, therefore

Is a fixed constant. Meanwhile, the operation of the l +1 th layer BN is as follows:

in case 3) of Table 1, the offset factor of l +1 layer BN without pruning is

In the case of pruning, the invention fuses the constants in case 3) to the bias of the l +1 layer BNOf the shift factors, the new shift factor is

fusion represents the offset factor after fusion, using

Replacement of

By changing the beta parameter of the l +1 th layer BN through the fusion, the same numerical calculation result as the non-pruned MobileNet V1 network can be achieved as far as possible under the pruning condition. Therefore, by the above calculation, θ will be_p1Parameter update of BN in

Obtaining the final pruning MobileNet V1 with the parameter of theta_p。

And step S5, saving the final pruning network model.

The implementation of the embodiment of the invention has the following beneficial effects:

1. compared with the existing network channel pruning method, the method can reduce the time consumption of the whole pruning algorithm, and only needs pre-training and pruning and two processes. The traditional method adds a third process: and (4) adjusting the quality to obtain better effect. The invention can reduce considerable computation amount under the condition of only two processes, and the accuracy rate is equal to the pre-training network.

2. Compared with the existing network channel pruning method, the method clearly judges whether a certain channel needs to be pruned or not from the perspective of probability. Compared with the method only starting from the magnitude order, the method is more reasonable and has better interpretability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is within the scope of the present invention for those skilled in the art to obtain other drawings based on the drawings without inventive exercise.

Fig. 1 is a flowchart of a probability-based MobileNetV1 network channel pruning method according to an embodiment of the present invention;

detailed description of the invention

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings.

As shown in fig. 1, in the embodiment of the present invention, a probability-based MobileNetV1 network channel pruning method is provided, where the method includes the following steps:

step S1, giving training set D^train＝{(Image_i,Label_i)|i∈[1,M]And test set D^test＝{(Image_j,Label_j)|j∈[1,N]}; wherein the Image_iRepresents the ith sample, Label, of the training set_iRepresenting the real label, Image, corresponding to the ith sample of the training set_jDenotes the jth sample, Label, of the test set_jRepresenting the true label corresponding to the jth sample of the test set, all dimensions of Image are 3 × 224(3 represents the number of channels, the first 224 represents the height of the Image, the second 224 represents the width of the Image, here, the size of batch processing is ignored, and no influence is exerted on the operation), all dimensions of label are 1000(1000 represents the number of categories to be classified, here, the size of batch processing is ignored, and no influence is exerted on the operation), M represents the training set D^trainN represents the test set D^testThe number of samples. Initializing the parameters of a given network MobileNetV1 (the network structure is shown in table 2) and a random gradient descent optimizer SGD; the parameters of the MobileNet V1 comprise iteration number q and network parameter theta_qNetwork parameter θ of the optimal model_best；

l denotes the index of the corresponding network layer number, W denotes the parameter of the corresponding convolutional layer, B denotes the learnable parameter of the BN layer, i.e. puncturingA scaling factor α and an offset factor β;

representing learnable parameters corresponding to the l-th BN layer in the q-th training; the iteration number q is initialized to 1, 1 is added each time, and the number of the iterations is 150; network parameter theta_qInitialized to theta₁Network parameter θ of the optimal model_bestInitialized to theta₁(ii) a The initialization of the parameters of the stochastic gradient descent optimizer SGD includes initializing the learning rate 0.01, the momentum 0.9 and the weight attenuation coefficient 4 x 10^-5(convolution parameters require weight attenuation, BN parameters do not).

A loss function is calculated. Calculating a training set D according to a preset cross entropy loss function and an L1 loss function^trainPrediction tag Predict of_iAnd true tag Label_iObtaining a cross entropy loss value through the error between the two points; obtaining an L1 loss value according to the scaling factor gamma of all BN layers in the MobileNet V1; adding the cross entropy loss value and the L1 loss value to obtain a final loss value, and performing back propagation by using the obtained loss value to the network parameter theta of the MobileNet V1_qAnd (6) adjusting. The loss function formula includes: 1) cross entropy loss_cls(ii) a 2) Loss of L1_norm. Therein, loss_normActing only on the scaling factor gamma of BN. The loss function is formulated as:

Loss＝Loss_cls+10^-5×loss_norm.

wherein, Label_iReal labels, Presect, representing images_iRepresents the predicted label output by MobileNet V1, M represents the number of training sets, γ_bRepresents the scaling factor of a channel of a BN in a certain layer, and A represents the number of the scaling factors of all BN layers in the MobileNet V1, namely the sum of the number of the channels of all BN layers in the MobileNet V1 network. Loss is the final Loss value, propagates back through the Loss, and updates the parameter θ of the MobileNet V1 network_q。

Updating network parameter θ_qThe formula of (1) is as follows:

wherein the content of the first and second substances,

representing that the network parameter of the (q + 1) th time is obtained through the network parameter update of the (q) th time; η represents a learning rate of 0.01 in the hyper-parameter;

Table 2: default BN layer and ReLU layer after each layer of convolution

In the above table, s1 indicates that the convolution kernel step size is 1, s2 indicates that the convolution kernel step size is 2, dw indicates the depth convolution, and no dw indicates the point-by-point convolution (except the first one indicating the standard convolution).

Step S2, given z ∈ [2,4]]And let Z be β + zx γ |. Wherein, beta and gamma are BN layers respectivelyAnd a scaling factor, and are both scalar (i.e., only one number). Calculating the MobileNet V1 model parameter theta obtained in the step S1_bestZ of all BN layers.

wherein, the ratio of x,

then, when

In (1), get the array

Step S3, obtaining the array according to the step S2

In table 1, for a certain input channel and output channel, if the condition 1) is met, the input channel and the output channel are both important, and pruning is not needed; if the condition 2) is met, no matter how important the input channel is, the output channel is unimportant, and the output is 0, so that pruning is directly carried out; if the case 3) is satisfied, the output of the input channel is 0 according to the formula of the step S2, the input of the output channel is fixed to 0, the output of the output channel is a constant, pruning can be performed, and the step S4 is performed in the present invention to maintain the accuracy of the calculated value; if the condition of 4) is satisfied, both the input channel and the output channel0 is output and pruning is directly carried out. Obtaining a preliminary pruning MobileNet V1 with the parameter of theta_p1。

Step S4, parameter theta of pruning MobileNet V1 obtained in step S3_p1The accuracy on the test set is usually lower than the network parameter θ of step S1_best. This is due to direct pruning for case 3) of table 1 during pruning, which results in errors in the calculated values of the MobileNetV1 network before and after pruning. The invention directly calculates the related numerical calculation and fuses the numerical calculation into the beta of the next layer of BN, thereby reducing the error of the network calculated numerical calculation before and after pruning. The accuracy of the pruning network on the test set can be close to theta_bestAccuracy on top of the test set.

The formula for the output of layer l is as follows:

wherein

Is given a parameter of theta_bestThe output of the kth channel of the l-th BN layer in MobileNetV 1;

also so, soTo be provided with

Is also a fixed constant scalar (eventually fused to the scalar beta, because of the presence of broadcast computations, here

Can be treated as a scalar). Constant number

In (1),

the input is changed, so that the forward calculation is required to be carried out in a pruning network without pruning; in that

In (1),

is a constant that is fixed in the number of,

is also solidConstant, therefore

in case 3) of Table 1, the offset factor of the BN in l +1 layers is β without pruning^l+1. In the case of pruning, the invention fuses the constants in case 3) into the offset factor of the l +1 layer BN, the new offset factor being

f_usionRepresenting the offset factor after fusion by

Substitution of beta^l+1. By changing the beta parameter of the l +1 th layer BN through the fusion, the same numerical calculation result as the non-pruned MobileNet V1 network can be achieved as far as possible under the pruning condition. Therefore, by the above calculation, θ will be_p1Parameter update of BN in

Obtaining the final pruning MobileNet V1 with the parameter of theta_p。

And step S5, saving the final pruning network model.

The embodiment of the invention has the following beneficial effects:

1. compared with the existing network channel pruning method, the method can reduce the time of the whole algorithm, and only needs pre-training and pruning, namely two processes. The traditional method adds a third process: and (4) adjusting the quality to obtain better effect. The invention can reduce considerable computation amount under the condition of only two processes, and the accuracy rate is equal to the pre-training network.

2. Compared with the existing network channel pruning method, the method clearly judges whether a certain channel needs to be pruned or not from the perspective of probability.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by relevant hardware instructed by a program, and the program may be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc.

While the invention has been described in connection with what is presently considered to be the most practical and preferred embodiment, it is to be understood that the invention is not to be limited to the disclosed embodiment, but on the contrary, is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A probability-based MobileNet V1 network channel pruning method is characterized by comprising the following steps:

and step S5, outputting and storing the final pruning MobileNet V1 model.

2. The probability-based MobileNetV1 network channel pruning method according to claim 1, wherein in the step S1, a given training set comprises images and corresponding class labels; the computation of the loss function includes two categories: 1) cross entropy loss_cls(ii) a 2) Loss of L1_norm(ii) a The total Loss is the weighted sum of the two Loss_cls+10^-5×loss_norm(ii) a And calculating the gradient required by back propagation based on the total loss, and updating the parameters of the MobileNet V1 model to obtain a pre-trained MobileNet V1 model.