CN112149803A

CN112149803A - Channel pruning method suitable for deep neural network

Info

Publication number: CN112149803A
Application number: CN202011002072.1A
Authority: CN
Inventors: 陈彦明; 闻翔; 施巍松
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2020-12-29

Abstract

The invention relates to the technical field of convolutional neural network compression, in particular to a channel pruning method suitable for a deep neural network. The invention can greatly reduce the storage space and the calculation amount of the model under the condition of not reducing the performance of the model, thereby obtaining the light neural network model, and the light neural network model has certain application prospect in edge equipment and vehicle-mounted systems.

Description

Channel pruning method suitable for deep neural network

Technical Field

The invention relates to the technical field of convolutional neural network compression, in particular to a channel pruning method suitable for a deep neural network.

Background

The current deep convolutional neural network has great success in the fields of image classification, target detection and the like, but as the performance of the network is better and better, the depth of a network model is deeper and deeper, the scale of the network model is larger and larger, and the required storage space and the calculation cost are increased. This makes deep neural network technology difficult to apply in daily life, especially on some resource-constrained devices. In view of this, more and more people pay attention to how to make the model occupy the least possible resources without reducing the performance of the model. Aiming at the problem, the invention adopts a channel pruning method to reduce the size of the model and hardly influences the performance of the model.

In the field of deep neural network model compression, methods of model compression can be divided into four categories: weight quantification, tensor decomposition, distillation and pruning. The weight quantization and tensor decomposition can be realized in most scenes by special software and hardware support, and a basic linear algebraic subprogram library cannot be utilized. The structure of the student network of knowledge distillation is difficult to determine, and specialized artificial design is generally required. Pruning is gaining increasing attention due to its simple concept and efficient performance.

In view of this, the present invention provides a channel pruning method suitable for a deep neural network.

Disclosure of Invention

The present invention is directed to a channel pruning method for a deep neural network, so as to solve the problems mentioned in the background art.

In order to achieve the purpose, the invention provides the following technical scheme:

a channel pruning method suitable for a deep neural network comprises the following steps:

step 1: in order to identify unimportant channels conveniently, L1 regularization is carried out on the convolutional layer weight and the BN layer scaling factor respectively, and a network model is trained in a sparse mode;

step 2: combining the convolutional layer weight and the BN layer scaling factor to obtain a channel importance vector S of the convolutional layer of the first layer_l(L is more than or equal to 1 and less than or equal to L), and combining the channel importance vectors of all layers to obtain a global channel importance vector S;

and step 3: in the pruning process, if all the channels of a certain layer are judged to be unimportant channels, in the pruning, the pruning operation on the layer is cancelled, so that all the channels left after the last pruning are reserved.

Preferably, when the network model is trained in a sparse mode, the optimization function is performed according to the following formula:

LOSS＝L(f(X,W),y)+α₁R(W)+α₂R(γ)，

where X represents the input data set, y represents the corresponding label, W represents the trainable set of weights for all convolutional layers,

represents the first layer convolutional layer weight, gamma_lDenotes the scaling factor, n, of the l-th layer_i+1Is an output channel, n_iIs an input channel, H is the convolution kernel height, G is the convolution kernel width, gamma is the set of all BN layer scaling factors, each output channel corresponds to a scaling factor, alpha₁And alpha₂The invention selects to introduce L1 regularization on convolution layer weight and BN layer scaling factor respectively to output unimportant channels with zero, thereby facilitating to screen out unimportant channels jointly by combining weight and scaling factor and delete them.

Preferably, the single layer channel importance vector may utilize the following formula:

wherein S_lIs a polymer having n_i+1The vector of numbers, each value representing the importance of an output channel, because of the BN layer relationship, considering only a single layer may mistakenly delete an important channel.

Preferably, pruning is to evaluate the importance of the channels in a global scope, so that it may occur that all channels of a certain layer are evaluated as having lower importance, and all channels are deleted during pruning, that is, the entire layer is deleted, which may destroy the structure of the model, so that, during the pruning process, if all channels of the layer are deleted, the pruning operation on the layer is cancelled in the pruning of the time, so as to retain all channels remaining after the last pruning.

Compared with the prior art, the invention has the beneficial effects that: jointly evaluating the importance of the channel by combining the convolutional layer weight and the BN layer scaling factor avoids using only a single convolutional layer or BN layer to evaluate the importance of the channel because whether the corresponding scaling factor or channel is important or not is not considered, namely when evaluating a channel in the convolutional layer is not important, the corresponding scaling factor is possibly important, and the important output feature map is deleted by mistake; the method provided by the invention can greatly reduce the storage space and the calculated amount of the model under the condition of not reducing the performance of the model, thereby obtaining the light neural network model, and the light neural network model has certain application prospect in edge equipment and vehicle-mounted systems.

Drawings

FIG. 1 is a display diagram of convolutional layer weight and BN layer scaling factors and channel importance vectors in the present invention;

FIG. 2 is a flow chart of a channel pruning method suitable for a deep neural network according to the present invention;

FIG. 3 is the pruning result of the proposed method of the present invention on the cifar-10 dataset using the VGG19_ bn model.

FIG. 4 is the pruning result of the proposed method of the present invention on the cifar-100 dataset using the VGG19_ bn model.

FIG. 5 is a diagram of GPU memory footprint during pre-and post-pruning training on cifar-10 and cifar-100 datasets using the VGG19_ bn model in accordance with the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-5, the present invention provides a technical solution:

examples

Considering the pruning process of the VGG19_ bn network on the cifar-10 and cifar-100 data sets, the VGG19_ bn originally has 16 convolutional layers and 3 fully-connected layers. However, in the current model, the full connection is generally replaced by the global pooling layer, so the required pruning VGG19_ bn model has 16 convolutional layers and 1 full connection layer.

The cifar-10 data set comprises 50000 training pictures and 10000 testing pictures, wherein 10 classes are provided, each class in the training set comprises 5000 pictures, and each class in the testing set comprises 1000 pictures. The cifar-100 data set consisted of 50000 training sets and 10000 test sets. Unlike cifar-10, cifar-100 has 100 classes, each class having 500 pictures in the training set and 100 pictures in the testing set.

step 1, initializing weight parameters and setting values of hyper-parameters in a model.

And 2, in order to calculate unimportant channels conveniently, performing L1 regularization on the convolutional layer weight W and the BN layer scaling factor gamma respectively, and training a network model by using a sparseness mode.

When the network model is trained by using a sparse mode, the target optimization function is as follows:

LOSS＝L(f(X,W),y)+α₁R(W)+α₂R(γ)，

represents the first layer convolutional layer weight, gamma_lDenotes the scaling factor, n, of the l-th layer_i+1Is an output channel, n_iIs the input channel, H is the convolution kernel height, G is the convolution kernel height, γ is the set of all BN layer scaling factors, one for each output channel. Alpha is alpha₁And alpha₂To balance the normal LOSS function and the sparse regularization term. L (f (X, W), y) is the LOSS function normal on the dataset X, R (-) represents the sparse regularization term, and L1 regularization is selected in the present invention to zero-out unimportant channels in order to better screen out unimportant channels for deletion.

Step 3, calculating the I channel importance vector S by combining the convolutional layer weight and the BN layer scaling factor_l(L is more than or equal to 1 and less than or equal to L), and combining the channel importance vectors of all layers to obtain a global channel importance vector S, wherein a channel importance vector calculation formula is as follows:

wherein S_lIs a polymer having n_i+1A vector of numbers, each value representing the importance of an output channel. After the channel importance is calculated, the importance of each channel will be determined globally.

Step 4, sorting the values of the channel importance vectors S according to the size of the values to obtain new channel importance vectors S^sort。

Step 5, setting the pruning rate P and deleting the channel importance S^sortAnd obtaining a new model through a channel corresponding to a small value of the middle and front P%.

As shown in fig. 1, the grey channels represent insignificant channels that will be deleted. When the unimportant channel in the convolutional layer is deleted, the corresponding scaling factor and output characteristic diagram in the BN layer are also deleted, and at the same time, the input channel of the next convolutional layer is also deleted. Pruning is to evaluate the importance of channels on a global scale, so that it may occur that all channels of a certain layer are evaluated as having a lower importance, and all channels are deleted in pruning, i.e. the entire layer is deleted, which may destroy the structure of the model. The present invention addresses this problem in a simple and efficient manner by eliminating the pruning operation for the layer during the pruning operation if all channels of the layer are deleted.

And 6, assigning values to the parameters of the new model, and copying the values of the channels in the old model corresponding to the channels of the new model to the new model.

And 7, fine-tuning the new model and recovering the precision of the new model.

Compared with one-time pruning, the precision of the model can be easily recovered by using an iterative mode pruning network. And after each pruning, the network is finely adjusted to recover the accuracy of the network, and the finely adjusted network is taken as the network to be pruned next time. Because one-time pruning is easier to delete the more important channels in the network model, the precision of the model is seriously damaged, and fine tuning cannot be recovered. Relatively speaking, the iterative pruning method can obtain a smoother model, the performance of the network cannot be damaged, and even the precision can be improved after fine adjustment.

And 8, repeating the steps 4-7 according to the precision reduction condition to obtain the final lightweight model.

The pruning procedure is shown in FIG. 2.

Experiment of

The effects of the present invention can be further illustrated by the following experiments.

In the experiment, the pruning operation is carried out on the VGG19_ bn model by adopting the method. The invention trains all initialization networks from scratch using random gradient descent (SDG), with weight degradation set to 10^-4Nesteroy momentum was set to 0.9 and initial learning rate was set to 0.1, on the CIFAR dataset we trained 160 batches using mini-batch 64, 80 batches and 120 batchesNext, the learning rate is divided by 10. During sparse training, the hyper-parameter alpha₁And alpha₂And is the term used to balance the normal LOSS function and coefficient regularization, we set their values empirically, and on VGG, we set α₁＝10^-6，α₂＝10^-4。

After the sparsification training, the method obtains the channel important vectors according to the step 4 and the step 5, and deletes the corresponding number of channels according to the set proportion.

FIG. 3 shows the pruning effect of VGG19_ bn on the cifar-10 data set. It can be seen from the figure that on the cifar-10 data set, when the invention reduces the calculation amount by 48.92%, the accuracy of the model is not only not reduced, but also improved by 0.07%. When 85.50% of the calculation amount is reduced, only the parameter of 0.46M (Million) is reserved, and the accuracy of the model is reduced by only 0.35%.

FIG. 4 shows the pruning effect of VGG19_ bn on the cifar-100 dataset, with cifar-100 having more classes and being more difficult to prune than cifar-10. It can be seen from the figure that when the calculation amount is reduced by 42.48%, the accuracy of the model is improved by 0.85%. When the calculated amount is reduced by 62.10%, only the 2.49M parameter is reserved, and the accuracy of the model is reduced by only 0.22%.

Fig. 5 shows the memory usage on the GPU during training after pruning and before pruning, and it can be seen that the model occupies less memory resources after pruning, which is very important for some devices with limited resources.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A channel pruning method suitable for a deep neural network is characterized by comprising the following steps: the method comprises the following steps:

step 2: combining the convolutional layer weight and the BN layer scaling factor to obtain the l-th layer channel importance vector S_l(L is more than or equal to 1 and less than or equal to L), and combining the channel importance vectors of all layers to obtain a global channel importance vector S;

2. The channel pruning method suitable for the deep neural network according to claim 1, wherein: when the network model is trained by using a sparse mode, the optimization function is carried out according to the following formula:

LOSS＝L(f(X，W)，y)+α₁R(W)+α₂R(γ)，

represents the first layer convolutional layer weight, gamma_lDenotes the scaling factor, n, of the l-th layer_i+1Is an output channel, n_iIs an input channel, H is the convolution kernel height, G is the convolution kernel width, gamma is the set of all BN layer scaling factors, each output channel corresponds to a scaling factor, alpha₁And alpha₂Hyperparameters to balance normal LOSS functions and sparse regularization terms, L (f (X, W), y) is normal on dataset XThe LOSS function, R (-) represents a sparse regularization term, and in the invention, L1 regularization is introduced to convolutional layer weight and BN layer scaling factor respectively to output unimportant channels in a zero way, so that the unimportant channels are screened out and deleted together by the joint weight and the scaling factor.

3. The channel pruning method suitable for the deep neural network according to claim 1, wherein: the single layer channel importance vector may utilize the following formula:

4. The channel pruning method suitable for the deep neural network according to claim 1, wherein: pruning is to evaluate the importance of channels on a global scale, so that it may occur that all channels of a certain layer are evaluated as having a lower importance, and all channels are deleted in pruning, i.e. the entire layer is deleted, which may destroy the structure of the model. Therefore, in the pruning process, if all channels of the layer are deleted, the pruning operation on the layer is cancelled in the pruning, so that all channels left after the last pruning are reserved.