CN110619385A

CN110619385A - Structured network model compression acceleration method based on multi-stage pruning

Info

Publication number: CN110619385A
Application number: CN201910820048.XA
Authority: CN
Inventors: 刘欣刚; 吴立帅; 钟鲁豪; 韩硕; 王文涵; 代成
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-08-31
Filing date: 2019-08-31
Publication date: 2019-12-27
Anticipated expiration: 2039-08-31
Also published as: CN110619385B

Abstract

The invention discloses a structured network model compression acceleration method based on multi-stage pruning, and belongs to the technical field of model compression acceleration. The invention comprises the following steps: obtaining a pre-training model, and training to obtain an initial complete network model; measuring sensitivity of the convolutional layers, and obtaining a sensitivity-pruning rate curve of each convolutional layer through controlling variables; performing single-layer pruning according to the sensitivity sequence from low to high, and finely adjusting and retraining the network model; selecting a sample as a verification set, and measuring the information entropy of the output characteristic diagram of the filter; performing iterative flexible pruning according to the magnitude sequence of the output entropy, and finely adjusting a retraining network model; and (4) hard pruning, retraining the network model to restore the network performance, and obtaining and storing the lightweight model. The invention can compress the large-scale convolutional neural network on the premise of keeping the original network performance, can reduce the local memory occupation of the network, reduces the floating point operation and the video memory occupation during the operation, and realizes the light weight of the network.

Description

Structured network model compression acceleration method based on multi-stage pruning

Technical Field

The invention relates to the technical field of model compression and acceleration, in particular to a structured network model compression and acceleration method based on multi-stage pruning.

Background

The deep convolutional neural network is widely applied to the related fields of computer vision, natural language processing and the like, has achieved great success, and as people pay more attention to the convolutional neural network, more and more layers and more complex-structured networks appear like bamboo shoots in spring after rain, the deep convolutional neural network is applied to more and more research fields, and higher requirements are provided for the development of hardware equipment.

With the rapid development of deep learning, the hardware condition is not improved as rapidly, and the development of the convolutional neural network depends on the improvement of the computing power of the computer equipment and the increase of the storage space nowadays, in particular the improvement of the parallel computing power of an image processor. The operation of the neural network is very difficult on the mobile embedded device because a large amount of storage space is consumed for the operation of the neural network, and huge floating point operation is generated. Taking a classic VGG-16 network as an example, a 224 × 224 color picture is recognized, only the number of parameters of the original network reaches 1 hundred million and 3 million, which occupies more than 520MB of storage space, and the middle feature map will occupy nearly 13MB of storage space when performing forward propagation once, and perform more than 309 million floating point operations. The application of the convolutional neural network to the embedded device is severely restricted by huge cost.

Many researches in recent years show that the neural network actually has huge redundant parameters, namely over-parameterization, and has huge optimization space in actual deployment, thereby proving the practical feasibility of model compression. Model pruning is widely researched as a high-efficiency and strong-universality model compression method, but the compression effect realized by the conventional pruning method is very limited, the actual compression storage and reduction operation cannot be obtained by the pruning algorithm at a plurality of parameter levels, and the parameter reduction and the actual network acceleration are difficult to be considered by the pruning algorithm at a plurality of filter levels. Therefore, it is important to design an efficient compression algorithm for the structured network model.

Disclosure of Invention

The invention aims to: aiming at the existing problems, a more efficient model compression acceleration method with stronger domain adaptability is provided.

The invention relates to a structured network model compression accelerating method based on multi-stage pruning, which comprises the following steps:

s1: acquiring a pre-training model, and training an original network model to be processed on a training data set to obtain a complete network model;

s2: measuring the sensitivity of the convolutional layers of the original network model based on a pre-training model, and obtaining a sensitivity-pruning rate change curve of each convolutional layer by a control variable method;

s3: carrying out sensitivity interlayer iteration pruning, carrying out single-layer pruning on the current network model according to the sensitivity sequence from low to high, and finely adjusting the network model;

s4: measuring the importance index of the filter, selecting a sample as a verification set, and measuring the information entropy of the filter output characteristic diagram of the current network model, namely the output image entropy;

s5: performing iterative flexible pruning on the current network model according to the magnitude sequence of the entropy of the output images, and finely adjusting a retraining model;

s6: and (4) hard pruning, and performing retraining on the current network model to obtain and store the lightweight model.

Wherein, step S1 includes the following steps:

s11: initializing original network parameters of a network model to be processed;

s12: and pre-training on the training set to obtain a complete network model.

Wherein, step S2 includes the following steps:

s21: setting a maximum pruning rate range and a pruning rate increase step;

s22: performing layer-by-layer sensitivity calculation on the convolutional layer by using a control variable method to obtain a sensitivity coefficient S of the ith convolutional layer_i：

S_i≡Acc(L,0)-Acc(L,-i),1≤i≤L

Acc (L,0) represents the recognition rate of the original network model on the test data set, Acc (L, -i) represents the recognition rate on the test data set after the filter of the ith convolution layer is deleted according to a certain ratio and the non-ith convolution layer is kept unchanged;

s23: and establishing a corresponding relation between the sensitivity sequence of each convolutional layer under the current set pruning rate and the pruning rate to obtain a sensitivity-pruning rate change curve of the convolutional layer.

Wherein, step S3 includes the following steps:

s31: calculating the F norm W of each filter of each convolution layer_i,j||_F：

Wherein, w_i,jJ-th filter, w, representing the ith convolutional layer_i,j(c,k₁,k₂) (k) th representing a two-dimensional parameter matrix on the c-th channel in the jth filter₁,k₂) A parameter value;

s32: performing single-layer hard pruning according to a sensitivity order, permanently deleting the filters determined to be deleted from the current network model, deleting the failed output characteristic channels according to the corresponding relation between the filters and the output characteristic channels, and then executing the same operation on the next convolution layer until all the convolution layers are traversed;

s33: and (4) loading the residual network parameters to the model after pruning, and carrying out fine tuning retraining on the training data set.

Wherein, step S4 includes the following steps:

s41: randomly sampling a training data set to construct a verification set;

s42: for the remaining filters (filters of the current network model), one forward propagation on the validation set, the output image entropy of each filter is calculated:

wherein E_i,jEntropy of the output image of the jth filter for the ith convolutional layer, p_k,lRepresents a pixel pair with a central pixel of k and a neighborhood pixel of l in the feature map, H_i,j[s][t]And (d) a parameter value representing a (s, t) position in an output characteristic diagram of the jth filter of the ith convolutional layer.

S43: and carrying out logarithmic normalization analysis on the output image entropy, and establishing the corresponding relation between the output image entropy of each filter under the current pruning rate and the pruning rate.

Wherein, step S5 includes the following steps:

s51: according to the entropy sorting of the output images of the filters, the pruning priority order of the filters between single layers is determined: the smaller the entropy of the output image of the filter is, the higher the pruning priority of the filter is;

performing flexible pruning layer by layer on the current network model, temporarily zeroing a filter to be deleted, temporarily zeroing a failed output characteristic channel according to the corresponding relation between the filter and the output characteristic channel, and then executing the same operation on the next convolution layer;

s52: and (4) loading the residual network parameters of the network with sparse filter stages, and performing fine tuning retraining on a training data set.

Wherein, step S6 includes the following steps:

s61: acquiring the sparsity of each filter of the filter-level sparse network;

s62: and sequencing the sparsity of each filter of the current network model, and deleting the filters with corresponding ratios according to the target pruning rate.

S63: and loading the rest network parameters, performing retraining to improve the network performance, and storing the structure and the parameters of the final lightweight network model.

In order to evaluate the overall performance of the compression acceleration algorithm of the structured model based on the multi-level pruning, the floating point operation times generated when the original network and the pruned network are subjected to forward propagation can be counted to evaluate the acceleration effect of the network, and the network parameter statistics can be performed on the new network structure to evaluate the compression effect of the network.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the deep convolutional neural network model is compressed by a multi-level structured pruning method, the limitation of the existing neural network on embedded edge equipment is considered, the original network is improved by adopting a filter pruning method, on the basis of keeping the performance of the original network, the storage space occupied by network parameters is reduced to the maximum extent, the video memory occupied by a middle activation layer during the operation of the network is reduced, the floating point operation times in the forward propagation process are reduced, the operation efficiency of the network is improved, and the aim of lightening the network is fulfilled. The invention can effectively reduce the parameter redundancy of large-scale deep convolution, expand the application scene on the neural network edge equipment and reduce the hardware dependence.

Drawings

Fig. 1 is a flowchart of an iterative pruning method according to an embodiment of the present invention.

FIG. 2 is a diagram illustrating first-order sensitivity hard pruning according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of flexible pruning of entropy of a two-level image according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

Referring to fig. 1, the method for compressing and accelerating the structured network model based on the multi-level pruning, provided by the invention, comprises the following specific implementation steps:

In the invention, the existing layer-by-layer sequential pruning algorithm from the first convolutional layer to the last convolutional layer is improved, the sensitivity of each convolutional layer pruning is evaluated, and the influence of each filter on the network performance is considered in a global sense because the convolutional neural network is an overall mutually-associated system. The method is characterized in that a single convolutional layer gradually deletes a single filter on the basis of an original network, other variables are controlled to be kept unchanged, and the corresponding identification rate reduction degrees of different convolutional layers on the same data set after pruning are compared, so that a filter importance index (sensitivity) is defined by taking the identification rate reduction degrees as a reference. And (3) carrying out sensitivity analysis on the convolutional layers one by adopting a control variable method, and defining the sensitivity of the ith convolutional layer as follows:

S_i≡Acc(L,0)-Acc(L,-i),1≤i≤L

wherein S_iFor sensitivity coefficient of ith convolutional layer, Acc (L,0) is recognition rate of original model, Acc (L, -i) is recognition rate on test data set after deleting ith convolutional layer filter according to a certain ratio for keeping other layers unchanged.

Because the invention follows a greedy pruning strategy, after a certain filter is pruned, the corresponding characteristic channels of the output characteristic diagram corresponding to the convolutional layer can be lost, and the loss of the characteristic channels of the layer can cause the corresponding channels of all the filters of the next convolutional layer to fail, so that the parameters of the failure can be ignored when the importance index evaluation of the filter is carried out, therefore, the pruning process always accompanies a large amount of cross pruning, and the finally obtained pruning rate can be greater than the preset pruning rate.

Thus, the higher the sensitivity of a particular convolution layer, the greater the degradation, indicating that the higher the sensitivity of the filter, the more important it is. In the initial iteration process, the pruning rate selected by the pruning rate is larger, and the filter pruning with low sensitivity is preferably selected, because the damage to the network performance after the low-sensitivity convolutional layers are pruned is smaller, the loss of the network performance in the initial iteration stage can be quickly recovered, which is equivalent to the pruning of the high-sensitivity convolutional layers on a model equivalent to the original network, and the optimal selection is based on the consideration of the network performance and the iteration pruning rate. And then after the network scale is gradually reduced, the pruning rate is reduced to carry out pruning with slight increase, a filter with higher sensitivity is pruned, and the performance is restored by iterative training for a longer time. The first-stage pruning at this stage adopts a hard pruning strategy, because the purpose of the first stage is to quickly approach the pruning rate of the target, and filters with larger redundancy are quickly deleted in a relatively extensive manner.

Fig. 2 shows a first-order sensitivity hard pruning diagram, and the process is divided into two stages: and performing hard pruning and fine-tuning retraining models according to the sensitivity sequence of the convolutional layer. And determining a filter to be deleted according to sensitivity analysis, deleting the filter and the corresponding convolutional layer channel, and entering the next iteration pruning process if the pruning rate of the current overall network does not reach the initial target pruning rate.

Then, fine pruning is performed at a higher pruning rate, and the filters with weaker functionality are slowly deleted, so that the pruning rate is improved. Fig. 3 shows a schematic diagram of entropy flexible pruning of a secondary image, and secondary fine pruning is performed after primary pruning reaches a bottleneck of increasing pruning rate.

The invention provides a more accurate filter importance measurement index, namely the entropy information quantity of the output two-dimensional image. The traditional filter importance index quantification standard is often measured by the nuclear norm of a filter or the sparsity of a filter output characteristic diagram, and most of the ideas consider the influence degree of the filter on a loss function from a mathematical point of view and do not pay much attention to the essential function of the filter. The invention is inspired by the entropy of image information in the traditional digital image processing theory, and considers the information quantity of the characteristic diagram extracted by each filter, which is directly related to the essential function of the filter. The larger the entropy value obtained, the better the filter is to act as a feature selector, i.e. the lower the priority the filter is pruned during pruning. Defining the importance index of the filter as the two-dimensional image entropy (containing time information and space information among pixels) of the output feature map, and determining the pruning order of each layer of the filter according to the normalized entropy. The obtained output entropy histogram of each convolution layer filter is compared with the F norm and the characteristic sparsity distribution of the filter, so that the filter has better distinguishability.

The output image entropy for each filter is defined as follows:

Finally, in the present invention, the relevant performance indicators include accelerated analysis of the new network structure:

wherein, Flops represents the total floating point operation times in the original network, including floating point addition operation and floating point multiplication operation. K represents the size of the convolution kernel, N_i,W_i,H_iThe number of channels representing the intermediate characteristic map (filtering of the previous convolution layer)Number of devices), length and width of the feature map, subscript to distinguish different convolutional layers;

the calculation mode of the total floating point operation times in the pruned lightweight network is as follows:

wherein Flops represents the total floating-point operation times in the pruned lightweight network, including floating-point addition operation and floating-point multiplication operation. P_iRepresenting the pruning rate of the final filter stage of the ith convolutional layer. In actual operation, the measurement formula is also suitable for a full connection layer, and only K is required to be 1;

performing parameter level compression ratio analysis on the new network structure:

wherein P represents the compression rate of the parameter level after pruning, and the numerator denominator item respectively represents the number of each convolution layer filter before and after pruning.

The invention is subjected to feasibility tests on three widely applied convolutional neural networks (LeNet-5, AlexNet and VGG-16), and experimental results show that the multi-level structured pruning scheme provided by the invention can effectively realize model compression of the original network and can keep the performance of the original network. Under the condition that the identification rate of a test data set is basically unchanged, the multilevel structured pruning method can achieve more than 60% of pruning rate and 5.6 times of floating point operation acceleration on a LeNet-5 network; 94% of pruning rate and 117.5 times of floating point operation acceleration can be achieved on an AlexNet network, and the parameter compression rate of a convolutional layer can reach 192.3 times; the pruning rate of 78.6 percent can be achieved on the VGG-16 network, 54.5 percent of floating point operation can be reduced, and the video memory occupation is 31.9 percent. Therefore, the model compression method of the invention can quickly recover the original performance of the network. The method effectively reduces the storage space and floating point operation for a common large network, and reduces the dependence on hardware conditions.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. The structured network model compression acceleration method based on multi-stage pruning is characterized by comprising the following steps:

s1: obtaining a pre-training model:

training an original network model to be processed on a training data set to obtain a complete network model;

s3: sensitivity interlayer iterative pruning:

carrying out single-layer pruning on the current network model according to the sensitivity sequence from low to high, and finely adjusting the network model;

s4: measurement filter importance index:

selecting a sample as a verification set, and measuring the information entropy of a filter output characteristic diagram of the current network model, namely the output image entropy;

s5: iterative pruning of image entropy:

performing iterative flexible pruning on the current network model according to the magnitude sequence of the entropy of the output images, and finely adjusting a retraining model;

2. The method for compressing and accelerating the structured network model based on multi-level pruning according to claim 1, wherein the step S1 comprises the following steps:

s12: and pre-training on the training set to obtain a complete network model.

3. The method for compressing and accelerating the structured network model based on multi-level pruning according to claim 1, wherein the step S2 comprises the following steps:

s21: setting a maximum pruning rate range and a pruning rate increase step;

S_i≡Acc(L,0)-Acc(L,-i),1≤i≤L

4. The method for compressing and accelerating the structured network model based on multi-level pruning according to claim 1, wherein the step S3 comprises the following steps:

5. The method for compressing and accelerating the structured network model based on multi-level pruning according to claim 1, wherein the step S4 comprises the following steps:

s41: randomly sampling a training data set to construct a verification set;

s42: for the filters of the current network model, carrying out forward propagation on the verification set once, and calculating the output image entropy of each filter:

6. The method for compressing and accelerating the structured network model based on multi-level pruning according to claim 1, wherein the step S5 comprises the following steps: