CN111027693A

CN111027693A - Neural network compression method and system based on weight-removing pruning

Info

Publication number: CN111027693A
Application number: CN201911174083.5A
Authority: CN
Inventors: 王睿; 宋昆; 王帅杰; 崔增皓; 陈亮
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2019-11-26
Filing date: 2019-11-26
Publication date: 2020-04-17

Abstract

The invention provides a neural network compression method and system based on weight-removing pruning, wherein the method comprises the following steps: determining parameters to be cut in a neural network to be cut; setting the parameters to be cut in the neural network to be pruned to be 0 to obtain the pruned neural network; and modifying the bottom layer calculation function of the pruned neural network, so that if the parameters related to the current calculation are zero in the operation process of the pruned neural network, the current calculation is skipped. The method reduces the complexity of the neural network through the de-weighting pruning to reduce the requirement on the computing power of the computing equipment, thereby achieving the purpose of applying the neural network to the edge equipment.

Description

Neural network compression method and system based on weight-removing pruning

Technical Field

The invention relates to the technical field of edge calculation, in particular to a neural network compression method and system based on weight-removing pruning.

Background

The deep neural network is applied to many life and engineering scenes, and the application of the deep neural network on the edge device is required to depend on the computing power of the cloud server, but the number of the edge devices is increased sharply with the development of the internet of things. Therefore, on some day in the future, cloud computing will have to support large-scale application of deep neural networks in life.

In this scenario, edge computing is carried out in order to relieve the computing pressure of the cloud server. However, the computing resources of the edge device are limited, and in the face of the huge computation amount of the deep neural network, the computing power of the edge device is unconscious, thereby causing serious computing delay. Therefore, how to efficiently deploy deep neural networks on edge devices with limited computing resources becomes a concern for those skilled in the art.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a neural network compression method based on weight-removing pruning, so as to solve the problem that the existing neural network needs huge computing power, and the existing edge device does not have such large computing power, so that the existing deep neural network cannot be directly deployed on the edge device. The computational power requirement on the computing equipment is reduced by reducing the complexity of the neural network, so that the aim of applying the neural network to the edge equipment is fulfilled.

In order to solve the above technical problem, the present invention provides a neural network compression method based on weighted pruning, where the neural network compression method based on weighted pruning includes:

determining parameters to be cut in a neural network to be cut;

setting the parameters to be cut in the neural network to be pruned to be 0 to obtain the pruned neural network;

and modifying the bottom layer calculation function of the pruned neural network, so that if the parameters related to the current calculation are zero in the operation process of the pruned neural network, the current calculation is skipped.

Further, the determining of the parameters to be cut in the neural network to be cut specifically includes:

and sequentially determining parameters which should be cut in each layer in the neural network to be cut.

Further, setting a parameter to be cut in the neural network to be pruned to 0, specifically:

setting the parameters to be cut in each layer in the neural network to be cut to 0 in sequence, thereby completing the cutting operation of each layer one by one; wherein, the pruning operation among the layers is mutually independent;

and after each layer in the neural network to be pruned finishes pruning operation, retraining the neural network to be convergent to obtain the pruned neural network.

Further, the determination of the parameters to be trimmed in each layer of the neural network to be trimmed specifically includes:

introducing a parameter β for indicating whether the parameter in the layer to be pruned should be pruned, and obtaining the following formula for indicating the output characteristic graph of the layer after being pruned:

Y_i＝β*W_i*X_i+b_i

wherein β is an array of one or more,

β_iβ for the ith element in the β array _i0 or 1 when β_iWhen the value of (1) is 1, the ith element W in the weight matrix W of the layer to be pruned is represented_iThe corresponding parameters should be preserved when β_iWhen the value of (A) is 0, the ith element W in the weight matrix W of the layer to be pruned is represented_iThe corresponding parameters should be clipped.

Further, the determining process of the parameter β specifically includes:

initializing β to have all its elements 1, iterating using the following equation:

stopping iteration when the number of the elements of β, which is 0, meets a preset number, wherein β_iRepresenting the ith element, W, in β array_iThe ith element, b, in the weight matrix W representing the layer to be pruned_iIs an offset value, lambda is a proportionality coefficient, N is the number of channels, F represents the Frobeniu norm, X_iIs the i-th element of the input matrix, Y_iIs the ith element of the output matrix.

Accordingly, in order to solve the above technical problem, the present invention further provides a neural network compression system based on weighted pruning, where the neural network compression system based on weighted pruning includes:

the cutting parameter determining module is used for determining parameters to be cut in the neural network to be cut;

the parameter cutting module is used for setting the parameters to be cut in the neural network to be cut to 0 to obtain the neural network after cutting;

and the bottom layer calculation function modification module is used for modifying the bottom layer calculation function of the pruned neural network, so that the pruned neural network skips the current calculation if the parameter related to the current calculation is zero in the operation process.

Further, the clipping parameter determining module is specifically configured to:

Further, the parameter clipping module is specifically configured to:

Further, the determining process of the parameter to be clipped by the parameter to be clipped determining module for each layer in the neural network to be clipped specifically includes:

Y_i＝β*W_i*X_i+b_i

wherein β is an array of one or more,

Further, the determining process of the parameter β specifically includes:

The technical scheme of the invention has the following beneficial effects:

1. the invention greatly improves the operation speed of the neural network, reduces the operation time of the neural network and meets the calculation requirements of edge nodes;

2. the invention optimizes the neural network, carries out weight removal pruning on the neural network, improves the running speed of CNN and other networks, and meets the calculation requirement of edge equipment;

3. the invention accelerates the network operation speed without generating great influence on the accuracy of the network;

4. the invention does not change the network structure;

5. the invention has simple operation and is easy to use by hands.

Drawings

FIG. 1 is an overall flow of the neural network compression method based on de-weighting pruning of the present invention;

FIG. 2 is a flow of pruning each layer by the neural network compression method based on de-weighting pruning according to the present invention;

FIG. 3 is a graph of the operating speed of the original model of the Lenet network;

FIG. 4 is a diagram of the model operating speed after pruning by the Lenet network;

FIG. 5 is a graph of the operating speed of the original model of the cifar10 network;

figure 6 is a graph of the model operating speed after cifar10 network pruning.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

First embodiment

Referring to fig. 1 to 6, the present embodiment provides a neural network compression method based on weighted pruning, which includes:

s101, determining parameters to be cut in a neural network to be cut;

s102, setting parameters to be cut in the neural network to be pruned to be 0 to obtain the pruned neural network;

s103, modifying the bottom layer calculation function of the pruned neural network, so that if the parameters related to the current calculation are zero in the operation process of the pruned neural network, the current calculation is skipped.

When the method of the embodiment is operated, firstly, the weight information of the network model needing to be pruned is input, the first layer of the network model is pruned, after the pruning of the current layer is finished, whether a layer which is not pruned exists is judged, if the layer which is not pruned exists, the next layer is pruned continuously, after all the layers are pruned, the network model is retrained to be fitted, and finally, the network model after pruning is output, as shown in fig. 1.

Specifically, in this embodiment, the determination of the parameter to be clipped in the neural network to be pruned in the above S101 is a key part of the scheme of this embodiment, and the process specifically includes:

Y_i＝β*W_i*X_i+b_i

wherein β is an array of one or more,

β_iβ for the ith element in the β array _i0 or 1 when β_iWhen the value of (1) is 1, the ith element W in the weight matrix W of the layer to be pruned is represented_iThe corresponding parameters should be preserved when β_iWhen the value of (A) is 0, W is represented_iThe corresponding parameters should be clipped, i.e. the corresponding channel should be masked to zero, where Yi is bi.

Sequentially determining parameters to be cut in each layer in the neural network to be pruned according to the parameters β, further sequentially setting the parameters to be cut in each layer in the neural network to be pruned to 0, and completing the pruning operation of each layer one by one, wherein the pruning operations among the layers are independent;

further, in this embodiment, the determination process of the parameter β is shown in fig. 2, and specifically includes:

β and λ are initialized, with all elements in β being 1, iterated using the following equation:

When pruning is carried out on a certain Layer, firstly, the lambda is set to be a minimum value, the lambda is gradually increased along with the progress of iteration, in the process, the pole point in the formula is moved to the direction smaller than β, so that the zero value of β is gradually increased, the minimum value of the formula is ensured, when β meets a certain condition (the number of elements of 0 in β meets the preset number), the iteration is stopped, and the operation of the next Layer is carried out.

And for each layer, firstly judging whether the number of 0 s in β meets the condition, if not, using the formula to iterate, after the number of 0 s in β meets the pruning condition of the current layer, ending pruning of the current layer, returning to carry out pruning operation of other layers, and after each layer in the neural network to be pruned finishes the pruning operation, retraining the neural network to be convergent to obtain the pruned neural network.

The effect of the solution of the invention is further illustrated below by practical application examples:

the effect of the invention was first tested on a shallow network. First, selecting a Lenet network as an original network for an mnist data set, and pruning results for each layer are shown in table 1:

TABLE 1 Lenet network pruning results

Layer name	Conv1	Conv2	Ip1
				Percentage of retention parameter	0.6	0.68	0.054

It can be seen that for the Lenet network convolution layer, less than 70% of the parameters are retained, and too many parameters are pruned out from the last layer, which may be too redundant in the original model, and few parameters are useful for the final result. The results of fig. 3 and 4 were obtained after 100 forward propagations of the running model. The mean, variance and model accuracy were found for each run time and the results are shown in table 2:

TABLE 2 results of 100 forward propagation before and after pruning for the Lenet network model

The final run results show that the FPS is increased, i.e., the run speed is increased, while the accuracy is not reduced or even increased, possibly due to overfitting of the previous model, while the model after pruning prunes off the overfitting parameters. Therefore, the method has good use effect on the network with less network layers and overfitting.

Next, the effect of the present invention was tested on a deep network, and cifar10 was selected as a data set to perform pruning on the Resnet network, and the pruning results are shown in table 3 below:

TABLE 3 Resnet network pruning results

The results of the Resnet network model after 100 forward propagations before and after pruning are shown in Table 4 below:

TABLE 4 results of the Resnet network model after 100 forward propagations before and after pruning

The running time of the visible model is improved according to the old.

In conclusion, the invention can greatly improve the operation speed of the neural network and reduce the operation time of the neural network by carrying out weight removal pruning on the neural network so as to meet the calculation requirement of the edge equipment; meanwhile, the invention accelerates the network operation speed and has no great influence on the accuracy of the network; nor does it change the network structure; has the advantages of simple operation and easy use.

Second embodiment

The present embodiment provides a neural network compression system based on weighted pruning, where the neural network compression system based on weighted pruning includes:

The neural network compression system based on the weighted pruning of the embodiment corresponds to the neural network compression method based on the weighted pruning of the first embodiment; the functions realized by the functional modules in the neural network compression system based on the weighted pruning correspond to the flow steps in the neural network compression method based on the weighted pruning in a one-to-one manner, and therefore, the detailed description is omitted here.

Furthermore, it should be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims

1. A neural network compression method based on weight-removing pruning is characterized by comprising the following steps:

determining parameters to be cut in a neural network to be cut;

2. The neural network compression method based on weighted pruning as claimed in claim 1, wherein the determining of the parameters to be pruned in the neural network to be pruned is specifically:

3. The neural network compression method based on weighted pruning as claimed in claim 2, wherein the setting of the parameters to be pruned in the neural network to be pruned to 0 is specifically:

4. The neural network compression method based on the weighted pruning as claimed in claim 3, wherein the determination of the parameters to be pruned in each layer of the neural network to be pruned is specifically as follows:

Y_i＝β*W_i*X_i+b_i

wherein β is an array of one or more,

β_iβ for the ith element in the β array_i0 or 1 when β_iWhen the value of (1) is 1, the ith element W in the weight matrix W of the layer to be pruned is represented_iThe corresponding parameters should be preserved when β_iWhen the value of (A) is 0, the ith element W in the weight matrix W of the layer to be pruned is represented_iThe corresponding parameters should be clipped, b_iIs an offset value.

5. The neural network compression method based on de-weighting pruning as claimed in claim 4, wherein the determination process of the parameter β specifically comprises:

stopping iteration when the number of the elements of β, which is 0, meets a preset number, wherein β_iRepresenting the ith element, W, in β array_iThe ith element, b, in the weight matrix W representing the layer to be pruned_iIs an offset value, lambda is a proportionality coefficient, N is a channel number, F represents a Frobeniu norm, and X_iIs the i-th element of the input matrix, Y_iIs the ith element of the output matrix.

6. A neural network compression system based on de-weighting pruning, comprising:

7. The de-weighted pruning-based neural network compression system of claim 6, wherein the clipping-parameters determination module is specifically configured to:

8. The de-weighting pruning-based neural network compression system of claim 7, wherein the parameter clipping module is specifically configured to:

9. The neural network compression system based on weighted pruning as claimed in claim 8, wherein the determining process of the parameter to be pruned by the parameter to be pruned determining module for each layer in the neural network to be pruned specifically comprises:

Y_i＝β*W_i*X_i+b_i

wherein β is an array of one or more,

β_iβ for the ith element in the β array_i0 or 1 when β_iWhen the value of (1) is 1, the ith element W in the weight matrix W of the layer to be pruned is represented_iThe corresponding parameters should be preserved when β_iWhen the value of (A) is 0, the ith element W in the weight matrix W of the layer to be pruned is represented_iThe corresponding parameters should be clipped.

10. The neural network compression system based on de-weighting pruning of claim 9, wherein the determination of the parameter β is specifically:

when the number of elements 0 in β satisfies a predetermined numberStopping the iteration, wherein β_iRepresenting the ith element, W, in β array_iThe ith element, b, in the weight matrix W representing the layer to be pruned_iIs an offset value, lambda is a proportionality coefficient, N is a channel number, F represents a Frobeniu norm, and X_iIs the i-th element of the input matrix, Y_iIs the ith element of the output matrix.