CN110942141A

CN110942141A - Deep neural network pruning method based on global sparse momentum SGD

Info

Publication number: CN110942141A
Application number: CN201911202397.1A
Authority: CN
Inventors: 丁贵广; 丁霄汉; 郭雨晨
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-03-31

Abstract

The invention discloses a deep neural network pruning method based on a global sparse momentum (SGD), which comprises the following steps of: s1: activity screening, S2: active and negative refresh, S3: and pruning the trained model. And carrying out a sufficient number of training iterations on the DNN, wherein after each iteration is updated through the updating rules introduced in S1 and S2, the training is completed. Subsequently, by setting most of the parameters to zero through S3, a DNN model with only a few non-zero parameters is obtained. In each training iteration, an activity screening technology is applied to select parameters which are important to the data of the current iteration, activity updating is carried out on the parameters, and negative updating is carried out on the rest parameters. The deep neural network pruning method based on the global sparse momentum SGD can achieve a high compression rate on DNN without obvious precision loss, so that a sparse model generated by pruning can be stored by using a lot of less storage space, and the better balance of precision and efficiency is realized.

Description

Deep neural network pruning method based on global sparse momentum SGD

Technical Field

The invention relates to the field of deep neural network pruning methods, in particular to a deep neural network pruning method based on a global sparse momentum (SGD).

Background

In the fields of computer vision, natural language processing and the like, Deep Neural Networks (DNNs) have become an indispensable tool.

In recent years, the accuracy of DNN has been greatly improved by the enlargement of the data set size, the increase of the network depth, the application of the novel regularization method and optimization method, and the innovation of the network architecture. However, as DNNs become deeper and deeper, the number of parameters, energy consumption, the required floating point operations (FLOPs) and memory usage thereof are also increasing, making it more and more difficult to deploy on platforms with limited computing resources, such as mobile devices. Thus, in recent years, DNN compression and acceleration techniques have been extensively studied, mainly including pruning, parameter quantification, knowledge distillation, and the like.

The purpose of DNN pruning is to reduce the number of non-zero parameters in the DNN to make sparsity in the parameter tensor of the DNN. This class of technologies has received wide attention, mainly because of the following four points: first, pruning is a general technique that can be applied to any DNN; second, pruning can effectively reduce the number of non-zero parameters of the network, so that the pruned network occupies less storage space; thirdly, on some hardware supporting sparse matrix and tensor operation, the operation speed can be improved by DNN sparseness; fourth, DNN pruning techniques may be applied in conjunction with other DNN compression and acceleration techniques.

Although the current existing DNN pruning techniques can reduce the number of non-zero parameters of DNN to some extent, achieving a better balance of accuracy and efficiency, these methods have significant limitations.

1. Some methods perform pruning on a trained DNN model (for example, sort all parameters according to their absolute values, and set a certain proportion of the parameters with smaller absolute values as 0), which results in loss of accuracy, so that the model needs to be retrained. On the one hand, given a global compression rate (e.g. for a certain model, a total non-zero parameter amount of 25% of the total parameter amount is required, i.e. a compression of 4 times), this requires that a pruning ratio is set for each layer in advance. Due to the large number of DNN layers, it is difficult to set the respective pruning ratio for each layer to be appropriate. The deeper the DNN, the more difficult it is. On the other hand, the model after pruning is difficult to train, and the precision is difficult to effectively recover. The more sparse the DNN parameters are, the more difficult the training becomes and the lower the resulting accuracy becomes.

2. Some methods model the tradeoff between compressibility and accuracy as an optimization problem, which is then solved in some way. On the one hand, some methods explicitly add the sparsity of the model to the optimization goal of the model, while end-to-end training is not possible because the sparsity of the metric model uses the L0 norm, which is not trivial. On the other hand, if the sparsity is not directly added to the optimization target, the absolute value of the parameter is reduced by using some guided regular terms, and then pruning is performed according to the absolute value, so that end-to-end training is inherently possible, but the coefficients of the regular terms cannot be directly reflected in the finally obtained sparsity, so that different regular term coefficient values are often manually selected, and a plurality of attempts are made to obtain the desired final sparsity.

Therefore, a deep neural network pruning method based on the global sparse momentum SGD is provided.

Disclosure of Invention

The invention aims to provide a deep neural network pruning method based on a global sparse momentum SGD (generalized sparse momentum) to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a deep neural network pruning method based on a global sparse momentum SGD comprises the following steps:

s1: activity screening; the specific implementation mode is as follows:

the invention is based on the optimization method of momentum random gradient descent, therefore, firstly, a common optimization method of momentum random gradient descent is introduced, wherein k is iteration times, L is an objective function, α is a learning rate, w is a certain parameter, z is the accumulated momentum of the parameter, η is weight attenuation intensity, β is a momentum coefficient, in the forward propagation process, an objective function value is calculated by input data and labels, in the backward propagation process, a partial derivative of each parameter is calculated by the objective function value as the gradient thereof, then, each parameter is updated, and the updating rule is as follows:

w^(k+1)←w^(k)-αz^(k+1).

firstly, updating momentum z of w, and then updating w by using z;

let the input data of this iteration be x, label be y, and for any parameter w, its importance measure is:

the above Θ represents the set of all current parameters of the whole model; l (x, y, theta) represents the loss function value of the current model on the input x and y, and the partial derivative of the value to w is the gradient of w;

s1, according to the formula, at the beginning of each training iteration, after the gradient of each parameter is solved through back propagation, the importance measurement of each parameter, namely the T value, is solved;

s2: active and negative refresh;

after the importance measure T of each parameter is obtained in S1, the required global compression rate is set to P, that is, all parameters of the entire network have a nonzero value of 1/P, the total number of all parameters is represented by | Θ |, the total amount of nonzero parameter values is set to Q, and it is obvious that Q ═ Θ |/P;

in each training iteration, updating Q parameters with the maximum T value by using gradients, and updating other parameters by only using weight attenuation;

assuming that the parameter matrix of a certain layer is W and the corresponding accumulated momentum is Z, the update rule can be formally expressed as:

W^(k+1)←W^(k)-αZ^(k+1)，

as described above

Representing the multiplication of corresponding position elements of two matrixes with the same size, the solving method of the B matrix is as follows:

if the size of the T value corresponding to a certain element in W, namely a certain parameter value of DNN, is positioned at the front Q of the T values of all the parameters, namely is larger than or equal to the value of the Q-th maximum in the T values of all the parameters, the corresponding position in the B matrix is 1, otherwise, the corresponding position is 0;

obviously, all B matrixes of all layers have Q1 in total, and others are 0; that is, in each training iteration, only a gradient of Q parameters is involved in the update, referred to as an "activity update"; other parameter updates only result from weight decay, called "negative updates";

s3: pruning the trained model;

and (3) carrying out training iteration for DNN for enough times, wherein after each iteration is updated through the updating rules introduced in S1 and S2, the training is completed, and the obtained model has other parameters close to 0 except Q parameters. Setting these parameters to 0 yields a model with only Q non-zero parameters

Preferably, SGD refers to random gradient descent and DNN refers to deep neural networks.

Preferably, in each training iteration of S1, an "activity screening" technique is applied to select the parameters that are important to the data of the current iteration, perform "activity update" on the parameters, and perform "negative update" on the remaining parameters.

Preferably, in S1, a small portion of the parameters with the largest T value is selected for "active update", and the remaining parameters are updated in the negative direction.

Preferably, in S2, those unimportant parameters are "negatively updated" more and more, and thus closer to 0, and eventually infinitely closer to 0, as training progresses.

Preferably, in S2, the absolute values of all parameters except Q parameters in the model at the time of training are close to 0.

Preferably, the accuracy of the model is not affected by setting these parameters very close to 0.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a global sparse momentum SGD (sparse gradient D), a novel SGD (random gradient descent) optimization method, and is used for DNN pruning. In each training iteration, the usual SGD method updates all parameters with the gradient found by the objective function. However, in the global sparse momentum SGD, only a few more important parameters are updated with the gradient of the objective function, and most parameters are updated only with weight decay (weight decay). Thus, as training progresses, most parameters become infinitely close to 0. Therefore, when training is finished, removing these parameters infinitely close to 0 does not affect the accuracy of the network.

2. The global sparse momentum SGD according to the above makes it possible for the present invention to achieve very high compression ratios (i.e. very low non-zero parameter ratios) on DNNs without significant loss of precision, which allows the sparse model generated by pruning to be stored with much less storage space, thus achieving a better balance of precision and efficiency.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a technical scheme that: a deep neural network pruning method based on a global sparse momentum SGD comprises the following steps:

s1: activity screening;

in each training iteration, an activity screening technology is applied to select a small part of parameters which are important to the data of the current iteration, activity updating is carried out on the parameters, and negative updating is carried out on most of the parameters. The activity screening was performed as follows.

First, we introduce a general method of optimization of random gradient descent of driving variables (momentum SGD), let k be the number of iterations, L be the objective function, α be the learning rate, w be a certain parameter, z be the accumulated momentum of the parameter, η be the weight attenuation strength (i.e., L2 regularization coefficient), β be the momentum coefficient (usually 0.9), in the forward propagation process, the objective function value is calculated from the input data and labels.

w^(k+1)←w^(k)-αz^(k+1).

Note that here the momentum z of w is updated first and then w is updated with z.

In the present invention, the decision of which parameters to actively update is first made, which requires a measure of the importance of each parameter to the model in one iteration. Let the input data of this iteration be x, label be y, and for any parameter w, its importance measure is:

here Θ represents the set of all parameters currently for the entire model. L (x, y, theta) represents the loss function value of the current model at the inputs x and y, and the partial derivative of this value to w is the gradient of w.

S1 is to find the importance measure (i.e., T value) of each parameter after finding the gradient of each parameter by back propagation at the beginning of each training iteration according to the above formula, and to select a small portion of the parameters with the largest T value to perform "activity update", and to perform "negative update" on the remaining parameters.

S2: active and negative refresh;

after the importance measure T is obtained for each parameter in S1, let the global compression rate we need be P (i.e. all parameters of the whole network have a non-zero value of 1/P), let | Θ | represent the total number of all parameters, and let the total number of non-zero parameter values be Q, which is obviously | Θ |/P.

The key point of the invention is that in each training iteration, Q parameters with the maximum T value are updated by using gradient, and other parameters are updated only by using weight decay (weight decay).

W^(k+1)←W^(k)-αZ^(k+1)，

herein, the

Represents the corresponding position element multiplication (element-wise multiplication) of two matrices of the same size. The B matrix is solved as follows:

if the size of the T value corresponding to a certain element (i.e., a certain parameter value of DNN) in W is located at the top Q of the T values of all parameters (i.e., is greater than or equal to the qth value of the T values of all parameters), then the corresponding position in the B matrix is 1, otherwise it is 0.

Obviously, all B matrices of all layers have Q1 s and the others are 0, that is, only the gradient of Q parameters participates in updating in each training iteration, which is called "active updating".

S3: pruning the trained model;

and (3) carrying out enough times of training iterations on the DNN, wherein after each iteration is updated through the updating rules introduced in S1 and S2, the training is completed, and the absolute values of other parameters except Q parameters in the model are very close to 0. In this case, the accuracy of the model is not affected by setting these parameters very close to 0. Thus, a DNN model with only Q non-zero parameters is obtained.

The invention provides a global sparse momentum SGD (sparse gradient D), a novel SGD (random gradient descent) optimization method, and is used for DNN pruning. In each training iteration, the usual SGD method updates all parameters with the gradient found by the objective function. However, in the global sparse momentum SGD, only a few more important parameters are updated with the gradient of the objective function, and most parameters are updated only with weight decay (weight decay). Thus, as training progresses, most parameters become infinitely close to 0. Therefore, after training is finished, the accuracy of the network cannot be influenced by removing the parameters infinitely close to 0; the global sparse momentum SGD according to the above allows the present invention to achieve very high compression ratios (i.e. very low non-zero parameter ratios) on DNN without significant loss of precision, which allows the sparse model generated by pruning to be stored with much less storage space, thus achieving a better balance of precision and efficiency, wherein the compression results are as in table 1 below.

TABLE 1 compression results table

Here, the compression ratio of 300X means that only 1/300 parameters are nonzero values among all the parameters of the model.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents, and the invention is not limited to the embodiments described above, and various modifications and changes may be made without departing from the spirit and scope of the invention, and it is intended that all changes and modifications that fall within the scope of the invention are embraced in the appended claims.

Claims

1. The deep neural network pruning method based on the global sparse momentum SGD is characterized by comprising the following steps of:

s1: activity screening; the specific implementation mode is as follows:

the invention is based on a common driving quantity random gradient descent optimization method, in the common driving quantity random gradient descent optimization method, k is iteration times, L is an objective function, α is a learning rate, w is a certain parameter, z is an accumulated momentum of the parameter, η is weight attenuation intensity, β is a momentum coefficient, in a forward propagation process, an objective function value is calculated by input data and labels, in a backward propagation process, a partial derivative of each parameter is worked out by the objective function value as the gradient of the parameter, then each parameter is updated, and an updating rule is as follows:

w^(k+1)←w^(k)-αz^(k+1).

firstly, updating momentum z of w, and then updating w by using z;

s2: active and negative refresh;

W^(k+1)←W^(k)-αZ^(k+1)，

as described above

s3: pruning the trained model;

and training the DNN for enough times, wherein each iteration is updated through the updating rules introduced in S1 and S2, and then the training is completed, so that a DNN model with only Q nonzero parameters is obtained.

2. The global sparse momentum (SGD) -based deep neural network pruning method according to claim 1, characterized in that: SGD refers to random gradient descent and DNN refers to deep neural networks.

3. The global sparse momentum (SGD) -based deep neural network pruning method according to claim 1, characterized in that: in each training iteration of S1, an "activity screening" technique is applied to select parameters that are important to the data of the current iteration, perform "activity update" on the parameters, and perform "negative update" on the remaining parameters.

4. The global sparse momentum (SGD) -based deep neural network pruning method according to claim 1, characterized in that: at S1, a small portion of the data with the largest T value is selected for "active refresh" and the remaining parameters are "negative-refreshed".

5. The global sparse momentum (SGD) -based deep neural network pruning method according to claim 1, characterized in that: in S2, those unimportant parameters are "negatively updated" more and more, and thus closer to 0, and eventually may approach 0 indefinitely as training progresses.

6. The global sparse momentum (SGD) -based deep neural network pruning method according to claim 1, characterized in that: at S2, the absolute values of the parameters of the model at the completion of training are close to 0 except for Q parameters.

7. The global sparse momentum (SGD) based deep neural network pruning method according to claim 6, characterized in that: setting these parameters very close to 0 does not affect the accuracy of the model.