CN110942141A - Deep neural network pruning method based on global sparse momentum SGD - Google Patents

Deep neural network pruning method based on global sparse momentum SGD Download PDF

Info

Publication number
CN110942141A
CN110942141A CN201911202397.1A CN201911202397A CN110942141A CN 110942141 A CN110942141 A CN 110942141A CN 201911202397 A CN201911202397 A CN 201911202397A CN 110942141 A CN110942141 A CN 110942141A
Authority
CN
China
Prior art keywords
parameters
momentum
parameter
sgd
deep neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911202397.1A
Other languages
Chinese (zh)
Inventor
丁贵广
丁霄汉
郭雨晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201911202397.1A priority Critical patent/CN110942141A/en
Publication of CN110942141A publication Critical patent/CN110942141A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The invention discloses a deep neural network pruning method based on a global sparse momentum (SGD), which comprises the following steps of: s1: activity screening, S2: active and negative refresh, S3: and pruning the trained model. And carrying out a sufficient number of training iterations on the DNN, wherein after each iteration is updated through the updating rules introduced in S1 and S2, the training is completed. Subsequently, by setting most of the parameters to zero through S3, a DNN model with only a few non-zero parameters is obtained. In each training iteration, an activity screening technology is applied to select parameters which are important to the data of the current iteration, activity updating is carried out on the parameters, and negative updating is carried out on the rest parameters. The deep neural network pruning method based on the global sparse momentum SGD can achieve a high compression rate on DNN without obvious precision loss, so that a sparse model generated by pruning can be stored by using a lot of less storage space, and the better balance of precision and efficiency is realized.

Description

Deep neural network pruning method based on global sparse momentum SGD
Technical Field
The invention relates to the field of deep neural network pruning methods, in particular to a deep neural network pruning method based on a global sparse momentum (SGD).
Background
In the fields of computer vision, natural language processing and the like, Deep Neural Networks (DNNs) have become an indispensable tool.
In recent years, the accuracy of DNN has been greatly improved by the enlargement of the data set size, the increase of the network depth, the application of the novel regularization method and optimization method, and the innovation of the network architecture. However, as DNNs become deeper and deeper, the number of parameters, energy consumption, the required floating point operations (FLOPs) and memory usage thereof are also increasing, making it more and more difficult to deploy on platforms with limited computing resources, such as mobile devices. Thus, in recent years, DNN compression and acceleration techniques have been extensively studied, mainly including pruning, parameter quantification, knowledge distillation, and the like.
The purpose of DNN pruning is to reduce the number of non-zero parameters in the DNN to make sparsity in the parameter tensor of the DNN. This class of technologies has received wide attention, mainly because of the following four points: first, pruning is a general technique that can be applied to any DNN; second, pruning can effectively reduce the number of non-zero parameters of the network, so that the pruned network occupies less storage space; thirdly, on some hardware supporting sparse matrix and tensor operation, the operation speed can be improved by DNN sparseness; fourth, DNN pruning techniques may be applied in conjunction with other DNN compression and acceleration techniques.
Although the current existing DNN pruning techniques can reduce the number of non-zero parameters of DNN to some extent, achieving a better balance of accuracy and efficiency, these methods have significant limitations.
1. Some methods perform pruning on a trained DNN model (for example, sort all parameters according to their absolute values, and set a certain proportion of the parameters with smaller absolute values as 0), which results in loss of accuracy, so that the model needs to be retrained. On the one hand, given a global compression rate (e.g. for a certain model, a total non-zero parameter amount of 25% of the total parameter amount is required, i.e. a compression of 4 times), this requires that a pruning ratio is set for each layer in advance. Due to the large number of DNN layers, it is difficult to set the respective pruning ratio for each layer to be appropriate. The deeper the DNN, the more difficult it is. On the other hand, the model after pruning is difficult to train, and the precision is difficult to effectively recover. The more sparse the DNN parameters are, the more difficult the training becomes and the lower the resulting accuracy becomes.
2. Some methods model the tradeoff between compressibility and accuracy as an optimization problem, which is then solved in some way. On the one hand, some methods explicitly add the sparsity of the model to the optimization goal of the model, while end-to-end training is not possible because the sparsity of the metric model uses the L0 norm, which is not trivial. On the other hand, if the sparsity is not directly added to the optimization target, the absolute value of the parameter is reduced by using some guided regular terms, and then pruning is performed according to the absolute value, so that end-to-end training is inherently possible, but the coefficients of the regular terms cannot be directly reflected in the finally obtained sparsity, so that different regular term coefficient values are often manually selected, and a plurality of attempts are made to obtain the desired final sparsity.
Therefore, a deep neural network pruning method based on the global sparse momentum SGD is provided.
Disclosure of Invention
The invention aims to provide a deep neural network pruning method based on a global sparse momentum SGD (generalized sparse momentum) to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a deep neural network pruning method based on a global sparse momentum SGD comprises the following steps:
s1: activity screening; the specific implementation mode is as follows:
the invention is based on the optimization method of momentum random gradient descent, therefore, firstly, a common optimization method of momentum random gradient descent is introduced, wherein k is iteration times, L is an objective function, α is a learning rate, w is a certain parameter, z is the accumulated momentum of the parameter, η is weight attenuation intensity, β is a momentum coefficient, in the forward propagation process, an objective function value is calculated by input data and labels, in the backward propagation process, a partial derivative of each parameter is calculated by the objective function value as the gradient thereof, then, each parameter is updated, and the updating rule is as follows:
Figure BDA0002296188640000031
w(k+1)←w(k)-αz(k+1).
firstly, updating momentum z of w, and then updating w by using z;
let the input data of this iteration be x, label be y, and for any parameter w, its importance measure is:
Figure BDA0002296188640000032
the above Θ represents the set of all current parameters of the whole model; l (x, y, theta) represents the loss function value of the current model on the input x and y, and the partial derivative of the value to w is the gradient of w;
s1, according to the formula, at the beginning of each training iteration, after the gradient of each parameter is solved through back propagation, the importance measurement of each parameter, namely the T value, is solved;
s2: active and negative refresh;
after the importance measure T of each parameter is obtained in S1, the required global compression rate is set to P, that is, all parameters of the entire network have a nonzero value of 1/P, the total number of all parameters is represented by | Θ |, the total amount of nonzero parameter values is set to Q, and it is obvious that Q ═ Θ |/P;
in each training iteration, updating Q parameters with the maximum T value by using gradients, and updating other parameters by only using weight attenuation;
assuming that the parameter matrix of a certain layer is W and the corresponding accumulated momentum is Z, the update rule can be formally expressed as:
Figure BDA0002296188640000033
W(k+1)←W(k)-αZ(k+1)
as described above
Figure BDA0002296188640000034
Representing the multiplication of corresponding position elements of two matrixes with the same size, the solving method of the B matrix is as follows:
if the size of the T value corresponding to a certain element in W, namely a certain parameter value of DNN, is positioned at the front Q of the T values of all the parameters, namely is larger than or equal to the value of the Q-th maximum in the T values of all the parameters, the corresponding position in the B matrix is 1, otherwise, the corresponding position is 0;
obviously, all B matrixes of all layers have Q1 in total, and others are 0; that is, in each training iteration, only a gradient of Q parameters is involved in the update, referred to as an "activity update"; other parameter updates only result from weight decay, called "negative updates";
s3: pruning the trained model;
and (3) carrying out training iteration for DNN for enough times, wherein after each iteration is updated through the updating rules introduced in S1 and S2, the training is completed, and the obtained model has other parameters close to 0 except Q parameters. Setting these parameters to 0 yields a model with only Q non-zero parameters
Preferably, SGD refers to random gradient descent and DNN refers to deep neural networks.
Preferably, in each training iteration of S1, an "activity screening" technique is applied to select the parameters that are important to the data of the current iteration, perform "activity update" on the parameters, and perform "negative update" on the remaining parameters.
Preferably, in S1, a small portion of the parameters with the largest T value is selected for "active update", and the remaining parameters are updated in the negative direction.
Preferably, in S2, those unimportant parameters are "negatively updated" more and more, and thus closer to 0, and eventually infinitely closer to 0, as training progresses.
Preferably, in S2, the absolute values of all parameters except Q parameters in the model at the time of training are close to 0.
Preferably, the accuracy of the model is not affected by setting these parameters very close to 0.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a global sparse momentum SGD (sparse gradient D), a novel SGD (random gradient descent) optimization method, and is used for DNN pruning. In each training iteration, the usual SGD method updates all parameters with the gradient found by the objective function. However, in the global sparse momentum SGD, only a few more important parameters are updated with the gradient of the objective function, and most parameters are updated only with weight decay (weight decay). Thus, as training progresses, most parameters become infinitely close to 0. Therefore, when training is finished, removing these parameters infinitely close to 0 does not affect the accuracy of the network.
2. The global sparse momentum SGD according to the above makes it possible for the present invention to achieve very high compression ratios (i.e. very low non-zero parameter ratios) on DNNs without significant loss of precision, which allows the sparse model generated by pruning to be stored with much less storage space, thus achieving a better balance of precision and efficiency.
Drawings
FIG. 1 is an overall flow chart of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a technical scheme that: a deep neural network pruning method based on a global sparse momentum SGD comprises the following steps:
s1: activity screening;
in each training iteration, an activity screening technology is applied to select a small part of parameters which are important to the data of the current iteration, activity updating is carried out on the parameters, and negative updating is carried out on most of the parameters. The activity screening was performed as follows.
First, we introduce a general method of optimization of random gradient descent of driving variables (momentum SGD), let k be the number of iterations, L be the objective function, α be the learning rate, w be a certain parameter, z be the accumulated momentum of the parameter, η be the weight attenuation strength (i.e., L2 regularization coefficient), β be the momentum coefficient (usually 0.9), in the forward propagation process, the objective function value is calculated from the input data and labels.
Figure BDA0002296188640000061
w(k+1)←w(k)-αz(k+1).
Note that here the momentum z of w is updated first and then w is updated with z.
In the present invention, the decision of which parameters to actively update is first made, which requires a measure of the importance of each parameter to the model in one iteration. Let the input data of this iteration be x, label be y, and for any parameter w, its importance measure is:
Figure BDA0002296188640000062
here Θ represents the set of all parameters currently for the entire model. L (x, y, theta) represents the loss function value of the current model at the inputs x and y, and the partial derivative of this value to w is the gradient of w.
S1 is to find the importance measure (i.e., T value) of each parameter after finding the gradient of each parameter by back propagation at the beginning of each training iteration according to the above formula, and to select a small portion of the parameters with the largest T value to perform "activity update", and to perform "negative update" on the remaining parameters.
S2: active and negative refresh;
after the importance measure T is obtained for each parameter in S1, let the global compression rate we need be P (i.e. all parameters of the whole network have a non-zero value of 1/P), let | Θ | represent the total number of all parameters, and let the total number of non-zero parameter values be Q, which is obviously | Θ |/P.
The key point of the invention is that in each training iteration, Q parameters with the maximum T value are updated by using gradient, and other parameters are updated only by using weight decay (weight decay).
Assuming that the parameter matrix of a certain layer is W and the corresponding accumulated momentum is Z, the update rule can be formally expressed as:
Figure BDA0002296188640000063
W(k+1)←W(k)-αZ(k+1)
herein, the
Figure BDA0002296188640000064
Represents the corresponding position element multiplication (element-wise multiplication) of two matrices of the same size. The B matrix is solved as follows:
if the size of the T value corresponding to a certain element (i.e., a certain parameter value of DNN) in W is located at the top Q of the T values of all parameters (i.e., is greater than or equal to the qth value of the T values of all parameters), then the corresponding position in the B matrix is 1, otherwise it is 0.
Obviously, all B matrices of all layers have Q1 s and the others are 0, that is, only the gradient of Q parameters participates in updating in each training iteration, which is called "active updating".
S3: pruning the trained model;
and (3) carrying out enough times of training iterations on the DNN, wherein after each iteration is updated through the updating rules introduced in S1 and S2, the training is completed, and the absolute values of other parameters except Q parameters in the model are very close to 0. In this case, the accuracy of the model is not affected by setting these parameters very close to 0. Thus, a DNN model with only Q non-zero parameters is obtained.
The invention provides a global sparse momentum SGD (sparse gradient D), a novel SGD (random gradient descent) optimization method, and is used for DNN pruning. In each training iteration, the usual SGD method updates all parameters with the gradient found by the objective function. However, in the global sparse momentum SGD, only a few more important parameters are updated with the gradient of the objective function, and most parameters are updated only with weight decay (weight decay). Thus, as training progresses, most parameters become infinitely close to 0. Therefore, after training is finished, the accuracy of the network cannot be influenced by removing the parameters infinitely close to 0; the global sparse momentum SGD according to the above allows the present invention to achieve very high compression ratios (i.e. very low non-zero parameter ratios) on DNN without significant loss of precision, which allows the sparse model generated by pruning to be stored with much less storage space, thus achieving a better balance of precision and efficiency, wherein the compression results are as in table 1 below.
TABLE 1 compression results table
Figure BDA0002296188640000081
Here, the compression ratio of 300X means that only 1/300 parameters are nonzero values among all the parameters of the model.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents, and the invention is not limited to the embodiments described above, and various modifications and changes may be made without departing from the spirit and scope of the invention, and it is intended that all changes and modifications that fall within the scope of the invention are embraced in the appended claims.

Claims (7)

1. The deep neural network pruning method based on the global sparse momentum SGD is characterized by comprising the following steps of:
s1: activity screening; the specific implementation mode is as follows:
the invention is based on a common driving quantity random gradient descent optimization method, in the common driving quantity random gradient descent optimization method, k is iteration times, L is an objective function, α is a learning rate, w is a certain parameter, z is an accumulated momentum of the parameter, η is weight attenuation intensity, β is a momentum coefficient, in a forward propagation process, an objective function value is calculated by input data and labels, in a backward propagation process, a partial derivative of each parameter is worked out by the objective function value as the gradient of the parameter, then each parameter is updated, and an updating rule is as follows:
Figure FDA0002296188630000011
w(k+1)←w(k)-αz(k+1).
firstly, updating momentum z of w, and then updating w by using z;
let the input data of this iteration be x, label be y, and for any parameter w, its importance measure is:
Figure FDA0002296188630000012
the above Θ represents the set of all current parameters of the whole model; l (x, y, theta) represents the loss function value of the current model on the input x and y, and the partial derivative of the value to w is the gradient of w;
s1, according to the formula, at the beginning of each training iteration, after the gradient of each parameter is solved through back propagation, the importance measurement of each parameter, namely the T value, is solved;
s2: active and negative refresh;
after the importance measure T of each parameter is obtained in S1, the required global compression rate is set to P, that is, all parameters of the entire network have a nonzero value of 1/P, the total number of all parameters is represented by | Θ |, the total amount of nonzero parameter values is set to Q, and it is obvious that Q ═ Θ |/P;
in each training iteration, updating Q parameters with the maximum T value by using gradients, and updating other parameters by only using weight attenuation;
assuming that the parameter matrix of a certain layer is W and the corresponding accumulated momentum is Z, the update rule can be formally expressed as:
Figure FDA0002296188630000021
W(k+1)←W(k)-αZ(k+1)
as described above
Figure FDA0002296188630000022
Representing the multiplication of corresponding position elements of two matrixes with the same size, the solving method of the B matrix is as follows:
if the size of the T value corresponding to a certain element in W, namely a certain parameter value of DNN, is positioned at the front Q of the T values of all the parameters, namely is larger than or equal to the value of the Q-th maximum in the T values of all the parameters, the corresponding position in the B matrix is 1, otherwise, the corresponding position is 0;
obviously, all B matrixes of all layers have Q1 in total, and others are 0; that is, in each training iteration, only a gradient of Q parameters is involved in the update, referred to as an "activity update"; other parameter updates only result from weight decay, called "negative updates";
s3: pruning the trained model;
and training the DNN for enough times, wherein each iteration is updated through the updating rules introduced in S1 and S2, and then the training is completed, so that a DNN model with only Q nonzero parameters is obtained.
2. The global sparse momentum (SGD) -based deep neural network pruning method according to claim 1, characterized in that: SGD refers to random gradient descent and DNN refers to deep neural networks.
3. The global sparse momentum (SGD) -based deep neural network pruning method according to claim 1, characterized in that: in each training iteration of S1, an "activity screening" technique is applied to select parameters that are important to the data of the current iteration, perform "activity update" on the parameters, and perform "negative update" on the remaining parameters.
4. The global sparse momentum (SGD) -based deep neural network pruning method according to claim 1, characterized in that: at S1, a small portion of the data with the largest T value is selected for "active refresh" and the remaining parameters are "negative-refreshed".
5. The global sparse momentum (SGD) -based deep neural network pruning method according to claim 1, characterized in that: in S2, those unimportant parameters are "negatively updated" more and more, and thus closer to 0, and eventually may approach 0 indefinitely as training progresses.
6. The global sparse momentum (SGD) -based deep neural network pruning method according to claim 1, characterized in that: at S2, the absolute values of the parameters of the model at the completion of training are close to 0 except for Q parameters.
7. The global sparse momentum (SGD) based deep neural network pruning method according to claim 6, characterized in that: setting these parameters very close to 0 does not affect the accuracy of the model.
CN201911202397.1A 2019-11-29 2019-11-29 Deep neural network pruning method based on global sparse momentum SGD Pending CN110942141A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911202397.1A CN110942141A (en) 2019-11-29 2019-11-29 Deep neural network pruning method based on global sparse momentum SGD

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911202397.1A CN110942141A (en) 2019-11-29 2019-11-29 Deep neural network pruning method based on global sparse momentum SGD

Publications (1)

Publication Number Publication Date
CN110942141A true CN110942141A (en) 2020-03-31

Family

ID=69909354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911202397.1A Pending CN110942141A (en) 2019-11-29 2019-11-29 Deep neural network pruning method based on global sparse momentum SGD

Country Status (1)

Country Link
CN (1) CN110942141A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112734029A (en) * 2020-12-30 2021-04-30 中国科学院计算技术研究所 Neural network channel pruning method, storage medium and electronic equipment
CN116562346A (en) * 2023-07-07 2023-08-08 深圳大学 L0 norm-based artificial neural network model compression method and device
WO2024083180A1 (en) * 2022-10-20 2024-04-25 International Business Machines Corporation Dnn training algorithm with dynamically computed zero-reference.

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112734029A (en) * 2020-12-30 2021-04-30 中国科学院计算技术研究所 Neural network channel pruning method, storage medium and electronic equipment
WO2024083180A1 (en) * 2022-10-20 2024-04-25 International Business Machines Corporation Dnn training algorithm with dynamically computed zero-reference.
CN116562346A (en) * 2023-07-07 2023-08-08 深圳大学 L0 norm-based artificial neural network model compression method and device
CN116562346B (en) * 2023-07-07 2023-11-10 深圳大学 L0 norm-based artificial neural network model compression method and device

Similar Documents

Publication Publication Date Title
CN110942141A (en) Deep neural network pruning method based on global sparse momentum SGD
Sharma Deep challenges associated with deep learning
CN109902183B (en) Knowledge graph embedding method based on diverse graph attention machine mechanism
CN107688849A (en) A kind of dynamic strategy fixed point training method and device
CN111985523A (en) Knowledge distillation training-based 2-exponential power deep neural network quantification method
Yu et al. Accelerating convolutional neural networks by group-wise 2D-filter pruning
Hacene et al. Attention based pruning for shift networks
US20200125960A1 (en) Small-world nets for fast neural network training and execution
CN109146057A (en) A kind of high-precision neural network engineering method based on computation of table lookup
Chang et al. Automatic channel pruning via clustering and swarm intelligence optimization for CNN
CN110110860B (en) Self-adaptive data sampling method for accelerating machine learning training
Sbrana et al. N-BEATS-RNN: Deep learning for time series forecasting
AU2021102597A4 (en) Remote sensing image classification method based on pruning compression neural network
US20220207374A1 (en) Mixed-granularity-based joint sparse method for neural network
KOO Back-propagation
CN109074348A (en) For being iterated the equipment and alternative manner of cluster to input data set
CN106250686B (en) A kind of collective communication function modelling method of concurrent program
Rong et al. Soft Taylor pruning for accelerating deep convolutional neural networks
CN111967528A (en) Image identification method for deep learning network structure search based on sparse coding
CN110766072A (en) Automatic generation method of computational graph evolution AI model based on structural similarity
Sarkar et al. An incremental pruning strategy for fast training of CNN models
CN111291898B (en) Multi-task sparse Bayesian extreme learning machine regression method
CN111783977B (en) Neural network training process intermediate value storage compression method and device based on regional gradient update
CN111783976A (en) Neural network training process intermediate value storage compression method and device based on window gradient updating
CN112488248A (en) Method for constructing proxy model based on convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200331

RJ01 Rejection of invention patent application after publication