CN114372572A

CN114372572A - Residual error neural network compression method based on activation information

Info

Publication number: CN114372572A
Application number: CN202210045279.XA
Authority: CN
Inventors: 秦国庆; 夏应林
Original assignee: Individual
Current assignee: Jilin Xuyuguan Technology Co.,Ltd.
Priority date: 2022-01-15
Filing date: 2022-01-15
Publication date: 2022-04-19

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a residual error neural network compression method based on activation information. The method trains a residual error neural network to determine the weight of each layer in the network; acquiring the integral gradient abnormal degree of each hidden layer based on the activation information, and respectively calculating the necessity of performing residual error operation between the associated combinations of the hidden layers according to the integral gradient abnormal degree; and compressing and simplifying the residual error neural network according to the reasonable removal degree of each residual error operation combination according to the necessity. Residual operations with small influence in the neural network are removed through carrying out necessity and removal rationality evaluation on the residual operations, so that the compression of the neural network is realized, the requirements of the neural network on the storage space and the computing performance of hardware equipment are reduced, and the device can be arranged on equipment with low power consumption.

Description

Residual error neural network compression method based on activation information

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a residual error neural network compression method based on activation information.

Background

In order to increase the feature extraction capability and the inference accuracy of the network, so that the loss function can converge faster and the convergence value is lower, the number of layers of the neural network is often increased to achieve the purpose. The increase of the number of network layers is practically always accompanied by the problems of gradient disappearance or gradient explosion. To solve this problem, it is currently solved by residual operations inside the neural network.

But some residual operations in current networks are not really effective, but still take up some computation. Therefore, in order to apply the network to the low power consumption device, the residual operation that has little influence in the network needs to be removed to implement the compression of the network.

Disclosure of Invention

In order to solve the above technical problem, an object of the present invention is to provide a residual neural network compression method based on activation information, and the adopted technical solution is specifically as follows:

training a residual error neural network to determine the weight of each layer in the network;

acquiring activation information of each neuron in the current hidden layer based on the weight, calculating the gradient abnormal degree of each neuron by using the activation information, and calculating the overall gradient abnormal degree of the current hidden layer according to the gradient abnormal degree of each neuron; respectively calculating the necessity of performing residual operation between each hidden layer association combination based on the overall gradient abnormal degree;

calculating the preference degree of each residual operation according to the necessity of the residual operation, calculating the average preference degree of each residual operation combination by using the preference degree of each residual operation, and sequencing a plurality of residual operation combinations from large to small based on the average preference degree to obtain a combination sequence; and sequentially removing one residual operation combination in the combination sequence, calculating the reasonable removal degree of the residual operation combination according to the loss function of the residual neural network, and compressing and simplifying the residual neural network according to the reasonable removal degree.

Further, the method for obtaining activation information of each neuron in the current hidden layer based on the weight includes:

and carrying out weighted summation on the input value of the previous layer of the current hidden layer and the corresponding weight value of the previous layer, and substituting the summation result into an activation function formula to obtain the activation information of the corresponding neuron in the current hidden layer.

Further, the method for calculating the abnormal gradient degree of each neuron by using the activation information comprises the following steps:

and deriving the activation information to obtain a derivative, setting the maximum gradient of the activation function, calculating the gradient abnormal degree of the corresponding neuron by combining the derivative and the maximum gradient, wherein the derivative and the gradient abnormal degree are in a negative correlation relationship.

Further, the method for calculating the overall gradient anomaly degree of the current hidden layer from the gradient anomaly degree of each neuron comprises the following steps:

setting an abnormal degree threshold, when the abnormal degree of the gradient is greater than or equal to the abnormal degree threshold, considering the gradient of the neuron to be normal, counting a first number of the neurons corresponding to the normal gradient in the current hidden layer, calculating a ratio between the first number and the total number of the neurons in the current hidden layer, and taking the ratio as the whole abnormal degree of the gradient of the current hidden layer.

Further, the method for calculating the necessity of the residual error operation includes:

and carrying out weighted summation on the overall gradient abnormal degree of each hidden layer in the hidden layer association combination, further calculating an abnormal degree average value, taking the abnormal degree average value as the necessity of carrying out residual error operation on the corresponding hidden layer association combination, wherein the weight corresponding to each hidden layer is set according to the sequence of the layer number of the hidden layer.

Further, the method for calculating the preference degree of each residual operation according to the necessity of the residual operation comprises:

sequencing each residual operation from large to small according to the importance of the residual operation, and numbering;

and calculating the ratio between the total number of residual operations and the corresponding number of the current residual operations, and calculating the preference degree of the current residual operations by combining the ratio and the necessity thereof.

Further, the method for obtaining the reasonable degree of removal of the residual operation combination includes:

calculating a difference value between loss function values of the corresponding residual error neural networks before and after the residual error operation combination is removed, and obtaining the reasonable removal degree corresponding to the residual error operation combination according to the difference value, wherein the difference value and the reasonable removal degree are in a positive correlation relationship.

Further, the method for compressing and simplifying the residual neural network according to the reasonable degree of removal comprises the following steps:

and sequentially calculating the removal reasonable degree of each residual operation combination according to the combination sequence, and removing the corresponding residual operation combination and the subsequent residual operation combinations when the removal reasonable degree is greater than a reasonable threshold value.

Further, the calculation formula of the number of residual operation combinations is:

wherein SN is the total number of the residual operations; s is the number of residual operations contained in the combination of residual operations.

The embodiment of the invention at least has the following beneficial effects: residual operations with small influence in the neural network are removed through carrying out necessity and removal rationality evaluation on the residual operations, so that the compression of the neural network is realized, the requirements of the neural network on the storage space and the computing performance of hardware equipment are reduced, and the device can be arranged on equipment with low power consumption.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart illustrating steps of a residual neural network compression method based on activation information according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a residual error operation according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a residual neural network ResNet-152 according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention for achieving the predetermined objects, the following detailed description of the method for compressing a residual error neural network based on activation information, its specific implementation, structure, features and effects will be given in conjunction with the accompanying drawings and the preferred embodiments. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following describes a specific scheme of the residual neural network compression method based on activation information in detail with reference to the accompanying drawings.

Referring to fig. 1, a flowchart illustrating steps of a residual neural network compression method based on activation information according to an embodiment of the present invention is shown, where the method includes the following steps:

and S001, training a residual error neural network to determine the weight of each layer in the network.

Specifically, referring to fig. 2, the content of the residual operation refers to an operation adopted to avoid the problem of gradient disappearance or gradient explosion of the deep (large number of layers) neural network, that is, f (X) is obtained by performing feature extraction (operation such as convolution) on data X, and in order to prevent the problem of subsequent feature extraction due to abnormality of the extracted information in f (X), it is necessary to ensure that the information contained in the extracted data is not inferior to the original data X after the feature extraction, and therefore, the original data, that is, h (X) ═ f (X) + X, is linearly superimposed after the feature extraction.

The residual neural network refers to a network which needs to be linearly overlapped with data before extraction after each feature extraction (residual operation), and referring to fig. 3, a schematic structural diagram of a residual neural network ResNet-152 is shown, the network has 152 convolutional layers in total, belongs to a deep neural network, and in order to prevent the gradient problem, the residual operation needs to be performed after every two convolutional operations (Conv), namely black curves of every two convolutional layers in the graph.

Training a residual error neural network by using the collected data set, wherein the specific training process comprises the following steps:

(1) and dividing the acquired data in a ratio of 8:2, wherein a data set of 0.8 is used as a training set of the network, and a data set of 0.2 is used as a network test set.

(2) The network obtains the initial internal weight parameters in a random mode, then obtains an inference result by utilizing random parameter calculation, and reversely adjusts the internal weight parameters of the network according to the difference between the inference value and the label value.

(3) The activation function of the network adopts a sigmoid function.

(4) The loss functions of the network can be roughly divided into two types according to different task types, cross entropy loss functions are adopted for classification tasks, and mean square error loss functions are adopted for regression tasks.

(5) When the number of rounds of network training reaches a set stopping condition or the loss function converges to a set value, the network training can be stopped, and the network training is completed at the moment.

It should be noted that, in the trained residual error neural network, the weights of all layers in the network are determined and are fixed values.

S002, acquiring activation information of each neuron in the current hidden layer based on the weight, calculating the gradient abnormal degree of each neuron by using the activation information, and calculating the overall gradient abnormal degree of the current hidden layer according to the gradient abnormal degree of each neuron; and respectively calculating the necessity of performing residual operation between each implicit layer association combination based on the overall gradient abnormal degree.

Specifically, because the disappearance of the gradient or the explosion of the gradient is more caused by the too small gradient of the activation function, the data of the test set needs to be input into the trained residual neural network one by one to obtain the activation information of each hidden layer, and the obtaining method is as follows:

(1) and carrying out weighted summation on the input value of the previous layer of the current hidden layer and the corresponding weight value of the previous layer, and substituting the summation result into an activation function formula to obtain the activation information of the corresponding neuron in the current hidden layer.

Specifically, assuming that the summation result of any neuron in the current hidden layer is x, the summation result is substituted into an activation function formula to obtain activation information sg of the neuron, where the activation information sg is:

(2) and deriving the activation information to obtain a derivative, setting the maximum gradient of the activation function, calculating the gradient abnormal degree of the corresponding neuron by combining the derivative and the maximum gradient, wherein the derivative and the gradient abnormal degree are in a negative correlation relationship.

Specifically, the derivative formula is:

the purpose of calculating the derivative of the activation information corresponding to each neuron is to perform a residual operation more necessary if the total number of derivatives of the activation function in the current hidden layer is too small.

And calculating the gradient abnormal degree corresponding to each neuron according to the derivative, wherein the calculation formula is as follows:

pb(x)＝1-e^{(sd(x)-yk)/yk}

wherein pb (x) is the degree of gradient anomaly; yk is the maximum gradient of the activation function, and the value is set to 0.25 in the embodiment of the invention; since the values of sd (x) are all smaller than yk, (sd (x) -yk)/yk has the value range of [ -1,0 ].

It should be noted that the smaller sd (x) tends to be 0, the more sd (x) -yk tends to be-yk, corresponding to e^{(sd(x)-yk)/yk}Tends to e^-1The smaller the value thereof is, the larger the gradient abnormality degree pb is; conversely, the larger sd (x) tends to yk, the larger sd (x) yk tends to 0, corresponding to e^{(sd(x)-yk)/yk}Tends to e⁰The more equal the value thereof is to 1, the smaller the gradient abnormality degree pb.

(3) Setting an abnormal degree threshold, when the abnormal degree of the gradient is greater than or equal to the abnormal degree threshold, considering the gradient of the neuron to be normal, counting a first number of the neurons corresponding to the normal gradient in the current hidden layer, calculating a ratio between the first number and the total number of the neurons in the current hidden layer, and taking the ratio as the overall abnormal degree of the gradient of the current hidden layer.

Specifically, an abnormal degree threshold nk is set, when the abnormal degree of the gradient is greater than the abnormal degree threshold, the gradient of the corresponding neuron is normal, otherwise, when the abnormal degree of the gradient is less than the abnormal degree threshold, the gradient of the corresponding neuron is abnormal, and then the judgment formula is:

preferably, in the embodiment of the present invention, the threshold value of the degree of abnormality is an empirical value, and nk is 0.7.

Counting a first quantity S of neurons corresponding to normal gradient in each hidden layer, substituting the first quantity S into a calculation formula of the overall gradient abnormal degree to obtain the overall gradient abnormal degree of the corresponding hidden layer, wherein the calculation formula of the overall gradient abnormal degree is as follows:

yc＝S/I

wherein yc is the overall gradient anomaly; i is the total number of neurons contained in the hidden layer.

Further, since the residual operation is across the hidden layer and the number of spans is variable, the necessity of calculating the residual operation of the associated N layers of hidden layers needs to be combined with the number of spans, and then the calculation method of the necessity is as follows: and carrying out weighted summation on the overall gradient abnormal degree of each hidden layer in the hidden layer association combination, further calculating an abnormal degree average value, taking the abnormal degree average value as the necessity of carrying out residual error operation on the corresponding hidden layer association combination, wherein the weight corresponding to each hidden layer is set according to the sequence of the layer number of the hidden layer.

As an example, the calculation formula of the necessity of the residual operation in the embodiment of the present invention is:

wherein By is the necessity of residual operation; n is the number of hidden layers contained in the hidden layer association combination; n is the number of layers corresponding to each hidden layer in the hidden layer association combination.

Step S003, calculating the preference degree of each residual operation according to the necessity of the residual operation, calculating the average preference degree of each residual operation combination by using the preference degree of each residual operation, and sequencing a plurality of residual operation combinations from large to small based on the average preference degree to obtain a combination sequence; and sequentially removing a residual operation combination in the combination sequence, calculating the reasonable removal degree of the residual operation combination according to the loss function of the residual neural network, and compressing and simplifying the residual neural network according to the reasonable removal degree.

Specifically, after obtaining the necessity of each residual operation, a combination for removing the residual operation according to the necessity is required, and since the neural network is in a serial structure, the subsequent residual operation is affected by the previous operation, so the embodiment of the present invention calculates the removal priority degree by combining the position and the necessity of the residual operation, thereby obtaining the priority degrees of various combinations, and facilitating the subsequent verification according to the sequence, the processing method is as follows:

(1) sequencing each residual operation from large to small according to the importance of the residual operation, and numbering; and calculating the ratio between the total number of the residual operations and the corresponding number of the current residual operation, and calculating the preference degree of the current residual operation by combining the ratio and the necessity thereof.

Specifically, the calculation formula of the preferred degree is as follows:

wherein, yx_sThe preferred degree of the s-th residual operation; SN is the total number of residual operations; by_sIs the necessity of the s-th residual operation.

(2) Obtaining the number of residual operation combinations according to the total number SN of the residual operations, wherein the calculation formula of the number of the residual operation combinations is as follows:

(3) And calculating the average preference degree of each residual operation combination according to the preference degree of each residual operation in the residual operation combinations, and sequencing all the residual operation combinations from large to small according to the average preference degree to obtain a combination sequence.

Specifically, the calculation formula of the average preference degree is as follows:

wherein xz is the average preference degree of the residual error operation combination; k is the number of residual operations included in the combination of residual operations.

Further, in order to verify and determine that it is reasonable to remove the corresponding combination of residual operations, i.e., not to have an excessive influence on the neural network, it is necessary to calculate the rationality of the removal of the combination of residual operations using the verification set. After removing the corresponding residual operation combination, comparing the change degree of the network output after the verification set data is output to the neural network before and after the removal, wherein the smaller the change is, the more reasonable the removal combination is, and the obtaining method of the removal reasonable degree of the residual operation combination is as follows: for the residual error neural network, the difference between the network inference value and the label value is measured by adopting the cross entropy loss function, and after the structure of the network is changed, the inference value is changed, and the label value is not changed, so that the loss function value is correspondingly changed, and the influence degree after the residual error operation combination is removed can be judged by only comparing the change degree of the loss function.

Specifically, calculating a difference value between the loss function values of the corresponding residual error neural networks before and after the residual error operation combination is removed, obtaining a removal reasonable degree of the corresponding residual error operation combination according to the difference value, wherein the difference value and the removal reasonable degree are in a positive correlation relationship, and then the calculation formula for removing the reasonable degree is as follows:

wherein po is the reasonable degree of removal; es_qRepresenting the loss function value obtained before the residual error operation combination removal; es_pRepresenting a loss function value obtained after the corresponding residual error operation combination is removed; min (es)_q) And representing the convergence value of the loss function of the residual operation combination before removing the pre-residual neural network.

And setting a reasonable threshold, sequentially calculating the removal reasonable degree of each residual operation combination according to the combination sequence, and removing the corresponding residual operation combination and the residual operation combinations behind the corresponding residual operation combination when the removal reasonable degree is greater than the reasonable threshold, so that the compression and simplification of the residual neural network can be completed.

In summary, the embodiment of the present invention provides a residual error neural network compression method based on activation information, which trains a residual error neural network to determine weights of layers in the network; acquiring the integral gradient abnormal degree of each hidden layer based on the activation information, and respectively calculating the necessity of performing residual error operation between the associated combinations of the hidden layers according to the integral gradient abnormal degree; and compressing and simplifying the residual error neural network according to the reasonable removal degree of each residual error operation combination according to the necessity. Residual operations with small influence in the neural network are removed through carrying out necessity and removal rationality evaluation on the residual operations, so that the compression of the neural network is realized, the requirements of the neural network on the storage space and the computing performance of hardware equipment are reduced, and the device can be arranged on equipment with low power consumption.

It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A residual error neural network compression method based on activation information is characterized by comprising the following steps:

2. The method of claim 1, wherein the method for obtaining activation information of each neuron in a current hidden layer based on the weight comprises:

3. The method of claim 2, wherein the method of using the activation information to calculate the degree of gradient abnormality for each neuron comprises:

4. The method of claim 1, wherein said method of calculating an overall gradient anomaly degree for a current hidden layer from said gradient anomaly degree for each neuron comprises:

5. The method of claim 1, wherein the calculation of the necessity of the residual operation comprises:

6. The method of claim 1, wherein the method of calculating a degree of preference for each residual operation based on the necessity of the residual operation comprises:

7. The method of claim 1, wherein the method for obtaining a reasonable degree of elimination of the residual operation combination comprises:

8. The method of claim 1, wherein the method of compression reduction of the residual neural network by a reasonable degree of said removing comprises:

9. The method of claim 1, wherein the number of combinations of residual operations is calculated by the formula: