CN113837378A

CN113837378A - Convolutional neural network compression method based on agent model and gradient optimization

Info

Publication number: CN113837378A
Application number: CN202111039434.9A
Authority: CN
Inventors: 刘德荣; 李佳鑫; 王永华; 赵博; 饶煊; 吴球业
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-09-06
Filing date: 2021-09-06
Publication date: 2021-12-24

Abstract

Aiming at the limitations of the prior art, the invention provides a convolutional neural network compression method based on a proxy model and gradient optimization, and the method uses the proxy model to predict the network accuracy of the corresponding structure without training the network of each structure, thereby greatly saving the search time; secondly, the method uses differentiable structural parameters to generate the pruning rate, directly predicts the network accuracy through the proxy model, constructs the direct relation between the network accuracy and the structural parameters, can directly train the pruning rate and realizes rapid and automatic pruning; the method takes global information into consideration, so that the optimal sub-network structure can be searched; after the target network is pruned according to the method, the parameters and the calculated amount of the network can be effectively compressed, the accuracy of the target network is kept to be reduced slightly or almost unchanged, and the pruned network is friendly to hardware and can be deployed on a plurality of network platforms.

Description

Convolutional neural network compression method based on agent model and gradient optimization

Technical Field

The invention relates to the technical field of deep learning, in particular to a pruning technology of a convolutional neural network in model compression, and more particularly relates to a convolutional neural network compression method based on a proxy model and gradient optimization.

Background

In recent years, deep neural networks have achieved great success in the fields of image classification, object detection, semantic segmentation, and the like. However, the huge network model scale is difficult to deploy on most resource-limited devices, which greatly limits the application of artificial intelligence. Therefore, many methods of compressing the model have been proposed.

Neural network pruning is a common method for model compression, and is disclosed in chinese patent application publication No. CN111931914A, with publication No. 2020.11.13: a convolutional neural network channel pruning method based on model fine tuning is disclosed, wherein neural network pruning is to prune redundant parameters in a neural network by a certain method so as to reduce the size and the calculated amount of a model and ensure that the accuracy of the pruned network is reduced to a small extent.

In the prior art, some methods are based on artificially designing network pruning rules and pruning rates, and the pruning rate of each layer can be determined only by testing a layer-by-layer network, so that a large amount of time is spent on the experiment; moreover, because each layer of network is independently tested and pruned, the integrity of the network is not concerned, and the interdependence relation among the networks is neglected, the obtained structure is not optimal. In some automatic pruning methods, the search space is discrete, gradient optimization can be performed only by performing relaxation serialization on the discrete space, or heuristic search is performed by using an evolutionary algorithm, a simulated annealing method and the like, and the obtained structure may not be optimal, or a large amount of time cost is required to find the optimal structure. In addition, many existing methods require a certain training round for each possible structure to select a better structure when searching the structure, which also often takes a lot of time to train and evaluate. Thus, the prior art has certain limitations.

Disclosure of Invention

Aiming at the limitation of the prior art, the invention provides a convolutional neural network compression method based on a proxy model and gradient optimization, and the technical scheme adopted by the invention is as follows:

a convolutional neural network compression method based on a proxy model and gradient optimization comprises the following stages:

an initialization stage:

acquiring a proxy model, a target network to be compressed and a target compression ratio of the target network, and initializing a micro-structure parameter and an experience pool; entering a model preheating stage after initialization is completed;

a model preheating stage:

performing the following steps in each pass of the model warm-up phase: randomly generating a group of structure vectors comprising the pruning rate of each layer of the target network according to the target compression rate, acquiring a network accuracy evaluation value corresponding to the structure vectors, and adding the randomly generated structure vectors and the corresponding network accuracy evaluation values into the experience pool as samples;

after the number of samples in the experience pool has accumulated to exceed a preset batch size, the following steps are also performed in each round of the model warm-up phase: extracting batch-sized samples from the experience pool to train the proxy model; when the number of executed rounds exceeds the preset number of model preheating rounds, ending the model preheating stage and entering a searching stage;

a searching stage:

when the number of rounds executed does not reach a preset maximum number of running rounds, performing the following steps in each round of the search phase: generating a structure vector by the micro-structure parameter, and inputting the structure vector generated by the micro-structure parameter into the proxy model to obtain a corresponding network accuracy predicted value and a floating point operand; optimizing and updating the micro-structurable parameter according to the network accuracy predicted value and the floating point operand;

when the number of executed rounds does not reach the preset number of search rounds, the following steps are also executed in each round of the search phase: acquiring a corresponding network accuracy evaluation value of a structure vector generated by the micro-structure parameter, and adding the structure vector generated by the micro-structure parameter and the corresponding network accuracy evaluation value into the experience pool as a sample; extracting batch-sized samples from the experience pool to train the proxy model;

when the number of executed rounds reaches the maximum number of running rounds, ending the searching stage and entering a network pruning stage;

pruning and recovering stages:

generating a structure vector from the micro-structurable parameter; pruning the target network in a mode of deleting a filter according to a structure vector generated by the micro-structure parameter; performing recovery training on the accuracy of the network on the sub-networks obtained by pruning in the pruning and recovery stage;

wherein, the maximum operation round number is larger than the search round number and the model preheating round number.

Compared with the prior art, the method and the device have the advantages that the network accuracy of the corresponding structure is predicted by using the proxy model, and the network of each structure does not need to be trained, so that the searching time is greatly saved; secondly, the method uses differentiable structural parameters to generate the pruning rate, directly predicts the network accuracy through the proxy model, constructs the direct relation between the network accuracy and the structural parameters, can directly train the pruning rate and realizes rapid and automatic pruning; the method takes global information into consideration, so that the optimal sub-network structure can be searched; after the target network is pruned according to the method, the parameters and the calculated amount of the network can be effectively compressed, the accuracy of the target network is kept to be reduced slightly or almost unchanged, and the pruned network is friendly to hardware and can be deployed on a plurality of network platforms.

Preferably, the network accuracy assessment value is obtained by:

and pruning the target network in a mode of setting the output of the filter channel to be 0 according to the structural vector, and evaluating the pruning result to obtain the network accuracy evaluation value of the structural vector.

Preferably, a Gaussian distribution N (mu, sigma) is used in the model preheating stage²) Randomly generating a structure vector a ═ a₁,a₂,...,a_n) Where μ ═ ratio _ target, σ ═ 1, and ratio _ target is a target compression rate of the target network, i.e., a pruning rate of each layer of the target network.

As a preferred scheme, the agent model is a multilayer perceptron, the agent model is trained by a stochastic gradient descent method, and a loss function of the agent model is a root-mean-square loss:

SAcc_jis a structure vector a_jCorresponding network accuracy prediction value, Acc_jIs a structure vector a_jThe corresponding network accuracy assessment value, N, indicates the batch size.

As a preferred embodiment, in the search phase the microstructurable parameter a ═ a (a) is determined from the following formula₁,A₂,...,A_n) Generating a structure vector a:

a_i＝sigmoid(A_i)*(1-min_a)+min_a,i＝1,2,...,n；

where min _ a represents the minimum pruning rate, min _ a ∈ [0,1 ].

As a preferable scheme, the micro-structurable parameter is updated by a random gradient descent method, and a loss function used in the process of updating the micro-structurable parameter is as follows:

loss＝a_loss+γ*f_loss；

wherein: a _ loss is-SAcc, and SAcc is a predicted value of network accuracy; f _ loss is a floating point operand; gamma is a penalty coefficient.

As a preferred solution, the sub-networks obtained by pruning in the pruning and recovery phase are trained to recover accuracy using a knowledge distillation method.

The present invention also provides the following:

a convolutional neural network compression system based on proxy model and gradient optimization, comprising:

an initialization module:

the method comprises the steps of obtaining a proxy model, a target network to be compressed and a target compression ratio of the target network, and initializing a micro-structure parameter and an experience pool; entering a model preheating stage after initialization is completed;

a model preheating module:

for performing the following steps in each pass of the model warm-up phase: randomly generating a group of structure vectors comprising the pruning rate of each layer of the target network according to the target compression rate, acquiring a network accuracy evaluation value corresponding to the structure vectors, and adding the randomly generated structure vectors and the corresponding network accuracy evaluation values into the experience pool as samples;

a search module:

for performing the following steps in each round of the search phase when the number of rounds performed has not reached a preset maximum number of run rounds: generating a structure vector by the micro-structure parameter, and inputting the structure vector generated by the micro-structure parameter into the proxy model to obtain a corresponding network accuracy predicted value and a floating point operand; optimizing and updating the micro-structurable parameter according to the network accuracy predicted value and the floating point operand;

pruning and recovering module:

for generating a structure vector from the micro-structurable parameters; pruning the target network in a mode of deleting a filter according to a structure vector generated by the micro-structure parameter; performing recovery training on the accuracy of the network on the sub-networks obtained by pruning in the pruning and recovery stage;

A medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the aforementioned convolutional neural network compression method based on a proxy model and gradient optimization.

A computer device comprising a medium, a processor and a computer program stored in the medium and executable by the processor, the computer program when executed by the processor implementing the steps of the aforementioned convolutional neural network compression method based on a proxy model and gradient optimization.

Drawings

Fig. 1 is a schematic diagram illustrating a progressive stage of a convolutional neural network compression method based on a proxy model and gradient optimization according to embodiment 1 of the present invention;

fig. 2 is a schematic circular logic diagram of a convolutional neural network compression method based on a proxy model and gradient optimization according to embodiment 1 of the present invention;

fig. 3 is a schematic diagram of a training principle provided in embodiment 1 of the present invention;

fig. 4 is a fitting effect diagram of the proxy model in embodiment 2 of the present invention;

FIG. 5 is a schematic diagram of the model accuracy and the variation trend of the proxy model prediction caused by the optimization of the structural parameters in the search process in embodiment 2 of the present invention;

fig. 6 is a schematic diagram of a convolutional neural network compression system based on a proxy model and gradient optimization according to embodiment 3 of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

it should be understood that the embodiments described are only some embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without any creative effort belong to the protection scope of the embodiments in the present application.

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims. In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely to distinguish one from another and are not necessarily used to describe a particular order or sequence, nor are they to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. The invention is further illustrated below with reference to the figures and examples.

In order to solve the limitation of the prior art, the present embodiment provides a technical solution, and the technical solution of the present invention is further described below with reference to the accompanying drawings and embodiments.

Example 1

Referring to fig. 1 and fig. 2, a convolutional neural network compression method based on a proxy model and gradient optimization includes the following stages:

s1, initialization stage:

acquiring a proxy model S, a target network M to be compressed and a target compression ratio _ target of the target network M, and initializing a micro-structure parameter A and an experience pool Relay; entering a model preheating stage after initialization is completed;

s2, model preheating stage:

after the number of samples len (relaybuffer) of the experience pool has accumulated to exceed a preset batch size batch _ size, the following steps are also performed in each round of the model warm-up phase: extracting samples of batch size batch _ size from the experience pool RelayBuffer to train the proxy model; when the number of executed rounds epoch exceeds the preset number of model preheating rounds epoch _ warp, ending the model preheating stage and entering a searching stage;

s3, search stage:

when the number of rounds executed epoch does not reach a preset maximum number of running rounds max _ epoch, performing the following steps in each round of the search phase: generating a structure vector by the micro-structure parameter, and inputting the structure vector generated by the micro-structure parameter into the proxy model to obtain a corresponding network accuracy predicted value and a floating point operand; optimizing and updating the micro-structurable parameter according to the network accuracy predicted value and the floating point operand;

when the number of rounds epoch performed does not reach a preset number of search rounds epoch _ S, the following steps are also performed in each round of the search phase: acquiring a corresponding network accuracy evaluation value of a structure vector generated by the micro-structure parameter, and adding the structure vector generated by the micro-structure parameter and the corresponding network accuracy evaluation value into the experience pool as a sample; extracting batch-sized samples from the experience pool to train the proxy model;

when the number of executed rounds epoch reaches the maximum number of running rounds max _ epoch, ending the searching stage and entering a network pruning stage;

s4, pruning and recovery stage:

wherein the maximum number of running rounds max _ epoch > the number of searching rounds max _ epoch > the number of model preheating rounds epoch _ warmup.

Specifically, the invention is mainly applicable to neural networks containing convolutional layers, such as convolutional neural networks like VGG, ResNet, MobileNet and the like. The scheme prunes convolutional layers in the convolutional neural network, and achieves the purpose of compressing the network by pruning channels or filters in the convolutional layers.

The effect of the proxy model is to quickly evaluate the searched sub-networks in the search phase, which saves time and cost greatly compared with directly training and evaluating the sub-networks. And secondly, a mapping relation between the network structure and accuracy is constructed, so that the structure vector can be optimized by directly using a random gradient descent method, and a better structure can be searched.

The input of the proxy model is a structure vector which consists of the pruning rate of each layer of the network. Evaluating the network accuracy of a network generally refers to two accuracies: top1 and Top5, Top1 means the accuracy of the first-ranked category, namely the real category, and can represent the real performance of the network; top5 is the accuracy of the Top5 ranked category. In an alternative embodiment, the network accuracy prediction value output by the proxy model is a prediction value of the accuracy of the network of the sub-network Top1 on the assumption that the target network is pruned according to the input structure vector.

In particular, it should be noted that the pruned network is not trained in the model pre-heating stage, but is directly predicted, and therefore, the accuracy obtained is low. But the inventors at this stage focused on the relative high and low performance of the different structures.

As a preferred embodiment, the network accuracy evaluation value is obtained by:

Specifically, the step of pruning the target network in the embodiment of the present invention is as follows:

calculating the L1 norm of each filter of each layer of the target network as the importance of each filter; ranking each filter by importance, the lower ranked filters being considered unimportant; calculating the number k of filters to be cut according to the structure vector a and the number out _ Channels of each layer, and selecting k filters with lower sequence; pruning the selected k filters.

In the pruning process aiming at obtaining the network accuracy evaluation value of the structure vector, the invention does not directly delete the filter, but makes the output of the corresponding channel be 0, and eliminates the influence on the output, namely, the channel is not available and is cut. The method can achieve the equivalent effect of deleting the filter, can also keep the original structure of the target network, is convenient for pruning different network structures in each round, and is more flexible and efficient.

As a preferred embodiment, a Gaussian distribution N (μ, σ) is used during the model warm-up phase²) Randomly generating a structure vector a ═ a₁,a₂,...,a_n) Where μ ═ ratio _ target, σ ═ 1, and ratio _ target is a target compression rate of the target network, i.e., a pruning rate of each layer of the target network.

As a preferred embodiment, the agent model is a multilayer perceptron (MLP), the agent model is trained by a stochastic gradient descent method, and the loss function of the agent model is root-mean-square loss:

In particular, the use of a loss function of root mean square loss allows the proxy model to predict the accuracy of the target network.

As a preferred embodiment, in the searchIn a stage, the microstructurable parameter A ═ A (A) is determined by the following formula₁,A₂,...,A_n) Generating a structure vector a:

a_i＝sigmoid(A_i)*(1-min_a)+min_a,i＝1,2,...,n；

where min _ a represents the minimum pruning rate, min _ a ∈ [0,1 ].

Specifically, a can be guaranteed by the above formula_iAt [ min _ a,1]And for A_iIs continuously conductive.

The invention aims to search the structural parameters which enable the accuracy of the pruned target network to be the highest, so that when the proxy model can predict the accuracy of the target network, the accuracy of the target network is the highest, namely the predicted value SAcc of the accuracy of the output network of the proxy model is the largest. Therefore, the Loss of the updated structure parameter a can be designed as:

a_loss＝-SAcc；

because different devices can accept different model sizes, in order to compress the network to a target size, the invention increases the constraint on the size of the network model when updating the structure parameters, and Floating Point of Operations (FLOPs) are used as indexes for representing the size of the model. The FLOPs are calculated as follows, wherein the FLOPs calculation formula of each convolutional layer is as follows:

flops_conv＝k*k*I*O*W*H；

wherein k is kernel _ size of the convolution layer, I is the number of input channels, O is the number of output channels, and W is the size of the output characteristic diagram;

and the FLOPs calculation formula of the full connection layer is as follows:

flops_fc＝I*O；

wherein, I is the number of input channels, and O is the number of output channels.

Respectively calculating the FLOPs of all layers and then adding the FLOPs to obtain the FLOPs of the whole network; the constraint function for FLOPs is obtained as follows:

wherein max _ hops is the FLOPs of the target network without pruning, and flop _ target is the FLOPs of the desired pruned network.

Therefore, as a preferred embodiment, the micro-structurable parameter is updated by a random gradient descent method, and the loss function used in updating the micro-structurable parameter is:

loss＝a_loss+γ*f_loss；

Specifically, SAcc is the output of the proxy model corresponding to the structure vector a, so SAcc is derivable for the structure vector a, i.e. for the structure parameter a. As can be seen from the calculation method of the FLOPs, the FLOPs are related to the output channels of the network layer, and the output channels after pruning are O' ═ O × a_iThus, FLOPs are also conductive to the structure vector a, i.e., to the structure parameter A. Therefore, both parts of the Loss function Loss are derivable for the structure parameter a, so the structure parameter a can be optimized using a gradient descent method according to Loss.

Referring to fig. 3, it can be seen that the gradient computation of the loss function to the structural parameter is complete when the gradient propagates backwards, with no truncation in the middle. SAcc is the output of the proxy model corresponding to the structure vector a, so SAcc is derivable for the structure vector a, i.e. for the structure parameter A. As can be seen from the calculation method of the FLOPs, the FLOPs are related to the output channels of the network layer, and the output channels after pruning are O' ═ O × a_iThus, FLOPs are also conductive to the structure vector a, i.e., to the structure parameter A. Therefore, both parts of the Loss function Loss are derivable for the structure parameter a, so the structure parameter a can be optimized using a gradient descent method according to Loss.

Specifically, knowledge distillation can migrate the knowledge of a teacher network (a large network) into a student network (a small network) to improve the efficiency of small network training, so that the accuracy of the small network can approach or even recover to the accuracy of the large network. Thus, knowledge distillation can be used to train the compressed network to improve the accuracy of the network.

The embodiment of the invention takes the target network as a teacher network, takes the searched sub-networks as student networks, and trains the student networks by using a knowledge distillation method, so that the accuracy of the student networks can be close to that of the teacher network, and the accuracy of the networks before pruning can be recovered.

Example 2

This embodiment is a further description made by combining specific parameter settings with a more specific example on the basis of embodiment 1, wherein:

the target network is ResNet20, which has been trained on 305 epochs on the public data set CIFAR10, and finally the target network has a Top1 accuracy of 92.44%.

The agent model is a multilayer perceptron (MLP), namely a neural network consisting of 3 layers of full connection layers, and the network structure is as follows: FC (n _ layers,256), ReLU (), FC (256,1), where n _ layers is the number of network layers that need to be pruned. The input to the proxy model is the structure vector, i.e., the pruning rate of each layer of the network. The target network in this embodiment is a residual network ResNet20, which is a total of 20 layer network layers, including a convolutional input layer, 9 residual blocks, and a fully connected output layer. Wherein each residual block comprises two convolutional layers and a primary residual connecting layer. In order to keep the output channels at both ends of the residual connection consistent, in this embodiment, only the first layer of the residual block is pruned, so that there are 9 convolutional layers to be pruned, i.e. n _ layers is 9.

A target compression ratio _ target of the target network is 0.5; since a total of 9 convolutional layers need to be pruned, the dimension of a is 9, i.e., n is 9.

In the present embodiment, the data set used to assess the accuracy of the network is the public data set CIFAR 10. To quickly evaluate the accuracy of the network, in this case, only a portion (1/10) of the training set is used to evaluate the accuracy of the network, resulting in a network accuracy evaluation.

The batch size 128.

In the present embodiment, the proxy model is optimized using a Stochastic Gradient Descent (SGD) method, and the initial learning rate lr is set to 0.1, the weight attenuation _ decay is set to 0.0005, and the momentum coefficient momentum is set to 0.9. And the learning rate is updated using a cosine annealing method.

The number of model warm-up rounds, epoch _ warmup, is 250.

Specifically, if the proxy model is tested after the training of the model preheating stage is completed, all (a, Acc) combinations are extracted from the experience pool, then a is input into the proxy model to obtain corresponding SAcc, and all the SAcc and Acc are marked on the graph to obtain a fitting effect graph of the proxy model, as can be seen from fig. 4, the proxy model can effectively fit the target network.

In the generation of the structure vector a from the structure parameter a, the minimum pruning rate min _ a is 0.2.

The constraint function for FLOPs is:

in this embodiment, as the FLOPs, max _ FLOPs, when resenet 20 does not prune is 40.81M, and the target compression ratio _ target is 0.5, the FLOPs _ target, which is the FLOPs of the desired pruned network, is 20.41M.

For the loss function of the updated structure parameter a:

loss＝a_loss+γ*f_loss；

gamma is the penalty coefficient of f _ Loss. In this case, γ is 10.

In this case, when the structural parameters are optimized by using the random gradient descent (SGD) method, the initial learning rate lr is set to 0.01, the weight attenuation _ decay is set to 0.0005, and the momentum coefficient momentum is set to 0.9, and the learning rate is updated by using the cosine annealing method.

Referring to fig. 5, it can be seen from the accuracy of the model and the variation trend of the proxy model prediction caused by the optimization of the structural parameters in the search process that the optimization of the structural parameters is gradually converged.

In this embodiment, the search round number epoch _ S is 350, and the maximum operation round number max _ epoch is 400.

In the pruning and recovery, 700 epochs of training are performed on the pruned subnetworks in this case in order to achieve sufficient training of the subnetworks. The final results are given in table 1 below:

TABLE 1 comparison of model size and accuracy before and after pruning

As can be seen from Table 1, the method provided by the invention can effectively compress the target network to the specified compression rate, and ensure that the accuracy of the compressed network is almost unchanged.

Example 3

A convolutional neural network compression system based on proxy model and gradient optimization, please refer to fig. 6, which includes:

the initialization module 1:

model preheating module 2:

the searching module 3:

pruning and recovery module 4:

Example 4

A medium having stored thereon a computer program which, when executed by a processor, implements the steps of the convolutional neural network compression method based on proxy model and gradient optimization of embodiment 1.

Example 5

A computer device comprising a medium, a processor, and a computer program stored in the medium and executable by the processor, the computer program when executed by the processor implementing the steps of the proxy model and gradient optimization based convolutional neural network compression method of embodiment 1.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A convolutional neural network compression method based on a proxy model and gradient optimization is characterized by comprising the following stages:

an initialization stage:

a model preheating stage:

a searching stage:

pruning and recovering stages:

2. The convolutional neural network compression method based on proxy model and gradient optimization of claim 1, wherein the network accuracy assessment value is obtained by:

3. The proxy model and gradient optimization-based of claim 1Convolutional neural network compression method, characterized in that it uses a Gaussian distribution N (μ, σ) in the model warm-up phase²) Randomly generating a structure vector a ═ a₁,a₂,...,a_n) Where μ ═ ratio _ target, σ ═ 1, and ratio _ target is a target compression rate of the target network, i.e., a pruning rate of each layer of the target network.

4. The convolutional neural network compression method based on the proxy model and the gradient optimization of claim 1, wherein the proxy model is a multilayer perceptron, the proxy model is trained by a stochastic gradient descent method, and the loss function of the proxy model is root-mean-square loss:

5. The convolutional neural network compression method based on proxy model and gradient optimization of claim 1, wherein in the search stage the micro-structurable parameter a ═ (a) is determined by the following formula₁,A₂,...,A_n) Generating a structure vector a:

a_i＝sigmoid(A_i)*(1-min_a)+min_a,i＝1,2,...,n；

where min _ a represents the minimum pruning rate, min _ a ∈ [0,1 ].

6. The convolutional neural network compression method based on proxy model and gradient optimization of claim 1, wherein the micro-configurable parameters are updated by a random gradient descent method, and a loss function used in the process of updating the micro-configurable parameters is as follows:

loss＝a_loss+γ*f_loss；

7. The convolutional neural network compression method based on proxy model and gradient optimization of claim 1, wherein knowledge distillation method is used to train the sub-networks obtained by pruning in the pruning and recovery phase to recover the accuracy.

8. A convolutional neural network compression system based on a proxy model and gradient optimization, comprising:

initialization module (1):

model preheating module (2):

search module (3):

pruning and recovery module (4):

9. A medium having a computer program stored thereon, characterized in that: the computer program when executed by a processor implements the steps of the convolutional neural network compression method based on proxy model and gradient optimization of any of claims 1 to 7.

10. A computer device, characterized by: comprising a medium, a processor and a computer program stored in the medium and executable by the processor, which computer program, when executed by the processor, carries out the steps of the proxy model and gradient optimization based convolutional neural network compression method as claimed in any one of claims 1 to 7.