CN117131920A

CN117131920A - Model pruning method based on network structure search

Info

Publication number: CN117131920A
Application number: CN202311396457.4A
Authority: CN
Inventors: 丁晓嵘; 张新; 李晓梅; 聂明杰; 李楠; 王柏春; 张兴业; 田晓宇; 高涵; 项乾
Original assignee: Beijing Smart Water Development Research Institute
Current assignee: Beijing Smart Water Development Research Institute
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2023-11-28
Anticipated expiration: 2043-10-26
Also published as: CN117131920B

Abstract

The application provides a model pruning method based on network structure search. The method comprises the following steps: s1, collecting network structure data, and training an improved model by loading; s2, searching based on a network structure to obtain the sensitivity degree of each convolution layer to pruning; s3, determining the sensitivity of each convolution layer to pruning, and adapting the pruning proportion of each layer; s4, pruning the model, namely pruning a channel with relatively low importance and a corresponding filter in each layer according to the self-adaptive pruning ratio of each layer; s5, the pruning model fine-tuning is self-adaptive to the cutting rate of each layer, and iteration is terminated after the effect of the test set reaches the expected target; s6, model testing, namely storing and outputting the models meeting the equipment requirements and the performance requirements. The application removes redundant channels and corresponding convolution kernels under the condition of ensuring the performance of the model, so that the model has the advantages of small size, low calculation complexity and low battery power consumption, and the portability of the network on the embedded equipment is ensured.

Description

Model pruning method based on network structure search

Technical Field

The application relates to the field of deep neural network model compression, in particular to a model pruning method based on network structure search, which is particularly suitable for solving the problems of overhigh calculated amount and parameter amount caused by transplanting a deep neural network model to edge mobile terminal equipment.

Background

The related fields of artificial intelligence such as computer vision and natural language processing benefit from the development of deep learning, so that the performance of the artificial intelligence reaches an unprecedented height. However, as the depth and width of the neural network are continuously increased, high performance is brought, and meanwhile, high storage space and huge calculation amount are also brought, which greatly hinders the commercialization of the deep learning related method.

With the increasing maturity of artificial intelligence technology and the improvement of computing power of edge devices, the AI landing of the mobile terminal becomes possible. Compared with a deep learning model at a server side, the edge moving end model can relieve the pressure at the server side, and has the advantages of high response speed, high stability and safety. However, the deployment of the deep learning model at the edge mobile terminal has great challenges, and the embedded devices such as the mobile terminal and the like are not designed for calculating dense tasks, so that the computing capacity, the storage resources and the like are limited, and the problem that the battery capacity and the like of the device are considered when the deep learning method is to be produced at the edge terminal is solved. Therefore, the model of the mobile terminal must meet the conditions of small size, low computational complexity, low battery power consumption, flexible deployment of issuing update, and the like. The diversity of the model deployment platform is another challenge, the hardware platforms of different mobile terminals are very different, the computing capacity and the storage capacity of the devices are also very different, and different edge devices have the maximum computing capacity and the parameter number which can be born by themselves, so that the same task may need models with different sizes for different hardware devices.

Under such circumstances, there is a need for a method of compressing a deep neural network model while ensuring the performance of the model and of constraining an adaptive network structure according to the parameters and the calculation amount of a mobile terminal platform.

Disclosure of Invention

The application provides a model pruning method based on network structure search, which aims to solve the defects in the prior art.

In order to achieve the above purpose, the present application adopts the following technical scheme.

A model pruning method based on network structure search, the method comprising:

s1, acquiring network structure data, and training an improved model through loading;

s2, searching based on a network structure to obtain the sensitivity degree of each convolution layer to pruning;

s3, determining the sensitivity of each convolution layer to pruning, and adapting the pruning proportion of each layer;

s4, pruning the model, namely pruning a channel with relatively low importance and a corresponding filter in each layer according to the self-adaptive pruning ratio of each layer;

s5, the pruning model fine-tuning is self-adaptive to the cutting rate of each layer, and iteration is terminated after the effect of the test set reaches the expected target;

s6, model testing, namely storing and outputting the models meeting the equipment requirements and the performance requirements.

Further, the step S1 specifically includes:

s1.1, a large number of photos are photographed by using a camera, and data augmentation is performed on the photos by using random cutting, random inversion and left and right mirror image processing, so that a data set is expanded;

s1.2, marking the data by using marking software, and storing the original image and marking content in the same folder;

s1.3, improving a model, adding a sandwich principle and in-situ distillation to improve the performance of a network, and accelerating the acquisition of network structure information by adopting a mode of sharing a convolution kernel.

Further, the S1.3 model improvement specifically includes:

(1) Firstly, defining a network structure according to tasks, setting a network structure searching range, and randomly and independently sampling the channel number of each layer in the training of a model;

(2) In the random sampling process of the channel, the upper limit and the lower limit of the set threshold are emphasized to be sampled, so that the accuracy of the model is within a preset range;

(3) In the network structure searching process, an in-situ distillation technology is adopted, and training of other proportions is guided by using a set pruning upper limit proportion;

(4) The method of sharing the convolution kernel is adopted to accelerate the collection of the network structure information, and the higher the front convolution kernel is, the higher the number of times of reuse is.

Further, the number of channels of each layer is randomly sampled in the training of the model, and the upper limit and the lower limit of a set threshold are emphasized in the random sampling process of the channels; the output of a single neuron can be represented by equation (1); equation (2) is the output of the first k channels:

wherein w and x are the upper convolution kernel weight and the input feature map, n is the number of channels input, and the property that y satisfies the formula (2) is the output of the first k channels, k and k ₀ I.e. the structure search range.

Further, the step of S2 of deriving the sensitivity of each convolution layer to pruning specifically includes:

s2.1, performing a small amount of iterative training on the improved model by using training data and verification data until convergence;

s2.2, loading the model, deleting the convolution kernel of each layer randomly within a set threshold value, finally recording the last related data of the network, and sorting and dividing the data to obtain the sensitivity degree of different convolution layers to the model output result.

Further, the pruning of the S4 model specifically includes: l1 regularization is applied to the scaling factors of the BN layers such that the scaling factors of the BN layers tend to zero, ordered in ascending order according to the importance of the channels, adapting the pruning rate of each layer and pruning the relatively low importance channels and corresponding filters in each layer. L1 regularization, also known as Lasso regularization, penalizes the complexity of the model by adding the sum of the absolute values of the target parameters to the model's loss function. L1 regularization is applied to the scaling factor of the Bulk Normalized (BN) layer such that the scaling factor of the BN layer tends to be zero. The pruning rate of each layer is adapted and channels of relatively low importance and corresponding filters in each layer are pruned out according to the importance of the channels in ascending order.

Further, as shown in the following formula (3), introducing scaling factors in the BN layer as sparsity penalty terms into an objective function, and jointly training network weights and the scaling factors to enable a plurality of scaling factors in an obtained model to tend to 0;

wherein the first term of the objective function in the formula is a training loss function, L1 regularization |gamma| is adopted as a sparsification penalty term on a scaling factor, (x, y) is training input data and a corresponding label, w is a trainable parameter, and lambda is a balance factor.

Further, the step S5 specifically includes: model fine-tuning after clipping: according to the ratio of the absolute difference change of the optimization target relative to the iteration start time, the sensitivity degree of the corresponding convolution layer to pruning and the requirements of the edge equipment on the data quantity and the parameter quantity, the cutting rate of each layer is self-adaptive; after the effect of the test set reaches the desired goal, the iteration is terminated.

Further, the model test in S6 specifically includes: and (3) testing by using a test data input model, and comparing differences of the same target identification results before and after pruning of the model.

Further, the model output in S6 specifically includes: and outputting the model with the optimal parameters after training to an h5 file so as to facilitate the transplanting of the model to the embedded equipment.

Compared with the prior art, the scheme of the application has the following beneficial effects:

the application aims to solve the problem of deep learning method productization, and simultaneously starts with pruning on the requirements of the convolutional layer on pruning sensitivity, the importance of channels and the parameter quantity and the calculation quantity of edge equipment.

And redundant channels and corresponding convolution kernels are removed under the condition of ensuring the performance of the model, so that the model has the conditions of small size, low computational complexity, low power consumption of a battery, flexible issuing, updating and deployment and the like, and the portability of the network on embedded equipment is ensured.

Drawings

The application is further described below with reference to the drawings and the detailed description.

FIG. 1 is a flow chart of a method of pruning a model based on network structure search of the present application;

FIG. 2 is a flow chart of a conventional structure search;

FIG. 3 is a flow chart of a structure search based on a shared convolution kernel;

FIG. 4 is a flow chart of a conventional pruning;

fig. 5 is a pruning flow chart based on convolutional layer sensitivity based on the present application.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

The application is mainly applied to the task of classifying and detecting targets based on deep learning, and provides a model pruning method based on network structure search through compressing a network model to the greatest extent by limiting the calculated amount and the parameter amount of the model while ensuring the accuracy. Taking the identification task of the traditional water meter as an example, the task adopts an SSD model, the model parameter can be reduced to 0.5% of the original model, the identification accuracy rate can reach 99% in 3 seconds, and the method can be deployed into embedded equipment.

FIG. 1 is a block flow diagram of the present application, and a specific implementation of the present application will be described.

Step 1, data preparation: data acquisition, data labeling and data division;

and 1.1, taking a large number of photos by using a camera, and carrying out data augmentation on the photos by using random cutting, random inversion, left and right mirror images and other processes to expand a data set.

And 1.2, marking the data by using marking software, storing txt files in marking contents, and storing the original images and the marking contents in the same folder.

And step 1.3, dividing the image and the corresponding labeling information into a training set, a verification set and a test set, wherein the training set, the verification set and the test set respectively account for 60%, 20% and 20%.

Step 2, model improvement: the sandwich principle and the in-situ distillation are added to improve the performance of the network, and the collection of the network structure information is accelerated by adopting a mode of sharing a convolution kernel, comprising the following substeps:

step 2.1, firstly defining a network structure according to tasks, setting a structure search range, and randomly and independently sampling the channel number of each layer in the training of a model. The method comprises the following specific steps: the SSD target detection network is used as a network structure for accurately identifying the traditional water meter, and the network structure is adjusted. In the traditional SSD detection network, VGG16 is used as a feature extraction network of a backbone, multi-layer convolution is added behind the backbone network to realize multi-scale detection of the target, and for accurate identification of the water meter, single-layer detection is adopted because similar feelings of the detected target on different convolution layers are not greatly different. Setting a structure searching range, namely, the sampling rate of the convolution kernel, wherein the pruning rate adopted in the accurate identification of the traditional water meter is [0.25,1], and independently sampling each layer, so that the structure searching range is enlarged to the greatest extent.

Step 2.2, during random sampling of the channel, the upper limit and the lower limit of the set threshold are emphasized to be sampled, so that the accuracy of the model is within a preset range; the output of a single neuron can be represented by equation (1), where w and x are the upper layer convolution kernel weights and the input signature, respectively, and n is the number of channels of the input. The property that y satisfies equation (2) is the output of the first k channels, k and k ₀ That is, the structure search range defined in step 2.1 is generally the larger the number of neurons is, the closer to the true value is. Given k ₀ The output result of the network layer is in a defined range, and the relation between the performance of the network layer and the resource consumption can be directly regulated by adding and deleting the channel number.

Step 2.3, in the network structure searching process, a site distillation technology is adopted, and the set pruning upper limit proportion is used for guiding training of other proportions, so that the performance of the network is improved; the sandwich principle is that a plurality of networks are trained simultaneously when searching the network structure, the maximum width network is trained and uses the real label, and other widths adopt the predictive label with the maximum width, so that the predictive label can be adopted in situ in the training without extra calculation cost.

And 2.4, accelerating the acquisition of network structure information by adopting a mode of sharing the convolution kernel, wherein the higher the front convolution kernel is, the higher the number of times of reuse is.

Fig. 2 is a diagram illustrating a conventional network structure search. The search space is alternated with the search strategy and the performance evaluation according to a certain search strategy.

The network structure search schematic of the present application is shown in fig. 3. And uniformly evaluating after all the sub-networks are searched. The number of sub-networks is very large due to independent sampling of each layer, and the calculated amount is not reduced if the sub-networks are stored one by one.

Step 3, collecting network structure data: through loading, training an improved model, recording relevant data information of the model under different network structures, and finally summarizing the sensitivity degree of different convolution layers to model results, wherein the method comprises the following substeps:

step 3.1, performing a small amount of iterative training on the improved model in the step 2 by using training data and verification data until convergence;

step 3.2, loading the model in step 3.1, randomly deleting the convolution kernels of each layer within a set threshold, and finally recording the final relevant data (including the number of the convolution kernels of each layer and the final accuracy) of the network.

Fig. 4 is a schematic diagram of conventional model pruning. Training a large model, pruning according to a set pruning strategy to obtain a plurality of sub-network models, performing fine-tuning recovery model accuracy on the sub-network models, and finally evaluating to obtain an optimal sub-network model as a model of successful pruning. If fine-tuning is done for each sub-network, a significant amount of computing resources and training time are required.

Fig. 5 is a schematic diagram of pruning flow based on convolutional layer sensitivity according to the present application.

In the step 2, the application enables the weights of the sub-networks to be stored in the corresponding convolution kernels in a way of sharing the convolution kernels, so that fine-tuning of the network weights is not needed. Batch normalization is carried out on the channel, the BN layer calculates the mean value and the variance in the training process, then the BN layer slides and updates the two parameters, and the two parameters are called when the model is loaded after model training is finished. Although the present application does not use fine-tuning to recover model accuracy, the present application requires a small amount of forward propagation from the network to update the mean and variance of the sub-network. Each subnetwork performs BN (Batch Normalization) layer mean and variance updates and then records the relevant data.

And 3.3, sorting and dividing the data to obtain the sensitivity degree of different convolution layers to the model output result. One or more convolution kernels in only one convolution layer are removed at a time, and the sensitivity of each convolution layer to pruning proportion is recorded.

And 4, pruning the model, namely applying L1 regularization to the scaling factors of the BN layer, so that the scaling factors of the BN layer tend to be zero, sorting according to the importance of the channels in ascending order, referring to the influence of different convolution layers obtained in the step 3 on the result, self-adapting the pruning ratio of each layer and pruning channels and corresponding filters with relatively lower importance in each layer. Wherein L1 regularization, also known as Lasso regularization, penalizes the complexity of the model by adding the sum of the absolute values of the target parameters to the model's loss function. L1 regularization is applied to the scaling factor of the Bulk Normalized (BN) layer such that the scaling factor of the BN layer tends to be zero. The pruning rate of each layer is adapted and channels of relatively low importance and corresponding filters in each layer are pruned out according to the importance of the channels in ascending order.

Step 4.1 as shown in the following formula (3), introducing scaling factors in the BN layer as sparsity penalty terms into an objective function, and jointly training network weights and the scaling factors so that a plurality of scaling factors in the obtained model tend to be 0. Wherein the first term of the objective function in the formula is a training loss function, where f (x, w) represents a predicted value obtained by taking x as an input and w as a parameter, and l represents a loss value obtained by taking the predicted value and a true value y as inputs. Using L1 regularization |γ| as a sparsification penalty term on the scale factor, (x, y) is the training input data and corresponding labels, w is the trainable parameter, and λ is the balance factor.

Step 4.2, the BN scaling factors of each convolution layer are arranged from large to small, namely the channel importance ranking, and the pruning proportion of each layer is self-adaptive according to the sensitivity of each convolution layer to pruning determined in step 3. On the basis of not changing the original network layer number, unimportant channels and corresponding convolution kernels are removed, and a more compact network convenient to transplant is obtained.

Step 5, model fine-tuning after clipping: according to the ratio of the absolute difference change of the optimization target relative to the iteration start time, the sensitivity degree of the corresponding convolution layer to pruning and the requirements of the edge equipment on the data quantity and the parameter quantity, the cutting rate of each layer is self-adaptive; after the effect of the test set reaches the desired goal, the iteration is terminated.

Step 6, model test: and (3) testing by using a test data input model, and comparing differences of the same target identification results before and after pruning of the model.

And 7, model output: and outputting the model with the optimal parameters after training to an h5 file so as to facilitate the transplanting of the model to the embedded equipment.

It should be noted that the foregoing detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is intended to include the plural unless the context clearly indicates otherwise. Furthermore, it will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, steps, operations, devices, components, and/or groups thereof.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the above detailed description, reference is made to the accompanying drawings, which form a part hereof. In the drawings, like numerals typically identify like components unless context indicates otherwise. The illustrated embodiments described in the detailed description, drawings, and claims are not meant to be limiting. Other embodiments may be utilized, and other changes may be made, without departing from the spirit or scope of the subject matter presented herein.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A model pruning method based on network structure search, the method comprising:

2. The method for pruning a model for network structure search according to claim 1, wherein S1 specifically comprises:

3. The method for pruning a model for network structure search according to claim 2, wherein the S1.3 model improvement specifically comprises:

4. The method for model pruning for network structure search according to claim 3,

randomly sampling the number of channels of each layer in the training of the model, and in the random sampling process of the channels, focusing on sampling the upper limit and the lower limit of a set threshold; the output of a single neuron can be represented by equation (1); equation (2) is the output of the first k channels:

5. The method for pruning a model for a network structure search according to claim 1, wherein,

the step S2 of obtaining the sensitivity degree of each convolution layer to pruning specifically comprises the following steps:

6. The method for pruning a model for a network structure search according to claim 1, wherein,

the S4 model pruning specifically comprises the following steps: l1 regularization is applied to the scaling factors of the BN layers such that the scaling factors of the BN layers tend to zero, ordered in ascending order according to the importance of the channels, adapting the pruning rate of each layer and pruning the relatively low importance channels and corresponding filters in each layer.

7. The method for model pruning for network structure search according to claim 6, wherein,

as shown in the following formula (3), introducing scaling factors in a BN layer as sparsity penalty items into an objective function, and jointly training network weights and the scaling factors to enable a plurality of scaling factors in an obtained model to tend to 0;

8. The method for pruning a model for a network structure search according to claim 1, wherein,

the step S5 specifically comprises the following steps: model fine-tuning after clipping: according to the ratio of the absolute difference change of the optimization target relative to the iteration start time, the sensitivity degree of the corresponding convolution layer to pruning and the requirements of the edge equipment on the data quantity and the parameter quantity, the cutting rate of each layer is self-adaptive; after the effect of the test set reaches the desired goal, the iteration is terminated.

9. The method for pruning a model for a network structure search according to claim 1, wherein,

the model test in S6 specifically includes: and (3) testing by using a test data input model, and comparing differences of the same target identification results before and after pruning of the model.

10. The method for pruning a model for a network structure search according to claim 1, wherein,

the model output in S6 specifically includes: and outputting the model with the optimal parameters after training to an h5 file so as to facilitate the transplanting of the model to the embedded equipment.