CN113065653A

CN113065653A - Design method of lightweight convolutional neural network for mobile terminal image classification

Info

Publication number: CN113065653A
Application number: CN202110462584.4A
Authority: CN
Inventors: 袁海英; 成君鹏
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-04-27
Filing date: 2021-04-27
Publication date: 2021-07-02

Abstract

The invention provides a design method of a lightweight convolutional neural network for mobile terminal image classification, which comprises the following steps: designing a lightweight convolutional neural network, and processing input characteristics with different resolutions by adopting a MainNet network and an AuxiliaryNet network; training and testing the network model; performing structure pruning on the trained network, and determining a clipping threshold value in the MainNet and AuxiliaryNet networks by using a sum of total ratio method and a k average method respectively; reconstructing the lightweight convolutional neural network according to the cutting condition, and adjusting the number of channels of each layer of the network so as to balance the classification precision and the model complexity; and (4) reconstructing the model for multiple times, training and pruning the reconstructed model to obtain the final model with excellent performance. The network model finally obtained by the method has high classification accuracy, the parameter and the calculated amount are far smaller than those of other mainstream networks, and the pressure of the deployment of the convolutional neural network in a mobile terminal is effectively reduced.

Description

Design method of lightweight convolutional neural network for mobile terminal image classification

Technical Field

The invention relates to the technical field of deep learning and image processing, in particular to a design method of a lightweight convolutional neural network for mobile terminal image classification.

Background

Deep learning shows huge potential in the computer vision field (image processing, target detection, video analysis and the like), and with the rapid increase of application requirements of the fields of industry, security, traffic, internet and the like on embedded systems and mobile terminal devices, a convolutional neural network model facing a real-time image classification task faces new technical challenges. Since most convolutional neural network models need to be run on a PC or a server, huge calculation amount and parameter are often involved, and such large-scale network models cannot be deployed on resource-limited mobile terminal devices. Therefore, the method for researching the general and efficient lightweight convolutional neural network structure design and model compression has wide application prospect and important engineering value aiming at the problem of real-time image classification in the application of the mobile terminal.

Disclosure of Invention

The purpose of the invention is as follows: when the mobile terminal device executes a real-time image classification task, the mobile terminal device is limited by hardware resources and application scenes, and a large-scale convolutional neural network model is often difficult to deploy. The invention improves the existing lightweight convolutional neural network model in a targeted manner, greatly reduces the operation and storage cost of the model on the premise of ensuring the model performance, and ensures that the model is easy to deploy on mobile terminal equipment. Since the computational cost of the convolutional neural network model is largely dependent on the input image size, the model can save 75% of the computation when the input size is reduced to half. Therefore, the invention designs a lightweight convolutional neural network for image classification of a mobile terminal, which mainly comprises a MainNet part and an AuxiliaryNet part, wherein the AuxiliaryNet part is used for extracting the characteristic information of an input image, the MainNet part is used for extracting the characteristic information of the input image after down sampling, and the output characteristics of the two parts are spliced; different pruning strategies are adopted for MainNet and AuxiliaryNet, so that the calculated amount and the parameter amount of the model are reduced on the premise of ensuring the classification performance of the model; and the network structure is reconstructed according to the pruning effect, so that the classification performance and the calculation complexity of the model are balanced, and the calculation amount and the parameter quantity of the model are further reduced.

The technical scheme is as follows: in order to achieve the purpose, the invention provides a design method of a lightweight convolutional neural network for mobile terminal image classification, which comprises the following steps:

step 1: a light-weight convolutional neural network is designed by adopting a deep separable convolution as a main structure of the network; the designed lightweight convolutional neural network mainly comprises 3-4 modules, wherein when the resolution of an original input image is less than 224 × 224, 3 modules are used, and when the resolution of the original input image is more than or equal to 224 × 224, 4 modules are used; each module consists of a MainNet part and an AuxiliaryNet part; the MainNet is a main body of the designed lightweight network, and the number of channels of the output characteristic diagram is equal to the number of channels of the input characteristic diagram; AuxiliaryNet is a supplementary network of MainNet, is used for obtaining the characteristic diagram information before down-sampling, and controls the number of output characteristic diagram channels by using a coefficient alpha, wherein the default initial value of alpha is 1; the two network outputs are spliced and sent to the next module through branch fusion, and the number of channels of the output characteristics is controlled by a coefficient beta; if the output channel numbers of the MainNet and the AuxiliaryNet are a and b respectively, the channel number c after branch fusion is beta (a + b), and the default initial value of beta is 1;

step 2: training the network in the step 1 under a training set, wherein the training iteration times are 160; after training is finished, testing is carried out on the test set, and the accuracy and the calculated amount of the model are recorded;

and step 3: carrying out channel pruning on the trained network; channel pruning belongs to structural pruning, and the simplification of a model is realized by pruning unimportant channels in a network; in the pruning process, the clipping threshold is determined by a BN layer gamma factor added in the Loss function;

respectively determining cutting threshold values in the MainNet network and the AuxiliaryNet network by using a sum-of-sum ratio method and a k-average method; for the MainNet, determining a cutting threshold value in a mode of a ratio of the sum of the two; setting a total of f factors gamma arranged from small to large₁,γ₂,...,γ_k,...,γ_fThe sum of which is Z_sum(ii) a Are summed up in sequence, denoted as

During the accumulation process, when the first satisfaction

When k is recorded, the threshold th ═ y (γ)_k-1+γ_k)/2；For AuxiliaryNet, a K-means clustering mode is adopted, and the clustering number is 2: classified in the neighborhood of 0; the rest part is classified into another class, and the minimum value of the class is taken as a threshold value;

and 4, step 4: reconstructing the model according to the cutting condition in the step 3; setting cutting rates p and q to respectively reflect the cutting conditions of each convolution layer in the MainNet and the AuxiliaryNet, wherein n is the repeated times in the MainNet, X is the number of channels to be cut in each layer, Y is the total number of channels in each layer, and the total number of channels in the MainNet is the number of channels after expansion;

q＝X/Y

the layer with high clipping rate has lower importance, and the number of channels is reduced; the layer with low clipping rate has higher importance, and the number of channels is increased; reassigning the channel control coefficients alpha and beta into alpha 'and beta' according to the clipping rates p and q to realize model reconstruction; controlling the channel number of AuxiliaryNet by alpha, and adjusting the alpha in each module according to the cutting rate q of AuxiliaryNet in the module; beta controls the number of output channels after branch fusion, beta in the first module is adjusted according to the cutting rate p of MainNet in the second module, beta in the second module is adjusted according to the cutting rate p of MainNet in the third module, and so on, and the value of beta in the last module is not adjusted; the α and β are reassigned as α 'and β':

and 5: training the reconstructed network model on a training set, wherein the training iteration times are 160 times; after the training is finished, pruning is carried out according to the method in the step 3, the model after pruning is tested on a test set, and the accuracy and the calculated amount of the model are recorded;

step 6: taking the model accuracy and the calculated amount obtained in the step 2 as references, and judging whether the calculated amount of the network model obtained in the step 5 is reduced under the condition of keeping the accuracy (the accuracy is reduced by less than 1%); if the accuracy rate is maintained and the calculated amount is reduced, repeating the steps 4 and 5; and if the accuracy rate is reduced by more than 1% or the calculated amount is not reduced any more, outputting the model at the moment as a final model.

Optionally, the MainNet in step 1 specifically includes: the MainNet is a main body of the designed lightweight network and is used for processing the characteristic diagram information after down sampling; if the resolution of the input image is K x K, the resolution of the down-sampled image is (K/2) x (K/2); the basic module of the MainNet is bottleeck, and the repetition frequency of the bottleeck in the module is n; the bottleeck includes the following operations: (1) performing pointwise convolution to expand channels, wherein the number of input channels is C_inThe number of channels after expansion is C_inT, where t is the expansion coefficient that controls the degree of expansion; (2) performing depthwise convolution operation to realize data processing, and extracting accurate image characteristic information in a higher spatial dimension; (3) reducing the channel number to input dimension by adopting pointwise convolution operation, and outputting the channel number C_out＝C_inThat is, the number of channels of each bottleeck of the MainNet before and after the liter dimension is kept unchanged; furthermore, bottleeck adopts a residual structure and linearly adds the input and output results.

Further, in the process of training, the resolution of the feature map of the input network is gradually reduced by the expansion coefficient t, and t is decreased gradually; when the number of modules is 3, the value of t in each module is 6, 4 and 2 respectively; when the number of modules is 4, the value of t in each module is 6, 4, 2, 1, respectively.

Optionally, when the module is 3, the value of n in each module is 3, 4, 1; when a module is 4, the value of n in each module is 2, 3, 2, 1, respectively.

Optionally, the auxiliaryenet in step 1 specifically includes: AuxiliaryNet is a supplementary network of MainNet and is used for acquiring feature map information before down-sampling; first, AuxiliaryNet passesControlling the number of channels of the input image by poitwise convolution; unlike MainNet, here pointwise convolution aims at reducing the number of input channels, the degree of reduction of the number of channels is controlled by a coefficient α, which defaults to an initial value of 1; when the number of input feature map channels is C_inThen, the number of channels of the feature map after pointwise convolution is equal to C_inα; secondly, AuxiliaryNet performs depthwise convolution of 3 multiplied by 3, wherein the convolution step length is 2, so as to ensure that the size of an output characteristic diagram is consistent with that of MainNet; finally, fusing the channel information by using pointwise, and outputting the channel number C of the feature map_out＝C_in*α。

The technical scheme adopted by the invention has the advantages and beneficial effects that:

the invention provides a design method of a lightweight convolutional neural network for mobile terminal image classification. By carrying out lightweight design on the network structure, the network scale is greatly reduced on the premise of not sacrificing the classification performance, and the storage cost and the calculated amount are effectively reduced; adopting a targeted pruning strategy, reconstructing the lightweight convolutional neural network according to the cutting condition, and adjusting the number of channels of each layer of the network so as to balance the classification precision and the model complexity; and (4) reconstructing the model for multiple times, training and pruning the reconstructed model to obtain the final model with excellent performance. The finally obtained network model has good classification performance, the parameter quantity and the calculated quantity are far smaller than those of a mainstream network, and the method can adapt to the deployment requirement of the mobile terminal equipment with limited hardware resources.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

FIG. 2 is a schematic diagram of the structure of branch fusion;

FIG. 3 is a schematic structural diagram of a designed MainNet;

FIG. 4 is a schematic structural diagram of a designed AuxiliaryNet;

Detailed Description

The embodiment relates to a design method of a lightweight convolutional neural network facing to mobile terminal image classification, and the specific flow is shown in fig. 1, and the method comprises the following 6 steps:

step 1: a light-weight convolutional neural network is designed by adopting a deep separable convolution as a main structure of the network; the designed lightweight convolutional neural network mainly comprises 3-4 modules, wherein when the resolution of an original input image is less than 224 × 224, 3 modules are used, and when the resolution of the original input image is more than or equal to 224 × 224, 4 modules are used; each module consists of a MainNet part and an AuxiliaryNet part; the MainNet is a main body of the designed lightweight network, and the number of channels of the output characteristic diagram is equal to the number of channels of the input characteristic diagram; AuxiliaryNet is a supplementary network of MainNet, is used for obtaining the characteristic diagram information before down-sampling, and controls the number of output characteristic diagram channels by using a coefficient alpha, wherein the default initial value of alpha is 1; the two network outputs are spliced and sent to the next module through branch fusion, referring to fig. 2, the number of channels of the output characteristics is controlled by a coefficient beta; if the output channel numbers of the MainNet and the auxiarynet are a and b, respectively, the channel number c after branch fusion is β (a + b), and β defaults to an initial value of 1.

The MainNet in the step 1 is shown in fig. 3, and specifically includes: the MainNet is a main body of the designed lightweight network and is used for processing the characteristic diagram information after down sampling; if the resolution of the input image is K x K, the resolution of the down-sampled image is (K/2) x (K/2); the basic module of the MainNet is bottleeck, and the repetition frequency of the bottleeck in the module is n; the bottleeck includes the following operations: (1) performing pointwise convolution to expand channels, wherein the number of input channels is C_inThe number of channels after expansion is C_inT, where t is the expansion coefficient that controls the degree of expansion; (2) performing depthwise convolution operation to realize data processing, and extracting accurate image characteristic information in a higher spatial dimension; (3) reducing the channel number to input dimension by adopting pointwise convolution operation, and outputting the channel number C_out＝C_inThat is, the number of channels of each bottleeck of the MainNet before and after the liter dimension is kept unchanged; furthermore, bottleeck adopts a residual structure and linearly adds the input and output results.

The expansion coefficient t is that the resolution of the feature map input into the network is gradually reduced in the training process, and t is decreased gradually; when the number of modules is 3, the value of t in each module is 6, 4 and 2 respectively; when the number of modules is 4, the value of t in each module is 6, 4, 2, 1, respectively.

When the module is 3, the value of n in each module is 3, 4 and 1 respectively; when a module is 4, the value of n in each module is 2, 3, 2, 1, respectively.

Referring to fig. 4, the AuxiliaryNet in step 1 specifically includes: AuxiliaryNet is a supplementary network of MainNet and is used for acquiring feature map information before down-sampling; firstly, AuxiliaryNet controls the number of channels of an input image through pointwise convolution; unlike MainNet, here pointwise convolution aims at reducing the number of input channels, the degree of reduction of the number of channels is controlled by a coefficient α, which defaults to an initial value of 1; when the number of input feature map channels is C_inThen, the number of channels of the feature map after pointwise convolution is equal to C_inα; secondly, AuxiliaryNet performs depthwise convolution of 3 multiplied by 3, wherein the convolution step length is 2, so as to ensure that the size of an output characteristic diagram is consistent with that of MainNet; finally, fusing the channel information by using pointwise, and outputting the channel number C of the feature map_out＝C_in*α。

In this example, the input image size is 32 × 3, 3 modules are used, and the network structure is shown in table 1.

Table 1 lightweight convolutional neural network architecture designed by the present invention

Step 2: training the network in the step 1 under a training set, wherein the training iteration times are 160; during training, adding a gamma factor of a BN layer into a Loss function, applying an L1 regular constraint, and further training W and gamma jointly, wherein the Loss function is as follows:

the first item of the Loss function is a Loss function of the network, and a cross entropy function is adopted, wherein x is data input by training, y is a label, and W is network weight. The second term is the L1 regular constraint term of the BN layer gamma factor. Wherein gamma is a scaling factor of the BN layer, each layer of channel has corresponding gamma, and the gamma value of the channel with lower importance is smaller; lambda is a hyper-parameter with a value of 0.0001 for balancing the first and second terms; a constraint function g () is applied to the γ factor of the BN layer, g (γ) ═ γ |.

After training is finished, testing is carried out on the test set, and the accuracy and the calculated amount of the model are recorded;

During the accumulation process, when the first satisfaction

When k is recorded, the threshold th ═ y (γ)_k-1+γ_k) 2; for AuxiliaryNet, a K-means clustering mode is adopted, and the clustering number is 2: classified in the neighborhood of 0; the rest part is classified into another class, and the minimum value of the class is taken as a threshold value;

q＝X/Y

In order to verify the effectiveness of the model, classical lightweight convolutional neural network models such as SqueezeNet, MobileNet V1/V2 and ShuffleNet V1/V2 are selected for comparison in the experiment. Experiments were completed in the same experimental environment under the CIFAR100 dataset, PyTorch deep learning framework. The Cifar100 data set contains 100 classes in total, each class containing 600 images (500 training images and 100 test images) with an image size of 32 x 3. The experiment used a random gradient descent method as the training algorithm, with momentum set to 0.9 and weight attenuation set to 0.0001. The initial learning rate is set to 0.2, the learning rate is reduced by cosine attenuation, the number of training iterations is set to 160, the batch size is 128, and the loss function is a cross entropy function. The results of the experiment are shown in table 2:

TABLE 2 comparison of the inventive model with other models (Params and Flops do not include fully connected layers)

The experimental results show that: the image classification accuracy of the designed model is superior to that of other lightweight networks, and the network parameter quantity and the calculation complexity are greatly reduced. The final network model designed by the invention has a small parameter of 0.22M and low FLOPs of 13.15M, and achieves the classification accuracy of 70.82% on the CIFAR100 data set.

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, but all changes that can be made by applying the principles of the present invention and performing non-inventive work on the basis of the principles shall fall within the scope of the present invention.

Claims

1. A design method of a lightweight convolutional neural network for mobile terminal image classification is characterized by comprising the following steps:

respectively determining cutting threshold values in the MainNet network and the AuxiliaryNet network by using a sum-of-sum ratio method and a k-average method; for the MainNet, determining a cutting threshold value in a mode of a ratio of the sum of the two; setting a total of f factors gamma arranged from small to large₁，γ₂，...，γ_k，...，γ_fThe sum of which is Z_sum(ii) a Are summed up in sequence, denoted as

During the accumulation process, when the first satisfaction

q＝X/Y

2. The method for designing a lightweight convolutional neural network for moving-end-image-oriented classification according to claim 1, wherein the MainNet in step 1 specifically includes:

the MainNet is a main body of the designed lightweight network and is used for processing the characteristic diagram information after down sampling; if the resolution of the input image is K x K, the resolution of the down-sampled image is (K/2) x (K/2); the basic module of the MainNet is bottleeck, and the repetition frequency of the bottleeck in the module is n; the bottleeck includes the following operations: (1) performing pointwise convolution to expand channels, wherein the number of input channels is C_inThe number of channels after expansion is C_inT, where t is the expansion coefficient that controls the degree of expansion; (2) performing depthwise convolution operation to realize data processing, and extracting accurate image characteristic information in a higher spatial dimension; (3) reducing the channel number to input dimension by adopting pointwise convolution operation, and outputting the channel number C_out＝C_inThat is, the number of channels of each bottleeck of the MainNet before and after the liter dimension is kept unchanged; furthermore, bottleeck adopts a residual structure and linearly adds the input and output results.

3. The method for designing a lightweight convolutional neural network for mobile-end-oriented image classification as claimed in claim 2, wherein the expansion coefficient t is such that in the training process, the resolution of the feature map of the input network is gradually reduced, and t is decreased; when the number of modules is 3, the value of t in each module is 6, 4 and 2 respectively; when the number of modules is 4, the value of t in each module is 6, 4, 2, 1, respectively.

4. The method for designing the lightweight convolutional neural network for mobile-end-image-oriented classification as claimed in claim 1, wherein the number of repetitions n, when a module is 3, n in each module has a value of 3, 4, 1; when a module is 4, the value of n in each module is 2, 3, 2, 1, respectively.

5. The method for designing a lightweight convolutional neural network for moving-end-image classification according to claim 1, wherein the auxiarynet in step 1 specifically includes:

AuxiliaryNet is a supplementary network of MainNet and is used for acquiring feature map information before down-sampling; firstly, AuxiliaryNet controls the number of channels of an input image through pointwise convolution; unlike MainNet, here pointwise convolution aims at reducing the number of input channels, the degree of reduction of the number of channels is controlled by a coefficient α, which defaults to an initial value of 1; when the number of input feature map channels is C_inThen, the number of channels of the feature map after pointwise convolution is equal to C_inα; secondly, AuxiliaryNet performs depthwise convolution of 3 multiplied by 3, wherein the convolution step length is 2, so as to ensure that the size of an output characteristic diagram is consistent with that of MainNet; finally, fusing the channel information by using pointwise, and outputting the channel number C of the feature map_out＝C_in*α。