CN112016639B

CN112016639B - Flexible separable convolution framework and feature extraction method and application thereof in VGG and ResNet

Info

Publication number: CN112016639B
Application number: CN202011199528.8A
Authority: CN
Inventors: 谢罗峰; 朱杨洋; 谢政峰; 殷鸣; 殷国富
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2021-01-26
Anticipated expiration: 2040-11-02
Also published as: CN112016639A

Abstract

The invention discloses a flexible separable convolution framework, a feature extraction method and application thereof in VGG and ResNet, comprising a feature graph clustering division module, a first convolution operation module, a second convolution operation module, a feature graph fusion module and an attention mechanism SE module; the characteristic graph clustering and dividing module divides the characteristic graph into a characteristic main information characteristic graph and a characteristic supplementary information characteristic graph; the first convolution operation module performs ordinary convolution operation on the characteristic main information characteristic diagram; the second convolution operation module carries out grouping convolution operation on the characteristic supplementary information characteristic diagram; the feature map fusion module firstly splices the feature map after convolution, then adds the feature map with the original feature map and activates the feature map; the attention mechanism SE module multiplies the extracted channel weights with the feature map to generate an output feature map. The method comprises common convolution, grouping convolution, residual error branching and attention mechanism SE, reduces the calculation amount of operation and the parameter amount of the network while ensuring the operation accuracy, and can be used in the neural network convolution layer in a plug-and-play mode.

Description

Flexible separable convolution framework and feature extraction method and application thereof in VGG and ResNet

Technical Field

The invention relates to the technical field of image processing, in particular to a flexible separable convolution frame and a feature extraction method and application thereof in VGG and ResNet.

Background

In recent years, deep convolutional neural networks have shown excellent performance in different computer vision tasks such as image recognition, target detection and semantic segmentation. However, the traditional deep convolutional neural network needs a large number of parameters and floating point operands to achieve satisfactory accuracy, and the model reasoning time is long. In some real application scenarios, such as mobile or embedded devices, the practical deployment of the traditional deep convolutional neural network in small devices is difficult due to the limited memory and computing resources of the devices; meanwhile, the practical application requirements for low delay are difficult to meet. Although the hardware condition will be better and better in the future, the compression theory for researching the deep convolutional neural network model is crucial at present, which is also a hot problem in the research direction of the deep convolutional neural network.

Disclosure of Invention

The invention provides a flexible separable convolution framework and a feature extraction method and application thereof in VGG and ResNet, wherein the convolution framework comprises a common convolution module, a grouping convolution module, a residual error branch module and an attention mechanism SE module, the calculation amount of operation and the parameter amount of a network are reduced while the operation accuracy is ensured, and the convolution framework can be used in a common convolution layer of a neural network, such as the convolution layer of a VGG convolutional neural network and the residual error block of a ResNet network.

In order to achieve the purpose, the invention adopts the following technical scheme:

a flexible separable deep learning convolution framework comprises a feature map clustering division module, a first convolution operation module, a second convolution operation module, a feature map fusion module, an attention mechanism SE module, M input channels and N output channels;

the characteristic graph clustering and dividing module is used for dividing the M input characteristic graphs into a characteristic main information characteristic graph and a characteristic supplementary information characteristic graph;

the first convolution operation module is used for performing common convolution and BN operation on the characteristic main information characteristic diagram and outputting a main characteristic diagram;

the second convolution operation module is used for performing packet convolution and BN operation on the characteristic supplementary information feature graph and outputting a supplementary feature graph;

the feature map fusion module is used for splicing the output main feature map and the output supplementary feature map along the depth direction, adding the spliced output main feature map and the output supplementary feature map to the preprocessed input feature map, and then outputting the input feature map through the ReLU activation;

and the attention mechanism SE module is used for multiplying the extracted feature map channel weight output by the feature map fusion module and the feature map output by the feature map fusion module to output N output feature maps.

Further, the feature map clustering and dividing module is used for dividing the input feature map into a characterization main information feature map Mrep and a characterization supplementary information feature map Mred according to the hyper-parameter supplementary feature information occupation ratio alpha, wherein the supplementary feature information occupation ratio alpha is (0, 1).

Further, the characterization main information feature map Mrep = (1- α) M, and the characterization supplementary information feature map Mred = α M.

Further, the preprocessed input feature map refers to a feature map obtained by passing the feature number input by the input channel through a 1 × 1 convolution and a BN operation.

The invention also provides a feature map processing method of the flexible separable deep learning convolution framework, which comprises the following steps:

(1) acquiring M input feature maps, and dividing the input feature maps into a characteristic main information feature map and a characteristic supplementary information feature map according to a hyper-parameter supplementary information ratio alpha;

(2) outputting a main characteristic diagram after performing common convolution and BN operation on the characterization main information characteristic diagram, and outputting a supplementary characteristic diagram after performing grouping convolution and BN operation on the characterization supplementary information characteristic diagram;

(3) splicing the output main feature map and the output supplementary feature map along the depth direction, adding the spliced output main feature map and the output supplementary feature map to the input feature map subjected to 1 multiplied by 1 convolution and BN operation, and finally outputting the added output feature map through ReLU activation;

(4) and extracting the channel weight of the feature map of the activation output, and multiplying the extracted weight by the feature map of the activation output to generate N output feature maps.

The invention applies the flexible separable deep learning convolution framework to the VGG convolution neural network, and the flexible separable deep learning convolution framework is adopted by the convolution layer.

The invention applies the flexible separable deep learning convolution frame provided by the method to the residual block of the ResNet network, comprises two convolution frames, and sets a super-parameter channel scaling factor beta for the output channel of the first convolution frame and the input channel of the second convolution frame.

The invention has the following beneficial effects:

(1) after the input feature graph is divided into a feature graph representing main information and a feature graph representing supplementary information, the feature graph representing the main information is subjected to 'ordinary convolution + BN' to ensure the integrity of the information, and the feature graph representing the supplementary information is subjected to 'packet convolution + BN' with low calculation amount to supplement the overall features, so that the information loss is reduced. Different types of convolution operations are used in the same convolution layer, the probability that difference exists in the generated output characteristic diagram is naturally higher, and the network performance is improved. The information extraction modes of the grouped convolution and the common convolution have difference, the communication among the channels of the input feature map is reduced, and the filter depth is reduced, so the parameter quantity and the calculation cost for generating the output feature map are less. Packet convolution will have some loss of characteristics, but loss of appropriate characteristic information as side information is quite acceptable.

(2) The residual error shortcut branch is introduced, the input feature graph and the output feature graph are activated and output after being added, backward propagation of front-layer distinguishable information is facilitated, main information which can be distinguished by a front layer is directly transmitted into a rear layer, fusion of characteristics of a next input layer and characteristics of an output layer is achieved, multi-level semantic information is provided, the rear layer information is more distinguishable, difficulty of extracting characteristic diversity from the front layer feature graph through convolution operation by the rear layer is reduced, and neural network parameter learning can be easier.

(3) The convolution framework provided by the invention can realize plug and play, and is particularly applied to a VGG convolution neural network.

(4) In order to apply the convolution frame provided by the invention to a ResNet network, the original structure of a basic residual error module is kept, the integral frame of a shallow layer network is not changed, the plug and play of the module in the shallow layer residual error network is realized, only the number of middle channels of the residual error module is changed, namely, a channel scaling factor beta is set in an output channel of a 1 st convolution layer and an input channel of a 2 nd convolution layer, and thus the input and output channels of an improved residual error block are kept consistent with the original residual error block.

Drawings

FIG. 1 is a schematic diagram of the characteristic diagram clustering partitioning principle and the convolution operation principle of the present invention.

Fig. 2 is a schematic structural diagram of a convolution frame provided in the present invention.

Fig. 3 is a schematic structural diagram of the convolution framework applied to the ResNet network residual block according to the present invention.

FIG. 4 is a table of classification performance of FSConv _ VGG-16 with different hyperparameters on CIFAR-10.

FIG. 5 is a graph comparing FSConv _ VGG-16 with the state of the art method of compressing VGG-16.

FIG. 6 is a graph comparing FSConv _ VGG-16 on CIFAR-100 with the most advanced method of compressing VGG-16.

FIG. 7 is a chart of classification performance of FSBnegk _ ResNet-20 with different superparameters on CIFAR-10.

FIG. 8 is a chart exploring the possibility of FSBnegk _ ResNet-20 replacing baseline ResNet-56/110 on CIFAR-10.

FIG. 9 is a graph comparing FSBnegk _ ResNet-20 on CIFAR-10 with the most advanced method of compressing ResNet.

FIG. 10 is a graph comparing FSBnegk _ ResNet-20 on CIFAR-100 with the state-of-the-art method of compressing ResNet.

Detailed Description

Example 1

The embodiment provides a flexible separable deep learning convolution framework, and the module improves the network performance, reduces the calculated amount and reduces the network parameters on the premise of ensuring the accurate operation.

As shown in fig. 1, the flexible separable deep learning convolution framework provided by this embodiment includes a feature map cluster partitioning module, a first convolution operation module, a second convolution operation module, a feature map fusion module, an attention mechanism SE module, M input channels, and N output channels.

The feature graph clustering and dividing module divides the M input feature graphs into a main characterization information feature graph and a supplementary characterization information feature graph according to a super-parameter supplementary feature information occupation ratio alpha, wherein the supplementary feature information occupation ratio alpha is a defined super-parameter and belongs to (0,1), and in the embodiment, the M input feature graphs are divided into a main characterization information feature graph Mrep = (1-alpha) M and a supplementary characterization information feature graph Mred = alpha M according to the supplementary feature information occupation ratio alpha.

The first convolution operation module is used for performing common convolution and BN operation on the characteristic main information characteristic diagram and outputting a main characteristic diagram; this example adoptsk×kThe ordinary convolution carries out convolution operation on the characteristic main information characteristic diagram so as to ensure the integrity of the characteristic diagram information.

The second convolution operation module is used for performing packet convolution and BN operation on the characteristic supplementary information feature graph and outputting a supplementary feature graph; in the embodiment, the grouped convolution is adopted to carry out the grouped convolution operation on the characteristic graph representing the supplementary information, so as to supplement the overall characteristics and reduce the information loss.

The feature map fusion module is used for splicing the output main feature map and the output supplementary feature map along the depth direction, adding the spliced output main feature map and the output supplementary feature map to the preprocessed input feature map, and then outputting the result through ReLU activation; the preprocessed input feature map means that the shape C multiplied by H multiplied by W of the input feature map is adjusted by '1 multiplied by 1 convolution + BN', so that the input feature map is consistent with the spliced result shape of the output main feature map and the output supplementary feature map along the depth direction.

And the attention mechanism SE module is used for extracting channel weights from the feature map output by the feature map fusion module through the SE module, multiplying the extracted weights with the feature map output by the feature map fusion module and outputting N output feature maps.

The method for processing the feature map by adopting the flexible separable deep learning convolution framework comprises the following steps:

(1) acquiring M input feature maps, and dividing the input feature maps into a characterization main information feature map Mrep = (1-alpha) M and a characterization supplementary information feature map Mred = alpha M according to a hyper-parameter supplementary information ratio alpha;

Example 2

As shown in fig. 2, this embodiment provides a VGG convolutional neural network, which includes a convolutional layer, a pooling layer and a fully-connected layer, and this embodiment only changes the structure of the convolutional layer, and changes the convolutional layer into the convolutional framework provided in embodiment 1, and this VGG convolutional neural network is referred to as FSConv _ VGG.

Example 3

As shown in fig. 3, this embodiment provides a ResNet-20 network, which includes a residual block, where the residual block includes a first convolution layer and a second convolution layer that are connected in sequence, where the first convolution layer and the second convolution layer adopt convolution frames provided in embodiment 1 and have the same structure, and in this embodiment, a super-parameter channel scaling factor β is introduced into an output channel of the first convolution layer and an input channel of the second convolution layer, and a value of the channel scaling factor β depends on an apparatus memory and a calculation resource; this ResNet-20 network is called FSBnegk _ ResNet-20.

The original convolutional layer of VGG-16 and the original residual module of ResNet-20 were replaced with example 2 and example 3, respectively, and the validity of the convolution framework provided by example 1 was verified on different public data sets (CIFAR-10 and CIFAR-100).

VGG-16 on CIFAR-10/100

The CIFAR-10/100 data set consists of fifty thousand trainable color images and ten thousand testable color images of size 32 × 32 pixels, containing 10 classes and 100 classes, respectively. VGG-16, which has 13 convolutional layers and 3 fully-connected layers, was originally designed for class 1000 ImageNet. For the CIFAR-10/100 dataset, variants thereof widely used in the literature were selected: VGG-15 with 2 fully connected layers, equipped with batch normalization after each layer. The proposed FSConv replaces all 3 x 3 convolutional layers, keeping all other configurations unchanged. All models were trained on the CIFAR-10/100 dataset for 200 generations. Optimization was performed using an SGD optimizer, momentum 0.9, weight decay 5e-4, batch size 128, initial learning rate 0.1, dividing by 10 every 50 generations. The image was data enhanced by random horizontal flipping and edge zero-filled 4-pixel random cropping to prevent model overfitting.

First analyze one hyper-parameter of FSConv: the supplemental information accounts for the effect of alpha on the model performance and is then compared to advanced methods.

Exploring the effect of hyper-parameters on model performance on CIFAR-10

FSConv contains only one hyper-parameter: the supplemental information accounts for a. And dividing the input feature map of the same layer into a feature map containing the supplementary information and a feature map containing the main information by setting the size of the hyperparameter alpha. The larger the total number of feature maps used for common convolution in the same-layer feature map is, the easier the model extracts information from each type of feature map, and the higher the model performance is, however, the higher the calculation cost is. FIG. 4 is the result of our FSConv _ VGG-16 exploring the effect of the hyperparameter α on model performance on CIFAR-10. The ratio (1- α) represents the ratio of the feature map Mrep for the normal convolution to the input feature map M, and therefore, as can be seen from the experimental results, the ratio (1- α) is gradually reduced, and the model calculation amount and the parameter amount are sharply reduced, at which time the model does not have any serious accuracy loss. The (1-alpha) can be properly adjusted according to specific hardware conditions, so that the model calculation quantity and the parameter quantity meet the use requirement.

Comparison with advanced models

The methods of comparison include different types of model compression methods: GhostNet, SPConv.

As shown in FIG. 5, on the CIFAR-10 dataset, the FSConv _ VGG-16 model of example 2 achieved comparable accuracy (93.8%) to the Baseline model VGG-16_ Baseline with a calculated amount compression of 32.58% and a parameter amount compression of 37.97% of the original. This shows that there is considerable redundancy in the VGG model, the class of the main information included in each layer of input feature map for generating the output feature map is limited, the extraction of the supplementary information does not need expensive ordinary convolution execution, and the cheap packet convolution can meet the requirement of feature extraction. Compared to advanced models, the FSConv model outperforms all competitors with superior performance but fewer floating point operations, close parameter quantities.

The advanced performance of the FSConv model to perform small-size image classification was verified on CIFAR-10, and for greater challenges, the fine-grained identification capability of the FSConv model was verified on a CIFAR-100 dataset (containing 20 large classes, each containing 5 small classes, for a total of 100 classes).

As shown in FIG. 6, on the CIFAR-100 dataset, the FSConv _ VGG-16 model exceeded baseline 1.79% on Top-5 and 0.97% on Top-1 with a calculated amount of compression of 31.27% and a parameter amount of compression of 37.09% of the original; the FSConv _ VGG-16 model exceeded the VGG-16_ Baseline Baseline by 1.77% at Top-5 and 0.82% at Top-1 with a calculated amount compression of 19.50% and a parameter amount compression of 25.24% of the original. At the same time, it can be seen that advanced models with larger computational load, similar or larger parameters than the FSConv _ VGG-16 model achieve lower accuracy than our model.

ResNet on CIFAR-10/100

ResNet is composed of three stages, each stage includes 16, 32, 64 filters, namely the characteristic diagram of ResNet is much less than that of VGG-16, 64-128-256-512. ResNet-20 contains only a 0.27M parameter amount, which is about 1.8% of the VGG-16 parameter amount. This means that the ResNet-20 network structure is more compact, the characteristic diagram containing the supplementary information in each layer of characteristic diagram is less, and most of the characteristic diagram is provided with the main body information. It would be more challenging for the FSBneck model to make the network more lightweight effectively. The FSBneck model replaces all residual modules of the ResNet-20 baseline network, keeping all other configurations unchanged. All models were trained on the CIFAR-10/100 dataset for 200 generations. Optimization was performed using an SGD optimizer, momentum 0.9, weight decay 5e-4, batch size 128, initial learning rate 0.1, dividing by 10 every 75 generations. The image was data enhanced by random horizontal flipping and edge zero-filled 4-pixel random cropping to prevent model overfitting.

First two hyper-parameters of FSBneck were analyzed: the effect of the supplemental information duty ratio alpha and the channel scaling factor beta on the model performance was then explored for the feasibility of the shallow network ResNet-20 instead of the deep network ResNet-56/110, and finally compared to the advanced approach.

Exploring the influence of hyperparameters on the performance of FSBnegk model on CIFAR-10

Resnet-20 is already much smaller than VGG-16, and the number of profiles containing homogeneous body information is much smaller. Most of main information categories even have only one feature map, in order to ensure that the main information is not lost, channel expansion is carried out on a first convolution output channel/a second convolution input channel of a residual error module, the number of feature maps containing the same type of main information in each convolution layer is increased, and at the moment, the network can be effectively compressed by adjusting the supplement information occupation ratio alpha of the convolution layers.

As shown in FIG. 7, by adjusting the two superparameters α and β simultaneously, the quantities of parameters or floating point calculations are made to approach each other at ResNet-20. As can be seen from the table, (1) when the parameter quantities are close to each other, the larger the calculated quantity is, the network performance is in an increasing trend; (2) when the calculated amount is close to each other, the larger the parameter amount is, the network performance still tends to rise. Therefore, in practical application, for a specific mobile device, a hyper-parameter combination mode suitable for the device can be flexibly selected according to the hardware performance of the mobile device.

Possibility of substituting ResNet-56/110 with FSBneck _ Resnet-20

As shown in FIG. 8, by observing the classification performance of the baseline ResNet-20/56/110 on CIFAR-10, it can be seen that the network performance tends to increase as the network deepens. However, it is also found from the graph that the excellent performance of the deep network is obtained at the cost of a large increase of the calculation amount and the parameter amount, and the relationship is not that

The slope of (a) increases. I.e., at a greater cost to achieve a small amount of performance gain. Further, deep networks are more difficult to train. Aiming at the problems, the calculation amount and the parameter amount are properly increased in a manner of expanding the ResNet-20 channel of the shallow network, but the calculation cost is always less than that of the deep network, and the ResNet-20 performance reaches or even exceeds that of the ResNet-56/110 of the deep network.

Still in contrast to several representative advanced model compression methods on ResNet: GhostNet, SPConv.

And taking the parameter quantity of the advanced model as the memory capacity of the mobile equipment required to be deployed. In the experiment, under the principle that the parameter number is similar to that of the advanced model, two hyper-parameters alpha and beta of the FSBnegk _ ResNet-20 are adjusted, and models with different calculated quantities can be obtained. The table only shows the performance of part of the compression network, and a reader can adjust two hyper-parameters according to actual requirements, so that the floating point calculation amount and the parameter amount meet the requirements of mobile equipment needing to be deployed, and the excellent compression network is obtained.

As shown in FIG. 9, in CIFAR-10_ ResNet-20, the FSBnegk _ ResNet-20 model takes the best network performance with a parameter slightly higher than the advanced network while the computation is lower than them. In CIFAR-10_ ResNet-56, the FSBneck _ ResNet-20 model can maximally compress the calculated amount of the original ResNet-56 network to 17.85%, the parameter amount is compressed to 25.00%, the precision is only reduced by 0.84%, and the model performance is 1.10% better than that of the most advanced model under the condition that the calculated amount is obviously reduced than that of the most advanced model; the FSBnegk _ ResNet-20 model compresses the original ResNet-56 network computation to 33.18%, and the parameter to 37.34%, which can reach the accuracy comparable to the baseline. In CIFAR-10_ ResNet-110, the maximum compression of the calculation amount of the original ResNet-110 network by the FSBnegk _ ResNet-20 model is 14.58%, the compression of the parameter amount is 24.81%, the precision is reduced by 1.61%, and under the condition that the calculation amount is half of that of the most advanced model, the performance of the FSBnegk _ ResNet-20 model is 0.64% better than that of the most advanced model; the FSBnegk _ ResNet-20 model compresses the original ResNet-110 network computation to 33.48%, and the parameter to 37.42%, which can achieve the accuracy comparable to the baseline.

As shown in FIG. 10, in CIFAR-100_ ResNet-20, the FSBneck _ ResNet-20 model compressed the original ResNet-20 network computation to 35.11%, and the parameter compressed to 47.05% of the original, comparable accuracy to the baseline was achieved. In CIFAR-100_ ResNet-56, the FSBneck _ ResNet-20 model can maximally compress the calculated amount of an original ResNet-56 network to 17.85%, and compress the parameter amount to 25.51% of the original parameter amount, so that the Top-5 precision comparable to the baseline can be achieved; the FSBnegk _ ResNet-20 model compresses the original ResNet-56 network calculated amount to 33.18%, the parameter amount to 37.77% and the Top-5 precision of the FSBnegk _ ResNet-20 model exceeds the baseline model by 0.29%. In CIFAR-100_ ResNet-110, the FSBnegk _ ResNet-20 model can maximally compress the calculated amount of the original ResNet-110 network to 14.13%, the parameter amount is compressed to 26.94% of the original parameter amount, and the FSBnegk _ ResNet-20 model can still realize Top-5 precision exceeding that of the baseline model; the FSBnegk _ ResNet-20 model compresses the original ResNet-110 network calculated amount to 25.12%, the parameter amount to 37.28%, and the Top-5 precision of the FSBnegk _ ResNet-20 model can exceed the baseline model by 0.8%. At the same time, it can be seen that advanced models with larger computational load, similar or larger parameters than the FSBnegk _ ResNet-20 model achieve lower accuracy than the FSBnegk _ ResNet-20 model.

The above description is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any modification and replacement based on the technical solution and inventive concept provided by the present invention should be covered within the scope of the present invention.

Claims

1. A flexible separable deep learning convolution framework characterized by: the system comprises a feature map clustering and dividing module, a first convolution operation module, a second convolution operation module, a feature map fusion module, an attention mechanism SE module, M input channels and N output channels;

2. The flexible separable deep learning convolution framework of claim 1, wherein: the characteristic graph clustering and dividing module is used for dividing the input characteristic graph into a representation main information characteristic graph Mrep and a representation supplementary information characteristic graph Mred according to a hyper-parameter supplementary characteristic information ratio alpha, and the supplementary characteristic information ratio alpha belongs to (0, 1).

3. The flexible separable deep learning convolution framework of claim 2 wherein: the characterization main information feature map Mrep ═ M (1- α), and the characterization supplemental information feature map Mred ═ α M.

4. The flexible separable deep learning convolution framework of claim 1, wherein: the preprocessed input feature map is the feature map obtained by subjecting the input feature map to 1 × 1 convolution and BN operation.

5. Use of a flexible separable deep learning convolution frame according to any one of claims 1 to 4 in a VGG convolutional neural network.

6. Use of a flexible separable deep learning convolution framework according to any one of claims 1 to 4 in a ResNet network residual block; the method is characterized in that: two of the flexible separable deep learning convolution frameworks are included, and a hyperparametric channel scaling factor β is set for the output channel of the first convolution framework and the input channel of the second convolution framework.

7. A feature extraction method of a flexible separable deep learning convolution framework is characterized by comprising the following steps:

(1) acquiring M input feature maps, and dividing the input feature maps into a characterization main information feature map Mrep (1-alpha) M and a characterization supplementary information feature map Mred (alpha M) according to a hyper-parameter supplementary information ratio alpha;