CN110659653A

CN110659653A - Feature extraction model and feature extraction method capable of fully retaining image features

Info

Publication number: CN110659653A
Application number: CN201910865573.3A
Authority: CN
Inventors: 刘天弼; 杜姗姗; 冯瑞
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2020-01-07

Abstract

At present, backbone networks of deep convolutional neural networks all originate from initial image classification networks, and when the deep convolutional neural networks are applied to the fields of target detection, semantic segmentation, target segmentation and the like, the traditional method of discarding characteristic information continuously leads to insufficient information amount during later analysis. In order to solve the above problem, the present invention provides a feature extraction model capable of sufficiently preserving image features for performing a lossless feature extraction operation on an input image of an arbitrary resolution, comprising: a plurality of convolution operation layers, each of which is composed of a channel separation convolution and a1 × 1 convolution; and a plurality of pooling operation layers in which pooling has a step size of 1 and boundary padding is alternately performed with 0 and 1 pixels, wherein the number of pooling operation layers is an even number.

Description

Feature extraction model and feature extraction method capable of fully retaining image features

Technical Field

The invention belongs to the field of digital image processing and deep learning, relates to the machine vision research direction, and particularly relates to a feature extraction model and a feature extraction method capable of fully retaining image features.

Background

Digital image analysis technology plays an important role in the current society, and machine vision becomes an important research content in various industries. At present, the development of machine vision technology has gradually abandoned the scheme of the manual design algorithm of the traditional digital image processing, and instead, deep learning is used, and a Convolutional Neural Network (CNN) is taken as a representative so as to achieve the analysis result with high accuracy. However, for the existing CNN model, the backbone networks of the CNN model all originate from the original image classification networks, such as VGG, ResNet, and the like, and when the CNN model is applied to the fields of target detection, semantic segmentation, target segmentation, and the like, the conventional backbone network continuously discards characteristic information, which results in insufficient information amount during the later analysis.

The convolutional neural network is a common deep learning network architecture and is inspired by a biological natural visual cognition mechanism. The CNN can derive an effective representation of the original image, which enables the CNN to capture the visual regularity directly from the original pixels with little pre-processing.

At present, almost all deep convolutional neural networks and backbone networks use the structure of an image classification network and use a feature extraction mode of image classification. On the one hand, it is required that the input image has to be scaled and warped to a fixed size resolution; on the other hand, 4 to 5 times of information refining extraction and discarding are required in the process of feature extraction, which is unreasonable for a network model with urgent need for accurate information. Therefore, in some fields of complex image analysis such as detection, segmentation and the like, the problem that the traditional deep convolutional neural network cannot sufficiently retain image data becomes an increasingly needed problem.

Disclosure of Invention

In order to solve the above problems, the present invention provides a feature extraction model which can fully retain image features without fixing the resolution of an input image, and adopts the following technical scheme:

the invention provides a feature extraction model capable of fully retaining image features, which is used for carrying out lossless feature extraction operation on an input image with any resolution, and is characterized by comprising the following steps: a plurality of convolution operation layers, each of which is composed of a channel separation convolution and a1 × 1 convolution; and a plurality of pooling operation layers in which pooling has a step size of 1 and boundary padding is alternately performed with 0 and 1 pixels, wherein the number of pooling operation layers is an even number.

The feature extraction model capable of sufficiently retaining image features provided by the invention can also have the technical features that after the resolution of the feature map is subjected to the traditional pooling operation, the resolution is reduced by 2 x 2 times, the convolution kernel size of the subsequent convolution layer is k x k, the convolution operation is f x f relative to the acceptable field before pooling, and f is: where f is (k-1) × 2+1, the convolution operation layer following the pooling operation layer needs to perform the convolution operation with a convolution kernel of size f × f.

The invention also provides a feature extraction method capable of fully retaining the image features, which is characterized by comprising the following steps of: step D1, using the feature extraction model as a backbone network for feature extraction in a convolutional neural network, wherein the convolutional neural network is used for executing point-to-point prediction output; d2, the feature extraction model extracts the features of the input image to obtain feature data; step D3, the convolutional neural network directly acquires the feature data and performs subsequent operations, wherein the feature extraction model is the feature extraction model of claim 1 or 2 that can sufficiently preserve the image features.

Action and Effect of the invention

According to the feature extraction model capable of sufficiently retaining image features, the convolution layer is formed by the channel separation convolution and the 1 × 1 convolution, so that the feature extraction model can be applied to input images with any resolution, and when the convolution operation is performed on the input images, the consumption of the display and memory resources by the convolution operation can be reduced while the data resolution of the subsequent operation is kept unchanged. And because the step size of the pooling operation layer for performing the pooling operation is 1, the boundary filling is performed alternately by 0 and 1 pixel, and meanwhile the pooling operation is performed for an even number of times in the whole feature extraction process, the feature map resolution can be kept unchanged all the time in the feature extraction process. The feature extraction model continues the reasonable structural design of the traditional CNN backbone network, and corrects and improves the defect that the feature extraction model is only suitable for image classification, so that the feature extraction operation retains sufficient information, and the feature extraction model is more suitable for algorithms in the fields of target detection, image segmentation and the like. The feature extraction model of the embodiment can be used as a backbone network of a CNN family model in the field of machine vision to form various special algorithms, and the low-loss characteristic of feature information can ensure the requirement of various algorithms on higher precision of calculation results.

Drawings

FIG. 1 is a schematic diagram of a network design execution flow in an embodiment of the present invention;

FIG. 2 is a diagram illustrating a comparison between a feature extraction scheme according to an embodiment of the present invention and a conventional CNN backbone extraction method;

FIG. 3 is a diagram illustrating the pooling operation of padding values alternating between 0 and 1 in an embodiment of the present invention;

FIG. 4 is a flow chart of a feature extraction method in an embodiment of the invention;

FIG. 5 is a schematic diagram comparing the backbone network structure of VGG16 and the corresponding backbone network structure of the feature extraction model in the embodiment of the present invention; and

fig. 6 is a theoretical diagram of calculation of the amount of information in the embodiment of the present invention.

Detailed Description

For various conventional convolutional neural networks, backbone networks used for extracting features of input images are refined, extracted and discarded on the basis of convolutional layers and pooling layers, and in order to solve the problem that the convolutional neural networks cannot fully retain image data, as shown in fig. 1, the inventor realizes model optimization through the following steps:

step S1, directly executing convolution operation on the image, and replacing the traditional convolution operation by using channel separation convolution and 1 multiplied by 1 convolution combination;

step S2, using a pooling (pooling) operation with step size 1 to replace the traditional pooling operation with step size 2, wherein the pooling operation must be performed an even number of times, and alternately using a boundary padding (padding) operation with pixel number 1, so as to ensure that the resolution of the input and output is unchanged;

step S3, after each pooling operation, performing convolution operation by using the increased convolution kernel, so that the acceptable field of convolution is kept uniform with the traditional convolution-pooling operation;

step S4, after the feature operation, dynamically executing the operation of feature set neutralization and extraction according to the actual algorithm requirement.

The comparison of the optimized feature extraction model obtained through the above process with the conventional CNN backbone network is shown in fig. 2. The optimized feature extraction model can be suitable for input images with any resolution, and directly carries out convolution operation on image data, so that more image feature data are reserved. Meanwhile, for the convolutional neural network with the requirement of scale parameters, the features can be further extracted and concentrated (namely, the features are simplified in the figure 2) after the feature extraction is completed by the optimized feature extraction model, so that the optimized feature extraction model which furthest retains the feature information according to the requirement is realized.

Therefore, based on the above idea, the inventor proposes a feature extraction model capable of sufficiently preserving image features and an application thereof in a neural network model, and the feature extraction model and the application thereof are described in detail below with reference to the drawings and the embodiments.

< example >

In the platform for realizing the embodiment, the operating system is ubuntu16.04, the deep learning framework uses pytorech 1.01, the graphics processing library uses opencv 3.2.0, the CUDA version is 9.0, and the image acceleration computing unit uses NVIDIA1080Ti GPU.

The feature extraction model of the present embodiment, which can sufficiently retain image features, includes a plurality of convolution operation layers and a plurality of pooling operation layers.

Wherein each convolution operation layer is composed of a channel separation convolution and a1 × 1 convolution.

Each pooling operation layer is a pooling layer with a step size of 1.

In this embodiment, the number of pooling operation layers is an even number, and boundary padding (padding) of each pooling operation layer is alternately performed with 0 and 1 pixels.

In this embodiment, when the pooling operation layer alternately performs pooling operations of different padding values, the resolution of the feature map is as shown in fig. 3, and the resolution of the feature map remains unchanged before and after the even-numbered pooling operations are performed.

The feature extraction model can realize lossless feature extraction on an input image, and in practical use, the feature extraction model is used as a backbone network for feature extraction in a convolutional neural network. Compared with the traditional convolutional neural network model, the new convolutional neural network model using the feature extraction model has the same layer (layer) number, stage (stage), branch connection (branch) and other structures, and meanwhile, the acceptable fields of the convolutional operation layers connected behind each pooling operation layer in the new convolutional neural network are respectively kept uniform with the corresponding acceptable fields in the original convolutional neural network.

Specifically, if the convolution kernel size of the convolution layer in the original convolution neural network is k × k, assuming that the convolution operation is f × f with respect to the acceptable field before pooling, f is:

f＝(k-1)×2+1 (1)

at this time, the convolution operation layer that is concatenated after the pooling operation layer in the new convolution neural network needs to perform a convolution operation with a convolution kernel of size f × f.

Fig. 4 is a flowchart of a feature extraction method in an embodiment of the present invention.

As shown in fig. 4, the feature extraction method based on the feature extraction model includes the following steps:

step D1, using the feature extraction model as the backbone network for feature extraction in the convolutional neural network to form a lossless convolutional neural network, wherein the lossless convolutional neural network is used for executing point-to-point prediction output;

d2, the feature extraction model extracts the features of the input image to obtain feature data;

and D3, directly acquiring the characteristic data by the convolutional neural network and performing subsequent operation.

In steps D2 and D3 of this embodiment, since the original convolutional neural network performs point-to-point prediction output, and the feature map does not need to change the resolution, the feature data obtained by the feature extraction model performing the feature operation can be directly input to the subsequent part of the convolutional neural network for performing the operation.

In this embodiment, the subsequent parts and other parts of the lossless convolutional neural network except the backbone network used for feature extraction are the same as those of the conventional convolutional neural network, and the number of layers, stages, branch connections and other structures of the backbone network are also the same as those of the conventional convolutional neural network, which is not described herein again.

Next, taking VGG16 as a reference model as an example, the feature extraction model of the present embodiment is used as a backbone network for feature extraction of this VGG16, and a backbone network structure with feature map resolution of the original image 1/4 is completed. The backbone network structure of the original VGG16 and the backbone network structure of the corresponding feature extraction model are shown in fig. 5. The specific implementation process of the new VGG16 using the feature extraction model is as follows:

step T1, analogy original VGG16, replacing the original convolution layer with channel separation convolution and 1 × 1 convolution (convolution operation layer), and keeping the channel number of feature maps of each stage consistent with VGG 16;

step T2, building 4 layers of pooling operation layers correspondingly by analogy with the original VGG16, wherein the step size of all pooling operation layers is 1: wherein, the first and third pooling operation layers use padding of 1; the pooling operation layers of the second and the fourth layers use padding equal to 0;

step T3, combining the pooling layer with stride 2 in the original VGG16 with the convolution layer of 3 × 3 to obtain an acceptable field of 5 × 5, and after replacement, setting the convolution kernel size of the channel separation convolution connected after all pooling operation layers to 5 × 5;

in step T4, since the resolution of the feature map that VGG16 needs to use in the subsequent model of feature extraction is 1/4, which is the length and height of the original image, two (3 × 3 convolutional layer + string ═ 2 pooling layer) combinations are also connected after the main network structure corresponding to the feature extraction model.

After the input image is combined by the backbone network structure of the corresponding feature extraction model and two (3 × 3 convolutional layers + string ═ 2 pooling layers) for feature extraction, the obtained feature data is the feature map required by the original VGG16 for subsequent analysis.

In this embodiment, the detailed layer design of the backbone network structure of the new VGG16 is shown in the following table 1:

table 1 detailed layer design

The above-mentioned new VGG16 using the feature extraction model of the present embodiment can hold more effective feature information than the main network structure of the original VGG 16.

For a stride of 2 pooling operation, it is equivalent to discarding 3 of the 4 neighboring feature pixels. This operation is equivalent to discarding a portion of the valid signature information. The range of influence of 4 neighboring pixels for a convolution kernel of 3 x 3 is 4 x 4.

The image processing of the invention is regarded as a random process, and the information quantity carried by each pixel is assumed to be equal, i is₀And each pixel is subjected to 9 operations of a 3 × 3 convolution kernel, each convolution operation of each pixel provides an effective information amount of

The information content distribution in the 4 x 4 range affected by the pooling operation is shown in fig. 6.

After the maximum pooling operation with stride 2, ideally, all the information content of the gray portion in fig. 6 is retained, and the ratio of the retained information content is:

the original VGG16 backbone network is subjected to 5 times of maximum pooling operation, and the information amount is equivalent to:

in this example, the amount of effective information is saved 6.1917 times that of the original VGG 16. Therefore, by using the VGG16 model of the feature extraction model of the embodiment, more feature information can be retained, so that the calculation result of the VGG16 has higher accuracy.

Examples effects and effects

According to the feature extraction model capable of sufficiently retaining image features provided in the present embodiment, since the convolution layer is formed by the channel separation convolution and the 1 × 1 convolution, the feature extraction model can be applied to an input image of any resolution, and when performing convolution operation on the input image, the consumption of the display memory resource by the convolution operation can be reduced while the data resolution of the subsequent operation is kept unchanged. And because the step size of the pooling operation layer for performing the pooling operation is 1, the boundary filling is performed alternately by 0 and 1 pixel, and meanwhile the pooling operation is performed for an even number of times in the whole feature extraction process, the feature map resolution can be kept unchanged all the time in the feature extraction process. The feature extraction model continues the reasonable structural design of the traditional CNN backbone network, and corrects and improves the defect that the feature extraction model is only suitable for image classification, so that the feature extraction operation retains sufficient information, and the feature extraction model is more suitable for algorithms in the fields of target detection, image segmentation and the like. The feature extraction model of the embodiment can be used as a backbone network of a CNN family model in the field of machine vision to form various special algorithms, and the low-loss characteristic of feature information can ensure the requirement of various algorithms on higher precision of calculation results.

In addition, the feature extraction model of the embodiment can be applied to the fields of target detection, semantic segmentation, target segmentation and the like, is used as a backbone network to be fused into a specific algorithm model, and is trained and used in a general training mode of a conventional deep learning algorithm. Therefore, the original algorithm model can completely reserve the information in the image when performing feature extraction, and the lossless feature extraction operation of the images with different resolutions is realized.

The above-described embodiments are merely illustrative of specific embodiments of the present invention, and the present invention is not limited to the description of the above-described embodiments.

In an embodiment, a convolutional neural network is used to perform point-to-point prediction output. In other embodiments, if features in the original convolutional neural network need to be further extracted and concentrated according to the scale parameters, the feature extraction model can also reduce the resolution of the feature map by combining pooling and a small amount of convolution operations after performing feature operation to obtain feature data, and the algorithm structure for further concentrating the features can be specifically designed according to specific operation needs.

Claims

1. A feature extraction model capable of sufficiently preserving image features for performing lossless feature extraction operation on an input image of an arbitrary resolution, comprising:

a plurality of convolution operation layers, each of which is composed of a channel separation convolution and a1 × 1 convolution;

a plurality of pooling operation layers in which pooling has a step size of 1, boundary padding is alternately performed with 0 and 1 pixels,

wherein the number of the pooling operation layers is an even number.

2. The feature extraction model capable of sufficiently preserving image features as claimed in claim 1, wherein:

after the resolution of the feature map is subjected to the traditional pooling operation, the resolution is reduced by 2 × 2 times, the convolution kernel size of the subsequent traditional convolution layer is k × k, the convolution operation is f × f relative to the acceptable field before pooling, and f is:

f＝(k-1)×2+1 (1)

at this time, the convolution operation layer correspondingly connected after the pooling operation layer performs a convolution operation with a convolution kernel having a size of f × f.

3. A feature extraction method capable of sufficiently retaining image features is characterized by comprising the following steps of:

step D1, using the feature extraction model as a backbone network for feature extraction in a convolutional neural network to form a lossless convolutional neural network, wherein the lossless convolutional neural network is used for executing point-to-point prediction output;

step D2, the feature extraction model extracts the features of the input image to obtain feature data;

step D3, the lossless convolution neural network directly acquires the characteristic data and carries out subsequent operation,

wherein, the feature extraction model is the feature extraction model of claim 1 or 2 which can sufficiently retain the image features.