CN113837199A

CN113837199A - Image feature extraction method based on cross-layer residual error double-path pyramid network

Info

Publication number: CN113837199A
Application number: CN202111002973.5A
Authority: CN
Inventors: 胡杰; 谢礼浩; 安永鹏; 熊宗权; 徐文才
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-12-24
Anticipated expiration: 2041-08-30
Also published as: CN113837199B

Abstract

The invention discloses an image feature extraction method based on a cross-layer residual error two-way pyramid network, which comprises the steps of inputting an original RGB color image into a residual error network ResNet50 for primary feature extraction to obtain a bottom-up feature pyramid network DTFPN; implementing a cross-layer residual network based on the residual network ResNet 50; and obtaining output feature maps P1 ', P2 ', P3 ', P4 ', P5 ', after the feature pyramid network FPN processing. The invention further alleviates the network degradation problem of the residual network ResNet50 and blends the features of different levels to further extract deep features to remarkably enhance the feature extraction capability of the residual network ResNet 50. The defect that high-level features in a Feature Pyramid Network (FPN) lack low-level detail texture information is overcome, and efficient fusion of feature map information of each layer is achieved.

Description

Image feature extraction method based on cross-layer residual error double-path pyramid network

Technical Field

The invention relates to the fields of computer vision, artificial intelligence, pattern recognition and the like, in particular to an image feature extraction method based on a cross-layer residual error double-path pyramid network.

Background

With the development of artificial intelligence, convolutional neural networks become a main method for extracting image features, and the famous feature extraction networks include a lie network 5(LeNet5), an Alexanet network (AlexNet), a Visual Geometry Group (VGG), a Google network (GoogleNet), a residual error network (ResNet), and the like.

The plum network 5(LeNet5) was born in 1994, is one of the earliest convolutional neural networks, and promotes the development of deep learning. The method consists of two convolution layers, two pooling layers and two full-connection layers, wherein convolution adopts a convolution kernel with the convolution kernel size of 5x5, the step pitch is 1, and maximum pooling downsampling is adopted.

The alexans network (AlexNet) obtained the image network (ImageNet) race champion in 2012, which was a deeper and broader version of the lie network (LeNet), which contained 6 hundred million and 3000 million connections, 6000 million parameters and 65 million neurons, with 5 convolutional layers, 3 of which were followed by the max pooling layer and finally 3 fully connected layers. The alexans network (AlexNet) wins the champion of the image network Large-Scale Visual Recognition Challenge (ILSVRC) game with significant advantage, and the error rate of five prediction mean errors (top-5) is reduced from the previous 25.8% to 16.4. The main technical points of the alexant network (AlexNet) are: (1) the problem of gradient diffusion of logistic function (sigmoid) when the network is deep is solved by using a modified linear unit (ReLU) as an activation function of the Convolutional Neural Network (CNN). (2) Random discard (Dropout) was used during training to randomly ignore a portion of the neurons to avoid model overfitting. (3) The maximum pooling of the overlap is used in the Convolutional Neural Network (CNN), the step size is smaller than the pooling kernel, so that the outputs are overlapped and covered, and the richness of the features is improved. The prior Convolutional Neural Network (CNN) generally uses average pooling, and the AlexNet network (AlexNet) all uses maximum pooling to avoid the fuzzy effect of the average pooling. (4) And data enhancement is used, overfitting is reduced, and the generalization capability of the model is improved.

The Visual Geometry Group (VGG) is the first network to use smaller 3 × 3 convolution kernels on each convolution layer and combine them as a convolution sequence to process, and its features are many continuous convolution calculations and huge calculation amount. A great advance in the Visual Geometry Group (VGG) is that by using multiple 3x3 convolutions in sequence, the effect of a larger field of view can be simulated. Models of the Visual Geometry Group (VGG) show that depth is beneficial for improving classification accuracy, and an important idea is that convolution can replace full concatenation. The overall parameters reach 1 hundred million and 4 million, mainly lie in the first full-connected layer, and after the convolution is used for replacement, the parameter quantity is reduced and no precision is lost.

Google network (google net) -the first "beginning" (inclusion) architecture first appeared in the image network Large Scale Visual Recognition Challenge (ILSVRC) 2014 match, the first one was obtained with great advantage. The 'onset network' (initiation Net) in the game is generally called 'onset V1' (initiation V1), and the biggest characteristic is that the calculation amount and the parameter amount are controlled, and meanwhile, the very good classification performance is obtained, namely, the error rate of five times of prediction mean error (top-5) is 6.67%, and only half of the AlexNet (AlexNet) is less. The "open end V1" (inclusion V1) is 22 layers deep, even deeper than the 8 layers of the alexant network (AlexNet) or the 19 layers of the Visual Geometry Group (VGG) network. However, the calculation amount is only 15 hundred million floating point operations, and at the same time, the calculation amount is only 500 ten thousand of parameter amounts, which is only 1/12 of the parameter amount (6000 ten thousand) of the alexan network (AlexNet), but the accuracy rate can be far better than that of the alexan network (AlexNet), so that the model is very excellent and very practical. The versions of V2, V3 and V4 are successively introduced on the basis of the 'beginning V1' (inclusion V1).

Residual network ResNet was proposed in 2015 to obtain the first name on the image network (ImageNet) race classification task because it coexists "simple and practical", and then many methods are based on residual network ResNet50 or residual network ResNet 101. The residual error network ResNet provides a residual error structure and uses a Batch Normalization method to effectively solve the problems of gradient disappearance or gradient explosion and network degradation of a deep network, so that the performance of a feature extraction network of the ultra-deep residual error network ResNet is greatly improved compared with the prior art, and excellent performances are obtained in the fields of image detection, image classification, image segmentation and the like.

The Feature Pyramid Network (FPN) constructs a feature pyramid network capable of performing end-to-end training, and the feature pyramid network fuses the high-level features extracted by the feature extraction network with the low-level features after down-sampling, so that semantic information of the low-level features is enriched. For small targets, a Feature Pyramid Network (FPN) increases the resolution of the feature map, i.e., operates on a larger feature map to obtain more information about the small target.

The feature graph output by the residual error network module can still be connected by residual errors to form a cross-layer residual error network module (spanning multiple residual error layers in the original residual error network module), that is, if the input of a certain residual error network module is x and the expected output is h (x), if we directly transmit the input x to the output as an initial result, the target to be learned by the layer of residual error network module is f (x) ═ h (x) -x, which is equivalent to changing the learning target of the residual error network module, and learning f (x) is much easier than learning h (x). Therefore, the ResNet structure is optimized again to form a cross-layer residual network, so that the problem of residual network ResNet network degradation can be further relieved, and deep features can be further extracted by mixedly utilizing features of different levels. For a Feature Pyramid Network (FPN), high-level features are fused into low-level features, although information of the low-level features can be greatly enriched, the high-level features are not improved, and the high-level features also need to supplement low-level feature texture information, which results in insufficient feature fusion and limited network performance.

The invention content is as follows:

in order to overcome the defects of the background technology, the invention provides an image feature extraction method based on a cross-layer residual error two-way pyramid network, which realizes two goals on the premise of equivalent to the original network feature extraction speed: (1) further reducing the network degradation problem of the residual network ResNet50 and blending the features of different levels to further extract deep features to significantly enhance the feature extraction capability of the residual network ResNet 50. (2) The defect that high-level features in a Feature Pyramid Network (FPN) lack low-level detail texture information is overcome, and efficient fusion of feature map information of each layer is achieved.

In order to solve the technical problems, the invention adopts the technical scheme that:

an image feature extraction method based on a cross-layer residual error two-way pyramid network comprises the following steps:

step S1, inputting the original RGB color image into residual network ResNet50 for preliminary feature extraction, conv1 convolution network module 1 of residual network ResNet50 outputting feature map P0, conv2_ x residual network module 2 of residual network ResNet50 outputting feature maps P1, P1 ', P1 ═ P1', conv3_ x residual network module 3 of residual network ResNet50 outputting feature map P2, conv4_ x residual network module 4 of residual network ResNet50 outputting feature map P3, conv5_ x residual network module 5 of residual network ResNet50 outputting feature map P4;

step S2, down-sampling the feature map P1 ', fusing the down-sampled feature map with the feature map P2 to obtain a feature map P2', down-sampling the feature map P2 ', fusing the down-sampled feature map P2' with the feature map P3 to obtain a feature map P3 ', down-sampling the feature map P3', fusing the down-sampled feature map P3 'with the feature map P4 to obtain a feature map P4', and obtaining a feature pyramid network DTFPN from the bottom up;

step S3, the characteristic maps P1 ', P2, P2' and the intermediate network constitute a cross-layer residual network module (a plurality of residual layers in the residual network module which crosses the residual network ResNet 50); the feature maps P2 ', P3, P3' and the intermediate network thereof form a cross-layer residual error network module; the feature maps P3 ', P4, P4' and the intermediate network thereof form a cross-layer residual error network module; implementing a cross-layer residual network based on the residual network ResNet 50;

step S4, inputting the feature maps P1 ', P2', P3 'and P4' into a feature pyramid network FPN, and establishing a cross-layer residual error two-way pyramid network with the feature pyramid network FPN; the feature maps P1 ', P2 ', P3 ' and P4 ' are processed by a feature pyramid network FPN to obtain output feature maps P1 ', P2 ', P3 ', P4 ', P5 '.

Preferably, in step 1, the width and height of the characteristic map Pi (i ═ 0,1,2,3) are 1/2 of the characteristic map Pi +1, and the number of channels of the characteristic map Pi is 2 times the number of channels of the characteristic map Pi + 1.

Preferably, step 2 comprises:

s2.1, adopting downsampling operation with convolution kernel size of 1x1 and step pitch of 2 to the feature map P1' to reduce the width and height of the feature map into 1/2, and increasing the number of channels by 1 time; inputting the feature map after downsampling of the feature map P1 'into a correction linear unit to adjust the distribution of the feature map data, and adding the adjusted feature map and the feature map P2 to obtain a feature map P2';

s2.2, adopting downsampling operation with convolution kernel size of 1x1 and step pitch of 2 to the feature map P2' to reduce the width and height of the feature map into 1/2, and increasing the number of channels by 1 time; then inputting the feature map after downsampling of the feature map P2 'into a correction linear unit to adjust the distribution of the feature map data, and adding the adjusted feature map and the feature map P3 to obtain a feature map P3';

s2.3, adopting downsampling operation with convolution kernel size of 1x1 and step pitch of 2 to the feature map P3' to reduce the width and height of the feature map into 1/2, and increasing the number of channels by 1 time; next, the feature map obtained by down-sampling the feature map P3 'is input to a modified linear unit to adjust the distribution of the feature map data, and the feature map obtained by the adjustment is added to the feature map P4 to obtain a feature map P4'.

Preferably, in step S1, the feature maps P1, P2 'and P3' are respectively calculated by the conv3_ x residual network module 3, the conv4_ x residual network module 4 and the conv5_ x residual network module 5 of the residual network ResNet50 in step S1 to obtain the feature maps P2, P3 and P4.

Preferably, the step 3 cross-layer residual network module refers to a plurality of residual layers in the residual network module that cross the residual network ResNet 50.

The invention has the beneficial effects that:

(1) according to the method, a bottom-up Feature Pyramid Network (DTFPN) is built based on Feature maps output by residual Network ResNet50 residual Network modules, the supplement of low-level texture detail information to high-level Feature map information is realized, the defect that high-level features in the Feature Pyramid Network (FPN) lack low-level detail texture information is effectively overcome, and the efficient fusion of the Feature map information of each layer is realized.

(2) On the basis of a bottom-up feature pyramid network (DTFPN) in the step (1), a Cross-layer residual error network (Cross-layer ResNet50) based on a residual error network ResNet50 is built, so that the problem of network degradation of the residual error network ResNet50 is further relieved, deep features are further extracted by mixedly utilizing features of different levels, and the feature extraction capability of the residual error network ResNet50 is remarkably enhanced.

Compared with a Faster Region-based convolutional Neural network (fast Region-based convolutional Neural Networks, fast _ R-CNN) based on a feature extraction network ResNet50-FPN, the Faster Region-based convolutional Neural network (fast Region-based convolutional Neural Networks, fast _ R-CNN) based on the Cross-layer residual error two-way pyramid network (Cross-layer residual Bi-FPN) provided by the invention has the target detection average accuracy AP (0.5-0.95) on a Kate (KITTY) data set improved by 3.8%, but the inference network speed is almost kept unchanged.

Drawings

FIG. 1 is a diagram of an overall network framework for implementing the solution of the present invention;

FIG. 2 is a diagram illustrating a detailed structure of a bottom-up Feature Pyramid Network (DTFPN) according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a Cross-layer residual structure of a Cross-layer residual network (Cross-layer ResNet50) according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings and examples.

The invention provides an image Feature extraction method based on a Cross-layer residual error two-way Pyramid Network (Cross-layer residual Bi-FPN), which comprises a Cross-layer residual error Network (Cross-layer ResNet50) and a Feature Pyramid Network (FPN) which are self-designed based on a residual error Network ResNet50, wherein the Cross-layer residual error Network (Cross-layer ResNet50) comprises a brand-new bottom-up Feature Pyramid Network (DTFPN). The invention is realized by the following steps: s1, defining feature maps output by the convolutional network module 1(conv1), the residual network module 2(conv2_ x), the residual network module 3(conv3_ x), the residual network module 4(conv4_ x), and the residual network module 5(conv5_ x) of the original image via the residual network ResNet50 skeleton network, which are P0, P1, P1 '(P1 ═ P1'), P2, P3, and P4, respectively. S2, the input of this step is feature maps P1 ', P2, P3 and P4 output in step S1, a feature map P2 ' is obtained by down-sampling the feature map P1 ' and fusing it with the feature map P2, a feature map P3 ' is obtained by down-sampling the feature map P2 ' and fusing it with the feature map P3, and a feature map P4 ' is obtained by down-sampling the feature map P3 ' and fusing it with the feature map P4, thereby constructing a feature pyramid network (DTFPN) from the bottom up, and feature maps P1 ', P2 ', P3 ' and P4 ' are output in this step. The inputs of S3 and this step are the signature P1 output in step S1 and the signature P2 'and P3' output in step S2. Feature maps P1, P2 'and P3' are respectively calculated by a residual network module 3(conv3_ x), a residual network module 4(conv4_ x) and a residual network module 5(conv5_ x) in a residual network ResNet50 to obtain feature maps P2, P3 and P4, so that the feature maps P1 ', P2 and P2' and intermediate networks thereof form a Cross-layer residual network module (a plurality of residual layers in the residual network module spanning the residual network ResNet50), the feature maps P2 ', P3 and P3' and intermediate networks thereof form a Cross-layer residual network module, and the feature maps P3 ', P4, P4' and intermediate networks thereof form a Cross-layer residual network module, thereby realizing a Cross-layer residual network (Cross-layer ResNet 50). The inputs of S4 and this step are feature maps P1 ', P2', P3 'and P4' output in step S2. Inputting the feature maps P1 ', P2', P3 'and P4' into a Feature Pyramid Network (FPN), and establishing a Cross-layer residual error two-way pyramid network (Cross-layer residual Bi-FPN) with the Feature Pyramid Network (FPN). The feature maps P1 ', P2 ', P3 ' and P4 ' are processed by a Feature Pyramid Network (FPN) to output feature maps P1 ', P2 ', P3 ', P4 ', P5 '. The invention forms a new feature extraction network by designing a new bottom-up feature pyramid network and a cross-layer residual error structure, further reduces the degradation problem of a residual error network ResNet50 network on the premise of equivalent to the original network feature extraction speed, and further extracts deep features by mixedly utilizing features of different levels, thereby solving the defect that high-level features in a Feature Pyramid Network (FPN) lack low-level detail information, and realizing the high-efficiency fusion of each layer of feature map information. The method is excellent in application to tasks such as image target detection and semantic segmentation.

Fig. one of the accompanying drawings is a general Network framework diagram of the technical solution of this embodiment, and the Network includes a Cross-layer residual Network (Cross-layer ResNet50) and a Feature Pyramid Network (FPN) which are self-designed based on the residual Network ResNet50, where the Cross-layer residual Network (Cross-layer ResNet50) includes a completely new bottom-up Feature Pyramid Network (DTFPN). Constructing a Cross-layer residual error two-way Pyramid Network (Cross-layer residual Bi-FPN), wherein the Cross-layer residual error Network (Cross-layer ResNet50) and a Feature Pyramid Network (FPN) are designed based on the residual error Network ResNet50, and the Cross-layer residual error Network (Cross-layer ResNet50) comprises a brand-new bottom-up Feature Pyramid Network (DTFPN). The detailed steps for building the whole feature extraction network are as follows:

s1, as shown in the attached table, the residual network ResNet50 feature extraction part is composed of a convolutional network module 1(conv1), a residual network module 2(conv2_ x), a residual network module 3(conv3_ x), a residual network module 4(conv4_ x), and a residual network module 5(conv5_ x), where "conv 2_ x" and each residual network module thereafter are composed of a plurality of residual layer structures. The original RGB color image is input into a residual network ResNet50 for preliminary feature extraction, and defined as "conv 1", "conv 2_ x", "conv 3_ x", "conv 4_ x" and "conv 5_ x" respectively output feature maps P0, P1, P1 '(P1 ═ P1'), P2, P3 and P4. The width and height of the characteristic map Pi (i is 0,1,2,3) are 1/2 of the characteristic map Pi +1, and the number of channels of the characteristic map Pi is 2 times of the number of channels of the characteristic map Pi + 1.

Attached table-network architecture of residual error network resnet50 feature extraction part in this example

The input of the step S2 is the characteristic maps P1', P2, P3 and P4 output in the step S1. The feature map P1 'is downsampled and then fused with the feature map P2 to obtain a feature map P2', the feature map P2 'is downsampled and then fused with the feature map P3 to obtain a feature map P3', and the feature map P3 'is downsampled and then fused with the feature map P4 to obtain a feature map P4', so that a bottom-up feature pyramid network (DTFPN) is constructed. The outputs of this step are feature maps P1 ', P2', P3 ', P4'. The details of downsampling and fusion are shown in fig. three, and are further described below with reference to fig. three:

s2.1, the input of the step is the characteristic map P1' and P2 output by the step S1. The feature map P1 'is downsampled by a convolution kernel size of 1x1 and a step size of 2, the width and the height of the feature map are reduced by 1/2, the number of channels is increased by 1 time, and the feature map P1' is guaranteed to be the same as the feature map P2 after downsampling. Next, the feature map obtained after downsampling the feature map P1 'is input to a modified linear unit (ReLU) to adjust the distribution of the feature map data, and the output feature map is added to the feature map P2 to obtain a feature map P2'. This step outputs a feature map P2'.

S2.2, the input of the step is the characteristic diagram P3 output by the step S1 and the characteristic diagram P2' output by the step S2.1. The feature map P2 'is downsampled by a convolution kernel size of 1x1 and a step size of 2, the width and the height of the feature map are reduced by 1/2, the number of channels is increased by 1 time, and the feature map P2' is guaranteed to be the same as the feature map P3 after downsampling. Next, the feature map obtained after downsampling the feature map P2 'is input to a modified linear unit (ReLU) to adjust the distribution of the feature map data, and the output feature map is added to the feature map P3 to obtain a feature map P3'. This step outputs a feature map P3'.

S2.3, the input of the step is the characteristic map P4 output by the step S1 and the characteristic map P3' output by the step S2.2. The feature map P3 'is downsampled by a convolution kernel size of 1x1 and a step size of 2, the width and the height of the feature map are reduced by 1/2, the number of channels is increased by 1 time, and the feature map P3' is guaranteed to be the same as the feature map P4 after downsampling. Next, the feature map obtained after downsampling the feature map P3 'is input to a modified linear unit (ReLU) to adjust the distribution of the feature map data, and the output feature map is added to the feature map P4 to obtain a feature map P4'. This step outputs a feature map P4'.

The inputs of S3 and this step are the signature P1 output in step S1 and the signature P2 'and P3' output in step S2. Feature maps P1, P2 'and P3' are calculated by residual network module 3(conv3_ x), residual network module 4(conv4_ x) and residual network module 5(conv5_ x) of residual network ResNet50 in step S1, respectively, to obtain feature maps P2, P3 and P4. As shown in fig. three, the Cross-layer residual network (Cross-layer ResNet50) based on the residual network ResNet50 is realized by constructing a Cross-layer residual network module with feature maps P1 ', P2, P2' and their intermediate networks (a plurality of residual layers in the residual network module spanning the residual network ResNet50), a Cross-layer residual network module with feature maps P2 ', P3, P3' and their intermediate networks, and a Cross-layer residual network module with feature maps P3 ', P4, P4' and their intermediate networks.

The inputs of S4 and this step are feature maps P1 ', P2', P3 'and P4' output in step S2. Inputting the feature maps P1 ', P2', P3 'and P4' into a Feature Pyramid Network (FPN), and establishing a Cross-layer residual error two-way pyramid network (Cross-layer residual Bi-FPN) with the Feature Pyramid Network (FPN). The feature maps P1 ', P2 ', P3 ' and P4 ' are processed by a Feature Pyramid Network (FPN) and then output feature maps P1 ', P2 ', P3 ', P4 ', P5 ', thereby completing all steps of extracting the feature maps of the original image input Cross-layer residual two-way pyramid network (Cross-layer residual Bi-FPN).

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. An image feature extraction method based on a cross-layer residual error two-way pyramid network is characterized by comprising the following steps:

step S1, inputting the original RGB color image into residual network ResNet50 for preliminary feature extraction, where conv1 convolution network module 1 of residual network ResNet50 outputs feature map P0, conv2_ x residual network module 2 of residual network ResNet50 outputs feature maps P1, P1 ', P1 ═ P1', conv3_ x residual network module 3 of residual network ResNet50 outputs feature map P2, conv4_ x residual network module 4 of residual network ResNet50 outputs feature map P3, and conv5_ x residual network module 5 of residual network ResNet50 outputs feature map P4;

2. The method for extracting image features based on the cross-layer residual error two-way pyramid network as claimed in claim 1, wherein: in step 1, the width and height of the characteristic map Pi (i is 0,1,2,3) are 1/2 of the characteristic map Pi +1, and the number of channels of the characteristic map Pi is 2 times the number of channels of the characteristic map Pi + 1.

3. The method for extracting image features based on the cross-layer residual error two-way pyramid network as claimed in claim 1, wherein the step 2 comprises:

4. The method for extracting image features based on cross-layer residual error two-way pyramid network as claimed in claim 1, wherein the feature maps P1, P2 'and P3' in step S3 are respectively calculated by the conv3_ x residual error network module 3, the conv4_ x residual error network module 4 and the conv5_ x residual error network module 5 of the residual error network ResNet50 in step S1 to obtain the feature maps P2, P3 and P4.

5. The method as claimed in claim 1, wherein the step 3 of the cross-layer residual error network module refers to a plurality of residual error layers in a residual error network module spanning a residual error network ResNet 50.