CN113837199B

CN113837199B - Image feature extraction method based on cross-layer residual double-path pyramid network

Info

Publication number: CN113837199B
Application number: CN202111002973.5A
Authority: CN
Inventors: 胡杰; 谢礼浩; 安永鹏; 熊宗权; 徐文才
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2024-01-09
Anticipated expiration: 2041-08-30
Also published as: CN113837199A

Abstract

The invention discloses an image feature extraction method based on a cross-layer residual double-path pyramid network, which is characterized in that an original RGB color image is input into a residual network ResNet50 for preliminary feature extraction to obtain a feature pyramid network DTFPN from bottom to top; realizing a cross-layer residual network based on a residual network ResNet 50; the output characteristic diagram P1', is obtained after the processing of the characteristic pyramid network FPN P2' ", P3 '", P4' ", P5 '". The invention further alleviates the problem of network degradation of the residual network ResNet50 and mixes and utilizes the characteristics of different levels to further extract deep characteristics so as to obviously enhance the characteristic extraction capability of the residual network ResNet 50. The defect that high-level features in a Feature Pyramid Network (FPN) lack low-level detail texture information is overcome, and efficient fusion of feature map information of each level is realized.

Description

Image feature extraction method based on cross-layer residual double-path pyramid network

Technical Field

The invention relates to the fields of computer vision, artificial intelligence, mode identification and the like, in particular to an image feature extraction method based on a cross-layer residual double-path pyramid network.

Background

With the development of artificial intelligence, convolutional neural networks have become a major method for extracting image features, and well-known feature extraction networks include a plum network 5 (LeNet 5), an alexin network (AlexNet), a visual geometry group (Visual Geometry Group, VGG), a google network (google net), a residual network (ResNet), and the like.

The generation of the lie network 5 (LeNet 5) was one of the earliest convolutional neural networks in 1994, and has driven the development of deep learning. The method consists of two convolution layers, two pooling layers and two full-connection layers, wherein convolution adopts convolution kernel with the convolution kernel size of 5x5 and the stride of 1, and maximum pooling downsampling is adopted.

The alexant network (alexent) has obtained in 2012 the image network (ImageNet) champion, which can be said to be a deeper and wider version of Li Wanglao (LeNet) which contains 6 million 3000 ten thousand connections, 6000 ten thousand parameters and 65 ten thousand neurons, with 5 convolutional layers, 3 of which are followed by the max pooling layer and finally 3 full connection layers. The alexant network (alexent) wins the champion of the image network large-scale visual recognition challenge (ImageNet Large Scale Visual Recognition Challenge, ILSVRC) race with significant advantages, and the error rate of five prediction errors (top-5) is reduced from the previous 25.8% to 16.4. The main technical points of Alexax networks (AlexNet) are: (1) The gradient dispersion problem of the logic function (sigmoid) when the network is deep is solved by using the modified linear units (Rectified linear unit, reLU) as the activation function of the Convolutional Neural Network (CNN). (2) Random discard (Dropout) was used during training to randomly ignore a portion of the neurons to avoid model overfitting. (3) The overlapped maximum pooling is used in the Convolutional Neural Network (CNN), and the step length is smaller than the pooling core, so that the outputs are overlapped and covered, and the feature richness is improved. Heretofore Convolutional Neural Networks (CNNs) have commonly used average pooling, with alexant networks (AlexNet) all using maximum pooling, avoiding the ambiguity effects of average pooling. (4) Data enhancement is used, overfitting is reduced, and model generalization capability is improved.

The Visual Geometry Group (VGG) is the first network to use smaller 3x3 convolution kernels at each convolution layer and to treat them in combination as one convolution sequence, and is characterized by a large number of sequential convolution computations and a large computational effort. A tremendous development of the Visual Geometry Group (VGG) is that by employing multiple 3x3 convolutions in sequence, the effect of a larger receptive field can be simulated. Models of the Visual Geometry Group (VGG) indicate that depth is beneficial for improving classification accuracy, and one important idea is that convolution can replace full concatenation. The whole parameter reaches 1 hundred million and 4 million, and is mainly characterized in that the first full connection layer is replaced by convolution, and the parameter quantity is reduced without precision loss.

Google network (google net), the first "open" architecture, first appears in the race of the image network large-scale visual recognition challenge (ImageNet Large Scale Visual Recognition Challenge, ILSVRC) 2014 to gain the first name with great advantage. The "open network" (acceptance Net) in that match is generally called "open V1" (acceptance V1), and its biggest feature is that the calculation amount and the parameter amount are controlled, and meanwhile, very good classification performance is obtained, that is, the five-time prediction error (top-5) error rate is 6.67%, and only half of alexan network (alexene Net) is not reached. The "onset V1" (acceptance V1) has a 22-layer depth that is deeper than the 19 layers of an 8-layer or Visual Geometry Group (VGG) network of an alexin network (alexene). However, the calculated amount is only 15 hundred million floating point operations, and simultaneously, only 500 ten thousand parameter amounts are only 1/12 of Alexarnt network (Alexarnt) parameter amount (6000 ten thousand), but the accuracy far superior to Alexarnt network (Alexarnt) can be achieved, so that the model is a very excellent and practical model. The versions V2, V3 and V4 are sequentially deduced on the basis of the beginning V1 (introduction V1).

Residual network ResNet was proposed in 2015 to obtain a first name on the image network (ImageNet) game classification task because it is "simple and practical" and many methods are then done on the basis of residual network ResNet50 or residual network ResNet 101. The residual network ResNet provides a residual structure and effectively solves the problems of deep network gradient disappearance or gradient explosion and network degradation by using a batch normalization (Batch Normalization) method, so that the performance of the ultra-deep residual network ResNet is greatly improved compared with the prior characteristic extraction network, and excellent results are obtained in the fields of image detection, image classification, image segmentation and the like.

The Feature Pyramid Network (FPN) constructs a feature pyramid network capable of performing end-to-end training, and the feature pyramid network fuses high-level features extracted by the feature extraction network after downsampling with low-level features, so that semantic information of the low-level features is enriched. For small objects, feature Pyramid Networks (FPNs) increase the resolution of feature mapping, i.e., operate on larger feature maps to obtain more information about the small object.

Inspired by a residual network ResNet, the feature map output by the residual network module can still be connected by residual to form a cross-layer residual network module (spans multiple residual layers in the original residual network module), namely, assuming that the input of a certain residual network module is x and the expected output is H (x), if we directly transmit the input x to the output as an initial result, the target that the residual network module needs to learn is F (x) =H (x) -x, which is equivalent to changing the learning target of the residual network module, and learning F (x) is much easier than learning H (x). And optimizing the ResNet structure again to form a cross-layer residual network, so that the problem of degradation of the residual network ResNet network can be further reduced, and deep features can be further extracted by mixing and utilizing features of different levels. For a Feature Pyramid Network (FPN), high-level features are fused into low-level features, and although the information of the low-level features can be greatly enriched, the information of the high-level features is not improved, and the high-level features also need to be supplemented with low-level feature texture information, so that the feature fusion is insufficient, and the network performance is limited.

The invention comprises the following steps:

in order to overcome the defects of the background technology, the invention provides an image feature extraction method based on a cross-layer residual double-path pyramid network, which realizes two targets on the premise of being equivalent to the feature extraction speed of the original network: (1) Further alleviating the problem of network degradation of the residual network ResNet50 and mixing the further extraction of deep features by using features of different levels to significantly enhance the feature extraction capability of the residual network ResNet 50. (2) The defect that high-level features in a Feature Pyramid Network (FPN) lack low-level detail texture information is overcome, and efficient fusion of feature map information of each level is realized.

In order to solve the technical problems, the invention adopts the following technical scheme:

an image feature extraction method based on a cross-layer residual double-path pyramid network comprises the following steps:

step S1, the original RGB color image is input into the residual network res net50 for preliminary feature extraction, the conv1 convolutional network module 1 of the residual network res net50 outputs the feature map P0, the conv2_x residual network module 2 of the residual network res net50 outputs the feature maps P1, P1', p1=p1', the conv3_x residual network module 3 of the residual network res net50 outputs the feature map P2, the conv4_x residual network module 4 of the residual network res net50 outputs the feature map P3, and the conv5_x residual network module 5 of the residual network res net50 outputs the feature map P4;

step S2, down-sampling the feature map P1 'and then fusing with the feature map P2 to obtain a feature map P2', down-sampling the feature map P2 'and then fusing with the feature map P3 to obtain a feature map P3', down-sampling the feature map P3 'and then fusing with the feature map P4 to obtain a feature map P4' to obtain a feature pyramid network DTFPN from bottom to top;

step S3, the feature maps P1', P2' and the intermediate network thereof form a cross-layer residual network module (a plurality of residual layers in the residual network module crossing the residual network ResNet 50); feature maps P2', P3' and intermediate networks thereof form a cross-layer residual error network module; the feature maps P3', P4' and the intermediate network thereof form a cross-layer residual error network module; realizing a cross-layer residual network based on a residual network ResNet 50;

s4, inputting the feature graphs P1', P2', P3', P4' to a feature pyramid network FPN, and establishing a cross-layer residual double-path pyramid network with the feature pyramid network FPN; the feature maps P1', P2', P3', P4' are processed by the feature pyramid network FPN to obtain output feature maps P1' ", P2 '", P3' ", P4 '", P5 ' ".

Preferably, in step 1, the width and height of the profile Pi (i=0, 1,2, 3) are 1/2 of the profile pi+1, and the number of channels of the profile Pi is 2 times the number of channels of the profile pi+1.

Preferably, step 2 includes:

s2.1, adopting downsampling operation with a convolution kernel size of 1x1 and a step distance of 2 for the feature map P1', reducing the width and height by 1/2, and increasing the channel number by 1 time; inputting the feature map after downsampling of the feature map P1 'into a correction linear unit to adjust the distribution of the feature map data, and adding the feature map obtained after adjustment with the feature map P2 to obtain a feature map P2';

s2.2, adopting downsampling operation with a convolution kernel size of 1x1 and a step distance of 2 on the feature map P2' to reduce the width and height by 1/2, and increasing the channel number by 1 time; then inputting the feature map after downsampling of the feature map P2 'into a correction linear unit to adjust the distribution of the feature map data, and adding the feature map obtained after adjustment with the feature map P3 to obtain a feature map P3';

s2.3, adopting downsampling operation with a convolution kernel size of 1x1 and a step distance of 2 on the characteristic map P3' to reduce the width and height by 1/2, and increasing the channel number by 1 time; and then inputting the feature map after the downsampling of the feature map P3 'into a correction linear unit to adjust the distribution of the feature map data, and adding the feature map obtained after the adjustment with the feature map P4 to obtain a feature map P4'.

Preferably, in step 3, the feature maps P1, P2', P3' are calculated by the conv3_x residual network module 3, the conv4_x residual network module 4, and the conv5_x residual network module 5 of the residual network res net50 in step S1 to obtain feature maps P2, P3, and P4, respectively.

Preferably, the step 3 cross-layer residual network module refers to a plurality of residual layers in the residual network module spanning the residual network res net 50.

The invention has the beneficial effects that:

(1) The invention builds the feature pyramid network (Down to Top Feature Pyramid Network, DTFPN) from bottom to top based on the feature graphs output by each residual network module of the residual network ResNet50, realizes the supplement of low-layer texture detail information to high-layer feature graph information, effectively solves the defect that the high-layer features in the feature pyramid network (Feature Pyramid Network, FPN) lack the low-layer detail texture information, and realizes the efficient fusion of the feature graph information of each layer.

(2) Based on the feature pyramid network (DTFPN) from bottom to top in the step (1), a Cross-layer ResNet50 based on the residual ResNet50 is built, so that the problem of network degradation of the residual ResNet50 is further reduced, deep features are further extracted by mixing and utilizing features of different levels, and the feature extraction capability of the residual ResNet50 is remarkably enhanced.

Compared with a Faster regional convolutional neural network (Faster Region-based Convolution Neural Networks, faster_R-CNN) based on a feature extraction network ResNet50-FPN, the Faster regional convolutional neural network (Faster Region-based Convolution Neural Networks, faster_R-CNN) based on the Cross-layer residual double-path pyramid network (Cross-layer residual Bi-FPN) provided by the invention has the advantage that the average target detection accuracy AP (0.5-0.95) on a Kate (KITTY) data set is improved by 3.8%, but the network reasoning speed is almost unchanged.

Drawings

Fig. 1 is a general network frame diagram of a technical solution according to an embodiment of the present invention;

FIG. 2 is a schematic diagram showing a detail of a bottom-up feature pyramid network (Down to Top Feature Pyramid Network, DTFPN) according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a Cross-layer residual structure of a Cross-layer residual network (Cross-layer res net 50) according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

The invention provides an image feature extraction method based on a Cross-layer residual double-path pyramid network (Cross-layer residual Bi-FPN), which comprises a Cross-layer residual network (Cross-layer ResNet 50) and a Feature Pyramid Network (FPN) which are self-designed based on a residual network ResNet50, wherein the Cross-layer ResNet50 comprises a completely new bottom-up feature pyramid network (Down to Top Feature Pyramid Network, DTFPN). The realization of the invention is divided into the following steps: s1, defining characteristic diagrams output by an original image through a convolution network module 1 (conv 1), a residual network module 2 (conv2_x), a residual network module 3 (conv3_x), a residual network module 4 (conv4_x) and a residual network module 5 (conv5_x) in a residual network ResNet50 skeleton network as P0, P1 and P1 '(P1=P1'), P2, P3 and P4 respectively. S2, the input of the step is feature graphs P1', P2, P3 and P4 output by the step S1, the feature graph P1' is downsampled and fused with the feature graph P2 to obtain a feature graph P2', the feature graph P2' is downsampled and fused with the feature graph P3 to obtain a feature graph P3', the feature graph P3' is downsampled and fused with the feature graph P4 to obtain a feature graph P4', and therefore a bottom-up feature pyramid network (DTFPN) is constructed, and the step outputs the feature graphs P1', P2', P3' and P4'. S3, the input of the step is a characteristic map P1 output by the step S1 and characteristic maps P2', P3' output by the step S2. The feature maps P1, P2', P3' are calculated by the residual network module 3 (conv3_x), the residual network module 4 (conv4_x) and the residual network module 5 (conv5_x) in the residual network res net50 to obtain feature maps P2, P3 and P4, respectively, so that the feature maps P1', P2' and intermediate networks thereof form a Cross-layer residual network module (a plurality of residual layers in the residual network module crossing the residual network res net 50), the feature maps P2', P3' and intermediate networks thereof form a Cross-layer residual network module, and the feature maps P3', P4' and intermediate networks thereof form a Cross-layer residual network module, thereby realizing a Cross-layer residual network (Cross-layer res net 50). S4, the input of the step is the characteristic graphs P1', P2', P3', P4' output by the step S2. The feature graphs P1', P2', P3', P4' are input into a Feature Pyramid Network (FPN), and a Cross-layer residual double-path pyramid network (Cross-layer residual Bi-FPN) is established with the Feature Pyramid Network (FPN). The feature maps P1', P2', P3', P4' are processed by a Feature Pyramid Network (FPN) to output a feature map P1', a feature map P2' ", P3 '", P4' ", P5 '". According to the invention, a new feature extraction network is formed by designing a new bottom-up feature pyramid network and a cross-layer residual structure, the problem of network degradation of a residual network ResNet50 is further reduced on the premise of being equivalent to the feature extraction speed of the original network, deep features are further extracted by mixing and utilizing features of different levels, the defect that high-level features in the Feature Pyramid Network (FPN) lack low-level detail information is overcome, and efficient fusion of feature map information of each layer is realized. The invention is applied to tasks such as image target detection and semantic segmentation and the like and is excellent in performance.

Fig. one is a general network frame diagram of the technical solution of the present embodiment, where the network includes a Cross-layer residual network (Cross-layer res net 50) and a feature pyramid network (Feature Pyramid Network, FPN) that are self-designed based on the residual network res net50, where the Cross-layer residual network (Cross-layer res net 50) includes an entirely new bottom-up feature pyramid network (Down to Top Feature Pyramid Network, DTFPN). A Cross-layer residual two-way pyramid network (Cross-layer residual Bi-FPN) is constructed, which comprises a Cross-layer residual network (Cross-layer res net 50) and a feature pyramid network (Feature Pyramid Network, FPN) which are self-designed based on the residual network res net50, wherein the Cross-layer residual network (Cross-layer res net 50) comprises an entirely new bottom-up feature pyramid network (Down to Top Feature Pyramid Network, DTFPN). The detailed steps for building the whole feature extraction network are as follows:

as shown in the attached table one, the residual network ResNet50 feature extraction part is composed of a convolutional network module 1 (conv 1), a residual network module 2 (conv2_x), a residual network module 3 (conv3_x), a residual network module 4 (conv4_x), and a residual network module 5 (conv5_x), where "conv2_x" and each residual network module thereafter is composed of a plurality of residual layer structures. The original RGB color image is input into the residual network res net50 for preliminary feature extraction, and "conv1", "conv2_x", "conv3_x", "conv4_x", "conv5_x" are defined to output feature maps P0, P1 '(p1=p1'), P2, P3, P4, respectively. Wherein the width and height of the profile Pi (i=0, 1,2, 3) are 1/2 of the profile pi+1, and the number of channels of the profile Pi is 2 times the number of channels of the profile pi+1.

Attached table-network architecture of residual network resnet50 feature extraction part in this example

S2, the input of the step is the characteristic diagrams P1', P2, P3 and P4 output by the step S1. The feature map P1 'is downsampled and then fused with the feature map P2 to obtain a feature map P2', the feature map P2 'is downsampled and then fused with the feature map P3 to obtain a feature map P3', and the feature map P3 'is downsampled and then fused with the feature map P4 to obtain a feature map P4', so that a bottom-up feature pyramid network (DTFPN) is constructed. The output of this step is the feature maps P1', P2', P3', P4'. The details of downsampling and fusing are shown in the third drawing, and are further described below in conjunction with the third drawing:

s2.1, the input of the step is the characteristic graphs P1', P2 output by the step S1. And the downsampling operation with the convolution kernel size of 1x1 and the step distance of 2 is adopted for the feature map P1', so that the width and the height of the feature map P1' are reduced by 1/2, the channel number is increased by 1 time, and the size of the feature map P1' is ensured to be the same as that of the feature map P2 after downsampling. Then, the feature map after the downsampling of the feature map P1 'is inputted to a correction linear unit (Rectified linear unit, reLU) to adjust the distribution of the feature map data, and the outputted feature map is added to the feature map P2 to obtain a feature map P2'. This step outputs a feature map P2'.

The inputs of step S2.2 are the feature map P3 output in step S1 and the feature map P2' output in step S2.1. And the downsampling operation with the convolution kernel size of 1x1 and the step distance of 2 is adopted for the feature map P2', so that the width and the height of the feature map P2' are reduced by 1/2, the channel number is increased by 1 time, and the downsampling of the feature map P2' is ensured to be the same as the feature map P3. Then, the feature map after the downsampling of the feature map P2 'is inputted to a correction linear unit (Rectified linear unit, reLU) to adjust the distribution of the feature map data, and the outputted feature map is added to the feature map P3 to obtain a feature map P3'. This step outputs a feature map P3'.

The inputs of step S2.3 are the feature map P4 output in step S1 and the feature map P3' output in step S2.2. And the downsampling operation with the convolution kernel size of 1x1 and the step distance of 2 is adopted for the feature map P3', so that the width and the height of the feature map P3' are reduced by 1/2, the channel number is increased by 1 time, and the downsampling of the feature map P3' is ensured to be the same as the feature map P4. Then, the feature map after the feature map P3 'is downsampled is input to a correction linear unit (Rectified linear unit, reLU) to adjust the distribution of the feature map data, and the output feature map is added to the feature map P4 to obtain a feature map P4'. This step outputs a feature map P4'.

S3, the input of the step is a characteristic map P1 output by the step S1 and characteristic maps P2', P3' output by the step S2. The feature maps P1, P2', P3' are calculated by the residual network module 3 (conv3_x), the residual network module 4 (conv4_x), and the residual network module 5 (conv5_x) of the residual network res net50 in step S1 to obtain feature maps P2, P3, and P4, respectively. As shown in fig. three, there are feature graphs P1', P2' and their intermediate networks forming a Cross-layer residual network module (multiple residual layers in the residual network module crossing the residual network res net 50), feature graphs P2', P3' and their intermediate networks forming a Cross-layer residual network module, and feature graphs P3', P4' and their intermediate networks forming a Cross-layer residual network module, thereby realizing a Cross-layer residual network (Cross-layer res net 50) based on the residual network res net 50.

S4, the input of the step is the characteristic graphs P1', P2', P3', P4' output by the step S2. The feature graphs P1', P2', P3', P4' are input into a Feature Pyramid Network (FPN), and a Cross-layer residual double-path pyramid network (Cross-layer residual Bi-FPN) is established with the Feature Pyramid Network (FPN). The feature maps P1', P2', P3', P4' are processed by a Feature Pyramid Network (FPN) to output a feature map P1', a feature map P2', P3', P4', P5 ', thus, all steps of extracting the feature map by inputting the original image into a Cross-layer residual double-path pyramid network (Cross-layer residual Bi-FPN) are completed.

It will be understood that modifications and variations will be apparent to those skilled in the art from the foregoing description, and it is intended that all such modifications and variations be included within the scope of the following claims.

Claims

1. The image feature extraction method based on the cross-layer residual double-path pyramid network is characterized by comprising the following steps of:

step S1, inputting an original RGB color image into a residual network res net50 for preliminary feature extraction, wherein a conv1 convolutional network module 1 of the residual network res net50 outputs a feature map P0, a conv2_x residual network module 2 of the residual network res net50 outputs feature maps P1, P1', p1=p1', a conv3_x residual network module 3 of the residual network res net50 outputs a feature map P2, a conv4_x residual network module 4 of the residual network res net50 outputs a feature map P3, and a conv5_x residual network module 5 of the residual network res net50 outputs a feature map P4;

step S3, the feature maps P1', P2' and intermediate networks thereof form a cross-layer residual error network module, wherein the intermediate networks form a plurality of residual error layers in the residual error network module crossing the residual error network ResNet 50; the feature maps P2', P3 and P3' and the intermediate network thereof form a cross-layer residual network module, the input of the step is the feature map P1 'output by the step S1 and the feature maps P2', P3 'output by the step S2, and the feature maps P1', P2 'and P3' are respectively calculated by the residual network module 3, the residual network module 4 and the residual network module 5 of the residual network ResNet50 in the step S1 to obtain feature maps P2, P3 and P4; the feature maps P3', P4' and the intermediate network thereof form a cross-layer residual error network module; realizing a cross-layer residual network based on a residual network ResNet 50;

2. The image feature extraction method based on the cross-layer residual double-path pyramid network according to claim 1, wherein the method comprises the following steps: in the step S1, the width and height of the profile Pi are 1/2 of the profile pi+1, and the number of channels of the profile Pi is 2 times the number of channels of the profile pi+1, where i=0, 1,2,3.

3. The method for extracting image features based on the cross-layer residual two-way pyramid network according to claim 1, wherein the step S2 includes: