Construction method of multichannel feature space pyramid
Technical Field
The invention relates to a feature space pyramid construction method, in particular to a multi-channel image feature space pyramid construction method using a cascading idea, which is used for extracting feature representation of a target in a current target detection method based on deep learning, and belongs to the technical field of computer vision and target detection.
Background
Object detection aims at detecting object instances in an image, identifying objects (e.g. people, vehicles, planes, birds, etc.) contained in the image and locating their position in the image. The current target detection architecture based on a deep convolutional neural network mainly consists of three different steps: feature extraction, region suggestion generation, classification and localization. The target detection performance depends largely on whether the feature extracted by the feature extraction section-deep convolutional neural network is sufficient or not. In the process of target detection by using the deep convolutional neural network, shallow layer feature map semantic information (used for judging what a target is subsequently) extracted by an earlier convolutional layer is less, but target position information (used for positioning the target subsequently) is accurate. The deep feature map semantic information extracted by the later convolution layer is rich, but the target position is rough. Meanwhile, due to inconsistent target dimensions in the images, semantic feature information of the targets can appear in different convolution layers according to the size of the targets. Thus, if the object is large (i.e., the area of the object in the image is large), then the semantic feature information will appear in the later convolutional layer, on the other hand if the object is small (i.e., the area of the object in the image is small), then it requires location information in the earlier convolutional layer, since it is likely to disappear in the subsequent convolutional layer, which makes feature extraction particularly important for detecting images containing multi-scale (large, medium, small) objects.
Disclosure of Invention
The invention aims to solve the technical problems in the prior art and overcome the defects, and provides a construction method of a multi-channel feature space pyramid for fully extracting feature information comprising a multi-scale target image.
The invention adopts the following technical scheme:
The construction method of the multichannel feature space pyramid comprises the following steps:
Inputting a picture to be detected;
Step (2), repeating the steps (3) to (10) aiming at the multi-scale target detection task, and continuing training the neural network model until the loss function of the network reaches a convergence state;
Step (3) selects a convolutional neural network based on deep learning (here, selects the convolutional neural network ResNet-101, and also selects other convolutional neural networks, such as ResNet-50, VGGNet, googLeNet, etc.) to extract the picture characteristics;
And (4) marking the last layer characteristic diagram of each group of convolution blocks of the convolution neural network ResNet-101 as C1, C2, C3, C4 and C5. Because C1 occupies a large amount of GPU (graphic processing unit) video memory, only C2, C3, C4 and C5 are used for constructing a four-layer feature space pyramid C in the method;
Step (5) on the basis of the feature space pyramid C, a convolution check feature space pyramid C2, C3, C4 and C5 with the size of 1 multiplied by 256 is used for enhancing feature representation by using a feature fusion unit, so that an enhanced feature space pyramid P0 is obtained;
Step (6) on the basis of the feature space pyramid C, a convolution check with the size of 1 multiplied by 512 is used for checking the C3, C4 and C5 of the feature space pyramid C, and a feature fusion unit is used for enhancing the feature representation to obtain an enhanced feature space pyramid P1;
step (7) on the basis of the feature space pyramid C, a convolution check feature space pyramid C4 with the size of 1 multiplied by 1024 is used, and a feature fusion unit is used for enhancing feature representation by C5 to obtain an enhanced feature space pyramid P2;
Step (8) on the basis of the feature space pyramid C, a convolution check feature space pyramid C5 with the size of 1 multiplied by 2048 is used for convolution to obtain a feature space pyramid P3;
Step (9) integrating the obtained layer with the strongest semantic information in the four feature space pyramids P0, P1, P2 and P3 subjected to fusion enhancement into a final feature space pyramid P;
and (10) sending the feature space pyramid P into a target detection network to detect the category and the position of the target.
Preferably, first step (3) we use a deep convolutional neural network RESNETNET-101, which contains a total of 101 convolutional layers of 4 sets of convolutional blocks.
The step (4) is to construct a feature pyramid C, and the specific steps are as follows:
(i) After an input picture enters a deep convolutional neural network ResNet-101, extracting features by using a convolutional kernel with the size of 7 multiplied by 64 in a first layer of the network, and then using a maximum pooling layer with the size of 3 multiplied by 3 to obtain a 64-dimensional feature map C1 with the size of original figure 1/2;
(ii) Further, in layers 2-10 of the deep convolutional neural network ResNet-101, the feature map C1 is each implemented using a set of three convolutional kernels: the convolution operation is repeatedly carried out for 3 times on 1 multiplied by 64,3 multiplied by 3 multiplied by 64, and 1 multiplied by 256, so that a 256-dimensional characteristic diagram C2 with the size of 1/4 of the original diagram is finally obtained;
(iii) Further, in layers 11-22 of the deep convolutional neural network ResNet-101, the feature map C2 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation for 4 times on 1×1× 128,3 ×3×128 and 1×1×512 to obtain a feature map C3 with dimension of 512 and size of 1/8 as shown in the original figure;
(iv) Further, in layers 23-91 of the deep convolutional neural network ResNet-101, the feature map C3 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation for 23 times on 1×1×256,3×3×256 and 1×1×1024 to obtain a feature map C4 with dimensions 1024 and sizes of original figures 1/16;
(v) Further, in layers 23-91 of the deep convolutional neural network ResNet-101, the feature map C4 is each implemented using a set of three convolutional kernels: 1×1× 512,3 ×3X 512,1 X1 X2048 the convolution operation was repeated 3 times, finally, a feature map C5 with the dimension of 2048 and the size of original figure 1/32 is obtained.
Preferably, step (5) constructs feature space pyramid P0 by feature space pyramid C:
(i) Using a convolution kernel of 1 multiplied by 256 to reduce the dimension of the feature map C5 to obtain P0_5;
(ii) The method comprises the steps of using a convolution kernel of 1 multiplied by 256 to reduce the dimension of a feature map C4 to obtain C4', then performing an up-sampling operation on the feature map P0_5 through linear interpolation to obtain P0_5', wherein the P0_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P0_4;
(iii) The method comprises the steps of performing dimension reduction on a feature map C3 by using a convolution kernel of 1 multiplied by 256 to obtain C3', performing up-sampling operation on P0_4 through linear interpolation to obtain P0_4', inputting the two feature maps into a feature fusion unit by using the C3 'and the P0_4' which have the same size and dimension to obtain a feature map P0_3;
(iv) The feature map C2 is subjected to dimension reduction by using a convolution kernel of 1 multiplied by 256 to obtain C2', then the P0_3 is subjected to an up-sampling operation by linear interpolation to obtain P0_3', and at the moment, the C2 'and the P0_3' have the same size and dimension, and the two feature maps are input into a feature fusion unit to obtain the feature map P0_2.
Preferably, the step (6) establishes the feature pyramid P1, which is specifically implemented as follows:
(i) Using a convolution kernel of 1 multiplied by 512 to reduce the dimension of the feature map C5 to obtain P1_5;
(ii) The method comprises the steps of using a convolution kernel of 1 multiplied by 512 to reduce the dimension of a feature map C4 to obtain C4', then performing an up-sampling operation on the feature map P1_5 through linear interpolation to obtain P1_5', wherein the P1_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P1_4;
(iii) The feature map C3 is subjected to dimension reduction by using a convolution kernel of 1 multiplied by 512 to obtain C3', then an up-sampling operation is performed on the P1_4 through linear interpolation to obtain P1_4', and at the moment, the P1_4 'and the C3' have the same size and dimension, and the two feature maps are input into a feature fusion unit to obtain the feature map P1_3.
Preferably, the step (7) establishes a feature pyramid P2, which is specifically implemented as follows:
(i) Performing dimension reduction on the feature map C5 by using a convolution kernel of 1 multiplied by 1024 to obtain P2_5;
(ii) The method comprises the steps of using a convolution kernel of 1 multiplied by 1024 to reduce the dimension of a feature map C4 to obtain C4 'and then performing an up-sampling operation on the feature map P2_5 through linear interpolation to obtain P2_5', wherein the P2_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P2_4;
Preferably, the step (8) establishes the feature pyramid P3, which is specifically implemented as follows: performing dimension reduction on the feature map C5 by using a convolution kernel of 1 multiplied by 2048 to obtain P3_5;
Preferably, the step (9) establishes a feature pyramid, and the specific implementation is as follows: the bottom-layer feature map of each pyramid fully fuses and enhances the feature semantics of the upper-layer feature map and the self-contained position information in the respective pyramid; thus, the final feature space pyramid P is reconstructed using the lowest level feature map of each pyramid.
Preferably, the final feature space pyramid P is fed into a convolutional neural network for subsequent detection and localization. The specific implementation is as follows: and (5) feeding the feature images in the final feature space pyramid P into a convolutional neural network for detection and positioning. The specific implementation is as follows: and detecting targets (such as people, automobiles, birds, planes and the like) contained in the pictures by using a classifier for the fused feature pictures, and further obtaining the coordinate positions of the targets in the pictures by using a locator.
Preferably, the convolutional neural network is one or more of ResNet-101, resNet-50, VGGNet or GoogLeNet.
By adopting the technical scheme, the invention has the following advantages: according to the method, a plurality of feature space pyramids of different channels are constructed by utilizing a cascading idea, and then the feature map with the strongest semantics in each feature space pyramid is integrated into the same feature space pyramid, so that the semantic features and the position information of the feature space pyramids are enhanced, and finally the accuracy of target classification and positioning is improved. In the proposed method called constructing the multi-channel feature space pyramid, the method has good robustness and detection capability for multi-scale target detection.
Drawings
FIG. 1 is a flow chart of an overall implementation of the present invention.
Fig. 2 is a block diagram of a convolutional neural network used in the present invention.
Fig. 3 is a diagram of a weight calculation method used in the present invention.
Fig. 4 is a picture that was not detected using the present method.
Fig. 5 is a picture after inspection using the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Examples:
As shown in fig. 1-5, the present embodiment is a method for constructing a multi-channel feature space pyramid, which is implemented according to the following steps:
Inputting a picture to be detected;
Step (2), repeating the steps (3) to (10) aiming at the multi-scale target detection task, and continuing training the neural network model until the loss function of the network reaches a convergence state;
Step (3) selects a convolutional neural network based on deep learning (in this embodiment, a deep convolutional neural network ResNet-101 is selected, and other convolutional neural networks, such as ResNet-50, VGGNet, googLeNet, etc. are also selected) to extract the picture features;
And (4) marking the last layer characteristic diagram of each group of convolution blocks of the convolution neural network ResNet-101 as C1, C2, C3, C4 and C5. Because C1 occupies a large amount of GPU (graphic processing unit) video memory, only C2, C3, C4 and C5 are used for constructing a four-layer feature space pyramid C in the method;
step (5) on the basis of the feature space pyramid C, a convolution check feature space pyramid C2, C3, C4 and C5 with the size of 1x1x256 is used for enhancing feature representation by using a feature fusion unit, and an enhanced feature space pyramid P0 is obtained;
Step (6) on the basis of the feature space pyramid C, a convolution with the size of 1x1x512 is used for checking the C3, C4 and C5 of the feature space pyramid C, and a feature fusion unit is used for enhancing feature representation to obtain an enhanced feature space pyramid P1;
Step (7) on the basis of the feature space pyramid C, a convolution check feature space pyramid C4 with the size of 1x1x1024 is used, and a feature fusion unit is used for enhancing feature representation by C5 to obtain an enhanced feature space pyramid P2;
Step (8) on the basis of the feature space pyramid C, a convolution check feature space pyramid C5 with the size of 1x1x2048 is used for convolution to obtain a feature space pyramid P3;
Step (9) integrating the obtained layer with the strongest semantic information in the four feature space pyramids P0, P1, P2 and P3 subjected to fusion enhancement into a final feature space pyramid P;
and (10) sending the feature space pyramid P into a target detection network to detect the category and the position of the target.
The deep learning convolutional neural network model adopted in the steps (3) and (4) of the construction method of the multichannel feature space pyramid is ResNet-101, and the method is concretely realized as follows:
First we use a deep convolutional neural network ResNet-101 that contains a total of 101 convolutional layers of 4 sets of convolutional blocks. The structure is shown in the attached figure 1:
(i) Further, after the input picture enters the deep convolutional neural network ResNet-101, features are extracted by using a convolutional kernel with the size of 7x7x64 in the first layer of the network, and then a maximum pooling layer with the size of 3x3 is used to obtain a 64-dimensional feature map C1 with the size of original figure 1/2;
(ii) Further, in layers 2-10 of the deep convolutional neural network ResNet-101, the feature map C1 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation for 3 times on 1x64,3x 64 and 1x256 to obtain a 256-dimensional characteristic diagram C2 with the size of 1/4 of the original diagram;
(iii) Further, in layers 11-22 of the deep convolutional neural network ResNet-101, the feature map C2 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation for 4 times on 1x128,3x 128 and 1x512 to obtain a feature diagram C3 with dimension being 512 and size being 1/8 of original figure;
(iv) Further, in layers 23-91 of the deep convolutional neural network ResNet-101, the feature map C3 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation for 23 times on 1x256,3x 256 and 1x1024 to obtain a feature map C4 with dimensions of 1024 and sizes of original figures 1/16;
(v) Further, in layers 23-91 of the deep convolutional neural network ResNet-101, the feature map C4 is each implemented using a set of three convolutional kernels: repeatedly performing convolution operation on 1x512,3x 512 and 1x2048 for 3 times to obtain a feature map C5 with dimension of 2048 and size of original figure 1/32;
The construction method of the multi-channel feature space pyramid in this embodiment includes the following steps (5) constructing a feature space pyramid P0 through a feature space pyramid C:
In the feature space pyramid C obtained by the deep convolutional neural network ResNet-101, semantic information of each layer is different, wherein the dimension of C2 is the smallest, the semantic features are the shallowest, and the most target position information is contained. To balance semantic features and location information, we use a feature fusion unit to augment the feature space pyramid C. Because the feature fusion unit needs that the two input feature graphs are consistent in dimension and size, the feature fusion unit comprises a transverse convolution downscaling process and a longitudinal 2-time upsampling process, the transverse convolution aligns the dimensions, and the longitudinal upsampling process enables the pictures to be consistent in size. The method comprises the following specific steps:
(i) Performing dimension reduction on the feature map C5 by using a convolution kernel of 1x1x256 to obtain P0_5;
(ii) The method comprises the steps of using a convolution kernel of 1x1x256 to reduce the dimension of a feature map C4 to obtain C4', then performing an up-sampling operation on the feature map P0_5 through linear interpolation to obtain P0_5', wherein the P0_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P0_4;
(iii) The method comprises the steps of performing dimension reduction on a feature map C3 by using a convolution kernel of 1x1x256 to obtain C3', performing up-sampling operation on P0_4 through linear interpolation to obtain P0_4', inputting the two feature maps into a feature fusion unit by using the C3 'and the P0_4' which have the same size and dimension to obtain a feature map P0_3;
(iv) The method comprises the steps of performing dimension reduction on a feature map C2 by using a convolution kernel of 1x1x256 to obtain C2', performing up-sampling operation on P0_3 through linear interpolation to obtain P0_3', inputting the two feature maps into a feature fusion unit by using the C2 'and the P0_3' which have the same size and dimension to obtain a feature map P0_2;
The construction method of the multi-channel feature space pyramid in this embodiment includes the following steps (6) of establishing a feature pyramid P1:
(i) Performing dimension reduction on the feature map C5 by using a convolution kernel of 1x1x512 to obtain P1_5;
(ii) The method comprises the steps of using a convolution kernel of 1x1x512 to reduce the dimension of a feature map C4 to obtain C4', then performing an up-sampling operation on the feature map P1_5 through linear interpolation to obtain P1_5', wherein the P1_5 'and the C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P1_4;
(iii) The feature map C3 is subjected to dimension reduction by using a convolution kernel of 1x1x512 to obtain C3', then P1_4' is obtained by carrying out an up-sampling operation on P1_4 through linear interpolation, at the moment, P1_4 'and C3' have the same size and dimension, and the two feature maps are input into a feature fusion unit to obtain a feature map P1_3;
The construction method of the multi-channel feature space pyramid in this embodiment includes the following steps:
(i) Performing dimension reduction on the feature map C5 by using a convolution kernel of 1x1x1024 to obtain P2_5;
(ii) The method comprises the steps of using a convolution kernel of 1x1x1024 to reduce the dimension of a feature map C4 to obtain a feature map C4 ', then performing an up-sampling operation on a feature map P2_5 through linear interpolation to obtain a feature map P2_5', wherein the feature maps P2_5 'and the feature map C4' have the same size and dimension, and inputting the two feature maps into a feature fusion unit to obtain a feature map P2_4;
The construction method of the multi-channel feature space pyramid in this embodiment includes the following steps: performing dimension reduction on the feature map C5 by using a convolution kernel of 1x1x2048 to obtain P3_5;
the construction method of the multi-channel feature space pyramid in the embodiment includes the following steps: the bottom-layer feature map of each pyramid is formed by fully fusing and enhancing the feature semantics of the upper-layer feature map and the self-contained position information in the respective pyramid. Thus, the final feature space pyramid P is reconstructed using the lowest level feature map of each pyramid.
In the method step (10) of constructing the multi-channel feature space pyramid of the embodiment, the feature map in the final feature space pyramid P is fed into a convolutional neural network for detection and positioning. The specific implementation is as follows: and detecting targets (such as people, automobiles, birds, planes and the like) contained in the pictures by using a classifier for the fused feature pictures, and further obtaining the coordinate positions of the targets in the pictures by using a locator.
As can be easily seen by comparing fig. 4 and fig. 5, the detection effect on the small-scale targets of the pictures is remarkable.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.