CN114612709A

CN114612709A - Multi-scale target detection method guided by image pyramid characteristics

Info

Publication number: CN114612709A
Application number: CN202210185676.7A
Authority: CN
Inventors: 陈苏婷; 马文妍; 张艳艳; 张闯
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-06-10

Abstract

The invention discloses a multi-scale target detection method guided by image pyramid characteristics, which comprises the following steps: s1, taking the color image as network input, taking FPN as a frame for target detection, and adopting a sequencing down-sampling method to extract image features; s2, taking the same color image as input, and extracting position information and detail features of each level in the image pyramid by adopting the constructed double-bottleneck convolution network; s3, inputting the image features of each level extracted in the step S2 and the deep features corresponding to the backbone network into a constructed hierarchical feature fusion module to complete the fusion of the high-resolution weak semantic features and the low-resolution strong semantic features; and S4, introducing a Focal local reconstruction loss function to finish target detection. The invention not only can strengthen the spatial position information, but also can avoid losing a large amount of detail information in the down-sampling, thereby increasing the identification degree of the target detection network to small targets and adjacent targets.

Description

Multi-scale target detection method guided by image pyramid characteristics

Technical Field

The invention relates to a multi-scale target detection method, in particular to a multi-scale target detection method guided by image pyramid characteristics.

Background

The target detection task is to accurately predict a category and a coordinate position for each target in a natural image. The method is widely applied to many fields, and has great research value from unmanned driving, intelligent communities to video monitoring. However, objects of different classes may have similar appearances and sizes, and objects of the same class may also differ greatly in appearance and size. Background is complex and various, and mutual shielding phenomenon exists between objects, and these factors cause target detection to be one of the most challenging tasks in the field of computer vision. In the traditional target detection method, the characteristics of the image are obtained by artificially designing a characteristic extraction operator, and the characteristics have low representation capability and poor generalization, so that the further development of the traditional target detection method is limited.

In recent years, the advent of deep learning has greatly promoted the development of computer vision. The target detection algorithm based on the Convolutional Neural Networks (CNNs) extracts target features from an image through convolution operation, so that the network can obtain more favorable and deeper features and can better handle complex conditions such as shielding, deformation and illumination change. Current target detection algorithms based on convolutional neural networks can be divided into two broad categories: one is a two-step target detection algorithm based on a region candidate box and a one-step target detection algorithm based on regression.

However, the target detection algorithm still has many disadvantages, especially in multi-scale target detection. The existing method cannot well detect the same target with different scales, particularly a small target and a neighboring target, and semantic information of the small target and the neighboring target disappears along with the deepening of a network.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to provide a multi-scale target detection method guided by image pyramid features, which can improve the identification degrees of small targets and adjacent targets, and is used for solving the problems that high-level feature maps in multi-scale target detection are easy to confuse the characteristics of the adjacent targets and ignore the small targets.

The technical scheme is as follows: the multi-scale target detection method comprises the following steps:

s1, taking the color image as network input, taking the FPN based on ResNet-101 backbone network as a frame of target detection, and adopting a sequencing down-sampling method to extract image features;

s2, taking the same color image in the step S1 as input, and extracting position information and detail features of each level in the image pyramid by adopting the constructed double-bottleneck convolution network;

s3, inputting the image features of each level extracted in the step S2 and the deep features corresponding to the backbone network into a constructed hierarchical feature fusion module to complete the fusion of the high-resolution weak semantic features and the low-resolution strong semantic features;

and S4, introducing a Focal local reconstruction loss function, training multiple tasks, and completing target detection.

In step S1, the implementation procedure of the ordered downsampling method is as follows:

s11, sliding a sliding window with a set step length on the feature map of the feature sampling layer of the convolutional neural network, sequencing the numerical values in the sliding window in an ascending order, sequentially extracting four values in the sliding window, and generating four new feature maps; the width and height of each new feature map are half of the original feature map, and the output of the ordered down-sampling method is as follows:

wherein the content of the first and second substances,

representing a feature map of each sampling layer of the convolutional neural network, W, H and D respectively represent the width, height and channel number of the feature map, and l is a hierarchical index of the sampling layer of the convolutional neural network; m_j(. h) represents the process of extracting the jth value in the sliding window, and four values are extracted in sequence in each sliding window;

a new feature map representing a jth output of the ith down-sampling layer, four new feature maps being generated for each down-sampling layer;

s12, the four new feature maps are juxtaposed and then input into a small convolution network for feature refinement and channel adjustment; final feature map to be output

And as the input of the next layer of the backbone network, wherein W ', H ' and D ' respectively represent the width, height and channel number of the final feature map.

In step S2, the process of constructing the dual bottleneck convolutional network is as follows:

s21, defining the inputs of the double bottleneck convolutional network as:

wherein, the first and the second end of the pipe are connected with each other,

denotes a height of H^*Width W^*The image of (a), which is simultaneously an input image of the object detection model; i is the hierarchical index of the image pyramid and the backbone network;

s22, inputting the ith layer of image in the image pyramid into a double-bottleneck sub-convolution network, and extracting the edge characteristics of the image surface layer through a 5 multiplied by 5 convolution kernel and a 3 multiplied by 3 convolution kernel;

s23, inputting the extracted edge features into a residual error network unit with 2 bottleneck structures to extract detail features, connecting the edge features by using a side edge with a 1 multiplied by 1 convolution kernel, and transmitting accurately positioned edge information to the extracted texture detail features;

the bottleneck structure is composed of 21 × 1 convolution kernels used for the dimensionality reduction and the dimensionality increase of a characteristic diagram channel and 23 × 3 convolution kernels used for learning shallow layer characteristics;

s24, obtaining a characteristic diagram with the same level scale as the corresponding backbone network, and outputting the characteristic diagram as a residual error network unit;

s25, taking the images of different scales as input, defining the output of the double bottleneck convolutional network as:

wherein the content of the first and second substances,

representing the extracted features of the ith layer image of the image pyramid;

representing the set of features extracted for all levels of the image in the image pyramid.

In step S3, the hierarchical feature fusion module is a feature fusion module based on element-by-element addition; the output defining the element-by-element addition is:

wherein the content of the first and second substances,

and

respectively, two 3 x 3 convolution units, used as parameterized feature maps for the feature map,

a 1 × 1 convolution unit used as a linear transformation of the feature map; BN [. C]Batch normalization operations for convolution features; t (-) represents a bilinear interpolation operation of the channel dimension, and is used for adjusting the channel dimensions of two different types of features; h (-) and g (-) are the output characteristic diagram of the double-bottleneck convolution network and the characteristic diagram of the main network respectively, i is the hierarchical index of the image pyramid and the main network; i is₀And I_iRepresenting the original image and the image of the ith layer in the image pyramid, respectively.

In step S4, the classification loss function is as follows:

wherein p and p^*Respectively a sample predicted value and a sample true value; alpha is alpha_tE (0, 1) is a weight factor introduced for class 1, 1-alpha_tIntroducing a weight factor for class-1; (1-p) γ is a modulation factor;

the positional regression loss term is expressed as:

wherein L is_reg(t，t^*) From smoothed L₁Loss expression; t ═ { x, y, w, h } represents bounding box position information of sample prediction, where { x, y } represents the center coordinates of the bounding box, and { w, h } represents the width and height of the bounding box; t is t^*A sample label of t;

the target detection loss function is expressed as:

wherein the content of the first and second substances,

for the number of samples to be classified,

is the number of samples regressed; w is the number of batches of training images; k is the index of a single sample in each batch of training samples; λ is the loss balance term.

Compared with the prior art, the invention has the following remarkable effects:

1. extracting shallow layer characteristics of each layer of a multi-scale image pyramid by using a double bottleneck sub-convolution network; introducing high-resolution shallow features rich in detail and position information into the target detection model by using a layered feature fusion module; a new sequencing downsampling method is utilized, and a large amount of detail information lost in the original downsampling process is saved; the reconstructed loss function can relieve the problem of unbalanced classification of the foreground and the background;

2. the invention fully utilizes the shallow high-resolution characteristic of the multi-scale image pyramid and simultaneously avoids losing a large amount of detail information in the down-sampling, thereby increasing the identification degree of a target detection network on a small target and an adjacent target and further improving the model performance.

Drawings

FIG. 1 is a schematic diagram of a double bottleneck characteristic pyramid network (DBFP-Net) structure of the present invention;

FIG. 2 is a schematic diagram of a sequential downsampling method according to the present invention;

fig. 3 is a schematic diagram of an implementation of the ordered downsampling method of the present invention.

FIG. 4 is a diagram of a framework for a two-bottleneck convolutional network of the present invention;

FIG. 5 (a) is a schematic diagram of a feature fusion module based on element-by-element addition in the present invention,

(b) for the schematic diagram of the feature fusion module based on element-by-element multiplication in the present invention,

(c) the invention is a schematic diagram of a feature fusion module based on feature map collocation;

FIG. 6 is a diagram showing the partial detection results of DBFP-Net on MS COCO.

Detailed Description

The invention is described in further detail below with reference to the drawings and the detailed description.

FIG. 1 is a schematic diagram of a double bottleneck characteristic pyramid network (DBFP-Net). Firstly, FPN (feature Pyramid networks) is used as an object detection framework, ResNet-101 is used as a backbone network of the FPN to extract depth features, color images are input, a new sequencing downsampling feature enhancement method is used for replacing an original downsampling method in the backbone network, all numerical values in a sliding window in downsampling are reserved, and the problem that a large amount of detail information is lost in deep features is solved. Then, a double-bottleneck-sub-convolutional network is constructed to extract image features of each level in the image pyramid, a hierarchical feature fusion module is constructed to fuse the high-resolution and weak semantic features extracted by the double-bottleneck-sub-convolutional network with the low-resolution and strong semantic features of the deep layer of the main network, and the shallow-layer features and more spatial information extracted in the image pyramid can be provided for a target detection network. And finally, introducing a Focal loss reconstruction loss function to relieve the imbalance of foreground and background classification.

The method takes FPN as a target detection frame, takes ResNet-101 as a backbone network to extract depth features, inputs a color image, and replaces an original downsampling method in the backbone network with a new sequencing downsampling feature enhancement method, so that all numerical values in a sliding window in downsampling are reserved, and the problem that a large amount of detail information is lost by deep features is solved.

(1) Ordered down-sampling method

In the continuous down-sampling operation of CNN (convolutional neural network), the maximum pooling method is commonly used, and the scale of the finally generated new feature map is smaller than that of the original feature map, which results in that a large amount of information of the original feature map is lost in the new feature map, which is very disadvantageous for the multi-scale object detection task that needs a large amount of feature information to identify multiple types of objects. Therefore, as shown in fig. 2, the ordered downsampling feature enhancement method proposed by the present invention slides a 2 × 2 sliding window with a step length of 2 on the feature map of the feature sampling layer of the CNN, orders the values in the sliding window in an ascending order, and sequentially extracts four values in the sliding window, thereby retaining all the values in the sliding window. And finally, four new feature maps are generated, the width and the height of the four new feature maps are half of those of the original feature maps, and the new feature maps do not lose any information of the original feature maps. The output of the ordered downsampling method is defined as:

wherein the content of the first and second substances,

representing a feature map of each sampling layer of a Convolutional Neural Network (CNN), W, H and D respectively representing the width, height and channel number of the feature map, and l is a hierarchical index of the CNN sampling layer; m_j(. The) represents the process of extracting the jth value in the sliding window by a sequencing downsampling method, and four values are sequentially extracted in each sliding window;

and a new feature map representing the jth output in the ith down-sampling layer, wherein four new feature maps are generated for each down-sampling layer.

A specific implementation of the ordered downsampling method is shown in fig. 3. And (3) juxtaposing the four new feature maps, inputting the four new feature maps into a small convolution network for feature refinement and channel adjustment, wherein the small convolution network consists of a 1 × 1 convolution kernel and a 3 × 3 convolution kernel in order to keep consistent with the output of the original sampling layer of the backbone network. Outputting the final characteristic diagram

As inputs to the next layer of the backbone network, W ', H ', and D ' represent the width, height, and number of channels, respectively, of the final feature map.

(2) Double bottleneck convolutional network

And constructing a double-bottleneck sub-convolution network to extract the image characteristics of each level in the image pyramid, and obtaining the characteristic graphs which are rich in spatial information and have different scales. Because the double-bottleneck sub-convolution network is a lightweight network for sharing parameters with images of all levels of the image pyramid, the newly added computational complexity and parameter storage loss can be ignored compared with a main network.

FIG. 4 is a block diagram of a two-bottleneck convolutional network. The input to the double bottleneck convolutional network is a simple multi-scale image pyramid. An image pyramid is a set of images arranged in a pyramid shape and with progressively lower resolution, which are obtained by sequential down-sampling of the same input image. The inputs that define the dual bottleneck convolutional network are:

wherein the content of the first and second substances,

representing resolution as H^*×W^*(height is H)^*Width W^*) The image of (a), which is simultaneously an input image of the backbone network; i is both the image pyramid and the hierarchical index of the backbone network. Firstly, the ith layer of image in the image pyramid is input into a double-bottleneck sub-convolution network, and the edge characteristics of the image surface layer are extracted through a 5 multiplied by 5 convolution kernel and a 3 multiplied by 3 convolution kernel. Then, the extracted edge features are input into a residual network unit with a 2-Bottleneck (bottleeck) structure to extract detail features, and accurately positioned edge information is transmitted to the extracted texture detail features using side connection with a 1 × 1 convolution kernel. The bottleneck structure is composed of 21 × 1 convolution kernels for dimensionality reduction and dimensionality increase of the feature map channel respectively and 23 × 3 convolution kernels for learning shallow features. Finally, the output of the residual network unit is a characteristic diagram with the same scale as the hierarchy of the corresponding main network. Taking images of different scales as input, defining the output of the double-bottleneck sub-convolution network as follows:

wherein the content of the first and second substances,

representing the extracted features of the ith layer image of the image pyramid.

(3) Layered feature fusion module

A hierarchical feature fusion module is constructed to fuse the high-resolution and weak semantic features extracted by the double-bottleneck sub-convolution network with the low-resolution and strong semantic features of the deep layer of the main network, and the fused features are transmitted to a feature map of a corresponding hierarchy of the FPN, so that shallow features and more spatial information extracted in an image pyramid can be provided for a target detection network. Three fusion methods are designed, as shown in (a), (b) and (c) in fig. 5, and their outputs are collectively defined as:

wherein h (-) and g (-) are the output characteristic diagram of the double-bottleneck convolution network and the characteristic diagram of the main network respectively, i is the hierarchical index of the image pyramid and the main network, and O_iThe output characteristics of the characteristic fusion module on the ith layer; i is₀And I_iRespectively representing the original image in the image pyramid and the image of the ith layer.

Is a fusion equation of the feature fusion module. In general, if the image pyramid has (n-1) downsampled images in total, then the number of levels of the image pyramid is n.

The feature fusion module based on element-by-element addition is shown in (a) of fig. 5, and is to combine the output features h of the dual-bottleneck convolution network_i(I_i) And backbone network characteristics g_i(I₀) And (4) adding. Due to the double bottleneck sub-convolution net parameter sharing, the channel dimensions of these two different types of features need to be adjusted. The channel dimensions of these two different types of features are adjusted using a channel dimension bilinear interpolation operation T (·). Thus, the output O of the element-by-element addition is defined_1iComprises the following steps:

wherein the content of the first and second substances,

and

is a 1 × 1 convolution unit and is used as a linear transformation of the feature map. BN [. C]Batch normalization operations for convolution features.

The feature fusion module based on element-by-element multiplication is shown in (b) of FIG. 5, and is to combine the output features h of the dual-bottleneck convolution network_i(I_i) And backbone network characteristics g_i(I₀) Multiplication. Then, refinement is performed by a 1 × 1 convolution kernel. Finally, the refined features are added again with the corresponding backbone network features through a side connection. Defining the output O of element-by-element multiplication_2iComprises the following steps:

wherein the content of the first and second substances,

convolution units of 1 × 1 are used for feature refinement;

a convolution unit of 1 x 1 is used as a linear transformation of the feature map.

Feature fusion module based on feature map collocation is shown in fig. 5 (c), similar to the feature fusion method used in U-Net. The invention also defines a feature map collocation method, the output of which is O_3iComprises the following steps:

wherein concat (. cndot.) represents the feature map juxtaposition operation.

The above methods demonstrate the flexibility of the hierarchical feature fusion module. Different fusion methods may be selected as part of the model of the invention.

To compare the effectiveness of three different fusion methods, experiments implemented different feature fusion methods in the same target detection model. The comparison results are shown in table 1.

TABLE 1 hierarchical feature fusion method comparison

44.7AP is obtained based on a feature fusion method of element-by-element addition; the methods based on element-by-element multiplication and feature map concatenation derive 42.9AP and 43.6AP, respectively. In small target detection, the three feature fusion methods obtain similar detection results. However, in the medium-scale and large-scale target detection, the target detection model configuring the element-by-element addition fusion method is improved by 4.1% and 3.6%, respectively, compared with the target detection model configuring the element-by-element multiplication fusion method. Table 1 shows the feature fusion method using element-by-element addition at AP_S、AP_MWith AP_LThe evaluation criteria obtained better detection results than the other two methods. Therefore, the present invention selects a feature fusion method based on element-by-element addition.

(4) Loss function

To alleviate the extreme imbalance between foreground and background, a Focal loss reconstruction loss function is introduced, the classification loss function being as follows:

wherein p and p^*Respectively, a sample prediction value and a sample true value. Alpha is alpha_tE (0, 1) is a weight factor introduced for class 1, 1-alpha_tA weight factor is introduced into class-1, and the weight of foreground and background classification can be adjusted. (1-p)^γIs a modulation factor, which can automatically reduce the weight of simple examples during training and quickly focus the model on difficult examples. That is, the higher the confidence of the sample prediction, the smaller the contribution of the sample to the overall loss. As with Focal loss, γ is empirically set to 2, α_tSet to 0.25.

The positional regression loss term is specifically expressed as:

wherein L is_reg(t，t^*) Represented by the smoothed L1 loss; t ═ x, y, w, h denotes Bounding Box (Bounding Box) position information for sample prediction, where x, y denotes the center coordinates of the Bounding Box and w, h denotes the width and height of the Bounding Box; t is t^*Sample label of t.

The target detection loss function used by the present invention can be represented by the following equation:

wherein the content of the first and second substances,

for the number of samples to be classified,

is the number of samples regressed. w is the number of batches of training images, and k is the index of a single sample in each batch of training samples; λ is a loss balance term, set empirically to 2.

To further validate the performance of the proposed model, the present invention compares the accuracy of DBFP-Net with existing target detection algorithms on the MS COCO dataset, as shown in table 2.

Table 2 comparison of the COCO data set with the existing models

It can be seen that the lowest AP value is the SSD model, only 28.8%. The SSD constructs a feature pyramid structure based on R-FCN, predicting targets at multiple levels, without the need to combine features or scores. This improves detection speed and accuracy, but the recall rate for small targets is still not ideal. The FPN uses a coding-decoding network structure with transverse connection to correlate low-level feature mapping of cross-resolution and semantic levels, provides high-level semantic information for shallow network features, and improves the robustness of a detection model to multi-scale targets. According to the method, the information lost by the backbone network is compensated by introducing the shallow layer characteristics of the multi-scale image pyramid on the basis of the FPN, and the information exceeds the FPN by 8.5% on the AP. RetinaNet and the image pyramid feature-guided multi-scale target detection model are both based on an FPN framework and use Focal loss as a loss function, but DBFP-Net is obviously improved by 5.6% compared with RetinaNet (44.7AP vs.39.1 AP). Compared with the detection model M2Det based on the feature pyramid method with the best performance, the method still exceeds 3.7% (44.7AP vs.41.0 AP). Compared with the recently popular target detection method without Anchor point (Anchor-free), DBFP-Net exceeds CornerNet, FCOS and FSAF by 4.2 percent (44.7AP vs.40.5AP), 2.6 percent (44.7AP vs.42.1AP) and 1.8 percent (44.7AP vs.42.9AP) respectively. In addition, in the evaluation of the criteria AP_S、AP_MAnd AP_LThe performance of the invention is also best compared to the other models in table 2.

FIG. 6 shows the results of the present invention on MS COCO data set. The results show that for multi-scale objects in the dataset, the invention retains more information for the objects and can accurately present their locations and classifications. The DBFP-Net can not only detect extremely small objects, but also can easily detect occluded and dense objects. For example, in the third row, the second picture and the fifth row, the fourth picture, the invention well detects the blocked birds; in the second picture of the second row and the third picture of the fifth row, the invention can accurately detect dense fruits and people. The invention is also highly robust for small objects with high speed motion and pixel blur, such as baseball in the second picture of the first line. In addition, objects with sparse characteristics, such as skateboards and cups, can be accurately detected.

In conclusion, compared with a classical FPN network, the method provided by the invention realizes 8.5% of promotion, and can effectively detect occluded and different-size targets.

Claims

1. A multi-scale target detection method guided by image pyramid features is characterized by comprising the following steps:

s3, inputting the image features of each level extracted in the step S2 and the deep features corresponding to the backbone network into a constructed hierarchical feature fusion module to complete the fusion of high-resolution weak semantic features and low-resolution strong semantic features;

2. The method for detecting multi-scale objects guided by image pyramid features according to claim 1, wherein in step S1, the ordered down-sampling method is implemented as follows:

wherein the content of the first and second substances,

3. The method for detecting multi-scale objects guided by image pyramid features of claim 1, wherein in step S2, the process of constructing the dual bottleneck sub-convolutional network is as follows:

s21, defining the inputs of the double bottleneck convolutional network as:

the bottleneck structure is composed of 21 × 1 convolution kernels used for dimensionality reduction and dimensionality increase of a feature map channel respectively and 23 × 3 convolution kernels used for learning shallow features;

wherein the content of the first and second substances,

4. The method for image pyramid feature guided multi-scale object detection according to claim 1, wherein in step S3, the hierarchical feature fusion module employs a feature fusion module based on element-by-element addition; the output defining the element-by-element addition is:

and

5. The method for detecting multi-scale objects guided by image pyramid features according to claim 1, wherein in the step S4, the classification loss function is as follows:

wherein p and p^*Respectively a sample predicted value and a sample true value; alpha is alpha_tE (0, 1) is a weight factor introduced for class 1, 1-alpha_tIntroducing a weight factor for class-1; (1-p)^γIs the modulation factor;

the positional regression loss term is expressed as:

the target detection loss function is expressed as:

wherein the content of the first and second substances,

for the number of samples to be classified,