CN113743521A

CN113743521A - Target detection method based on multi-scale context sensing

Info

Publication number: CN113743521A
Application number: CN202111061082.7A
Authority: CN
Inventors: 王伯英; 汲如意; 张立波; 武延军
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2021-12-03
Anticipated expiration: 2041-09-10
Also published as: CN113743521B

Abstract

The invention discloses a target detection method based on multi-scale context sensing, which comprises the following steps: 1) extracting a plurality of scale features of the image; 2) enhancing the top layer characteristics in the multi-scale characteristics through the cavity residual block to obtain the top layer characteristics with high-level characteristics; 3) fusing the features of the adjacent layers to generate pyramid features; 4) aggregating the pyramid features to obtain feature X_m(ii) a 5) Further enhancement of feature X by relying on enhancement modules_mGenerating enhanced feature X_o(ii) a 6) Will be characterized by X_oMatching and adding the pyramid characteristics in an up-sampling or down-sampling mode respectively; 7) inputting the features obtained in the step 6) into a candidate area generation network to generate a candidate frame, and extracting the features of the candidate frame; 8) inputting features of candidate boxes to head detectionAnd the module predicts and then filters the detection result of the candidate frame by a non-maximum suppression method to obtain the category and position information of the article.

Description

Target detection method based on multi-scale context sensing

Technical Field

The invention relates to the technical field of computer vision, in particular to target detection, and particularly relates to a target detection method based on multi-scale up-and-down perception.

Background

Object detection is a realistic and challenging computer vision task whose purpose is to identify and locate objects in an image. In recent years, with the deep study of deep learning, the method is rapidly developed and widely applied to the fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like. General target detection is generally divided into two categories: single-stage and two-stage target detection. The single-stage detection directly processes the input image to generate a detection result. The two-stage detection firstly extracts a candidate region through RPN, and then refines the detection result according to the candidate region. In earlier studies, object detection directly utilized the highest level of features to detect objects. However, the highest level features are not good for target detection due to the small spatial scale. To address this problem, some feature pyramid techniques that utilize multi-scale features have evolved. The mainstream work of the feature pyramid technology is divided into two categories: neural structure search and non-neural structure search. NAS-FPN is representative of neural structure search based methods. The NAS-FPN defines a search space and utilizes reinforcement learning strategies to explore the pyramid structure with the best performance. The neural structure search based approach has higher performance, but also has some obvious disadvantages. First, the resulting structure is extremely complex and not easily understood. Second, the structure is typically a multi-layer stack, thus placing a significant parameter and computational burden. Third, the search cost of neural structure search is prohibitive, involving thousands of TPU hours. In contrast, the non-NAS feature pyramid approach is designed manually. FPN is a widely applied non-neural structure search module, and the current FPN-based method has three problems: (1) the highest level context information is lost. Before merging, a 1 × 1 convolutional layer is used to reduce the number of feature channels. The top level features typically have thousands of channels, which contain rich contextual information. The top level features lose a large amount of information due to the reduction of channels. (2) The context fusion strategy is inadequate. In the fusion process, the high-level features are matched with the shallow features through upsampling operation, and then fusion is carried out through element addition. But this simple aggregation strategy is not optimal. Different levels should not be handled with the same considerations, since the context information contained is different. (3) Semantic gaps between different hierarchical features. Given that feature propagation is unidirectional, the underlying features cannot be propagated to higher levels. In addition, high-level semantic information can be diluted in the propagation process, so that semantic differences are generated between different levels after fusion.

Disclosure of Invention

In order to overcome the above problems, an object of the present invention is to provide a method for detecting a target based on multi-scale context sensing, an electronic device and a scale storage medium. First, with the hole residual block, an enhanced high-level feature with a richer receptive field is produced. Secondly, an interactive fusion method is adopted to better fuse the context information of adjacent layers. Third, an adaptive context aggregation block is proposed to solve the semantic gap problem. Under the guidance of channels and spaces, the network learns the weights of different layers in a self-adaptive manner to generate a discriminative context. Our method enables the network to obtain significant performance gains, and thus the present invention has been completed.

To achieve the object of the present aspect, the present invention employs the following steps:

1) inputting the sample image into a backbone network to extract features { C2, C3, C4 and C5} of a plurality of scales;

2) the top-level features C5 extracted by the backbone network are acted by the hole residual block, so that enhanced high-level features P5 with richer receptive fields are generated to make up for the loss of the high-level features.

3) Features { P2, P3, P4, P5} are generated by a cross-scale context aggregation module that better fuses context information of neighboring levels.

4) By applying the adaptive context aggregation module to the features { P2, P3, P4 and P5}, the network can also learn the weights of the multi-scale features on channels and spaces, and obtain the feature X by means of weighted summation_m。

5) Further enhancement of feature X by relying on enhancement modules_mGenerating enhanced feature X_o。

6) Will be characterized by X_oAnd finally, matching the features with the dimensions of the features { P2, P3, P4 and P5} in an upsampling or downsampling mode respectively, and adding the matched features in an element adding mode to obtain the features { O2, O3, O4 and O5 }.

7) Inputting the features { O2, O3, O4 and O5} into a candidate region generation network to generate candidate boxes, and simultaneously extracting the features of the candidate boxes by using the Roi-Pooling layer.

8) The candidate box features are input to a head detection module (such as in the fast rcnn, mask rcnn, etc. techniques) for prediction. The head detection module comprises a classification module and a regression module. The classification module is used for generating the category of the candidate frame, and the regression module is used for predicting the position coordinate offset. The offset is used to correct the position of the candidate frame generated in step 7). And finally, obtaining a final detection result, namely the type and the position of the article, by a non-maximum value inhibition method, and judging whether the type of the article is a target type.

A server, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the above method.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the above-mentioned method.

The invention has the advantages that:

1) the invention provides a novel characteristic pyramid network, namely a multi-scale context-aware network, which comprises three modules, namely: the device comprises a cavity residual block, a cross-scale context aggregation module and a self-adaptive context aggregation module;

2) the target detection method based on the multi-scale context sensing can obtain obvious performance improvement on the baseline of the target detection algorithm;

drawings

Fig. 1 is a flowchart of a target detection method based on multi-scale context awareness according to an embodiment of the present invention;

FIG. 2 is a diagram of a multi-scale context-aware-based target detection framework according to the present invention, with a structure of a void residual block on the right, where CCAB is a cross-scale context aggregation module, CAB is a channel-oriented aggregation module, and SAB is a spatial-oriented aggregation module;

FIG. 3 illustrates a network architecture diagram of a cross-scale context aggregation module;

fig. 4 shows a network structure diagram of an adaptive context aggregation module, wherein (a) is a channel guidance aggregation module and (b) is a spatial guidance aggregation module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings. The described embodiments are only some embodiments of the invention, not all embodiments.

Example 1

The invention discloses a target detection method based on multi-scale context sensing, which comprises the following steps:

step S1: constructing a backbone network, and performing pre-training on a large-scale classification data set to extract multi-scale features { C2, C3, C4, C5} of the input image; the backbone network can select an existing deep learning based neural network, such as a residual error network (ResNet) or a multi-branch residual error network (ResNeXt). The backbone network is pre-trained on large-scale taxonomic datasets (such as ImageNet or Microsoft COCO).

Step S2: and constructing a multi-scale context-aware network. Firstly, the hole residual block generates enhanced high-level features with richer receptive fields by overlapping a plurality of residual blocks with different hole rates, which can reduce the loss of context information of the highest-level features, wherein the residual block with the smallest hole rate is in front, and then the hole rates of the residual blocks are sequentially increased, namely all the residual blocks are overlapped together from small to large according to the hole rates. Secondly, the cross-scale context aggregation block adopts an interactive fusion method to better fuse the context information of adjacent layers and provide more effective supplement for the current layer. Third, an adaptive context aggregation block is proposed to solve the semantic gap problem. Under the guidance of channels and spaces, the network can adaptively learn weights of different layers to generate a distinguishing context.

A hole residual block. As shown in fig. 2, after obtaining the backbone network to extract the top-level features C5, we input them into the hole residual block to obtain rich context information P5. Firstly, each residual block uses a 1 × 1 convolutional layer to reduce the number of output channels, and then context semantic information is enhanced through a 3 × 3 convolutional layer, so that the sense field is enlarged due to the increase of a convolutional kernel, and therefore the extracted features have rich context semantic information. Finally, the number of channels is recovered using one 1 × 1 convolutional layer. It is noted that each 3 x 3 convolutional layer has a different void fraction, e.g., 2, 4, 6, 8.

Aggregating blocks across a scale context. The feature P4 is obtained by fusing features of adjacent levels through a cross-scale context aggregation module (for example, the context aggregation module acts on the features P5 and C4). As shown in FIG. 3, we assume the inputs of the cross-scale aggregation block are f (i +1) and f (i); first, we enhance the input features by 1 convolutional layer of 3 × 3.

f(i+1)＝Conv(f(i+1))

f(i)＝Conv(f(i))

The two branches are then cross-fused. f (i +1) is matched by upsampling to f (i), and f (i) is matched by downsampling to f (i + 1). The fusion mode is as follows:

h(i+1)＝Conv(Down(f(i)))+Conv(f(i+1))

h(i)＝Conv(Up(f(i+1)))+Conv(f(i))

o(i)＝Conv(h(i))+Conv(Up(h(i+1)))

P(i)＝Conv(o(i)+f(i))

finally, we obtain enhanced features { P2, P3, P4, P5} through cross-scale context aggregation block.

An adaptive context aggregation module. As shown in FIG. 2, the multi-scale features { P2, P3, P4, P5} are input into the channel-directed aggregation module and the spatial-directed aggregation module, respectively,to generate corresponding features X_cAnd X_s. Then, the two characteristics are fused in an element addition mode to obtain an enhanced characteristic X_m. Note that we first need to unify the multi-scale features (experimentally chosen as the P4 scale size) and then input them into the adaptive context aggregation block.

The channel directs the aggregation module. As shown in fig. 4(a), given the output pyramid features of the cross-scale context aggregation block as { P2, P3, P4, P5}, we can obtain their global semantic representation through the addition operation of elements and input them to the Global Average Pool (GAP) layer. And then, processing the input global semantic expression by using a Global Average Pool (GAP) layer to output global channel information. We then use 1 x 1 convolutional layer to compress the global channel information. In addition, N convolutional layers are used for acting on compressed global channel information to channel weights of pyramid features, and finally the channel weights and the pyramid features are subjected to weighting summation to obtain the features X_c. And N is the pyramid feature layer number.

The space directs the aggregation module. As shown in fig. 4(b), a global semantic representation of the pyramid features { P2, P3, P4, P5} is first obtained by element addition. Two different spatial context information are then generated using the average pooling and maximum pooling operations. And we use the Concat operation to fuse these two pieces of context information. Then, we can use N7 × 7 convolutional layers to act on the fused context information to obtain the spatial weight of the pyramid feature, and finally obtain the feature X by weighting and summing the spatial weight and the pyramid feature_s。

The enhancement module is relied upon. We use dependency enhancement modules to act on feature X_mGenerating more discriminative features X_o. Experiments performed on existing attention blocks (such as SEBlock, CBAM, Non-local, and GCBlock) have shown that both GCBlock and Non-local work well. Non-local, compared to GCBlock, imposes a significant parametric and computational burden. Therefore, GCBlock (global context block) is selected herein as a default setting. By effectively capturing the long-distance dependency, the accuracy is further improved.

Will be characterized by X_oAnd matching with the dimensions of the features { P2, P3, P4 and P5} respectively by means of up-sampling or down-sampling, and finally obtaining the features { O2, O3, O4 and O5} by means of element addition. Wherein the pair X is determined according to the dimensions of the characteristics of each layer { P2, P3, P4, P5}_oRespectively operating; for the ith layer feature Pi, if X_oIs smaller than it, up-sampling if X is_oIs larger than it, then downsampled.

Step S3: and constructing a candidate area generation network. The candidate area generation network may generate a detection box. For each point on the feature map { O2, O3, O4, O5} obtained in step S2, it may generate detection frames having different scales and aspect ratios. Then extracting the characteristics of the detection frames through an ROI Align layer, and finally inputting the extracted characteristics into two network layers, wherein one network layer is used for classification, namely whether the object contained in the frame belongs to the foreground or not; the other outputs the offset of the detection frame with respect to the real object frame. And performing preliminary correction on the detection frame through the predicted offset.

Step S4: and constructing a head detection module, and classifying and regressing the corrected detection frame again. The head detection module includes: the classification module is used for outputting the classification result of each detection frame; the position regression module is used for outputting the offset of each detection frame relative to the real target.

Step S5: the network is trained by a gradient descent algorithm. When the number of rounds specified in advance is reached, the entire network stops training.

Step S6: and (5) testing the network.

Example 2

An embodiment 2 of the present invention provides an electronic device, including a memory and a processor, where a target detection program based on multi-scale context awareness is stored, and when the target detection program is executed by the processor, the processor is enabled to execute a target detection method based on multi-scale context awareness, where the method includes the following steps:

1) performing multi-scale feature extraction on an input image by using a pre-trained backbone network;

2) fusing the extracted multi-scale features by adopting a multi-scale context-aware network;

3) inputting the fused features into a candidate area generation network to extract candidate frames, and extracting the features of the candidate frames through a Roi-Pooling layer;

4) and inputting the extracted candidate frame features into a head detector to obtain the category and the position offset of the detection frame. The offset is used to correct the position of the candidate frame generated in step 3). And finally, obtaining a final detection result, namely the category and the position of the article, by a non-maximum value inhibition method.

Example 3

An embodiment 3 of the present invention provides a computer-readable storage medium, where when executed by a processor, the program causes the processor to execute a method for detecting a target based on multi-scale context awareness, where the method includes:

4) the extracted candidate frame features are input to a head detector to obtain category and position information of the detection frame.

The above description is only a preferred example of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A target detection method based on multi-scale context sensing comprises the following steps:

1) extracting a plurality of scale features of the image by using a backbone network;

2) enhancing the top layer characteristics in the multi-scale characteristics through the cavity residual block to obtain the top layer characteristics with high-level characteristics;

3) fusing the features of adjacent layers through a cross-scale context aggregation module to generate pyramid features;

4) aggregating pyramid features through an adaptive context aggregation module to obtain a feature X_m；

5) Further enhancement of feature X by relying on enhancement modules_mGenerating enhanced feature X_o；

6) Will be characterized by X_oRespectively matching the pyramid features in an up-sampling or down-sampling mode, and adding the matched features in an element addition mode;

7) inputting the features obtained in the step 6) into a candidate area generation network to generate a candidate frame, and extracting the features of the candidate frame;

8) inputting the characteristics of the candidate frame into a head detection module for prediction to obtain the category and position coordinates of the candidate frame; and then filtering the detection result of the candidate frame by a non-maximum value inhibition method to obtain the category and position information of the articles in the candidate frame.

2. The method of claim 1, wherein the hole residual block comprises a plurality of residual blocks with different hole rates; sequentially inputting top-level features in the multi-scale features into each residual block, wherein each residual block firstly adopts a 1 × 1 convolutional layer to reduce the number of channels of input data, then enhances the context semantic information of the input data through a 3 × 3 convolutional layer, and then recovers the number of channels of the input data by using the 1 × 1 convolutional layer; wherein the 3 x 3 convolutional layers in different residual blocks have different void rates.

3. The method of claim 1 or 2, wherein the cross-scale context aggregation module generates pyramid features by:

31) respectively enhancing the input two adjacent layer characteristics f (i +1) and f (i) through a 3 x 3 convolutional layer;

32) performing up-sampling on the increased features f (i +1) and performing matching fusion on the enhanced features f (i) to obtain features h (i); performing down-sampling on the enhanced features f (i) and matching and fusing the enhanced features f (i +1) to obtain features h (i + 1);

33) performing up-sampling on the feature h (i +1), and performing matching fusion on the feature h (i) and the feature h (i) to obtain a feature o (i);

34) matching and fusing the features o (i) and the features f (i) of the ith layer to generate pyramid features.

4. The method of claim 1 or 2, wherein the adaptive context aggregation module comprises a channel guidance aggregation module and a spatial guidance aggregation module; inputting the pyramid features into a channel guidance aggregation module and a space guidance aggregation module respectively to generate corresponding features X_cAnd X_s(ii) a Then the feature X is compared_cAnd X_sFusing by element addition to obtain enhanced feature X_m。

5. The method of claim 4, wherein the channel guidance aggregation module first obtains and inputs a global semantic representation of a pyramid feature to a global average pool layer; then, processing the input global semantic expression by using a global average pool layer to output global channel information; then, a 1 × 1 convolutional layer is used for compressing global channel information, N convolutional layers are used for acting on the compressed global features to obtain channel weights of the pyramid features, and then the channel weights and the pyramid features are subjected to weighting summation to obtain features X_c(ii) a Wherein N is the pyramid feature level.

6. The method of claim 4, wherein the spatial guidance aggregation module first obtains a global semantic representation of a pyramid feature; then, average pooling and maximum pooling operations are respectively carried out on the global semantic representation to generate two different pieces of spatial context information; then, fusing the context information of the two spaces; then using N7 × 7 convolution layers to act on the fused spaceObtaining the spatial weight of the pyramid feature by the text information, and finally obtaining the feature X by the spatial weight and the pyramid feature in a weighted summation mode_s(ii) a Wherein N is the pyramid feature level.

7. The method of claim 1, wherein the dependency enhancement module is an attention module (GCBlock).

8. The method of claim 1, wherein the candidate area generation network generates detection boxes with different scales and aspect ratios for each point on the features resulting from step 6); then extracting the characteristics of the detection frames and inputting the characteristics into two network layers, wherein one network layer is used for classifying, namely identifying whether the object contained in the detection frame belongs to the foreground; another network layer predicts and outputs the offset of the detection frame relative to the real object frame; then correcting the detection frame through the predicted offset; and then classifying and regressing the corrected detection frames again.

9. A server, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for carrying out the steps of the method according to any one of claims 1 to 8.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.