CN113743521B

CN113743521B - Target detection method based on multi-scale context awareness

Info

Publication number: CN113743521B
Application number: CN202111061082.7A
Authority: CN
Inventors: 王伯英; 汲如意; 张立波; 武延军
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2023-06-27
Anticipated expiration: 2041-09-10
Also published as: CN113743521A

Abstract

The invention discloses a target detection method based on multi-scale context awareness, which comprises the following steps: 1) Extracting a plurality of scale features of the image; 2) Enhancing the top-level features in the multi-scale features through the cavity residual blocks to obtain top-level features with high-level features; 3) Fusing the features of adjacent layers to generate pyramid features; 4) The pyramid features are aggregated to obtain feature X _m The method comprises the steps of carrying out a first treatment on the surface of the 5) Further enhancement of feature X by dependency enhancement module _m Generating enhanced feature X _o The method comprises the steps of carrying out a first treatment on the surface of the 6) Feature X _o Respectively carrying out matching addition with pyramid features in an up-sampling or down-sampling mode; 7) Inputting the features obtained in the step 6) into a candidate region generation network to generate candidate frames, and extracting the features of the candidate frames; 8) And inputting the characteristics of the candidate frames into a head detection module for prediction, and filtering the detection results of the candidate frames by a non-maximum suppression method to obtain the category and position information of the articles.

Description

Target detection method based on multi-scale context awareness

Technical Field

The invention relates to the technical field of computer vision, in particular to a target detection method based on multi-scale up-down sensing.

Background

Object detection is a realistic and challenging computer vision task whose purpose is to identify objects in an image and locate them. In recent years, with the deep study of deep learning, the method is rapidly developed and widely applied to the fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like. General purpose target detection is generally divided into two categories: single-stage and two-stage target detection. The single-stage detection directly processes the input image to generate a detection result. The two-stage detection firstly extracts candidate areas through RPN, and then refines the detection result according to the candidate areas. In early studies, object detection directly utilized the highest level features to detect objects. However, due to the small spatial scale, the highest level features are not conducive to target detection. To address this problem, feature pyramid techniques have been developed that utilize multi-scale features. The main flow of work in feature pyramid technology falls into two categories: neural structure searches and non-neural structure searches. NAS-FPN is representative of methods based on neural structure search. NAS-FPN defines the search space and explores the best performing pyramid structure using reinforcement learning strategies. The neural structure search-based approach has higher performance, but also has some significant drawbacks. First, the resulting structure is extremely complex and not easily understood. Second, the structure is typically multi-layered, thus placing a significant amount of parameters and computational burden. Third, the search cost of neural structure searches is prohibitive, involving thousands of TPU hours. In contrast, non-NAS feature pyramid methods are designed manually. FPN is a widely applied non-neural structure search module, and three problems exist in the current FPN-based method: (1) highest level context information is lost. Before fusion, a 1×1 convolution layer is used to reduce the number of characteristic channels. The highest level features typically have thousands of channels containing rich contextual information. The highest level features lose a lot of information due to the reduction of channels. (2) the context fusion policy is inadequate. In the fusion process, the high-level features are matched with the shallow features through up-sampling operation, and then the elements are fused through addition. But this simple aggregation strategy is not optimal. Different levels should not be handled with the same considerations due to the different context information contained. (3) semantic gap between different levels of features. Considering that feature propagation is unidirectional, the underlying features cannot be propagated to high levels. In addition, in the propagation process, high-level semantic information can be diluted, so that semantic differences are generated between different layers after fusion.

Disclosure of Invention

In order to overcome the problems, the invention aims to provide a target detection method based on multi-scale context sensing, an electronic device and a scale storage medium. First, enhanced high-level features with richer receptive fields are generated by the hole residual block. And secondly, adopting an interactive fusion method to better fuse the context information of adjacent layers. Third, an adaptive context aggregation block is proposed to solve the semantic gap problem. Under the guidance of the channel and the space, the network adaptively learns weights of different layers to generate a context with discrimination. Our approach allows the network to achieve significant performance gains, leading to the completion of the present invention.

To achieve the object of the present invention, the present invention employs the steps of:

1) Inputting the sample image into a backbone network to extract characteristics { C2, C3, C4, C5} of a plurality of scales;

2) The cavity residual block acts on the top-level feature C5 extracted by the backbone network, so that the enhanced high-level feature P5 with a richer receptive field is generated, and the loss of the high-level feature is compensated.

3) The cross-scale context aggregation module is used for better fusing the context information of adjacent layers, so that the features { P2, P3, P4 and P5}.

4) The network can also learn the weights of the multi-scale features on channels and spaces by acting on the features { P2, P3, P4, P5}, and the feature X is obtained by a weighted summation mode _m 。

5) Further enhancement of feature X by dependency enhancement module _m Generating enhanced feature X _o 。

6) Feature X _o By upsampling or downsampling, respectively, with featuresThe { P2, P3, P4, P5} scale is matched, and finally, the matched features are added in an element addition mode to obtain features { O2, O3, O4, O5}.

7) The features { O2, O3, O4, O5} are input into the candidate region generation network to generate candidate boxes, while the Roi-Pooling layer is used to extract the features of the candidate boxes.

8) The candidate box features are input to a head detection module (such as a head detection module in the technology of a master rcnn, a mask rcnn and the like) for prediction. The head detection module comprises a classification module and a regression module. The classification module is used for generating the category of the candidate frame, and the regression module is used for predicting the position coordinate offset. The offset is used to correct the position of the generated candidate box of step 7). And finally, obtaining a final detection result, namely the category and the position of the article, through a non-maximum suppression method, and judging whether the category of the article is a target category.

A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the above method.

A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor realizes the steps of the above method.

The invention has the beneficial effects that:

1) The invention provides a novel characteristic pyramid network, namely a multi-scale context awareness network, which comprises three modules, namely: the system comprises a cavity residual block, a trans-scale context aggregation module and a self-adaptive context aggregation module;

2) The target detection method based on multi-scale context awareness can obtain remarkable performance improvement on a base line of a target detection algorithm;

drawings

FIG. 1 is a flow chart of a target detection method based on multi-scale context awareness according to an embodiment of the present invention;

FIG. 2 shows a multi-scale context awareness-based target detection framework, with a structure diagram of a hole residual block on the right side, wherein CCAB is a cross-scale context aggregation module, CAB is a channel guidance aggregation module, and SAB is a space guidance aggregation module;

FIG. 3 illustrates a network structure diagram of a cross-scale context aggregation module;

fig. 4 shows a network architecture diagram of an adaptive context aggregation module, where (a) is a channel-directed aggregation module and (b) is a spatial-directed aggregation module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings. The described embodiments are only some, but not all, embodiments of the invention.

Example 1

The target detection method based on multi-scale context awareness comprises the following steps:

step S1: constructing a backbone network, and pre-training on a large-scale classification dataset for extracting multi-scale features { C2, C3, C4, C5} of an input image; the backbone network may select existing deep learning based neural networks such as residual network (ResNet) or multi-branch residual network (ResNeXt) and the like. The backbone network is pre-trained on large-scale classification datasets (such as ImageNet or Microsoft COCO).

Step S2: a multi-scale context aware network is constructed. First, the hole residual block generates enhanced high-level features with richer receptive fields by superposing a plurality of residual blocks with different hole ratios together, which can alleviate the loss of context information of the highest-level features, wherein the hole ratio of the residual block with the smallest hole ratio is the front, and then the hole ratio of the residual block is sequentially increased, i.e. the residual blocks are sequentially superposed together from small hole ratio to large hole ratio. And secondly, the cross-scale context aggregation block adopts an interactive fusion method to better fuse the context information of the adjacent layers, thereby providing more effective supplement for the current layer. Third, an adaptive context aggregation block is proposed to solve the semantic gap problem. Under the guidance of the channel and the space, the network can adaptively learn weights of different layers to generate a differentiated context.

And (5) a hole residual block. After obtaining the backbone network extraction to the top-level feature C5, we input it into the hole residual block to obtain rich context information P5, as shown in fig. 2. First, each residual block uses a 1×1 convolution layer to reduce the number of output channels, then the context semantic information is enhanced by a 3×3 convolution layer, the convolution kernel increases so that the receptive field increases, and thus the extracted features have rich context semantic information. Finally, the number of channels is recovered using a 1×1 convolutional layer. Notably, each 3×3 convolutional layer has a different void fraction, such as 2,4,6,8.

The cross-scale context aggregates blocks. Features P4 are obtained by fusing features of adjacent layers by a cross-scale context aggregation module (e.g., the context aggregation module acts on features P5 and C4). As shown in fig. 3, we assume that the inputs to the cross-scale aggregation block are f (i+1) and f (i); first, we enhance the input features by 1 convolution layer of 3×3.

f(i+1)＝Conv(f(i+1))

f(i)＝Conv(f(i))

The two branches are then cross-fused. f (i+1) is matched to f (i) by upsampling, and f (i) is matched to f (i+1) by downsampling. The fusion mode is as follows:

h(i+1)＝Conv(Down(f(i)))+Conv(f(i+1))

h(i)＝Conv(Up(f(i+1)))+Conv(f(i))

o(i)＝Conv(h(i))+Conv(Up(h(i+1)))

P(i)＝Conv(o(i)+f(i))

finally, we obtain enhanced features { P2, P3, P4, P5} through the cross-scale context aggregation block.

An adaptive context aggregation module. As shown in FIG. 2, the multi-scale features { P2, P3, P4, P5} are input into the channel-directed aggregation module and the space-directed aggregation module, respectively, to generate corresponding features X _c And X _s . Then, the two features are fused in an element addition mode to obtain enhanced feature X _m . Note that we first need to unify the multi-scale features (chosen in the experiment as P4 scale size) and then input them into the adaptive context aggregation block.

The channel directs the aggregation module. As shown in fig. 4 (a), given that the output pyramid of the cross-scale context aggregation block is characterized by { P2, P3, P4, P5}, we can obtain their global semantic representation by addition of elements and input to the Global Average Pool (GAP) layer. Then, the Global Average Pool (GAP) layer is utilized to process the input global semantic expression and output global channel information. Afterwards, we compress the global channel information using 1×1 convolutional layer. In addition, N convolution layers are used for acting on the channel weights of the compressed global channel information to pyramid features, and finally the channel weights and the pyramid features are subjected to weighted summation to obtain features X _c . N is the number of pyramid feature layers.

The space directs the aggregation module. As shown in fig. 4 (b), a global semantic representation of the pyramid features { P2, P3, P4, P5} is first obtained by element addition. Two different spatial context information are then generated using the average pooling and maximum pooling operations. And we use the Concat operation to fuse the two context information. Then, we can use N7×7 convolution layers to act on the fused context information to obtain the spatial weight of the pyramid feature, and finally obtain feature X by weighting and summing the spatial weight and the pyramid feature _s 。

Rely on an enhancement module. We use a dependency enhancement module to act on feature X _m Generating more discriminative feature X _o . Experiments performed on existing attention blocks (e.g., SEBlock, CBAM, non-local and GCBlock) show that GCBlock and Non-local have good effects. Non-local brings a large number of parameters and computational burden compared to GCBlock. Accordingly, GCBlock (global context block, i.e., global context block) is selected herein as the default setting. The accuracy is further improved by effectively capturing long-range dependencies.

Feature X _o Matching with the scales of the features { P2, P3, P4, P5} by means of upsampling or downsampling respectively, and the mostThen, the features { O2, O3, O4, O5} are obtained by means of element addition. Wherein for X, based on the scale of each layer of features { P2, P3, P4, P5} _o Respectively performing operations; for the ith layer feature Pi, if X _o Is smaller than it, up-sampling if X _o Is larger than it, downsampling.

Step S3: and constructing a candidate area generation network. The candidate region generation network may generate the detection box. For each point on the feature map { O2, O3, O4, O5} obtained in step S2, it may generate a detection box having a different scale and aspect ratio. Extracting the characteristics of the detection frames through the ROI alignment layer, and finally inputting the extracted characteristics into two network layers, wherein one network layer is used for classifying whether an object contained in the frame belongs to the foreground or not; the other outputs an offset of the detection frame relative to the real object frame. And carrying out preliminary correction on the detection frame through the predicted offset.

Step S4: and constructing a head detection module, and reclassifying the corrected detection frame. The head detection module includes: the classification module is used for outputting classification results of each detection frame; the position regression module is used for outputting the offset of each detection frame relative to the real target.

Step S5: the network is trained by a gradient descent algorithm. When the number of turns prescribed in advance is reached, the entire network stops training.

Step S6: and (5) network testing.

Example 2

An embodiment 2 of the present invention provides an electronic device including a memory and a processor, wherein when a target detection program based on multi-scale context awareness is executed by the processor, the processor is caused to execute a target detection method based on multi-scale context awareness, the method including the steps of:

1) Performing multi-scale feature extraction on the input image by using a pre-trained backbone network;

2) Fusing the extracted multi-scale features by adopting a multi-scale context sensing network;

3) Inputting the fused features into a candidate region generation network to extract candidate frames, and extracting the features of the candidate frames through a Roi-Pooling layer;

4) The extracted candidate frame features are input to a head detector to obtain the category and the positional offset of the detection frame. The offset is used to correct the position of the candidate box generated in step 3). Finally, the final detection result, namely the category and the position of the article, is obtained through a non-maximum suppression method.

Example 3

Embodiment 3 of the present invention provides a computer-readable storage medium, wherein the program, when executed by a processor, causes the processor to execute a multi-scale context-aware-based object detection method, the method including the steps of:

4) The extracted candidate frame features are input to a head detector to obtain category and position information of the detection frame.

The foregoing is merely a preferred example of the present disclosure, and is not intended to limit the present disclosure, so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A target detection method based on multi-scale context awareness comprises the following steps:

1) Extracting a plurality of scale features of the image by using a backbone network;

2) Enhancing the top-level features in the plurality of scale features through the cavity residual blocks to obtain top-level features with high-level features;

3) Fusing the features of adjacent layers through a trans-scale context aggregation module to generate pyramid features; the method for generating pyramid features by the cross-scale context aggregation module comprises the following steps: 31 Enhancement of the input two adjacent layers of features f (i+1) and f (i) by a 3×3 convolution layer, respectively; 32 Up-sampling the added feature f (i+1) and matching and fusing the enhanced feature f (i) to obtain a feature h (i); downsampling the enhanced feature f (i) and carrying out matching fusion on the enhanced feature f (i+1) to obtain a feature h (i+1); 33 Up-sampling the characteristic h (i+1) and then carrying out matching fusion with the characteristic h (i) to obtain a characteristic o (i); 34 Matching and fusing the feature o (i) and the ith layer feature f (i) to generate pyramid features;

4) Pyramid features are aggregated through a self-adaptive context aggregation module to obtain features X _m ；

5) Further enhancement of feature X by dependency enhancement module _m Generating enhanced feature X _o ；

6) Feature X _o Matching with pyramid features in an up-sampling or down-sampling mode respectively, and adding the matched features in an element adding mode;

7) Inputting the features obtained in the step 6) into a candidate region generation network to generate candidate frames, and extracting the features of the candidate frames;

8) Inputting the characteristics of the candidate frames into a head detection module for prediction to obtain the category and position coordinates of the candidate frames; and filtering the detection result of the candidate frame by a non-maximum suppression method to obtain the category and position information of the articles in the candidate frame.

2. The method of claim 1, wherein the hole residual block comprises a plurality of residual blocks having different hole rates; sequentially inputting top-level features in the plurality of scale features into each residual block, wherein each residual block firstly adopts a 1X 1 convolution layer to reduce the channel number of input data, then enhances context semantic information of the input data through a 3X 3 convolution layer, and then uses a 1X 1 convolution layer to recover the channel number of the input data; wherein the 3 x 3 convolutional layers in different residual blocks have different void fractions.

3. The method of claim 1 or 2, wherein the adaptive context aggregation module comprises a channel-directed aggregation module and a spatial-directed aggregation module; pyramid features are respectively input into a channel guidance aggregation module and a space guidance aggregation module to generate corresponding features X _c And X _s The method comprises the steps of carrying out a first treatment on the surface of the Then feature X _c And X _s Fusion is carried out by an element addition mode, and enhanced characteristic X is obtained _m 。

4. The method of claim 3, wherein the channel guide aggregation module first obtains a global semantic representation of pyramid features and inputs to a global average pool layer; then, processing the input global semantic expression by using a global average pool layer to output global channel information; then compressing global channel information by using a 1X 1 convolution layer, obtaining channel weights of pyramid features by using N convolution layers to act on the compressed global features, and obtaining features X by weighting and summing the channel weights and the pyramid features _c The method comprises the steps of carrying out a first treatment on the surface of the Where N is the number of pyramid feature layers.

5. The method of claim 3, wherein the spatial guidance aggregation module first obtains a global semantic representation of pyramid features; then carrying out average pooling and maximum pooling operation on the global semantic representation respectively to generate two different spatial context information; then fusing the two space context information; then using N7X 7 convolution layers to act on the fused spatial context information to obtain the spatial weight of the pyramid feature, and finally obtaining the feature difference through the spatial weight and the pyramid feature in a weighted summation mode; where N is the number of pyramid feature layers.

6. The method of claim 1, wherein the dependency enhancement module is an attention module GCBlock.

7. The method of claim 1, wherein the candidate region generation network generates a detection box having a different scale and aspect ratio for each point on the feature resulting from step 6); then extracting the characteristics of the detection frames and inputting the characteristics into two network layers, wherein one network layer is used for classifying, namely identifying whether an object contained in the detection frame belongs to the prospect; the other network layer predicts and outputs the offset of the detection frame relative to the real object frame; then correcting the detection frame through the predicted offset;

and then reclassifying and regressing the corrected detection frame.

8. A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of any of claims 1 to 7.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.