CN109902800B

CN109902800B - Method for detecting general object by using multi-stage backbone network based on quasi-feedback neural network

Info

Publication number: CN109902800B
Application number: CN201910058187.3A
Authority: CN
Inventors: 刘玉栋; 王勇涛; 汤帜
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2020-11-27
Anticipated expiration: 2039-01-22
Also published as: CN109902800A

Abstract

The invention discloses a method for detecting a general object based on a multi-level backbone network, which is characterized in that the multi-level backbone network based on a pseudo-feedback neural network is established, and a feedback mechanism is simulated in a deep neural network by utilizing the connection among a plurality of backbone networks, so that the extraction of the characteristics of the general object is enhanced, and the precision of object detection is improved. The invention can be applied to various object detectors, the applied backbone network of the detector adopts the multi-level backbone network provided by the invention, and the network structures of other parts of the detector do not need to be changed, the method is simple and convenient, and the object detection precision is high.

Description

Method for detecting general object by using multi-stage backbone network based on quasi-feedback neural network

Technical Field

The invention belongs to the technical field of object detection, relates to computer vision and deep learning technology, and particularly relates to a method for detecting a general object of a double-backbone network based on a quasi-feedback neural network.

Background

General object detection is one of the most basic tasks in the field of computer vision, and has very wide application in actual life, such as automatic driving, intelligent video monitoring, remote sensing technology and the like. In the years, the universal object detection has been greatly developed based on the rapid development of the deep neural network.

Currently, general Object detectors based on deep learning are classified into two types, one type is a Single-stage Detector, such as SSD (SSD), retanet (focal local for sense Object detection). Another class is two-stage detectors, such as Faster R-CNN (fast R-CNN: directions read-Time Object Detection with Region projection Networks), FPN (feature Pyramid Networks for Object Detection), MaskR-CNN, CascadeR-CNN (Cascade R-CNN: decoding in High precision Object Detection), and the like.

However, the above detectors all use a unidirectional feedforward neural network to detect a general object, and in the training and testing of the network, the features directly pass through the whole feedforward network and are output, and the network does not include a feedback mechanism. This is because the gradient descent method of the deep neural network is based on a back propagation mechanism, and no loop can exist in the connection of the network; however, there is a feedback loop in the human visual system, and a feedback mechanism can correct errors of the extracted features and further enhance the extraction of the features. Therefore, the existing detector adopting the unidirectional feedforward neural network is used for detecting a general object, a certain bottleneck exists in the technology, and the detection accuracy and precision are limited.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method for detecting a general object of a multi-stage backbone network based on a pseudo-feedback neural network, which establishes the multi-stage backbone network based on the pseudo-feedback neural network, and simulates a feedback mechanism in a deep neural network by using the connection among a plurality of backbone networks, thereby enhancing the extraction of the characteristics of the general object and improving the precision of object detection.

The technical scheme of the invention is as follows:

a multi-stage backbone network based on a quasi-feedback neural network is established, and a feedback mechanism is simulated in a deep neural network by utilizing the connection among a plurality of backbone networks, so that the extraction of the characteristics of the general object is enhanced, and the object detection precision is improved; the method comprises the following steps:

1) and establishing a multi-stage backbone network based on the quasi-feedback neural network.

The number of the multi-stage backbone networks can be 2,3 …, the backbone networks have the same structure, and the backbone networks can be ResNet (residual error network) or ResNeXt (multi-branch residual error network);

each backbone network comprises a plurality of (typically 4) convolutional blocks (or backbone network stages), each stage comprising a plurality of convolutional layers.

And taking the output of each stage of each backbone network as input to the same stage of the next stage of backbone network to form the quasi-feedback connection.

The structure of the quasi-feedback connection comprises a 1 × 1 convolutional layer and an up-sampling operation; the 1 x 1 convolutional layer aligns the number of channels of the output characteristics of a certain stage of the previous stage of the backbone network with the number of channels of the input characteristics of a corresponding stage of the next stage of the backbone network, and the upsampling operation aligns the spatial sizes of the characteristics of the two stages of the two adjacent stages of the backbone networks. The minimum stage quasi-feedback connection does not require an upsampling operation because its input and output characteristics are the same spatial size.

And taking the output of the last stage of backbone network as the output of the multi-stage backbone network.

2) Inputting a general object image to be detected to a detector, such as MaskR-CNN, CascadeR-CNN and the like;

3) sending the general object image into the multi-stage backbone network based on the quasi-feedback neural network established in the step 1) to extract features, wherein the output of the multi-stage backbone network is the extracted features;

4) the features extracted from the multi-stage backbone network are fed into subsequent modules of the backbone network, which may be RPNs (regional candidate networks) or detective heads, depending on the specific detector.

5) And taking the output of the subsequent module of the multilevel backbone network as the detection result of the detector.

The detection method of the general object can be widely applied to detectors for practical application such as automatic driving, intelligent video monitoring and object remote sensing identification, and the like, and the precision of object detection is improved.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a method for detecting a general object of a multi-stage backbone network based on a quasi-feedback neural network, which is characterized in that the multi-stage backbone network based on the quasi-feedback neural network is established, and a feedback mechanism is simulated in the neural network by utilizing the connection between the backbone networks, so that the extraction of the characteristics of the general object is enhanced, and the accuracy of object detection is improved.

The method breaks through the conventional thinking of adopting a forward network, establishes the multi-level backbone network extraction characteristics based on the quasi-feedback neural network, can be applied to various object detectors, adopts the multi-level backbone network provided by the invention for the backbone network of the applied detector, does not need to change the network structures of other parts of the detector, and has the advantages of simple and convenient method and high object detection precision. The implementation on MSCOCO shows that the input image size in both training and testing800 × 1333, after modifying the backbone network of the detector to the corresponding two-level backbone network (e.g., replacing the rescet 101 backbone network with the two-level rescet 101 backbone network, and replacing the rescext 152 backbone network with the two-level rescext 152 backbone network), the boxmAP value of the FPN based on ResNet101 on the test-dev set may be increased from 39.4% to 41.0%, the boxmAP value of the MaskR-CNN based on ResNet101 is increased from 40.1% to 41.8%, the boxmAP value of the CascadeR-CNN based on ResNet101 is increased from 42.8% to 44.3%, and the boxmAP value of the CascadeMaskR-CNN based on ResNet 152 is increased from 48.3% to 50.0%; and after the backbone network of the detector is modified into a corresponding three-level backbone network (for example, the ResNet101 backbone network is replaced by the three-level ResNet101 backbone network, and the ResNeXt152 backbone network is replaced by the three-level ResNeXt152 backbone network), the boxmAP value of the FPN based on ResNet101 can be increased from 39.4% to 42.0%, and the boxmAP value of the CascadeMaskR-CNN based on ResNeXt152 is increased from 48.3% to 51.2%. (Note: MSCOCO is a large-scale data set, including tasks such as object detection, segmentation, etc., seehttp://cocodataset.org/#homeThe mAP value of box is an index for measuring the detection performance, seehttp://cocodataset.org/#detection-eval)。

Drawings

Fig. 1 is a flow chart diagram of a general object detection method provided by the present invention.

Fig. 2 is a schematic diagram of a conventional backbone network structure.

Fig. 3 is a schematic diagram of a connection structure between two adjacent backbone networks according to the present invention.

Fig. 4 is a schematic structural diagram of a feedback connection in an embodiment of the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The present invention proposes a multi-level backbone network for universal object detection, as shown in fig. 1. In the existing general object detection framework, there is only one backbone network, as shown in fig. 2, the most commonly used backbone network at present is ResNet (residual error network). In order to solve the feedback problem in the general object detection, the embodiment of the invention proposes that a plurality of backbone networks are used as a detection network, and a feedback mechanism is simulated in a deep neural network through some connections among the plurality of backbone networks, so as to enhance the extraction of features. These backbone networks are structurally identical and can be either ResNet (residual network) or resenext (multi-branch residual network). Each backbone network has a plurality of convolutional blocks (stages), each convolutional block containing a plurality of convolutional layers. The output of the convolution block of each level of the backbone network is connected to the input of the convolution block of the same level of the next level of the backbone network to form a pseudo-feedback connection, as shown in fig. 3. The structure of the pseudo-feedback connection (also called feedback connection for descriptive convenience) is shown in fig. 4. The method comprises a 1 x 1 convolutional layer and an up-sampling operation, wherein the 1 x 1 convolutional layer aligns the number of channels of the output characteristics of a certain convolutional block of a previous stage backbone network with the number of channels of the input characteristics of a convolutional block corresponding to a next stage backbone network, and the up-sampling operation aligns the space sizes of the two. It is noted that the feedback connection of the lowest stage does not require an upsampling operation because the spatial size of its input and output characteristics is the same.

FIG. 1 is a flow chart of a general object detection method provided by the present invention; for the detection network to be improved, a general backbone network (such as ResNet, ResNeXt) is directly replaced by the multi-stage backbone network in the invention.

The MSCOCO is a large-scale data set including tasks of object detection, segmentation, and the like, seehttp:// cocodataset.org/#homeThe mAP value of box is an index for measuring the detection performance, seehttp:// cocodataset.org/#detection-eval)。

Taking an FPN (Feature Pyramid Object Detection network) as an example, replacing a ResNet101 part in the network with a ResNet101 two-stage backbone network in the invention, namely, a first-stage backbone network and a second-stage backbone network are both ResNet101, and after improvement, under the condition that the size of a trained and tested image is 800 × 1333, the mAP value of Object Detection is promoted on a test-dev data set of the MSCOCO; when the ResNet101 in the network is partially replaced by the ResNet101 three-level backbone network structure in the invention, namely the first-level backbone network, the second-level backbone network and the third-level backbone network are all ResNet101, the mAP value detected by the object is improved on the test-dev data set of the MSCOCO.

Specifically, the experimental results on MSCOCO show that after modifying the backbone network of the detector to the corresponding two-stage backbone network (e.g., replacing the renet 101 backbone network with the two-stage ResNet101 backbone network and replacing the ResNeXt152 backbone network with the two-stage ResNeXt152 backbone network) in the case where the input image sizes of training and testing are both 800 × 1333, the boxmAP value of the FPN based on ResNet101 on the test-dev set can be increased from 39.4% to 41.0%, the boxmAP value of the MaskR-CNN based on ResNet101 is increased from 40.1% to 41.8%, the boxmAP value of the CascadeR-CNN based on ResNet101 is increased from 42.8% to 44.3%, and the boxmAP value of the CascadeMaskR-CNN based on ResNet 152 is increased from 48.3% to 50.0%; and after the backbone network of the detector is modified into a corresponding three-level backbone network (for example, the ResNet101 backbone network is replaced by the three-level ResNet101 backbone network, and the ResNeXt152 backbone network is replaced by the three-level ResNeXt152 backbone network), the boxmAP value of the FPN based on ResNet101 can be increased from 39.4% to 42.0%, and the boxmAP value of the CascadeMaskR-CNN based on ResNeXt152 is increased from 48.3% to 51.2%.

It is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A multi-level backbone network based on a general object detection method of a multi-level backbone network is established, the multi-level backbone network based on a quasi-feedback neural network is established, and a feedback mechanism is simulated in a deep neural network by utilizing the connection among a plurality of backbone networks, so that the extraction of general object characteristics is enhanced, and the object detection precision is improved; the method comprises the following steps:

1) establishing a multi-stage backbone network based on a quasi-feedback neural network; the number of each level of backbone network is 1; the structure of each level of backbone network is the same; each backbone network comprises a plurality of stages; each stage comprises a plurality of convolution layers; for each backbone network, the output of each stage is used as input and is sent to the same stage of the next stage backbone network to form a connection of quasi-feedback;

2) collecting an image of a general object to be detected, and inputting the image into a detector;

3) sending the image into the multi-stage backbone network based on the quasi-feedback neural network established in the step 1) to extract features, wherein the output of the multi-stage backbone network is the extracted features;

4) the features extracted from the multilevel backbone network are sent to a subsequent detector module of the multilevel backbone network for detection;

5) and taking the output of the subsequent detector module of the multilevel backbone network as the detection result of the detector.

2. The method for detecting the universal object based on the multi-level backbone network according to claim 1, wherein the detection method is applied to an automatic driving detector, an intelligent video monitoring detector or an object remote sensing identification detector.

3. The method according to claim 1, wherein the subsequent detector modules of the multi-stage backbone network are regional candidate networks RPN.

4. The method for detecting the universal object based on the multi-stage backbone network as claimed in claim 1, wherein each stage of the backbone network employs a residual error network ResNet or a multi-branch residual error network ResNeXt.

5. The method for detecting a generic object based on a multi-stage backbone network as claimed in claim 1, wherein each stage of the backbone network comprises 4 stages.

6. The method according to claim 1, wherein the structure of the pseudo-feedback connection comprises a 1 x 1 convolutional layer and an upsampling operation; the 1 x 1 convolutional layer aligns the number of channels of the output characteristics of the previous stage of backbone network with the number of channels of the input characteristics of the corresponding stage of the next stage of backbone network, and the upsampling operation aligns the spatial sizes of the characteristics of the corresponding stages of the two stages of backbone networks.

7. The method for detecting a generic object based on a multi-stage backbone network as claimed in claim 6, wherein the input features and the output features of the lowest stage of each stage of the backbone network have the same spatial size; the quasi-feedback connection does not include an upsampling operation.

8. The method as claimed in claim 1, wherein the detector includes but is not limited to Mask R-CNN or Cascade R-CNN.