CN109472298B

CN109472298B - Deep bidirectional feature pyramid enhanced network for small-scale target detection

Info

Publication number: CN109472298B
Application number: CN201811219005.8A
Authority: CN
Inventors: 庞彦伟; 朱海龙
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-10-19
Filing date: 2018-10-19
Publication date: 2021-06-01
Anticipated expiration: 2038-10-19
Also published as: CN109472298A

Abstract

The invention relates to a deep bidirectional feature pyramid enhancement network for small-scale target detection, which comprises the following steps: determining a backbone network of a network coding end; designing a Bottom-up characteristic pyramid; designing a Top-down characteristic pyramid; target detection sub-network: adopting a strategy of two-stage detection in the faster-rcnn, namely a candidate frame extraction stage and a target classification stage respectively, adopting convolution with a convolution kernel of 3x3 to perform regression of a target frame and prediction of probability of whether the target frame is the target or not on an output feature map of each scale of a top-down feature pyramid in an RPN stage, performing ROI-posing on the screened candidate target frame and the output feature map of the top-down feature pyramid with the corresponding scale, and finally performing frame adjustment and classification of specific classes of the target by using two fully-connected layers; and outputting the object detection result.

Description

Deep bidirectional feature pyramid enhanced network for small-scale target detection

Technical Field

The invention belongs to a target detection technology in the fields of computer vision, pattern recognition, deep learning, artificial intelligence and the like, and particularly relates to a technology for detecting a target in a scene by using a deep convolutional neural network in an image or a video.

Background

In the field of deep object detection, as the performance of object detection is continuously improved, the performance of small-scale object detection becomes a new bottleneck, and some new network structures are proposed to improve the problem of small-scale object detection. A feature pyramid network (FPN for short) is representative thereof. The FPN introduces the pyramid thought widely applied in the field of traditional image processing into a deep object detection framework, and greatly improves the object detection in a large scale range, particularly the detection performance of small-scale objects. The characteristic pyramid in the FPN is constructed in a top-down (top-down) mode, is integrated with a backbone network, and can be used in a single-stage or double-stage object detection method. The characteristic pyramid structure in DSSD [2] is similar to FPN, and the construction mode is complex, and the pyramid structure is used in single-stage object detection. The authors in Blitznet [3] attempted to solve the multi-tasking problem of object detection and semantic segmentation simultaneously using the structure of the feature pyramid for single-stage object detection. In DSOD [4], the authors propose a bottom-up (bottom-up) based dense connection-based network architecture that merges more shallow network features in the forward direction. Although the methods improve the detection performance of the small objects to a certain extent, the methods are away from the requirements of the actual scene to a certain extent.

Most of the existing methods adopt the method that the backbone network characteristics of the previous scale are fused with the characteristics of the current scale of the pyramid network through a jump connection module, some are characteristic pyramid structures from top to bottom, some are structures from bottom to top, and the utilization of the characteristics of different scales and different semantic levels is insufficient.

One challenge facing the computer vision field is identifying objects in a large scale range. At present, an object detection method based on a deep neural network obtains overwhelming performance advantages in the field of object detection. However, most existing object detection methods have a good detection effect on large-scale objects, and the detection effect on small-scale objects is unsatisfactory. One common small-scale object detection problem is shown in fig. 1. The reason is that as the network is continuously deepened and the resolution of the feature map is correspondingly reduced, the information of the small-scale object is gradually submerged in the context in the feature extraction process. However, in scenarios such as autopilot, the performance requirements for small-scale objects are very stringent.

Reference documents:

[1]Lin,T.Y.,Dollár,P.,Girshick,R.,He,K.,Hariharan,B.,&Belongie,S.(2017,July).Featurepyramidnetworks for object detection.In CVPR(Vol.1,No.2,p.4).

[2]Fu,C.Y.,Liu,W.,Ranga,A.,Tyagi,A.,&Berg,A.C.(2017).DSSD:Deconvolutional single shotdetector.arXivpreprint arXiv:1701.06659.

[3]Dvornik,N.,Shmelkov,K.,Mairal,J.,&Schmid,C.(2017,October).BlitzNet:A real-time deep network for scene understanding.In ICCV 2017-International Conference on ComputerVision(p.11).

[4]Shen,Z.,Liu,Z.,Li,J.,Jiang,Y.G.,Chen,Y.,&Xue,X.(2017,October).Dsod:Learning deeply supervised object detectors from scratch.In The IEEE International Conference on ComputerVision(ICCV)(Vol.3,No.6,p.7).

disclosure of Invention

In order to solve the problem that small-scale object information is gradually submerged by a background as a network deepens, the invention provides a deep bidirectional feature pyramid enhanced network for small-scale target detection so as to improve the scale robustness of an object detection algorithm. The technical scheme is as follows:

a deep bidirectional feature pyramid enhancement network for small scale target detection, comprising:

(1) determining a backbone network of a network coding end: the residual network is used as a backbone network, and the residual network comprises 5 convolution modules, wherein each convolution module starts with one pooling layer (pooling) or convolution with a step size of two (stride convolution).

(2) Designing a Bottom-up feature pyramid: in the construction process of the bottom-up feature pyramid, each feature fusion operation is completed by adding two paths of features by using corresponding elements, wherein one path is formed by carrying out channel feature fusion and channel direction dimension adjustment on the output of a posing layer of a trunk network with the current scale or a convolution layer with the step length of two convolution layers through a convolution layer with the convolution kernel of 1x1, the adjusted channels are 256, the other path is formed by continuously carrying out fusion of three scales from a third convolution module of the trunk network after the previous feature fusion in the bottom-up pyramid structure is carried out and then is carried out through a convolution kernel of 3x3 with the step length of two, and the output of the convolution layer with the channel number of 256 is output;

(3) designing a Top-down characteristic pyramid: in the construction process of the top-down feature pyramid, each feature fusion operation fuses three paths of features by adding corresponding elements, wherein the first path is output after the output of the last layer of a convolution module of a backbone network with the same output scale as that of the current fusion module passes through a convolution layer with convolution kernel 1 to fuse channel features and adjust the direction dimension of the channel to be 256, the second path is output after the output of a feature fusion module on the bottom-up pyramid corresponding to 1/2 scale of the output scale of the current fusion module passes through a convolution kernel of 3x3 and convolution layers with 256 output channels, and then is output after the up-sampling with multiple of 2, the third path is output after the output of a feature fusion module on the top-down pyramid corresponding to 1/2 scale of the output scale of the current fusion module passes through a convolution kernel of 3x3 and the convolution layers with 256 output channels, continuously performing three-scale fusion from the output of the last convolution module of the backbone network in the top-down characteristic pyramid through the up-sampling output with the multiple of 2;

(4) target detection sub-network: adopting a strategy of two-stage detection in the faster-rcnn, namely a candidate frame extraction stage and a target classification stage respectively, adopting convolution with a convolution kernel of 3x3 to perform regression of a target frame and prediction of probability of whether the target frame is the target or not on an output feature map of each scale of a top-down feature pyramid in an RPN stage, performing ROI-posing on the screened candidate target frame and the output feature map of the top-down feature pyramid with the corresponding scale, and finally performing frame adjustment and classification of specific classes of the target by using two fully-connected layers;

(5) outputting an object detection result: giving an input image, extracting features of a backbone network, fusing features of a bottom-up feature pyramid and a top-down feature pyramid, extracting and classifying candidate target frames on a feature map fused by the top-down feature pyramid, outputting the position and the scale of a target, adjusting the regression of the position information by a target classification stage according to the position information of the candidate target frames output by an RPN stage, and outputting the final position and scale, wherein the category of the target is determined by the output of the target classification stage; and (3) obtaining a high-resolution prediction image by fusing the multi-scale feature space and the semantic space at the decoding end, and up-sampling the prediction image to a scale consistent with the image so as to obtain a pixel-level semantic segmentation image of the input image.

Compared with FPN, the network structure provided by the invention integrates the bidirectional pyramids from top to bottom and from bottom to top at the same time, and can reserve more small-scale object information from a shallow network with richer details. Since the bottom-up pyramid structure uses a smaller number of channels, only a small amount of additional computation is added. The pyramid from bottom to top performs feature fusion in the forward part of the backbone network, and adds more channels for information transmission, thereby improving information loss of small objects in the information transmission process. The feature fusion module in the top-down feature pyramid utilizes three paths of information sources, and semantic levels of features of each path are different, so that diversity of information is increased, and retention of information of small-scale objects is facilitated.

Drawings

Figure 1 illustrates the small scale object detection problem. In the left picture, the person who wears black clothes and sits down misses, and in the right picture, the two innermost children miss

FIG. 2 depicts a deep bidirectional feature pyramid enhanced network for small scale target detection proposed by the present invention

FIG. 3 depicts feature fusion operations in a bottom-up feature pyramid and a top-down feature pyramid

FIG. 4 illustrates an overall object detection network architecture

FIG. 5 shows some experimental results of Resnet50-FPN compared to PEN proposed by the present invention.

FIG. 6 shows a general embodiment of the present invention

Detailed Description

The invention provides a deep bidirectional feature pyramid enhancement network for small-scale target detection, the network structure is shown in figure 2, and the network structure can enhance the forward transmission of features, particularly the maintenance of small-scale target information. The proposed network comprises a backbone convolutional neural network and a bottom-up semantically (bottom-up) feature pyramid and a top-down semantically (top-down) feature pyramid. The Top-down feature pyramid comprises three fusion input sources, namely a former scale of the backbone network, a current scale of the Top-down feature pyramid and a current scale of the bottom-up feature pyramid. The feature fusion module of the pyramid structure is shown in fig. 3. The feature fusion module in the top-down feature pyramid includes operations of upsampling, convolution and corresponding element addition. The feature fusion module in the bottom-up feature pyramid contains operations of down-sampling, convolution and addition of corresponding elements. In the present invention, a bottom-up feature pyramid is used to enhance a top-down feature pyramid like FPN to maintain richer small-scale target information during the network forward process. The enhanced bidirectional feature pyramid network as a subject network may be combined with a target detection sub-network (such as fast-rcnn) to form an overall target detection network as shown in fig. 4. Both the bottom-up characteristic pyramid structure and the top-down characteristic pyramid structure are beneficial to improving the information loss of the small-scale target in the forward process of the deep network, so that the performance of small-scale target detection is improved.

The invention mainly comprises two aspects of overall network construction and parameter learning of the network. These two aspects will be described in detail below.

The method comprises the following steps of firstly constructing an overall network, wherein the overall network can be divided into a backbone network, a bottom-up feature pyramid, a top-down feature pyramid and a target detection subnetwork.

Backbone network: in our implementation, a classical residual network is used as the backbone network. The specific implementation may combine the application scenario and the requirement of the device to select a suitable backbone network, for example, a scenario with high speed requirement and limited device computing performance, a lightweight backbone network needs to be selected, and Resnet18 may be used. When the device and efficiency requirements are not high but the performance requirements are strict, a more complex backbone network can be adopted. Here we take Resnet50 as an example, a 50-layer residual network contains 5 convolution modules, each starting with one pooling layer (posing) or convolution with a step size of two (stride convolution).

Bottom-up feature pyramid: the constructing operation of the bottom-up pyramid is completed by the operation of the left graph in fig. 3, in the constructing process of the bottom-up feature pyramid, each time the feature fusion operation completes the adding operation of two paths of features by corresponding elements, one path of output of a pooling layer of a backbone network with the current scale or a convolution layer with the step length of two convolution layers is subjected to channel feature fusion and channel direction dimension adjustment by one convolution layer with the convolution kernel of 1x1, the adjusted channels are uniform 256, the other path of output is output of the convolution layer with the channel number of 256 after the previous feature fusion in the bottom-up pyramid structure is subjected to 3x3, the step length is 2. Thus, the three scales are continuously fused from the third convolution module of the backbone network.

Top-down feature pyramid: the construction of the top-down feature pyramid is completed by the operation of the right diagram in fig. 3, and in the construction process of the top-down feature pyramid, three paths of features are fused by adding corresponding elements in each feature fusion operation. The first path is output after the last layer of convolution module of the backbone network with the same output scale as the current fusion module fuses channel features through a convolution layer with convolution kernel of 1 and adjusts the direction dimension of the channel to be uniform 256, the second path is output after the output of the feature fusion module corresponding to 1/2 scale of the output scale of the current fusion module on the bottom-up pyramid passes through a convolution kernel of 3x3, the number of output channels is 256 convolution layers, and then passes through an up-sampling output with multiple of 2, and the third path is output after the output of the feature fusion module corresponding to 1/2 scale of the output scale of the current fusion module on the top-down pyramid passes through a convolution kernel of 3x3, the number of output channels is 256 convolution layers, and then passes through the up-sampling output with multiple of 2. And continuously performing fusion of three scales from the output of the last convolution module of the backbone network in the top-down feature pyramid.

Target detection sub-network: similar to FPN, a strategy of two-stage detection in the fast-rcnn is adopted, and the two-stage detection is respectively a candidate frame extraction stage and a target classification stage. The rpn (regional pro-posal network) stage performs regression of the target box and prediction of the probability of being a target or not on the output feature map of each scale of the top-down feature pyramid using convolution with a convolution kernel of 3 × 3. And performing ROI-posing on the screened candidate target frame and an output feature map of a corresponding scale top-down feature pyramid, and finally performing frame adjustment and target specific category classification by using two full-connection layers. However, there is a difference from FPN in that the fused features of each scale of the pyramid in FPN are extracted by a convolutional layer with a convolution kernel of 3 × 3 and then detected. The proposed network followed each merge by a 3x3 convolutional layer and examined on the convolutional layer's output profile.

Secondly, the learning of network parameters, which can be divided into the following three parts.

Training and test data preparation: to prove the effectiveness of the proposed network, a database is selected, which is divided into a training set for learning network parameters and a test set for verifying the overall performance level of the network. Given that we are interested in small-scale object detection, microsoft's published COCO dataset is a better choice, training and testing sets have been already identified above and provide more objective evaluation criteria, and what we do is only to process the dataset into the form required by our network input and some data enhancement operations, depending on our choice of deep learning development framework, such as caffe, tenserflow, cafe 2, mxnet, pitorch, etc. Our experiments were all based on this dataset expansion.

Network initialization and training hyper-parameter settings: we used the resnet50 model trained on the image recognition database Imagenet as initial values for the backbone network parameters, and the rest was initialized randomly. Our training was performed on a single nvidiaitanx GPU, with the hyper-parameters of the training including the number of data set cycles set to 20, the initial value of the learning rate set to 1e-2, and the original 1/10 after the 12 th and 18 th cycles end, with the batch size set to 2.

Selection of a training strategy: a two-stage training strategy is adopted, the value of a backbone network is fixed in the first stage, parameters of a bottom-up feature pyramid and a top-down feature pyramid and parameters of a detection sub-network are adjusted until convergence, and fine tuning is carried out on the whole network in the second stage.

Effects of the embodiments: when the backbone network selects the resnet50, the comprehensive performance of the proposed network (PEN for short) and the FPN on the coco data set is compared with that in table 1, and it can be seen that the PEN provided by the invention increases the scale robustness of detection, and especially the detection performance of small-scale targets is obviously improved.

Fig. 5 shows the comparison result between the networks (PEN for short) of the present invention and FPN, where the same backbone network (such as the Resnet50) is used, the proposed PEN has a great advantage over FPN in the detection of small-scale objects (such as railway people and cars).

Table 1 comparison of test performance on COCO Minival dataset

Claims

1. A small-scale target detection method based on a deep bidirectional feature pyramid enhanced network comprises the following steps:

(1) determining a backbone network of a network coding end: taking a residual error network as a backbone network, wherein the residual error network comprises 5 convolution modules, and each convolution module starts with one pooling layer position or convolution constraint with the step length of two;

(2) designing a Bottom-up feature pyramid: in the construction process of the bottom-up feature pyramid, each feature fusion operation is completed by adding two paths of features by using corresponding elements, wherein one path is formed by carrying out channel feature fusion and channel direction dimension adjustment on the output of a posing layer of a trunk network with the current scale or a convolution layer with the step length of two through a convolution layer with the convolution kernel of 1x1, the adjusted channels are 256, the other path is formed by continuously carrying out fusion of three scales from a third convolution module of the trunk network after the previous feature fusion in the bottom-up pyramid structure is carried out and then is 3x3 through the convolution kernel, the step length is two, and the output of the convolution layer with the channel number of 256 is output;

(3) designing a Top-down characteristic pyramid: in the construction process of the top-down feature pyramid, each feature fusion operation fuses three paths of features by adding corresponding elements, wherein the first path is output after the output of the last layer of a convolution module of a backbone network with the same output scale as that of the current fusion module passes through a convolution layer with convolution kernel 1 to fuse channel features and adjust the direction dimension of the channel to be 256, the second path is output after the output of a feature fusion module on the bottom-up pyramid corresponding to 1/2 scale of the output scale of the current fusion module passes through a convolution kernel of 3x3 and convolution layers with 256 output channels, and then is output after the up-sampling with multiple of 2, the third path is output after the output of a feature fusion module on the top-down pyramid corresponding to 1/2 scale of the output scale of the current fusion module passes through a convolution kernel of 3x3, outputting convolution layers with 256 channel numbers, performing up-sampling output with multiple of 2, and continuously performing fusion of three scales from the output of the last convolution module of the trunk network in a top-down characteristic pyramid;