CN109472298A

CN109472298A - Depth binary feature pyramid for the detection of small scaled target enhances network

Info

Publication number: CN109472298A
Application number: CN201811219005.8A
Authority: CN
Inventors: 庞彦伟; 朱海龙
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-10-19
Filing date: 2018-10-19
Publication date: 2019-03-15
Anticipated expiration: 2038-10-19
Also published as: CN109472298B

Abstract

The present invention relates to a kind of depth binary feature pyramids for the detection of small scaled target to enhance network, comprising: determines the core network at network code end；Design Bottom-up feature pyramid；Design Top-down feature pyramid；Target detection sub-network: the strategy detected using two stages in faster-rcnn, respectively candidate frame extracts stage and target classification stage, the RPN stage used on the output characteristic pattern of the pyramidal each scale of top-down feature convolution kernel for 3x3 convolution carry out target frame recurrence and whether be target probability prediction, candidate target frame after screening is ROI-pooling with the pyramidal output characteristic pattern of corresponding scale top-down feature again, finally carries out the adjustment of frame and the classification of target specific category using two full articulamentums；Export object detection result.

Description

Depth binary feature pyramid for the detection of small scaled target enhances network

Technical field

The invention belongs to the target detection skills in the fields such as computer vision, pattern-recognition, deep learning and artificial intelligence Art, specifically, more particularly to being examined in an image or a video using depth convolutional neural networks to the target in scene The technology of survey.

Background technique

In depth object detecting areas, with the continuous improvement of object detection performance, the performance of small nanoscale object detection at For new bottleneck, some new network structures are proposed for the problem of improving the detection of small nanoscale object.Feature pyramid network (featurepyramidnetwork [1], abbreviation FPN) is representative therein.FPN will be answered extensively in traditional images process field Pyramid thought is introduced into depth object detection framework, and very big mention is achieved in the object detection of large scale range It rises, the detection performance of especially small nanoscale object.Feature pyramid in FPN is with the form structure of top-down (top-down) It makes, combines together with core network, can be used in single phase or dual-stage object detecting method.Spy in DSSD [2] Sign pyramid structure is similar with FPN, and make is more complex, is used in single phase object detection.Work in [3] Blitznet Person attempts to carry out while solving the problems, such as the multitask of object detection and semantic segmentation using the pyramidal structure of feature, is used for single phase Object detection.In DSOD [4], author propose one bottom-up (bottom-up) based on the network knot intensively connected Structure merges more shallow-layer network characterizations in forward direction.Although these methods make wisp detection performance achieve certain mention It rises, but the requirement from actual scene is there are also with a certain distance from.

Existing method is mostly to pass through a jump link block and gold using by the core network feature of previous scale The feature of word tower network current scale is merged, and has plenty of top-down feature pyramid structure, has plenty of bottom-up Structure, it is all insufficient to the utilization of different scale with Bu Tong semantic hierarchy characteristic.

The challenge that computer vision field faces when identifying the object of large scale range.Currently, based on depth nerve The object detecting method of network achieves inundatory performance advantage in object detecting areas.But the object of most existing is examined Survey method is preferable to the detection effect of large scale object, not satisfactory to the effect of small nanoscale object detection.One often The small nanoscale object test problems seen are as shown in Figure 1.Reason is the continuous intensification with network, and the resolution ratio of characteristic pattern is corresponding Decline, the information of small nanoscale object is gradually submerged in the background of context in characteristic extraction procedure.However, all It is again very harsh to the performance requirement of small nanoscale object in such as automatic Pilot scene.

Bibliography:

[1]Lin,T.Y.,Dollár,P.,Girshick,R.,He,K.,Hariharan,B.,&Belongie,S. (2017,July).Featurepyramidnetworks for object detection.In CVPR(Vol.1,No.2, p.4).

[2]Fu,C.Y.,Liu,W.,Ranga,A.,Tyagi,A.,&Berg,A.C.(2017).DSSD: Deconvolutional single shotdetector.arXivpreprint arXiv:1701.06659.

[3]Dvornik,N.,Shmelkov,K.,Mairal,J.,&Schmid,C.(2017,October) .BlitzNet:A real-time deep network for scene understanding.In ICCV 2017- International Conference on ComputerVision(p.11).

[4]Shen,Z.,Liu,Z.,Li,J.,Jiang,Y.G.,Chen,Y.,&Xue,X.(2017,October) .Dsod:Learning deeply supervised object detectors from scratch.In The IEEE International Conference on ComputerVision(ICCV)(Vol.3,No.6,p.7).

Summary of the invention

The problem of gradually being flooded by background to improve small nanoscale object information as network is deepened, the invention proposes Depth binary feature pyramid for the detection of small scaled target enhances network, to improve the scale robust of object detection algorithms Property.Technical solution is as follows:

A kind of depth binary feature pyramid enhancing network for the detection of small scaled target, comprising:

(1) determine the core network at network code end: using residual error network as core network, residual error network includes 5 volumes Volume module, the convolution (stride that each convolution module is two with a pond layer (pooling) or step-length Convolution) start.

(2) Bottom-up feature pyramid is designed: in the pyramidal construction process of bottom-up feature, each feature The operation that two-way feature corresponding element is added by mixing operation is completed, the pooling of the core network of a routing current scale Layer or step-length are that the output of two convolutional layers carries out channel characteristics fusion and channel direction by the convolutional layer that a convolution kernel is 1x1 Dimension adjustment, channel adjusted are the 256 of unification, and another way is a preceding Fusion Features in bottom-up pyramid structure It is afterwards 3x3 by a convolution kernel, step-length two, output channel number is the output after 256 convolutional layer, from core network Third convolution module starts the fusion for continuously doing three scales；

(3) design Top-down feature pyramid: in the pyramidal construction process of top-down feature, each feature is melted The operation that three tunnel features corresponding element is added by closing operation is merged, and the first via is identical with present fusion module output scale The output of the convolution module the last layer of core network is merged channel characteristics by the convolutional layer that a convolution kernel is 1 and is adjusted logical Road direction dimension is the output after unified 256, and the second tunnel is to export ruler with present fusion module on bottom-up pyramid The output of the corresponding Fusion Features module of 1/2 scale of degree is 3x3, the convolution that output channel number is 256 by a convolution kernel Layer, using the output for the up-sampling that a multiple is 2, third road is to export on top-down pyramid with present fusion module The output of the corresponding Fusion Features module of 1/2 scale of scale is 3x3, the volume that output channel number is 256 by a convolution kernel Lamination, using the output for the up-sampling that multiple is 2, from the last one volume of core network in top-down feature pyramid The output of volume module starts the fusion for continuously doing three scales；

(4) target detection sub-network: the strategy detected using two stages in faster-rcnn, respectively candidate frame are extracted Stage and target classification stage, the RPN stage is on the output characteristic pattern of the pyramidal each scale of top-down feature using volume Product core be 3x3 convolution carry out target frame recurrence and whether be target probability prediction, by screening after candidate target Frame is ROI-pooling with the pyramidal output characteristic pattern of corresponding scale top-down feature again, finally using two full connections Layer carries out the adjustment of frame and the classification of target specific category；

(5) object detection result is exported: given input picture, it is special by the feature extraction of core network and bottom-up Pyramid and the pyramidal Fusion Features of top-down feature are levied, it is enterprising in the fused characteristic pattern of top-down feature pyramid The extraction and classification of row candidate target frame, the position for the candidate target frame that the position and scale for exporting target were exported by the RPN stage The adjustment that information returns location information by the target classification stage exports final position and scale later, the classification of target by The output in target classification stage determines；By merging for decoding end Analysis On Multi-scale Features space and semantic space, obtain high-resolution Prognostic chart, prognostic chart through being upsampled to the consistent scale of image, and then obtain the Pixel-level semantic segmentation figure of input picture.

Compared to FPN, the mentioned network structure of the present invention merges top-down and bottom-up duplex pyramid, energy simultaneously Enough shallow-layer networks richer from details retain more small nanoscale object information.It is used in view of bottom-up pyramid structure Lesser port number merely adds the extra computation amount of very little.Bottom-up pyramid core network forward portion into It has gone the fusion of feature, more channels has been increased for the transmitting of information, so as to improve the small object in information exchanging process Body information loss.Three tunnel information sources are utilized in Fusion Features module in top-down feature pyramid, and every road is special The semantic level of sign is all different, increases the diversity of information, is conducive to the information for retaining small nanoscale object.

Detailed description of the invention

Fig. 1 illustrates small scaled target test problems.In left figure, people's missing inspection that dark suit is squatted is worn, in right figure, Innermost two child's missing inspections

Fig. 2 depicts the depth binary feature pyramid enhancing network for the detection of small scaled target proposed by the invention

Fig. 3 describes the operation of the Fusion Features in bottom-up feature pyramid and top-down feature pyramid

Fig. 4 illustrates the overall target detection network architecture

Some experimental results that Fig. 5 illustrates Resnet50-FPN and PEN proposed by the present invention compare.

Fig. 6 illustrates general embodiment of the invention

Specific embodiment

Enhance network, network knot the invention proposes a depth binary feature pyramid for the detection of small scaled target Structure is as shown in Fig. 2, it is capable of the forward direction transmitting of Enhanced feature, the holding of especially small scaled target information.Mentioned network includes One trunk convolutional neural networks and one semantically bottom-up (bottom-up) feature pyramid and one are semantically The feature pyramid of top-down (top-down).Top-down feature pyramid includes that three fusions input source, is respectively The previous scale of core network, the pyramidal current scale of top-down feature and bottom-up feature are pyramidal current Scale.The Fusion Features module of pyramid structure is as shown in Figure 3.Fusion Features module packet in top-down feature pyramid Include up-sampling, the operation that convolution sum corresponding element is added.In the case where the Fusion Features module in bottom-up feature pyramid includes The operation that sampling, convolution sum corresponding element are added.In the present invention, similar to enhance using bottom-up feature pyramid The top-down feature pyramid of FPN in network forward process so that keep richer small scaled target information.Enhancing Binary feature pyramid network afterwards can be combined with target detection sub-network (such as fast-rcnn) as major network and be formed Overall target detection network is as shown in Figure 4.Either bottom-up feature pyramid and top-down feature pyramid Structure is all conducive to improve information loss of the small scaled target in depth network forward process, to improve small scaled target inspection The performance of survey.

The invention mainly comprises universe network construction, two aspects of study of the parameter of network.Separately below with regard to these two aspects It is described in detail.

It is the construction of universe network first, this respect can be divided into core network, bottom-up feature pyramid, top- Four parts of down feature pyramid and target detection sub-network.

Core network: in our implementation, using classical residual error network as core network.Concrete implementation can be tied The requirement of application scenarios and equipment is closed to select suitable core network, for example, rate request height and equipment calculated performance it is limited Scene needs to select the core network of lightweight, and Resnet18 etc. can be used.When equipment and efficiency requirements are not high but to performance It is required that stringent scene, can use more complicated core network.We are by taking Resnet50 as an example herein, 50 layers of residual error net Network includes 5 convolution modules, the convolution (stride that each convolution module is two with a pond layer (pooling) or step-length Convolution) start.

Bottom-up feature pyramid: the pyramidal constructor of bottom-up is completed by the operation of the left figure in Fig. 3, In the pyramidal construction process of bottom-up feature, what two-way feature corresponding element was added by each Fusion Features operation Operation is completed, and the pooling layer or step-length of the core network of a routing current scale are the output of two convolutional layers by a volume Product core carries out channel characteristics fusion and the adjustment of channel direction dimension for the convolutional layer of 1x1, and channel adjusted is the 256 of unification, It after a preceding Fusion Features by a convolution kernel is 3x3 that another way, which is in bottom-up pyramid structure, step-length 2, output Output after the convolutional layer that port number is 256.Three rulers are continuously made since the third convolution module of core network in this way The fusion of degree.

Top-down feature pyramid: the pyramidal construction of top-down feature is completed by the operation of the right figure in Fig. 3, In the pyramidal construction process of top-down feature, each Fusion Features operate the operation for being added three tunnel features with corresponding element Fusion.The first via is that the output of the convolution module the last layer of core network identical with present fusion module output scale is passed through The convolutional layer that one convolution kernel is 1 merges channel characteristics and adjusts the output after channel direction dimension is unified 256, and second Road is the output of Fusion Features module corresponding with present fusion module output 1/2 scale of scale on bottom-up pyramid By a convolution kernel be 3x3, output channel number be 256 convolutional layer, using a multiple be 2 up-sampling output, Third road is the defeated of Fusion Features module corresponding with present fusion module output 1/2 scale of scale on top-down pyramid It is out 3x3, the convolutional layer that output channel number is 256, using the output for the up-sampling that multiple is 2 by a convolution kernel.? Melting for three scales is continuously done since the output of the last one convolution module of core network in top-down feature pyramid It closes.

Target detection sub-network:, the strategy that using in faster-rcnn two stages is detected similar with FPN, it is respectively candidate Frame extracts stage and target classification stage.RPN (regionproposal network) stage is in top-down feature pyramid Each scale output characteristic pattern on use convolution kernel for 3x3 convolution carry out target frame recurrence and whether be the general of target The prediction of rate.Candidate target frame after screening is done with the pyramidal output characteristic pattern of corresponding scale top-down feature again ROI-pooling finally carries out the adjustment of frame and the classification of target specific category using two full articulamentums.But herein with FPN Some difference, the convolutional layer that the fused feature of each scale of pyramid is 3x3 with a convolution kernel in FPN are examined after drawing It surveys.And mentioned network is detected immediately following a 3x3 convolutional layer and on the output characteristic pattern of convolutional layer after each fusion.

The followed by study of network parameter, this respect can be divided into following three parts.

Trained and test data prepares: the effect in order to prove mentioned network, needs to select a database, be divided into Training airplane and test set, training set are used for learning network parameter, and test set is used to examine the comprehensive performance of network horizontal.In view of me Be concerned with the detection of small scaled target, COCO data set disclosed in Microsoft is a relatively good selection, has been divided above Trained and test set and more objective appraisal standard is provided, what we were done only needs data set to be processed into ours The form and some data enhancement operations that network inputs need, this depends on our selected deep learning Development Frameworks, such as Caffe, tensorflow, caffe2, mxnet, pytorch etc..Our experiment is all based on the expansion of this data set.

Netinit and training hyper parameter setting: our uses are trained on image recognition database Imagenet Initial value of the resnet50 model as core network parameter, it is remaining to use random initializtion.Our training is single NVIDIATITANX GPU is carried out, and trained hyper parameter includes that data set cycle-index is set as 20, and learning rate initial value is set as 1e-2 will be original 1/10 at the 12nd and the 18th after circulation terminates, and batch processing is sized to 2.

The selection of Training strategy: we use two stages Training strategy, fix the value of core network, adjustment in the first stage The pyramidal parameter of bottom-up and top-down feature and the parameter of detection sub-network network are whole in second stage until convergence A network is finely adjusted together.

The effect of embodiment: when core network all selects resnet50, our mentioned networks (abbreviation PEN) and FPN Comprehensive performance comparison such as table 1 on coco data set, it can be seen that the PEN that the present invention is mentioned increases the scale Shandong of detection Stick, the detection performance of especially small scaled target, which has, to be obviously improved.

Fig. 5 illustrate the mentioned network of some present invention (pyramid enhancement network, referred to as PEN) with The comparing result of FPN, using identical core network (such as Resnet50), mentioned PEN is in small scaled target FPN tool is compared in the detection of (such as railway people with car) to have great advantage.

The comparison of the test performance on COCO Minival data set of table 1

Claims

1. a kind of depth binary feature pyramid for the detection of small scaled target enhances network, comprising:

(1) determine the core network at network code end: using residual error network as core network, residual error network includes 5 convolution moulds Block, each convolution module are opened with the convolution (stride convolution) that a pond layer (pooling) or step-length are two Begin.

(2) Bottom-up feature pyramid is designed: in the pyramidal construction process of bottom-up feature, each Fusion Features The operation that two-way feature corresponding element is added by operation is completed, one route the pooling layer of the core network of current scale or Step-length is that the output of two convolutional layers carries out channel characteristics fusion and channel direction dimension by the convolutional layer that a convolution kernel is 1x1 Adjustment, channel adjusted are the 256 of unification, and another way is to pass through after a preceding Fusion Features in bottom-up pyramid structure Crossing a convolution kernel is 3x3, and step-length two, output channel number is the output after 256 convolutional layer, from the third of core network A convolution module starts the fusion for continuously doing three scales；

(3) Top-down feature pyramid is designed: in the pyramidal construction process of top-down feature, each Fusion Features behaviour The operation fusion that Zuo Jiang tri- tunnel features corresponding element is added, the first via are trunk identical with present fusion module output scale The convolutional layer that the output of the convolution module the last layer of network is 1 by a convolution kernel merges channel characteristics and adjusts channel side It is the output after 256 uniformly to dimension, the second tunnel is to export scale with present fusion module on bottom-up pyramid The output of the corresponding Fusion Features module of 1/2 scale is 3x3 by convolution kernel, the convolutional layer that output channel number is 256, then By the output for the up-sampling that a multiple is 2, third road is to export scale with present fusion module on top-down pyramid The output of the corresponding Fusion Features module of 1/2 scale by convolution kernel be 3x3, the convolutional layer that output channel number is 256, Using the output for the up-sampling that multiple is 2, from the last one convolution mould of core network in top-down feature pyramid The output of block starts the fusion for continuously doing three scales；

(4) target detection sub-network: the strategy detected using two stages in faster-rcnn, respectively candidate frame extract the stage With the target classification stage, the RPN stage uses convolution kernel on the output characteristic pattern of the pyramidal each scale of top-down feature For 3x3 convolution carry out target frame recurrence and whether be target probability prediction, by screening after candidate target frame again Be ROI-pooling with the pyramidal output characteristic pattern of corresponding scale top-down feature, finally using two full articulamentums into The adjustment of row frame and the classification of target specific category；

(5) object detection result is exported: given input picture, feature extraction and bottom-up feature gold by core network Word tower and the pyramidal Fusion Features of top-down feature, are waited on the fused characteristic pattern of top-down feature pyramid The extraction and classification for selecting target frame, the location information for the candidate target frame that the position and scale for exporting target were exported by the RPN stage Final position and scale are exported after the adjustment returned by the target classification stage to location information, the classification of target is by target The output of sorting phase determines；By merging for decoding end Analysis On Multi-scale Features space and semantic space, high-resolution prediction is obtained Figure, prognostic chart through being upsampled to the consistent scale of image, and then obtain the Pixel-level semantic segmentation figure of input picture.