CN111612065A

CN111612065A - Multi-scale characteristic object detection algorithm based on ratio self-adaptive pooling

Info

Publication number: CN111612065A
Application number: CN202010433145.6A
Authority: CN
Inventors: 朱勉春; 许曼玲; 戴宪华
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2020-05-21
Filing date: 2020-05-21
Publication date: 2020-09-01

Abstract

The invention relates to a multi-scale feature object detection algorithm based on ratio adaptive pooling. The method comprises the following steps: (1) collecting a large number of images, dividing the images into a training set and a test set according to a certain proportion, and preprocessing the training set; (2) inputting the training set into a pre-trained neural network (ResNet50) for feature extraction to obtain a corresponding feature map; (3) embedding RPN in RAP combined FPN structure to generate different scale characteristics and training RPN; (4) performing RoI Pooling on the ROIs with different scales generated in the step (3), and then calculating loss, classification and more detailed frame regression; (5) and inputting the test set image into the trained detection model to output a detection result. The method can effectively relieve the semantic loss problem of the FPN in the fusion process and improve the detection precision.

Description

Multi-scale characteristic object detection algorithm based on ratio self-adaptive pooling

Technical Field

The invention relates to the field of image processing, in particular to a multi-scale feature object detection algorithm (hereinafter referred to as RAP) based on ratio adaptive pooling.

Background

The target detection is widely applied to the fields of pedestrian detection, intelligent driving assistance, intelligent monitoring, flame smoke detection, intelligent robots and the like, and although the target detection technology is developed rapidly, the target detection technology has many problems, wherein how to improve the detection precision is always the difficulty of the target detection.

The feature pyramid network (hereinafter abbreviated as FPN) is mainly used for solving the problem of multi-scale of the target. The method is a structure formed by a characteristic layer from top to bottom, because FPN can reduce dimension through 1 x 1 convolution in the top-to-bottom and transverse connection processes, channels (channels) are reduced, semantic information can be lost, and the top layer of the FPN is directly reduced in dimension through 1 x 1 convolution and then is subjected to 3 x 3 convolution to generate a new top layer for final prediction, because the FPN is not fused with other layer information, the number of channels is reduced, and the semantic information can be lost.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a multi-scale feature object detection algorithm based on ratio self-adaptive pooling, which can enhance semantic information in the top-down and transverse connection fusion processes of FPN and improve the multi-scale object detection precision.

The technical scheme of the invention is as follows:

a multi-scale feature object detection algorithm based on ratio adaptive pooling improves a traditional top-down feature pyramid network fusion structure, and the method comprises the following steps:

the image to be measured is input to a convolutional neural network (ResNet50), C_xC ═ C, characteristic map generated on behalf of each module of ResNet₂,C₃,C₄,C₅The overall framework retains the original structure of the FPN, and the transverse connection is that C is reduced to 256 dimension through convolution of 1 × 1 to M (M) as the transverse connection₂,M₃,M₄,M₅Then through the top most M₅Enhanced by RAP module, output is recorded as P₅；

P₅M connected through up-sampling and transverse₄Merging, output is noted as P₄. Sequentially operating until the last layer of feature map P is output₂And finishing the enhancement process. P ═ P₂,P₃,P₄,P₅Sending the data to subsequent detection.

Wherein the specific operations in RAP enhancement are: topmost feature M of FPN₅Through a ratio adaptation, the pooling mode is adaptive average pooling, and here we choose a pooling coefficient of α ═ 0.1,0.2,0.3]The three feature maps with different output resolutions are marked as { A, B and C };

then { A, B, C } is convolved by 3 x 3, then is subjected to bilinear difference upsampling and output to be marked as { E, F, G }, and the original input resolution is restored;

and finally, carrying out feature splicing on the three feature graphs with the same resolution, reducing the channel number to 256 through 1 × 1 convolution, and outputting a final enhancement result. By this point, the rate adaptive pooling process is complete.

Embedding RPN in the improved FPN for multi-scale feature fusion; wherein for P₂，P₃，P₄，P₅These layers, defining the anchor size of 32²,64²,128²,256²，512²In addition, each layer has 3 length-width contrasts {1:2, 1:1, 2:1 }. So the whole feature pyramid has 15 kinds of anchor points (anchors);

when training the RPN, only 256 anchors are selected for training, and the proportion of positive samples to negative samples is 1:1 approximately. The positive and negative samples are delimited, in order to ensure that at least a positive sample participates in training, for each real frame, the anchor with the maximum IoU (cross-over ratio) is taken as the positive sample; in the remaining anchors, if IoU of the corresponding anchor and any one of the real boxes is greater than 0.7, the corresponding anchor is taken as a positive sample of training, and if IoU of the anchor and any one of the real boxes is less than 0.3, the anchor is taken as a negative sample of training.

When training RPN, RPN will also generate RoIs to send to Fast R-CNN network. The RPN selects most anchors (such as 12000) and roughly corrects the positions of the anchors to obtain the RoIs, and then selects 2000 RoIs with the highest probability from the 12000 RoIs by using a non-maximum suppression (NMS) method.

The network selects 128 RoIs from 2000 RoIs as training, selects positive and negative sample rules as positive samples (for example, N) with the ratio of the RoIs to IoU of the real frame being greater than 0.5, and selects the remaining (128-N) negative samples from the RoIs with the ratio of IoU of the real frame being less than or equal to 0 (or 0.1), wherein the ratio of the positive and negative samples is approximately 1: 3.

And selecting the 128 pieces of the obtained RoIs to perform subsequent operations of RoI Pooling, loss calculation, classification, more detailed border regression and the like. Where the penalty function typically employed for regression of location is Smooth _ L1Loss, and the cross-entropy penalty function is typically employed for classification problems. When calculating the position regression Loss L1Loss, the negative examples do not participate in the calculation.

Selecting 5011 pictures in total of a training sample set as an open source data set PASCAL VOC2007 train inval, and performing data preprocessing, wherein only horizontal turning and vertical turning with the probability of 0.5 are set, and corresponding changes need to be made to coordinate information.

Inputting the data into an improved network for training, wherein the version of an operating system of the experimental environment of the algorithm is ubuntu16.04, the model of a video card is GeForce GTX 1080Ti (the number is 2), the size of a video memory is 11GB, and the used deep learning framework is Pythrch. In the experiment, ResNet50 is used as a feature extraction network, Fsater R-CNN + FPN + RAP is used as a target detection framework, and mAP (mean Average precision) is used as an evaluation index. Initializing the network weight by using the trained ImageNet weight on the organ network, optimizing by adopting a random gradient descent method (SGD), wherein the learning rate is 0.04, the training period is 14 periods, one period is to traverse the data set once (5011 times), weight attenuation is carried out in the tenth period, and the attenuation coefficient is 0.0001.

Selecting 4952 pictures as an open source data set PASCAL VOC2007 test, and inputting the pictures into a trained improved network for testing, wherein the network parameter weight is the weight stored in the last period. The picture size set in the training and testing process is (1000, 600), the shortest side is greater than 600, and the longest side is less than 1000.

The beneficial technical effects of the invention are as follows:

1. the application discloses a multi-scale feature object detection algorithm (RAP) based on ratio self-adaptive pooling, which is improved in a classic feature pyramid network FPN multi-layer feature information fusion structure algorithm, a ratio self-adaptive module is added, the action position of 3 multiplied by 3 convolution in the original FPN is changed, the semantic information of the top layer of the FPN is enhanced, and the detection precision is improved.

2. The RAP structure is particularly simple, and the object detection precision can be effectively improved on the premise that only a small amount of calculation is added compared with the FPN structure;

3. RAPs are also portable structures that can act on multi-scale feature processes rather than just a pyramid of features.

Drawings

FIG. 1 is a flow chart of an object detection algorithm in the present application

Fig. 2 is a schematic diagram of an FPN binding Ratio Adaptive Pooling (RAP) structure.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

A multi-scale feature object detection algorithm (RAP) based on rate adaptive pooling is disclosed in the present application. Because the FPN can be subjected to dimension reduction through 1 x 1 convolution in the top-down and transverse connection processes, channels (channels) are reduced, semantic information is lost, the uppermost layer of the FPN is directly subjected to dimension reduction through 1 x 1 convolution, and then a new top layer is generated through 3 x 3 convolution for final prediction, because the top layer is not fused with other layer information, and the semantic information is lost due to the reduction of the number of channels. The method is improved in a classic feature pyramid network FPN multi-layer feature information fusion structure algorithm, a ratio self-adaption module is added, the action position of the 3 x 3 convolution in the original FPN is changed, the semantic information of the FPN top layer is enhanced, and the detection precision is improved.

Before the method disclosed by the invention uses the Faster R-CNN to detect the target, the Faster R-CNN needs to be trained, so the method is divided into two parts, wherein the first part is a training model part, the second part is a target detection part of a test set, and the main flow refers to fig. 1. The method mainly comprises the following steps:

the first part mainly comprises the following steps:

(1) and acquiring a sample set to be detected. Selecting 5011 pictures of a training sample set as an open source data set PASCAL VOC2007 train val, and 4952 pictures of a testing sample set as an open source data set PASCAL VOC2007 test, and performing data preprocessing, wherein only horizontal turning and vertical turning with the probability of 0.5 are set, and corresponding changes need to be made in coordinate information.

(2) Reading in weights pre-trained on ImageNet by a basic network ResNet50, taking the read parameters as the initial of parameters of a convolutional neural network, inputting training data into the network, extracting features of an image by the convolutional neural network, calculating a feature map by a convolution kernel, wherein the feature map is generally smaller and smaller, the output of some feature layers is the same as the original size and is called as the same network stage, and for ResNet, feature activation output of a final residual error structure of each stage is used. These residual block outputs are denoted as { C₂,C₃,C₄,C₅Corresponding to the outputs of conv2, conv3, conv4 and conv 5.

(3) Adding RAP structure into FPN, and implementing the steps as follows:

(3.1) input the image to be measured into a convolutional neural network (ResNet50), C_xC ═ C, characteristic map generated on behalf of each module of ResNet₂,C₃,C₄,C₅The overall framework retains the original structure of the FPN, and the transverse connection is that C is reduced to 256 dimension through convolution of 1 × 1 to M (M) as the transverse connection₂,M₃,M₄,M₅Then through the top most M₅Enhanced by RAP module, output is recorded as P₅；

(3.2)P₅M connected through up-sampling and transverse₄Merging, output is noted as P₄. Sequentially operating until the last layer of feature map P is output₂And finishing the enhancement process. P ═ P₂,P₃,P₄,P₅Sending the data to subsequent detection.

(3.3) wherein the specific operations at RAP enhancement are: topmost feature M of FPN₅Through a ratio adaptation, the pooling mode is adaptive average pooling, and here we choose a pooling coefficient of α ═ 0.1,0.2,0.3]The three feature maps with different output resolutions are marked as { A, B and C };

(3.4) after carrying out 3 × 3 convolution on the { A, B, C }, carrying out up-sampling on a bilinear difference value, outputting and recording as { E, F, G }, and recovering to the original input resolution;

and (3.5) carrying out feature splicing on the last three feature maps with the same resolution, reducing the channel number to 256 through 1 × 1 convolution, and outputting a final enhancement result. By this point, the rate adaptive pooling process is complete.

(4) Training the RPN network, the concrete steps are as follows

(4.1) embedding the RPN in the improved FPN for multi-scale feature fusion; wherein for P₂，P₃， P₄，P₅These layers, defining the anchor size of 32²,64²,128²,256²，512²In addition, each layer has 3 length-width contrasts {1:2, 1:1, 2:1 }. So the whole feature pyramid has 15 kinds of anchor points (anchors);

(4.2) when training the RPN, only 256 anchors are selected for training, and the approximate proportion of positive samples to negative samples is 1: 1. The positive and negative samples are delimited, in order to ensure that at least a positive sample participates in training, for each real frame, the anchor with the maximum IoU (cross-over ratio) is taken as the positive sample; in the remaining anchors, if IoU of the corresponding anchor and any one of the real boxes is greater than 0.7, the corresponding anchor is taken as a positive sample of training, and if IoU of the anchor and any one of the real boxes is less than 0.3, the anchor is taken as a negative sample of training.

And (4.3) training the RPN, and simultaneously generating the RoIs by the RPN and sending the RoIs into a Fast R-CNN network. The RPN selects most anchors (such as 12000) and roughly corrects the positions of the anchors to obtain the RoIs, and then selects 2000 RoIs with the highest probability from the 12000 RoIs by using a non-maximum suppression (NMS) method.

(4.4) the network will select 128 RoIs from the 2000 RoIs as training, and select positive and negative sample rules that IoU between the RoIs and the real frame is greater than 0.5 as positive samples (such as N), and the remaining (128-N) negative samples are selected from the RoIs that IoU is less than or equal to 0 (or 0.1) from the real frame, and the ratio of positive and negative samples is approximately 1: 3.

(4.5) selecting 128 RoIs obtained for subsequent RoI Pooling, wherein RoIs with different scales use different characteristic layers as input of RoI Align layersIn, the larger-scale RoI is achieved by using the latter pyramid layers, such as P5; small-scale RoI uses the feature layer of the previous point, such as P₄. Here, a coefficient P is defined_kThe output of that layer is determined to be changed from RoI to be defined as:

where 224 is the standard input for ImageNet, k₀Is a reference value, set to 5, representing P₅Layer output (original size is P)₅Layers), w and h are the length and width of the RoI, assuming RoI is 112 × 112, then k — k₀-1-5-1-4, meaning that the RoI should use P₄The characteristic layer of (1). The value of k should be rounded to prevent the result from being an integer.

(5) Loss, classification and regression are calculated, and when the position regression Loss L1Loss is calculated, the negative sample does not participate in calculation. The Loss function generally adopted for the regression of the position is Smooth _ L1Loss, and the cross entropy Loss function is generally adopted for the classification problem, and the formula is as follows:

wherein v ═ v (v)_x,v_y,v_w,v_h) For the real pan-zoom parameter(s),

the parameters are scaled for the predicted translation.

In training, the operating system version of the experimental environment of the algorithm is ubuntu16.04, the video card model is GeForce GTX 1080Ti (number is 2), the video memory size is 11GB, and the deep learning framework is Pytorch. In the experiment, ResNet50 is used as a feature extraction network, Fsater R-CNN + FPN + RAP is used as a target detection framework, and mAP (mean Average precision) is used as an evaluation index. Initializing the network weight by using the trained ImageNet weight on the organ network, optimizing by adopting a random gradient descent method (SGD), wherein the learning rate is 0.04, the training period is 14 periods, one period is to traverse the data set once (5011 times), weight attenuation is carried out in the tenth period, and the attenuation coefficient is 0.0001.

The learning used in the training is shown in Table 1

Table 1 this experimental network training parameters

Period of time	Learning rate	Impulse quantity	Weight attenuation
				1～10	4e-3	0.9	0.0001
10～14	4e-4	0.9	0.0001

The second part is a target detection part, and after a fusion network of Faster R-CNN + FPN + RAP is obtained through training, the target detection is carried out on the image to be detected through the network, and the method comprises the following steps:

(1) carrying out multi-scale preprocessing on an image to be detected; the training set and the test set are single-scale pictures, the maximum long edge is 1000, and the minimum short edge is 600.

(2) And importing the image to be tested into a convolutional neural network ResNet50, extracting the characteristics of the input image by the convolutional neural network, embedding the FPN into the RPN, generating a characteristic mapping diagram by the characteristics of the test data in the RPN, carrying out rough classification and rough frame regression on the foreground and the background of the test data, and finally carrying out finer classification and frame regression through softmax.

(3) The number of convolutional layer channels in all the above operations is 256, the convolutional neural network uses a residual network ResNet50, the classifier used is a softmax classifier, and the map (mean Average precision) is used as an evaluation index.

(4) The training sample set is selected as an open source data set PASCAL VOC2007, and the training target types of the training sample set are all 20 categories, such as some common categories of automobiles, cats, airplanes, people, bicycles, horses, and the like. The test results are shown in table 2:

TABLE 2 detection results based on the improved FPN fusion Structure

Wherein, baseline is the detection result of the Faster R-CNN (ResNet-50) + FPN, and Ours is the detection result of the algorithm of Faster R-CNN (ResNet-50) + FPN + RAP.

(4.1) it can be seen from table 2 that the object detection accuracy can be effectively improved in the RAP network, and by using the FPN in fig. 2 in combination with the RAP structure, the corresponding mAP is improved by 1.6% relative to Baseline (original FPN structure), and the accuracy of each category is higher than that of Baseline, thus proving the improved RAP structure effectiveness.

What has been described above is only a preferred embodiment of the present application, and the present invention is not limited to the above embodiment. It is to be understood that other modifications and variations directly derivable or suggested by those skilled in the art without departing from the spirit and concept of the present invention are to be considered as included within the scope of the present invention.

Claims

1. A multi-scale feature object detection algorithm based on ratio adaptive pooling improves a traditional top-down feature pyramid network fusion structure, and the method comprises the following steps:

the image to be measured is input to a convolutional neural network (ResNet50), C_xC ═ C, characteristic map generated on behalf of each module of ResNet₂,C₃,C₄,C₅}, integral frameThe original structure of FPN is kept, and the transverse connection is that C is reduced to 256 dimension through 1 × 1 convolution and M is recorded as M ═ M { (M) }₂,M₃,M₄,M₅Then through the top most M₅Enhanced by RAP module, output is recorded as P₅；

2. The method of claim 1, wherein RPN is embedded in the refined FPN for multi-scale feature fusion; wherein for P₂，P₃，P₄，P₅These layers, defining the anchor size of 32²,64²,128²,256²，512²In addition, each layer has 3 length-width contrasts {1:2, 1:1, 2:1 }. The whole feature pyramid has 15 anchor points (anchors).

3. The method as claimed in claims 1 and 2, wherein when training the RPN, only 256 anchors are selected for training, and the approximate ratio of positive and negative samples is 1: 1. The positive and negative samples are delimited, in order to ensure that at least a positive sample participates in training, for each real frame, the anchor with the maximum IoU (cross-over ratio) is taken as the positive sample; in the remaining anchors, if IoU of the corresponding anchor and any one of the real boxes is greater than 0.7, the corresponding anchor is taken as a positive sample of training, and if IoU of the anchor and any one of the real boxes is less than 0.3, the anchor is taken as a negative sample of training.

4. The method according to claim 3, wherein the RPN is trained and the RoIs generated by the RPN are sent to the Fast R-CNN network. The RPN selects most anchors (such as 12000) and roughly corrects the positions of the anchors to obtain the RoIs, and then selects 2000 RoIs with the highest probability from the 12000 RoIs by using a non-maximum suppression (NMS) method.

5. The method of claim 4, wherein the network selects 128 RoIs from the 2000 RoIs as training, and selects positive and negative samples according to a rule that IoU between the RoIs and the real box is greater than 0.5 as positive samples (such as N), and the remaining (128-N) negative samples are selected from the RoIs that are less than or equal to 0 (or 0.1) from IoU of the real box, and the ratio of positive and negative samples is approximately 1: 3.

6. The method of claim 5, wherein the 128 RoIs are selected for subsequent RoIPooling, loss calculation, classification, more detailed bounding box regression, and the like. Where the penalty function typically employed for regression of location is Smooth _ L1Loss, and the cross-entropy penalty function is typically employed for classification problems. When the position regression Loss L1Loss is calculated, the negative samples do not participate in the calculation.