CN111612065A - Multi-scale characteristic object detection algorithm based on ratio self-adaptive pooling - Google Patents
Multi-scale characteristic object detection algorithm based on ratio self-adaptive pooling Download PDFInfo
- Publication number
- CN111612065A CN111612065A CN202010433145.6A CN202010433145A CN111612065A CN 111612065 A CN111612065 A CN 111612065A CN 202010433145 A CN202010433145 A CN 202010433145A CN 111612065 A CN111612065 A CN 111612065A
- Authority
- CN
- China
- Prior art keywords
- rois
- training
- feature
- fpn
- rpn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 36
- 238000011176 pooling Methods 0.000 title claims abstract description 25
- 238000012549 training Methods 0.000 claims abstract description 41
- 238000000034 method Methods 0.000 claims abstract description 28
- 230000003044 adaptive effect Effects 0.000 claims abstract description 12
- 238000013527 convolutional neural network Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 4
- 230000006978 adaptation Effects 0.000 claims description 3
- 230000001629 suppression Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 abstract description 12
- 238000007781 pre-processing Methods 0.000 abstract description 4
- 238000000605 extraction Methods 0.000 abstract description 3
- 238000007499 fusion processing Methods 0.000 abstract description 2
- 238000013528 artificial neural network Methods 0.000 abstract 1
- 230000006870 function Effects 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 3
- 230000009467 reduction Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000011478 gradient descent method Methods 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 240000002791 Brassica napus Species 0.000 description 1
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 241000282326 Felis catus Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000000779 smoke Substances 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
- G06V10/464—Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a multi-scale feature object detection algorithm based on ratio adaptive pooling. The method comprises the following steps: (1) collecting a large number of images, dividing the images into a training set and a test set according to a certain proportion, and preprocessing the training set; (2) inputting the training set into a pre-trained neural network (ResNet50) for feature extraction to obtain a corresponding feature map; (3) embedding RPN in RAP combined FPN structure to generate different scale characteristics and training RPN; (4) performing RoI Pooling on the ROIs with different scales generated in the step (3), and then calculating loss, classification and more detailed frame regression; (5) and inputting the test set image into the trained detection model to output a detection result. The method can effectively relieve the semantic loss problem of the FPN in the fusion process and improve the detection precision.
Description
Technical Field
The invention relates to the field of image processing, in particular to a multi-scale feature object detection algorithm (hereinafter referred to as RAP) based on ratio adaptive pooling.
Background
The target detection is widely applied to the fields of pedestrian detection, intelligent driving assistance, intelligent monitoring, flame smoke detection, intelligent robots and the like, and although the target detection technology is developed rapidly, the target detection technology has many problems, wherein how to improve the detection precision is always the difficulty of the target detection.
The feature pyramid network (hereinafter abbreviated as FPN) is mainly used for solving the problem of multi-scale of the target. The method is a structure formed by a characteristic layer from top to bottom, because FPN can reduce dimension through 1 x 1 convolution in the top-to-bottom and transverse connection processes, channels (channels) are reduced, semantic information can be lost, and the top layer of the FPN is directly reduced in dimension through 1 x 1 convolution and then is subjected to 3 x 3 convolution to generate a new top layer for final prediction, because the FPN is not fused with other layer information, the number of channels is reduced, and the semantic information can be lost.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a multi-scale feature object detection algorithm based on ratio self-adaptive pooling, which can enhance semantic information in the top-down and transverse connection fusion processes of FPN and improve the multi-scale object detection precision.
The technical scheme of the invention is as follows:
a multi-scale feature object detection algorithm based on ratio adaptive pooling improves a traditional top-down feature pyramid network fusion structure, and the method comprises the following steps:
the image to be measured is input to a convolutional neural network (ResNet50), CxC ═ C, characteristic map generated on behalf of each module of ResNet2,C3,C4,C5The overall framework retains the original structure of the FPN, and the transverse connection is that C is reduced to 256 dimension through convolution of 1 × 1 to M (M) as the transverse connection2,M3,M4,M5Then through the top most M5Enhanced by RAP module, output is recorded as P5;
P5M connected through up-sampling and transverse4Merging, output is noted as P4. Sequentially operating until the last layer of feature map P is output2And finishing the enhancement process. P ═ P2,P3,P4,P5Sending the data to subsequent detection.
Wherein the specific operations in RAP enhancement are: topmost feature M of FPN5Through a ratio adaptation, the pooling mode is adaptive average pooling, and here we choose a pooling coefficient of α ═ 0.1,0.2,0.3]The three feature maps with different output resolutions are marked as { A, B and C };
then { A, B, C } is convolved by 3 x 3, then is subjected to bilinear difference upsampling and output to be marked as { E, F, G }, and the original input resolution is restored;
and finally, carrying out feature splicing on the three feature graphs with the same resolution, reducing the channel number to 256 through 1 × 1 convolution, and outputting a final enhancement result. By this point, the rate adaptive pooling process is complete.
Embedding RPN in the improved FPN for multi-scale feature fusion; wherein for P2,P3,P4,P5These layers, defining the anchor size of 322,642,1282,2562,5122In addition, each layer has 3 length-width contrasts {1:2, 1:1, 2:1 }. So the whole feature pyramid has 15 kinds of anchor points (anchors);
when training the RPN, only 256 anchors are selected for training, and the proportion of positive samples to negative samples is 1:1 approximately. The positive and negative samples are delimited, in order to ensure that at least a positive sample participates in training, for each real frame, the anchor with the maximum IoU (cross-over ratio) is taken as the positive sample; in the remaining anchors, if IoU of the corresponding anchor and any one of the real boxes is greater than 0.7, the corresponding anchor is taken as a positive sample of training, and if IoU of the anchor and any one of the real boxes is less than 0.3, the anchor is taken as a negative sample of training.
When training RPN, RPN will also generate RoIs to send to Fast R-CNN network. The RPN selects most anchors (such as 12000) and roughly corrects the positions of the anchors to obtain the RoIs, and then selects 2000 RoIs with the highest probability from the 12000 RoIs by using a non-maximum suppression (NMS) method.
The network selects 128 RoIs from 2000 RoIs as training, selects positive and negative sample rules as positive samples (for example, N) with the ratio of the RoIs to IoU of the real frame being greater than 0.5, and selects the remaining (128-N) negative samples from the RoIs with the ratio of IoU of the real frame being less than or equal to 0 (or 0.1), wherein the ratio of the positive and negative samples is approximately 1: 3.
And selecting the 128 pieces of the obtained RoIs to perform subsequent operations of RoI Pooling, loss calculation, classification, more detailed border regression and the like. Where the penalty function typically employed for regression of location is Smooth _ L1Loss, and the cross-entropy penalty function is typically employed for classification problems. When calculating the position regression Loss L1Loss, the negative examples do not participate in the calculation.
Selecting 5011 pictures in total of a training sample set as an open source data set PASCAL VOC2007 train inval, and performing data preprocessing, wherein only horizontal turning and vertical turning with the probability of 0.5 are set, and corresponding changes need to be made to coordinate information.
Inputting the data into an improved network for training, wherein the version of an operating system of the experimental environment of the algorithm is ubuntu16.04, the model of a video card is GeForce GTX 1080Ti (the number is 2), the size of a video memory is 11GB, and the used deep learning framework is Pythrch. In the experiment, ResNet50 is used as a feature extraction network, Fsater R-CNN + FPN + RAP is used as a target detection framework, and mAP (mean Average precision) is used as an evaluation index. Initializing the network weight by using the trained ImageNet weight on the organ network, optimizing by adopting a random gradient descent method (SGD), wherein the learning rate is 0.04, the training period is 14 periods, one period is to traverse the data set once (5011 times), weight attenuation is carried out in the tenth period, and the attenuation coefficient is 0.0001.
Selecting 4952 pictures as an open source data set PASCAL VOC2007 test, and inputting the pictures into a trained improved network for testing, wherein the network parameter weight is the weight stored in the last period. The picture size set in the training and testing process is (1000, 600), the shortest side is greater than 600, and the longest side is less than 1000.
The beneficial technical effects of the invention are as follows:
1. the application discloses a multi-scale feature object detection algorithm (RAP) based on ratio self-adaptive pooling, which is improved in a classic feature pyramid network FPN multi-layer feature information fusion structure algorithm, a ratio self-adaptive module is added, the action position of 3 multiplied by 3 convolution in the original FPN is changed, the semantic information of the top layer of the FPN is enhanced, and the detection precision is improved.
2. The RAP structure is particularly simple, and the object detection precision can be effectively improved on the premise that only a small amount of calculation is added compared with the FPN structure;
3. RAPs are also portable structures that can act on multi-scale feature processes rather than just a pyramid of features.
Drawings
FIG. 1 is a flow chart of an object detection algorithm in the present application
Fig. 2 is a schematic diagram of an FPN binding Ratio Adaptive Pooling (RAP) structure.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
A multi-scale feature object detection algorithm (RAP) based on rate adaptive pooling is disclosed in the present application. Because the FPN can be subjected to dimension reduction through 1 x 1 convolution in the top-down and transverse connection processes, channels (channels) are reduced, semantic information is lost, the uppermost layer of the FPN is directly subjected to dimension reduction through 1 x 1 convolution, and then a new top layer is generated through 3 x 3 convolution for final prediction, because the top layer is not fused with other layer information, and the semantic information is lost due to the reduction of the number of channels. The method is improved in a classic feature pyramid network FPN multi-layer feature information fusion structure algorithm, a ratio self-adaption module is added, the action position of the 3 x 3 convolution in the original FPN is changed, the semantic information of the FPN top layer is enhanced, and the detection precision is improved.
Before the method disclosed by the invention uses the Faster R-CNN to detect the target, the Faster R-CNN needs to be trained, so the method is divided into two parts, wherein the first part is a training model part, the second part is a target detection part of a test set, and the main flow refers to fig. 1. The method mainly comprises the following steps:
the first part mainly comprises the following steps:
(1) and acquiring a sample set to be detected. Selecting 5011 pictures of a training sample set as an open source data set PASCAL VOC2007 train val, and 4952 pictures of a testing sample set as an open source data set PASCAL VOC2007 test, and performing data preprocessing, wherein only horizontal turning and vertical turning with the probability of 0.5 are set, and corresponding changes need to be made in coordinate information.
(2) Reading in weights pre-trained on ImageNet by a basic network ResNet50, taking the read parameters as the initial of parameters of a convolutional neural network, inputting training data into the network, extracting features of an image by the convolutional neural network, calculating a feature map by a convolution kernel, wherein the feature map is generally smaller and smaller, the output of some feature layers is the same as the original size and is called as the same network stage, and for ResNet, feature activation output of a final residual error structure of each stage is used. These residual block outputs are denoted as { C2,C3,C4,C5Corresponding to the outputs of conv2, conv3, conv4 and conv 5.
(3) Adding RAP structure into FPN, and implementing the steps as follows:
(3.1) input the image to be measured into a convolutional neural network (ResNet50), CxC ═ C, characteristic map generated on behalf of each module of ResNet2,C3,C4,C5The overall framework retains the original structure of the FPN, and the transverse connection is that C is reduced to 256 dimension through convolution of 1 × 1 to M (M) as the transverse connection2,M3,M4,M5Then through the top most M5Enhanced by RAP module, output is recorded as P5;
(3.2)P5M connected through up-sampling and transverse4Merging, output is noted as P4. Sequentially operating until the last layer of feature map P is output2And finishing the enhancement process. P ═ P2,P3,P4,P5Sending the data to subsequent detection.
(3.3) wherein the specific operations at RAP enhancement are: topmost feature M of FPN5Through a ratio adaptation, the pooling mode is adaptive average pooling, and here we choose a pooling coefficient of α ═ 0.1,0.2,0.3]The three feature maps with different output resolutions are marked as { A, B and C };
(3.4) after carrying out 3 × 3 convolution on the { A, B, C }, carrying out up-sampling on a bilinear difference value, outputting and recording as { E, F, G }, and recovering to the original input resolution;
and (3.5) carrying out feature splicing on the last three feature maps with the same resolution, reducing the channel number to 256 through 1 × 1 convolution, and outputting a final enhancement result. By this point, the rate adaptive pooling process is complete.
(4) Training the RPN network, the concrete steps are as follows
(4.1) embedding the RPN in the improved FPN for multi-scale feature fusion; wherein for P2,P3, P4,P5These layers, defining the anchor size of 322,642,1282,2562,5122In addition, each layer has 3 length-width contrasts {1:2, 1:1, 2:1 }. So the whole feature pyramid has 15 kinds of anchor points (anchors);
(4.2) when training the RPN, only 256 anchors are selected for training, and the approximate proportion of positive samples to negative samples is 1: 1. The positive and negative samples are delimited, in order to ensure that at least a positive sample participates in training, for each real frame, the anchor with the maximum IoU (cross-over ratio) is taken as the positive sample; in the remaining anchors, if IoU of the corresponding anchor and any one of the real boxes is greater than 0.7, the corresponding anchor is taken as a positive sample of training, and if IoU of the anchor and any one of the real boxes is less than 0.3, the anchor is taken as a negative sample of training.
And (4.3) training the RPN, and simultaneously generating the RoIs by the RPN and sending the RoIs into a Fast R-CNN network. The RPN selects most anchors (such as 12000) and roughly corrects the positions of the anchors to obtain the RoIs, and then selects 2000 RoIs with the highest probability from the 12000 RoIs by using a non-maximum suppression (NMS) method.
(4.4) the network will select 128 RoIs from the 2000 RoIs as training, and select positive and negative sample rules that IoU between the RoIs and the real frame is greater than 0.5 as positive samples (such as N), and the remaining (128-N) negative samples are selected from the RoIs that IoU is less than or equal to 0 (or 0.1) from the real frame, and the ratio of positive and negative samples is approximately 1: 3.
(4.5) selecting 128 RoIs obtained for subsequent RoI Pooling, wherein RoIs with different scales use different characteristic layers as input of RoI Align layersIn, the larger-scale RoI is achieved by using the latter pyramid layers, such as P5; small-scale RoI uses the feature layer of the previous point, such as P4. Here, a coefficient P is definedkThe output of that layer is determined to be changed from RoI to be defined as:
where 224 is the standard input for ImageNet, k0Is a reference value, set to 5, representing P5Layer output (original size is P)5Layers), w and h are the length and width of the RoI, assuming RoI is 112 × 112, then k — k0-1-5-1-4, meaning that the RoI should use P4The characteristic layer of (1). The value of k should be rounded to prevent the result from being an integer.
(5) Loss, classification and regression are calculated, and when the position regression Loss L1Loss is calculated, the negative sample does not participate in calculation. The Loss function generally adopted for the regression of the position is Smooth _ L1Loss, and the cross entropy Loss function is generally adopted for the classification problem, and the formula is as follows:
wherein v ═ v (v)x,vy,vw,vh) For the real pan-zoom parameter(s),the parameters are scaled for the predicted translation.
In training, the operating system version of the experimental environment of the algorithm is ubuntu16.04, the video card model is GeForce GTX 1080Ti (number is 2), the video memory size is 11GB, and the deep learning framework is Pytorch. In the experiment, ResNet50 is used as a feature extraction network, Fsater R-CNN + FPN + RAP is used as a target detection framework, and mAP (mean Average precision) is used as an evaluation index. Initializing the network weight by using the trained ImageNet weight on the organ network, optimizing by adopting a random gradient descent method (SGD), wherein the learning rate is 0.04, the training period is 14 periods, one period is to traverse the data set once (5011 times), weight attenuation is carried out in the tenth period, and the attenuation coefficient is 0.0001.
The learning used in the training is shown in Table 1
Table 1 this experimental network training parameters
Period of time | Learning rate | Impulse quantity | Weight attenuation |
1~10 | 4e-3 | 0.9 | 0.0001 |
10~14 | 4e-4 | 0.9 | 0.0001 |
The second part is a target detection part, and after a fusion network of Faster R-CNN + FPN + RAP is obtained through training, the target detection is carried out on the image to be detected through the network, and the method comprises the following steps:
(1) carrying out multi-scale preprocessing on an image to be detected; the training set and the test set are single-scale pictures, the maximum long edge is 1000, and the minimum short edge is 600.
(2) And importing the image to be tested into a convolutional neural network ResNet50, extracting the characteristics of the input image by the convolutional neural network, embedding the FPN into the RPN, generating a characteristic mapping diagram by the characteristics of the test data in the RPN, carrying out rough classification and rough frame regression on the foreground and the background of the test data, and finally carrying out finer classification and frame regression through softmax.
(3) The number of convolutional layer channels in all the above operations is 256, the convolutional neural network uses a residual network ResNet50, the classifier used is a softmax classifier, and the map (mean Average precision) is used as an evaluation index.
(4) The training sample set is selected as an open source data set PASCAL VOC2007, and the training target types of the training sample set are all 20 categories, such as some common categories of automobiles, cats, airplanes, people, bicycles, horses, and the like. The test results are shown in table 2:
TABLE 2 detection results based on the improved FPN fusion Structure
Wherein, baseline is the detection result of the Faster R-CNN (ResNet-50) + FPN, and Ours is the detection result of the algorithm of Faster R-CNN (ResNet-50) + FPN + RAP.
(4.1) it can be seen from table 2 that the object detection accuracy can be effectively improved in the RAP network, and by using the FPN in fig. 2 in combination with the RAP structure, the corresponding mAP is improved by 1.6% relative to Baseline (original FPN structure), and the accuracy of each category is higher than that of Baseline, thus proving the improved RAP structure effectiveness.
What has been described above is only a preferred embodiment of the present application, and the present invention is not limited to the above embodiment. It is to be understood that other modifications and variations directly derivable or suggested by those skilled in the art without departing from the spirit and concept of the present invention are to be considered as included within the scope of the present invention.
Claims (6)
1. A multi-scale feature object detection algorithm based on ratio adaptive pooling improves a traditional top-down feature pyramid network fusion structure, and the method comprises the following steps:
the image to be measured is input to a convolutional neural network (ResNet50), CxC ═ C, characteristic map generated on behalf of each module of ResNet2,C3,C4,C5}, integral frameThe original structure of FPN is kept, and the transverse connection is that C is reduced to 256 dimension through 1 × 1 convolution and M is recorded as M ═ M { (M) }2,M3,M4,M5Then through the top most M5Enhanced by RAP module, output is recorded as P5;
P5M connected through up-sampling and transverse4Merging, output is noted as P4. Sequentially operating until the last layer of feature map P is output2And finishing the enhancement process. P ═ P2,P3,P4,P5Sending the data to subsequent detection.
Wherein the specific operations in RAP enhancement are: topmost feature M of FPN5Through a ratio adaptation, the pooling mode is adaptive average pooling, and here we choose a pooling coefficient of α ═ 0.1,0.2,0.3]The three feature maps with different output resolutions are marked as { A, B and C };
then { A, B, C } is convolved by 3 x 3, then is subjected to bilinear difference upsampling and output to be marked as { E, F, G }, and the original input resolution is restored;
and finally, carrying out feature splicing on the three feature graphs with the same resolution, reducing the channel number to 256 through 1 × 1 convolution, and outputting a final enhancement result. By this point, the rate adaptive pooling process is complete.
2. The method of claim 1, wherein RPN is embedded in the refined FPN for multi-scale feature fusion; wherein for P2,P3,P4,P5These layers, defining the anchor size of 322,642,1282,2562,5122In addition, each layer has 3 length-width contrasts {1:2, 1:1, 2:1 }. The whole feature pyramid has 15 anchor points (anchors).
3. The method as claimed in claims 1 and 2, wherein when training the RPN, only 256 anchors are selected for training, and the approximate ratio of positive and negative samples is 1: 1. The positive and negative samples are delimited, in order to ensure that at least a positive sample participates in training, for each real frame, the anchor with the maximum IoU (cross-over ratio) is taken as the positive sample; in the remaining anchors, if IoU of the corresponding anchor and any one of the real boxes is greater than 0.7, the corresponding anchor is taken as a positive sample of training, and if IoU of the anchor and any one of the real boxes is less than 0.3, the anchor is taken as a negative sample of training.
4. The method according to claim 3, wherein the RPN is trained and the RoIs generated by the RPN are sent to the Fast R-CNN network. The RPN selects most anchors (such as 12000) and roughly corrects the positions of the anchors to obtain the RoIs, and then selects 2000 RoIs with the highest probability from the 12000 RoIs by using a non-maximum suppression (NMS) method.
5. The method of claim 4, wherein the network selects 128 RoIs from the 2000 RoIs as training, and selects positive and negative samples according to a rule that IoU between the RoIs and the real box is greater than 0.5 as positive samples (such as N), and the remaining (128-N) negative samples are selected from the RoIs that are less than or equal to 0 (or 0.1) from IoU of the real box, and the ratio of positive and negative samples is approximately 1: 3.
6. The method of claim 5, wherein the 128 RoIs are selected for subsequent RoIPooling, loss calculation, classification, more detailed bounding box regression, and the like. Where the penalty function typically employed for regression of location is Smooth _ L1Loss, and the cross-entropy penalty function is typically employed for classification problems. When the position regression Loss L1Loss is calculated, the negative samples do not participate in the calculation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010433145.6A CN111612065A (en) | 2020-05-21 | 2020-05-21 | Multi-scale characteristic object detection algorithm based on ratio self-adaptive pooling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010433145.6A CN111612065A (en) | 2020-05-21 | 2020-05-21 | Multi-scale characteristic object detection algorithm based on ratio self-adaptive pooling |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111612065A true CN111612065A (en) | 2020-09-01 |
Family
ID=72201920
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010433145.6A Pending CN111612065A (en) | 2020-05-21 | 2020-05-21 | Multi-scale characteristic object detection algorithm based on ratio self-adaptive pooling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111612065A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109886357A (en) * | 2019-03-13 | 2019-06-14 | 哈尔滨工程大学 | A kind of adaptive weighting deep learning objective classification method based on Fusion Features |
CN110084124A (en) * | 2019-03-28 | 2019-08-02 | 北京大学 | Feature based on feature pyramid network enhances object detection method |
CN110321923A (en) * | 2019-05-10 | 2019-10-11 | 上海大学 | Object detection method, system and the medium of different scale receptive field Feature-level fusion |
-
2020
- 2020-05-21 CN CN202010433145.6A patent/CN111612065A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109886357A (en) * | 2019-03-13 | 2019-06-14 | 哈尔滨工程大学 | A kind of adaptive weighting deep learning objective classification method based on Fusion Features |
CN110084124A (en) * | 2019-03-28 | 2019-08-02 | 北京大学 | Feature based on feature pyramid network enhances object detection method |
CN110321923A (en) * | 2019-05-10 | 2019-10-11 | 上海大学 | Object detection method, system and the medium of different scale receptive field Feature-level fusion |
Non-Patent Citations (4)
Title |
---|
CHAOXU GUO ET AL: "AugFPN: Improving Multi-scale Feature Learning for Object Detection", 《HTTPS://ARXIV.ORG/ABS/1912.05384V1》 * |
ROSS GIRSHICK: "Fast R-CNN", 《2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION》 * |
SHAOQING REN ET AL: "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 * |
YIQING ZHANG ET AL: "Mask-Refined R-CNN: A Network for Refining Object Details in Instance Segmentation", 《SENSORS》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108509978B (en) | Multi-class target detection method and model based on CNN (CNN) multi-level feature fusion | |
CN108427920B (en) | Edge-sea defense target detection method based on deep learning | |
CN110189255B (en) | Face detection method based on two-stage detection | |
CN110633661A (en) | Semantic segmentation fused remote sensing image target detection method | |
EP3690740B1 (en) | Method for optimizing hyperparameters of auto-labeling device which auto-labels training images for use in deep learning network to analyze images with high precision, and optimizing device using the same | |
CN107784288B (en) | Iterative positioning type face detection method based on deep neural network | |
CN112307982B (en) | Human body behavior recognition method based on staggered attention-enhancing network | |
CN111160249A (en) | Multi-class target detection method of optical remote sensing image based on cross-scale feature fusion | |
CN114841972A (en) | Power transmission line defect identification method based on saliency map and semantic embedded feature pyramid | |
CN110135446B (en) | Text detection method and computer storage medium | |
CN107273870A (en) | The pedestrian position detection method of integrating context information under a kind of monitoring scene | |
CN110245620B (en) | Non-maximization inhibition method based on attention | |
CN109858327B (en) | Character segmentation method based on deep learning | |
CN111126278A (en) | Target detection model optimization and acceleration method for few-category scene | |
CN114842343A (en) | ViT-based aerial image identification method | |
CN111626379B (en) | X-ray image detection method for pneumonia | |
CN109508639B (en) | Road scene semantic segmentation method based on multi-scale porous convolutional neural network | |
CN113705583A (en) | Target detection and identification method based on convolutional neural network model | |
CN115147648A (en) | Tea shoot identification method based on improved YOLOv5 target detection | |
CN115880495A (en) | Ship image target detection method and system under complex environment | |
CN112818840A (en) | Unmanned aerial vehicle online detection system and method | |
CN117351348A (en) | Image road extraction method based on Unet improved feature extraction and loss function | |
CN116051984B (en) | Weak and small target detection method based on Transformer | |
CN111612065A (en) | Multi-scale characteristic object detection algorithm based on ratio self-adaptive pooling | |
CN112232102A (en) | Building target identification method and system based on deep neural network and multitask learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200901 |