CN112949635B - Target detection method based on feature enhancement and IoU perception - Google Patents
Target detection method based on feature enhancement and IoU perception Download PDFInfo
- Publication number
- CN112949635B CN112949635B CN202110268913.1A CN202110268913A CN112949635B CN 112949635 B CN112949635 B CN 112949635B CN 202110268913 A CN202110268913 A CN 202110268913A CN 112949635 B CN112949635 B CN 112949635B
- Authority
- CN
- China
- Prior art keywords
- roi
- iou
- feature
- network
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 58
- 230000008447 perception Effects 0.000 title claims abstract description 18
- 230000011218 segmentation Effects 0.000 claims abstract description 44
- 238000000034 method Methods 0.000 claims abstract description 19
- 230000007246 mechanism Effects 0.000 claims abstract description 8
- 238000010586 diagram Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 48
- 238000011176 pooling Methods 0.000 claims description 25
- 238000012549 training Methods 0.000 claims description 24
- 239000013598 vector Substances 0.000 claims description 24
- 238000000605 extraction Methods 0.000 claims description 17
- 238000012360 testing method Methods 0.000 claims description 14
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000007781 pre-processing Methods 0.000 claims description 8
- 230000003213 activating effect Effects 0.000 claims description 6
- 230000009466 transformation Effects 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 230000002708 enhancing effect Effects 0.000 claims description 4
- 238000012544 monitoring process Methods 0.000 claims description 4
- 238000005192 partition Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims 1
- 230000007547 defect Effects 0.000 description 4
- 238000010276 construction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 230000001629 suppression Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a target detection method based on feature enhancement and IoU perception, and belongs to the field of computer vision target detection. The method utilizes the spatial information of the convolution feature map in the RoI classification regression network to improve the accuracy of target classification and positioning, utilizes the attention mechanism to restrain the background information in the RoI feature and enhance the semantic information in the RoI feature, utilizes the IoU re-grading strategy to increase the correlation between the classification score and the confidence coefficient of the bounding box, and reserves the high-quality bounding box. The method can effectively utilize the spatial information in the characteristic diagram and effectively improve the classification and positioning capabilities of the target detection model through the RoI classification regression branch network; the features in the RoI classification regression branch network are enhanced through the semantic segmentation branch network and the attention mechanism at the boundary box level; through IoU prediction branch network and the strategy of re-scoring the category score by using prediction IoU, the relevance between the category score of the target and the confidence coefficient of the bounding box is improved, and the positioning accuracy of the bounding box is effectively improved.
Description
Technical Field
The invention relates to a target detection method based on feature enhancement and IoU (interaction over Union) perception, and belongs to the field of computer vision target detection.
Background
The target detection is used as a basic task in the field of computer vision, and is widely applied in the fields of aerospace, robot navigation, intelligent video monitoring and the like. In recent years, as the target detection algorithm based on deep learning develops, two types of target detection frameworks are gradually formed: a one-stage object detector, a two-stage object detector. Wherein, the speed of the one-stage detector is high, but the detection precision is relatively low. The two-stage detector performs two-time classification and boundary box coordinate regression on the target, so that the detection precision of the algorithm is generally higher, and the method is relatively more widely applied in the industry.
In the two-stage target detection algorithm, a region suggestion network is generally used to perform two-classification and bounding box coordinate regression on a large number of preset anchor points in an image in the first stage, and a group of regions of interest (RoI) with potential targets is output. And in the second stage, the region of interest is subjected to multi-classification and coordinate regression on the bounding box by the RoI classification regression network, and a final detection result is obtained after post-processing. When the coordinates of the bounding box predicted by the regional proposed network are inaccurate, part of background information may exist in the RoI feature map generated by the RoI classification regression network, so that the classification and positioning accuracy is affected.
The above-mentioned difficult problem in the target detection makes the current two-stage target detection technology have the following defects:
1. the existing algorithm generally considers improving the feature expression capability of a feature extraction network, but neglects a feature enhancement method aiming at the RoI classification and positioning task in target detection.
2. When the existing algorithm classifies and regresses the RoI, the utilization of spatial information is lacked, and inherent structural information in a characteristic diagram is not fully utilized.
3. Existing algorithms typically use non-maxima suppression algorithms to remove redundant target boxes during post-processing of the detection algorithm. However, in the non-maximum suppression algorithm, the confidence of the localization may be represented by using the category score of the target box, which may cause the bounding box with a lower category score but accurate localization to be suppressed, thereby affecting the performance of the detection algorithm.
Therefore, how to overcome the defects of the existing target detection algorithm and realize efficient and robust target detection is an urgent technical problem to be solved.
Disclosure of Invention
The invention aims to provide a target detection method based on feature enhancement and IoU perception, aiming at overcoming the defects in the prior art and effectively solving the defects and problems of the two-stage target detection technology.
The innovation points of the invention are as follows: the method comprises the steps of firstly utilizing spatial information of a convolution feature map in a RoI classification regression network to improve the accuracy of target classification and positioning, utilizing an attention mechanism to restrain background information in the RoI feature and enhance semantic information in the RoI feature, utilizing an IoU re-grading strategy to increase the correlation between a category score and a bounding box confidence coefficient, and keeping a high-quality bounding box.
The invention is realized by adopting the following technical scheme:
a target detection method based on feature enhancement and IoU perception comprises the following steps:
step 1: and acquiring a target detection data set, and carrying out preprocessing operation on the image to form a training data set.
Specifically, in step 1, the image preprocessing operation specifically includes:
step 1.1: scaling the short side of the input image to 600 pixels;
step 1.2: and (4) randomly and horizontally turning the image to amplify the data.
Step 2: and constructing a target detection network based on feature enhancement and IoU perception based on a two-stage target detection network, namely, Faster R-CNN.
Specifically, the step 2 includes the steps of:
step 2.1: and (3) constructing a trunk feature extraction network, inputting the trunk feature extraction network into a preprocessed image, and outputting the image as a feature map of the image.
The main feature extraction network can be any convolution network, such as VGG-16, ResNet, and the like.
Step 2.2: and after the trunk feature extraction network in the step 2.1 is obtained, a RoI pooling network is built, and a plurality of regions of interest RoI of the output feature map in the step 2.1 are obtained.
Wherein the RoI pooling algorithm uses RoI Align to extract the RoI feature from the last feature map in the conv4_ x module of the ResNet network described in step 2.1.
Step 2.3: and after the RoI pooling network is obtained in the step 2.2, a RoI classification regression branch network is built, the features of the plurality of RoIs obtained in the step 2.2 are extracted, the classification score and the position of the boundary frame of each RoI are predicted, and a final target detection result is output.
For the RoI classification regression branch network described in step 2.3, the feature map size of the RoI Align output is 7 × 7.
In particular, said step 2.3 comprises the following steps:
step 2.3.1: the RoI classification regression branch network performs feature extraction by using two continuous padding-0 3 × 3 convolutions, an output feature map is marked as X, and the size of the feature map is 3 × 3 × 512;
step 2.3.2: carrying out dense prediction on the feature map X, and classifying the feature vectors of each position in turn, wherein the formula is as follows:
S i =σ(φ θ (X i )) (1)
wherein i is the position number in the feature map, X i Is the ith feature vector. Phi is a θ () For the feature vector classification function, a 1 × 1 convolutional layer implementation is used. σ () is a softmax operation for outputting a class score vector S of length K +1 i And K is the number of categories in the training data set.
In the training stage, the category of each feature vector is the same as the RoI label where the feature vector is located, and the classification task uses a cross entropy loss function to calculate.
In the testing stage, firstly, the prediction score of each position on the feature map X is calculated, the category score of the RoI is the mean value S of all the position category scores, and the calculation formula is as follows:
step 2.3.3: in the case of the bounding box regression, the coordinates of the center point of the bounding box, which are different from the Faster R-CNN regression, and the width and the high scaling ratio (t) of the bounding box x ,t y ,t w ,t h ) Instead, the coordinates of each edge of the bounding box are independently regressed, and the bounding box coordinate parameterization process is as follows:
t x1 =(x 1 -x 1a )/w a ,t x2 =(x 2 -x 2a )/w a
t y1 =(y 1 -y 1a )/h a ,t y2 =(y 2 -y 2a )/h a
wherein (x) 1 ,y 1 ,x 2 ,y 2 ) Coordinates representing the left, upper, right, and lower four sides of the predicted bounding box,is the target value of the regression of the bounding box coordinates, (t) x1 ,t y1 ,t x2 ,t y2 ) In order to be able to predict the coordinate offset,is the target value of regression, (x) 1a ,y 1a ,x 2a ,y 2a ) Coordinates, w, representing the left, upper, right, and lower four edges of the anchor point frame a Indicates the width of the anchor box, h a Indicating the height of the anchor box.
Step 2.3.4: and regressing the coordinates of the corresponding edges by using the characteristics of the characteristic positions.
For each edge, it is regressed using a network whose parameters are not shared, the calculation formula is as follows:
t x1 =φ θx1 (gmp(X 0 ,X 3 ,X 6 )),
t y1 =φ θy1 (gmp(X 0 ,X 1 ,X 2 )),
t x2 =φ θx2 (gmp(X 2 ,X 5 ,X 8 )), (4)
t y2 =φ θy2 (gmp(X 6 ,X 7 ,X 8 ))
wherein, X i I ∈ {0,1, …,8} represents the feature vector of the corresponding location on the feature map X, | represents the coordinate regression function, implemented using a 1 × 1 convolutional layer, and gmp (·) represents the global max pooling function.
Step 2.4: and (3) building a semantic segmentation branch network after the RoI pooling network in the step 2.1, building a feature enhancement module according to an attention mechanism, and enhancing the RoI features in the step 2.3 by using the extracted semantic segmentation feature map.
Specifically, step 2.4 includes the steps of:
step 2.4.1: and (3) adding segmentation labels at the pixel level to the target detection data set in the step 1.
Specifically, the coordinates of the target frame of the input image are rounded and mapped onto the RoI feature map, pixels falling within the target frame are labeled as positive samples, and the remaining pixels are labeled as negative samples.
Step 2.4.2: step 2.4, the input of the semantic segmentation branching network is a RoI feature map with the size of 14 × 14 × C obtained by a RoI pooling layer, and feature extraction is carried out by using two convolution layers with the size of 3 × 3 to obtain a feature map X with the size of 14 × 14 × 512 mask To X mask Activating by using a 3 x 3 convolutional layer and a sigmoid function, and outputting a final RoI partition prediction;
step 2.4.3: using feature maps X mask The RoI characteristics are enhanced.
Specifically, a feature enhancement module is designed, and the input of the feature enhancement module comprises the RoI features output by the RoI classification regression branch intermediate layer in the step 2.3 and the feature map X output by the semantic segmentation branch intermediate layer in the step 2.4 mask . For the RoI feature, converting the channel number of the feature map from C to 512 dimensions by using a 1 × 1 convolution; for semantic segmentation feature X mask Firstly, downsampling the feature map by using a bilinear interpolation algorithm, then performing feature transformation on the downsampled feature map by using 1 × 1 convolution, obtaining an attention map at a pixel level by using a sigmoid function, and finally multiplying the feature map of the RoI branch by the attention map to obtain an enhanced feature map.
For the semantic segmentation branch network described in step 2.4, the feature map size output by the RoI Align is 14 × 14.
Step 2.5: and (3) building IoU a prediction branch network after the RoI pooling network in the step 2.1, wherein the input of the prediction branch network is the semantic segmentation feature map extracted by the semantic segmentation branch network in the step 2.4, and the output of the prediction branch network is IoU between the predicted RoI and a real target box matched with the predicted RoI.
Specifically, step 2.5 comprises the steps of:
step 2.5.1: step 2.5 the IoU predict the input of the branch network as the middle layer output characteristic X of the semantic segmentation branch network of step 2.4 mask Using a 1X 1 convolutional layer pair X mask Carrying out transformation, then obtaining a 512-dimensional feature vector by using global average pooling, finally activating by using a full-connection layer and a sigmoid function, and outputting IoU values predicted for each RoI;
step 2.5.2: in the training phase, only the positive sample RoI participates in IoU training of the prediction branch network;
step 2.5.3: in the testing phase, the classification scores were re-scored using the predicted IoU, the calculation formula being as follows:
wherein,representing the classification score predicted by the RoI classification score branch network described in step 2.3, S i ' represents the classification score after the re-scoring, and γ represents the interval [0,1 ]]The hyper-parameter in (a) is,value IoU representing the IoU predicted branch network prediction at step 2.5.
And step 3: and constructing a loss function, and training the target detection network in the step S2 according to the training data set to obtain a target detection model.
Wherein, the loss function calculation formula is as follows:
L=L RPN +αL cls +βL loc +δL seg +ηL IoU (6)
wherein L represents a multitask loss function; l is RPN Represents the RPN loss function of Faster R-CNN; l is cls And L loc Respectively representing a classification loss function and a position regression loss function of fast R-CNN; l is seg Representing a semantic segmentation loss function, wherein only a positive sample RoI participates in the training of a semantic segmentation task, and monitoring is performed by using cross entropy loss; l is IoU Representing IoU a predictive loss function; α, β, δ, η represent the classification loss weight, regression loss weight, semantic segmentation loss weight and IoU prediction loss weight, respectively, wherein the classification loss weight α is set toThe regression loss weight β and the segmentation loss weight δ are set to 1, and the IoU prediction loss weight η is set to 0.5.
The IoU predicted loss function is specifically as follows:
wherein, I i The representation of the real IoU is shown,representing the output value, N, of the IoU predicted branch network in said step 2.5 pos Representing the number of positive samples RoI.
And 4, step 4: and (3) acquiring a test image, preprocessing the test image (such as size change), and inputting the target detection model acquired in the step (3) to obtain a target classification and positioning result of the test image.
Advantageous effects
Compared with the prior art, the method has the following advantages:
through the RoI classification regression branch network, the spatial information in the feature map can be effectively utilized, and the classification and positioning capabilities of the target detection model are effectively improved; the features in the RoI classification regression branch network are enhanced through the semantic segmentation branch network and the attention mechanism at the boundary box level; through IoU prediction branch networks and a strategy of re-scoring the category scores by using prediction IoU, the relevance between the category scores of the targets and the confidence coefficient of the bounding box is improved, and the positioning accuracy of the bounding box is effectively improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of a feature enhancement and IoU perception based object detection network architecture provided by the present invention;
FIG. 3 is a schematic diagram illustrating mask generation in a semantic segmentation branch network according to the present invention;
FIG. 4 is a schematic diagram of a feature enhancement module based on an attention mechanism according to the present invention.
Detailed Description
The process of the present invention is described in further detail below with reference to the following examples and combinations. The method comprises the following steps:
step S1: acquiring a target detection data set and carrying out preprocessing operation on the image to form a training data set;
the image preprocessing operation steps are as follows:
step S11: scaling the short side of the input image to 600 pixels;
step S12: data augmentation is performed using random horizontal flipping of the image.
Step S2: referring to fig. 2, fig. 2 shows a target detection network based on feature enhancement and IoU perception, which is built based on a two-stage target detection network, fast R-CNN;
the target detection network construction steps based on feature enhancement and IoU perception are as follows:
step S21: constructing a trunk characteristic extraction network, inputting a preprocessed image, and outputting a characteristic graph of the image;
the backbone feature extraction network can be any convolution network, such as VGG-16, ResNet, and the like.
Step S22: constructing a RoI pooling network after the backbone feature extraction network in the step S21, and obtaining a plurality of regions of interest RoI of the output feature map in the step S21;
the method for constructing the RoI pooling network comprises the following steps:
the RoI pooling algorithm uses RoI Align to extract the RoI feature from the last feature map in the conv4_ x module of the ResNet network in step S21;
for the RoI classification regression network branch, the characteristic map size of the RoI Align output is 7 x 7;
for the semantic split branching network, the characteristic graph size of the RoI Align output is 14 x 14.
Step S23: constructing a RoI classification regression branch network after the RoI pooling network in the step S22, performing feature extraction on the RoIs in the step S22, predicting the classification score and the position of a boundary frame of each RoI, and outputting a final target detection result;
the method for constructing the RoI classification regression branch network comprises the following steps:
step S231: the RoI classification regression branch network performs feature extraction by using two continuous padding-0 3X 3 convolutions, an output feature map is marked as X, and the size of the feature map is 3X 512;
step S232: carrying out dense prediction on the feature map X, and classifying the feature vectors of each position in sequence, wherein the calculation formula is as follows:
S i =σ(φ θ (X i )) (1)
wherein i is the position number in the feature map, X i Is the ith feature vector. Phi is a θ () For the feature vector classification function, a 1 × 1 convolutional layer implementation is used. σ () is a softmax operation for outputting a class score vector S of length K +1 i And K is the number of categories in the training data set.
In the training stage, the category of each feature vector is the same as the RoI label where the feature vector is located, and the classification task uses a cross entropy loss function to calculate.
In the testing stage, firstly, the prediction score of each position on the feature map X is calculated, the category score of the RoI is the mean value S of all the position category scores, and the calculation formula is as follows:
step S233: in the case of bounding box regression, the coordinates of the center point of the bounding box, which are different from the Faster R-CNN regression, and the scaling (t) of the width and height of the bounding box x ,t y ,t w ,t h ) In the embodiment of the present invention, the coordinates of each edge of the bounding box are regressed independently, and the parameterization process of the bounding box coordinates is as follows:
t x1 =(x 1 -x 1a )/w a ,t x2 =(x 2 -x 2a )/w a
t y1 =(y 1 -y 1a )/h a ,t y2 =(y 2 -y 2a )/h a
wherein (x) 1 ,y 1 ,x 2 ,y 2 ) Coordinates representing the left, upper, right, and lower four sides of the predicted bounding box,is the target value of the regression of the bounding box coordinates, (t) x1 ,t y1 ,t x2 ,t y2 ) In order to be able to predict the coordinate offset,is the target value of regression, (x) 1a ,y 1a ,x 2a ,y 2a ) Coordinates, w, representing the left, upper, right, and lower four edges of the anchor point frame a Indicates the width of the anchor box, h a Indicating the height of the anchor box.
Step S234: and regressing the coordinates of the corresponding edges by using the characteristics of the characteristic positions. For each edge, it is regressed using a network whose parameters are not shared, the calculation formula is as follows:
t x1 =φ θx1 (gmp(X 0 ,X 3 ,X 6 )),
t y1 =φ θy1 (gmp(X 0 ,X 1 ,X 2 )),
t x2 =φ θx2 (gmp(X 2 ,X 5 ,X 8 )), (4)
t y2 =φ θy2 (gmp(X 6 ,X 7 ,X 8 ))
wherein, X i I ∈ {0,1, …,8} represents the feature vector of the corresponding location on the feature map X, | represents the coordinate regression function, implemented using a 1 × 1 convolutional layer, and gmp (·) represents the global max pooling function.
Step S24: building a semantic segmentation branch network behind the RoI pooling network in the step S21, building a feature enhancement module according to an attention mechanism, and enhancing the RoI features in the step S23 by using the extracted semantic segmentation feature map;
the semantic division branch network construction steps are as follows:
step S241: for the target detection data set of step S1, segmentation labels at the pixel level are added. Referring to fig. 3, fig. 3 illustrates a mask generation process in a semantic segmentation branch network. Specifically, the coordinates of a target frame of the input image are rounded and mapped onto the RoI feature map, pixels falling in the target frame are marked as positive samples, and the rest pixels are marked as negative samples;
step S242: step S24, the input of the semantic segmentation branching network is a RoI feature map with a size of 14 × 14 × C obtained by RoI pooling layers, and feature extraction is performed using two 3 × 3 convolutional layers to obtain a feature map X with a size of 14 × 14 × 512 mask To X mask Activating by using a 3 x 3 convolutional layer and a sigmoid function, and outputting a final RoI partition prediction;
step S243: using feature maps X mask The RoI characteristics are enhanced. Specifically, referring to fig. 4, fig. 4 shows a specific structure of the feature enhancing module. The input of which comprises: the RoI characteristics output by the RoI classification regression branch intermediate layer in the step S23 and the characteristic diagram X output by the semantic segmentation branch intermediate layer in the step S24 mask . For the RoI feature, converting the channel number of the feature map from C to 512 dimensions by using a 1 × 1 convolution; for semantic segmentation feature X mask Firstly, downsampling the feature map by using a bilinear interpolation algorithm, then performing feature transformation on the downsampled feature map by using 1 × 1 convolution, then obtaining an attention map at a pixel level by using a sigmoid function, and finally multiplying the feature map of the RoI branch by the attention map to obtain an enhanced feature map.
Step S25: and building IoU a prediction branch network after the RoI pooling network in the step S21, wherein the input of the prediction branch network is the semantic segmentation feature map extracted by the semantic segmentation branch network in the step S24, and the output of the prediction branch network is IoU between the predicted RoI and a real target box matched with the predicted RoI.
The IoU prediction branch network building steps are as follows:
step S251: step S25 input of IoU prediction branch network is middle layer output characteristic X of the semantic division branch network of step S24 mask Using a 1X 1 convolutional layer pair X mask Carrying out transformation, then obtaining a 512-dimensional feature vector by using global average pooling, finally activating by using a full-connection layer and a sigmoid function, and outputting IoU values predicted for each RoI;
step S252: in the training phase, only the positive sample RoI participates in IoU training of the prediction branch network;
step S253: in the testing stage, the classification score is re-scored using the predicted IoU, and the calculation formula is as follows:
wherein,a classification score, S, representing the RoI classification score Branch network prediction of step S23 i ' represents the classification score after the re-scoring, and γ represents the interval [0,1 ]]The hyper-parameter in (a) is,the value of IoU representing the IoU predicted branch network prediction described in step S25.
Step S3: and constructing a loss function, and training the target detection network in the step S2 according to the training data set to obtain a target detection model.
Wherein, the loss function calculation formula is as follows:
L=L RPN +αL cls +βL loc +δL seg +ηL IoU (6)
wherein L represents a multitask loss function; l is RPN Represents the RPN loss function of Faster R-CNN; l is cls And L loc Respectively representing a classification loss function and a position regression loss function of the Faster R-CNN; l is seg Representing a semantic segmentation loss function, wherein only a positive sample RoI participates in the training of a semantic segmentation task, and monitoring is performed by using cross entropy loss; l is IoU Representing IoU a predictive loss function; α, β, δ, η represent the classification loss weight, regression loss weight, semantic segmentation loss weight and IoU prediction loss weight, respectively, wherein the classification loss weight α is set toThe regression loss weight β and the segmentation loss weight δ are set to 1, and the IoU prediction loss weight η is set to 0.5.
The IoU predicted loss function is specifically as follows:
wherein, I i The representation of the real IoU is shown,represents the stepIoU predicts the output value, N, of the branch network in step S25 pos Representing the number of positive samples RoI.
Step S4: the test image is acquired, and is preprocessed (e.g., size changed), and then the target detection model obtained in step S3 is input, so as to obtain the target classification and positioning result of the test image.
Claims (8)
1. A target detection method based on feature enhancement and IoU perception is characterized by comprising the following steps:
step 1: acquiring a target detection data set, and carrying out preprocessing operation on the image to form a training data set;
and 2, step: building a target detection network based on feature enhancement and IoU perception based on a two-stage target detection network Faster R-CNN;
the method comprises the following steps:
step 2.1: constructing a trunk characteristic extraction network, inputting a preprocessed image, and outputting a characteristic graph of the image;
step 2.2: after the trunk feature extraction network in the step 2.1, a RoI pooling network is built to obtain a plurality of regions of interest RoI of the output feature map in the step 2.1;
step 2.3: after the RoI pooling network is obtained in the step 2.2, a RoI classification regression branch network is built, the features of the RoIs obtained in the step 2.2 are extracted, the classification score and the position of a boundary frame of each RoI are predicted, and a final target detection result is output;
step 2.4: establishing a semantic segmentation branch network after the RoI pooling network in the step 2.2, establishing a feature enhancement module according to an attention mechanism, and enhancing the RoI features in the step 2.3 by using the extracted semantic segmentation feature map, wherein the method comprises the following steps:
step 2.4.1: for the target detection data set in the step 1, adding segmentation labels at the pixel level;
rounding and mapping the coordinates of a target frame of the input image to a RoI characteristic diagram, marking pixels falling in the target frame as positive samples, and marking the rest pixels as negative samples;
step 2.4.2: step 2.4, the input of the semantic segmentation branching network is a RoI feature map with the size of 14 × 14 × C obtained by a RoI pooling layer, and feature extraction is carried out by using two convolution layers with the size of 3 × 3 to obtain a feature map X with the size of 14 × 14 × 512 mask To X mask Activating by using a 3 x 3 convolutional layer and a sigmoid function, and outputting a final RoI partition prediction;
step 2.4.3: using feature maps X mask The RoI characteristic is enhanced, and the method specifically comprises the following steps:
designing a characteristic enhancement module, wherein the input of the characteristic enhancement module comprises the RoI characteristic output by the RoI classification regression branch intermediate layer in the step 2.3 and the characteristic graph X output by the semantic segmentation branch intermediate layer in the step 2.4 mask (ii) a For the RoI feature, converting the channel number of the feature map from C to 512 dimensions by using a 1 × 1 convolution; for semantic segmentation feature X mask Firstly, downsampling the image by using a bilinear interpolation algorithm, then performing feature transformation on the image by using 1 × 1 convolution, obtaining an attention map at a pixel level by using a sigmoid function, and finally multiplying the feature map of the RoI branch by the attention map to obtain an enhanced feature map;
step 2.5: building IoU a prediction branch network after the RoI pooling network of step 2.2, wherein the input of the prediction branch network is the semantic segmentation feature map extracted by the semantic segmentation branch network of step 2.4, and the output is IoU between the predicted RoI and a real target box matched with the predicted RoI;
and 3, step 3: constructing a loss function, training the target detection network in the step S2 according to a training data set, and obtaining a target detection model;
and 4, step 4: and (3) obtaining a test image, preprocessing the test image, and inputting the target detection model obtained in the step (3) to obtain a target classification and positioning result of the test image.
2. The object detection method based on feature enhancement and IoU perception as claimed in claim 1, wherein in step 1, the image preprocessing operation comprises the following steps:
step 1.1: scaling the short side of the input image to 600 pixels;
step 1.2: and (4) randomly and horizontally turning the image to amplify the data.
3. The feature enhancement and IoU perception-based target detection method of claim 1, wherein in step 2.2, the RoI pooling algorithm uses roilign.
4. The object detection method based on feature enhancement and IoU perception of claim 1, wherein, for the RoI classification regression branch network of step 2.3, the feature map size of RoI Align output is 7 x 7.
5. The object detection method based on feature enhancement and IoU perception according to claim 1, wherein the step 2.3 includes the following steps:
step 2.3.1: the RoI classification regression branch network performs feature extraction by using two continuous padding-0 3 × 3 convolutions, an output feature map is marked as X, and the size of the feature map is 3 × 3 × 512;
step 2.3.2: carrying out dense prediction on the feature map X, and classifying the feature vectors of each position in turn, wherein the formula is as follows:
S i =σ(φ θ (X i )) (1)
wherein i is the position number in the feature map, X i Is the ith feature vector; phi is a θ () For the feature vector classification function, a 1 × 1 convolutional layer is used for implementation; σ () is a softmax operation for outputting a class score vector S of length K +1 i Wherein K is the number of categories in the training data set;
in the training stage, the category of each feature vector is the same as the RoI label where the feature vector is located, and the classification task is calculated by using a cross entropy loss function;
in the testing stage, firstly, the prediction score of each position on the feature map X is calculated, the category score of the RoI is the mean value S of all the position category scores, and the calculation formula is as follows:
step 2.3.3: in the case of the bounding box regression, the coordinates of the center point of the bounding box, which are different from the Faster R-CNN regression, and the width and the high scaling ratio (t) of the bounding box x ,t y ,t w ,t h ) Instead, the coordinates of each edge of the bounding box are independently regressed, and the bounding box coordinate parameterization process is as follows:
wherein (x) 1 ,y 1 ,x 2 ,y 2 ) Coordinates representing the left, upper, right, and lower four sides of the predicted bounding box,is the target value of the regression of the bounding box coordinates, (t) x1 ,t y1 ,t x2 ,t y2 ) In order to be able to predict the coordinate offset,is the target value of regression, (x) 1a ,y 1a ,x 2a ,y 2a ) Coordinates, w, representing the left, upper, right, and lower four edges of the anchor point frame a Indicates the width of the anchor box, h a Representing the height of the anchor box;
step 2.3.4: regressing the coordinates of the corresponding edges by using the characteristics of the characteristic positions;
for each edge, it is regressed using a network whose parameters are not shared, the calculation formula is as follows:
wherein, X i I ∈ {0,1, …,8} represents the feature vector for the corresponding location on the feature map X, φ | () Expressing a coordinate regression function using one1 × 1 convolutional layer implementation, gmp (·) represents a global max pooling function.
6. The method of claim 1, wherein the size of the feature map output by the RoI Align is 14 x 14 for the semantic segmentation branch network of step 2.4.
7. The object detection method based on feature enhancement and IoU perception according to claim 1, wherein the step 2.5 includes the steps of:
step 2.5.1: step 2.5 the IoU predict the input of the branch network as the middle layer output characteristic X of the semantic segmentation branch network of step 2.4 mask Using a 1X 1 convolutional layer pair X mask Carrying out transformation, then obtaining a 512-dimensional feature vector by using global average pooling, finally activating by using a full-connection layer and a sigmoid function, and outputting IoU values predicted for each RoI;
step 2.5.2: in the training phase, only the positive sample RoI participates in IoU training of the prediction branch network;
step 2.5.3: in the testing phase, the classification scores were re-scored using the predicted IoU, the calculation formula being as follows:
wherein,represents the classification score, S, predicted by the RoI classification regression branch network described in step 2.3 i ' represents the classification score after the re-scoring, and γ represents the interval [0,1 ]]The hyper-parameter in (a) is,value IoU representing the IoU predicted branch network prediction at step 2.5.
8. The object detection method based on feature enhancement and IoU perception as claimed in claim 1, wherein in step 3, the loss function is calculated as follows:
L=L RPN +αL cls +βL loc +δL seg +ηL IoU (6)
wherein L represents a multitask loss function; l is RPN Represents the RPN loss function of Faster R-CNN; l is cls And L loc Respectively representing a classification loss function and a position regression loss function of fast R-CNN; l is seg Representing a semantic segmentation loss function, wherein only a positive sample RoI participates in the training of a semantic segmentation task, and monitoring is performed by using cross entropy loss; l is IoU Representing IoU a predictive loss function; α, β, δ, η represent the classification loss weight, regression loss weight, semantic segmentation loss weight and IoU prediction loss weight, respectively, wherein the classification loss weight α is set toThe regression loss weight β and the segmentation loss weight δ are set to 1, and the IoU prediction loss weight η is set to 0.5;
the IoU predicted loss function is specifically as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110268913.1A CN112949635B (en) | 2021-03-12 | 2021-03-12 | Target detection method based on feature enhancement and IoU perception |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110268913.1A CN112949635B (en) | 2021-03-12 | 2021-03-12 | Target detection method based on feature enhancement and IoU perception |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112949635A CN112949635A (en) | 2021-06-11 |
CN112949635B true CN112949635B (en) | 2022-09-16 |
Family
ID=76229263
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110268913.1A Active CN112949635B (en) | 2021-03-12 | 2021-03-12 | Target detection method based on feature enhancement and IoU perception |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112949635B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113658199B (en) * | 2021-09-02 | 2023-11-03 | 中国矿业大学 | Regression correction-based chromosome instance segmentation network |
CN116340807B (en) * | 2023-01-10 | 2024-02-13 | 中国人民解放军国防科技大学 | Broadband Spectrum Signal Detection and Classification Network |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108830192A (en) * | 2018-05-31 | 2018-11-16 | 珠海亿智电子科技有限公司 | Vehicle and detection method of license plate under vehicle environment based on deep learning |
CN110163207A (en) * | 2019-05-20 | 2019-08-23 | 福建船政交通职业学院 | One kind is based on Mask-RCNN ship target localization method and storage equipment |
CN110287960A (en) * | 2019-07-02 | 2019-09-27 | 中国科学院信息工程研究所 | The detection recognition method of curve text in natural scene image |
CN111079739A (en) * | 2019-11-28 | 2020-04-28 | 长沙理工大学 | Multi-scale attention feature detection method |
CN111767799A (en) * | 2020-06-01 | 2020-10-13 | 重庆大学 | Improved down-going human target detection algorithm for fast R-CNN tunnel environment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10679351B2 (en) * | 2017-08-18 | 2020-06-09 | Samsung Electronics Co., Ltd. | System and method for semantic segmentation of images |
-
2021
- 2021-03-12 CN CN202110268913.1A patent/CN112949635B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108830192A (en) * | 2018-05-31 | 2018-11-16 | 珠海亿智电子科技有限公司 | Vehicle and detection method of license plate under vehicle environment based on deep learning |
CN110163207A (en) * | 2019-05-20 | 2019-08-23 | 福建船政交通职业学院 | One kind is based on Mask-RCNN ship target localization method and storage equipment |
CN110287960A (en) * | 2019-07-02 | 2019-09-27 | 中国科学院信息工程研究所 | The detection recognition method of curve text in natural scene image |
CN111079739A (en) * | 2019-11-28 | 2020-04-28 | 长沙理工大学 | Multi-scale attention feature detection method |
CN111767799A (en) * | 2020-06-01 | 2020-10-13 | 重庆大学 | Improved down-going human target detection algorithm for fast R-CNN tunnel environment |
Non-Patent Citations (3)
Title |
---|
Mask R-CNN with Pyramid Attention Network for Scene Text Detection;Zhida Huang等;《arXiv》;20181122;全文 * |
Mask Scoring R-CNN;Zhaojin Huang等;《arXiv》;20190301;全文 * |
基于具有空间注意力机制的Mask R-CNN的口腔白斑分割;谢飞等;《西北大学学报(自然科学版)》;20200109(第01期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN112949635A (en) | 2021-06-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108961235B (en) | Defective insulator identification method based on YOLOv3 network and particle filter algorithm | |
CN112418236B (en) | Automobile drivable area planning method based on multitask neural network | |
CN111179217A (en) | Attention mechanism-based remote sensing image multi-scale target detection method | |
CN111767944B (en) | Single-stage detector design method suitable for multi-scale target detection based on deep learning | |
CN112418108B (en) | Remote sensing image multi-class target detection method based on sample reweighing | |
CN111368769B (en) | Ship multi-target detection method based on improved anchor point frame generation model | |
CN107944443A (en) | One kind carries out object consistency detection method based on end-to-end deep learning | |
CN113177560A (en) | Universal lightweight deep learning vehicle detection method | |
CN112949635B (en) | Target detection method based on feature enhancement and IoU perception | |
CN112883934A (en) | Attention mechanism-based SAR image road segmentation method | |
CN112364931A (en) | Low-sample target detection method based on meta-feature and weight adjustment and network model | |
CN106845458B (en) | Rapid traffic sign detection method based on nuclear overrun learning machine | |
CN111259796A (en) | Lane line detection method based on image geometric features | |
CN112686233B (en) | Lane line identification method and device based on lightweight edge calculation | |
CN114170230B (en) | Glass defect detection method and device based on deformable convolution and feature fusion | |
CN112101113B (en) | Lightweight unmanned aerial vehicle image small target detection method | |
CN114332473A (en) | Object detection method, object detection device, computer equipment, storage medium and program product | |
CN115019201B (en) | Weak and small target detection method based on feature refinement depth network | |
CN113326734A (en) | Rotary target detection method based on YOLOv5 | |
CN113420648A (en) | Target detection method and system with rotation adaptability | |
CN113011415A (en) | Improved target detection method and system based on Grid R-CNN model | |
CN114913504A (en) | Vehicle target identification method of remote sensing image fused with self-attention mechanism | |
CN115345932A (en) | Laser SLAM loop detection method based on semantic information | |
CN114882205A (en) | Target detection method based on attention mechanism | |
CN115082897A (en) | Monocular vision 3D vehicle target real-time detection method for improving SMOKE |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |