CN112949635B - Target detection method based on feature enhancement and IoU perception - Google Patents

Target detection method based on feature enhancement and IoU perception Download PDF

Info

Publication number
CN112949635B
CN112949635B CN202110268913.1A CN202110268913A CN112949635B CN 112949635 B CN112949635 B CN 112949635B CN 202110268913 A CN202110268913 A CN 202110268913A CN 112949635 B CN112949635 B CN 112949635B
Authority
CN
China
Prior art keywords
roi
iou
feature
network
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110268913.1A
Other languages
Chinese (zh)
Other versions
CN112949635A (en
Inventor
马波
安骄阳
刘龙耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202110268913.1A priority Critical patent/CN112949635B/en
Publication of CN112949635A publication Critical patent/CN112949635A/en
Application granted granted Critical
Publication of CN112949635B publication Critical patent/CN112949635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a target detection method based on feature enhancement and IoU perception, and belongs to the field of computer vision target detection. The method utilizes the spatial information of the convolution feature map in the RoI classification regression network to improve the accuracy of target classification and positioning, utilizes the attention mechanism to restrain the background information in the RoI feature and enhance the semantic information in the RoI feature, utilizes the IoU re-grading strategy to increase the correlation between the classification score and the confidence coefficient of the bounding box, and reserves the high-quality bounding box. The method can effectively utilize the spatial information in the characteristic diagram and effectively improve the classification and positioning capabilities of the target detection model through the RoI classification regression branch network; the features in the RoI classification regression branch network are enhanced through the semantic segmentation branch network and the attention mechanism at the boundary box level; through IoU prediction branch network and the strategy of re-scoring the category score by using prediction IoU, the relevance between the category score of the target and the confidence coefficient of the bounding box is improved, and the positioning accuracy of the bounding box is effectively improved.

Description

Target detection method based on feature enhancement and IoU perception
Technical Field
The invention relates to a target detection method based on feature enhancement and IoU (interaction over Union) perception, and belongs to the field of computer vision target detection.
Background
The target detection is used as a basic task in the field of computer vision, and is widely applied in the fields of aerospace, robot navigation, intelligent video monitoring and the like. In recent years, as the target detection algorithm based on deep learning develops, two types of target detection frameworks are gradually formed: a one-stage object detector, a two-stage object detector. Wherein, the speed of the one-stage detector is high, but the detection precision is relatively low. The two-stage detector performs two-time classification and boundary box coordinate regression on the target, so that the detection precision of the algorithm is generally higher, and the method is relatively more widely applied in the industry.
In the two-stage target detection algorithm, a region suggestion network is generally used to perform two-classification and bounding box coordinate regression on a large number of preset anchor points in an image in the first stage, and a group of regions of interest (RoI) with potential targets is output. And in the second stage, the region of interest is subjected to multi-classification and coordinate regression on the bounding box by the RoI classification regression network, and a final detection result is obtained after post-processing. When the coordinates of the bounding box predicted by the regional proposed network are inaccurate, part of background information may exist in the RoI feature map generated by the RoI classification regression network, so that the classification and positioning accuracy is affected.
The above-mentioned difficult problem in the target detection makes the current two-stage target detection technology have the following defects:
1. the existing algorithm generally considers improving the feature expression capability of a feature extraction network, but neglects a feature enhancement method aiming at the RoI classification and positioning task in target detection.
2. When the existing algorithm classifies and regresses the RoI, the utilization of spatial information is lacked, and inherent structural information in a characteristic diagram is not fully utilized.
3. Existing algorithms typically use non-maxima suppression algorithms to remove redundant target boxes during post-processing of the detection algorithm. However, in the non-maximum suppression algorithm, the confidence of the localization may be represented by using the category score of the target box, which may cause the bounding box with a lower category score but accurate localization to be suppressed, thereby affecting the performance of the detection algorithm.
Therefore, how to overcome the defects of the existing target detection algorithm and realize efficient and robust target detection is an urgent technical problem to be solved.
Disclosure of Invention
The invention aims to provide a target detection method based on feature enhancement and IoU perception, aiming at overcoming the defects in the prior art and effectively solving the defects and problems of the two-stage target detection technology.
The innovation points of the invention are as follows: the method comprises the steps of firstly utilizing spatial information of a convolution feature map in a RoI classification regression network to improve the accuracy of target classification and positioning, utilizing an attention mechanism to restrain background information in the RoI feature and enhance semantic information in the RoI feature, utilizing an IoU re-grading strategy to increase the correlation between a category score and a bounding box confidence coefficient, and keeping a high-quality bounding box.
The invention is realized by adopting the following technical scheme:
a target detection method based on feature enhancement and IoU perception comprises the following steps:
step 1: and acquiring a target detection data set, and carrying out preprocessing operation on the image to form a training data set.
Specifically, in step 1, the image preprocessing operation specifically includes:
step 1.1: scaling the short side of the input image to 600 pixels;
step 1.2: and (4) randomly and horizontally turning the image to amplify the data.
Step 2: and constructing a target detection network based on feature enhancement and IoU perception based on a two-stage target detection network, namely, Faster R-CNN.
Specifically, the step 2 includes the steps of:
step 2.1: and (3) constructing a trunk feature extraction network, inputting the trunk feature extraction network into a preprocessed image, and outputting the image as a feature map of the image.
The main feature extraction network can be any convolution network, such as VGG-16, ResNet, and the like.
Step 2.2: and after the trunk feature extraction network in the step 2.1 is obtained, a RoI pooling network is built, and a plurality of regions of interest RoI of the output feature map in the step 2.1 are obtained.
Wherein the RoI pooling algorithm uses RoI Align to extract the RoI feature from the last feature map in the conv4_ x module of the ResNet network described in step 2.1.
Step 2.3: and after the RoI pooling network is obtained in the step 2.2, a RoI classification regression branch network is built, the features of the plurality of RoIs obtained in the step 2.2 are extracted, the classification score and the position of the boundary frame of each RoI are predicted, and a final target detection result is output.
For the RoI classification regression branch network described in step 2.3, the feature map size of the RoI Align output is 7 × 7.
In particular, said step 2.3 comprises the following steps:
step 2.3.1: the RoI classification regression branch network performs feature extraction by using two continuous padding-0 3 × 3 convolutions, an output feature map is marked as X, and the size of the feature map is 3 × 3 × 512;
step 2.3.2: carrying out dense prediction on the feature map X, and classifying the feature vectors of each position in turn, wherein the formula is as follows:
S i =σ(φ θ (X i )) (1)
wherein i is the position number in the feature map, X i Is the ith feature vector. Phi is a θ () For the feature vector classification function, a 1 × 1 convolutional layer implementation is used. σ () is a softmax operation for outputting a class score vector S of length K +1 i And K is the number of categories in the training data set.
In the training stage, the category of each feature vector is the same as the RoI label where the feature vector is located, and the classification task uses a cross entropy loss function to calculate.
In the testing stage, firstly, the prediction score of each position on the feature map X is calculated, the category score of the RoI is the mean value S of all the position category scores, and the calculation formula is as follows:
Figure BDA0002973329670000031
step 2.3.3: in the case of the bounding box regression, the coordinates of the center point of the bounding box, which are different from the Faster R-CNN regression, and the width and the high scaling ratio (t) of the bounding box x ,t y ,t w ,t h ) Instead, the coordinates of each edge of the bounding box are independently regressed, and the bounding box coordinate parameterization process is as follows:
t x1 =(x 1 -x 1a )/w a ,t x2 =(x 2 -x 2a )/w a
t y1 =(y 1 -y 1a )/h a ,t y2 =(y 2 -y 2a )/h a
Figure BDA0002973329670000032
Figure BDA0002973329670000033
wherein (x) 1 ,y 1 ,x 2 ,y 2 ) Coordinates representing the left, upper, right, and lower four sides of the predicted bounding box,
Figure BDA0002973329670000034
is the target value of the regression of the bounding box coordinates, (t) x1 ,t y1 ,t x2 ,t y2 ) In order to be able to predict the coordinate offset,
Figure BDA0002973329670000035
is the target value of regression, (x) 1a ,y 1a ,x 2a ,y 2a ) Coordinates, w, representing the left, upper, right, and lower four edges of the anchor point frame a Indicates the width of the anchor box, h a Indicating the height of the anchor box.
Step 2.3.4: and regressing the coordinates of the corresponding edges by using the characteristics of the characteristic positions.
For each edge, it is regressed using a network whose parameters are not shared, the calculation formula is as follows:
t x1 =φ θx1 (gmp(X 0 ,X 3 ,X 6 )),
t y1 =φ θy1 (gmp(X 0 ,X 1 ,X 2 )),
t x2 =φ θx2 (gmp(X 2 ,X 5 ,X 8 )), (4)
t y2 =φ θy2 (gmp(X 6 ,X 7 ,X 8 ))
wherein, X i I ∈ {0,1, …,8} represents the feature vector of the corresponding location on the feature map X, | represents the coordinate regression function, implemented using a 1 × 1 convolutional layer, and gmp (·) represents the global max pooling function.
Step 2.4: and (3) building a semantic segmentation branch network after the RoI pooling network in the step 2.1, building a feature enhancement module according to an attention mechanism, and enhancing the RoI features in the step 2.3 by using the extracted semantic segmentation feature map.
Specifically, step 2.4 includes the steps of:
step 2.4.1: and (3) adding segmentation labels at the pixel level to the target detection data set in the step 1.
Specifically, the coordinates of the target frame of the input image are rounded and mapped onto the RoI feature map, pixels falling within the target frame are labeled as positive samples, and the remaining pixels are labeled as negative samples.
Step 2.4.2: step 2.4, the input of the semantic segmentation branching network is a RoI feature map with the size of 14 × 14 × C obtained by a RoI pooling layer, and feature extraction is carried out by using two convolution layers with the size of 3 × 3 to obtain a feature map X with the size of 14 × 14 × 512 mask To X mask Activating by using a 3 x 3 convolutional layer and a sigmoid function, and outputting a final RoI partition prediction;
step 2.4.3: using feature maps X mask The RoI characteristics are enhanced.
Specifically, a feature enhancement module is designed, and the input of the feature enhancement module comprises the RoI features output by the RoI classification regression branch intermediate layer in the step 2.3 and the feature map X output by the semantic segmentation branch intermediate layer in the step 2.4 mask . For the RoI feature, converting the channel number of the feature map from C to 512 dimensions by using a 1 × 1 convolution; for semantic segmentation feature X mask Firstly, downsampling the feature map by using a bilinear interpolation algorithm, then performing feature transformation on the downsampled feature map by using 1 × 1 convolution, obtaining an attention map at a pixel level by using a sigmoid function, and finally multiplying the feature map of the RoI branch by the attention map to obtain an enhanced feature map.
For the semantic segmentation branch network described in step 2.4, the feature map size output by the RoI Align is 14 × 14.
Step 2.5: and (3) building IoU a prediction branch network after the RoI pooling network in the step 2.1, wherein the input of the prediction branch network is the semantic segmentation feature map extracted by the semantic segmentation branch network in the step 2.4, and the output of the prediction branch network is IoU between the predicted RoI and a real target box matched with the predicted RoI.
Specifically, step 2.5 comprises the steps of:
step 2.5.1: step 2.5 the IoU predict the input of the branch network as the middle layer output characteristic X of the semantic segmentation branch network of step 2.4 mask Using a 1X 1 convolutional layer pair X mask Carrying out transformation, then obtaining a 512-dimensional feature vector by using global average pooling, finally activating by using a full-connection layer and a sigmoid function, and outputting IoU values predicted for each RoI;
step 2.5.2: in the training phase, only the positive sample RoI participates in IoU training of the prediction branch network;
step 2.5.3: in the testing phase, the classification scores were re-scored using the predicted IoU, the calculation formula being as follows:
Figure BDA0002973329670000051
wherein the content of the first and second substances,
Figure BDA0002973329670000052
representing the classification score predicted by the RoI classification score branch network described in step 2.3, S i ' represents the classification score after the re-scoring, and γ represents the interval [0,1 ]]The hyper-parameter in (a) is,
Figure BDA0002973329670000053
value IoU representing the IoU predicted branch network prediction at step 2.5.
And step 3: and constructing a loss function, and training the target detection network in the step S2 according to the training data set to obtain a target detection model.
Wherein, the loss function calculation formula is as follows:
L=L RPN +αL cls +βL loc +δL seg +ηL IoU (6)
wherein L represents a multitask loss function; l is RPN Represents the RPN loss function of Faster R-CNN; l is cls And L loc Respectively representing a classification loss function and a position regression loss function of fast R-CNN; l is seg Representing a semantic segmentation loss function, wherein only a positive sample RoI participates in the training of a semantic segmentation task, and monitoring is performed by using cross entropy loss; l is IoU Representing IoU a predictive loss function; α, β, δ, η represent the classification loss weight, regression loss weight, semantic segmentation loss weight and IoU prediction loss weight, respectively, wherein the classification loss weight α is set to
Figure BDA0002973329670000054
The regression loss weight β and the segmentation loss weight δ are set to 1, and the IoU prediction loss weight η is set to 0.5.
The IoU predicted loss function is specifically as follows:
Figure BDA0002973329670000055
wherein, I i The representation of the real IoU is shown,
Figure BDA0002973329670000056
representing the output value, N, of the IoU predicted branch network in said step 2.5 pos Representing the number of positive samples RoI.
And 4, step 4: and (3) acquiring a test image, preprocessing the test image (such as size change), and inputting the target detection model acquired in the step (3) to obtain a target classification and positioning result of the test image.
Advantageous effects
Compared with the prior art, the method has the following advantages:
through the RoI classification regression branch network, the spatial information in the feature map can be effectively utilized, and the classification and positioning capabilities of the target detection model are effectively improved; the features in the RoI classification regression branch network are enhanced through the semantic segmentation branch network and the attention mechanism at the boundary box level; through IoU prediction branch networks and a strategy of re-scoring the category scores by using prediction IoU, the relevance between the category scores of the targets and the confidence coefficient of the bounding box is improved, and the positioning accuracy of the bounding box is effectively improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of a feature enhancement and IoU perception based object detection network architecture provided by the present invention;
FIG. 3 is a schematic diagram illustrating mask generation in a semantic segmentation branch network according to the present invention;
FIG. 4 is a schematic diagram of a feature enhancement module based on an attention mechanism according to the present invention.
Detailed Description
The process of the present invention is described in further detail below with reference to the following examples and combinations. The method comprises the following steps:
step S1: acquiring a target detection data set and carrying out preprocessing operation on the image to form a training data set;
the image preprocessing operation steps are as follows:
step S11: scaling the short side of the input image to 600 pixels;
step S12: data augmentation is performed using random horizontal flipping of the image.
Step S2: referring to fig. 2, fig. 2 shows a target detection network based on feature enhancement and IoU perception, which is built based on a two-stage target detection network, fast R-CNN;
the target detection network construction steps based on feature enhancement and IoU perception are as follows:
step S21: constructing a trunk characteristic extraction network, inputting a preprocessed image, and outputting a characteristic graph of the image;
the backbone feature extraction network can be any convolution network, such as VGG-16, ResNet, and the like.
Step S22: constructing a RoI pooling network after the backbone feature extraction network in the step S21, and obtaining a plurality of regions of interest RoI of the output feature map in the step S21;
the method for constructing the RoI pooling network comprises the following steps:
the RoI pooling algorithm uses RoI Align to extract the RoI feature from the last feature map in the conv4_ x module of the ResNet network in step S21;
for the RoI classification regression network branch, the characteristic map size of the RoI Align output is 7 x 7;
for the semantic split branching network, the characteristic graph size of the RoI Align output is 14 x 14.
Step S23: constructing a RoI classification regression branch network after the RoI pooling network in the step S22, performing feature extraction on the RoIs in the step S22, predicting the classification score and the position of a boundary frame of each RoI, and outputting a final target detection result;
the method for constructing the RoI classification regression branch network comprises the following steps:
step S231: the RoI classification regression branch network performs feature extraction by using two continuous padding-0 3X 3 convolutions, an output feature map is marked as X, and the size of the feature map is 3X 512;
step S232: carrying out dense prediction on the feature map X, and classifying the feature vectors of each position in sequence, wherein the calculation formula is as follows:
S i =σ(φ θ (X i )) (1)
wherein i is the position number in the feature map, X i Is the ith feature vector. Phi is a θ () For the feature vector classification function, a 1 × 1 convolutional layer implementation is used. σ () is a softmax operation for outputting a class score vector S of length K +1 i And K is the number of categories in the training data set.
In the training stage, the category of each feature vector is the same as the RoI label where the feature vector is located, and the classification task uses a cross entropy loss function to calculate.
In the testing stage, firstly, the prediction score of each position on the feature map X is calculated, the category score of the RoI is the mean value S of all the position category scores, and the calculation formula is as follows:
Figure BDA0002973329670000071
step S233: in the case of bounding box regression, the coordinates of the center point of the bounding box, which are different from the Faster R-CNN regression, and the scaling (t) of the width and height of the bounding box x ,t y ,t w ,t h ) In the embodiment of the present invention, the coordinates of each edge of the bounding box are regressed independently, and the parameterization process of the bounding box coordinates is as follows:
t x1 =(x 1 -x 1a )/w a ,t x2 =(x 2 -x 2a )/w a
t y1 =(y 1 -y 1a )/h a ,t y2 =(y 2 -y 2a )/h a
Figure BDA0002973329670000072
Figure BDA0002973329670000073
wherein (x) 1 ,y 1 ,x 2 ,y 2 ) Coordinates representing the left, upper, right, and lower four sides of the predicted bounding box,
Figure BDA0002973329670000081
is the target value of the regression of the bounding box coordinates, (t) x1 ,t y1 ,t x2 ,t y2 ) In order to be able to predict the coordinate offset,
Figure BDA0002973329670000082
is the target value of regression, (x) 1a ,y 1a ,x 2a ,y 2a ) Coordinates, w, representing the left, upper, right, and lower four edges of the anchor point frame a Indicates the width of the anchor box, h a Indicating the height of the anchor box.
Step S234: and regressing the coordinates of the corresponding edges by using the characteristics of the characteristic positions. For each edge, it is regressed using a network whose parameters are not shared, the calculation formula is as follows:
t x1 =φ θx1 (gmp(X 0 ,X 3 ,X 6 )),
t y1 =φ θy1 (gmp(X 0 ,X 1 ,X 2 )),
t x2 =φ θx2 (gmp(X 2 ,X 5 ,X 8 )), (4)
t y2 =φ θy2 (gmp(X 6 ,X 7 ,X 8 ))
wherein, X i I ∈ {0,1, …,8} represents the feature vector of the corresponding location on the feature map X, | represents the coordinate regression function, implemented using a 1 × 1 convolutional layer, and gmp (·) represents the global max pooling function.
Step S24: building a semantic segmentation branch network behind the RoI pooling network in the step S21, building a feature enhancement module according to an attention mechanism, and enhancing the RoI features in the step S23 by using the extracted semantic segmentation feature map;
the semantic division branch network construction steps are as follows:
step S241: for the target detection data set of step S1, segmentation labels at the pixel level are added. Referring to fig. 3, fig. 3 illustrates a mask generation process in a semantic segmentation branch network. Specifically, the coordinates of a target frame of the input image are rounded and mapped onto the RoI feature map, pixels falling in the target frame are marked as positive samples, and the rest pixels are marked as negative samples;
step S242: step S24, the input of the semantic segmentation branching network is a RoI feature map with a size of 14 × 14 × C obtained by RoI pooling layers, and feature extraction is performed using two 3 × 3 convolutional layers to obtain a feature map X with a size of 14 × 14 × 512 mask To X mask Activating by using a 3 x 3 convolutional layer and a sigmoid function, and outputting a final RoI partition prediction;
step S243: using feature maps X mask The RoI characteristics are enhanced. Specifically, referring to fig. 4, fig. 4 shows a specific structure of the feature enhancing module. The input of which comprises: the RoI characteristics output by the RoI classification regression branch intermediate layer in the step S23 and the characteristic diagram X output by the semantic segmentation branch intermediate layer in the step S24 mask . For the RoI feature, converting the channel number of the feature map from C to 512 dimensions by using a 1 × 1 convolution; for semantic segmentation feature X mask Firstly, downsampling the feature map by using a bilinear interpolation algorithm, then performing feature transformation on the downsampled feature map by using 1 × 1 convolution, then obtaining an attention map at a pixel level by using a sigmoid function, and finally multiplying the feature map of the RoI branch by the attention map to obtain an enhanced feature map.
Step S25: and building IoU a prediction branch network after the RoI pooling network in the step S21, wherein the input of the prediction branch network is the semantic segmentation feature map extracted by the semantic segmentation branch network in the step S24, and the output of the prediction branch network is IoU between the predicted RoI and a real target box matched with the predicted RoI.
The IoU prediction branch network building steps are as follows:
step S251: step S25 input of IoU prediction branch network is middle layer output characteristic X of the semantic division branch network of step S24 mask Using a 1X 1 convolutional layer pair X mask Carrying out transformation, then obtaining a 512-dimensional feature vector by using global average pooling, finally activating by using a full-connection layer and a sigmoid function, and outputting IoU values predicted for each RoI;
step S252: in the training phase, only the positive sample RoI participates in IoU training of the prediction branch network;
step S253: in the testing stage, the classification score is re-scored using the predicted IoU, and the calculation formula is as follows:
Figure BDA0002973329670000091
wherein the content of the first and second substances,
Figure BDA0002973329670000092
a classification score, S, representing the RoI classification score Branch network prediction of step S23 i ' represents the classification score after the re-scoring, and γ represents the interval [0,1 ]]The hyper-parameter in (a) is,
Figure BDA0002973329670000093
the value of IoU representing the IoU predicted branch network prediction described in step S25.
Step S3: and constructing a loss function, and training the target detection network in the step S2 according to the training data set to obtain a target detection model.
Wherein, the loss function calculation formula is as follows:
L=L RPN +αL cls +βL loc +δL seg +ηL IoU (6)
wherein L represents a multitask loss function; l is RPN Represents the RPN loss function of Faster R-CNN; l is cls And L loc Respectively representing a classification loss function and a position regression loss function of the Faster R-CNN; l is seg Representing a semantic segmentation loss function, wherein only a positive sample RoI participates in the training of a semantic segmentation task, and monitoring is performed by using cross entropy loss; l is IoU Representing IoU a predictive loss function; α, β, δ, η represent the classification loss weight, regression loss weight, semantic segmentation loss weight and IoU prediction loss weight, respectively, wherein the classification loss weight α is set to
Figure BDA0002973329670000094
The regression loss weight β and the segmentation loss weight δ are set to 1, and the IoU prediction loss weight η is set to 0.5.
The IoU predicted loss function is specifically as follows:
Figure BDA0002973329670000101
wherein, I i The representation of the real IoU is shown,
Figure BDA0002973329670000102
represents the stepIoU predicts the output value, N, of the branch network in step S25 pos Representing the number of positive samples RoI.
Step S4: the test image is acquired, and is preprocessed (e.g., size changed), and then the target detection model obtained in step S3 is input, so as to obtain the target classification and positioning result of the test image.

Claims (8)

1. A target detection method based on feature enhancement and IoU perception is characterized by comprising the following steps:
step 1: acquiring a target detection data set, and carrying out preprocessing operation on the image to form a training data set;
and 2, step: building a target detection network based on feature enhancement and IoU perception based on a two-stage target detection network Faster R-CNN;
the method comprises the following steps:
step 2.1: constructing a trunk characteristic extraction network, inputting a preprocessed image, and outputting a characteristic graph of the image;
step 2.2: after the trunk feature extraction network in the step 2.1, a RoI pooling network is built to obtain a plurality of regions of interest RoI of the output feature map in the step 2.1;
step 2.3: after the RoI pooling network is obtained in the step 2.2, a RoI classification regression branch network is built, the features of the RoIs obtained in the step 2.2 are extracted, the classification score and the position of a boundary frame of each RoI are predicted, and a final target detection result is output;
step 2.4: establishing a semantic segmentation branch network after the RoI pooling network in the step 2.2, establishing a feature enhancement module according to an attention mechanism, and enhancing the RoI features in the step 2.3 by using the extracted semantic segmentation feature map, wherein the method comprises the following steps:
step 2.4.1: for the target detection data set in the step 1, adding segmentation labels at the pixel level;
rounding and mapping the coordinates of a target frame of the input image to a RoI characteristic diagram, marking pixels falling in the target frame as positive samples, and marking the rest pixels as negative samples;
step 2.4.2: step 2.4, the input of the semantic segmentation branching network is a RoI feature map with the size of 14 × 14 × C obtained by a RoI pooling layer, and feature extraction is carried out by using two convolution layers with the size of 3 × 3 to obtain a feature map X with the size of 14 × 14 × 512 mask To X mask Activating by using a 3 x 3 convolutional layer and a sigmoid function, and outputting a final RoI partition prediction;
step 2.4.3: using feature maps X mask The RoI characteristic is enhanced, and the method specifically comprises the following steps:
designing a characteristic enhancement module, wherein the input of the characteristic enhancement module comprises the RoI characteristic output by the RoI classification regression branch intermediate layer in the step 2.3 and the characteristic graph X output by the semantic segmentation branch intermediate layer in the step 2.4 mask (ii) a For the RoI feature, converting the channel number of the feature map from C to 512 dimensions by using a 1 × 1 convolution; for semantic segmentation feature X mask Firstly, downsampling the image by using a bilinear interpolation algorithm, then performing feature transformation on the image by using 1 × 1 convolution, obtaining an attention map at a pixel level by using a sigmoid function, and finally multiplying the feature map of the RoI branch by the attention map to obtain an enhanced feature map;
step 2.5: building IoU a prediction branch network after the RoI pooling network of step 2.2, wherein the input of the prediction branch network is the semantic segmentation feature map extracted by the semantic segmentation branch network of step 2.4, and the output is IoU between the predicted RoI and a real target box matched with the predicted RoI;
and 3, step 3: constructing a loss function, training the target detection network in the step S2 according to a training data set, and obtaining a target detection model;
and 4, step 4: and (3) obtaining a test image, preprocessing the test image, and inputting the target detection model obtained in the step (3) to obtain a target classification and positioning result of the test image.
2. The object detection method based on feature enhancement and IoU perception as claimed in claim 1, wherein in step 1, the image preprocessing operation comprises the following steps:
step 1.1: scaling the short side of the input image to 600 pixels;
step 1.2: and (4) randomly and horizontally turning the image to amplify the data.
3. The feature enhancement and IoU perception-based target detection method of claim 1, wherein in step 2.2, the RoI pooling algorithm uses roilign.
4. The object detection method based on feature enhancement and IoU perception of claim 1, wherein, for the RoI classification regression branch network of step 2.3, the feature map size of RoI Align output is 7 x 7.
5. The object detection method based on feature enhancement and IoU perception according to claim 1, wherein the step 2.3 includes the following steps:
step 2.3.1: the RoI classification regression branch network performs feature extraction by using two continuous padding-0 3 × 3 convolutions, an output feature map is marked as X, and the size of the feature map is 3 × 3 × 512;
step 2.3.2: carrying out dense prediction on the feature map X, and classifying the feature vectors of each position in turn, wherein the formula is as follows:
S i =σ(φ θ (X i )) (1)
wherein i is the position number in the feature map, X i Is the ith feature vector; phi is a θ () For the feature vector classification function, a 1 × 1 convolutional layer is used for implementation; σ () is a softmax operation for outputting a class score vector S of length K +1 i Wherein K is the number of categories in the training data set;
in the training stage, the category of each feature vector is the same as the RoI label where the feature vector is located, and the classification task is calculated by using a cross entropy loss function;
in the testing stage, firstly, the prediction score of each position on the feature map X is calculated, the category score of the RoI is the mean value S of all the position category scores, and the calculation formula is as follows:
Figure FDA0003766323580000031
step 2.3.3: in the case of the bounding box regression, the coordinates of the center point of the bounding box, which are different from the Faster R-CNN regression, and the width and the high scaling ratio (t) of the bounding box x ,t y ,t w ,t h ) Instead, the coordinates of each edge of the bounding box are independently regressed, and the bounding box coordinate parameterization process is as follows:
Figure FDA0003766323580000032
wherein (x) 1 ,y 1 ,x 2 ,y 2 ) Coordinates representing the left, upper, right, and lower four sides of the predicted bounding box,
Figure FDA0003766323580000033
is the target value of the regression of the bounding box coordinates, (t) x1 ,t y1 ,t x2 ,t y2 ) In order to be able to predict the coordinate offset,
Figure FDA0003766323580000034
is the target value of regression, (x) 1a ,y 1a ,x 2a ,y 2a ) Coordinates, w, representing the left, upper, right, and lower four edges of the anchor point frame a Indicates the width of the anchor box, h a Representing the height of the anchor box;
step 2.3.4: regressing the coordinates of the corresponding edges by using the characteristics of the characteristic positions;
for each edge, it is regressed using a network whose parameters are not shared, the calculation formula is as follows:
Figure FDA0003766323580000035
wherein, X i I ∈ {0,1, …,8} represents the feature vector for the corresponding location on the feature map X, φ | () Expressing a coordinate regression function using one1 × 1 convolutional layer implementation, gmp (·) represents a global max pooling function.
6. The method of claim 1, wherein the size of the feature map output by the RoI Align is 14 x 14 for the semantic segmentation branch network of step 2.4.
7. The object detection method based on feature enhancement and IoU perception according to claim 1, wherein the step 2.5 includes the steps of:
step 2.5.1: step 2.5 the IoU predict the input of the branch network as the middle layer output characteristic X of the semantic segmentation branch network of step 2.4 mask Using a 1X 1 convolutional layer pair X mask Carrying out transformation, then obtaining a 512-dimensional feature vector by using global average pooling, finally activating by using a full-connection layer and a sigmoid function, and outputting IoU values predicted for each RoI;
step 2.5.2: in the training phase, only the positive sample RoI participates in IoU training of the prediction branch network;
step 2.5.3: in the testing phase, the classification scores were re-scored using the predicted IoU, the calculation formula being as follows:
Figure FDA0003766323580000041
wherein the content of the first and second substances,
Figure FDA0003766323580000042
represents the classification score, S, predicted by the RoI classification regression branch network described in step 2.3 i ' represents the classification score after the re-scoring, and γ represents the interval [0,1 ]]The hyper-parameter in (a) is,
Figure FDA0003766323580000043
value IoU representing the IoU predicted branch network prediction at step 2.5.
8. The object detection method based on feature enhancement and IoU perception as claimed in claim 1, wherein in step 3, the loss function is calculated as follows:
L=L RPN +αL cls +βL loc +δL seg +ηL IoU (6)
wherein L represents a multitask loss function; l is RPN Represents the RPN loss function of Faster R-CNN; l is cls And L loc Respectively representing a classification loss function and a position regression loss function of fast R-CNN; l is seg Representing a semantic segmentation loss function, wherein only a positive sample RoI participates in the training of a semantic segmentation task, and monitoring is performed by using cross entropy loss; l is IoU Representing IoU a predictive loss function; α, β, δ, η represent the classification loss weight, regression loss weight, semantic segmentation loss weight and IoU prediction loss weight, respectively, wherein the classification loss weight α is set to
Figure FDA0003766323580000044
The regression loss weight β and the segmentation loss weight δ are set to 1, and the IoU prediction loss weight η is set to 0.5;
the IoU predicted loss function is specifically as follows:
Figure FDA0003766323580000045
wherein, I i The representation of the real IoU is shown,
Figure FDA0003766323580000046
representing the output value, N, of the IoU predicted branch network in said step 2.5 pos Representing the number of positive samples RoI.
CN202110268913.1A 2021-03-12 2021-03-12 Target detection method based on feature enhancement and IoU perception Active CN112949635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110268913.1A CN112949635B (en) 2021-03-12 2021-03-12 Target detection method based on feature enhancement and IoU perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110268913.1A CN112949635B (en) 2021-03-12 2021-03-12 Target detection method based on feature enhancement and IoU perception

Publications (2)

Publication Number Publication Date
CN112949635A CN112949635A (en) 2021-06-11
CN112949635B true CN112949635B (en) 2022-09-16

Family

ID=76229263

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110268913.1A Active CN112949635B (en) 2021-03-12 2021-03-12 Target detection method based on feature enhancement and IoU perception

Country Status (1)

Country Link
CN (1) CN112949635B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113658199B (en) * 2021-09-02 2023-11-03 中国矿业大学 Regression correction-based chromosome instance segmentation network
CN116340807B (en) * 2023-01-10 2024-02-13 中国人民解放军国防科技大学 Broadband Spectrum Signal Detection and Classification Network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830192A (en) * 2018-05-31 2018-11-16 珠海亿智电子科技有限公司 Vehicle and detection method of license plate under vehicle environment based on deep learning
CN110163207A (en) * 2019-05-20 2019-08-23 福建船政交通职业学院 One kind is based on Mask-RCNN ship target localization method and storage equipment
CN110287960A (en) * 2019-07-02 2019-09-27 中国科学院信息工程研究所 The detection recognition method of curve text in natural scene image
CN111079739A (en) * 2019-11-28 2020-04-28 长沙理工大学 Multi-scale attention feature detection method
CN111767799A (en) * 2020-06-01 2020-10-13 重庆大学 Improved down-going human target detection algorithm for fast R-CNN tunnel environment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10679351B2 (en) * 2017-08-18 2020-06-09 Samsung Electronics Co., Ltd. System and method for semantic segmentation of images

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108830192A (en) * 2018-05-31 2018-11-16 珠海亿智电子科技有限公司 Vehicle and detection method of license plate under vehicle environment based on deep learning
CN110163207A (en) * 2019-05-20 2019-08-23 福建船政交通职业学院 One kind is based on Mask-RCNN ship target localization method and storage equipment
CN110287960A (en) * 2019-07-02 2019-09-27 中国科学院信息工程研究所 The detection recognition method of curve text in natural scene image
CN111079739A (en) * 2019-11-28 2020-04-28 长沙理工大学 Multi-scale attention feature detection method
CN111767799A (en) * 2020-06-01 2020-10-13 重庆大学 Improved down-going human target detection algorithm for fast R-CNN tunnel environment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Mask R-CNN with Pyramid Attention Network for Scene Text Detection;Zhida Huang等;《arXiv》;20181122;全文 *
Mask Scoring R-CNN;Zhaojin Huang等;《arXiv》;20190301;全文 *
基于具有空间注意力机制的Mask R-CNN的口腔白斑分割;谢飞等;《西北大学学报(自然科学版)》;20200109(第01期);全文 *

Also Published As

Publication number Publication date
CN112949635A (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN108961235B (en) Defective insulator identification method based on YOLOv3 network and particle filter algorithm
CN108564097B (en) Multi-scale target detection method based on deep convolutional neural network
CN112418236B (en) Automobile drivable area planning method based on multitask neural network
CN111179217A (en) Attention mechanism-based remote sensing image multi-scale target detection method
CN111767944B (en) Single-stage detector design method suitable for multi-scale target detection based on deep learning
CN111368769B (en) Ship multi-target detection method based on improved anchor point frame generation model
CN112949635B (en) Target detection method based on feature enhancement and IoU perception
CN112418108B (en) Remote sensing image multi-class target detection method based on sample reweighing
CN112364931A (en) Low-sample target detection method based on meta-feature and weight adjustment and network model
CN114627052A (en) Infrared image air leakage and liquid leakage detection method and system based on deep learning
CN106845458B (en) Rapid traffic sign detection method based on nuclear overrun learning machine
CN113177560A (en) Universal lightweight deep learning vehicle detection method
CN113255837A (en) Improved CenterNet network-based target detection method in industrial environment
CN112883934A (en) Attention mechanism-based SAR image road segmentation method
CN111259796A (en) Lane line detection method based on image geometric features
CN112784757B (en) Marine SAR ship target significance detection and identification method
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN115620180A (en) Aerial image target detection method based on improved YOLOv5
CN113159215A (en) Small target detection and identification method based on fast Rcnn
CN114170230B (en) Glass defect detection method and device based on deformable convolution and feature fusion
CN112686233B (en) Lane line identification method and device based on lightweight edge calculation
CN113420648A (en) Target detection method and system with rotation adaptability
CN113326734A (en) Rotary target detection method based on YOLOv5
CN115019201B (en) Weak and small target detection method based on feature refinement depth network
CN114913504A (en) Vehicle target identification method of remote sensing image fused with self-attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant