CN112949635B

CN112949635B - Target detection method based on feature enhancement and IoU perception

Info

Publication number: CN112949635B
Application number: CN202110268913.1A
Authority: CN
Inventors: 马波; 安骄阳; 刘龙耀
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2022-09-16
Anticipated expiration: 2041-03-12
Also published as: CN112949635A

Abstract

The invention relates to a target detection method based on feature enhancement and IoU perception, and belongs to the field of computer vision target detection. The method utilizes the spatial information of the convolution feature map in the RoI classification regression network to improve the accuracy of target classification and positioning, utilizes the attention mechanism to restrain the background information in the RoI feature and enhance the semantic information in the RoI feature, utilizes the IoU re-grading strategy to increase the correlation between the classification score and the confidence coefficient of the bounding box, and reserves the high-quality bounding box. The method can effectively utilize the spatial information in the characteristic diagram and effectively improve the classification and positioning capabilities of the target detection model through the RoI classification regression branch network; the features in the RoI classification regression branch network are enhanced through the semantic segmentation branch network and the attention mechanism at the boundary box level; through IoU prediction branch network and the strategy of re-scoring the category score by using prediction IoU, the relevance between the category score of the target and the confidence coefficient of the bounding box is improved, and the positioning accuracy of the bounding box is effectively improved.

Description

Target detection method based on feature enhancement and IoU perception

Technical Field

The invention relates to a target detection method based on feature enhancement and IoU (interaction over Union) perception, and belongs to the field of computer vision target detection.

Background

The target detection is used as a basic task in the field of computer vision, and is widely applied in the fields of aerospace, robot navigation, intelligent video monitoring and the like. In recent years, as the target detection algorithm based on deep learning develops, two types of target detection frameworks are gradually formed: a one-stage object detector, a two-stage object detector. Wherein, the speed of the one-stage detector is high, but the detection precision is relatively low. The two-stage detector performs two-time classification and boundary box coordinate regression on the target, so that the detection precision of the algorithm is generally higher, and the method is relatively more widely applied in the industry.

In the two-stage target detection algorithm, a region suggestion network is generally used to perform two-classification and bounding box coordinate regression on a large number of preset anchor points in an image in the first stage, and a group of regions of interest (RoI) with potential targets is output. And in the second stage, the region of interest is subjected to multi-classification and coordinate regression on the bounding box by the RoI classification regression network, and a final detection result is obtained after post-processing. When the coordinates of the bounding box predicted by the regional proposed network are inaccurate, part of background information may exist in the RoI feature map generated by the RoI classification regression network, so that the classification and positioning accuracy is affected.

The above-mentioned difficult problem in the target detection makes the current two-stage target detection technology have the following defects:

1. the existing algorithm generally considers improving the feature expression capability of a feature extraction network, but neglects a feature enhancement method aiming at the RoI classification and positioning task in target detection.

2. When the existing algorithm classifies and regresses the RoI, the utilization of spatial information is lacked, and inherent structural information in a characteristic diagram is not fully utilized.

3. Existing algorithms typically use non-maxima suppression algorithms to remove redundant target boxes during post-processing of the detection algorithm. However, in the non-maximum suppression algorithm, the confidence of the localization may be represented by using the category score of the target box, which may cause the bounding box with a lower category score but accurate localization to be suppressed, thereby affecting the performance of the detection algorithm.

Therefore, how to overcome the defects of the existing target detection algorithm and realize efficient and robust target detection is an urgent technical problem to be solved.

Disclosure of Invention

The invention aims to provide a target detection method based on feature enhancement and IoU perception, aiming at overcoming the defects in the prior art and effectively solving the defects and problems of the two-stage target detection technology.

The innovation points of the invention are as follows: the method comprises the steps of firstly utilizing spatial information of a convolution feature map in a RoI classification regression network to improve the accuracy of target classification and positioning, utilizing an attention mechanism to restrain background information in the RoI feature and enhance semantic information in the RoI feature, utilizing an IoU re-grading strategy to increase the correlation between a category score and a bounding box confidence coefficient, and keeping a high-quality bounding box.

The invention is realized by adopting the following technical scheme:

a target detection method based on feature enhancement and IoU perception comprises the following steps:

step 1: and acquiring a target detection data set, and carrying out preprocessing operation on the image to form a training data set.

Specifically, in step 1, the image preprocessing operation specifically includes:

step 1.1: scaling the short side of the input image to 600 pixels;

step 1.2: and (4) randomly and horizontally turning the image to amplify the data.

Step 2: and constructing a target detection network based on feature enhancement and IoU perception based on a two-stage target detection network, namely, Faster R-CNN.

Specifically, the step 2 includes the steps of:

step 2.1: and (3) constructing a trunk feature extraction network, inputting the trunk feature extraction network into a preprocessed image, and outputting the image as a feature map of the image.

The main feature extraction network can be any convolution network, such as VGG-16, ResNet, and the like.

Step 2.2: and after the trunk feature extraction network in the step 2.1 is obtained, a RoI pooling network is built, and a plurality of regions of interest RoI of the output feature map in the step 2.1 are obtained.

Wherein the RoI pooling algorithm uses RoI Align to extract the RoI feature from the last feature map in the conv4_ x module of the ResNet network described in step 2.1.

Step 2.3: and after the RoI pooling network is obtained in the step 2.2, a RoI classification regression branch network is built, the features of the plurality of RoIs obtained in the step 2.2 are extracted, the classification score and the position of the boundary frame of each RoI are predicted, and a final target detection result is output.

For the RoI classification regression branch network described in step 2.3, the feature map size of the RoI Align output is 7 × 7.

In particular, said step 2.3 comprises the following steps:

step 2.3.1: the RoI classification regression branch network performs feature extraction by using two continuous padding-0 3 × 3 convolutions, an output feature map is marked as X, and the size of the feature map is 3 × 3 × 512;

step 2.3.2: carrying out dense prediction on the feature map X, and classifying the feature vectors of each position in turn, wherein the formula is as follows:

S _i ＝σ(φ _θ (X _i )) (1)

wherein i is the position number in the feature map, X _i Is the ith feature vector. Phi is a _θ () For the feature vector classification function, a 1 × 1 convolutional layer implementation is used. σ () is a softmax operation for outputting a class score vector S of length K +1 _i And K is the number of categories in the training data set.

In the training stage, the category of each feature vector is the same as the RoI label where the feature vector is located, and the classification task uses a cross entropy loss function to calculate.

In the testing stage, firstly, the prediction score of each position on the feature map X is calculated, the category score of the RoI is the mean value S of all the position category scores, and the calculation formula is as follows:

step 2.3.3: in the case of the bounding box regression, the coordinates of the center point of the bounding box, which are different from the Faster R-CNN regression, and the width and the high scaling ratio (t) of the bounding box _x ,t _y ,t _w ,t _h ) Instead, the coordinates of each edge of the bounding box are independently regressed, and the bounding box coordinate parameterization process is as follows:

t _x1 ＝(x ₁ -x _1a )/w _a ,t _x2 ＝(x ₂ -x _2a )/w _a

t _y1 ＝(y ₁ -y _1a )/h _a ,t _y2 ＝(y ₂ -y _2a )/h _a

wherein (x) ₁ ,y ₁ ,x ₂ ,y ₂ ) Coordinates representing the left, upper, right, and lower four sides of the predicted bounding box,

is the target value of the regression of the bounding box coordinates, (t) _x1 ,t _y1 ,t _x2 ,t _y2 ) In order to be able to predict the coordinate offset,

is the target value of regression, (x) _1a ,y _1a ,x _2a ,y _2a ) Coordinates, w, representing the left, upper, right, and lower four edges of the anchor point frame _a Indicates the width of the anchor box, h _a Indicating the height of the anchor box.

Step 2.3.4: and regressing the coordinates of the corresponding edges by using the characteristics of the characteristic positions.

For each edge, it is regressed using a network whose parameters are not shared, the calculation formula is as follows:

t _x1 ＝φ _θx1 (gmp(X ₀ ,X ₃ ,X ₆ )),

t _y1 ＝φ _θy1 (gmp(X ₀ ,X ₁ ,X ₂ )),

t _x2 ＝φ _θx2 (gmp(X ₂ ,X ₅ ,X ₈ )), (4)

t _y2 ＝φ _θy2 (gmp(X ₆ ,X ₇ ,X ₈ ))

wherein, X _i I ∈ {0,1, …,8} represents the feature vector of the corresponding location on the feature map X, | represents the coordinate regression function, implemented using a 1 × 1 convolutional layer, and gmp (·) represents the global max pooling function.

Step 2.4: and (3) building a semantic segmentation branch network after the RoI pooling network in the step 2.1, building a feature enhancement module according to an attention mechanism, and enhancing the RoI features in the step 2.3 by using the extracted semantic segmentation feature map.

Specifically, step 2.4 includes the steps of:

step 2.4.1: and (3) adding segmentation labels at the pixel level to the target detection data set in the step 1.

Specifically, the coordinates of the target frame of the input image are rounded and mapped onto the RoI feature map, pixels falling within the target frame are labeled as positive samples, and the remaining pixels are labeled as negative samples.

Step 2.4.2: step 2.4, the input of the semantic segmentation branching network is a RoI feature map with the size of 14 × 14 × C obtained by a RoI pooling layer, and feature extraction is carried out by using two convolution layers with the size of 3 × 3 to obtain a feature map X with the size of 14 × 14 × 512 _mask To X _mask Activating by using a 3 x 3 convolutional layer and a sigmoid function, and outputting a final RoI partition prediction;

step 2.4.3: using feature maps X _mask The RoI characteristics are enhanced.

Specifically, a feature enhancement module is designed, and the input of the feature enhancement module comprises the RoI features output by the RoI classification regression branch intermediate layer in the step 2.3 and the feature map X output by the semantic segmentation branch intermediate layer in the step 2.4 _mask . For the RoI feature, converting the channel number of the feature map from C to 512 dimensions by using a 1 × 1 convolution; for semantic segmentation feature X _mask Firstly, downsampling the feature map by using a bilinear interpolation algorithm, then performing feature transformation on the downsampled feature map by using 1 × 1 convolution, obtaining an attention map at a pixel level by using a sigmoid function, and finally multiplying the feature map of the RoI branch by the attention map to obtain an enhanced feature map.

For the semantic segmentation branch network described in step 2.4, the feature map size output by the RoI Align is 14 × 14.

Step 2.5: and (3) building IoU a prediction branch network after the RoI pooling network in the step 2.1, wherein the input of the prediction branch network is the semantic segmentation feature map extracted by the semantic segmentation branch network in the step 2.4, and the output of the prediction branch network is IoU between the predicted RoI and a real target box matched with the predicted RoI.

Specifically, step 2.5 comprises the steps of:

step 2.5.1: step 2.5 the IoU predict the input of the branch network as the middle layer output characteristic X of the semantic segmentation branch network of step 2.4 _mask Using a 1X 1 convolutional layer pair X _mask Carrying out transformation, then obtaining a 512-dimensional feature vector by using global average pooling, finally activating by using a full-connection layer and a sigmoid function, and outputting IoU values predicted for each RoI;

step 2.5.2: in the training phase, only the positive sample RoI participates in IoU training of the prediction branch network;

step 2.5.3: in the testing phase, the classification scores were re-scored using the predicted IoU, the calculation formula being as follows:

wherein,

representing the classification score predicted by the RoI classification score branch network described in step 2.3, S _i ' represents the classification score after the re-scoring, and γ represents the interval [0,1 ]]The hyper-parameter in (a) is,

value IoU representing the IoU predicted branch network prediction at step 2.5.

And step 3: and constructing a loss function, and training the target detection network in the step S2 according to the training data set to obtain a target detection model.

Wherein, the loss function calculation formula is as follows:

L＝L _RPN +αL _cls +βL _loc +δL _seg +ηL _IoU (6)

wherein L represents a multitask loss function; l is _RPN Represents the RPN loss function of Faster R-CNN; l is _cls And L _loc Respectively representing a classification loss function and a position regression loss function of fast R-CNN; l is _seg Representing a semantic segmentation loss function, wherein only a positive sample RoI participates in the training of a semantic segmentation task, and monitoring is performed by using cross entropy loss; l is _IoU Representing IoU a predictive loss function; α, β, δ, η represent the classification loss weight, regression loss weight, semantic segmentation loss weight and IoU prediction loss weight, respectively, wherein the classification loss weight α is set to

The regression loss weight β and the segmentation loss weight δ are set to 1, and the IoU prediction loss weight η is set to 0.5.

The IoU predicted loss function is specifically as follows:

wherein, I _i The representation of the real IoU is shown,

representing the output value, N, of the IoU predicted branch network in said step 2.5 _pos Representing the number of positive samples RoI.

And 4, step 4: and (3) acquiring a test image, preprocessing the test image (such as size change), and inputting the target detection model acquired in the step (3) to obtain a target classification and positioning result of the test image.

Advantageous effects

Compared with the prior art, the method has the following advantages:

through the RoI classification regression branch network, the spatial information in the feature map can be effectively utilized, and the classification and positioning capabilities of the target detection model are effectively improved; the features in the RoI classification regression branch network are enhanced through the semantic segmentation branch network and the attention mechanism at the boundary box level; through IoU prediction branch networks and a strategy of re-scoring the category scores by using prediction IoU, the relevance between the category scores of the targets and the confidence coefficient of the bounding box is improved, and the positioning accuracy of the bounding box is effectively improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of a feature enhancement and IoU perception based object detection network architecture provided by the present invention;

FIG. 3 is a schematic diagram illustrating mask generation in a semantic segmentation branch network according to the present invention;

FIG. 4 is a schematic diagram of a feature enhancement module based on an attention mechanism according to the present invention.

Detailed Description

The process of the present invention is described in further detail below with reference to the following examples and combinations. The method comprises the following steps:

step S1: acquiring a target detection data set and carrying out preprocessing operation on the image to form a training data set;

the image preprocessing operation steps are as follows:

step S11: scaling the short side of the input image to 600 pixels;

step S12: data augmentation is performed using random horizontal flipping of the image.

Step S2: referring to fig. 2, fig. 2 shows a target detection network based on feature enhancement and IoU perception, which is built based on a two-stage target detection network, fast R-CNN;

the target detection network construction steps based on feature enhancement and IoU perception are as follows:

step S21: constructing a trunk characteristic extraction network, inputting a preprocessed image, and outputting a characteristic graph of the image;

the backbone feature extraction network can be any convolution network, such as VGG-16, ResNet, and the like.

Step S22: constructing a RoI pooling network after the backbone feature extraction network in the step S21, and obtaining a plurality of regions of interest RoI of the output feature map in the step S21;

the method for constructing the RoI pooling network comprises the following steps:

the RoI pooling algorithm uses RoI Align to extract the RoI feature from the last feature map in the conv4_ x module of the ResNet network in step S21;

for the RoI classification regression network branch, the characteristic map size of the RoI Align output is 7 x 7;

for the semantic split branching network, the characteristic graph size of the RoI Align output is 14 x 14.

Step S23: constructing a RoI classification regression branch network after the RoI pooling network in the step S22, performing feature extraction on the RoIs in the step S22, predicting the classification score and the position of a boundary frame of each RoI, and outputting a final target detection result;

the method for constructing the RoI classification regression branch network comprises the following steps:

step S231: the RoI classification regression branch network performs feature extraction by using two continuous padding-0 3X 3 convolutions, an output feature map is marked as X, and the size of the feature map is 3X 512;

step S232: carrying out dense prediction on the feature map X, and classifying the feature vectors of each position in sequence, wherein the calculation formula is as follows:

S _i ＝σ(φ _θ (X _i )) (1)

step S233: in the case of bounding box regression, the coordinates of the center point of the bounding box, which are different from the Faster R-CNN regression, and the scaling (t) of the width and height of the bounding box _x ,t _y ,t _w ,t _h ) In the embodiment of the present invention, the coordinates of each edge of the bounding box are regressed independently, and the parameterization process of the bounding box coordinates is as follows:

t _x1 ＝(x ₁ -x _1a )/w _a ,t _x2 ＝(x ₂ -x _2a )/w _a

t _y1 ＝(y ₁ -y _1a )/h _a ,t _y2 ＝(y ₂ -y _2a )/h _a

Step S234: and regressing the coordinates of the corresponding edges by using the characteristics of the characteristic positions. For each edge, it is regressed using a network whose parameters are not shared, the calculation formula is as follows:

t _x1 ＝φ _θx1 (gmp(X ₀ ,X ₃ ,X ₆ )),

t _y1 ＝φ _θy1 (gmp(X ₀ ,X ₁ ,X ₂ )),

t _x2 ＝φ _θx2 (gmp(X ₂ ,X ₅ ,X ₈ )), (4)

t _y2 ＝φ _θy2 (gmp(X ₆ ,X ₇ ,X ₈ ))

Step S24: building a semantic segmentation branch network behind the RoI pooling network in the step S21, building a feature enhancement module according to an attention mechanism, and enhancing the RoI features in the step S23 by using the extracted semantic segmentation feature map;

the semantic division branch network construction steps are as follows:

step S241: for the target detection data set of step S1, segmentation labels at the pixel level are added. Referring to fig. 3, fig. 3 illustrates a mask generation process in a semantic segmentation branch network. Specifically, the coordinates of a target frame of the input image are rounded and mapped onto the RoI feature map, pixels falling in the target frame are marked as positive samples, and the rest pixels are marked as negative samples;

step S242: step S24, the input of the semantic segmentation branching network is a RoI feature map with a size of 14 × 14 × C obtained by RoI pooling layers, and feature extraction is performed using two 3 × 3 convolutional layers to obtain a feature map X with a size of 14 × 14 × 512 _mask To X _mask Activating by using a 3 x 3 convolutional layer and a sigmoid function, and outputting a final RoI partition prediction;

step S243: using feature maps X _mask The RoI characteristics are enhanced. Specifically, referring to fig. 4, fig. 4 shows a specific structure of the feature enhancing module. The input of which comprises: the RoI characteristics output by the RoI classification regression branch intermediate layer in the step S23 and the characteristic diagram X output by the semantic segmentation branch intermediate layer in the step S24 _mask . For the RoI feature, converting the channel number of the feature map from C to 512 dimensions by using a 1 × 1 convolution; for semantic segmentation feature X _mask Firstly, downsampling the feature map by using a bilinear interpolation algorithm, then performing feature transformation on the downsampled feature map by using 1 × 1 convolution, then obtaining an attention map at a pixel level by using a sigmoid function, and finally multiplying the feature map of the RoI branch by the attention map to obtain an enhanced feature map.

Step S25: and building IoU a prediction branch network after the RoI pooling network in the step S21, wherein the input of the prediction branch network is the semantic segmentation feature map extracted by the semantic segmentation branch network in the step S24, and the output of the prediction branch network is IoU between the predicted RoI and a real target box matched with the predicted RoI.

The IoU prediction branch network building steps are as follows:

step S251: step S25 input of IoU prediction branch network is middle layer output characteristic X of the semantic division branch network of step S24 _mask Using a 1X 1 convolutional layer pair X _mask Carrying out transformation, then obtaining a 512-dimensional feature vector by using global average pooling, finally activating by using a full-connection layer and a sigmoid function, and outputting IoU values predicted for each RoI;

step S252: in the training phase, only the positive sample RoI participates in IoU training of the prediction branch network;

step S253: in the testing stage, the classification score is re-scored using the predicted IoU, and the calculation formula is as follows:

wherein,

a classification score, S, representing the RoI classification score Branch network prediction of step S23 _i ' represents the classification score after the re-scoring, and γ represents the interval [0,1 ]]The hyper-parameter in (a) is,

the value of IoU representing the IoU predicted branch network prediction described in step S25.

Step S3: and constructing a loss function, and training the target detection network in the step S2 according to the training data set to obtain a target detection model.

Wherein, the loss function calculation formula is as follows:

L＝L _RPN +αL _cls +βL _loc +δL _seg +ηL _IoU (6)

wherein L represents a multitask loss function; l is _RPN Represents the RPN loss function of Faster R-CNN; l is _cls And L _loc Respectively representing a classification loss function and a position regression loss function of the Faster R-CNN; l is _seg Representing a semantic segmentation loss function, wherein only a positive sample RoI participates in the training of a semantic segmentation task, and monitoring is performed by using cross entropy loss; l is _IoU Representing IoU a predictive loss function; α, β, δ, η represent the classification loss weight, regression loss weight, semantic segmentation loss weight and IoU prediction loss weight, respectively, wherein the classification loss weight α is set to

The IoU predicted loss function is specifically as follows:

wherein, I _i The representation of the real IoU is shown,

represents the stepIoU predicts the output value, N, of the branch network in step S25 _pos Representing the number of positive samples RoI.

Step S4: the test image is acquired, and is preprocessed (e.g., size changed), and then the target detection model obtained in step S3 is input, so as to obtain the target classification and positioning result of the test image.

Claims

1. A target detection method based on feature enhancement and IoU perception is characterized by comprising the following steps:

step 1: acquiring a target detection data set, and carrying out preprocessing operation on the image to form a training data set;

and 2, step: building a target detection network based on feature enhancement and IoU perception based on a two-stage target detection network Faster R-CNN;

the method comprises the following steps:

step 2.1: constructing a trunk characteristic extraction network, inputting a preprocessed image, and outputting a characteristic graph of the image;

step 2.2: after the trunk feature extraction network in the step 2.1, a RoI pooling network is built to obtain a plurality of regions of interest RoI of the output feature map in the step 2.1;

step 2.3: after the RoI pooling network is obtained in the step 2.2, a RoI classification regression branch network is built, the features of the RoIs obtained in the step 2.2 are extracted, the classification score and the position of a boundary frame of each RoI are predicted, and a final target detection result is output;

step 2.4: establishing a semantic segmentation branch network after the RoI pooling network in the step 2.2, establishing a feature enhancement module according to an attention mechanism, and enhancing the RoI features in the step 2.3 by using the extracted semantic segmentation feature map, wherein the method comprises the following steps:

step 2.4.1: for the target detection data set in the step 1, adding segmentation labels at the pixel level;

rounding and mapping the coordinates of a target frame of the input image to a RoI characteristic diagram, marking pixels falling in the target frame as positive samples, and marking the rest pixels as negative samples;

step 2.4.3: using feature maps X _mask The RoI characteristic is enhanced, and the method specifically comprises the following steps:

designing a characteristic enhancement module, wherein the input of the characteristic enhancement module comprises the RoI characteristic output by the RoI classification regression branch intermediate layer in the step 2.3 and the characteristic graph X output by the semantic segmentation branch intermediate layer in the step 2.4 _mask (ii) a For the RoI feature, converting the channel number of the feature map from C to 512 dimensions by using a 1 × 1 convolution; for semantic segmentation feature X _mask Firstly, downsampling the image by using a bilinear interpolation algorithm, then performing feature transformation on the image by using 1 × 1 convolution, obtaining an attention map at a pixel level by using a sigmoid function, and finally multiplying the feature map of the RoI branch by the attention map to obtain an enhanced feature map;

step 2.5: building IoU a prediction branch network after the RoI pooling network of step 2.2, wherein the input of the prediction branch network is the semantic segmentation feature map extracted by the semantic segmentation branch network of step 2.4, and the output is IoU between the predicted RoI and a real target box matched with the predicted RoI;

and 3, step 3: constructing a loss function, training the target detection network in the step S2 according to a training data set, and obtaining a target detection model;

and 4, step 4: and (3) obtaining a test image, preprocessing the test image, and inputting the target detection model obtained in the step (3) to obtain a target classification and positioning result of the test image.

2. The object detection method based on feature enhancement and IoU perception as claimed in claim 1, wherein in step 1, the image preprocessing operation comprises the following steps:

step 1.1: scaling the short side of the input image to 600 pixels;

3. The feature enhancement and IoU perception-based target detection method of claim 1, wherein in step 2.2, the RoI pooling algorithm uses roilign.

4. The object detection method based on feature enhancement and IoU perception of claim 1, wherein, for the RoI classification regression branch network of step 2.3, the feature map size of RoI Align output is 7 x 7.

5. The object detection method based on feature enhancement and IoU perception according to claim 1, wherein the step 2.3 includes the following steps:

S _i ＝σ(φ _θ (X _i )) (1)

wherein i is the position number in the feature map, X _i Is the ith feature vector; phi is a _θ () For the feature vector classification function, a 1 × 1 convolutional layer is used for implementation; σ () is a softmax operation for outputting a class score vector S of length K +1 _i Wherein K is the number of categories in the training data set;

in the training stage, the category of each feature vector is the same as the RoI label where the feature vector is located, and the classification task is calculated by using a cross entropy loss function;

is the target value of regression, (x) _1a ,y _1a ,x _2a ,y _2a ) Coordinates, w, representing the left, upper, right, and lower four edges of the anchor point frame _a Indicates the width of the anchor box, h _a Representing the height of the anchor box;

step 2.3.4: regressing the coordinates of the corresponding edges by using the characteristics of the characteristic positions;

wherein, X _i I ∈ {0,1, …,8} represents the feature vector for the corresponding location on the feature map X, φ _| () Expressing a coordinate regression function using one1 × 1 convolutional layer implementation, gmp (·) represents a global max pooling function.

6. The method of claim 1, wherein the size of the feature map output by the RoI Align is 14 x 14 for the semantic segmentation branch network of step 2.4.

7. The object detection method based on feature enhancement and IoU perception according to claim 1, wherein the step 2.5 includes the steps of:

wherein,

represents the classification score, S, predicted by the RoI classification regression branch network described in step 2.3 _i ' represents the classification score after the re-scoring, and γ represents the interval [0,1 ]]The hyper-parameter in (a) is,

value IoU representing the IoU predicted branch network prediction at step 2.5.

8. The object detection method based on feature enhancement and IoU perception as claimed in claim 1, wherein in step 3, the loss function is calculated as follows:

L＝L _RPN +αL _cls +βL _loc +δL _seg +ηL _IoU (6)

The regression loss weight β and the segmentation loss weight δ are set to 1, and the IoU prediction loss weight η is set to 0.5;

the IoU predicted loss function is specifically as follows:

wherein, I _i The representation of the real IoU is shown,