CN112101277B

CN112101277B - Remote sensing target detection method based on image semantic feature constraint

Info

Publication number: CN112101277B
Application number: CN202011018965.5A
Authority: CN
Inventors: 孙斌; 马付严; 李树涛; 孙俊
Original assignee: Hunan University; Fujitsu Ltd
Current assignee: Hunan University; Fujitsu Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2023-07-28
Anticipated expiration: 2040-09-24
Also published as: CN112101277A

Abstract

The invention discloses a remote sensing target detection method of image semantic feature constraint, which comprises the following steps: the method comprises the steps of performing feature extraction on an input image by adopting a depth residual error network ResNet50 and a feature pyramid network, fusing to obtain a multi-scale feature image input center estimation module, and combining output and input of the center estimation module to obtain an image semantic feature image with negative samples filtered; restricting the generation of anchor points in any direction by using the extracted image semantic features, generating a network by rotating candidate regions, extracting candidate regions from the image semantic feature map after negative samples are filtered, and extracting feature vectors with uniform sizes for each candidate region by rotating an interest region aggregation layer; and (3) respectively utilizing the two full-connection layer branches to complete classification and regression tasks, and obtaining detection results and detection positions of each candidate region in the input remote sensing image. The invention greatly reduces the calculation cost and improves the detection speed and the detection accuracy.

Description

Remote sensing target detection method based on image semantic feature constraint

Technical Field

The invention relates to an image target detection method, in particular to a remote sensing target detection method of image semantic feature constraint.

Background

The need for intelligent transportation and earth observation has led to great attention in remote sensing image vehicle detection. It aims to identify the class of vehicles and accurately locate each vehicle in the remote sensing image. Despite the many efforts that have been made to address this task, vehicle detection remains very challenging due to the various sizes and appearances of vehicles in the telemetry image. In particular, detecting vehicles with arbitrary directions also makes it an extremely difficult task, since the direct application of horizontal object detection methods often results in a region of interest (RoIs) and a vehicle region not matching, thereby greatly expanding the search space.

Shaoqing Ren et al, "Faster R-CNN: towards real-time object detection with region proposal networks" (in Advances in Neural Information Processing Systems,2015, pp. 91-99.) states that it has proven effective in open baseline testing by presetting anchor points (or initially estimated object borders) of different sizes and different aspect ratios and regressing the object positions in the image based on the preset anchor points. The same strategy is also adopted in most target detection methods in any direction, taking a rotation candidate region generation network described in Arbitrary-oriented scene text detection via rotation proposals (IEEE Transactions on Multimedia, vol.20, no.11, pp.3111-3122,2018) published by Jianqi Ma et al as an example, rotation candidate regions (or a set of candidate frames) are generated by using anchor points with angles, and the positions of the rotation candidate regions are reduced by merging based on the rotation candidate regions. The detection performance of the anchor point-based detection algorithm is good, but the algorithm usually starts from a large number of anchor points which are densely distributed, the intersection ratio of the true frame and the predicted frame is calculated in the algorithm model training stage, and a predicted frame negative sample with the intersection ratio smaller than a threshold value is removed, so that a large amount of calculation cost is generated. The anchor-free detection method described in Cornernet: detecting objects as paired keypoints (in Proceedings of the European Conference on Computer Vision,2018, pp.734-750.) by Kaiwen Duan et al (in Proceedings of the IEEE International Conference on Computer Vision,2019, pp.6569-6578.), hei Law and Jia Deng predicts bounding boxes by key points rather than anchors of predetermined size and aspect ratio. However, since only key points are used to predict bounding boxes, the recall rate of the anchor-free detection method is lower than the anchor-based detection method.

Disclosure of Invention

The invention aims to solve the technical problems: aiming at the problems in the prior art, the invention provides a remote sensing target detection method for image semantic feature constraint, which utilizes semantic feature information in an image to constrain the generation of anchor points.

In order to solve the technical problems, the invention adopts the following technical scheme:

the remote sensing target detection method based on image semantic feature constraint is characterized by comprising the following steps of:

step 1): carrying out feature extraction on an input image by adopting a depth residual error network ResNet50 and a feature pyramid network, and fusing to obtain a multi-scale feature map fused with multi-scale information;

step 2): filtering the negative samples of the multi-scale feature images obtained through fusion through a center estimation module, and combining the center feature images output by the center estimation module and the feature images input by the center estimation module to filter the negative samples, so as to obtain image semantic feature images with the negative samples filtered;

step 3): restricting the generation of anchor points in any direction by using the extracted image semantic features, generating anchors in the image semantic feature map after negative samples are filtered, generating a network by rotating candidate regions based on the generated anchors to obtain candidate regions, and extracting feature vectors with uniform sizes for each candidate region by rotating a region of interest aggregation layer;

step 4): and aiming at the feature vectors extracted from each candidate region and with uniform size, respectively utilizing two full-connection layer branches to complete classification and regression tasks, and obtaining the detection result and detection position of each candidate region in the input remote sensing image.

Optionally, the detailed steps of step 1) include: downsampling: downsampling an input remote sensing image through a depth residual error network ResNet50, and taking a layer with a constant characteristic diagram size of the depth residual error network ResNet50 as a stage to obtain 4-stage characteristic diagrams C2, C3, C4 and C5 with 4 scales; upsampling: forming a feature pyramid network by using feature graphs C2, C3, C4 and C5 with 4 scales, up-sampling the feature graph C5 by 2 times by using bilinear interpolation, fixing the feature dimension to be 256 through a 1*1 convolution layer, fixing the feature dimension to be 256 by using a 1*1 convolution layer by using the feature graph C4, and finally obtaining a fused feature graph F4 by adding elements to the feature graphs with the same size in two stages; up-sampling the feature map F4 by 2 times, fixing the feature dimension 256, fixing the feature map C3 to the feature dimension 256, and adding the two to obtain a feature map F3 according to elements; and up-sampling the feature map F3 by 2 times, fixing the feature dimension 256, fixing the feature map C2 by 256, adding the two to obtain a feature map F2 fusing high-order features and low-order features according to elements, and outputting the feature map F2 as a feature map fusing multi-scale information.

Optionally, the center estimation module in step 2) is composed of a 1*1 convolution layer and a sigmoid activation layer operated according to elements, and is used for converting the input feature map fused with multi-scale information into a center feature map which has consistent size and reflects the existence probability of a positive sample, multiplying the input feature map fused with multi-scale information and the center feature map according to elements, and obtaining a final feature map, wherein the element value of the region of the negative sample is close to 0, and the element value of the positive sample region is almost unchanged.

Optionally, step 2) is preceded by a step of training the center estimation module, and the branches of the center estimation module are supervised using a Focal Loss function Focal Loss during the training of the center estimation module, wherein a functional expression of the Focal Loss function Focal Loss is as follows:

fl＝-(1-p) ^α log(p)

in the above formula, fl is a function value of a Focal Loss function Focal Loss, p represents a probability that a sample is a positive sample, and α is a coefficient, where the positive sample refers to a sample in which an intersection ratio of a preset anchor point and a real frame in a remote sensing image is higher than a threshold, and the negative sample refers to a sample in which an intersection ratio of the preset anchor point and the real frame in the remote sensing image is lower than the threshold.

Optionally, the rotation candidate region generating network in the step 3) includes a 3*3 convolution layer and two 1*1 convolution layers, and the rotation candidate region generating network is configured to output the feature map through the 3*3 convolution layer to obtain feature maps consistent with H and W of the input feature map, and pass the feature maps through the two 1*1 convolution layers respectively to obtain two sets of feature maps respectively including category information and location information.

Optionally, step 3) is preceded by a step of training the rotation candidate region generation network, and training the rotation candidate region generation network to generate a candidate region, where a judgment principle is that whether the candidate region is a positive sample is required to be judged when the intersection ratio of the candidate region and the real frame is calculated: for candidate positive samples, the following requirements are satisfied: 1) The highest intersection ratio of the frame with the real frame is or is more than 0.7; 2) The included angle between the frame and the real frame is smaller than pi/12; for the candidate region negative sample, one of them needs to be satisfied: 1) The intersection ratio of the real frame and the real frame is smaller than 0.3; 2) The intersection ratio of the sample and the real frame is larger than 0.7, but the included angle of the sample and the real frame is larger than pi/12, then the inclination intersection ratio is calculated for candidate areas of all positive samples and negative samples, and the candidate areas which do not meet the positive samples and the negative samples do not participate in training; and then, inputting the feature vector with uniform size output by the rotation interest region layer into a full convolution network pair, supervising the rotation candidate region generation network by using a focusing loss function, and repeating the process to finally finish the training of the rotation candidate region generation network.

Optionally, in step 4), when the classification task and the regression task are completed by using two fully connected layer branches, the classification task uses a Loss function Softmax Loss supervision network to complete training, the regression task of the frame uses a Loss function smoth L1 Loss, and a function expression used for calculating the regression variables is as follows:

t _θ ＝θ-θ _a

in the above formula, (x, y, w, h, θ) are the abscissa, ordinate, frame width, frame height and rotation angle of the central point of the predicted target frame, respectively, (x) _a ，y _a ，w _a ，h _a ，θ _a ) Respectively the abscissa, ordinate, frame width, frame height and rotation angle of the central point of the anchor point frame, (x) ^* ，y ^* ，w ^* ，h ^* ，θ ^* ) The horizontal coordinate, the vertical coordinate, the frame width, the frame height and the rotation angle of the center point of the real target frame are respectively. (t) _x ，t _y ，t _w ，t _h ，t _θ ) To predict the offset of the bounding box from the anchor box,the offset of the real frame and the anchor point frame; and the regression task of the frame uses the Loss function Smooth L1 Loss to calculate the function expression of the Loss for the two offsets as follows:

in the above-mentioned method, the step of,representing the total regression loss of true and predicted offsets,for the offset of the real frame and the anchor frame, t= (t _x ，t _y ，t _w ，t _h ，t _θ ) For predicting the offset of the frame and the anchor point frame, (x, y, w, h, θ) are respectively the abscissa, ordinate, frame width, frame height, rotation angle and +.>Is about t ^* Smooth L1 loss of t, t ^* -t represents the difference of the true offset and the predicted offset, and a smooth L1 penalty for any x +.>The functional expression of (2) is as follows:

in addition, the invention also provides a remote sensing target detection system of the image semantic feature constraint, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and the microprocessor of the computer device is programmed or configured to execute the steps of the remote sensing target detection method of the image semantic feature constraint.

In addition, the invention also provides a remote sensing target detection system of the image semantic feature constraint, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and a computer program which is programmed or configured to execute the remote sensing target detection method of the image semantic feature constraint is stored in the memory of the computer device.

Furthermore, the invention provides a computer readable storage medium having stored therein steps of a remote sensing target detection method programmed or configured to perform the image semantic feature constraint.

Compared with the prior art, the invention has the following advantages: 1) The remote sensing target detection method based on the image semantic feature constraint comprises the steps of carrying out feature extraction on an input image by adopting a depth residual error network ResNet50 and a feature pyramid network, and fusing to obtain a multi-scale feature map, so that the accurate extraction of image features can be realized. 2) The method comprises the steps of filtering a negative sample through a center estimation module, combining the center feature image output by the center estimation module with the feature image input by the center estimation module to obtain an image semantic feature image for filtering the negative sample, filtering out anchor points with lower probability of covering a vehicle region by utilizing semantic information, using part of anchor points to participate in the generation of a rotation candidate region, adding image semantic feature constraint for a detection method, and only aiming at a small number of generated candidate regions for operation in subsequent calculation, so that the performance advantage of an anchor point detection method is reserved and the detection speed is improved. 3) According to the invention, candidate regions are extracted from the image semantic feature map after negative samples are filtered through rotating the candidate region generation network, and feature vectors with uniform sizes are extracted for each candidate region through rotating the interest region aggregation layer; according to the invention, aiming at the feature vectors extracted from each candidate region and with uniform size, classification and regression tasks are completed by utilizing two full-connection layer branches respectively, so that the detection result and detection position of each candidate region in the input remote sensing image are obtained, the remote sensing target detection of image semantic feature constraint can be realized, and the detection result and detection position of the candidate region can be realized simultaneously. In summary, the invention fully considers the semantic information in the image, greatly reduces the calculation cost, and improves the detection speed and accuracy.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a remote sensing target detection network with image semantic feature constraint adopted in an embodiment of the present invention.

FIG. 3 is a coordinate representation of a prior art directional frame.

FIG. 4 is a representation of the directional border of an anchor in an embodiment of the present invention.

Fig. 5 is a schematic representation of a rotation anchor employed in an embodiment of the present invention. Wherein (a) anchors of different angles, (b) examples of anchors used in the present embodiment.

Fig. 6 is a schematic diagram of a process for calculating candidate positive samples according to an embodiment of the present invention.

FIG. 7 is a graph of vehicle test results visualized on a DOTA test dataset in an embodiment of the invention.

Detailed Description

As shown in fig. 1 and fig. 2, the remote sensing target detection method with image semantic feature constraint according to the embodiment of the invention includes the following steps:

In order to construct the remote sensing target detection network constrained by the semantic features of the image shown in fig. 2, in this embodiment, a disclosed remote sensing image target detection dataset DOTA (a dataset with the maximum standard of the directed frame label used in the remote sensing image target detection field) is obtained, and a real label corresponding to the type and the position of the vehicle target in the image is extracted. The training set of the DOTA data set comprises images of the vehicle as the training set, and the test set of the DOTA data set is used as the test set. And cutting the original image in the constructed training set into sub-images with 1024 x 1024 according to the step length 512, carrying out data amplification, changing the scale of the original image into 0.5 times and 1.5 times according to different target scales, and carrying out the same cutting operation. The training set thus constructed contained 106965 images in total, and the test set similarly constructed contained 74058 images in total. The remote sensing image vehicle detection training set and the remote sensing image vehicle detection testing set are constructed so as to train and test the remote sensing target detection network constrained by the image semantic features shown in fig. 2.

The depth residual network ResNet50 and the feature pyramid network FPN are divided into a downsampling process and an upsampling process, so that F2 fusing high-order features and low-order features is finally obtained, and the feature map F2 can represent target information of different scales in a remote sensing image. In this embodiment, the detailed steps of step 1) include: downsampling: downsampling an input remote sensing image through a depth residual error network ResNet50, and taking a layer with a constant characteristic diagram size of the depth residual error network ResNet50 as a stage to obtain 4-stage characteristic diagrams C2, C3, C4 and C5 with 4 scales; upsampling: forming a Feature Pyramid Network (FPN) by using feature graphs C2, C3, C4 and C5 with 4 scales, up-sampling the feature graph C5 by 2 times by using bilinear interpolation, fixing the feature dimension to 256 by using a 1*1 convolution layer, fixing the feature dimension to 256 by using a 1*1 convolution layer by using the feature graph C4, and finally obtaining a fused feature graph F4 by adding elements from the feature graphs with the same size in two stages; up-sampling the feature map F4 by 2 times, fixing the feature dimension 256, fixing the feature map C3 to the feature dimension 256, and adding the two to obtain a feature map F3 according to elements; and up-sampling the feature map F3 by 2 times, fixing the feature dimension 256, fixing the feature map C2 by 256, adding the two to obtain a feature map F2 fusing high-order features and low-order features according to elements, and outputting the feature map F2 as a feature map fusing multi-scale information.

In step 2) of this embodiment, a center estimation module is used to filter the negative samples in the input feature map, so as to increase the speed of model vehicle detection. The uneven distribution of vehicles in the remote sensing image and the different scales can cause that the intersection ratio of most of preset anchor points and real frames is lower than a threshold value, the intersection ratio is less than the positive samples of the preset anchor points higher than the threshold value, and the proportion of the positive samples and the negative samples is extremely unbalanced. In order to solve the problem that a large amount of computation cost is spent on negative samples in the target detection process, in the embodiment, a 1*1 convolution layer and a sigmoid activation layer according to elements are used for converting feature images extracted by a depth residual network ResNet50 and a feature pyramid network FPN into a central feature image which is consistent in size and reflects the existence probability of the positive samples, and the feature images extracted by the depth residual network ResNet50 and the feature pyramid network FPN are combined with the central feature image according to the elements, so that an image semantic feature image for filtering the negative samples is obtained. Referring to fig. 2, the center estimation module in step 2) is composed of a 1*1 convolution layer and a sigmoid activation layer operated according to elements, and is used for converting an input feature map fused with multi-scale information into a center feature map which is consistent in size and reflects the existence probability of a positive sample, multiplying the input feature map fused with multi-scale information and the center feature map according to elements, and obtaining a final feature map, wherein the element value of a region of a negative sample is close to 0, and the element value of a positive sample region is almost unchanged.

In this embodiment, step 2) further includes a step of training the center estimation module, and the branches of the center estimation module are supervised by using a Focal Loss function Focal Loss in the process of training the center estimation module, where a functional expression of the Focal Loss function Focal Loss is as follows:

fl＝-(1-p) ^α log(p)

in the above formula, fl is a function value of Focal Loss function Focal Loss, p represents a probability that a sample is a positive sample, and α is a coefficient (α is usually 2), where the positive sample is a sample with a ratio of an intersection ratio of a preset anchor point and a real border in the remote sensing image being higher than a threshold, and the negative sample is a sample with a ratio of an intersection ratio of a preset anchor point and a real border in the remote sensing image being lower than a threshold. It can be seen that when a sample is considered to be a positive sample with a probability of 0.9, the contribution of the sample to the cross entropy Loss is 100 times greater than the contribution to the Focal Loss using the common cross entropy Loss, and therefore the Focal Loss can well control the contribution of the sample to the model, which is easy to classify.

Fig. 3 shows a labeling manner of directional frame rotation in the prior coordinate method. In this embodiment, when an anchor point in any direction is generated in the image semantic feature map after the negative sample is filtered in step 3), in this embodiment, different methods of the orientation frame of the anchor point in any direction are shown in fig. 4, that is, tuples (x, y, w, h, θ) containing 5 elements, where the value range of θ is [ -pi/2, pi/2), and the frame beyond the angle range is moved in the opposite direction.

The rotation candidate region is used to derive a candidate region by rotating the candidate region generation network based on the arbitrary direction anchor. In this embodiment, the rotation candidate region generating network in step 3) includes a 3*3 convolution layer and two 1*1 convolution layers, and the rotation candidate region generating network is configured to output a feature map through the 3*3 convolution layer to obtain feature maps consistent with H and W of an input feature map, and pass the feature maps through the two 1*1 convolution layers respectively to obtain two sets of feature maps respectively including category information and location information. The anchor points used in this embodiment are shown in fig. 5, in which, as shown in fig. 5, the anchor points with angle information are composed of 2 scales and 6 angles (1:2 and 1:4 and-pi/2, -pi/3, -pi/6, 0, pi/6, pi/3, see fig. 5, in which, as shown in fig. 5, in the sub-graph (a)), the elements in the feature map in the rotated candidate region generation network correspond to 2*6 =12 anchor points. And outputting the feature images through 3*3 convolution layers to obtain feature images consistent with H and W of the input feature images, and respectively passing the feature images through two 1*1 convolution layers to obtain two groups of feature images respectively containing category information and position information.

As shown in fig. 6, step 3) further includes a step of training the rotation candidate region generation network, and training the rotation candidate region generation network to generate a candidate region, where a judgment principle is that whether the candidate region is a positive sample is required to be judged when the intersection ratio of the candidate region and the real frame is calculated: for candidate positive samples, the following requirements are satisfied: 1) The highest intersection ratio of the frame with the real frame is or is more than 0.7; 2) The included angle between the frame and the real frame is smaller than pi/12; for the candidate region negative sample, one of them needs to be satisfied: 1) The intersection ratio of the real frame and the real frame is smaller than 0.3; 2) The intersection ratio of the sample and the real frame is larger than 0.7, but the included angle of the sample and the real frame is larger than pi/12, then the inclination intersection ratio is calculated for candidate areas of all positive samples and negative samples, and the candidate areas which do not meet the positive samples and the negative samples do not participate in training; and then, inputting the feature vector with uniform size output by the rotation interest region layer into a full convolution network pair, supervising the rotation candidate region generation network by using a focusing loss function, and repeating the process to finally finish the training of the rotation candidate region generation network.

In this embodiment, the method for calculating the intersection ratio of the candidate region and the real frame is as follows:

s1) inputting candidate regions and real frames R ₁ ,R ₂ ,R ₃ …；

S2) traversing and selecting any candidate region and a real frame<R _i ,R _j >(i<j) As the current rectangular frame pair, ending and exiting if the traversal is finished, and jumping to execute the step S3 if the traversal is not finished;

s3) setting a point set PSet as an empty set;

s4) rectangular frame R _i And rectangular frame R _j The intersecting point set is added to the point set PSet;

s5) will be in rectangular box R _j Rectangular frame R in (a) _i Adding vertices of (2) to the point set PSet;

s6) will be in rectangular box R _i Rectangular frame R in (a) _j Adding vertices of (2) to the point set PSet;

s7) sorting the point sets PSet anticlockwise;

s8) calculating an intersection point I by using a triangulation method for the point set PSet;

s9) calculation using the following<R _i ,R _j >(i<j) The cross-over ratio IoU [ i, j ]]；

In the above formula, area (I) represents the Area where the candidate region and the real frame intersect, area (R) _i ) Represents the Area of the candidate region, area (R _j ) Representing the area of the real border.

S10) jumping to execute step S2).

Referring to fig. 2, the center mask segmentation module is used in this embodiment to improve the accuracy of the model. Considering the speed of model vehicle detection, the present embodiment uses only one 1*1 convolution layer and one sigmoid activation layer that operates on an element basis in the central estimation module, without using the deep network that is commonly used in semantic segmentation. In order to obtain a better estimation effect, the present embodiment uses a central mask module to constrain the extraction of vehicle position information during the training phase. For a feature vector of uniform size of the rotated region of interest layer output, we input it into a full convolution network pair and supervise the network using a focused loss function. The full convolution network is utilized to enable the network to focus on each pixel of the image, the pixels are classified, and the focusing type loss function is used to calculate the loss of each pixel, so that the network focuses on difficult samples, the influence of simple samples is reduced, and the accuracy of the model is improved.

In this embodiment, when the classification task and the regression task are completed by using two full connection layer branches in step 4), the classification task uses a Loss function Softmax Loss supervision network to complete training, the regression task of the frame uses a Loss function smoth L1 Loss, and the function expression used for calculating the regression variables is as follows:

t _θ ＝θ-θ _a

in the above-mentioned method, the step of,representing the total regression loss of true and predicted offsets,for the offset of the real border from the anchor frame,t＝(t _x ，t _y ，t _w ，t _h ，t _θ ) For predicting the offset of the frame and the anchor point frame, (x, y, w, h, θ) are respectively the abscissa, ordinate, frame width, frame height, rotation angle and +.>Is about t ^* Smooth L1 loss of t, t ^* -t represents the difference of the true offset and the predicted offset, and a smooth L1 penalty for any x +.>The functional expression of (2) is as follows:

fig. 7 shows the visualization of the vehicle test results on the DOTA test dataset, wherein the box marked a is the detected truck and the remaining boxes are the cars.

The initialization parameters of the depth residual network res net50 used in the embodiment inherit from the parameters of the pre-trained ImageNet dataset, the initial learning rate of the training model is set to 0.01, and the total iteration cycle is 12 rounds. The learning rate becomes 1/10 of the learning rate of the previous stage after the 8 th round and the 11 th round. And inputting the constructed remote sensing image vehicle detection training set into a constructed remote sensing target detection model constrained by image semantic features for training, and training the model by using an SGD optimization algorithm, wherein when the training reaches 12 rounds, a trained remote sensing image vehicle detection model is obtained. The objective of the test phase is to obtain the position information and class information of the vehicle in each image, whereas the central mask segmentation module contained in the trained model is not required, so that the position and class of the vehicle need only be obtained by regression and classification, respectively. In the test phase, the frames of the vehicle only select the predicted frames with scores higher than 0.3, and the non-maximum suppression with the threshold value of 0.5 is applied for repeating the deletion.

Table 1 shows quantitative evaluation results of the remote sensing target detection method and other methods of the image semantic feature constraint in this embodiment. Wherein: FR-O represents the FasterRCNN OBB detector, which is an official benchmark provided by DOTA. Mode 1 of the present embodiment method represents an example in which the present embodiment method has only image semantic feature constraints (does not contain a center mask segmentation module); mode 2 of the present embodiment method represents an example of the present embodiment method having an image semantic feature constraint and a center mask segmentation module; mode 3 of the method of the present embodiment shows an example in which anchor points of various proportions are used on the basis of method 2. Through testing, modes 1-3 of the method of the embodiment are superior to other methods in terms of average precision mean (mean Average Precision, mAP) and time cost. Modes 1 to 3 of the method of the present example provided a mAP of 76.9% higher than the official standard by 40.2%. The vehicle detection performance of modes 1 to 3 of the present embodiment method is 6.2% higher than that of the rotation candidate region generation network alone. The method with center mask split module of mode 2 of the present embodiment method improved the mAP by 3.4% over the method without the branch. The time costs of modes 1 to 3 of the method of this example are also reported in table 1. The best results in each case are highlighted in bold. SV represents the average accuracy of small vehicle detection, LV represents the average accuracy of large vehicle detection, and mAP represents the average of all classes of average accuracy.

Table 1: quantitative evaluation results (average accuracy + run time) for the different methods.

In summary, the direct application of conventional horizontal anchor-based detection methods in any direction of vehicle detection typically results in poor performance. Although rotation anchors have been used to address this problem, the design incurs significant computational costs due to the thousands of rotation anchors generated in each level feature map. In order to solve the problem, the embodiment provides a remote sensing target detection method of image semantic feature constraint, which utilizes semantic information to filter out anchor points with lower probability of covering a vehicle region before model calculation and comparison, uses anchor points in any direction to participate in generation of a rotation candidate region, and only operates on a small amount of generated candidate regions in subsequent calculation, so that the performance advantage of the anchor point-based detection method is reserved and the detection speed is improved. In general, the embodiment fully considers semantic information in the image, greatly reduces the calculation cost, and improves the detection speed and accuracy.

In addition, the embodiment also provides a remote sensing target detection system of image semantic feature constraint, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and the microprocessor of the computer device is programmed or configured to execute the steps of the remote sensing target detection method of the image semantic feature constraint.

In addition, the embodiment also provides a remote sensing target detection system of image semantic feature constraint, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and a computer program programmed or configured to execute the remote sensing target detection method of image semantic feature constraint is stored in the memory of the computer device.

In addition, the present embodiment also provides a computer readable storage medium having stored therein steps of a remote sensing target detection method programmed or configured to perform the aforementioned image semantic feature constraint.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products in accordance with embodiments of the present application, and to apparatus for performing functions specified in a flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. The remote sensing target detection method based on image semantic feature constraint is characterized by comprising the following steps of:

step 4): aiming at the feature vectors extracted from each candidate region and with uniform size, the classification and regression tasks are completed by utilizing two full-connection layer branches respectively, and the detection result and detection position of each candidate region in the input remote sensing image are obtained;

the detailed steps of step 1) include: downsampling: downsampling an input remote sensing image through a depth residual error network ResNet50, and taking a layer with a constant characteristic diagram size of the depth residual error network ResNet50 as a stage to obtain 4-stage characteristic diagrams C2, C3, C4 and C5 with 4 scales; upsampling: forming a feature pyramid network by using feature graphs C2, C3, C4 and C5 with 4 scales, up-sampling the feature graph C5 by 2 times by using bilinear interpolation, fixing the feature dimension to be 256 through a 1*1 convolution layer, fixing the feature dimension to be 256 by using a 1*1 convolution layer by using the feature graph C4, and finally obtaining a fused feature graph F4 by adding elements to the feature graphs with the same size in two stages; up-sampling the feature map F4 by 2 times, fixing the feature dimension 256, fixing the feature map C3 to the feature dimension 256, and adding the two to obtain a feature map F3 according to elements; up-sampling the feature map F3 by 2 times, fixing the feature dimension 256, fixing the feature map C2 by 256, adding the two to obtain a feature map F2 fusing high-order features and low-order features according to elements, and outputting the feature map F2 as a feature map fusing multi-scale information;

the center estimation module in the step 2) consists of a 1*1 convolution layer and a sigmoid activation layer operated according to elements, and is used for converting an input characteristic diagram fused with multi-scale information into a center characteristic diagram which is consistent in size and reflects the existence probability of a positive sample, multiplying the input characteristic diagram fused with multi-scale information and the center characteristic diagram according to elements, wherein the element value of a negative sample area in the obtained final characteristic diagram is close to 0, and the element value of a positive sample area is almost unchanged;

the rotation candidate region generating network in the step 3) comprises a 3*3 convolution layer and two 1*1 convolution layers, and is used for outputting the feature map through the 3*3 convolution layers to obtain feature maps consistent with H and W of the input feature map, and respectively passing the feature maps through the two 1*1 convolution layers to obtain two groups of feature maps respectively containing category information and position information.

2. The method for remote sensing target detection based on image semantic feature constraint according to claim 1, wherein the step 2) further comprises a step of training a center estimation module, and branches of the center estimation module are supervised by using a Focal Loss function Focal Loss in the process of training the center estimation module, wherein a functional expression of the Focal Loss function Focal Loss is as follows:

fl＝-(1-p) ^α log(p)

3. The method for detecting a remote sensing target constrained by image semantic features according to claim 1, wherein the step 3) further comprises a step of training a rotation candidate region generation network, wherein the training of the rotation candidate region generation network generates a candidate region, and the judgment principle is that whether the candidate region is a positive sample is required to be judged when the intersection ratio of the candidate region and a real frame is calculated: for candidate positive samples, the following requirements are satisfied: 1) The highest intersection ratio of the frame with the real frame is or is more than 0.7; 2) The included angle between the frame and the real frame is smaller than pi/12; for the candidate region negative sample, one of them needs to be satisfied: 1) The intersection ratio of the real frame and the real frame is smaller than 0.3; 2) The intersection ratio of the sample and the real frame is larger than 0.7, but the included angle of the sample and the real frame is larger than pi/12, then the inclination intersection ratio is calculated for candidate areas of all positive samples and negative samples, and the candidate areas which do not meet the positive samples and the negative samples do not participate in training; and then, inputting the feature vector with uniform size output by the rotation interest region layer into a full convolution network pair, supervising the rotation candidate region generation network by using a focusing loss function, and repeating the process to finally finish the training of the rotation candidate region generation network.

4. The method for detecting the remote sensing target constrained by the image semantic features according to claim 1, wherein in the step 4), when classification and regression tasks are completed by using two full-connection layer branches respectively, the classification task uses a Loss function Softmax Loss supervision network to complete training, the regression task of a frame uses a Loss function smoth L1 Loss, and a function expression used for calculating regression variables is as follows:

t _θ ＝θ-θ _a

in the above formula, (x, y, w, h, θ) are the abscissa, ordinate, frame width, frame height and rotation angle of the central point of the predicted target frame, respectively, (x) _a ，y _a ，w _a ，h _a ，θ _a ) Respectively the abscissa, ordinate, frame width, frame height and rotation angle of the central point of the anchor point frame, (x) ^* ，y ^* ，w ^* ，h ^* ，θ ^* ) Respectively the abscissa, the ordinate, the frame width, the frame height and the rotation angle of the central point of the real target frame, (t) _x ，t _y ，t _w ，t _h ，t _θ ) To predict the offset of the bounding box from the anchor box,the offset of the real frame and the anchor point frame; and the regression task of the frame uses the Loss function Smooth L1 Loss to calculate the function expression of the Loss for the two offsets as follows:

in the above-mentioned method, the step of,representing the total regression loss of true and predicted offsets,for the offset of the real frame and the anchor frame, t= (t _x ，t _y ，t _w ，t _h ，t _θ ) For predicting the offset of the frame and the anchor point frame, (, y, w, h, θ) are respectively the abscissa, ordinate, frame width, frame height, rotation angle and>is about t ^* Smooth L1 loss of t, t ^* -t represents the true offset and the predicted offsetDifference, and smooth L1 loss about arbitrary x +.>The functional expression of (2) is as follows:

5. a remote sensing object detection system of image semantic feature constraint, comprising a computer device comprising a microprocessor and a memory connected to each other, characterized in that the microprocessor of the computer device is programmed or configured to perform the steps of the remote sensing object detection method of image semantic feature constraint of any one of claims 1 to 4.

6. A remote sensing object detection system of image semantic feature constraint, comprising a computer device comprising a microprocessor and a memory interconnected, characterized in that the memory of the computer device has stored therein a computer program programmed or configured to perform the remote sensing object detection method of image semantic feature constraint of any one of claims 1 to 4.

7. A computer readable storage medium having stored therein steps of a remote sensing target detection method programmed or configured to perform the image semantic feature constraint of any one of claims 1 to 4.