CN112101277A

CN112101277A - Remote sensing target detection method based on image semantic feature constraint

Info

Publication number: CN112101277A
Application number: CN202011018965.5A
Authority: CN
Inventors: 孙斌; 马付严; 李树涛; 孙俊
Original assignee: Hunan University; Fujitsu Ltd
Current assignee: Hunan University; Fujitsu Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2020-12-18
Anticipated expiration: 2040-09-24
Also published as: CN112101277B

Abstract

The invention discloses a remote sensing target detection method based on image semantic feature constraint, which comprises the following steps: extracting features of the input image by adopting a depth residual error network ResNet50 and a feature pyramid network, fusing to obtain a multi-scale feature map input center estimation module, and combining the output and input of the center estimation module to obtain an image semantic feature map with negative samples filtered; utilizing the extracted image semantic features to restrict the generation of anchor points in any direction, rotating a candidate region generation network, extracting candidate regions from the image semantic feature map with negative samples filtered, and extracting feature vectors with uniform size for each candidate region by rotating an interest region aggregation layer; and finishing classification and regression tasks by using the two full-connection layer branches respectively to obtain the detection result and the detection position of each candidate region in the input remote sensing image. The invention greatly reduces the calculation cost and improves the detection speed and the detection accuracy.

Description

Remote sensing target detection method based on image semantic feature constraint

Technical Field

The invention relates to an image target detection method, in particular to a remote sensing target detection method based on image semantic feature constraint.

Background

The need for intelligent transportation and ground observation has led to great interest in remote sensing image vehicle detection. It aims to identify the class of vehicles and to accurately locate each vehicle in the remote sensing image. Although much effort has been expended to address this task, vehicle detection remains very challenging due to the various sizes and appearances of vehicles in the remote sensing images. In particular, detecting vehicles with arbitrary directions also makes it an extremely difficult task, since direct application of horizontal object detection methods often results in regions of interest (RoIs) and vehicle regions not matching, thus greatly expanding the search space.

Fast R-CNN published by Shaoqing Ren et al (in Advances in Neural Information Processing Systems,2015, pp.91-99.) records anchor points (or initially estimated object borders) of different sizes and different aspect ratios, and based on the way the preset anchor points regress the object position in the image, has proven to be effective in open reference testing. Most of the target detection methods in any direction also adopt the same strategy, taking a rotation candidate region generation network recorded in "aligned-oriented scene text detection view rotation protocols" published by Jianqi Ma et al (IEEE Transactions on Multimedia, vol.20, No.11, pp.3111-3122,2018) as an example, the method generates rotation candidate regions (or a set of candidate frames) by using anchor points with angles, and returns and refines the positions of the rotation candidate regions based on the rotation candidate regions. The detection performance of the detection algorithm based on the anchor points is good, but the algorithm usually starts from a large number of anchor points which are distributed densely, the intersection ratio of a truth border and a prediction border is calculated in the training stage of an algorithm model, and a large amount of calculation cost is generated by removing a negative sample of the prediction border of which the intersection ratio is smaller than a threshold value. The anchorless detection method described in "Central: Keypoint triplets for object detection" by Kaiwen Duan et al (in Proceedings of the IEEE International Conference on Computer Vision,2019, pp.6569-6578.), and "Corneret: Detecting objects as paired keypoints" by Hei Law and Jia Deng (in Proceedings of the European Conference on Computer Vision,2018, pp.734-750.) predicts borders by key points rather than anchors of predetermined size and aspect ratio. However, since only the key points are used for predicting the bounding box, the anchor-free detection method has a lower recall rate than the anchor-based detection method.

Disclosure of Invention

The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a remote sensing target detection method based on image semantic feature constraint, which utilizes semantic feature information in an image to constrain anchor point generation, greatly reduces the calculation cost, and improves the detection speed and the detection accuracy.

In order to solve the technical problems, the invention adopts the technical scheme that:

a remote sensing target detection method based on image semantic feature constraint is characterized by comprising the following steps:

step 1): extracting the characteristics of the input image by adopting a depth residual error network ResNet50 and a characteristic pyramid network, and fusing to obtain a multi-scale characteristic map fusing multi-scale information;

step 2): filtering the negative samples by the multi-scale feature map obtained by fusion through a center estimation module, and combining the center feature map output by the center estimation module and the feature map input into the center estimation module to filter out the negative samples to obtain an image semantic feature map with the negative samples filtered out;

step 3): utilizing the extracted image semantic features to restrict the generation of anchor points in any direction, generating anchors in the image semantic feature map with negative samples filtered, rotating a candidate region generation network based on the anchors generated by the anchor points to obtain candidate regions, and extracting feature vectors with uniform sizes for each candidate region by rotating an interest region aggregation layer;

step 4): and (3) aiming at the feature vectors with uniform size extracted from each candidate region, completing classification and regression tasks by utilizing two full-connection layer branches respectively to obtain the detection result and the detection position of each candidate region in the input remote sensing image.

Optionally, the detailed steps of step 1) include: down-sampling: downsampling the input remote sensing image through a depth residual network ResNet50, calling a layer with the unchanged feature map size of a depth residual network ResNet50 as a stage, and obtaining 4 stages of feature maps C2, C3, C4 and C5 with 4 scales; and (3) upsampling: forming feature maps C2, C3, C4 and C5 of 4 scales into a feature pyramid network, performing 2-time upsampling on the feature map C5 by using bilinear interpolation, fixing the feature dimension to be 256 by using 1-by-1 convolution layers on the feature map C4, and finally adding the feature maps of two stages with the same size according to elements to obtain a fused feature map F4; 2 times of upsampling is carried out on the feature map F4, the feature dimension is fixed to be 256, then the feature dimension is fixed to be 256 on the feature map C3, and the feature map F3 is obtained by adding the two according to elements; 2 times of upsampling is carried out on the feature diagram F3, the feature dimension 256 is fixed, then the feature dimension 256 is fixed on the feature diagram C2, the feature diagram F2 fusing high-order features and low-order features is obtained by adding the two according to elements, and the feature diagram F2 is output as a feature diagram fusing multi-scale information.

Optionally, the center estimation module in step 2) is composed of a 1 × 1 convolution layer and an element-based sigmoid active layer, and is configured to convert the feature map of the input fused multi-scale information into a center feature map with a consistent size and representing the existence probability of the positive sample, and multiply the feature map of the input fused multi-scale information and the center feature map by elements, so that the area element value of the negative sample in the final feature map is close to 0, and the element value of the positive sample area is approximately kept unchanged.

Optionally, step 2) is preceded by a step of training a central estimation module, and a focused Loss function, Focal local, is used to supervise branches of the central estimation module during the training of the central estimation module, wherein a functional expression of the focused Loss function, Focal local, is as follows:

fl＝-(1-p)^αlog(p)

in the above formula, fl is a function value of a Focal Loss function Focal local, p represents a probability that a sample is a positive sample, and α is a coefficient, wherein the positive sample is a sample in which the intersection-sum ratio of a preset anchor point to a real frame in a remote sensing image is higher than a threshold, and the negative sample is a sample in which the intersection-sum ratio of the preset anchor point to the real frame in the remote sensing image is lower than the threshold.

Optionally, the rotation candidate region generating network in step 3) includes 3 × 3 convolutional layers and two 1 × 1 convolutional layers, and the rotation candidate region generating network is configured to output the feature map through the 3 × 3 convolutional layers to obtain feature maps that are consistent with H and W of the input feature map, and pass the feature maps through the two 1 × 1 convolutional layers respectively to obtain two groups of feature maps that respectively include category information and location information.

Optionally, a step of training a rotation candidate area generation network is further included before step 3), and training the rotation candidate area generation network to generate a candidate area, and whether the candidate area is a positive sample needs to be determined when calculating the intersection ratio of the candidate area and the real frame, where the determination principle is as follows: for the positive sample of the candidate region, the following requirements are satisfied: 1) the intersection ratio of the frame and the real frame is highest or is more than 0.7; 2) the included angle between the frame and the real frame is less than pi/12; for the candidate region negative examples, one of them needs to be satisfied: 1) the intersection ratio of the frame and the real frame is less than 0.3; 2) the intersection ratio of the positive sample and the negative sample to the real frame is more than 0.7, but the included angle between the positive sample and the real frame is more than pi/12, then the oblique intersection ratio is calculated for all the candidate areas of the positive sample and the negative sample, and the candidate areas which do not satisfy the positive sample and the negative sample do not participate in training; and then, inputting the feature vectors with uniform size output by the rotating interest region layer into a full convolution network pair, using a focusing loss function to supervise the rotating candidate region generation network, and repeating the process to finally finish the training of the rotating candidate region generation network.

Optionally, when the classification task and the regression task are completed by using two full-link layer branches in step 4), the classification task uses a Loss function Softmax Loss supervision network to complete training, the regression task of the border uses a Loss function Smooth L1 Loss, and a function expression used for calculating a regression variable is as follows:

t_θ＝θ-θ_a

in the above formula, (x, y, w, h, θ) are respectively the horizontal coordinate, vertical coordinate, frame width, frame height, and rotation angle of the central point of the predicted target frame, (x)_a，y_a，w_a，h_a，θ_a) Respectively the abscissa, ordinate, frame width, frame height and rotation angle of the center point of the anchor point frame (x)^*，y^*，w^*，h^*，θ^*) The horizontal coordinate, the vertical coordinate, the frame width, the frame height and the rotation angle of the center point of the real target frame are respectively. (t)_x，t_y，t_w，t_h，t_θ) To predict the offset of the bounding box from the anchor box,

the offset of the real frame and the anchor frame is obtained; and the regression task of the bounding box uses the Loss function Smooth L1 Loss to calculate the functional expression of the Loss for two offsets as follows:

in the above formula, the first and second carbon atoms are,

represents the total regression loss of the true offset versus the predicted offset,

for the offset of the real frame from the anchor frame, t ═ t_x，t_y，t_w，t_h，t_θ) In order to predict the offset of the frame and the anchor frame, (x, y, w, h, theta) are respectively the abscissa, ordinate, frame width, frame height and rotation angle of the center point of the predicted target frame,

is about t^*Smooth L1 loss of-t, t^*-t represents the difference of the true offset and the predicted offset, with a loss of smoothed L1 for any x

The function of (a) is expressed as follows:

in addition, the invention also provides a remote sensing target detection system of image semantic feature constraint, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and the microprocessor of the computer device is programmed or configured to execute the steps of the remote sensing target detection method of image semantic feature constraint.

In addition, the invention also provides a remote sensing target detection system of image semantic feature constraint, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and a computer program which is programmed or configured to execute the remote sensing target detection method of the image semantic feature constraint is stored in the memory of the computer device.

In addition, the invention also provides a computer readable storage medium, wherein the steps of the remote sensing target detection method which is programmed or configured to execute the image semantic feature constraint are stored in the computer readable storage medium.

Compared with the prior art, the invention has the following advantages: 1) the remote sensing target detection method based on image semantic feature constraint comprises the steps of extracting features of an input image by adopting a depth residual error network ResNet50 and a feature pyramid network, and fusing to obtain a multi-scale feature map, so that accurate extraction of image features can be realized. 2) The method comprises the steps of filtering negative samples through a center estimation module by using feature maps obtained by fusion, combining the center feature map output by the center estimation module with the feature map input into the center estimation module to obtain an image semantic feature map with negative samples filtered, filtering out anchor points with low probability of covering a vehicle area by using semantic information, using part of the anchor points to participate in the generation of a rotation candidate area, increasing image semantic feature constraint for a detection method, and only operating on a small number of generated candidate areas by subsequent calculation, thereby retaining the performance advantage of the anchor point-based detection method and improving the detection speed. 3) The method comprises the steps of generating a network by rotating a candidate region, extracting the candidate region from an image semantic feature map with negative samples filtered, and extracting feature vectors with uniform size for each candidate region by rotating an interest region aggregation layer; the invention extracts the feature vectors with uniform size from each candidate region, completes classification and regression tasks by utilizing two full-connection layer branches respectively, obtains the detection result and the detection position of each candidate region in the input remote sensing image, can realize the remote sensing target detection of image semantic feature constraint, and can realize the detection result and the detection position of the candidate region at the same time. In conclusion, the invention fully considers the semantic information in the image, greatly reduces the calculation cost and improves the detection speed and the accuracy.

Drawings

FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a remote sensing target detection network structure constrained by image semantic features adopted in the embodiment of the present invention.

Fig. 3 is a coordinate representation method of a conventional directional bezel.

Fig. 4 is a representation of the directional bounding box of the anchor in an embodiment of the present invention.

Fig. 5 is an illustration of a spin anchor employed in an embodiment of the present invention. Where (a) anchors at different angles and (b) an example of an anchor used in the present embodiment.

Fig. 6 is a schematic diagram illustrating a process of calculating a positive sample of a candidate region according to an embodiment of the present invention.

FIG. 7 is a representation of vehicle inspection results visualized on a DOTA test data set in accordance with an embodiment of the present invention.

Detailed Description

As shown in fig. 1 and fig. 2, the method for detecting a remote sensing target constrained by image semantic features in the embodiment of the present invention includes the following steps:

In order to construct the remote sensing target detection network constrained by the image semantic features shown in fig. 2, in this embodiment, an open remote sensing image target detection data set DOTA (a data set with a maximum scale labeled by a directed frame in the field of remote sensing image target detection) is obtained, and a real label corresponding to the category and position of a vehicle target in an image is extracted. And taking the images of the vehicle in the training set of the DOTA data set as the training set, and taking the test set of the DOTA data set as the test set. And (3) cutting the original image in the constructed training set into sub-images with the size of 1024 x 1024 according to the step length 512, performing data amplification, changing the scales of the original image into 0.5 time and 1.5 time according to different target scales, and performing the same cutting operation. The training set thus constructed contained 106965 images in total, and the test set similarly constructed contained 74058 images in total. By constructing the remote sensing image vehicle detection training set and the test set, the remote sensing target detection network constrained by the image semantic features shown in FIG. 2 can be trained and tested.

The depth residual error network ResNet50 and the feature pyramid network FPN are divided into two processes of down sampling and up sampling, and finally F2 fusing high-order features and low-order features is obtained, and the feature map F2 can represent target information of different scales in the remote sensing image. In this embodiment, the detailed steps of step 1) include: down-sampling: downsampling the input remote sensing image through a depth residual network ResNet50, calling a layer with the unchanged feature map size of a depth residual network ResNet50 as a stage, and obtaining 4 stages of feature maps C2, C3, C4 and C5 with 4 scales; and (3) upsampling: combining feature maps C2, C3, C4 and C5 of 4 scales into a Feature Pyramid Network (FPN), performing 2-time upsampling on the feature map C5 by using bilinear interpolation, fixing a feature dimension to be 256 by 1-1 convolution layers on a feature map C4, and finally adding the feature maps of two stages with the same size by elements to obtain a fused feature map F4; 2 times of upsampling is carried out on the feature map F4, the feature dimension is fixed to be 256, then the feature dimension is fixed to be 256 on the feature map C3, and the feature map F3 is obtained by adding the two according to elements; 2 times of upsampling is carried out on the feature diagram F3, the feature dimension 256 is fixed, then the feature dimension 256 is fixed on the feature diagram C2, the feature diagram F2 fusing high-order features and low-order features is obtained by adding the two according to elements, and the feature diagram F2 is output as a feature diagram fusing multi-scale information.

In step 2) of this embodiment, a central estimation module is used to filter negative samples in an input feature map, so as to improve the speed of model vehicle detection. The intersection ratio of most of the preset anchor points and the real frame is lower than a threshold value due to uneven vehicle distribution and different sizes in the remote sensing image, the intersection ratio is less than the positive sample of the preset anchor points higher than the threshold value, and the proportion of the positive sample and the negative sample is extremely unbalanced. In order to solve the problem that a large amount of calculation cost is spent on a negative sample in the target detection process, in this embodiment, a 1 × 1 convolution layer and an element-based sigmoid activation layer are used to convert feature maps extracted by a depth residual error network ResNet50 and a feature pyramid network FPN into central feature maps which are consistent in size and represent the existence probability of the positive sample, and the feature maps extracted by the depth residual error network ResNet50 and the feature pyramid network FPN are combined with the central feature maps according to elements, so that an image semantic feature map with the negative sample filtered out is obtained. Referring to fig. 2, the center estimation module in step 2) is composed of a 1 × 1 convolution layer and an elementary-operated sigmoid activation layer, and is configured to convert the input feature map of the fused multi-scale information into a center feature map with a consistent size and representing the existence probability of the positive sample, and multiply the input feature map of the fused multi-scale information and the center feature map by elements, so that the area element value of the negative sample in the final feature map is close to 0, and the element value of the positive sample area is approximately kept unchanged.

In this embodiment, step 2) further includes a step of training the central estimation module before, and in the process of training the central estimation module, a focused Loss function, Focal local, is used to supervise branches of the central estimation module, where a functional expression of the focused Loss function, Focal local, is as follows:

fl＝-(1-p)^αlog(p)

in the above formula, fl is a function value of a Focal Loss function Focal local, p represents a probability that a sample is a positive sample, and α is a coefficient (α is usually 2), where a positive sample refers to a sample in which an intersection ratio of a preset anchor point to a real frame in a remote sensing image is higher than a threshold, and a negative sample refers to a sample in which an intersection ratio of a preset anchor point to a real frame in a remote sensing image is lower than a threshold. It can be seen that, when the probability of a sample being considered as a positive sample is 0.9, the contribution of the sample to the cross-entropy Loss is 100 times greater than that of the common cross-entropy Loss by using the Focal local, and therefore, the Focal local can well control the contribution degree of the sample which is easy to classify to the model.

Fig. 3 illustrates a method for labeling the rotation of the directional frame in the conventional coordinate method. In this embodiment, when the anchor point in any direction is generated in the image semantic feature map after the negative sample is filtered in step 3), in this embodiment, different methods of the directional frame of the anchor point in any direction are shown in fig. 4, that is, a tuple (x, y, w, h, θ) containing 5 elements, where θ has a value range of [ -pi/2, pi/2), and the frame exceeding the angle range is moved in the opposite direction.

The rotation candidate region is used for obtaining a candidate region by rotating the candidate region generation network based on the anchor in any direction. In this embodiment, the rotation candidate region generation network in step 3) includes 3 × 3 convolution layers and two 1 × 1 convolution layers, and the rotation candidate region generation network is configured to output the feature map through the 3 × 3 convolution layers to obtain feature maps that are consistent with H and W of the input feature map, and pass the feature maps through the two 1 × 1 convolution layers respectively to obtain two groups of feature maps that respectively include category information and position information. As shown in sub-graph (b) in fig. 5, compared to the conventional candidate region generation network, the anchor point with angle information is composed of 2 scales and 6 angles (1:2 and 1:4 and-pi/2, -pi/3, -pi/6, 0, pi/6, pi/3, see sub-graph (a) in fig. 5), and the element in the feature map in the rotation candidate region generation network corresponds to 2 × 6 — 12 anchor points. And outputting the feature map through a 3 × 3 convolution layer to obtain the feature map consistent with H and W of the input feature map, and respectively passing the feature map through two 1 × 1 convolution layers to obtain two groups of feature maps respectively containing category information and position information.

As shown in fig. 6, a step of training a rotation candidate area generation network is further included before step 3), and the step of training the rotation candidate area generation network to generate a candidate area needs to determine whether the candidate area is a positive sample when calculating the intersection ratio between the candidate area and the real frame, where the determination principle is as follows: for the positive sample of the candidate region, the following requirements are satisfied: 1) the intersection ratio of the frame and the real frame is highest or is more than 0.7; 2) the included angle between the frame and the real frame is less than pi/12; for the candidate region negative examples, one of them needs to be satisfied: 1) the intersection ratio of the frame and the real frame is less than 0.3; 2) the intersection ratio of the positive sample and the negative sample to the real frame is more than 0.7, but the included angle between the positive sample and the real frame is more than pi/12, then the oblique intersection ratio is calculated for all the candidate areas of the positive sample and the negative sample, and the candidate areas which do not satisfy the positive sample and the negative sample do not participate in training; and then, inputting the feature vectors with uniform size output by the rotating interest region layer into a full convolution network pair, using a focusing loss function to supervise the rotating candidate region generation network, and repeating the process to finally finish the training of the rotating candidate region generation network.

In this embodiment, the method for calculating the intersection ratio between the candidate region and the real frame is as follows:

s1) inputting the candidate region and the real bounding box R₁,R₂,R₃…；

S2) traversing to select an arbitrary candidate region and a real border<R_i,R_j>(i<j) As the current rectangular box pair, if the traversal is completed, ending and exiting, and if the traversal is not completed, jumping to execute step S3);

s3) setting the point set PSet as an empty set;

s4) forming a rectangular frame R_iAnd a rectangular frame R_jAdding the intersected point set to a point set PSet;

s5) will be in the rectangular frame R_jIn the rectangular frame R_iTo the point set PSet;

s6) will be in the rectangular frame R_iIn the rectangular frame R_jTo the point set PSet;

s7) sorting the point set PSet by reverse time needle;

s8) calculating an intersection point I by using a triangulation method on the point set PSet;

s9) is calculated using the following formula<R_i,R_j>(i<j) Cross-to-parallel ratio between IoU [ i, j]；

In the above formula, Area (i) represents the Area where the candidate region and the real frame intersect, Area (R)_i) Area (R) representing the Area of the candidate region_j) Representing the area of the real border.

S10) jumping to perform step S2).

Referring to fig. 2, the center mask segmentation module is used to improve the model accuracy in this embodiment. In consideration of the speed detected by the model vehicle, the central estimation module only uses a 1 × 1 convolution layer and a sigmoid active layer calculated according to elements, and deep networks commonly used in semantic segmentation are not used in the embodiment. In order to obtain better estimation effect, the embodiment uses the central mask module to restrict the extraction of the vehicle position information in the training phase. For uniform-sized feature vectors output by the rotating region of interest layer, we input them into a full convolution network pair and supervise the network using a focused loss function. The full convolution network is utilized to focus the network on each pixel of the image, each pixel is classified, and then the focused loss function is used for calculating the loss of each pixel, so that the network focuses on a difficult sample, the influence of a simple sample is reduced, and the accuracy of the model is improved.

In this embodiment, when the classification and regression tasks are completed by using two full-link layer branches in step 4), the classification task uses a Loss function Softmax Loss supervision network to complete training, the regression task of the border uses a Loss function Smooth L1 Loss, and a function expression used for calculating a regression variable is as follows:

t_θ＝θ-θ_a

in the above formula, the first and second carbon atoms are,

The function of (a) is expressed as follows:

FIG. 7 is a representation of vehicle inspection results visualized on a DOTA test data set, where the box labeled A is the inspected truck and the remaining boxes are cars.

The initialization parameters of the deep residual error network ResNet50 used in this embodiment inherit the parameters of the pre-trained ImageNet dataset, the initial learning rate of the training model is set to 0.01, and the total iteration cycle is 12 rounds. The learning rate becomes 1/10 of the previous stage learning rate after the 8 th round and the 11 th round. And inputting the constructed remote sensing image vehicle detection training set into a constructed remote sensing target detection model constrained by image semantic features for training, training the model by using an SGD (generalized minimum mean square) optimization algorithm, and obtaining the trained remote sensing image vehicle detection model when the training reaches 12 rounds. The objective of the testing stage is to obtain the position information and the category information of the vehicle in each image, and the center mask segmentation module contained in the trained model is not needed, so that the position and the category of the vehicle are obtained by regression and classification respectively. And in the testing stage, only the predicted frames with the scores higher than 0.3 are selected by the frames of the vehicle, and the non-maximum value inhibition with the threshold value of 0.5 is applied for repeated deletion.

Table 1 shows the quantitative evaluation results of the remote sensing target detection method and other methods based on the image semantic feature constraint of this embodiment. Wherein: FR-O stands for FasterRCNN OBB detector, which is the official benchmark provided by DOTA. Mode 1 of the method of the present embodiment represents an example (without a central mask segmentation module) in which the method of the present embodiment has only semantic feature constraints of an image; mode 2 of the method of the present embodiment represents an example in which the method of the present embodiment has an image semantic feature constraint and a center mask segmentation module; mode 3 of the method of the present embodiment shows an example in which anchor points of various proportions are used on the basis of method 2. Through tests, the methods 1-3 of the embodiment are superior to other methods in terms of Average accuracy Average (mAP) and time cost. The method provided by modes 1-3 of the method of the embodiment obtains 76.9% of mAP, which is 40.2% higher than the official standard. Compared with the method of generating the network by using the rotation candidate area alone, the method of the embodiment has the advantage that the vehicle detection performance of the modes 1-3 is 6.2% higher. The method with the center mask split module of mode 2 of the method of this embodiment improves the mAP by 3.4% over the method without this branch. The time costs of embodiments 1 to 3 of the method of the present embodiment are also reported in table 1. The best results in each case are highlighted in bold. SV represents the average accuracy of small vehicle detection, LV represents the average accuracy of large vehicle detection, and mAP represents the average of the average accuracies of all categories.

Table 1: quantitative evaluation of the results of the different methods (average precision + run time).

In summary, direct application of conventional horizontal anchor-based detection methods to vehicle detection in any direction typically results in poor performance. Although rotational anchors have been used to address this problem, this design incurs significant computational costs due to the thousands of rotational anchors generated in each level of the feature map. In order to solve the problem, the embodiment provides a remote sensing target detection method based on image semantic feature constraint, before model calculation intersection, semantic information is used for filtering out anchors with low probability of covering a vehicle region, anchors in any direction are used for participating in generation of a rotation candidate region, subsequent calculation is only performed on a small number of generated candidate regions, performance advantages based on an anchor point detection method are reserved, and detection speed is improved. In general, the embodiment fully considers semantic information in the image, greatly reduces the calculation cost, and improves the detection speed and accuracy.

In addition, the embodiment also provides an image semantic feature constrained remote sensing target detection system, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and the microprocessor of the computer device is programmed or configured to execute the steps of the image semantic feature constrained remote sensing target detection method.

In addition, the embodiment also provides an image semantic feature constrained remote sensing target detection system, which includes a computer device, where the computer device includes a microprocessor and a memory connected to each other, and a computer program programmed or configured to execute the aforementioned image semantic feature constrained remote sensing target detection method is stored in the memory of the computer device.

In addition, the embodiment also provides a computer readable storage medium, and the computer readable storage medium stores the steps programmed or configured to execute the remote sensing target detection method of the semantic feature constraint of the image.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products according to embodiments of the application wherein instructions, which execute via a flowchart and/or a processor of the computer program product, create means for implementing functions specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims

1. A remote sensing target detection method based on image semantic feature constraint is characterized by comprising the following steps:

2. The method for detecting the remote sensing target constrained by the image semantic features according to claim 1, wherein the detailed steps of the step 1) comprise the following steps: down-sampling: downsampling the input remote sensing image through a depth residual network ResNet50, calling a layer with the unchanged feature map size of a depth residual network ResNet50 as a stage, and obtaining 4 stages of feature maps C2, C3, C4 and C5 with 4 scales; and (3) upsampling: forming feature maps C2, C3, C4 and C5 of 4 scales into a feature pyramid network, performing 2-time upsampling on the feature map C5 by using bilinear interpolation, fixing the feature dimension to be 256 by using 1-by-1 convolution layers on the feature map C4, and finally adding the feature maps of two stages with the same size according to elements to obtain a fused feature map F4; 2 times of upsampling is carried out on the feature map F4, the feature dimension is fixed to be 256, then the feature dimension is fixed to be 256 on the feature map C3, and the feature map F3 is obtained by adding the two according to elements; 2 times of upsampling is carried out on the feature diagram F3, the feature dimension 256 is fixed, then the feature dimension 256 is fixed on the feature diagram C2, the feature diagram F2 fusing high-order features and low-order features is obtained by adding the two according to elements, and the feature diagram F2 is output as a feature diagram fusing multi-scale information.

3. The method for detecting the remote sensing target constrained by the image semantic features according to claim 1, wherein the center estimation module in the step 2) is composed of a 1 x 1 convolution layer and a sigmoid active layer operated according to elements, and is used for converting the input feature map of the fused multi-scale information into a center feature map which is consistent in size and shows the existence probability of the positive sample, and multiplying the input feature map of the fused multi-scale information and the center feature map according to elements to obtain a final feature map in which the area element value of the negative sample is close to 0, and the element value of the area of the positive sample is approximately kept unchanged.

4. The method for detecting the remote sensing target constrained by the image semantic features according to claim 3, characterized in that the method further comprises a step of training a central estimation module before the step 2), and a Focal Loss function Focal local is used for supervising branches of the central estimation module in the process of training the central estimation module, wherein the functional expression of the Focal Loss function Focal local is as follows:

fl＝-(1-p)^αlog(p)

5. The method for detecting the remote sensing target constrained by the image semantic features according to claim 1, wherein the rotation candidate region generation network in the step 3) comprises a 3 × 3 convolution layer and two 1 × 1 convolution layers, and the rotation candidate region generation network is used for outputting the feature map through the 3 × 3 convolution layer to obtain feature maps consistent with H and W of the input feature map, and respectively passing the feature maps through the two 1 × 1 convolution layers to obtain two groups of feature maps respectively containing category information and position information.

6. The method for detecting the remote sensing target constrained by the image semantic features according to claim 5, characterized in that a step of training a rotation candidate area to generate a network is further included before step 3), the rotation candidate area is trained to generate a network to generate a candidate area, whether the candidate area is a positive sample or not needs to be judged when calculating the intersection ratio of the candidate area and a real frame, and the judgment principle is as follows: for the positive sample of the candidate region, the following requirements are satisfied: 1) the intersection ratio of the frame and the real frame is highest or is more than 0.7; 2) the included angle between the frame and the real frame is less than pi/12; for the candidate region negative examples, one of them needs to be satisfied: 1) the intersection ratio of the frame and the real frame is less than 0.3; 2) the intersection ratio of the positive sample and the negative sample to the real frame is more than 0.7, but the included angle between the positive sample and the real frame is more than pi/12, then the oblique intersection ratio is calculated for all the candidate areas of the positive sample and the negative sample, and the candidate areas which do not satisfy the positive sample and the negative sample do not participate in training; and then, inputting the feature vectors with uniform size output by the rotating interest region layer into a full convolution network pair, using a focusing loss function to supervise the rotating candidate region generation network, and repeating the process to finally finish the training of the rotating candidate region generation network.

7. The image semantic feature constrained remote sensing target detection method according to claim 1, characterized in that, when completing classification and regression tasks by using two full-connected layer branches in step 4), the classification task uses a Loss function Softmax Loss supervision network to complete training, the regression task of the frame uses a Loss function Smooth L1 Loss, and a function expression used for calculating regression variables is as follows:

t_θ＝θ-θ_a

in the above formula, the first and second carbon atoms are,

The function of (a) is expressed as follows:

8. an image semantic feature constrained remote sensing target detection system comprising a computer device comprising a microprocessor and a memory connected to each other, characterized in that the microprocessor of the computer device is programmed or configured to perform the steps of the image semantic feature constrained remote sensing target detection method according to any one of claims 1 to 7.

9. A remote sensing target detection system of image semantic feature constraint, comprising a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and the remote sensing target detection system is characterized in that a computer program which is programmed or configured to execute the remote sensing target detection method of image semantic feature constraint according to any one of claims 1-7 is stored in the memory of the computer device.

10. A computer-readable storage medium having stored thereon steps programmed or configured to perform a method for remote sensing object detection constrained by the image semantic features of any one of claims 1-7.