CN112101277B - Remote sensing target detection method based on image semantic feature constraint - Google Patents

Remote sensing target detection method based on image semantic feature constraint Download PDF

Info

Publication number
CN112101277B
CN112101277B CN202011018965.5A CN202011018965A CN112101277B CN 112101277 B CN112101277 B CN 112101277B CN 202011018965 A CN202011018965 A CN 202011018965A CN 112101277 B CN112101277 B CN 112101277B
Authority
CN
China
Prior art keywords
feature
frame
remote sensing
image
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011018965.5A
Other languages
Chinese (zh)
Other versions
CN112101277A (en
Inventor
孙斌
马付严
李树涛
孙俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Fujitsu Ltd
Original Assignee
Hunan University
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University, Fujitsu Ltd filed Critical Hunan University
Priority to CN202011018965.5A priority Critical patent/CN112101277B/en
Publication of CN112101277A publication Critical patent/CN112101277A/en
Application granted granted Critical
Publication of CN112101277B publication Critical patent/CN112101277B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Astronomy & Astrophysics (AREA)
  • Remote Sensing (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a remote sensing target detection method of image semantic feature constraint, which comprises the following steps: the method comprises the steps of performing feature extraction on an input image by adopting a depth residual error network ResNet50 and a feature pyramid network, fusing to obtain a multi-scale feature image input center estimation module, and combining output and input of the center estimation module to obtain an image semantic feature image with negative samples filtered; restricting the generation of anchor points in any direction by using the extracted image semantic features, generating a network by rotating candidate regions, extracting candidate regions from the image semantic feature map after negative samples are filtered, and extracting feature vectors with uniform sizes for each candidate region by rotating an interest region aggregation layer; and (3) respectively utilizing the two full-connection layer branches to complete classification and regression tasks, and obtaining detection results and detection positions of each candidate region in the input remote sensing image. The invention greatly reduces the calculation cost and improves the detection speed and the detection accuracy.

Description

Remote sensing target detection method based on image semantic feature constraint
Technical Field
The invention relates to an image target detection method, in particular to a remote sensing target detection method of image semantic feature constraint.
Background
The need for intelligent transportation and earth observation has led to great attention in remote sensing image vehicle detection. It aims to identify the class of vehicles and accurately locate each vehicle in the remote sensing image. Despite the many efforts that have been made to address this task, vehicle detection remains very challenging due to the various sizes and appearances of vehicles in the telemetry image. In particular, detecting vehicles with arbitrary directions also makes it an extremely difficult task, since the direct application of horizontal object detection methods often results in a region of interest (RoIs) and a vehicle region not matching, thereby greatly expanding the search space.
Shaoqing Ren et al, "Faster R-CNN: towards real-time object detection with region proposal networks" (in Advances in Neural Information Processing Systems,2015, pp. 91-99.) states that it has proven effective in open baseline testing by presetting anchor points (or initially estimated object borders) of different sizes and different aspect ratios and regressing the object positions in the image based on the preset anchor points. The same strategy is also adopted in most target detection methods in any direction, taking a rotation candidate region generation network described in Arbitrary-oriented scene text detection via rotation proposals (IEEE Transactions on Multimedia, vol.20, no.11, pp.3111-3122,2018) published by Jianqi Ma et al as an example, rotation candidate regions (or a set of candidate frames) are generated by using anchor points with angles, and the positions of the rotation candidate regions are reduced by merging based on the rotation candidate regions. The detection performance of the anchor point-based detection algorithm is good, but the algorithm usually starts from a large number of anchor points which are densely distributed, the intersection ratio of the true frame and the predicted frame is calculated in the algorithm model training stage, and a predicted frame negative sample with the intersection ratio smaller than a threshold value is removed, so that a large amount of calculation cost is generated. The anchor-free detection method described in Cornernet: detecting objects as paired keypoints (in Proceedings of the European Conference on Computer Vision,2018, pp.734-750.) by Kaiwen Duan et al (in Proceedings of the IEEE International Conference on Computer Vision,2019, pp.6569-6578.), hei Law and Jia Deng predicts bounding boxes by key points rather than anchors of predetermined size and aspect ratio. However, since only key points are used to predict bounding boxes, the recall rate of the anchor-free detection method is lower than the anchor-based detection method.
Disclosure of Invention
The invention aims to solve the technical problems: aiming at the problems in the prior art, the invention provides a remote sensing target detection method for image semantic feature constraint, which utilizes semantic feature information in an image to constrain the generation of anchor points.
In order to solve the technical problems, the invention adopts the following technical scheme:
the remote sensing target detection method based on image semantic feature constraint is characterized by comprising the following steps of:
step 1): carrying out feature extraction on an input image by adopting a depth residual error network ResNet50 and a feature pyramid network, and fusing to obtain a multi-scale feature map fused with multi-scale information;
step 2): filtering the negative samples of the multi-scale feature images obtained through fusion through a center estimation module, and combining the center feature images output by the center estimation module and the feature images input by the center estimation module to filter the negative samples, so as to obtain image semantic feature images with the negative samples filtered;
step 3): restricting the generation of anchor points in any direction by using the extracted image semantic features, generating anchors in the image semantic feature map after negative samples are filtered, generating a network by rotating candidate regions based on the generated anchors to obtain candidate regions, and extracting feature vectors with uniform sizes for each candidate region by rotating a region of interest aggregation layer;
step 4): and aiming at the feature vectors extracted from each candidate region and with uniform size, respectively utilizing two full-connection layer branches to complete classification and regression tasks, and obtaining the detection result and detection position of each candidate region in the input remote sensing image.
Optionally, the detailed steps of step 1) include: downsampling: downsampling an input remote sensing image through a depth residual error network ResNet50, and taking a layer with a constant characteristic diagram size of the depth residual error network ResNet50 as a stage to obtain 4-stage characteristic diagrams C2, C3, C4 and C5 with 4 scales; upsampling: forming a feature pyramid network by using feature graphs C2, C3, C4 and C5 with 4 scales, up-sampling the feature graph C5 by 2 times by using bilinear interpolation, fixing the feature dimension to be 256 through a 1*1 convolution layer, fixing the feature dimension to be 256 by using a 1*1 convolution layer by using the feature graph C4, and finally obtaining a fused feature graph F4 by adding elements to the feature graphs with the same size in two stages; up-sampling the feature map F4 by 2 times, fixing the feature dimension 256, fixing the feature map C3 to the feature dimension 256, and adding the two to obtain a feature map F3 according to elements; and up-sampling the feature map F3 by 2 times, fixing the feature dimension 256, fixing the feature map C2 by 256, adding the two to obtain a feature map F2 fusing high-order features and low-order features according to elements, and outputting the feature map F2 as a feature map fusing multi-scale information.
Optionally, the center estimation module in step 2) is composed of a 1*1 convolution layer and a sigmoid activation layer operated according to elements, and is used for converting the input feature map fused with multi-scale information into a center feature map which has consistent size and reflects the existence probability of a positive sample, multiplying the input feature map fused with multi-scale information and the center feature map according to elements, and obtaining a final feature map, wherein the element value of the region of the negative sample is close to 0, and the element value of the positive sample region is almost unchanged.
Optionally, step 2) is preceded by a step of training the center estimation module, and the branches of the center estimation module are supervised using a Focal Loss function Focal Loss during the training of the center estimation module, wherein a functional expression of the Focal Loss function Focal Loss is as follows:
fl=-(1-p) α log(p)
in the above formula, fl is a function value of a Focal Loss function Focal Loss, p represents a probability that a sample is a positive sample, and α is a coefficient, where the positive sample refers to a sample in which an intersection ratio of a preset anchor point and a real frame in a remote sensing image is higher than a threshold, and the negative sample refers to a sample in which an intersection ratio of the preset anchor point and the real frame in the remote sensing image is lower than the threshold.
Optionally, the rotation candidate region generating network in the step 3) includes a 3*3 convolution layer and two 1*1 convolution layers, and the rotation candidate region generating network is configured to output the feature map through the 3*3 convolution layer to obtain feature maps consistent with H and W of the input feature map, and pass the feature maps through the two 1*1 convolution layers respectively to obtain two sets of feature maps respectively including category information and location information.
Optionally, step 3) is preceded by a step of training the rotation candidate region generation network, and training the rotation candidate region generation network to generate a candidate region, where a judgment principle is that whether the candidate region is a positive sample is required to be judged when the intersection ratio of the candidate region and the real frame is calculated: for candidate positive samples, the following requirements are satisfied: 1) The highest intersection ratio of the frame with the real frame is or is more than 0.7; 2) The included angle between the frame and the real frame is smaller than pi/12; for the candidate region negative sample, one of them needs to be satisfied: 1) The intersection ratio of the real frame and the real frame is smaller than 0.3; 2) The intersection ratio of the sample and the real frame is larger than 0.7, but the included angle of the sample and the real frame is larger than pi/12, then the inclination intersection ratio is calculated for candidate areas of all positive samples and negative samples, and the candidate areas which do not meet the positive samples and the negative samples do not participate in training; and then, inputting the feature vector with uniform size output by the rotation interest region layer into a full convolution network pair, supervising the rotation candidate region generation network by using a focusing loss function, and repeating the process to finally finish the training of the rotation candidate region generation network.
Optionally, in step 4), when the classification task and the regression task are completed by using two fully connected layer branches, the classification task uses a Loss function Softmax Loss supervision network to complete training, the regression task of the frame uses a Loss function smoth L1 Loss, and a function expression used for calculating the regression variables is as follows:
t θ =θ-θ a
in the above formula, (x, y, w, h, θ) are the abscissa, ordinate, frame width, frame height and rotation angle of the central point of the predicted target frame, respectively, (x) a ,y a ,w a ,h a ,θ a ) Respectively the abscissa, ordinate, frame width, frame height and rotation angle of the central point of the anchor point frame, (x) * ,y * ,w * ,h * ,θ * ) The horizontal coordinate, the vertical coordinate, the frame width, the frame height and the rotation angle of the center point of the real target frame are respectively. (t) x ,t y ,t w ,t h ,t θ ) To predict the offset of the bounding box from the anchor box,the offset of the real frame and the anchor point frame; and the regression task of the frame uses the Loss function Smooth L1 Loss to calculate the function expression of the Loss for the two offsets as follows:
in the above-mentioned method, the step of,representing the total regression loss of true and predicted offsets,for the offset of the real frame and the anchor frame, t= (t x ,t y ,t w ,t h ,t θ ) For predicting the offset of the frame and the anchor point frame, (x, y, w, h, θ) are respectively the abscissa, ordinate, frame width, frame height, rotation angle and +.>Is about t * Smooth L1 loss of t, t * -t represents the difference of the true offset and the predicted offset, and a smooth L1 penalty for any x +.>The functional expression of (2) is as follows:
in addition, the invention also provides a remote sensing target detection system of the image semantic feature constraint, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and the microprocessor of the computer device is programmed or configured to execute the steps of the remote sensing target detection method of the image semantic feature constraint.
In addition, the invention also provides a remote sensing target detection system of the image semantic feature constraint, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and a computer program which is programmed or configured to execute the remote sensing target detection method of the image semantic feature constraint is stored in the memory of the computer device.
Furthermore, the invention provides a computer readable storage medium having stored therein steps of a remote sensing target detection method programmed or configured to perform the image semantic feature constraint.
Compared with the prior art, the invention has the following advantages: 1) The remote sensing target detection method based on the image semantic feature constraint comprises the steps of carrying out feature extraction on an input image by adopting a depth residual error network ResNet50 and a feature pyramid network, and fusing to obtain a multi-scale feature map, so that the accurate extraction of image features can be realized. 2) The method comprises the steps of filtering a negative sample through a center estimation module, combining the center feature image output by the center estimation module with the feature image input by the center estimation module to obtain an image semantic feature image for filtering the negative sample, filtering out anchor points with lower probability of covering a vehicle region by utilizing semantic information, using part of anchor points to participate in the generation of a rotation candidate region, adding image semantic feature constraint for a detection method, and only aiming at a small number of generated candidate regions for operation in subsequent calculation, so that the performance advantage of an anchor point detection method is reserved and the detection speed is improved. 3) According to the invention, candidate regions are extracted from the image semantic feature map after negative samples are filtered through rotating the candidate region generation network, and feature vectors with uniform sizes are extracted for each candidate region through rotating the interest region aggregation layer; according to the invention, aiming at the feature vectors extracted from each candidate region and with uniform size, classification and regression tasks are completed by utilizing two full-connection layer branches respectively, so that the detection result and detection position of each candidate region in the input remote sensing image are obtained, the remote sensing target detection of image semantic feature constraint can be realized, and the detection result and detection position of the candidate region can be realized simultaneously. In summary, the invention fully considers the semantic information in the image, greatly reduces the calculation cost, and improves the detection speed and accuracy.
Drawings
FIG. 1 is a schematic diagram of a basic flow of a method according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a remote sensing target detection network with image semantic feature constraint adopted in an embodiment of the present invention.
FIG. 3 is a coordinate representation of a prior art directional frame.
FIG. 4 is a representation of the directional border of an anchor in an embodiment of the present invention.
Fig. 5 is a schematic representation of a rotation anchor employed in an embodiment of the present invention. Wherein (a) anchors of different angles, (b) examples of anchors used in the present embodiment.
Fig. 6 is a schematic diagram of a process for calculating candidate positive samples according to an embodiment of the present invention.
FIG. 7 is a graph of vehicle test results visualized on a DOTA test dataset in an embodiment of the invention.
Detailed Description
As shown in fig. 1 and fig. 2, the remote sensing target detection method with image semantic feature constraint according to the embodiment of the invention includes the following steps:
step 1): carrying out feature extraction on an input image by adopting a depth residual error network ResNet50 and a feature pyramid network, and fusing to obtain a multi-scale feature map fused with multi-scale information;
step 2): filtering the negative samples of the multi-scale feature images obtained through fusion through a center estimation module, and combining the center feature images output by the center estimation module and the feature images input by the center estimation module to filter the negative samples, so as to obtain image semantic feature images with the negative samples filtered;
step 3): restricting the generation of anchor points in any direction by using the extracted image semantic features, generating anchors in the image semantic feature map after negative samples are filtered, generating a network by rotating candidate regions based on the generated anchors to obtain candidate regions, and extracting feature vectors with uniform sizes for each candidate region by rotating a region of interest aggregation layer;
step 4): and aiming at the feature vectors extracted from each candidate region and with uniform size, respectively utilizing two full-connection layer branches to complete classification and regression tasks, and obtaining the detection result and detection position of each candidate region in the input remote sensing image.
In order to construct the remote sensing target detection network constrained by the semantic features of the image shown in fig. 2, in this embodiment, a disclosed remote sensing image target detection dataset DOTA (a dataset with the maximum standard of the directed frame label used in the remote sensing image target detection field) is obtained, and a real label corresponding to the type and the position of the vehicle target in the image is extracted. The training set of the DOTA data set comprises images of the vehicle as the training set, and the test set of the DOTA data set is used as the test set. And cutting the original image in the constructed training set into sub-images with 1024 x 1024 according to the step length 512, carrying out data amplification, changing the scale of the original image into 0.5 times and 1.5 times according to different target scales, and carrying out the same cutting operation. The training set thus constructed contained 106965 images in total, and the test set similarly constructed contained 74058 images in total. The remote sensing image vehicle detection training set and the remote sensing image vehicle detection testing set are constructed so as to train and test the remote sensing target detection network constrained by the image semantic features shown in fig. 2.
The depth residual network ResNet50 and the feature pyramid network FPN are divided into a downsampling process and an upsampling process, so that F2 fusing high-order features and low-order features is finally obtained, and the feature map F2 can represent target information of different scales in a remote sensing image. In this embodiment, the detailed steps of step 1) include: downsampling: downsampling an input remote sensing image through a depth residual error network ResNet50, and taking a layer with a constant characteristic diagram size of the depth residual error network ResNet50 as a stage to obtain 4-stage characteristic diagrams C2, C3, C4 and C5 with 4 scales; upsampling: forming a Feature Pyramid Network (FPN) by using feature graphs C2, C3, C4 and C5 with 4 scales, up-sampling the feature graph C5 by 2 times by using bilinear interpolation, fixing the feature dimension to 256 by using a 1*1 convolution layer, fixing the feature dimension to 256 by using a 1*1 convolution layer by using the feature graph C4, and finally obtaining a fused feature graph F4 by adding elements from the feature graphs with the same size in two stages; up-sampling the feature map F4 by 2 times, fixing the feature dimension 256, fixing the feature map C3 to the feature dimension 256, and adding the two to obtain a feature map F3 according to elements; and up-sampling the feature map F3 by 2 times, fixing the feature dimension 256, fixing the feature map C2 by 256, adding the two to obtain a feature map F2 fusing high-order features and low-order features according to elements, and outputting the feature map F2 as a feature map fusing multi-scale information.
In step 2) of this embodiment, a center estimation module is used to filter the negative samples in the input feature map, so as to increase the speed of model vehicle detection. The uneven distribution of vehicles in the remote sensing image and the different scales can cause that the intersection ratio of most of preset anchor points and real frames is lower than a threshold value, the intersection ratio is less than the positive samples of the preset anchor points higher than the threshold value, and the proportion of the positive samples and the negative samples is extremely unbalanced. In order to solve the problem that a large amount of computation cost is spent on negative samples in the target detection process, in the embodiment, a 1*1 convolution layer and a sigmoid activation layer according to elements are used for converting feature images extracted by a depth residual network ResNet50 and a feature pyramid network FPN into a central feature image which is consistent in size and reflects the existence probability of the positive samples, and the feature images extracted by the depth residual network ResNet50 and the feature pyramid network FPN are combined with the central feature image according to the elements, so that an image semantic feature image for filtering the negative samples is obtained. Referring to fig. 2, the center estimation module in step 2) is composed of a 1*1 convolution layer and a sigmoid activation layer operated according to elements, and is used for converting an input feature map fused with multi-scale information into a center feature map which is consistent in size and reflects the existence probability of a positive sample, multiplying the input feature map fused with multi-scale information and the center feature map according to elements, and obtaining a final feature map, wherein the element value of a region of a negative sample is close to 0, and the element value of a positive sample region is almost unchanged.
In this embodiment, step 2) further includes a step of training the center estimation module, and the branches of the center estimation module are supervised by using a Focal Loss function Focal Loss in the process of training the center estimation module, where a functional expression of the Focal Loss function Focal Loss is as follows:
fl=-(1-p) α log(p)
in the above formula, fl is a function value of Focal Loss function Focal Loss, p represents a probability that a sample is a positive sample, and α is a coefficient (α is usually 2), where the positive sample is a sample with a ratio of an intersection ratio of a preset anchor point and a real border in the remote sensing image being higher than a threshold, and the negative sample is a sample with a ratio of an intersection ratio of a preset anchor point and a real border in the remote sensing image being lower than a threshold. It can be seen that when a sample is considered to be a positive sample with a probability of 0.9, the contribution of the sample to the cross entropy Loss is 100 times greater than the contribution to the Focal Loss using the common cross entropy Loss, and therefore the Focal Loss can well control the contribution of the sample to the model, which is easy to classify.
Fig. 3 shows a labeling manner of directional frame rotation in the prior coordinate method. In this embodiment, when an anchor point in any direction is generated in the image semantic feature map after the negative sample is filtered in step 3), in this embodiment, different methods of the orientation frame of the anchor point in any direction are shown in fig. 4, that is, tuples (x, y, w, h, θ) containing 5 elements, where the value range of θ is [ -pi/2, pi/2), and the frame beyond the angle range is moved in the opposite direction.
The rotation candidate region is used to derive a candidate region by rotating the candidate region generation network based on the arbitrary direction anchor. In this embodiment, the rotation candidate region generating network in step 3) includes a 3*3 convolution layer and two 1*1 convolution layers, and the rotation candidate region generating network is configured to output a feature map through the 3*3 convolution layer to obtain feature maps consistent with H and W of an input feature map, and pass the feature maps through the two 1*1 convolution layers respectively to obtain two sets of feature maps respectively including category information and location information. The anchor points used in this embodiment are shown in fig. 5, in which, as shown in fig. 5, the anchor points with angle information are composed of 2 scales and 6 angles (1:2 and 1:4 and-pi/2, -pi/3, -pi/6, 0, pi/6, pi/3, see fig. 5, in which, as shown in fig. 5, in the sub-graph (a)), the elements in the feature map in the rotated candidate region generation network correspond to 2*6 =12 anchor points. And outputting the feature images through 3*3 convolution layers to obtain feature images consistent with H and W of the input feature images, and respectively passing the feature images through two 1*1 convolution layers to obtain two groups of feature images respectively containing category information and position information.
As shown in fig. 6, step 3) further includes a step of training the rotation candidate region generation network, and training the rotation candidate region generation network to generate a candidate region, where a judgment principle is that whether the candidate region is a positive sample is required to be judged when the intersection ratio of the candidate region and the real frame is calculated: for candidate positive samples, the following requirements are satisfied: 1) The highest intersection ratio of the frame with the real frame is or is more than 0.7; 2) The included angle between the frame and the real frame is smaller than pi/12; for the candidate region negative sample, one of them needs to be satisfied: 1) The intersection ratio of the real frame and the real frame is smaller than 0.3; 2) The intersection ratio of the sample and the real frame is larger than 0.7, but the included angle of the sample and the real frame is larger than pi/12, then the inclination intersection ratio is calculated for candidate areas of all positive samples and negative samples, and the candidate areas which do not meet the positive samples and the negative samples do not participate in training; and then, inputting the feature vector with uniform size output by the rotation interest region layer into a full convolution network pair, supervising the rotation candidate region generation network by using a focusing loss function, and repeating the process to finally finish the training of the rotation candidate region generation network.
In this embodiment, the method for calculating the intersection ratio of the candidate region and the real frame is as follows:
s1) inputting candidate regions and real frames R 1 ,R 2 ,R 3 …;
S2) traversing and selecting any candidate region and a real frame<R i ,R j >(i<j) As the current rectangular frame pair, ending and exiting if the traversal is finished, and jumping to execute the step S3 if the traversal is not finished;
s3) setting a point set PSet as an empty set;
s4) rectangular frame R i And rectangular frame R j The intersecting point set is added to the point set PSet;
s5) will be in rectangular box R j Rectangular frame R in (a) i Adding vertices of (2) to the point set PSet;
s6) will be in rectangular box R i Rectangular frame R in (a) j Adding vertices of (2) to the point set PSet;
s7) sorting the point sets PSet anticlockwise;
s8) calculating an intersection point I by using a triangulation method for the point set PSet;
s9) calculation using the following<R i ,R j >(i<j) The cross-over ratio IoU [ i, j ]];
In the above formula, area (I) represents the Area where the candidate region and the real frame intersect, area (R) i ) Represents the Area of the candidate region, area (R j ) Representing the area of the real border.
S10) jumping to execute step S2).
Referring to fig. 2, the center mask segmentation module is used in this embodiment to improve the accuracy of the model. Considering the speed of model vehicle detection, the present embodiment uses only one 1*1 convolution layer and one sigmoid activation layer that operates on an element basis in the central estimation module, without using the deep network that is commonly used in semantic segmentation. In order to obtain a better estimation effect, the present embodiment uses a central mask module to constrain the extraction of vehicle position information during the training phase. For a feature vector of uniform size of the rotated region of interest layer output, we input it into a full convolution network pair and supervise the network using a focused loss function. The full convolution network is utilized to enable the network to focus on each pixel of the image, the pixels are classified, and the focusing type loss function is used to calculate the loss of each pixel, so that the network focuses on difficult samples, the influence of simple samples is reduced, and the accuracy of the model is improved.
In this embodiment, when the classification task and the regression task are completed by using two full connection layer branches in step 4), the classification task uses a Loss function Softmax Loss supervision network to complete training, the regression task of the frame uses a Loss function smoth L1 Loss, and the function expression used for calculating the regression variables is as follows:
t θ =θ-θ a
in the above formula, (x, y, w, h, θ) are the abscissa, ordinate, frame width, frame height and rotation angle of the central point of the predicted target frame, respectively, (x) a ,y a ,w a ,h a ,θ a ) Respectively the abscissa, ordinate, frame width, frame height and rotation angle of the central point of the anchor point frame, (x) * ,y * ,w * ,h * ,θ * ) The horizontal coordinate, the vertical coordinate, the frame width, the frame height and the rotation angle of the center point of the real target frame are respectively. (t) x ,t y ,t w ,t h ,t θ ) To predict the offset of the bounding box from the anchor box,the offset of the real frame and the anchor point frame; and the regression task of the frame uses the Loss function Smooth L1 Loss to calculate the function expression of the Loss for the two offsets as follows:
in the above-mentioned method, the step of,representing the total regression loss of true and predicted offsets,for the offset of the real border from the anchor frame,t=(t x ,t y ,t w ,t h ,t θ ) For predicting the offset of the frame and the anchor point frame, (x, y, w, h, θ) are respectively the abscissa, ordinate, frame width, frame height, rotation angle and +.>Is about t * Smooth L1 loss of t, t * -t represents the difference of the true offset and the predicted offset, and a smooth L1 penalty for any x +.>The functional expression of (2) is as follows:
fig. 7 shows the visualization of the vehicle test results on the DOTA test dataset, wherein the box marked a is the detected truck and the remaining boxes are the cars.
The initialization parameters of the depth residual network res net50 used in the embodiment inherit from the parameters of the pre-trained ImageNet dataset, the initial learning rate of the training model is set to 0.01, and the total iteration cycle is 12 rounds. The learning rate becomes 1/10 of the learning rate of the previous stage after the 8 th round and the 11 th round. And inputting the constructed remote sensing image vehicle detection training set into a constructed remote sensing target detection model constrained by image semantic features for training, and training the model by using an SGD optimization algorithm, wherein when the training reaches 12 rounds, a trained remote sensing image vehicle detection model is obtained. The objective of the test phase is to obtain the position information and class information of the vehicle in each image, whereas the central mask segmentation module contained in the trained model is not required, so that the position and class of the vehicle need only be obtained by regression and classification, respectively. In the test phase, the frames of the vehicle only select the predicted frames with scores higher than 0.3, and the non-maximum suppression with the threshold value of 0.5 is applied for repeating the deletion.
Table 1 shows quantitative evaluation results of the remote sensing target detection method and other methods of the image semantic feature constraint in this embodiment. Wherein: FR-O represents the FasterRCNN OBB detector, which is an official benchmark provided by DOTA. Mode 1 of the present embodiment method represents an example in which the present embodiment method has only image semantic feature constraints (does not contain a center mask segmentation module); mode 2 of the present embodiment method represents an example of the present embodiment method having an image semantic feature constraint and a center mask segmentation module; mode 3 of the method of the present embodiment shows an example in which anchor points of various proportions are used on the basis of method 2. Through testing, modes 1-3 of the method of the embodiment are superior to other methods in terms of average precision mean (mean Average Precision, mAP) and time cost. Modes 1 to 3 of the method of the present example provided a mAP of 76.9% higher than the official standard by 40.2%. The vehicle detection performance of modes 1 to 3 of the present embodiment method is 6.2% higher than that of the rotation candidate region generation network alone. The method with center mask split module of mode 2 of the present embodiment method improved the mAP by 3.4% over the method without the branch. The time costs of modes 1 to 3 of the method of this example are also reported in table 1. The best results in each case are highlighted in bold. SV represents the average accuracy of small vehicle detection, LV represents the average accuracy of large vehicle detection, and mAP represents the average of all classes of average accuracy.
Table 1: quantitative evaluation results (average accuracy + run time) for the different methods.
In summary, the direct application of conventional horizontal anchor-based detection methods in any direction of vehicle detection typically results in poor performance. Although rotation anchors have been used to address this problem, the design incurs significant computational costs due to the thousands of rotation anchors generated in each level feature map. In order to solve the problem, the embodiment provides a remote sensing target detection method of image semantic feature constraint, which utilizes semantic information to filter out anchor points with lower probability of covering a vehicle region before model calculation and comparison, uses anchor points in any direction to participate in generation of a rotation candidate region, and only operates on a small amount of generated candidate regions in subsequent calculation, so that the performance advantage of the anchor point-based detection method is reserved and the detection speed is improved. In general, the embodiment fully considers semantic information in the image, greatly reduces the calculation cost, and improves the detection speed and accuracy.
In addition, the embodiment also provides a remote sensing target detection system of image semantic feature constraint, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and the microprocessor of the computer device is programmed or configured to execute the steps of the remote sensing target detection method of the image semantic feature constraint.
In addition, the embodiment also provides a remote sensing target detection system of image semantic feature constraint, which comprises a computer device, wherein the computer device comprises a microprocessor and a memory which are connected with each other, and a computer program programmed or configured to execute the remote sensing target detection method of image semantic feature constraint is stored in the memory of the computer device.
In addition, the present embodiment also provides a computer readable storage medium having stored therein steps of a remote sensing target detection method programmed or configured to perform the aforementioned image semantic feature constraint.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products in accordance with embodiments of the present application, and to apparatus for performing functions specified in a flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (7)

1. The remote sensing target detection method based on image semantic feature constraint is characterized by comprising the following steps of:
step 1): carrying out feature extraction on an input image by adopting a depth residual error network ResNet50 and a feature pyramid network, and fusing to obtain a multi-scale feature map fused with multi-scale information;
step 2): filtering the negative samples of the multi-scale feature images obtained through fusion through a center estimation module, and combining the center feature images output by the center estimation module and the feature images input by the center estimation module to filter the negative samples, so as to obtain image semantic feature images with the negative samples filtered;
step 3): restricting the generation of anchor points in any direction by using the extracted image semantic features, generating anchors in the image semantic feature map after negative samples are filtered, generating a network by rotating candidate regions based on the generated anchors to obtain candidate regions, and extracting feature vectors with uniform sizes for each candidate region by rotating a region of interest aggregation layer;
step 4): aiming at the feature vectors extracted from each candidate region and with uniform size, the classification and regression tasks are completed by utilizing two full-connection layer branches respectively, and the detection result and detection position of each candidate region in the input remote sensing image are obtained;
the detailed steps of step 1) include: downsampling: downsampling an input remote sensing image through a depth residual error network ResNet50, and taking a layer with a constant characteristic diagram size of the depth residual error network ResNet50 as a stage to obtain 4-stage characteristic diagrams C2, C3, C4 and C5 with 4 scales; upsampling: forming a feature pyramid network by using feature graphs C2, C3, C4 and C5 with 4 scales, up-sampling the feature graph C5 by 2 times by using bilinear interpolation, fixing the feature dimension to be 256 through a 1*1 convolution layer, fixing the feature dimension to be 256 by using a 1*1 convolution layer by using the feature graph C4, and finally obtaining a fused feature graph F4 by adding elements to the feature graphs with the same size in two stages; up-sampling the feature map F4 by 2 times, fixing the feature dimension 256, fixing the feature map C3 to the feature dimension 256, and adding the two to obtain a feature map F3 according to elements; up-sampling the feature map F3 by 2 times, fixing the feature dimension 256, fixing the feature map C2 by 256, adding the two to obtain a feature map F2 fusing high-order features and low-order features according to elements, and outputting the feature map F2 as a feature map fusing multi-scale information;
the center estimation module in the step 2) consists of a 1*1 convolution layer and a sigmoid activation layer operated according to elements, and is used for converting an input characteristic diagram fused with multi-scale information into a center characteristic diagram which is consistent in size and reflects the existence probability of a positive sample, multiplying the input characteristic diagram fused with multi-scale information and the center characteristic diagram according to elements, wherein the element value of a negative sample area in the obtained final characteristic diagram is close to 0, and the element value of a positive sample area is almost unchanged;
the rotation candidate region generating network in the step 3) comprises a 3*3 convolution layer and two 1*1 convolution layers, and is used for outputting the feature map through the 3*3 convolution layers to obtain feature maps consistent with H and W of the input feature map, and respectively passing the feature maps through the two 1*1 convolution layers to obtain two groups of feature maps respectively containing category information and position information.
2. The method for remote sensing target detection based on image semantic feature constraint according to claim 1, wherein the step 2) further comprises a step of training a center estimation module, and branches of the center estimation module are supervised by using a Focal Loss function Focal Loss in the process of training the center estimation module, wherein a functional expression of the Focal Loss function Focal Loss is as follows:
fl=-(1-p) α log(p)
in the above formula, fl is a function value of a Focal Loss function Focal Loss, p represents a probability that a sample is a positive sample, and α is a coefficient, where the positive sample refers to a sample in which an intersection ratio of a preset anchor point and a real frame in a remote sensing image is higher than a threshold, and the negative sample refers to a sample in which an intersection ratio of the preset anchor point and the real frame in the remote sensing image is lower than the threshold.
3. The method for detecting a remote sensing target constrained by image semantic features according to claim 1, wherein the step 3) further comprises a step of training a rotation candidate region generation network, wherein the training of the rotation candidate region generation network generates a candidate region, and the judgment principle is that whether the candidate region is a positive sample is required to be judged when the intersection ratio of the candidate region and a real frame is calculated: for candidate positive samples, the following requirements are satisfied: 1) The highest intersection ratio of the frame with the real frame is or is more than 0.7; 2) The included angle between the frame and the real frame is smaller than pi/12; for the candidate region negative sample, one of them needs to be satisfied: 1) The intersection ratio of the real frame and the real frame is smaller than 0.3; 2) The intersection ratio of the sample and the real frame is larger than 0.7, but the included angle of the sample and the real frame is larger than pi/12, then the inclination intersection ratio is calculated for candidate areas of all positive samples and negative samples, and the candidate areas which do not meet the positive samples and the negative samples do not participate in training; and then, inputting the feature vector with uniform size output by the rotation interest region layer into a full convolution network pair, supervising the rotation candidate region generation network by using a focusing loss function, and repeating the process to finally finish the training of the rotation candidate region generation network.
4. The method for detecting the remote sensing target constrained by the image semantic features according to claim 1, wherein in the step 4), when classification and regression tasks are completed by using two full-connection layer branches respectively, the classification task uses a Loss function Softmax Loss supervision network to complete training, the regression task of a frame uses a Loss function smoth L1 Loss, and a function expression used for calculating regression variables is as follows:
t θ =θ-θ a
in the above formula, (x, y, w, h, θ) are the abscissa, ordinate, frame width, frame height and rotation angle of the central point of the predicted target frame, respectively, (x) a ,y a ,w a ,h a ,θ a ) Respectively the abscissa, ordinate, frame width, frame height and rotation angle of the central point of the anchor point frame, (x) * ,y * ,w * ,h * ,θ * ) Respectively the abscissa, the ordinate, the frame width, the frame height and the rotation angle of the central point of the real target frame, (t) x ,t y ,t w ,t h ,t θ ) To predict the offset of the bounding box from the anchor box,the offset of the real frame and the anchor point frame; and the regression task of the frame uses the Loss function Smooth L1 Loss to calculate the function expression of the Loss for the two offsets as follows:
in the above-mentioned method, the step of,representing the total regression loss of true and predicted offsets,for the offset of the real frame and the anchor frame, t= (t x ,t y ,t w ,t h ,t θ ) For predicting the offset of the frame and the anchor point frame, (, y, w, h, θ) are respectively the abscissa, ordinate, frame width, frame height, rotation angle and>is about t * Smooth L1 loss of t, t * -t represents the true offset and the predicted offsetDifference, and smooth L1 loss about arbitrary x +.>The functional expression of (2) is as follows:
5. a remote sensing object detection system of image semantic feature constraint, comprising a computer device comprising a microprocessor and a memory connected to each other, characterized in that the microprocessor of the computer device is programmed or configured to perform the steps of the remote sensing object detection method of image semantic feature constraint of any one of claims 1 to 4.
6. A remote sensing object detection system of image semantic feature constraint, comprising a computer device comprising a microprocessor and a memory interconnected, characterized in that the memory of the computer device has stored therein a computer program programmed or configured to perform the remote sensing object detection method of image semantic feature constraint of any one of claims 1 to 4.
7. A computer readable storage medium having stored therein steps of a remote sensing target detection method programmed or configured to perform the image semantic feature constraint of any one of claims 1 to 4.
CN202011018965.5A 2020-09-24 2020-09-24 Remote sensing target detection method based on image semantic feature constraint Active CN112101277B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011018965.5A CN112101277B (en) 2020-09-24 2020-09-24 Remote sensing target detection method based on image semantic feature constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011018965.5A CN112101277B (en) 2020-09-24 2020-09-24 Remote sensing target detection method based on image semantic feature constraint

Publications (2)

Publication Number Publication Date
CN112101277A CN112101277A (en) 2020-12-18
CN112101277B true CN112101277B (en) 2023-07-28

Family

ID=73755387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011018965.5A Active CN112101277B (en) 2020-09-24 2020-09-24 Remote sensing target detection method based on image semantic feature constraint

Country Status (1)

Country Link
CN (1) CN112101277B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112700444B (en) * 2021-02-19 2023-06-23 中国铁道科学研究院集团有限公司铁道建筑研究所 Bridge bolt detection method based on self-attention and central point regression model
CN112861744B (en) * 2021-02-20 2022-06-17 哈尔滨工程大学 Remote sensing image target rapid detection method based on rotation anchor point clustering
CN113095188A (en) * 2021-04-01 2021-07-09 山东捷讯通信技术有限公司 Deep learning-based Raman spectrum data analysis method and device
CN113468968B (en) * 2021-06-02 2023-04-07 中国地质大学(武汉) Remote sensing image rotating target detection method based on non-anchor frame
CN113505806B (en) * 2021-06-02 2023-12-15 北京化工大学 Robot grabbing detection method
CN113468993B (en) * 2021-06-21 2022-08-26 天津大学 Remote sensing image target detection method based on deep learning
CN113420819B (en) * 2021-06-25 2022-12-06 西北工业大学 Lightweight underwater target detection method based on CenterNet
CN113792357B (en) * 2021-09-09 2023-09-05 重庆大学 Tree growth model construction method and computer storage medium
CN114240946B (en) * 2022-02-28 2022-12-02 南京智莲森信息技术有限公司 Locator abnormality detection method, system, storage medium and computing device
CN117094343B (en) * 2023-10-19 2023-12-29 成都新西旺自动化科技有限公司 QR code decoding system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN110909642A (en) * 2019-11-13 2020-03-24 南京理工大学 Remote sensing image target detection method based on multi-scale semantic feature fusion
CN111091105A (en) * 2019-12-23 2020-05-01 郑州轻工业大学 Remote sensing image target detection method based on new frame regression loss function

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN110909642A (en) * 2019-11-13 2020-03-24 南京理工大学 Remote sensing image target detection method based on multi-scale semantic feature fusion
CN111091105A (en) * 2019-12-23 2020-05-01 郑州轻工业大学 Remote sensing image target detection method based on new frame regression loss function

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于改进旋转区域生成网络的遥感图像目标检测;戴媛;易本顺;肖进胜;雷俊锋;童乐;程志钦;;光学学报(第01期);全文 *
深度学习遥感影像近岸舰船识别方法;王昌安;田金文;张强;张英辉;;遥感信息(第02期);全文 *

Also Published As

Publication number Publication date
CN112101277A (en) 2020-12-18

Similar Documents

Publication Publication Date Title
CN112101277B (en) Remote sensing target detection method based on image semantic feature constraint
US11144889B2 (en) Automatic assessment of damage and repair costs in vehicles
CN111612008B (en) Image segmentation method based on convolution network
CN111738110A (en) Remote sensing image vehicle target detection method based on multi-scale attention mechanism
CN110084817B (en) Digital elevation model production method based on deep learning
CN113468967B (en) Attention mechanism-based lane line detection method, attention mechanism-based lane line detection device, attention mechanism-based lane line detection equipment and attention mechanism-based lane line detection medium
CN111738995B (en) RGBD image-based target detection method and device and computer equipment
CN111598030A (en) Method and system for detecting and segmenting vehicle in aerial image
Biasutti et al. Lu-net: An efficient network for 3d lidar point cloud semantic segmentation based on end-to-end-learned 3d features and u-net
Sumer et al. An adaptive fuzzy-genetic algorithm approach for building detection using high-resolution satellite images
CN110516514B (en) Modeling method and device of target detection model
CN113657560A (en) Weak supervision image semantic segmentation method and system based on node classification
CN111696110A (en) Scene segmentation method and system
CN113191204A (en) Multi-scale blocking pedestrian detection method and system
Saovana et al. Automated point cloud classification using an image-based instance segmentation for structure from motion
CN113052108A (en) Multi-scale cascade aerial photography target detection method and system based on deep neural network
CN111860411A (en) Road scene semantic segmentation method based on attention residual error learning
Gomez-Donoso et al. Three-dimensional reconstruction using SFM for actual pedestrian classification
CN114494893B (en) Remote sensing image feature extraction method based on semantic reuse context feature pyramid
CN113673478B (en) Port large-scale equipment detection and identification method based on deep learning panoramic stitching
CN113538523B (en) Parking space detection tracking method, electronic equipment and vehicle
Li et al. Detection of Imaged Objects with Estimated Scales.
Hehn et al. Instance stixels: Segmenting and grouping stixels into objects
Kim et al. LiDAR Based 3D object detection using CCD information
CN111339934A (en) Human head detection method integrating image preprocessing and deep learning target detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant