CN115115936A - Remote sensing image target detection method in any direction based on deep learning - Google Patents

Remote sensing image target detection method in any direction based on deep learning Download PDF

Info

Publication number
CN115115936A
CN115115936A CN202210759294.0A CN202210759294A CN115115936A CN 115115936 A CN115115936 A CN 115115936A CN 202210759294 A CN202210759294 A CN 202210759294A CN 115115936 A CN115115936 A CN 115115936A
Authority
CN
China
Prior art keywords
target
deep learning
remote sensing
feature
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210759294.0A
Other languages
Chinese (zh)
Inventor
郭亮
祁文平
李亚超
熊涛
荆丹
许晴
吕艳
邢孟道
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210759294.0A priority Critical patent/CN115115936A/en
Publication of CN115115936A publication Critical patent/CN115115936A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for detecting targets in any direction of remote sensing images based on deep learning, which comprises the following steps: acquiring a remote sensing image data set, and dividing the remote sensing image data set into a training set and a test set; performing data enhancement on the training set and marking; constructing a deep learning network model with an SAN structure; designing an OVAL _ IOU loss function based on the constructed network model, and training the network model by using a training set; testing the test set by using the trained network model to obtain a detection result; and carrying out post-processing on the detection result to obtain final target position information. The method provided by the invention overcomes the problems of high false alarm rate and high missing rate in the prior art, and improves the detection precision and the detection efficiency.

Description

Remote sensing image target detection method in any direction based on deep learning
Technical Field
The invention belongs to the technical field of target detection, and particularly relates to a remote sensing image target detection method in any direction based on deep learning.
Background
With the high-speed development of imaging technology, information acquired by unmanned aerial vehicles, satellites and other equipment is more and more comprehensive, data processing methods are more and more perfect, the resolution of images is higher and higher, and the application of remote sensing images in various fields is more and more extensive. The remote sensing image is different from a natural shot image, the background of the remote sensing image is more complex, the illumination is changeable, and the visual angle is changeable; the target in the remote sensing image has small size, usually large length and width, such as ships, trucks and the like, and has any direction and small foreground and background distinguishing degree. Due to the characteristics of the remote sensing image, the target detection method based on deep learning aiming at the natural scene at the present stage can not obtain ideal effect when being directly applied to the remote sensing image, and the obtained target frame contains more redundant background information. Therefore, a method suitable for target detection of the remote sensing image needs to be designed, and recall rate and accuracy rate of target detection in the remote sensing image are improved.
In order to solve the problem of low target detection accuracy of remote sensing images, patent document one (publication number CN112102241A) proposes a single-stage remote sensing image target detection algorithm, in which YOLO V3 is used as a reference network, a characteristic pyramid structure of YOLO V3 is replaced by a path aggregation network, and a feature graph is up-sampled by means of transposed convolution to improve the detection rate of a small target. However, the method uses a horizontal rectangular frame to calibrate the position, the detection result contains more redundant information, and the method has serious omission when the targets are closely arranged.
Patent document two (publication number CN110674674A) proposes a rotation target detection method based on YOLO V3, which uses a 5-parameter method defined by OpenCV to represent a rectangle in any direction, and at the same time redesigns an anchor frame generation method and a positive and negative sample distribution method, so that an algorithm can obtain more positive samples, thereby making the network easier to train, and finally uses a smooth _ L1 loss function to return the position, thereby realizing the detection of a target in any direction in a remote sensing image. However, this method greatly increases the number of anchor frames, so that the detection speed is reduced; meanwhile, the method is also based on position calibration of a rectangular frame, so that the detection result contains more redundant information and the detection precision is influenced. In addition, when the smooth _ L1 loss function regresses the target position, the loss value is suddenly increased due to the periodicity and the width and height commutative property of the angle, so that the regression difficulty is increased, and the detection efficiency and the detection accuracy are further influenced.
In conclusion, the existing remote sensing image target detection method has the problems of high false alarm rate, high omission factor and low detection efficiency.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides a method for detecting targets in any direction of remote sensing images based on deep learning. The technical problem to be solved by the invention is realized by the following technical scheme:
a remote sensing image target detection method in any direction based on deep learning comprises the following steps:
acquiring a remote sensing image data set, and dividing the remote sensing image data set into a training set and a test set;
performing data enhancement and marking on the training set;
constructing a deep learning network model with an SAN structure;
designing an OVAL _ IOU loss function based on the constructed network model, and training the network model by using the training set;
testing the test set by using the trained network model to obtain a detection result;
and carrying out post-processing on the detection result to obtain final target position information.
In one embodiment of the present invention, the performing data enhancement on the training set and labeling comprises:
cutting the data of the training set into proper sizes and performing data enhancement on the data;
marking the target box in the training set by adopting an OpenCV representation method;
converting the OpenCV representation into a long-edge representation, marking the target boxes in the training set as:
{x c ,y c ,w_l,h_l,θ_l};
Figure BDA0003723672720000031
wherein (x) c ,y c ) Is the center coordinate of the target frame, w represents the first edge encountered by the clockwise rotation of the x-axis, h is the adjacent other edge, w _ l>θ is the angle through which the rotation passes, θ ∈ (0 °,90 °)]θ _ l represents the angle through which the x-axis rotates clockwise to w _ l, θ _ l ∈ (0 °,180 °)]。
In one embodiment of the invention, building a deep learning network model with SAN fabric includes: adopting darknet53 as a backbone feature extraction network to build a deep learning network framework, wherein the deep learning network framework comprises a feature extraction module, an SAN module, a feature fusion module and a feature integration module; wherein the content of the first and second substances,
the feature extraction module takes the image to be detected as input, and outputs a feature map to the SAN module after feature extraction;
the SAN module is used for performing feature enhancement on the feature graph output by the feature extraction module and outputting the enhanced feature graph to the feature fusion module;
the feature fusion module is used for sampling deep features rich in semantic information and fusing the deep features with shallow features rich in outline information according to the feature map output by the attention module to obtain fusion features;
the feature integration module is used for performing feature integration on the fusion features to obtain detection results of different scales.
In one embodiment of the invention, the SAN module includes a CAM module and an S-SAM module to weight the feature map from different perspectives to achieve feature enhancement.
In one embodiment of the invention, the loss function is:
Loss=L conf +L cls +L attention +L reg
wherein L is conf Indicates a loss of confidence, L cls Represents the class loss, L attention Indicating a loss of attention, L reg Indicating the positional regression loss.
In one embodiment of the invention, the confidence loss L conf The calculation formula of (2) is as follows:
Figure BDA0003723672720000041
wherein N represents the number of all prediction boxes;
Figure BDA0003723672720000042
whether a target exists in the kth prediction box is represented as 1, and if not, the target is represented as 0;
Figure BDA0003723672720000043
whether targets do not exist in the kth prediction frame is represented, and the number of targets is not 1, otherwise, the number of targets is 0; p is a radical of k Indicates the probability that the k-th prediction box contains the target, y k Tag value, y, indicating whether there is an object in the kth prediction box k 1 denotes target, y k 0 means no target; BCE is the cross entropy loss function.
In one embodiment of the invention, the class penaltyL cls The calculation formula of (2) is as follows:
Figure BDA0003723672720000044
wherein N represents the number of all prediction blocks,
Figure BDA0003723672720000045
is a binary value and is used as a binary value,
Figure BDA0003723672720000046
indicating that the nth prediction box contains the target,
Figure BDA0003723672720000047
indicates that the nth prediction box does not contain the target, BCE is a cross entropy loss function, p n (c) Indicates the probability that the nth prediction box is class c, t n (c) Is the corresponding tag value.
In one embodiment of the invention, the loss of attention L attention The calculation formula of (2) is as follows:
Figure BDA0003723672720000048
h and w respectively represent the height and width of the foreground score map after the characteristic layer passes through the S-SAM module, BCE is a cross entropy loss function, and u ij Representing the foreground score of the pixel at the ith, j-th position,
Figure BDA0003723672720000049
a label value representing a corresponding position when the position belongs to the foreground
Figure BDA00037236727200000410
Otherwise
Figure BDA00037236727200000411
In one embodiment of the present invention, the position regression loss L reg Is calculated byComprises the following steps:
Figure BDA00037236727200000412
wherein N represents the number of all prediction boxes;
Figure BDA00037236727200000413
is a binary value and is used as a binary value,
Figure BDA00037236727200000414
indicating that the nth prediction box contains the target,
Figure BDA00037236727200000415
indicating that the nth prediction box does not contain the target, OVAL _ IOU is a function for computing two arbitrary-direction rectangular boxes IOU by ellipse approximation.
In an embodiment of the present invention, the post-processing the test result to obtain the location information of the target includes:
calculating a confidence value of each grid according to the detection result;
calculating a prediction frame according to the detection result and the anchor frame position information;
and screening the prediction frame based on the confidence value to obtain final target position information.
The invention has the beneficial effects that:
according to the invention, by adding a supervised attention structure in the deep learning network, noise information is effectively reduced and target information is enhanced; meanwhile, an OVAL _ IOU loss function is designed to carry out regression on the rotating rectangular frame, sudden increase of loss values in the existing method is eliminated, the optimization regression task of the position of the rotating rectangular frame is kept consistent with the measurement standard of the evaluation method, the target position regression is more direct and effective, the problems of high false alarm rate and high omission factor in the prior art are solved, and the detection precision and the detection efficiency are improved.
The present invention will be described in further detail with reference to the drawings and examples.
Drawings
Fig. 1 is a schematic flowchart of a method for detecting an object in any direction of a remote sensing image based on deep learning according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a rotating rectangular box OpenCV representation according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a long side representation of a rotating rectangular frame according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a deep learning network structure with SAN structure provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of a SAN network architecture provided by an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for detecting an object in any direction of a remote sensing image based on deep learning according to an embodiment of the present invention, which includes:
step 1: and acquiring a remote sensing image data set, and dividing the remote sensing image data set into a training set and a testing set.
In this embodiment, data may be obtained from public remote sensing image datasets, such as UCAS _ AOD, HRSC2016, DOTA, and the like. The acquired data set is then divided into a training set and a test set in a 7:3 ratio.
Step 2: and performing data enhancement on the training set and marking.
First, the data of the training set is cut to the appropriate size and data-enhanced.
The specific cutting size can be set as required, and the data enhancement method can be implemented by referring to the related art, and the embodiment is not described in detail herein.
Then, the target boxes in the training set are labeled by using the OpenCV representation, as shown in fig. 2.
Generally, the labeling mode of the rotating rectangular frame in the public data sets such as DOTA, UCAS _ AOD and the like is { x } 1 ,y 1 ,x 2 ,y 2 ,x 3 ,y 3 ,x 4 ,y 4 I.e. 8 parameter method, in which (x) i ,y i ) And i is 1,2,3,4, which indicates the coordinates of the four vertices of the rotating rectangular frame. In this embodiment, however, 8-parameter conversion to 5-parameter representation is required.
Specifically, 8 parameters are converted into OpenCV 5 parameter representation, i.e., { x }, using cv c ,y c W, h, θ }, where (x) c ,y c ) Is the central coordinate of the rotating rectangular frame, i.e. the central coordinate of the target frame, w represents the first edge touched by the clockwise rotation of the x-axis, h is the other adjacent edge, theta is the angle passed by the rotation, theta belongs to (0 DEG, 90 DEG)]。
Finally, the OpenCV representation is converted into a long-edge representation, as shown in fig. 3, the target boxes in the training set are labeled as:
{x c ,y c ,w_l,h_l,θ_l};
Figure BDA0003723672720000061
where w _ l > -h _ l, θ _ l indicates the angle through which the x-axis rotates clockwise to w _ l, and θ _ l ∈ (0 °,180 °).
And step 3: and constructing a deep learning network model with SAN structure.
Referring to fig. 4, fig. 4 is a schematic diagram of a deep learning network structure with a SAN structure according to an embodiment of the present invention. In this embodiment, a deep learning network framework is built by using the darknet53 as a backbone feature extraction network, and the deep learning network framework comprises a feature extraction module, an SAN (Supervised Attention network) module, a feature fusion module and a feature integration module; wherein the content of the first and second substances,
the feature extraction module takes the image to be detected as input, and outputs a feature map to the SAN module after feature extraction;
the SAN module is used for performing feature enhancement on the feature graph output by the feature extraction module and outputting the enhanced feature graph to the feature fusion module;
the feature fusion module is used for sampling deep features rich in semantic information and fusing the deep features with shallow features rich in outline information according to the feature map output by the attention module to obtain fusion features;
the feature integration module is used for performing feature integration on the fusion features to obtain detection results of different scales.
Specifically, as shown in fig. 4, in this embodiment, a darknet53 is used as a main feature extraction network, 5-scale feature maps are obtained after passing through a residual block, and a Supervised Attention Structure (SAN) is added behind feature maps 3,4, and 5 to obtain new 3 feature layers 6, 7, and 8; then, the 8 th feature map is convolved and then upsampled and fused with the 7 th feature map, and the fused 11 th feature map is convolved and then upsampled and fused with the 6 th feature map; and finally, obtaining detection results of 3 scales after the three characteristic layers are subjected to convolutional layer. Taking the detection result with the smallest scale as an example, the shape is 13 × 13 [ (5+1+ n) × 3], wherein 13 × 13 represents that the original image is divided into 13 × 13 grids; [ (5+1+ n) × 3] represents the number of channels, where × 3 represents three prediction frames per grid, and there are 13 × 3 prediction frames in the original image, 5 represents the offset of the prediction frames with respect to the center position, width, height and angle of the anchor frame, respectively, 1 represents the probability of having an object in each prediction frame, and n represents the number of classes, and the probability of having an object in the corresponding prediction frame as each class.
Further, referring to fig. 5, fig. 5 is a schematic diagram of a SAN network structure according to an embodiment of the present invention, which includes two modules, a Channel Attention Module (CAM) and a Supervised-pixel Attention Module (S-SAM), that weight-divide a feature map from different angles, so as to achieve feature enhancement.
Specifically, for the input features of size 256 × 52, fig. 3, which are processed by the CAM module and the SAM module, respectively, enhanced features are obtained.
For the CAM module, the maximum pooling and average pooling of the features with the size of 256 × 1 are respectively obtained through full-connection layers, then the two features are added according to elements and activated through a sigmoid function, and then the 256 × 1 features are output.
For the S-SAM module, 2 x 52 features are obtained by splicing after maximum pooling and average pooling, and 1 x 52 features are obtained after activation by convolution and sigmoid functions.
Finally, the features obtained by the CAM module are multiplied by the feature map 3 obtained by the S-SAM module and inputted, so that a feature map 6 is obtained, which has the same size as the inputted features and is 256 × 52.
The following takes the picture of network input 3 × 416 as an example, and the processing procedure thereof is described in detail.
Firstly, 3 × 416 pictures are subjected to conv2D Block to obtain a feature map 0 with the size of 64 × 208;
then, obtaining a characteristic diagram with 5 scales through Res Block, wherein Res Block n represents the stacking of n Res blocks;
after passing through Res Block 1, the feature map 0 obtains a feature map 1 with a size of 64 × 208;
after passing through Res Block × 2, the feature map 1 obtains a feature map 2 with a size of 128 × 104;
feature map 2, after Res Block 8, yields a feature map 3 with size 256 x 52;
the characteristic map 4 with the size of 512 by 26 is obtained after Res Block by 8;
the passage of the signature 4 through Res Block 4 results in a signature 5 with a size of 1024 x 13.
Adding a Supervised Attention Net (SAN) behind the feature graphs 3,4 and 5 to obtain feature graphs 6, 7 and 8 respectively; the SAN does not change the feature map size.
Convolving the feature map 8 to obtain a feature map 9 with the size of 512 × 13;
the feature map 9 is convolved to obtain a detection result of the large-scale target, and the shape of the detection result is 13 × 13 [ (5+1+ n) × 3 ];
the feature map 10 with a size of 512 × 26 is obtained by up-sampling the feature map 9;
the feature map 10 is fused with the feature map 7 to obtain a feature map 11 with the size of 768 × 26;
the size of the feature map 12 is 256 × 26 after the feature map 11 is convolved;
the feature map 12 is convolved to obtain a detection result of the mesoscale target, and the shape is 26 × 26 [ (5+1+ n) × 3 ];
the feature map 12 is up-sampled to obtain a feature map 13 with a size of 128 × 52;
the feature map 13 is fused with the feature map 6 to obtain a feature map 14 with the size of 384 × 52;
after convolution of the feature map 14, a feature map 15 with a size of 128 × 52 is obtained;
the feature map 15 is convolved to obtain the detection result of the small-scale target, and the shape is 52 x 52 [ (5+1+ n) × 3 ].
In the embodiment, an SAN structure is added behind a main feature layer, and the SAN structure comprises a CAM module and an S-SAM module, so that noise information can be effectively reduced, target information can be enhanced, and the detection precision is improved to a certain extent.
And 4, step 4: and designing an OVAL _ IOU loss function based on the constructed network model, and training the network model by using a training set.
First, in order to solve the problem of sudden increase in loss value caused by periodicity of angle and commutative property of width and height when the smooth _ L1 loss function regresses the target position, the present embodiment defines that the OVAL _ IOU loss function regresses the rotating rectangular frame.
Specifically, the OVAL _ IOU penalty function is expressed as:
Loss=L conf +L cls +L attention +L reg
wherein L is conf Representing the confidence loss, the calculation formula is:
Figure BDA0003723672720000101
wherein N represents the number of all prediction boxes;
Figure BDA0003723672720000102
whether a target exists in the kth prediction box is represented as 1, and if not, the target is represented as 0;
Figure BDA0003723672720000103
whether a target does not exist in the kth prediction box is represented, and the number is not 1, otherwise, the number is 0; p is a radical of formula k Indicates the probability that the k-th prediction box contains the target, y k Tag value, y, indicating whether there is an object in the kth prediction box k 1 denotes Targeted, y k 0 means no target; BCE is the cross entropy loss function.
In this embodiment, the calculation formula of BCE is:
BCE(p k ,y k )=-[y k ·log(p k )+(1-y k )·log(1-p k )]
L cls the category loss is expressed by the following calculation formula:
Figure BDA0003723672720000104
where N represents the number of all prediction blocks,
Figure BDA0003723672720000105
is a binary value and is used as a binary value,
Figure BDA0003723672720000106
indicating that the nth prediction box contains the target,
Figure BDA0003723672720000107
indicates that the nth prediction box does not contain the target, p n (c) Indicates the probability that the nth prediction box is class c, t n (c) Is the corresponding tag value.
L attention Expressing the attention loss, the calculation formula is:
Figure BDA0003723672720000108
h and w respectively represent the height and width of the foreground score map after the characteristic layer passes through the S-SAM module, BCE is a cross entropy loss function, and u ij Representing the foreground score of the pixel at the ith, j-th position,
Figure BDA0003723672720000109
a tag value representing a corresponding location when the location belongs to the foreground
Figure BDA00037236727200001010
Otherwise
Figure BDA00037236727200001011
L reg Expressing the position regression loss, and the calculation formula is as follows:
Figure BDA00037236727200001012
the OVAL _ IOU is a function for approximately calculating two rectangular frames IOU in any direction by using an ellipse, and the specific calculation process is as follows:
any ellipse in the plane can be represented as:
Figure BDA0003723672720000111
wherein a is a long axis, b is a short axis, and alpha is an included angle alpha epsilon (0 DEG, 180 DEG) of a long axis x axis]Will { x c ,y c The rotated rectangular box approximation, denoted w _ l, h _ l, θ _ l, is represented by an ellipse as:
Figure BDA0003723672720000112
defining:
Figure BDA0003723672720000113
Figure BDA0003723672720000114
Figure BDA0003723672720000115
wherein (i, j) represents the coordinates of the pixel point, oval represents the coordinates as { x } c ,y c And judging whether the pixel point P (i, j) is in the ellipse or not by using P (i, j, oval), wherein the oval P (i, j, oval) is 1, which indicates that the pixel point P (i, j) is in the oval, and otherwise, the pixel point P (i, j) is not in the oval. The intersection set of the two ellipses is then:
Figure BDA0003723672720000116
Figure BDA0003723672720000117
because P (i, j, oval) is not derivable, P (i, j, oval) is approximately replaced by a continuously derivable function F (i, j, oval), whose expression is as follows:
F(i,j,oval)=K(k_1+k_2,0.25)
wherein
Figure BDA0003723672720000118
k is an adjustable parameter for controlling the sensitivity of the target pixel.
The intersection set formula is updated as follows:
Figure BDA0003723672720000121
Figure BDA0003723672720000122
then the OVAL _ IOU calculation formula is:
Figure BDA0003723672720000123
the invention provides a new Loss function-OVAL _ IOU Loss. The OVAL _ IOU uses an ellipse to approximately replace a rectangular frame in any direction, then the IOUs of two ellipses are calculated as any two rotation rectangular frame IOUs, and the OVAL enables the optimization regression task of the positions of the rotation rectangular frames to be consistent with the measurement standard of the evaluation method, namely the IOUs of two target frames are used, so that the target position regression is more direct and effective.
And then, training the network model constructed in the step 3 by using training set data based on the constructed OVAL _ IOU loss function to obtain the trained network model.
And 5: and testing the test set by using the trained network model to obtain a detection result.
Specifically, assuming that the width and height of the input picture are 416 × 416, a total of 10647 detection results can be obtained, namely 13 × 13 × 3+26 × 26 × 3+52 × 52 × 3.
Step 6: and carrying out post-processing on the detection result to obtain final target position information.
61) And calculating the confidence value of each grid according to the detection result.
Specifically, after the picture is divided into n × n meshes (n ∈ {13,26,52}), the prediction result of the (i, j) -th mesh is (t ∈ {13,26,52}) ijkx ,t ijky ,t ijkw ,t ijkh ,t ijkθ ,p ijkobj ,p ijkc1 ,p ijkc2 ,...,p ijkcn )。
Where, (i ═ 1,2, …, n ═ 1,2, …, n), k ═ 1,2,3 means that there are three prediction boxes per grid, t ijkx ,t ijky ,t ijkw ,t ijkh ,t ijkθ Respectively representing the k-th prediction frame of the (i, j) -th grid, relative to the center position, width, height and angle offset of the anchor frame, p ijkobj (i, j) th grid kth prediction box has target probability, n represents the number of categories, p ijkcm The probability that the target in the kth prediction box of the (i, j) th mesh is of the mth (m ═ 1, 2.., n) th class is represented.
The confidence calculation formula is:
conf ijk =p ijkobj *max(p ijkc1 ,p ijkc2 ,...,p ijkcn )
setting a confidence threshold conf _ thr to limit conf ijk <Removing the prediction frame of conf _ thr, and converting the prediction result into (t) ijkx ,t ijky ,t ijkw ,t ijkh ,t ijkθ ,conf ijk Cls _ id), where cls _ id (cls _ id 0, 1.., n-1) denotes (p) ijkc1 ,p ijkc2 ,...,p ijkcn ) The medium maximum corresponds to the number of the category.
62) And calculating a prediction frame according to the detection result and the anchor frame position information.
In particular, based on predicted t ijkx ,t ijky ,t ijkw ,t ijkh ,t ijkθ And center, width height and angle (x) of anchor frame aijk ,y aijk ,w aijk ,h aijkaijk ) Get the prediction box (x) pijk ,y pijk ,w pijk ,h pijkpijk ) The calculation formula is as follows:
x pijk =x aijk +t ijkx
y pijk =y aijk +t ijky
w pijk =w ajk *exp(t ijkw )
h pijk =h ajk *exp(t ijkh )
θ pijk =θ aijk +t ijkθ
63) and screening the prediction frame based on the confidence value to obtain final target position information.
Since there are many duplicate prediction blocks in the prediction block obtained in step 62), it needs to be culled. In this embodiment, the NMS (non-maximum suppression) algorithm is used to remove redundant prediction boxes and retain the prediction box with the maximum confidence, and the specific implementation steps are as follows:
a) the method comprises the following steps Setting an IOU threshold IOU _ thr for deleting redundant prediction frames;
b) the method comprises the following steps Sequencing all the prediction frames according to cls _ id;
c) the method comprises the following steps Taking out a category of prediction frame each time, and obtaining a subsequent frame list according to descending order of confidence degrees of the prediction frames;
d) the method comprises the following steps Selecting a box B with the highest confidence coefficient to be added into the output list, and deleting the box B from the candidate box list;
e) the method comprises the following steps When the subsequent frame list is not empty, calculating IOU values of all frames in the frame B and the candidate frame list, and deleting the subsequent frames with IOU larger than IOU _ thr;
f) the method comprises the following steps Repeating the steps d) and e) until the candidate box list is empty;
g) the method comprises the following steps And repeating the steps c), d), e) and f) until all the categories are processed, and returning to the output list.
Thus, the final target position information is obtained.
According to the invention, by adding a supervised attention structure in the deep learning network, noise information is effectively reduced, target information is enhanced, and an OVAL _ IOU loss function is designed to carry out regression on the rotating rectangular frame, so that sudden increase of loss values in the existing method is eliminated, the optimization regression task of the position of the rotating rectangular frame is kept consistent with the measurement standard of the evaluation method, the target position regression is more direct and effective, the problems of high false alarm rate and high omission ratio in the prior art are solved, and the detection precision and the detection efficiency are improved.
The foregoing is a further detailed description of the invention in connection with specific preferred embodiments and it is not intended to limit the invention to the specific embodiments described. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. A remote sensing image target detection method in any direction based on deep learning is characterized by comprising the following steps:
acquiring a remote sensing image data set, and dividing the remote sensing image data set into a training set and a test set;
performing data enhancement on the training set and marking;
constructing a deep learning network model with an SAN structure;
designing an OVAL _ IOU loss function based on the constructed network model, and training the network model by using the training set;
testing the test set by using the trained network model to obtain a detection result;
and carrying out post-processing on the detection result to obtain final target position information.
2. The method for detecting the target in any direction based on the remote sensing image of the deep learning according to claim 1, wherein the step of performing data enhancement and marking on the training set comprises the following steps:
cutting the data of the training set into proper sizes and performing data enhancement on the data;
marking the target box in the training set by adopting an OpenCV representation method;
converting the OpenCV representation into a long-edge representation, marking the target boxes in the training set as:
{x c ,y c ,w_l,h_l,θ_l};
Figure FDA0003723672710000011
wherein (x) c ,y c ) Is the central coordinate of the target frame, w represents the first side encountered by the clockwise rotation of the x-axis, h is the other adjacent side, w _ l > - [ h _ l ], theta is the angle passed by the rotation, theta is the angle passed by theta (0 DEG, 90 DEG)]θ _ l represents the angle through which the x-axis rotates clockwise to w _ l, θ _ l ∈ (0 °,180 °)]。
3. The method for detecting the target in any direction based on the deep learning of the remote sensing image as claimed in claim 1, wherein the step of constructing the deep learning network model with the SAN structure comprises the following steps:
adopting darknet53 as a backbone feature extraction network to build a deep learning network framework, wherein the deep learning network framework comprises a feature extraction module, an SAN module, a feature fusion module and a feature integration module; wherein the content of the first and second substances,
the feature extraction module takes the image to be detected as input, and outputs a feature map to the SAN module after feature extraction;
the SAN module is used for performing feature enhancement on the feature graph output by the feature extraction module and outputting the enhanced feature graph to the feature fusion module;
the feature fusion module is used for sampling deep features rich in semantic information and fusing the deep features with shallow features rich in outline information according to the feature map output by the attention module to obtain fusion features;
the feature integration module is used for performing feature integration on the fusion features to obtain detection results of different scales.
4. The method for detecting the target in any direction of the remote sensing image based on the deep learning as claimed in claim 3, wherein the SAN module comprises a CAM module and an S-SAM module, and the feature map is subjected to weight division from different angles, so that feature enhancement is achieved.
5. The method for detecting the target in any direction based on the deep learning of the remote sensing image as claimed in claim 1, wherein the loss function is as follows:
Loss=L conf +L cls +L attention +L reg
wherein L is conf Indicates a loss of confidence, L cls Represents the class loss, L attention Denotes attention loss, L reg Indicating the positional regression loss.
6. The method for detecting the target in any direction based on the deep learning of the remote sensing image as claimed in claim 5, wherein the confidence loss L is conf The calculation formula of (2) is as follows:
Figure FDA0003723672710000021
wherein the content of the first and second substances,n represents the number of all prediction boxes;
Figure FDA0003723672710000022
whether a target exists in the kth prediction box is represented as 1, and if not, the target is represented as 0;
Figure FDA0003723672710000023
whether a target does not exist in the kth prediction box is represented, and the number is not 1, otherwise, the number is 0; p is a radical of formula k Indicates the probability that the k-th prediction box contains the target, y k Tag value, y, indicating whether there is an object in the kth prediction box k 1 denotes Targeted, y k 0 means no target; BCE is the cross entropy loss function.
7. The method for detecting any direction target of remote sensing image based on deep learning of claim 5, wherein the class loss L is cls The calculation formula of (2) is as follows:
Figure FDA0003723672710000031
where N represents the number of all prediction blocks,
Figure FDA0003723672710000032
is a binary value and is used as a binary value,
Figure FDA0003723672710000033
indicating that the nth prediction box contains the target,
Figure FDA0003723672710000034
indicates that the nth prediction box does not contain the target, BCE is a cross entropy loss function, p n (c) Indicates the probability that the nth prediction box is class c, t n (c) Is the corresponding tag value.
8. The remote-sensing image based on deep learning of claim 5The method for detecting an intention target is characterized in that the attention loss L is attention The calculation formula of (2) is as follows:
Figure FDA0003723672710000035
h and w respectively represent the height and width of the foreground score map after the characteristic layer passes through the S-SAM module, BCE is a cross entropy loss function, and u ij Representing the foreground score of the pixel at the ith, j-th position,
Figure FDA0003723672710000036
a tag value representing a corresponding location when the location belongs to the foreground
Figure FDA0003723672710000037
Otherwise
Figure FDA0003723672710000038
9. The method for detecting the target in any direction based on the deep learning of the remote sensing image as claimed in claim 5, wherein the position regression loss L is reg The calculation formula of (2) is as follows:
Figure FDA0003723672710000039
wherein N represents the number of all prediction boxes;
Figure FDA00037236727100000310
is a binary value and is used as a binary value,
Figure FDA00037236727100000311
indicating that the nth prediction box contains the target,
Figure FDA00037236727100000312
indicating that the nth prediction box does not contain the target, OVAL _ IOU is a function for computing two arbitrary-direction rectangular boxes IOU by ellipse approximation.
10. The method for detecting the target in any direction based on the deep learning of the remote sensing image as claimed in claim 1, wherein the step of performing post-processing on the test result to obtain the position information of the target comprises the following steps:
calculating a confidence value of each grid according to the detection result;
calculating a prediction frame according to the detection result and the anchor frame position information;
and screening the prediction frame based on the confidence value to obtain final target position information.
CN202210759294.0A 2022-06-30 2022-06-30 Remote sensing image target detection method in any direction based on deep learning Pending CN115115936A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210759294.0A CN115115936A (en) 2022-06-30 2022-06-30 Remote sensing image target detection method in any direction based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210759294.0A CN115115936A (en) 2022-06-30 2022-06-30 Remote sensing image target detection method in any direction based on deep learning

Publications (1)

Publication Number Publication Date
CN115115936A true CN115115936A (en) 2022-09-27

Family

ID=83331094

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210759294.0A Pending CN115115936A (en) 2022-06-30 2022-06-30 Remote sensing image target detection method in any direction based on deep learning

Country Status (1)

Country Link
CN (1) CN115115936A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116681983A (en) * 2023-06-02 2023-09-01 中国矿业大学 Long and narrow target detection method based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116681983A (en) * 2023-06-02 2023-09-01 中国矿业大学 Long and narrow target detection method based on deep learning
CN116681983B (en) * 2023-06-02 2024-06-11 中国矿业大学 Long and narrow target detection method based on deep learning

Similar Documents

Publication Publication Date Title
CN110135267B (en) Large-scene SAR image fine target detection method
CN108596101B (en) Remote sensing image multi-target detection method based on convolutional neural network
CN109934153B (en) Building extraction method based on gating depth residual error optimization network
CN111738112B (en) Remote sensing ship image target detection method based on deep neural network and self-attention mechanism
CN108460341B (en) Optical remote sensing image target detection method based on integrated depth convolution network
CN112488210A (en) Three-dimensional point cloud automatic classification method based on graph convolution neural network
CN112396002A (en) Lightweight remote sensing target detection method based on SE-YOLOv3
Wang et al. A deep-learning-based sea search and rescue algorithm by UAV remote sensing
CN110189304B (en) Optical remote sensing image target on-line rapid detection method based on artificial intelligence
CN109886066A (en) Fast target detection method based on the fusion of multiple dimensioned and multilayer feature
CN111753677B (en) Multi-angle remote sensing ship image target detection method based on characteristic pyramid structure
CN111126359A (en) High-definition image small target detection method based on self-encoder and YOLO algorithm
CN114612835A (en) Unmanned aerial vehicle target detection model based on YOLOv5 network
CN115830471B (en) Multi-scale feature fusion and alignment domain self-adaptive cloud detection method
CN110490155B (en) Method for detecting unmanned aerial vehicle in no-fly airspace
CN113850783B (en) Sea surface ship detection method and system
CN114998566A (en) Interpretable multi-scale infrared small and weak target detection network design method
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN112084897A (en) Rapid traffic large-scene vehicle target detection method of GS-SSD
Chen et al. Remote sensing image ship detection under complex sea conditions based on deep semantic segmentation
Cao et al. Multi angle rotation object detection for remote sensing image based on modified feature pyramid networks
CN115170816A (en) Multi-scale feature extraction system and method and fan blade defect detection method
CN115497002A (en) Multi-scale feature fusion laser radar remote sensing classification method
CN116091946A (en) Yolov 5-based unmanned aerial vehicle aerial image target detection method
CN115995042A (en) Video SAR moving target detection method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination