CN115115936A

CN115115936A - Remote sensing image target detection method in any direction based on deep learning

Info

Publication number: CN115115936A
Application number: CN202210759294.0A
Authority: CN
Inventors: 郭亮; 祁文平; 李亚超; 熊涛; 荆丹; 许晴; 吕艳; 邢孟道
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-27

Abstract

The invention discloses a method for detecting targets in any direction of remote sensing images based on deep learning, which comprises the following steps: acquiring a remote sensing image data set, and dividing the remote sensing image data set into a training set and a test set; performing data enhancement on the training set and marking; constructing a deep learning network model with an SAN structure; designing an OVAL _ IOU loss function based on the constructed network model, and training the network model by using a training set; testing the test set by using the trained network model to obtain a detection result; and carrying out post-processing on the detection result to obtain final target position information. The method provided by the invention overcomes the problems of high false alarm rate and high missing rate in the prior art, and improves the detection precision and the detection efficiency.

Description

Remote sensing image target detection method in any direction based on deep learning

Technical Field

The invention belongs to the technical field of target detection, and particularly relates to a remote sensing image target detection method in any direction based on deep learning.

Background

With the high-speed development of imaging technology, information acquired by unmanned aerial vehicles, satellites and other equipment is more and more comprehensive, data processing methods are more and more perfect, the resolution of images is higher and higher, and the application of remote sensing images in various fields is more and more extensive. The remote sensing image is different from a natural shot image, the background of the remote sensing image is more complex, the illumination is changeable, and the visual angle is changeable; the target in the remote sensing image has small size, usually large length and width, such as ships, trucks and the like, and has any direction and small foreground and background distinguishing degree. Due to the characteristics of the remote sensing image, the target detection method based on deep learning aiming at the natural scene at the present stage can not obtain ideal effect when being directly applied to the remote sensing image, and the obtained target frame contains more redundant background information. Therefore, a method suitable for target detection of the remote sensing image needs to be designed, and recall rate and accuracy rate of target detection in the remote sensing image are improved.

In order to solve the problem of low target detection accuracy of remote sensing images, patent document one (publication number CN112102241A) proposes a single-stage remote sensing image target detection algorithm, in which YOLO V3 is used as a reference network, a characteristic pyramid structure of YOLO V3 is replaced by a path aggregation network, and a feature graph is up-sampled by means of transposed convolution to improve the detection rate of a small target. However, the method uses a horizontal rectangular frame to calibrate the position, the detection result contains more redundant information, and the method has serious omission when the targets are closely arranged.

Patent document two (publication number CN110674674A) proposes a rotation target detection method based on YOLO V3, which uses a 5-parameter method defined by OpenCV to represent a rectangle in any direction, and at the same time redesigns an anchor frame generation method and a positive and negative sample distribution method, so that an algorithm can obtain more positive samples, thereby making the network easier to train, and finally uses a smooth _ L1 loss function to return the position, thereby realizing the detection of a target in any direction in a remote sensing image. However, this method greatly increases the number of anchor frames, so that the detection speed is reduced; meanwhile, the method is also based on position calibration of a rectangular frame, so that the detection result contains more redundant information and the detection precision is influenced. In addition, when the smooth _ L1 loss function regresses the target position, the loss value is suddenly increased due to the periodicity and the width and height commutative property of the angle, so that the regression difficulty is increased, and the detection efficiency and the detection accuracy are further influenced.

In conclusion, the existing remote sensing image target detection method has the problems of high false alarm rate, high omission factor and low detection efficiency.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a method for detecting targets in any direction of remote sensing images based on deep learning. The technical problem to be solved by the invention is realized by the following technical scheme:

a remote sensing image target detection method in any direction based on deep learning comprises the following steps:

acquiring a remote sensing image data set, and dividing the remote sensing image data set into a training set and a test set;

performing data enhancement and marking on the training set;

constructing a deep learning network model with an SAN structure;

designing an OVAL _ IOU loss function based on the constructed network model, and training the network model by using the training set;

testing the test set by using the trained network model to obtain a detection result;

and carrying out post-processing on the detection result to obtain final target position information.

In one embodiment of the present invention, the performing data enhancement on the training set and labeling comprises:

cutting the data of the training set into proper sizes and performing data enhancement on the data;

marking the target box in the training set by adopting an OpenCV representation method;

converting the OpenCV representation into a long-edge representation, marking the target boxes in the training set as:

{x _c ,y _c ,w_l,h_l,θ_l}；

wherein (x) _c ,y _c ) Is the center coordinate of the target frame, w represents the first edge encountered by the clockwise rotation of the x-axis, h is the adjacent other edge, w _ l>θ is the angle through which the rotation passes, θ ∈ (0 °,90 °)]θ _ l represents the angle through which the x-axis rotates clockwise to w _ l, θ _ l ∈ (0 °,180 °)]。

In one embodiment of the invention, building a deep learning network model with SAN fabric includes: adopting darknet53 as a backbone feature extraction network to build a deep learning network framework, wherein the deep learning network framework comprises a feature extraction module, an SAN module, a feature fusion module and a feature integration module; wherein the content of the first and second substances,

the feature extraction module takes the image to be detected as input, and outputs a feature map to the SAN module after feature extraction;

the SAN module is used for performing feature enhancement on the feature graph output by the feature extraction module and outputting the enhanced feature graph to the feature fusion module;

the feature fusion module is used for sampling deep features rich in semantic information and fusing the deep features with shallow features rich in outline information according to the feature map output by the attention module to obtain fusion features;

the feature integration module is used for performing feature integration on the fusion features to obtain detection results of different scales.

In one embodiment of the invention, the SAN module includes a CAM module and an S-SAM module to weight the feature map from different perspectives to achieve feature enhancement.

In one embodiment of the invention, the loss function is:

Loss＝L _conf +L _cls +L _attention +L _reg

wherein L is _conf Indicates a loss of confidence, L _cls Represents the class loss, L _attention Indicating a loss of attention, L _reg Indicating the positional regression loss.

In one embodiment of the invention, the confidence loss L _conf The calculation formula of (2) is as follows:

wherein N represents the number of all prediction boxes;

whether a target exists in the kth prediction box is represented as 1, and if not, the target is represented as 0;

whether targets do not exist in the kth prediction frame is represented, and the number of targets is not 1, otherwise, the number of targets is 0; p is a radical of _k Indicates the probability that the k-th prediction box contains the target, y _k Tag value, y, indicating whether there is an object in the kth prediction box _k 1 denotes target, y _k 0 means no target; BCE is the cross entropy loss function.

In one embodiment of the invention, the class penaltyL _cls The calculation formula of (2) is as follows:

wherein N represents the number of all prediction blocks,

is a binary value and is used as a binary value,

indicating that the nth prediction box contains the target,

indicates that the nth prediction box does not contain the target, BCE is a cross entropy loss function, p _n (c) Indicates the probability that the nth prediction box is class c, t _n (c) Is the corresponding tag value.

In one embodiment of the invention, the loss of attention L _attention The calculation formula of (2) is as follows:

h and w respectively represent the height and width of the foreground score map after the characteristic layer passes through the S-SAM module, BCE is a cross entropy loss function, and u _ij Representing the foreground score of the pixel at the ith, j-th position,

a label value representing a corresponding position when the position belongs to the foreground

Otherwise

In one embodiment of the present invention, the position regression loss L _reg Is calculated byComprises the following steps:

wherein N represents the number of all prediction boxes;

is a binary value and is used as a binary value,

indicating that the nth prediction box contains the target,

indicating that the nth prediction box does not contain the target, OVAL _ IOU is a function for computing two arbitrary-direction rectangular boxes IOU by ellipse approximation.

In an embodiment of the present invention, the post-processing the test result to obtain the location information of the target includes:

calculating a confidence value of each grid according to the detection result;

calculating a prediction frame according to the detection result and the anchor frame position information;

and screening the prediction frame based on the confidence value to obtain final target position information.

The invention has the beneficial effects that:

according to the invention, by adding a supervised attention structure in the deep learning network, noise information is effectively reduced and target information is enhanced; meanwhile, an OVAL _ IOU loss function is designed to carry out regression on the rotating rectangular frame, sudden increase of loss values in the existing method is eliminated, the optimization regression task of the position of the rotating rectangular frame is kept consistent with the measurement standard of the evaluation method, the target position regression is more direct and effective, the problems of high false alarm rate and high omission factor in the prior art are solved, and the detection precision and the detection efficiency are improved.

The present invention will be described in further detail with reference to the drawings and examples.

Drawings

Fig. 1 is a schematic flowchart of a method for detecting an object in any direction of a remote sensing image based on deep learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a rotating rectangular box OpenCV representation according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a long side representation of a rotating rectangular frame according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a deep learning network structure with SAN structure provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a SAN network architecture provided by an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for detecting an object in any direction of a remote sensing image based on deep learning according to an embodiment of the present invention, which includes:

step 1: and acquiring a remote sensing image data set, and dividing the remote sensing image data set into a training set and a testing set.

In this embodiment, data may be obtained from public remote sensing image datasets, such as UCAS _ AOD, HRSC2016, DOTA, and the like. The acquired data set is then divided into a training set and a test set in a 7:3 ratio.

Step 2: and performing data enhancement on the training set and marking.

First, the data of the training set is cut to the appropriate size and data-enhanced.

The specific cutting size can be set as required, and the data enhancement method can be implemented by referring to the related art, and the embodiment is not described in detail herein.

Then, the target boxes in the training set are labeled by using the OpenCV representation, as shown in fig. 2.

Generally, the labeling mode of the rotating rectangular frame in the public data sets such as DOTA, UCAS _ AOD and the like is { x } ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ I.e. 8 parameter method, in which (x) _i ,y _i ) And i is 1,2,3,4, which indicates the coordinates of the four vertices of the rotating rectangular frame. In this embodiment, however, 8-parameter conversion to 5-parameter representation is required.

Specifically, 8 parameters are converted into OpenCV 5 parameter representation, i.e., { x }, using cv _c ,y _c W, h, θ }, where (x) _c ,y _c ) Is the central coordinate of the rotating rectangular frame, i.e. the central coordinate of the target frame, w represents the first edge touched by the clockwise rotation of the x-axis, h is the other adjacent edge, theta is the angle passed by the rotation, theta belongs to (0 DEG, 90 DEG)]。

Finally, the OpenCV representation is converted into a long-edge representation, as shown in fig. 3, the target boxes in the training set are labeled as:

{x _c ,y _c ,w_l,h_l,θ_l}；

where w _ l > -h _ l, θ _ l indicates the angle through which the x-axis rotates clockwise to w _ l, and θ _ l ∈ (0 °,180 °).

And step 3: and constructing a deep learning network model with SAN structure.

Referring to fig. 4, fig. 4 is a schematic diagram of a deep learning network structure with a SAN structure according to an embodiment of the present invention. In this embodiment, a deep learning network framework is built by using the darknet53 as a backbone feature extraction network, and the deep learning network framework comprises a feature extraction module, an SAN (Supervised Attention network) module, a feature fusion module and a feature integration module; wherein the content of the first and second substances,

Specifically, as shown in fig. 4, in this embodiment, a darknet53 is used as a main feature extraction network, 5-scale feature maps are obtained after passing through a residual block, and a Supervised Attention Structure (SAN) is added behind

feature maps

3,4, and 5 to obtain new 3 feature layers 6, 7, and 8; then, the 8 th feature map is convolved and then upsampled and fused with the 7 th feature map, and the fused 11 th feature map is convolved and then upsampled and fused with the 6 th feature map; and finally, obtaining detection results of 3 scales after the three characteristic layers are subjected to convolutional layer. Taking the detection result with the smallest scale as an example, the shape is 13 × 13 [ (5+1+ n) × 3], wherein 13 × 13 represents that the original image is divided into 13 × 13 grids; [ (5+1+ n) × 3] represents the number of channels, where × 3 represents three prediction frames per grid, and there are 13 × 3 prediction frames in the original image, 5 represents the offset of the prediction frames with respect to the center position, width, height and angle of the anchor frame, respectively, 1 represents the probability of having an object in each prediction frame, and n represents the number of classes, and the probability of having an object in the corresponding prediction frame as each class.

Further, referring to fig. 5, fig. 5 is a schematic diagram of a SAN network structure according to an embodiment of the present invention, which includes two modules, a Channel Attention Module (CAM) and a Supervised-pixel Attention Module (S-SAM), that weight-divide a feature map from different angles, so as to achieve feature enhancement.

Specifically, for the input features of size 256 × 52, fig. 3, which are processed by the CAM module and the SAM module, respectively, enhanced features are obtained.

For the CAM module, the maximum pooling and average pooling of the features with the size of 256 × 1 are respectively obtained through full-connection layers, then the two features are added according to elements and activated through a sigmoid function, and then the 256 × 1 features are output.

For the S-SAM module, 2 x 52 features are obtained by splicing after maximum pooling and average pooling, and 1 x 52 features are obtained after activation by convolution and sigmoid functions.

Finally, the features obtained by the CAM module are multiplied by the feature map 3 obtained by the S-SAM module and inputted, so that a feature map 6 is obtained, which has the same size as the inputted features and is 256 × 52.

The following takes the picture of network input 3 × 416 as an example, and the processing procedure thereof is described in detail.

Firstly, 3 × 416 pictures are subjected to conv2D Block to obtain a feature map 0 with the size of 64 × 208;

then, obtaining a characteristic diagram with 5 scales through Res Block, wherein Res Block n represents the stacking of n Res blocks;

after passing through Res Block 1, the feature map 0 obtains a feature map 1 with a size of 64 × 208;

after passing through Res Block × 2, the feature map 1 obtains a feature map 2 with a size of 128 × 104;

feature map 2, after Res Block 8, yields a feature map 3 with size 256 x 52;

the characteristic map 4 with the size of 512 by 26 is obtained after Res Block by 8;

the passage of the signature 4 through Res Block 4 results in a signature 5 with a size of 1024 x 13.

Adding a Supervised Attention Net (SAN) behind the

feature graphs

3,4 and 5 to obtain feature graphs 6, 7 and 8 respectively; the SAN does not change the feature map size.

Convolving the feature map 8 to obtain a feature map 9 with the size of 512 × 13;

the feature map 9 is convolved to obtain a detection result of the large-scale target, and the shape of the detection result is 13 × 13 [ (5+1+ n) × 3 ];

the feature map 10 with a size of 512 × 26 is obtained by up-sampling the feature map 9;

the feature map 10 is fused with the feature map 7 to obtain a feature map 11 with the size of 768 × 26;

the size of the feature map 12 is 256 × 26 after the feature map 11 is convolved;

the feature map 12 is convolved to obtain a detection result of the mesoscale target, and the shape is 26 × 26 [ (5+1+ n) × 3 ];

the feature map 12 is up-sampled to obtain a feature map 13 with a size of 128 × 52;

the feature map 13 is fused with the feature map 6 to obtain a feature map 14 with the size of 384 × 52;

after convolution of the feature map 14, a feature map 15 with a size of 128 × 52 is obtained;

the feature map 15 is convolved to obtain the detection result of the small-scale target, and the shape is 52 x 52 [ (5+1+ n) × 3 ].

In the embodiment, an SAN structure is added behind a main feature layer, and the SAN structure comprises a CAM module and an S-SAM module, so that noise information can be effectively reduced, target information can be enhanced, and the detection precision is improved to a certain extent.

And 4, step 4: and designing an OVAL _ IOU loss function based on the constructed network model, and training the network model by using a training set.

First, in order to solve the problem of sudden increase in loss value caused by periodicity of angle and commutative property of width and height when the smooth _ L1 loss function regresses the target position, the present embodiment defines that the OVAL _ IOU loss function regresses the rotating rectangular frame.

Specifically, the OVAL _ IOU penalty function is expressed as:

Loss＝L _conf +L _cls +L _attention +L _reg

wherein L is _conf Representing the confidence loss, the calculation formula is:

wherein N represents the number of all prediction boxes;

whether a target does not exist in the kth prediction box is represented, and the number is not 1, otherwise, the number is 0; p is a radical of formula _k Indicates the probability that the k-th prediction box contains the target, y _k Tag value, y, indicating whether there is an object in the kth prediction box _k 1 denotes Targeted, y _k 0 means no target; BCE is the cross entropy loss function.

In this embodiment, the calculation formula of BCE is:

BCE(p _k ,y _k )＝-[y _k ·log(p _k )+(1-y _k )·log(1-p _k )]

L _cls the category loss is expressed by the following calculation formula:

where N represents the number of all prediction blocks,

is a binary value and is used as a binary value,

indicating that the nth prediction box contains the target,

indicates that the nth prediction box does not contain the target, p _n (c) Indicates the probability that the nth prediction box is class c, t _n (c) Is the corresponding tag value.

L _attention Expressing the attention loss, the calculation formula is:

a tag value representing a corresponding location when the location belongs to the foreground

Otherwise

L _reg Expressing the position regression loss, and the calculation formula is as follows:

the OVAL _ IOU is a function for approximately calculating two rectangular frames IOU in any direction by using an ellipse, and the specific calculation process is as follows:

any ellipse in the plane can be represented as:

wherein a is a long axis, b is a short axis, and alpha is an included angle alpha epsilon (0 DEG, 180 DEG) of a long axis x axis]Will { x _c ,y _c The rotated rectangular box approximation, denoted w _ l, h _ l, θ _ l, is represented by an ellipse as:

defining:

wherein (i, j) represents the coordinates of the pixel point, oval represents the coordinates as { x } _c ,y _c And judging whether the pixel point P (i, j) is in the ellipse or not by using P (i, j, oval), wherein the oval P (i, j, oval) is 1, which indicates that the pixel point P (i, j) is in the oval, and otherwise, the pixel point P (i, j) is not in the oval. The intersection set of the two ellipses is then:

because P (i, j, oval) is not derivable, P (i, j, oval) is approximately replaced by a continuously derivable function F (i, j, oval), whose expression is as follows:

F(i,j,oval)＝K(k_1+k_2,0.25)

wherein

k is an adjustable parameter for controlling the sensitivity of the target pixel.

The intersection set formula is updated as follows:

then the OVAL _ IOU calculation formula is:

the invention provides a new Loss function-OVAL _ IOU Loss. The OVAL _ IOU uses an ellipse to approximately replace a rectangular frame in any direction, then the IOUs of two ellipses are calculated as any two rotation rectangular frame IOUs, and the OVAL enables the optimization regression task of the positions of the rotation rectangular frames to be consistent with the measurement standard of the evaluation method, namely the IOUs of two target frames are used, so that the target position regression is more direct and effective.

And then, training the network model constructed in the step 3 by using training set data based on the constructed OVAL _ IOU loss function to obtain the trained network model.

And 5: and testing the test set by using the trained network model to obtain a detection result.

Specifically, assuming that the width and height of the input picture are 416 × 416, a total of 10647 detection results can be obtained, namely 13 × 13 × 3+26 × 26 × 3+52 × 52 × 3.

Step 6: and carrying out post-processing on the detection result to obtain final target position information.

61) And calculating the confidence value of each grid according to the detection result.

Specifically, after the picture is divided into n × n meshes (n ∈ {13,26,52}), the prediction result of the (i, j) -th mesh is (t ∈ {13,26,52}) _ijkx ,t _ijky ,t _ijkw ,t _ijkh ,t _ijkθ ,p _ijkobj ,p _ijkc1 ,p _ijkc2 ,...,p _ijkcn )。

Where, (i ═ 1,2, …, n ═ 1,2, …, n), k ═ 1,2,3 means that there are three prediction boxes per grid, t _ijkx ,t _ijky ,t _ijkw ,t _ijkh ,t _ijkθ Respectively representing the k-th prediction frame of the (i, j) -th grid, relative to the center position, width, height and angle offset of the anchor frame, p _ijkobj (i, j) th grid kth prediction box has target probability, n represents the number of categories, p _ijkcm The probability that the target in the kth prediction box of the (i, j) th mesh is of the mth (m ═ 1, 2.., n) th class is represented.

The confidence calculation formula is:

conf _ijk ＝p _ijkobj *max(p _ijkc1 ,p _ijkc2 ,...,p _ijkcn )

setting a confidence threshold conf _ thr to limit conf _ijk <Removing the prediction frame of conf _ thr, and converting the prediction result into (t) _ijkx ,t _ijky ,t _ijkw ,t _ijkh ,t _ijkθ ,conf _ijk Cls _ id), where cls _ id (cls _ id 0, 1.., n-1) denotes (p) _ijkc1 ,p _ijkc2 ,...,p _ijkcn ) The medium maximum corresponds to the number of the category.

62) And calculating a prediction frame according to the detection result and the anchor frame position information.

In particular, based on predicted t _ijkx ,t _ijky ,t _ijkw ,t _ijkh ,t _ijkθ And center, width height and angle (x) of anchor frame _aijk ,y _aijk ,w _aijk ,h _aijk ,θ _aijk ) Get the prediction box (x) _pijk ,y _pijk ,w _pijk ,h _pijk ,θ _pijk ) The calculation formula is as follows:

x _pijk ＝x _aijk +t _ijkx

y _pijk ＝y _aijk +t _ijky

w _pijk ＝w _ajk *exp(t _ijkw )

h _pijk ＝h _ajk *exp(t _ijkh )

θ _pijk ＝θ _aijk +t _ijkθ

63) and screening the prediction frame based on the confidence value to obtain final target position information.

Since there are many duplicate prediction blocks in the prediction block obtained in step 62), it needs to be culled. In this embodiment, the NMS (non-maximum suppression) algorithm is used to remove redundant prediction boxes and retain the prediction box with the maximum confidence, and the specific implementation steps are as follows:

a) the method comprises the following steps Setting an IOU threshold IOU _ thr for deleting redundant prediction frames;

b) the method comprises the following steps Sequencing all the prediction frames according to cls _ id;

c) the method comprises the following steps Taking out a category of prediction frame each time, and obtaining a subsequent frame list according to descending order of confidence degrees of the prediction frames;

d) the method comprises the following steps Selecting a box B with the highest confidence coefficient to be added into the output list, and deleting the box B from the candidate box list;

e) the method comprises the following steps When the subsequent frame list is not empty, calculating IOU values of all frames in the frame B and the candidate frame list, and deleting the subsequent frames with IOU larger than IOU _ thr;

f) the method comprises the following steps Repeating the steps d) and e) until the candidate box list is empty;

g) the method comprises the following steps And repeating the steps c), d), e) and f) until all the categories are processed, and returning to the output list.

Thus, the final target position information is obtained.

According to the invention, by adding a supervised attention structure in the deep learning network, noise information is effectively reduced, target information is enhanced, and an OVAL _ IOU loss function is designed to carry out regression on the rotating rectangular frame, so that sudden increase of loss values in the existing method is eliminated, the optimization regression task of the position of the rotating rectangular frame is kept consistent with the measurement standard of the evaluation method, the target position regression is more direct and effective, the problems of high false alarm rate and high omission ratio in the prior art are solved, and the detection precision and the detection efficiency are improved.

The foregoing is a further detailed description of the invention in connection with specific preferred embodiments and it is not intended to limit the invention to the specific embodiments described. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A remote sensing image target detection method in any direction based on deep learning is characterized by comprising the following steps:

performing data enhancement on the training set and marking;

constructing a deep learning network model with an SAN structure;

2. The method for detecting the target in any direction based on the remote sensing image of the deep learning according to claim 1, wherein the step of performing data enhancement and marking on the training set comprises the following steps:

{x _c ,y _c ,w_l,h_l,θ_l}；

wherein (x) _c ,y _c ) Is the central coordinate of the target frame, w represents the first side encountered by the clockwise rotation of the x-axis, h is the other adjacent side, w _ l > - [ h _ l ], theta is the angle passed by the rotation, theta is the angle passed by theta (0 DEG, 90 DEG)]θ _ l represents the angle through which the x-axis rotates clockwise to w _ l, θ _ l ∈ (0 °,180 °)]。

3. The method for detecting the target in any direction based on the deep learning of the remote sensing image as claimed in claim 1, wherein the step of constructing the deep learning network model with the SAN structure comprises the following steps:

adopting darknet53 as a backbone feature extraction network to build a deep learning network framework, wherein the deep learning network framework comprises a feature extraction module, an SAN module, a feature fusion module and a feature integration module; wherein the content of the first and second substances,

4. The method for detecting the target in any direction of the remote sensing image based on the deep learning as claimed in claim 3, wherein the SAN module comprises a CAM module and an S-SAM module, and the feature map is subjected to weight division from different angles, so that feature enhancement is achieved.

5. The method for detecting the target in any direction based on the deep learning of the remote sensing image as claimed in claim 1, wherein the loss function is as follows:

Loss＝L _conf +L _cls +L _attention +L _reg

wherein L is _conf Indicates a loss of confidence, L _cls Represents the class loss, L _attention Denotes attention loss, L _reg Indicating the positional regression loss.

6. The method for detecting the target in any direction based on the deep learning of the remote sensing image as claimed in claim 5, wherein the confidence loss L is _conf The calculation formula of (2) is as follows:

wherein the content of the first and second substances,n represents the number of all prediction boxes;

7. The method for detecting any direction target of remote sensing image based on deep learning of claim 5, wherein the class loss L is _cls The calculation formula of (2) is as follows:

where N represents the number of all prediction blocks,

is a binary value and is used as a binary value,

indicating that the nth prediction box contains the target,

8. The remote-sensing image based on deep learning of claim 5The method for detecting an intention target is characterized in that the attention loss L is _attention The calculation formula of (2) is as follows:

Otherwise

9. The method for detecting the target in any direction based on the deep learning of the remote sensing image as claimed in claim 5, wherein the position regression loss L is _reg The calculation formula of (2) is as follows:

wherein N represents the number of all prediction boxes;

is a binary value and is used as a binary value,

indicating that the nth prediction box contains the target,

10. The method for detecting the target in any direction based on the deep learning of the remote sensing image as claimed in claim 1, wherein the step of performing post-processing on the test result to obtain the position information of the target comprises the following steps:

calculating a confidence value of each grid according to the detection result;