CN116452936B

CN116452936B - Rotation target detection method integrating optics and SAR image multi-mode information

Info

Publication number: CN116452936B
Application number: CN202310446031.9A
Authority: CN
Inventors: 徐凯; 刘思远; 汪安铃; 汪子羽; 梁栋
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-04-22
Filing date: 2023-04-22
Publication date: 2023-09-29
Anticipated expiration: 2043-04-22
Also published as: CN116452936A

Abstract

Compared with the prior art, the method can effectively solve the problem that targets cannot be accurately identified due to various interferences in the remote sensing image acquisition and transmission process, and can also deal with the problems that targets of different categories have different shapes, sizes, colors and the like or overlap, shielding and the like. The invention comprises the following steps: feature extraction of the SAR image dataset and the optical image dataset; cross-modal multi-scale feature fusion; the two-stage rotation pre-measuring head positions and sorts targets with different angles. The invention can effectively solve the problems that targets cannot be accurately detected due to single-mode targets such as weather, illumination, object color and the like, and the positioning accuracy of the rotating frame is low, thereby improving the accuracy and efficiency of target detection.

Description

Rotation target detection method integrating optics and SAR image multi-mode information

Technical Field

The invention relates to the field of remote sensing image target detection, in particular to a rotating target detection method for fusing optical and SAR image multi-mode information.

Background

The remote sensing image target detection is a technology for target identification and positioning by utilizing remote sensing image data, and has important application value in a plurality of fields, such as urban planning, agricultural resource management, environment monitoring and the like. However, various interferences may be caused in the process of collecting and transmitting the remote sensing image, and different target categories have different shapes, sizes, colors and other characteristics, and the defects that the targets cannot be accurately identified due to overlapping, shielding and other conditions may also exist. The cross-mode remote sensing target detection refers to a technology for detecting and identifying targets among different remote sensing image modes, so that the accuracy and the robustness of remote sensing image target detection can be improved, and the application range and the scene of remote sensing image target detection can be expanded.

The cross-mode remote sensing target detection is a technology for information fusion by utilizing multi-source remote sensing data, and more comprehensive and accurate target information is obtained by combining remote sensing images from different sensors or different wave bands. Compared with a single-mode remote sensing image, the cross-mode remote sensing image has more wave bands and characteristic information, and can provide better results in the aspects of target detection and classification. In a single-mode remote sensing image, the data collected by the sensor can only provide information of a specific wave band, so that the performance of the sensor is not ideal for detection and classification tasks of some complex targets. And the cross-mode remote sensing image can effectively improve the accuracy and the robustness of target detection and classification by utilizing various wave bands and characteristic information.

Remote sensing target detection typically requires a tradeoff between high accuracy and high efficiency. Accurate target detection requires high accuracy of the position and shape of the detection result, and can accurately represent the position, shape, size and other information of the target. However, the single-mode target detection is affected by remote sensing data quality, data labeling, target category and the like, and is easy to identify without great accuracy, so that subsequent application is limited. According to the invention, the characteristics of the images of the same target in a plurality of modes are extracted, and the fused characteristic images have more characteristic information by utilizing the difference of the characteristics of different modes, so that the subsequent two-stage rotating head can more accurately position and classify the target. At present, papers and patents for a rotation target detection method for fusing optical and SAR multi-mode images are still lacking in China.

Disclosure of Invention

The invention aims to solve the problem that the target detection is inaccurate due to errors generated by data acquisition, transmission and other modes in the single-mode remote sensing target detection, and provides a rotating target detection method for fusing optical and SAR image multi-mode information to solve the problem.

In order to achieve the above object, the technical scheme of the present invention is as follows:

A method for detecting a rotating target by combining optical and SAR image multi-mode information comprises the following steps:

11 Preparation of rotation target detection data and feature extraction of fusion optics and SAR image multi-mode information: dividing and cutting the acquired remote sensing image data set; constructing a transducer-UNet network based on the structures of the encoder and the decoder to extract the characteristics of the remote sensing data;

12 Establishing a multi-mode multi-scale feature fusion module: constructing a framework for multi-modal feature fusion, extracting multi-modal difference features and the same features by using a difference enhancement module and a public selection module, and fusing the multi-modal difference features and the same features into a multi-modal feature map;

13 Establishing a two-stage rotary pre-measurement head module: constructing a two-stage prediction head module, and performing secondary fine tuning on the basis of the first-stage classification and positioning;

14 Performing corresponding training and parameter adjustment on the established rotation target detection network fused with the optical and SAR image multi-modal information by utilizing the divided training set and the corresponding label thereof until the training reaches the preset epoch, and finally reserving the corresponding parameters and the trained network;

15 Utilizing the rotation target detection network which is obtained in the step 14) and is used for fusing the optical and SAR image multi-mode information, inputting the preprocessed test data set into the loaded model for prediction, and marking the target prediction frame and the target category on the original image through visualization.

The preparation and feature extraction of the rotation target detection data fusing the optical and SAR image multi-mode information comprises the following steps:

21 Dividing the data set into a training set, a verification set and a test set according to the proportion of 6:2:2, wherein the size of the unified cutting size which is not overlapped is 256 x 256;

22 A parallel encoder decoder converter-UNet network AB is constructed, wherein the network a processes the optical remote sensing image and the network B processes the SAR remote sensing image;

221 A double-layer convolution module is constructed, the module structure comprises two convolution layers, two normalization layers and two ReLU activation functions; each convolution layer structure is characterized in that the kernel size is 3, the padding is 1, and the stride is 1;

222 A downsampling structure for feature extraction is constructed, and the structure comprises a double-layer DoubleConv convolution module and a maximum pooling layer;

223 A Bottleneck layer for connecting the feature graphs of the up-sampling and down-sampling stages is constructed, the Bottleneck layer structure includes two convolution layers with a kernel size of 1 and a stride of 1, one convolution layer with a kernel size of 3 and a stride of 1;

224 Constructing an up-sampling structure for feature extraction, wherein the structure comprises a ConvLSTM layer and a convolution layer; the ConvLSTM unit comprises an input gate, a forget gate and an output gate, the kernel_size is (3, 3), and the stride is (2, 2);

23 The specific steps of rotation target detection feature extraction of the multi-mode information of the fusion optics and SAR image are as follows:

231 Inputting the preprocessed optical remote sensing image, SAR remote sensing image and tag data into a convolutional neural network, and training a downsampling feature extraction model with a self-attention mechanism, wherein the method comprises the following specific steps of:

232 Performing a normal convolution layer with a convolution kernel size of 1x1 to convert the optical remote sensing image into V ^OPT 、Q ^OPT 、K ^OPT Three channel characteristics; converting SAR remote sensing image into V ^SAR 、Q ^SAR 、K ^SAR Three channel characteristics; executing a primary encoder structure to obtain 4 downsampled outputs;

performing a normal convolution with a convolution kernel size of 3×3 on the input picture, normalizing a ReLu by an example, and performing a max pooling operation with a stride of 1 to obtain a first downsampled output;

performing a normal convolution with a convolution kernel size of 3×3 on the first downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a second downsampled output;

performing a normal convolution with a convolution kernel size of 3×3 on the second downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a third downsampled output;

Performing a normal convolution with a convolution kernel size of 3×3 on the third downsampled output, and normalizing a maximum pooling operation with a ReLu and a stride of 1 by an example to obtain a fourth downsampled output;

b3 A first downsampled output, a second downsampled output, and a third downsampled output followed by a self-attention mechanism module and a cross-correlation module, comprising the following steps:

performing convolution with a convolution kernel of 1x1 to convert the optical remote sensing image into V ^OPT 、Q ^OPT 、K ^OPT Three-channel feature matrix; converting SAR remote sensing image into V ^SAR 、Q ^SAR 、K ^SAR Three-channel feature matrix;

will Q ^OPT Transposition and K ^OPT Dot product multiplication, result softmax, and V ^OPT After the dot products are multiplied, the dot products are weighted and summed with the original feature images to obtain an optical image self-attention mechanism feature image; the SAR image self-attention mechanism characteristic diagram process is the same as that described above;

extracting a supporting feature image and a query feature image from a self-attention mechanism feature image, carrying out reshape on the images, generating a relation between the two images by using cosine distance, obtaining corresponding weights through global average pooling and a nonlinear network comprising 2 convolution layers and a RELU layer, and obtaining the correlation of the features after dot product multiplication and normalization; the SAR remote sensing image cross-correlation module is the same as the optical remote sensing image cross-correlation module;

24 A Bottleneck layer for connecting the feature maps of the up-sampling and down-sampling phases is built up, consisting of three convolution layers:

the convolution kernel of the first convolution layer is 1x1, and is used for reducing the dimension, reducing the number of input channels and reducing the number of model parameters;

the convolution kernel of the second convolution layer is 3x3 and is used for convoluting the feature map and extracting features;

the convolution kernel of the third convolution layer is 1x1, and is used for increasing the dimension, increasing the number of channels of the convolved feature map and increasing the expression capacity of the model;

25 Constructing an up-sampling convLSTM, specifically comprising the following steps:

performing deconvolution operation (also called transpose convolution) on the fourth downsampled output to upsample to 1/8 of the original image (for the 4-fold downsampled case, 4-fold upsampling), resulting in an upsampled output 1;

splicing the up-sampling output 1 and the third down-sampling output to obtain a combined output 1;

performing a common convolution with a convolution kernel size of 3×3 on the combined output 1, and normalizing one leak lu by one example to obtain a convolution output 1;

performing ConvLSTM operation on the convolution output 1 to obtain LSTM output 1;

performing a common convolution with a convolution kernel size of 3×3 on LSTM output 1, and normalizing one LeakyReLu by one example to obtain a convolution output 2;

Deconvolution operation is carried out on the convolution output 2 to enable the convolution output 2 to be up-sampled to be 1/4 of the size of the original image (for the 4 times down-sampling case, namely 2 times up-sampling), so as to obtain up-sampled output 2;

splicing the up-sampling output 2 and the second down-sampling output to obtain a combined output 2;

performing a normal convolution with a convolution kernel size of 3×3 on the combined output 2, and normalizing one leak lu by one example to obtain a convolved output 3;

performing ConvLSTM operation on the convolution output 3 to obtain an LSTM output 2;

performing a common convolution with a convolution kernel size of 3×3 on LSTM output 2, and normalizing one LeakyReLu by one example to obtain a convolution output 4;

deconvolution operation is performed on the convolution output 4 to up-sample it to 1/2 of the original image (for the case of 4 times down-sampling, i.e., 2 times up-sampling), resulting in up-sampled output 3;

splicing the up-sampling output 3 and the first down-sampling output to obtain a combined output 3;

performing a normal convolution with a convolution kernel size of 3×3 on the combined output 3, and normalizing one leak lu by one example to obtain a convolution output 5;

performing ConvLSTM operation on the convolution output 5 to obtain LSTM output 3;

a normal convolution with a convolution kernel size of 3 x 3 is performed on LSTM output 3, and an example normalizes one LeakyReLu to obtain the final upsampled output.

The rotation target detection multimode characteristic fusion module for fusing optical and SAR image multimode information comprises the following steps:

31 Constructing a multi-mode feature fusion framework for the optical remote sensing image and the SAR remote sensing image, wherein the framework comprises a differential enhancement module and a public selection module;

311 The differential enhancement module specifically comprises the following steps:

performing difference operation on the extracted optical image features and SAR image features to obtain a feature map of a difference part;

calculating attention weights through hourglass 1*1 convolution to obtain respective attention force diagrams;

adding the obtained attention map to the original characteristic map in a residual mode to obtain a reinforced characteristic map;

weighting and summing the reinforcement feature images of the optical remote sensing image and the SAR remote sensing image to obtain a differential reinforcement feature image;

312 The public selection module comprises the following specific steps:

performing addition operation on the extracted optical image features and SAR image features to obtain a feature map of the public part;

obtaining the attention map of the optical remote sensing image and the attention map of the SAR remote sensing image through the obtained characteristic map of the public part in a softmax mode;

multiplying the attention map of the optical remote sensing image and the attention map of the SAR remote sensing image with the input feature map of the SAR remote sensing image respectively to obtain new feature maps;

And carrying out weighted summation on the new feature images of the optical remote sensing image and the SAR remote sensing image to obtain a public module feature image.

The rotating target detection two-stage rotating prediction head network for fusing optical and SAR image multi-mode information comprises the following steps:

41 Constructing a feature pyramid result to realize feature splicing, and taking a head as input, wherein the method comprises the following specific steps of:

411 Inputting 4 feature images with different sizes, obtaining a group of feature images with the same size as the feature images of the next layer by the feature images of the highest layer through a C3+conv, splicing the feature images with the feature images of the next layer, obtaining a group of new feature images through a C3+conv, and repeating the process until the feature images of the lowest layer are reached;

412 For the bottom-layer feature map, outputting the feature map to each head, splicing the feature map with the information output by the upper layer, taking the feature map as a new output through a C3+conv, taking the feature map as the input of the lower layer, splicing the feature map with the information output by the upper layer through a C3+conv again, taking the feature map through a C3+conv as a new output, and repeating the process until the feature map reaches the highest layer;

42 Constructing a remote sensing target detection rotating frame, and realizing target positioning through two stages, wherein the method comprises the following specific steps of:

421 A first stage anchor optimization module (ARM) uses an adaptive training sample selection (ats) strategy to adjust the horizontal anchor to be a high quality anchor for rotation, as follows:

extracting all horizontal anchor points for the input characteristic image, and regarding the anchor points as candidate samples of the first stage;

calculating the ratio between the center point distance between each candidate sample and all real targets and the target size, and dividing all candidate samples into two types of positive samples and negative samples according to comprehensive consideration of the two factors;

for positive samples, taking the corresponding real target as a center, and generating a group of high-quality rotation anchor points as positive samples in the first stage;

422 After the first stage of adjustment, ARM obtains a group of candidate samples with rotation anchor points as a second stage, the candidate samples are input into a target detection network for classification and regression, and are screened according to the prediction result and IoU of a real target, and finally the sample with the maximum IoU is selected as a positive sample for adjustment, and the method comprises the following specific steps:

inputting the rotation anchor point obtained in the first stage into a target detection network to obtain a detection result;

according to the detection result, calculating IoU values of each rotation anchor point and the corresponding real target, and selecting a positive sample with the maximum IoU value as a positive sample of the second stage;

And taking the positive sample obtained in the second stage as an input positive sample, and classifying and regressing through the target detection network again to further improve the detection accuracy.

The steps of the network model training and result obtaining are as follows:

51 Inputting the preprocessed remote sensing image data into a rotating target detection network integrating optical and SAR image multi-mode information;

52 Performing a normal convolution layer with a convolution kernel size of 1x1 to convert the optical remote sensing image into V ^OPT 、Q ^OPT 、K ^OPT Three channel characteristics; converting SAR remote sensing image into V ^SAR 、Q ^SAR 、K ^SAR Executing a primary encoder structure by the three-channel feature to obtain 4 downsampled outputs;

53 A first downsampled output, a second downsampled output, and a third downsampled output followed by a self-attention mechanism module and a cross-correlation module;

54 Deconvolution of the fourth downsampled output to 1/8 of the original image (4-fold downsampling, 4-fold upsampling for the 4-fold downsampled case), resulting in an upsampled output of 1;

performing a common convolution with a convolution kernel of 3×3 on LSTM output 3, and normalizing a LeakyReLu by an example to obtain a final upsampled output;

55 Inputting the multi-mode multi-scale feature map extracted by the two modes into a cross-mode feature fusion module;

56 The difference enhancement module obtains a difference part characteristic diagram of the optical image and the SAR image through difference operation, obtains an enhanced characteristic diagram through the attention weight enhancement original characteristic diagram, and obtains a difference enhanced characteristic diagram through weighted summation;

57 The public selection module obtains a public partial feature map of the optical image and the SAR image through addition operation, obtains attention force diagram through softmax, and multiplies the attention force diagram phase to the original feature map to obtain a new feature map;

58 The differential enhancement feature map and the public selection feature map are subjected to weighted summation to obtain a cross-modal feature map;

59 Feature stitching 4 feature graphs of different sizes: the feature map of the highest layer is subjected to C3+conv to obtain a group of feature maps with the same size as the feature map of the next layer, the feature maps of the next layer are spliced together, then the feature map of the highest layer is subjected to C3+conv to obtain a group of new feature maps, and the process is repeated until the lowest layer is reached;

510 For the bottom-layer feature map, outputting the feature map to each head, splicing the feature map with the information output by the upper layer, taking the feature map as a new output through a C3+conv, taking the feature map as the input of the lower layer, splicing the feature map with the information output by the upper layer through a C3+conv again, taking the feature map through a C3+conv as a new output, and repeating the process until the feature map reaches the highest layer;

511 Inputting the feature map into a pre-measuring head, and adjusting a horizontal anchor point to be a high-quality rotation anchor point by using an ATSS strategy by an ARM module in the first stage;

512 After the first stage of adjustment, ARM obtains a group of candidate samples with rotation anchor points as the second stage, inputs the candidate samples into a target detection network for classification and regression, screens according to the prediction result and IoU of a real target, and selects IoU the largest sample as a positive sample for adjustment;

513 Calculating a loss function, and carrying out back propagation on the weight parameters;

514 Judging whether the number of the wheels reaches the set number, if so, obtaining a trained segmentation model, otherwise, returning to 52) reloading data for continuous training;

515 The obtained rotation target detection network which fuses optical and SAR image multi-mode information is utilized, a preprocessed test data set is input into a loaded model for prediction, and a target prediction frame and a target category are marked on an original image through visualization.

Advantageous effects

Compared with the prior art, in the rotating target detection method for fusing optical and SAR image multi-mode information, the rotating target detection method for fusing optical and SAR image multi-mode information has the advantages that firstly, the obtained optical remote sensing image and SAR remote sensing image can obtain good local feature and global structure information through a downsampling network with an attention mechanism and a cross-correlation mechanism, the features can better capture the relation and interaction between different positions and time points through upsampling ConvLSTM, the features extracted from the two modes are fused through cross-mode feature fusion, the expression capacity and robustness of the features are improved, the model is suitable for more complex and changeable application scenes, and the fused multi-scale feature map enables target positioning and classification to be more accurate through a two-stage rotating frame. In addition, in the detection of the target of the remote sensing image, the problem that the target cannot be accurately identified due to various interferences in the collection and transmission processes of the remote sensing image is solved, and meanwhile, the problems that different types of targets have different shapes, sizes, colors and the like or are overlapped and blocked exist. The method provided by the invention enables the remote sensing images with different modes to perform the cross-mode feature fusion, so that more features of the target are detected during the detection, and the positioning and classifying precision is greatly improved.

Drawings

FIG. 1 is a sequence diagram of a method for detecting a rotation target by fusing optical and SAR image multi-mode information;

FIG. 2 is a schematic diagram of a model structure of a method for detecting a rotating target by combining optical and SAR image multi-mode information;

FIG. 3 is a schematic diagram of a rotation target detection feature extracted from an attention mechanism module that fuses optical and SAR image multi-modality information;

FIG. 4 is a schematic diagram of a rotational target detection feature extraction cross-correlation module that fuses optical and SAR image multi-modality information;

FIG. 5 is a schematic diagram of a rotation target detection feature fusion structure that fuses optical and SAR image multi-modality information;

FIG. 6 is a schematic diagram of a structure of a rotation target detection feature fusion differential enhancement module for fusing optical and SAR image multi-modal information;

FIG. 7 is a schematic diagram of a structure of a rotation target detection feature fusion common selection module fusing optical and SAR image multi-modal information;

FIG. 8 is a schematic diagram of a rotating target detection two-stage rotating frame structure fusing optical and SAR image multi-modal information;

fig. 9 is a schematic diagram of a rotating target detection network result integrating optical and SAR image multi-modal information.

Detailed Description

For a further understanding and appreciation of the structural features and advantages achieved by the present invention, the following description is provided in connection with the accompanying drawings, which are presently preferred embodiments and are incorporated in the accompanying drawings, in which:

As shown in fig. 1, the method for detecting the rotation target by fusing optical and SAR image multi-mode information according to the present invention comprises the following steps:

firstly, preparing rotation target detection data and extracting features by fusing optical and SAR image multi-mode information: dividing and cutting the acquired remote sensing image data set; constructing a transducer-UNet network based on the encoder and decoder structures to extract the characteristics of the remote sensing data. The method comprises the following specific steps:

(1) Dividing the data set into a training set, a verification set and a test set according to the proportion of 6:2:2, wherein the unified cutting size with non-overlapping sizes is 256 x 256;

(2) Constructing a parallel encoder and decoder structure transducer-UNet, wherein a network A processes an optical remote sensing image and a network B processes an SAR remote sensing image;

(2-1) constructing a double-layer convolution module of double-layer DoubleConv, wherein the module structure comprises two convolution layers, two normalization layers and two ReLU activation functions; each convolution layer structure is characterized in that the kernel size is 3, the padding is 1, and the stride is 1;

(2-2) constructing a downsampling structure for feature extraction, wherein the downsampling structure comprises a double-layer convolution module of DoubleConv and a maximum pooling layer;

(2-3) constructing a Bottleneck layer for connecting the feature maps of the up-sampling and down-sampling stages, the Bottleneck layer structure comprising two convolution layers of kernel size 1, stride 1, one convolution layer of kernel size 3, stride 1,

(2-4) constructing an up-sampling structure for feature extraction, wherein the structure comprises a ConvLSTM layer and a convolution layer; the ConvLSTM unit comprises an input gate, a forget gate and an output gate, the kernel_size is (3, 3), and the stride is (2, 2);

(3) The specific steps of rotation target detection feature extraction of the multi-mode information of the fusion optics and SAR image are as follows:

(3-1) inputting the preprocessed optical remote sensing image, SAR remote sensing image and tag data into a convolutional neural network, and training a downsampling feature extraction model with a self-attention mechanism, wherein the specific steps are as follows:

(3-2) performing a normal convolution layer with a convolution kernel size of 1x1 to convert the optical remote sensing image into V ^OPT 、Q ^OPT 、K ^OPT Three channel characteristics; converting SAR remote sensing image into V ^SAR 、Q ^SAR 、K ^SAR Three channel characteristics; executing a primary encoder structure to obtain 4 downsampled outputs;

(3-3) executing the self-attention mechanism module and the cross-correlation module after the first downsampled output, the second downsampled output, and the third downsampled output, specifically comprising the steps of:

will Q ^OPT Transposition and K ^OPT Dot product multiplication, result softmax, and V ^OPT The dot product is multiplied and then weighted and summed with the original characteristic diagram to obtain the optical shadowA self-attention mechanism feature map; the SAR image self-attention mechanism characteristic diagram process is the same as that described above;

(4) The Bottleneck layer used to connect the feature maps of the up-sampling and down-sampling phases is built up from three convolutional layers:

(5) The up-sampling convLSTM is constructed as follows:

Secondly, constructing a framework for multi-modal feature fusion, extracting multi-modal difference features and the same features by using a difference enhancement module and a public selection module, and fusing the multi-modal difference features and the same features into a multi-modal feature map. The method comprises the following specific steps:

(1) Constructing a multi-mode feature fusion framework for an optical remote sensing image and an SAR remote sensing image, wherein the framework comprises a differential enhancement module and a public selection module;

(1-1) constructing a differential enhancement module structure, and performing differential operation on the extracted optical image features and SAR image features to obtain a feature map of a differential part;

(1-2) constructing a structure for a public selection module, and carrying out addition operation on the extracted optical image characteristics and SAR image characteristics to obtain a characteristic diagram of a public part;

Thirdly, establishing a two-stage rotation prediction head module: constructing a two-stage prediction head module, and performing secondary fine tuning on the basis of the first-stage classification and positioning, wherein the method comprises the following specific steps of:

(1) A first stage anchor optimization module (ARM) uses an Adaptive Training Sample Selection (ATSS) strategy to adjust a horizontal anchor to be a high-quality anchor for rotation;

(2) After the first stage of adjustment, ARM obtains a group of rotation anchor points as candidate samples of the second stage, the candidate samples are input into a target detection network for classification and regression, screening is carried out according to the prediction result and IoU of a real target, and finally the sample with the maximum IoU is selected as a positive sample for adjustment.

Fourthly, training a rotation target detection model fusing optical and SAR image multi-mode information:

a rotary target detection model integrating optical and SAR image multi-mode information is constructed, processed remote sensing data images and labels are input into the rotary target detection model integrating the optical and SAR image multi-mode information to obtain a trained target detection network model, a training flow is shown in a figure 1, a target detection network structure diagram is shown in a figure 2, self-attention feature extraction is based on a transducer-Unet structure, multi-mode feature integration is shown in a figure 5, a feature diagram with rich feature information is obtained, more features of a target are detected during detection, and the positioning and classifying precision is greatly improved by a two-stage rotary pre-measurement head shown in a figure 8.

The method comprises the following specific steps:

514 If the number of the rounds reaches the set number, a trained segmentation model is obtained, otherwise, the step 52) of reloading data is returned for continuous training.

Fifthly, obtaining a rotating target detection network result by integrating optical and SAR image multi-mode information: and inputting the preprocessed test data set into the loaded model for prediction, and marking the target prediction frame and the target category on the original image through visualization.

As shown in fig. 9, the method is a schematic diagram of a rotating target detection network result integrating optical and SAR image multi-mode information, wherein the rotating target detection network result comprises a wharf, an automobile and a ship, and as can be seen from fig. 9, the method can well achieve the purposes of positioning and classifying targets in an image.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The method for detecting the rotating target by fusing the optical and SAR image multi-mode information is characterized by comprising the following steps of:

111 Dividing the data set into a training set, a verification set and a test set according to the proportion of 6:2:2, wherein the size of the data set is not overlapped, and the uniform cutting size is 256 multiplied by 256;

112 Constructing a parallel encoder and decoder structure transducer-UNet, wherein a network A processes the optical remote sensing image and a network B processes the SAR remote sensing image;

1121 A double-layer convolution module is constructed, and the module structure comprises two convolution layers, two normalization layers and two ReLU activation functions; each convolution layer structure is characterized in that the kernel size is 3, the padding is 1, and the stride is 1;

1122 A downsampling structure for feature extraction is constructed, and the structure comprises a double-layer DoubleConv convolution module and a maximum pooling layer;

1123 A Bottleneck layer for connecting the feature graphs of the up-sampling and down-sampling stages is constructed, the Bottleneck layer structure includes two convolution layers with a kernel size of 1 and a stride of 1, one convolution layer with a kernel size of 3 and a stride of 1;

1124 Constructing an up-sampling structure for feature extraction, wherein the structure comprises a ConvLSTM layer and a convolution layer; the ConvLSTM unit comprises an input gate, a forget gate and an output gate, the kernel_size is (3, 3), and the stride is (2, 2);

113 The specific steps of rotation target detection feature extraction of the multi-mode information of the fusion optics and SAR image are as follows:

1131 Inputting the preprocessed optical remote sensing image, SAR remote sensing image and tag data into a convolutional neural network, and training a downsampling feature extraction model with a self-attention mechanism, wherein the method comprises the following specific steps of:

1132 Performing a normal convolution layer with a convolution kernel size of 1 x 1 to convert the optical remote sensing image into information V for each element in the optical image providing sequence ^OPT Weight Q of each element in an optical image providing sequence ^OPT For calculating the similarity K between Q and K in an optical image ^OPT Three channel characteristics; converting SAR remote sensing image into information V of each element in SAR image providing sequence ^SAR The SAR image provides the weight Q of each element in the sequence ^SAR For calculating the similarity K between Q and K in SAR images ^SAR Three channel characteristics; executing the encoder structure once to obtain 4 downsamplesOutputting a sample;

1133 A first downsampled output, a second downsampled output, and a third downsampled output followed by a self-attention mechanism module and a cross-correlation module, comprising the following steps:

performing convolution with a convolution kernel of 1×1 to convert the optical remote sensing image into V ^OPT 、Q ^OPT 、K ^OPT Three-channel feature matrix; converting SAR remote sensing image into V ^SAR 、Q ^SAR 、K ^SAR Three-channel feature matrix;

114 A Bottleneck layer for connecting the feature maps of the up-sampling and down-sampling phases is built up, consisting of three convolution layers:

the convolution kernel of the first convolution layer is 1 multiplied by 1, and is used for reducing the dimension, reducing the number of input channels and reducing the number of model parameters;

the convolution kernel of the second convolution layer is 3 multiplied by 3 and is used for convoluting the feature map and extracting features;

the convolution kernel of the third convolution layer is 1 multiplied by 1, and is used for increasing the dimension, increasing the number of channels of the convolved feature map and increasing the expression capacity of the model;

115 Constructing up-sampling ConvLSTM, specifically comprising the following steps:

116 Completing construction of a network A for processing the optical remote sensing image and a network B for processing the SAR remote sensing image;

121 Constructing a multi-mode feature fusion framework for the optical remote sensing image and the SAR remote sensing image, wherein the framework comprises a differential enhancement module and a public selection module;

1211 The differential enhancement module specifically comprises the following steps:

through hourglass type 1×1 convolution, attention weights are calculated, and separate attention force diagrams are obtained;

1212 The public selection module comprises the following specific steps:

122 Weighting and summing new feature images of the optical remote sensing image and the SAR remote sensing image to obtain a common module feature image;

14 Training the established network model by using the divided training set and the corresponding label thereof and adjusting parameters until the training reaches the preset epoch, and finally, reserving the corresponding parameters and the trained network to detect and acquire the results for other target images.

2. The method for detecting a rotation target by fusing optical and SAR image multi-mode information according to claim 1, wherein said establishing a two-stage rotation prediction head module comprises the steps of:

21 Constructing a feature pyramid result to realize feature splicing, and taking a head as input, wherein the method comprises the following specific steps of:

211 Inputting 4 feature images with different sizes, obtaining a group of feature images with the same size as the feature images of the next layer by the feature images of the highest layer through a C3+conv, splicing the feature images with the feature images of the next layer, obtaining a group of new feature images through a C3+conv, and repeating the process until the feature images of the lowest layer are reached;

212 For the bottom-layer feature map, outputting the feature map to each head, splicing the feature map with the information output by the upper layer, taking the feature map as a new output through a C3+conv, taking the feature map as the input of the lower layer, splicing the feature map with the information output by the upper layer through a C3+conv again, taking the feature map through a C3+conv as a new output, and repeating the process until the feature map reaches the highest layer;

22 A remote sensing target detection rotary pre-measuring head is constructed, and target positioning is realized through two stages, wherein the method comprises the following specific steps:

221 The first stage ARM module uses an ATSS strategy to adjust the horizontal anchor point to be a high-quality rotation anchor point, and the steps are as follows:

calculating the ratio between the center point distance between each candidate sample and all real targets (group-trunk) and the target size, and classifying all the candidate samples into two types of positive samples and negative samples according to the comprehensive consideration of the two factors;

222 After the first stage of adjustment, ARM obtains a group of candidate samples with rotation anchor points as a second stage, the candidate samples are input into a target detection network for classification and regression, and are screened according to the prediction result and IoU of a real target, and finally the sample with the maximum IoU is selected as a positive sample for adjustment, and the method comprises the following specific steps:

3. The method for detecting a rotation target by combining optical and SAR image multi-mode information according to claim 1, wherein the steps of training the network model and obtaining the result are as follows:

31 Inputting the preprocessed remote sensing image data into a rotating target detection network integrating optical and SAR image multi-mode information;

32 Performing a normal convolution layer with a convolution kernel size of 1×1 to convert an optical remote sensing image into V ^OPT 、Q ^OPT 、K ^OPT Three channel characteristics; converting SAR remote sensing image into V ^SAR 、Q ^SAR 、K ^SAR Executing a primary encoder structure by the three-channel feature to obtain 4 downsampled outputs;

33 A first downsampled output, a second downsampled output, and a third downsampled output followed by a self-attention mechanism module and a cross-correlation module;

34 Deconvolution of the fourth downsampled output to 1/8 of the original image (4-fold downsampling, 4-fold upsampling for the 4-fold downsampled case), resulting in an upsampled output of 1;

35 Inputting the multi-mode multi-scale feature map extracted by the two modes into a cross-mode feature fusion module;

36 The difference enhancement module obtains a difference part characteristic diagram of the optical image and the SAR image through difference operation, obtains an enhanced characteristic diagram through the attention weight enhancement original characteristic diagram, and obtains a difference enhanced characteristic diagram through weighted summation;

37 The public selection module obtains a public partial feature map of the optical image and the SAR image through addition operation, obtains attention force diagram through softmax, and multiplies the attention force diagram phase to the original feature map to obtain a new feature map;

38 The differential enhancement feature map and the public selection feature map are subjected to weighted summation to obtain a cross-modal feature map;

39 Feature stitching 4 feature graphs of different sizes: the feature map of the highest layer is subjected to C3+conv to obtain a group of feature maps with the same size as the feature map of the next layer, the feature maps of the next layer are spliced together, then the feature map of the highest layer is subjected to C3+conv to obtain a group of new feature maps, and the process is repeated until the lowest layer is reached;

310 For the bottom-layer feature map, outputting the feature map to each head, splicing the feature map with the information output by the upper layer, taking the feature map as a new output through a C3+conv, taking the feature map as the input of the lower layer, splicing the feature map with the information output by the upper layer through a C3+conv again, taking the feature map through a C3+conv as a new output, and repeating the process until the feature map reaches the highest layer;

311 Inputting the feature map into a pre-measuring head, and adjusting a horizontal anchor point to be a high-quality rotation anchor point by using an ATSS strategy by an ARM module in the first stage;

312 After the first stage of adjustment, ARM obtains a group of candidate samples with rotation anchor points as the second stage, inputs the candidate samples into a target detection network for classification and regression, screens according to the prediction result and IoU of a real target, and selects IoU the largest sample as a positive sample for adjustment;

313 Calculating a loss function, and carrying out back propagation on the weight parameters;

314 Judging whether the number of the wheels reaches the set number, if so, obtaining a trained segmentation model, otherwise, returning to 32) reloading data for continuous training;

315 The obtained rotation target detection network which fuses optical and SAR image multi-mode information is utilized, a preprocessed test data set is input into a loaded model for prediction, and a target prediction frame and a target category are marked on an original image through visualization.