CN116452936B - Rotation target detection method integrating optics and SAR image multi-mode information - Google Patents

Rotation target detection method integrating optics and SAR image multi-mode information Download PDF

Info

Publication number
CN116452936B
CN116452936B CN202310446031.9A CN202310446031A CN116452936B CN 116452936 B CN116452936 B CN 116452936B CN 202310446031 A CN202310446031 A CN 202310446031A CN 116452936 B CN116452936 B CN 116452936B
Authority
CN
China
Prior art keywords
output
convolution
feature
image
sar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310446031.9A
Other languages
Chinese (zh)
Other versions
CN116452936A (en
Inventor
徐凯
刘思远
汪安铃
汪子羽
梁栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202310446031.9A priority Critical patent/CN116452936B/en
Publication of CN116452936A publication Critical patent/CN116452936A/en
Application granted granted Critical
Publication of CN116452936B publication Critical patent/CN116452936B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Abstract

Compared with the prior art, the method can effectively solve the problem that targets cannot be accurately identified due to various interferences in the remote sensing image acquisition and transmission process, and can also deal with the problems that targets of different categories have different shapes, sizes, colors and the like or overlap, shielding and the like. The invention comprises the following steps: feature extraction of the SAR image dataset and the optical image dataset; cross-modal multi-scale feature fusion; the two-stage rotation pre-measuring head positions and sorts targets with different angles. The invention can effectively solve the problems that targets cannot be accurately detected due to single-mode targets such as weather, illumination, object color and the like, and the positioning accuracy of the rotating frame is low, thereby improving the accuracy and efficiency of target detection.

Description

Rotation target detection method integrating optics and SAR image multi-mode information
Technical Field
The invention relates to the field of remote sensing image target detection, in particular to a rotating target detection method for fusing optical and SAR image multi-mode information.
Background
The remote sensing image target detection is a technology for target identification and positioning by utilizing remote sensing image data, and has important application value in a plurality of fields, such as urban planning, agricultural resource management, environment monitoring and the like. However, various interferences may be caused in the process of collecting and transmitting the remote sensing image, and different target categories have different shapes, sizes, colors and other characteristics, and the defects that the targets cannot be accurately identified due to overlapping, shielding and other conditions may also exist. The cross-mode remote sensing target detection refers to a technology for detecting and identifying targets among different remote sensing image modes, so that the accuracy and the robustness of remote sensing image target detection can be improved, and the application range and the scene of remote sensing image target detection can be expanded.
The cross-mode remote sensing target detection is a technology for information fusion by utilizing multi-source remote sensing data, and more comprehensive and accurate target information is obtained by combining remote sensing images from different sensors or different wave bands. Compared with a single-mode remote sensing image, the cross-mode remote sensing image has more wave bands and characteristic information, and can provide better results in the aspects of target detection and classification. In a single-mode remote sensing image, the data collected by the sensor can only provide information of a specific wave band, so that the performance of the sensor is not ideal for detection and classification tasks of some complex targets. And the cross-mode remote sensing image can effectively improve the accuracy and the robustness of target detection and classification by utilizing various wave bands and characteristic information.
Remote sensing target detection typically requires a tradeoff between high accuracy and high efficiency. Accurate target detection requires high accuracy of the position and shape of the detection result, and can accurately represent the position, shape, size and other information of the target. However, the single-mode target detection is affected by remote sensing data quality, data labeling, target category and the like, and is easy to identify without great accuracy, so that subsequent application is limited. According to the invention, the characteristics of the images of the same target in a plurality of modes are extracted, and the fused characteristic images have more characteristic information by utilizing the difference of the characteristics of different modes, so that the subsequent two-stage rotating head can more accurately position and classify the target. At present, papers and patents for a rotation target detection method for fusing optical and SAR multi-mode images are still lacking in China.
Disclosure of Invention
The invention aims to solve the problem that the target detection is inaccurate due to errors generated by data acquisition, transmission and other modes in the single-mode remote sensing target detection, and provides a rotating target detection method for fusing optical and SAR image multi-mode information to solve the problem.
In order to achieve the above object, the technical scheme of the present invention is as follows:
A method for detecting a rotating target by combining optical and SAR image multi-mode information comprises the following steps:
11 Preparation of rotation target detection data and feature extraction of fusion optics and SAR image multi-mode information: dividing and cutting the acquired remote sensing image data set; constructing a transducer-UNet network based on the structures of the encoder and the decoder to extract the characteristics of the remote sensing data;
12 Establishing a multi-mode multi-scale feature fusion module: constructing a framework for multi-modal feature fusion, extracting multi-modal difference features and the same features by using a difference enhancement module and a public selection module, and fusing the multi-modal difference features and the same features into a multi-modal feature map;
13 Establishing a two-stage rotary pre-measurement head module: constructing a two-stage prediction head module, and performing secondary fine tuning on the basis of the first-stage classification and positioning;
14 Performing corresponding training and parameter adjustment on the established rotation target detection network fused with the optical and SAR image multi-modal information by utilizing the divided training set and the corresponding label thereof until the training reaches the preset epoch, and finally reserving the corresponding parameters and the trained network;
15 Utilizing the rotation target detection network which is obtained in the step 14) and is used for fusing the optical and SAR image multi-mode information, inputting the preprocessed test data set into the loaded model for prediction, and marking the target prediction frame and the target category on the original image through visualization.
The preparation and feature extraction of the rotation target detection data fusing the optical and SAR image multi-mode information comprises the following steps:
21 Dividing the data set into a training set, a verification set and a test set according to the proportion of 6:2:2, wherein the size of the unified cutting size which is not overlapped is 256 x 256;
22 A parallel encoder decoder converter-UNet network AB is constructed, wherein the network a processes the optical remote sensing image and the network B processes the SAR remote sensing image;
221 A double-layer convolution module is constructed, the module structure comprises two convolution layers, two normalization layers and two ReLU activation functions; each convolution layer structure is characterized in that the kernel size is 3, the padding is 1, and the stride is 1;
222 A downsampling structure for feature extraction is constructed, and the structure comprises a double-layer DoubleConv convolution module and a maximum pooling layer;
223 A Bottleneck layer for connecting the feature graphs of the up-sampling and down-sampling stages is constructed, the Bottleneck layer structure includes two convolution layers with a kernel size of 1 and a stride of 1, one convolution layer with a kernel size of 3 and a stride of 1;
224 Constructing an up-sampling structure for feature extraction, wherein the structure comprises a ConvLSTM layer and a convolution layer; the ConvLSTM unit comprises an input gate, a forget gate and an output gate, the kernel_size is (3, 3), and the stride is (2, 2);
23 The specific steps of rotation target detection feature extraction of the multi-mode information of the fusion optics and SAR image are as follows:
231 Inputting the preprocessed optical remote sensing image, SAR remote sensing image and tag data into a convolutional neural network, and training a downsampling feature extraction model with a self-attention mechanism, wherein the method comprises the following specific steps of:
232 Performing a normal convolution layer with a convolution kernel size of 1x1 to convert the optical remote sensing image into V OPT 、Q OPT 、K OPT Three channel characteristics; converting SAR remote sensing image into V SAR 、Q SAR 、K SAR Three channel characteristics; executing a primary encoder structure to obtain 4 downsampled outputs;
performing a normal convolution with a convolution kernel size of 3×3 on the input picture, normalizing a ReLu by an example, and performing a max pooling operation with a stride of 1 to obtain a first downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the first downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a second downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the second downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a third downsampled output;
Performing a normal convolution with a convolution kernel size of 3×3 on the third downsampled output, and normalizing a maximum pooling operation with a ReLu and a stride of 1 by an example to obtain a fourth downsampled output;
b3 A first downsampled output, a second downsampled output, and a third downsampled output followed by a self-attention mechanism module and a cross-correlation module, comprising the following steps:
performing convolution with a convolution kernel of 1x1 to convert the optical remote sensing image into V OPT 、Q OPT 、K OPT Three-channel feature matrix; converting SAR remote sensing image into V SAR 、Q SAR 、K SAR Three-channel feature matrix;
will Q OPT Transposition and K OPT Dot product multiplication, result softmax, and V OPT After the dot products are multiplied, the dot products are weighted and summed with the original feature images to obtain an optical image self-attention mechanism feature image; the SAR image self-attention mechanism characteristic diagram process is the same as that described above;
extracting a supporting feature image and a query feature image from a self-attention mechanism feature image, carrying out reshape on the images, generating a relation between the two images by using cosine distance, obtaining corresponding weights through global average pooling and a nonlinear network comprising 2 convolution layers and a RELU layer, and obtaining the correlation of the features after dot product multiplication and normalization; the SAR remote sensing image cross-correlation module is the same as the optical remote sensing image cross-correlation module;
24 A Bottleneck layer for connecting the feature maps of the up-sampling and down-sampling phases is built up, consisting of three convolution layers:
the convolution kernel of the first convolution layer is 1x1, and is used for reducing the dimension, reducing the number of input channels and reducing the number of model parameters;
the convolution kernel of the second convolution layer is 3x3 and is used for convoluting the feature map and extracting features;
the convolution kernel of the third convolution layer is 1x1, and is used for increasing the dimension, increasing the number of channels of the convolved feature map and increasing the expression capacity of the model;
25 Constructing an up-sampling convLSTM, specifically comprising the following steps:
performing deconvolution operation (also called transpose convolution) on the fourth downsampled output to upsample to 1/8 of the original image (for the 4-fold downsampled case, 4-fold upsampling), resulting in an upsampled output 1;
splicing the up-sampling output 1 and the third down-sampling output to obtain a combined output 1;
performing a common convolution with a convolution kernel size of 3×3 on the combined output 1, and normalizing one leak lu by one example to obtain a convolution output 1;
performing ConvLSTM operation on the convolution output 1 to obtain LSTM output 1;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 1, and normalizing one LeakyReLu by one example to obtain a convolution output 2;
Deconvolution operation is carried out on the convolution output 2 to enable the convolution output 2 to be up-sampled to be 1/4 of the size of the original image (for the 4 times down-sampling case, namely 2 times up-sampling), so as to obtain up-sampled output 2;
splicing the up-sampling output 2 and the second down-sampling output to obtain a combined output 2;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 2, and normalizing one leak lu by one example to obtain a convolved output 3;
performing ConvLSTM operation on the convolution output 3 to obtain an LSTM output 2;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 2, and normalizing one LeakyReLu by one example to obtain a convolution output 4;
deconvolution operation is performed on the convolution output 4 to up-sample it to 1/2 of the original image (for the case of 4 times down-sampling, i.e., 2 times up-sampling), resulting in up-sampled output 3;
splicing the up-sampling output 3 and the first down-sampling output to obtain a combined output 3;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 3, and normalizing one leak lu by one example to obtain a convolution output 5;
performing ConvLSTM operation on the convolution output 5 to obtain LSTM output 3;
a normal convolution with a convolution kernel size of 3 x 3 is performed on LSTM output 3, and an example normalizes one LeakyReLu to obtain the final upsampled output.
The rotation target detection multimode characteristic fusion module for fusing optical and SAR image multimode information comprises the following steps:
31 Constructing a multi-mode feature fusion framework for the optical remote sensing image and the SAR remote sensing image, wherein the framework comprises a differential enhancement module and a public selection module;
311 The differential enhancement module specifically comprises the following steps:
performing difference operation on the extracted optical image features and SAR image features to obtain a feature map of a difference part;
calculating attention weights through hourglass 1*1 convolution to obtain respective attention force diagrams;
adding the obtained attention map to the original characteristic map in a residual mode to obtain a reinforced characteristic map;
weighting and summing the reinforcement feature images of the optical remote sensing image and the SAR remote sensing image to obtain a differential reinforcement feature image;
312 The public selection module comprises the following specific steps:
performing addition operation on the extracted optical image features and SAR image features to obtain a feature map of the public part;
obtaining the attention map of the optical remote sensing image and the attention map of the SAR remote sensing image through the obtained characteristic map of the public part in a softmax mode;
multiplying the attention map of the optical remote sensing image and the attention map of the SAR remote sensing image with the input feature map of the SAR remote sensing image respectively to obtain new feature maps;
And carrying out weighted summation on the new feature images of the optical remote sensing image and the SAR remote sensing image to obtain a public module feature image.
The rotating target detection two-stage rotating prediction head network for fusing optical and SAR image multi-mode information comprises the following steps:
41 Constructing a feature pyramid result to realize feature splicing, and taking a head as input, wherein the method comprises the following specific steps of:
411 Inputting 4 feature images with different sizes, obtaining a group of feature images with the same size as the feature images of the next layer by the feature images of the highest layer through a C3+conv, splicing the feature images with the feature images of the next layer, obtaining a group of new feature images through a C3+conv, and repeating the process until the feature images of the lowest layer are reached;
412 For the bottom-layer feature map, outputting the feature map to each head, splicing the feature map with the information output by the upper layer, taking the feature map as a new output through a C3+conv, taking the feature map as the input of the lower layer, splicing the feature map with the information output by the upper layer through a C3+conv again, taking the feature map through a C3+conv as a new output, and repeating the process until the feature map reaches the highest layer;
42 Constructing a remote sensing target detection rotating frame, and realizing target positioning through two stages, wherein the method comprises the following specific steps of:
421 A first stage anchor optimization module (ARM) uses an adaptive training sample selection (ats) strategy to adjust the horizontal anchor to be a high quality anchor for rotation, as follows:
extracting all horizontal anchor points for the input characteristic image, and regarding the anchor points as candidate samples of the first stage;
calculating the ratio between the center point distance between each candidate sample and all real targets and the target size, and dividing all candidate samples into two types of positive samples and negative samples according to comprehensive consideration of the two factors;
for positive samples, taking the corresponding real target as a center, and generating a group of high-quality rotation anchor points as positive samples in the first stage;
422 After the first stage of adjustment, ARM obtains a group of candidate samples with rotation anchor points as a second stage, the candidate samples are input into a target detection network for classification and regression, and are screened according to the prediction result and IoU of a real target, and finally the sample with the maximum IoU is selected as a positive sample for adjustment, and the method comprises the following specific steps:
inputting the rotation anchor point obtained in the first stage into a target detection network to obtain a detection result;
according to the detection result, calculating IoU values of each rotation anchor point and the corresponding real target, and selecting a positive sample with the maximum IoU value as a positive sample of the second stage;
And taking the positive sample obtained in the second stage as an input positive sample, and classifying and regressing through the target detection network again to further improve the detection accuracy.
The steps of the network model training and result obtaining are as follows:
51 Inputting the preprocessed remote sensing image data into a rotating target detection network integrating optical and SAR image multi-mode information;
52 Performing a normal convolution layer with a convolution kernel size of 1x1 to convert the optical remote sensing image into V OPT 、Q OPT 、K OPT Three channel characteristics; converting SAR remote sensing image into V SAR 、Q SAR 、K SAR Executing a primary encoder structure by the three-channel feature to obtain 4 downsampled outputs;
performing a normal convolution with a convolution kernel size of 3×3 on the input picture, normalizing a ReLu by an example, and performing a max pooling operation with a stride of 1 to obtain a first downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the first downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a second downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the second downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a third downsampled output;
Performing a normal convolution with a convolution kernel size of 3×3 on the third downsampled output, and normalizing a maximum pooling operation with a ReLu and a stride of 1 by an example to obtain a fourth downsampled output;
53 A first downsampled output, a second downsampled output, and a third downsampled output followed by a self-attention mechanism module and a cross-correlation module;
54 Deconvolution of the fourth downsampled output to 1/8 of the original image (4-fold downsampling, 4-fold upsampling for the 4-fold downsampled case), resulting in an upsampled output of 1;
splicing the up-sampling output 1 and the third down-sampling output to obtain a combined output 1;
performing a common convolution with a convolution kernel size of 3×3 on the combined output 1, and normalizing one leak lu by one example to obtain a convolution output 1;
performing ConvLSTM operation on the convolution output 1 to obtain LSTM output 1;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 1, and normalizing one LeakyReLu by one example to obtain a convolution output 2;
deconvolution operation is carried out on the convolution output 2 to enable the convolution output 2 to be up-sampled to be 1/4 of the size of the original image (for the 4 times down-sampling case, namely 2 times up-sampling), so as to obtain up-sampled output 2;
Splicing the up-sampling output 2 and the second down-sampling output to obtain a combined output 2;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 2, and normalizing one leak lu by one example to obtain a convolved output 3;
performing ConvLSTM operation on the convolution output 3 to obtain an LSTM output 2;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 2, and normalizing one LeakyReLu by one example to obtain a convolution output 4;
deconvolution operation is performed on the convolution output 4 to up-sample it to 1/2 of the original image (for the case of 4 times down-sampling, i.e., 2 times up-sampling), resulting in up-sampled output 3;
splicing the up-sampling output 3 and the first down-sampling output to obtain a combined output 3;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 3, and normalizing one leak lu by one example to obtain a convolution output 5;
performing ConvLSTM operation on the convolution output 5 to obtain LSTM output 3;
performing a common convolution with a convolution kernel of 3×3 on LSTM output 3, and normalizing a LeakyReLu by an example to obtain a final upsampled output;
55 Inputting the multi-mode multi-scale feature map extracted by the two modes into a cross-mode feature fusion module;
56 The difference enhancement module obtains a difference part characteristic diagram of the optical image and the SAR image through difference operation, obtains an enhanced characteristic diagram through the attention weight enhancement original characteristic diagram, and obtains a difference enhanced characteristic diagram through weighted summation;
57 The public selection module obtains a public partial feature map of the optical image and the SAR image through addition operation, obtains attention force diagram through softmax, and multiplies the attention force diagram phase to the original feature map to obtain a new feature map;
58 The differential enhancement feature map and the public selection feature map are subjected to weighted summation to obtain a cross-modal feature map;
59 Feature stitching 4 feature graphs of different sizes: the feature map of the highest layer is subjected to C3+conv to obtain a group of feature maps with the same size as the feature map of the next layer, the feature maps of the next layer are spliced together, then the feature map of the highest layer is subjected to C3+conv to obtain a group of new feature maps, and the process is repeated until the lowest layer is reached;
510 For the bottom-layer feature map, outputting the feature map to each head, splicing the feature map with the information output by the upper layer, taking the feature map as a new output through a C3+conv, taking the feature map as the input of the lower layer, splicing the feature map with the information output by the upper layer through a C3+conv again, taking the feature map through a C3+conv as a new output, and repeating the process until the feature map reaches the highest layer;
511 Inputting the feature map into a pre-measuring head, and adjusting a horizontal anchor point to be a high-quality rotation anchor point by using an ATSS strategy by an ARM module in the first stage;
512 After the first stage of adjustment, ARM obtains a group of candidate samples with rotation anchor points as the second stage, inputs the candidate samples into a target detection network for classification and regression, screens according to the prediction result and IoU of a real target, and selects IoU the largest sample as a positive sample for adjustment;
513 Calculating a loss function, and carrying out back propagation on the weight parameters;
514 Judging whether the number of the wheels reaches the set number, if so, obtaining a trained segmentation model, otherwise, returning to 52) reloading data for continuous training;
515 The obtained rotation target detection network which fuses optical and SAR image multi-mode information is utilized, a preprocessed test data set is input into a loaded model for prediction, and a target prediction frame and a target category are marked on an original image through visualization.
Advantageous effects
Compared with the prior art, in the rotating target detection method for fusing optical and SAR image multi-mode information, the rotating target detection method for fusing optical and SAR image multi-mode information has the advantages that firstly, the obtained optical remote sensing image and SAR remote sensing image can obtain good local feature and global structure information through a downsampling network with an attention mechanism and a cross-correlation mechanism, the features can better capture the relation and interaction between different positions and time points through upsampling ConvLSTM, the features extracted from the two modes are fused through cross-mode feature fusion, the expression capacity and robustness of the features are improved, the model is suitable for more complex and changeable application scenes, and the fused multi-scale feature map enables target positioning and classification to be more accurate through a two-stage rotating frame. In addition, in the detection of the target of the remote sensing image, the problem that the target cannot be accurately identified due to various interferences in the collection and transmission processes of the remote sensing image is solved, and meanwhile, the problems that different types of targets have different shapes, sizes, colors and the like or are overlapped and blocked exist. The method provided by the invention enables the remote sensing images with different modes to perform the cross-mode feature fusion, so that more features of the target are detected during the detection, and the positioning and classifying precision is greatly improved.
Drawings
FIG. 1 is a sequence diagram of a method for detecting a rotation target by fusing optical and SAR image multi-mode information;
FIG. 2 is a schematic diagram of a model structure of a method for detecting a rotating target by combining optical and SAR image multi-mode information;
FIG. 3 is a schematic diagram of a rotation target detection feature extracted from an attention mechanism module that fuses optical and SAR image multi-modality information;
FIG. 4 is a schematic diagram of a rotational target detection feature extraction cross-correlation module that fuses optical and SAR image multi-modality information;
FIG. 5 is a schematic diagram of a rotation target detection feature fusion structure that fuses optical and SAR image multi-modality information;
FIG. 6 is a schematic diagram of a structure of a rotation target detection feature fusion differential enhancement module for fusing optical and SAR image multi-modal information;
FIG. 7 is a schematic diagram of a structure of a rotation target detection feature fusion common selection module fusing optical and SAR image multi-modal information;
FIG. 8 is a schematic diagram of a rotating target detection two-stage rotating frame structure fusing optical and SAR image multi-modal information;
fig. 9 is a schematic diagram of a rotating target detection network result integrating optical and SAR image multi-modal information.
Detailed Description
For a further understanding and appreciation of the structural features and advantages achieved by the present invention, the following description is provided in connection with the accompanying drawings, which are presently preferred embodiments and are incorporated in the accompanying drawings, in which:
As shown in fig. 1, the method for detecting the rotation target by fusing optical and SAR image multi-mode information according to the present invention comprises the following steps:
firstly, preparing rotation target detection data and extracting features by fusing optical and SAR image multi-mode information: dividing and cutting the acquired remote sensing image data set; constructing a transducer-UNet network based on the encoder and decoder structures to extract the characteristics of the remote sensing data. The method comprises the following specific steps:
(1) Dividing the data set into a training set, a verification set and a test set according to the proportion of 6:2:2, wherein the unified cutting size with non-overlapping sizes is 256 x 256;
(2) Constructing a parallel encoder and decoder structure transducer-UNet, wherein a network A processes an optical remote sensing image and a network B processes an SAR remote sensing image;
(2-1) constructing a double-layer convolution module of double-layer DoubleConv, wherein the module structure comprises two convolution layers, two normalization layers and two ReLU activation functions; each convolution layer structure is characterized in that the kernel size is 3, the padding is 1, and the stride is 1;
(2-2) constructing a downsampling structure for feature extraction, wherein the downsampling structure comprises a double-layer convolution module of DoubleConv and a maximum pooling layer;
(2-3) constructing a Bottleneck layer for connecting the feature maps of the up-sampling and down-sampling stages, the Bottleneck layer structure comprising two convolution layers of kernel size 1, stride 1, one convolution layer of kernel size 3, stride 1,
(2-4) constructing an up-sampling structure for feature extraction, wherein the structure comprises a ConvLSTM layer and a convolution layer; the ConvLSTM unit comprises an input gate, a forget gate and an output gate, the kernel_size is (3, 3), and the stride is (2, 2);
(3) The specific steps of rotation target detection feature extraction of the multi-mode information of the fusion optics and SAR image are as follows:
(3-1) inputting the preprocessed optical remote sensing image, SAR remote sensing image and tag data into a convolutional neural network, and training a downsampling feature extraction model with a self-attention mechanism, wherein the specific steps are as follows:
(3-2) performing a normal convolution layer with a convolution kernel size of 1x1 to convert the optical remote sensing image into V OPT 、Q OPT 、K OPT Three channel characteristics; converting SAR remote sensing image into V SAR 、Q SAR 、K SAR Three channel characteristics; executing a primary encoder structure to obtain 4 downsampled outputs;
performing a normal convolution with a convolution kernel size of 3×3 on the input picture, normalizing a ReLu by an example, and performing a max pooling operation with a stride of 1 to obtain a first downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the first downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a second downsampled output;
Performing a normal convolution with a convolution kernel size of 3×3 on the second downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a third downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the third downsampled output, and normalizing a maximum pooling operation with a ReLu and a stride of 1 by an example to obtain a fourth downsampled output;
(3-3) executing the self-attention mechanism module and the cross-correlation module after the first downsampled output, the second downsampled output, and the third downsampled output, specifically comprising the steps of:
performing convolution with a convolution kernel of 1x1 to convert the optical remote sensing image into V OPT 、Q OPT 、K OPT Three-channel feature matrix; converting SAR remote sensing image into V SAR 、Q SAR 、K SAR Three-channel feature matrix;
will Q OPT Transposition and K OPT Dot product multiplication, result softmax, and V OPT The dot product is multiplied and then weighted and summed with the original characteristic diagram to obtain the optical shadowA self-attention mechanism feature map; the SAR image self-attention mechanism characteristic diagram process is the same as that described above;
extracting a supporting feature image and a query feature image from a self-attention mechanism feature image, carrying out reshape on the images, generating a relation between the two images by using cosine distance, obtaining corresponding weights through global average pooling and a nonlinear network comprising 2 convolution layers and a RELU layer, and obtaining the correlation of the features after dot product multiplication and normalization; the SAR remote sensing image cross-correlation module is the same as the optical remote sensing image cross-correlation module;
(4) The Bottleneck layer used to connect the feature maps of the up-sampling and down-sampling phases is built up from three convolutional layers:
the convolution kernel of the first convolution layer is 1x1, and is used for reducing the dimension, reducing the number of input channels and reducing the number of model parameters;
the convolution kernel of the second convolution layer is 3x3 and is used for convoluting the feature map and extracting features;
the convolution kernel of the third convolution layer is 1x1, and is used for increasing the dimension, increasing the number of channels of the convolved feature map and increasing the expression capacity of the model;
(5) The up-sampling convLSTM is constructed as follows:
performing deconvolution operation (also called transpose convolution) on the fourth downsampled output to upsample to 1/8 of the original image (for the 4-fold downsampled case, 4-fold upsampling), resulting in an upsampled output 1;
splicing the up-sampling output 1 and the third down-sampling output to obtain a combined output 1;
performing a common convolution with a convolution kernel size of 3×3 on the combined output 1, and normalizing one leak lu by one example to obtain a convolution output 1;
performing ConvLSTM operation on the convolution output 1 to obtain LSTM output 1;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 1, and normalizing one LeakyReLu by one example to obtain a convolution output 2;
Deconvolution operation is carried out on the convolution output 2 to enable the convolution output 2 to be up-sampled to be 1/4 of the size of the original image (for the 4 times down-sampling case, namely 2 times up-sampling), so as to obtain up-sampled output 2;
splicing the up-sampling output 2 and the second down-sampling output to obtain a combined output 2;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 2, and normalizing one leak lu by one example to obtain a convolved output 3;
performing ConvLSTM operation on the convolution output 3 to obtain an LSTM output 2;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 2, and normalizing one LeakyReLu by one example to obtain a convolution output 4;
deconvolution operation is performed on the convolution output 4 to up-sample it to 1/2 of the original image (for the case of 4 times down-sampling, i.e., 2 times up-sampling), resulting in up-sampled output 3;
splicing the up-sampling output 3 and the first down-sampling output to obtain a combined output 3;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 3, and normalizing one leak lu by one example to obtain a convolution output 5;
performing ConvLSTM operation on the convolution output 5 to obtain LSTM output 3;
performing a common convolution with a convolution kernel of 3×3 on LSTM output 3, and normalizing a LeakyReLu by an example to obtain a final upsampled output;
Secondly, constructing a framework for multi-modal feature fusion, extracting multi-modal difference features and the same features by using a difference enhancement module and a public selection module, and fusing the multi-modal difference features and the same features into a multi-modal feature map. The method comprises the following specific steps:
(1) Constructing a multi-mode feature fusion framework for an optical remote sensing image and an SAR remote sensing image, wherein the framework comprises a differential enhancement module and a public selection module;
(1-1) constructing a differential enhancement module structure, and performing differential operation on the extracted optical image features and SAR image features to obtain a feature map of a differential part;
calculating attention weights through hourglass 1*1 convolution to obtain respective attention force diagrams;
adding the obtained attention map to the original characteristic map in a residual mode to obtain a reinforced characteristic map;
weighting and summing the reinforcement feature images of the optical remote sensing image and the SAR remote sensing image to obtain a differential reinforcement feature image;
(1-2) constructing a structure for a public selection module, and carrying out addition operation on the extracted optical image characteristics and SAR image characteristics to obtain a characteristic diagram of a public part;
obtaining the attention map of the optical remote sensing image and the attention map of the SAR remote sensing image through the obtained characteristic map of the public part in a softmax mode;
Multiplying the attention map of the optical remote sensing image and the attention map of the SAR remote sensing image with the input feature map of the SAR remote sensing image respectively to obtain new feature maps;
and carrying out weighted summation on the new feature images of the optical remote sensing image and the SAR remote sensing image to obtain a public module feature image.
Thirdly, establishing a two-stage rotation prediction head module: constructing a two-stage prediction head module, and performing secondary fine tuning on the basis of the first-stage classification and positioning, wherein the method comprises the following specific steps of:
(1) A first stage anchor optimization module (ARM) uses an Adaptive Training Sample Selection (ATSS) strategy to adjust a horizontal anchor to be a high-quality anchor for rotation;
(2) After the first stage of adjustment, ARM obtains a group of rotation anchor points as candidate samples of the second stage, the candidate samples are input into a target detection network for classification and regression, screening is carried out according to the prediction result and IoU of a real target, and finally the sample with the maximum IoU is selected as a positive sample for adjustment.
Fourthly, training a rotation target detection model fusing optical and SAR image multi-mode information:
a rotary target detection model integrating optical and SAR image multi-mode information is constructed, processed remote sensing data images and labels are input into the rotary target detection model integrating the optical and SAR image multi-mode information to obtain a trained target detection network model, a training flow is shown in a figure 1, a target detection network structure diagram is shown in a figure 2, self-attention feature extraction is based on a transducer-Unet structure, multi-mode feature integration is shown in a figure 5, a feature diagram with rich feature information is obtained, more features of a target are detected during detection, and the positioning and classifying precision is greatly improved by a two-stage rotary pre-measurement head shown in a figure 8.
The method comprises the following specific steps:
51 Inputting the preprocessed remote sensing image data into a rotating target detection network integrating optical and SAR image multi-mode information;
52 Performing a normal convolution layer with a convolution kernel size of 1x1 to convert the optical remote sensing image into V OPT 、Q OPT 、K OPT Three channel characteristics; converting SAR remote sensing image into V SAR 、Q SAR 、K SAR Executing a primary encoder structure by the three-channel feature to obtain 4 downsampled outputs;
performing a normal convolution with a convolution kernel size of 3×3 on the input picture, normalizing a ReLu by an example, and performing a max pooling operation with a stride of 1 to obtain a first downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the first downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a second downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the second downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a third downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the third downsampled output, and normalizing a maximum pooling operation with a ReLu and a stride of 1 by an example to obtain a fourth downsampled output;
53 A first downsampled output, a second downsampled output, and a third downsampled output followed by a self-attention mechanism module and a cross-correlation module;
54 Deconvolution of the fourth downsampled output to 1/8 of the original image (4-fold downsampling, 4-fold upsampling for the 4-fold downsampled case), resulting in an upsampled output of 1;
splicing the up-sampling output 1 and the third down-sampling output to obtain a combined output 1;
performing a common convolution with a convolution kernel size of 3×3 on the combined output 1, and normalizing one leak lu by one example to obtain a convolution output 1;
performing ConvLSTM operation on the convolution output 1 to obtain LSTM output 1;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 1, and normalizing one LeakyReLu by one example to obtain a convolution output 2;
deconvolution operation is carried out on the convolution output 2 to enable the convolution output 2 to be up-sampled to be 1/4 of the size of the original image (for the 4 times down-sampling case, namely 2 times up-sampling), so as to obtain up-sampled output 2;
splicing the up-sampling output 2 and the second down-sampling output to obtain a combined output 2;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 2, and normalizing one leak lu by one example to obtain a convolved output 3;
Performing ConvLSTM operation on the convolution output 3 to obtain an LSTM output 2;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 2, and normalizing one LeakyReLu by one example to obtain a convolution output 4;
deconvolution operation is performed on the convolution output 4 to up-sample it to 1/2 of the original image (for the case of 4 times down-sampling, i.e., 2 times up-sampling), resulting in up-sampled output 3;
splicing the up-sampling output 3 and the first down-sampling output to obtain a combined output 3;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 3, and normalizing one leak lu by one example to obtain a convolution output 5;
performing ConvLSTM operation on the convolution output 5 to obtain LSTM output 3;
performing a common convolution with a convolution kernel of 3×3 on LSTM output 3, and normalizing a LeakyReLu by an example to obtain a final upsampled output;
55 Inputting the multi-mode multi-scale feature map extracted by the two modes into a cross-mode feature fusion module;
56 The difference enhancement module obtains a difference part characteristic diagram of the optical image and the SAR image through difference operation, obtains an enhanced characteristic diagram through the attention weight enhancement original characteristic diagram, and obtains a difference enhanced characteristic diagram through weighted summation;
57 The public selection module obtains a public partial feature map of the optical image and the SAR image through addition operation, obtains attention force diagram through softmax, and multiplies the attention force diagram phase to the original feature map to obtain a new feature map;
58 The differential enhancement feature map and the public selection feature map are subjected to weighted summation to obtain a cross-modal feature map;
59 Feature stitching 4 feature graphs of different sizes: the feature map of the highest layer is subjected to C3+conv to obtain a group of feature maps with the same size as the feature map of the next layer, the feature maps of the next layer are spliced together, then the feature map of the highest layer is subjected to C3+conv to obtain a group of new feature maps, and the process is repeated until the lowest layer is reached;
510 For the bottom-layer feature map, outputting the feature map to each head, splicing the feature map with the information output by the upper layer, taking the feature map as a new output through a C3+conv, taking the feature map as the input of the lower layer, splicing the feature map with the information output by the upper layer through a C3+conv again, taking the feature map through a C3+conv as a new output, and repeating the process until the feature map reaches the highest layer;
511 Inputting the feature map into a pre-measuring head, and adjusting a horizontal anchor point to be a high-quality rotation anchor point by using an ATSS strategy by an ARM module in the first stage;
512 After the first stage of adjustment, ARM obtains a group of candidate samples with rotation anchor points as the second stage, inputs the candidate samples into a target detection network for classification and regression, screens according to the prediction result and IoU of a real target, and selects IoU the largest sample as a positive sample for adjustment;
513 Calculating a loss function, and carrying out back propagation on the weight parameters;
514 If the number of the rounds reaches the set number, a trained segmentation model is obtained, otherwise, the step 52) of reloading data is returned for continuous training.
Fifthly, obtaining a rotating target detection network result by integrating optical and SAR image multi-mode information: and inputting the preprocessed test data set into the loaded model for prediction, and marking the target prediction frame and the target category on the original image through visualization.
As shown in fig. 9, the method is a schematic diagram of a rotating target detection network result integrating optical and SAR image multi-mode information, wherein the rotating target detection network result comprises a wharf, an automobile and a ship, and as can be seen from fig. 9, the method can well achieve the purposes of positioning and classifying targets in an image.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made therein without departing from the spirit and scope of the invention, which is defined by the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (3)

1. The method for detecting the rotating target by fusing the optical and SAR image multi-mode information is characterized by comprising the following steps of:
11 Preparation of rotation target detection data and feature extraction of fusion optics and SAR image multi-mode information: dividing and cutting the acquired remote sensing image data set; constructing a transducer-UNet network based on the structures of the encoder and the decoder to extract the characteristics of the remote sensing data;
111 Dividing the data set into a training set, a verification set and a test set according to the proportion of 6:2:2, wherein the size of the data set is not overlapped, and the uniform cutting size is 256 multiplied by 256;
112 Constructing a parallel encoder and decoder structure transducer-UNet, wherein a network A processes the optical remote sensing image and a network B processes the SAR remote sensing image;
1121 A double-layer convolution module is constructed, and the module structure comprises two convolution layers, two normalization layers and two ReLU activation functions; each convolution layer structure is characterized in that the kernel size is 3, the padding is 1, and the stride is 1;
1122 A downsampling structure for feature extraction is constructed, and the structure comprises a double-layer DoubleConv convolution module and a maximum pooling layer;
1123 A Bottleneck layer for connecting the feature graphs of the up-sampling and down-sampling stages is constructed, the Bottleneck layer structure includes two convolution layers with a kernel size of 1 and a stride of 1, one convolution layer with a kernel size of 3 and a stride of 1;
1124 Constructing an up-sampling structure for feature extraction, wherein the structure comprises a ConvLSTM layer and a convolution layer; the ConvLSTM unit comprises an input gate, a forget gate and an output gate, the kernel_size is (3, 3), and the stride is (2, 2);
113 The specific steps of rotation target detection feature extraction of the multi-mode information of the fusion optics and SAR image are as follows:
1131 Inputting the preprocessed optical remote sensing image, SAR remote sensing image and tag data into a convolutional neural network, and training a downsampling feature extraction model with a self-attention mechanism, wherein the method comprises the following specific steps of:
1132 Performing a normal convolution layer with a convolution kernel size of 1 x 1 to convert the optical remote sensing image into information V for each element in the optical image providing sequence OPT Weight Q of each element in an optical image providing sequence OPT For calculating the similarity K between Q and K in an optical image OPT Three channel characteristics; converting SAR remote sensing image into information V of each element in SAR image providing sequence SAR The SAR image provides the weight Q of each element in the sequence SAR For calculating the similarity K between Q and K in SAR images SAR Three channel characteristics; executing the encoder structure once to obtain 4 downsamplesOutputting a sample;
performing a normal convolution with a convolution kernel size of 3×3 on the input picture, normalizing a ReLu by an example, and performing a max pooling operation with a stride of 1 to obtain a first downsampled output;
Performing a normal convolution with a convolution kernel size of 3×3 on the first downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a second downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the second downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a third downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the third downsampled output, and normalizing a maximum pooling operation with a ReLu and a stride of 1 by an example to obtain a fourth downsampled output;
1133 A first downsampled output, a second downsampled output, and a third downsampled output followed by a self-attention mechanism module and a cross-correlation module, comprising the following steps:
performing convolution with a convolution kernel of 1×1 to convert the optical remote sensing image into V OPT 、Q OPT 、K OPT Three-channel feature matrix; converting SAR remote sensing image into V SAR 、Q SAR 、K SAR Three-channel feature matrix;
will Q OPT Transposition and K OPT Dot product multiplication, result softmax, and V OPT After the dot products are multiplied, the dot products are weighted and summed with the original feature images to obtain an optical image self-attention mechanism feature image; the SAR image self-attention mechanism characteristic diagram process is the same as that described above;
Extracting a supporting feature image and a query feature image from a self-attention mechanism feature image, carrying out reshape on the images, generating a relation between the two images by using cosine distance, obtaining corresponding weights through global average pooling and a nonlinear network comprising 2 convolution layers and a RELU layer, and obtaining the correlation of the features after dot product multiplication and normalization; the SAR remote sensing image cross-correlation module is the same as the optical remote sensing image cross-correlation module;
114 A Bottleneck layer for connecting the feature maps of the up-sampling and down-sampling phases is built up, consisting of three convolution layers:
the convolution kernel of the first convolution layer is 1 multiplied by 1, and is used for reducing the dimension, reducing the number of input channels and reducing the number of model parameters;
the convolution kernel of the second convolution layer is 3 multiplied by 3 and is used for convoluting the feature map and extracting features;
the convolution kernel of the third convolution layer is 1 multiplied by 1, and is used for increasing the dimension, increasing the number of channels of the convolved feature map and increasing the expression capacity of the model;
115 Constructing up-sampling ConvLSTM, specifically comprising the following steps:
performing deconvolution operation (also called transpose convolution) on the fourth downsampled output to upsample to 1/8 of the original image (for the 4-fold downsampled case, 4-fold upsampling), resulting in an upsampled output 1;
Splicing the up-sampling output 1 and the third down-sampling output to obtain a combined output 1;
performing a common convolution with a convolution kernel size of 3×3 on the combined output 1, and normalizing one leak lu by one example to obtain a convolution output 1;
performing ConvLSTM operation on the convolution output 1 to obtain LSTM output 1;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 1, and normalizing one LeakyReLu by one example to obtain a convolution output 2;
deconvolution operation is carried out on the convolution output 2 to enable the convolution output 2 to be up-sampled to be 1/4 of the size of the original image (for the 4 times down-sampling case, namely 2 times up-sampling), so as to obtain up-sampled output 2;
splicing the up-sampling output 2 and the second down-sampling output to obtain a combined output 2;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 2, and normalizing one leak lu by one example to obtain a convolved output 3;
performing ConvLSTM operation on the convolution output 3 to obtain an LSTM output 2;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 2, and normalizing one LeakyReLu by one example to obtain a convolution output 4;
deconvolution operation is performed on the convolution output 4 to up-sample it to 1/2 of the original image (for the case of 4 times down-sampling, i.e., 2 times up-sampling), resulting in up-sampled output 3;
Splicing the up-sampling output 3 and the first down-sampling output to obtain a combined output 3;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 3, and normalizing one leak lu by one example to obtain a convolution output 5;
performing ConvLSTM operation on the convolution output 5 to obtain LSTM output 3;
performing a common convolution with a convolution kernel of 3×3 on LSTM output 3, and normalizing a LeakyReLu by an example to obtain a final upsampled output;
116 Completing construction of a network A for processing the optical remote sensing image and a network B for processing the SAR remote sensing image;
12 Establishing a multi-mode multi-scale feature fusion module: constructing a framework for multi-modal feature fusion, extracting multi-modal difference features and the same features by using a difference enhancement module and a public selection module, and fusing the multi-modal difference features and the same features into a multi-modal feature map;
121 Constructing a multi-mode feature fusion framework for the optical remote sensing image and the SAR remote sensing image, wherein the framework comprises a differential enhancement module and a public selection module;
1211 The differential enhancement module specifically comprises the following steps:
performing difference operation on the extracted optical image features and SAR image features to obtain a feature map of a difference part;
through hourglass type 1×1 convolution, attention weights are calculated, and separate attention force diagrams are obtained;
Adding the obtained attention map to the original characteristic map in a residual mode to obtain a reinforced characteristic map;
weighting and summing the reinforcement feature images of the optical remote sensing image and the SAR remote sensing image to obtain a differential reinforcement feature image;
1212 The public selection module comprises the following specific steps:
performing addition operation on the extracted optical image features and SAR image features to obtain a feature map of the public part;
obtaining the attention map of the optical remote sensing image and the attention map of the SAR remote sensing image through the obtained characteristic map of the public part in a softmax mode;
multiplying the attention map of the optical remote sensing image and the attention map of the SAR remote sensing image with the input feature map of the SAR remote sensing image respectively to obtain new feature maps;
122 Weighting and summing new feature images of the optical remote sensing image and the SAR remote sensing image to obtain a common module feature image;
13 Establishing a two-stage rotary pre-measurement head module: constructing a two-stage prediction head module, and performing secondary fine tuning on the basis of the first-stage classification and positioning;
14 Training the established network model by using the divided training set and the corresponding label thereof and adjusting parameters until the training reaches the preset epoch, and finally, reserving the corresponding parameters and the trained network to detect and acquire the results for other target images.
2. The method for detecting a rotation target by fusing optical and SAR image multi-mode information according to claim 1, wherein said establishing a two-stage rotation prediction head module comprises the steps of:
21 Constructing a feature pyramid result to realize feature splicing, and taking a head as input, wherein the method comprises the following specific steps of:
211 Inputting 4 feature images with different sizes, obtaining a group of feature images with the same size as the feature images of the next layer by the feature images of the highest layer through a C3+conv, splicing the feature images with the feature images of the next layer, obtaining a group of new feature images through a C3+conv, and repeating the process until the feature images of the lowest layer are reached;
212 For the bottom-layer feature map, outputting the feature map to each head, splicing the feature map with the information output by the upper layer, taking the feature map as a new output through a C3+conv, taking the feature map as the input of the lower layer, splicing the feature map with the information output by the upper layer through a C3+conv again, taking the feature map through a C3+conv as a new output, and repeating the process until the feature map reaches the highest layer;
22 A remote sensing target detection rotary pre-measuring head is constructed, and target positioning is realized through two stages, wherein the method comprises the following specific steps:
221 The first stage ARM module uses an ATSS strategy to adjust the horizontal anchor point to be a high-quality rotation anchor point, and the steps are as follows:
Extracting all horizontal anchor points for the input characteristic image, and regarding the anchor points as candidate samples of the first stage;
calculating the ratio between the center point distance between each candidate sample and all real targets (group-trunk) and the target size, and classifying all the candidate samples into two types of positive samples and negative samples according to the comprehensive consideration of the two factors;
for positive samples, taking the corresponding real target as a center, and generating a group of high-quality rotation anchor points as positive samples in the first stage;
222 After the first stage of adjustment, ARM obtains a group of candidate samples with rotation anchor points as a second stage, the candidate samples are input into a target detection network for classification and regression, and are screened according to the prediction result and IoU of a real target, and finally the sample with the maximum IoU is selected as a positive sample for adjustment, and the method comprises the following specific steps:
inputting the rotation anchor point obtained in the first stage into a target detection network to obtain a detection result;
according to the detection result, calculating IoU values of each rotation anchor point and the corresponding real target, and selecting a positive sample with the maximum IoU value as a positive sample of the second stage;
and taking the positive sample obtained in the second stage as an input positive sample, and classifying and regressing through the target detection network again to further improve the detection accuracy.
3. The method for detecting a rotation target by combining optical and SAR image multi-mode information according to claim 1, wherein the steps of training the network model and obtaining the result are as follows:
31 Inputting the preprocessed remote sensing image data into a rotating target detection network integrating optical and SAR image multi-mode information;
32 Performing a normal convolution layer with a convolution kernel size of 1×1 to convert an optical remote sensing image into V OPT 、Q OPT 、K OPT Three channel characteristics; converting SAR remote sensing image into V SAR 、Q SAR 、K SAR Executing a primary encoder structure by the three-channel feature to obtain 4 downsampled outputs;
performing a normal convolution with a convolution kernel size of 3×3 on the input picture, normalizing a ReLu by an example, and performing a max pooling operation with a stride of 1 to obtain a first downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the first downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a second downsampled output;
performing a normal convolution with a convolution kernel size of 3×3 on the second downsampled output, an example normalizing a maximum pooling operation with a ReLu and a stride of 1 to obtain a third downsampled output;
Performing a normal convolution with a convolution kernel size of 3×3 on the third downsampled output, and normalizing a maximum pooling operation with a ReLu and a stride of 1 by an example to obtain a fourth downsampled output;
33 A first downsampled output, a second downsampled output, and a third downsampled output followed by a self-attention mechanism module and a cross-correlation module;
34 Deconvolution of the fourth downsampled output to 1/8 of the original image (4-fold downsampling, 4-fold upsampling for the 4-fold downsampled case), resulting in an upsampled output of 1;
splicing the up-sampling output 1 and the third down-sampling output to obtain a combined output 1;
performing a common convolution with a convolution kernel size of 3×3 on the combined output 1, and normalizing one leak lu by one example to obtain a convolution output 1;
performing ConvLSTM operation on the convolution output 1 to obtain LSTM output 1;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 1, and normalizing one LeakyReLu by one example to obtain a convolution output 2;
deconvolution operation is carried out on the convolution output 2 to enable the convolution output 2 to be up-sampled to be 1/4 of the size of the original image (for the 4 times down-sampling case, namely 2 times up-sampling), so as to obtain up-sampled output 2;
Splicing the up-sampling output 2 and the second down-sampling output to obtain a combined output 2;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 2, and normalizing one leak lu by one example to obtain a convolved output 3;
performing ConvLSTM operation on the convolution output 3 to obtain an LSTM output 2;
performing a common convolution with a convolution kernel size of 3×3 on LSTM output 2, and normalizing one LeakyReLu by one example to obtain a convolution output 4;
deconvolution operation is performed on the convolution output 4 to up-sample it to 1/2 of the original image (for the case of 4 times down-sampling, i.e., 2 times up-sampling), resulting in up-sampled output 3;
splicing the up-sampling output 3 and the first down-sampling output to obtain a combined output 3;
performing a normal convolution with a convolution kernel size of 3×3 on the combined output 3, and normalizing one leak lu by one example to obtain a convolution output 5;
performing ConvLSTM operation on the convolution output 5 to obtain LSTM output 3;
performing a common convolution with a convolution kernel of 3×3 on LSTM output 3, and normalizing a LeakyReLu by an example to obtain a final upsampled output;
35 Inputting the multi-mode multi-scale feature map extracted by the two modes into a cross-mode feature fusion module;
36 The difference enhancement module obtains a difference part characteristic diagram of the optical image and the SAR image through difference operation, obtains an enhanced characteristic diagram through the attention weight enhancement original characteristic diagram, and obtains a difference enhanced characteristic diagram through weighted summation;
37 The public selection module obtains a public partial feature map of the optical image and the SAR image through addition operation, obtains attention force diagram through softmax, and multiplies the attention force diagram phase to the original feature map to obtain a new feature map;
38 The differential enhancement feature map and the public selection feature map are subjected to weighted summation to obtain a cross-modal feature map;
39 Feature stitching 4 feature graphs of different sizes: the feature map of the highest layer is subjected to C3+conv to obtain a group of feature maps with the same size as the feature map of the next layer, the feature maps of the next layer are spliced together, then the feature map of the highest layer is subjected to C3+conv to obtain a group of new feature maps, and the process is repeated until the lowest layer is reached;
310 For the bottom-layer feature map, outputting the feature map to each head, splicing the feature map with the information output by the upper layer, taking the feature map as a new output through a C3+conv, taking the feature map as the input of the lower layer, splicing the feature map with the information output by the upper layer through a C3+conv again, taking the feature map through a C3+conv as a new output, and repeating the process until the feature map reaches the highest layer;
311 Inputting the feature map into a pre-measuring head, and adjusting a horizontal anchor point to be a high-quality rotation anchor point by using an ATSS strategy by an ARM module in the first stage;
312 After the first stage of adjustment, ARM obtains a group of candidate samples with rotation anchor points as the second stage, inputs the candidate samples into a target detection network for classification and regression, screens according to the prediction result and IoU of a real target, and selects IoU the largest sample as a positive sample for adjustment;
313 Calculating a loss function, and carrying out back propagation on the weight parameters;
314 Judging whether the number of the wheels reaches the set number, if so, obtaining a trained segmentation model, otherwise, returning to 32) reloading data for continuous training;
315 The obtained rotation target detection network which fuses optical and SAR image multi-mode information is utilized, a preprocessed test data set is input into a loaded model for prediction, and a target prediction frame and a target category are marked on an original image through visualization.
CN202310446031.9A 2023-04-22 2023-04-22 Rotation target detection method integrating optics and SAR image multi-mode information Active CN116452936B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310446031.9A CN116452936B (en) 2023-04-22 2023-04-22 Rotation target detection method integrating optics and SAR image multi-mode information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310446031.9A CN116452936B (en) 2023-04-22 2023-04-22 Rotation target detection method integrating optics and SAR image multi-mode information

Publications (2)

Publication Number Publication Date
CN116452936A CN116452936A (en) 2023-07-18
CN116452936B true CN116452936B (en) 2023-09-29

Family

ID=87120068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310446031.9A Active CN116452936B (en) 2023-04-22 2023-04-22 Rotation target detection method integrating optics and SAR image multi-mode information

Country Status (1)

Country Link
CN (1) CN116452936B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117528233A (en) * 2023-09-28 2024-02-06 哈尔滨航天恒星数据系统科技有限公司 Zoom multiple identification and target re-identification data set manufacturing method

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012185712A (en) * 2011-03-07 2012-09-27 Mitsubishi Electric Corp Image collation device and image collation method
CN112307901A (en) * 2020-09-28 2021-02-02 国网浙江省电力有限公司电力科学研究院 Landslide detection-oriented SAR and optical image fusion method and system
CN112434745A (en) * 2020-11-27 2021-03-02 西安电子科技大学 Occlusion target detection and identification method based on multi-source cognitive fusion
CN113283435A (en) * 2021-05-14 2021-08-20 陕西科技大学 Remote sensing image semantic segmentation method based on multi-scale attention fusion
CN113469094A (en) * 2021-07-13 2021-10-01 上海中科辰新卫星技术有限公司 Multi-mode remote sensing data depth fusion-based earth surface coverage classification method
CN114387439A (en) * 2022-01-13 2022-04-22 中国电子科技集团公司第五十四研究所 Semantic segmentation network based on fusion of optical and PolSAR (polar synthetic Aperture Radar) features
CN114565856A (en) * 2022-02-25 2022-05-31 西安电子科技大学 Target identification method based on multiple fusion deep neural networks
WO2022142297A1 (en) * 2021-01-04 2022-07-07 Guangzhou Institute Of Advanced Technology, Chinese Academy Of Sciences A robot grasping system and method based on few-shot learning
CN115497005A (en) * 2022-09-05 2022-12-20 重庆邮电大学 YOLOV4 remote sensing target detection method integrating feature transfer and attention mechanism
CN115496928A (en) * 2022-09-30 2022-12-20 云南大学 Multi-modal image feature matching method based on multi-feature matching
CN115830471A (en) * 2023-01-04 2023-03-21 安徽大学 Multi-scale feature fusion and alignment domain self-adaptive cloud detection method
US11631238B1 (en) * 2022-04-13 2023-04-18 Iangxi Electric Power Research Institute Of State Grid Method for recognizing distribution network equipment based on raspberry pi multi-scale feature fusion

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190138786A1 (en) * 2017-06-06 2019-05-09 Sightline Innovation Inc. System and method for identification and classification of objects
CN112767418B (en) * 2021-01-21 2022-10-14 大连理工大学 Mirror image segmentation method based on depth perception

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012185712A (en) * 2011-03-07 2012-09-27 Mitsubishi Electric Corp Image collation device and image collation method
CN112307901A (en) * 2020-09-28 2021-02-02 国网浙江省电力有限公司电力科学研究院 Landslide detection-oriented SAR and optical image fusion method and system
CN112434745A (en) * 2020-11-27 2021-03-02 西安电子科技大学 Occlusion target detection and identification method based on multi-source cognitive fusion
WO2022142297A1 (en) * 2021-01-04 2022-07-07 Guangzhou Institute Of Advanced Technology, Chinese Academy Of Sciences A robot grasping system and method based on few-shot learning
CN113283435A (en) * 2021-05-14 2021-08-20 陕西科技大学 Remote sensing image semantic segmentation method based on multi-scale attention fusion
CN113469094A (en) * 2021-07-13 2021-10-01 上海中科辰新卫星技术有限公司 Multi-mode remote sensing data depth fusion-based earth surface coverage classification method
CN114387439A (en) * 2022-01-13 2022-04-22 中国电子科技集团公司第五十四研究所 Semantic segmentation network based on fusion of optical and PolSAR (polar synthetic Aperture Radar) features
CN114565856A (en) * 2022-02-25 2022-05-31 西安电子科技大学 Target identification method based on multiple fusion deep neural networks
US11631238B1 (en) * 2022-04-13 2023-04-18 Iangxi Electric Power Research Institute Of State Grid Method for recognizing distribution network equipment based on raspberry pi multi-scale feature fusion
CN115497005A (en) * 2022-09-05 2022-12-20 重庆邮电大学 YOLOV4 remote sensing target detection method integrating feature transfer and attention mechanism
CN115496928A (en) * 2022-09-30 2022-12-20 云南大学 Multi-modal image feature matching method based on multi-feature matching
CN115830471A (en) * 2023-01-04 2023-03-21 安徽大学 Multi-scale feature fusion and alignment domain self-adaptive cloud detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Kai Xu ; Siyuan Liu ; Ziyi Wang.《Geometric Auto-Calibration of SAR Images Utilizing Constraints of Symmetric Geometry》.《IEEE Geoscience and Remote Sensing Letters》.2022,全文. *
周波 ; 童海鹏 ; 陈晓 ; 薛巍 ; 徐凯.《多模态MRI评价胶质母细胞瘤中组织因子表达水平的价值》.《第三军医大学学报》.2020,全文. *

Also Published As

Publication number Publication date
CN116452936A (en) 2023-07-18

Similar Documents

Publication Publication Date Title
CN116665176B (en) Multi-task network road target detection method for vehicle automatic driving
CN114241274B (en) Small target detection method based on super-resolution multi-scale feature fusion
CN116452936B (en) Rotation target detection method integrating optics and SAR image multi-mode information
CN113239736B (en) Land coverage classification annotation drawing acquisition method based on multi-source remote sensing data
CN113901900A (en) Unsupervised change detection method and system for homologous or heterologous remote sensing image
CN116704357B (en) YOLOv 7-based intelligent identification and early warning method for landslide of dam slope
CN113610070A (en) Landslide disaster identification method based on multi-source data fusion
CN115049640B (en) Road crack detection method based on deep learning
CN115439458A (en) Industrial image defect target detection algorithm based on depth map attention
CN115238758A (en) Multi-task three-dimensional target detection method based on point cloud feature enhancement
CN114119610A (en) Defect detection method based on rotating target detection
CN115861260A (en) Deep learning change detection method for wide-area city scene
Fan et al. A novel sonar target detection and classification algorithm
CN113408540B (en) Synthetic aperture radar image overlap area extraction method and storage medium
CN114596503A (en) Road extraction method based on remote sensing satellite image
CN114170526A (en) Remote sensing image multi-scale target detection and identification method based on lightweight network
CN116977747B (en) Small sample hyperspectral classification method based on multipath multi-scale feature twin network
CN113673556A (en) Hyperspectral image classification method based on multi-scale dense convolution network
CN114743023B (en) Wheat spider image detection method based on RetinaNet model
CN114663654B (en) Improved YOLOv4 network model and small target detection method
CN115661655A (en) Southwest mountain area cultivated land extraction method with hyperspectral and hyperspectral image depth feature fusion
CN112989919B (en) Method and system for extracting target object from image
CN117788296B (en) Infrared remote sensing image super-resolution reconstruction method based on heterogeneous combined depth network
CN116229174A (en) Hyperspectral multi-class change detection method based on spatial spectrum combined attention mechanism
CN114842001B (en) Remote sensing image detection system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant