CN111898663B

CN111898663B - Cross-modal remote sensing image matching method based on transfer learning

Info

Publication number: CN111898663B
Application number: CN202010701646.8A
Authority: CN
Inventors: 杨文�; 徐芳; 夏桂松; 张瑞祥
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2020-07-20
Filing date: 2020-07-20
Publication date: 2022-05-13
Anticipated expiration: 2040-07-20
Also published as: CN111898663A

Abstract

The invention discloses a cross-modal remote sensing image matching method based on transfer learning. And simultaneously inputting the labeled cross-modal remote sensing image data and the unlabeled cross-modal data into the network. The network comprises two feature extractors, the parameter parts of which are shared, and the two feature extractors are respectively used for extracting the features of the optical image and the features of the SAR image. The training phase comprises two tasks, namely learning the metric criterion between the two modalities of the optical image and the SAR image by using the labeled data, and obfuscating the data of the same modality of different imaging devices. The invention can effectively transfer the metric criterion learned from the data with the label to the data without the label, and carry out high-precision matching on the cross-mode remote sensing image without the label.

Description

Cross-modal remote sensing image matching method based on transfer learning

Technical Field

The invention relates to the technical field of image processing, in particular to a cross-modal remote sensing image matching method based on transfer learning.

Background

In recent years, diversification of imaging systems has led to diversification of remote sensing images, and for example, satellites such as high-resolution-two, WorldView-2, and Sentiniel-2 can acquire optical images, and satellites such as high-resolution-three, TerrasAR-X, Sentinel-1 can acquire Synthetic Aperture Radar (SAR) images. These images of different modalities have completely different descriptive forms, both relevance and complementarity, for the same thing. Therefore, the matching between the images can provide more comprehensive and valuable information to overcome the defects caused by the extraction and the interpretation of the single-source remote sensing information.

With the development of deep learning, the cross-modal remote sensing image matching based on the convolutional neural network achieves unprecedented achievement. However, such methods require a large number of label samples and tend to focus only on data for a particular imaging device, such as the matching of a Sentinel-2 optical image to a Sentinel-1 synthetic aperture radar image. When the trained model is applied to the cross-mode image matching tasks of other imaging devices, such as matching of a WorldView-2 optical image with a Capella Space synthetic aperture radar image or matching of an unmanned aerial vehicle optical image with a synthetic aperture radar image, even if the two modes are the same, the performance of the model can be suddenly reduced.

The inventor of the present application finds that the method of the prior art has at least the following technical problems in the process of implementing the present invention:

the prior art has been through the collection of label samples of specific imaging device data to be matched. However, tagged sample data is often poor and difficult to obtain, for example, tagged samples of drone optical images and synthetic aperture radar images, resulting in poor cross-modal image matching performance.

Disclosure of Invention

The invention provides a cross-modal remote sensing image matching method based on transfer learning, which is used for solving or at least partially solving the technical problem of poor cross-modal remote sensing image matching performance in the prior art.

In order to solve the technical problem, the invention provides a cross-modal remote sensing image matching method based on transfer learning, which comprises the following steps:

s1: extracting the characteristics of the optical image with the label through a first characteristic extractor, and extracting the characteristics of the SAR image with the label through a second characteristic extractor;

s2: extracting the characteristics of the non-label optical image through a first characteristic extractor, and extracting the characteristics of the non-label SAR image through a second characteristic extractor;

s3: inputting the extracted features of the SAR image with the label and the features of the SAR image without the label to be matched into a gradient inversion layer and a second image discriminator, wherein the gradient inversion layer is used for automatically inverting the gradient direction in the backward propagation process, and the second image discriminator is used for discriminating which type of SAR image acquired by the imaging equipment is input into the network according to the extracted features of the SAR image; inputting the extracted characteristics of the optical image with the label and the characteristics of the optical image without the label to be matched into a gradient inversion layer and a first image discriminator, wherein the first image discriminator is used for discriminating which kind of imaging equipment obtains the optical image according to the extracted characteristics of the optical image;

s4: calculating a first loss function from the features of the labeled SAR image and the features of the labeled optical image extracted in S1

Wherein, F^sarRepresenting features extracted from a sample, the sample being a labelled SAR image, F^opt+Representing features extracted from a positive sample, an optical image matching the sample, F^opt-Representing the characteristics extracted from the negative sample, wherein the negative sample is an optical image which is not matched with the sample, m is a set threshold value, the first loss function is used for learning a measurement criterion between two modes of the optical image and the SAR image, and similarity calculation of data of the two modes is realized by optimizing that the distance between the sample and the positive sample is smaller than the distance between the sample and the negative sample;

s5: calculating a second loss function from the SAR image acquired by the imaging device characterized by the input

Wherein the content of the first and second substances,

representing features of SAR images, R, from some imaging device_λ(F^sar) Representation pair feature F^sarPerforming a gradient inversion operation, /)_t(·)＝tlog(D_sar(·))+(1-t)log(1-D_sar(.)), if the input feature is from a tagged SAR image, t is 0; if the input features are from the unlabeled SAR image to be matched, t is 1, and the second loss isThe loss function is used for closing the difference between SAR images acquired by different imaging devices;

s6: calculating a third loss function from the input optical image characterized by that type of imaging device

Wherein the content of the first and second substances,

representing features of an optical image, R, from some imaging device_λ(.) represents a gradient inversion operation on a certain feature, m_q(·)＝qlog(D_opt(·))+(1-q)log(1-D_opt(.)), wherein if the input feature is from a tagged optical image, q is 0; if the input features are from the unlabeled optical images to be matched, q is equal to 1, and a third loss function is used for closing the difference between the optical images acquired by different imaging devices;

s7: calculating a total loss function from the first loss function, the second loss function, and the third loss function

Wherein β represents a weight;

s8: training a first feature extractor, a second feature extractor, a first image discriminator and a second image discriminator through a back propagation algorithm based on a total loss function;

s9: the features of the SAR image and the optical image to be matched are extracted through the trained first feature extractor and the trained second feature extractor obtained in S8, and the euclidean distance of the features of the two modalities is calculated to determine the matching degree of the images of the two modalities, wherein the smaller the euclidean distance value, the higher the matching degree is.

In one embodiment, S1 specifically includes:

will sample I^sarInput to a second feature extractor E_sarIn (c), sample I is extracted^sarCharacteristic F of^sar(ii) a Positive sample I^opt+Input first feature extractor E_optIn (c), a positive sample I is taken^opt+Characteristic F of^opt+(ii) a Negative sample I^opt-Input to a first feature extractor E_optIn (c), a negative sample I is extracted^opt-Characteristic F of^opt-Wherein the first feature extractor E_opt(. o) and a second feature extractor E_sarBoth of them share some parameters, i.e. from the fourth layer.

Second image discriminator D at S3_sar(. DEG) and a first image discriminator D_optThe structures of the two classifiers are the same, the two classifiers are all deep convolutional networks, and each deep convolutional network comprises two full connection layers and a Sigmoid function and is equivalent to a two-classifier.

In one embodiment, during the training in S8, an Adam optimizer is used and the learning rate is set to 0.001.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

the method utilizes transfer learning to transfer the metric criterion learned from the labeled cross-modal remote sensing image data to the cross-modal remote sensing images of other unlabeled imaging equipment, so as to improve the matching performance of the labeled cross-modal remote sensing image data. The network simultaneously inputs labeled cross-modal remote sensing image data (optical images and SAR images) and unlabeled cross-modal data. The network comprises two feature extractors, the parameter parts of which are shared, and the two feature extractors are respectively used for extracting the features of the optical image and the features of the SAR image. The training stage comprises the following two tasks, wherein the first task is to utilize the labeled data to learn the measurement criterion between the two modes of the optical image and the SAR image, and realize the similarity calculation of the data of the two modes by optimizing that the distance between the sample and the positive sample is less than the distance between the sample and the negative sample; the second task is to mix up the same modal data of different imaging devices, so that the model learned on the labeled cross-modal remote sensing image can be applied to the label-free cross-modal remote sensing image of the specific imaging device to be matched. The method for improving the cross-modal remote sensing image matching performance by using transfer learning can effectively transfer the metric criterion learned from the labeled data to the unlabeled data, carry out higher-precision matching on the unlabeled cross-modal remote sensing image, and improve the image matching performance.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a diagram of a deep convolutional neural network model used in the present invention.

Detailed Description

Aiming at the defects in the prior art, the invention aims to provide a method for improving the cross-modal remote sensing image matching performance by utilizing transfer learning, which can better solve the problem that the performance of a cross-modal matching model learned from data of a specific imaging device is suddenly reduced when the cross-modal matching model is applied to cross-modal remote sensing images of other imaging devices. The method can transfer the metric criteria learned from the labeled cross-modal remote sensing image data to the cross-modal remote sensing images of other unlabeled imaging equipment, and perform high-precision matching on the cross-modal remote sensing images of other unlabeled imaging equipment.

In order to achieve the technical effects, the main inventive concept of the invention is as follows:

extracting high-level semantic features of an optical image and an SAR image by using a deep convolutional neural network, learning a measurement criterion between two modes on labeled cross-mode remote sensing image data by using twin learning, and transferring the learned measurement criterion to the cross-mode remote sensing image of other unlabeled imaging equipment by using transfer learning.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the invention provides a cross-modal remote sensing image matching method based on transfer learning, which comprises the following steps:

and the first feature extractor and the second feature extractor are both deep convolutional neural networks.

specifically, the tagged SAR image is from one imaging device and the untagged SAR image is from another imaging device. Second image discriminator D_sar(. cndot.) is to determine from which imaging device the incoming data came, which can be understood as a two-classifier.The result of the determination can be used in step S5 if D is the case_sarAnd the imaging device from which the data comes can not be correctly judged according to the input features, and the data features of different imaging devices are considered to have the same distribution, namely the difference between SAR images acquired by different imaging devices is closed. The first image discriminator is similar and will not be described in detail.

in practical implementation, the value of m may be set according to practical situations, for example, set to 1.

Wherein the content of the first and second substances,

representing features of SAR images, R, from some imaging device_λ(F^sar) Representation pair feature F^sarPerforming a gradient inversion operation, /)_t(·)＝tlog(D_sar(·))+(1-t)log(1-D_sar(.)), if the input feature is from a tagged SAR image, t is 0; if the input features come from the unlabeled SAR images to be matched, t is 1, and the second loss function is used for closing the difference between the SAR images acquired by different imaging devices;

Wherein the content of the first and second substances,

representing features representing an optical image from some imaging device, R_λ(.) represents a gradient inversion operation on a certain feature, m_q(·)＝qlog(D_opt(·))+(1-q)log(1-D_opt(.)), wherein if the input feature is from a tagged optical image, q is 0; if the input features are from the unlabeled optical images to be matched, q is equal to 1, and a third loss function is used for closing the difference between the optical images acquired by different imaging devices;

Wherein β represents a weight; in one embodiment, wherein the weight β is 0.001.

Referring to fig. 1, the overall deep convolutional neural network model diagram used in the present invention includes a first feature extractor (feature extractor 1), a second feature extractor (feature extractor 2), a gradient inversion layer, a first discriminator (discriminator 1), and a second discriminator (discriminator 2). The first loss function calculates the triplet loss, the second loss function calculates the domain loss, and the third loss function calculates the domain loss 1.

In one embodiment, S1 specifically includes:

will sample I^sarInput to a second feature extractor E_sarIn (c), sample I is extracted^sarCharacteristic (F)^sar(ii) a Positive sample I^opt+Input first feature extractor E_optIn (c), a positive sample I is taken^opt+Characteristic F of^opt+(ii) a Negative sample I^opt-Input to a first feature extractor E_optIn (c), a negative sample I is extracted^opt-Characteristic F of^opt-Wherein the first feature extractor E_opt(. o) and a second feature extractor E_sarBoth of them share some parameters, i.e. from the fourth layer.

In one embodiment, the second image discriminator D in S3_sar(. DEG) and a first image discriminator D_optThe structures of the two classifiers are the same, the two classifiers are all deep convolutional networks, and each deep convolutional network comprises two full connection layers and a Sigmoid function and is equivalent to a two-classifier.

In the experimental process, the metric criterion learned on the cross-modal remote sensing image data set SEN1-2 is migrated to the cross-modal remote sensing image data set SpaceNet6 data set. In the SEN1-2 dataset, the optical images are from a Sentinel-2 satellite and the SAR images are from a Sentinel-1 satellite. In the SpaceNet6 dataset, the optical images were from the WorldView-2 satellite and the SAR images were from the Capella Space constellation. For the measurement of the matching precision, the method adopts the following analysis indexes: accuracy (Accuracy) and auc (area understhe ROC curve). The results of the experiments on the SpaceNet6 data set are shown in Table 1. According to the analysis of the matching precision, the method can transfer the metric criteria learned from the labeled cross-modal remote sensing image data to the cross-modal remote sensing images of other unlabeled imaging equipment, so that the matching performance of the labeled cross-modal remote sensing image data is improved.

TABLE 1 matching accuracy analysis on SpaceNet6 dataset

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A cross-modal remote sensing image matching method based on transfer learning is characterized by comprising the following steps:

s3: inputting the extracted features of the SAR image with the label and the features of the SAR image without the label to be matched into a gradient inversion layer and a second image discriminator, wherein the gradient inversion layer is used for automatically inverting the gradient direction in the backward propagation process, and the second image discriminator for realizing the identity transformation in the forward propagation process is used for discriminating which type of SAR image acquired by the imaging equipment is input into the network according to the extracted features of the SAR image; inputting the extracted characteristics of the optical image with the label and the characteristics of the optical image without the label to be matched into a gradient inversion layer and a first image discriminator, wherein the first image discriminator is used for discriminating which kind of imaging equipment obtains the optical image according to the extracted characteristics of the optical image;

Wherein the content of the first and second substances,

representing features of SAR images, R, from some imaging device_λ(F^sar) Representation pair feature F^sarPerforming a gradient inversion operation, /)_t(·)＝tlog(D_sar(·))+(1-t)log(1-D_sar(.)), if the input feature is from a tagged SAR image, t is 0; if the input features are from unlabeled SAR images to be matched, t is 1, and a second loss function is used for closing the difference between the SAR images acquired by different imaging devices, D_sar() represents a second image discriminator;

Wherein the content of the first and second substances,

representing features of an optical image, R, from some imaging device_λ(.) represents a gradient inversion operation on a certain feature, m_q(·)＝qlog(D_opt(·))+(1-q)log(1-D_opt(.)), wherein if the input feature is from a tagged optical image, q is 0; if the input features are from unlabeled optical images to be matched, q is 1, a third loss function is used to bridge the differences between the optical images acquired by the different imaging devices, D_opt() represents a first image discriminator;

Wherein β represents a weight;

2. The matching method of the cross-modal remote sensing image based on the transfer learning of claim 1, wherein S1 specifically includes:

sample I^sarInput to a second feature extractor E_sarIn (c), sample I is extracted^sarCharacteristic F of^sar(ii) a Positive sample I^opt+Input first feature extractor E_optIn (c), a positive sample I is taken^opt+Characteristic F of^opt+(ii) a Negative sample I^opt-Input to a first feature extractor E_optIn (c), a negative sample I is extracted^opt-Characteristic (F)^opt-Wherein the first feature extractor E_opt(. o) and a second feature extractor E_sarBoth of these (are) the ResNet-34 model, and both share part of the parameters, i.e. share the parameters from the fourth layer.

3. The matching method of trans-modal remote sensing image based on transfer learning of claim 1Characterized in that the second image discriminator D at S3_sar(. cndot.) and a first image discriminator D_optThe structures of the two classifiers are the same, the two classifiers are all deep convolutional networks, and each deep convolutional network comprises two full connection layers and a Sigmoid function and is equivalent to a two-classifier.

4. The matching method for the trans-modal remote sensing image based on the transfer learning of claim 1, wherein in the training process in S8, an Adam optimizer is adopted, and the learning rate is set to be 0.001.