CN116012679B

CN116012679B - Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction

Info

Publication number: CN116012679B
Application number: CN202211635290.8A
Authority: CN
Inventors: 孙显; 王佩瑾; 何琪彬; 闫志远; 赵一铭; 常浩; 毕涵博
Original assignee: Aerospace Information Research Institute of CAS
Current assignee: Aerospace Information Research Institute of CAS
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-06-16
Anticipated expiration: 2042-12-19
Also published as: CN116012679A

Abstract

The application relates to the technical field of remote sensing image processing, in particular to a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction. The method comprises the following steps: s100, acquiring a multi-mode remote sensing image sample pair set A; s200, traversing A, for a _n,m Performing blocking and random mask processing to obtain a _n,m Corresponding H mask blocks and D non-mask blocks; s300, performing joint training on M neural network models respectively corresponding to the M modes, wherein the neural network models comprise a data coding model and a decoder, and the joint training process comprises the following steps: will a _n,m The corresponding target embedded vector sequence B is input to a data coding model corresponding to the mth mode; loss l=l employed by the joint training ¹ +L ² ，L ¹ For the first level loss, L ² Is a second level loss. The invention improves the characteristic representation capability of the deep learning model on remote sensing images of different modes.

Description

Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction

Technical Field

The invention relates to the technical field of remote sensing image processing, in particular to a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction.

Background

The information acquisition approaches in the big data age are various, and multi-modal data has become a main form of data resources in recent years. The data of different modes has different characteristics, for example, the scattering points of the SAR remote sensing image simultaneously comprise amplitude information and frequency information, and the resolution ratio of the optical remote sensing image is higher than that of the SAR remote sensing image, and the optical remote sensing image contains more detail information. The existing remote sensing representation learning method is only suitable for carrying out feature representation on remote sensing images of one mode, and when a deep learning model inputs remote sensing images of other modes, the deep learning model has poor feature representation capability on the remote sensing images of the other modes; how to improve the feature representation capability of the deep learning model on remote sensing images of different modes is a problem to be solved urgently.

Disclosure of Invention

The invention aims to provide a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction, which improves the characteristic representation capability of a deep learning model on remote sensing images of different modalities.

According to the invention, a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction is provided, which comprises the following steps:

s100, acquiring a multi-mode remote sensing image sample pair set A= { a ₁ ,a ₂ ,…,a _N }，a _n For the nth multi-mode remote sensing image sample pair, the value range of N is 1 to N, and N is the number of the multi-mode remote sensing image sample pairs included by A; a, a _n ＝(a _n,1 ,a _n,2 ,…,a _n,M )，a _n,m Is a as _n The method comprises the steps that in the M-th mode remote sensing image, the value range of M is 1 to M, M is the number of modes included in each multi-mode remote sensing image sample pair in A, and the multi-modes comprise at least two of optics, SAR, hyperspectrum and near infrared; the remote sensing images of M modes included in each multi-mode remote sensing image sample pair are the remote sensing images of the same scene, and the sequence of the remote sensing images corresponding to different modes in each multi-mode remote sensing image sample pair is the same.

S200, traversing A, for a _n,m Performing blocking and random mask processing to obtain a _n,m Corresponding H mask blocks and D non-mask blocks.

S300, carrying out joint training on M neural network models respectively corresponding to the M modes, wherein the neural network model comprisesThe method comprises the steps of a data coding model and a decoder, wherein A is a multi-mode remote sensing image sample pair set required by one joint training, and the joint training process comprises the following steps: will a _n,m The corresponding target embedded vector sequence B is input to a data coding model corresponding to the mth mode; b= (f) ^n,0 _m ,f ^n,1 _m ,f ^n,2 _m ,…,f ^n,H+D _m )，f ^n,0 _m Is a as _n,m Corresponding global embedded vector to be learned, f ^n,i _m Is a as _n,m The corresponding local embedded vector of the ith block, wherein the value range of i is 1 to H+D, and f is when the ith block is a mask block ^n,i _m To include the ith block at a _n,m A local embedded vector to be learned of the medium position information; when the ith block is a non-masking block, f ^n,i _m For including pixel value information of the ith block and the ith block in a _n,m A local embedded vector of the position information.

Loss l=l employed by the joint training ¹ +L ² ，L ¹ For the first level loss, L ² For the second level loss, L ¹ And (3) with

Negative correlation and L ¹ And->

Positive correlation, ->

A global feature expression vector obtained for inputting B into a data coding model corresponding to the mth modality,/V>

The method comprises the steps of inputting an embedded vector which is not subjected to random masking processing and corresponds to a reconstruction image of an mth mode into a global feature expression vector obtained by a data coding model corresponding to the mth mode, wherein the reconstruction image of the mth mode is obtained by adopting the method that ∈>

An image input to a decoder corresponding to the mth mode,/th mode>

Is a as _n An average global feature representation vector corresponding to a mode other than the mth mode,/->

To a is of _n The target embedded vector sequence corresponding to the jth mode is input into the global feature expression vector obtained by the data coding model corresponding to the jth mode; />

For the division a in A _n The q-th multimode remote sensing image sample corresponds to the global feature expression vector of the m-th mode, the value range of q is 1 to N-1, sim () is the similarity, and tau is the preset temperature; l (L) ² And (3) with

Positive correlation, ->

Is->

Is characterized by a probability distribution of->

Is->

Is characterized by a probability distribution of->

To a is of _n The target embedded vector sequence corresponding to the g-th mode is input into the global feature expression vector obtained by the data coding model corresponding to the g-th mode, g=1, 2, …, M, g is not equal to M,>

is->

And->

KL divergence of (c).

Compared with the prior art, the method provided by the invention has obvious beneficial effects, can achieve quite technical progress and practicality by virtue of the technical scheme, has wide industrial utilization value, and has at least the following beneficial effects:

the invention discloses a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction, which utilizes a multi-modal remote sensing image sample pair set to perform joint training on M neural network models, wherein each multi-modal remote sensing image sample pair comprises M remote sensing images of different modes in the same scene, a data coding model of each neural network model corresponds to the input of the remote sensing image of one mode. Based on the specific multi-mode remote sensing image sample pair set and the specific loss adopted in the joint training process, the data coding model in the neural network model can learn the information of the remote sensing images from different modes, the learning capacity of the data coding model in the neural network model on the information among the modes is improved, and the characteristic representation capacity of the remote sensing images of different modes is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction provided by an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

According to the invention, a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction is provided, as shown in fig. 1, comprising the following steps:

s100, acquiring a multi-mode remote sensing image sample pair set A= { a ₁ ,a ₂ ,…,a _N }，a _n For the nth multi-mode remote sensing image sample pair, the value range of N is 1 to N, and N is the number of the multi-mode remote sensing image sample pairs included by A; a, a _n ＝(a _n,1 ,a _n,2 ,…,a _n,M )，a _n,m Is a as _n The method comprises the steps that in the M-th mode remote sensing image, the value range of M is 1 to M, M is the number of modes included in each multi-mode remote sensing image sample pair in A, and the multi-modes comprise at least two of optics, SAR, hyperspectrum and near infrared; each multi-mode remote sensing image sample pair comprises M mode remote sensing images which are identicalThe remote sensing images of a scene have the same sequence of the remote sensing images corresponding to different modes in each multi-mode remote sensing image sample pair.

According to the invention, the number of the remote sensing images included in each multi-mode remote sensing image sample pair is the same, the modes corresponding to different remote sensing images included in each multi-mode remote sensing image sample pair are different, and the sequence of the remote sensing images corresponding to different modes in the corresponding multi-mode remote sensing image sample pair is the same. As an embodiment, each multi-modal remote sensing image sample pair comprises an optical remote sensing image and a SAR remote sensing image, m=2, a _n ＝(a _n,1 ,a _n,2 )，a _n,1 For optical remote sensing image, a _n,2 Is an SAR remote sensing image. As another embodiment, each multi-modal remote sensing image sample pair includes an optical remote sensing image, a SAR remote sensing image, and a hyperspectral remote sensing image, m=3, a _n ＝(a _n,1 ,a _n,2 ,a _n,3 )，a _n,1 For optical remote sensing image, a _n,2 A is SAR remote sensing image _n,3 Is a hyperspectral remote sensing image.

According to the invention, a _n,m The corresponding set of mask blocks is YP, a _n,m The corresponding set of non-masked chunks is NYP; yp= (YP) ₁ ,yp ₂ ,…,yp _H )，yp _h Is a as _n,m The corresponding H mask block has the value range of H from 1 to H, and H is a _n,m A corresponding number of mask blocks; nyp= (NYP) ₁ ,nyp ₂ ,…,nyp _D )，nyp _d Is a as _n,m The corresponding D-th non-mask block has D ranging from 1 to D, D being a _n,m A corresponding number of non-masked chunks.

Alternatively, to a _n,m Performing blocking and random masking processes, including:

s210, will a _n,m Equally divided into Z x Z blocks, each block having a size (M ₀ /Z)×(N ₀ /Z)×C _m Z is a preset positive integer, M ₀ Is a as _n,m Length of N ₀ Is a as _n,m Is of width C _m Is a as _n,m A corresponding number of image channels.

S220, performing random masking on the z×z blocks according to a preset ratio k to obtain h=round (k×z×z) mask blocks and d=z×z-H non-mask blocks, where round () is a rounding.

Those skilled in the art will appreciate that any method of blocking and random masking in the prior art falls within the scope of the present invention.

S300, performing joint training on M neural network models respectively corresponding to the M modalities, wherein the neural network models comprise a data coding model and a decoder, A is a multi-modality remote sensing image sample pair set required by one joint training, and the joint training process comprises the following steps: will a _n,m The corresponding target embedded vector sequence B is input to a data coding model corresponding to the mth mode; b= (f) ^n,0 _m ,f ^n,1 _m ,f ^n,2 _m ,…,f ^n,H+D _m )，f ^n,0 _m Is a as _n,m Corresponding global embedded vector to be learned, f ^n,i _m Is a as _n,m The corresponding local embedded vector of the ith block, wherein the value range of i is 1 to H+D, and f is when the ith block is a mask block ^n,i _m To include the ith block at a _n,m A local embedded vector to be learned of the medium position information; when the ith block is a non-masking block, f ^n,i _m For including pixel value information of the ith block and the ith block in a _n,m A local embedded vector of the position information.

Optionally, the data coding model is a transducer data coding model, and the decoder is a linear layer; as an embodiment, the decoder is configured to predict the original pixel values corresponding to the mask blocks according to the output of the data encoding model. Those skilled in the art will appreciate that any configuration of data encoding models and decoders in the prior art falls within the scope of the present invention.

According to the invention, when said i-th block is a non-masked block, f ^n,i _m Acquisition method of (a)Comprising the following steps:

and S310, stretching the ith block into a corresponding one-dimensional pixel vector according to the pixel value information corresponding to the ith block.

According to the invention, the size of the ith block is: (M) ₀ /Z)×(N ₀ /Z)×C _m The dimension of the one-dimensional pixel vector after stretching the ith block into the corresponding one-dimensional pixel vector is (M ₀ /Z)×(N ₀ /Z)×C _m . For example, the size of the i-th block is: 5×5×1, stretching the i-th block into a corresponding one-dimensional pixel vector, wherein the dimension of the one-dimensional pixel vector is 25, the 1 st element in the one-dimensional pixel vector is the pixel value of the pixel point with the coordinates of (1, 1) of the i-th block, the 2 nd element in the one-dimensional pixel vector is the pixel value of the pixel point with the coordinates of (1, 2) of the i-th block, and so on, the 24 th element in the one-dimensional pixel vector is the pixel value of the pixel point with the coordinates of (5, 4) of the i-th block, and the 25 th element in the one-dimensional pixel vector is the pixel value of the pixel point with the coordinates of (5, 5) of the i-th block.

S320, inputting the one-dimensional pixel vector into a linear mapping layer to obtain a pixel embedded vector corresponding to the ith block.

The linear mapping layer functions to transform the one-dimensional pixel vector into a pixel embedding vector of a preset dimension. Those skilled in the art will appreciate that any linear mapping layer in the prior art falls within the scope of the present invention.

S330, the position information corresponding to the ith block is encoded into a corresponding one-dimensional position vector, and the dimension of the one-dimensional position vector is the same as that of the one-dimensional pixel vector; and inputting the one-dimensional position vector into a linear mapping layer to obtain a position embedded vector corresponding to the ith block.

According to the invention, the dimension of the one-dimensional position vector is (M ₀ /Z)×(N ₀ /Z)×C _m The dimension of the position embedded vector obtained by the one-dimensional position vector through the linear mapping layer is the same as the dimension of the pixel embedded vector obtained by the one-dimensional pixel vector through the linear mapping layer, and the dimension is the preset dimension.

S340, the position corresponding to the ith block is determinedAdding the embedded vector and the pixel embedded vector to obtain f ^n,i _m 。

According to the invention, the position embedded vector corresponding to the ith block has the same dimension as the pixel embedded vector, and can be subjected to addition operation, and f is obtained after the addition operation ^n,i _m Including both the pixel value information of the ith block and the pixel value information of the ith block in a _n,m Is included in the location information.

According to the present invention, when the ith block is a mask block, the ith block is replaced with a mask vector to be learned, the dimension of the mask vector is set to be the same as the dimension of the pixel embedding vector, and then the position embedding vector corresponding to the ith block is added with the mask vector to obtain f ^n,i _m The method comprises the steps of carrying out a first treatment on the surface of the F is of ^n,i _m To include the ith block at a _n,m A local embedded vector to be learned of the position information.

According to the invention, the loss L=L used for the joint training ¹ +L ² ，L ¹ For the first level loss, L ² For the second level loss, L ¹ And (3) with

Negative correlation and L ¹ And->

Positive correlation, ->

An image input to a decoder corresponding to the mth mode,/th mode>

Is a as _n The average global feature representation vector corresponding to a modality other than the mth modality,

For the division a in A _n The q-th multimode remote sensing image sample corresponds to the global feature expression vector of the m-th mode, the value range of q is 1 to N-1, sim () is the similarity, and tau is the preset temperature; l (L) ² And->

Positive correlation, ->

Is->

Is characterized by a probability distribution of->

Is->

Is characterized by a probability distribution of->

To a is of _n The target embedded vector sequence corresponding to the g-th mode is input into the global feature expression vector obtained by the data coding model corresponding to the g-th mode, g=1, 2, …, M, g not equal to M,

is->

And->

KL divergence of (c).

Alternatively to this, the method may comprise,

those skilled in the art will appreciate that any similarity calculation method in the prior art falls within the scope of the present invention.

Acquisition of

And->

Thereafter, the process of how to obtain the feature probability distribution and obtain the KL divergence is the prior art, and will not be described herein. Those skilled in the art will appreciate that any method of obtaining a characteristic probability distribution and a method of obtaining a KL divergence in the prior art fall within the scope of the present invention.

Optionally, the first level loss L ¹ The following relationship is satisfied:

for the division a in A _n The other multimode remote sensing image samples except the multimode remote sensing image samples correspond to the set of global feature representation vectors of the mth mode.

Optionally, the second level loss L ² The following relationship is satisfied:

according to the invention, the characteristic representation of the remote sensing images of different modes can be performed by using the data coding model of each neural network model after the training is finished.

The method does not need to label the remote sensing images in the multi-mode remote sensing image sample pair, belongs to a self-supervision learning method, and solves the problems of time and labor waste and high labeling cost caused by manually labeling the remote sensing images in the prior art.

While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. A self-supervision remote sensing representation learning method based on multi-level cross-modal interaction is characterized by comprising the following steps:

s100, acquiring a multi-mode remote sensing image sample pair set A= { a ₁ ,a ₂ ,…,a _N }，a _n For the nth multi-mode remote sensing image sample pair, the value range of N is 1 to N, and N is the number of the multi-mode remote sensing image sample pairs included by A; a, a _n ＝(a _n,1 ,a _n,2 ,…,a _n,M )，a _n,m Is a as _n The method comprises the steps that in the M-th mode remote sensing image, the value range of M is 1 to M, M is the number of modes included in each multi-mode remote sensing image sample pair in A, and the multi-modes comprise at least two of optics, SAR, hyperspectrum and near infrared; the remote sensing images of M modes included in each multi-mode remote sensing image sample pair are the remote sensing images of the same scene, and the sequence of the remote sensing images corresponding to different modes in each multi-mode remote sensing image sample pair is the same;

s200, traversing A, for a _n,m Performing blocking and random mask processing to obtain a _n,m Corresponding H mask blocks and D non-mask blocks;

s300, performing joint training on M neural network models respectively corresponding to the M modalities, wherein the neural network models comprise a data coding model and a decoder, A is a multi-modality remote sensing image sample pair set required by one joint training, and the joint training process comprises the following steps: will a _n,m The corresponding target embedded vector sequence B is input to a data coding model corresponding to the mth mode; b= (f) ^n,0 _m ,f ^n,1 _m ,f ^n,2 _m ,…,f ^n,H+D _m )，f ^n,0 _m Is a as _n,m Corresponding global embedded vector to be learned, f ^n,i _m Is a as _n,m The corresponding local embedded vector of the ith block, wherein the value range of i is 1 to H+D, and f is when the ith block is a mask block ^n,i _m To include the ith blocka _n,m A local embedded vector to be learned of the medium position information; when the ith block is a non-masking block, f ^n,i _m For including pixel value information of the ith block and the ith block in a _n,m A local embedded vector of the medium position information;

Negative correlation and L ¹ And->

Positive correlation, ->

An image input to a decoder corresponding to the mth mode,/th mode>

Positive correlation, ->

Is->

Is characterized by a probability distribution of->

Is->

Is characterized by a probability distribution of->

is->

And->

KL divergence of (c).

2. The self-monitoring remote sensing representation learning method based on multi-level cross-modal interaction of claim 1, wherein in S300,

3. The self-monitoring remote sensing representation learning method based on multi-level cross-modal interaction of claim 1, wherein in S300,

4. the self-monitoring remote sensing representation learning method based on multi-level cross-modal interaction according to claim 1, wherein in S200, for a _n,m Performing blocking and random masking processes, including:

s210, will a _n,m Equally divided into Z x Z blocks, each block having a size (M ₀ /Z)×(N ₀ /Z)×C _m Z is a preset positive integer, M ₀ Is a as _n,m Length of N ₀ Is a as _n,m Is of width C _m Is a as _n,m A corresponding number of image channels;

5. The multiple-based system of claim 1A self-supervision remote sensing representation learning method of hierarchical cross-modal interaction is characterized in that in S300, when the ith block is a non-mask block, f ^n,i _m The acquisition method of (1) comprises the following steps:

s310, stretching the ith block into a corresponding one-dimensional pixel vector according to the pixel value information corresponding to the ith block;

s320, inputting the one-dimensional pixel vector into a linear mapping layer to obtain a pixel embedded vector corresponding to the ith block;

s330, the position information corresponding to the ith block is encoded into a corresponding one-dimensional position vector, and the dimension of the one-dimensional position vector is the same as that of the one-dimensional pixel vector; inputting the one-dimensional position vector to a linear mapping layer to obtain a position embedded vector corresponding to the ith block;

s340, adding the position embedded vector corresponding to the ith block and the pixel embedded vector to obtain f ^n,i _m 。

6. The self-supervised remote sensing representation learning method based on multi-level cross-modal interaction of claim 1, wherein the data encoding model is a transducer data encoding model.

7. The method of claim 1, wherein the decoder is a linear layer.

8. The self-monitoring remote sensing representation learning method based on multi-level cross-modal interaction of claim 1, wherein in S300,