CN116012679B - Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction - Google Patents

Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction Download PDF

Info

Publication number
CN116012679B
CN116012679B CN202211635290.8A CN202211635290A CN116012679B CN 116012679 B CN116012679 B CN 116012679B CN 202211635290 A CN202211635290 A CN 202211635290A CN 116012679 B CN116012679 B CN 116012679B
Authority
CN
China
Prior art keywords
remote sensing
mode
vector
sensing image
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211635290.8A
Other languages
Chinese (zh)
Other versions
CN116012679A (en
Inventor
孙显
王佩瑾
何琪彬
闫志远
赵一铭
常浩
毕涵博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aerospace Information Research Institute of CAS
Original Assignee
Aerospace Information Research Institute of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Information Research Institute of CAS filed Critical Aerospace Information Research Institute of CAS
Priority to CN202211635290.8A priority Critical patent/CN116012679B/en
Publication of CN116012679A publication Critical patent/CN116012679A/en
Application granted granted Critical
Publication of CN116012679B publication Critical patent/CN116012679B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The application relates to the technical field of remote sensing image processing, in particular to a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction. The method comprises the following steps: s100, acquiring a multi-mode remote sensing image sample pair set A; s200, traversing A, for a n,m Performing blocking and random mask processing to obtain a n,m Corresponding H mask blocks and D non-mask blocks; s300, performing joint training on M neural network models respectively corresponding to the M modes, wherein the neural network models comprise a data coding model and a decoder, and the joint training process comprises the following steps: will a n,m The corresponding target embedded vector sequence B is input to a data coding model corresponding to the mth mode; loss l=l employed by the joint training 1 +L 2 ,L 1 For the first level loss, L 2 Is a second level loss. The invention improves the characteristic representation capability of the deep learning model on remote sensing images of different modes.

Description

Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction
Technical Field
The invention relates to the technical field of remote sensing image processing, in particular to a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction.
Background
The information acquisition approaches in the big data age are various, and multi-modal data has become a main form of data resources in recent years. The data of different modes has different characteristics, for example, the scattering points of the SAR remote sensing image simultaneously comprise amplitude information and frequency information, and the resolution ratio of the optical remote sensing image is higher than that of the SAR remote sensing image, and the optical remote sensing image contains more detail information. The existing remote sensing representation learning method is only suitable for carrying out feature representation on remote sensing images of one mode, and when a deep learning model inputs remote sensing images of other modes, the deep learning model has poor feature representation capability on the remote sensing images of the other modes; how to improve the feature representation capability of the deep learning model on remote sensing images of different modes is a problem to be solved urgently.
Disclosure of Invention
The invention aims to provide a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction, which improves the characteristic representation capability of a deep learning model on remote sensing images of different modalities.
According to the invention, a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction is provided, which comprises the following steps:
s100, acquiring a multi-mode remote sensing image sample pair set A= { a 1 ,a 2 ,…,a N },a n For the nth multi-mode remote sensing image sample pair, the value range of N is 1 to N, and N is the number of the multi-mode remote sensing image sample pairs included by A; a, a n =(a n,1 ,a n,2 ,…,a n,M ),a n,m Is a as n The method comprises the steps that in the M-th mode remote sensing image, the value range of M is 1 to M, M is the number of modes included in each multi-mode remote sensing image sample pair in A, and the multi-modes comprise at least two of optics, SAR, hyperspectrum and near infrared; the remote sensing images of M modes included in each multi-mode remote sensing image sample pair are the remote sensing images of the same scene, and the sequence of the remote sensing images corresponding to different modes in each multi-mode remote sensing image sample pair is the same.
S200, traversing A, for a n,m Performing blocking and random mask processing to obtain a n,m Corresponding H mask blocks and D non-mask blocks.
S300, carrying out joint training on M neural network models respectively corresponding to the M modes, wherein the neural network model comprisesThe method comprises the steps of a data coding model and a decoder, wherein A is a multi-mode remote sensing image sample pair set required by one joint training, and the joint training process comprises the following steps: will a n,m The corresponding target embedded vector sequence B is input to a data coding model corresponding to the mth mode; b= (f) n,0 m ,f n,1 m ,f n,2 m ,…,f n,H+D m ),f n,0 m Is a as n,m Corresponding global embedded vector to be learned, f n,i m Is a as n,m The corresponding local embedded vector of the ith block, wherein the value range of i is 1 to H+D, and f is when the ith block is a mask block n,i m To include the ith block at a n,m A local embedded vector to be learned of the medium position information; when the ith block is a non-masking block, f n,i m For including pixel value information of the ith block and the ith block in a n,m A local embedded vector of the position information.
Loss l=l employed by the joint training 1 +L 2 ,L 1 For the first level loss, L 2 For the second level loss, L 1 And (3) with
Figure BDA0004006949620000021
Negative correlation and L 1 And->
Figure BDA0004006949620000022
Positive correlation, ->
Figure BDA0004006949620000023
A global feature expression vector obtained for inputting B into a data coding model corresponding to the mth modality,/V>
Figure BDA0004006949620000024
The method comprises the steps of inputting an embedded vector which is not subjected to random masking processing and corresponds to a reconstruction image of an mth mode into a global feature expression vector obtained by a data coding model corresponding to the mth mode, wherein the reconstruction image of the mth mode is obtained by adopting the method that ∈>
Figure BDA0004006949620000025
An image input to a decoder corresponding to the mth mode,/th mode>
Figure BDA0004006949620000026
Is a as n An average global feature representation vector corresponding to a mode other than the mth mode,/->
Figure BDA0004006949620000027
Figure BDA0004006949620000028
To a is of n The target embedded vector sequence corresponding to the jth mode is input into the global feature expression vector obtained by the data coding model corresponding to the jth mode; />
Figure BDA0004006949620000029
For the division a in A n The q-th multimode remote sensing image sample corresponds to the global feature expression vector of the m-th mode, the value range of q is 1 to N-1, sim () is the similarity, and tau is the preset temperature; l (L) 2 And (3) with
Figure BDA00040069496200000210
Positive correlation, ->
Figure BDA00040069496200000211
Is->
Figure BDA00040069496200000212
Is characterized by a probability distribution of->
Figure BDA00040069496200000213
Is->
Figure BDA00040069496200000214
Is characterized by a probability distribution of->
Figure BDA00040069496200000215
To a is of n The target embedded vector sequence corresponding to the g-th mode is input into the global feature expression vector obtained by the data coding model corresponding to the g-th mode, g=1, 2, …, M, g is not equal to M,>
Figure BDA00040069496200000216
is->
Figure BDA00040069496200000217
And->
Figure BDA00040069496200000218
KL divergence of (c).
Compared with the prior art, the method provided by the invention has obvious beneficial effects, can achieve quite technical progress and practicality by virtue of the technical scheme, has wide industrial utilization value, and has at least the following beneficial effects:
the invention discloses a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction, which utilizes a multi-modal remote sensing image sample pair set to perform joint training on M neural network models, wherein each multi-modal remote sensing image sample pair comprises M remote sensing images of different modes in the same scene, a data coding model of each neural network model corresponds to the input of the remote sensing image of one mode. Based on the specific multi-mode remote sensing image sample pair set and the specific loss adopted in the joint training process, the data coding model in the neural network model can learn the information of the remote sensing images from different modes, the learning capacity of the data coding model in the neural network model on the information among the modes is improved, and the characteristic representation capacity of the remote sensing images of different modes is further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction provided by an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
According to the invention, a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction is provided, as shown in fig. 1, comprising the following steps:
s100, acquiring a multi-mode remote sensing image sample pair set A= { a 1 ,a 2 ,…,a N },a n For the nth multi-mode remote sensing image sample pair, the value range of N is 1 to N, and N is the number of the multi-mode remote sensing image sample pairs included by A; a, a n =(a n,1 ,a n,2 ,…,a n,M ),a n,m Is a as n The method comprises the steps that in the M-th mode remote sensing image, the value range of M is 1 to M, M is the number of modes included in each multi-mode remote sensing image sample pair in A, and the multi-modes comprise at least two of optics, SAR, hyperspectrum and near infrared; each multi-mode remote sensing image sample pair comprises M mode remote sensing images which are identicalThe remote sensing images of a scene have the same sequence of the remote sensing images corresponding to different modes in each multi-mode remote sensing image sample pair.
According to the invention, the number of the remote sensing images included in each multi-mode remote sensing image sample pair is the same, the modes corresponding to different remote sensing images included in each multi-mode remote sensing image sample pair are different, and the sequence of the remote sensing images corresponding to different modes in the corresponding multi-mode remote sensing image sample pair is the same. As an embodiment, each multi-modal remote sensing image sample pair comprises an optical remote sensing image and a SAR remote sensing image, m=2, a n =(a n,1 ,a n,2 ),a n,1 For optical remote sensing image, a n,2 Is an SAR remote sensing image. As another embodiment, each multi-modal remote sensing image sample pair includes an optical remote sensing image, a SAR remote sensing image, and a hyperspectral remote sensing image, m=3, a n =(a n,1 ,a n,2 ,a n,3 ),a n,1 For optical remote sensing image, a n,2 A is SAR remote sensing image n,3 Is a hyperspectral remote sensing image.
S200, traversing A, for a n,m Performing blocking and random mask processing to obtain a n,m Corresponding H mask blocks and D non-mask blocks.
According to the invention, a n,m The corresponding set of mask blocks is YP, a n,m The corresponding set of non-masked chunks is NYP; yp= (YP) 1 ,yp 2 ,…,yp H ),yp h Is a as n,m The corresponding H mask block has the value range of H from 1 to H, and H is a n,m A corresponding number of mask blocks; nyp= (NYP) 1 ,nyp 2 ,…,nyp D ),nyp d Is a as n,m The corresponding D-th non-mask block has D ranging from 1 to D, D being a n,m A corresponding number of non-masked chunks.
Alternatively, to a n,m Performing blocking and random masking processes, including:
s210, will a n,m Equally divided into Z x Z blocks, each block having a size (M 0 /Z)×(N 0 /Z)×C m Z is a preset positive integer, M 0 Is a as n,m Length of N 0 Is a as n,m Is of width C m Is a as n,m A corresponding number of image channels.
S220, performing random masking on the z×z blocks according to a preset ratio k to obtain h=round (k×z×z) mask blocks and d=z×z-H non-mask blocks, where round () is a rounding.
Those skilled in the art will appreciate that any method of blocking and random masking in the prior art falls within the scope of the present invention.
S300, performing joint training on M neural network models respectively corresponding to the M modalities, wherein the neural network models comprise a data coding model and a decoder, A is a multi-modality remote sensing image sample pair set required by one joint training, and the joint training process comprises the following steps: will a n,m The corresponding target embedded vector sequence B is input to a data coding model corresponding to the mth mode; b= (f) n,0 m ,f n,1 m ,f n,2 m ,…,f n,H+D m ),f n,0 m Is a as n,m Corresponding global embedded vector to be learned, f n,i m Is a as n,m The corresponding local embedded vector of the ith block, wherein the value range of i is 1 to H+D, and f is when the ith block is a mask block n,i m To include the ith block at a n,m A local embedded vector to be learned of the medium position information; when the ith block is a non-masking block, f n,i m For including pixel value information of the ith block and the ith block in a n,m A local embedded vector of the position information.
Optionally, the data coding model is a transducer data coding model, and the decoder is a linear layer; as an embodiment, the decoder is configured to predict the original pixel values corresponding to the mask blocks according to the output of the data encoding model. Those skilled in the art will appreciate that any configuration of data encoding models and decoders in the prior art falls within the scope of the present invention.
According to the invention, when said i-th block is a non-masked block, f n,i m Acquisition method of (a)Comprising the following steps:
and S310, stretching the ith block into a corresponding one-dimensional pixel vector according to the pixel value information corresponding to the ith block.
According to the invention, the size of the ith block is: (M) 0 /Z)×(N 0 /Z)×C m The dimension of the one-dimensional pixel vector after stretching the ith block into the corresponding one-dimensional pixel vector is (M 0 /Z)×(N 0 /Z)×C m . For example, the size of the i-th block is: 5×5×1, stretching the i-th block into a corresponding one-dimensional pixel vector, wherein the dimension of the one-dimensional pixel vector is 25, the 1 st element in the one-dimensional pixel vector is the pixel value of the pixel point with the coordinates of (1, 1) of the i-th block, the 2 nd element in the one-dimensional pixel vector is the pixel value of the pixel point with the coordinates of (1, 2) of the i-th block, and so on, the 24 th element in the one-dimensional pixel vector is the pixel value of the pixel point with the coordinates of (5, 4) of the i-th block, and the 25 th element in the one-dimensional pixel vector is the pixel value of the pixel point with the coordinates of (5, 5) of the i-th block.
S320, inputting the one-dimensional pixel vector into a linear mapping layer to obtain a pixel embedded vector corresponding to the ith block.
The linear mapping layer functions to transform the one-dimensional pixel vector into a pixel embedding vector of a preset dimension. Those skilled in the art will appreciate that any linear mapping layer in the prior art falls within the scope of the present invention.
S330, the position information corresponding to the ith block is encoded into a corresponding one-dimensional position vector, and the dimension of the one-dimensional position vector is the same as that of the one-dimensional pixel vector; and inputting the one-dimensional position vector into a linear mapping layer to obtain a position embedded vector corresponding to the ith block.
According to the invention, the dimension of the one-dimensional position vector is (M 0 /Z)×(N 0 /Z)×C m The dimension of the position embedded vector obtained by the one-dimensional position vector through the linear mapping layer is the same as the dimension of the pixel embedded vector obtained by the one-dimensional pixel vector through the linear mapping layer, and the dimension is the preset dimension.
S340, the position corresponding to the ith block is determinedAdding the embedded vector and the pixel embedded vector to obtain f n,i m
According to the invention, the position embedded vector corresponding to the ith block has the same dimension as the pixel embedded vector, and can be subjected to addition operation, and f is obtained after the addition operation n,i m Including both the pixel value information of the ith block and the pixel value information of the ith block in a n,m Is included in the location information.
According to the present invention, when the ith block is a mask block, the ith block is replaced with a mask vector to be learned, the dimension of the mask vector is set to be the same as the dimension of the pixel embedding vector, and then the position embedding vector corresponding to the ith block is added with the mask vector to obtain f n,i m The method comprises the steps of carrying out a first treatment on the surface of the F is of n,i m To include the ith block at a n,m A local embedded vector to be learned of the position information.
According to the invention, the loss L=L used for the joint training 1 +L 2 ,L 1 For the first level loss, L 2 For the second level loss, L 1 And (3) with
Figure BDA0004006949620000051
Negative correlation and L 1 And->
Figure BDA0004006949620000052
Positive correlation, ->
Figure BDA0004006949620000053
A global feature expression vector obtained for inputting B into a data coding model corresponding to the mth modality,/V>
Figure BDA0004006949620000054
The method comprises the steps of inputting an embedded vector which is not subjected to random masking processing and corresponds to a reconstruction image of an mth mode into a global feature expression vector obtained by a data coding model corresponding to the mth mode, wherein the reconstruction image of the mth mode is obtained by adopting the method that ∈>
Figure BDA0004006949620000055
An image input to a decoder corresponding to the mth mode,/th mode>
Figure BDA0004006949620000056
Is a as n The average global feature representation vector corresponding to a modality other than the mth modality,
Figure BDA0004006949620000057
to a is of n The target embedded vector sequence corresponding to the jth mode is input into the global feature expression vector obtained by the data coding model corresponding to the jth mode; />
Figure BDA0004006949620000058
For the division a in A n The q-th multimode remote sensing image sample corresponds to the global feature expression vector of the m-th mode, the value range of q is 1 to N-1, sim () is the similarity, and tau is the preset temperature; l (L) 2 And->
Figure BDA0004006949620000059
Positive correlation, ->
Figure BDA00040069496200000510
Is->
Figure BDA00040069496200000511
Is characterized by a probability distribution of->
Figure BDA00040069496200000512
Is->
Figure BDA00040069496200000513
Is characterized by a probability distribution of->
Figure BDA00040069496200000514
To a is of n The target embedded vector sequence corresponding to the g-th mode is input into the global feature expression vector obtained by the data coding model corresponding to the g-th mode, g=1, 2, …, M, g not equal to M,
Figure BDA00040069496200000515
is->
Figure BDA00040069496200000516
And->
Figure BDA00040069496200000517
KL divergence of (c).
Alternatively to this, the method may comprise,
Figure BDA00040069496200000518
those skilled in the art will appreciate that any similarity calculation method in the prior art falls within the scope of the present invention.
Acquisition of
Figure BDA0004006949620000061
And->
Figure BDA0004006949620000062
Thereafter, the process of how to obtain the feature probability distribution and obtain the KL divergence is the prior art, and will not be described herein. Those skilled in the art will appreciate that any method of obtaining a characteristic probability distribution and a method of obtaining a KL divergence in the prior art fall within the scope of the present invention.
Optionally, the first level loss L 1 The following relationship is satisfied:
Figure BDA0004006949620000063
for the division a in A n The other multimode remote sensing image samples except the multimode remote sensing image samples correspond to the set of global feature representation vectors of the mth mode.
Optionally, the second level loss L 2 The following relationship is satisfied:
Figure BDA0004006949620000064
according to the invention, the characteristic representation of the remote sensing images of different modes can be performed by using the data coding model of each neural network model after the training is finished.
The method does not need to label the remote sensing images in the multi-mode remote sensing image sample pair, belongs to a self-supervision learning method, and solves the problems of time and labor waste and high labeling cost caused by manually labeling the remote sensing images in the prior art.
The invention discloses a self-supervision remote sensing representation learning method based on multi-level cross-modal interaction, which utilizes a multi-modal remote sensing image sample pair set to perform joint training on M neural network models, wherein each multi-modal remote sensing image sample pair comprises M remote sensing images of different modes in the same scene, a data coding model of each neural network model corresponds to the input of the remote sensing image of one mode. Based on the specific multi-mode remote sensing image sample pair set and the specific loss adopted in the joint training process, the data coding model in the neural network model can learn the information of the remote sensing images from different modes, the learning capacity of the data coding model in the neural network model on the information among the modes is improved, and the characteristic representation capacity of the remote sensing images of different modes is further improved.
While certain specific embodiments of the invention have been described in detail by way of example, it will be appreciated by those skilled in the art that the above examples are for illustration only and are not intended to limit the scope of the invention. Those skilled in the art will also appreciate that many modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (8)

1. A self-supervision remote sensing representation learning method based on multi-level cross-modal interaction is characterized by comprising the following steps:
s100, acquiring a multi-mode remote sensing image sample pair set A= { a 1 ,a 2 ,…,a N },a n For the nth multi-mode remote sensing image sample pair, the value range of N is 1 to N, and N is the number of the multi-mode remote sensing image sample pairs included by A; a, a n =(a n,1 ,a n,2 ,…,a n,M ),a n,m Is a as n The method comprises the steps that in the M-th mode remote sensing image, the value range of M is 1 to M, M is the number of modes included in each multi-mode remote sensing image sample pair in A, and the multi-modes comprise at least two of optics, SAR, hyperspectrum and near infrared; the remote sensing images of M modes included in each multi-mode remote sensing image sample pair are the remote sensing images of the same scene, and the sequence of the remote sensing images corresponding to different modes in each multi-mode remote sensing image sample pair is the same;
s200, traversing A, for a n,m Performing blocking and random mask processing to obtain a n,m Corresponding H mask blocks and D non-mask blocks;
s300, performing joint training on M neural network models respectively corresponding to the M modalities, wherein the neural network models comprise a data coding model and a decoder, A is a multi-modality remote sensing image sample pair set required by one joint training, and the joint training process comprises the following steps: will a n,m The corresponding target embedded vector sequence B is input to a data coding model corresponding to the mth mode; b= (f) n,0 m ,f n,1 m ,f n,2 m ,…,f n,H+D m ),f n,0 m Is a as n,m Corresponding global embedded vector to be learned, f n,i m Is a as n,m The corresponding local embedded vector of the ith block, wherein the value range of i is 1 to H+D, and f is when the ith block is a mask block n,i m To include the ith blocka n,m A local embedded vector to be learned of the medium position information; when the ith block is a non-masking block, f n,i m For including pixel value information of the ith block and the ith block in a n,m A local embedded vector of the medium position information;
loss l=l employed by the joint training 1 +L 2 ,L 1 For the first level loss, L 2 For the second level loss, L 1 And (3) with
Figure FDA0004006949610000011
Negative correlation and L 1 And->
Figure FDA0004006949610000012
Positive correlation, ->
Figure FDA0004006949610000013
A global feature expression vector obtained for inputting B into a data coding model corresponding to the mth modality,/V>
Figure FDA0004006949610000014
The method comprises the steps of inputting an embedded vector which is not subjected to random masking processing and corresponds to a reconstruction image of an mth mode into a global feature expression vector obtained by a data coding model corresponding to the mth mode, wherein the reconstruction image of the mth mode is obtained by adopting the method that ∈>
Figure FDA0004006949610000015
An image input to a decoder corresponding to the mth mode,/th mode>
Figure FDA0004006949610000016
Is a as n An average global feature representation vector corresponding to a mode other than the mth mode,/->
Figure FDA0004006949610000017
Figure FDA0004006949610000018
To a is of n The target embedded vector sequence corresponding to the jth mode is input into the global feature expression vector obtained by the data coding model corresponding to the jth mode; />
Figure FDA0004006949610000019
For the division a in A n The q-th multimode remote sensing image sample corresponds to the global feature expression vector of the m-th mode, the value range of q is 1 to N-1, sim () is the similarity, and tau is the preset temperature; l (L) 2 And (3) with
Figure FDA00040069496100000110
Positive correlation, ->
Figure FDA00040069496100000111
Is->
Figure FDA00040069496100000112
Is characterized by a probability distribution of->
Figure FDA0004006949610000021
Is->
Figure FDA0004006949610000022
Is characterized by a probability distribution of->
Figure FDA0004006949610000023
To a is of n The target embedded vector sequence corresponding to the g-th mode is input into the global feature expression vector obtained by the data coding model corresponding to the g-th mode, g=1, 2, …, M, g is not equal to M,>
Figure FDA0004006949610000024
is->
Figure FDA0004006949610000025
And->
Figure FDA0004006949610000026
KL divergence of (c).
2. The self-monitoring remote sensing representation learning method based on multi-level cross-modal interaction of claim 1, wherein in S300,
Figure FDA0004006949610000027
Figure FDA0004006949610000028
for the division a in A n The other multimode remote sensing image samples except the multimode remote sensing image samples correspond to the set of global feature representation vectors of the mth mode.
3. The self-monitoring remote sensing representation learning method based on multi-level cross-modal interaction of claim 1, wherein in S300,
Figure FDA0004006949610000029
4. the self-monitoring remote sensing representation learning method based on multi-level cross-modal interaction according to claim 1, wherein in S200, for a n,m Performing blocking and random masking processes, including:
s210, will a n,m Equally divided into Z x Z blocks, each block having a size (M 0 /Z)×(N 0 /Z)×C m Z is a preset positive integer, M 0 Is a as n,m Length of N 0 Is a as n,m Is of width C m Is a as n,m A corresponding number of image channels;
s220, performing random masking on the z×z blocks according to a preset ratio k to obtain h=round (k×z×z) mask blocks and d=z×z-H non-mask blocks, where round () is a rounding.
5. The multiple-based system of claim 1A self-supervision remote sensing representation learning method of hierarchical cross-modal interaction is characterized in that in S300, when the ith block is a non-mask block, f n,i m The acquisition method of (1) comprises the following steps:
s310, stretching the ith block into a corresponding one-dimensional pixel vector according to the pixel value information corresponding to the ith block;
s320, inputting the one-dimensional pixel vector into a linear mapping layer to obtain a pixel embedded vector corresponding to the ith block;
s330, the position information corresponding to the ith block is encoded into a corresponding one-dimensional position vector, and the dimension of the one-dimensional position vector is the same as that of the one-dimensional pixel vector; inputting the one-dimensional position vector to a linear mapping layer to obtain a position embedded vector corresponding to the ith block;
s340, adding the position embedded vector corresponding to the ith block and the pixel embedded vector to obtain f n,i m
6. The self-supervised remote sensing representation learning method based on multi-level cross-modal interaction of claim 1, wherein the data encoding model is a transducer data encoding model.
7. The method of claim 1, wherein the decoder is a linear layer.
8. The self-monitoring remote sensing representation learning method based on multi-level cross-modal interaction of claim 1, wherein in S300,
Figure FDA0004006949610000031
CN202211635290.8A 2022-12-19 2022-12-19 Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction Active CN116012679B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211635290.8A CN116012679B (en) 2022-12-19 2022-12-19 Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211635290.8A CN116012679B (en) 2022-12-19 2022-12-19 Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction

Publications (2)

Publication Number Publication Date
CN116012679A CN116012679A (en) 2023-04-25
CN116012679B true CN116012679B (en) 2023-06-16

Family

ID=86036542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211635290.8A Active CN116012679B (en) 2022-12-19 2022-12-19 Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction

Country Status (1)

Country Link
CN (1) CN116012679B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107346328A (en) * 2017-05-25 2017-11-14 北京大学 A kind of cross-module state association learning method based on more granularity hierarchical networks
CN115223057A (en) * 2022-08-02 2022-10-21 大连理工大学 Target detection unified model for multimodal remote sensing image joint learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113240056B (en) * 2021-07-12 2022-05-17 北京百度网讯科技有限公司 Multi-mode data joint learning model training method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107346328A (en) * 2017-05-25 2017-11-14 北京大学 A kind of cross-module state association learning method based on more granularity hierarchical networks
CN115223057A (en) * 2022-08-02 2022-10-21 大连理工大学 Target detection unified model for multimodal remote sensing image joint learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于表示学习的跨模态检索模型与特征抽取研究综述;李志义;黄子风;许晓绵;;情报学报(04);第 86-99页 *

Also Published As

Publication number Publication date
CN116012679A (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN108399406B (en) Method and system for detecting weakly supervised salient object based on deep learning
CN112634296B (en) RGB-D image semantic segmentation method and terminal for gate mechanism guided edge information distillation
CN113313164B (en) Digital pathological image classification method and system based on super-pixel segmentation and graph convolution
CN112801047B (en) Defect detection method and device, electronic equipment and readable storage medium
CN113269224A (en) Scene image classification method, system and storage medium
CN111325766A (en) Three-dimensional edge detection method and device, storage medium and computer equipment
CN113988147A (en) Multi-label classification method and device for remote sensing image scene based on graph network, and multi-label retrieval method and device
CN117974693B (en) Image segmentation method, device, computer equipment and storage medium
CN115578589A (en) Unsupervised echocardiography section identification method
CN115131558A (en) Semantic segmentation method under less-sample environment
CN117693754A (en) Training masked automatic encoders for image restoration
Zhang et al. A joint convolution auto-encoder network for infrared and visible image fusion
Aliouat et al. EVBS-CAT: enhanced video background subtraction with a controlled adaptive threshold for constrained wireless video surveillance
CN116776014B (en) Multi-source track data representation method and device
CN116012679B (en) Self-supervision remote sensing representation learning method based on multi-level cross-modal interaction
CN117011650A (en) Method and related device for determining image encoder
Li et al. Automated Tire visual inspection based on low rank matrix recovery
CN107273793A (en) A kind of feature extracting method for recognition of face
CN115148303B (en) Microorganism-drug association prediction method based on normalized graph neural network
CN115861713A (en) Carotid plaque ultrasonic image processing method based on multitask learning
Duan et al. A study on the generalized normalization transformation activation function in deep learning based image compression
Kebir et al. End-to-end deep auto-encoder for segmenting a moving object with limited training data
Das et al. Image splicing detection using feature based machine learning methods and deep learning mechanisms
Ye et al. GFSCompNet: remote sensing image compression network based on global feature-assisted segmentation
CN117853739B (en) Remote sensing image feature extraction model pre-training method and device based on feature transformation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant