CN115861810A

CN115861810A - Remote sensing image change detection method and system based on multi-head attention and self-supervision learning

Info

Publication number: CN115861810A
Application number: CN202211553468.4A
Authority: CN
Inventors: 樊庆宇; 李文成; 王军鹏; 陈岩; 简铮; 卢隆
Original assignee: Tianyi Cloud Technology Co Ltd
Current assignee: Tianyi Cloud Technology Co Ltd
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2023-03-28

Abstract

The invention provides a remote sensing image change detection method and system based on multi-head attention and self-supervised learning, firstly, learning a pre-weight in a remote sensing image without a label by using a self-supervised learning technology based on a converter and combined with the multi-head attention; secondly, unifying the styles of the double-time phase images by using fast Fourier transform in a supervised learning stage; and finally, the pre-weight in the self-supervised learning is used for carrying out supervised learning on the double-temporal image and predicting a changed mask image, so that the problems that the algorithm detection precision is not enough and the like caused by large distribution difference between the remote sensing field image and the data set are not considered are solved.

Description

Remote sensing image change detection method and system based on multi-head attention and self-supervision learning

Technical Field

The invention belongs to the field of image processing in artificial intelligence, and particularly relates to a remote sensing image change detection method and system based on multi-head attention and self-supervision learning.

Background

The research on land utilization and land cover change has become a discussion hotspot, in recent years, more and more workers use remote sensing satellites or unmanned aerial vehicles to shoot images of the same region in different time periods to detect the geographic appearance change of the region, and the method is mainly applied to the aspects of urban expansion, forest felling, damage assessment and the like.

The change detection technology in the field of remote sensing is slowly transitioning from a traditional pixel comparison-based method to a deep learning technology method. The remote sensing field is usually accompanied by complicated background and foreground, the traditional pixel comparison or simple shallow feature learning technology cannot solve the problem, and therefore the technical method of advanced semantic feature extraction based on deep learning is being widely applied to the field of change detection.

The main problems of remote sensing change detection include:

(1) At present, pre-training knowledge used by people is obtained based on ImageNet data set training, but the remote sensing image and the ImageNet data have large distribution difference, and the difference can directly influence the precision index of change detection of the remote sensing image.

(2) The remote sensing images shot at the same position in different periods are influenced by natural environment factors such as weather and illumination, so that the difference of the shot images is large, and the difference can directly influence the precision of change detection.

(3) Variant objects of the same semantic concept may be in different temporal and spatial locations

Different image characteristics are shown, which raises the difficulty of deep learning characteristic coding, and the general convolution operation is difficult to completely capture the image characteristics.

Most of the existing convolutional neural network change detection methods are realized by adopting pre-training knowledge based on ImageNet data and a conventional attention convolution technology, and although the method can detect some change differences, the method also has some problems:

1. the data shot in different time phases are not corrected in color style, so that the difference of image noise is larger than that of transformation, which affects the accuracy of algorithm detection.

2. The problem that the distribution difference between the remote sensing field image and the ImageNet data set is large is not considered, so that ImageNet is used for pre-weight learning in model training, and the accuracy of algorithm detection is directly influenced.

3. Background and foreground in the remote sensing field are complex, and it is difficult to capture the long-distance dependence between different objects by using only single convolution feature extraction.

The invention provides a remote sensing image method change detection method based on combination of a Multi-Head Attention mechanism and Self-supervised learning, which comprises the steps of firstly, learning pre-weight knowledge (Wp) in a remote sensing picture without annotation by using a Self-supervised learning technology based on a converter (Transformer) and Multi-Head Self Attention (Multi-Head Self Attention); secondly, unifying the style of the double-temporal images by using fast Fourier transform; finally, the double-temporal image is supervised-learned by using pre-weighted knowledge in the self-supervised learning and a changed mask image is predicted so as to solve the problems.

Disclosure of Invention

The present invention has been made in view of the above problems.

According to one aspect of the invention, a remote sensing image change detection method based on multi-head attention and self-supervision learning is provided, and the method comprises the following steps:

the method comprises the following steps: million pieces of label-free remote sensing data RSCD are manufactured, and HOG characteristics are respectively extracted;

step two: the first stage is self-supervision learning, feature coding and multilayer perceptron decoding learning are carried out on RSCD and HOG features, and learned pre-weight knowledge Wp is stored;

step three: the second stage of supervised learning, namely inputting the double-time phase pictures T1 and T2 to be trained into an FFT image color style correction module, and outputting a training picture Gr after color correction;

step four: inputting Gr into a feature encoder based on a Transformer structure, and outputting paired features f1 and f2;

step five: inputting the paired features f1 and f2to a feature decoder, and predicting an output change mask map;

step six: and (3) performing post-processing on the mask graph, setting a threshold score thresh, recording all the mask graphs which are larger than thresh as changes, recording all the mask graphs which are smaller than thresh as no changes, and outputting a final change graph.

The invention also provides a remote sensing image change detection system based on multi-head attention and self-supervision learning, which comprises:

manufacturing a module: million pieces of label-free remote sensing data RSCD are manufactured, and HOG characteristics are respectively extracted;

a learning module: the first stage is self-supervision learning, feature coding and multilayer perceptron decoding learning are carried out on RSCD and HOG features, and learned pre-weight knowledge Wp is stored;

a correction module: the second stage of supervised learning, namely inputting the double-time phase pictures T1 and T2 to be trained into an FFT image color style correction module, and outputting a training picture Gr after color correction;

and an encoding module: inputting Gr into a feature encoder based on a Transformer structure, and outputting paired features f1 and f2;

a prediction module: inputting the paired features f1 and f2to a feature decoder, and predicting an output change mask map;

an imaging module: and (3) performing post-processing on the mask graph, setting a threshold score thresh, recording all the mask graphs which are larger than thresh as changes, recording all the mask graphs which are smaller than thresh as no change, and outputting a final change graph.

Compared with the prior art, the method has the following beneficial effects:

1. the self-supervision learning method based on the transform structure is provided, the pixel value and the HOG value in the remote sensing image are taken as learning targets, the effect of learning can be better than the pre-weight of a supervision learning mode by the self-supervision learning mode without marking data, the problem of data distribution difference in pre-weight knowledge transfer training is solved, and the precision of a change detection algorithm is improved;

2. the FFT image color style correction is used, so that the difference of a change area can be effectively amplified, the difference of a non-change area is reduced, and the detection precision of weak change can be improved;

3. a multi-head attention mechanism based on a Transformer structure is used, the mechanism can well capture the dependency relationship between objects, targets with specific relationships can be more easily focused in a feature extraction stage, and the detection precision can be directly improved.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail embodiments of the present invention with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 shows a schematic block diagram of the overall flow of a change detection architecture according to one embodiment of the present invention.

FIG. 2 shows a schematic block diagram of a first stage, auto-supervised learning pre-weight knowledge flow, according to one embodiment of the present invention.

FIG. 3 shows a schematic block diagram of a second stage supervised learning change detection procedure in accordance with one embodiment of the present invention.

FIG. 4 is a diagram illustrating second stage FFT image color style correction according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of embodiments of the invention and not all embodiments of the invention, with the understanding that the invention is not limited to the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention described in the present application without inventive step, shall fall within the scope of protection of the present invention.

The first embodiment is as follows:

in order to solve the problems, a remote sensing image change detection method based on multi-head attention and self-supervision learning is provided, as shown in fig. 1, firstly, in a first stage, a large number of non-label remote sensing images are collected and named as data RSCD, gradient direction histogram features (HOG) of each remote sensing image are extracted, three-channel remote sensing images and HOG feature images are combined into four-channel data, then 75% random masking is carried out, the masked four-channel images are input into a feature encoder and a multi-layer perceptron decoder and output predicted four-channel images, finally, back propagation learning is carried out by using an absolute value loss function (L1 norm), and the whole learning process can be regarded as pre-weight knowledge (Wp) learning.

And secondly, inputting double-time phase data and a change label, correcting the picture by using FFT image color style correction, learning by using the feature coding of the first stage and the feature decoder of the second stage, and outputting a detected change mask image.

Specifically, according to an embodiment of the present invention, a method for detecting a change in a remote sensing image based on multi-head attention and self-supervised learning is provided, where the method includes:

step six: and (3) performing post-processing on the mask graph, setting a threshold score thresh, recording all the mask graphs which are larger than thresh as changes, recording all the mask graphs which are smaller than thresh as no change, and outputting a final change graph.

Specifically, the first stage of the self-supervised learning includes modules such as label-free four-channel remote sensing image input, a feature encoder, a multilayer perceptron decoder and the like, and is shown in the attached figure 2, and the specific steps are as follows:

(1) Extracting HOG characteristics of each picture in a remote sensing data set (RSCD), and carrying out channel fusion on the HOG characteristics and a three-channel remote sensing picture to obtain a four-channel picture input Iself;

(2) Carrying out random mask shielding on Iself, wherein a mask pixel block is Imasked;

(3) Performing feature coding on the Imasked by using a Transformer (Transformer) and a local multi-head attention mechanism structure to obtain pre-weight knowledge Wp;

(4) Predicting a four-channel output Pself of the mask pixel block using a multi-layer perceptron feature decoder;

(5) Calculating losses of Pself and Imasked by using an absolute value loss function, and reversely propagating and learning;

(6) And finally, saving the learned feature transformation pre-weighting knowledge Wp as pre-training knowledge.

Meanwhile, the specific change detection algorithm of the second-stage supervised learning mode comprises modules (see fig. 1) such as an FFT image color style correction module, a serialized feature encoder module, a serialized feature decoder module and the like, the whole process of change detection is shown in fig. 3, and the specific process is as follows:

(1) And inputting the double-time-phase remote sensing pictures T1 and T2, and scaling to a specified size HxWx3.

(2) And performing color distribution correction on the T1 and T2 images by using FFT image color style correction.

(3) The partial learning parameters in the serialized feature encoder module are initialized using pre-weight knowledge wp of the self-supervised learning.

(4) The T1 and T2 remote sensing images are subjected to feature coding by using a feature coder (see a feature coder module in the attached figure 1) based on a multi-head attention machine Transformer (transducer), and features f1 and f2 are obtained respectively, wherein the dimensions of the f1 and the f2 are (H/32, W/32, 8x4), and H and W are the width and the height of the T1 image.

(5) And performing semantic serialization on the features f1 and f2, and combining the serialized features to obtain ftoken _ units.

(6) And (4) carrying out Transformer (transform) encoding on the combined serialized characteristic ftoken _ unit, and obtaining ftoken _ transform.

(7) Ftoken _ fransformer was subjected to sequence separation and f1token and f2token were obtained.

(8) The f1token and the f2token are decoded by a Transformer (Transformer) to obtain an f1token-Transformer and an f2token-Transformer.

(9) And respectively inputting the image feature sequences of T1 and T2 and f1token-transformer and f2token-transformer, and obtaining feature maps f1decoder and f2decoder after feature transformation, wherein the dimensions of the f1decoder and the f2decoder are (H/32, W/32, 8x4).

(10) And carrying out difference on the f1decoder and the f2decoder, and carrying out deconvolution transformation again to obtain a change mask image.

Specifically, the FFT image color style correction module in the change detection algorithm is shown in fig. 4, and the specific steps are as follows:

(1) The front phase image T1 and the rear phase image T2 are scaled by four times, respectively.

(2) The zoomed image T1 _ds And T2 _ds RGB channel decomposition is performed.

(3) Performing fast Fourier transform on each channel of the image

/>

Wherein M and N are each T1 _ds Width and height of the image, FT1 and FT2 representing T1, respectively _ds And T2 _ds Frequency spectrum of image。

(4) Will T2 _ds Replacement of low frequency components by T1 _ds Low frequency component of

FT2(location(FT1-A))＝FT1-A

Wherein A represents the high frequency portion of FT1,

location(FT1-A)

the location region corresponding to the low frequency part in the FT1 spectrogram is shown.

(5) Inverse Fourier transform of the replaced spectrogram FT2 of each channel

(6) And fusing the inverse Fourier transform result of each channel to obtain a three-channel RGB image and performing quadruple up-sampling on the RGB image.

The invention uses the self-supervision learning technology to well solve the problem of the distribution differentiation of knowledge migration learning data and improve the precision of the remote sensing transformation detection algorithm; the HOG characteristics are combined with the pixel values to be used as a target of self-supervision learning, so that the over-fitting problem of self-supervision to image pixel value noise learning can be effectively solved, and better pre-weight knowledge can be learned; the FFT technology is applied to the color style correction of the remote sensing image, so that the detection precision of weak transformation can be effectively improved; a Transformer structure based on a multi-head attention mechanism is used, the structure can well capture the dependency relationship between objects, targets with specific relationships can be more easily focused in a feature extraction stage, and the detection precision can be directly improved.

The second embodiment:

the coding module: inputting Gr into a feature encoder based on a Transformer structure, and outputting paired features f1 and f2;

an imaging module: and (3) performing post-processing on the mask graph, setting a threshold score thresh, recording all the mask graphs which are larger than thresh as changes, recording all the mask graphs which are smaller than thresh as no changes, and outputting a final change graph.

In particular, the method comprises the following steps of,

the first stage of self-supervision learning comprises modules such as label-free four-channel remote sensing image input, a feature encoder, a multilayer perceptron decoder and the like, and is shown in the attached figure 2, and the specific steps are as follows:

(6) And finally, saving the learned pre-weight knowledge Wp as pre-training knowledge.

The change detection algorithm of the second stage supervised learning mode comprises modules such as FFT image color style correction, a serialized feature encoder, a serialized feature decoder and the like, the whole process of change detection is shown in the attached figure 3, and the specific process is as follows:

(2) The color distribution correction is performed on the T1 and T2 images using FFT image color style correction.

(3) Initializing partial learning parameters in the serialized feature encoder module using pre-weighted knowledge Wp of the self-supervised learning.

(5) And semantically serializing the features f1 and f2, and combining the serialized features to obtain ftoken _ unit.

(8) The f1token and the f2token are then subject to Transformer (Transformer) decoding, respectively, and an f1token-Transformer and an f2token-Transformer are obtained.

(10) And carrying out difference on the f1decoder and the f2decoder, and carrying out deconvolution transformation again to obtain a change mask image finally.

(2) The zoomed image T1 _ds And T2 _ds RGB channel decomposition is performed.

(3) Performing fast Fourier transform on each channel of the image

Wherein M and N are each T1 _ds Width and height of the image, FT1 and FT2 representing T1, respectively _ds And T2 _ds The frequency spectrum of the image.

FT2(location(FT1-A))＝FT1-A

Wherein A represents the high frequency portion of FT1,

location(FT1-A)

(5) Inverse Fourier transform of the replaced spectrogram FT2 of each channel

(6) And fusing the Fourier inversion result of each channel to obtain a three-channel RGB image and performing four-time up-sampling on the RGB image.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the foregoing illustrative embodiments are merely exemplary and are not intended to limit the scope of the invention thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present invention should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention. It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means can be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiment of the present invention or the description thereof, and the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A remote sensing image change detection method based on multi-head attention and self-supervision learning comprises the following steps:

the method comprises the following steps: million label-free remote sensing data sets RSCD are manufactured, and HOG characteristics are respectively extracted;

2. The method for detecting the change of the remote sensing image based on the multi-head attention and the self-supervised learning as claimed in claim 1, wherein the self-supervised learning in the first stage comprises an unlabeled four-channel remote sensing image input, a feature encoder and a multilayer perceptron decoder, and the specific steps are as follows:

(1) Extracting HOG characteristics of each picture in the remote sensing data set RSCD, and carrying out channel fusion on the HOG characteristics and three-channel remote sensing pictures to obtain four-channel pictures for input into Iself;

(3) Performing feature coding on the Imasked by using a converter and a local multi-head attention mechanism structure to obtain a feature transform encoder weight Wp;

3. The method for detecting the change of the remote sensing image based on the multi-head attention and the self-supervised learning as claimed in claim 2, wherein the change detection algorithm of the second-stage supervised learning mode comprises FFT image color style correction, a serialized feature encoder and a serialized feature decoder, and the method comprises the following specific steps:

(1) Inputting double-time-phase remote sensing pictures T1 and T2, and zooming to a specified size HxWx3;

(2) Carrying out color distribution correction on the T1 and T2 images by using FFT image color style correction;

(3) Initializing partial learning parameters in a serialized feature encoder module using pre-weighted knowledge Wp of self-supervised learning;

(4) Performing feature encoding on the T1 and T2 remote sensing images by using a feature encoder based on a multi-head attention mechanism transducer to obtain features f1 and f2 respectively, wherein the dimensions of the f1 and the f2 are (H/32, W/32, 8x4); h and W are the width and height of the T1 image;

(5) Performing semantic serialization on the features f1 and f2, and combining the serialized features to obtain ftoken _ units;

(6) Carrying out transformer coding on the combined serialized characteristic ftoken _ unit, and obtaining ftoken _ transform;

(7) Carrying out sequence separation on ftoken _ fransformer to obtain f1token and f2token;

(8) Respectively decoding the f1token and the f2token by using a converter to obtain an f1token-transformer and an f2token-transformer;

(9) Respectively inputting image characteristic sequences of T1 and T2 and f1token-transformer and f2token-transformer, and obtaining characteristic diagrams f1decoder and f2decoder after characteristic transformation, wherein the dimensions of the f1decoder and the f2decoder are H/32, W/32 and 8x4;

4. The remote sensing image change detection method based on multi-head attention and self-supervision learning as claimed in claim 3, wherein an FFT image color style correction algorithm in the change detection algorithm specifically comprises the following steps:

(1) The scales of the front time phase image T1 and the rear time phase image T2 are respectively zoomed by four times;

(2) The zoomed image T1 _ds And T2 _ds Performing RGB channel decomposition;

(3) Performing fast Fourier transform on each channel of the image

5. The remote sensing image change detection method based on multi-head attention and self-supervised learning as recited in claim 3, wherein the FFT image color style correction algorithm in the change detection algorithm further comprises:

(4) Will T2 _ds Replacement of low frequency components by T1 _ds Low frequency component of (2)

FT2(location(FT1-A))×FT1-A

Wherein A represents the high frequency portion of FT1,

location (FT 1-A) represents a location region corresponding to a low frequency part in the FT1 spectrogram;

(5) Inverse Fourier transform of the replaced spectrogram FT2 of each channel

6. A remote sensing image change detection system based on multi-head attention and self-supervision learning is characterized in that: the system comprises:

manufacturing a module: million label-free remote sensing data sets RSCD are manufactured, and HOG characteristics are respectively extracted;

7. The remote sensing image change detection system based on multi-head attention and self-supervised learning as recited in claim 6, wherein: the first stage of self-supervision learning comprises modules such as label-free four-channel remote sensing image input, a characteristic encoder, a multilayer perceptron decoder and the like, and specifically comprises the following steps:

(3) Performing feature coding on the Imasked by using a converter and a local multi-head attention mechanism structure to obtain pre-weight knowledge Wp;

(6) And finally, storing the learned pre-weight knowledge Wp as pre-training knowledge.

8. The remote sensing image change detection system based on multi-head attention and self-supervised learning as recited in claim 7, wherein: the change detection algorithm of the second stage supervised learning mode comprises modules such as FFT image color style correction, a serialized feature encoder and a serialized feature decoder, and specifically comprises the following steps:

(3) Initializing feature encoder subcomponents in the serialized feature encoder module using preweight knowledge Wp of the self-supervised learning;

(4) Performing feature encoding on the T1 and T2 remote sensing images by using a feature encoder based on a multi-head attention mechanism converter to obtain features f1 and f2 respectively, wherein the dimensions of the f1 and the f2 are (H/32, W/32, 8x4), and the H and the W are the width and the height of the T1 image;

(9) Inputting the image feature sequences of T1 and T2 and f1token-transformer and f2token-transformer respectively, and obtaining feature maps f1decoder and f2decoder after feature transformation, wherein the dimensions of the f1decoder and the f2decoder are (H/32, W/32, 8x4);

9. The remote sensing image change detection system based on multi-head attention and self-supervised learning as recited in claim 8, wherein: the FFT image color style correction module in the change detection algorithm comprises the following specific steps:

(2) The zoomed image T1 _ds And T2 _ds Performing RGB channel decomposition;

(3) Performing fast Fourier transform on each channel of the image

10. The remote sensing image change detection system based on multi-head attention and self-supervised learning as recited in claim 8, wherein: the FFT image color style correction module in the change detection algorithm further comprises the following steps:

FT2(location(FT1-A))＝FT1-A

Wherein A represents the high frequency portion of FT1,

location(FT1-A)

representing a location region corresponding to a low frequency part in the FT1 spectrogram;

(5) Inverse Fourier transform of the replaced spectrogram FT2 of each channel